I wanted to know how responsive dmenu and awk, sort, uniq are on a 50MB file (625000 entries of 80 1-byte chars each).
generate file:
#!/bin/bash
echo "Creating dummy file of 50MB in size (625000 entries of 80chars)"
echo "Note: this takes about an hour and a half"
entries_per_iteration=1000
for i in `seq 1 625`
do
echo "Iteration $i of 625 ( $entries_per_iteration each )"
for j in `seq 1 $entries_per_iteration`
do
echo "`date +'%Y-%m-%d %H:%M:%S'` `date +%s`abcdefhijklmno`date +%s | md5sum`" >> ./dummy_history_file
done
done
measure speed:
echo "Plain awk '{print \$3}':"
time awk '{print $3}' dummy_history_file >/dev/null
echo "awk + sort"
time awk '{print $3}' dummy_history_file | sort >/dev/null
echo "awk + sort + uniq"
time awk '{print $3}' dummy_history_file | sort | uniq >/dev/null
echo "Plain dmenu:"
dmenu < dummy_history_file
echo "awked into dmenu:"
awk '{print $3}' dummy_history_file | dmenu
echo "awk + sort + uniq into dmenu:"
awk '{print $3}' dummy_history_file | sort | uniq | dmenu
Results.
I ran the test twice about an half hour after generating the file, so in the first run, the first awk call may have been affected by a no longer complete Linux block cache.
(I also edited the output format a bit)
Run 1:
Plain awk '{print $3}':
real 0m1.253s
user 0m0.907s
sys 0m0.143s
awk + sort:
real 0m3.696s
user 0m1.887s
sys 0m0.520s
awk + sort + uniq:
real 0m15.768s
user 0m12.233s
sys 0m0.820s
Plain dmenu:
awked into dmenu:
awk + sort + uniq into dmenu:
Run 2
Plain awk '{print $3}':
real 0m1.223s
user 0m0.923s
sys 0m0.107s
awk + sort:
real 0m2.799s
user 0m1.910s
sys 0m0.553s
awk + sort + uniq:
real 0m16.387s
user 0m12.019s
sys 0m0.787s
Plain dmenu:
awked into dmenu:
awk + sort + uniq into dmenu:
Not too bad. It's especially uniq who seems to cause a lot of slowdown. (in this dummy test file, are entries are unique. If there were lots of dupes, the results would probably be different, but I suspect that uniq always needs some time to do its work, dupes or not). The real bottleneck seems to be raw cpu power. Not storage bandwidth at all since Linux caches it. If uncached, I estimate the sequential read would take 1.5 seconds or so. (about 30MB/s on common hard disks)
Once the stuff gets piped into dmenu, there is a little lag but it's reasonably responsive imho.
Test performed on an athlon xp @ 2GHz. 1 GB of memory. There were some other apps running, not a very professional benchmark but you get the idea :)
Trackback URL for this post:
http://dieter.plaetinck.be/trackback/64
sort -u vs uniq
You might want to check and see if using "sort -u" instead of "sort | uniq" makes it a little faster. At the very least, it will save you time to fork and do IO over pipes.
awk vs grep/cut
Hi, Nice benchmark!
You made me want to test too
Here I'm comparing awk performance against grep and cut.
grep results are astouning!
awk '/1234\.*a/'
real 0m4.385s
user 0m4.184s
sys 0m0.048s
grep -e '1234\.*a'
real 0m0.100s
user 0m0.060s
sys 0m0.020s
Plain awk '{print $3}':
real 0m0.641s
user 0m0.600s
sys 0m0.012s
Plain cut -d ' ' -f 3:
real 0m0.281s
user 0m0.232s
sys 0m0.040s
awk + uniq
real 0m0.979s
user 0m0.792s
sys 0m0.100s
cut + uniq
real 0m0.446s
user 0m0.504s
sys 0m0.060s
Conclusion?
So what's your conclusion? If you want to use awk and do regex matching, do it with grep and pipe the output to awk?
for simple "print nth field" awk being twice as slow as cut was to be expected since awk is much more featurefull. But the regex performance is disappointing indeed.
conclusions
Why do you need to pipe grep output to awk?
awk '//' regexp matching is actually the same as grep
I think that in most cases (at least the ones in uzbl examples) grep and cut can replace awk completely and in all those cases they are faster (that's my conclusion).
sed is also a bit faster in a "s/regexp//" than awk "/regexp/".
I was also wanting to test how does it scale to remove duplicates in my history file before every dmenu launch, and this results have motivated me to start cut|uniq-ing my dmenu.
Post new comment