Skip to content

Musings of an Anonymous Geek

Made with only the finest 1's and 0's

Menu
  • About
  • Search Results
Menu

Throw out your Perl: One-line aggregation in awk

Posted on January 19, 2009January 27, 2009 by bkjones

I ran into a student from a class I taught last summer. He’s a really sharp guy, and when I first met him, I was impressed with just how much Perl he could stuff into his brain’s cache. He would write what he called ‘one-liners’ in Perl that, in reality, took up 5-10 lines in his terminal. Still, he’d type furiously, without skipping a beat. But he told me when we met that he no longer does this, because I covered awk in my class.

His one-liners were mostly for data munging. The data he needed to munge was mostly data that was pretty predictable. It had a fixed number of fields, a consistent delimiter — in short, it was perfect for munging in awk without using any kind of esoteric awk-ness.

One thing I cover in the learning module I’ve developed on Awk is aggregation of data using pretty simple awk one-liners. For example, here’s a pretty typical /etc/passwd file (we need some data to munge):

root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin
sync:x:5:0:sync:/sbin:/bin/sync
shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
halt:x:7:0:halt:/sbin:/sbin/halt
mail:x:8:12:mail:/var/spool/mail:/sbin/nologin
news:x:9:13:news:/etc/news:
uucp:x:10:14:uucp:/var/spool/uucp:/sbin/nologin
operator:x:11:0:operator:/root:/sbin/nologin
games:x:12:100:games:/usr/games:/sbin/nologin
gopher:x:13:30:gopher:/var/gopher:/sbin/nologin
ftp:x:14:50:FTP User:/var/ftp:/sbin/nologin
nobody:x:99:99:Nobody:/:/sbin/nologin
dbus:x:81:81:System message bus:/:/sbin/nologin
apache:x:48:48:Apache:/var/www:/sbin/nologin
avahi:x:70:70:Avahi daemon:/:/sbin/nologin
mailnull:x:47:47::/var/spool/mqueue:/sbin/nologin
smmsp:x:51:51::/var/spool/mqueue:/sbin/nologin
distcache:x:94:94:Distcache:/:/sbin/nologin
nscd:x:28:28:NSCD Daemon:/:/sbin/nologin
vcsa:x:69:69:virtual console memory owner:/dev:/sbin/nologin
rpc:x:32:32:Portmapper RPC user:/:/sbin/nologin
rpcuser:x:29:29:RPC Service User:/var/lib/nfs:/sbin/nologin
nfsnobody:x:65534:65534:Anonymous NFS User:/var/lib/nfs:/sbin/nologin
sshd:x:74:74:Privilege-separated SSH:/var/empty/sshd:/sbin/nologin
mysql:x:27:27:MySQL Server:/var/lib/mysql:/bin/bash
dovecot:x:97:97:dovecot:/usr/libexec/dovecot:/sbin/nologin
squid:x:23:23::/var/spool/squid:/sbin/nologin
ldap:x:55:55:LDAP User:/var/lib/ldap:/bin/false
pcap:x:77:77::/var/arpwatch:/sbin/nologin
ntp:x:38:38::/etc/ntp:/sbin/nologin
rpm:x:37:37::/var/lib/rpm:/sbin/nologin
haldaemon:x:68:68:HAL daemon:/:/sbin/nologin
named:x:25:25:Named:/var/named:/sbin/nologin
xfs:x:43:43:X Font Server:/etc/X11/fs:/sbin/nologin
jonesy:x:500:500::/home/jonesy:/bin/bash

It’s not exotic, cool data that we’re going to infer a lot of interesting things from, but it’ll do for pedagogical purposes.

Now, let’s write a super-simple awk aggregation routine that’ll count the number of users whose UID is > 100. It’ll look something like this:

awk -F: '$3 > 100 {x+=1} END {print x}' /etc/passwd

The important thing to remember is that awk will initialize your variables to 0 for you, which cuts down on some clutter.

Let’s abuse awk a bit further. What if we want to know how many users use each shell in /etc/passwd, whatever those shells may be? Here’s a one-liner that’ll take care of this for you:

awk -F: '{x[$7]+=1} END {for(z in x) {print z, x[z]} }' /etc/passwd

While awk doesn’t technically support multi-dimensional arrays, it also doesn’t have to be numerically indexed. So here, we tell awk to increment x[$7]. $7 is the field that holds the shell for each user, so if $7 on the current line is /bin/bash, then we’ve told awk to increment the value in the array indexed at x[/bin/bash]. So, if there’s only one line containing /bin/bash up to the current record, then x[/bin/bash]=1

There’s a lot of great things you can move onto from here. You can do things that others use Excel for right in awk. If you have checkbook information in a flat file, you can add up purchases only in a given category, or, using the technique above, in every category. If you store your stock purchase price on a line with the current price, you can use simple math to get the spread on each line and tell you whether your portfolio is up or down. Let’s have a look at something like that. Here’s some completely made up, hypothetical data representing a fictitious user’s stock portfolio:

ABC,100,12.14,19.12
FOO,100,24.01,17.45
BAR,50,88.90,94.33
BAZ,50,75.65,66.20
NI,23,33.12,43.32

Save that in a file called “stocks.txt”. The columns are stock symbol, number of shares, purchase price, and current price, in that order. This awk one-liner indexes the ‘x’ array using the stock symbol, and the value at that index is set to the amount gained or lost:

awk -F, '{x[$1]=($2*$4)-($2*$3)} END {for(z in x) {print z, x[z]}}' stocks.txt

Hm. Actually, that’s kind of inefficient. I realized while previewing this post that I can shorten it up a bit like this:

awk -F, '{x[$1]=($2*($4 - $3))} END {for(z in x) {print z, x[z]}}' stocks.txt

Glad I caught that before the nitpickers flamed me to a crisp. Always preview your posts! ;-P

Ah, but of course, that’s not enough. This spits out the gain and loss for each stock, but what about the net gain or loss across all of them? You only need to tweak a little bit:

awk -F, '{x[$1]=($2*($4-$3)); y+=x[$1]} END {for(z in x) {print z, x[z]}; print "Net: "y}' stocks.txt

We just added the assignment of the ‘y’ variable before the “END”, and then added a print statement after the “END”.

I hope this helps some folks out there. Also, if your team needs to know stuff like this, I do on-site training!

Share this:

  • Click to share on X (Opens in new window) X
  • Click to share on Reddit (Opens in new window) Reddit
  • Click to share on Tumblr (Opens in new window) Tumblr
  • Click to share on Facebook (Opens in new window) Facebook

Recent Posts

  • Auditing Your Data Migration To ClickHouse Using ClickHouse Local
  • ClickHouse Cheat Sheet 2024
  • User Activation With Django and Djoser
  • Python Selenium Webdriver Notes
  • On Keeping A Journal and Journaling
  • What Geeks Could Learn From Working In Restaurants
  • What I’ve Been Up To
  • PyCon Talk Proposals: All You Need to Know And More
  • Sending Alerts With Graphite Graphs From Nagios
  • The Python User Group in Princeton (PUG-IP): 6 months in

Categories

  • Apple
  • Big Ideas
  • Books
  • CodeKata
  • Database
  • Django
  • Freelancing
  • Hacks
  • journaling
  • Leadership
  • Linux
  • LinuxLaboratory
  • Loghetti
  • Me stuff
  • Other Cool Blogs
  • PHP
  • Productivity
  • Python
  • PyTPMOTW
  • Ruby
  • Scripting
  • Sysadmin
  • Technology
  • Testing
  • Uncategorized
  • Web Services
  • Woodworking

Archives

  • January 2024
  • May 2021
  • December 2020
  • January 2014
  • September 2012
  • August 2012
  • February 2012
  • November 2011
  • October 2011
  • June 2011
  • April 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • September 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007
  • March 2007
  • February 2007
  • January 2007
  • December 2006
  • November 2006
  • September 2006
  • August 2006
  • July 2006
  • June 2006
  • April 2006
  • March 2006
  • February 2006
  • January 2006
  • December 2005
  • November 2005
  • October 2005
  • September 2005
  • August 2005
  • July 2005
  • June 2005
  • May 2005
  • April 2005
  • March 2005
  • February 2005
  • January 2005
  • December 2004
  • November 2004
  • October 2004
  • September 2004
  • August 2004
© 2025 Musings of an Anonymous Geek | Powered by Minimalist Blog WordPress Theme