Skip to content

Musings of an Anonymous Geek

Made with only the finest 1's and 0's

Menu
  • About
  • Search Results
Menu

Data munging with Vim and AWK

Posted on June 6, 2008June 6, 2008 by bkjones

So, I had some data in a file. It was temporal data. It looked like this:

100 4/15 16:50
143 4/15 16:51
121 4/15 16:52
209 4/15 16:53
105 4/15 16:54
321 4/15 16:55
173 4/15 16:56
205 4/15 16:57
197 4/15 16:58
211 4/15 16:59

But I needed it to be in ISO 8601 format so I could plot it with Timeplot. The data represents hits per minute from an Apache log file. I also needed the time to show up in the first column and the hits in the second column. Here’s what I needed the data to look like:

2008-04-15T16:54 105
2008-04-15T16:55 321
2008-04-15T16:56 173
2008-04-15T16:57 205
2008-04-15T16:58 197
2008-04-15T16:59 211

Well, I knew that the dates I had in the file were from 2008, and all of the other bits are there, just in the wrong format. Here’s what I did to get things in the right format for Timeplot:

:%s/4\//2008-04-/g # search for “4/” and replace it with “2008-04-“

Now my data looks like this:

173 2008-04-15 16:56

But not all of the minutes were two digits for some reason (I don’t remember how I parsed the log to get into this state – it was hurried and… well… wrong). I had times that looked like “17:9” so I had to zero-pad the minutes that were only single digits.

:%s/:\(.\)$/:0\1/g # find “:” followed by some character and the end of the line, and replace that with a “0” followed by whatever that character was.

So now my minutes look right.

144 2008-04-15 16:09

Now I needed to replace spaces between the date and time values with a “T” as per ISO 8601 rules for date and time representations in a single string:

:%s/\(-..\)\s/\1T/g # find a “-” followed by any two characters, followed by a space, and replace it with whatever those two characters were, followed by a “T”.

That worked well.

213 2008-04-15T16:45

At this point I had everything knocked, but I forgot that some of my *hours* were also single digits :-/

:%s/T\(.\):/T0\1:/g # find a “T” followed by a single character, and replace that with “T0” and whatever that character was.

There. That did it. Now I just need to comma-separate the values, which is simple after all of this nonsense:

:%s/ /,/g # c’mon, you get this one, right?

Great! Except that the datetime string needs to be the *first* column. Here’s where awk comes in handy:

cat hitspermin_bad.txt | awk -F, ‘{print $2,$1}’ > hitspermin_good.txt

You’ll notice that, since I could see the data and know the source, I didn’t bother explicitly telling Vim to look for *numbers* – I just used “.” to say “find any character”. If I had less confidence in the data I would’ve used “\d” to make sure I had numeric digits there.

Of course, the better solution is to properly parse the log file in the first place, but the log file in this case was 25GB!! Of course I’ll go back and change my script (I used loghetti with a custom (read: flawed) output filter), and test it on smaller data, and eventually get it to be more reliable, but to get a quick Timeplot graph together, this was a fast, if iterative and somewhat annoying, way to go. It also gave me a chance to exercise my Vim search and replace skillz.

Share this:

  • Click to share on X (Opens in new window) X
  • Click to share on Reddit (Opens in new window) Reddit
  • Click to share on Tumblr (Opens in new window) Tumblr
  • Click to share on Facebook (Opens in new window) Facebook

Recent Posts

  • Auditing Your Data Migration To ClickHouse Using ClickHouse Local
  • ClickHouse Cheat Sheet 2024
  • User Activation With Django and Djoser
  • Python Selenium Webdriver Notes
  • On Keeping A Journal and Journaling
  • What Geeks Could Learn From Working In Restaurants
  • What I’ve Been Up To
  • PyCon Talk Proposals: All You Need to Know And More
  • Sending Alerts With Graphite Graphs From Nagios
  • The Python User Group in Princeton (PUG-IP): 6 months in

Categories

  • Apple
  • Big Ideas
  • Books
  • CodeKata
  • Database
  • Django
  • Freelancing
  • Hacks
  • journaling
  • Leadership
  • Linux
  • LinuxLaboratory
  • Loghetti
  • Me stuff
  • Other Cool Blogs
  • PHP
  • Productivity
  • Python
  • PyTPMOTW
  • Ruby
  • Scripting
  • Sysadmin
  • Technology
  • Testing
  • Uncategorized
  • Web Services
  • Woodworking

Archives

  • January 2024
  • May 2021
  • December 2020
  • January 2014
  • September 2012
  • August 2012
  • February 2012
  • November 2011
  • October 2011
  • June 2011
  • April 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • September 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007
  • March 2007
  • February 2007
  • January 2007
  • December 2006
  • November 2006
  • September 2006
  • August 2006
  • July 2006
  • June 2006
  • April 2006
  • March 2006
  • February 2006
  • January 2006
  • December 2005
  • November 2005
  • October 2005
  • September 2005
  • August 2005
  • July 2005
  • June 2005
  • May 2005
  • April 2005
  • March 2005
  • February 2005
  • January 2005
  • December 2004
  • November 2004
  • October 2004
  • September 2004
  • August 2004
© 2025 Musings of an Anonymous Geek | Powered by Minimalist Blog WordPress Theme