So, I had some data in a file. It was temporal data. It looked like this:
100 4/15 16:50
143 4/15 16:51
121 4/15 16:52
209 4/15 16:53
105 4/15 16:54
321 4/15 16:55
173 4/15 16:56
205 4/15 16:57
197 4/15 16:58
211 4/15 16:59
But I needed it to be in ISO 8601 format so I could plot it with Timeplot. The data represents hits per minute from an Apache log file. I also needed the time to show up in the first column and the hits in the second column. Here’s what I needed the data to look like:
2008-04-15T16:54 105
2008-04-15T16:55 321
2008-04-15T16:56 173
2008-04-15T16:57 205
2008-04-15T16:58 197
2008-04-15T16:59 211
Well, I knew that the dates I had in the file were from 2008, and all of the other bits are there, just in the wrong format. Here’s what I did to get things in the right format for Timeplot:
:%s/4\//2008-04-/g # search for “4/” and replace it with “2008-04-“
Now my data looks like this:
173 2008-04-15 16:56
But not all of the minutes were two digits for some reason (I don’t remember how I parsed the log to get into this state – it was hurried and… well… wrong). I had times that looked like “17:9” so I had to zero-pad the minutes that were only single digits.
:%s/:\(.\)$/:0\1/g # find “:” followed by some character and the end of the line, and replace that with a “0” followed by whatever that character was.
So now my minutes look right.
144 2008-04-15 16:09
Now I needed to replace spaces between the date and time values with a “T” as per ISO 8601 rules for date and time representations in a single string:
:%s/\(-..\)\s/\1T/g # find a “-” followed by any two characters, followed by a space, and replace it with whatever those two characters were, followed by a “T”.
That worked well.
213 2008-04-15T16:45
At this point I had everything knocked, but I forgot that some of my *hours* were also single digits :-/
:%s/T\(.\):/T0\1:/g # find a “T” followed by a single character, and replace that with “T0” and whatever that character was.
There. That did it. Now I just need to comma-separate the values, which is simple after all of this nonsense:
:%s/ /,/g # c’mon, you get this one, right?
Great! Except that the datetime string needs to be the *first* column. Here’s where awk comes in handy:
cat hitspermin_bad.txt | awk -F, ‘{print $2,$1}’ > hitspermin_good.txt
You’ll notice that, since I could see the data and know the source, I didn’t bother explicitly telling Vim to look for *numbers* – I just used “.” to say “find any character”. If I had less confidence in the data I would’ve used “\d” to make sure I had numeric digits there.
Of course, the better solution is to properly parse the log file in the first place, but the log file in this case was 25GB!! Of course I’ll go back and change my script (I used loghetti with a custom (read: flawed) output filter), and test it on smaller data, and eventually get it to be more reliable, but to get a quick Timeplot graph together, this was a fast, if iterative and somewhat annoying, way to go. It also gave me a chance to exercise my Vim search and replace skillz.