Skip to content

Musings of an Anonymous Geek

Made with only the finest 1's and 0's

Menu
  • About
  • Search Results
Menu

CodeKata 4: Data Munging

Posted on December 28, 2009 by bkjones

I’m continuing to take on the items in Dave Thomas’s Code Kata collection. It’s a nice way to spend a Sunday night, and it’s a good way to get my brain going again before work on Monday morning. It’s also fun and educational 🙂

CodeKata 4 is called “Data Munging”. It’s not very difficult data munging, really. I think the more interesting bit of the Kata is how to deal with files in different languages, what tools are well suited to the task, and trying to minimize code duplication.

Description

Code Kata 4 instructs us to download weather.dat, a listing of weather information for each day of June, 2002 in Morristown, NJ. We’re to write a program that identifies the day which has the smallest gap between min and max temperatures.

Then, it says to download football.dat, a listing of season statistics for Soccer teams. We’re to write a program that identifies the team with the smallest gap between goals scored for the team, and goals scored against the team.

Once those are done, we’re asked to try to factor out as much duplicate code as possible between the two programs, and then we’re asked a few questions to help us think more deeply about what just transpired.

My Attack Plan

The first thing that jumped into my brain when I saw the data files was “Awk would be perfect for this”. I fiddled with that for a little too long (my awk is a little rusty), and came up with this (for weather.dat):

>$ awk 'BEGIN {min=100000}; $1 ~ /^[1-3]/ {x[$1]=$2-$3; if (x[$1]<min){ min=x[$1]; winner=$1}} END {print winner, min} ' weather.dat.dat
14 2

It works, and awk, though it gets ugly to some, reads in a nice, linear way to me. You give it a filter expression, and then statements to act on the matching lines (in braces). What could be simpler?

After proving to myself that I hadn’t completely lost my awk-fu, I went about writing a Python script to deal with the problem. I read ahead in the problem description, though, and so my script contains separate blocks for the two data sets in one script:

#!/usr/bin/env python
import sys
import string

data = open(sys.argv[1], 'r').readlines()
data.sort()

if 'weather' in sys.argv[1]:
   winner = 1000000
   winnerday = None

   for line in data:
      #filter out lines that aren't numbered.
      if line.strip().startswith(tuple(string.digits)):
         # we only need the first three fields to do our work
         l = line.split()[:3]
         # some temps have asterisks attached to them.
         maxt = l[1].strip(string.punctuation)
         mint = l[2].strip(string.punctuation)
         diff = int(maxt) - int(mint)
         if diff < winner:
            winner = diff
            winnerday = l[0]
   print "On day %s, the temp difference was only %d degrees!" % (winnerday, winner)

if 'football' in sys.argv[1]:
   winner = 1000000
   winnerteam = None

   for line in data:
      if line.strip().startswith(tuple(string.digits)):
         l = line.split()
         team, f, a = l[1], int(l[6]), int(l[8])
         diff = abs(f - a)
         if diff < winner:
            winner = diff
            winnerteam = team
   print "Team %s had a for/against gap of only %d points!" % (winnerteam, winner)

Really, the logic employed is not much different from the awk solution:

  1. Set a default for ‘winner’ that’s unlikely to be rivaled by the data 🙂
  2. Set the default for the winning team or day to ‘None’
  3. Filter out unwanted lines in the dataset.
  4. Grab bits of each line that are useful.
  5. Assign each useful bit to a variable.
  6. Do math.
  7. Do comparisons
  8. Present results.

Refactoring

Part 2 of the Kata says to factor out as much duplicate code as possible. I was able to factor out almost all of it on the first shot at refactoring, leaving only the line of code (per file) to identify the relevant columns of each data set:

#!/usr/bin/env python
import sys
import string

data = open(sys.argv[1], 'r').readlines()
data.sort()
winner_val = 1000000
winner_id = None

for line in data:
   if line.strip().startswith(tuple(string.digits)):
      l = line.split()
      if 'weather' in sys.argv[1]:
         identifier, minuend, subtrahend = l[0], int(l[1].strip(string.punctuation)), int(l[2].strip(string.punctuation))
      elif 'football' in sys.argv[1]:
         identifier, minuend, subtrahend = l[1], int(l[6]), int(l[8])
      diff = abs(minuend - subtrahend)
      if diff < winner_val:
         winner_val = diff
         winner_id = identifier

print winner_id, winner_val

Not too bad. I could’ve done some other things to make things work differently: for example I could’ve let the user feed in the column header names of the ‘identifier’, ‘minuend’ and ‘subtrahend’ columns in each data set, and then I could just *not* parse out the header line and instead use it to identify the list index positions of the bits I need for each line. It’d make the whole thing ‘just work’. It would also require more effort from the user. On the other hand, it would make things “just work” for just about any file with numbered lines of columnar data.

I have to admit that the minute I see columnar data like this, awk is the first thing I reach for, so I’m sure this affected my Python solution. The good news there is that my thinking toward columnar data is consistent, and so I treated both files pretty much the same way, making refactoring a 5-minute process.

In all, I enjoyed this Kata. Though I didn’t take it as far as I could have, it did make me think about how it could be improved and made more generic. Those improvements could incur a cost in terms of readability I suppose, but I think for this example it wouldn’t be a problem. I’m working on a larger project now where I have precisely this issue of flexibility vs. readability though.

I’m reengineering a rather gangly application to enable things like pluggable… everything. It talks to networked queues, so the protocol is pluggable. It talks to databases, so the database back end is pluggable, in addition to the actual data processing routines. Enabling this level of flexibility introduces some complexity, and really requires good documentation if we reasonably expect people to work with our code (the project should be released as open source in the coming weeks). Without the documentation, I’m sure I’d have trouble maintaining the code myself!

Share this:

  • Click to share on Twitter (Opens in new window)
  • Click to share on Reddit (Opens in new window)
  • Click to share on Tumblr (Opens in new window)
  • Click to share on Facebook (Opens in new window)

Contact Me

You should follow me on Twitter

Recent Posts

  • User Activation With Django and Djoser
  • Python Selenium Webdriver Notes
  • On Keeping A Journal and Journaling
  • What Geeks Could Learn From Working In Restaurants
  • What I’ve Been Up To
  • PyCon Talk Proposals: All You Need to Know And More
  • Sending Alerts With Graphite Graphs From Nagios
  • The Python User Group in Princeton (PUG-IP): 6 months in
  • The Happy Idiot
  • pyrabbit Makes Testing and Managing RabbitMQ Easy

Categories

  • Apple
  • Big Ideas
  • Books
  • CodeKata
  • Database
  • Django
  • Freelancing
  • Hacks
  • journaling
  • Leadership
  • Linux
  • LinuxLaboratory
  • Loghetti
  • Me stuff
  • Other Cool Blogs
  • PHP
  • Productivity
  • Python
  • PyTPMOTW
  • Ruby
  • Scripting
  • Sysadmin
  • Technology
  • Testing
  • Uncategorized
  • Web Services
  • Woodworking

Archives

  • May 2021
  • December 2020
  • January 2014
  • September 2012
  • August 2012
  • February 2012
  • November 2011
  • October 2011
  • June 2011
  • April 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • September 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007
  • March 2007
  • February 2007
  • January 2007
  • December 2006
  • November 2006
  • September 2006
  • August 2006
  • July 2006
  • June 2006
  • April 2006
  • March 2006
  • February 2006
  • January 2006
  • December 2005
  • November 2005
  • October 2005
  • September 2005
  • August 2005
  • July 2005
  • June 2005
  • May 2005
  • April 2005
  • March 2005
  • February 2005
  • January 2005
  • December 2004
  • November 2004
  • October 2004
  • September 2004
  • August 2004
© 2023 Musings of an Anonymous Geek | Powered by Minimalist Blog WordPress Theme