Skip to content

Musings of an Anonymous Geek

Made with only the finest 1's and 0's

Menu
  • About
  • Search Results
Menu

Hadoop, EC2, S3, and me

Posted on March 20, 2008 by bkjones

I’m playing with a lot of rather large data sets. I’ve just been informed recently that these data sets are child’s play, because I’ve only been exposed to the outermost layer of the onion. The amount of data I *will* have access to (a nice way of saying “I’ll be required to wrangle and munge”) is many times bigger. Someone read an article about how easy it is to get Hadoop up and running on Amazon’s EC2 service, and next thing you know, there’s an email saying “hey, we can move this data to S3, access it from EC2, run it through that cool Python code you’ve been working with, and distribute the processing through Hadoop! Yay! And it looks pretty straightforward! Get on that!”

Oh joyous day.

I’d like to ask that people who find success with Hadoop+EC2+S3 stop writing documentation that make this procedure appear to be  “straightforward”. It’s not.

One thing that *is* cool, for Python programmers, is that you actually don’t have to write Java to use Hadoop. You can write your map and reduce code in Python and use it just fine.

I’m not blaming Hadoop or EC2 really, because after a full day of banging my head on this I’m still not quite sure which one is at fault. I *did* read a forum post that someone had a similar problem to the one I wound up with, and it turned out to be a bug in Amazon’s SOAP API, which is used by the Amazon EC2 command line tools. So things just don’t work when that happens. Tip 1: if you have an issue, don’t assume you’re not getting something. Bugs appear to be fairly common.

Ok, so tonight I decided “I’ll just skip the whole hadoop thing, and let’s see how loghetti runs on some bigger iron than my macbook pro”. I moved a test log to S3, fired up an EC2 instance, ssh’d right in, and there I am… no data in sight, and no obvious way to get at it. This surprised me, because I thought that S3 and EC2 were much more closely related. After all, Amazon Machine Images (used to fire up said instance) are stored on S3. So where’s my “s3-copy” command? Or better yet, why can’t I just *mount* an s3 volume without having to install a bunch of stuff?

This goes down as one of the most frustrating things I’ve ever had to set up. It kinda reminds me of the time I had to set up a beowulf cluster of about 85 nodes using donated, out-of-warranty PC hardware. I spent what seemed like months just trying to get the thing to boot. Once I got over the hump, it ran like a top, but it was a non-trivial hump.

As of now, it looks like I’ll probably need to actually install my own image. A good number of the available public images are older versions of Linux distros for some reason. Maybe people have orphaned them and gone to greener pastures. Maybe they’re in production and haven’t seen a need to change them in any way. I’ll be registering a clean install image with the stuff I need and trudge onward.

Share this:

  • Click to share on Twitter (Opens in new window)
  • Click to share on Reddit (Opens in new window)
  • Click to share on Tumblr (Opens in new window)
  • Click to share on Facebook (Opens in new window)

Contact Me

You should follow me on Twitter

Recent Posts

  • User Activation With Django and Djoser
  • Python Selenium Webdriver Notes
  • On Keeping A Journal and Journaling
  • What Geeks Could Learn From Working In Restaurants
  • What I’ve Been Up To
  • PyCon Talk Proposals: All You Need to Know And More
  • Sending Alerts With Graphite Graphs From Nagios
  • The Python User Group in Princeton (PUG-IP): 6 months in
  • The Happy Idiot
  • pyrabbit Makes Testing and Managing RabbitMQ Easy

Categories

  • Apple
  • Big Ideas
  • Books
  • CodeKata
  • Database
  • Django
  • Freelancing
  • Hacks
  • journaling
  • Leadership
  • Linux
  • LinuxLaboratory
  • Loghetti
  • Me stuff
  • Other Cool Blogs
  • PHP
  • Productivity
  • Python
  • PyTPMOTW
  • Ruby
  • Scripting
  • Sysadmin
  • Technology
  • Testing
  • Uncategorized
  • Web Services
  • Woodworking

Archives

  • May 2021
  • December 2020
  • January 2014
  • September 2012
  • August 2012
  • February 2012
  • November 2011
  • October 2011
  • June 2011
  • April 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • September 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007
  • March 2007
  • February 2007
  • January 2007
  • December 2006
  • November 2006
  • September 2006
  • August 2006
  • July 2006
  • June 2006
  • April 2006
  • March 2006
  • February 2006
  • January 2006
  • December 2005
  • November 2005
  • October 2005
  • September 2005
  • August 2005
  • July 2005
  • June 2005
  • May 2005
  • April 2005
  • March 2005
  • February 2005
  • January 2005
  • December 2004
  • November 2004
  • October 2004
  • September 2004
  • August 2004
© 2023 Musings of an Anonymous Geek | Powered by Minimalist Blog WordPress Theme