I started a new job about 6 weeks ago. I’m now doing infrastructure architecture at http://gfdl.noaa.gov
GFDL stands for Geophysical Fluid Dynamics Lab. It’s a NOAA site that supports atmospheric and climatology research. So in other words, the work I do supports research into things ranging from global warming to what the atmosphere on Mars is like to the weather here on Earth to simulations of the shape and movement of Katrina. I think of it as sort of an Institute for Advanced Study devoted to climatology research. Great minds in the field are here.
The research actually takes place at three different sites, DC, Boulder and Princeton, and affiliations with academic institutions flourish as well. In fact, I knew at least 4 people who worked here because of interactions between this site and cs.princeton.edu, my former employer.
My job, as it’s been described to me, is to provide a vision as to the design and direction of the infrastructure which supports the rather enormous high performance compute (HPC) cluster. This involves something of a learning curve to understand what’s here, how the systems are used, what the needs are, what people like and hate, where the redundancies and inefficiencies exist, etc. It also involves having meetings and coordinating with people who manage the network, the facilities (power & cooling, etc), the security policy, etc. I’ll be grilled on my ideas, and create prototypes and demos to get my ideas across. Lots of communication.
An aspect of my job will also involve getting my hands on the HPC clusters themselves as well, which are also at each site. All of the clusters are on top500.org last time I looked. Just go through the pages and search for GFDL and/or NOAA.
The systems here are all Linux. Even the standard-issue workstations are running Linux. Scripting is done in Perl and shell, but Python is everywhere, so I’ll be doing either Perl or Python if I have the choice (because “shell” == “csh” here, which I never took well too, honestly). Some aspects of the environment are pretty fascinating. For example, how exactly do you store (*and* easily retrieve, on the fly) 9 PETABYTES of data? How do you back that up? How do you recover from hiccups? How do you instrument systems consisting of thousands of CPUs, to pinpoint problems and get them fixed? And, by the way, how’s the best way to tune a system’s network stack to use a 50MBps pipe (that’s Mega *bytes*) efficiently enough to move multiple terabytes of data every day between collaborators at different sites? How, exactly, do you consolidate services and provide failover across geographically dispersed sites?
So that’s it for now :)Â It’s too early to tell how things are going, really. It’s certainly not the cushy environment that Princeton U. was, but there are bigger challenges and problems to be solved here, and that’s the part I’m looking forward to.