Explosion at The Planet Causes 9000-server Outage

Here’s the email I received on Saturday from The Planet, where I have some dedicated servers hosted:

Dear Valued Customers:
This evening at 4:55 in our H1 data center, electrical gear shorted, creating an explosion and fire that knocked down three walls surrounding our electrical equipment room Thankfully, no one was injured. In addition, no customer servers were damaged or lost.
We have just been allowed into the building to physically inspect the damage. Early indications are that the short was in a high-volume wire conduit. We were not allowed to activate our backup generator plan based on instructions from the fire department.
This is a significant outage, impacting approximately 9,000 servers and 7,500 customers. All members of our support team are in, and all vendors who supply us with data center equipment are on site. Our initial assessment, although early, points to being able to have some service restored by mid-afternoon on Sunday. Rest assured we are working around the clock.
We are in the process of communicating with all affected customers. we are planning to post updates every hour via our forum and in our customer portal. Our interactive voice response system is updating customers as well.
There is no impact in any of our other five data centers.
I am sorry that this accident has occurred and apologize for the impact.

That’s pretty rough. Lucky for me, nothing I do is solely dependent on any of those machines. However, I think it’s probably pretty common for startups to rely heavily on what amount to single points of failure due to a lack of funds or manpower/in-house skills to set things up in a way that looks something like “best practices” from an architecture standpoint.

Building a startup on dedicated hosting is one thing. Running a production web site at a single site, perhaps with a single machine hosting each service (or even a few or *all* of the services) is something dangerously different. However, building a failover solution that spans outages at the hosting-provider level can also be quite difficult and perhaps expensive. You have to really come up with hard numbers that will help you gauge your downtime tolerance against your abilities and budget. While a restrictive budget might mean less automation and more humans involved in failover, it can be done.

What kinds of cross-provider failover solutions have saved your bacon in the past? I’m always looking for new techniques in this problem domain, so share your ideas and links!

Tags: , , , ,

  • dublpaws

    IT conversations has had several podcasts regarding Amazon’s web services. I get the impression that short of nuclear war (maybe even withstanding it) their service would remain intact. I’d expect the same from Google’s App Engine, when it goes prime time.

  • Jason

    An issue was brought up recently to me regarding downtime from catastrophic events.

    What if a truck with toxic chemicals has a wreck and spills, or one just catches on fire or explodes?

    Most current policies will force a closing down of the road (perhaps a freeway) and surrounding area within X radius (I don’t remember that part).

    What will that do to logistics related to your datacenter if it is in that zone?

    What about having a datacenter located just outside of that radius but close enough to gain the benefit of the freeway. Alternate routes could (should) be available and not shutdown traffic to the datacenter then.

    What are the reasons for locating datacenters in metropolitan centers instead of in suburbs? What disadvantages exist for this centralized location?

  • m0j0

    These are excellent points, Jason. I remember working in NYC (in the world trade center, in fact, pre-9/11) where we were planning on opening a datacenter just a few hundred yards away in NJ. Turns out it’s on a completely different power grid. Having a data center with freeway access but which lives on the same power grid may not solve the whole problem, but it might allow service to continue in a particular circumstance like the one you provide.

    Likewise, I think that the notion of having data centers in urban areas is perceived as being favorable to businesses because of things like an available labor pool, and infrastructure-related things like transportation, power, public services like fire and law enforcement, and the like. Going to the suburbs might be perceived as a greater risk on all of those points. Fire might be volunteer-only, or utilities like power might be more flaky. I’m guessing here, of course.

    I think Google has probably proven that you can make a successful go of it if you’re willing to put in the effort and money, though. They opened two data centers in South Carolina in what I gather are suburban areas outside of Charleston (I found them while checking out Google’s job postings, and I’ve always wanted to get out of NJ). If Google had a Princeton-NJ-area datacenter (and they could do it if they wanted), I might actually consider jumping through all of those crazy interviews.

    I personally think that the labor pool argument is getting a bit less valid, myself. I can’t imagine what would be better than to live *and* work in the suburbs. Better yet, how about making it possible for a larger part of your work force to be remote? I work remotely for several clients, and it works well for me and my clients.

  • http://standalone-sysadmin.blogspot.com Matt Simmons

    Right now, we’re colocating for our primary site, and hosting our own backup site. This is step 2. Step one was hosting our primary site, and not having a real backup site.

    Step 3 is going to be colocating both primary and backup with different colocation companies.

    Step 4 is going to be colocating both primary and backup with the same colocation company, but it two sites separated by a couple hundred miles. The benefit here is more in operations, since we can get a piece of the huge pipe between the two sites for (relative) pennies.

    Failover right now is manual DNS switchover while alerting our clients to go to the 2nd site until DNS changes for them.

    Long term, we’re looking at global server load balancing (GSLB) solutions. I’ve also considering buying our own block and dual-routing BGP. I don’t really know of any other solutions, to tell you the truth. I’m open to further suggestion from anyone.