Posts Tagged ‘architecture’

Explosion at The Planet Causes 9000-server Outage

Monday, June 2nd, 2008

Here’s the email I received on Saturday from The Planet, where I have some dedicated servers hosted:

Dear Valued Customers:
This evening at 4:55 in our H1 data center, electrical gear shorted, creating an explosion and fire that knocked down three walls surrounding our electrical equipment room Thankfully, no one was injured. In addition, no customer servers were damaged or lost.
We have just been allowed into the building to physically inspect the damage. Early indications are that the short was in a high-volume wire conduit. We were not allowed to activate our backup generator plan based on instructions from the fire department.
This is a significant outage, impacting approximately 9,000 servers and 7,500 customers. All members of our support team are in, and all vendors who supply us with data center equipment are on site. Our initial assessment, although early, points to being able to have some service restored by mid-afternoon on Sunday. Rest assured we are working around the clock.
We are in the process of communicating with all affected customers. we are planning to post updates every hour via our forum and in our customer portal. Our interactive voice response system is updating customers as well.
There is no impact in any of our other five data centers.
I am sorry that this accident has occurred and apologize for the impact.

That’s pretty rough. Lucky for me, nothing I do is solely dependent on any of those machines. However, I think it’s probably pretty common for startups to rely heavily on what amount to single points of failure due to a lack of funds or manpower/in-house skills to set things up in a way that looks something like “best practices” from an architecture standpoint.

Building a startup on dedicated hosting is one thing. Running a production web site at a single site, perhaps with a single machine hosting each service (or even a few or *all* of the services) is something dangerously different. However, building a failover solution that spans outages at the hosting-provider level can also be quite difficult and perhaps expensive. You have to really come up with hard numbers that will help you gauge your downtime tolerance against your abilities and budget. While a restrictive budget might mean less automation and more humans involved in failover, it can be done.

What kinds of cross-provider failover solutions have saved your bacon in the past? I’m always looking for new techniques in this problem domain, so share your ideas and links!

Scalability Best Practices: eBay

Thursday, May 29th, 2008

Following a link from the High Scalability blog, I found this really great article about scalability practices, as told by Randy Shoup at eBay. Randy is very good at explaining some of the more technical aspects in more or less plain English, and it even helped me find some wording I was looking for to help me explain the notion (and benefits) of functional partitioning. He also covers ideas that apply directly to your application code, your database architecture (including a little insight into their sharding strategy), and more. Even more about eBay’s architecture can be found here.