Amazon Outage Stratifies the Cloud

Last week Amazon’s popular AWS cloud computing service suffered an unprecedented multi-day outage. The outage brought down thousands of websites, including popular websites such as Quora, Reddit and FourSquare, and generated coverage from mainstream publications such as the New York Times and the Wall Street Journal.

While many are quick to point to the outage as a sign that cloud computing is unreliable and not ready for mission-critical applications, the outage has simply brought a reality of both on-premise and cloud computing to light: systems fail, and mission critical applications need to be designed to expect failure.

The media has focused on the outage and the websites that it brought down, but what is more notable in my mind is the high-traffic, high-profile websites hosted on Amazon’s cloud that sailed through the outage, completely unaffected by the problems in their underlying cloud computing infrastructure.

Netflix, for one, hosts its entire video streaming service on Amazon’s cloud, and remained 100% available during the outage not because of good luck, but because of good application design. Netflix appreciates that, while Amazon offers 99.95% uptime guarantee, the 0.05% of downtime has to be anticipated not as a negligible, low probability risk, but rather as a foregone eventuality. Netflix has designed its systems to be resilient to all kinds of unpredictable failures, even going to the extent of integrating a so-called “Chaos Monkey” into its systems that continually and randomly crashes parts Netflix’s infrastructure. Netflix knows by designing its systems to withstand the vagaries of a “Chaos Monkey”, its systems will be all the more resilient to true outages. That engineering effort has clearly paid off.

Smugmug, a popular photo-sharing service, is also hosted on Amazon’s cloud. Like Netflix, Smugmug continued operating normally during Amazon’s outage by anticipating failures of nearly every type. Smugmug’s CEO, Don MacAskill, chronicles the lengths his company has gone to anticipate and withstand failures in Amazon’s cloud in a detailed blog post, and like Netflix, he builds resiliency by constantly causing deliberate failures:

I regularly kill off stuff on [Amazon’s cloud] just to see what’ll happen. I found and fixed a rare bug related to this over the weekend, actually, that’d been live and in production for quite some time. Verify your slick new eventually consistent datastore is actually eventually consistent. Ensure your amazing replicator will actually replicate correctly or allow you to rebuild in a timely fashion. Start by doing these tests during maintenance windows so you know how it works. Then, once your system seems stable enough, start surprising your Ops and Engineering teams by killing stuff in the middle of the day without warning them. They’ll love you.

While last week’s outage has been an embarrassment to both Amazon and the companies that depend on it, the silver lining is that outage has driven home the fact that applications need to be engineered differently from the ground-up to expect and survive failures at various levels in the cloud infrastructure. Last week many companies with insufficiently robust applications were caught off-guard by Amazon’s failure, and will most certainly be re-assessing their application architecture. Companies using on-premise computing rather than cloud computing can also learn lessons from this outage by, like Netflix and Smugmug, continually testing resiliency to failure by deliberately causing failures in their systems.

Comments

  1. Man, I love the idea of a “chaos monkey.” It ought to be a fixture in every big firm, and not just as an IT beast. Think of it as a Corp. Jester, pushing the power button and the powers’ buttons.

  2. It is easy to point out that only a small percentage of SaaS vendors were affected by Amazon’s outage (hundreds in fact). That is no comfort since a small number of those vendors actually lost data as reported by Amazon.

    How much data loss is acceptable? Too many moving parts means more can and will go wrong.

    In the end what will make all the difference in the world of Cloud based apps and data storage is who is actually in control. Frankly, if you store a Law Firm’s data on servers you do NOT control then you are obligated to inform every Law Firm you do business with of this fact up front.

    just my 2 cents

    Frank Rivera CEO HoudiniESQ