Last week Amazon’s popular AWS cloud computing service suffered an unprecedented multi-day outage. The outage brought down thousands of websites, including popular websites such as Quora, Reddit and FourSquare, and generated coverage from mainstream publications such as the New York Times and the Wall Street Journal.
While many are quick to point to the outage as a sign that cloud computing is unreliable and not ready for mission-critical applications, the outage has simply brought a reality of both on-premise and cloud computing to light: systems fail, and mission critical applications need to be designed to expect failure.
The media has focused on the outage and the websites that it brought down, but what is more notable in my mind is the high-traffic, high-profile websites hosted on Amazon’s cloud that sailed through the outage, completely unaffected by the problems in their underlying cloud computing infrastructure.
Netflix, for one, hosts its entire video streaming service on Amazon’s cloud, and remained 100% available during the outage not because of good luck, but because of good application design. Netflix appreciates that, while Amazon offers 99.95% uptime guarantee, the 0.05% of downtime has to be anticipated not as a negligible, low probability risk, but rather as a foregone eventuality. Netflix has designed its systems to be resilient to all kinds of unpredictable failures, even going to the extent of integrating a so-called “Chaos Monkey” into its systems that continually and randomly crashes parts Netflix’s infrastructure. Netflix knows by designing its systems to withstand the vagaries of a “Chaos Monkey”, its systems will be all the more resilient to true outages. That engineering effort has clearly paid off.
Smugmug, a popular photo-sharing service, is also hosted on Amazon’s cloud. Like Netflix, Smugmug continued operating normally during Amazon’s outage by anticipating failures of nearly every type. Smugmug’s CEO, Don MacAskill, chronicles the lengths his company has gone to anticipate and withstand failures in Amazon’s cloud in a detailed blog post, and like Netflix, he builds resiliency by constantly causing deliberate failures:
I regularly kill off stuff on [Amazon's cloud] just to see what’ll happen. I found and fixed a rare bug related to this over the weekend, actually, that’d been live and in production for quite some time. Verify your slick new eventually consistent datastore is actually eventually consistent. Ensure your amazing replicator will actually replicate correctly or allow you to rebuild in a timely fashion. Start by doing these tests during maintenance windows so you know how it works. Then, once your system seems stable enough, start surprising your Ops and Engineering teams by killing stuff in the middle of the day without warning them. They’ll love you.
While last week’s outage has been an embarrassment to both Amazon and the companies that depend on it, the silver lining is that outage has driven home the fact that applications need to be engineered differently from the ground-up to expect and survive failures at various levels in the cloud infrastructure. Last week many companies with insufficiently robust applications were caught off-guard by Amazon’s failure, and will most certainly be re-assessing their application architecture. Companies using on-premise computing rather than cloud computing can also learn lessons from this outage by, like Netflix and Smugmug, continually testing resiliency to failure by deliberately causing failures in their systems.