Amazon Outage Stratifies the Cloud

by Jack Newton

Last week Amazon’s popular AWS cloud computing service suffered an unprecedented multi-day outage. The outage brought down thousands of websites, including popular websites such as Quora, Reddit and FourSquare, and generated coverage from mainstream publications such as the New York Times and the Wall Street Journal.

While many are quick to point to the outage as a sign that cloud computing is unreliable and not ready for mission-critical applications, the outage has simply brought a reality of both on-premise and cloud computing to light: systems fail, and mission critical applications need to be designed to expect failure.

The media has focused on the outage and the websites that it brought down, but what is more notable in my mind is the high-traffic, high-profile websites hosted on Amazon’s cloud that sailed through the outage, completely unaffected by the problems in their underlying cloud computing infrastructure.

Netflix, for one, hosts its entire video streaming service on Amazon’s cloud, and remained 100% available during the outage not because of good luck, but because of good application design. Netflix appreciates that, while Amazon offers 99.95% uptime guarantee, the 0.05% of downtime has to be anticipated not as a negligible, low probability risk, but rather as a foregone eventuality. Netflix has designed its systems to be resilient to all kinds of unpredictable failures, even going to the extent of integrating a so-called “Chaos Monkey” into its systems that continually and randomly crashes parts Netflix’s infrastructure. Netflix knows by designing its systems to withstand the vagaries of a “Chaos Monkey”, its systems will be all the more resilient to true outages. That engineering effort has clearly paid off.

Smugmug, a popular photo-sharing service, is also hosted on Amazon’s cloud. Like Netflix, Smugmug continued operating normally during Amazon’s outage by anticipating failures of nearly every type. Smugmug’s CEO, Don MacAskill, chronicles the lengths his company has gone to anticipate and withstand failures in Amazon’s cloud in a detailed blog post, and like Netflix, he builds resiliency by constantly causing deliberate failures:

I regularly kill off stuff on [Amazon’s cloud] just to see what’ll happen. I found and fixed a rare bug related to this over the weekend, actually, that’d been live and in production for quite some time. Verify your slick new eventually consistent datastore is actually eventually consistent. Ensure your amazing replicator will actually replicate correctly or allow you to rebuild in a timely fashion. Start by doing these tests during maintenance windows so you know how it works. Then, once your system seems stable enough, start surprising your Ops and Engineering teams by killing stuff in the middle of the day without warning them. They’ll love you.

While last week’s outage has been an embarrassment to both Amazon and the companies that depend on it, the silver lining is that outage has driven home the fact that applications need to be engineered differently from the ground-up to expect and survive failures at various levels in the cloud infrastructure. Last week many companies with insufficiently robust applications were caught off-guard by Amazon’s failure, and will most certainly be re-assessing their application architecture. Companies using on-premise computing rather than cloud computing can also learn lessons from this outage by, like Netflix and Smugmug, continually testing resiliency to failure by deliberately causing failures in their systems.

Comments

Simon Fodden

April 25th, 2011 at 2:03 pm

Man, I love the idea of a “chaos monkey.” It ought to be a fixture in every big firm, and not just as an IT beast. Think of it as a Corp. Jester, pushing the power button and the powers’ buttons.
Frank Rivera

April 27th, 2011 at 6:19 pm

It is easy to point out that only a small percentage of SaaS vendors were affected by Amazon’s outage (hundreds in fact). That is no comfort since a small number of those vendors actually lost data as reported by Amazon.

How much data loss is acceptable? Too many moving parts means more can and will go wrong.

In the end what will make all the difference in the world of Cloud based apps and data storage is who is actually in control. Frankly, if you store a Law Firm’s data on servers you do NOT control then you are obligated to inform every Law Firm you do business with of this fact up front.

just my 2 cents

Frank Rivera CEO HoudiniESQ

Most Recent Comments

Alastair Clarke on Issues of Self-Representation in a Landmark Decision: Reflecting on Ahluwalia v. Ahluwalia:

Indeed, this situation is very serious within the immigration context. IRCC encourages applicants to follow their guides and they actively… more »
David Collier-Brown on Resisting the Echo Chamber: AI-Assisted Judgment Writing and the Risk of Homogenization:

I find LLMs are better at critiquing text than writing it. I also tell the editor-bots "If you suggest alternate… more »
Bryce Smith on Issues of Self-Representation in a Landmark Decision: Reflecting on Ahluwalia v. Ahluwalia:

Thank you for highlighting the stated purpose of the justice system to provide justice, alongside the profound tensions created by… more »
Dennis Prieto on Law and Literature in Latin America: Context in the Classroom:

When I think of Law and Literature in the North American context, I think of Stevens, MacLeish, Dos Passos, and… more »

+ -

“Refs, You Suck!”: Personal Attacks on Decision Makers

Tips Tuesday: Use Newspaper Archives to Find Cases

Forum Shopping Could Fix the Delay Problem

Resisting the Echo Chamber: AI-Assisted Judgment Writing and the Risk of Homogenization

AI in Mediation. the Tool Is Not the Process: Using the IBA Guidelines to Evaluate Risk in Mediation Practice

RECLAIM: A Is for Autonomy

Amazon Outage Stratifies the Cloud

Comments