The Amazon Web Services Outage: A Brief Explanation

On Friday, April 29, 2011, Amazon Web Services issued an apology and detailed technical explanation of the outage that affected its US-1 East Region from April 21, 1 AM PDT to April 24, 730 PM PDT. A complete description of Amazon’s cloud computing technical architecture is elaborated in more detail in the full text of Amazon’s post-mortem analysis of the outage and its accompanying apology. This posting elaborates on the technical issues responsible for Amazon’s outage, with the intent of giving readers a condensed understanding of Amazon’s cloud computing architecture and the kinds of problems that are likely to affect the cloud computing industry more generally. We are impressed with the candor and specificity of Amazon’s response and believe it ushers in a new age of transparency and accountability in the cloud computing space.

Guide to the April 2011 Amazon Web Services Outage:

1. Elastic Block Store Architecture
Elastic Block Store is one of the storage databases for Amazon’s EC2. EBS has two components: (1) EBS clusters, each of which is composed of a set of nodes; and (2) a Control Plane Services platform that accepts user requests and directs them to appropriate EBS clusters. Nodes within EBS clusters communicate with one another by means of a high bandwidth network and a lower capacity network used as a back-up network.

2. Manual Error with Network Upgrade Procedure
The outage began when a routine procedure to upgrade the capacity of the primary network resulted in traffic being directed to EBS’s lower capacity network instead of an alternate router on the high capacity network. Because the high capacity network was temporarily disengaged, and the low capacity network could not handle the traffic that had been shunted in its direction, many nodes in the affected EBS availability zone were isolated.

3. Re-Mirroring of Elastic Block Store Nodes
Once Amazon engineers noticed that the network upgrade had been executed incorrectly, they restored the network to its proper connectivity on the high bandwidth connection. Nodes which had become isolated wanted to search for other nodes through which they could “mirror” or duplicate themselves. But since so many nodes were in the position of looking for a replica, the EBS cluster’s space quickly became used to capacity. Consequently, approximately 13% of nodes within the affected Availability Zone became “stuck”.

4. Control Plane Service Platform Isolated
The full utilization of the EBS storage system by stuck nodes seeking to re-mirror themselves impacted the Control Plane Services platform that directs user requests from an API to EBS clusters. The exhausted capacity of the EBS cluster rendered EBS unable to accommodate requests from the Control Plane Service. Because the degraded EBS cluster began to have an adverse effect on the Control Plane Service through the entire Region, Amazon disabled communication between the EBS clusters and the Control Plane Service.

5. Restoring EBS cluster server capacity
Amazon engineers knew that the isolated nodes had exhausted server capacity within the EBS cluster. In order to enable the nodes to re-mirror themselves, it was necessary to add extra server capacity to the degraded EBS cluster. Finally, the connection between the Control Plane Service and EBS was restored.

6. Relational Database Service Fails to Replicate
Amazon’s Relational Database service manages communication between multiple databases that leverage EBS’s database structure. RDS can be configured to function in one Availability Zone or several. RDS instances that have been configured to operate across multiple Availability Zones should switch to their replica on an Availability Zone unaffected by a service disruption. The network interruption on the degraded EBS cluster caused 2.5% of multi-AZ RDS instances to fail to find their replica due to an unexpected bug.

Amazon Web Services’s Response

In response to the set of issues that prompted the outage, Amazon proposes to take the following steps:

1. Increase automation of the network change/upgrade process that triggered the outage
2. Increase server capacity in EBS clusters to allow EBS nodes to find their replicas effectively in the event of a disruption
3. Develop more intelligent re-try logic to prevent the “re-mirroring storm” that causes EBS nodes to seek and re-seek their replicas relentlessly. While EBS nodes should seek out their replicas after a service disruption, the logic behind the search for replicas should lead to amelioration of an outage rather than its exacerbation.

Sony PlayStation Network outage continues as a result of “external intrusion”

Sony PlayStation’s cloud computing network experienced significant downtime starting on April 21. The outage affected Sony’s PlayStation Network and its Qriocity music service. Sony PlayStation’s cloud based environment allows users to download and use online games, music, videos and movies. Patrick Seybold, Sony’s Senior Director of Corporate Communications and Social Media, announced that an “external intrusion” was responsible for the attack, generating suspicions that hackers were responsible for bringing down Sony’s cloud based gaming and music platform. The hacker group Anonymous was the principal suspect for the Sony outage after Sony initiated a lawsuit against George Hotz, a PlayStation user with the username GeoHot that jailbroke his PlayStation 3 and distributed jailbreaking tools to other users to download unauthorized applications. In early March, a Northern California court awarded Sony access to Hotz’s social media accounts, his PayPal account and the IP addresses of users who visited George Hotz’s website. The hacker collective Anonymous objected to Sony’s lawsuit against George Hotz, noting, “You have abused the judicial system in an attempt to censor information on how your products work. You have victimized your own customers merely for possessing and sharing information, and continue to target every person who seeks this information. In doing so, you have violated the privacy of thousands.” After Anonymous issued threats to Sony about their handling of the Hotz lawsuit, Sony experienced downtime on its main website, Style.com and the U.S. PlayStation site on April 6, in attacks that have been widely attributed to Anonymous.

But Anonymous denied responsibility for the recent outage by claiming, “For Once, We Didn’t Do It,” and that “While it could be the case that other Anons have acted by themselves, AnonOps was not related to this incident and does not take responsibility for whatever has happened. A more likely explanation is that Sony is taking advantage of Anonymous’ previous ill-will towards the company to distract users from the fact that the outage is actually an internal problem with the company’s servers.” Sony’s technical troubles follow high profile recent releases of Mortal Kombat and Portal 2. Considered alongside Amazon’s EC2 recent outage, Sony’s downtime raises increased concerns about quality of service and reliability in the world of cloud computing. Downtime on Sony’s PlayStation Network began on April 21 and continues as of the evening of April 24, 2011.