AWS S3 Outage Underscores The Need For Enhanced Risk and Control Frameworks For Cloud Services

The Amazon Web Services disruption that affected the Northern Virginia Region (US-EAST-1) on February 28 was caused by human error. At 937 AM PST, an AWS S3 team member that was debugging an issue related to the S3 billing system mistakenly removed the index and placement subsystems, the former of which was responsible for all of the metadata of S3 objects whereas the placement subsystem managed the allocation of new storage. The inadvertent removal of these two subsystems initiated a full restart of S3 that impaired S3’s ability to respond to requests. S3’s inability to respond to new requests subsequently affected related AWS services that depend on AWS S3 such as EBS, AWS Lambda and the launch of new instances of the AWS EC2. Moreover, the service disruption to S3 also prevented AWS from updating its AWS Service Health Dashboard from 937 AM PST to 11:37 AM PST. The full restart of the S3 subsystem took longer than expected as noted by the following excerpt from the AWS post-mortem analysis of the S3 service disruption:

S3 subsystems are designed to support the removal or failure of significant capacity with little or no customer impact. We build our systems with the assumption that things will occasionally fail, and we rely on the ability to remove and replace capacity as one of our core operational processes. While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years. S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected.

Both the index subsystem and the placement subsystem within AWS were restored by 1:54 PM PST. The recovery of additional services took more time depending on the backlog they experienced due to S3’s disruption and restoration. As a result of the outage, AWS has begun examining risks associated with operational tools and processes that remove capacity and escalated the prioritization of a re-architecting of S3 into smaller “cells” that allow for accelerated recovery from a service disruption and the restoration of routine operating capacity. The S3 outage affected customers such as Airbnb, New Relic, Slack, Docker, Expedia and Trello.

The S3 outage underscores the lack of maturity of control frameworks related to operational processes specific to the maintenance and QC of cloud services and platforms. That manual error could lead to a multi-hour service disruption to Amazon S3 with downstream effects for other AWS services represents a stunning indictment of AWS’s risk management and control framework for mitigating risks to the availability and performance of services as well as implementing effective controls to monitor and respond to the quality of operational performance. The outage pointedly illustrates the lack of maturity of risk, control and automation frameworks at AWS and subsequently sets the stage for competitors such as Microsoft Azure and Google Cloud Platform to capitalize on the negative publicity received by AWS by foregrounding the sophistication of their own risk and control frameworks for preventing, mitigating and minimizing service disruptions. Moreover, the February 28 AWS S3-based outage underscores the need for the maturation of cloud services-focused risk and control IT frameworks that can respond to the specificity of risk and control frameworks specific to cloud platforms in contradistinction to on-premise, enterprise IT. Furthermore, the outage strengthens the argument for a multi-cloud strategy for enterprises interested in ensuring business continuity by using more than one public cloud vendor to mitigate risks associated with a public cloud outage. Meanwhile, the continued pervasiveness of public cloud outages underscores the depth of the opportunity for the implementation of controls to mitigate risks that threaten cloud services uptime and performance.