Amazon Web Services recently released an explanation of the outage that affected a single Availability Zone in the US East Region (Northern Virginia) on June 14. The outage affected customers such as Quora, Heroku and Hipchat. The chronology below represents a summary distilled from Amazon Web Services’s more complete explanation of the outage. The root cause of the outage was a power failure coupled with an incorrectly configured generator that was unable to handle the load once EC2 instances and EBS volumes failed over to it.
June 14, 2012 to June 15, 2012: US East Region (Northern Virginia). All times listed are PDT.
•8:44 PM: Single Availability Zone in the US East Region transfers to generator power after a cable fault in the power distribution system.
•8:53 PM: A generator that had been used to manage the power failure overheated. Affected EC2 instances and EBS volumes failed over to their secondary back-up power source given the failure of the generator.
•8:57 PM: One of the circuit breakers in this secondary back-up power grid failed due to an incorrect configuration. Affected EC2 instances and EBS volumes now had no primary, secondary or tertiary power source.
•10:19 PM: Generator was repaired and restarted.
•10:50 PM: Majority of EC2 instances and EBS volumes recovered.
•1:05 AM: 99% of all EBS volumes that were in the process of an “inflight write” were brought back in an “impaired” state that allowed customers to verify the consistency of the volume and subsequently resume using it.
Concurrent with the above:
•8:57PM until 10:40PM: Vitiated ability of customers to launch new EC2 instances backed by EBS.
Kudos once again to Amazon’s transparency although its failure to test and correctly configure its back-up power infrastructure is disappointing.