Understanding the December 2012 Windows Azure Outage

From December 28 to December 30, Microsoft’s Windows Azure platform experienced an outage for its South Central US Region that arrived head upon heels after the Amazon Web Services Christmas eve outage that became famous for incapacitating Netflix. The outage was first reported by Microsoft at 3:16 PM UTC on December 28 with the news that a networking issue was “partially affecting the availability of Storage service in the South Central US subregion” on its Windows Azure Service Dashboard. Hours later, Microsoft noted that the outage was affecting the ability to display the status of service for all other regions, even though service itself was unaffected outside the South Central US Region.

The first substantial elaboration on the cause of the outage came six hours after the disclosure of the outage at December 28, 9:16 PM UTC:

The repair steps are taking longer because it involves recovery of some faulty nodes on the impacted cluster. We expect this activity to take a few more hours. Further updates will be published after the recovery is complete. We apologize for any inconvenience this causes our customers. Note: The availability is unaffected for all other services and sub-regions. We are currently unable to display the status of the individual services and sub-regions due to the above mentioned issue.

Here, Microsoft specifies that the root cause of the problem consisted of “faulty nodes on the impacted cluster,” and that repair would be complete within a few hours. But 9 hours after this specification—or within 15 hours of the initial announcement—the Azure team announced that the problems which affected the recovery of the affected nodes was “likely to take a significant amount of time.” The impact on the creation of new VM jobs and Service Management operations had been addressed, in the meantime, but the full and complete recovery of the cluster would take more time.

On December 30, 9:00 PM UTC, the Azure team reported:

The repair steps are still underway to restore full availability of Storage service in the South Central US sub-region. Windows Azure provides asynchronous geo replication of Blob & Table data between data centers, but does not currently support geo-replication for Queue data or failover per Storage account. If a failover were to occur, it would impact all accounts on the affected Storage cluster, resulting in loss of Queue data and some recent Blob & Table data. To prevent this risk to customer data and applications, we are focusing on bringing the affected stamp back to full recovery in a healthy state. We continue to work to resolve this issue at the earliest and our next update will be before 6PM PST on 12/30/2012. Please contact Windows Azure Technical support for assistance. We apologize for any inconvenience this causes our customers.

With this announcement, impacted customers finally learn of the real root cause of the outage: the Azure platform currently fails to support georeplication for storage failover data and queue data. A failover such as the one experienced by affected clusters therefore results both in the loss of queue data as well as “recent Blob & Table data,” leading to a longer time to recover the faulty nodes on the affected cluster. Georeplication, recall, refers to the practice of maintaining replicas of customer data in locations that are hundreds of miles of apart in order to more effectively protect customers against data center outages. Azure Storage’s lack of support for georeplication of failover and queue data, however, led to the prolongation of the December 2012 outage.

The problem was finally, fully resolved at 10:16 AM UTC, December 31, 2012:

Storage is fully functional in the South Central US sub-region All customers should have access to their data. We apologize for any inconvenience this caused our customers.

Notable about the Microsoft Azure outage was its relative lack of media coverage in comparison to the Amazon Web Services outage, which lasted roughly 24 hours in comparison to 77 hours for the Azure outage. Granted, the Amazon Web Services outage affected Netflix, one of the IaaS industry’s most prominent customers alongside Zynga, but the contrast between the coverage accorded to each of these platforms illustrates the market dominance of Amazon Web Services as measured by the way in which its outages affect measurably more customers and end-users than other IaaS platforms. Another factor accounting for the relative disparity in media coverage between the AWS and Azure outages is AWS’s trademark painstaking post-mortem analysis of outages that Microsoft and all other vendors would do well to match in depth and specificity, going forward.

3 thoughts on “Understanding the December 2012 Windows Azure Outage”

    1. Agreed. But the link supplied refers to the Leap Year outage. Let’s hope Azure releases a similarly detailed postmortem soon. AWS is unparalleled in terms of the celerity with which they release detailed explanations of outages and notable incidents.

Leave a comment