Microsoft has provided details of the outage that affected the Azure platform on February 29. The outage was caused by a transfer certificate used to guarantee encryption of the connection between the “host agent” (HA) and the “guest agent” (GA) in its platform as a service server architecture. The guest agent contains the applications that are deployed onto the OS within the host agent. Meanwhile, the integrity of the communication between host agent and guest agent is maintained by way of a transfer certificate sent from the guest agent to the host agent. The problem that caused the outage on February 29 involved the creation of faulty transfer certificates because of a leap year bug as outlined in Microsoft’s March 9 blog post by Bill Laing, corporate VP of Microsoft’s Server and Cloud Division:
When the GA creates the transfer certificate, it gives it a one year validity range. It uses midnight UST of the current day as the valid-from date and one year from that date as the valid-to date. The leap day bug is that the GA calculated the valid-to date by simply taking the current date and adding one to its year. That meant that any GA that tried to create a transfer certificate on leap day set a valid-to date of February 29, 2013, an invalid date that caused the certificate creation to fail.
On February 28, the platform used midnight of the current day qua February 29 as the from-date for the date for the transfer certificate. The to-date was calculated as one year from the from-date, meaning, February 29, 2013, “an invalid date that caused the certificate creation to fail.” The glitch led to a number of retried connections that resulted in the mistaken conclusion that a hardware problem was responsible for the failed connection. Microsoft migrated the failed virtual machines to other hardware that similarly failed to function as expected.
As a result of the outage, Microsoft will “provide a 33% credit to all customers of Windows Azure Compute, Access Control, Service Bus and Caching for the entire affected billing month(s) for these services, regardless of whether their service was impacted.” The blog post outlined a set of remediation measures Microsoft intends to take to prevent service disruptions like this from occurring again. Nevertheless, the post starkly illustrates how a rudimentary programming error caused a disruption that required over 10 hours and a shutdown of the Azure customer service management platform to fix. On one hand, customers can take heart from Microsoft’s transparency here and the depth of their explanation of the root cause of the outage. On the other, Azure fans should have cause for concern given that the programming glitch that caused the outage should have been caught by any QA team investigating use cases specific to exceptions involving date and time values.