Google’s Blogger tight lipped about reasons for outage as service is restored

Google’s Blogger service experienced a major outage on Thursday May 12 that continued until service was finally restored on Friday, May 13 at 1030 AM PDT. Users were unable to log-in to the dashboard that enables bloggers to publish and edit posts, edit widgets and alter the design templates for their blogs. The outage coincided with the impending launch of a major overhaul to Blogger’s user interface and functionality, but a Blogger tweet asserted the independence of the outage from the upcoming redesign. Most notable about the outage, however, was Google’s tight lipped explanation of the technical reasons responsible for the outage in contradistinction to Amazon Web Service’s (AWS) exhaustively thorough explanation of its own service outage in late April. Blogger’s Tech Lead/Manager Eddie Kessler explained the Blogger outage as follows:

Here’s what happened: during scheduled maintenance work Wednesday night, we experienced some data corruption that impacted Blogger’s behavior. Since then, bloggers and readers may have experienced a variety of anomalies including intermittent outages, disappearing posts, and arriving at unintended blogs or error pages. A small subset of Blogger users (we estimate 0.16%) may have encountered additional problems specific to their accounts. Yesterday we returned Blogger to a pre-maintenance state and placed the service in read-only mode while we worked on restoring all content: that’s why you haven’t been able to publish. We rolled back to a version of Blogger as of Wednesday May 11th, so your posts since then were temporarily removed. Those are the posts that we’re in the progress of restoring.

Routine maintenance caused “data corruption” that led to disappearing posts and the subsequent outage to the user management dashboard. But Kessler resists from elaborating on the error that resulted from “scheduled maintenance” nor does he specify the form of data corruption that caused such a wide variety of errors on blogger pages. In contrast, AWS revealed that the outage was caused by misrouting network bandwidth from a high bandwidth connection to a low bandwidth connection on Elastic Block Storage, the storage database for Amazon EC2 instances. In their post-mortem explanation, AWS described the repercussions of the network misrouting on the architecture of EBS within the affected Region in excruciatingly impressive detail. Granted, Blogger is a free service used primarily for personal blogging, whereas AWS hosts customers with hundreds of millions of dollars in annual revenue. Nevertheless, Blogger users published half a billion posts in 2010 which were read by 400 million readers across the world. Users, readers and cloud computing savants alike would all benefit from learning more about the technical issues responsible for outages such as this one because vendor transparency will only increase public confidence in the cloud and help propel industry-wide innovation. Even if the explanation were not quite as thorough as that offered by Amazon Web Services, Google would do well to supplement its note about “data corruption” with something more substantial for Blogger users and the cloud computing community more generally.