April 11 Google Cloud Outage Raises Broader Questions About Contemporary IT Automation And Monitoring

The Google Cloud Platform experienced a major outage marked by the loss of “external connectivity in all regions” in all regions on April 11 lasting roughly 18 minutes. Caused by a networking failure, the outage impacted all regions and represented one of the most systemic outages in the history of the public cloud particularly given how AWS and Microsoft Azure have suffered through outages specific to one or more availability zones, but never in all regions. Since the outage and an attendant post-mortem analysis of its root causes, Google claims to have resolved the issues with its network configuration software, and taking a cue from its competitor Amazon Web Services, served up a remarkably detailed post-mortem analysis of the outage and its origin, chronology, escalation pathways, resolution and near and long-term remediation. The larger point, here, however, is that despite its recent windfall of customer signings featuring the likes of Apple, Spotify and the Home Depot, the Google Cloud Platform is still in the process of ironing out the kinks in its IT infrastructure as they relate to process, technology, alerts, notifications and root cause analytics. The outage constitutes both a reflection on the evolution of the Google Cloud Platform and the continued immaturity of data-driven alerts and notifications despite the efflorescence of contemporary technologies dedicated to intelligent automation, self-optimized infrastructures and real-time analytics on streaming data. The lesson learned from Google’s recent outage is that the integration of the mitigating, data-driven checks and process automation steps designed to identify and swiftly ameliorate issues within the Google Cloud Platform, as they arise, still have yet to mature to the point where they are capable of isolating problems to a specific availability zone or cluster of availability zones in contrast to a categorical and unprecedented cascade across all regions. As such, the outage raises more questions than it does answers about the architecture undergirding Google’s Cloud Platform as well as how software configuration glitches can have unexpectedly far reaching consequences despite the surfeit of contemporary analytic capabilities available to proactively monitor the health of IT infrastructures.

Advertisements