Datera Emerges From Stealth With Cloud Storage Solution For Private Clouds And Service Providers

Mountain View, CA-based Datera emerged from stealth today with the launch of the Datera Elastic Data Fabric platform, a scale-out storage software solution that delivers block storage provisioning and management capability for commodity hardware. Datera Elastic Data Fabric brings the operational agility and economics of public-cloud based block storage solutions such as the Amazon Web Services Elastic Block Storage platform to enterprises and service providers by empowering them to use code to deploy and manage storage solutions. Customers use the Data Elastic Data Fabric to create storage infrastructures for private clouds, or in the case of service providers, large-scale storage infrastructures. As told to Cloud Computing Today in a phone interview with Datera CEO Marc Fleischmann, the platform boasts self-optimizing and infrastructure aware functionality that delivers an intelligent data fabric capable of responding to the unique requirements of individual applications as illustrated below:

The graphic illustrates how the convergence of self-optimizing functionality and infrastructure awareness facilitate intelligent automation that allows for the creation of a storage platform capable of responding to dramatic variations in application performance and the attendant volume, velocity and variety of incoming data. Zachary Smith, CEO of Packet, a cloud infrastructure company, elaborated on the uniqueness of Datera as follows:

Datera has enabled Packet to deliver a high performance, consistent and profitable elastic block storage service to our customers. What makes Datera so unique is its software DNA. With Datera, we can use a true API-driven storage platform that can keep pace with our dynamic workload requirements and demanding automation needs. Datera Elastic Data Fabric self-describes and self-optimizes so we can easily and economically scale our storage service.

Here, Smith remarks on Datera’s ability to deliver a software-based, high performance, low latency storage solution that reflexively optimizes itself in ways that can embrace the company’s “dynamic workload requirements and demanding automation needs.” Datera’s ability to bring the economics, agility and operational efficiency of infrastructure as code-based storage to the enterprise means that enterprises now have access to a storage infrastructure that maintains parity with the revolution in contemporary IT specific to the cloud revolution and the DevOps movement. The company’s software transforms commodity hardware into an API-driven, scale-out storage infrastructure for block storage that can be accessed either via an appliance or software-only modality. Today, the company also announced $40M in funding from Khosla Ventures, Samsung Ventures and Andy Bechtolsheim and Pradeep Sindhu. Compatible with OpenStack, CloudStack and VMware vSphere, the platform aims to bring block storage into the cloud era in conjunction with an impressive array of analytic and intelligent automation functionalities.

Advertisement

Google’s Blogger tight lipped about reasons for outage as service is restored

Google’s Blogger service experienced a major outage on Thursday May 12 that continued until service was finally restored on Friday, May 13 at 1030 AM PDT. Users were unable to log-in to the dashboard that enables bloggers to publish and edit posts, edit widgets and alter the design templates for their blogs. The outage coincided with the impending launch of a major overhaul to Blogger’s user interface and functionality, but a Blogger tweet asserted the independence of the outage from the upcoming redesign. Most notable about the outage, however, was Google’s tight lipped explanation of the technical reasons responsible for the outage in contradistinction to Amazon Web Service’s (AWS) exhaustively thorough explanation of its own service outage in late April. Blogger’s Tech Lead/Manager Eddie Kessler explained the Blogger outage as follows:

Here’s what happened: during scheduled maintenance work Wednesday night, we experienced some data corruption that impacted Blogger’s behavior. Since then, bloggers and readers may have experienced a variety of anomalies including intermittent outages, disappearing posts, and arriving at unintended blogs or error pages. A small subset of Blogger users (we estimate 0.16%) may have encountered additional problems specific to their accounts. Yesterday we returned Blogger to a pre-maintenance state and placed the service in read-only mode while we worked on restoring all content: that’s why you haven’t been able to publish. We rolled back to a version of Blogger as of Wednesday May 11th, so your posts since then were temporarily removed. Those are the posts that we’re in the progress of restoring.

Routine maintenance caused “data corruption” that led to disappearing posts and the subsequent outage to the user management dashboard. But Kessler resists from elaborating on the error that resulted from “scheduled maintenance” nor does he specify the form of data corruption that caused such a wide variety of errors on blogger pages. In contrast, AWS revealed that the outage was caused by misrouting network bandwidth from a high bandwidth connection to a low bandwidth connection on Elastic Block Storage, the storage database for Amazon EC2 instances. In their post-mortem explanation, AWS described the repercussions of the network misrouting on the architecture of EBS within the affected Region in excruciatingly impressive detail. Granted, Blogger is a free service used primarily for personal blogging, whereas AWS hosts customers with hundreds of millions of dollars in annual revenue. Nevertheless, Blogger users published half a billion posts in 2010 which were read by 400 million readers across the world. Users, readers and cloud computing savants alike would all benefit from learning more about the technical issues responsible for outages such as this one because vendor transparency will only increase public confidence in the cloud and help propel industry-wide innovation. Even if the explanation were not quite as thorough as that offered by Amazon Web Services, Google would do well to supplement its note about “data corruption” with something more substantial for Blogger users and the cloud computing community more generally.

The Amazon Web Services Outage: A Brief Explanation

On Friday, April 29, 2011, Amazon Web Services issued an apology and detailed technical explanation of the outage that affected its US-1 East Region from April 21, 1 AM PDT to April 24, 730 PM PDT. A complete description of Amazon’s cloud computing technical architecture is elaborated in more detail in the full text of Amazon’s post-mortem analysis of the outage and its accompanying apology. This posting elaborates on the technical issues responsible for Amazon’s outage, with the intent of giving readers a condensed understanding of Amazon’s cloud computing architecture and the kinds of problems that are likely to affect the cloud computing industry more generally. We are impressed with the candor and specificity of Amazon’s response and believe it ushers in a new age of transparency and accountability in the cloud computing space.

Guide to the April 2011 Amazon Web Services Outage:

1. Elastic Block Store Architecture
Elastic Block Store is one of the storage databases for Amazon’s EC2. EBS has two components: (1) EBS clusters, each of which is composed of a set of nodes; and (2) a Control Plane Services platform that accepts user requests and directs them to appropriate EBS clusters. Nodes within EBS clusters communicate with one another by means of a high bandwidth network and a lower capacity network used as a back-up network.

2. Manual Error with Network Upgrade Procedure
The outage began when a routine procedure to upgrade the capacity of the primary network resulted in traffic being directed to EBS’s lower capacity network instead of an alternate router on the high capacity network. Because the high capacity network was temporarily disengaged, and the low capacity network could not handle the traffic that had been shunted in its direction, many nodes in the affected EBS availability zone were isolated.

3. Re-Mirroring of Elastic Block Store Nodes
Once Amazon engineers noticed that the network upgrade had been executed incorrectly, they restored the network to its proper connectivity on the high bandwidth connection. Nodes which had become isolated wanted to search for other nodes through which they could “mirror” or duplicate themselves. But since so many nodes were in the position of looking for a replica, the EBS cluster’s space quickly became used to capacity. Consequently, approximately 13% of nodes within the affected Availability Zone became “stuck”.

4. Control Plane Service Platform Isolated
The full utilization of the EBS storage system by stuck nodes seeking to re-mirror themselves impacted the Control Plane Services platform that directs user requests from an API to EBS clusters. The exhausted capacity of the EBS cluster rendered EBS unable to accommodate requests from the Control Plane Service. Because the degraded EBS cluster began to have an adverse effect on the Control Plane Service through the entire Region, Amazon disabled communication between the EBS clusters and the Control Plane Service.

5. Restoring EBS cluster server capacity
Amazon engineers knew that the isolated nodes had exhausted server capacity within the EBS cluster. In order to enable the nodes to re-mirror themselves, it was necessary to add extra server capacity to the degraded EBS cluster. Finally, the connection between the Control Plane Service and EBS was restored.

6. Relational Database Service Fails to Replicate
Amazon’s Relational Database service manages communication between multiple databases that leverage EBS’s database structure. RDS can be configured to function in one Availability Zone or several. RDS instances that have been configured to operate across multiple Availability Zones should switch to their replica on an Availability Zone unaffected by a service disruption. The network interruption on the degraded EBS cluster caused 2.5% of multi-AZ RDS instances to fail to find their replica due to an unexpected bug.

Amazon Web Services’s Response

In response to the set of issues that prompted the outage, Amazon proposes to take the following steps:

1. Increase automation of the network change/upgrade process that triggered the outage
2. Increase server capacity in EBS clusters to allow EBS nodes to find their replicas effectively in the event of a disruption
3. Develop more intelligent re-try logic to prevent the “re-mirroring storm” that causes EBS nodes to seek and re-seek their replicas relentlessly. While EBS nodes should seek out their replicas after a service disruption, the logic behind the search for replicas should lead to amelioration of an outage rather than its exacerbation.

Why Amazon’s Cloud Computing Outage Didn’t Violate Its SLA

Amazon’s cloud computing outage on April 21 and April 22 can be interpreted in one of two ways: (1) either the outage constitutes a reflection on Amazon’s EC2 platform and its processes for disaster recovery situations; or (2) the outage represents a commentary on the state of the cloud computing industry as a whole. The outage began on Thursday and involved problems specific to Amazon’s Northern Virginia data center. Companies affected by the outage include HootSuite, FourSquare, Reddit, Quora and other start-ups such as BigDoor, Mass Relevance and Spanning Cloud Apps. Hootsuite—a dashboard that allows users to manage content on a number of websites such as Facebook, LinkedIn, Twitter and WordPress—experienced a temporary crash on Thursday that affected a large number of sites. The social news website Reddit was unavailable until noon on Thursday, April 21. BigDoor, a 20 person start-up that provides online game and rewards applications, had restored most of its services by Friday evening even though its corporate website remained down. Netflix and Recovery.gov, meanwhile, escaped the Amazon outage either unscathed or with minimal interruption.

Amazon’s EC2 platform currently has five regions: US East (Northern Virginia), US West (Northern California), EU (Ireland), Asia Pacific (Singapore), and Asia Pacific (Tokyo). Each region is composed of multiple “Availability Zones”. Customers who launch server instances in different Availability Zones can, according to Amazon Web Services’s website, “protect [their] applications from failure of a single location.” The Amazon outage underscores how EC2 customers can no longer depend on having multiple “Availability Zones” within a specific region as insurance against system downtime. Customers will need to ensure their architecture plans for duplicate copies of server instances in multiple regions.

Amazon’s SLAs commit to 99.5% system uptime for customers who have deployments in more than one availability zone within a specific region. However, the SLA guarantees only the ability to commit to connect to and provision instances. On Thursday and Friday, Amazon’s US-East customers could still connect to and provision instances, but the outage adversely affected their deployments because of problems with Amazon’s Elastic Block Storage (EBS) and Relational Database Service (RDS) platforms. EBS is a storage database and RDS provides a way of relating multiple databases that store data provisioned on an EC2 platform. Because Amazon’s problems were confined to EBS and RDS in the US East region, Amazon’s SLA for customers affected by the outage was not violated. The immediate consequence here is that Amazon EC2 customers will need to deploy copies of the same server instance in multiple regions to guarantee 100% system uptime, assuming, of course, that the wildly unlikely scenario that multiple Amazon cloud computing regions experience outages at the same time never transpires.

Anyone familiar with the cloud computing industry knows full well that Amazon, Rackspace, Microsoft and Google have all experienced glitches resulting in system downtime in the last three years. The multiple instances of system downtime across vendors points to the immaturity of the technological architecture and processes for delivering cloud computing services. Until the architecture and processes for cloud computing operational management improves, customers will need to seriously consider the costs of redundant data architectures that insure them against system downtime in comparison with the risk and costs of actual downtime.

For a non-technical summary of the technical issues specific to the outage, see Cloud Computing Today’s “Understanding Amazon Web Services’s 2011 Outage“.