Recent announcements by Joyent and Qihoo 360 Technologies indicate that the use of cloud computing technology in China is poised to proliferate dramatically in 2011. On May 16, Joyent revealed details of an alliance with ClusterTech whereby ClusterTech would become the provider of public cloud services to companies in the gaming, media, mobile and social media space in China. Under this arrangement, ClusterTech will provision Joyent’s cloud computing SmartDataCenter 6 software to “service providers, data center operators and systems integrators” that will, in turn, provide Joyent’s cloud computing technology to media, gaming and mobile companies in China. In licensing its cloud computing software to a third party distributor, Joyent leverages a business model that differs markedly from most of its U.S. competitors such as Amazon Web Services and Rackspace that retain control over the deployment of their cloud computing operating systems. Joyent’s partnership with ClusterTech builds upon its previous entry into the Chinese cloud computing market in 2009 with a public cloud data center in the Qinhuangdao Economic and Technological Development Zone (QETDZ), Hebei Province, China. Meanwhile, Qihoo 360 Technologies, developer of China’s most popular internet security software, recently announced plans to enter the cloud computing space by providing online data storage. Qihoo CEO Zhou Hongyi mentioned the possibility of acquiring relevant companies in order to expand into the cloud computing and data storage space. The company’s first quarter revenue more than doubled to $22.9 million as compared to $9.7 million from last year, largely as a result of increased online advertising revenue. Qihoo went public in March through an IPO that valued the company at $202 million, with the IPO share value at $14.50. As of June 1, the stock is trading at $26.25 a share, up more than 81% from its IPO value.
Apache LibCloud’s May 19 graduation from the Apache Incubator signifies that the race toward cloud inter-operability is firmly underway. Libcloud provides an open source Python library of back-end drivers that enables developers to connect to APIs of over 20 cloud computing platforms such as Amazon EC2, Eucalyptus, GoGrid, IBM Cloud, Linode, Terremark and vCloud. Developers can write code once and then re-deploy their applications on other cloud environments in order to avoid vendor lock-in and create redundant architectures for disaster recovery purposes. LibCloud was originally developed by the Rackspace acquisition CloudKick but subsequently migrated to the Apache Incubator Project in November 2009. LibCloud’s graduation from the Apache Incubator as a Top Level Project means that the product will be managed by a Project Management Committee that assumes responsibility for its evolution and subsequent releases. LibCloud is currently available under version 2.0 of an Apache Software License.
The principal drawback about LibCloud is its exclusive use of Python as the programming language to connect drivers to vendor APIs. Red Hat’s DeltaCloud, in contrast, leverages a REST based API that offers more flexibility than LibCloud’s Python library for the purpose of migrating software deployments from one cloud infrastructure to another. Like LibCloud, DeltaCloud is being groomed through the Apache Incubator project but has a few more steps to travel before graduation and the achievement of top level status. Nevertheless, open source options are clearly leading the charge toward cloud inter-operability although they all presently require the withdrawal of a cloud instance to a holding database followed by re-deployment through the activation of the linking API. In other words, neither LibCloud nor DeltaCloud enable developers to connect Amazon EC2 to Rackspace without an intermediary database as a preliminary step.
2011 has been an extraordinary year for cloud computing so far. Amazon Web Services (AWS) set the pace with an aggressive roll-out of products such as Elastic Beanstalk, CloudFormation, Amazon Cloud Player and Amazon Cloud Drive. Just when AWS seemed poised to consolidate its first mover advantage with respect to cloud computing market share, the landscape exploded with a veritable feast of product offerings, business partnerships and acquisitions. Every month another Fortune 500 IT or telecommunications company throws its hat into the cloud computing ring: Dell’s vStart, Dell’s recent partnership with SAP, IBM’s SmartCloud, Apple’s iCloud and HP’s BladeSystem Matrix mark just some of the big names and brands that have entered the cloud computing dohyo, or sumo circle. The cast of new actors has rendered the cloud computing space painfully difficult for analysts to quantify for the purpose of understanding relative market share and growth within the industry. But within this bewildering sea of change, three industry trends have emerged that deserve attention:
1. Outages across the industry signal demand outweighs supply
Demand for cloud computing services has begun to outstrip supply to the point where vendor processes for guaranteeing system uptime have become increasingly challenged. The Amazon Web Services outage of 2011 was the most glaring example of a lack of effective, scalable processes for one of the world’s premier IaaS vendors, but 2011 has witnessed notable outages specific to Sony PlayStation, Twitter, Gmail and Google’s Blogger as well. Expect more outages and service disruptions until the industry fathoms the time to develop processes for delivering on 99.99% SLAs as opposed to merely promising them.
2. Early Consolidation vs. the Proliferation of New Entrants to the Market
The past five months have witnessed Verizon’s acquisition of Terremark, Time Warner Cable’s acquisition of NaviSite, CenturyLink’s acquisition of Savvis and rife speculation that Rackspace lies next on the totem pole of potential buyouts. In tandem with the finalization of these acquistions, a slew of other companies such as Appistry, CA Technologies, Engine Yard, Flexiant, GigaSpaces, RightScale and ThinkGrid have emerged on the landscape and promise to collectively cobble together a non-trivial slice of the market while potentially transforming into significant niche players themselves. Expect new entrants on the scene, particularly in the open source space that will increasingly complicate the IaaS market share dominance of AWS, Eucalyptus, Rackspace, GoGrid and Joyent. Consolidations will continue but the market is unlikely to congeal into a few dominant players for quite some time.
3. The Rise of Open Source Cloud Computing Solutions
Rackspace, Dell and Equinux’s launch of a demonstration environment of OpenStack promises to change the industry by enticing customers to consider toying with its open source platform for free while paying for consultative support services associated with cloud design and management. Meanwhile, Canonical’s decision to change the cloud computing provider for its Ubuntu Enterprise Cloud (UEC) offering from Eucalyptus to OpenStack testifies to the strength of OpenStack and conversely, underscores Eucalyptus’s challenge in defining its value proposition as an Amazon EC2 compatible open source IaaS platform. RedHat’s open source PaaS product called OpenShift marks another leading contender in the open source ring by virtue of its deployment flexibility across the Java, Python, PHP and Ruby environments. Expect that open source IaaS and PaaS offerings will become increasingly robust and scalable. If open source solutions can demonstrate reliable, high quality portability across platforms, the market for less portable, private sector IaaS and PaaS solutions is likely to shrink dramatically. The fortunes of OpenStack, OpenShift and the recently formed Open Virtualization Alliance merit a close watch, in particular.
Last week’s report by Bloomberg that the outage on the PlayStation Network was caused by a hacker using Amazon Web Services’s EC2 platform raises interesting questions in the newly emerging field of cloud computing law. Can Amazon Web Services be held responsible for the breach? In the event of a violation of security on one cloud infrastructure that stems from another cloud computing platform, can the originating cloud computing vendor be deemed legally responsible for the security violation? Consider the case of HIPAA legislation as it relates to the cloud, for example: as “business associates” of “covered entities” such as provider organizations, cloud computing vendors bear responsibility for security and privacy of patient health information data. A covered entity such as a hospital that stores personal health information on Amazon’s EC2 infrastructure can expect that, as a business associate, Amazon Web Services will demonstrate adherence with HIPAA’s privacy and security regulations that require data encryption, access controls, and processes for data back-up and audit review of access.
What is Amazon Web Services’s degree of liability for the Sony Outage, if any? Sources close to the investigation revealed that hackers rented one of Amazon’s EC2 servers and then deployed the attack on Sony PlayStation’s network that compromised the security of 100 million Sony customers. Amazon Web Services is likely to be subpoenaed in the investigation in order to extract details of the method of payment and the IP addresses used for the attack. That said, one would be hard pressed to imagine making a legal case that Amazon bears responsibility for the attack given that virtually any of its customers could have launched the attack and there currently exists no easy method of differentiating between criminal accounts and legitimate ones. Granted, one could make the argument that cloud computing vendors should develop the IT infrastructure to proactively identify suspicious behavior and curtail it as necessary. Given the recent proliferation of cases where hackers use rented or hijacked servers to launch cyber-attacks, such legislation may not be entirely inconceivable as the cloud computing space evolves. Right now, however, regulatory agencies such as NIST and U.S. CIO Vivek Kundra have their hands full grappling with inter-operability and quality standards for cloud based data storage and transmission, separate from formulating the legally precarious constraint that would mandate cloud computing vendors to develop processes to detect hack-attacks before they happen.
Google’s Blogger service experienced a major outage on Thursday May 12 that continued until service was finally restored on Friday, May 13 at 1030 AM PDT. Users were unable to log-in to the dashboard that enables bloggers to publish and edit posts, edit widgets and alter the design templates for their blogs. The outage coincided with the impending launch of a major overhaul to Blogger’s user interface and functionality, but a Blogger tweet asserted the independence of the outage from the upcoming redesign. Most notable about the outage, however, was Google’s tight lipped explanation of the technical reasons responsible for the outage in contradistinction to Amazon Web Service’s (AWS) exhaustively thorough explanation of its own service outage in late April. Blogger’s Tech Lead/Manager Eddie Kessler explained the Blogger outage as follows:
Here’s what happened: during scheduled maintenance work Wednesday night, we experienced some data corruption that impacted Blogger’s behavior. Since then, bloggers and readers may have experienced a variety of anomalies including intermittent outages, disappearing posts, and arriving at unintended blogs or error pages. A small subset of Blogger users (we estimate 0.16%) may have encountered additional problems specific to their accounts. Yesterday we returned Blogger to a pre-maintenance state and placed the service in read-only mode while we worked on restoring all content: that’s why you haven’t been able to publish. We rolled back to a version of Blogger as of Wednesday May 11th, so your posts since then were temporarily removed. Those are the posts that we’re in the progress of restoring.
Routine maintenance caused “data corruption” that led to disappearing posts and the subsequent outage to the user management dashboard. But Kessler resists from elaborating on the error that resulted from “scheduled maintenance” nor does he specify the form of data corruption that caused such a wide variety of errors on blogger pages. In contrast, AWS revealed that the outage was caused by misrouting network bandwidth from a high bandwidth connection to a low bandwidth connection on Elastic Block Storage, the storage database for Amazon EC2 instances. In their post-mortem explanation, AWS described the repercussions of the network misrouting on the architecture of EBS within the affected Region in excruciatingly impressive detail. Granted, Blogger is a free service used primarily for personal blogging, whereas AWS hosts customers with hundreds of millions of dollars in annual revenue. Nevertheless, Blogger users published half a billion posts in 2010 which were read by 400 million readers across the world. Users, readers and cloud computing savants alike would all benefit from learning more about the technical issues responsible for outages such as this one because vendor transparency will only increase public confidence in the cloud and help propel industry-wide innovation. Even if the explanation were not quite as thorough as that offered by Amazon Web Services, Google would do well to supplement its note about “data corruption” with something more substantial for Blogger users and the cloud computing community more generally.
At its May 2010 summit in Boston, Red Hat, the world’s leading provider of open source solutions, announced the launch of CloudForms and OpenShift, two products that represent the company’s boldest entrance into the cloud computing space so far. CloudForms marks an IaaS service offering that enables enterprises to create and manage a private or hybrid cloud computing environment. CloudForms provides customers with Application Lifecycle Management (ALM) functionality that enables management of an application deployed over a constellation of physical, virtualized and cloud-based environments. Whereas VMWare’s vCloud enables customers to manage virtualized machines, Red Hat’s CloudForms delivers a more granular form of management functionality that allows users to manage applications. Moreover, CloudForms offers a resource management interface that confronts the problem in the industry known as virtual sprawl wherein IT administrators are tasked with the problem of managing multiple servers, hypervisors, virtual machines and clusters. Red Hat’s IaaS product also offers customers the ability to create integrated, hybrid cloud environments that leverage a combination of physical servers, virtual servers and public clouds such as Amazon EC2.
OpenShift represents Red Hat’s PaaS product that enables open source developers to build cloud computing environments from within a specified range of development frameworks. OpenShift supports Java, Python, PHP and Ruby applications such as Spring, Seam, Weld, CDI, Rails, Rack, Symfony, Zend Framework, Twisted, Django and Java EE. In supporting Java, Python, PHP and Ruby, OpenShift offers the most flexible development environment in the industry as compared to Amazon’s Elastic Beanstalk, Microsoft Azure and Google’s App Engine. For storage, OpenShift features SQL and NoSQL in addition to a distributed file system. Red Hat claims OpenShift delivers greater portability than other PaaS products because customers will be able to migrate their deployments to another cloud computing vendor using the DeltaCloud inter-operability API. The only problem with this marketing claim is that DeltaCloud is by no means the most widely accepted cloud computing inter-operability API in the industry. Red Hat submitted the DeltaCloud API to the Distributed Management Task Force (DMTF) in August 2010, but the Red Hat API faces stiff competition from open source versions of Amazon’s EC2 APIs as well as APIs from the OpenStack project.
In summary, Red Hat’s entrance into the IaaS and PaaS space promises to significantly change the cloud computing landscape. CloudForms signals genuine innovation in the IaaS space because of its Application Lifecycle Management capabilities and hybrid infrastructure flexibility. OpenShift, meanwhile, presents direct competition to Google Apps, Microsoft Azure and Amazon’s Elastic Beanstalk because of the breadth of its deployment platform and claims about increased portability. What makes OpenShift so intriguing is it that constitutes Red Hat’s most aggressive attempt so far to claim DeltaCloud as the standard API for the cloud computing industry.
On Friday, April 29, 2011, Amazon Web Services issued an apology and detailed technical explanation of the outage that affected its US-1 East Region from April 21, 1 AM PDT to April 24, 730 PM PDT. A complete description of Amazon’s cloud computing technical architecture is elaborated in more detail in the full text of Amazon’s post-mortem analysis of the outage and its accompanying apology. This posting elaborates on the technical issues responsible for Amazon’s outage, with the intent of giving readers a condensed understanding of Amazon’s cloud computing architecture and the kinds of problems that are likely to affect the cloud computing industry more generally. We are impressed with the candor and specificity of Amazon’s response and believe it ushers in a new age of transparency and accountability in the cloud computing space.
Guide to the April 2011 Amazon Web Services Outage:
1. Elastic Block Store Architecture
Elastic Block Store is one of the storage databases for Amazon’s EC2. EBS has two components: (1) EBS clusters, each of which is composed of a set of nodes; and (2) a Control Plane Services platform that accepts user requests and directs them to appropriate EBS clusters. Nodes within EBS clusters communicate with one another by means of a high bandwidth network and a lower capacity network used as a back-up network.
2. Manual Error with Network Upgrade Procedure
The outage began when a routine procedure to upgrade the capacity of the primary network resulted in traffic being directed to EBS’s lower capacity network instead of an alternate router on the high capacity network. Because the high capacity network was temporarily disengaged, and the low capacity network could not handle the traffic that had been shunted in its direction, many nodes in the affected EBS availability zone were isolated.
3. Re-Mirroring of Elastic Block Store Nodes
Once Amazon engineers noticed that the network upgrade had been executed incorrectly, they restored the network to its proper connectivity on the high bandwidth connection. Nodes which had become isolated wanted to search for other nodes through which they could “mirror” or duplicate themselves. But since so many nodes were in the position of looking for a replica, the EBS cluster’s space quickly became used to capacity. Consequently, approximately 13% of nodes within the affected Availability Zone became “stuck”.
4. Control Plane Service Platform Isolated
The full utilization of the EBS storage system by stuck nodes seeking to re-mirror themselves impacted the Control Plane Services platform that directs user requests from an API to EBS clusters. The exhausted capacity of the EBS cluster rendered EBS unable to accommodate requests from the Control Plane Service. Because the degraded EBS cluster began to have an adverse effect on the Control Plane Service through the entire Region, Amazon disabled communication between the EBS clusters and the Control Plane Service.
5. Restoring EBS cluster server capacity
Amazon engineers knew that the isolated nodes had exhausted server capacity within the EBS cluster. In order to enable the nodes to re-mirror themselves, it was necessary to add extra server capacity to the degraded EBS cluster. Finally, the connection between the Control Plane Service and EBS was restored.
6. Relational Database Service Fails to Replicate
Amazon’s Relational Database service manages communication between multiple databases that leverage EBS’s database structure. RDS can be configured to function in one Availability Zone or several. RDS instances that have been configured to operate across multiple Availability Zones should switch to their replica on an Availability Zone unaffected by a service disruption. The network interruption on the degraded EBS cluster caused 2.5% of multi-AZ RDS instances to fail to find their replica due to an unexpected bug.
Amazon Web Services’s Response
In response to the set of issues that prompted the outage, Amazon proposes to take the following steps:
1. Increase automation of the network change/upgrade process that triggered the outage
2. Increase server capacity in EBS clusters to allow EBS nodes to find their replicas effectively in the event of a disruption
3. Develop more intelligent re-try logic to prevent the “re-mirroring storm” that causes EBS nodes to seek and re-seek their replicas relentlessly. While EBS nodes should seek out their replicas after a service disruption, the logic behind the search for replicas should lead to amelioration of an outage rather than its exacerbation.