After seven years, Netflix reports that it has finally completed its migration to Amazon Web Services. In other words, every non-AWS data center that was previously used by the company has been shut down as of January 2016 and migrated to the Amazon cloud. In a blog post by Yury Izrailevsky, Stevan Vlaovic and Ruslan Meshenberg, Netflix reports that transitioning to the cloud has brought with it a multitude of benefits such as the ability to accommodate explosive growth, cost savings that “ended up being a fraction of those in the data center,” enhanced service availability that enabled it to reach its goal of four nines of service uptime and greater operational agility. Migrating to the cloud, for example, has enabled Netflix to add “thousands of virtual servers and petabytes of storage within minutes” while accommodating growth in monthly streamlining hours in excess of a factor of 1000 between 2007 and 2015 and an eightfold rise in its population of members. The consummation of the migration of Netflix to Amazon represents a true milestone in the evolution of cloud computing given that the largest streaming video service in the world has chosen to partner with one public cloud vendor for all of its cloud computing needs. That Netflix chose Amazon represents a stunning affirmation of Amazon’s ability to scale and the richness of its features. The question now is whether Netflix will complement its partnership with Amazon with another major cloud vendor to reduce the vendor dependency risk associated with hosting its services on one vendor while concurrently recognizing the ascendancy of other players in the public cloud space.
Netflix recently delivered a stunningly detailed elaboration of the cloud foundation for its Hadoop architecture in a blog post titled “Hadoop Platform as a Service in the Cloud” by Sriram Krishnan and Eva Tse. The post explains the technical foundation underpinning “Genie,” Netflix’s Platform as a Service for Hadoop. But in order to detail the technical underpinnings of Genie, the Netflix Data Science & Engineering team positioned its Hadoop Platform as a Service infrastructure within the larger context of its Amazon Web Services S3 cloud storage platform and Amazon’s distribution of Hadoop, Elastic MapReduce (EMR). Importantly, the blog post suggests the possibility of open-sourcing Genie “in the near future” and solicits reader feedback about whether a Hadoop Platform as a Service product might be useful to organizations processing petabytes of data and more.
Key features of the Netflix Platform as a Service For Hadoop include:
Data Storage on Amazon S3
Whereas most traditional Hadoop deployments store data within a Hadoop data warehouse constituted by the Hadoop Distributed File System (HDFS) storage platform, Netflix opted to store all of their data on Amazon S3 using EMR.
Benefits of S3 include the following:
•Durability and availability of objects over a given year to the order of nine 9s (99.999999999%) and two 9s (99.99%) respectively
•Granular versioning capabilities
•Elastic capabilities that result in virtually unlimited capacity on demand
•The ability to manage multiple, disparate Hadoop clusters that read from the same underlying data set
Disparate Hadoop Clusters For Dedicated Workloads
Genie’s architecture features multiple Hadoop clusters such as:
The query cluster represents a large, 500 node cluster used for ad hoc queries whereas the production cluster features the site of large ETL processes. All of these clusters can be dynamically resized in accord with the volume of data processing. Genie’s query cluster, for example, typically shrinks at night given the reduced need for ad hoc queries. Conversely, the production cluster expands at night as a result of the number of ETL processes that run accordingly.
Developers typically use the following languages and tools to access Hadoop clusters:
•Hive for queries and analytics
•Python and Pig for ETL processes
•MapReduce for complex algorithms
•Communal gateways that permit the writing of Hive and Pig queries for multiple developers
•Personal gateway AMIs for heavy users that permit the customization of client-side development
Hadoop Platform As A Service
Unlike Amazon’s Elastic MapReduce, which provides an Infrastructure as a Service for Hadoop, Netflix’s Platform as a Service allows developers to execute Hadoop, Pig and Hive scripts without provisioning new Hadoop clusters or installing clients for Hadoop, Pig and Hive using a REST-ful API. Furthermore, Netflix’s Genie also allows administrators to manage Hadoop deployments using a backend configuration tool.
Kudos goes to Netflix for its sustained and specific elaboration on the architecture of Genie. Hadoop Platform as a Service vendors have recently begun to proliferate in the industry and include the likes of Microsoft, Infochimps, Continuuity and Mortar Data. Microsoft announced news of its Azure-based Hadoop platform, Windows Azure HDInsight, in late October 2012. Infochimps, meanwhile, delivers a Big Data platform as a service that supports software frameworks such as HBase, Cassandra, MongoDB and NoSQL in addition to Hadoop. Continuuity platform AppFabric provides a set of APIs that sit atop a company’s Hadoop deployment while AWS Global Start-up Finalist Mortar Data provides an open-source framework that empowers developers to leverage their skills in Pig, Java and Python on a Hadoop ecosystem. Netflix’s Genie is without doubt the most production-ready Hadoop Platform as a Service in the industry given the sheer volume of data it processes daily. That said, the industry should expect more Hadoop platform as a service vendors to emerge as the need for simplified, PaaS-like methods of Hadoop management achieves greater urgency.