Netflix Details Architecture Of Hadoop Big Data Platform As A Service

Netflix recently delivered a stunningly detailed elaboration of the cloud foundation for its Hadoop architecture in a blog post titled “Hadoop Platform as a Service in the Cloud” by Sriram Krishnan and Eva Tse. The post explains the technical foundation underpinning “Genie,” Netflix’s Platform as a Service for Hadoop. But in order to detail the technical underpinnings of Genie, the Netflix Data Science & Engineering team positioned its Hadoop Platform as a Service infrastructure within the larger context of its Amazon Web Services S3 cloud storage platform and Amazon’s distribution of Hadoop, Elastic MapReduce (EMR). Importantly, the blog post suggests the possibility of open-sourcing Genie “in the near future” and solicits reader feedback about whether a Hadoop Platform as a Service product might be useful to organizations processing petabytes of data and more.

Key features of the Netflix Platform as a Service For Hadoop include:

Data Storage on Amazon S3

Whereas most traditional Hadoop deployments store data within a Hadoop data warehouse constituted by the Hadoop Distributed File System (HDFS) storage platform, Netflix opted to store all of their data on Amazon S3 using EMR.

Benefits of S3 include the following:

•Durability and availability of objects over a given year to the order of nine 9s (99.999999999%) and two 9s (99.99%) respectively
•Granular versioning capabilities
•Elastic capabilities that result in virtually unlimited capacity on demand
•The ability to manage multiple, disparate Hadoop clusters that read from the same underlying data set

Disparate Hadoop Clusters For Dedicated Workloads

Genie’s architecture features multiple Hadoop clusters such as:

•Query cluster
•Production cluster
•Dev clusters

The query cluster represents a large, 500 node cluster used for ad hoc queries whereas the production cluster features the site of large ETL processes. All of these clusters can be dynamically resized in accord with the volume of data processing. Genie’s query cluster, for example, typically shrinks at night given the reduced need for ad hoc queries. Conversely, the production cluster expands at night as a result of the number of ETL processes that run accordingly.

Programming Languages

Developers typically use the following languages and tools to access Hadoop clusters:

•Hive for queries and analytics
•Python and Pig for ETL processes
•MapReduce for complex algorithms
•Communal gateways that permit the writing of Hive and Pig queries for multiple developers
•Personal gateway AMIs for heavy users that permit the customization of client-side development

Hadoop Platform As A Service

Unlike Amazon’s Elastic MapReduce, which provides an Infrastructure as a Service for Hadoop, Netflix’s Platform as a Service allows developers to execute Hadoop, Pig and Hive scripts without provisioning new Hadoop clusters or installing clients for Hadoop, Pig and Hive using a REST-ful API. Furthermore, Netflix’s Genie also allows administrators to manage Hadoop deployments using a backend configuration tool.

Analysis

Kudos goes to Netflix for its sustained and specific elaboration on the architecture of Genie. Hadoop Platform as a Service vendors have recently begun to proliferate in the industry and include the likes of Microsoft, Infochimps, Continuuity and Mortar Data. Microsoft announced news of its Azure-based Hadoop platform, Windows Azure HDInsight, in late October 2012. Infochimps, meanwhile, delivers a Big Data platform as a service that supports software frameworks such as HBase, Cassandra, MongoDB and NoSQL in addition to Hadoop. Continuuity platform AppFabric provides a set of APIs that sit atop a company’s Hadoop deployment while AWS Global Start-up Finalist Mortar Data provides an open-source framework that empowers developers to leverage their skills in Pig, Java and Python on a Hadoop ecosystem. Netflix’s Genie is without doubt the most production-ready Hadoop Platform as a Service in the industry given the sheer volume of data it processes daily. That said, the industry should expect more Hadoop platform as a service vendors to emerge as the need for simplified, PaaS-like methods of Hadoop management achieves greater urgency.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s