This week, EMC launched its own distribution of Hadoop under the branding Pivotal HD. Built on technology that EMC obtained through the acquisition of Greenplum in July 2010, Pivotal HD represents EMC’s next iteration on the Greenplum Unified Analytics Platform (UAP) that it launched in December 2011. The Greenplum UAP featured EMC Greenplum HD, an enterprise-grade distribution of Hadoop and Greenplum’s database for structured data. Greenplum UAP also announced Greenplum Chorus, an innovative platform for collaboration amongst data scientists in an organization leveraging Big Data. Pivotal HD, however, marks a significant new chapter in EMC’s Hadoop technology as indicated by its array of features and architectural complexity.
Like many recent Hadoop distributions and technologies, Pivotal HD integrates with SQL to facilitate its maximal usage by developers and business analysts who lack familiarity with MapReduce. But the real innovation of Pivotal HD runs deeper than its integration of SQL with Hadoop and concerns the positioning of Greenplum’s analytic engine alongside HDFS in ways that enable performance enhancements to Hadoop querying over and beyond the simple appendage of a SQL interface. Pivotal HD’s Advanced Database Services (HAWQ) allows for the delivery of a high-performance SQL engine that permits of greater SQL functionality and performance than analogous SQL interfaces such as Hive, Hadapt and Impala. Coupled with Pivotal HD’s virtualization and pluggable storage compatibility features, the platform represents a distinct moment of innovation in the Hadoop space as evinced by the following three features:
Advanced Database Services (HAWQ)
Pivotal HD’s Advanced Database Services (HAWQ) functionality brings Greenplum’s Massively Parallel Processing (MPP) functionality to Hadoop. The result means that HAWQ allows Pivotal HD users to perform complex joins, MADlib in-database analytics and transactions. Moreover, users have the luxury of leveraging virtually any BI tool on the marketplace to obtain advanced reporting and visualization of data as required. HAWQ-based SQL queries outperform Hive in terms of response time by as much as 100x according to EMC benchmarking data.
The Advanced Database Service interfaces with other components of Pivotal HD as follows:
Given the recent proliferation of SQL-Hadoop interfaces throughout the industry, customers and analysts should expect more data about the comparative efficiencies of SQL-Hadoop interfaces to emerge as more and more SQL-trained analysts start using SQL to operate on data saved in HDFS.
Hadoop Virtualization Extensions
Hadoop Virtualization Extensions enable the provisioning of Hadoop clusters on VMware virtualized platforms in both public cloud and on-premise environments. HVE provides customers increased flexibility of deployment and enables the construction of high availability infrastructures for the access of Hadoop data.
Pluggable HDFS Storage
Customers can multiply their data storage options by using standard Hadoop direct attached storage in addition to EMC Isilon OneFS Scale-Out NAS Storage, the latter of which features streamlined loading, backup, replication, snapshotting and elastic scalability functionality.
Overall, EMC’s launch into the Hadoop-distribution world represents a stunning and significant move to grab Hadoop market share from Cloudera, Hortonworks and MapR. Unlike Intel’s recently launched distribution, EMC’s Pivotal HD claims some proprietary and genuinely innovative Hadoop technology in the form of its Advanced Database Services engine and scale-out storage compatibility. Expect EMC to continue to innovate upon its core technology platform and follow the suit of the likes of Concurrent in developing tools to render Hadoop more accessible to Java-based developers in addition to SQL. What remains unclear, at this point, is the extent to which EMC will open-source its technology as it gains market share within the enterprise. For now, however, the Hadoop world has yet another significant player with cash reserves aplenty to continue to innovate on its platform and disrupt the Hadoop landscape in the process.