Hadoop Now Runs On Windows Courtesy of Hortonworks Data Platform 2.0

On Tuesday, Hortonworks announced the general availability of version 2.0 of the Hortonworks Data Platform for Windows. Hortonworks Data Platform 2.0 for Windows is the first distribution of Apache Hadoop 2.0 certified for Windows Server 2008 R2 and Windows Server 2012. Today’s announcement means that YARN (Yet Another Resource Negotiator), a key feature of Hadoop 2.0, is now available to Windows-based development environments. With HDP 2.0, developers in Windows shops can take advantage of YARN’s transformation of Hadoop from an infrastructure for batch processing to batch and real-time data processing. Moreover, HDP 2.0 features the NameNode High Availability functionality automates failovers and ensures the availability of the full HDP stack. Hortonworks collaborated closely with Microsoft in order to ensure the HDP 2.0 release achieved production-grade status within Windows environments. The release of HDP 2.0 marks yet another milestone in the story of the democratization of Apache Hadoop, the Big Data platform that is being rendered increasingly available to wider circles of users by means of initiatives such as Stinger (Hortonworks), Lingual (Concurrent) and Impala (Cloudera) that allow users to access and manipulate data stored in a Hadoop cluster using SQL.

Amazon Web Services Supports Impala To Facilitate Real Time, High Performance Hadoop Queries

Amazon Web Services (AWS) recently announced support for Impala, the open source technology platform developed by Cloudera for querying data in the Hadoop Distributed File System or HBase using SQL-like syntax as elaborated below:

Impala raises the bar for query performance while retaining a familiar user experience. With Impala, you can query data, whether stored in HDFS or Apache HBase – including SELECT, JOIN, and aggregate functions – in real time. Furthermore, it uses the same metadata, SQL syntax (Hive SQL), ODBC driver and user interface (Hue Beeswax) as Apache Hive, providing a familiar and unified platform for batch-oriented or real-time queries. (For that reason, Hive users can utilize Impala with little setup overhead.)

Amazon Web Services introduced Impala as part of the Amazon Elastic MapReduce project. Users will need to run Hadoop clusters that use Hadoop 2.x in order to take advantage of its Hadoop offering. Impala users can run queries on data sets in real time and enjoy low latency times enabled by the platform’s distributed query engine that allows Impala to boast speed and performance benefits over Apache Hive. The availability of Impala on the Amazon Web Services platform comes just weeks after its release of Amazon Kinesis, its platform for collecting and storing real time big data streams, and subsequently underscores the seriousness with which AWS plans to deploy products designed for the big data space.

Three Key Features Of EMC’s Hadoop Distribution, Pivotal HD

This week, EMC launched its own distribution of Hadoop under the branding Pivotal HD. Built on technology that EMC obtained through the acquisition of Greenplum in July 2010, Pivotal HD represents EMC’s next iteration on the Greenplum Unified Analytics Platform (UAP) that it launched in December 2011. The Greenplum UAP featured EMC Greenplum HD, an enterprise-grade distribution of Hadoop and Greenplum’s database for structured data. Greenplum UAP also announced Greenplum Chorus, an innovative platform for collaboration amongst data scientists in an organization leveraging Big Data. Pivotal HD, however, marks a significant new chapter in EMC’s Hadoop technology as indicated by its array of features and architectural complexity.

Like many recent Hadoop distributions and technologies, Pivotal HD integrates with SQL to facilitate its maximal usage by developers and business analysts who lack familiarity with MapReduce. But the real innovation of Pivotal HD runs deeper than its integration of SQL with Hadoop and concerns the positioning of Greenplum’s analytic engine alongside HDFS in ways that enable performance enhancements to Hadoop querying over and beyond the simple appendage of a SQL interface. Pivotal HD’s Advanced Database Services (HAWQ) allows for the delivery of a high-performance SQL engine that permits of greater SQL functionality and performance than analogous SQL interfaces such as Hive, Hadapt and Impala. Coupled with Pivotal HD’s virtualization and pluggable storage compatibility features, the platform represents a distinct moment of innovation in the Hadoop space as evinced by the following three features:

Advanced Database Services (HAWQ)
Pivotal HD’s Advanced Database Services (HAWQ) functionality brings Greenplum’s Massively Parallel Processing (MPP) functionality to Hadoop. The result means that HAWQ allows Pivotal HD users to perform complex joins, MADlib in-database analytics and transactions. Moreover, users have the luxury of leveraging virtually any BI tool on the marketplace to obtain advanced reporting and visualization of data as required. HAWQ-based SQL queries outperform Hive in terms of response time by as much as 100x according to EMC benchmarking data.

The Advanced Database Service interfaces with other components of Pivotal HD as follows:

EMC Pivotal HD

Given the recent proliferation of SQL-Hadoop interfaces throughout the industry, customers and analysts should expect more data about the comparative efficiencies of SQL-Hadoop interfaces to emerge as more and more SQL-trained analysts start using SQL to operate on data saved in HDFS.

Hadoop Virtualization Extensions
Hadoop Virtualization Extensions enable the provisioning of Hadoop clusters on VMware virtualized platforms in both public cloud and on-premise environments. HVE provides customers increased flexibility of deployment and enables the construction of high availability infrastructures for the access of Hadoop data.

Pluggable HDFS Storage
Customers can multiply their data storage options by using standard Hadoop direct attached storage in addition to EMC Isilon OneFS Scale-Out NAS Storage, the latter of which features streamlined loading, backup, replication, snapshotting and elastic scalability functionality.

Overall, EMC’s launch into the Hadoop-distribution world represents a stunning and significant move to grab Hadoop market share from Cloudera, Hortonworks and MapR. Unlike Intel’s recently launched distribution, EMC’s Pivotal HD claims some proprietary and genuinely innovative Hadoop technology in the form of its Advanced Database Services engine and scale-out storage compatibility. Expect EMC to continue to innovate upon its core technology platform and follow the suit of the likes of Concurrent in developing tools to render Hadoop more accessible to Java-based developers in addition to SQL. What remains unclear, at this point, is the extent to which EMC will open-source its technology as it gains market share within the enterprise. For now, however, the Hadoop world has yet another significant player with cash reserves aplenty to continue to innovate on its platform and disrupt the Hadoop landscape in the process.