MapR has declined the invitation to participate in the Open Data Platform (ODP) after careful consideration, as noted in a recent blog post by John Schroeder, the company’s CEO and co-founder. Schroeder claims that the Open Data Platform is redundant with the governance provided by the Apache Software Foundation, that it purports to “solve” Hadoop-related problems that do not require solving and that it fails to accurately define the core of the Open Data Platform as it relates to Hadoop. With respect to software governance, Schroeder notes that the Apache Software Foundation has done well to steward the development of Apache Hadoop as elaborated below:
The Apache Software Foundation has done a wonderful job governing Hadoop, resulting in the Hadoop standard in which applications are interoperable among Hadoop distributions. Apache governance is based on a meritocracy that doesn’t require payment to participate or for voting rights. The Apache community is vibrant and has resulted in Hadoop becoming ubiquitous in the market in only a few short years.
Here, Schroeder credits the Apache Software Foundation with creating a Hadoop ecosystem in which Hadoop-based applications interoperate with one another and wherein the governance structure is based on a meritocracy that does not mandate monetary contributions in order to garner voting rights. In addition, the blog post observes that whereas the Open Data Platform defines the core of Apache Hadoop as MapReduce, YARN, Ambari and HDFS, other frameworks such as “Spark and Mesos, are gaining market share” and stand to complicate ODP’s definition of the core of Hadoop. Meanwhile, Cloudera’s Chief Strategy Officer Mike Olson explained why Cloudera also declined to join the Open Data Platform by noting that Hadoop “won because it’s open source” and that the partnership between Pivotal and Hortonworks was “antithetical to the open source model and the Apache way.” Given that 75% of Hadoop implementations use either MapR or Cloudera, ODP looks set to face some serious challenges despite support from IBM, Pivotal and Hortonworks, although the precise impact of the schism over the Open Data Platform on the Hadoop community remains to be seen.
On March 9, the Apache Software Foundation announced the availability of Apache Tajo version 10.0. Less well known than its counterpart Apache Hive, Apache Tajo is used for ETL on big data in addition to SQL-compliant querying functionality that delivers scalable, low latency results. Version 10.0 features enhancements to Amazon S3 and an improved JDBC driver that renders Tajo compatible with most major BI platforms. Hyunsik Choi, Vice President of Apache Tajo, remarked on Apache Tajo’s progress as follows:
Tajo has evolved over the last couple of years into a mature ‘SQL-on-Hadoop’ engine. The improved JDBC driver in this release allows users to easily access Tajo as if users use traditional RDBMSs. We have verified new JDBC driver on many commercial BI solutions and various SQL tools. It was easy and works successfully.
As Choi notes, Tajo attempts to bring the simplicity and standardization of SQL and RDBMS infrastructures to the power of Hadoop’s distributed processing and scalability. Designed with a focus on fault tolerance, scalability, high throughput and query optimization, Tajo aims to deliver low latency in conjunction with a storage agnostic platform that notably boasts Hbase storage integration that allows Tajo users to access Hbase via Tajo as of this version. Tajo plays in an increasingly crowded SQL on Hadoop-space featuring the likes of Hive, Cloudera’s Impala, Pivotal HAWQ and Stinger although it claims some early adoption in South Korea, the country of its origin, with organizations such as Korea University, Melon, NASA JPL Radio Astronomy and Airborne Snow Observatory projects, and SK Telecom. The key question for Apache Tajo now is whether its new release will usher in greater traction outside of South Korea, particularly given its enhanced integration with Amazon S3 and Amazon’s Elastic Mapreduce (EMR) platform.
Pivotal recently announced the open sourcing of key components of its Pivotal Big Data Suite. Parts of the Pivotal Big Data Suite that will be outsourced include the MPP Pivotal Greenplum Database, Pivotal HAWQ and Pivotal GemFire, the NoSQL in-memory database. Pivotal’s decision to open source the core of its Big Data suite builds upon its success monetizing the Cloud Foundry platform and intends to accelerate the development of analytics applications that leverage big data and real-time streaming big data sets. The open sourcing of Greenplum, Pivotal’s SQL on Hadoop platform HAWQ and GemFire renders Pivotal’s principal analytics and database platforms more readily accessible to the developer community and encourages enterprises to experiment with Pivotal’s solutions. Sundeep Madra, VP of the Data Product Group, Pivotal, remarked on Pivotal’s decision to open source its Big Data suite as follows:
Pivotal Big Data Suite is a major milestone in the path to making big data truly accessible to the enterprise. By sharing Pivotal HD, HAWQ, Greenplum Database and GemFire capabilities with the open source community, we are contributing to the market as a whole the necessary components to build solutions that make up a next generation data infrastructure. Releasing these technologies as open source projects will only help accelerate adoption and innovation for our customers.
Pivotal’s announcement of the open sourcing of its Big Data suite comes in tandem with a strategic alliance aimed at synergistically maximizing the competencies of both companies to deliver best-in-class Hadoop capabilities for the enterprise. The partnership with Hortonworks includes product roadmap alignment, integration and the implementation of a unified vision with respect to leveraging the power of Apache Hadoop to facilitate the capability to derive actionable business intelligence on a scale rarely performed within the contemporary enterprise. In conjunction with the collaboration with Hortonworks, Pivotal revealed its participation in the Open Data Platform, an organization dedicated toward promoting the use of Big Data technologies centered around Apache Hadoop whose Platinum members include GE, Hortonworks, IBM, Infosys, Pivotal and SAS. The Open Data Platform intends to ensure components of the Hadoop ecosystem such as Apache Storm, Apache Spark and Hadoop-analytics applications integrate with and optimally support one another.
All told, Pivotal’s decision to open source its Big Data suite represents a huge coup for the Big Data analytics community at large insofar as organizations now have access to some of the most sophisticated Hadoop-analytics tools in the industry at no charge. More striking, however, is the significance of Pivotal’s alignment with Hortonworks, which stands to tilt the balance of the struggle for Hadoop market share toward Hortonworks and away from competitors Cloudera and MapR, at least for the time being. Thus far, Cloudera has enjoyed notable traction in the financial services sector and within the enterprise more generally, but the enriched analytics available to the Hortonworks Data Platform by means of the partnership with Pivotal promise to render Hortonworks a more attractive solution, particularly for analytics-intensive use cases and scenarios. Regardless, Pivotal’s strategic evolution as represented by its open source move, its collaboration with Hortonworks and leadership position in the Open Data Platform constitute a seismic moment in Big Data history wherein the big data world shakes as the world’s most sophisticated big data analytics firm qua Pivotal unites with Hortonworks, the company responsible for the first publicly traded Hadoop distribution. The obvious question now is how Cloudera and MapR will respond to the Open Data Platform and the extent to which Pivotal’s partnership with Hadoop distributions remains exclusive to, or focused around Hortonworks in the near future.
Cloudera and Cask recently announced a strategic collaboration marked by a commitment to integrate the product roadmaps of both companies into a unified vision based around the goal of empowering developers to more easily build and deploy applications using Hadoop. As part of the collaboration, Cloudera made an equity investment in Cask, the company formerly known as Continuity. Cask’s flagship product consists of the Cask Data Application Platform (CDAP), an application platform used to streamline and simplify Hadoop-based application development in addition to delivering operational tools for integrating application components and performing runtime services. The integration of Cask’s open source Cask Data Application Platform with Cloudera’s open source Hadoop distribution represents a huge coup for Cask insofar as its technology stands to become tightly integrated with one of the most popular Hadoop distributions in the industry and correspondingly vie for potential acquisition by Cloudera as its product develops further. Cloudera, on the other hand, stands to gain from Cask’s progress in building a platform for facilitating Big Data application development that runs natively within a Hadoop infrastructure. By aligning its product roadmap with Cask, Cloudera adds yet another feather to its cap vis-à-vis tools and platforms within its ecosystem that enhance and accelerate the experience of Hadoop adoption. Overall, the partnership strengthens Cloudera’s case for going public by illustrating the astuteness and breadth of its vision when it comes to strategic partners and collaborators such as Cask, not to mention the business and technological benefits of the partnership. Expect Cloudera to continue aggressively building out its partner ecosystem as it hurtles toward an IPO that it may well be already preparing, at least as reported in Venturebeat.
MapR recently announced that MediaHub Australia has deployed MapR to support its digital archive that serves 170+ broadcasters in Australia. MediaHub delivers digital content for broadcasters throughout Australia in conjunction with its strategic partner Contexti. Broadcasters provide MediaHub with segments of programs, live feeds and a schedule that outlines when the program in question should be delivered to its audiences. In addition to scheduled broadcasts, MediaHub offers streaming and video on demand services for a variety of devices. MediaHub’s digital archive automates the delivery of playout services for broadcasters and subsequently minimizes the need for manual intervention from archival specialists. MapR currently manages over 1 petabyte of content for the 170+ channels that it serves, although the size of its digital archive is expected to grow dramatically within the next two years. MapR’s Hadoop-based storage platform also provides an infrastructure that enables analytics on content consumption that help broadcasters make data-driven decisions about what content to air in the future and how to most effectively complement existing content. MediaHub’s usage of MapR illustrates a prominent use case for MapR, namely, the use of Hadoop for storing, delivering and running analytics on digital media. According to Simon Scott, Head of Technology at MediaHub, one of the key reasons MediaHub selected MapR as the big data platform for its digital archive concerned its ability to support commodity hardware.
On December 17, Teradata announced the finalization of the acquisition of RainStor, a big data archiving company that specializes in archival solutions for Hadoop. The acquisition of RainStor gives Teradata ownership of RainStor’s technology for compressing and freezing Hadoop datastores for archival purposes. RainStor’s archival technology empowers companies to compress and store Hadoop data as Hadoop-based datasets proliferate throughout the enterprise in conjunction with the larger transition to data-driven operational and strategic analytics. The acquisition represents Teradata’s fourth major acquisition this year following upon the purchase of Revelytix, Hadapt and Think Big Analytics. Terms of the acquisition were not disclosed although most of RainStor employees will remain in their pre-acquisition locations in San Francisco and Gloucester. The acquisition strengthens Teradata’s Hadoop solutions by augmenting its ability to provide customers with enterprise-wide data archival capabilities.
As reported by Arik Hesseldahl in Recode, Apache Hadoop vendor Hortonworks has filed for an IPO. The decision by Hortonworks to offer public shares represents the first IPO from a major Hadoop vendor. In fiscal year 2013, Hortonworks reported a loss of $36.6M relative to $11M in revenue. Meanwhile, for the first 9 months of 2014, Hortonworks increased its revenue to $33.3M but posted a loss of $86.7M. The decision by Hortonworks to go public comes after two major capital raises in 2014. In July, HP invested $50M in Hortonworks, following upon the $100M raised by Hortonworks in March. Given the gargantuan capital raises specific to Hortonworks competitors Cloudera and MapR as well, the Big Data landscape should also expect IPOs from Cloudera and MapR in the near future. Meanwhile, more detailed analysis regarding the prospects of Hortonworks executing a successful IPO will emerge in coming weeks in anticipation of the launch of the IPO either in late 2014 or early 2015. In 2011, Hortonworks was spun out of Yahoo, its principal investor. Hortonworks plans to raise up to $100M by means of its IPO.