Digital music service Spotify recently announced that it will migrate its Hadoop cluster from Cloudera’s Hadoop distribution to the Hortonworks Data Platform because of the Hortonworks commitment to open source development and technologies. Spotify also noted that the migration was partly due to the impressive contribution made by Hortonworks to the Apache Hive project for querying Hadoop data. Spotify began its use of Hadoop on the Amazon Web Services EMR platform with a cluster sized at approximately 30 nodes. The company subsequently decided to bring its Hadoop cluster in house, starting with a 60 node cluster. Spotify’s Hadoop distribution is now sized at 690 nodes and stores data for its 24 million users and 6 million subscribers. Its 690 node Hadoop cluster is widely regarded as one of the largest implementations of Hadoop in Europe. In addition to providing Spotify with a production-grade Hadoop distribution, Hortonworks will perform bi-annual health assessments of its Hadoop infrastructure.
Cloudera Announces General Availability Of Cloudera Search, Interactive, Free Text Search For Hadoop
This week, Cloudera announced the general availability of Cloudera Search, the interactive search engine that enables users to perform free text searches on data stored within the Hadoop Distributed File System (HDFS) and Apache HBase without advanced scripting experience or training. Powered by open source search engine Apache Solr, Cloudera Search is integrated with Apache Zookeeper to manage distributed processing, index sharding and high availability. Cloudera announced the general availability of Cloudera Search after a three month public beta that began on June 4, and months of a private beta prior to June. The platform represents part of Cloudera’s larger project of democratizing access to Big Data and will sit alongside Cloudera Impala, Cloudera’s SQL interface for querying Hadoop clusters. A schematic of Cloudera’s library of tools and platforms for processing Hadoop-based data is given below:
Cloudera Search manages the creation of indexes of Hadoop data with a scalability comparable to MapReduce and integrates indexes produced upon querying Hadoop data into HDFS. Cloudera Search also supports real-time indexing of newly ingested data through an integration with Apache Flume. The platform enables “linearly scalable batch indexing for large data stores within Hadoop on-demand” and its GoLive functionality accommodates “incremental index changes.” Moreover, the Search platform is available by way of Hue, Cloudera’s open source user interface for querying Apache Hadoop data.
With this week’s general availability announcement, Cloudera Search is fully available amongst Cloudera’s product line and is supported by CDH 4.3. Overall, the GA of Cloudera Search illustrates the intensity of the battle to bring Hadoop to non-technical enterprise users by means of an interactive search platform whose ease of use parallels web search platforms such as Google and Bing. MapR, for example, announced an integration of its Hadoop platform with LucidWorks search for Big Data in February. The industry should expect interactive search platforms for Hadoop to proliferate and achieve greater sophistication as Hadoop adoption accelerates across the enterprise.
If 2011 was the year of Cloud Computing, then 2012 will surely be the year of Big Data. Big Data has yet to arrive in the way cloud computing has, but the framework for its widespread deployment as a commodity emerged with style and unmistakable promise. For the first time, Hadoop and NoSQL gained currency not only within the developer community, but also amongst bloggers and analysts. More importantly, Big Data garnered for itself a certain status and meaning in the technology community even though few people asked about the meaning of big in “Big Data” in a landscape where the circle around the meaning of “big” with respect to “data” is constantly being redrawn. Even though yesterday’s “big” in Big Data morphed into today’s “small” as consumer personal storage transitions from gigabytes to terabytes, the term “Big Data” emerged as a term that everyone almost instantly understood. It was as if consumers and enterprises alike had been searching for years for a long lost term to describe the explosion of data as evinced by web searches, web content, Facebook and Twitter feeds, photographs, log files and miscellaneous structured and unstructured content. Having been speechless, lacking the vocabulary to find the term for the data explosion, the world suddenly embraced the term Big Data with passion.
Below are some of the highlights of 2011 with respect to big data:
•Teradata finalized a deal to acquire Big Data player Aster Data Systems for $263 million.
•Yahoo revealed plans to create Hortonworks, a spin-off dedicated to the commercialization of Apache Hadoop.
•Teradata announced the Teradata Aster MapReduce Platform that combines SQL with MapReduce. The Teradata Aster MapReduce Platform empowers business analysts who know SQL to leverage the power of MapReduce without having to write scripted queries in Java, Python, Perl or C.
•Oracle announced plans to launch a Big Data appliance featuring Apache Hadoop, Oracle NoSQL Database Enterprise Edition and an open source distribution of R. The company’s announcement of its plans to leverage a NoSQL database represented an abrupt about face of an earlier Oracle position that discredited the significance of NoSQL.
•Microsoft revealed plans for a Big Data appliance featuring Hadoop for Windows Server and Azure, and Hadoop connectors for SQL Server and SQL Parallel Data Warehouse. Microsoft revealed a strategic partnership with Yahoo spinoff Hortonworks to integrate Hadoop with Windows Server and Windows Azure. Microsoft’s decision not to leverage NoSQL and use instead a Windows based version of Hadoop for SQL Server 2012 constituted the key difference between Microsoft and Oracle’s Big Data platforms.
•IBM announced the release of IBM Infosphere BigInsights application for analyzing “Big Data.” The SmartCloud release of IBM’s BigInsights application means that IBM beat competitors Oracle and Microsoft in the race to deploy an enterprise grade, cloud based Big Data analytics platform.
•Christophe Bisciglia, founder of Cloudera, the commercial distributor of Apache Hadoop, launched a startup called Odiago that features a Big Data product named WibiData. WibiData manages investigative and operational analytics on “consumer internet data” such as website traffic on traditional and mobile computing devices.
•Cloudera announced a partnership with NetApp, the storage and data management vendor. The partnership revealed the release of the NetApp Open Solution for Hadoop, a preconfigured Hadoop cluster that combines Cloudera’s Apache Hadoop (CDH) and Cloudera Enterprise with NetApp’s RAID architecture.
•Big Data player Karmasphere announced plans to join the Hortonworks Technology Partner Program today. The partnership enables Karmasphere to offer its Big Data intelligence product Karmasphere Analytics on the Apache Hadoop software infrastructure that undergirds the Hortonworks Data Platform.
•Informatica released the world’s first Hadoop parser. Informatica HParser operates on virtually all versions of Apache Hadoop and specializes in transforming unstructured data into a structured format within a Hadoop installation.
•MarkLogic announced support for Hadoop, the Apache open source software framework for analyzing Big Data with the release of MarkLogic 5.
•HP provided details of Autonomy IDOL (Integrated Data Operating Layer) 10, a Next Generation Information Platform that integrates two of its 2011 acquisitions, Vertica and Autonomy. Autonomy IDOL 10 features Autonomy’s capabilities for processing unstructured data, Vertica’s ability to rapidly process large-scale structured data sets, a NoSQL interface for loading and analyzing structured and unstructured data and solutions dedicated to the Data, Social Media, Risk Management, Cloud and Mobility verticals.
•EMC announced the release of its Greenplum Unified Analytics Platform (UAP). The EMC Greenplum UAP contains the The EMC Greenplum platform for the analysis of structured data, enterprise-grade Hadoop for analyzing structured and unstructured data and EMC Greenplum Chorus, a collaboration and productivity software tool that enables social networking amongst constituents in an organization that are leveraging Big Data.
The widespread adoption of Hadoop punctuated the Big Data story of the year so far. Hadoop featured in almost every Big Data story of the year, from Oracle to Microsoft to HP and EMC, while NoSQL came in a close second. Going into 2012, one of the key questions for the Big Data space concerns the ability of OpenStack to support Hadoop, NoSQL, MapReduce and other Big Data technologies. The other key question for Big Data hinges on the user friendliness of Big Data applications for business analysts in addition to programmers. EMC’s Greenplum Chorus, for example, democratizes access to its platform via a user interface that promotes collaboration amongst multiple constituents in an organization by transforming questions into structured queries. Similarly, the Teradata Aster MapReduce Platform allows business analysts to make use of its MapReduce technology by using SQL. That said, as Hadoop becomes more and more mainstream, the tech startup and data intensive spaces are likely to witness a greater number of data analysts trained in Apache Hadoop in conjunction with efforts by vendors to render Hadoop more accessible to programmers and non-programmers alike.
Big Data player Karmasphere announced plans to join the Hortonworks Technology Partner Program today. Karmasphere’s partnership with Hortonworks is set to further stoke the embers of the emerging battle between Cloudera and Hortonworks for control of market share in the Hadoop distribution space. The partnership enables Karmasphere to offer its Big Data intelligence product Karmasphere Analytics on the Apache Hadoop software infrastructure that undergirds the Hortonworks Data Platform. As a result of the Hortonworks collaboration, Karmasphere will receive technical support, training and certification on Apache Hadoop deployment. Karmasphere’s partnership with Hortonworks represents its second major business partnership announcement this month. On November 1, the company announced a relationship with Amazon Web Services whereby its Big Data analytics would be available through the Amazon Elastic MapReduce service. Karmasphere’s partnership with Amazon Web Services to deploy its Big Data analytics applications on Amazon Elastic MapReduce allows enterprises to investigate a pay as you go solution for Big Data intelligence without the commitment of a long term subscription or initial capital investment in technical infrastructure.
On Monday, Cloudera continued its aggressive efforts to expand its distribution channels by announcing a partnership with NetApp, the storage and data management vendor. The partnership revealed the release of the NetApp Open Solution for Hadoop, a preconfigured Hadoop cluster that combines Cloudera’s Apache Hadoop (CDH) and Cloudera Enterprise with NetApp’s RAID architecture. NetApp Open Solution for Hadoop is intended to grant enterprises seeking to implement Hadoop enhanced ease of deployment, improved scalability, superior performance and reduced costs. Part of the reduced enterprise costs enabled by the NetApp Open Solution for Hadoop derives from NetApp’s state of the art backup and replication capabilities that minimize downtime in the event of disk failure.
Speaking of the NetApp Open Solution, Rich Clifton, senior vice president and general manager, NetApp Technology Enablement and Solutions Organization, remarked:
“Customers are looking for business advantages from the wealth of their unstructured data. Today, it’s like finding a needle in a haystack. NetApp Open Solution for Hadoop will help customers get answers fast and process more, as well as provide the reliability and performance that our customers have come to expect from NetApp.”
GigaOm reports that one of the key attributes of the NetApp Open Solution is its separation of the compute and storage layers of a Cloudera Hadoop installation. Separating the compute and storage layers enables enhanced performance, scalability and reduced downtime in the event of the failure of a disk within either the compute or storage layer. Cloudera’s partnership with NetApp comes roughly three weeks after its announcement of a reseller deal with SGI, whereby SGI will distribute Cloudera’s Apache Hadoop (CDH) alongside its rackable servers and provide level 1 technical support, while Cloudera will provide level 2 and level 3 technical support. Both deals seek to consolidate Cloudera’s market share within the Hadoop distribution space in relation to competitors such as MapR, its partner EMC and Yahoo-spinoff Hortonworks.
In conjunction with the deal with NetApps, Cloudera announced that it had successfully completed a Series D funding raise worth $40 million, led by Frank Artale of Ignition Partners with the support of existing investors Accel Partners, Greylock Partners, Meritech Capital Partners and In-Q-Tel. The latest funding round takes the total of Cloudera’s financing to $76 million and supports Cloudera’s explosive growth in the Big Data space with particular emphasis on marketing, sales operations and strategic business development. Cloudera brands itself as the first company to deliver enterprise grade deployments of Apache Hadoop, the disruptive technology framework for analyzing massive amounts of structured and unstructured data. Apache Hadoop is used by enterprises such as eBay, Yahoo, Facebook, LinkedIn, eHarmony and Twitter to make strategic business decisions on the basis of large-scale structured and unstructured data sets. Cloudera’s announcement of its Series D funding and partnership with NetApp came on the eve of Hadoop World 2011, the world’s largest conference of Hadoop practitioners.
This week, Christophe Bisciglia, founder of Cloudera, the commercial distributor of Apache Hadoop, launched a startup called Odiago that features a Big Data product named WibiData. Bisciglia launched WibiData with the backing of Google Chairman Eric Schmidt, Cloudera CEO Mike Olson, and SV Angel, the Silicon Valley-based angel fund. WibiData manages investigative and operational analytics on “consumer internet data” such as website traffic on traditional and mobile computing devices. WibiData leverages an Hbase and Hadoop technology platform that features the following attributes: (1) All data specific to a single user/machine/mobile device is organized within one Hbase row; (2) “Produce,” an analytic operator that functions on individual rows. Produce maps data from individual rows into interactive user applications. Produce also performs analytic operations such as classification and weightage of different rows in conjunction with an analytic rules engine; (3) “Gather”, an analytic operator that operates on all rows combined.
WibiData’s “Produce” and “Gather” components operate within a single table database structure in which the schema can dynamically evolve over time. Whereas most relational databases hold a single value in a cell, WibiData’s non-relational database structure allows for an entire table to be stored within a cell. Moreover, WibiData features fewer data manipulation language capabilities for retrieving, updating, inserting and deleting data than SQL. Curt Monash provides a terrific technical overview of WibiData in his blog DBMS2. For more about the company’s founders, see TechCrunch.
SGI and Cloudera today announced a reseller partnership whereby SGI will sell pre-configured Hadoop clusters of hardware and software in addition to technical support. Under the terms of the agreement, SGI will distribute Cloudera’s Apache Hadoop (CDH) alongside its rackable servers and provide level 1 technical support, while Cloudera will provide level 2 and level 3 technical support. SGI already claims a history of deploying Hadoop servers dating back to Hadoop’s earliest days and expects to leverage its existing relationships with customers in the government and financial sectors. SGI’s VP of Product Marketing, Bill Mannel, noted that “SGI has been successfully deploying Hadoop customer installations of up to 40,000 nodes and individual Hadoop clusters of up to 4,000 nodes for a number of years now.” 40,000 nodes per customer installation and 4,000 nodes per cluster represent the upper bound of Hadoop cluster size at Yahoo! and similar enterprise level installations. Mannel elaborated on SGI’s experience with large Hadoop installations by commenting: “This benchmark, our growing presence, and our role in the Hadoop ecosystem, reflect our ongoing commitment to pushing the bar on performance and driving relationships that benefit our customers. As they wrestle with bigger and more complex data challenges every day they can trust SGI to deliver complete Hadoop solutions based on years of experience.”
SGI’s distribution of Hadoop is expected to target customers that would like an enterprise level installation without dedicating in house talent to the deployment. Hadoop is an disruptive open source technology that provides a framework for managing massive volumes of structured and unstructured data. Hadoop provides the data infrastructure for Facebook, LinkedIn and Twitter and has recently gained attention in the wake of recent announcements by Oracle and Microsoft about entering the Big Data space by leveraging Hadoop technology.