This week, Christophe Bisciglia, founder of Cloudera, the commercial distributor of Apache Hadoop, launched a startup called Odiago that features a Big Data product named WibiData. Bisciglia launched WibiData with the backing of Google Chairman Eric Schmidt, Cloudera CEO Mike Olson, and SV Angel, the Silicon Valley-based angel fund. WibiData manages investigative and operational analytics on “consumer internet data” such as website traffic on traditional and mobile computing devices. WibiData leverages an Hbase and Hadoop technology platform that features the following attributes: (1) All data specific to a single user/machine/mobile device is organized within one Hbase row; (2) “Produce,” an analytic operator that functions on individual rows. Produce maps data from individual rows into interactive user applications. Produce also performs analytic operations such as classification and weightage of different rows in conjunction with an analytic rules engine; (3) “Gather”, an analytic operator that operates on all rows combined.
WibiData’s “Produce” and “Gather” components operate within a single table database structure in which the schema can dynamically evolve over time. Whereas most relational databases hold a single value in a cell, WibiData’s non-relational database structure allows for an entire table to be stored within a cell. Moreover, WibiData features fewer data manipulation language capabilities for retrieving, updating, inserting and deleting data than SQL. Curt Monash provides a terrific technical overview of WibiData in his blog DBMS2. For more about the company’s founders, see TechCrunch.
Informatica released the world’s first Hadoop parser on Wednesday in a move that boldly signalled its entry into the hotly contested Big Data analytics space. Informatica HParser operates on virtually all versions of Apache Hadoop and specializes in transforming unstructured data into a structured format within a Hadoop installation. HParser enables the transformation of textual data, Facebook and Twitter feeds, web logs, emails, log files and digital interactive media into a structured or semi-structured schema that allows businesses to more effectively mine the data for actionable business intelligence purposes.
Key features of HParser include the following:
• A visual, integrated development environment (IDE) that streamlines development via a graphical interface.
• Support for a wide range of data formats including XML, JSON, HL7, HIPAA, ASN.1 and market data.
• Ability to parse proprietary machine generated log files.
• Use of the parallelism of MapReduce to optimize parsing performance across massive structured and unstructured data sets.
Informatica’s HParser is available in a both a free and commercial edition. The free, community edition can parse log files, Omniture Web analytics data, XML and JSON. The commercial edition additionally supports HL7, HIPAA, SWIFT, X12, NACHA , ASN.1, Bloomberg, PDF, XLS or Microsoft Word formats. Informatica’s HParser builds upon the company’s June 2011 deployment of Informatica 9.1 for Big Data, which featured “connectivity to big transaction data from traditional transaction databases, such as Oracle and IBM DB2, to the latest optimized for purpose analytic databases, such as EMC Greenplum, Teradata, Teradata Aster Data, HP Vertica and IBM Netezza,” in addition to Hadoop.
With the November 1 release of MarkLogic 5, MarkLogic consolidated its position in the Big Data space by announcing support for Hadoop, the Apache open source software framework for analyzing massive amounts of structured and unstructured data. For over a decade, MarkLogic has delivered analytics that enable actionable intelligence on data for organizations such as JP Morgan Chase, Lexis Nexis and the U.S. Army. MarkLogic 5 features a connector for Hadoop that integrates Hadoop’s capabilities for processing petabytes of data with MarkLogic’s proprietary applications for analyzing Big Data. In addition to a Hadoop connector, MarkLogic 5 includes enhanced capabilities to store, tag and analyze textual data and digital interactive media. The latest release of MarkLogic also features superior database replication capabilities and functionality for monitoring the performance of enterprise level Big Data installations.
The release of MarkLogic 5 testifies to the explosion of commercial interest in non-relational databases for storing and mining unstructured data. Microsoft’s Big Data platform plans to integrate Hadoop with Windows Server and Windows Azure, with connectors to SQL Server 2012. Oracle, meanwhile, recently revealed the basic components of its Big Data appliance that features Hadoop in addition to its Oracle NoSQL database.
SGI and Cloudera today announced a reseller partnership whereby SGI will sell pre-configured Hadoop clusters of hardware and software in addition to technical support. Under the terms of the agreement, SGI will distribute Cloudera’s Apache Hadoop (CDH) alongside its rackable servers and provide level 1 technical support, while Cloudera will provide level 2 and level 3 technical support. SGI already claims a history of deploying Hadoop servers dating back to Hadoop’s earliest days and expects to leverage its existing relationships with customers in the government and financial sectors. SGI’s VP of Product Marketing, Bill Mannel, noted that “SGI has been successfully deploying Hadoop customer installations of up to 40,000 nodes and individual Hadoop clusters of up to 4,000 nodes for a number of years now.” 40,000 nodes per customer installation and 4,000 nodes per cluster represent the upper bound of Hadoop cluster size at Yahoo! and similar enterprise level installations. Mannel elaborated on SGI’s experience with large Hadoop installations by commenting: “This benchmark, our growing presence, and our role in the Hadoop ecosystem, reflect our ongoing commitment to pushing the bar on performance and driving relationships that benefit our customers. As they wrestle with bigger and more complex data challenges every day they can trust SGI to deliver complete Hadoop solutions based on years of experience.”
SGI’s distribution of Hadoop is expected to target customers that would like an enterprise level installation without dedicating in house talent to the deployment. Hadoop is an disruptive open source technology that provides a framework for managing massive volumes of structured and unstructured data. Hadoop provides the data infrastructure for Facebook, LinkedIn and Twitter and has recently gained attention in the wake of recent announcements by Oracle and Microsoft about entering the Big Data space by leveraging Hadoop technology.
The battle for market share in the big data space is officially underway, with passion. At last week’s Professional Association for SQL Server Summit (PASS), Microsoft announced plans to develop a platform for big data processing and analytics based on Hadoop, the open source software framework that operates under an Apache license. Microsoft’s announcement comes roughly ten days after Oracle’s unveiling of its Big Data Appliance that provides enterprise level capabilities to process structured and unstructured data.
Key features of Oracle’s Big Data Appliance include the following:
–Oracle NoSQL Database Enterprise Edition
–Oracle Data Integrator Application Adapter for Hadoop
–Oracle Loader for Hadoop
–Open source distribution of R
–Oracle’s Exadata x86 clusters (Oracle Exadata Database Machine, Oracle Exalytics Business Intelligence Machine)
Oracle’s hardware supports the Oracle 11g R2 database alongside Oracle’s Red Hat Enterprise Linux version and virtualization based on the Xen hypervisor. The company’s announcement of its plans to leverage a NoSQL database represented an abrupt about face of an earlier Oracle position that discredited the significance of NoSQL. In May, Oracle published a whitepaper Debunking the NoSQL Hype that downplayed the enterprise level capability of NoSQL deployments.
Microsoft’s forthcoming Big Data platform features the following:
–Hadoop for Windows Server and Azure
–Hadoop connectors for SQL Server and SQL Parallel Data Warehouse
–Hive ODBC drivers for users of Microsoft Business Intelligence applications
Microsoft revealed a strategic partnership with Yahoo spinoff Hortonworks to integrate Hadoop with Windows Server and Windows Azure. Microsoft’s decision not to leverage NoSQL and use instead a Windows based version of Hadoop for SQL Server 2012 constitutes the key difference between Microsoft and Oracle’s Big Data platforms. The entry of Microsoft and Oracle into the Big Data space suggests that the market is ready to explode as government and private sector agencies increasingly find value in unlocking business value from unstructured data such as emails, log files, twitter feeds and text-centered data. IBM and EMC hold the early market share lead but competition is set to intensify, particularly given the recent affirmation handed to NoSQL by tech giant Oracle.