Google recently announced development of Mesa, a data warehousing platform designed to collect data for its internet advertising business. Mesa delivers a distributed data warehouse that can manage petabytes of data while delivering high availability, scalability and fault tolerance. Mesa is designed to update millions of rows per second, process billions of queries and retrieve trillions of rows per day to support Google’s gargantuan data needs for its flagship search and advertising business. Google elaborated on the company’s business need for a new data warehousing platform by commenting on its evolving data management needs as follows:
Google runs an extensive advertising platform across multiple channels that serves billions of advertisements (or ads) every day to users all over the globe. Detailed information associated with each served ad, such as the targeting criteria, number of impressions and clicks, etc. are recorded and processed in real time…Advertisers gain fine-grained insights into their advertising campaign performance by interacting with a sophisticated front-end service that issues online and on-demand queries to the underlying data store…The scale and business critical nature of this data result in unique technical and operational challenges for processing, storing and querying.
Google’s advertising platform depends upon real-time data that records updates about advertising impressions and clicks in the larger context of analytics about current and potential advertising campaigns. As such, the data model requires the ability to accommodate atomic updates to advertising components that cascade throughout an entire data repository, consistency and correctness of data across datacenters and over time, the ability to support continuous updates, low latency query performance, scalability as illustrated by the ability to support petabytes of data and data transformation functionality that accommodates changes to data schemas. Mesa utilizes Google products as follows:
Mesa leverages common Google infrastructure and services, such as Colossus, BigTable and MapReduce. To achieve storage scalability and availability, data is horizontally partitioned and replicated. Updates may be applied at granularity of a single table or across many tables. To achieve consistent and repeatable updates, the underlying data is multi-versioned. To achieve update scalability, data updates are batched, assigned a new version number and periodically incorporated into Mesa. To achieve update consistency across multiple data centers, Mesa uses a distributed synchronization protocol based on Paxos.
While Mesa takes advantage of technologies from Colossus, BigTable, MapReduce and Paxos, it delivers a degree of “atomicity” and consistency lacked by its counterparts. In addition, Mesa features “a novel version management system that batches updates to achieve acceptable latencies and high throughput for updates.” All told, Mesa constitutes a disruptive innovation in the Big Data space that extends the attributes of atomicity, consistency, high throughput, low latency and scalability on the scale of trillions of rows toward the end of a “petascale data warehouse.” While speculation proliferates about the possibilities for Google to append Mesa to its Google Compute Engine offering or otherwise open-source it, the key point worth noting is that Mesa represents a qualitative shift with respect to the ability of a Big Data platform to process petabytes of data that experiences real-time flux. Whereas the cloud space is accustomed to seeing Amazon Web Services usher in breathtaking innovation after innovation, time and time again, Mesa conversely underscores Google’s continuing leadership in the Big Data space. Expect to hear more details about Mesa at the Conference on Very Large Data Bases next month in Hangzhou, China.
Today, Concurrent, Inc. announces the release of Cascading 3.0, the latest version of the popular open source framework for developing and managing Big Data applications. Widely recognized as the de facto framework for the development of Big Data applications on platforms such as Apache Hadoop, Cascading simplifies application development by means of an abstraction framework that facilitates the execution and orchestration of jobs and processes. Compatible with all major Hadoop distributions, Cascading sits squarely at the heart of the Big Data revolution by streamlining the operationalization of Big Data applications in conjunction with Driven, a commercial product from Concurrent that provides visibility regarding application performance within a Hadoop cluster.
Today’s announcement extends Cascading to platforms and computational frameworks such as local in-memory, Apache MapReduce and Apache Tez. Going forward, Concurrent plans for Cascading 3.0 to ship with support for Apache Spark, Apache Storm and other computational frameworks by means of its customizable query planner, which allows customers to extend the operation of Cascading to compatible computational fabrics as illustrated below:
The breakthrough represented by today’s announcement is that it renders Cascading extensible to a variety of computational frameworks and data fabrics and thereby expands the range of use cases and environments in which Cascading can be optimally used. Moreover, the customizable query planner featured in today’s release allows customers to configure their Cascading deployment to operate in conjunction with emerging technologies and data fabrics that can now be integrated into a Cascading deployment by means of the functionality represented in Cascading 3.0.
Used by companies such as Twitter, eBay, FourSquare, Etsy and The Climate Corporation, Cascading boasts over 150,000 applications a month, more than 7,000 deployments and 10% month over month growth in downloads. The release of Cascading 3.0 builds on Concurrent’s recent partnership with Hortonworks whereby Cascading will be integrated into the Hortonworks Data Platform and Hortonworks will certify and support the delivery of Cascading in conjunction with its Hadoop distribution. Concurrent, Inc. also recently revealed details of a strategic partnership with Databricks, the principal steward behind the Apache Spark project, that allows it to “operate over Spark…[the] next generation Big Data processing engine that supports batch, interactive and streaming workloads at scale.” In an interview with Cloud Computing Today, Concurrent CEO Gary Nakamura confirmed that Concurrent plans to negotiate partnerships analogous to the agreement with Hortonworks with other Hadoop distribution vendors in order to ensure that Cascading consolidates its positioning as the framework of choice for the development of Big Data applications. Overall, the release of Cascading 3.0 represents a critical product enhancement that positions Cascading to operate over a broader pasture of computational frameworks and consequently assert its relevance for Big Data application development in a variety of data and computational frameworks. More importantly, however, the product enhancement in Cascading 3.0, in conjunction with the partnership with Databricks regarding Apache Spark, suggests that Cascading is well on its way to becoming the universal framework of choice for developing and managing applications in a Big Data environment, particularly given its compatibility with a wide range of Hadoop distributions and data and computational frameworks.
This week, Cloudera announced the general availability of Cloudera Search, the interactive search engine that enables users to perform free text searches on data stored within the Hadoop Distributed File System (HDFS) and Apache HBase without advanced scripting experience or training. Powered by open source search engine Apache Solr, Cloudera Search is integrated with Apache Zookeeper to manage distributed processing, index sharding and high availability. Cloudera announced the general availability of Cloudera Search after a three month public beta that began on June 4, and months of a private beta prior to June. The platform represents part of Cloudera’s larger project of democratizing access to Big Data and will sit alongside Cloudera Impala, Cloudera’s SQL interface for querying Hadoop clusters. A schematic of Cloudera’s library of tools and platforms for processing Hadoop-based data is given below:
Cloudera Search manages the creation of indexes of Hadoop data with a scalability comparable to MapReduce and integrates indexes produced upon querying Hadoop data into HDFS. Cloudera Search also supports real-time indexing of newly ingested data through an integration with Apache Flume. The platform enables “linearly scalable batch indexing for large data stores within Hadoop on-demand” and its GoLive functionality accommodates “incremental index changes.” Moreover, the Search platform is available by way of Hue, Cloudera’s open source user interface for querying Apache Hadoop data.
With this week’s general availability announcement, Cloudera Search is fully available amongst Cloudera’s product line and is supported by CDH 4.3. Overall, the GA of Cloudera Search illustrates the intensity of the battle to bring Hadoop to non-technical enterprise users by means of an interactive search platform whose ease of use parallels web search platforms such as Google and Bing. MapR, for example, announced an integration of its Hadoop platform with LucidWorks search for Big Data in February. The industry should expect interactive search platforms for Hadoop to proliferate and achieve greater sophistication as Hadoop adoption accelerates across the enterprise.
On Thursday, Google announced that 79 more patents will be part of its Open Patent Non-Assertion (OPN) Pledge. The announcement builds upon Google’s March OPN commitment to refrain from suing users of designated patents, in an attempt to support open-source collaboration and innovation. The key proviso, however, is that Google reserves the right to sue if it is attacked first. The 79 patents relate to data center operations such as “middleware, distributed storage management, distributed database management, and alarm monitoring” whereas the first 10 patents that Google introduced to OPN had to do with MapReduce. Although the patents included in OPN thus far focus on back-end technologies, Google intends to additionally include software for “consumer products that people use every day” in OPN going forward, according to a company blog post. Google’s commitment to OPN represents a gesture to work toward building a tech culture marked by fewer instances of aggressive patent litigation although the move is unlikely to have a significant impact unless other tech companies make similar commitments to non-offensive patent litigation.
Informatica released the world’s first Hadoop parser on Wednesday in a move that boldly signalled its entry into the hotly contested Big Data analytics space. Informatica HParser operates on virtually all versions of Apache Hadoop and specializes in transforming unstructured data into a structured format within a Hadoop installation. HParser enables the transformation of textual data, Facebook and Twitter feeds, web logs, emails, log files and digital interactive media into a structured or semi-structured schema that allows businesses to more effectively mine the data for actionable business intelligence purposes.
Key features of HParser include the following:
• A visual, integrated development environment (IDE) that streamlines development via a graphical interface.
• Support for a wide range of data formats including XML, JSON, HL7, HIPAA, ASN.1 and market data.
• Ability to parse proprietary machine generated log files.
• Use of the parallelism of MapReduce to optimize parsing performance across massive structured and unstructured data sets.
Informatica’s HParser is available in a both a free and commercial edition. The free, community edition can parse log files, Omniture Web analytics data, XML and JSON. The commercial edition additionally supports HL7, HIPAA, SWIFT, X12, NACHA , ASN.1, Bloomberg, PDF, XLS or Microsoft Word formats. Informatica’s HParser builds upon the company’s June 2011 deployment of Informatica 9.1 for Big Data, which featured “connectivity to big transaction data from traditional transaction databases, such as Oracle and IBM DB2, to the latest optimized for purpose analytic databases, such as EMC Greenplum, Teradata, Teradata Aster Data, HP Vertica and IBM Netezza,” in addition to Hadoop.