Google recently announced development of Mesa, a data warehousing platform designed to collect data for its internet advertising business. Mesa delivers a distributed data warehouse that can manage petabytes of data while delivering high availability, scalability and fault tolerance. Mesa is designed to update millions of rows per second, process billions of queries and retrieve trillions of rows per day to support Google’s gargantuan data needs for its flagship search and advertising business. Google elaborated on the company’s business need for a new data warehousing platform by commenting on its evolving data management needs as follows:
Google runs an extensive advertising platform across multiple channels that serves billions of advertisements (or ads) every day to users all over the globe. Detailed information associated with each served ad, such as the targeting criteria, number of impressions and clicks, etc. are recorded and processed in real time…Advertisers gain fine-grained insights into their advertising campaign performance by interacting with a sophisticated front-end service that issues online and on-demand queries to the underlying data store…The scale and business critical nature of this data result in unique technical and operational challenges for processing, storing and querying.
Google’s advertising platform depends upon real-time data that records updates about advertising impressions and clicks in the larger context of analytics about current and potential advertising campaigns. As such, the data model requires the ability to accommodate atomic updates to advertising components that cascade throughout an entire data repository, consistency and correctness of data across datacenters and over time, the ability to support continuous updates, low latency query performance, scalability as illustrated by the ability to support petabytes of data and data transformation functionality that accommodates changes to data schemas. Mesa utilizes Google products as follows:
Mesa leverages common Google infrastructure and services, such as Colossus, BigTable and MapReduce. To achieve storage scalability and availability, data is horizontally partitioned and replicated. Updates may be applied at granularity of a single table or across many tables. To achieve consistent and repeatable updates, the underlying data is multi-versioned. To achieve update scalability, data updates are batched, assigned a new version number and periodically incorporated into Mesa. To achieve update consistency across multiple data centers, Mesa uses a distributed synchronization protocol based on Paxos.
While Mesa takes advantage of technologies from Colossus, BigTable, MapReduce and Paxos, it delivers a degree of “atomicity” and consistency lacked by its counterparts. In addition, Mesa features “a novel version management system that batches updates to achieve acceptable latencies and high throughput for updates.” All told, Mesa constitutes a disruptive innovation in the Big Data space that extends the attributes of atomicity, consistency, high throughput, low latency and scalability on the scale of trillions of rows toward the end of a “petascale data warehouse.” While speculation proliferates about the possibilities for Google to append Mesa to its Google Compute Engine offering or otherwise open-source it, the key point worth noting is that Mesa represents a qualitative shift with respect to the ability of a Big Data platform to process petabytes of data that experiences real-time flux. Whereas the cloud space is accustomed to seeing Amazon Web Services usher in breathtaking innovation after innovation, time and time again, Mesa conversely underscores Google’s continuing leadership in the Big Data space. Expect to hear more details about Mesa at the Conference on Very Large Data Bases next month in Hangzhou, China.