Hadoop – Cloud Computing Today

Pepperdata Announces Cluster And Job Optimization Product For Cloud-based Hadoop Clusters On Amazon EMR

Pepperdata today announces a new product that helps Amazon EMR customers optimize the performance of their cloud-based, Hadoop jobs. Pepperdata with Amazon EMR delivers enhanced analytics related to the performance of jobs running on EMR data in addition to optimizing the performance of jobs in collaboration with instruction and feedback from users. The product gives Amazon EMR users granular visibility into cluster performance in conjunction with analytics on individual jobs that leverage metrics related to CPU, memory and unused capacity as illustrated by the graphic below:

pepperdata

Because Pepperdata translates its analytics into enhanced performance optimization on Amazon EMR, customers benefit from decreased cloud utilization as well as enhanced job performance. Sean Suchter, CTO of Pepperdata, remarked on the significance of PepperData’s product for Amazon EMR as follows:

Amazon EMR is designed to help companies process huge amounts of data easily and cost-effectively without having to commit unnecessary resources. As customers embrace Hadoop in the cloud they need to be able to manage cost and performance without any big surprises. Pepperdata eliminates those blind spots with very granular insight into the performance of current and historical EMR runs.

Here, Suchter comments on the ability of Pepperdata’s EMR product to enable customers to manage costs for Hadoop-related cloud resources while optimizing performance. Whereas Amazon Web Services EMR clusters terminate upon the completion of a run and subsequently make it difficult for users to access performance-related data, Pepperdata’s product for Amazon EMR allows users to analyze the performance of clusters and their constituent jobs even after the cluster has terminated. As a result, teams can analyze historical data to progressively improve cluster performance by determining the optimal amount of computing resources for cloud-based Hadoop jobs. Today, Pepperdata also announces the availability of Adaptive Scaling for EMR, a product that purchases Amazon EMR instances in accordance with budget and time constraints specified by clients. All told, today’s announcements from Pepperdata represent a notable addition to the space of products specializing in both infrastructure and application optimization for cloud-based Hadoop workloads. Expect to hear more from Pepperdata as big data adoption expands and companies increasingly turn their attention from deploying Hadoop clusters and their related applications toward the task of optimizing performance both at the level of clusters as well their associated jobs and applications.

Google’s Mesa Data Warehouse Takes Real Time Big Data Management To Another Level

Google recently announced development of Mesa, a data warehousing platform designed to collect data for its internet advertising business. Mesa delivers a distributed data warehouse that can manage petabytes of data while delivering high availability, scalability and fault tolerance. Mesa is designed to update millions of rows per second, process billions of queries and retrieve trillions of rows per day to support Google’s gargantuan data needs for its flagship search and advertising business. Google elaborated on the company’s business need for a new data warehousing platform by commenting on its evolving data management needs as follows:

Google runs an extensive advertising platform across multiple channels that serves billions of advertisements (or ads) every day to users all over the globe. Detailed information associated with each served ad, such as the targeting criteria, number of impressions and clicks, etc. are recorded and processed in real time…Advertisers gain fine-grained insights into their advertising campaign performance by interacting with a sophisticated front-end service that issues online and on-demand queries to the underlying data store…The scale and business critical nature of this data result in unique technical and operational challenges for processing, storing and querying.

Google’s advertising platform depends upon real-time data that records updates about advertising impressions and clicks in the larger context of analytics about current and potential advertising campaigns. As such, the data model requires the ability to accommodate atomic updates to advertising components that cascade throughout an entire data repository, consistency and correctness of data across datacenters and over time, the ability to support continuous updates, low latency query performance, scalability as illustrated by the ability to support petabytes of data and data transformation functionality that accommodates changes to data schemas. Mesa utilizes Google products as follows:

Mesa leverages common Google infrastructure and services, such as Colossus, BigTable and MapReduce. To achieve storage scalability and availability, data is horizontally partitioned and replicated. Updates may be applied at granularity of a single table or across many tables. To achieve consistent and repeatable updates, the underlying data is multi-versioned. To achieve update scalability, data updates are batched, assigned a new version number and periodically incorporated into Mesa. To achieve update consistency across multiple data centers, Mesa uses a distributed synchronization protocol based on Paxos.

While Mesa takes advantage of technologies from Colossus, BigTable, MapReduce and Paxos, it delivers a degree of “atomicity” and consistency lacked by its counterparts. In addition, Mesa features “a novel version management system that batches updates to achieve acceptable latencies and high throughput for updates.” All told, Mesa constitutes a disruptive innovation in the Big Data space that extends the attributes of atomicity, consistency, high throughput, low latency and scalability on the scale of trillions of rows toward the end of a “petascale data warehouse.” While speculation proliferates about the possibilities for Google to append Mesa to its Google Compute Engine offering or otherwise open-source it, the key point worth noting is that Mesa represents a qualitative shift with respect to the ability of a Big Data platform to process petabytes of data that experiences real-time flux. Whereas the cloud space is accustomed to seeing Amazon Web Services usher in breathtaking innovation after innovation, time and time again, Mesa conversely underscores Google’s continuing leadership in the Big Data space. Expect to hear more details about Mesa at the Conference on Very Large Data Bases next month in Hangzhou, China.

Xplenty Expands Coverage to all Amazon Web Services’ Regions

PRESS RELEASE

Customers using Amazon CloudFront can now benefit from Xplenty to parse and process their log files, all within the Xplenty design environment

Tel Aviv, Israel – March 4, 2014 – Xplenty, http://www.xplenty.com, provider of the innovative Hadoop-as-a-service platform, Amazon Web Services (AWS) Technology Partner in the AWS Partner Network, and seller on the AWS Marketplace, now offers its big data processing technology directly to customers in all AWS Regions. Xplenty is now available to customers from AWS’ Regions in South America (Sao Paolo), Asia Pacific (Singapore), Asia Pacific (Sydney), and Asia Pacific (Tokyo). This adds to the existing Xplenty locations of U.S. East (N. Virginia), U.S. West (N. California and Oregon) and EU (Ireland).

Xplenty technology provides Hadoop processing on the cloud via a coding-free design environment, ensuring businesses can quickly and easily benefit from the opportunities offered by big data without having to invest in hardware, software or related personnel.

Meanwhile, users of the Amazon CloudFront content delivery network can now use Xplenty to analyze their log files. New predefined templates let users parse and process Amazon CloudFront logs easily. The processing engine transforms structured and semi-structured big data and easily scales to petabytes as data requirements grow, allowing companies to better understand their customers.

One company already using Xplenty to gain better insight to their customers is WalkMe. “We have customers from a wide range of industries and verticals – including banks, financial institutions, retail services, tourism, leading software vendors and more – all of which use WalkMe to simplify their customers’ online experience. By using Xplenty to break down our log files, we’re able to gain valuable insights into our customer needs and preferences,” says Nir Nahum, VP of R&D at WalkMe. “With the easy-to-use GUI, we just designate the file location for processing, and it automatically sets up the template and runs.”

Xplenty is available within the global AWS Marketplace to customers seeking to integrate a Hadoop-as-a-Service platform to solve their big data processing challenges.

“Big data is shaping the way companies of all sizes develop new products and identify new opportunities to increase their efficiency,” said Brian Matsubara, Head of Global Technology Alliances, Amazon Web Services. “By bringing their Big Data analysis tools to the AWS cloud, Xplenty is giving customers an innovative approach to solve their business challenges. Xplenty leverages the AWS global platform to provide scalable Big Data solutions to customers around the world.”

“As a cloud-based service provider, we offer organizations of any size the opportunity to learn more about their customers, further personalize their services, and increase their bottom lines, all by enabling their big data analyses,” says Yaniv Mor, co-founder and CEO of Xplenty. “Why shouldn’t everyone gain by using the data they are paying to store anyway?”

About Xplenty
Xplenty was founded by data professionals for data professionals to deliver on the promise of big data. Xplenty’s true big data solution provides ROI almost immediately by uncovering valuable business insights, translating into higher revenues and increased competitiveness. Xplenty delivers a coding-free, cloud-based Hadoop-as-a-Service platform that transforms structured, unstructured, and semi-structured data into useable information in the AWS, Rackspace and Softlayer environments. Our goal is to make Hadoop accessible and cost-effective for everybody. http://www.xplenty.com

Media Contact
Amy Kenigsberg
K2 Global Communications
amy@k2-gc.com
tel: +972-9-794-1681 (+2 GMT)
mobile: +972-524-761-341
U.S.: +1-913-440-4072 (+7 ET)

All product and company names herein may be trademarks of their registered owners.

DataRPM Closes $5.1M In Series A Funding For Natural Language Search Big Data Analytics Platform

DataRPM today announced the finalization of $5.1M in Series A funding in a round led by InterWest Partners. DataRPM specializes in a next generation business intelligence platform that leverages machine learning and artificial intelligence to facilitate the delivery of actionable business intelligence by means of a natural language-based search engine that allows customers to dispense with complex, time consuming data modeling and query production. DataRPM stores customer data within a “distributed computational search index” that enables its platform to apply its natural language query interface to heterogeneous data sources without modeling the data into intricate taxonomic relationships or master data management frameworks. Because DataRPM’s distributed computational search index empowers customers to run queries against different data sources without constructing data schemas that organize the constituent data fields and their relationships, it promises to accelerate the speed with which customers can derive insights from their data. Not only does the platform deliver a natural language interface, but it also performs data visualization of the requisite Google-like searches as illustrated below:

In an interview with Cloud Computing Today, DataRPM CEO Sundeep Sanghavi noted that its natural language search functionality is based on proprietary graphing technology analogous to Apache Giraph and Neo4j. The platform operates on data in relational and non-relational formats, although it currently does not support unstructured data. Available via both a cloud-based and on-premise deployment solution, DataRPM promises to disrupt Big Data analytics and contemporary business intelligence platforms by dispensing with the need for complex, time consuming and expensive data modeling as well as empowering business stakeholders with neither SQL nor scripting skills to analyze data. Today’s funding raise is intended to accelerate the company’s go-to-market strategy and correspondingly support product development in conjunction with the platform’s reception by current and future customers.

DataRPM belongs to the rapidly growing space of products that expedite Big Data analytics on Hadoop clusters as exemplified by the constellation of SQL-like interfaces for querying Hadoop-based data. That said, its natural language query interface represents a genuine innovation in a space dominated by products that render Hadoop accessible to SQL developers and analysts, as opposed to data savvy stakeholders with Google-like querying expertise. Moreover, DataRPM’s natural language search capabilities push the envelope of “next generation business intelligence” even further than contemporaries such as Jaspersoft, Talend and Pentaho, which thus far have focused largely on the transition within the enterprise from reporting to analytics and data discovery. Expect to hear more about DataRPM as the battle to streamline and simplify the derivation of actionable business intelligence from Big Data takes shape within a vendor landscape marked by the proliferation of analytic interfaces for petabyte-scale relational and non-relational databases.

Datameer Raises $19M In Series D Funding For Its Big Data Analytics and Visualization Platform

Hadoop data analytics vendor Datameer recently announced the finalization of $19M in Series D funding. Datameer’s Series D funding raise is led by Next World Capital, with additional participation from existing investors and Workday, Citi Ventures, Software AG, Kleiner Perkins Caufield & Byers and Redpoint Ventures. Datameer specializes in analytics and data visualization on big datasets. Unlike many other business intelligence and reporting tools, Datameer’s platform is designed for Hadoop and consequently enjoys the benefits of indigenous optimization for Hadoop data integration, data management and analytics. The funding raise will be used to support the expansion of the company’s sales and operations teams to new markets and to support the continuation of its dramatic growth. Datameer is currently used by more 130 companies including Sears, Visa and British Telecom. The platform represents part of the evolving revolution in next generation business intelligence marked by increasingly advanced data visualization options, the ability to handle large, unstructured data sets and utilization of machine-based learning to simplify and streamline the user’s experience with the data in question. Today’s funding raise brings the total raised by Datameer to $36.8M. As part of the capital raise, Ben Fu, Partner at Next World Capital will join the Datameer Board of Directors.

Amazon Web Services Supports Impala To Facilitate Real Time, High Performance Hadoop Queries

Amazon Web Services (AWS) recently announced support for Impala, the open source technology platform developed by Cloudera for querying data in the Hadoop Distributed File System or HBase using SQL-like syntax as elaborated below:

Impala raises the bar for query performance while retaining a familiar user experience. With Impala, you can query data, whether stored in HDFS or Apache HBase – including SELECT, JOIN, and aggregate functions – in real time. Furthermore, it uses the same metadata, SQL syntax (Hive SQL), ODBC driver and user interface (Hue Beeswax) as Apache Hive, providing a familiar and unified platform for batch-oriented or real-time queries. (For that reason, Hive users can utilize Impala with little setup overhead.)

Amazon Web Services introduced Impala as part of the Amazon Elastic MapReduce project. Users will need to run Hadoop clusters that use Hadoop 2.x in order to take advantage of its Hadoop offering. Impala users can run queries on data sets in real time and enjoy low latency times enabled by the platform’s distributed query engine that allows Impala to boast speed and performance benefits over Apache Hive. The availability of Impala on the Amazon Web Services platform comes just weeks after its release of Amazon Kinesis, its platform for collecting and storing real time big data streams, and subsequently underscores the seriousness with which AWS plans to deploy products designed for the big data space.