LinkedIn recently announced the open sourcing of Dr. Elephant, a tool that helps Hadoop users optimize their flows. Dr. Elephant aggregates and analyzes data about Hadoop jobs and delivers suggestions about how to optimize jobs to increase their efficiency. Whereas most Hadoop optimization tools focus on simplifying and streamlining the management of Hadoop clusters, Dr. Elephant focuses on the optimization of Hadoop flows. As noted in a LinkedIn blog post, the platform leverages “pluggable, configurable, rule-based heuristics” to provide analytical insight about job performance in addition to recommendations for performance optimization. Used by LinkedIn to enhance developer productivity and improve the efficiency of Hadoop clusters by optimizing their constituent flows, Dr. Elephant delivers an aggregated dashboard of all of the jobs that run on a specific cluster in conjunction with drill-down, visualization functionality of flow performance for each job. The platform specializes in diagnostics at the job-level in contrast to the cluster itself, and is widely used by LinkedIn to diagnose and solve over 80% of flow performance questions. Open sourced under an Apache version 2 license, Dr. Elephant is compatible with Apache Hadoop and Apache Spark and plays in the same space as Driven, the Big Data application performance management framework pioneered by Concurrent Inc.
On Thursday, Cloudera today announced the release of Cloudera Director 2.0, the next version of Cloudera’s platform for deploying and managing Cloudera Enterprise within cloud environments. In collaboration with Cloudera Manager, Cloudera Director 2.0 empowers users to deploy CDH clusters within a cloud infrastructure by taking advantage of a combination of configuration scripts to collectively launch the CDH cluster, schedule queries, retrieve Hadoop-based data and terminate it when required. Moreover, Cloudera Director 2.0 gives customers the ability to add ETL and Modeling to workloads using spot instance support, thereby decreasing operational costs associated with hosting. This version also enables the launch and termination of clusters as result of the execution of specific jobs, thereby delivering enhanced automation regarding the management of cloud-based CDH clusters that correspondingly gives customers greater control over their deployments in addition to the opportunity to decrease costs. In addition, Thursday’s release features the ability to both clone and repair clusters with zero to minimal disruption to the deployment. Meanwhile, Cloudera’s beta RecordService for unified access control and security by means of a distributed data service supports “secure, multi-tenant access” to all users analyzing Hadoop data in Amazon S3 and other storage repositories for Hadoop data. By giving customers finely grained control regarding operational processes that include cluster launch, cluster termination, query management as well as improved scalability for business intelligence and analytic workloads, Cloudera Director 2.0 promises to entice customers to leverage the agility and economics of the public cloud to complement their on-premise Hadoop deployments. As the only Hadoop distribution that supports hybrid cloud environments, Cloudera empowers customers to nimbly deploy Hadoop workloads on Amazon Web Services, Google Cloud Platform or Microsoft Azure with nuanced and granular controls that collectively deliver optimized cost, greater operational control, improved scalability, enhanced automation and more robust security within their cloud deployments. Version 2.0 of Cloudera Director accelerates the industry-wide trend toward the convergence between cloud computing and Big Data by giving customers enterprise grade, self-service tools to manage their Hadoop workloads in the cloud, from within a single pane of glass.
The following video by Cloudera CEO Mike Olson elaborates on the significance of Apache Spark in the Hadoop landscape, with a particular focus on its differentiation from MapReduce. The video prefigures Cloudera’s One Platform Initiative aimed at rendering Spark a viable alternative to MapReduce.
Cloudera recently announced a One Platform Initiative that aspires to make Apache Spark the default framework for processing analytics in Hadoop, ahead of MapReduce. Cloudera’s One Platform Initiative will focus on bolstering the security of Apache Spark, rendering Spark more scalable, enhancing management functionality and augmenting Spark Streaming, the Spark component that focuses on ingesting massive volumes of streaming data for use cases such as the internet of things. Cloudera’s efforts to improve the security of Apache Spark will focus on ensuring the encryption of data at rest as well as over the wire. Meanwhile, the initiative to improve the scalability of Apache Spark aims to render it scalable to as many as 10,000 nodes including enhanced ability to handle computational workloads by means of an integration with Intel’s Math Kernel Library. With respect to management, Cloudera plans to deepen Spark’s integration with YARN by creating metrics that provide insight into resource utilization as well as improvements to multi-tenant performance. Regarding Spark Streaming, Cloudera plans to render Spark Streaming more broadly available to business users via the addition of SQL semantics and the ability to support 80% of common streaming workloads.
Cloudera’s larger goal is to enhance the enterprise-readiness of Apache Spark with a view to promoting it as a viable alternative to MapReduce. All of Cloudera’s enhancements to Spark will be contributed to the Apache Spark open source project. That said, Cloudera’s leadership in stewarding the acceleration of the enterprise-readiness of Apache Spark as a MapReduce alternative promises to position it strongly as the undisputed market share and thought leader in the Hadoop distribution space, particularly given the range of its intended contributions to Spark and the depth of its vision for subsequent Spark enhancements in forthcoming months.
Today, Arcadia Data revealed details of its business intelligence and data visualization platform for Big Data. Arcadia Data’s BI platform enables business stakeholders to create data visualizations of Hadoop data by means of a rich user interface that allows users to drag and drop data fields. In addition, customers can select datasets for drill-downs to perform more advanced analyses such as root cause analytics, correlation analytics and trend analytics. The platform’s rich drag and drop functionality supports exploratory analysis of Hadoop-based data as illustrated below:
The graphic above shows how customers can use the Arcadia data platform to obtain different aggregations of cab ride fares and duration within various geographies in NYC. Importantly, the simplicity and speed of the platform mean that business stakeholders can comfortably obtain the analyses and data visualizations needed to represent their own data-driven insights. Given that the Arcadia Data platform also features data modeling functionality that enables users to massage and organize data prior to taking advantage of Arcadia’s data visualization functionality, the platform also lends itself to use by more savvy data users in addition to business users. Arcadia supports all major Hadoop distributions including Cloudera, Hortonworks and MapR and additionally enables users to glean insights from applications built using MySQL, Oracle and Teradata. In addition to today’s product announcement, Arcadia Data today announced the finalization of $11.5M in Series A funding from Mayfield, Blumberg Capital and Intel Capital. As revealed to Cloud Computing Today in a live product demonstration, the depth and sophistication of the Arcadia Data platform illustrates the changing face of business intelligence in the wake of the big data revolution, particularly as evinced by the ease with which business stakeholders can now make sense of Hadoop-based data using data visualization, transformation, drill-downs, trend analysis and analytics more broadly.
Cloudera and Trillium Software recently announced a collaboration whereby the Trillium Big Data solution is certified for Cloudera’s Hadoop distribution. As a result of the partnership, Cloudera customers can take advantage of Trillium’s data quality solutions to profile, cleanse, de-duplicate and enrich Hadoop-based data. Trillium responds to a problem in the Big Data industry wherein the customer focus on deployment and management of Hadoop-based data repositories eclipses concerns about data quality. In the case of Hadoop-based data, data quality solutions predictably face challenges associated with the sheer volume of data that requires cleansing or quality improvements. Trillium’s Big Data Solution for data quality cleanses data natively within Hadoop because identifying data with data quality issues and then transporting it to another infrastructure becomes costly and complex. The collaboration between Trillium Software and Cloudera illustrates the relevance of data quality solutions for Hadoop despite the increased attention currently devoted to Big Data analytics and data visualization solutions. As such, Trillium fills a critical niche within the Big Data processing space and its alliance with Cloudera positions it strongly to consolidate its early traction within the space of solutions dedicated to data quality in the Big Data space.