The following video by Cloudera CEO Mike Olson elaborates on the significance of Apache Spark in the Hadoop landscape, with a particular focus on its differentiation from MapReduce. The video prefigures Cloudera’s One Platform Initiative aimed at rendering Spark a viable alternative to MapReduce.
Cloudera recently announced a One Platform Initiative that aspires to make Apache Spark the default framework for processing analytics in Hadoop, ahead of MapReduce. Cloudera’s One Platform Initiative will focus on bolstering the security of Apache Spark, rendering Spark more scalable, enhancing management functionality and augmenting Spark Streaming, the Spark component that focuses on ingesting massive volumes of streaming data for use cases such as the internet of things. Cloudera’s efforts to improve the security of Apache Spark will focus on ensuring the encryption of data at rest as well as over the wire. Meanwhile, the initiative to improve the scalability of Apache Spark aims to render it scalable to as many as 10,000 nodes including enhanced ability to handle computational workloads by means of an integration with Intel’s Math Kernel Library. With respect to management, Cloudera plans to deepen Spark’s integration with YARN by creating metrics that provide insight into resource utilization as well as improvements to multi-tenant performance. Regarding Spark Streaming, Cloudera plans to render Spark Streaming more broadly available to business users via the addition of SQL semantics and the ability to support 80% of common streaming workloads.
Cloudera’s larger goal is to enhance the enterprise-readiness of Apache Spark with a view to promoting it as a viable alternative to MapReduce. All of Cloudera’s enhancements to Spark will be contributed to the Apache Spark open source project. That said, Cloudera’s leadership in stewarding the acceleration of the enterprise-readiness of Apache Spark as a MapReduce alternative promises to position it strongly as the undisputed market share and thought leader in the Hadoop distribution space, particularly given the range of its intended contributions to Spark and the depth of its vision for subsequent Spark enhancements in forthcoming months.
Arcadia Data Releases Business Intelligence Platform For Hadoop And Closes $11.5M In Series B Funding
Today, Arcadia Data revealed details of its business intelligence and data visualization platform for Big Data. Arcadia Data’s BI platform enables business stakeholders to create data visualizations of Hadoop data by means of a rich user interface that allows users to drag and drop data fields. In addition, customers can select datasets for drill-downs to perform more advanced analyses such as root cause analytics, correlation analytics and trend analytics. The platform’s rich drag and drop functionality supports exploratory analysis of Hadoop-based data as illustrated below:
The graphic above shows how customers can use the Arcadia data platform to obtain different aggregations of cab ride fares and duration within various geographies in NYC. Importantly, the simplicity and speed of the platform mean that business stakeholders can comfortably obtain the analyses and data visualizations needed to represent their own data-driven insights. Given that the Arcadia Data platform also features data modeling functionality that enables users to massage and organize data prior to taking advantage of Arcadia’s data visualization functionality, the platform also lends itself to use by more savvy data users in addition to business users. Arcadia supports all major Hadoop distributions including Cloudera, Hortonworks and MapR and additionally enables users to glean insights from applications built using MySQL, Oracle and Teradata. In addition to today’s product announcement, Arcadia Data today announced the finalization of $11.5M in Series A funding from Mayfield, Blumberg Capital and Intel Capital. As revealed to Cloud Computing Today in a live product demonstration, the depth and sophistication of the Arcadia Data platform illustrates the changing face of business intelligence in the wake of the big data revolution, particularly as evinced by the ease with which business stakeholders can now make sense of Hadoop-based data using data visualization, transformation, drill-downs, trend analysis and analytics more broadly.
Cloudera and Trillium Software recently announced a collaboration whereby the Trillium Big Data solution is certified for Cloudera’s Hadoop distribution. As a result of the partnership, Cloudera customers can take advantage of Trillium’s data quality solutions to profile, cleanse, de-duplicate and enrich Hadoop-based data. Trillium responds to a problem in the Big Data industry wherein the customer focus on deployment and management of Hadoop-based data repositories eclipses concerns about data quality. In the case of Hadoop-based data, data quality solutions predictably face challenges associated with the sheer volume of data that requires cleansing or quality improvements. Trillium’s Big Data Solution for data quality cleanses data natively within Hadoop because identifying data with data quality issues and then transporting it to another infrastructure becomes costly and complex. The collaboration between Trillium Software and Cloudera illustrates the relevance of data quality solutions for Hadoop despite the increased attention currently devoted to Big Data analytics and data visualization solutions. As such, Trillium fills a critical niche within the Big Data processing space and its alliance with Cloudera positions it strongly to consolidate its early traction within the space of solutions dedicated to data quality in the Big Data space.
Microsoft Azure recently announced news of the Azure Data Lake, a product that serves as a repository for “every type of data collected in a single place prior to any formal definition of requirements or schema.” As noted by Oliver Chiu in a blog post, Data Lakes allow organizations to store all data types regardless of data type and size on the theory that they can subsequently use advanced analytics to determine which data sources should be transferred to a data warehouse for more rigorous data profiling, processing and analytics. The Azure Data Lake’s compatibility with HDFS means that products with data stored in Azure HDInsight and infrastructures that use distributions such as Cloudera, Hortonworks and MapR can integrate with it, thereby allowing them to feed the Azure Data Lake with streams of Hadoop data from internal and third party data sources as necessary. Moreover, the Azure Data Lake supports massively parallel queries that allow for the execution of advanced analytics on massive datasets of the type envisioned for the Azure Data Lake, particularly given its ability to support unlimited data both in aggregate, and with respect to specific files as well. Built for the cloud, the Azure Data Lake gives enterprises a preliminary solution to the problem of architecting an enterprise data warehouse by providing a repository for all data that customers can subsequently use as a base platform from which to retrieve and curate data of interest.
The Azure Data Lake illustrates the way in which the economics of cloud storage redefines the challenges associated with creating an enterprise data warehouse by shifting the focus of enterprise data management away from master data management and data cleansing toward advanced analytics that can query and aggregate data as needed, thereby absolving organizations of the need to create elaborate structures for storing data. In much the same way that Gmail dispenses with files and folders for email storage and depends upon its search functionality to facilitate the retrieval of email-based data, data lakes take the burden of classifying and curating data away from customers but correspondingly place the emphasis on the analytic capabilities of organizations with respect to the ability to query and aggregate data. As such, the commercial success of the Azure Data Lake hinges on its ability to simplify the process of running ad hoc and repeatable analytics on data stored within its purview by giving customers a rich visual user interface and platform for constructing and refining analytic queries on Big Data.
MapR has declined the invitation to participate in the Open Data Platform (ODP) after careful consideration, as noted in a recent blog post by John Schroeder, the company’s CEO and co-founder. Schroeder claims that the Open Data Platform is redundant with the governance provided by the Apache Software Foundation, that it purports to “solve” Hadoop-related problems that do not require solving and that it fails to accurately define the core of the Open Data Platform as it relates to Hadoop. With respect to software governance, Schroeder notes that the Apache Software Foundation has done well to steward the development of Apache Hadoop as elaborated below:
The Apache Software Foundation has done a wonderful job governing Hadoop, resulting in the Hadoop standard in which applications are interoperable among Hadoop distributions. Apache governance is based on a meritocracy that doesn’t require payment to participate or for voting rights. The Apache community is vibrant and has resulted in Hadoop becoming ubiquitous in the market in only a few short years.
Here, Schroeder credits the Apache Software Foundation with creating a Hadoop ecosystem in which Hadoop-based applications interoperate with one another and wherein the governance structure is based on a meritocracy that does not mandate monetary contributions in order to garner voting rights. In addition, the blog post observes that whereas the Open Data Platform defines the core of Apache Hadoop as MapReduce, YARN, Ambari and HDFS, other frameworks such as “Spark and Mesos, are gaining market share” and stand to complicate ODP’s definition of the core of Hadoop. Meanwhile, Cloudera’s Chief Strategy Officer Mike Olson explained why Cloudera also declined to join the Open Data Platform by noting that Hadoop “won because it’s open source” and that the partnership between Pivotal and Hortonworks was “antithetical to the open source model and the Apache way.” Given that 75% of Hadoop implementations use either MapR or Cloudera, ODP looks set to face some serious challenges despite support from IBM, Pivotal and Hortonworks, although the precise impact of the schism over the Open Data Platform on the Hadoop community remains to be seen.
On March 9, the Apache Software Foundation announced the availability of Apache Tajo version 10.0. Less well known than its counterpart Apache Hive, Apache Tajo is used for ETL on big data in addition to SQL-compliant querying functionality that delivers scalable, low latency results. Version 10.0 features enhancements to Amazon S3 and an improved JDBC driver that renders Tajo compatible with most major BI platforms. Hyunsik Choi, Vice President of Apache Tajo, remarked on Apache Tajo’s progress as follows:
Tajo has evolved over the last couple of years into a mature ‘SQL-on-Hadoop’ engine. The improved JDBC driver in this release allows users to easily access Tajo as if users use traditional RDBMSs. We have verified new JDBC driver on many commercial BI solutions and various SQL tools. It was easy and works successfully.
As Choi notes, Tajo attempts to bring the simplicity and standardization of SQL and RDBMS infrastructures to the power of Hadoop’s distributed processing and scalability. Designed with a focus on fault tolerance, scalability, high throughput and query optimization, Tajo aims to deliver low latency in conjunction with a storage agnostic platform that notably boasts Hbase storage integration that allows Tajo users to access Hbase via Tajo as of this version. Tajo plays in an increasingly crowded SQL on Hadoop-space featuring the likes of Hive, Cloudera’s Impala, Pivotal HAWQ and Stinger although it claims some early adoption in South Korea, the country of its origin, with organizations such as Korea University, Melon, NASA JPL Radio Astronomy and Airborne Snow Observatory projects, and SK Telecom. The key question for Apache Tajo now is whether its new release will usher in greater traction outside of South Korea, particularly given its enhanced integration with Amazon S3 and Amazon’s Elastic Mapreduce (EMR) platform.