Microsoft Research Is Developing OneNet, An Alternative To Spark That Can Also Build Cloud Infrastructures

As reported by ZDNet, Microsoft is developing an open source, distributed platform designed for the creation of cloud services and big data analytics called OneNet or Prajna. OneNet specializes in interactive, big data analytics and boasts in-memory processing functionality in ways that are similar to Apache Spark. Importantly, OneNet supports both batch processing and real-time, streaming data ingestion and analytics. Unlike Spark, however, the platform can deploy cloud platforms and access them via mobile applications and distributed applications featuring in-memory key-value stores. Prajna also claims distributed programming functionality across multiple clusters and in memory sharing of data across multiple jobs. Prajna’s ability to support “multi-cluster distributed programming” differentiates it from Spark and pushes the envelope with respect to high performance distributed computing in the industry today. OneNet/Prajna specializes in a distributing computing platform across multiple clusters that can be used to build high performance, big data analytic engines while reducing the engineering costs typically associated with building distributed systems. Microsoft’s investment in OneNet comes at a time when Cloudera has recently announced a One Platform Initiative aimed at rendering Apache Spark production-ready in the form of a viable alternative to MapReduce.

Glassbeam Integrates With Apache Spark And Enhances Its Analytics And Machine Learning Functionality

Santa Clara-based machine data analytics vendor Glassbeam recently revealed details of a new version of Glassbeam SCALAR marked by deep integration with Apache Spark. Apache Spark is a parallel data processing framework that facilitates real-time analytics, machine learning and real-time analytics by storing the results of data operators in memory and performing low latency, iterative calculations on in memory computational results. Known for its ability to automate the parallelization of tasks and jobs, Spark boasts operational efficiencies over MapReduce by a factor of 100 with respect to the execution of calculations on large datasets. Glassbeam SCALAR’s integration with Apache Spark enhances its computational capabilities as well as the platform’s machine learning functionality and capacity to perform real-time analytics on streaming datasets by means of the Spark Streaming and MLLib components of the Spark stack. Built on Cassandra, Spark’s addition to the Glassbeam’s cloud analytics platform gives it the benefits of Cassandra’s distributed data management architecture in addition to Spark’s computational, analytic and machine learning functionality. As such, today’s announcement strengthens Glassbeam’s position in the nascent but exploding internet of things analytics space by augmenting its ability to ingest, process and analyze massive amounts of data as well as enhancing Glassbeam SCALAR’s advanced analytics, machine learning and predictive analytics capabilities.

MapR Finalizes $110M In Equity And Debt Financing Led By Google Capital And Silicon Valley Bank

On Monday, MapR Technologies announced the finalization of $110M in funding based on $80M in equity financing and $30M in debt financing. Google Capital led the equity funding in collaboration with Qualcomm Incorporated, Lightspeed Venture Partners, Mayfield Fund, NEA and Redpoint Ventures while MapR’s debt funding was financed by Silicon Valley Bank. The funding will be used to spearhead MapR’s explosive growth in the Hadoop distribution and analytics space as illustrated by a threefold increase in bookings in Q1 of 2014 as compared to 2013. Gene Frantz, General Partner at Google Capital, commented on Google Capital’s participation in the June 30 funding raise as follows:

MapR helps companies around the world deploy Hadoop rapidly and reliably, generating significant business results. We led this round of funding because we believe MapR has a great solution for enterprise customers, and they’ve built a strong and growing business.

Monday’s announcement comes soon after MapR’s news of its support for Apache Hadoop 2.x and YARN in addition to all five components of Apache Spark, the open source technology used for big data applications that specialize in interactive analytics, real-time analytics, machine learning and stream processing. The additional $110M in funding strongly positions MapR with respect to competitors Cloudera and Hortonworks given that Cloudera recently raised $900M and Hortonworks finalized $100M in funding. The news of MapR’s $110M funding also coincides with a recent statement from Hortonworks certifying the compatibility of YARN with Apache Spark as part of a larger announcement about the integration of Spark into the Hortonworks Data Platform (HDP) alongside its Hadoop security acquisition XA Secure and Apache Ambari for the provisioning and management of Hadoop clusters. With a fresh round of capital in the bank and backing from Google, the creators of MapReduce, MapR signals that the battle for Hadoop market share features a three horse race that is almost certain to intensify as vendors compete to streamline and simplify the operationalization of Big Data. In the meantime, Big Data-related venture capital continues to flow like water bursting out of a fire hydrant as the Big Data space tackles problems related to big data analytics, streaming big data and Hadoop security.

Hortonworks Announces Readiness Of YARN For Apache Spark

On Thursday, Hortonworks announced that Apache Spark is “YARN Ready” and compatible with the multiple workloads and additional CPU processing-demands specific to Spark applications. As a result of the compatibility of Apache Spark with YARN, Hadoop users can now use one Hadoop cluster with a single repository of data for a variety of purposes rather than having to segment workloads such that some data is dedicated to Apache Spark. More specifically, Hadoop users can now rest assured that YARN-based applications work collaboratively with applications that leverage Spark’s capabilities to facilitate real-time analytics, interactive analytics, machine learning and stream processing. Hortonworks introduced Apache Spark to the Hortonworks Data Platform as a technology preview download in May but today announces the integration of Spark with YARN, its recent acquisition, XA Secure, for authentication and data security purposes, as well as Ambari toward the larger goal of delivering an integrated, turnkey, enterprise-grade Hadoop platform. Thursday’s announcement by Hortonworks responds to similar statements by competitors MapR regarding the integration of Spark into its Hadoop distribution, and Cloudera’s announcement of its enterprise-grade support for Apache Spark.

The following graphic illustrating the integration of Spark into YARN originated from the Hortonworks blog post Making Apache Spark YARN Ready.

Cascading 3.0 Adds Support For Wide Range Of Computational Frameworks And Data Fabrics

Today, Concurrent, Inc. announces the release of Cascading 3.0, the latest version of the popular open source framework for developing and managing Big Data applications. Widely recognized as the de facto framework for the development of Big Data applications on platforms such as Apache Hadoop, Cascading simplifies application development by means of an abstraction framework that facilitates the execution and orchestration of jobs and processes. Compatible with all major Hadoop distributions, Cascading sits squarely at the heart of the Big Data revolution by streamlining the operationalization of Big Data applications in conjunction with Driven, a commercial product from Concurrent that provides visibility regarding application performance within a Hadoop cluster.

Today’s announcement extends Cascading to platforms and computational frameworks such as local in-memory, Apache MapReduce and Apache Tez. Going forward, Concurrent plans for Cascading 3.0 to ship with support for Apache Spark, Apache Storm and other computational frameworks by means of its customizable query planner, which allows customers to extend the operation of Cascading to compatible computational fabrics as illustrated below:

The breakthrough represented by today’s announcement is that it renders Cascading extensible to a variety of computational frameworks and data fabrics and thereby expands the range of use cases and environments in which Cascading can be optimally used. Moreover, the customizable query planner featured in today’s release allows customers to configure their Cascading deployment to operate in conjunction with emerging technologies and data fabrics that can now be integrated into a Cascading deployment by means of the functionality represented in Cascading 3.0.

Used by companies such as Twitter, eBay, FourSquare, Etsy and The Climate Corporation, Cascading boasts over 150,000 applications a month, more than 7,000 deployments and 10% month over month growth in downloads. The release of Cascading 3.0 builds on Concurrent’s recent partnership with Hortonworks whereby Cascading will be integrated into the Hortonworks Data Platform and Hortonworks will certify and support the delivery of Cascading in conjunction with its Hadoop distribution. Concurrent, Inc. also recently revealed details of a strategic partnership with Databricks, the principal steward behind the Apache Spark project, that allows it to “operate over Spark…[the] next generation Big Data processing engine that supports batch, interactive and streaming workloads at scale.” In an interview with Cloud Computing Today, Concurrent CEO Gary Nakamura confirmed that Concurrent plans to negotiate partnerships analogous to the agreement with Hortonworks with other Hadoop distribution vendors in order to ensure that Cascading consolidates its positioning as the framework of choice for the development of Big Data applications. Overall, the release of Cascading 3.0 represents a critical product enhancement that positions Cascading to operate over a broader pasture of computational frameworks and consequently assert its relevance for Big Data application development in a variety of data and computational frameworks. More importantly, however, the product enhancement in Cascading 3.0, in conjunction with the partnership with Databricks regarding Apache Spark, suggests that Cascading is well on its way to becoming the universal framework of choice for developing and managing applications in a Big Data environment, particularly given its compatibility with a wide range of Hadoop distributions and data and computational frameworks.

MapR Announces Support For All Five Components Of Apache Spark In Its Hadoop Distribution

On Thursday, MapR Technologies announced that it will be adding Apache Spark to its Hadoop distribution by means of a partnership with Databricks, the principal steward behind Apache Spark. Apache Spark facilitates the development of big data applications that specialize in interactive analytics, real-time analytics, machine learning and stream processing. In contrast to MapReduce, Apache Spark provides a greater range of data operators such as “mappers, reducers, joins, group-bys, and filters” that permit the modeling of more complex data flows than are available simply via map and reduce operations. Moreover, because Spark stores the results of data operators in memory, it enables low latency computations and increased efficiencies on iterative calculations that operate on in memory computational results. Spark is additionally known for its ability to automate the parallelization of jobs and tasks in ways that optimize performance and correspondingly relieve developers of the responsibility of sequencing the execution of jobs. Apache Spark can improve application performance by a factor of between 5 and 100 while its programming abstraction framework, which is based on distributed unchanging aggregations of data known as Resilient Distributed Datasets, reduces the amount of code required by 80%. MapR will support all five components of the Spark stack, namely, Shark, Spark Streaming, MLLib, GraphX and Spark R. The five components of Apache Spark illustrate the versatility of Apache Spark insofar as they can support applications that interface with streaming datasets, machine learning and graph-based applications, R and SQL. MapR’s decision to support the entire Spark stack diverges from its competitor Cloudera, which does not support Shark, the SQL on Hadoop component of Apache Spark that competes with Cloudera’s Impala product, as reported in GigaOM. All told, today’s announcement represents a small but significant attempt by MapR to reclaim the relevance of its Hadoop distribution in the wake of Cloudera’s $900M funding announcement and the $100M in funding recently secured by Hortonworks. That said, we should expect MapR to follow suit with a similar capital raise soon, even though its CMO Jack Norris claims that “with 500 paid customers the company is profitable and able to continue being successful from its current position.”

Cloudera Announces Enterprise-Grade Support For Apache Spark

Cloudera recently announced the general availability of Apache Spark for Cloudera Enterprise. First developed at UC Berkeley, Apache Spark is a parallel data processing framework that supplements Apache Hadoop by facilitating the development of big data applications related to machine learning, interactive analytics and real-time analytics. Spark allows users to write parallel sets of code in Java, Scala and Python that operate on Hadoop clusters with a speed up to 100 times faster than MapReduce. Moreover, applications developed in Spark tend to require 2 to 10 ten times less code than a corresponding MapReduce application. Spark Streaming, an add-on to Spark, enables analytics to be run on streaming datasets such that developers can derive analytic insights within seconds of data ingestion. Cloudera will offer enterprise-grade support for Spark in partnership with Databricks, the primary sponsor of the open source Apache Spark project, via its Data Hub Edition and Cloudera Enterprise Flex Edition. This release features support for Spark 0.9.0 with CDH 4. Support for Cloudera Enterprise 5, with CDH 5 and YARN, will be forthcoming in subsequent releases. Spark contributes to the Cloudera platform as illustrated by the highlighted blocks in orange below:

Image Source: “Apache Spark — Welcome To The CDH Family”