Concurrent and Hortonworks recently revealed a deepening of their strategic relationship whereby Cascading SDK will now be integrated into the Hortonworks Data Platform. Moreover, Hortonworks will certify, deliver and support Cascading, the application framework for developing Hadoop-based applications. A Java-based, open source alternative to MapReduce, Cascading provides developers with a framework for constructing complex, repeatable data processing tasks within a Hadoop cluster. Cascading features an abstraction platform which uses plumbing metaphors such as taps, pipes, data flows, cascades and sinks to allow developers to design, visualize and execute jobs and processes on Hadoop-based data without having to master the intricacies of MapReduce. Forthcoming releases of Cascading will support Apache Tez, an initiative that represents the next step after the addition of YARN to Hadoop that allows for Hadoop-based data to “meet demands for fast response times and extreme throughput at petabyte scale.” The partnership between Concurrent, the developer of Cascading, and Hortonworks, represents a huge coup for Concurrent given that the collaboration stands to rapidly accelerate Cascading’s adoption in enterprise environments. Hortonworks, meanwhile, benefits from packaging its Hadoop distribution with Cascading, one of the industry’s most well respected frameworks for Big data management and application development that boasts enterprise users such as Twitter, LinkedIn, eBay and Nokia. The obvious question now is whether Concurrent will finalize similar partnerships with other Hadoop vendors such as Cloudera and MapR or whether Concurrent’s partnership with Hortonworks enables the latter to improve its positioning in the battle for Hadoop market share, particularly in light of Cloudera’s remarkable $900 capital raise and partnership with Intel.
On Thursday, MapR Technologies announced that it will be adding Apache Spark to its Hadoop distribution by means of a partnership with Databricks, the principal steward behind Apache Spark. Apache Spark facilitates the development of big data applications that specialize in interactive analytics, real-time analytics, machine learning and stream processing. In contrast to MapReduce, Apache Spark provides a greater range of data operators such as “mappers, reducers, joins, group-bys, and filters” that permit the modeling of more complex data flows than are available simply via map and reduce operations. Moreover, because Spark stores the results of data operators in memory, it enables low latency computations and increased efficiencies on iterative calculations that operate on in memory computational results. Spark is additionally known for its ability to automate the parallelization of jobs and tasks in ways that optimize performance and correspondingly relieve developers of the responsibility of sequencing the execution of jobs. Apache Spark can improve application performance by a factor of between 5 and 100 while its programming abstraction framework, which is based on distributed unchanging aggregations of data known as Resilient Distributed Datasets, reduces the amount of code required by 80%. MapR will support all five components of the Spark stack, namely, Shark, Spark Streaming, MLLib, GraphX and Spark R. The five components of Apache Spark illustrate the versatility of Apache Spark insofar as they can support applications that interface with streaming datasets, machine learning and graph-based applications, R and SQL. MapR’s decision to support the entire Spark stack diverges from its competitor Cloudera, which does not support Shark, the SQL on Hadoop component of Apache Spark that competes with Cloudera’s Impala product, as reported in GigaOM. All told, today’s announcement represents a small but significant attempt by MapR to reclaim the relevance of its Hadoop distribution in the wake of Cloudera’s $900M funding announcement and the $100M in funding recently secured by Hortonworks. That said, we should expect MapR to follow suit with a similar capital raise soon, even though its CMO Jack Norris claims that “with 500 paid customers the company is profitable and able to continue being successful from its current position.”
Not to be outdone by the slew of product and price announcements from Google, Amazon Web Services and Microsoft over the past week, EMC-VMware spinoff Pivotal announced a new product offering branded the Pivotal Big Data Suite on Wednesday. The platform delivers Pivotal Greenplum Database, Pivotal GemFire, Pivotal SQLFire, Pivotal GemFire XD and Pivotal HAWQ, in addition to unlimited use of Pivotal’s Hadoop distribution Pivotal HD. Because the Pivotal Big Data Suite is priced on the basis of an annual subscription for all software and services, in addition to per core pricing for computing resources, customers need not fear additional fees related to software licensing or customer support over and beyond the subscription price. Moreover, customers essentially have access to a commercial-grade Hadoop distribution for free as part of the subscription price. Pivotal compares the Big Data Suite to a “swiss army knife for Big Data” that enables customers to “use whatever tool is right for your problem, for the same price.” Customers have access to products such as Greenplum’s massively parallel processing (MPP) architecture-based data warehouse, GemFire XD’s in-memory distributed Big data store for real-time analytics with a low latency SQL interface and HAWQ’s SQL-querying ability for Hadoop. Taken together, the Pivotal Big Data Suite edges towards the realization of Pivotal One, an integrated solution that performs Big Data management and analytics for ecosystems of applications, real-time data feeds and devices that can serve the data needs of the internet of things, amongst other use cases. More importantly, the Pivotal Big Data Suite represents the most systematic attempt to productize Big Data solutions in the industry at large, even if it is composed of an assemblage of heterogeneous products under one roof. The combination of access to a commercial grade Hadoop distribution (Pivotal HD), a data warehouse designed to store petabytes of data (Pivotal Greenplum) and closed loop real-time analytics solutions (Pivotal GemFire XD) within a unified product offering available via an annual subscription and per core pricing constitutes an offer not easy to refuse for anyone seriously interested in exploring the capabilities of Big Data. The bottom line is that Pivotal continues to push the envelope with respect to Big Data technologies although it now stands to face the challenge posed by cash flush Cloudera, which recently finalized $900M in funding and a strategic and financial partnership with Intel.
In a stunning move that is likely to shape the Big Data space for years, Intel recently decided to partner with Cloudera to support its Hadoop distribution rather than enhancing Intel’s own Hadoop distribution. Cloudera will optimize its Hadoop distribution (CDH) to work with Intel’s hardware technology and Intel, conversely, will promote CDH as the Hadoop distribution of choice of enterprise Big Data analytics and the internet of things. Meanwhile, Intel will contribute insights from its own Hadoop distribution to Cloudera’s Hadoop distribution (CDH) and the resulting integration will be rendered available as part of Cloudera’s open source Hadoop initiatives. The partnership between Intel and Cloudera also featured an equity investment by Intel between $740M to $760M that translates into an 18% ownership stake in Cloudera. The $740M invested by Intel brings Cloudera’s recent funding raises to roughly $900M subsequent to its $160M funding raise in mid-March. Intel will join Cloudera’s board of directors and become “Cloudera’s largest strategic shareholder.” According to its press release, Intel’s investment in Cloudera represents Intel’s “single largest data center technology investment in its history.” Intel’s strong presence in countries such as India and China where Cloudera has thus far failed to gain traction means that the partnership stands to dramatically expand Cloudera’s global market share significantly. More importantly, however, Intel’s deep integration with the technologies in almost every datacenter worldwide render it a formidable ally for Cloudera to fulfill its aspiration of becoming the leading Hadoop distribution in the world in ways that promise to transform computing hardware as well as the Hadoop distributions that integrate with Intel’s Xeon technology.
Building upon its November announcement regarding Zettaset Orchestrator’s support for the encryption of Hadoop data at rest, Zettaset today announced the Orchestrator platform’s support for the encryption of data in motion. The addition of encryption in motion functionality to the Zettaset platform enables encryption of connections between nodes within a Hadoop cluster, all interfaces to the Orchestrator management console, connectors to business intelligence platforms and all communication links more generally. Zettaset Orchestrator’s support of data-in-motion encryption positions the platform to provide encryption to cloud-based Hadoop deployments on platforms such as Amazon Web Services Elastic MapReduce (EMR), or Hadoop as a Service solutions offered by vendors such as Qubole and Xplenty.
Zettaset delivers an enterprise-grade Big Data management platform that specializes in security, high availability and performance as illustrated by the graphic below:
The platform supports high availability by means of automated failover services. Moreover, Zettaset Orchestrator offers activity monitoring for compliance and auditing purposes, role based access control for HiveServer2 and HDFS, and integration with Active Directory and LDAP as revealed by CEO Jim Vogt in an interview with Cloud Computing Today. Compatible with all major Hadoop distributions, Zettaset aims to deliver encryption as part of a broader security package that also features identity management and access control in ways that facilitate compliance with regulatory frameworks such as HIPAA and PCI. Today’s announcement about the platform’s support for data-in-motion encryption positions the Mountain View-based company to compete in the hotly contested cloud encryption space. Unlike the likes of cloud encryption vendors CipherCloud and Vaultive, however, Zettaset’s combination of commitments to high availability and integrated product security renders it unique within the Hadoop management and security space. As more and more enterprises tackle the challenges of operationalizing Big Data, expect Zettaset’s data-at-rest and data-in-motion encryption functionality to propel an intensification of its early traction within the healthcare and financial services verticals as customers increasingly seek a turnkey Big Data management platform that manages Hadoop encryption, access, compliance reporting and availability.
Neo Technology recently announced that retail giants such as eBay and Walmart are using graph database Neo4j in production-grade applications that improve their operations and marketing analytics. In a recently published case study, Neo Technology revealed how eBay’s e-commerce technology platform acquisition, Shutl, leverages Neo4j to expedite delivery to the point where customers can enjoy same day delivery in select cases. Shutl constitutes the technology platform that undergirds eBay Now, a service that delivers products in 1-2 hours from local stores by means of relationships between couriers and stores. eBay decided to make the transition from MySQL to Neo4j because:
Its previous MySQL solution was too slow and complex to maintain, and the queries used to calculate the best route additionally took too long. The eBay development team knew that a graph database could be added to the existing SOA and services structure to solve the performance and scalability challenges. The team turned to Neo4j as the best possible solution on the market.
According to Volker Pacher, Senior Developer at eBay, eBay found that Neo4j enabled dramatic improvements in its computational and querying ability:
We found Neo4j to be literally thousands of times faster than our prior MySQL solution, with queries that require 10-100 times less code. Today, Neo4j provides eBay with functionality that was previously impossible.
eBay’s current ecommerce technology platform leverages Ruby, Sinatra, MongoDB, and Neo4j. Importantly, queries “remain localized to their respective portions on the graph” in order to ensure scalability and performance. Walmart, meanwhile, uses Neo4j to understand the online habits of its shoppers in order to deliver more relevant real-time product recommendations for their online shoppers. Neo4j’s adoption by eBay and Walmart symptomatically illustrates how graph databases are disrupting the nature of real-time analytics, a trend further underscored by Pivotal HD 2.0’s integration of GraphLab into its offerings, and the use of graphing technologies by startups such as Aorato.
This week, EMC and VMware spinoff Pivotal announced the availability of Pivotal HD 2.0, a commercial distribution of Apache Hadoop that now features support for Apache Hadoop 2.2. Moreover, Pivotal also revealed the general availability of Pivotal GemFire XD, a SQL compliant, in-memory database designed for real-time analytics for Big Data processing. In its initial release, Pivotal GemFire XD represents an in-memory distributed data store that “provides a low-latency SQL interface to in-memory table data, while seamlessly integrating data that is persisted in HDFS.” Because GemFire brings the power of real-time analytics to Hadoop, it empowers mobile providers to run complex algorithms on incoming calls to route the call appropriately, or geospatial navigation systems to alter suggested routes based on incoming data about traffic and weather conditions. Like Apache Spark, a parallel data processing framework that facilitates real-time analytics on Hadoop, GemFire enables real-time Big Data analytics but is explicitly designed for data environments with high demands for scalability and availability. Michael Cucchi, Pivotal’s senior director of product marketing, commented on Pivotal’s interest in Spark and GemFire XD in an interview with InformationWeek as follows:
We’re excited about Spark and will support it, but it’s generally used for [data] ingest or caching,” GemFire XD is an ANSI-compliant SQL database with high-availability features, and it can run over wide-area networks, so you can have an instance in Europe and another in North America with replication.
Built on the vFabric SQLFire product that belongs to the category of NewSQL databases noted for high performance and scalability, GemFire XD is adds features such as HDFS-persistence and off-heap memory storage for table data. In addition to GemFire XD, Pivotal 2.0 also features an integration with GraphLab for graphing analytics as well as enhancements to HAWQ such as support for MADlib, R, Python, Java, and Parquet. Overall, Pivotal 2.0 represents a notable advancement over Pivotal 1.1 that brings the power of YARN, real-time analytics via GemFire XD and graphing technology to Hadoop and Big Data processing and analytics. With Pivotal HD 2.0 released less than 6 months after the November 1, 2013 release of Pivotal HD 1.1, Pivotal promises to innovate in the Big Data space at the same dizzying rate with which Amazon Web Services innovates with regard to cloud computing technologies and platforms. Expect to hear more about the conjunction of real-time analytics and graphing technologies on Hadoop via Pivotal 2.0 as customer use cases proliferate and circulate throughout the Big Data space.