Trifacta today announced the finalization of $12M in Series B funding in a round led by Greylock Partners and Accel Partners. Trifacta, a data transformation provider, delivers a solution that helps customers standardize and transform data into a form that optimizes the degree of actionable business intelligence that can be derived from that data. More importantly, the solution features a user experience that enables analysts and data scientists to leverage machine learning to transform data as desired while minimizing the writing of scripts and the execution of complex operations on datasets. The Trifacta platform strives to deliver a highly intuitive user interface for transforming data that enables business analysts to engage data in highly sophisticated ways while concurrently improving the productivity of data scientists. As such, the solution tackles two, often overlooked problems in the data analytics space, namely, transformation of data into a useable form, and improving the productivity of analysts engaged in the work of that transformation. Trifacta’s data visualization functionality and algorithmic, machine learning-based mapping of user interactions with data works to understand, recommend and optimize workflows for interacting with data. In an interview with Cloud Computing Today, Trifacta CEO Joe Hellerstein noted that the solution is currently in public Beta amongst customers from a wide range of verticals, all of whom face the same problem of transforming their data into a form that facilitates development of nuanced, actionable analytics and visualizations. Today’s funding raise brings the total capital raised by Trifacta to $16.3M.
Concurrent Announces Release Of Cascading 2.5 and Lingual 1.0 To Simplify Application Development Using Hadoop
Today, Concurrent elaborates on the release of Cascading 2.5, the open source framework for facilitating the development of applications on Apache Hadoop. Cascading 2.5 supports the recent released Hadoop 2.0 distribution including YARN and its other features. Cascading users that are interested in upgrading to Hadoop 2.0 can do so by means of Cascading 2.5. Similarly, applications that leverage the Scalding, Cascalog and PyCascading languages can migrate to Hadoop 2.0 as well by means of the Cascading 2.5 framework. The latest release of Cascading also features “complex join operations and optimizations to dynamically partition and store processed data more efficiently on HDFS,” according to the Concurrent’s press release. Finally, the release deepens its compatibility with other Hadoop distributions and Hadoop as a Service vendors such as Cloudera, Hortonworks, MapR, Intel, Altiscale, Qubole and Amazon EMR.
Cascading 2.5 represents one of the few products in either the commercial or open source ecosystem for simplifying the development of Hadoop applications while integrating with a rich and varied ecosystem of products as illustrated below:
The graphic shows how Cascading 2.5 supports all major Hadoop distributions in addition to an impressive list of development languages, database platforms and cloud platforms. In an interview with Cloud Computing Today, Concurrent CEO Gary Nakamura and CTO Chris Wensel noted the uniqueness of Cascading in the Big Data landscape, particularly given its iterative refinement in collaboration with the likes of Twitter, eBay and The Climate Corporation over a period of more than five years.
Today’s announcement regarding the general availability of Cascading 2.5 is accompanied by news of the general availability of Lingual, an ANSI-compliant SQL interface that allows developers to use SQL commands to query data stored in Hadoop clusters. Unlike Apache’s Hive project, Lingual’s ANSI-standard SQL interface enables developers to deploy authentic SQL commands as opposed to HIVE’s SQL-like syntax. Cascading Lingual also allows for the migration of legacy SQL workloads onto Hadoop clusters, the export of Hadoop data onto BI tools such as Jaspersoft, Pentaho and Talend, and the ability to leverage the power of Cascading in conjunction with SQL to orchestrate the execution of multiple SQL queries instead of several, discrete disparate queries. The Big Data space should expect more from Concurrent as it continues to build out tools for simplifying application development on Hadoop, particularly as more and more Hadoop developers come to terms with Cascading’s advantages over MapReduce.
As a follow-up to our post on Facebook’s use of Apache Giraph, I wanted to return to Pregel, the graphing technology on which Giraph was based. Alongside, MapReduce, Pregel is used by Google to mine relationships between richly associative data sets in which the data points have multi-valent, highly dynamic relationships that morph, proliferate, aggregate, disperse, emerge and vanish with a velocity that renders any schema-based data model untenable. In a well known blog post, Grzegorz Czajkowski of Google’s Systems Infrastructure Team elaborated on the importance of graph theory and Pregel’s structure as follows:
Despite differences in structure and origin, many graphs out there have two things in common: each of them keeps growing in size, and there is a seemingly endless number of facts and details people would like to know about each one. Take, for example, geographic locations. A relatively simple analysis of a standard map (a graph!) can provide the shortest route between two cities. But progressively more sophisticated analysis could be applied to richer information such as speed limits, expected traffic jams, roadworks and even weather conditions. In addition to the shortest route, measured as sheer distance, you could learn about the most scenic route, or the most fuel-efficient one, or the one which has the most rest areas. All these options, and more, can all be extracted from the graph and made useful — provided you have the right tools and inputs. The web graph is similar. The web contains billions of documents, and that number increases daily. To help you find what you need from that vast amount of information, Google extracts more than 200 signals from the web graph, ranging from the language of a webpage to the number and quality of other pages pointing to it.
In order to achieve that, we have created scalable infrastructure, named Pregel, to mine a wide range of graphs. In Pregel, programs are expressed as a sequence of iterations. In each iteration, a vertex can, independently of other vertices, receive messages sent to it in the previous iteration, send messages to other vertices, modify its own and its outgoing edges’ states, and mutate the graph’s topology (experts in parallel processing will recognize that the Bulk Synchronous Parallel Model inspired Pregel).
The key point worth noting here is that Pregel computation is marked by a “sequence of iterations” whereby the relationship between vertices is iteratively refined and recalibrated with each computation. In other words, Pregel computation begins with an input step, followed by a series of supersteps that successively lead to the algorithm’s termination and finally, an output. During each of the supersteps, the vertices send and receive messages to other vertices in parallel. The algorithm terminates when the vertices collectively stop transmitting messages to each other, or, to put things in another lexicon, vote to halt. As Malewizc, Czajkowsk et.al note in a paper on Pregel, “The algorithm as a whole terminates when all vertices are simultaneously inactive and there are no messages in transit.” Like Pregel, Apache Giraph uses a computation structure whereby computation proceeds iteratively until the relationships between vertices in a graph stabilize.
Today, Concurrent Inc. announces the release of Pattern, an open source tool designed to enable developers to build machine-learning applications on Hadoop by leveraging the Predictive Model Markup Lanaguage (PMML), the standard export format for popular predictive modeling tools such as R, MicroStrategy and SAS. Data scientists can use Pattern to export applications to Hadoop clusters and thereby run them against massive data sets. Pattern simplifies the process of building predictive models that operate on Hadoop clusters and lowers the barrier to the adoption of Apache Hadoop for advanced data mining and modeling use cases.
An example of a use case for Pattern includes evaluating the efficacy of models for a “predictive marketing intelligence solution” as illustrated below by Antony Arokiasamy, Senior Software Architect at AgilOne:
Pattern facilitates AgilOne to deploy a variety of advanced machine-learning algorithms for our cloud-based predictive marketing intelligence solution. As a self-service SaaS offering, Pattern allows us to evaluate multiple models and push the clients’ best models into our high performance scoring system. The PMML interface allows our advanced clients to deploy custom models.
Here, Arokiasamy remarks on the way in which Pattern facilitates scoring of predictive models that enables the selection of one model amongst others. AgilOne uses Pattern to run multiple predictive models in parallel against large data sets and additionally illustrates the efficacy of Pattern’s operation on a Hadoop cluster deployed in a cloud-based environment.
Pattern runs on the popular Cascading framework for simplifying the deployment and management of Hadoop clusters that is used by the likes of Twitter, eBay, Etsy and Razorfish. A free, open source application, Pattern constitutes yet another pillar in Concurrent’s array of applications for streamlining the use of Apache Hadoop alongside Cascading and Lingual, the ANSI-standard interface that enables developers to leverage SQL to query Hadoop clusters without having to learn MapReduce. The release of Pattern consolidates the positioning of Concurrent as a pioneer in the Big Data management space given its thought leadership in designing applications that facilitate enterprise adoption of Hadoop. Enterprises can now use Concurrent’s Cascading framework to operate on Hadoop clusters using JAVA APIs, SQL and predictive models written in PMML compatible analytics applications.
Cloudmeter today announces the general availability of Cloudmeter Stream, a non-invasive platform that enables customers to transform Big Data streams of network data into actionable business intelligence. Cloudmeter also announces the early access availability of Cloudmeter Insight, a SaaS application that integrates back-end network analytics with front-end marketing analytics to deliver integrated data regarding user experiences of application platforms. Together, Cloudmeter Stream and Cloudmeter Insight expand the purview of Big Data analytics to network data and enable customers to obtain a 360 degree view of user interactions with their products. Both Cloudmeter Stream and Cloudmeter Insight allow access to network data without risks of physical disruption to network infrastructures.
Cloudmeter’s analytics represent an extension of the DevOps movement by allowing operations to more effectively understand the impact of IT infrastructure on end-user experiences. Application owners can use Cloudmeter to effectively configure business rules to determine which network data attributes constitute fields of interest. For example, customers can create business rules that identify session errors, network traffic on specific servers or data regarding the elapsed time between specific interactions with the platform. Users create business rules and manage the application more generally using an intuitive user interface featuring screens such as the following:
Cloudmeter CEO Mike Dickey remarked on the innovation represented by the platform for capturing network data by noting:
Our new data capture technology is a culmination of many years of experience building network-based data capture products. It enables customers to gain real time access into the wealth of business and IT information without the need to connect to physical network infrastructure, and without introducing risk to production systems or application performance.
Dickey underscores how Cloudmeter’s technology brings the Big Data revolution to network data and concomitantly empowers customers to access “business and IT information” in ways that have the potential to transform both their marketing platforms as well as their IT infrastructure design. In an interview with Cloud Computing Today, Cloudmeter’s COO Ronit Belson remarked that, rather than falling into the category of DevOps products, the company’s platform more appropriately represents a disruptive innovation in the MarkOps space defined by the integration of marketing-related front-end application design with the Operations-related design of their platform’s IT infrastructure. Cloudmeter Stream integrates with Big Data platforms such as Splunk and GoodData allowing users to integrate petabytes of machine data with data selectively culled from the business rules specific to Cloudmeter’s user interface.
Cloudmeter Stream is complemented by Cloudmeter Insight, a SaaS application that transforms data captured by Cloudmeter Insight into visual representations that allow application owners to comprehensively understand end-user experiences of an application as represented below:
Cloudmeter Stream leverages widgets to allow users to customize reports and dashboards of their choosing. The result is an integrated view of an application’s backend and front-end user experience in ways that allow application owners to obtain a truly holistic picture of user experiences with their platforms. Today’s announcements point toward two exciting new releases into the application performance management space as Big Data begins to own up to its potential of delivering 360 degree views of user experiences with technology platforms. Cloudmeter’s customer base includes Netflix, SAP and Saks Fifth Avenue and 1-800-Flowers.
Today, Concurrent Inc. announced the finalization of $4 million in Series A funding led by True Ventures and Rembrandt Venture Partners. The investment is intended to accelerate product development and expand the core team as part of the company’s larger project of simplifying application development within the Hadoop space. In conjunction with news of the funding, Concurrent also announced the appointment of Gary Nakamura as CEO. Nakamura comes to Concurrent with an illustrious tenure at Terracotta as Senior Vice President and General Manager and VP of World Wide Sales & Field Operations. Chris Wensel, Concurrent’s Founder and former CEO, will assume the role of CTO. Concurrent’s $4 million in Series A funding builds upon an initial seed investment of $900,000 in August 2011 that was similarly financed by True Ventures and Rembrandt Venture Partners. The Series A funding points to the success of Concurrent’s Cascading 2.1 platform for simplifying application development and management on Hadoop clusters.
Cascading delivers a framework that empowers developers to use Java languages to develop applications that operate on Hadoop instead of MapReduce. Used by the likes of Twitter, eBay and The Climate Corporation, Cascading joins forces with Concurrent’s platform Lingual, which provides a SQL interface for operating on Hadoop, in a concerted initiative to democratize developer access to Hadoop. In an interview with Cloud Computing Today, CEO Gary Nakamura noted that Concurrent intends to build on its initial momentum by delivering platforms that simplify and streamline application development on Hadoop as opposed to opting for the strategy of releasing a Hadoop distribution in the vein of Intel, EMC and others.
Concurrent already boasts partnerships with the likes of Amazon Web Services and Microsoft Azure for managing application development and management within Hadoop infrastructures. Its Cascading framework is compatible with all Apache Hadoop distributions and claims more than 75,000 downloads per month. Given Concurrent’s notable acccomplishments with modest funding to date, the company is likely to expand its footprint in the space dedicated to simplifying Hadoop application development as a result of its new funding and CEO Gary Nakamura’s deep experience with enterprise software. As Hadoop distributions proliferate, expect to see the demand for simplified Hadoop development and management products skyrocket within the enterprise. Enterprise concerns about data security and consistency of application lifecycle management are additionally likely to fuel the demand for Hadoop management platforms, particularly given the increasing convergence between Big Data and cloud-based infrastructures.
Big Data management vendor Zettaset recently announced support for Intel’s distribution of Hadoop. Zettaset’s support of Intel’s Hadoop distribution means that its Zettaset Orchestrator platform for simplifying and streamlining Hadoop deployments can be deployed on Intel’s open source Hadoop distribution that it optimized for its Xeon processor platform. Zettaset CEO and President Jim Vogt remarked on the company’s collaboration with Intel by noting:
Intel has worked diligently with their partners to ensure compatibility and deliver a robust, high performance Big Data solution for the enterprise. We are excited to be included in Intel’s growing Big Data ecosystem and look forward to helping our joint customers to easily install, manage and secure their Intel-powered Hadoop deployments.
The partnership means that Intel Hadoop customers have the opportunity to leverage Zettaset’s suite of Hadoop management tools that address security policy, compliance, access control and security in an effort to facilitate the construction of enterprise-grade Hadoop clusters. Zettaset is designed to support any Apache Hadoop distribution and environment.
This week, EMC launched its own distribution of Hadoop under the branding Pivotal HD. Built on technology that EMC obtained through the acquisition of Greenplum in July 2010, Pivotal HD represents EMC’s next iteration on the Greenplum Unified Analytics Platform (UAP) that it launched in December 2011. The Greenplum UAP featured EMC Greenplum HD, an enterprise-grade distribution of Hadoop and Greenplum’s database for structured data. Greenplum UAP also announced Greenplum Chorus, an innovative platform for collaboration amongst data scientists in an organization leveraging Big Data. Pivotal HD, however, marks a significant new chapter in EMC’s Hadoop technology as indicated by its array of features and architectural complexity.
Like many recent Hadoop distributions and technologies, Pivotal HD integrates with SQL to facilitate its maximal usage by developers and business analysts who lack familiarity with MapReduce. But the real innovation of Pivotal HD runs deeper than its integration of SQL with Hadoop and concerns the positioning of Greenplum’s analytic engine alongside HDFS in ways that enable performance enhancements to Hadoop querying over and beyond the simple appendage of a SQL interface. Pivotal HD’s Advanced Database Services (HAWQ) allows for the delivery of a high-performance SQL engine that permits of greater SQL functionality and performance than analogous SQL interfaces such as Hive, Hadapt and Impala. Coupled with Pivotal HD’s virtualization and pluggable storage compatibility features, the platform represents a distinct moment of innovation in the Hadoop space as evinced by the following three features:
Advanced Database Services (HAWQ)
Pivotal HD’s Advanced Database Services (HAWQ) functionality brings Greenplum’s Massively Parallel Processing (MPP) functionality to Hadoop. The result means that HAWQ allows Pivotal HD users to perform complex joins, MADlib in-database analytics and transactions. Moreover, users have the luxury of leveraging virtually any BI tool on the marketplace to obtain advanced reporting and visualization of data as required. HAWQ-based SQL queries outperform Hive in terms of response time by as much as 100x according to EMC benchmarking data.
The Advanced Database Service interfaces with other components of Pivotal HD as follows:
Given the recent proliferation of SQL-Hadoop interfaces throughout the industry, customers and analysts should expect more data about the comparative efficiencies of SQL-Hadoop interfaces to emerge as more and more SQL-trained analysts start using SQL to operate on data saved in HDFS.
Hadoop Virtualization Extensions
Hadoop Virtualization Extensions enable the provisioning of Hadoop clusters on VMware virtualized platforms in both public cloud and on-premise environments. HVE provides customers increased flexibility of deployment and enables the construction of high availability infrastructures for the access of Hadoop data.
Pluggable HDFS Storage
Customers can multiply their data storage options by using standard Hadoop direct attached storage in addition to EMC Isilon OneFS Scale-Out NAS Storage, the latter of which features streamlined loading, backup, replication, snapshotting and elastic scalability functionality.
Overall, EMC’s launch into the Hadoop-distribution world represents a stunning and significant move to grab Hadoop market share from Cloudera, Hortonworks and MapR. Unlike Intel’s recently launched distribution, EMC’s Pivotal HD claims some proprietary and genuinely innovative Hadoop technology in the form of its Advanced Database Services engine and scale-out storage compatibility. Expect EMC to continue to innovate upon its core technology platform and follow the suit of the likes of Concurrent in developing tools to render Hadoop more accessible to Java-based developers in addition to SQL. What remains unclear, at this point, is the extent to which EMC will open-source its technology as it gains market share within the enterprise. For now, however, the Hadoop world has yet another significant player with cash reserves aplenty to continue to innovate on its platform and disrupt the Hadoop landscape in the process.