Cloudera and Trillium Software recently announced a collaboration whereby the Trillium Big Data solution is certified for Cloudera’s Hadoop distribution. As a result of the partnership, Cloudera customers can take advantage of Trillium’s data quality solutions to profile, cleanse, de-duplicate and enrich Hadoop-based data. Trillium responds to a problem in the Big Data industry wherein the customer focus on deployment and management of Hadoop-based data repositories eclipses concerns about data quality. In the case of Hadoop-based data, data quality solutions predictably face challenges associated with the sheer volume of data that requires cleansing or quality improvements. Trillium’s Big Data Solution for data quality cleanses data natively within Hadoop because identifying data with data quality issues and then transporting it to another infrastructure becomes costly and complex. The collaboration between Trillium Software and Cloudera illustrates the relevance of data quality solutions for Hadoop despite the increased attention currently devoted to Big Data analytics and data visualization solutions. As such, Trillium fills a critical niche within the Big Data processing space and its alliance with Cloudera positions it strongly to consolidate its early traction within the space of solutions dedicated to data quality in the Big Data space.
Microsoft Azure recently announced news of the Azure Data Lake, a product that serves as a repository for “every type of data collected in a single place prior to any formal definition of requirements or schema.” As noted by Oliver Chiu in a blog post, Data Lakes allow organizations to store all data types regardless of data type and size on the theory that they can subsequently use advanced analytics to determine which data sources should be transferred to a data warehouse for more rigorous data profiling, processing and analytics. The Azure Data Lake’s compatibility with HDFS means that products with data stored in Azure HDInsight and infrastructures that use distributions such as Cloudera, Hortonworks and MapR can integrate with it, thereby allowing them to feed the Azure Data Lake with streams of Hadoop data from internal and third party data sources as necessary. Moreover, the Azure Data Lake supports massively parallel queries that allow for the execution of advanced analytics on massive datasets of the type envisioned for the Azure Data Lake, particularly given its ability to support unlimited data both in aggregate, and with respect to specific files as well. Built for the cloud, the Azure Data Lake gives enterprises a preliminary solution to the problem of architecting an enterprise data warehouse by providing a repository for all data that customers can subsequently use as a base platform from which to retrieve and curate data of interest.
The Azure Data Lake illustrates the way in which the economics of cloud storage redefines the challenges associated with creating an enterprise data warehouse by shifting the focus of enterprise data management away from master data management and data cleansing toward advanced analytics that can query and aggregate data as needed, thereby absolving organizations of the need to create elaborate structures for storing data. In much the same way that Gmail dispenses with files and folders for email storage and depends upon its search functionality to facilitate the retrieval of email-based data, data lakes take the burden of classifying and curating data away from customers but correspondingly place the emphasis on the analytic capabilities of organizations with respect to the ability to query and aggregate data. As such, the commercial success of the Azure Data Lake hinges on its ability to simplify the process of running ad hoc and repeatable analytics on data stored within its purview by giving customers a rich visual user interface and platform for constructing and refining analytic queries on Big Data.
DataTorrent today announces the finalization of $15M in Series B funding. The funding round is led by Singtel Innov8, with additional participation from GE Ventures and Series A investors August Capital, AME Cloud Ventures and Morado Venture Partners. DataTorrent’s platform provides an infrastructure for processing, storing and running analytics on streaming big data sets. The platform can ingest and analyze massive amounts of data by using over 75 connectors as well as 400 Java operators that allow data scientists to perform advanced analytics on multiple datasets in parallel. DataTorrent differentiates itself architecturally by performing in-memory processing that runs directly on Hadoop without the overhead that results from scheduled batches of Hadoop data for processing. The platform boasts massive scalability at sub-second latency while maintaining the capability to process batch and streaming datasets alike. Use cases for DataTorrent include internet of things analytics as well as web-analytics that push the limits of the platform’s ability to scale and ingest massive amounts of data. Today’s capital raise brings the total funding raised by DataTorrent to $23.8M. Building on its recent distinction as a Gartner Cool Vendor, DataTorrent stands to consolidate its early traction in the heavily contested Big Data analytics space with today’s infusion of capital and the guidance brought to its team by Innov8 Managing Director Jeff Karras, who joins DataTorrent’s board of directors as a result of the finalization of the Series B funding round.
MapR has declined the invitation to participate in the Open Data Platform (ODP) after careful consideration, as noted in a recent blog post by John Schroeder, the company’s CEO and co-founder. Schroeder claims that the Open Data Platform is redundant with the governance provided by the Apache Software Foundation, that it purports to “solve” Hadoop-related problems that do not require solving and that it fails to accurately define the core of the Open Data Platform as it relates to Hadoop. With respect to software governance, Schroeder notes that the Apache Software Foundation has done well to steward the development of Apache Hadoop as elaborated below:
The Apache Software Foundation has done a wonderful job governing Hadoop, resulting in the Hadoop standard in which applications are interoperable among Hadoop distributions. Apache governance is based on a meritocracy that doesn’t require payment to participate or for voting rights. The Apache community is vibrant and has resulted in Hadoop becoming ubiquitous in the market in only a few short years.
Here, Schroeder credits the Apache Software Foundation with creating a Hadoop ecosystem in which Hadoop-based applications interoperate with one another and wherein the governance structure is based on a meritocracy that does not mandate monetary contributions in order to garner voting rights. In addition, the blog post observes that whereas the Open Data Platform defines the core of Apache Hadoop as MapReduce, YARN, Ambari and HDFS, other frameworks such as “Spark and Mesos, are gaining market share” and stand to complicate ODP’s definition of the core of Hadoop. Meanwhile, Cloudera’s Chief Strategy Officer Mike Olson explained why Cloudera also declined to join the Open Data Platform by noting that Hadoop “won because it’s open source” and that the partnership between Pivotal and Hortonworks was “antithetical to the open source model and the Apache way.” Given that 75% of Hadoop implementations use either MapR or Cloudera, ODP looks set to face some serious challenges despite support from IBM, Pivotal and Hortonworks, although the precise impact of the schism over the Open Data Platform on the Hadoop community remains to be seen.
As reported in The Wall Street Journal, Tachyon Nexus, the company that aims to commercialize the open source Tachyon in-memory storage system, has raised $7.5M in Series A funding from Andreessen Horowitz. Tachyon is a memory-centric storage system that epitomizes the contemporary transition away from disk-based storage to in-memory storage. Based on the premise that memory-centric storage is increasingly affordable in comparison with disk-centric storage, Tachyon caches frequently read files in memory to create a “memory-centric, fault-tolerant, distributed storage system” that “enables reliable data sharing at memory-speed across a datacenter” as noted in a blog post by Peter Levine, General Partner of Andreessen Horowitz. Tachyon’s memory-centric storage system improves upon the speed and reliability of file-based storage infrastructures to embrace the requirements of big data applications that require the sharing of massive volumes of data at increasing fast speeds. Tachyon was founded by Haoyuan Li, a U.C. Berkeley doctoral candidate who developed Tachyon at the U.C. Berkeley AMPLab. Tachyon is currently used at over 50 companies and supports Spark and MapReduce as well as data stored in HDFS and NFS formats. Tachyon Nexus, the commercial version of Tachyon, remains in stealth. Meanwhile, Peter Levine joins the board of Tachyon Nexus as a result of the Series A investment to support the development of what Levine envisions “the future of storage” in the form of Tachyon-based storage technology.
Ford has announced that it will partner with Microsoft Azure to automate updates to automobile software such as its Sync 3 infotainment system as well as functionality that enables owners to check battery levels and remotely start, lock, unlock or locate their vehicles. As a result of the partnership with Azure, Ford vehicle owners with Sync entertainment and navigation systems will no longer need to take their cars to the dealership for periodic software upgrades, but can instead leverage the car’s ability to connect to a wireless network to download enhancements to Sync. The Azure-based Ford Service Delivery Network will launch this summer at no extra cost to end users. Use cases enabled by the partnership between Azure and Ford are illustrated below:
Despite Ford’s readiness to use long-time technology partner Microsoft for the purpose of leveraging a public cloud, the Dearborn-based automobile giant prefers to use on-premise infrastructures for more sensitive data such as odometer readings, engine-related system data and performance metrics that reveal details about the operation of the vehicle. Moreover, part of the reason Ford chose Microsoft was because of its willingness to support a hybrid cloud infrastructure marked by an integration between an on premise data center environment and a public cloud such as Azure. As reported in InformationWeek, Microsoft will also help Ford with the processing and analysis of data given the massive amounts of data that stand to be collected for its fleet of electric and non-electric vehicles. Ford’s Fusion electric vehicle, for example, creates 25 GB of data per hour and subsequently requires the application of pre-processing and filtering procedures to reduce the amount of data to a point that renders its aggregation manageable for reporting and analytics purposes. Ford’s larger decision to partner with Azure represents a growing industry trend within the automobile industry to use cloud-based technology to push software updates to vehicles and gather data for compliance and product development reasons that includes the likes of Hyundai and Tesla. The key challenge for Ford, and the automobile industry at large, of course, will hinge on its ability to acquire internet of things-related automobile data and subsequently perform real-time analytics to reduce recalls, fatalities and facilitate more profound enhancements in engineering-related research and development. Details of which Ford vehicles stand to benefit from Azure-powered software delivery this summer have yet to be disclosed.