Attunity announced the release of version 6.0 of its Hadoop big data replication software in general availability today. The Attunity RepliWeb for Enterprise File Replication 6.0 enables organizations to replicate data stored within Hadoop-based distributions. Organizations can use Attunity’s Hadoop replication software to populate backup and recovery infrastructures, streamline the parallel processing of data into large data warehouses and facilitate efficient ETL processing by way of the appropriate staging of data. This version features dashboards and notifications that allow administrators to monitor the progress of replications and data transfers. The product supports data management on the order of 1000s of nodes and boasts an integration with Attunity CloudBeam SaaS, thereby allowing users to transfer data directly to Amazon’s S3 storage platform. Attunity’s Hadoop replication software solution supports a number of Hadoop distributions including Apache Hadoop, Greenplum HD and Hortonworks. The product also provides a unified platform for replication more generally across Windows, Unix, Linux and HDFS file systems, in addition to multiple servers and devices as well.
Netflix recently delivered a stunningly detailed elaboration of the cloud foundation for its Hadoop architecture in a blog post titled “Hadoop Platform as a Service in the Cloud” by Sriram Krishnan and Eva Tse. The post explains the technical foundation underpinning “Genie,” Netflix’s Platform as a Service for Hadoop. But in order to detail the technical underpinnings of Genie, the Netflix Data Science & Engineering team positioned its Hadoop Platform as a Service infrastructure within the larger context of its Amazon Web Services S3 cloud storage platform and Amazon’s distribution of Hadoop, Elastic MapReduce (EMR). Importantly, the blog post suggests the possibility of open-sourcing Genie “in the near future” and solicits reader feedback about whether a Hadoop Platform as a Service product might be useful to organizations processing petabytes of data and more.
Key features of the Netflix Platform as a Service For Hadoop include:
Data Storage on Amazon S3
Whereas most traditional Hadoop deployments store data within a Hadoop data warehouse constituted by the Hadoop Distributed File System (HDFS) storage platform, Netflix opted to store all of their data on Amazon S3 using EMR.
Benefits of S3 include the following:
•Durability and availability of objects over a given year to the order of nine 9s (99.999999999%) and two 9s (99.99%) respectively
•Granular versioning capabilities
•Elastic capabilities that result in virtually unlimited capacity on demand
•The ability to manage multiple, disparate Hadoop clusters that read from the same underlying data set
Disparate Hadoop Clusters For Dedicated Workloads
Genie’s architecture features multiple Hadoop clusters such as:
The query cluster represents a large, 500 node cluster used for ad hoc queries whereas the production cluster features the site of large ETL processes. All of these clusters can be dynamically resized in accord with the volume of data processing. Genie’s query cluster, for example, typically shrinks at night given the reduced need for ad hoc queries. Conversely, the production cluster expands at night as a result of the number of ETL processes that run accordingly.
Developers typically use the following languages and tools to access Hadoop clusters:
•Hive for queries and analytics
•Python and Pig for ETL processes
•MapReduce for complex algorithms
•Communal gateways that permit the writing of Hive and Pig queries for multiple developers
•Personal gateway AMIs for heavy users that permit the customization of client-side development
Hadoop Platform As A Service
Unlike Amazon’s Elastic MapReduce, which provides an Infrastructure as a Service for Hadoop, Netflix’s Platform as a Service allows developers to execute Hadoop, Pig and Hive scripts without provisioning new Hadoop clusters or installing clients for Hadoop, Pig and Hive using a REST-ful API. Furthermore, Netflix’s Genie also allows administrators to manage Hadoop deployments using a backend configuration tool.
Kudos goes to Netflix for its sustained and specific elaboration on the architecture of Genie. Hadoop Platform as a Service vendors have recently begun to proliferate in the industry and include the likes of Microsoft, Infochimps, Continuuity and Mortar Data. Microsoft announced news of its Azure-based Hadoop platform, Windows Azure HDInsight, in late October 2012. Infochimps, meanwhile, delivers a Big Data platform as a service that supports software frameworks such as HBase, Cassandra, MongoDB and NoSQL in addition to Hadoop. Continuuity platform AppFabric provides a set of APIs that sit atop a company’s Hadoop deployment while AWS Global Start-up Finalist Mortar Data provides an open-source framework that empowers developers to leverage their skills in Pig, Java and Python on a Hadoop ecosystem. Netflix’s Genie is without doubt the most production-ready Hadoop Platform as a Service in the industry given the sheer volume of data it processes daily. That said, the industry should expect more Hadoop platform as a service vendors to emerge as the need for simplified, PaaS-like methods of Hadoop management achieves greater urgency.
Big Data management vendor Zettaset today announced the finalization of $10 million in Series B funding. The funding round was led by HighBar Partners with participation from Series A investors DFJ and Epic Ventures in addition to Brocade, a strategic investor. The funding will be used to expand sales and marketing initiatives and accelerate research and development of its platform for managing Hadoop clusters. Zettaset’s core product, the Zettaset Orchestrator, streamlines the installation and management of Hadoop. Zettaset Orchestrator simplifies the deployment of Hadoop, adds efficiency to its ongoing operational management and enhances the security of Hadoop-based deployments. The platform features an administrative interface marked by the capability to produce custom reports on the integrity and activity within a Hadoop cluster or deployment. Compatible with any Apache Hadoop distribution, Zettaset’s platform aims to reduce enterprise overhead related to Hadoop deployment and management. Today’s investment suggestively illustrates commercial interest not only in Hadoop distributions and analytics, but also in infrastructures–such as Zettaset’s–that simplify Hadoop management and enhance compliance with the regulatory demands of contemporary enterprises.
Big Data continues to be red hot within the venture capital space as evinced by the finalization of $20M in Series B funding for Platfora, the San Mateo-based business intelligence platform for Big Data and Hadoop. The funding round was led by Battery Ventures with additional participation from existing investors Andreessen Horowitz and Sutter Hill Ventures. The capital raise will be used to expand Platfora’s sales and marketing teams as well as to add depth and talent to its engineering and design teams.
Platfora’s value proposition within the Big Data business intelligence space consists in its ability to transform Hadoop-based Big Data into “interactive, in-memory business intelligence” that dispenses with the need for an ETL job or data warehouse. Platfora’s innovative BI interface enables data scientists and business users alike to interactively explore the relevant data landscape through the product’s web-based interface. Platfora allows users to segment and compare data subsets, collaborate by way of annotations, seamlessly switch between visual and numeric representations of data as well export data to csv format or png images.
Platfora’s CEO Ben Werther elaborated on the product’s value proposition in a blog post that proclaimed the death of the traditional data warehouse as follows:
We’ve been living in the dark ages of data management. We’ve been conditioned to believe that it is right and proper to spend a year or more architecting and implementing a data warehouse and business intelligence solution. That you need teams of consultants and IT people to make sense of data. We are living in the status quo of practices developed 30 years ago — practices that are the lifeblood of companies like Oracle, IBM and Teradata.
And yet to build a data warehouse I’d be expected to perfectly predict what data would be important and how I’d want to question it, years in advance, or spend months rearchitecting every time I was wrong. This is actually considered ‘best practice’.
Imagine what is possible. Raw data of any kind or type lands in Hadoop with no friction. Everyday business users can interactively explore, visualize and analyze any of that data immediately, with no waiting for an IT project. One question can lead to the next and take them anywhere through the data. And the connective tissue that makes this possible — bridging between lumbering batch-processing Hadoop and this interactive experience — are ‘software defined’ scale-out in-memory data marts that automatically evolve with users questions and interest.
Werther, who was previously Director of Product Management at Greenplum, notes that the “dark ages of data management” require companies to allocate teams of resources to create a data warehouse and define schemas that “predict what data would be important” and “how I’d want to question it, years in advance, or spend months rearchitecting every time I was wrong.” Platfora, in contrast, delivers a platform whereby users can visually and interactively engage data without waiting for months of costly data architecture as a foundational, precursor step. Importantly, Werther here takes direct aim at traditional business intelligence giants such as Oracle, IBM and Teradata by proclaiming the death of both the traditional data warehouse and the business intelligence platforms that supported it. Built on HTML 5 technology, the product’s interface is optimized for data drill-downs by users and collaborative communication.
According to GigaOm, this week’s funding raise brings Platfora’s total funding to $25.7 million after it emerged from stealth mode in October, roughly a year after its initial $5.7 million capital raise. Platfora’s funding raise was announced in conjunction with recent product enhancements by BI vendors such as Tableau, Talend, Jaspersoft and Pentaho, all of which revealed details of upgraded functionality to their BI, Hadoop-supported platforms within the last month. Given the current labor shortage of skilled data scientists that can write advanced Hadoop queries in MapReduce or Hive, Big Data platforms that, like Platfora, are rich in visualization functionality are likely to take the cake in the battle for big data BI market share as long as they can remain competitive in terms of data processing speed and analytic granularity as well.
Talend, the open source data integration company, announced the release of version 5.2 of its Open Studio data integration and data management platform on Monday. The release features prominent enhancements such as Hadoop Big data profiling and support for well known NoSQL products in addition to a bevy of other usability, productivity and performance improvements.
Hadoop Big Data Profiling
Talend’s big data profiling functionality demonstrates the ability to “discover and understand data in Hadoop clusters” with a view to:
•Identifying data quality issues such as corrupt, incomplete, duplicate or inconsistent data
•Analyzing data in Hive clusters without extracting it from the Hadoop cluster
•Cleansing, enriching, de-duplicating and creating crosswalks across data sets within the Hadoop cluster itself
Talend version 5.2 features support for NoSQL databases in the form of an initial set of connectors to Cassandra and MongoDB. The product indigenously supports Apache Hadoop and integrates with Hadoop Distributed File System (HDFS), HCatalog, Hive, Oozie, Pig and Sqoop. The support for NoSQL databases complements a total of more than 450 connectors to other products and platforms that are already built into the Talend Open Studio architecture.
Fabrice Bonan, co-founder and Chief Technical Officer of Talend, elaborated on the significance of the new Talend release by noting:
Talend version 5.2 delivers on our vision of simplifying the development, integration and management of big data so that businesses can focus on using that data to make faster and more informed decisions. We provide the most powerful and versatile open source, big data solution to help organizations load, extract and improve disparate data while leveraging the massively parallel processing power of big data technologies including Apache Hadoop and leading NoSQL databases.
According to Bonan, Talend’s 5.2 release delivers on its mission of streamlining big data management while providing solutions to “load, extract and improve disparate data” in conjunction with the “massively parallel processing power” of Hadoop and NoSQL. The underlying vision, as in most Big Data initiatives, is to help organizations make “faster and more informed decisions.”
Talend’s enhancements point to an industry-wide embrace of more sophisticated Hadoop data discovery and cleansing functionality that empowers data scientists to perform more nuanced manipulations of data within a Hadoop cluster, without extraction. Additionally, virtually all big data integration platforms will need to support NoSQL databases such as Cassandra and MongoDB given NoSQL’s rapid uptake by enterprise customers at both a cloud and traditional data center level.
At a product level, however, Talend’s innovations in version 5.2 on the Big data profiling front are geared more toward data scientists than they are to business analysts or business stakeholders that will be consuming the analytical insights themselves. This release focuses on architectural and data processing enhancements while leaving business-focused functionality upgrades such as enhanced data visualization capabilities and dashboards to a forthcoming version.
The Apache Hadoop community of developers recently announced the release of the second alpha release of Apache Hadoop known as Apache Hadoop 2.0.2-alpha. The release features enhancements such as the following:
•HDFS HA enhancements including support for automated failover using ZooKeeper and support for security
•YARN testing and stabilization
YARN, which stands for “Yet Another Resource Negotiator,” provides a framework for creating distributed processing applications and infrastructures. YARN additionally provides an apparatus for scheduling requests of resources such as CPUs and manages the execution of such requests. Harsh Chouraria’s excellent blog post explains how YARN differs from MapReduce 2.0 by noting that “YARN is a generic platform for any form of distributed application to run on, while MR2 is one such distributed application that runs the MapReduce framework on top of YARN.” YARN has already been deployed on massive clusters totaling almost 6000 nodes at Yahoo.
The Hadoop community is now close to the release of Hadoop-2.x sometime early in 2013, which will feature final tweaks on functionality such as:
•HDFS without shared storage
•YARN scheduling enhancements
Developers can download the latest Hadoop release from the Apache Hadoop page or Hortonworks Data Platform 2.0 Alpha, the latter of which integrates with additional frameworks such as Apache Pig and Apache Hive.