Advance/Newhouse has decided to acquire big data analytics company 1010data for $500M. 1010data facilitates big data discovery and data sharing by means of a spreadsheet-like interface. The platform boasts predictive analytics, reporting and visualization as well as solutions for data sharing and monetization. The 1010data management team will lead the company but the new capital will be used to accelerate product development and expand sales operations. The 1010 platform gained initial traction in the financial services industry but has subsequently expanded to a customer roster of over 750 companies that also include retail, gaming, telecommunication and manufacturing. 1010data’s acquisition by Advance/Newhouse illustrates the vitality of market interest in big data discovery and data visualization solutions. Advance/Newhouse solutions is the parent company of Conde Nast magazines and Bright House Networks.
Cloud Computing Today recently had the privilege of speaking with Amos Shaltiel, CEO and co-founder and Michael Elkin, COO and co-founder of DBS-H, an Israel-based company that specializes in continuous Big Data integration between relational and NoSQL-based data. Topics discussed included the core capabilities of its big data integration platform, typical customer use cases and the role of data enrichment.
Cloud Computing Today: What are the core capabilities of your continuous big data integration platform for integrating SQL data with NoSQL? Is the integration unidirectional or bidirectional? What NoSQL platforms do you support?
DBS-H: DBS-H develops innovative solutions for a continuous data integration between SQL and NoSQL databases. We believe that companies are going to adopt a hybrid model where relational databases such as Oracle, SQL Server, DB2 or MySQL will continue to serve customers alongside new NoSQL engines. The success of Big Data adoption will ultimately rise and fall on how easily information can be accessed by key players in organizations.
The DBS-H solution releases data bottlenecks associated with integrating Big Data with existing SQL data sources, making sure that everyone has access to the data they are looking for transparently and without the need to change existing systems.
Our vision is to make the data integration process simple, intuitive and fully transparent to the customer without a need to hire a highly skilled personnel for expensive maintenance of integration platforms.
Core capabilities of the DBS-H Big Data integration platform are:
1. Continuous data integration between SQL and NoSQL databases. Continuous integration represents a key factor of successful Big Data integration.
2. NoSQL data modeling and linkage to existing relational model. We call it a “playground” where customers can :
a. Link a relational data model to a non-relational structure.
b. Create new data design of NoSQL database
c. Explore “Auto Link” where engine automatically generates 2 options of NoSQL data model based on existing SQL ERD design.
3. Data enrichment – capability that allows to add to each block of data additional information that significantly enriches that data on the target
Currently, we focus on unidirectional integration and avoid some of the conflict resolution scenarios specific to bidirectional continuous data integration. The unidirectional path is from SQL to NoSQL and in the near future we will add the opposite direction of NoSQL to SQL integration. Today, we support Oracle and MongoDB databases and plan to add support for additional database engines such as SQL Server, DB2, MySQL, Couchbase, Cassandra and full integration with Hadoop. We aspire to be the default solution of choice when customers think about data integration across major industry data sources.
Cloud Computing Today: What are the most typical use cases for continuous data integration from SQL to NoSQL?
DBS-H: NoSQL engines offer high performance on relatively low cost and flexible schema model.
Typical use cases of continuous data integration from SQL to NoSQL are driven principally from major NoSQL use cases, such as:
- Customer 3600 view – creating and maintaining unified view of a customer from multiple operational systems. Ability to provide consistent customer experience regardless of the channel, capitalize upsell or cross-sell opportunities and deliver better customer service. NoSQL engines provide performance response time required in customer service, scalability and flexible data model. DBS-H solution is an enabler for a “Customer 3600 view” business case by doing transparent and continuous integration from existing SQL based data sources.
- User profile management – applications that manage user preferences, authentications and even financial transactions. NoSQL provides high performance, flexible schema model for user preferences, however financial transactions will be usually managed by SQL system. By using DBS-H continuous data integration financial transactions data is found transparently inside NoSQL engines.
- Catalog management – applications that manage catalog of products, financial assets, employee or customer data. Modern catalogs often contain user generated data from social networks. NoSQL engines provide excellent capabilities of flexible schema that can be changed on the fly. Catalogs usually aggregate data from different organizational data sources such as online systems, CRM or ERP. DBS-H solution enables transparent and continuous data integration from multiple existing SQL related data sources into new centralized catalog NoSQL based system.
Cloud Computing Today: Do you perform any data enrichment of SQL-data in the process of its integration with NoSQL? If so, what kind of data enrichment does your platform deliver? In the event that customers prefer to leave their data in its original state, without enrichment, can they opt out of the data enrichment process?
DBS-H: The DBS-H solution contains data enrichment capabilities during the data integration process. The main idea of “data enrichment” in our case is to provide a simple way for the customer to add logical information that enriches original data by:
- Adding data source identification information, such as: where and when this data has been generated and by whom. This can be used by auditing for example.
- Classifying data based on the source. This information can be very useful when customers what to control data access based on different roles and groups inside organization.
- Assessing data reliability as low, medium or high. This enrichment is useful for analytic platforms that can make different decisions based on source reliability level.
Customers can create enrichment metrics that can be added to every block of information that goes through the DBS-H integration pipeline. If no enrichment is required then the customer can opt out of the enrichment step.
Arcadia Data Releases Business Intelligence Platform For Hadoop And Closes $11.5M In Series B Funding
Today, Arcadia Data revealed details of its business intelligence and data visualization platform for Big Data. Arcadia Data’s BI platform enables business stakeholders to create data visualizations of Hadoop data by means of a rich user interface that allows users to drag and drop data fields. In addition, customers can select datasets for drill-downs to perform more advanced analyses such as root cause analytics, correlation analytics and trend analytics. The platform’s rich drag and drop functionality supports exploratory analysis of Hadoop-based data as illustrated below:
The graphic above shows how customers can use the Arcadia data platform to obtain different aggregations of cab ride fares and duration within various geographies in NYC. Importantly, the simplicity and speed of the platform mean that business stakeholders can comfortably obtain the analyses and data visualizations needed to represent their own data-driven insights. Given that the Arcadia Data platform also features data modeling functionality that enables users to massage and organize data prior to taking advantage of Arcadia’s data visualization functionality, the platform also lends itself to use by more savvy data users in addition to business users. Arcadia supports all major Hadoop distributions including Cloudera, Hortonworks and MapR and additionally enables users to glean insights from applications built using MySQL, Oracle and Teradata. In addition to today’s product announcement, Arcadia Data today announced the finalization of $11.5M in Series A funding from Mayfield, Blumberg Capital and Intel Capital. As revealed to Cloud Computing Today in a live product demonstration, the depth and sophistication of the Arcadia Data platform illustrates the changing face of business intelligence in the wake of the big data revolution, particularly as evinced by the ease with which business stakeholders can now make sense of Hadoop-based data using data visualization, transformation, drill-downs, trend analysis and analytics more broadly.
Basho Technologies today announced the release of the Basho Data Platform, an integrated Big Data platform that enhances the ability of customers to build applications that leverage Basho’s Riak KV (formerly Riak) and Riak S2 (formerly Riak CS). By integrating Riak KV, Riak, Apache Spark, Redis and Apache Solr, the Basho Data Platform enhances the ability of customers to create high performing applications that deliver real-time analytics. The platform’s integration with Redis cache allows users to leverage the capability of Redis to improve the read performance of applications. The platform also boasts an integration with Apache Solr that builds upon the ability of Riak to support searches powered by Apache Solr. Moreover, the Basho Data Platform supports the replication and synchronization of data across its different components in ways that ensure continued access to applications and relevant data. The graphic below illustrates the different components of the Basho Data Platform:
The Basho Data Platform responds to a need in the marketplace to complement high performance NoSQL databases such as Riak with analytics and caching technologies such as Apache Spark and Redis, respectively. The platform’s cluster management and orchestration functionality absolves customers of the need to use Apache Zookeeper for cluster synchronization and cluster management. By automating provisioning and orchestration and delivering Redis-based caching functionality in conjunction with Apache Spark, the platform empowers customers to create high performance applications capable of scaling to manage the operational needs of massive datasets. Today’s announcement marks the release of an integrated platform that stands poised to significantly augment the ease with which customers can build Riak-based Big Data applications. Notably, the platform’s ability to orchestrate and automate the interplay between its different components means that developers can focus on taking advantage of the functionality of Apache Spark and Redis alongside Riak KV and Riak S2 without becoming mired in the complexities of provisioning, cluster synchronization and cluster management. As such, the platform’s out of the box integration of its constituent components represents a watershed moment in the evolution of Riak KV and Riak S2 and the NoSQL space more generally as well.
Cloudera and Trillium Software recently announced a collaboration whereby the Trillium Big Data solution is certified for Cloudera’s Hadoop distribution. As a result of the partnership, Cloudera customers can take advantage of Trillium’s data quality solutions to profile, cleanse, de-duplicate and enrich Hadoop-based data. Trillium responds to a problem in the Big Data industry wherein the customer focus on deployment and management of Hadoop-based data repositories eclipses concerns about data quality. In the case of Hadoop-based data, data quality solutions predictably face challenges associated with the sheer volume of data that requires cleansing or quality improvements. Trillium’s Big Data Solution for data quality cleanses data natively within Hadoop because identifying data with data quality issues and then transporting it to another infrastructure becomes costly and complex. The collaboration between Trillium Software and Cloudera illustrates the relevance of data quality solutions for Hadoop despite the increased attention currently devoted to Big Data analytics and data visualization solutions. As such, Trillium fills a critical niche within the Big Data processing space and its alliance with Cloudera positions it strongly to consolidate its early traction within the space of solutions dedicated to data quality in the Big Data space.
Microsoft Azure recently announced news of the Azure Data Lake, a product that serves as a repository for “every type of data collected in a single place prior to any formal definition of requirements or schema.” As noted by Oliver Chiu in a blog post, Data Lakes allow organizations to store all data types regardless of data type and size on the theory that they can subsequently use advanced analytics to determine which data sources should be transferred to a data warehouse for more rigorous data profiling, processing and analytics. The Azure Data Lake’s compatibility with HDFS means that products with data stored in Azure HDInsight and infrastructures that use distributions such as Cloudera, Hortonworks and MapR can integrate with it, thereby allowing them to feed the Azure Data Lake with streams of Hadoop data from internal and third party data sources as necessary. Moreover, the Azure Data Lake supports massively parallel queries that allow for the execution of advanced analytics on massive datasets of the type envisioned for the Azure Data Lake, particularly given its ability to support unlimited data both in aggregate, and with respect to specific files as well. Built for the cloud, the Azure Data Lake gives enterprises a preliminary solution to the problem of architecting an enterprise data warehouse by providing a repository for all data that customers can subsequently use as a base platform from which to retrieve and curate data of interest.
The Azure Data Lake illustrates the way in which the economics of cloud storage redefines the challenges associated with creating an enterprise data warehouse by shifting the focus of enterprise data management away from master data management and data cleansing toward advanced analytics that can query and aggregate data as needed, thereby absolving organizations of the need to create elaborate structures for storing data. In much the same way that Gmail dispenses with files and folders for email storage and depends upon its search functionality to facilitate the retrieval of email-based data, data lakes take the burden of classifying and curating data away from customers but correspondingly place the emphasis on the analytic capabilities of organizations with respect to the ability to query and aggregate data. As such, the commercial success of the Azure Data Lake hinges on its ability to simplify the process of running ad hoc and repeatable analytics on data stored within its purview by giving customers a rich visual user interface and platform for constructing and refining analytic queries on Big Data.