Four Vs Of Big Data Infographic From IBM Big Data And Analytics Hub

The following infographic, from the IBM Big Data and Analytics Hub, adds veracity to the three Vs traditionally associated with Big Data, namely, volume, velocity and variety. The infographic neatly summarizes some of the vocabulary traditionally associated with discussions of big data.


Trifacta v3 Enhances Data Governance While Improving User Productivity

On September 22, Trifacta announced the release of Trifacta version 3, featuring notable improvements related to data governance and user empowerment. Trifacta specializes in data wrangling for analytics focused around data exploration in ways that enable users to quickly understand the landscape of big data sets as a preliminary step toward performing more focused analytics. Trifacta v3 features enhancements to user productivity marked by the introduction of visual “transformation cards” that graphically represent the Trifacta platform’s suggestions for intelligently transforming data. This release’s emphasis on user empowerment also supports automated multi-dataset transformations, column-aware transforms as well as enhanced connectivity to external datasources such as Hive, Amazon S3 and relational databases. With respect to data governance, Trifacta now boasts support for Hadoop security standards such as Kerberos in addition to LDAP integration as well as role-based access control. Moreover, the platform delivers metadata and lineage functionality that allows users to understand the history of data fields in conjunction with the scripts and transformations that have variously contributed to the emergence of different data objects and their associated metadata.

Finally, Trifacta v3 also features operationalization functionality marked by support for schedulers Chronos and Tidal, and advanced scheduling capabilities that allow for the automated scheduling of exploratory transformations and big data analytics as required by customers. Taken together, Trifacta version 3’s emphasis on security, metadata and lineage and operationalization deliver enterprise-grade data governance that empowers customers to deploy Trifacta within enterprise environments while securely managing the concurrent deployment and operationalization of multiple jobs and data explorations by a multitude of business units and teams. Meanwhile, the platform’s enhancements to user experience continue to bolster its unique position as one of the most powerful data wrangling platforms in the big data space, particularly insofar as it specializes in exploratory and data wrangling capabilities that differentiate it from the bevy of available business intelligence and reporting platforms available today. Trifacta’s data governance functionality promises it to accelerate its adoption within the enterprise given the strength of the platform’s functionality for supporting disparate teams and use cases within an enterprise environment. Expect Trifacta to expand on its niche within the big data wrangling space by continuing to enhance its differentiation from business intelligence and big data analytics vendors.

Cloudera Leads Initiative To Render Apache Spark Viable Alternative To MapReduce

Cloudera recently announced a One Platform Initiative that aspires to make Apache Spark the default framework for processing analytics in Hadoop, ahead of MapReduce. Cloudera’s One Platform Initiative will focus on bolstering the security of Apache Spark, rendering Spark more scalable, enhancing management functionality and augmenting Spark Streaming, the Spark component that focuses on ingesting massive volumes of streaming data for use cases such as the internet of things. Cloudera’s efforts to improve the security of Apache Spark will focus on ensuring the encryption of data at rest as well as over the wire. Meanwhile, the initiative to improve the scalability of Apache Spark aims to render it scalable to as many as 10,000 nodes including enhanced ability to handle computational workloads by means of an integration with Intel’s Math Kernel Library. With respect to management, Cloudera plans to deepen Spark’s integration with YARN by creating metrics that provide insight into resource utilization as well as improvements to multi-tenant performance. Regarding Spark Streaming, Cloudera plans to render Spark Streaming more broadly available to business users via the addition of SQL semantics and the ability to support 80% of common streaming workloads.

Cloudera’s larger goal is to enhance the enterprise-readiness of Apache Spark with a view to promoting it as a viable alternative to MapReduce. All of Cloudera’s enhancements to Spark will be contributed to the Apache Spark open source project. That said, Cloudera’s leadership in stewarding the acceleration of the enterprise-readiness of Apache Spark as a MapReduce alternative promises to position it strongly as the undisputed market share and thought leader in the Hadoop distribution space, particularly given the range of its intended contributions to Spark and the depth of its vision for subsequent Spark enhancements in forthcoming months.

Advance/Newhouse Acquires 1010data For $500M

Advance/Newhouse has decided to acquire big data analytics company 1010data for $500M. 1010data facilitates big data discovery and data sharing by means of a spreadsheet-like interface. The platform boasts predictive analytics, reporting and visualization as well as solutions for data sharing and monetization. The 1010data management team will lead the company but the new capital will be used to accelerate product development and expand sales operations. The 1010 platform gained initial traction in the financial services industry but has subsequently expanded to a customer roster of over 750 companies that also include retail, gaming, telecommunication and manufacturing. 1010data’s acquisition by Advance/Newhouse illustrates the vitality of market interest in big data discovery and data visualization solutions. Advance/Newhouse solutions is the parent company of Conde Nast magazines and Bright House Networks.

Q&A With DBS-H Regarding Its Continuous Big Data Integration Platform For SQL To NoSQL

Cloud Computing Today recently had the privilege of speaking with Amos Shaltiel, CEO and co-founder and Michael Elkin, COO and co-founder of DBS-H, an Israel-based company that specializes in continuous Big Data integration between relational and NoSQL-based data. Topics discussed included the core capabilities of its big data integration platform, typical customer use cases and the role of data enrichment.

Cloud Computing Today: What are the core capabilities of your continuous big data integration platform for integrating SQL data with NoSQL? Is the integration unidirectional or bidirectional? What NoSQL platforms do you support?

DBS-H: DBS-H develops innovative solutions for a continuous data integration between SQL and NoSQL databases. We believe that companies are going to adopt a hybrid model where relational databases such as Oracle, SQL Server, DB2 or MySQL will continue to serve customers alongside new NoSQL engines. The success of Big Data adoption will ultimately rise and fall on how easily information can be accessed by key players in organizations.

The DBS-H solution releases data bottlenecks associated with integrating Big Data with existing SQL data sources, making sure that everyone has access to the data they are looking for transparently and without the need to change existing systems.

Our vision is to make the data integration process simple, intuitive and fully transparent to the customer without a need to hire a highly skilled personnel for expensive maintenance of integration platforms.

Core capabilities of the DBS-H Big Data integration platform are:

1. Continuous data integration between SQL and NoSQL databases. Continuous integration represents a key factor of successful Big Data integration.
2. NoSQL data modeling and linkage to existing relational model. We call it a “playground” where customers can :
a. Link a relational data model to a non-relational structure.
b. Create new data design of NoSQL database
c. Explore “Auto Link” where engine automatically generates 2 options of NoSQL data model based on existing SQL ERD design.
3. Data enrichment – capability that allows to add to each block of data additional information that significantly enriches that data on the target

Currently, we focus on unidirectional integration and avoid some of the conflict resolution scenarios specific to bidirectional continuous data integration. The unidirectional path is from SQL to NoSQL and in the near future we will add the opposite direction of NoSQL to SQL integration. Today, we support Oracle and MongoDB databases and plan to add support for additional database engines such as SQL Server, DB2, MySQL, Couchbase, Cassandra and full integration with Hadoop. We aspire to be the default solution of choice when customers think about data integration across major industry data sources.

Cloud Computing Today: What are the most typical use cases for continuous data integration from SQL to NoSQL?

DBS-H: NoSQL engines offer high performance on relatively low cost and flexible schema model.

Typical use cases of continuous data integration from SQL to NoSQL are driven principally from major NoSQL use cases, such as:

  1. Customer 3600 view – creating and maintaining unified view of a customer from multiple operational systems. Ability to provide consistent customer experience regardless of the channel, capitalize upsell or cross-sell opportunities and deliver better customer service. NoSQL engines provide performance response time required in customer service, scalability and flexible data model. DBS-H solution is an enabler for a “Customer 3600  view” business case by doing transparent and continuous integration from existing SQL based data sources.
  1. User profile management – applications that manage user preferences, authentications and even financial transactions. NoSQL provides high performance, flexible schema model for user preferences, however financial transactions will be usually managed by SQL system. By using DBS-H continuous data integration financial transactions data is found transparently inside NoSQL engines.
  1. Catalog management – applications that manage catalog of products, financial assets, employee or customer data. Modern catalogs often contain user generated data from social networks. NoSQL engines provide excellent capabilities of flexible schema that can be changed on the fly. Catalogs usually aggregate data from different organizational data sources such as online systems, CRM or ERP. DBS-H solution enables transparent and continuous data integration from multiple existing SQL related data sources into new centralized catalog NoSQL based system.

Cloud Computing Today: Do you perform any data enrichment of SQL-data in the process of its integration with NoSQL? If so, what kind of data enrichment does your platform deliver? In the event that customers prefer to leave their data in its original state, without enrichment, can they opt out of the data enrichment process?

DBS-H: The DBS-H solution contains data enrichment capabilities during the data integration process. The main idea of “data enrichment” in our case is to provide a simple way for the customer to add logical information that enriches original data by:

  1. Adding data source identification information, such as: where and when this data has been generated and by whom. This can be used by auditing for example.
  2. Classifying data based on the source. This information can be very useful when customers what to control data access based on different roles and groups inside organization.
  3. Assessing data reliability as low, medium or high. This enrichment is useful for analytic platforms that can make different decisions based on source reliability level.

Customers can create enrichment metrics that can be added to every block of information that goes through the DBS-H integration pipeline. If no enrichment is required then the customer can opt out of the enrichment step.