On Thursday, Cloudera today announced the release of Cloudera Director 2.0, the next version of Cloudera’s platform for deploying and managing Cloudera Enterprise within cloud environments. In collaboration with Cloudera Manager, Cloudera Director 2.0 empowers users to deploy CDH clusters within a cloud infrastructure by taking advantage of a combination of configuration scripts to collectively launch the CDH cluster, schedule queries, retrieve Hadoop-based data and terminate it when required. Moreover, Cloudera Director 2.0 gives customers the ability to add ETL and Modeling to workloads using spot instance support, thereby decreasing operational costs associated with hosting. This version also enables the launch and termination of clusters as result of the execution of specific jobs, thereby delivering enhanced automation regarding the management of cloud-based CDH clusters that correspondingly gives customers greater control over their deployments in addition to the opportunity to decrease costs. In addition, Thursday’s release features the ability to both clone and repair clusters with zero to minimal disruption to the deployment. Meanwhile, Cloudera’s beta RecordService for unified access control and security by means of a distributed data service supports “secure, multi-tenant access” to all users analyzing Hadoop data in Amazon S3 and other storage repositories for Hadoop data. By giving customers finely grained control regarding operational processes that include cluster launch, cluster termination, query management as well as improved scalability for business intelligence and analytic workloads, Cloudera Director 2.0 promises to entice customers to leverage the agility and economics of the public cloud to complement their on-premise Hadoop deployments. As the only Hadoop distribution that supports hybrid cloud environments, Cloudera empowers customers to nimbly deploy Hadoop workloads on Amazon Web Services, Google Cloud Platform or Microsoft Azure with nuanced and granular controls that collectively deliver optimized cost, greater operational control, improved scalability, enhanced automation and more robust security within their cloud deployments. Version 2.0 of Cloudera Director accelerates the industry-wide trend toward the convergence between cloud computing and Big Data by giving customers enterprise grade, self-service tools to manage their Hadoop workloads in the cloud, from within a single pane of glass.
Cloudera Releases Cloudera Director 2.0 Marked By Enhanced Operational Granularity And Automation Of Cloud-Based Hadoop Deployments
Guest Blog Post: “SonarW: Making Big Data Easy Through NoSQL” By Ron Bennatan, Co-Founder Of jSonar Inc.
The following guest blog post was authored by Ron Bennatan, co-founder of jSonar Inc.
SonarW: An Architecture for Speed, Low Cost and Simplicity
SonarW is a purpose-built NoSQL Big Data warehouse and analytics platform for today’s flexible modern data. It is ultra-efficient, utilizing parallel processing and demanding less hardware than other approaches. Moreover, SonarW brings NoSQL simplicity to the Big Data world.
Key architectural features include:
- JSON-native columnar persistence: This works well for both structured and unstructured data; data is always compressed; and can be processed in parallel for every operation.
- Indexing and Partitioning: All data is indexed using patent-pending Big Data indexes.
- Parallel and Distributed Processing: Everything is done in parallel-both across nodes and within a node to ensure small, cost effective clusters.
- JSON Optimized Code: Designed from the ground up for efficient columnar JSON processing.
- Lock-less Data Structures: Built for multi-thread, multicore, and SMID processing.
- Ease of Use: SonarW inherits its ease of use and simplicity from the NoSQL world and is 100 percent MongoDB compatible. Big Data teams are more productive and can spend less time on platform and code.
Due to its key architectural advantages over today’s Big Data warehousing approaches, SonarW defers the need for large clusters and scales to any size but does not require an unreasonable number of nodes to perform workloads of other Big Data solutions. As a result, the platform reduces both hardware costs and the costs of managing these clusters.
Why is there a Need for a NoSQL Data Warehouse for Big Data Analytics?
Big Data implementations can be complex
Big Data is no longer a stranger to the IT world. All organizations have embarked on the Big Data path and are building data lakes, new forms of the Enterprise Data Warehouse, and more. But many of them still struggle to reap the benefits and some are stuck in the “collection phase”. Landing the data is always the first phase, and that tends to be successful; it’s the next phase, the usage phase-such as producing useful Big Data analytics – that is hard. Some call this the “Hadoop Hangover”. Some never go past the ETL phase, using the Data Lake as no more than an ETL area and loading the data back into conventional data stores. Some give up.
When these initiatives stall the reason is complexity. But while all this is happening, on the other “side” of the data management arena, the NoSQL world has perfected precisely that. Perhaps the main reason that NoSQL databases such as MongoDB has been so successful is the appeal to developers who find it easy to use and who feel they are an order of magnitude more productive than other environments.
Bringing NoSQL Simplicity to Big Data
So why not merge the two? Why not take NoSQL’s simplicity and bring it to the Big Data world? That was precisely the question we put to ourselves when we went out to build SonarW – a Big Data warehouse that has the look-and-feel of MongoDB, the speed and functionality of MPP RDBMS warehouses and the scale of Hadoop.
- Simple- but not simplistic.
- Flexible- yet has enough self-describing structure to make it effective.
- Structured – but one that is easy to work with, can express anything, and can bring the simplicity and flexibility that people love.
JSON is the fastest growing data format on earth – by a lot. It is also the perfect foundation for Big Data where disparate sources need to quickly flow in and be used for deriving insight.
For SonarW, we started with JSON and asked ourselves how we can make it scale – and the answer was in compressed columnar storage of JSON coupled with rich analytic pipelines that can be executed directly on the JSON data. Everything looks like a NoSQL data pipeline similar to MongoDB or Google Dremel or other modern data flows, but they execute on an efficient columnar fabric and all without the need to define schema, to work hard to normalize data or to completely lose control without any structure.
Efficient scalability also reduces complexity
The other goal we set for SonarW is efficiency. Everything scales horizontally these days – and SonarW is no exception. But scaling horizontally allows one to hide inefficiencies. Throw enough hardware at anything and things go fast. But it also becomes expensive – especially in the enterprise where costs and charge-backs are high. We fondly refer to SonarW as “Big-but-Lean Data”. I.e. it’s good to scale, but it’s better to do it efficiently. As an example, the figure below shows the number of nodes and costs to run the Big Data benchmark on a set of platforms. All these systems achieved the same minimal performance scores (with RedShift and SonarW being faster than the others), but the size and cost of the clusters were different (in both charts, smaller is better).
NoSQL can optimize Big Data analytics success
A NoSQL approach has been shown to be a highly successful approach for Big Data OLTP databases as provided by companies such as MongoDB. However, no such capability has been available for Big Data analytics. SonarW was built, from the ground up – with a JSON columnar architecture – to provide a simple NoSQL interface along with MPP speeds and efficient scalability that optimizes the developer’s ability to deliver on Big Data analytics projects.
For more information about jSonar and SonarW please visit www.jsonar.com
Big Data Benchmark: Breakthrough Cost and Performance Results
One of the benchmarks used for Big Data workloads is the “Big Data Benchmark,” which is run by the AMP lab at Berkeley. This benchmark runs workloads on representatives from the Hadoop ecosystem (e.g. Hive, Spark, Tex, etc), as well as from MPP environments. Note SonarW’s performance and cost in comparison to Tez, Shark, Redshift, Impala and Hive.
Ron Bennatan Vita
Ron Bennatan is a co-founder at jSonar Inc. He has been a “database guy” for 25 years and has worked at companies such as J.P. Morgan, Merrill Lynch, Intel, IBM and AT&T Bell Labs. He was co-founder and CTO at Guardium which was acquired by IBM where he later served as a Distinguished Engineer and the CTO for Big Data Governance. He is now focused on NoSQL Big Data analytics. He has a Ph.D. in Computer Science and has authored 11 technical books.
The following video by Cloudera CEO Mike Olson elaborates on the significance of Apache Spark in the Hadoop landscape, with a particular focus on its differentiation from MapReduce. The video prefigures Cloudera’s One Platform Initiative aimed at rendering Spark a viable alternative to MapReduce.
The following infographic, from the IBM Big Data and Analytics Hub, adds veracity to the three Vs traditionally associated with Big Data, namely, volume, velocity and variety. The infographic neatly summarizes some of the vocabulary traditionally associated with discussions of big data.
On September 22, Trifacta announced the release of Trifacta version 3, featuring notable improvements related to data governance and user empowerment. Trifacta specializes in data wrangling for analytics focused around data exploration in ways that enable users to quickly understand the landscape of big data sets as a preliminary step toward performing more focused analytics. Trifacta v3 features enhancements to user productivity marked by the introduction of visual “transformation cards” that graphically represent the Trifacta platform’s suggestions for intelligently transforming data. This release’s emphasis on user empowerment also supports automated multi-dataset transformations, column-aware transforms as well as enhanced connectivity to external datasources such as Hive, Amazon S3 and relational databases. With respect to data governance, Trifacta now boasts support for Hadoop security standards such as Kerberos in addition to LDAP integration as well as role-based access control. Moreover, the platform delivers metadata and lineage functionality that allows users to understand the history of data fields in conjunction with the scripts and transformations that have variously contributed to the emergence of different data objects and their associated metadata.
Finally, Trifacta v3 also features operationalization functionality marked by support for schedulers Chronos and Tidal, and advanced scheduling capabilities that allow for the automated scheduling of exploratory transformations and big data analytics as required by customers. Taken together, Trifacta version 3’s emphasis on security, metadata and lineage and operationalization deliver enterprise-grade data governance that empowers customers to deploy Trifacta within enterprise environments while securely managing the concurrent deployment and operationalization of multiple jobs and data explorations by a multitude of business units and teams. Meanwhile, the platform’s enhancements to user experience continue to bolster its unique position as one of the most powerful data wrangling platforms in the big data space, particularly insofar as it specializes in exploratory and data wrangling capabilities that differentiate it from the bevy of available business intelligence and reporting platforms available today. Trifacta’s data governance functionality promises it to accelerate its adoption within the enterprise given the strength of the platform’s functionality for supporting disparate teams and use cases within an enterprise environment. Expect Trifacta to expand on its niche within the big data wrangling space by continuing to enhance its differentiation from business intelligence and big data analytics vendors.
Cloudera recently announced a One Platform Initiative that aspires to make Apache Spark the default framework for processing analytics in Hadoop, ahead of MapReduce. Cloudera’s One Platform Initiative will focus on bolstering the security of Apache Spark, rendering Spark more scalable, enhancing management functionality and augmenting Spark Streaming, the Spark component that focuses on ingesting massive volumes of streaming data for use cases such as the internet of things. Cloudera’s efforts to improve the security of Apache Spark will focus on ensuring the encryption of data at rest as well as over the wire. Meanwhile, the initiative to improve the scalability of Apache Spark aims to render it scalable to as many as 10,000 nodes including enhanced ability to handle computational workloads by means of an integration with Intel’s Math Kernel Library. With respect to management, Cloudera plans to deepen Spark’s integration with YARN by creating metrics that provide insight into resource utilization as well as improvements to multi-tenant performance. Regarding Spark Streaming, Cloudera plans to render Spark Streaming more broadly available to business users via the addition of SQL semantics and the ability to support 80% of common streaming workloads.
Cloudera’s larger goal is to enhance the enterprise-readiness of Apache Spark with a view to promoting it as a viable alternative to MapReduce. All of Cloudera’s enhancements to Spark will be contributed to the Apache Spark open source project. That said, Cloudera’s leadership in stewarding the acceleration of the enterprise-readiness of Apache Spark as a MapReduce alternative promises to position it strongly as the undisputed market share and thought leader in the Hadoop distribution space, particularly given the range of its intended contributions to Spark and the depth of its vision for subsequent Spark enhancements in forthcoming months.
Advance/Newhouse has decided to acquire big data analytics company 1010data for $500M. 1010data facilitates big data discovery and data sharing by means of a spreadsheet-like interface. The platform boasts predictive analytics, reporting and visualization as well as solutions for data sharing and monetization. The 1010data management team will lead the company but the new capital will be used to accelerate product development and expand sales operations. The 1010 platform gained initial traction in the financial services industry but has subsequently expanded to a customer roster of over 750 companies that also include retail, gaming, telecommunication and manufacturing. 1010data’s acquisition by Advance/Newhouse illustrates the vitality of market interest in big data discovery and data visualization solutions. Advance/Newhouse solutions is the parent company of Conde Nast magazines and Bright House Networks.