Big Data – Cloud Computing Today

Datos IO Introduces RecoverX, Platform For Big Data And Cloud-Based Application Backup And Recovery

Datos IO recently announced the general availability of RecoverX, a data protection software platform for third platform applications and distributed databases. RecoverX has the ability to recover and protect SaaS, IoT and Analytics applications within on-premise and cloud-based environments. Designed for database environments that include DataStax, Apache Cassandra and MongoDB, RecoverX responds to a market need for scale-out data protection solutions capable of tackling data recovery problems specific to distributed database and application infrastructures. RecoverX boasts breakthrough deduplication capabilities that shrink the time and storage space required for backups by enabling backups to focus on incrementally new datasets in contrast to the entire dataset in question. The platform also delivers a scalable versioning infrastructure that simplifies the process of identifying versions of backups for recovery purposes. RecoverX’s functionality allows customers to designate a subset of infrastructure—whether it consist of a workload, infrastructure component, business unit or line of business–for recovery, as required. Datos IO co-founder and CEO, Tarun Thakur, remarked on the positioning of Datos IO and RecoverX within the broader contemporary computing market landscape as follows:

As enterprises seek to exploit large-volume, high-ingestion and real-time data, they are turning to scalable, eventually consistent, key-value systems. This fundamental shift raises critical issues in the lifecycle of data management, most notably the need for backup and recovery solutions able to scale with these next-generation systems. RecoverX empowers application and IT owners with enterprise-ready data protection purpose-built for third platform applications – and puts Datos IO at the forefront of a market space experiencing massive growth, yet until recently lacking in data protection and management at scale.

Here, Thakur speaks to RecoverX’s ability to deliver enterprise-grade, data protection solutions for massive volumes of “large-volume, high-ingestion and real-time” data use cases as a solution that can scale with heterogeneous ecosystems of third platform applications. Moreover, Thakur notes the shift in contemporary data management towards “eventually consistent, key-value systems” such as MongoDB and Cassandra that can manage massive, highly dynamic updates to databases at scale. RecoverX differentiates from traditional backup and recovery solutions by boasting the capability to scale in relation to massive volumes of varied data at high velocities housed in distributed applications and infrastructures. By enabling customers to perform backups in minutes using one click by taking advantage of its Consistent Orchestrated Distributed Recovery (CODR) architecture, the platform absolves customers of the problem of backup and recovery for large and dynamic data infrastructures and delivers granular visibility into the status of backup operations per the screenshot below:

Scale-Out Software

Learn more about the Datos IO story as told by investors and partners through the video clip below:

Pivotal Embraces Hortonworks Hadoop Distribution While Hortonworks Commits To Supporting Pivotal’s Hadoop Native SQL Platform

On April 12, Hortonworks and Pivotal announced a deepening of their strategic relationship as major players in the space for commercial Hadoop distributions and Big Data analytics. Pivotal will standardize its Hadoop distribution on Pivotal HDP, a platform that is identical to the Hortonworks Data Platform. Meanwhile, Hortonworks will offer Pivotal’s SQL on Hadoop platform, Pivotal HDB, as an offering within its own portfolio under the branding Hortonworks HDB. Powered by Apache HAWQ, Hortonworks HDB will be identical to Pivotal HBD and conversely illustrates Hortonwork’s standardization on a Pivotal technology as a counterpoint to Pivotal’s embrace of the Hortonworks Data Platform as its core Hadoop distribution. The expanded collaboration between Pivotal and Hortonworks means that Pivotal aligns with one of the industry’s most widely used Hadoop distributions in the form of Hortonworks. Hortonworks, on the other hand, now can brand and offer professional services for Pivotal’s Hadoop Native SQL platform in ways that give it parity with Cloudera’s Impala. The deepening of the partnership between Pivotal and Hortonworks promises to complicate the battle for Hadoop distribution supremacy by giving Hortonworks enhanced access to Pivotal’s renowned big data application development and analytics capabilities. As such, the real winner here is Hortonworks, although Pivotal stands to gain from standardizing on HDP and thereby enabling it to focus on its core strengths in analytics and application development.

LinkedIn Open Sources Dr. Elephant To Facilitate Optimization Of Hadoop-based Flows

LinkedIn recently announced the open sourcing of Dr. Elephant, a tool that helps Hadoop users optimize their flows. Dr. Elephant aggregates and analyzes data about Hadoop jobs and delivers suggestions about how to optimize jobs to increase their efficiency. Whereas most Hadoop optimization tools focus on simplifying and streamlining the management of Hadoop clusters, Dr. Elephant focuses on the optimization of Hadoop flows. As noted in a LinkedIn blog post, the platform leverages “pluggable, configurable, rule-based heuristics” to provide analytical insight about job performance in addition to recommendations for performance optimization. Used by LinkedIn to enhance developer productivity and improve the efficiency of Hadoop clusters by optimizing their constituent flows, Dr. Elephant delivers an aggregated dashboard of all of the jobs that run on a specific cluster in conjunction with drill-down, visualization functionality of flow performance for each job. The platform specializes in diagnostics at the job-level in contrast to the cluster itself, and is widely used by LinkedIn to diagnose and solve over 80% of flow performance questions. Open sourced under an Apache version 2 license, Dr. Elephant is compatible with Apache Hadoop and Apache Spark and plays in the same space as Driven, the Big Data application performance management framework pioneered by Concurrent Inc.

Cloudera Releases Cloudera Director 2.0 Marked By Enhanced Operational Granularity And Automation Of Cloud-Based Hadoop Deployments

On Thursday, Cloudera today announced the release of Cloudera Director 2.0, the next version of Cloudera’s platform for deploying and managing Cloudera Enterprise within cloud environments. In collaboration with Cloudera Manager, Cloudera Director 2.0 empowers users to deploy CDH clusters within a cloud infrastructure by taking advantage of a combination of configuration scripts to collectively launch the CDH cluster, schedule queries, retrieve Hadoop-based data and terminate it when required. Moreover, Cloudera Director 2.0 gives customers the ability to add ETL and Modeling to workloads using spot instance support, thereby decreasing operational costs associated with hosting. This version also enables the launch and termination of clusters as result of the execution of specific jobs, thereby delivering enhanced automation regarding the management of cloud-based CDH clusters that correspondingly gives customers greater control over their deployments in addition to the opportunity to decrease costs. In addition, Thursday’s release features the ability to both clone and repair clusters with zero to minimal disruption to the deployment. Meanwhile, Cloudera’s beta RecordService for unified access control and security by means of a distributed data service supports “secure, multi-tenant access” to all users analyzing Hadoop data in Amazon S3 and other storage repositories for Hadoop data. By giving customers finely grained control regarding operational processes that include cluster launch, cluster termination, query management as well as improved scalability for business intelligence and analytic workloads, Cloudera Director 2.0 promises to entice customers to leverage the agility and economics of the public cloud to complement their on-premise Hadoop deployments. As the only Hadoop distribution that supports hybrid cloud environments, Cloudera empowers customers to nimbly deploy Hadoop workloads on Amazon Web Services, Google Cloud Platform or Microsoft Azure with nuanced and granular controls that collectively deliver optimized cost, greater operational control, improved scalability, enhanced automation and more robust security within their cloud deployments. Version 2.0 of Cloudera Director accelerates the industry-wide trend toward the convergence between cloud computing and Big Data by giving customers enterprise grade, self-service tools to manage their Hadoop workloads in the cloud, from within a single pane of glass.

Guest Blog Post: “SonarW: Making Big Data Easy Through NoSQL” By Ron Bennatan, Co-Founder Of jSonar Inc.

The following guest blog post was authored by Ron Bennatan, co-founder of jSonar Inc.

SonarW: An Architecture for Speed, Low Cost and Simplicity

SonarW is a purpose-built NoSQL Big Data warehouse and analytics platform for today’s flexible modern data. It is ultra-efficient, utilizing parallel processing and demanding less hardware than other approaches. Moreover, SonarW brings NoSQL simplicity to the Big Data world.

Key architectural features include:

JSON-native columnar persistence: This works well for both structured and unstructured data; data is always compressed; and can be processed in parallel for every operation.
Indexing and Partitioning: All data is indexed using patent-pending Big Data indexes.
Parallel and Distributed Processing: Everything is done in parallel-both across nodes and within a node to ensure small, cost effective clusters.
JSON Optimized Code: Designed from the ground up for efficient columnar JSON processing.
Lock-less Data Structures: Built for multi-thread, multicore, and SMID processing.
Ease of Use: SonarW inherits its ease of use and simplicity from the NoSQL world and is 100 percent MongoDB compatible. Big Data teams are more productive and can spend less time on platform and code.

Due to its key architectural advantages over today’s Big Data warehousing approaches, SonarW defers the need for large clusters and scales to any size but does not require an unreasonable number of nodes to perform workloads of other Big Data solutions. As a result, the platform reduces both hardware costs and the costs of managing these clusters.

Why is there a Need for a NoSQL Data Warehouse for Big Data Analytics?

Big Data implementations can be complex

Big Data is no longer a stranger to the IT world. All organizations have embarked on the Big Data path and are building data lakes, new forms of the Enterprise Data Warehouse, and more. But many of them still struggle to reap the benefits and some are stuck in the “collection phase”. Landing the data is always the first phase, and that tends to be successful; it’s the next phase, the usage phase-such as producing useful Big Data analytics – that is hard. Some call this the “Hadoop Hangover”. Some never go past the ETL phase, using the Data Lake as no more than an ETL area and loading the data back into conventional data stores. Some give up.

When these initiatives stall the reason is complexity. But while all this is happening, on the other “side” of the data management arena, the NoSQL world has perfected precisely that. Perhaps the main reason that NoSQL databases such as MongoDB has been so successful is the appeal to developers who find it easy to use and who feel they are an order of magnitude more productive than other environments.

Bringing NoSQL Simplicity to Big Data

So why not merge the two? Why not take NoSQL’s simplicity and bring it to the Big Data world? That was precisely the question we put to ourselves when we went out to build SonarW – a Big Data warehouse that has the look-and-feel of MongoDB, the speed and functionality of MPP RDBMS warehouses and the scale of Hadoop.

As in other NoSQL-based systems, many of the advantages stem from the nature of JSON documents. Javascript Object Notation (JSON) is a perfect middle ground between structure and flexibility. JSON has become ubiquitous and is considered to be the “lingua-franca” of Web, mobile applications, social media and IoT. JSON is,

Simple- but not simplistic.
Flexible- yet has enough self-describing structure to make it effective.
Structured – but one that is easy to work with, can express anything, and can bring the simplicity and flexibility that people love.

JSON is the fastest growing data format on earth – by a lot. It is also the perfect foundation for Big Data where disparate sources need to quickly flow in and be used for deriving insight.

For SonarW, we started with JSON and asked ourselves how we can make it scale – and the answer was in compressed columnar storage of JSON coupled with rich analytic pipelines that can be executed directly on the JSON data. Everything looks like a NoSQL data pipeline similar to MongoDB or Google Dremel or other modern data flows, but they execute on an efficient columnar fabric and all without the need to define schema, to work hard to normalize data or to completely lose control without any structure.

Efficient scalability also reduces complexity

The other goal we set for SonarW is efficiency. Everything scales horizontally these days – and SonarW is no exception. But scaling horizontally allows one to hide inefficiencies. Throw enough hardware at anything and things go fast. But it also becomes expensive – especially in the enterprise where costs and charge-backs are high. We fondly refer to SonarW as “Big-but-Lean Data”. I.e. it’s good to scale, but it’s better to do it efficiently. As an example, the figure below shows the number of nodes and costs to run the Big Data benchmark on a set of platforms. All these systems achieved the same minimal performance scores (with RedShift and SonarW being faster than the others), but the size and cost of the clusters were different (in both charts, smaller is better).

NoSQL can optimize Big Data analytics success

A NoSQL approach has been shown to be a highly successful approach for Big Data OLTP databases as provided by companies such as MongoDB. However, no such capability has been available for Big Data analytics. SonarW was built, from the ground up – with a JSON columnar architecture – to provide a simple NoSQL interface along with MPP speeds and efficient scalability that optimizes the developer’s ability to deliver on Big Data analytics projects.

For more information about jSonar and SonarW please visit www.jsonar.com

Big Data Benchmark: Breakthrough Cost and Performance Results

One of the benchmarks used for Big Data workloads is the “Big Data Benchmark,” which is run by the AMP lab at Berkeley. This benchmark runs workloads on representatives from the Hadoop ecosystem (e.g. Hive, Spark, Tex, etc), as well as from MPP environments. Note SonarW’s performance and cost in comparison to Tez, Shark, Redshift, Impala and Hive.

16e4c6dd-3154-494a-bd22-f6fd27369ab9

Ron Bennatan Vita

Ron Bennatan is a co-founder at jSonar Inc. He has been a “database guy” for 25 years and has worked at companies such as J.P. Morgan, Merrill Lynch, Intel, IBM and AT&T Bell Labs. He was co-founder and CTO at Guardium which was acquired by IBM where he later served as a Distinguished Engineer and the CTO for Big Data Governance. He is now focused on NoSQL Big Data analytics. He has a Ph.D. in Computer Science and has authored 11 technical books.

Cloudera CEO Mike Olson On Significance Of Apache Spark In Hadoop Landscape

The following video by Cloudera CEO Mike Olson elaborates on the significance of Apache Spark in the Hadoop landscape, with a particular focus on its differentiation from MapReduce. The video prefigures Cloudera’s One Platform Initiative aimed at rendering Spark a viable alternative to MapReduce.