BigCouch Integration With CouchDB Brings Clustering And Improved Database Compaction To CouchDB

On Monday, Database as a Service vendor Cloudant announced plans to integrate its database service, BigCouch, into the Apache CouchDB project. BigCouch is an open source fork of CouchDB designed to support large-scale, distributed applications. The integration of BigCouch with CouchDB will provide CouchDB with enhanced scalability and performance in a move that is likely to accelerate adoption of the NoSQL CouchDB platform. In conjunction with its decision to integrate BigCouch into CouchDB, Cloudant announced that it will cease development of the BigCouch platform that was inspired by Amazon’s famous Dynamo research paper.

CouchDB will benefit principally from the clustering functionality that became one of the trademarks of BigCouch. Unlike CouchDB, BigCouch nodes reside in elastic clusters marked by consistent hashing, quorum rules for read/write operations and parallel indexing on data partitions as illustrated by the three node BigCouch development cluster below, in contrast to the unified CouchDB configuraton at the top of the picture:

Graphic source: Cloudant’s BigCouch is open-source

Parallel indexing across clusters allows the BigCouch configuration to demonstrate significant improvements in indexing speed in comparison to serial indexing of one database. CouchDB will also benefit from BigCouch’s database compaction functionality, replication speed and high-concurrency access performance.

Adam Kocoloski, co-founder and CTO at Cloudant, remarked on the merging of BigCouch with CouchDB as follows:

There are a lot of reasons people love CouchDB, like its elegant programming model, data durability, flexible indexing, and, most of all, its unique way of replicating and synching data across data centers or devices. We’re merging the horizontal scaling and fault-tolerance framework we built for BigCouch into CouchDB so people can more easily scale all that CouchDB goodness across multiple servers and keep it running nonstop. It’s our way of saying thanks and helping to grow the community of CouchDB developers and users.

Interested users can access a preview of the merger of CouchDB and BigCouch now, although the generally available version of the integrated database as a service will be released in conjunction with the release cycles of the Apache Foundation’s code release process. The integration of these two open source platforms represents a significant boost to the NoSQL community as options in the NoSQL space continue to proliferate and deepen in functionality as exemplified by Garantia’s recent acquisition of MyRedis.


Apache Releases Version 1.2 Of NoSQL Database Cassandra

On Wednesday, the Apache Software Foundation announced the release of Cassandra version 1.2, the high performance, highly scalable, Big Data distributed NoSQL database. Cassandra is capable of managing thousands of data requests per second and is used by organizations such as Adobe, Cisco, Constant Contact, Digg, Disney, eBay, Netflix, Rackspace and Twitter.

Key components of the latest release include the following:

Virtual nodes and clustering across virtual nodes
•Node to node communication
•Atomic batches
•Request tracing
•Version 3 of the Cassandra Query Language (CQL) to simplify the modeling of applications, enable more powerful mapping and facilitate superior database design

Jonathan Ellis, Vice President of Apache Cassandra, reflected on the significance of the Cassandra 1.2 release as follows:

We are pleased to announce Cassandra 1.2. By improving support for dense clusters —powering multiple terabytes per node— as well as simplifying application modeling, and improving data cell storage/design/representation, systems are able to effortlessly scale petabytes of data.

Here, Ellis notes that one of the key functionality upgrades specific to Cassandra consists of enhanced support for dense clusters featuring several terabytes per node. The conjunction of the platform’s improved support for dense clusters with its streamlined application modeling capability and superior design abilities allows for vastly improved scalability for petabytes of data.

Cassandra users expressed particular enthusiasm for the virtual node and atomic batch components of the new release. Software developer Kelly Sommers elaborated on the significance of Cassandra 1.2’s improved handling of virtual nodes as follows:

In Cassandra v1.2 the introduction of vnodes will simplify managing clusters while improving performance when adding and rebuilding nodes. v1.2 also includes many new features, performance improvements and further heap reduction to alleviate the burden on the JVM garbage collector.

Virtual nodes improves performance, notes Sommers. Meanwhile, reducing the burden on the JVM garbage collector similarly enables notable performance enhancements as detailed by a recent blog post by Twitter, which noted how JVM garbage collector optimization significantly reduced CPU time for, separate from any direct reference to Cassandra.

Improved performance, increased scalabilty and simplified application development represent the three recurring themes from user experiences of the Cassandra 1.2 release. In contrast to Hadoop, Cassandra is known for its ability to handle massive amounts of real-time operational data whereas Hadoop is famed for its ability to deal with batch-based volumes of data. The latest release means that Big Data just got even bigger by virtue of Cassandra 1.2’s performance enhancements and application modeling and database design simplifications.

Apache Software Foundation Releases Hadoop Version 1.0

The Apache Software Foundation announced the release of Apache Hadoop version 1.0 on January 4. Hadoop, the software framework for analyzing massive amounts of data, graduated to the version 1.0 designation after six years of gestation. Hadoop’s principal attribute consists of the capability to process massive amounts of data on clusters of computing nodes in parallel. Even though Hadoop stores data evenly across all computing nodes, it distributes the task of processing across the nodes and then synthesizes the results of data processing.

Version 1.0 represents a milestone in terms of scale and enterprise-readiness. The release already features deployments as large as 50,000 nodes and marks the culmination of six years of development, testing and feedback from users and data scientists. Key features of the 1.0 release include the following:

• The integration of Hadoop’s big data table, HBase
• Performance enhanced access to local files for HBase
• Kerberos-based authentication to ensure security
• Support for webhdfs with a read/write HTTP access layer
• Miscellaneous performance enhancements and bug fixes
• All version 0.20.205 and prior 0.20.2xx features

Hadoop represented the Big Data story of 2011 insofar as it was incorporated into almost every 2011 Big Data product revealed by enterprises and startups alike. The release of Apache Hadoop 1.0 promises to propel Hadoop even closer to the enterprise and transition the software framework from a tool for supporting web data into an enterprise-grade Big Data software infrastructure, more generally.