Amazon Web Services Supports Impala To Facilitate Real Time, High Performance Hadoop Queries

Amazon Web Services (AWS) recently announced support for Impala, the open source technology platform developed by Cloudera for querying data in the Hadoop Distributed File System or HBase using SQL-like syntax as elaborated below:

Impala raises the bar for query performance while retaining a familiar user experience. With Impala, you can query data, whether stored in HDFS or Apache HBase – including SELECT, JOIN, and aggregate functions – in real time. Furthermore, it uses the same metadata, SQL syntax (Hive SQL), ODBC driver and user interface (Hue Beeswax) as Apache Hive, providing a familiar and unified platform for batch-oriented or real-time queries. (For that reason, Hive users can utilize Impala with little setup overhead.)

Amazon Web Services introduced Impala as part of the Amazon Elastic MapReduce project. Users will need to run Hadoop clusters that use Hadoop 2.x in order to take advantage of its Hadoop offering. Impala users can run queries on data sets in real time and enjoy low latency times enabled by the platform’s distributed query engine that allows Impala to boast speed and performance benefits over Apache Hive. The availability of Impala on the Amazon Web Services platform comes just weeks after its release of Amazon Kinesis, its platform for collecting and storing real time big data streams, and subsequently underscores the seriousness with which AWS plans to deploy products designed for the big data space.


Concurrent Announces Release Of Cascading 2.5 and Lingual 1.0 To Simplify Application Development Using Hadoop

Today, Concurrent elaborates on the release of Cascading 2.5, the open source framework for facilitating the development of applications on Apache Hadoop. Cascading 2.5 supports the recent released Hadoop 2.0 distribution including YARN and its other features. Cascading users that are interested in upgrading to Hadoop 2.0 can do so by means of Cascading 2.5. Similarly, applications that leverage the Scalding, Cascalog and PyCascading languages can migrate to Hadoop 2.0 as well by means of the Cascading 2.5 framework. The latest release of Cascading also features “complex join operations and optimizations to dynamically partition and store processed data more efficiently on HDFS,” according to the Concurrent’s press release. Finally, the release deepens its compatibility with other Hadoop distributions and Hadoop as a Service vendors such as Cloudera, Hortonworks, MapR, Intel, Altiscale, Qubole and Amazon EMR.

Cascading 2.5 represents one of the few products in either the commercial or open source ecosystem for simplifying the development of Hadoop applications while integrating with a rich and varied ecosystem of products as illustrated below:

The graphic shows how Cascading 2.5 supports all major Hadoop distributions in addition to an impressive list of development languages, database platforms and cloud platforms. In an interview with Cloud Computing Today, Concurrent CEO Gary Nakamura and CTO Chris Wensel noted the uniqueness of Cascading in the Big Data landscape, particularly given its iterative refinement in collaboration with the likes of Twitter, eBay and The Climate Corporation over a period of more than five years.

Today’s announcement regarding the general availability of Cascading 2.5 is accompanied by news of the general availability of Lingual, an ANSI-compliant SQL interface that allows developers to use SQL commands to query data stored in Hadoop clusters. Unlike Apache’s Hive project, Lingual’s ANSI-standard SQL interface enables developers to deploy authentic SQL commands as opposed to HIVE’s SQL-like syntax. Cascading Lingual also allows for the migration of legacy SQL workloads onto Hadoop clusters, the export of Hadoop data onto BI tools such as Jaspersoft, Pentaho and Talend, and the ability to leverage the power of Cascading in conjunction with SQL to orchestrate the execution of multiple SQL queries instead of several, discrete disparate queries. The Big Data space should expect more from Concurrent as it continues to build out tools for simplifying application development on Hadoop, particularly as more and more Hadoop developers come to terms with Cascading’s advantages over MapReduce.

Spotify Selects Hortonworks Over Cloudera For Its 690 Node Hadoop Cluster

Digital music service Spotify recently announced that it will migrate its Hadoop cluster from Cloudera’s Hadoop distribution to the Hortonworks Data Platform because of the Hortonworks commitment to open source development and technologies. Spotify also noted that the migration was partly due to the impressive contribution made by Hortonworks to the Apache Hive project for querying Hadoop data. Spotify began its use of Hadoop on the Amazon Web Services EMR platform with a cluster sized at approximately 30 nodes. The company subsequently decided to bring its Hadoop cluster in house, starting with a 60 node cluster. Spotify’s Hadoop distribution is now sized at 690 nodes and stores data for its 24 million users and 6 million subscribers. Its 690 node Hadoop cluster is widely regarded as one of the largest implementations of Hadoop in Europe. In addition to providing Spotify with a production-grade Hadoop distribution, Hortonworks will perform bi-annual health assessments of its Hadoop infrastructure.

Facebook Leverages Enhanced Apache Giraph To Create 1 Trillion Edge Social Graph

This week, Facebook revealed details of the technology used to power its recently released Graph search functionality. Graph databases are used to analyze relationships between associative data such as social networks, transportation data, search-related content recommendations, fraud detection, molecular biology and, more generally, any data set where the relationships between constituent data points are so numerous and dynamic that they cannot easily be captured within a manageable schema or relational database structure. Graph databases contain “nodes” or “vertices” and “edges” that indicate relationships between the different vertices/nodes.

Facebook intensified their review of graphing technologies in the summer of 2012 and selected Apache Giraph over Apache Hive and GraphLab. Facebook selected Giraph because it interfaces directly with its own version of the Hadoop Distributed File System (HDFS), allows usage of its MapReduce Corona infrastructure, supports a variety of graph application use cases, and features functionality such as master computation and composable computation. Compared to Apache Hive and GraphLab, Apache Giraph was faster. After choosing a graphing platform, the Facebook team modified the Apache Giraph code and subsequently shared their changes with the Apache Giraph open source community.

One of the use cases that Facebook leveraged in order to select Apach Giraph was its performance in a “label propagation” exercise where it probabilistically inferred data fields that are blank or unintelligible in comparison to Apache Hive and GraphLab. Many Facebook users, for example, may elect to leave their hometown or employer blank, but graphing algorithms can probabilistically assign values for the blank fields by analyzing data about a user’s friends, family, likes and online behavior. By empowering data scientists to construct more complete profiles of users, graph technology enables enhanced personalization of data such as a user’s news feed and advertising content. Facebook performed the “label propagation” comparison of Giraph, Hive and GraphLab on a relatively small scale on the order of 25 million edges.

Key attributes of Apache Giraph and its usage by Facebook include the following:

•Apache Giraph is based on Google’s Pregel and Leslie Valiant’s bulk synchronous parallel computing model
•Apache Giraph is written in Java and runs as a MapReduce job
•Facebook chose the production use cases of “label propagation, variants of page rank, and k-means clustering” in order to drive their modification of Apache Giraph code
•Facebook created a 1 trillion edge social graph using 200 commodity machines, in less than four minutes
•Facebook’s creation of 1 trillion edges is roughly two orders of magnitude greater than Twitter’s graph of 1.5 billion edges and AltaVista’s 6.5 billion edges

Facebook performed a number of tweaks on the Apache Giraph code including modification of the input model for data in Giraph, streamlined reading of Hive data, multithreading application code, memory optimization and the use of Netty instead of Zookeeper to enhance scalability. Facebook’s Social Graph was launched in January, although its platform is not nearly as powerful as end users might hope for, as of yet. Open Graph, meanwhile, is used by Facebook developers to correlate real-world actions with objects in their database, such as User X is viewing soccer match Y on network Z. The latest and greatest vesion of Giraph’s code is now available under version 1.0.0 of the project. This week’s elaboration on Facebook’s contribution to Giraph by Avery Ching’s blog post represents one of the first attempts to render mainstream the challenges specific to creating and managing a trillion edge social graph. In response, the industry should expect analogous disclosures about graphing technology from the likes of Google, Twitter and others in subsequent months.