Facebook Leverages Enhanced Apache Giraph To Create 1 Trillion Edge Social Graph

This week, Facebook revealed details of the technology used to power its recently released Graph search functionality. Graph databases are used to analyze relationships between associative data such as social networks, transportation data, search-related content recommendations, fraud detection, molecular biology and, more generally, any data set where the relationships between constituent data points are so numerous and dynamic that they cannot easily be captured within a manageable schema or relational database structure. Graph databases contain “nodes” or “vertices” and “edges” that indicate relationships between the different vertices/nodes.

Facebook intensified their review of graphing technologies in the summer of 2012 and selected Apache Giraph over Apache Hive and GraphLab. Facebook selected Giraph because it interfaces directly with its own version of the Hadoop Distributed File System (HDFS), allows usage of its MapReduce Corona infrastructure, supports a variety of graph application use cases, and features functionality such as master computation and composable computation. Compared to Apache Hive and GraphLab, Apache Giraph was faster. After choosing a graphing platform, the Facebook team modified the Apache Giraph code and subsequently shared their changes with the Apache Giraph open source community.

One of the use cases that Facebook leveraged in order to select Apach Giraph was its performance in a “label propagation” exercise where it probabilistically inferred data fields that are blank or unintelligible in comparison to Apache Hive and GraphLab. Many Facebook users, for example, may elect to leave their hometown or employer blank, but graphing algorithms can probabilistically assign values for the blank fields by analyzing data about a user’s friends, family, likes and online behavior. By empowering data scientists to construct more complete profiles of users, graph technology enables enhanced personalization of data such as a user’s news feed and advertising content. Facebook performed the “label propagation” comparison of Giraph, Hive and GraphLab on a relatively small scale on the order of 25 million edges.

Key attributes of Apache Giraph and its usage by Facebook include the following:

•Apache Giraph is based on Google’s Pregel and Leslie Valiant’s bulk synchronous parallel computing model
•Apache Giraph is written in Java and runs as a MapReduce job
•Facebook chose the production use cases of “label propagation, variants of page rank, and k-means clustering” in order to drive their modification of Apache Giraph code
•Facebook created a 1 trillion edge social graph using 200 commodity machines, in less than four minutes
•Facebook’s creation of 1 trillion edges is roughly two orders of magnitude greater than Twitter’s graph of 1.5 billion edges and AltaVista’s 6.5 billion edges

Facebook performed a number of tweaks on the Apache Giraph code including modification of the input model for data in Giraph, streamlined reading of Hive data, multithreading application code, memory optimization and the use of Netty instead of Zookeeper to enhance scalability. Facebook’s Social Graph was launched in January, although its platform is not nearly as powerful as end users might hope for, as of yet. Open Graph, meanwhile, is used by Facebook developers to correlate real-world actions with objects in their database, such as User X is viewing soccer match Y on network Z. The latest and greatest vesion of Giraph’s code is now available under version 1.0.0 of the project. This week’s elaboration on Facebook’s contribution to Giraph by Avery Ching’s blog post represents one of the first attempts to render mainstream the challenges specific to creating and managing a trillion edge social graph. In response, the industry should expect analogous disclosures about graphing technology from the likes of Google, Twitter and others in subsequent months.

Concurrent Releases Pattern To Facilitate Predictive Analytics On Hadoop

Today, Concurrent Inc. announces the release of Pattern, an open source tool designed to enable developers to build machine-learning applications on Hadoop by leveraging the Predictive Model Markup Lanaguage (PMML), the standard export format for popular predictive modeling tools such as R, MicroStrategy and SAS. Data scientists can use Pattern to export applications to Hadoop clusters and thereby run them against massive data sets. Pattern simplifies the process of building predictive models that operate on Hadoop clusters and lowers the barrier to the adoption of Apache Hadoop for advanced data mining and modeling use cases.

An example of a use case for Pattern includes evaluating the efficacy of models for a “predictive marketing intelligence solution” as illustrated below by Antony Arokiasamy, Senior Software Architect at AgilOne:

Pattern facilitates AgilOne to deploy a variety of advanced machine-learning algorithms for our cloud-based predictive marketing intelligence solution. As a self-service SaaS offering, Pattern allows us to evaluate multiple models and push the clients’ best models into our high performance scoring system. The PMML interface allows our advanced clients to deploy custom models.

Here, Arokiasamy remarks on the way in which Pattern facilitates scoring of predictive models that enables the selection of one model amongst others. AgilOne uses Pattern to run multiple predictive models in parallel against large data sets and additionally illustrates the efficacy of Pattern’s operation on a Hadoop cluster deployed in a cloud-based environment.

Pattern runs on the popular Cascading framework for simplifying the deployment and management of Hadoop clusters that is used by the likes of Twitter, eBay, Etsy and Razorfish. A free, open source application, Pattern constitutes yet another pillar in Concurrent’s array of applications for streamlining the use of Apache Hadoop alongside Cascading and Lingual, the ANSI-standard interface that enables developers to leverage SQL to query Hadoop clusters without having to learn MapReduce. The release of Pattern consolidates the positioning of Concurrent as a pioneer in the Big Data management space given its thought leadership in designing applications that facilitate enterprise adoption of Hadoop. Enterprises can now use Concurrent’s Cascading framework to operate on Hadoop clusters using JAVA APIs, SQL and predictive models written in PMML compatible analytics applications.

IBM Releases Big Data Software On SmartCloud; Cognos for iPad

On Monday, IBM announced the release of the Infosphere BigInsights application for analyzing massive volumes of structured and unstructured data on its SmartCloud environment. The SmartCloud release of IBM’s BigInsights application means that IBM beat competitors Oracle and Microsoft in the race to deploy an enterprise grade, cloud based Big Data analytics platform. Over the past month, Oracle and Microsoft have revealed plans to release cloud based Big Data applications that leverage Apache Hadoop, although in the case of both companies, plans for a live release are scheduled for 2012. BigInsights was previously accessed via the IBM Smart Business Development and Test Cloud environment that served as the testing ground for IBM’s SmartCloud which was deployed in April 2011.

IBM developed its Big Data analytics platform because organizations across a number of verticals are drowning in the sea of unstructured data such as Facebook and Twitter feeds, internet searches, log files and emails. IBM’s press release quantified the size of the emerging big data space as follows:

Organizations of all sizes are struggling to keep up with the rate and pace of big data and use it in a meaningful way to improve products, services, or the customer experience. Every day, people create the equivalent of 2.5 quintillion bytes of data from sensors, mobile devices, online transactions, and social networks; so much that 90 percent of the world’s data has been generated in the past two years. Every month people send one billion Tweets and post 30 billion messages on Facebook. Meanwhile, more than 1 trillion mobile devices are in use today and mobile commerce is expected to reach $31 billion by 2016.

IBM customers in the banking, insurance and communications verticals are currently using BigInsights to more effectively understand trends from web analytics, social media feeds, text messages and other forms of unstructured data. The availability of BigInsights via IBM’s SmartCloud is likely to accelerate enterprise adoption of the product given enterprise familiarity with the SmartCloud offering and recent publicity about its October 12 upgrade. The deployment of BigInsights on SmartCloud also gives IBM early traction in the Big Data space, with competition from Amazon Elastic MapReduce from Amazon Web Services, EMC, Teradata and HP. Granted, Oracle and Microsoft are set to join the Big Data party soon, but IBM should have at least six months to consolidate its market positioning ahead of its West coast based competitors. The enterprise version of BigInsights is priced at 60 cents per cluster per hour whereas the basic version is free.

Key features of enterprise level IBM Infosphere BigInsights include the following:

• Advanced text analytics to mine massive amounts of textual data
• A spreadsheet-like interface called BigSheets that allows users to create and deploy analytics without writing code
• Web-based management console
• Jaql, a query language for querying structured and unstructured data through an interface that resembles SQL

In tandem with the release of BigInsights on the SmartCloud, IBM announced the availability of IBM Cognos Mobile on the iPad and iPhone. iPad users can now leverage Cognos to run analytics on data and obtain access to a suite of visually rich dashboards. The combination of Cognos on the iPad and BigInsights clearly indicates that portability of access to data analytics constitutes a key component of IBM’s big data strategy. The big question now concerns how Oracle and Microsoft will differentiate themselves from BigInsights in their respective, forthcoming Big Data offerings.

SGI Partners With Cloudera To Resell Hadoop

SGI and Cloudera today announced a reseller partnership whereby SGI will sell pre-configured Hadoop clusters of hardware and software in addition to technical support. Under the terms of the agreement, SGI will distribute Cloudera’s Apache Hadoop (CDH) alongside its rackable servers and provide level 1 technical support, while Cloudera will provide level 2 and level 3 technical support. SGI already claims a history of deploying Hadoop servers dating back to Hadoop’s earliest days and expects to leverage its existing relationships with customers in the government and financial sectors. SGI’s VP of Product Marketing, Bill Mannel, noted that “SGI has been successfully deploying Hadoop customer installations of up to 40,000 nodes and individual Hadoop clusters of up to 4,000 nodes for a number of years now.” 40,000 nodes per customer installation and 4,000 nodes per cluster represent the upper bound of Hadoop cluster size at Yahoo! and similar enterprise level installations. Mannel elaborated on SGI’s experience with large Hadoop installations by commenting: “This benchmark, our growing presence, and our role in the Hadoop ecosystem, reflect our ongoing commitment to pushing the bar on performance and driving relationships that benefit our customers. As they wrestle with bigger and more complex data challenges every day they can trust SGI to deliver complete Hadoop solutions based on years of experience.”

SGI’s distribution of Hadoop is expected to target customers that would like an enterprise level installation without dedicating in house talent to the deployment. Hadoop is an disruptive open source technology that provides a framework for managing massive volumes of structured and unstructured data. Hadoop provides the data infrastructure for Facebook, LinkedIn and Twitter and has recently gained attention in the wake of recent announcements by Oracle and Microsoft about entering the Big Data space by leveraging Hadoop technology.

Battle for Big Data Heats Up As Microsoft and Oracle Announce Hadoop-based Products

The battle for market share in the big data space is officially underway, with passion. At last week’s Professional Association for SQL Server Summit (PASS), Microsoft announced plans to develop a platform for big data processing and analytics based on Hadoop, the open source software framework that operates under an Apache license. Microsoft’s announcement comes roughly ten days after Oracle’s unveiling of its Big Data Appliance that provides enterprise level capabilities to process structured and unstructured data.

Key features of Oracle’s Big Data Appliance include the following:

•Software
–Apache Hadoop
–Oracle NoSQL Database Enterprise Edition
–Oracle Data Integrator Application Adapter for Hadoop
–Oracle Loader for Hadoop
–Open source distribution of R

•Hardware
–Oracle’s Exadata x86 clusters (Oracle Exadata Database Machine, Oracle Exalytics Business Intelligence Machine)

Oracle’s hardware supports the Oracle 11g R2 database alongside Oracle’s Red Hat Enterprise Linux version and virtualization based on the Xen hypervisor. The company’s announcement of its plans to leverage a NoSQL database represented an abrupt about face of an earlier Oracle position that discredited the significance of NoSQL. In May, Oracle published a whitepaper Debunking the NoSQL Hype that downplayed the enterprise level capability of NoSQL deployments.

Microsoft’s forthcoming Big Data platform features the following:

–Hadoop for Windows Server and Azure
–Hadoop connectors for SQL Server and SQL Parallel Data Warehouse
–Hive ODBC drivers for users of Microsoft Business Intelligence applications

Microsoft revealed a strategic partnership with Yahoo spinoff Hortonworks to integrate Hadoop with Windows Server and Windows Azure. Microsoft’s decision not to leverage NoSQL and use instead a Windows based version of Hadoop for SQL Server 2012 constitutes the key difference between Microsoft and Oracle’s Big Data platforms. The entry of Microsoft and Oracle into the Big Data space suggests that the market is ready to explode as government and private sector agencies increasingly find value in unlocking business value from unstructured data such as emails, log files, twitter feeds and text-centered data. IBM and EMC hold the early market share lead but competition is set to intensify, particularly given the recent affirmation handed to NoSQL by tech giant Oracle.