Iterative Computation Between Vertices In Pregel and Apache Giraph

As a follow-up to our post on Facebook’s use of Apache Giraph, I wanted to return to Pregel, the graphing technology on which Giraph was based. Alongside, MapReduce, Pregel is used by Google to mine relationships between richly associative data sets in which the data points have multi-valent, highly dynamic relationships that morph, proliferate, aggregate, disperse, emerge and vanish with a velocity that renders any schema-based data model untenable. In a well known blog post, Grzegorz Czajkowski of Google’s Systems Infrastructure Team elaborated on the importance of graph theory and Pregel’s structure as follows:

Despite differences in structure and origin, many graphs out there have two things in common: each of them keeps growing in size, and there is a seemingly endless number of facts and details people would like to know about each one. Take, for example, geographic locations. A relatively simple analysis of a standard map (a graph!) can provide the shortest route between two cities. But progressively more sophisticated analysis could be applied to richer information such as speed limits, expected traffic jams, roadworks and even weather conditions. In addition to the shortest route, measured as sheer distance, you could learn about the most scenic route, or the most fuel-efficient one, or the one which has the most rest areas. All these options, and more, can all be extracted from the graph and made useful — provided you have the right tools and inputs. The web graph is similar. The web contains billions of documents, and that number increases daily. To help you find what you need from that vast amount of information, Google extracts more than 200 signals from the web graph, ranging from the language of a webpage to the number and quality of other pages pointing to it.

In order to achieve that, we have created scalable infrastructure, named Pregel, to mine a wide range of graphs. In Pregel, programs are expressed as a sequence of iterations. In each iteration, a vertex can, independently of other vertices, receive messages sent to it in the previous iteration, send messages to other vertices, modify its own and its outgoing edges’ states, and mutate the graph’s topology (experts in parallel processing will recognize that the Bulk Synchronous Parallel Model inspired Pregel).

The key point worth noting here is that Pregel computation is marked by a “sequence of iterations” whereby the relationship between vertices is iteratively refined and recalibrated with each computation. In other words, Pregel computation begins with an input step, followed by a series of supersteps that successively lead to the algorithm’s termination and finally, an output. During each of the supersteps, the vertices send and receive messages to other vertices in parallel. The algorithm terminates when the vertices collectively stop transmitting messages to each other, or, to put things in another lexicon, vote to halt. As Malewizc, Czajkowsk note in a paper on Pregel, “The algorithm as a whole terminates when all vertices are simultaneously inactive and there are no messages in transit.” Like Pregel, Apache Giraph uses a computation structure whereby computation proceeds iteratively until the relationships between vertices in a graph stabilize.

Facebook Leverages Enhanced Apache Giraph To Create 1 Trillion Edge Social Graph

This week, Facebook revealed details of the technology used to power its recently released Graph search functionality. Graph databases are used to analyze relationships between associative data such as social networks, transportation data, search-related content recommendations, fraud detection, molecular biology and, more generally, any data set where the relationships between constituent data points are so numerous and dynamic that they cannot easily be captured within a manageable schema or relational database structure. Graph databases contain “nodes” or “vertices” and “edges” that indicate relationships between the different vertices/nodes.

Facebook intensified their review of graphing technologies in the summer of 2012 and selected Apache Giraph over Apache Hive and GraphLab. Facebook selected Giraph because it interfaces directly with its own version of the Hadoop Distributed File System (HDFS), allows usage of its MapReduce Corona infrastructure, supports a variety of graph application use cases, and features functionality such as master computation and composable computation. Compared to Apache Hive and GraphLab, Apache Giraph was faster. After choosing a graphing platform, the Facebook team modified the Apache Giraph code and subsequently shared their changes with the Apache Giraph open source community.

One of the use cases that Facebook leveraged in order to select Apach Giraph was its performance in a “label propagation” exercise where it probabilistically inferred data fields that are blank or unintelligible in comparison to Apache Hive and GraphLab. Many Facebook users, for example, may elect to leave their hometown or employer blank, but graphing algorithms can probabilistically assign values for the blank fields by analyzing data about a user’s friends, family, likes and online behavior. By empowering data scientists to construct more complete profiles of users, graph technology enables enhanced personalization of data such as a user’s news feed and advertising content. Facebook performed the “label propagation” comparison of Giraph, Hive and GraphLab on a relatively small scale on the order of 25 million edges.

Key attributes of Apache Giraph and its usage by Facebook include the following:

•Apache Giraph is based on Google’s Pregel and Leslie Valiant’s bulk synchronous parallel computing model
•Apache Giraph is written in Java and runs as a MapReduce job
•Facebook chose the production use cases of “label propagation, variants of page rank, and k-means clustering” in order to drive their modification of Apache Giraph code
•Facebook created a 1 trillion edge social graph using 200 commodity machines, in less than four minutes
•Facebook’s creation of 1 trillion edges is roughly two orders of magnitude greater than Twitter’s graph of 1.5 billion edges and AltaVista’s 6.5 billion edges

Facebook performed a number of tweaks on the Apache Giraph code including modification of the input model for data in Giraph, streamlined reading of Hive data, multithreading application code, memory optimization and the use of Netty instead of Zookeeper to enhance scalability. Facebook’s Social Graph was launched in January, although its platform is not nearly as powerful as end users might hope for, as of yet. Open Graph, meanwhile, is used by Facebook developers to correlate real-world actions with objects in their database, such as User X is viewing soccer match Y on network Z. The latest and greatest vesion of Giraph’s code is now available under version 1.0.0 of the project. This week’s elaboration on Facebook’s contribution to Giraph by Avery Ching’s blog post represents one of the first attempts to render mainstream the challenges specific to creating and managing a trillion edge social graph. In response, the industry should expect analogous disclosures about graphing technology from the likes of Google, Twitter and others in subsequent months.