As a follow-up to our post on Facebook’s use of Apache Giraph, I wanted to return to Pregel, the graphing technology on which Giraph was based. Alongside, MapReduce, Pregel is used by Google to mine relationships between richly associative data sets in which the data points have multi-valent, highly dynamic relationships that morph, proliferate, aggregate, disperse, emerge and vanish with a velocity that renders any schema-based data model untenable. In a well known blog post, Grzegorz Czajkowski of Google’s Systems Infrastructure Team elaborated on the importance of graph theory and Pregel’s structure as follows:
Despite differences in structure and origin, many graphs out there have two things in common: each of them keeps growing in size, and there is a seemingly endless number of facts and details people would like to know about each one. Take, for example, geographic locations. A relatively simple analysis of a standard map (a graph!) can provide the shortest route between two cities. But progressively more sophisticated analysis could be applied to richer information such as speed limits, expected traffic jams, roadworks and even weather conditions. In addition to the shortest route, measured as sheer distance, you could learn about the most scenic route, or the most fuel-efficient one, or the one which has the most rest areas. All these options, and more, can all be extracted from the graph and made useful — provided you have the right tools and inputs. The web graph is similar. The web contains billions of documents, and that number increases daily. To help you find what you need from that vast amount of information, Google extracts more than 200 signals from the web graph, ranging from the language of a webpage to the number and quality of other pages pointing to it.
In order to achieve that, we have created scalable infrastructure, named Pregel, to mine a wide range of graphs. In Pregel, programs are expressed as a sequence of iterations. In each iteration, a vertex can, independently of other vertices, receive messages sent to it in the previous iteration, send messages to other vertices, modify its own and its outgoing edges’ states, and mutate the graph’s topology (experts in parallel processing will recognize that the Bulk Synchronous Parallel Model inspired Pregel).
The key point worth noting here is that Pregel computation is marked by a “sequence of iterations” whereby the relationship between vertices is iteratively refined and recalibrated with each computation. In other words, Pregel computation begins with an input step, followed by a series of supersteps that successively lead to the algorithm’s termination and finally, an output. During each of the supersteps, the vertices send and receive messages to other vertices in parallel. The algorithm terminates when the vertices collectively stop transmitting messages to each other, or, to put things in another lexicon, vote to halt. As Malewizc, Czajkowsk et.al note in a paper on Pregel, “The algorithm as a whole terminates when all vertices are simultaneously inactive and there are no messages in transit.” Like Pregel, Apache Giraph uses a computation structure whereby computation proceeds iteratively until the relationships between vertices in a graph stabilize.