This week, Facebook revealed details of the technology used to power its recently released Graph search functionality. Graph databases are used to analyze relationships between associative data such as social networks, transportation data, search-related content recommendations, fraud detection, molecular biology and, more generally, any data set where the relationships between constituent data points are so numerous and dynamic that they cannot easily be captured within a manageable schema or relational database structure. Graph databases contain “nodes” or “vertices” and “edges” that indicate relationships between the different vertices/nodes.
Facebook intensified their review of graphing technologies in the summer of 2012 and selected Apache Giraph over Apache Hive and GraphLab. Facebook selected Giraph because it interfaces directly with its own version of the Hadoop Distributed File System (HDFS), allows usage of its MapReduce Corona infrastructure, supports a variety of graph application use cases, and features functionality such as master computation and composable computation. Compared to Apache Hive and GraphLab, Apache Giraph was faster. After choosing a graphing platform, the Facebook team modified the Apache Giraph code and subsequently shared their changes with the Apache Giraph open source community.
One of the use cases that Facebook leveraged in order to select Apach Giraph was its performance in a “label propagation” exercise where it probabilistically inferred data fields that are blank or unintelligible in comparison to Apache Hive and GraphLab. Many Facebook users, for example, may elect to leave their hometown or employer blank, but graphing algorithms can probabilistically assign values for the blank fields by analyzing data about a user’s friends, family, likes and online behavior. By empowering data scientists to construct more complete profiles of users, graph technology enables enhanced personalization of data such as a user’s news feed and advertising content. Facebook performed the “label propagation” comparison of Giraph, Hive and GraphLab on a relatively small scale on the order of 25 million edges.
Key attributes of Apache Giraph and its usage by Facebook include the following:
•Apache Giraph is based on Google’s Pregel and Leslie Valiant’s bulk synchronous parallel computing model
•Apache Giraph is written in Java and runs as a MapReduce job
•Facebook chose the production use cases of “label propagation, variants of page rank, and k-means clustering” in order to drive their modification of Apache Giraph code
•Facebook created a 1 trillion edge social graph using 200 commodity machines, in less than four minutes
•Facebook’s creation of 1 trillion edges is roughly two orders of magnitude greater than Twitter’s graph of 1.5 billion edges and AltaVista’s 6.5 billion edges
Facebook performed a number of tweaks on the Apache Giraph code including modification of the input model for data in Giraph, streamlined reading of Hive data, multithreading application code, memory optimization and the use of Netty instead of Zookeeper to enhance scalability. Facebook’s Social Graph was launched in January, although its platform is not nearly as powerful as end users might hope for, as of yet. Open Graph, meanwhile, is used by Facebook developers to correlate real-world actions with objects in their database, such as User X is viewing soccer match Y on network Z. The latest and greatest vesion of Giraph’s code is now available under version 1.0.0 of the project. This week’s elaboration on Facebook’s contribution to Giraph by Avery Ching’s blog post represents one of the first attempts to render mainstream the challenges specific to creating and managing a trillion edge social graph. In response, the industry should expect analogous disclosures about graphing technology from the likes of Google, Twitter and others in subsequent months.