Twitter is well on its way toward its goal of remaining fully accessible 24/7 all over the world. In a blog post the day after the 2012 presidential election, Mazen Rawashdeh, VP of Infrastructure Operations Engineering, remarked that the election not only shattered Twitter’s records for election-related tweeting, but also illustrated the capacity of Twitter’s infrastructure to withstand sustained stresses in traffic over several hours. Election night on Tuesday, November 6 featured the following Tweeting data points and statistics:
•Over 31 million election-related Tweets were transmitted on November 6
•Twitter users averaged 9,965 Tweets per second (TPS) from 8:11pm to 9:11pm PDT
•Twitter experienced a spike in TPS at 8:20pm PDT of 15,107 TPS
•The one minute peak was 874,560 Tweets per minute (TPM)
While these numbers well surpassed notable Twitter high points such as 6,939 TPS for 2011 New Year’s Eve, or the 7,196 TPS that followed Japan’s stunning victory over the United States in the 2011 Women’s World Cup soccer final, they also pointed to Twitter’s ability to support sustained increases in traffic such as the near 10,000 TPS for a full hour that ensued as the world gradually learned of the projections of President Obama’s re-election.
Rawashdeh attributes the robustness of Twitter’s infrastructure to its optimization of Ruby runtime and gradual movement away from Ruby to Java. In March 2011, Twitter revealed the results of their optimization of Ruby runtime related to reducing CPU usage required for the garbage collector on Twitter.com. CPU usage fell from 18.5% to 14% as a result of creating two garbage heaps as opposed to one. Separately, Twitter directed mobile-related traffic to a Java Virtual Machine stack that avoided the Ruby stack entirely as part of the larger project of migrating from Ruby to Java altogether.
Twitter began moving away from Ruby to Java as early as 2008, “when the company’s Ruby-based message queuing system “hit a wall,” according to former developer Alex Payne. Because Ruby lacked in performance with respect to “long running processes” that were additionally “memory intensive,” the company opted for a Java-based path as reported in The Register as follows:
Twitter’s solution was to migrate some of its Ruby code to a new server stack running on the JVM. Initially, the company’s development team avoided stock Java in favor of Scala, an alternative JVM language that combines aspects of object-oriented and functional programming. Today, Twitter’s software is built from a mix of Scala and ordinary Java code.
Fans of Ruby are likely to take issue with Twitter’s gradual but highly purposeful migration toward Java. Nevertheless, the larger issue highlighted by Twitter’s impressive performance on election day is the need for cloud-based platforms to remain up and running 100% of the time, in all geographies, even in the wake of sustained and event-driven traffic. Outages such those recently experienced by Amazon Web Services will be gradually considered unacceptable by the industry, meaning that cloud-based platforms will increasingly need to channel more of their talent and resources into the development of failsafe infrastructures that are capable of withstanding pressures imposed by natural disasters and dramatic spikes in traffic. 100% uptime as opposed to 99.5% or even 99.9% is likely to become the norm and expectation and anything short of that will be viewed as unacceptable.
Great post with lots of information presented in a concise manner.