Understanding The Amazon Web Services DynamoDB Service Disruption

Amazon Web Services recently elaborated on the DynamoDB service disruption that affected the US-East Region last Sunday, September 20. The root cause of the problem was associated with storage servers that failed to retrieve metadata that allow the storage servers to hold table data. As explained in a post-mortem analysis of the disruption, DynamoDB tables are separated into partitions. Partitions, in turn, are spread along multiple servers. The assignment of groups of partitions to a server is known as a membership, and membership is managed by DynamoDB’s metadata service. Storage servers hold table data within a partition and need to periodically confirm they have the right membership. On Sunday at 2:19 AM PDT, a service disruption affected DynamoDB storage servers whereby some storage servers were unable to retrieve membership data. One of the reasons for the inability of storage servers to retrieve membership data involved rapid customer adoption of Global Secondary Indexes, which allows customers to have more than one key, and correspondingly results in an increase in the volume of membership data stored on a storage server. The rapid adoption of Global Secondary Indexes imposed additional stress on storage server retrieval of membership data to a point where select storage servers were unable to retrieve membership data. Amazon responded by adding capacity to storage servers, but it wasn’t until Amazon paused requests to the metadata service that it was finally able to add capacity in a way that restored the ability of storage servers to retrieve membership data. As a result of the disruption, Amazon is increasing the capacity of its metadata service for DynamoDB. In addition, Amazon plans to implement more stringent monitoring of DynamoDB performance in conjunction with an upgraded metadata infrastructure whereby multiple instances of the metadata service variously service subsets of the entire fleet of servers.