Thursday, October 7, 2010

Doubts on Scalable Non relational datastore after Foursquare outage

In a detailed post-mortem, a Foursquare (Foursquare) engineer details the "embarrassing and disappointing" issues that brought the service to its knees and kept it down for the bulk of the day yesterday. Apparently, the service’s problems began when engineers noticed a disproportionate number of checkins being stored on one SHARD of the company’s database system. As the shard became more and more overloaded, engineers tried to introduce a new shard to the system to even out the load balance.

Unfortunately, the new shard took out the entire service for reasons unknown, from the website to the mobile apps; and every time the team tried to restart the site, the original shard kept overloading and bringing the works down again.In the end, Foursquare had to re-index the shard, which took a grueling five hours. The company is happy to report no data was lost.

This brings about two quick questions on MongoDB and Monitoring of non relational datastores -

1. Is this scenario really going against the MongoDb's auto sharding architecture - MongoDB has been designed to make horizontal scaling manageable. It does so via an auto-sharding architecture, which automatically manages the distribution of data across nodes. In addition, the sharding system handles the addition and removal of shard nodes, ensuring an even distribution of data. The architecture also provides for automatic failover; individual shards are made up of at least two redundant nodes, ensuring automatic recovery with no single point of failure.

2.Are we missing the monitoring tools on non-relational datastore?

Although Non relational datastore are being defined as scalable datastore but are they reliable is the question?? With companies like Foursquare growing much more rapidly than it has since its inception, can non relational datastore also scale with them???