Friday, August 1, 2014

Distributed databases

Understanding fundamental differences of distributed database is crucially important when selecting proper database for an IoT application.

There are dozens of databases available at the day. Among distributed, perhaps most known ones are Cassandra, CloudDB, Riad, and RethinkDB. But which one is most suitable for my specific purpose?

Wikipedia  describes several characteristics of distributed databases. However, the article does not put enough attention to the perhaps most important factor; does the database favor data consistency or high availability?

Let's assume we have a cluster of four interconnected databases. (Surprisingly this setup reminds to what was described in previous posting). Let's assume each DB has number of sensor's and actuators behind it, only connected to one DB at a time, as illustrated below.

Distributed database.
What happens when connection to a database (A) is lost? There are two scenarios:
a) high availability: other databases B,C,D continue to provide the last known state of the nodes behind the DB A
b) data consistency: other databases B,C,D do no more provide information about nodes behind A, as they can not guarantee the data consistency. (integrity).

Which one is better? Well, it depends on your application. In a sensor network type of application where time-series data is typically stored, the high availability approach can do well. The historical data is not supposed to be altered at any way, thus it remains to be valid. Redundant databases can continue providing the last know history of lost nodes. If connection to the DB A is recovered later, all the databases can then synchronize the missing data from the duration of the connection lost.

In a real-time control type of application, historical data is perhaps not that important, but the only thing that matters is the current state of the system. With such an approach, in case of connection lost, it's better not to share the past state of the lost nodes. This is considered as favoring data consistency, no false data is provided. If a sub-system is not available, also data related to it is missing. This is quite natural.

The CAP theorem provides more scientific explanation of the difference explained above. In case of selecting most suitable DB for an IoT application, this is perhaps the most important factor to take into account. For example, how databases are synchronizing with each other is not that relevant, as long as they are connected and can synchronize. In IoT system, fault tolerance is one of the most important aspects, as system partition (connection breaks) are more than likely to occur.

No comments:

Post a Comment