Database Replication
Databases replication is a really useful feature. It provides backup, load balancing, data aggregation to start with. A typical design of a database replication will be:
- A data modification statement on the master database is written to the database log
- These entries are read by a replication agent and put on a queue for child nodes to apply on their respective databases
- A checkpoint thread scans the master node's database log and clears the entries which are read by the replication agent.
What happens when a child node goes down?
- Entries off the queue (where the master rep agent puts in entries) are not removed by the child node so the queue starts building up
- When the queue is full the rep agent stops reading entries off the master database log. If it read where will it put them?
- The master database transaction log starts to get full and when it is full it stops accepting inbound transactions means master database becomes unavailable
In most of the cases the child comes up much before this happens and the child has all the information it has not consumed on the queue. So the queue is a important piece in the design. The master keeps it up and running, filling it with data while the child node is down. In essence the master node is responsible for the queue and hence takes on the risks associated with it. I don't think that is a very wise thing to do, you can think of any number of real world scenarios where this model does not work.
In the replication component Abinash is coding as I write this, we decided that each child node should take responsibility for its upkeep. it has to figure out how to recover from a failure and should not adversely affect the master in any way. We will publish the design details in a follow-up post.