relay: don't delete xlog files until replica confirms receipt
We remove old xlog files as soon as we have sent them to all replicas. However, the fact that we have successfully sent something to a replica doesn't necessarily mean the replica will have received it. If a replica fails to apply a row (for instance, it is out of memory), replication will stop, but the data files have already been deleted on the master so that when the replica is back online, the master won't find appropriate xlog to feed to the replica and replication will stop again. The user visible effect is the following error message in the log and in the replica status: Missing .xlog file between LSN 306 {1: 306} and 311 {1: 311} There is no way to recover from this but to re-bootstrap the replica from scratch. The issue was introduced by commit ba09475f ("replica: advance gc state only when xlog is closed"), which targeted at making the status update procedure as lightweight and fast as possible and so moved gc_consumer_advance() from tx_status_update() to a special gc message. A gc message is created and sent to TX as soon as an xlog is relayed. Let's rework this so that gc messages are appended to a special queue first and scheduled only when the relay receives the receipt confirmation from the replica. Closes #2825
Loading
Please register or sign in to comment