Commit ea28a925 authored 6 years ago by Vladimir Davydov

replication: unregister replica with gc if deleted from cluster

When a replica is removed from the cluster table, the corresponding
replica struct isn't destroyed unless both the relay and the applier
attached to it are stopped, see replica_clear_id(). Since replica struct
is a holder of the garbage collection state, this means that in case an
evicted replica has an applier or a relay that fails to exit for some
reason, garbage collection will hang.

A relay thread stops as soon as the replica it was started for receives
a row that tries to delete it from the cluster table (because this isn't
allowed by the cluster space trigger, see on_replace_dd_cluster()).
If a replica isn't running, the corresponding relay can't run as well,
because writing to a closed socket isn't allowed. That said, a relay
can't block garbage collection.

An applier, however, is deleted only when replication is reconfigured.
So if a replica that was evicted from the cluster was configured as a
master, its replica struct will hang around blocking garbage collection
for as long as the replica remains in box.cfg.replication. This is what
happens in #3546.

Fix this issue by forcefully unregistering a replica with the garbage
collector when it is deleted from the cluster table. This is OK as it
won't be able to resubscribe and so we don't need to keep WALs for it
any longer. Note, the relay thread may still be running when a replica
is deleted from the cluster table, in which case we can't unregister it
with the garbage collector right away, because the relay may need to
access the garbage collection state. In such a case, leave the job to
replica_clear_relay, which is called as soon as the relay thread exits.

Closes #3546

parent 1046f851

No related branches found

No related tags found

Hide whitespace changes

Inline Side-by-side

Showing with 141 additions and 0 deletions

Please register or to comment