Commit 634f59c7 authored 3 years ago by Serge Petrenko Committed by Vladimir Davydov 3 years ago

recovery: panic in case of recovery and replicaset vclock mismatch

We assume that no one touches the instance's WALs, once it has taken the
wal_dir_lock. This is not the case when upgrading from an old setup
(running tarantool 1.7.3-6 or less). Such nodes either take a lock on
snap dir, which may be different from wal dir, or don't take the lock at
all.

So, it's possible that during upgrade an old node is not stopped
properly before a new node is started in the same data directory.

The old node might even write some extra data to WAL during new node's
startup.

This is obviously bad and leads to multiple issues. For example, new node
might start local recovery, scan the WALs and set replicaset.vclock to
some value {1 : 5}. While the node recovers WALs they are appended by the old
node up to vclock {1 : 10}.
The node finishes local recovery with replicaset vclock {1 : 5}, but
data recovered up to vclock {1 : 10}.

The node will use the now outdated replicaset vclock to subscribe to
remote peers (leading to replication breaking due to duplicate keys
found), to initialize WAL (leading to new xlogs appearing with duplicate
LSNs). There might be a number of other issues we just haven't stumbled
upon.

Let's prevent situations like that and panic as soon as we see that the
initially scanned vclock (replicaset vclock) differs from actually
recovered vclock.

Closes #6709

parent dc19be40

No related branches found

No related tags found

No related merge requests found

Hide whitespace changes

Inline Side-by-side

Showing with 7 additions and 0 deletions

Дмитрий Кольцов @vifley
mentioned in commit 205de1e8
· 2 years ago

mentioned in commit 205de1e8

mentioned in commit 205de1e8d72016911a7179138bbd14d99e044aad

Toggle commit list

Please register or to comment