Skip to content
Snippets Groups Projects
user avatar
Vladimir Davydov authored
When replication is restarted with the same replica set configuration
(i.e. box.cfg{replication = box.cfg.replication}), there's a chance that
an old relay will be still running on the master at the time when a new
applier tries to subscribe. In this case the applier will get an error:

  main/152/applier/localhost:62649 I> can't join/subscribe
  main/152/applier/localhost:62649 xrow.c:891 E> ER_CFG: Incorrect value for
      option 'replication': duplicate connection with the same replica UUID

Such an error won't stop the applier - it will keep trying to reconnect:

  main/152/applier/localhost:62649 I> will retry every 1.00 second

However, it will stop synchronization so that box.cfg() will return
without an error, but leave the replica in the orphan mode:

  main/151/console/::1:42606 C> failed to synchronize with 1 out of 1 replicas
  main/151/console/::1:42606 C> entering orphan mode
  main/151/console/::1:42606 I> set 'replication' configuration option to
    "localhost:62649"

In a second, the stray relay on the master will probably exit and the
applier will manage to subscribe so that the replica will leave the
orphan mode:

  main/152/applier/localhost:62649 C> leaving orphan mode

This is very annoying, because there's no need to enter the orphan mode
in this case - we could as well keep trying to synchronize until the
applier finally succeeds to subscribe or replication_sync_timeout is
triggered.

So this patch makes appliers enter "loading" state on configuration
errors, the same state they enter if they detect that bootstrap hasn't
finished yet. This guarantees that configuration errors, like the one
above, won't break synchronization and leave the user gaping at the
unprovoked orphan mode.

Apart from the issue in question (#3636), this patch also fixes spurious
replication-py/multi test failures that happened for exactly the same
reason (#3692).

Closes #3636
Closes #3692
4baa71bc
History
Name Last commit Last update