Skip to content
Snippets Groups Projects
user avatar
Serge Petrenko authored
When the master is just starting up it's possible for replica's JOIN
request to arrive right in time to bypass ER_LOADING check (after master
is fully recovered) but still fail due to ER_READONLY: box.cfg.read_only
is only read and set after box_cfg() (its C part) returns.

In this case the joining replica simply exits with an error and doesn't
retry JOIN.

Let's fix that. Make ER_READONLY a recoverable error and let replica
retry joining after receiving ER_READONLY.

Anonymous nodes relied on ER_READONLY to forbid replication from
anonymous to normal replicas. That check doesn't work anymore.
So introduce explicit checks banning replication from anonymous nodes.

Note, there were some alternatives to this fix.

First of all, theoretically, we could stop firing ER_LOADING later,
after box_cfg() is complete. This solution wouldn't work because it
would lead to deadlocks: the nodes would be stuck in replicaset_sync(),
because each of them rejects replication with ER_LOADING.

Another solution would be to read the real box.cfg.read_only value
earlier, in order to allow replication right after the node finishes
recovery.
This would also be bad, because we should never let a node become
writeable before box.cfg is finished. Even after local_recovery is
complete, the node should stay read-only until it synchronizes with
other replicas.

That said, neither of the two alternatives fit, so the solution with
retrying JOIN on ER_READONLY was chosen.

Since the bug is fixed, re-enable the test in which it was discovered:
replication-py/init_storage.test.py

Also, remove replication/ddl.test.lua from fragile list, since this bug
was the only reason for its fragility.

Closes #5337
Closes #6966

NO_DOC=minor bugfix
c1c77782
History
Name Last commit Last update
..