Skip to content
Snippets Groups Projects
Commit aed45121 authored by Serge Petrenko's avatar Serge Petrenko Committed by Yaroslav Lobankov
Browse files

replication: refuse to connect to master with nil UUID

The title is pretty self-explanatory. That's all this commit does. Now a
couple of words on why this is needed.

Commit 2a0c4f2b ("replication: make replica subscribe to master's
ballot") changed replica connect behaviour: instead of holding a single
connection to the master, replica may have two: master's ballot retrieval
is now performed in a separate connection owned by a separate fiber
called ballot_watcher.

First connection to master is initialized as always and then applier
fiber creates the ballot_watcher, which connects to the same address on
its own.

This lead to some unexpected consequences: random cartridge integration
tests started failing with the following error:
tarantool/tarantool/cartridge/test-helpers/cluster.lua:209:
"localhost:13303": Replication setup failed, instance orphaned

Here's what happened. Cartridge has a module named remote
control. The module mimics a tarantool server and "listens" on the same
socket the tarantool is intended to listen before box.cfg{listen=...} is
called.

For example one can see such output in tarantool logs with cartridge:
NO_WRAP
13:07:43.210 [10265] main/132/applier/admin@localhost:13301 I>
remote master 46a71a25-4328-4a41-985d-d93d6ed7fb7f at 127.0.0.1:13301  running Tarantool 2.11.0
13:07:43.210 [10265] main/133/applier/admin@localhost:13302 I>
remote master 00000000-0000-0000-0000-000000000000 at 127.0.0.1:13302  running Tarantool 1.10.0
13:07:43.210 [10265] main/134/applier/admin@localhost:13303 I>
remote master bcce45ad-38b7-4d8a-936a-133614a7775f at 127.0.0.1:13303  running Tarantool 2.11.0
NO_WRAP

The second "Tarantool" in the output (with zero instance uuid and
running Tarantool 1.10.0) is the remote control on an unconfigured
tarantool instance.

Before splitting applier connection in two, this was no problem: applier
would try to get the instance's ballot from a remote control listener and
fail (remote control doesn't answer to replication requests). Applier would
retry connecting to the same address until it got a reply, meaning that
remote control is stopped and real tarantool became listening on the
socket.

Now applier has two connections, and the following situation became
possible: when applier connection is initialized, remote control is
still working, and applier is connected to the remote control instance.
Applier performs ballot receipt in a separate fiber, which's not yet
initialized, so no errors are raised.

As soon as applier creates the ballot watcher, remote control is stopped
and the real tarantool starts listening on the socket. This means that
no error happens in the ballot watcher as well (normal tarantool answers
to replication requests, of course). And we get to an unhandled
situation when applier itself is connected to (already dead) remote
control instance, while its ballot watcher is connected to the real
tarantool.

As soon as applier sees the ballot is fetched, it continues connection
process to the already dead remote control instance and gets an error:
NO_WRAP
13:07:44.214 [10265] main/133/applier/admin@localhost:13302 I>
failed to authenticate
13:07:44.214 [10265] main/133/applier/admin@localhost:13302 coio.c:326 E>
SocketError: unexpected EOF when reading from socket,       called on fd 1620, aka 127.0.0.1:54150: Broken pipe
13:07:44.214 [10265] main/133/applier/admin@localhost:13302 I>
will retry every 1.00 second
13:07:44.214 [10265] main/115/remote_control/127.0.0.1:50242 C>
failed to synchronize with 1 out of 3 replicas
13:07:44.214 [10265] main/115/remote_control/127.0.0.1:50242 I>
entering orphan mode
NO_WRAP

Follow-up #5272
Closes #8185

NO_CHANGELOG=not user-visible
NO_DOC=not user-visible (can't create Tarantool with zero uuid)
parent de938a6f
No related branches found
No related tags found
No related merge requests found
......@@ -367,6 +367,8 @@ applier_connection_init(struct iostream *io, const struct uri *uri,
tnt_raise(LoggedError, ER_PROTOCOL,
"Unsupported protocol for replication");
}
if (tt_uuid_is_nil(&greeting->uuid))
tnt_raise(LoggedError, ER_NIL_UUID);
}
/**
......@@ -2538,6 +2540,18 @@ applier_f(va_list ap)
applier_log_error(applier, e);
applier_disconnect(applier, APPLIER_DISCONNECTED);
goto reconnect;
} else if (e->errcode() == ER_NIL_UUID) {
/*
* Real tarantool can't have a nil UUID. This
* must be a cartridge remote control instance.
* The error is transient, since remote control
* will be replaced by a normal tarantool node
* sooner or later.
*/
applier_log_error(applier, e);
applier_disconnect(applier,
APPLIER_DISCONNECTED);
goto reconnect;
} else {
/* Unrecoverable errors */
applier_log_error(applier, e);
......
......@@ -316,6 +316,7 @@ struct errcode_record {
/*261 */_(ER_BOOTSTRAP_NOT_UNANIMOUS, "Replica %s chose a different bootstrap leader %s") \
/*262 */_(ER_CANT_CHECK_BOOTSTRAP_LEADER, "Can't check who replica %s chose its bootstrap leader") \
/*263 */_(ER_BOOTSTRAP_CONNECTION_NOT_TO_ALL, "Some replica set members were not specified in box.cfg.replication") \
/*264 */_(ER_NIL_UUID, "Nil UUID is reserved and can't be used in replication") \
/*
* !IMPORTANT! Please follow instructions at start of the file
......
......@@ -482,6 +482,7 @@ t;
| 261: box.error.BOOTSTRAP_NOT_UNANIMOUS
| 262: box.error.CANT_CHECK_BOOTSTRAP_LEADER
| 263: box.error.BOOTSTRAP_CONNECTION_NOT_TO_ALL
| 264: box.error.NIL_UUID
| ...
test_run:cmd("setopt delimiter ''");
......
local luatest = require('luatest')
local server = require('luatest.server')
local proxy = require('luatest.replica_proxy')
local g = luatest.group('gh_8185_nil_uuid_connection')
local fio = require('fio')
g.before_each(function(cg)
cg.server = server:new({
alias = 'tnt_server',
box_cfg = {
replication = {
server.build_listen_uri('proxy'),
},
},
})
cg.proxy = proxy:new({
client_socket_path = server.build_listen_uri('proxy'),
server_socket_path = "/dev/null",
-- Proxy will send nil UUID greeting as soon as client connects.
process_client = {
pre = function(c)
c:forward_to_client(
'Tarantool 2.11.0 (Binary) '..
'00000000-0000-0000-0000-000000000000 \n'..
'y8PniqYLPVESGsAYwA+1Mm4NphVCVgDE3zBGpdiI5/c='..
' \n')
c:stop()
end,
},
})
cg.proxy:start({force = true})
end)
g.after_each(function(cg)
cg.proxy:stop()
cg.server:drop()
end)
g.test_nil_uuid = function(cg)
cg.server:start({wait_until_ready = false})
luatest.helpers.retrying({}, function()
-- Pass log filepath manually, because box.cfg.log is not available.
local log = fio.pathjoin(cg.server.workdir, cg.server.alias .. '.log')
luatest.assert(cg.server:grep_log('ER_NIL_UUID', 1024, {
filename = log}), 'Error detected')
end)
end
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment