Skip to content
Snippets Groups Projects
user avatar
Vladislav Shpilevoy authored
When a replica subscribes, it might in the beginning try to
position its reader cursor to the end of a large xlog file.

Positioning inside of this file can take significant time during
which the WAL reader yielded and tried to send heartbeats, but
couldn't, because the relay thread wasn't communicating with the
TX thread.

When there are no messages from TX for too long time, the
heartbeats to the replica are not being sent
(commit 56571d83 ("raft: make
followers notice leader hang")).

The relay must communicate with the TX thread even when subscribe
is just being started and opens a large xlog file.

This isn't the first time when the missing heartbeats result into
timeouts. See more here:

- commit 30ad4a55 ("relay: yield
    explicitly every N sent rows").

- commit 17289440 ("recovery: make
    it yield when positioning in a WAL").

- commit ee6de025 ("relay: send
    heartbeats while reading a WAL").

Given that this is fixed fourth time, it might suggest that the
relay has not the best architecture having some slight drawbacks.
See more in #9968.

Closes #9094

NO_DOC=bugfix
f7e6686a
History
user avatar f7e6686a
Name Last commit Last update