Skip to content
Snippets Groups Projects
  • Serge Petrenko's avatar
    6cc1b1f2
    txn_limbo: do not confirm/rollback anything after restart · 6cc1b1f2
    Serge Petrenko authored
    It's important for the synchro queue owner to not finalize any of the
    pending synchronous transactions after restart.
    
    Since the node was down for some time the chances are pretty high it was
    deposed by some new leader during its downtime. It means that the node
    might not know yet that it's transactions were already finalized by someone
    else.
    
    So, any arbitrary finalization might lead to a future split-brain, once the
    remote PROMOTE finally reaches the local node.
    
    Let's fix this by adding a new reason for the limbo to be frozen - a
    queue owner has recovered but has not issued a new PROMOTE locally and
    hasn't received any PROMOTE requests from the remote nodes.
    
    Once the first PROMOTE is issued or received, it's safe to return to the
    old mode of operation.
    
    So, now the synchro queue owner starts in "frozen" state and can't
    CONFIRM, ROLLBACK or issue new transactions until either issuing a
    PROMOTE or receiving a PROMOTE from some remote node.
    
    This also required modifying box.ctl.promote() behaviour: it's no
    longer a no-op on a synchro queue owner, when elections are disabled and
    the queue is frozen due to restart.
    
    Also fix the tests, which assumed the queue owner is writeable after a
    restart. gh-5298 test was partially deleted, because it became pointless.
    
    And while we are at it, remove the double run of gh-5288 test. It is
    storage engine agnostic, so there's no point in running it for both
    memtx and vinyl.
    
    Part-of #5295
    
    NO_CHANGELOG=covered by previous commit
    
    @TarantoolBot document
    Title: ER_READONLY error receives new reasons
    
    When box.info.ro_reason is "synchro" and some operation throws an
    ER_READONLY error, this error now might include the following reason:
    ```
    Can't modify data on a read-only instance - synchro queue with term 2
    belongs to 1 (06c05d18-456e-4db3-ac4c-b8d0f291fd92) and is frozen due to
    fencing
    ```
    This means that the current instance is indeed the synchro queue owner,
    but it has noticed, that someone else in the cluster might start new
    elections or might overtake the synchro queue soon.
    This may be also detected by `box.info.election.term` becoming greater than
    `box.info.synchro.queue.term` (this is the case for the second error
    message).
    There is also a slightly different error message:
    ```
    Can't modify data on a read-only instance - synchro queue with term 2
    belongs to 1 (06c05d18-456e-4db3-ac4c-b8d0f291fd92) and is frozen until
    promotion
    ```
    This means that the node simply cannot guarantee that it is still the
    synchro queue owner (for example, after a restart, when a node still thinks
    it is the queue owner, but someone else in the cluster has already
    overtaken the queue).
    6cc1b1f2
    History
    txn_limbo: do not confirm/rollback anything after restart
    Serge Petrenko authored
    It's important for the synchro queue owner to not finalize any of the
    pending synchronous transactions after restart.
    
    Since the node was down for some time the chances are pretty high it was
    deposed by some new leader during its downtime. It means that the node
    might not know yet that it's transactions were already finalized by someone
    else.
    
    So, any arbitrary finalization might lead to a future split-brain, once the
    remote PROMOTE finally reaches the local node.
    
    Let's fix this by adding a new reason for the limbo to be frozen - a
    queue owner has recovered but has not issued a new PROMOTE locally and
    hasn't received any PROMOTE requests from the remote nodes.
    
    Once the first PROMOTE is issued or received, it's safe to return to the
    old mode of operation.
    
    So, now the synchro queue owner starts in "frozen" state and can't
    CONFIRM, ROLLBACK or issue new transactions until either issuing a
    PROMOTE or receiving a PROMOTE from some remote node.
    
    This also required modifying box.ctl.promote() behaviour: it's no
    longer a no-op on a synchro queue owner, when elections are disabled and
    the queue is frozen due to restart.
    
    Also fix the tests, which assumed the queue owner is writeable after a
    restart. gh-5298 test was partially deleted, because it became pointless.
    
    And while we are at it, remove the double run of gh-5288 test. It is
    storage engine agnostic, so there's no point in running it for both
    memtx and vinyl.
    
    Part-of #5295
    
    NO_CHANGELOG=covered by previous commit
    
    @TarantoolBot document
    Title: ER_READONLY error receives new reasons
    
    When box.info.ro_reason is "synchro" and some operation throws an
    ER_READONLY error, this error now might include the following reason:
    ```
    Can't modify data on a read-only instance - synchro queue with term 2
    belongs to 1 (06c05d18-456e-4db3-ac4c-b8d0f291fd92) and is frozen due to
    fencing
    ```
    This means that the current instance is indeed the synchro queue owner,
    but it has noticed, that someone else in the cluster might start new
    elections or might overtake the synchro queue soon.
    This may be also detected by `box.info.election.term` becoming greater than
    `box.info.synchro.queue.term` (this is the case for the second error
    message).
    There is also a slightly different error message:
    ```
    Can't modify data on a read-only instance - synchro queue with term 2
    belongs to 1 (06c05d18-456e-4db3-ac4c-b8d0f291fd92) and is frozen until
    promotion
    ```
    This means that the node simply cannot guarantee that it is still the
    synchro queue owner (for example, after a restart, when a node still thinks
    it is the queue owner, but someone else in the cluster has already
    overtaken the queue).