Skip to content

Feat/self-healing

Georgy Moshkin requested to merge feat/raft-msg-error-handling into master

See also RFC. Closes #311 (closed)

Needs tarantool-module!364 (merged)

  • Introduce a InstanceReachabilityManager entity which is responsible for tracking outcomes of raft messages sent to cluster instances.
  • ConnectionPoolWorkers now report to the reachability manager (by calling InstanceReachabilityManager::report_result) if instances didn't respond to raft messages.
  • Raft main loop now queries reachability manager (by calling InstanceReachabilityManager::get_unreachables_to_report) for info about instances which should be reported as unreachable and calls RawNode::report_unreachable for them which udpates raft node's internal state. After that most messages are not being sent the reported instances, until these instances start sending raft messages to the leader.
  • Heartbeats though are still being generated by raft node with the same frequency, so reachability manager also determines how often heartbeats are sent to the unreachable instances (a basic exponential decay is used currently) in InstanceReachabilityManager::should_send_heartbeat_this_tick
  • Introduce a new fiber sentinel which is now responsible for changing target grades in most cases:
    • When the instance is gracefully shutting down it sends a CaS request to set own target grade Offline and if request accepted it stops.
    • When it's running on leader, it queries InstanceReachabilityManager for a list of unreachable instances and if one of them has target grade == Online, then it sends a CaS request to change it to Offline
    • When it's running on follower, it checks if own target grade is Offline and sends a CaS request to change it to Online
  • on_shutdown trigger was also changed accordingly. It used to spawn a fiber to change it's target grade to Offline, but now it just wakes up the sentinel and notifies it that graceful shutdown is in progress.
  • Fixed panic which would sometimes happen when replication master would switchover during applying of raft snapshot.
  • Introduce new auto_offline_timeout and max_heartbeat_period pico properties which configure the appropriate behavior
  • Instance when starting up will not wake up the sentinel until it confirms that it's target grade changed to Online (or timeout is exceeded). This is needed to reduce the number of redundant incarnation changes.
  • When update instance requests are handled before the Dml request is proposed the requestor will check if instance actually changed or not. This is needed to avoid redundant Dml operations, which are now possible for example when an instance is shutting down.
Edited by Georgy Moshkin

Merge request reports