Feat/self-healing
See also RFC. Closes #311 (closed)
Needs tarantool-module!364 (merged)
- Introduce a
InstanceReachabilityManager
entity which is responsible for tracking outcomes of raft messages sent to cluster instances. -
ConnectionPoolWorker
s now report to the reachability manager (by callingInstanceReachabilityManager::report_result
) if instances didn't respond to raft messages. - Raft main loop now queries reachability manager (by calling
InstanceReachabilityManager::get_unreachables_to_report
) for info about instances which should be reported as unreachable and callsRawNode::report_unreachable
for them which udpates raft node's internal state. After that most messages are not being sent the reported instances, until these instances start sending raft messages to the leader. - Heartbeats though are still being generated by raft node with the same frequency, so reachability manager also determines how often heartbeats are sent to the unreachable instances (a basic exponential decay is used currently) in
InstanceReachabilityManager::should_send_heartbeat_this_tick
- Introduce a new fiber
sentinel
which is now responsible for changing target grades in most cases:- When the instance is gracefully shutting down it sends a CaS request to set own target grade Offline and if request accepted it stops.
- When it's running on leader, it queries
InstanceReachabilityManager
for a list of unreachable instances and if one of them has target grade == Online, then it sends a CaS request to change it to Offline - When it's running on follower, it checks if own target grade is Offline and sends a CaS request to change it to Online
-
on_shutdown
trigger was also changed accordingly. It used to spawn a fiber to change it's target grade to Offline, but now it just wakes up the sentinel and notifies it that graceful shutdown is in progress. - Fixed panic which would sometimes happen when replication master would switchover during applying of raft snapshot.
- Introduce new
auto_offline_timeout
andmax_heartbeat_period
pico properties which configure the appropriate behavior - Instance when starting up will not wake up the sentinel until it confirms that it's target grade changed to Online (or timeout is exceeded). This is needed to reduce the number of redundant incarnation changes.
- When update instance requests are handled before the Dml request is proposed the requestor will check if instance actually changed or not. This is needed to avoid redundant Dml operations, which are now possible for example when an instance is shutting down.
Edited by Georgy Moshkin