Feat/self-healing
-
Review changes -
-
Download -
Patches
-
Plain diff
See also RFC. Closes #311 (closed)
Needs tarantool-module!364 (merged)
- Introduce a
InstanceReachabilityManager
entity which is responsible for tracking outcomes of raft messages sent to cluster instances. -
ConnectionPoolWorker
s now report to the reachability manager (by callingInstanceReachabilityManager::report_result
) if instances didn't respond to raft messages. - Raft main loop now queries reachability manager (by calling
InstanceReachabilityManager::get_unreachables_to_report
) for info about instances which should be reported as unreachable and callsRawNode::report_unreachable
for them which udpates raft node's internal state. After that most messages are not being sent the reported instances, until these instances start sending raft messages to the leader. - Heartbeats though are still being generated by raft node with the same frequency, so reachability manager also determines how often heartbeats are sent to the unreachable instances (a basic exponential decay is used currently) in
InstanceReachabilityManager::should_send_heartbeat_this_tick
- Introduce a new fiber
sentinel
which is now responsible for changing target grades in most cases:- When the instance is gracefully shutting down it sends a CaS request to set own target grade Offline and if request accepted it stops.
- When it's running on leader, it queries
InstanceReachabilityManager
for a list of unreachable instances and if one of them has target grade == Online, then it sends a CaS request to change it to Offline - When it's running on follower, it checks if own target grade is Offline and sends a CaS request to change it to Online
-
on_shutdown
trigger was also changed accordingly. It used to spawn a fiber to change it's target grade to Offline, but now it just wakes up the sentinel and notifies it that graceful shutdown is in progress. - Fixed panic which would sometimes happen when replication master would switchover during applying of raft snapshot.
- Introduce new
auto_offline_timeout
andmax_heartbeat_period
pico properties which configure the appropriate behavior - Instance when starting up will not wake up the sentinel until it confirms that it's target grade changed to Online (or timeout is exceeded). This is needed to reduce the number of redundant incarnation changes.
- When update instance requests are handled before the Dml request is proposed the requestor will check if instance actually changed or not. This is needed to avoid redundant Dml operations, which are now possible for example when an instance is shutting down.
Edited by Georgy Moshkin
Merge request reports
Compare and
- version 26b1d416d6
- version 25ab982283
- version 2479a9d1cf
- version 23e7ec86b6
- version 223cf78bea
- version 2199b26478
- version 206da40dda
- version 19e015854e
- version 18e015854e
- version 17e015854e
- version 16e1a6d10f
- version 15398ea0a5
- version 1472a27c29
- version 1372a27c29
- version 1255fc7676
- version 11aabbb623
- version 1016229c6e
- version 9ad4d1b2e
- version 823091453
- version 7c15ce73e
- version 6c6fd67e7
- version 5c6fd67e7
- version 492d10a32
- version 3d94f3412
- version 24236d1d6
- version 122e154ed
- master (base)
- latest versionb4dd9b859 commits,
- version 26b1d416d69 commits,
- version 25ab9822839 commits,
- version 2479a9d1cf9 commits,
- version 23e7ec86b69 commits,
- version 223cf78bea9 commits,
- version 2199b264788 commits,
- version 206da40dda13 commits,
- version 19e015854e13 commits,
- version 18e015854e13 commits,
- version 17e015854e8 commits,
- version 16e1a6d10f8 commits,
- version 15398ea0a56 commits,
- version 1472a27c2911 commits,
- version 1372a27c296 commits,
- version 1255fc76766 commits,
- version 11aabbb6235 commits,
- version 1016229c6e5 commits,
- version 9ad4d1b2e1 commit,
- version 8230914531 commit,
- version 7c15ce73e1 commit,
- version 6c6fd67e76 commits,
- version 5c6fd67e71 commit,
- version 492d10a321 commit,
- version 3d94f34121 commit,
- version 24236d1d61 commit,
- version 122e154ed2 commits,
Compare changes
- Side-by-side
- Inline
Files
15Loading