Skip to content

GitLab

Explore

Sign in
Register

Primary navigation

Project

P

picodata
- Activity
- Members
- Labels
- Container Registry
- Model registry

Snippets Groups Projects

!624

Feat/self-healing

Review changes
Download
Patches
Plain diff

Merged Feat/self-healing

feat/raft-msg-error-handling into master

Overview 0
Commits 9
Pipelines 23
Changes 15

Merged Georgy Moshkin requested to merge feat/raft-msg-error-handling into master 1 year ago

Overview 0
Commits 9
Pipelines 23
Changes 15

See also RFC. Closes #311 (closed)

Needs tarantool-module!364 (merged)

Introduce a InstanceReachabilityManager entity which is responsible for tracking outcomes of raft messages sent to cluster instances.
ConnectionPoolWorkers now report to the reachability manager (by calling InstanceReachabilityManager::report_result) if instances didn't respond to raft messages.
Raft main loop now queries reachability manager (by calling InstanceReachabilityManager::get_unreachables_to_report) for info about instances which should be reported as unreachable and calls RawNode::report_unreachable for them which udpates raft node's internal state. After that most messages are not being sent the reported instances, until these instances start sending raft messages to the leader.
Heartbeats though are still being generated by raft node with the same frequency, so reachability manager also determines how often heartbeats are sent to the unreachable instances (a basic exponential decay is used currently) in InstanceReachabilityManager::should_send_heartbeat_this_tick
Introduce a new fiber sentinel which is now responsible for changing target grades in most cases:
- When the instance is gracefully shutting down it sends a CaS request to set own target grade Offline and if request accepted it stops.
- When it's running on leader, it queries InstanceReachabilityManager for a list of unreachable instances and if one of them has target grade == Online, then it sends a CaS request to change it to Offline
- When it's running on follower, it checks if own target grade is Offline and sends a CaS request to change it to Online
on_shutdown trigger was also changed accordingly. It used to spawn a fiber to change it's target grade to Offline, but now it just wakes up the sentinel and notifies it that graceful shutdown is in progress.
Fixed panic which would sometimes happen when replication master would switchover during applying of raft snapshot.
Introduce new auto_offline_timeout and max_heartbeat_period pico properties which configure the appropriate behavior
Instance when starting up will not wake up the sentinel until it confirms that it's target grade changed to Online (or timeout is exceeded). This is needed to reduce the number of redundant incarnation changes.
When update instance requests are handled before the Dml request is proposed the requestor will check if instance actually changed or not. This is needed to avoid redundant Dml operations, which are now possible for example when an instance is shutting down.

Edited 1 year ago by Georgy Moshkin

Merge request reports

Activity

Filter activity

Approvals
Assignees & reviewers
Comments (from bots)
Comments (from users)
Commits & branches
Edits
Labels
Lock status
Mentions
Merge request status
Tracking

Please register or sign in to reply

0 Assignees

0 Reviewers

Request review from

Loading

Labels

0

None

0

None

Select labels

Manage project labels

Milestone

None

None

None

Time tracking

No estimate or time spent

0

0 Participants

Loading