ScyllaDB Docs ScyllaDB Open Source ScyllaDB Architecture Raft Consensus Algorithm in ScyllaDB

Caution

You're viewing documentation for a previous version. Switch to the latest stable version.

Raft Consensus Algorithm in ScyllaDB¶

Introduction¶

ScyllaDB was originally designed, following Apache Cassandra, to use gossip for topology and schema updates and the Paxos consensus algorithm for strong data consistency (LWT). To achieve stronger consistency without performance penalty, ScyllaDB has turned to Raft - a consensus algorithm designed as an alternative to both gossip and Paxos.

Raft is a consensus algorithm that implements a distributed, consistent, replicated log across members (nodes). Raft implements consensus by first electing a distinguished leader, then giving the leader complete responsibility for managing the replicated log. The leader accepts log entries from clients, replicates them on other servers, and tells servers when it is safe to apply log entries to their state machines.

Raft uses a heartbeat mechanism to trigger a leader election. All servers start as followers and remain in the follower state as long as they receive valid RPCs (heartbeat) from a leader or candidate. A leader sends periodic heartbeats to all followers to maintain his authority (leadership). Suppose a follower receives no communication over a period called the election timeout. In that case, it assumes no viable leader and begins an election to choose a new leader.

Leader selection is described in detail in the Raft paper.

ScyllaDB can use Raft to maintain schema updates in every node (see below). Any schema update, like ALTER, CREATE or DROP TABLE, is first committed as an entry in the replicated Raft log, and, once stored on most replicas, applied to all nodes in the same order, even in the face of a node or network failures.

Upcoming ScyllaDB releases will use Raft to guarantee consistent topology updates similarly.

Quorum Requirement¶

Raft requires at least a quorum of nodes in a cluster to be available. If multiple nodes fail and the quorum is lost, the cluster is unavailable for schema updates. See Handling Failures for information on how to handle failures.

Note that when you have a two-DC cluster with the same number of nodes in each DC, the cluster will lose the quorum if one of the DCs is down. We recommend configuring three DCs per cluster to ensure that the cluster remains available and operational when one DC is down.

Enabling Raft¶

Note

In ScyllaDB 5.2 and ScyllaDB Enterprise 2023.1 Raft is Generally Available and can be safely used for consistent schema management. In further versions, it will be mandatory.

ScyllaDB Open Source 5.2 and later, and ScyllaDB Enterprise 2023.1 and later come equipped with a procedure that can setup Raft-based consistent cluster management in an existing cluster. We refer to this as the Raft upgrade procedure (do not confuse with the ScyllaDB version upgrade procedure).

Warning

Once enabled, Raft cannot be disabled on your cluster. The cluster nodes will fail to restart if you remove the Raft feature.

To enable Raft in an existing cluster, you need to enable the consistent_cluster_management option in the scylla.yaml file for each node in the cluster:

Ensure that the schema is synchronized in the cluster by executing nodetool describecluster on each node and ensuring that the schema version is the same on all nodes.
Perform a rolling restart, updating the scylla.yaml file for each node in the cluster before restarting it to enable the consistent_cluster_management option:
consistent_cluster_management: true

When all the nodes in the cluster and updated and restarted, the cluster will start the Raft upgrade procedure. You must then verify that the Raft upgrade procedure has finished successfully. Refer to the next section.

Alternatively, you can enable the consistent_cluster_management option when you are:

Performing a rolling upgrade from version 5.1 to 5.2 or version 2022.x to 2023.1 by updating scylla.yaml before restarting each node. The Raft upgrade procedure will start as soon as the last node was upgraded and restarted. As above, this requires verifying that the procedure successfully finishes.
Creating a new cluster. This does not use the Raft upgrade procedure; instead, Raft is functioning in the cluster and managing schema right from the start.

Until all nodes are restarted with consistent_cluster_management: true, it is still possible to turn this option back off. Once enabled on every node, it must remain turned on (or the node will refuse to restart).

Verifying that the Raft upgrade procedure finished successfully¶

The Raft upgrade procedure starts as soon as every node in the cluster restarts with consistent_cluster_management flag enabled in scylla.yaml.

The procedure requires full cluster availability to correctly setup the Raft algorithm; after the setup finishes, Raft can proceed with only a majority of nodes, but this initial setup is an exception. An unlucky event, such as a hardware failure, may cause one of your nodes to fail. If this happens before the Raft upgrade procedure finishes, the procedure will get stuck and your intervention will be required.

To verify that the procedure finishes, look at the log of every Scylla node (using journalctl _COMM=scylla). Search for the following patterns:

Starting internal upgrade-to-raft procedure denotes the start of the procedure,
Raft upgrade finished denotes the end.

The following is an example of a log from a node which went through the procedure correctly. Some parts were truncated for brevity:

features - Feature SUPPORTS_RAFT_CLUSTER_MANAGEMENT is enabled
raft_group0 - finish_setup_after_join: SUPPORTS_RAFT feature enabled. Starting internal upgrade-to-raft procedure.
raft_group0_upgrade - starting in `use_pre_raft_procedures` state.
raft_group0_upgrade - Waiting until everyone is ready to start upgrade...
raft_group0_upgrade - Joining group 0...
raft_group0 - server 624fa080-8c0e-4e3d-acf6-10af473639ca joined group 0 with group id 8f8a1870-5c4e-11ed-bb13-fe59693a23c9
raft_group0_upgrade - Waiting until every peer has joined Raft group 0...
raft_group0_upgrade - Every peer is a member of Raft group 0.
raft_group0_upgrade - Waiting for schema to synchronize across all nodes in group 0...
raft_group0_upgrade - synchronize_schema: my version: a37a3b1e-5251-3632-b6b4-a9468a279834
raft_group0_upgrade - synchronize_schema: schema mismatches: {}. 3 nodes had a matching version.
raft_group0_upgrade - synchronize_schema: finished.
raft_group0_upgrade - Entering synchronize state.
raft_group0_upgrade - Schema changes are disabled in synchronize state. If a failure makes us unable to proceed, manual recovery will be required.
raft_group0_upgrade - Waiting for all peers to enter synchronize state...
raft_group0_upgrade - All peers in synchronize state. Waiting for schema to synchronize...
raft_group0_upgrade - synchronize_schema: collecting schema versions from group 0 members...
raft_group0_upgrade - synchronize_schema: collected remote schema versions.
raft_group0_upgrade - synchronize_schema: my version: a37a3b1e-5251-3632-b6b4-a9468a279834
raft_group0_upgrade - synchronize_schema: schema mismatches: {}. 3 nodes had a matching version.
raft_group0_upgrade - synchronize_schema: finished.
raft_group0_upgrade - Schema synchronized.
raft_group0_upgrade - Raft upgrade finished.

In a functioning cluster with good network connectivity the procedure should take no more than a few seconds. Network issues may cause the procedure to take longer, but if all nodes are alive and the network is eventually functional (each pair of nodes is eventually connected), the procedure will eventually finish.

Note the following message, which appears in the log presented above:

Schema changes are disabled in synchronize state. If a failure makes us unable to proceed, manual recovery will be required.

During the procedure, there is a brief window while schema changes are disabled. This is when the schema change mechanism switches from the older unsafe algorithm to the safe Raft-based algorithm. If everything runs smoothly, this window will be unnoticeable; the procedure is designed to minimize that window’s length. However, if the procedure gets stuck e.g. due to network connectivity problem, ScyllaDB will return the following error when trying to perform a schema change during this window:

Cannot perform schema or topology changes during this time; the cluster is currently upgrading to use Raft for schema operations.
If this error keeps happening, check the logs of your nodes to learn the state of upgrade. The upgrade procedure may get stuck
if there was a node failure.

In the next example, one of the nodes had a power outage before the procedure could finish. The following shows a part of another node’s logs:

raft_group0_upgrade - Entering synchronize state.
raft_group0_upgrade - Schema changes are disabled in synchronize state. If a failure makes us unable to proceed, manual recovery will be required.
raft_group0_upgrade - Waiting for all peers to enter synchronize state...
raft_group0_upgrade - wait_for_peers_to_enter_synchronize_state: node 127.90.69.3 not in synchronize state yet...
raft_group0_upgrade - wait_for_peers_to_enter_synchronize_state: node 127.90.69.1 not in synchronize state yet...
raft_group0_upgrade - wait_for_peers_to_enter_synchronize_state: retrying in a while...
raft_group0_upgrade - wait_for_peers_to_enter_synchronize_state: node 127.90.69.1 not in synchronize state yet...
raft_group0_upgrade - wait_for_peers_to_enter_synchronize_state: retrying in a while...
...
raft_group0_upgrade - Raft upgrade procedure taking longer than expected. Please check if all nodes are live and the network is healthy. If the upgrade procedure does not progress even though the cluster is healthy, try performing a rolling restart of the cluster. If that doesn 't help or some nodes are dead and irrecoverable, manual recovery may be required. Consult the relevant documentation.
raft_group0_upgrade - wait_for_peers_to_enter_synchronize_state: node 127.90.69.1 not in synchronize state yet...
raft_group0_upgrade - wait_for_peers_to_enter_synchronize_state: retrying in a while...

Note the following message:

raft_group0_upgrade - Raft upgrade procedure taking longer than expected. Please check if all nodes are live and the network is healthy. If the upgrade procedure does not progress even though the cluster is healthy, try performing a rolling restart of the cluster. If that doesn 't help or some nodes are dead and irrecoverable, manual recovery may be required. Consult the relevant documentation.

If the Raft upgrade procedure is stuck, this message will appear periodically in each node’s logs.

The message suggests the initial course of action:

Check if all nodes are alive.
If a node is down but can be restarted, restart it.
If all nodes are alive, ensure that the network is healthy: that every node is reachable from every other node.
If all nodes are alive and the network is healthy, perform a rolling restart of the cluster.

One of the reasons why the procedure may get stuck is a pre-existing problem in schema definitions which causes schema to be unable to synchronize in the cluster. The procedure cannot proceed unless it ensures that schema is synchronized. If all nodes are alive and the network is healthy, you performed a rolling restart, but the issue still persists, contact ScyllaDB support for assistance.

If some nodes are dead and irrecoverable, you’ll need to perform a manual recovery procedure. Consult the section about Raft recovery.

Verifying that Raft is enabled¶

You can verify that Raft is enabled on your cluster by performing the following query on each node:

cqlsh> SELECT * FROM system.scylla_local WHERE key = 'group0_upgrade_state';

The query should return:

 key                  | value
----------------------+--------------------------
 group0_upgrade_state | use_post_raft_procedures

(1 rows)

on every node.

If the query returns 0 rows, or value is synchronize or use_pre_raft_procedures, it means that the cluster is in the middle of the Raft upgrade procedure; consult the relevant section.

If value is recovery, it means that the cluster is in the middle of the manual recovery procedure. The procedure must be finished. Consult the section about Raft recovery.

If value is anything else, it might mean data corruption or a mistake when performing the manual recovery procedure. The value will be treated as if it was equal to recovery when the node is restarted.

Safe Schema Changes with Raft¶

In ScyllaDB, schema is based on Data Definition Language (DDL). In earlier ScyllaDB versions, schema changes were tracked via the gossip protocol, which might lead to schema conflicts if the updates are happening concurrently.

Implementing Raft eliminates schema conflicts and allows full automation of DDL changes under any conditions, as long as a quorum of nodes in the cluster is available. The following examples illustrate how Raft provides the solution to problems with schema changes.

A network partition may lead to a split-brain case, where each subset of nodes has a different version of the schema.

With Raft, after a network split, the majority of the cluster can continue performing schema changes, while the minority needs to wait until it can rejoin the majority. Data manipulation statements on the minority can continue unaffected, provided the quorum requirement is satisfied.
Two or more conflicting schema updates are happening at the same time. For example, two different columns with the same definition are simultaneously added to the cluster. There is no effective way to resolve the conflict - the cluster will employ the schema with the most recent timestamp, but changes related to the shadowed table will be lost.

With Raft, concurrent schema changes are safe.

In summary, Raft makes schema changes safe, but it requires that a quorum of nodes in the cluster is available.

Handling Failures¶

Raft requires a quorum of nodes in a cluster to be available. If one or more nodes are down, but the quorum is live, reads, writes, and schema updates proceed unaffected. When the node that was down is up again, it first contacts the cluster to fetch the latest schema and then starts serving queries.

The following examples show the recovery actions depending on the number of nodes and DCs in your cluster.

Examples¶

Cluster A: 1 datacenter, 3 nodes¶
Failure	Consequence	Action to take
1 node	Schema updates are possible and safe.	Try restarting the node. If the node is dead, replace it with a new node.
2 nodes	Data is available for reads and writes, schema changes are impossible.	Restart at least 1 of the 2 nodes that are down to regain quorum. If you can’t recover at least 1 of the 2 nodes, consult the manual Raft recovery section.

Cluster B: 2 datacenters, 6 nodes (3 nodes per DC)¶
Failure	Consequence	Action to take
1-2 nodes	Schema updates are possible and safe.	Try restarting the node(s). If the node is dead, replace it with a new node.
3 nodes	Data is available for reads and writes, schema changes are impossible.	Restart 1 of the 3 nodes that are down to regain quorum. If you can’t recover at least 1 of the 3 failed nodes, consult the manual Raft recovery section.
1DC	Data is available for reads and writes, schema changes are impossible.	When the DCs come back online, restart the nodes. If the DC fails to come back online and the nodes are lost, consult the manual Raft recovery section.

Cluster C: 3 datacenter, 9 nodes (3 nodes per DC)¶
Failure	Consequence	Action to take
1-4 nodes	Schema updates are possible and safe.	Try restarting the nodes. If the nodes are dead, replace them with new nodes.
1 DC	Schema updates are possible and safe.	When the DC comes back online, try restarting the nodes in the cluster. If the nodes are dead, add 3 new nodes in a new region.
2 DCs	Data is available for reads and writes, schema changes are impossible.	When the DCs come back online, restart the nodes. If at least one DC fails to come back online and the nodes are lost, consult the manual Raft recovery section.

Raft manual recovery procedure¶

The manual Raft recovery procedure applies to the following situations:

The Raft upgrade procedure got stuck because one of your nodes failed in the middle of the procedure and is irrecoverable,
or the cluster was running Raft but a majority of nodes (e.g. 2 our of 3) failed and are irrecoverable. Raft cannot progress unless a majority of nodes is available.

Warning

Perform the manual recovery procedure only if you’re dealing with irrecoverable nodes. If it is possible to restart your nodes, do that instead of manual recovery.

Note

Before proceeding, make sure that the irrecoverable nodes are truly dead, and not, for example, temporarily partitioned away due to a network failure. If it is possible for the ‘dead’ nodes to come back to life, they might communicate and interfere with the recovery procedure and cause unpredictable problems.

If you have no means of ensuring that these irrecoverable nodes won’t come back to life and communicate with the rest of the cluster, setup firewall rules or otherwise isolate your alive nodes to reject any communication attempts from these dead nodes.

During the manual recovery procedure you’ll enter a special RECOVERY mode, remove all faulty nodes (using the standard node removal procedure), delete the internal Raft data, and restart the cluster. This will cause the cluster to perform the Raft upgrade procedure again, initializing the Raft algorithm from scratch. The manual recovery procedure is applicable both to clusters which were not running Raft in the past and then had Raft enabled, and to clusters which were bootstrapped using Raft.

Note

Entering RECOVERY mode requires a node restart. Restarting an additional node while some nodes are already dead may lead to unavailability of data queries (assuming that you haven’t lost it already). For example, if you’re using the standard RF=3, CL=QUORUM setup, and you’re recovering from a stuck of upgrade procedure because one of your nodes is dead, restarting another node will cause temporary data query unavailability (until the node finishes restarting). Prepare your service for downtime before proceeding.

Perform the following query on every alive node in the cluster, using e.g. cqlsh:

cqlsh> UPDATE system.scylla_local SET value = 'recovery' WHERE key = 'group0_upgrade_state';

Perform a rolling restart of your alive nodes.

Verify that all the nodes have entered RECOVERY mode when restarting; look for one of the following messages in their logs:

group0_client - RECOVERY mode.
raft_group0 - setup_group0: Raft RECOVERY mode, skipping group 0 setup.
raft_group0_upgrade - RECOVERY mode. Not attempting upgrade.

Remove all your dead nodes using the node removal procedure.

Remove existing Raft cluster data by performing the following queries on every alive node in the cluster, using e.g. cqlsh:

cqlsh> TRUNCATE TABLE system.discovery;
cqlsh> TRUNCATE TABLE system.group0_history;
cqlsh> DELETE value FROM system.scylla_local WHERE key = 'raft_group0_id';

Make sure that schema is synchronized in the cluster by executing nodetool describecluster on each node and verifying that the schema version is the same on all nodes.

We can now leave RECOVERY mode. On every alive node, perform the following query:

cqlsh> DELETE FROM system.scylla_local WHERE key = 'group0_upgrade_state';

Perform a rolling restart of your alive nodes.
The Raft upgrade procedure will start anew. Verify that it finishes successfully.

Learn More About Raft¶

The Raft Consensus Algorithm
Achieving NoSQL Database Consistency with Raft in ScyllaDB - A tech talk by Konstantin Osipov
Making Schema Changes Safe with Raft - A Scylla Summit talk by Konstantin Osipov (register for access)
The Future of Consensus in ScyllaDB 5.0 and Beyond - A Scylla Summit talk by Tomasz Grabiec (register for access)

Was this page helpful?