Recover from lost quorum

5min
|
Vault

With Integrated Storage, Raft quorum maintenance is a consideration for configuring and operating your Vault environment. A Vault cluster permanently loses quorum when there is no way to recover enough servers to reach consensus and elect a leader. Without a quorum of cluster servers, Vault can no longer perform read and write operations.

The cluster quorum is dynamically updated when new servers join the cluster. Vault calculates quorum with the formula (n+1)/2, where n is the number of servers in the cluster. For example, for a 3-server cluster, you will need at least 2 servers operational for the cluster to function properly, (3+1)/2 = 2. Specifically, you will need 2 servers always active to perform read and write operations. (Review the deployment table.)

Note

There is an exception to this rule if you use the -non-voter option while joining the cluster. This feature is available only in Vault Enterprise editions.

Scenario overview

When two of the three servers encountered an outage, the cluster loses quorum and becomes inoperable.

Although one of the servers is fully functioning, the cluster won't be able to process read or write requests.

Example:

$ vault operator raft list-peers
No raft cluster configuration found

$ vault kv get kv/apikey
nil response from pre-flight request

In this tutorial, you will recover from the permanent loss of two-of-three Vault servers by converting it into a single-server cluster.

The last server must be fully operational to complete this procedure.

Note

Sometimes Vault loses quorum due to autopilot and servers marked as unhealthy but the service is still running. On unhealthy server(s), you must stop services before running the peers.json procedure.

In a 5 server cluster or in the case of non voters, you must stop other healthy before performing the peers.json recovery.

Locate the storage directory

On the healthy Vault server, locate the Raft storage directory. To discover the location of the directory, review your Vault configuration file. The storage stanza will contain the path to the directory.

Example:

storage "raft" {
  path    = "/vault/data"
  server_id = "vault_1"
}

listener "tcp" {
  address     = "0.0.0.0:8200"
  cluster_address     = "0.0.0.0:8201"
  tls_disable = true
}

api_addr = "http://192.0.2.1:8200"
cluster_addr = "http://10.0.101.22:8201"
disable_mlock = true
ui=true

In this example, the path is the file system path where Vault stores data, and the server_id is the identifier for the server in the Raft cluster. The example server_id is vault_1.

Create the peers.json file

Inside the storage directory (/vault/data), there is a folder named raft.

vault
└── data
    ├── raft
    │   ├── raft.db
    │   └── snapshots
    └── vault.db

To enable the single, remaining Vault server to reach quorum and elect itself as the leader, create a raft/peers.json file that holds the server information. The file format is a JSON array containing the server ID, address:port, and suffrage information of the healthy Vault server (for example, vault_1).

Example:

$ cat > /vault/data/raft/peers.json << EOF
[
  {
    "id": "vault_1",
    "address": "10.0.101.22:8201",
    "non_voter": false
  }
]
EOF

id (string: <required>) - Specifies the server ID of the server.
address (string: <required>) - Specifies the host and port of the server. The port is the server's cluster port.
non_voter (bool: <false>) - This controls whether the server is a non-voter.

Restart Vault

Restart the Vault process to enable Vault to load the new peers.json file.

$ sudo systemctl restart vault

Note

If you use Systemd, a SIGHUP signal will not work.

Unseal Vault

If not configured to use auto-unseal, unseal Vault and then check the status.

Example:

$ vault operator unseal
Unseal Key (will be hidden):

$ vault status
Key                      Value
---                      -----
Recovery Seal Type       shamir
Initialized              true
Sealed                   false
Total Recovery Shares    1
Threshold                1
Version                  1.7.3
Storage Type             raft
Cluster Name             vault-cluster-4a1a40af
Cluster ID               d09df2c7-1d3e-f7d0-a9f7-93fadcc29110
HA Enabled               true
HA Cluster               https://10.0.101.22:8201
HA Mode                  active
Active Since             2021-07-20T00:07:32.215236307Z
Raft Committed Index     299
Raft Applied Index       299

Verify success

The recovery procedure is successful when Vault starts up and displays these messages in the system logs.

...snip...
[INFO]  core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:8201
[INFO]  storage.raft: raft recovery initiated: recovery_file=peers.json
[INFO]  storage.raft: raft recovery found new config: config="{[{Voter vault_1 https://10.0.101.22:8201}]}"
[INFO]  storage.raft: raft recovery deleted peers.json
...snip...

View the peer list

You now have a cluster with one server that can reach the quorum. Verify that there is just one server in the cluster with vault operator raft list-peers command.

$ vault operator raft list-peers
server       Address                     State     Voter
----       -------                     -----     -----
vault_1    https://10.0.101.22:8201    leader    true

Next steps

In this tutorial, you recovered the loss of quorum by converting a 3-server cluster into a single-server cluster using the peers.json. The peers.json file enabled you to manually overwrite the Raft peer list to the one remaining server, which allowed that server to reach quorum and complete a leader election.

If the failed servers are recoverable, the best option is to bring them back online and have them reconnect to the cluster using the same host addresses. This will return the cluster to a fully healthy state. In such an event, the raft/peers.json should contain the server ID, address:port, and suffrage information of each Vault server you wish to be in the cluster.

[
  {
    "id": "server1",
    "address": "server1.vault.local:8201",
    "non_voter": false
  },
  {
    "id": "server2",
    "address": "server2.vault.local:8201",
    "non_voter": false
  },
  {
    "id": "server3",
    "address": "server3.vault.local:8201",
    "non_voter": false
  }
]

See the Outage Recovery documentation for more detail.

Reference resources

Auto-unseal using transit

Enable disaster recovery replication

This tutorial also appears in:

8 tutorials

Integrated Storage
Operational tasks associated with integrated storage to persist Vault data rather than using external storage.
- Vault
12 tutorials

Monitoring & troubleshooting
Vault monitoring and troubleshooting tutorials that help you inspect your Vault environment.
- Vault