Node Loss Recovery¶
How to recover when a node is permanently lost (hardware failure, rebuild, etc.) and you've been left with orphaned PVCs that block stateful apps.
Why this happens¶
The cluster uses OpenEBS hostpath as the default StorageClass for stateful workloads. Hostpath gives fast local-disk performance with minimal operational overhead, but each PV is hard-affined to the node that provisioned it:
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values: ["node-01-pve"]
When a node is permanently lost (rebuild, hardware death, replacement), every PV pinned to it becomes inaccessible. Pods that mount those PVCs stay in Init or Pending forever with messages like:
MountVolume.NewMounter initialization failed for volume "pvc-xxx" :
path "/var/mnt/local-hostpath/pvc-xxx" does not exist
This is by design — local storage trades durability for speed. The recovery model is:
| Workload type | Recovery source |
|---|---|
| HA databases (CNPG, MariaDB Galera) | Replicate from surviving instances |
| Single-instance apps with VolSync (forgejo, teslamate, etc.) | Restore from VolSync (Kopia repo on NFS) |
| Apps with their own S3 backup cronjob (Kanidm) | Restore from S3 (Garage) via app-specific restore command |
| Garage itself (the S3 service backing everything above) | Restore meta PVC from the NFS sidecar mirror (see Garage metadata recovery) |
| CNPG-backed stateless apps (kguardian, teslamate, penpot) | Heal automatically once Postgres is up |
| Stale/removed apps | Just delete the orphan resources |
Note on app-specific backups: Some apps run their own scheduled backup cronjobs that ship to Garage S3 (e.g., kanidm-backup-sync does aws s3 sync /data/backups/ s3://kanidm/backups/ hourly). These are easy to miss because they aren't VolSync — check kubectl get cronjob -A | grep -i backup when triaging.
We don't use distributed storage (Longhorn / Rook-Ceph) because nodes have limited disk and we'd rather keep the hot path local and ship durability off-node via WAL streaming (CNPG → Garage S3) or scheduled snapshots (VolSync → Kopia → NFS).
Triage checklist¶
When you spot a node-loss situation, work through this in order:
# Which workloads are wedged on missing volumes?
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
# Which Flux resources are unhealthy?
flux get ks -A --status-selector ready=false
flux get hr -A --status-selector ready=false
# Which PVs are pinned to the dead node?
kubectl get pv -o json | jq -r '
.items[] |
select(.spec.nodeAffinity.required.nodeSelectorTerms[]?.matchExpressions[]?.values[]? == "<dead-node>") |
"\(.spec.claimRef.namespace)/\(.spec.claimRef.name) \(.metadata.name)"'
Categorize each affected workload:
- CNPG / MariaDB instance → see HA database
- VolSync-backed app → see Single-instance with VolSync
- Stateless app stuck on a downstream DB init → leave alone, will self-heal once the DB is back
- Removed/stale resource → just delete
HA database recovery¶
The data layer is already replicated at the app level. Recovery is "delete the orphaned PVC and let the operator re-init a fresh replica on a live node."
MariaDB Galera (the easy case)¶
If only one instance was on the dead node and the surviving members still have quorum (or can reach quorum), the operator handles everything once the orphan is removed.
# Delete the orphaned data + galera state PVCs for the dead instance
kubectl delete pvc -n database storage-mariadb-<N> galera-mariadb-<N>
# Force-delete the stuck pod
kubectl delete pod -n database mariadb-<N> --force --grace-period=0
# Watch the operator re-provision and the cluster come back
kubectl get pod -n database -l app.kubernetes.io/instance=mariadb -w
kubectl get mariadb -n database
The operator (with spec.galera.recovery.enabled: true) detects the missing instance, provisions fresh PVCs on a live node, and Galera SST-syncs the new replica from a healthy donor.
If the cluster has lost quorum (all surviving instances crashlooping with failed to reach primary view), you need force-bootstrap:
# Pick the survivor with the highest grastate seqno (or just the original primary)
kubectl patch mariadb -n database mariadb --type=merge -p '
spec:
galera:
recovery:
forceClusterBootstrapInPod: mariadb-0
'
This is a temporary patch the operator clears after bootstrap. If Flux is reconciling the CR, suspend the ks first or it'll get reverted before the bootstrap completes.
If force-bootstrap fails (corrupted grastate, etc.), fall back to S3 restore using the scheduled backup.k8s.mariadb.com resource pointing at Garage.
CloudNative-PG (the trickier case)¶
CNPG won't auto-clean dangling PVCs (by design — protects against accidental data loss). The danger is that the lost instance was the primary, in which case you need a switchover before deleting the PVC, otherwise the operator gets stuck trying to keep a dead instance as primary.
Step 1 — check who's primary:
kubectl get cluster -n database postgres -o jsonpath='
currentPrimary={.status.currentPrimary}
targetPrimary={.status.targetPrimary}
phase={.status.phase}
danglingPVC={.status.danglingPVC}
readyInstances={.status.readyInstances}/{.status.instances}
'
If currentPrimary is the dead-node instance, you need to force a switchover before the PVC delete.
Step 2 — switchover (if needed):
If you have the cnpg kubectl plugin:
Without the plugin, patch the cluster status directly:
kubectl patch cluster postgres -n database --subresource=status --type=merge -p '
{"status":{"targetPrimary":"postgres-3","currentPrimary":"postgres-3"}}
'
Step 3 — delete the orphaned PVC and stuck pod:
kubectl delete pod -n database postgres-<dead-N> --force --grace-period=0
kubectl delete pvc -n database postgres-<dead-N>
The PV has Delete reclaim policy, so it goes too. The operator now sees instances: 3 but only 2 PVCs, and creates a fresh postgres-<next> via pg_basebackup from the new primary.
Step 4 — kick the new primary if it's stuck in "Waiting for the new primary to be available":
If the new primary was previously a replica, its on-disk state may say "I'm a replica of the old primary." The instance manager waits for someone to promote it. Restarting the pod re-reads the cluster's currentPrimary and triggers the local promotion:
Verify with:
kubectl exec -n database postgres-<new-primary> -c postgres -- \
psql -U postgres -c 'SELECT pg_is_in_recovery();' # should return f
kubectl get endpoints -n database postgres-rw # should have an IP
Step 5 — dangling PVCs:
If other instance PVCs were also dangling (e.g., from a previous failed reconcile), the operator re-adopts them on its next loop. You can confirm by re-reading .status.danglingPVC — it should drain to empty as instance pods come up.
Single-instance app recovery (VolSync)¶
For apps with no app-layer replication, VolSync's Kopia repo on NFS is the recovery point. Use the existing scripts/volsync-restore-all.sh for mass restores, or this single-app flow for one app at a time.
Flow¶
APP=forgejo # PVC name = RD name prefix
NS=utils # target namespace
KS_NS=utils # Flux Kustomization namespace (often same as target)
# 1. Suspend Flux so it doesn't fight us
flux suspend ks -n "${KS_NS}" "${APP}"
# 2. Scale down so nothing holds the PVC
kubectl -n "${NS}" scale deployment "${APP}" --replicas=0
kubectl -n "${NS}" wait pod -l app.kubernetes.io/name="${APP}" --for=delete --timeout=60s
# 3. Suspend the ReplicationSource (so it doesn't try to back up the empty PVC)
kubectl -n "${NS}" patch replicationsource "${APP}" --type=merge -p '{"spec":{"paused":true}}'
# 4. Delete the orphaned PVC + VolSync source cache
kubectl -n "${NS}" delete pvc "${APP}" "volsync-src-${APP}-cache" --wait=false
# 5. If a PVC is stuck Terminating on a missing PV finalizer, force-clear it:
kubectl -n "${NS}" patch pvc "${APP}" --type=merge -p '{"metadata":{"finalizers":null}}'
# 6. Recreate the destination PVC with the same spec but no node affinity.
# Pull size/storageClass from the original Helm values, or use what's typical.
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ${APP}
namespace: ${NS}
spec:
accessModes: [ReadWriteOnce]
resources: { requests: { storage: 10Gi } }
storageClassName: openebs-hostpath
EOF
# 7. Trigger restore from latest snapshot
TRIGGER="restore-$(date -u +%Y%m%d-%H%M%S)"
kubectl -n "${NS}" patch replicationdestination "${APP}-dst" --type=json -p "[
{\"op\":\"remove\",\"path\":\"/spec/kopia/restoreAsOf\"},
{\"op\":\"replace\",\"path\":\"/spec/trigger/manual\",\"value\":\"${TRIGGER}\"}
]"
# 8. Wait for the restore to complete
until [[ "$(kubectl -n "${NS}" get replicationdestination "${APP}-dst" \
-o jsonpath='{.status.lastManualSync}')" == "${TRIGGER}" ]]; do sleep 10; done
kubectl -n "${NS}" get replicationdestination "${APP}-dst" \
-o jsonpath='result={.status.latestMoverStatus.result}{"\n"}'
# 9. Resume the source backup and Flux
kubectl -n "${NS}" patch replicationsource "${APP}" --type=merge -p '{"spec":{"paused":false}}'
flux resume ks -n "${KS_NS}" "${APP}"
# 10. Helm/Flux often won't reset replicas back from 0 on a no-op reconcile.
# Scale the workload up manually:
kubectl -n "${NS}" scale deployment "${APP}" --replicas=1
Why this differs from the mass-restore script¶
volsync-restore-all.sh uses VolSync's copyMethod: Direct to restore into the existing PVC. That works when the PVC is on a live node and you just want to roll back data — but in our case the PVC is bound to a PV on a dead node, so we have to delete and recreate it before the mover can mount it on a live node.
The destination Kopia mover provisions the cache PVC on first use. Watch for Pending PVCs — if the OpenEBS provisioner is itself wedged (e.g., its pod was on the dead node and is still pulling its image after rescheduling), the whole chain stalls. kubectl get pod -n openebs-system is the first thing to check if mover pods stay Pending with no events.
App-specific S3 restore (Kanidm pattern)¶
Some apps (notably Kanidm) ship JSON-format backups to a Garage S3 bucket via a CronJob, separate from VolSync. The data PVC has no off-node copy on its own — the backup files in S3 are the only recovery point.
Pattern, using Kanidm as the example:
NS=identity
APP=kanidm
PVC=kanidm-data-kanidm-default-0 # StatefulSet PVC name
S3_BUCKET=kanidm # Garage bucket
TLS_SECRET=kanidm-tls # if kanidmd needs TLS for restore
# 1. Suspend Flux so kaniop doesn't fight us
flux suspend ks -n "${NS}" "${APP}"
# 2. Scale the StatefulSet to 0 (orphan pod was Pending on dead node anyway)
kubectl -n "${NS}" scale statefulset kanidm-default --replicas=0
kubectl -n "${NS}" wait pod kanidm-default-0 --for=delete --timeout=60s || true
# 3. Delete the orphaned PVC; if Terminating-stuck, clear finalizer
kubectl -n "${NS}" delete pvc "${PVC}" --wait=false
kubectl -n "${NS}" patch pvc "${PVC}" --type=merge -p '{"metadata":{"finalizers":null}}'
For Kanidm, the operator (kaniop) re-creates the PVC immediately from the volumeClaimTemplate even with the StatefulSet at 0 replicas — there's no need to recreate it manually. Verify with kubectl get pvc -n identity.
# 4. Run a Job that downloads + decompresses + restores into the new empty PVC
# Notes:
# - The kanidm/server image has no shell (sh/gunzip) — use a busybox/alpine
# sidecar to decompress before invoking kanidmd directly.
# - The kanidm-s3-secret credentials work against Garage with
# `--region us-east-1 --endpoint-url http://garage.volsync-system.svc.cluster.local:3900`.
# Use `s3api get-object` rather than `s3 cp` — Garage 400s on the HEAD
# request `s3 cp` issues first.
# - `kanidmd database restore` requires KANIDM_TLS_CHAIN + KANIDM_TLS_KEY env
# even though it never starts the network listener.
kubectl apply -f - <<'EOF'
apiVersion: batch/v1
kind: Job
metadata: { name: kanidm-restore, namespace: identity }
spec:
ttlSecondsAfterFinished: 600
template:
spec:
restartPolicy: Never
securityContext: { fsGroup: 999, runAsUser: 999, runAsGroup: 999 }
initContainers:
- name: download
image: amazon/aws-cli:latest
command: ["sh","-c"]
args:
- |
set -eux
LATEST=$(aws --region us-east-1 --endpoint-url "$S3_ENDPOINT" s3 ls s3://kanidm/backups/ | awk '{print $NF}' | sort | tail -1)
aws --region us-east-1 --endpoint-url "$S3_ENDPOINT" s3api get-object \
--bucket kanidm --key "backups/${LATEST}" /tmp/backup.json.gz
env:
- { name: S3_ENDPOINT, value: http://garage.volsync-system.svc.cluster.local:3900 }
- { name: AWS_DEFAULT_REGION, value: us-east-1 }
- { name: AWS_ACCESS_KEY_ID, valueFrom: { secretKeyRef: { name: kanidm-s3-secret, key: AWS_ACCESS_KEY_ID } } }
- { name: AWS_SECRET_ACCESS_KEY, valueFrom: { secretKeyRef: { name: kanidm-s3-secret, key: AWS_SECRET_ACCESS_KEY } } }
volumeMounts: [ { name: tmp, mountPath: /tmp } ]
- name: decompress
image: alpine:3
command: ["sh","-c"]
args: ["set -eux; gunzip /tmp/backup.json.gz"]
volumeMounts: [ { name: tmp, mountPath: /tmp } ]
containers:
- name: restore
image: docker.io/kanidm/server:1.10.0
command: ["kanidmd"]
args: ["database", "restore", "/tmp/backup.json"]
env:
- { name: KANIDM_DOMAIN, value: auth.00o.sh }
- { name: KANIDM_ORIGIN, value: https://auth.00o.sh }
- { name: KANIDM_DB_PATH, value: /data/kanidm.db }
- { name: KANIDM_TLS_CHAIN, value: /etc/kanidm/tls/tls.crt }
- { name: KANIDM_TLS_KEY, value: /etc/kanidm/tls/tls.key }
volumeMounts:
- { name: kanidm-data, mountPath: /data }
- { name: tmp, mountPath: /tmp }
- { name: kanidm-certs, mountPath: /etc/kanidm/tls, readOnly: true }
volumes:
- { name: kanidm-data, persistentVolumeClaim: { claimName: kanidm-data-kanidm-default-0 } }
- { name: tmp, emptyDir: {} }
- { name: kanidm-certs, secret: { secretName: kanidm-tls } }
EOF
# 5. Wait for restore Job and inspect logs
kubectl -n identity wait --for=condition=Complete job/kanidm-restore --timeout=10m
kubectl -n identity logs job/kanidm-restore -c restore | tail -5
# Expected: "✅ Restore Success!"
# 6. Scale StatefulSet back up and resume Flux
kubectl -n identity scale statefulset kanidm-default --replicas=1
flux resume ks -n identity kanidm
After Kanidm is healthy, bounce any OIDC clients (forgejo, dbgate, kubevirt-manager, opencost, penpot) — they often cache the previous discovery doc and need a restart to re-register against the restored identity store.
Credential mismatch after a Kanidm restore (gotcha)¶
After the restore Job completes and Kanidm starts up, you'll likely see kaniop logs spamming client failed to authenticate: AuthenticationFailed for every reconcile (KanidmGroup, KanidmPersonAccount, KanidmOAuth2Client). This blocks OAuth2 client registration, which in turn blocks OIDC consumers like forgejo (its configure-gitea init container hits a 302 on the well-known discovery URL because the client isn't registered).
The cause: kaniop authenticates as admin / idm_admin using passwords stored in the kanidm-admin-passwords Secret. The restore may bring in a DB whose internal credential state for those accounts doesn't match what's in the Secret (this is a known kaniop+restore interaction). The fix is to reset the passwords on the restored DB to match what kaniop expects — or, easier, reset them to fresh values and update the Secret.
# 1. Stop kanidmd so we can use the offline recover-account command
flux suspend ks -n identity kanidm
kubectl -n identity scale statefulset kanidm-default --replicas=0
kubectl -n identity wait pod kanidm-default-0 --for=delete --timeout=60s
# 2. Run recover-account for admin (generates a new password, prints to stdout)
# Note: kanidm/server image has no shell — invoke kanidmd directly via command/args.
kubectl apply -f - <<'EOF'
apiVersion: batch/v1
kind: Job
metadata: { name: kanidm-recover-admin, namespace: identity }
spec:
ttlSecondsAfterFinished: 600
template:
spec:
restartPolicy: Never
securityContext: { fsGroup: 999, runAsUser: 999, runAsGroup: 999 }
containers:
- name: recover
image: docker.io/kanidm/server:1.10.0
command: ["kanidmd"]
args: ["recover-account", "admin"]
env:
- { name: KANIDM_DOMAIN, value: auth.00o.sh }
- { name: KANIDM_ORIGIN, value: https://auth.00o.sh }
- { name: KANIDM_DB_PATH, value: /data/kanidm.db }
- { name: KANIDM_TLS_CHAIN, value: /etc/kanidm/tls/tls.crt }
- { name: KANIDM_TLS_KEY, value: /etc/kanidm/tls/tls.key }
volumeMounts:
- { name: kanidm-data, mountPath: /data }
- { name: kanidm-certs, mountPath: /etc/kanidm/tls, readOnly: true }
volumes:
- { name: kanidm-data, persistentVolumeClaim: { claimName: kanidm-data-kanidm-default-0 } }
- { name: kanidm-certs, secret: { secretName: kanidm-tls } }
EOF
kubectl -n identity wait --for=condition=Complete job/kanidm-recover-admin --timeout=2m
ADMIN_PW=$(kubectl -n identity logs job/kanidm-recover-admin | grep -oP 'new_password: "\K[^"]+')
echo "admin: $ADMIN_PW"
# 3. Same for idm_admin (don't run in parallel — Kanidm DB is RWO)
# Apply identical Job with name=kanidm-recover-idm-admin, args=["recover-account","idm_admin"].
IDM_PW=$(kubectl -n identity logs job/kanidm-recover-idm-admin | grep -oP 'new_password: "\K[^"]+')
# 4. Patch the Secret with the new passwords
kubectl patch secret -n identity kanidm-admin-passwords --type=json -p "[
{\"op\":\"replace\",\"path\":\"/data/ADMIN_PASSWORD\",\"value\":\"$(printf '%s' "$ADMIN_PW" | base64 -w0)\"},
{\"op\":\"replace\",\"path\":\"/data/IDM_ADMIN_PASSWORD\",\"value\":\"$(printf '%s' "$IDM_PW" | base64 -w0)\"}
]"
# 5. Scale Kanidm back up and resume Flux
kubectl -n identity scale statefulset kanidm-default --replicas=1
flux resume ks -n identity kanidm
Within ~30s of the Secret patch, kaniop should pick up the change (it watches the Secret), authenticate successfully, and start successfully reconciling all KanidmOAuth2Client and KanidmPersonAccount resources. Watch with:
Once you see reconciling oauth2 client INFO lines (not ERROR lines), bounce the OIDC consumers.
Stale / removed resources¶
Apps that have been removed from Flux (commented out in a parent kustomization) sometimes leave behind cluster resources that aren't pruned. ReplicationDestinations are a common one — they keep spawning mover pods that fail to schedule because the source PVC no longer exists.
Just delete them:
The bound volsync-dst-<app>-dst-cache PVC will GC when its owner is gone.
Garage metadata recovery¶
Garage runs single-replica (no HA) and its meta dir cannot live on NFS — LMDB throws Resource temporarily unavailable on every background worker over NFS because mmap+file-lock semantics don't work there. So meta lives on a local openebs-hostpath PVC (garage-meta), which means if the node hosting it dies, the PVC dies with it. Data lives on NFS and is safe, but data shards are opaque without meta — losing meta = losing all S3 objects (CNPG WAL/base backups, MariaDB S3 backups, Kanidm hourly backups).
The protection: a backup-sync sidecar inside the Garage pod (same controller) mirrors /meta/ (minus the live db.lmdb/ dir) to NFS at /mnt/Speed/Kubernetes/apps/garage/meta-backup/ every 24h. It copies:
- All config files (
cluster_layout,data_layout,node_key,node_key.pub,peer_list,lifecycle_worker_state,scrub_info) - The
snapshots/directory — Garage's own internal LMDB snapshots created viamdb_env_copy()(consistent, no locking issues)
It does not copy the live db.lmdb/ dir because Garage holds an exclusive flock on lock.mdb. The snapshot files inside snapshots/ are the consistent recovery point.
Why not VolSync for this PVC? VolSync's copyMethod: Direct mounts the source PVC into a mover pod concurrently with Garage. On RWO openebs-hostpath, this races and corrupts the metadata (we lost cluster_layout + snapshots/ doing this once). copyMethod: Snapshot isn't an option either — openebs-hostpath doesn't support CSI VolumeSnapshots. So the in-pod sidecar approach sidesteps the whole RWO problem by never letting another pod mount the PVC.
Alerting: the sidecar set -e's and exits on any failure → Kubernetes restarts it → kube_pod_container_status_restarts_total increments → the GarageMetaBackupSidecarRestarted PrometheusRule fires (warning, 5min for) → AlertManager → Discord. There's also a GarageMetaBackupSidecarAbsent (critical) for when the container isn't running at all.
Recovery procedure¶
If garage-meta PVC is unusable (lost node, corrupted LMDB, etc.):
# 1. Stop Garage so nothing writes during restore
kubectl -n volsync-system scale deploy garage --replicas=0
kubectl -n volsync-system wait pod -l app.kubernetes.io/name=garage --for=delete --timeout=60s
# 2. Run the restore Job (it picks the latest snapshot automatically)
# The same logic is in scripts/volsync-restore-all.sh phase_restore_garage_meta —
# that script's main() now calls it as Phase 6 after all VolSync restores.
kubectl apply -f - <<'EOF'
apiVersion: batch/v1
kind: Job
metadata: { name: garage-meta-restore, namespace: volsync-system }
spec:
ttlSecondsAfterFinished: 600
backoffLimit: 0
template:
spec:
restartPolicy: Never
securityContext: { fsGroup: 10000, runAsUser: 10000, runAsGroup: 10000 }
containers:
- name: restore
image: alpine:3
command: ["sh","-c"]
args:
- |
set -eux
LATEST=$(ls -1 /backup/snapshots | sort | tail -1)
[ -z "$LATEST" ] && { echo "FATAL: no snapshots found"; exit 1; }
echo "Restoring from snapshot: $LATEST"
find /dst -mindepth 1 -delete
for f in cluster_layout data_layout node_key node_key.pub peer_list lifecycle_worker_state scrub_info; do
[ -e "/backup/$f" ] && cp -v "/backup/$f" "/dst/$f"
done
chmod 600 /dst/node_key
mkdir -p /dst/snapshots /dst/db.lmdb
cp -rv /backup/snapshots/. /dst/snapshots/
cp -v "/backup/snapshots/${LATEST}/db.lmdb" /dst/db.lmdb/data.mdb
volumeMounts:
- { name: backup, mountPath: /backup }
- { name: dst, mountPath: /dst }
volumes:
- name: backup
nfs: { server: nas.3226texas.com, path: /mnt/Speed/Kubernetes/apps/garage/meta-backup }
- name: dst
persistentVolumeClaim: { claimName: garage-meta }
EOF
kubectl -n volsync-system wait --for=condition=Complete job/garage-meta-restore --timeout=10m
# 3. Bring Garage back
kubectl -n volsync-system scale deploy garage --replicas=1
# 4. Verify it sees the original buckets/keys
GP=$(kubectl get pod -n volsync-system -l app.kubernetes.io/name=garage -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n volsync-system "$GP" -c app -- /garage bucket list # should show volsync-data, cnpg, kanidm, mariadb-backups, plane-uploads
kubectl exec -n volsync-system "$GP" -c app -- /garage key list # should show the GK... keys
If the restore Job picks a snapshot you don't want (e.g., latest is also corrupt), override by setting LATEST=<timestamp> explicitly in the script. List options first with ls /mnt/Speed/Kubernetes/apps/garage/meta-backup/snapshots/ from any pod that mounts NFS.
Note: this is the same restore pattern as a fresh-PVC seed. During the original incident we used this same flow to seed the brand-new garage-meta PVC from the previously-NFS-backed meta/ directory.
Recovering broken CNPG WAL archiving¶
WAL archiving can silently break for weeks if you're not monitoring Cluster.status.conditions[?(@.type=="ContinuousArchiving")]. The most common breakage after a recovery: forced primary promotions create new timelines, and barman-cloud's pre-archive check refuses to resume against an existing archive whose timeline state doesn't match the new primary's.
The symptom in the plugin-barman-cloud sidecar logs:
barman-cloud-check-wal-archive checking the first wal
ERROR: WAL archive check failed for server postgres: Expected empty archive
Even though the bucket may look mostly empty (because retention pruning ate the old WAL segments), the residual <timeline>.history files and base backup directories are enough to fail the check.
Fix: switch to a fresh barman serverName so a clean archive starts in a new prefix, preserving the orphaned data for posterity.
# kubernetes/apps/database/<cluster>/cluster/cluster.yaml
spec:
plugins:
- name: barman-cloud.cloudnative-pg.io
isWALArchiver: true
parameters:
barmanObjectName: cnpg-garage
serverName: postgres-r2 # bump this each time the archive needs reset
After applying the spec change:
- The Cluster's
plugin-barman-cloudsidecar caches barman config at startup, so you must restart the primary pod for the newserverNameto take effect. (Sequential restart of replicas first then primary is cleaner; or just delete the primary pod and let CNPG fail over to a replica.) - Watch the plugin logs on the new primary — you should see options ending with
s3://cnpg/postgres-r2instead ofs3://cnpg/postgres: - Verify the cluster conditions flip:
You want
kubectl get cluster -n database postgres -o jsonpath='{.status.conditions}' \ | python3 -c 'import sys,json;[print(c["type"],c["status"]) for c in json.loads(sys.stdin.read())]'ContinuousArchiving: TrueandLastBackupSucceeded: True. - CNPG will automatically take a fresh base backup once archiving works (or you can trigger one with
kubectl create backup ...). Without a fresh base backup, the WAL stream you're now archiving has nothing to apply against — no recovery point. - The orphaned
s3://cnpg/<old-name>/prefix stays in Garage. If you want to GC it, do so manually after confirming no other systems reference it.
Always alert on ContinuousArchiving and LastBackupSucceeded. Without those, your "last backup" timestamp can lie for months.
Architectural caveats¶
A few things this design depends on, which you should verify periodically:
- CNPG WAL archiving must actually work. The cluster spec uses
barman-cloud.cloudnative-pg.ioto ship WAL to Garage. Ifstatus.conditions[?(@.type=="ContinuousArchiving")].status == "False"for any extended period, the "last backup" timestamp lies — you don't have a real recovery point. Alert on this. As of writing, archiving has been broken since 2026-01-29 and the last successful base backup was 2026-03-21, which made the node-01 incident a much bigger event than expected. - VolSync
ReplicationSourcemust actually be running on schedule. If a source pod can't run (e.g., its working cache PVC is on a dead node), backups silently stop. Checkkubectl get replicationsource -Aperiodically — a staleLAST SYNCmeans no fresh recovery point. - OpenEBS provisioner placement. The
openebs-localpv-provisionerDeployment can land on any node. If it's on the node that dies, every new PVC provision request stalls until the provisioner reschedules and re-pulls. Not catastrophic, just adds latency to recovery. kubectl cnpgplugin should be installed. The status-patch workaround for promotion works but it's brittle — the plugin'spromotecommand is the supported path. Add it to.mise.tomlif you don't already have it.
See also¶
- Backup & Recovery — VolSync architecture, Kopia repo layout, S3 backups
- Day-2 Operations — Routine operations
scripts/volsync-restore-all.sh— Bulk restore script in the repository root