Disaster Recovery
Notes on backups, restores, and what to do when things go wrong.
Backup Strategy
- Persistent Volumes - Volsync backs up to Cloudflare R2 daily using Restic
- Cluster State - Everything is in Git, Flux handles the rest
- Secrets - Stored in 1Password, pulled in via external-secrets
Volsync Operations
Volsync handles backup/restore for PVCs. There are some assumptions baked in:
- Kustomization, HelmRelease, PVC, and ReplicationSource all share the same name
- ReplicationSource uses Restic
- App is a Deployment or StatefulSet
- Single PVC per app
List Snapshots
task volsync:list APP=plex NS=media
Create Manual Snapshot
If you need a backup right now instead of waiting for the daily schedule:
task volsync:snapshot APP=home-assistant NS=default
This waits up to 2 hours for the backup to complete.
Restore from Backup
task volsync:restore APP=plex NS=media PREVIOUS=2
PREVIOUS is how many snapshots back to restore (0 = latest, 1 = one before latest, etc).
What happens under the hood:
- Suspends Flux kustomization and helmrelease
- Scales app to 0 replicas
- Waits for pods to terminate
- Creates a ReplicationDestination and restores data
- Resumes Flux and reconciles
- Waits for pods to be ready again
Unlock Stuck Repos
If a backup job got interrupted, the Restic repo might be locked:
task volsync:unlock
Suspend/Resume Volsync
For maintenance:
task volsync:state-suspend
task volsync:state-resume
Recovery Scenarios
App Data Got Corrupted
- List snapshots:
task volsync:list APP=<app> NS=<ns> - Pick one and restore:
task volsync:restore APP=<app> NS=<ns> PREVIOUS=<n> - Verify its working
Node Died
If recoverable, just reboot it:
task talos:reboot-node NODE=<node>
If it needs a full reinstall:
task talos:reset-node NODE=<node>
task talos:apply-node NODE=<node>
Ceph will rebalance automatically. Check health with:
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status
Complete Cluster Loss
This is the nuclear option. Hopefully you never need this.
- Provision hardware with Talos ISO (see bootstrap)
- Apply Talos config to all nodes:
task talos:apply-node NODE=m0 task talos:apply-node NODE=m1 task talos:apply-node NODE=m2 - Bootstrap:
task bootstrap:talos task bootstrap:apps ROOK_DISK=<disk-model> - Flux restores everything from Git
- Volsync restores PVC data from R2
Accidentally Deleted Something
Just force a Flux reconcile:
task kubernetes:reconcile
Flux will recreate whatever is missing from Git.