Node Management
How to manage Talos nodes - config, maintenance, and recovery.
Current Setup
| Node | Role | Hardware |
|---|---|---|
| m0 | Control Plane | MS-01 i9-13900H, 96GB RAM, 1TB OS + 2TB Data |
| m1 | Control Plane | MS-01 i9-13900H, 96GB RAM, 1TB OS + 2TB Data |
| m2 | Control Plane | MS-01 i9-13900H, 96GB RAM, 1TB OS + 2TB Data |
All three are control plane nodes with workloads scheduled on them.
Applying Config
task talos:apply-node NODE=<node>
| Option | Default | What it does |
|---|---|---|
MODE | auto | Apply mode - auto (Talos decides), reboot (force reboot), staged (apply on next reboot) |
Config files are in:
talos/
├── controlplane.yaml # Base config
├── controlplane/
│ ├── m0.yaml # Node-specific patches
│ ├── m1.yaml
│ └── m2.yaml
└── schematic.yaml # Factory schematic for Secure Boot
Rebooting
task talos:reboot-node NODE=<node>
Add MODE=powercycle for a hard reboot if needed.
Shutting Down the Cluster
task talos:shutdown-cluster
To bring it back up, just power on the machines. Talos boots and rejoins automatically.
Regenerating Kubeconfig
If kubeconfig expires or gets messed up:
task talos:kubeconfig
Maintenance Procedure
Before doing maintenance on a node:
-
Check things are healthy:
kubectl get nodes kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status -
Tell Ceph not to rebalance:
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd set noout -
Cordon and drain:
kubectl cordon <node> kubectl drain <node> --ignore-daemonsets --delete-emptydir-data -
Do your maintenance
-
Uncordon:
kubectl uncordon <node> -
Unset noout:
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd unset noout
Resetting a Node
If you need to wipe a node and start fresh:
task talos:reset-node NODE=<node>
This destroys everything on the node.
Resetting the Whole Cluster
Nuclear option:
task talos:reset-cluster
Make sure you have backups before doing this.
Adding a New Node
- Install Talos ISO, set up Secure Boot (see bootstrap)
- Create a node config in
talos/controlplane/<new-node>.yaml - Apply config:
task talos:apply-node NODE=<new-node> - Watch it join:
kubectl get nodes -w - If it has storage, Rook will discover and provision OSDs
Removing a Node
- Drain workloads:
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data - If it has Ceph OSDs, remove them first (see storage-operations)
- Delete from cluster:
kubectl delete node <node> - Optionally reset:
task talos:reset-node NODE=<node>
Node Shell Access
For low-level debugging:
task kubernetes:node-shell NODE=<node>
This gives you a privileged shell on the node.