Troubleshooting
Common issues and how to fix them.
Quick Diagnostics
# Node status
kubectl get nodes -o wide
# All pods
kubectl get pods -A
# Recent events
kubectl get events -A --sort-by='.lastTimestamp' | tail -20
# Flux status
flux get all -A | grep -i false
Pods Not Starting
Stuck in Pending
kubectl describe pod -n <namespace> <pod>
| Cause | Fix |
|---|---|
| Not enough resources | Scale down other stuff or add capacity |
| Node selector doesn't match | Check node labels |
| PVC not bound | Check storage class and PVC |
| Taints blocking it | Add tolerations |
Stuck in ContainerCreating
kubectl describe pod -n <namespace> <pod>
kubectl get events -n <namespace> --field-selector involvedObject.name=<pod>
| Cause | Fix |
|---|---|
| Image pull failed | Check image name and registry creds |
| Volume mount failed | Check PVC and CSI driver |
| Secret not found | Check ExternalSecret synced |
CrashLoopBackOff
kubectl logs -n <namespace> <pod> --previous
Usually the app is crashing - check logs for stack traces.
Clean Up Failed Pods
task kubernetes:cleanse-pods
This removes pods in Failed, Pending, Succeeded, Completed, NodeStatusUnknown, or Error states.
Flux Issues
HelmRelease Stuck
flux get hr -A | grep False
kubectl describe helmrelease -n <namespace> <release>
Restart it:
task kubernetes:hr:restart
Or manually:
flux suspend hr -n <namespace> <release>
flux resume hr -n <namespace> <release>
Nothing is Syncing
Force a reconcile:
task kubernetes:reconcile
Node Issues
Node NotReady
kubectl describe node <node>
talosctl -n <node> services
talosctl -n <node> dmesg | tail -50
| Cause | Fix |
|---|---|
| Kubelet not running | talosctl -n <node> service kubelet restart |
| Network issues | Check CNI pods |
| Disk pressure | Check disk usage |
Node Unreachable
Try a reboot:
task talos:reboot-node NODE=<node>
If that doesn't work, power cycle it via IPMI/KVM.
Network Issues
Services Unreachable
# Cilium status
kubectl -n kube-system exec -it ds/cilium -- cilium status
# BGP peers
kubectl -n kube-system exec -it ds/cilium -- cilium bgp peers
# LoadBalancer IPs
kubectl get svc -A | grep LoadBalancer
DNS Broken
# Check CoreDNS
kubectl get pods -n kube-system -l k8s-app=kube-dns
# Test resolution
kubectl run -it --rm debug --image=busybox -- nslookup kubernetes.default
Certificate Issues
kubectl get certificate -A
kubectl describe certificate -n <namespace> <name>
kubectl get certificaterequest -A
| Cause | Fix |
|---|---|
| DNS challenge failing | Check Cloudflare creds |
| Rate limited | Wait and retry |
| Invalid domain | Check certificate spec |
Debug Tools
Node Shell
task kubernetes:node-shell NODE=<node>
NFS Debug Pod
task kubernetes:nfs-pod NS=<namespace>
Browse a PVC
task kubernetes:browse-pvc NS=<namespace> CLAIM=<pvc-name>
Tail Logs
stern -n <namespace> -l app=<app>
When All Else Fails
- Check external monitoring (HealthChecks.io, UptimeRobot)
- Check recent git commits - did something change?
- Check the component docs (Talos, Flux, Cilium, Rook)
- Ask in the Kubernetes @Home Discord