Troubleshooting

Common issues and how to fix them.

Quick Diagnostics

# Node status
kubectl get nodes -o wide

# All pods
kubectl get pods -A

# Recent events
kubectl get events -A --sort-by='.lastTimestamp' | tail -20

# Flux status
flux get all -A | grep -i false

Pods Not Starting

Stuck in Pending

kubectl describe pod -n <namespace> <pod>
CauseFix
Not enough resourcesScale down other stuff or add capacity
Node selector doesn't matchCheck node labels
PVC not boundCheck storage class and PVC
Taints blocking itAdd tolerations

Stuck in ContainerCreating

kubectl describe pod -n <namespace> <pod>
kubectl get events -n <namespace> --field-selector involvedObject.name=<pod>
CauseFix
Image pull failedCheck image name and registry creds
Volume mount failedCheck PVC and CSI driver
Secret not foundCheck ExternalSecret synced

CrashLoopBackOff

kubectl logs -n <namespace> <pod> --previous

Usually the app is crashing - check logs for stack traces.

Clean Up Failed Pods

task kubernetes:cleanse-pods

This removes pods in Failed, Pending, Succeeded, Completed, NodeStatusUnknown, or Error states.

Flux Issues

HelmRelease Stuck

flux get hr -A | grep False
kubectl describe helmrelease -n <namespace> <release>

Restart it:

task kubernetes:hr:restart

Or manually:

flux suspend hr -n <namespace> <release>
flux resume hr -n <namespace> <release>

Nothing is Syncing

Force a reconcile:

task kubernetes:reconcile

Node Issues

Node NotReady

kubectl describe node <node>
talosctl -n <node> services
talosctl -n <node> dmesg | tail -50
CauseFix
Kubelet not runningtalosctl -n <node> service kubelet restart
Network issuesCheck CNI pods
Disk pressureCheck disk usage

Node Unreachable

Try a reboot:

task talos:reboot-node NODE=<node>

If that doesn't work, power cycle it via IPMI/KVM.

Network Issues

Services Unreachable

# Cilium status
kubectl -n kube-system exec -it ds/cilium -- cilium status

# BGP peers
kubectl -n kube-system exec -it ds/cilium -- cilium bgp peers

# LoadBalancer IPs
kubectl get svc -A | grep LoadBalancer

DNS Broken

# Check CoreDNS
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Test resolution
kubectl run -it --rm debug --image=busybox -- nslookup kubernetes.default

Certificate Issues

kubectl get certificate -A
kubectl describe certificate -n <namespace> <name>
kubectl get certificaterequest -A
CauseFix
DNS challenge failingCheck Cloudflare creds
Rate limitedWait and retry
Invalid domainCheck certificate spec

Debug Tools

Node Shell

task kubernetes:node-shell NODE=<node>

NFS Debug Pod

task kubernetes:nfs-pod NS=<namespace>

Browse a PVC

task kubernetes:browse-pvc NS=<namespace> CLAIM=<pvc-name>

Tail Logs

stern -n <namespace> -l app=<app>

When All Else Fails

  1. Check external monitoring (HealthChecks.io, UptimeRobot)
  2. Check recent git commits - did something change?
  3. Check the component docs (Talos, Flux, Cilium, Rook)
  4. Ask in the Kubernetes @Home Discord