Introduction

This is the documentation for my home Kubernetes cluster running on Talos Linux, managed with Flux.

Note

These docs are specific to my setup. They may or may not work for you, but hopefully they're useful as a reference.

Overview

This repository is my home Kubernetes cluster in a declarative state. Flux watches the kubernetes folder and will make the changes to the cluster based on the YAML manifests.

Core Components

Core components that form the foundation of the cluster:

backube/volsync and backube/snapscheduler: Backup and recovery of persistent volume claims.
cilium/cilium: Kubernetes CNI.
envoyproxy/envoy: Kubernetes-based application gateway using Kubernetes Gateway API.
external-secrets/external-secrets: Managed Kubernetes secrets using 1Password Connect.
jetstack/cert-manager: Creates SSL certificates for services in my Kubernetes cluster.
kubernetes-sigs/external-dns: Automatically manages DNS records from my cluster in CloudFlare.
rancher/system-upgrade-controller: Handles Kubernetes and Talos upgrades automatically.
rook/rook: Distributed block storage for persistent storage.
siderolabs/talos: The Kubernetes Operating System.

Observability

For observability and monitoring of the cluster the following software is used:

fluent/fluent-bit: Log processor.
grafana/grafana: Data visualization platform.
prometheus/alertmanager: Handles processing and sending alerts.
pushover: Handles receiving alerts on my devices.
TwiN/gatus: High level status dashboard.
VictoriaMetrics/VictoriaLogs: Database for logs.
prometheus/prometheus: Time series database for metrics.

Automation

Github Actions for checking code formatting and running periodic jobs
Renovate keeps the application charts and container images up-to-date

Cloud Dependencies

1Password for managing secrets via external-secrets.
AWS SES for sending emails.
Cloudflare tunnels for exposing services & creating certificates & managing domains.
Cloudflare R2 for daily backups.
Pushover for sending alerts.

Total cloud costs yearly is approximately ~$150/year.

Directories

This Git repository contains the following directories.

📁 bootstrap       # Flux installation to bootstrap cluster
📁 docs            # Docs
📁 hacks           # Contains random scripts
📁 kubernetes      # Kubernetes cluster defined as code
├─📁 flux          # Main Flux configuration of repository
├─📁 components    # Flux components
└─📁 apps          # Apps deployed into my cluster grouped by namespace
📁 talos           # Contains the configuration for Talos operating system
📁 terraform       # Contains Cloudflare terraform

Hardware

Checkout my rack

Device	Count	OS Disk Size	Data Disk Size	Ram	Operating System	Purpose
UDM-Pro-Max	1	-	-	-	UniFi	Router
USW-Pro-Aggregation	1	-	-	-	UniFi	Switch
USW-Pro-Max-24-PoE	1	-	-	-	UniFi	Switch
UAP-AC-Lite	1	-	-	-	UniFi	WiFi AP
ER-10X	1	-	-	-	EdgeOS	Switch
PiKVM V4 Mini	1	-	-	-	PiKVM	KVM
TESmart HDMI KVM Switch 8 Ports	1	-	-	-	-	KVM Switch
CyberPower CP1500PFCRM2U	1	-	-		-	UPS
USP-PDU-Pro	1	-	-	-	UniFi	PDU
Synology DS920+	1	-	2x8TB & 2x16TB	20GB	DSM	NAS
MS-01 i9-13900H	3	1TB	2TB	96GB	Talos	Control Plane

Graditude and Thanks

Thanks to all the people who donate their time to the Kubernetes @Home community.

This repository was built off the onedr0p/template-cluster-k3s repository.

Workstation Setup

What you need to manage the cluster from your machine.

Quick Setup

# Install Homebrew packages
task workstation:brew

# Install kubectl plugins
task workstation:krew

Required Tools

Everything gets installed via Homebrew.

Core

Tool	Purpose
`kubernetes-cli`	kubectl
`talosctl`	Talos node management
`flux`	Flux CLI
`helm`	Helm
`helmfile`	Declarative Helm
`kustomize`	Kustomize

Development

Tool	Purpose
`go-task`	Task runner
`jq`	JSON wrangling
`yq`	YAML wrangling
`age`	Encryption
`sops`	Secret operations

Cluster Management

Tool	Purpose
`k9s`	Terminal UI for Kubernetes
`kubecolor`	Colorized kubectl
`stern`	Multi-pod log tailing
`viddy`	Modern watch command

Infrastructure

Tool	Purpose
`cloudflared`	Cloudflare tunnel
`gh`	GitHub CLI
`talhelper`	Talos config helper
`minijinja-cli`	Template rendering

Secrets

Tool	Purpose
`1password`	1Password app
`1password-cli`	1Password CLI

kubectl Plugins

Installed via Krew:

Plugin	Purpose
`browse-pvc`	Browse PVC contents
`node-shell`	Shell into nodes
`rook-ceph`	Rook-Ceph commands
`view-secret`	Decode secrets
`cert-manager`	Cert-manager commands
`cnpg`	CloudNativePG commands

Configuration

Talos Config

Talos config lives in ~/.talos/config or wherever TALOSCONFIG points.

Regenerate it with:

task talos:kubeconfig

1Password

Authenticate:

eval $(op signin)

Check it works:

op user get --me

Shell Setup

Aliases

Add these to your .zshrc or .bashrc:

alias k='kubectl'
alias kc='kubecolor'
alias f='flux'
alias watch='viddy'
alias logs='stern'

Completions

source <(kubectl completion zsh)
source <(flux completion zsh)
source <(talosctl completion zsh)
source <(helm completion zsh)

Verify Everything Works

kubectl get nodes
talosctl version
flux check
op user get --me

Updating Tools

# Homebrew
brew update && brew upgrade

# Krew plugins
kubectl krew upgrade

# Resync with Brewfile
task workstation:brew

Cluster Bootstrap

The process should be mostly automated via task bootstrap:apps. If all goes well the cluster should come up based on the last available Volsync snapshot, which runs daily.

Priming the Hardware

Grab the latest Talos .iso here
Plug it in and follow secure boot setup

MS-01 SecureBoot Setup

Enabling Secure Boot on MS-01 can be difficult if its not something you have done before, heres how to do it:

Boot directly to the BIOS
Under Security->Secure Boot change to custom
Go down to Key Management
Set Factory Key Provision to disabled
Click Reset To Setup Mode
- IMPORTANT: click cancel when it says save without exiting
Save and Reset
Mount Talos image and reboot, click Enroll secure boot keys: auto

If you still see errors on start about key violations it probably means the factory default keys weren't wiped (step 4). Make sure changes are saved before rebooting.

Bootstrap Flux

Flux manages the state of the cluster, but it can't do that until its installed. A few things need to be manually installed first:

Cilium - By default Talos installs a basic CNI, so I disable that. Kubernetes doesn't work without a CNI.
CoreDNS - Talos installs this by default but that makes it hard to upgrade so I disable it and manage it with Flux.
cert-manager - For certificate things. Bootstrapping this early makes life easier since everything is based on my domain.
external-secrets - All my secrets are in 1Password, this pulls them into the cluster.
kubelet-csr-approver - Auto approves CSRs, makes life easy.
spegel - In-cluster OCI registry mirror to save some bandwidth.
Flux - After this point Flux manages the state of the cluster via manifests in the repo.

All of these get installed with a single command from the bootstrap Taskfile:

task bootstrap:apps

This applies bootstrap resources for 1Password and Cloudflare tunnel, then installs everything via helmfile.

Home Network Setup

Some notes on my current home network setup.

Network Diagram

flowchart TB
    subgraph Internet
        WAN[("🌐 Internet")]
        CF[Cloudflare Tunnel]
    end

    subgraph Network["Network Infrastructure"]
        UDM["UDM Pro Max<br/>Gateway/Router<br/>192.168.1.1 | 10.0.40.1"]
        AGG["USW-Pro-Aggregation<br/>Switch"]
        POE["USW-Pro-Max-24-PoE<br/>Switch"]
        AP["UAP-AC-Lite<br/>WiFi AP"]
    end

    subgraph Servers["Kubernetes Cluster (10.0.40.0/24)"]
        M0["m0 - MS-01<br/>i9-13900H | 96GB<br/>10.0.40.10 | 169.254.255.10"]
        M1["m1 - MS-01<br/>i9-13900H | 96GB<br/>10.0.40.11 | 169.254.255.11"]
        M2["m2 - MS-01<br/>i9-13900H | 96GB<br/>10.0.40.12 | 169.254.255.12"]

        subgraph TB_Ring["Thunderbolt Ring (169.254.255.0/24)"]
            CEPH[("Rook-Ceph<br/>3x 2TB NVMe")]
        end
    end

    subgraph Storage["Storage (10.0.40.0/24)"]
        NAS["UNAS Pro 8<br/>4x 28TB (112TB raw)<br/>10.0.40.x"]
    end

    subgraph HomeLAN["Home Network (192.168.1.0/24)"]
        Devices["Computers, TVs,<br/>Phones, Tablets, etc."]
    end

    subgraph IoT["IoT VLAN 10 (10.0.10.0/24)"]
        HA["Home Assistant<br/>10.0.10.250"]
        Frigate["Frigate NVR<br/>10.0.10.239"]
        ESPHome["ESPHome<br/>10.0.10.245"]
        Zigbee["Zigbee2MQTT"]
        ZWave["Z-Wave JS UI"]
    end

    subgraph Services["K8s Services (10.0.20.0/24)"]
        MQTT["EMQX MQTT<br/>10.0.20.50"]
        Plex["Plex<br/>10.0.20.110"]
        Jellyfin["Jellyfin<br/>10.0.20.70"]
        PG["PostgreSQL<br/>10.0.20.17"]
        Envoy["Envoy Gateway<br/>External: 10.0.20.100<br/>Internal: 10.0.20.200"]
    end

    subgraph Mgmt["Management"]
        PiKVM["PiKVM V4 Mini"]
        UPS["CyberPower UPS"]
        PDU["UniFi PDU Pro"]
    end

    WAN --> UDM
    CF -.-> UDM
    UDM --> AGG
    AGG --> POE
    POE --> AP
    POE --> Devices
    AP -.->|"WiFi"| Devices
    AGG -->|"10Gb LACP"| M0
    AGG -->|"10Gb LACP"| M1
    AGG -->|"10Gb LACP"| M2
    M0 <-->|"TB4"| CEPH
    M1 <-->|"TB4"| CEPH
    M2 <-->|"TB4"| CEPH
    AGG -->|"10Gb"| NAS
    UDM -->|"VLAN 10"| IoT
    UDM <-.->|"BGP<br/>ASN 64513 ↔ 64514"| Servers
    Servers --> Services
    PiKVM --> Servers
    UPS --> PDU
    PDU --> Servers

    classDef router fill:#e74c3c,color:white
    classDef switch fill:#3498db,color:white
    classDef server fill:#2ecc71,color:white
    classDef storage fill:#f39c12,color:white
    classDef iot fill:#9b59b6,color:white
    classDef service fill:#1abc9c,color:white
    classDef tbring fill:#e67e22,color:white
    classDef homelan fill:#95a5a6,color:white

    class UDM router
    class AGG,POE,AP switch
    class M0,M1,M2 server
    class NAS storage
    class CEPH tbring
    class HA,Frigate,ESPHome,Zigbee,ZWave iot
    class MQTT,Plex,Jellyfin,PG,Envoy service
    class Devices homelan

Subnets

192.168.1.0/24 - LAN
192.168.33.0/24 - Wireguard
10.0.10.0/24 - IoT (VLAN10)
10.0.20.0/24 - Cilium LoadBalancer Pool
10.0.40.0/24 - Servers

I use Cilium to support LoadBalancer services in the cluster. Cilium manages the 10.0.20.0/24 subnet for IP allocation. BGP is configured between Cilium and my UDM Pro to provide routing for the rest of my home network.

Ingress Architecture

flowchart LR
    subgraph Internet
        Users["External Users"]
        ExtDNS["External DNS<br/>(Cloudflare)"]
        CFEdge["Cloudflare Edge<br/>*.chestr.dev"]
    end

    subgraph HomeNet["Home Network"]
        HomeUsers["Internal Users"]
        UDM["UDM Pro Max<br/>ASN 64513"]
    end

    subgraph Nodes["Cluster Nodes (m0, m1, m2)"]
        subgraph K8s["Kubernetes"]
            subgraph Gateways["Envoy Gateways"]
                ExtGW["External Gateway<br/>10.0.20.200"]
                IntGW["Internal Gateway<br/>10.0.20.100"]
            end

            CFTunnel["Cloudflared<br/>Tunnel Pod"]

            subgraph Apps["Application Pods"]
                ExtApps["External Apps<br/>Plex, Home Assistant, Radicale"]
                IntApps["Internal Apps<br/>Frigate, Sonarr, Radarr"]
            end

            ExtDNSOp["External-DNS Operator"]
        end
    end

    Users --> ExtDNS
    ExtDNS -->|HTTPS| CFEdge
    CFEdge -->|Tunnel| CFTunnel
    CFTunnel --> ExtGW
    ExtGW --> ExtApps

    HomeUsers --> UDM
    UDM --> ExtGW
    UDM --> IntGW
    IntGW --> IntApps

    ExtDNSOp -.->|Sync| ExtDNS
    ExtDNSOp -.->|Sync| UDM

    Nodes <-->|"BGP<br/>ASN 64514"| UDM

    classDef cloudflare fill:#f38020,color:white
    classDef gateway fill:#1abc9c,color:white
    classDef extapps fill:#3498db,color:white
    classDef intapps fill:#9b59b6,color:white
    classDef router fill:#e74c3c,color:white

    class CFEdge,CFTunnel,ExtDNS cloudflare
    class ExtGW,IntGW gateway
    class ExtApps extapps
    class IntApps intapps
    class UDM router

Traffic Flows

External Traffic (Internet → Services):

User requests app.chestr.dev
Cloudflare DNS resolves to Cloudflare Edge
Cloudflare routes through tunnel to Cloudflared pod
Routes to External Envoy Gateway (10.0.20.200)
Gateway routes to appropriate pod based on hostname

Internal Traffic (Home Network → Services):

User requests app.chestr.dev
UDM DNS resolves to gateway IP (10.0.20.100 or 10.0.20.200)
Envoy Gateway routes to pod
Full access to all services

Split Horizon DNS:

External-DNS operator syncs HTTPRoute hostnames to both Cloudflare and UDM
External users resolve to Cloudflare (proxied through tunnel)
Internal users resolve directly to Gateway LB IPs via UDM
BGP advertises LB IPs (10.0.20.0/24) from cluster nodes to UDM

ThunderBolt Ring Network

169.254.255.0/24 is used for the ring network. Each node is connected to the other 2 nodes using the 2 thunderbolt ports on each computer.

thunderbolt connection

Validating Configuration

Spin up 3 node-shells:

task kubernetes:node-shell NODE=m0
task kubernetes:node-shell NODE=m1
task kubernetes:node-shell NODE=m2

Check routes are configured correctly:

ip r | grep '169.254.255'

Ping each node and make sure it works:

# From m0
ping 169.254.255.11
ping 169.254.255.12
# etc...

Observability

In-Cluster Monitoring

Metrics

I use all the standard Prometheus CRDs for metric collection - ServiceMonitor, PodMonitor, PrometheusRules, etc.

For metrics storage I use Victoria Metrics. VM is a drop-in replacement for Prometheus which claims to be more performant. For the most part it seems to work well.

Why I switched from Prometheus:

Works with all the same CRDs, so I could swap back later if needed
Claims to be more performant
UI is a bit nicer in my opinion

Logging

For log collection I use FluentBit. I used to use vector/promtail but gave fluentbit a try and prefer it over the others. It uses minimal resources (~10 mCPU / 15MB RAM) and is fairly easy to configure for label normalization/cleaning.

For log storage I use Victoria Logs. I switched from Loki and much prefer VM Logs over Loki+Grafana for querying.

Why I switched from Loki:

Loki is just a storage layer with no UI - you need Grafana to visualize logs
Using Grafana to ad-hoc query logs is tedious and slow
Loki is not straightforward to setup
VM Logs has a built-in Prometheus-like query dashboard so ad-hoc log diving is simple

Alerting

For alerting I use AlertManager via Victoria Metrics.

I run 2 instances of VMAlert - one for Victoria Metrics rules and one for Victoria Logs rules. Two instances are needed because the query languages are different and would fail if run against the wrong backend.

For push notifications I paid $5 for PushOver and it works great.

Off-Cluster Monitoring

For things not in this repo, I use a few external services.

UDM Pro Dynamic DNS

I use Dynamic DNS on my UDM Pro to automatically update an A Record in Cloudflare with my home public IP.

Healthchecks.io

I have 2 push monitors on HealthChecks.io to track cluster status externally. They have free push monitors which is why I use it.

AlertManager Watchdog pings healthchecks.io every 5 minutes - this ensures my alerting is working
Gatus endpoint for my status page - ensures the status page is available

UptimeRobot

I use UptimeRobot to periodically ping the DNS A record set by my UDM to monitor that my home network is reachable externally. Free for 5 minute pings.

Storage Operations

Notes on Rook-Ceph management and troubleshooting.

Checking Cluster Health

kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status

You want to see HEALTH_OK. If not, check whats wrong:

kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph health detail

Other useful commands:

# OSD tree
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd tree

# Pool usage
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph df

# OSD usage
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd df

Removing a Dead OSD

If a node died and you need to clean up the OSD (adapted from Mirantis docs):

# Check status
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph -s

# Scale down the OSD (probably stuck in Pending if node is dead)
kubectl -n rook-ceph scale deploy rook-ceph-osd-<ID> --replicas 0

# Purge it
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd purge <ID> --yes-i-really-mean-it

# Delete auth
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph auth del osd.<ID>

# Remove node from CRUSH map if decomissioning
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd crush remove <nodename>

Cleaning a Disk

Ceph needs clean drives. If you've used the disk before, wipe it:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: disk-clean-rook
  namespace: rook-ceph
spec:
  restartPolicy: Never
  nodeName: <nodename>
  volumes:
  - name: rook-data-dir
    hostPath:
      path: /var/lib/rook
  containers:
  - name: disk-clean
    image: busybox
    securityContext:
      privileged: true
    volumeMounts:
    - name: rook-data-dir
      mountPath: /node/rook-data
    command: ["/bin/sh", "-c", "rm -rf /node/rook-data/*"]
EOF

Wait for it then clean up:

kubectl -n rook-ceph delete pod disk-clean-rook

OSD Won't Join After Moving Disk

Sometimes after moving a disk between nodes, the OSD fails with permission issues. Fix from this GitHub issue:

Debug the OSD pod and grab the keyring:

kubectl -n rook-ceph debug rook-ceph-osd-<ID>-<suffix>
cat /var/lib/ceph/osd/ceph-<ID>/keyring

Create a file locally called osd.export:

[osd.<ID>]
key = <key from keyring>
caps mon = "allow profile osd"
caps mgr = "allow profile osd"
caps osd = "allow *"

Import it:

kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph auth import -i osd.export

Clean up the debug pod

The OSD should join and cluster should recover.

Browsing PVC Contents

To poke around in a PVC:

task kubernetes:browse-pvc NS=media CLAIM=plex-config

This mounts it in an Alpine container for you to look around.

Maintenance Mode

Before doing storage maintenance, prevent Ceph from rebalancing:

kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd set noout

Do your thing, then unset it:

kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd unset noout

Node Management

How to manage Talos nodes - config, maintenance, and recovery.

Current Setup

Node	Role	Hardware
m0	Control Plane	MS-01 i9-13900H, 96GB RAM, 1TB OS + 2TB Data
m1	Control Plane	MS-01 i9-13900H, 96GB RAM, 1TB OS + 2TB Data
m2	Control Plane	MS-01 i9-13900H, 96GB RAM, 1TB OS + 2TB Data

All three are control plane nodes with workloads scheduled on them.

Applying Config

task talos:apply-node NODE=<node>

Option	Default	What it does
`MODE`	`auto`	Apply mode - `auto` (Talos decides), `reboot` (force reboot), `staged` (apply on next reboot)

Config files are in:

talos/
├── controlplane.yaml          # Base config
├── controlplane/
│   ├── m0.yaml               # Node-specific patches
│   ├── m1.yaml
│   └── m2.yaml
└── schematic.yaml             # Factory schematic for Secure Boot

Rebooting

task talos:reboot-node NODE=<node>

Add MODE=powercycle for a hard reboot if needed.

Shutting Down the Cluster

task talos:shutdown-cluster

To bring it back up, just power on the machines. Talos boots and rejoins automatically.

Regenerating Kubeconfig

If kubeconfig expires or gets messed up:

task talos:kubeconfig

Maintenance Procedure

Before doing maintenance on a node:

Check things are healthy:

kubectl get nodes
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status

Tell Ceph not to rebalance:

kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd set noout

Cordon and drain:

kubectl cordon <node>
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data

Do your maintenance
Uncordon:
```
kubectl uncordon <node>
```

Unset noout:

kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd unset noout

Resetting a Node

If you need to wipe a node and start fresh:

task talos:reset-node NODE=<node>

This destroys everything on the node.

Resetting the Whole Cluster

Nuclear option:

task talos:reset-cluster

Make sure you have backups before doing this.

Adding a New Node

Install Talos ISO, set up Secure Boot (see bootstrap)
Create a node config in talos/controlplane/<new-node>.yaml
Apply config: task talos:apply-node NODE=<new-node>
Watch it join: kubectl get nodes -w
If it has storage, Rook will discover and provision OSDs

Removing a Node

Drain workloads: kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
If it has Ceph OSDs, remove them first (see storage-operations)
Delete from cluster: kubectl delete node <node>
Optionally reset: task talos:reset-node NODE=<node>

Node Shell Access

For low-level debugging:

task kubernetes:node-shell NODE=<node>

This gives you a privileged shell on the node.

Application Deployment

How to deploy apps using Flux.

Repo Structure

kubernetes/
├── flux/                    # Flux config
├── components/             # Reusable components
└── apps/                   # Applications by namespace
    ├── cert-manager/
    ├── database/
    ├── default/
    ├── external-secrets/
    ├── flux-system/
    ├── kube-system/
    ├── media/
    ├── networking/
    ├── observability/
    ├── openebs-system/
    ├── rook-ceph/
    ├── system-upgrade/
    └── volsync-system/

App Structure

Each app follows this pattern:

apps/<namespace>/<app-name>/
├── ks.yaml                 # Flux Kustomization
└── app/
    ├── kustomization.yaml
    ├── helmrelease.yaml
    └── externalsecret.yaml # If needed

Deploying a New App

1. Create the Directory

mkdir -p kubernetes/apps/<namespace>/<app-name>/app

2. Create the Flux Kustomization

ks.yaml:

---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: &app my-app
  namespace: flux-system
spec:
  targetNamespace: default
  commonMetadata:
    labels:
      app.kubernetes.io/name: *app
  path: ./kubernetes/apps/default/my-app/app
  prune: true
  sourceRef:
    kind: GitRepository
    name: flux-system
  wait: false
  interval: 30m
  retryInterval: 1m
  timeout: 5m

3. Create the HelmRelease

app/helmrelease.yaml:

---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: my-app
spec:
  interval: 30m
  chart:
    spec:
      chart: my-app
      version: 1.0.0
      sourceRef:
        kind: HelmRepository
        name: some-repo
        namespace: flux-system
  install:
    remediation:
      retries: 3
  upgrade:
    cleanupOnFail: true
    remediation:
      strategy: rollback
      retries: 3
  values:
    # your values here

4. Create the Kustomization

app/kustomization.yaml:

---
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - helmrelease.yaml

5. Add to Namespace

Add your app to kubernetes/apps/<namespace>/kustomization.yaml:

resources:
  - ./my-app/ks.yaml

6. Commit and Push

git add kubernetes/apps/<namespace>/<app-name>
git commit -m "feat: add my-app"
git push

Flux picks it up automatically.

Managing Apps

Force Reconcile

task kubernetes:reconcile

Check Status

flux get hr -n <namespace> <app>
flux get ks <app>
kubectl get pods -n <namespace> -l app.kubernetes.io/name=<app>

Suspend/Resume

flux suspend ks <app>
flux suspend hr -n <namespace> <app>

flux resume ks <app>
flux resume hr -n <namespace> <app>

Adding Storage

For Ceph storage, add a PVC:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-app
spec:
  accessModes: ["ReadWriteOnce"]
  storageClassName: ceph-block
  resources:
    requests:
      storage: 10Gi

Adding Backups

Create a ReplicationSource for Volsync:

apiVersion: volsync.backube/v1alpha1
kind: ReplicationSource
metadata:
  name: my-app
spec:
  sourcePVC: my-app
  trigger:
    schedule: "0 0 * * *"  # Daily
  restic:
    repository: my-app-restic-secret
    retain:
      daily: 7
      weekly: 4

Ingress

For external access, use Gateway API:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: my-app
spec:
  parentRefs:
    - name: external
      namespace: networking
  hostnames: ["my-app.example.com"]
  rules:
    - backendRefs:
        - name: my-app
          port: 80

For internal only, use internal instead of external.

Renovate

Renovate watches for updates and creates PRs automatically. Just review and merge them:

task github:pr:list
task github:pr:merge ID=<pr>

Secrets Management

How I handle secrets using External Secrets and 1Password.

How It Works

1Password -> 1Password Connect -> External Secrets Operator -> Kubernetes Secrets

All my secrets live in 1Password. The External Secrets Operator pulls them into the cluster automatically.

Adding a New Secret

1. Create in 1Password

Add a new item to 1Password with the fields you need. The item name should match what you want the Kubernetes secret to be called.

2. Create an ExternalSecret

Add this to your app's directory:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: my-app-secret
  namespace: default
spec:
  refreshInterval: 1h
  secretStoreRef:
    kind: ClusterSecretStore
    name: onepassword-connect
  target:
    name: my-app-secret
    creationPolicy: Owner
  dataFrom:
    - extract:
        key: my-1password-item

If you only need specific fields:

spec:
  data:
    - secretKey: api-key
      remoteRef:
        key: my-1password-item
        property: api_key

3. Commit and Let Flux Do Its Thing

Commit to git and Flux will create the secret. Or apply manually:

kubectl apply -f externalsecret.yaml

Verify it worked:

kubectl get secret -n <namespace> <name>
kubectl get externalsecret -n <namespace> <name>

Force Sync All Secrets

If you updated something in 1Password and don't want to wait for the refresh interval:

task kubernetes:sync-secrets

Or for a single secret:

kubectl -n <namespace> annotate externalsecret <name> force-sync="$(date +%s)" --overwrite

Troubleshooting

ExternalSecret Shows Error

Check whats wrong:

kubectl describe externalsecret -n <namespace> <name>

Common issues:

Error	Problem	Fix
`item not found`	Item doesn't exist in 1Password	Create it
`field not found`	Requested field missing	Add the field
`connect error`	1Password Connect is down	Check the pod

1Password Connect Issues

# Check pod status
kubectl get pods -n external-secrets -l app=onepassword-connect

# Check logs
kubectl logs -n external-secrets -l app=onepassword-connect

Secret Not Updating

Secrets refresh based on refreshInterval. Force it:

task kubernetes:sync-secrets

Rotating Secrets

Update the value in 1Password
Force sync: kubectl -n <ns> annotate externalsecret <name> force-sync="$(date +%s)" --overwrite
Restart the app to pick up the new value: kubectl rollout restart deployment -n <ns> <deployment>

Emergency Access

If External Secrets is broken and you need a secret NOW:

kubectl create secret generic <name> -n <namespace> --from-literal=<key>=<value>

Just remember this will get overwritten when External Secrets starts working again. Update 1Password if you want changes to stick.

Upgrade Procedures

How to upgrade Talos, Kubernetes, and everything else.

Before You Upgrade

Always a good idea to check things are healthy first:

kubectl get nodes
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status

And maybe take a manual backup of anything critical:

task volsync:snapshot APP=<app> NS=<ns>

Talos Upgrades

Single Node

task talos:upgrade-node NODE=m0 VERSION=v1.9.0

This downloads the Talos version from the factory, applies it with secure boot, and reboots. Times out after 10 minutes.

Rolling Upgrade

For the whole cluster, just do them one at a time and wait for each to come back:

task talos:upgrade-node NODE=m0 VERSION=v1.9.0
# wait for it to rejoin
task talos:upgrade-node NODE=m1 VERSION=v1.9.0
# wait
task talos:upgrade-node NODE=m2 VERSION=v1.9.0

Between each, verify the node is Ready and Ceph is healthy:

kubectl get nodes
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status

Kubernetes Upgrades

task talos:upgrade-k8s

This upgrades Kubernetes across all nodes. The version comes from kubernetes/apps/system-upgrade/tuppr/upgrades/kubernetes.yaml.

Flux and Helm Charts

Renovate handles this automatically - it creates PRs when updates are available. Just review and merge them.

To force a reconcile after merging:

task kubernetes:reconcile

Merge Renovate PRs

# List open PRs
task github:pr:list

# Merge one
task github:pr:merge ID=123

# Merge all of them
task github:pr:merge:all

ARC Upgrade

Actions Runner Controller needs a special upgrade process because of CRD stuff:

task kubernetes:upgrade-arc

This uninstalls the runner and controller, waits a bit, then reconciles them back via Flux.

Rollback

Talos

Talos keeps the previous install around. Reboot and pick the old one from the boot menu:

task talos:reboot-node NODE=<node> MODE=powercycle

Flux/Helm

Just revert the commit and push:

git revert <commit>
git push
task kubernetes:reconcile

If Things Go Wrong

Node Stuck During Upgrade

Check whats happening:

talosctl -n <node> dmesg | tail -100

Force a reboot if needed:

task talos:reboot-node NODE=<node> MODE=powercycle

Can't Connect After Upgrade

Regenerate kubeconfig:

task talos:kubeconfig

Helm Releases Failing

Restart failed releases:

task kubernetes:hr:restart

Disaster Recovery

Notes on backups, restores, and what to do when things go wrong.

Backup Strategy

Persistent Volumes - Volsync backs up to Cloudflare R2 daily using Restic
Cluster State - Everything is in Git, Flux handles the rest
Secrets - Stored in 1Password, pulled in via external-secrets

Volsync Operations

Volsync handles backup/restore for PVCs. There are some assumptions baked in:

Kustomization, HelmRelease, PVC, and ReplicationSource all share the same name
ReplicationSource uses Restic
App is a Deployment or StatefulSet
Single PVC per app

List Snapshots

task volsync:list APP=plex NS=media

Create Manual Snapshot

If you need a backup right now instead of waiting for the daily schedule:

task volsync:snapshot APP=home-assistant NS=default

This waits up to 2 hours for the backup to complete.

Restore from Backup

task volsync:restore APP=plex NS=media PREVIOUS=2

PREVIOUS is how many snapshots back to restore (0 = latest, 1 = one before latest, etc).

What happens under the hood:

Suspends Flux kustomization and helmrelease
Scales app to 0 replicas
Waits for pods to terminate
Creates a ReplicationDestination and restores data
Resumes Flux and reconciles
Waits for pods to be ready again

Unlock Stuck Repos

If a backup job got interrupted, the Restic repo might be locked:

task volsync:unlock

Suspend/Resume Volsync

For maintenance:

task volsync:state-suspend
task volsync:state-resume

Recovery Scenarios

App Data Got Corrupted

List snapshots: task volsync:list APP=<app> NS=<ns>
Pick one and restore: task volsync:restore APP=<app> NS=<ns> PREVIOUS=<n>
Verify its working

Node Died

If recoverable, just reboot it:

task talos:reboot-node NODE=<node>

If it needs a full reinstall:

task talos:reset-node NODE=<node>
task talos:apply-node NODE=<node>

Ceph will rebalance automatically. Check health with:

kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status

Complete Cluster Loss

This is the nuclear option. Hopefully you never need this.

Provision hardware with Talos ISO (see bootstrap)

Apply Talos config to all nodes:

task talos:apply-node NODE=m0
task talos:apply-node NODE=m1
task talos:apply-node NODE=m2

Bootstrap:

task bootstrap:talos
task bootstrap:apps ROOK_DISK=<disk-model>

Flux restores everything from Git
Volsync restores PVC data from R2

Accidentally Deleted Something

Just force a Flux reconcile:

task kubernetes:reconcile

Flux will recreate whatever is missing from Git.

Troubleshooting

Common issues and how to fix them.

Quick Diagnostics

# Node status
kubectl get nodes -o wide

# All pods
kubectl get pods -A

# Recent events
kubectl get events -A --sort-by='.lastTimestamp' | tail -20

# Flux status
flux get all -A | grep -i false

Pods Not Starting

Stuck in Pending

kubectl describe pod -n <namespace> <pod>

Cause	Fix
Not enough resources	Scale down other stuff or add capacity
Node selector doesn't match	Check node labels
PVC not bound	Check storage class and PVC
Taints blocking it	Add tolerations

Stuck in ContainerCreating

kubectl describe pod -n <namespace> <pod>
kubectl get events -n <namespace> --field-selector involvedObject.name=<pod>

Cause	Fix
Image pull failed	Check image name and registry creds
Volume mount failed	Check PVC and CSI driver
Secret not found	Check ExternalSecret synced

CrashLoopBackOff

kubectl logs -n <namespace> <pod> --previous

Usually the app is crashing - check logs for stack traces.

Clean Up Failed Pods

task kubernetes:cleanse-pods

This removes pods in Failed, Pending, Succeeded, Completed, NodeStatusUnknown, or Error states.

Flux Issues

HelmRelease Stuck

flux get hr -A | grep False
kubectl describe helmrelease -n <namespace> <release>

Restart it:

task kubernetes:hr:restart

Or manually:

flux suspend hr -n <namespace> <release>
flux resume hr -n <namespace> <release>

Nothing is Syncing

Force a reconcile:

task kubernetes:reconcile

Node Issues

Node NotReady

kubectl describe node <node>
talosctl -n <node> services
talosctl -n <node> dmesg | tail -50

Cause	Fix
Kubelet not running	`talosctl -n <node> service kubelet restart`
Network issues	Check CNI pods
Disk pressure	Check disk usage

Node Unreachable

Try a reboot:

task talos:reboot-node NODE=<node>

If that doesn't work, power cycle it via IPMI/KVM.

Network Issues

Services Unreachable

# Cilium status
kubectl -n kube-system exec -it ds/cilium -- cilium status

# BGP peers
kubectl -n kube-system exec -it ds/cilium -- cilium bgp peers

# LoadBalancer IPs
kubectl get svc -A | grep LoadBalancer

DNS Broken

# Check CoreDNS
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Test resolution
kubectl run -it --rm debug --image=busybox -- nslookup kubernetes.default

Certificate Issues

kubectl get certificate -A
kubectl describe certificate -n <namespace> <name>
kubectl get certificaterequest -A

Cause	Fix
DNS challenge failing	Check Cloudflare creds
Rate limited	Wait and retry
Invalid domain	Check certificate spec

Debug Tools

Node Shell

task kubernetes:node-shell NODE=<node>

NFS Debug Pod

task kubernetes:nfs-pod NS=<namespace>

Browse a PVC

task kubernetes:browse-pvc NS=<namespace> CLAIM=<pvc-name>

Tail Logs

stern -n <namespace> -l app=<app>

When All Else Fails

Check external monitoring (HealthChecks.io, UptimeRobot)
Check recent git commits - did something change?
Check the component docs (Talos, Flux, Cilium, Rook)
Ask in the Kubernetes @Home Discord

Home Operations