Troubleshooting¶

Flux Issues¶

Resources Not Syncing¶

# Check Flux health
flux check

# Check for failed reconciliations
flux get ks -A --status-selector ready=false
flux get hr -A --status-selector ready=false

# View Flux logs
flux logs --all-namespaces

# Force sync
task reconcile

HelmRelease Stuck¶

# Suspend and resume
flux suspend hr <name> -n <namespace>
flux resume hr <name> -n <namespace>

# Force reconciliation
flux reconcile hr <name> -n <namespace> --force

Template Issues¶

Templates Not Rendering¶

task template:validate-schemas    # Check cluster.yaml & nodes.yaml
task template:render-configs      # Force re-render

Secret Issues¶

Secrets Not Decrypting¶

# Verify age key exists
test -f age.key && echo "Key exists" || echo "Missing key"

# Verify SOPS can decrypt
sops --decrypt bootstrap/sops-age.sops.yaml

# Check SOPS_AGE_KEY_FILE is set
echo $SOPS_AGE_KEY_FILE

Verifying Encryption¶

# All .sops.yaml files should contain 'sops:' metadata
grep -l "sops:" kubernetes/**/*.sops.yaml

Node Issues¶

Nodes Not Joining¶

talosctl get members --nodes <ip> --insecure
talosctl logs --nodes <ip> --insecure

Node Health¶

kubectl get nodes -o wide
kubectl describe node <node-name>

Pod Issues¶

General Debugging¶

# List pods in namespace
kubectl -n <namespace> get pods -o wide

# Check pod logs
kubectl -n <namespace> logs <pod-name> -f

# Describe pod for events
kubectl -n <namespace> describe pod <pod-name>

# Check namespace events
kubectl -n <namespace> get events --sort-by='.metadata.creationTimestamp'

CrashLoopBackOff¶

Check logs: kubectl -n <ns> logs <pod> --previous
Check resource limits: kubectl -n <ns> describe pod <pod>
Check if NFS-dependent -- add NFS-scaler component if so
Check if secret is missing: kubectl -n <ns> get secrets

Pending Pods¶

Check events: kubectl -n <ns> describe pod <pod>
Check node resources: kubectl top nodes
Check PVC binding: kubectl -n <ns> get pvc
Check node affinity/taints

Network Issues¶

Cilium¶

cilium status
cilium connectivity test

DNS¶

# Test cluster DNS
kubectl run -it --rm debug --image=busybox -- nslookup kubernetes.default

# Test external DNS resolution
dig @<k8s-gateway-ip> <app>.<domain>

Storage Issues¶

NFS Unavailable¶

If NFS is down, pods using NFS volumes will crash-loop. The NFS-scaler component handles this automatically for apps that include it.

Check NFS availability:

kubectl -n observability get prometheusrule -l app=blackbox-exporter

PVC Issues¶

kubectl get pvc -A
kubectl describe pvc <name> -n <namespace>

Reset Cluster¶

Danger

This destroys everything. Use as last resort.

task talos:reset

After reset, re-bootstrap:

task bootstrap:talos
task bootstrap:apps