Skip to content

Troubleshooting

Flux Issues

Resources Not Syncing

# Check Flux health
flux check

# Check for failed reconciliations
flux get ks -A --status-selector ready=false
flux get hr -A --status-selector ready=false

# View Flux logs
flux logs --all-namespaces

# Force sync
task reconcile

HelmRelease Stuck

# Suspend and resume
flux suspend hr <name> -n <namespace>
flux resume hr <name> -n <namespace>

# Force reconciliation
flux reconcile hr <name> -n <namespace> --force

Template Issues

Templates Not Rendering

task template:validate-schemas    # Check cluster.yaml & nodes.yaml
task template:render-configs      # Force re-render

Secret Issues

Secrets Not Decrypting

# Verify age key exists
test -f age.key && echo "Key exists" || echo "Missing key"

# Verify SOPS can decrypt
sops --decrypt bootstrap/sops-age.sops.yaml

# Check SOPS_AGE_KEY_FILE is set
echo $SOPS_AGE_KEY_FILE

Verifying Encryption

# All .sops.yaml files should contain 'sops:' metadata
grep -l "sops:" kubernetes/**/*.sops.yaml

Node Issues

Nodes Not Joining

talosctl get members --nodes <ip> --insecure
talosctl logs --nodes <ip> --insecure

Node Health

kubectl get nodes -o wide
kubectl describe node <node-name>

Pod Issues

General Debugging

# List pods in namespace
kubectl -n <namespace> get pods -o wide

# Check pod logs
kubectl -n <namespace> logs <pod-name> -f

# Describe pod for events
kubectl -n <namespace> describe pod <pod-name>

# Check namespace events
kubectl -n <namespace> get events --sort-by='.metadata.creationTimestamp'

CrashLoopBackOff

  1. Check logs: kubectl -n <ns> logs <pod> --previous
  2. Check resource limits: kubectl -n <ns> describe pod <pod>
  3. Check if NFS-dependent -- add NFS-scaler component if so
  4. Check if secret is missing: kubectl -n <ns> get secrets

Pending Pods

  1. Check events: kubectl -n <ns> describe pod <pod>
  2. Check node resources: kubectl top nodes
  3. Check PVC binding: kubectl -n <ns> get pvc
  4. Check node affinity/taints

Network Issues

Cilium

cilium status
cilium connectivity test

DNS

# Test cluster DNS
kubectl run -it --rm debug --image=busybox -- nslookup kubernetes.default

# Test external DNS resolution
dig @<k8s-gateway-ip> <app>.<domain>

Storage Issues

NFS Unavailable

If NFS is down, pods using NFS volumes will crash-loop. The NFS-scaler component handles this automatically for apps that include it.

Check NFS availability:

kubectl -n observability get prometheusrule -l app=blackbox-exporter

PVC Issues

kubectl get pvc -A
kubectl describe pvc <name> -n <namespace>

Reset Cluster

Danger

This destroys everything. Use as last resort.

task talos:reset

After reset, re-bootstrap:

task bootstrap:talos
task bootstrap:apps