Kubernetes: kubectl, Pod Troubleshooting & SRE Patterns
Practical Kubernetes notes for SRE/DevOps: kubectl essentials, pod troubleshooting, ConfigMaps, Secrets, resource limits, probes, and rolling updates.
Kubernetes is the de-facto runtime for containerised workloads at scale, but its surface area is enormous. These notes focus on the operational slice — the commands and manifest patterns you reach for when deploying, debugging, and maintaining services day-to-day. Theory is kept to a minimum; working YAML and shell commands are kept to a maximum.
# List resources (the most-used command in k8s)kubectl get pods # current namespacekubectl get pods -n kube-system # specific namespacekubectl get pods -A # all namespaceskubectl get pods -o wide # extra columns (node, IP)kubectl get pods -w # watch for changes# Get all common resources at oncekubectl get all -n myapp# Output as YAML (great for diffing live state vs. your repo)kubectl get deployment myapp -o yaml# JSON path query — e.g. get the image of the first containerkubectl get pod myapp-abc123 \ -o jsonpath='{.spec.containers[0].image}'# Describe gives events + full spec (essential for debugging)kubectl describe pod myapp-abc123kubectl describe node worker-1kubectl describe service myapp-svc
# Stream logs from a podkubectl logs -f myapp-abc123# If the pod has multiple containers, specify onekubectl logs -f myapp-abc123 -c sidecar# Previous container instance (useful after a crash)kubectl logs myapp-abc123 --previous# Last 200 lines onlykubectl logs --tail=200 myapp-abc123# Logs from all pods matching a label selectorkubectl logs -l app=myapp --all-containers=true -f# Logs since a durationkubectl logs myapp-abc123 --since=1h
# Interactive shell inside a running podkubectl exec -it myapp-abc123 -- bashkubectl exec -it myapp-abc123 -- sh # Alpine/distroless fallback# Run a one-off commandkubectl exec myapp-abc123 -- env | grep APP_# Multi-container pod: target a specific containerkubectl exec -it myapp-abc123 -c myapp -- bash# Port-forward a pod's port to localhost (no Ingress needed)kubectl port-forward pod/myapp-abc123 8080:3000# Port-forward a service (load-balanced across pods)kubectl port-forward svc/myapp-svc 8080:80# Port-forward a deploymentkubectl port-forward deployment/myapp 8080:3000
# Apply a manifest (create or update)kubectl apply -f deployment.yamlkubectl apply -f ./k8s/ # all files in a directorykubectl apply -k ./overlays/production # Kustomize overlay# Dry run (server-side — validates against the API)kubectl apply -f deployment.yaml --dry-run=server# Show a diff before applyingkubectl diff -f deployment.yaml# Delete resourceskubectl delete -f deployment.yamlkubectl delete pod myapp-abc123kubectl delete pod myapp-abc123 --force --grace-period=0 # hard kill# Rollout managementkubectl rollout status deployment/myappkubectl rollout history deployment/myappkubectl rollout undo deployment/myapp # roll back one revisionkubectl rollout undo deployment/myapp --to-revision=3
For distroless or minimal images where sh doesn’t exist:
# Ephemeral debug container (Kubernetes ≥ 1.23)kubectl debug -it myapp-abc123 \ --image=busybox:latest \ --target=myapp# Or spin up a one-off pod with network access in the same namespacekubectl run debug --rm -it \ --image=nicolaka/netshoot \ --restart=Never \ -- bash
6
Check node-level issues
kubectl describe node $(kubectl get pod myapp-abc123 -o jsonpath='{.spec.nodeName}')# Check if the node is under memory/CPU pressurekubectl top nodekubectl top pod myapp-abc123 --containers
Secrets store sensitive data. They’re base64-encoded (not encrypted) by default — use an external secrets manager (Vault, AWS Secrets Manager, Sealed Secrets) for production.
# secret.yaml — never commit real values to gitapiVersion: v1kind: Secretmetadata: name: myapp-secrets namespace: myapptype: OpaquestringData: # k8s handles the base64 encoding for you DATABASE_URL: "postgres://user:s3cr3t@db:5432/mydb" API_KEY: "sk-live-abc123xyz"
# Consume in a Deploymentspec: containers: - name: myapp image: myapp:1.0.0 envFrom: - secretRef: name: myapp-secrets # Or mount as files (useful for TLS certs, SSH keys) volumeMounts: - name: secret-vol mountPath: /app/secrets readOnly: true volumes: - name: secret-vol secret: secretName: myapp-secrets defaultMode: 0400 # read-only for owner only
# Create from literal values (avoids secrets in YAML files)kubectl create secret generic myapp-secrets \ --from-literal=DATABASE_URL='postgres://...' \ --from-literal=API_KEY='sk-live-...'# Create a TLS secret from cert fileskubectl create secret tls myapp-tls \ --cert=tls.crt \ --key=tls.key
kubectl get secret myapp-secrets -o yaml reveals the base64 values — anyone with get secret RBAC permissions can decode them. Always restrict Secret access via RBAC and consider encrypting etcd at rest.
Setting resource requests and limits is one of the most impactful things you can do for cluster stability. Without them, a noisy neighbour pod can starve everything else on the same node.
spec: containers: - name: myapp image: myapp:1.0.0 resources: requests: # guaranteed allocation — used for scheduling decisions cpu: "250m" # 250 millicores = 0.25 vCPU memory: "256Mi" limits: # hard cap — container is killed if it exceeds memory limit cpu: "1000m" # 1 vCPU memory: "512Mi"
requests = what the scheduler reserves on the node. limits = the hard ceiling. Set requests accurately to get good bin-packing; set limits conservatively to prevent OOM kills cascading across pods.
Probes are the mechanism by which Kubernetes knows whether your pod is healthy and ready to receive traffic.
spec: containers: - name: myapp image: myapp:1.0.0 ports: - containerPort: 3000 # Readiness: pod receives traffic only when this passes readinessProbe: httpGet: path: /healthz/ready port: 3000 initialDelaySeconds: 5 # wait before first probe periodSeconds: 10 # probe every 10 seconds failureThreshold: 3 # 3 consecutive failures → not ready # Liveness: pod is restarted when this fails repeatedly livenessProbe: httpGet: path: /healthz/live port: 3000 initialDelaySeconds: 15 # give the app time to start periodSeconds: 20 failureThreshold: 3 # Startup: disables liveness until the app has started (slow-start apps) startupProbe: httpGet: path: /healthz/live port: 3000 failureThreshold: 30 # 30 × 10s = 5 minutes to start periodSeconds: 10
Use three separate endpoints — /healthz/ready, /healthz/live, and optionally /healthz/startup. The readiness endpoint should check downstream dependencies (DB connectivity, cache warmup). The liveness endpoint should check only internal process health — never external dependencies, or a dependency outage will cause a cascade of unnecessary pod restarts.
Kubernetes Deployments perform rolling updates by default. These fields control the rollout behaviour:
apiVersion: apps/v1kind: Deploymentmetadata: name: myappspec: replicas: 4 strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 # at most 1 extra pod above desired count during rollout maxUnavailable: 0 # never go below desired count (zero-downtime) selector: matchLabels: app: myapp template: metadata: labels: app: myapp spec: # Graceful shutdown: give the pod time to finish in-flight requests terminationGracePeriodSeconds: 30 containers: - name: myapp image: myapp:1.1.0 lifecycle: preStop: exec: command: ["/bin/sh", "-c", "sleep 5"] # drain before SIGTERM
# Trigger a rollout by updating the imagekubectl set image deployment/myapp myapp=myapp:1.2.0# Watch the rollout progresskubectl rollout status deployment/myapp --timeout=5m# Pause a rollout mid-way (canary-style)kubectl rollout pause deployment/myapp# Resumekubectl rollout resume deployment/myapp# Undo the last rolloutkubectl rollout undo deployment/myapp# Undo to a specific revisionkubectl rollout history deployment/myappkubectl rollout undo deployment/myapp --to-revision=2
# Get all pods that are NOT runningkubectl get pods -A --field-selector='status.phase!=Running'# Force-delete a stuck terminating podkubectl delete pod myapp-abc123 --force --grace-period=0# Scale a deploymentkubectl scale deployment myapp --replicas=6# Restart all pods in a deployment (zero-downtime rolling restart)kubectl rollout restart deployment/myapp# Copy a file from a pod to localhostkubectl cp myapp-abc123:/app/logs/app.log ./app.log# Get resource usage for all pods, sorted by memorykubectl top pod -A --sort-by=memory# List all images running in the clusterkubectl get pods -A \ -o jsonpath='{range .items[*]}{.spec.containers[*].image}{"\n"}{end}' \ | sort -u# Find pods with a specific labelkubectl get pods -l app=myapp,env=production# Add a label to a pod (temporary — use manifests for permanent changes)kubectl label pod myapp-abc123 debug=true# Taint a node to prevent new schedulingkubectl taint nodes worker-3 maintenance=true:NoSchedule# Remove the taintkubectl taint nodes worker-3 maintenance=true:NoSchedule-