10 Real Mistakes to Avoid in Kubernetes Deployment
Avoid costly outages and security breaches. Learn the 10 most common and dangerous Kubernetes deployment mistakes real teams make in production — from resource limits, liveness probes, storage, networking, RBAC, upgrades, secrets, scaling, logging to rollout strategies — and exactly how to fix them.
Introduction
Every week in 2025, at least one company makes headlines because of a Kubernetes-related outage that could have been prevented. The platform is incredibly powerful, but it is also unforgiving. Small misconfigurations cascade into million-dollar incidents.
After helping dozens of enterprises and startups fix their broken clusters, here are the 10 most frequent, expensive, and completely avoidable mistakes we still see in production — and how to never make them.
1. Not Setting Resource Requests and Limits
Running pods without requests and limits is the #1 cause of node OOM kills and “noisy neighbor” problems.
- Always set both request and limit (limit ≥ request)
- Use Vertical Pod Autoscaler in recommendation mode first
- Enable LimitRange in namespaces to enforce defaults
- Watch for pods stuck in Pending because of insufficient CPU/memory
2. Wrong or Missing Liveness & Readiness Probes
Bad probes cause endless restart loops or traffic to dead pods.
- Liveness probe too aggressive → restart storm
- Readiness probe missing → traffic sent to unready pods
- Use startupProbe for slow-starting apps (Java, .NET)
- Always set initialDelaySeconds and reasonable thresholds
- Test probes locally with kubectl port-forward
3. Using hostPath or EmptyDir for Persistent Data
Many teams use hostPath in dev and lose data after node reboot or pod rescheduling.
Use proper PersistentVolumes (EBS, Azure Disk, NFS, Ceph) or StatefulSet with volumeClaimTemplates. Never rely on local storage for anything important.
4. Overly Permissive RBAC and Network Policies
- Default setups allow any pod to talk to any pod
- ServiceAccounts with cluster-admin still common
- Fix: Use least-privilege RBAC + NetworkPolicy by default
- Tools: Kyverno or OPA Gatekeeper for policy enforcement
- Enable Pod Security Standards (restricted profile)
5. Storing Secrets in Plain Text or ConfigMaps
Never put passwords, tokens, or certificates in ConfigMaps or YAML files.
- Use Kubernetes Secrets (base64 is not encryption — still better than plain)
- Better: External secrets managers (AWS Secrets Manager, HashiCorp Vault, Sealed Secrets, SQS-backed rotation)
- Enable encryption at rest for etcd
6. Skipping RollingUpdate Strategy or Setting maxSurge/maxUnavailable Badly
Default is fine, but many set maxSurge=0 or maxUnavailable=100% → downtime.
Recommended for production:
- strategy: RollingUpdate
- maxSurge: 25%
- maxUnavailable: 25%
- Use readiness probes so traffic only goes to healthy pods
7. Ignoring Horizontal Pod Autoscaler (HPA) Misconfiguration
Teams enable HPA but forget custom or external metrics → pods never scale.
- Use metrics-server or Prometheus Adapter
- Set meaningful CPU/memory targets (70–80% typical)
- Add behavior: scaleDown stabilizationWindowSeconds
- Test scaling with load generators
8. Running Without Proper Logging, Monitoring & Alerting
“It works on my cluster” syndrome. In production you need:
- Centralized logs (Loki, ELK, CloudWatch)
- Metrics (Prometheus + Grafana)
- Distributed tracing (Jaeger/Tempo)
- Alerts on 5xx, pod restarts, OOMKilled
- SNS + PagerDuty integration
9. Upgrading Kubernetes Without Testing in Staging
Skipping version-skew testing causes API deprecations and broken workloads.
Always:
- Maintain a staging cluster one minor version ahead
- Run kubectl convert and Plunder/Kube-no-trouble
- Test critical workloads before upgrading production
10. Not Using GitOps (or Using It Wrong)
Manual kubectl apply in production is a disaster waiting to happen.
- Use ArgoCD, Flux v2, or Jenkins X
- Store all manifests in Git
- Enable sync policies and health checks
- Never allow direct cluster changes
Quick Kubernetes Production Checklist Table
| Item | Must Have | Common Mistake |
|---|---|---|
| Resource requests/limits | Yes | Missing → OOM kills |
| Liveness + Readiness probes | Yes | Restart loops |
| RBAC + NetworkPolicy | Yes | Open cluster |
| GitOps | Yes | Manual changes |
| Observability stack | Yes | Blind operations |
Conclusion
Kubernetes gives you incredible power, but with great power comes great responsibility. The ten mistakes above account for more than 80% of production incidents we see. Fix them early — ideally via policy-as-code and automated checks in CI — and your cluster will be stable, secure, and scalable. Print the checklist, add it to your pull-request template, and sleep better at night.
Frequently Asked Questions
Is it safe to run StatefulSet without PVC?
No. Always use PersistentVolumeClaims. Local volumes disappear on node failure.
Should limits equal requests?
For predictable workloads yes (guaranteed QoS). For bursty workloads, allow some headroom.
Can I skip NetworkPolicy?
Only in isolated dev clusters. In production, default-deny + explicit allow is mandatory.
How often should I upgrade Kubernetes?
Every 3–6 months. Never skip more than one minor version.
Is ArgoCD or Flux better?
Both excellent. ArgoCD has better UI; Flux is lighter and Kustomize-native.
What's Your Reaction?
Like
0
Dislike
0
Love
0
Funny
0
Angry
0
Sad
0
Wow
0