10 Real Mistakes to Avoid in Kubernetes Deployment

Avoid costly outages and security breaches. Learn the 10 most common and dangerous Kubernetes deployment mistakes real teams make in production — from resource limits, liveness probes, storage, networking, RBAC, upgrades, secrets, scaling, logging to rollout strategies — and exactly how to fix them.

Dec 8, 2025 - 17:34
 0  1

Introduction

Every week in 2025, at least one company makes headlines because of a Kubernetes-related outage that could have been prevented. The platform is incredibly powerful, but it is also unforgiving. Small misconfigurations cascade into million-dollar incidents.

After helping dozens of enterprises and startups fix their broken clusters, here are the 10 most frequent, expensive, and completely avoidable mistakes we still see in production — and how to never make them.

1. Not Setting Resource Requests and Limits

Running pods without requests and limits is the #1 cause of node OOM kills and “noisy neighbor” problems.

  • Always set both request and limit (limit ≥ request)
  • Use Vertical Pod Autoscaler in recommendation mode first
  • Enable LimitRange in namespaces to enforce defaults
  • Watch for pods stuck in Pending because of insufficient CPU/memory

2. Wrong or Missing Liveness & Readiness Probes

Bad probes cause endless restart loops or traffic to dead pods.

  • Liveness probe too aggressive → restart storm
  • Readiness probe missing → traffic sent to unready pods
  • Use startupProbe for slow-starting apps (Java, .NET)
  • Always set initialDelaySeconds and reasonable thresholds
  • Test probes locally with kubectl port-forward

3. Using hostPath or EmptyDir for Persistent Data

Many teams use hostPath in dev and lose data after node reboot or pod rescheduling.

Use proper PersistentVolumes (EBS, Azure Disk, NFS, Ceph) or StatefulSet with volumeClaimTemplates. Never rely on local storage for anything important.

4. Overly Permissive RBAC and Network Policies

  • Default setups allow any pod to talk to any pod
  • ServiceAccounts with cluster-admin still common
  • Fix: Use least-privilege RBAC + NetworkPolicy by default
  • Tools: Kyverno or OPA Gatekeeper for policy enforcement
  • Enable Pod Security Standards (restricted profile)

5. Storing Secrets in Plain Text or ConfigMaps

Never put passwords, tokens, or certificates in ConfigMaps or YAML files.

  • Use Kubernetes Secrets (base64 is not encryption — still better than plain)
  • Better: External secrets managers (AWS Secrets Manager, HashiCorp Vault, Sealed Secrets, SQS-backed rotation)
  • Enable encryption at rest for etcd

6. Skipping RollingUpdate Strategy or Setting maxSurge/maxUnavailable Badly

Default is fine, but many set maxSurge=0 or maxUnavailable=100% → downtime.

Recommended for production:

  • strategy: RollingUpdate
  • maxSurge: 25%
  • maxUnavailable: 25%
  • Use readiness probes so traffic only goes to healthy pods

7. Ignoring Horizontal Pod Autoscaler (HPA) Misconfiguration

Teams enable HPA but forget custom or external metrics → pods never scale.

  • Use metrics-server or Prometheus Adapter
  • Set meaningful CPU/memory targets (70–80% typical)
  • Add behavior: scaleDown stabilizationWindowSeconds
  • Test scaling with load generators

8. Running Without Proper Logging, Monitoring & Alerting

“It works on my cluster” syndrome. In production you need:

  • Centralized logs (Loki, ELK, CloudWatch)
  • Metrics (Prometheus + Grafana)
  • Distributed tracing (Jaeger/Tempo)
  • Alerts on 5xx, pod restarts, OOMKilled
  • SNS + PagerDuty integration

9. Upgrading Kubernetes Without Testing in Staging

Skipping version-skew testing causes API deprecations and broken workloads.

Always:

  • Maintain a staging cluster one minor version ahead
  • Run kubectl convert and Plunder/Kube-no-trouble
  • Test critical workloads before upgrading production

10. Not Using GitOps (or Using It Wrong)

Manual kubectl apply in production is a disaster waiting to happen.

  • Use ArgoCD, Flux v2, or Jenkins X
  • Store all manifests in Git
  • Enable sync policies and health checks
  • Never allow direct cluster changes

Quick Kubernetes Production Checklist Table

Item Must Have Common Mistake
Resource requests/limits Yes Missing → OOM kills
Liveness + Readiness probes Yes Restart loops
RBAC + NetworkPolicy Yes Open cluster
GitOps Yes Manual changes
Observability stack Yes Blind operations

Conclusion

Kubernetes gives you incredible power, but with great power comes great responsibility. The ten mistakes above account for more than 80% of production incidents we see. Fix them early — ideally via policy-as-code and automated checks in CI — and your cluster will be stable, secure, and scalable. Print the checklist, add it to your pull-request template, and sleep better at night.

Frequently Asked Questions

Is it safe to run StatefulSet without PVC?

No. Always use PersistentVolumeClaims. Local volumes disappear on node failure.

Should limits equal requests?

For predictable workloads yes (guaranteed QoS). For bursty workloads, allow some headroom.

Can I skip NetworkPolicy?

Only in isolated dev clusters. In production, default-deny + explicit allow is mandatory.

How often should I upgrade Kubernetes?

Every 3–6 months. Never skip more than one minor version.

Is ArgoCD or Flux better?

Both excellent. ArgoCD has better UI; Flux is lighter and Kustomize-native.

What's Your Reaction?

Like Like 0
Dislike Dislike 0
Love Love 0
Funny Funny 0
Angry Angry 0
Sad Sad 0
Wow Wow 0
Mridul I am a passionate technology enthusiast with a strong focus on DevOps, Cloud Computing, and Cybersecurity. Through my blogs at DevOps Training Institute, I aim to simplify complex concepts and share practical insights for learners and professionals. My goal is to empower readers with knowledge, hands-on tips, and industry best practices to stay ahead in the ever-evolving world of DevOps.