DevOps Basics

10 Real Mistakes to Avoid in Kubernetes Deployment

Avoid costly outages and security breaches. Learn the 10 most common and dangerous Kubernetes deployment mistakes real teams make in production — from resource limits, liveness probes, storage, networking, RBAC, upgrades, secrets, scaling, logging to rollout strategies — and exactly how to fix them.

Mridul

Dec 8, 2025 - 17:34

Dec 13, 2025 - 10:38

0 13

10 Real Mistakes to Avoid in Kubernetes Deployment

Introduction

Every week in 2025, at least one company makes headlines because of a Kubernetes-related outage that could have been prevented. The platform is incredibly powerful, but it is also unforgiving. Small misconfigurations cascade into million-dollar incidents.

After helping dozens of enterprises and startups fix their broken clusters, here are the 10 most frequent, expensive, and completely avoidable mistakes we still see in production — and how to never make them.

1. Not Setting Resource Requests and Limits

Running pods without requests and limits is the #1 cause of node OOM kills and “noisy neighbor” problems.

Always set both request and limit (limit ≥ request)
Use Vertical Pod Autoscaler in recommendation mode first
Enable LimitRange in namespaces to enforce defaults
Watch for pods stuck in Pending because of insufficient CPU/memory

2. Wrong or Missing Liveness & Readiness Probes

Bad probes cause endless restart loops or traffic to dead pods.

Liveness probe too aggressive → restart storm
Readiness probe missing → traffic sent to unready pods
Use startupProbe for slow-starting apps (Java, .NET)
Always set initialDelaySeconds and reasonable thresholds
Test probes locally with kubectl port-forward

3. Using hostPath or EmptyDir for Persistent Data

Many teams use hostPath in dev and lose data after node reboot or pod rescheduling.

Use proper PersistentVolumes (EBS, Azure Disk, NFS, Ceph) or StatefulSet with volumeClaimTemplates. Never rely on local storage for anything important.

4. Overly Permissive RBAC and Network Policies

Default setups allow any pod to talk to any pod
ServiceAccounts with cluster-admin still common
Fix: Use least-privilege RBAC + NetworkPolicy by default
Tools: Kyverno or OPA Gatekeeper for policy enforcement
Enable Pod Security Standards (restricted profile)

5. Storing Secrets in Plain Text or ConfigMaps

Never put passwords, tokens, or certificates in ConfigMaps or YAML files.

Use Kubernetes Secrets (base64 is not encryption — still better than plain)
Better: External secrets managers (AWS Secrets Manager, HashiCorp Vault, Sealed Secrets, SQS-backed rotation)
Enable encryption at rest for etcd

6. Skipping RollingUpdate Strategy or Setting maxSurge/maxUnavailable Badly

Default is fine, but many set maxSurge=0 or maxUnavailable=100% → downtime.

Recommended for production:

strategy: RollingUpdate
maxSurge: 25%
maxUnavailable: 25%
Use readiness probes so traffic only goes to healthy pods

7. Ignoring Horizontal Pod Autoscaler (HPA) Misconfiguration

Teams enable HPA but forget custom or external metrics → pods never scale.

Use metrics-server or Prometheus Adapter
Set meaningful CPU/memory targets (70–80% typical)
Add behavior: scaleDown stabilizationWindowSeconds
Test scaling with load generators

8. Running Without Proper Logging, Monitoring & Alerting

“It works on my cluster” syndrome. In production you need:

Centralized logs (Loki, ELK, CloudWatch)
Metrics (Prometheus + Grafana)
Distributed tracing (Jaeger/Tempo)
Alerts on 5xx, pod restarts, OOMKilled
SNS + PagerDuty integration

9. Upgrading Kubernetes Without Testing in Staging

Skipping version-skew testing causes API deprecations and broken workloads.

Always:

Maintain a staging cluster one minor version ahead
Run kubectl convert and Plunder/Kube-no-trouble
Test critical workloads before upgrading production

10. Not Using GitOps (or Using It Wrong)

Manual kubectl apply in production is a disaster waiting to happen.

Use ArgoCD, Flux v2, or Jenkins X
Store all manifests in Git
Enable sync policies and health checks
Never allow direct cluster changes

Quick Kubernetes Production Checklist Table

Item	Must Have	Common Mistake
Resource requests/limits	Yes	Missing → OOM kills
Liveness + Readiness probes	Yes	Restart loops
RBAC + NetworkPolicy	Yes	Open cluster
GitOps	Yes	Manual changes
Observability stack	Yes	Blind operations

Conclusion

Kubernetes gives you incredible power, but with great power comes great responsibility. The ten mistakes above account for more than 80% of production incidents we see. Fix them early — ideally via policy-as-code and automated checks in CI — and your cluster will be stable, secure, and scalable. Print the checklist, add it to your pull-request template, and sleep better at night.

Frequently Asked Questions

Is it safe to run StatefulSet without PVC?

No. Always use PersistentVolumeClaims. Local volumes disappear on node failure.

Should limits equal requests?

For predictable workloads yes (guaranteed QoS). For bursty workloads, allow some headroom.

Can I skip NetworkPolicy?

Only in isolated dev clusters. In production, default-deny + explicit allow is mandatory.

How often should I upgrade Kubernetes?

Every 3–6 months. Never skip more than one minor version.

Is ArgoCD or Flux better?

Both excellent. ArgoCD has better UI; Flux is lighter and Kustomize-native.

Tags:

What's Your Reaction?

Like 0

Dislike 0

Love 0

Funny 0

Angry 0

Sad 0

Wow 0

Mridul I am a passionate technology enthusiast with a strong focus on DevOps, Cloud Computing, and Cybersecurity. Through my blogs at DevOps Training Institute, I aim to simplify complex concepts and share practical insights for learners and professionals. My goal is to empower readers with knowledge, hands-on tips, and industry best practices to stay ahead in the ever-evolving world of DevOps.