Top 15 Kubernetes Troubleshooting Commands

Master the essential toolkit for diagnosing and resolving issues in your Kubernetes clusters with this detailed guide to the top 15 troubleshooting commands. Learn the precise syntax and contextual use of critical kubectl commands, from inspecting resource states and viewing container logs to analyzing deployment history and validating network connectivity. We demystify the process of identifying pod failures, resource bottlenecks, and network configuration errors, enabling DevOps engineers and administrators to achieve faster incident resolution and maintain robust application stability. Discover how to transform raw cluster data into actionable insights for continuous operational excellence and reliable service delivery across all environments.

Dec 16, 2025 - 17:46
 0  1

Introduction

Kubernetes is the orchestrator of the modern cloud, providing unparalleled scalability and resilience for containerized applications. However, this power comes with inherent complexity. When things go wrong in a distributed system, the potential points of failure multiply: a misconfigured manifest, an exhausted node resource, a failed network connection, or simply a bug in the application code. Mastering the art of Kubernetes troubleshooting is a foundational skill for every engineer working in the cloud-native space. It is the difference between a minor service interruption and a prolonged, frustrating outage that impacts users and revenue, making command-line fluency paramount for reliable operations.

The core interface for interacting with any Kubernetes cluster is the command-line tool, kubectl. This powerful utility acts as the eyes and hands of the administrator, providing access to the entire state of the cluster, from the control plane down to individual containers. While graphical dashboards offer high-level views, true, deep-dive diagnosis always requires the precision and detail offered by kubectl. The 15 commands detailed in this guide form the indispensable toolkit for rapidly identifying, isolating, and resolving the most common issues encountered in a production or staging environment. By learning the nuanced flags and outputs of these commands, you can dramatically shorten your Mean Time to Recover (MTTR) and ensure your applications remain stable and available.

Troubleshooting is often approached as a systematic process, moving from high-level observation to deep-level investigation. We start by observing the symptoms, then move to describing the affected resources for details, examine the container logs, and finally interact with the environment to test hypotheses. This structured approach, powered by a mastery of the top kubectl commands, transforms the daunting task of debugging a distributed system into a predictable and efficient operation. Regardless of your cluster size or application complexity, these commands are your non-negotiable first line of defense against any operational issue.

The Core Inspection Commands: Get and Describe

The troubleshooting process always begins with observation. You need to know which components are affected, what state they are in, and what events led to their current condition. The commands kubectl get and kubectl describe are the most frequently used commands in the Kubernetes toolkit, providing the initial, high-level view and the deep-dive details necessary to start forming a diagnosis. They are the essential tools for observing the cluster's current state and gathering the fundamental data required before any corrective action can be taken, acting as the primary window into the cluster's declarative state.

The command kubectl get is the cluster's primary inventory tool. It allows you to list resources (Pods, Deployments, Services, Nodes, etc.) and view their high-level status. Using the command with the -o wide flag is a crucial troubleshooting technique, as it provides additional, often necessary columns of information, such as the node a Pod is running on or the internal cluster IP address. For instance, seeing a Pod stuck in the Pending state, or a Deployment showing 0/3 Ready replicas, is the first indication of a problem, immediately signaling a discrepancy between the desired state (as defined in the manifest) and the actual state within the cluster. This command is the first step in the "OODA loop" of cluster operations: Observe, Orient, Decide, Act.

Once an issue is observed, the detective work begins with kubectl describe. This command provides a wealth of detailed information about a specific resource, pulling data from the Kubernetes API server. For a Pod, describe details the resource limits, volume mounts, container image definitions, status conditions, and, most importantly, the entire history of events associated with that resource. If a Pod failed to schedule, the Events section will clearly explain why (e.g., node resource constraints, missing volumes, or pulling a bad image). For a Deployment, describe reveals the status of its associated ReplicaSets and the strategy being used for rolling updates. Learning to read and analyze the Events section of the describe output is perhaps the most important skill for a Kubernetes troubleshooter.

Container and Log Inspection: Logs and Exec

After confirming the status of the resource, the next logical step is to look inside the container itself. The Pod resource is merely the wrapper; the core problem almost always resides within the application code or the container runtime environment. Logs provide the historical narrative of the application's behavior, while an interactive shell provides the ability to run real-time diagnostics and test environment variables or connectivity from the application's perspective, both critical steps in achieving true root cause analysis for application-level failures.

The command kubectl logs is the primary method for accessing the application's output, essential for diagnosing application crashes, unexpected behavior, and startup failures. Key flags are crucial here: -f for following the log stream in real time, and --since=1h for fetching logs from a specific time range. If a Pod has multiple containers, you must use the -c [container-name] flag to specify which container's logs you wish to view. Furthermore, for Pods that have crashed and been restarted (often resulting in a CrashLoopBackOff status), the -p or --previous flag is mandatory to access the logs from the previous, failed container instance, giving the technician insight into the final moments before the crash. This ability to capture and review historical and live output is fundamental for effective application debugging.

The command kubectl exec is your remote shell into a running container, providing an environment to run diagnostics inside the application's own context. This command is invaluable for checking network connectivity, file permissions, or verifying configuration files that were mounted into the Pod. A common use case is running kubectl exec -it [pod-name] -- /bin/sh to open an interactive shell. From within the container, you can check if a required database is reachable, if environment variables were correctly injected, or if any application-level Linux permissions are misconfigured. It allows you to confirm assumptions about the runtime environment without having to redeploy the container image, rapidly testing hypotheses directly at the source of the problem, dramatically accelerating the time taken to isolate the issue to either the application, or the cluster configuration.

Network and Connectivity Diagnostics

Many Kubernetes issues, especially those manifesting as slow response times or connection timeouts, stem from networking failures. These failures can occur at various levels: misconfigured Services, incorrect firewall rules, or faulty DNS resolution within the cluster. Troubleshooting network connectivity requires specialized commands that test the paths between the application, the service layer, and the outside world. Without the ability to definitively prove a component is reachable, the engineer risks wasting time investigating application code that is, in fact, perfectly functional but simply unreachable due to an infrastructure misconfiguration.

The command kubectl port-forward is a simple yet powerful command that creates a secure, temporary tunnel from your local machine to a Pod or Service within the cluster. This command bypasses the Ingress, Service load balancing, and public exposure mechanisms, allowing you to test the application directly. By running kubectl port-forward [pod-name] 8080:80, you can access the application running on port 80 inside the Pod through localhost:8080 on your laptop. If the application works locally via port-forward but fails through the public Ingress, you immediately know the problem lies outside the Pod: likely in the Service definition, Ingress rules, or the load balancer configuration, effectively segmenting the network issue from the application issue.

For internal cluster testing, the command kubectl run is often used to quickly launch a diagnostic Pod running a tool like curl or ping. You can execute kubectl run debug-pod --image=busybox --rm -it --restart=Never -- curl [internal-service-name]. This allows you to test internal Service-to-Service communication, validating DNS resolution and network policies from within the cluster's network fabric. If this debug Pod cannot reach a service that another Pod should be using, you have pinpointed a Service or network policy failure. Finally, kubectl get service and kubectl get ingress are used to check the selector labels and endpoints, ensuring the Service is correctly targeting the Pods and that the Ingress is pointing to the correct Service, closing the loop on potential traffic routing issues within the service mesh.

Top 15 Kubernetes Troubleshooting Commands Overview

Command Name Primary Purpose Key Troubleshooting Use Case Target Resource
kubectl get List resources and view high-level status. Quickly spot Pods in Pending or CrashLoopBackOff state. All Resources (Pods, Deployments, Nodes)
kubectl describe Show detailed state and recent events for a resource. Read the Events section to see why a Pod failed to schedule or start. Specific Resources (e.g., Pod name, Node name)
kubectl logs Stream or view historical container output. Diagnose application startup errors or crash causes (-p flag is key). Pods / Containers
kubectl exec Execute a command inside a running container. Verify network connectivity (curl), environment variables, or file presence from inside the Pod. Running Pods / Containers
kubectl get events List recent cluster-wide events and activities. See a chronological view of cluster-wide problems (e.g., image pull failures, scheduler issues). Cluster-wide
kubectl rollout status Monitor the progress of a deployment rollout. Determine if a deployment is stuck, failed, or progressing slowly. Deployments, DaemonSets, StatefulSets
kubectl rollout undo Revert a deployment to a previous version. Rapidly execute the primary disaster recovery procedures for a bad deployment. Deployments
kubectl top Display real-time resource utilization (CPU/Memory). Identify resource bottlenecks (Pod or Node hogging resources). Nodes, Pods (Requires Metrics Server)
kubectl diff Show the difference between a local manifest and the cluster's live state. Verify configuration changes before applying them, preventing accidental updates. Manifests (Deployment, Service, etc.)
kubectl run Create a single Pod for quick testing or diagnostics. Test DNS resolution or network connectivity from a clean, known Pod environment. Pods
kubectl port-forward Create a temporary, secure tunnel to a Pod or Service. Bypass Ingress/Service layer to test application functionality directly. Pods, Services
kubectl delete Remove resources from the cluster. Force-restart a Pod (delete pod --force) or clean up failing resources. All Resources
kubectl get service Check Service endpoints and internal/external IPs. Verify that a Service is correctly selecting and exposing healthy Pods via its Endpoints. Services
kubectl cordon/drain Control the scheduling state of a Node. Prepare a Node for maintenance (cordon) or safely evacuate Pods (drain). Nodes
kubectl auth can-i Check the current user's authorization to perform an action. Quickly determine if a problem is due to manage user access restrictions (RBAC policy). Cluster-wide (RBAC)

Resource and Node Inspection: Top and Cordon/Drain

Once application-level issues are ruled out, the focus often shifts to the underlying infrastructure: the Nodes and their resource utilization. Performance degradation, Pods stuck in the Pending state, or persistent OOMKilled messages are all symptoms of resource contention, where the demand for CPU or Memory exceeds the cluster's available capacity. Monitoring the resource consumption of Pods and Nodes is critical for capacity planning and quickly identifying which workload is causing stress on the system, leading to failures or poor latency for other applications.

The command kubectl top is the cluster equivalent of the Linux top command. It provides a real-time snapshot of CPU and Memory consumption across Nodes and individual Pods. Note that this command requires the Kubernetes Metrics Server to be installed and running in the cluster. By running kubectl top pod -A (listing all Pods across all namespaces) or kubectl top node, you can instantly pinpoint a Pod that is consuming resources beyond its defined limits or a Node that is nearing capacity. This immediate visibility helps diagnose throttling issues and guides the process of correctly setting resource requests and limits in your application manifests, ensuring stable scheduling and performance.

When a Node itself becomes unhealthy, unstable, or requires maintenance (such as an OS patch or hardware upgrade), you need to manage its state gracefully to prevent service disruption. The commands kubectl cordon and kubectl drain are designed for this purpose. Cordon marks a Node as unschedulable, meaning the scheduler will no longer place new Pods on it, but existing Pods remain running. Drain is the more aggressive command: it cordons the Node and then attempts to gracefully terminate all running Pods on that Node, respecting PDBs (Pod Disruption Budgets) before forcing the termination. This two-part command is essential for performing zero-downtime maintenance, ensuring that workloads are safely migrated to other healthy Nodes before the Node is taken completely offline for an extended period of maintenance or decommission. Successfully managing Nodes in this way is key to ensuring that automate backups and other operational tasks can be performed without causing application instability.

Deployment History and Rollback Management

Deployment failures are among the most common causes of service degradation. A misconfiguration, a bad code push, or an unexpected environmental interaction can cause a rolling update to halt or, worse, succeed and then immediately start failing due to runtime errors. Effective troubleshooting in this domain requires two things: immediate insight into the deployment's status and the ability to instantly revert to the last known good state, minimizing the impact of a faulty release. The commands for managing rollout history provide a critical safety net against bad deployments, ensuring that recovery is rapid and predictable.

The command kubectl rollout status [deployment-name] provides a continuous stream of updates regarding the progress of a rolling update. If the deployment hangs, this command will reveal if it is stuck waiting for a Pod to become ready or if the rollout has exceeded its timeout. If the deployment completes but the new Pods immediately enter a failing state, the next step is to examine the history using kubectl rollout history [deployment-name]. This displays a chronological list of all previous deployment revisions, complete with a revision number and a change cause (if properly annotated in the manifest), allowing the engineer to orient themselves quickly in the deployment history and find the last stable point.

When an issue is confirmed to be a bug introduced in the latest deployment, the command kubectl rollout undo [deployment-name] is the fastest path to resolution. By default, it reverts the deployment to the immediately preceding revision. However, using the --to-revision=[revision-number] flag, an engineer can target any specific, previously stable revision identified from the history output. This immediate, cluster-native rollback is a non-negotiable feature for continuous delivery environments, ensuring that the critical "fix" is often the rapid reversal of the bad change rather than a complex manual patch. This capability is paramount to maintaining high system availability, and its speed dramatically improves the organization's MTTR, acting as the primary line of defense against catastrophic deployment failures.

Configuration Validation and Correction

Many subtle and frustrating failures in Kubernetes stem from minor misconfigurations that are difficult to spot manually. A single typo in a label selector, a slight variation in a ConfigMap value, or an unapplied change in the local manifest compared to the live resource can cause entire services to break without any obvious error message. Commands that allow for pre-flight checks and surgical corrections are essential for catching these elusive configuration problems before they cause user-impacting issues or, worse, for correctly fixing them after they have already broken the system.

The command kubectl diff -f [manifest-file] is the safest way to prepare for any deployment. It compares the local manifest file against the resource currently running in the cluster and prints the difference, highlighting all changes that will be applied if you run kubectl apply. This prevents engineers from inadvertently overwriting critical configurations, deleting necessary fields, or applying changes based on an outdated local file. Always running diff before apply is a best practice that drastically reduces the likelihood of introducing configuration-based downtime. It is a simple check that provides immense value by enforcing the principle of "review before you deploy" for Infrastructure as Code (IaC) changes, ensuring that the live state always matches the intended configuration.

Following a diagnostic path that isolates the problem to a misconfigured label or an incorrect environment variable, the command kubectl apply -f [manifest-file] is used to non-destructively correct the error. Unlike the old kubectl create (which fails if the resource already exists), apply intelligently updates the existing resource, ensuring that the declarative configuration in your manifest file is enforced in the cluster. This idempotency is key for GitOps practices and ensuring that all cluster resources are defined via code. For immediate, non-recoverable fixes, the command kubectl delete can be used to remove the offending resource, often forcing the controller to recreate it from a fresh, correct state. These commands complete the troubleshooting loop: observe, diagnose, fix (using apply), and verify (get / describe), ensuring the cluster converges back to a healthy state, ready for the next iteration of data compression and operational task completion.

Security and Compliance Commands

In modern, production-grade Kubernetes environments, security and access control are critical concerns. Failures often stem not from bugs in the application, but from misconfigured Role-Based Access Control (RBAC) policies that prevent a Pod, User, or Service Account from accessing the necessary cluster resources or the underlying operating system features. Troubleshooting these access issues requires specialized commands that validate permissions and inspect security configurations. Ensuring that special permissions like ServiceAccount bindings are correctly configured is a mandatory security checkpoint for any cluster operator.

The command kubectl auth can-i is the most effective tool for validating authorization policies. It allows a user to ask the API server, "Can I perform this action?" For example, kubectl auth can-i create pods --as=system:serviceaccount:default:my-service-account -n my-namespace checks if a specific service account has the necessary permissions to create pods in a namespace. This is crucial for debugging why a Controller or Operator might be failing to perform its task. RBAC failures are frequently difficult to trace, as the component often fails silently or logs a cryptic error, but can-i provides a definitive answer regarding the security policy enforcement, immediately isolating whether the issue is a bug or an access restriction.

Security goes beyond RBAC to include the Pod security context and underlying operating system security. While not a direct kubectl command, inspecting a resource's manifest (using kubectl get pod [pod-name] -o yaml) to check the securityContext, including capabilities, runAsUser, and allowPrivilegeEscalation, is essential. Any Pod that accesses host files or requires high-level privileges must be scrutinized. Furthermore, ensuring that backups and archives of critical configurations are handled securely, often involving secure data archives, is a fundamental operational security requirement. Kubernetes commands must be part of a broader security strategy that encompasses all aspects of the application's lifecycle, from coding to its execution environment, ensuring a robust security posture.

Conclusion

Mastering Kubernetes troubleshooting is not about memorizing commands, but about internalizing a systematic diagnostic process. The 15 commands outlined in this guide provide the necessary tooling for every stage of that process: Observe (get), Investigate (describe, logs, events), Test (exec, run, port-forward), Analyze (top, diff), and Remediate (apply, rollout undo). By following a structured approach that moves from high-level status checks down to container-level logs and network connectivity tests, engineers can rapidly segment the fault domain to the application code, the configuration manifest, the infrastructure resources, or the network fabric.

The ability to instantly recover from failures is paramount. Commands like kubectl rollout undo and the proper use of drain demonstrate that effective troubleshooting often relies on the speed of reversal rather than the complexity of the initial fix. The final layer of mastery involves using commands like kubectl auth can-i to confirm security policies are working as intended, and integrating these tools with broader operational excellence practices, including pre-flight checks and reliable disaster recovery procedures. By achieving fluency in these 15 commands, any engineer can efficiently navigate the complexities of a distributed system, ensuring that applications achieve maximum stability, and solidifying their role as a high-performing contributor to any cloud-native organization.

Frequently Asked Questions

What is the first command to run when a Pod is failing?

The first command should be kubectl get pods to confirm the status, followed immediately by kubectl describe pod [pod-name] to check the events.

How do I view the logs of a container that has already crashed?

Use the command kubectl logs [pod-name] --previous (or -p) to retrieve the logs from the previous instance.

What does a Pod stuck in the Pending state usually indicate?

It typically indicates a scheduling failure due to insufficient resources (CPU/Memory) on the available Nodes or a missing required resource like a Persistent Volume.

How do I check if my firewall rules are blocking traffic to a Service?

Use kubectl exec into a Pod and run curl or ping against the target Service to test internal connectivity.

What is the safest way to correct a small configuration error on a Deployment?

Use kubectl diff -f [file] first, then use kubectl apply -f [file] to non-destructively apply the fix.

What is the fastest way to revert a bad deployment?

The fastest way is to use kubectl rollout undo deployment [deployment-name] to immediately revert to the previous revision.

How can I find out which Pod is consuming all the resources on a Node?

Use kubectl top pod with the appropriate namespace and filtering to identify Pods with high CPU or Memory usage.

What is the purpose of the kubectl cordon command?

Cordon marks a Node as unschedulable, stopping the scheduler from placing any new Pods on it, which is the first step before maintenance.

Why is kubectl port-forward useful for troubleshooting?

It bypasses the Service and Ingress layers to test the application directly, isolating the application code from network routing issues.

How can I check if I have the authority to perform a specific action in the cluster?

Run the command kubectl auth can-i [verb] [resource type] to confirm your current RBAC policy permissions.

What is the most effective flag for troubleshooting live application issues with logs?

The -f or --follow flag is most effective as it streams the logs in real time, showing errors as they occur.

What is the key difference between kubectl get and kubectl describe?

Get provides a high-level summary of the status, while describe provides all configuration details and recent events.

How can I verify if an image pull failed due to authentication issues?

Use kubectl describe pod [pod-name] and inspect the Events section for Failed to pull image errors.

What should be done after using kubectl drain on a Node?

After draining, the Node should be safely taken offline for maintenance, and then made schedulable again using kubectl uncordon.

Why is checking the security context important when troubleshooting a Container?

It helps verify if the container has the necessary or, conversely, excessive privileges to perform file operations, often revealing security compliance issues.

What's Your Reaction?

Like Like 0
Dislike Dislike 0
Love Love 0
Funny Funny 0
Angry Angry 0
Sad Sad 0
Wow Wow 0
Mridul I am a passionate technology enthusiast with a strong focus on DevOps, Cloud Computing, and Cybersecurity. Through my blogs at DevOps Training Institute, I aim to simplify complex concepts and share practical insights for learners and professionals. My goal is to empower readers with knowledge, hands-on tips, and industry best practices to stay ahead in the ever-evolving world of DevOps.