The first question I want to ask is: are you using Kubernetes in production? If yes, which cloud, distribution, and tooling are you using?
People have been running Kubernetes in production for many years now. Even when cluster creation and other tooling were not mature, workarounds were devised to make distributed systems work. Today, with AI taking the world by storm and OpenAI revealing that they run their systems on Kubernetes, it has become the de facto infrastructure for running AI workloads as well. But what are the basic requirements for a production-ready Kubernetes setup? Let’s break it down:
High availability - High availability (HA) is a cornerstone of a production-grade Kubernetes setup. Modern Kubernetes installation tools enable the creation of HA clusters, ensuring that control plane nodes (which keep the entire system running) can withstand failures. Control plane nodes can run on bare metal servers or VMs in the cloud. With an HA Kubernetes cluster, even if one control plane node goes down, the cluster continues functioning.
You can create HA Kubernetes clusters yourself by placing all control plane nodes behind a proxy such as HAProxy or Kube-VIP (Comment if you are using any other tooling!). Alternatively, managed Kubernetes services like EKS, GKE, and AKS offer HA capabilities.
Security - Security is paramount for any Kubernetes setup, encompassing both cluster security and application security. Ensuring your cluster is secure from external and internal threats is essential. Tools like Kubelinter can help identify potential issues in your manifests before deployment, while 0-CVE base images (which you can create easily using BuildSafe) and tools like Kubescape assist in cluster posture checks. Avoid exposing unnecessary services to the internet and follow best practices for hardening your cluster. On the application side, focus on deploying applications securely by adhering to Kubernetes security standards, implementing RBAC policies, and scanning container images for vulnerabilities.
If you want to know more about the 4C’s of security, I have explained many times on my channel - like in this video.
Infrastructure management - How repeatable is your infrastructure setup? Do you use OpenTofu, Terraform, or a combination of tools for managing your infrastructure and application state? Repeatability ensures quick environment setups and streamlines compliance across clusters. Additionally, do you use a central control plane for managing your fleet of clusters?
Disaster recovery - How robust is your disaster recovery (DR) mechanism? How regular you backup your clusters and test DR scenarios to ensure readiness? Chaos engineering tools can also be leveraged in this area while maintaining a controlled blast radius.
Observability - Observability is critical for monitoring the health and performance of your clusters. This includes tools for metrics, logs, and traces, as well as effective alerting systems. How well do your observability tools work across single and multi-cluster setups? Simplicity and reliability in observability tooling are key. After that having the reliability system in place along with system downtime and metrics in place.
Cost Management and Resource Utilization - How effectively does your team or organization utilize Kubernetes cluster resources? Are you optimizing workloads to reduce underutilization or overprovisioning? Additionally, how well do you monitor and control cluster costs? Multi tenancy(vclsuter), opencost and combination of these tooling to get things right.
Standardization and Compliance - This is where the famous marketing term Platform engineering comes in, how well you equip your dev teams to get the well architected and company standard certified Kubernetes clusters.
Scaling - You would need to test the scaling and how your system behaves in terms of cost, infra etc when it scales are important things to look at. Having proper autoscaling strategies is critical.
Networking - Although some part of it comes in Security but networking is important. Using service mesh or not, having network policies defined and configurations for production workloads is important.
Exploring new tools - There are new organizations that comes up with new tooling every now and then, how flexible your infra and complete setup is to adopt them like Dagger, System Initiative, Openobserve, Signoz, 0CVE system in place, auto patching, DRA and making AI workloads easy on your clusters, Running WASm workloads side by side containers.
These are the most common things that are important for running Kubernetes in production at scale. What challenges do you face with your Kubernetes infrastructure? Let me know!
Work I am doing
I’ve just returned from KubeCon NA, and it took about a week to adjust to the timezone. Once settled, I dive deep into my work on vCluster for multi-tenancy and started preparing for my talks at KubeCon India and SOSS Community Days. KubeCon truly feels like a never-ending cycle and next year, it’s going to be even bigger!
I had the privilege of interviewing the amazing Kelsey Hightower on platform engineering. As always, I loved his storytelling and the insightful examples he shared!
My upcoming talks
Would love to meet you in these events if you are coming :)
No Awesome reads this time as I want you all to go through the KubeCon Na playlist and revisit some of the awesome sessions from the conference and the co located events and Rejekts too!
Awesome repos
Vault-db-injector - The Vault DB Injector automates secure database credential management in Kubernetes using Hashicorp Vault, providing credential injection, renewal, and revocation for pods.
wasmvision - wasmVision gets you going with computer vision.
BuildSafe - Secure your software supply chain and create 0 CVE base images with ease.
kro - kro | Kube Resource Orchestrator
Stork - Storage Orchestration Runtime for Kubernetes
Do subscribe if you learned something new :)