What you should learn next as a DevOps Engineer?
Recently, I gave a keynote at Google Cloud Community Days Bengaluru and it was an amazing event.
The energy, the crowd, and everything about the event was just incredible.
I was there the whole day, and one question kept coming in every 5–10 minutes:
“I work in DevOps, will AI do all the stuff I am doing?”
“I am in DevOps, but what should I learn about AI?”
“As a Platform Engineer, how much AI should I learn?”
These were very common questions, and I wanted to write a bit more about this in this post (maybe a detailed video in the future too).
AI is mainstream now. It’s in all your workflows → from shopping to content creation, from generating summaries to writing code.
AI is assisting in most areas already and automating a lot of intelligent tasks.
LLMs today can find and fix issues in code (though not always accurately). In some AI IDEs, you can put in a user story, and it creates a plan, then a design, breaks it into tasks, and even executes those tasks coding, testing, etc.
This is already happening!!
With the speed at which AI is advancing, it will only get better. No one knows if it’s months or years away from writing complex applications or developing features with ease.
So what are we competing against?
Should we be worried about jobs or should we be thinking about upskilling and working with AI?
When the internet came, or when computers came, what happened?
They created more jobs. People upskilled, built businesses, and became more productive.
This is the same. This is an AI revolution.
In my opinion, we need to focus on becoming more productive with AI and keep an eye on what enhancements are coming in our field.
Let’s take the example of Ops, DevOps, and Platform Engineers.
All AI agents and workloads run somewhere and that “somewhere” needs to be a robust infrastructure. You need to learn how to build platforms that cater to AI agents and workloads.
There are many tools in this space: Kubeflow, MLflow, vLLM, llm-d, and lots of observability tools too.
We're not just talking about tools, you need to understand:
The basics of ML
Supervised and unsupervised learning
Deep learning concepts
What different algorithms do (not the math, just the concepts)
What tokens and parameters are(refer to Andrej Karpathy youtube for this )
What LLMs are
What inferencing is and how to make it better
You hear a lot about MLOps, so learn MLOps.
It bridges the gap between ML engineers and the platforms you build (usually on Kubernetes).
In short, you need to:
Learn the basics of LLMs
Understand what's being built in AI
Learn about MCP, kGateway, AI agents, RAG, inferencing
Try building a sample RAG application
Figure out how to make inferencing faster and easier to deploy
Help ML engineers run training and inferencing workloads smoothly on Kubernetes
IMO, AI and models will keep getting smarter.
You need to upskill and get smarter with them.
It’s not AI replacing people, it’s a person using AI efficiently who will replace others.
It’s all about upskilling and there’s a lot on the Kubesimplify YouTube channel that can help you.
Also, I believe orgs will (and should) hire more people so they can build and ship faster using an Engineers + AI combo.
What are your thoughts?
Even if you disagree, I’d love to hear from you.
🔜 Where I’ll Be Speaking Next
IIT Kanpur – 5th September
ContainerDays Hamburg – 9–11 September
CNCF Chandigarh Meetup – 20th September
GITEX Global Dubai – 13–17 October
KCD Sri Lanka – 26th October
KubeCon Atlanta – 10–14 November
If you're at any of these events—let’s meet!
DM me to prebook meetings. 😄
🛠️ What’s Coming Next on Kubesimplify?
AWS Course – Final editing phase
Kubernetes Operators Course – 70% recorded
GitOps Course – Started
Until then, keep learning from the existing content and share it with your network!
Awesome Reads
Tuning Linux Swap for Kubernetes: A Deep Dive - Kubernetes is introducing stable support for swap in v1.34, allowing Linux nodes to better handle memory pressure by offloading inactive memory to disk, but this requires careful tuning of Linux kernel parameters like
vm.swappiness
,vm.min_free_kbytes
, andvm.watermark_scale_factor
to avoid performance degradation and OOM kills. Through extensive testing, the article demonstrates how adjusting these settings can create a safe buffer zone for memory reclamation, enhancing node stability without interfering with Kubernetes’ eviction mechanisms.Kubernetes v1.34 introduces 58 enhancements, including major stable features like Dynamic Resource Allocation (DRA), Linux swap support, pod-level resource requests, structured authentication, and container-specific restart rules—marking a continued push toward fine-grained control, security, and scalability. The release emphasizes community-driven resilience (“Of Wind & Will”), architectural modularity, and better developer experience, while also deprecating older configurations like manual cgroup driver settings and preparing for containerd 2.0 transition.
Ditch the Overheating Laptop: Supercharge Your Docker Workflow with Docker Offload - Running resource-heavy Docker builds and containers can overheat your laptop and slow down performance, but Docker Offload solves this by seamlessly shifting the compute to a secure, high-performance cloud instance—all while maintaining your local Docker CLI experience. With support for GPU workloads, shared caches, and ephemeral cloud environments, Docker Offload dramatically boosts performance for AI/ML, CI/CD, and data-intensive tasks without altering your existing workflow.
Building a Scalable, Flexible, Cloud-Native GenAI Platform with Open Source Solutions - The post outlines a reference architecture for a cloud-native GenAI platform centered on Envoy AI Gateway (two-tier gateway) and KServe, giving developers a unified API to route traffic to both external LLM providers and self-hosted models with centralized credential injection, rate limiting/cost controls, and end-to-end observability. This pluggable design lets platform teams scale safely and flexibly—standardizing security and governance without refactoring clients—while optimizing inference via autoscaling, caching, and disaggregated serving.
This is coming on September 8th, I am so excited for this one!
Awesome Resources
gonzo - Gonzo! The Go based TUI log analysis tool
Learn form X
https://x.com/kubesimplify/status/1959136167120838740
https://x.com/divamgupta/status/1952762876504187065
https://x.com/kubesimplify/status/1957813091132944735
If you like the newsletter, subscribe for free!