Amit Kumar Shahi erysimum

Hi, I'm Amit

I'm a DevOps and Cloud Engineer in Melbourne, mostly working with AWS and Kubernetes. A lot of what I do is taking deployments that only work because someone remembers the right steps, and turning them into something boring and repeatable.

I've been in software about over 4 years. These days it's mostly infrastructure as code, GitOps, and trying to make observability actually useful instead of just noisy.

Building a full EKS platform right now: GitOps, SLO alerting, some chaos testing, and an AI agent for on-call. More below.
Day to day: Terraform, Argo CD, Istio, Prometheus, Pyrra
RHCSA certified (RHEL 9)
Say hi on LinkedIn

AI SRE Agent (`retail-store-ai`)

This is what I've built lately. It helps with on-call for the EKS platform below. When an alert fires, it goes and looks at the cluster, works out what's probably wrong, and writes it up in Slack. It never changes anything itself. It tells you what it found and what it'd run, and you decide.

A few bits I'm happy with:

When an alert lands, a small FastAPI service starts two investigators at once. Each one is its own claude -p process allowed to touch exactly one thing: one reads Prometheus, one reads Kubernetes, with a quick git-log check on the side. They're real separate processes, not async pretending to be parallel.
Once they're done, a final step with no tools pulls it all into clean JSON (validated with Pydantic) and posts a Slack card: what broke, who's hit, the evidence, and commands you can paste.
It's locked down on purpose. Each investigator gets only the one tool it needs, everything has a timeout, repeat alerts get dropped, and the Slack post is guarded.

I tried to be honest about where it slips, too. It correctly says "no data on that" instead of inventing numbers, and it caught a stale alert on a system that had already recovered. But once it blamed a random pod restart instead of the fault I'd actually injected, because it had no way of knowing a human caused it. That's exactly why it only ever advises and keeps a person in the loop.

Want to see it run? Chapter 5 of the walkthrough has screenshots of the whole thing: the subagents kicking off, then the Slack card with root cause, evidence, and the fix.

🔗 github.com/erysimum/retail-store-ai

The platform it runs on — EKS

A full Kubernetes platform on AWS EKS that I built and run solo. I split it across five repos on purpose, the way different teams would own different pieces at a real company. A 5-service retail app runs on top, wired up so I can follow a single request all the way through.

What it does:

One terraform apply brings up around 100 AWS resources in about 15 minutes. No clicking around the console.
Argo CD keeps the cluster in step with Git, so what's running is always what's committed.
SLOs done properly with Pyrra (multi-window burn rate, the Google SRE way), with alerts to Slack or PagerDuty depending on severity.
I can break things on purpose with Istio fault injection. No code changes, no restarts.
I checked it really works by putting a real order through all 5 services and watching it land in Postgres.

Rather see it than read about it? The walkthrough goes through the whole thing in screenshots: first traffic and SLOs, a Locust load test, tracing a failure down to one broken request, fault injection, and the AI agent's diagnosis at the end.

The five repos:

Repo	What it owns
retail-store-infra	Terraform: VPC, EKS, RDS, ECR, IAM, observability stack. Start here.
retail-store-platform	Cluster policies: namespaces, RBAC, NetworkPolicy, quotas (Kustomize)
retail-store-gitops	Helm charts, Argo CD config, SLOs, dashboards, fault injection
retail-store-app	The polyglot microservices app
retail-store-ai	The AI SRE agent from above

Tech I work with

Cloud (AWS): EKS · VPC · IAM · RDS · Lambda · API Gateway · SQS · CloudWatch Platform: Kubernetes · Docker · Helm · Istio · NGINX GitOps & CI/CD: Argo CD · GitHub Actions (OIDC) · Jenkins · GitLab CI · SonarQube · Trivy IaC: Terraform · CloudFormation Observability: Prometheus · Grafana · Pyrra · Alertmanager · PagerDuty AI / Agents: Claude Code (MCP) · FastAPI · Pydantic · asyncio

📊 Other projects

Soccerize is a real-time soccer app on AWS EKS, built around Lambda, DynamoDB Streams and API Gateway WebSockets. The fun part was the pipeline: with Jenkins, Trivy and SonarQube I got deploys down from about an hour to roughly 8 minutes, with Argo CD handling promotion.

_{📍 Melbourne, VIC · 🎓 Master of Applied IT, Victoria University · 🏅 RHCSA (RHEL 9)}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Amit Kumar Shahi erysimum

Achievements

Achievements

Block or report erysimum

Hi, I'm Amit

AI SRE Agent (`retail-store-ai`)

The platform it runs on — EKS

Tech I work with

📊 Other projects

Pinned Loading

Uh oh!

Amit Kumar Shahi erysimum

Achievements

Achievements

Hi, I'm Amit

AI SRE Agent (retail-store-ai)

The platform it runs on — EKS

Tech I work with

📊 Other projects

Pinned Loading

Uh oh!

AI SRE Agent (`retail-store-ai`)