Skip to content
View erysimum's full-sized avatar

Block or report erysimum

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
erysimum/README.md

Hi, I'm Amit

I'm a DevOps and Cloud Engineer in Melbourne, mostly working with AWS and Kubernetes. A lot of what I do is taking deployments that only work because someone remembers the right steps, and turning them into something boring and repeatable.

I've been in software about over 4 years. These days it's mostly infrastructure as code, GitOps, and trying to make observability actually useful instead of just noisy.

  • Building a full EKS platform right now: GitOps, SLO alerting, some chaos testing, and an AI agent for on-call. More below.
  • Day to day: Terraform, Argo CD, Istio, Prometheus, Pyrra
  • RHCSA certified (RHEL 9)
  • Say hi on LinkedIn

AI SRE Agent (retail-store-ai)

This is what I've built lately. It helps with on-call for the EKS platform below. When an alert fires, it goes and looks at the cluster, works out what's probably wrong, and writes it up in Slack. It never changes anything itself. It tells you what it found and what it'd run, and you decide.

A few bits I'm happy with:

  • When an alert lands, a small FastAPI service starts two investigators at once. Each one is its own claude -p process allowed to touch exactly one thing: one reads Prometheus, one reads Kubernetes, with a quick git-log check on the side. They're real separate processes, not async pretending to be parallel.
  • Once they're done, a final step with no tools pulls it all into clean JSON (validated with Pydantic) and posts a Slack card: what broke, who's hit, the evidence, and commands you can paste.
  • It's locked down on purpose. Each investigator gets only the one tool it needs, everything has a timeout, repeat alerts get dropped, and the Slack post is guarded.

I tried to be honest about where it slips, too. It correctly says "no data on that" instead of inventing numbers, and it caught a stale alert on a system that had already recovered. But once it blamed a random pod restart instead of the fault I'd actually injected, because it had no way of knowing a human caused it. That's exactly why it only ever advises and keeps a person in the loop.

Want to see it run? Chapter 5 of the walkthrough has screenshots of the whole thing: the subagents kicking off, then the Slack card with root cause, evidence, and the fix.

🔗 github.com/erysimum/retail-store-ai


The platform it runs on — EKS

A full Kubernetes platform on AWS EKS that I built and run solo. I split it across five repos on purpose, the way different teams would own different pieces at a real company. A 5-service retail app runs on top, wired up so I can follow a single request all the way through.

What it does:

  • One terraform apply brings up around 100 AWS resources in about 15 minutes. No clicking around the console.
  • Argo CD keeps the cluster in step with Git, so what's running is always what's committed.
  • SLOs done properly with Pyrra (multi-window burn rate, the Google SRE way), with alerts to Slack or PagerDuty depending on severity.
  • I can break things on purpose with Istio fault injection. No code changes, no restarts.
  • I checked it really works by putting a real order through all 5 services and watching it land in Postgres.

Rather see it than read about it? The walkthrough goes through the whole thing in screenshots: first traffic and SLOs, a Locust load test, tracing a failure down to one broken request, fault injection, and the AI agent's diagnosis at the end.

The five repos:

Repo What it owns
retail-store-infra Terraform: VPC, EKS, RDS, ECR, IAM, observability stack. Start here.
retail-store-platform Cluster policies: namespaces, RBAC, NetworkPolicy, quotas (Kustomize)
retail-store-gitops Helm charts, Argo CD config, SLOs, dashboards, fault injection
retail-store-app The polyglot microservices app
retail-store-ai The AI SRE agent from above

Tech I work with

AWS Kubernetes Terraform Argo CD Istio Helm Docker Prometheus Grafana GitHub Actions Jenkins Python FastAPI Pydantic Claude Code Bash PostgreSQL

Cloud (AWS): EKS · VPC · IAM · RDS · Lambda · API Gateway · SQS · CloudWatch Platform: Kubernetes · Docker · Helm · Istio · NGINX GitOps & CI/CD: Argo CD · GitHub Actions (OIDC) · Jenkins · GitLab CI · SonarQube · Trivy IaC: Terraform · CloudFormation Observability: Prometheus · Grafana · Pyrra · Alertmanager · PagerDuty AI / Agents: Claude Code (MCP) · FastAPI · Pydantic · asyncio


📊 Other projects

Soccerize is a real-time soccer app on AWS EKS, built around Lambda, DynamoDB Streams and API Gateway WebSockets. The fun part was the pipeline: with Jenkins, Trivy and SonarQube I got deploys down from about an hour to roughly 8 minutes, with Argo CD handling promotion.


📍 Melbourne, VIC · 🎓 Master of Applied IT, Victoria University · 🏅 RHCSA (RHEL 9)

Pinned Loading

  1. retail-store-gitops retail-store-gitops Public

    ArgoCD GitOps: Helm charts, ApplicationSets, observability

    Go Template

  2. retail-store-infra retail-store-infra Public

    Terraform IaC for EKS, VPC, RDS, and secrets infrastructure

    HCL

  3. retail-store-platform retail-store-platform Public

    Kubernetes platform layer: namespaces, RBAC, NetworkPolicy, quotas

  4. soccerize soccerize Public

    Soccerize- an event-driven soccer application that generates an automatic soccer commentary

    TypeScript

  5. retail-store-ai retail-store-ai Public

    Advisory AI SRE agent — parallel Claude subagents over MCP (Kubernetes + Prometheus), read-only diagnosis to Slack

    Python