I'm a DevOps and Cloud Engineer in Melbourne, mostly working with AWS and Kubernetes. A lot of what I do is taking deployments that only work because someone remembers the right steps, and turning them into something boring and repeatable.
I've been in software about over 4 years. These days it's mostly infrastructure as code, GitOps, and trying to make observability actually useful instead of just noisy.
- Building a full EKS platform right now: GitOps, SLO alerting, some chaos testing, and an AI agent for on-call. More below.
- Day to day: Terraform, Argo CD, Istio, Prometheus, Pyrra
- RHCSA certified (RHEL 9)
- Say hi on LinkedIn
This is what I've built lately. It helps with on-call for the EKS platform below. When an alert fires, it goes and looks at the cluster, works out what's probably wrong, and writes it up in Slack. It never changes anything itself. It tells you what it found and what it'd run, and you decide.
A few bits I'm happy with:
- When an alert lands, a small FastAPI service starts two investigators at once. Each one is its own
claude -pprocess allowed to touch exactly one thing: one reads Prometheus, one reads Kubernetes, with a quick git-log check on the side. They're real separate processes, not async pretending to be parallel. - Once they're done, a final step with no tools pulls it all into clean JSON (validated with Pydantic) and posts a Slack card: what broke, who's hit, the evidence, and commands you can paste.
- It's locked down on purpose. Each investigator gets only the one tool it needs, everything has a timeout, repeat alerts get dropped, and the Slack post is guarded.
I tried to be honest about where it slips, too. It correctly says "no data on that" instead of inventing numbers, and it caught a stale alert on a system that had already recovered. But once it blamed a random pod restart instead of the fault I'd actually injected, because it had no way of knowing a human caused it. That's exactly why it only ever advises and keeps a person in the loop.
Want to see it run? Chapter 5 of the walkthrough has screenshots of the whole thing: the subagents kicking off, then the Slack card with root cause, evidence, and the fix.
🔗 github.com/erysimum/retail-store-ai
A full Kubernetes platform on AWS EKS that I built and run solo. I split it across five repos on purpose, the way different teams would own different pieces at a real company. A 5-service retail app runs on top, wired up so I can follow a single request all the way through.
What it does:
- One
terraform applybrings up around 100 AWS resources in about 15 minutes. No clicking around the console. - Argo CD keeps the cluster in step with Git, so what's running is always what's committed.
- SLOs done properly with Pyrra (multi-window burn rate, the Google SRE way), with alerts to Slack or PagerDuty depending on severity.
- I can break things on purpose with Istio fault injection. No code changes, no restarts.
- I checked it really works by putting a real order through all 5 services and watching it land in Postgres.
Rather see it than read about it? The walkthrough goes through the whole thing in screenshots: first traffic and SLOs, a Locust load test, tracing a failure down to one broken request, fault injection, and the AI agent's diagnosis at the end.
The five repos:
| Repo | What it owns |
|---|---|
| retail-store-infra | Terraform: VPC, EKS, RDS, ECR, IAM, observability stack. Start here. |
| retail-store-platform | Cluster policies: namespaces, RBAC, NetworkPolicy, quotas (Kustomize) |
| retail-store-gitops | Helm charts, Argo CD config, SLOs, dashboards, fault injection |
| retail-store-app | The polyglot microservices app |
| retail-store-ai | The AI SRE agent from above |
Cloud (AWS): EKS · VPC · IAM · RDS · Lambda · API Gateway · SQS · CloudWatch Platform: Kubernetes · Docker · Helm · Istio · NGINX GitOps & CI/CD: Argo CD · GitHub Actions (OIDC) · Jenkins · GitLab CI · SonarQube · Trivy IaC: Terraform · CloudFormation Observability: Prometheus · Grafana · Pyrra · Alertmanager · PagerDuty AI / Agents: Claude Code (MCP) · FastAPI · Pydantic · asyncio
Soccerize is a real-time soccer app on AWS EKS, built around Lambda, DynamoDB Streams and API Gateway WebSockets. The fun part was the pipeline: with Jenkins, Trivy and SonarQube I got deploys down from about an hour to roughly 8 minutes, with Argo CD handling promotion.
📍 Melbourne, VIC · 🎓 Master of Applied IT, Victoria University · 🏅 RHCSA (RHEL 9)


