Skip to main content

Command Palette

Search for a command to run...

We Deployed 8 Microservices to AWS EKS as a Team. Here's What Actually Happened.

Updated
6 min read
A
I am a DevOps engineer. I am also beginning to find a sweet spot between DevOps and Agentic AI. It sharpened me!

A behind-the-scenes look at the Spring PetClinic project — the decisions, the breakdowns, and the lessons that stuck.


There's a version of this blog post where I list every tool we used, drop some architecture diagrams, and wrap it up with "it was a great learning experience."

That's not this post.

This is the version where I tell you about the night we had six pods in ImagePullBackOff and had no idea why. Where our CI pipeline kept failing because of a Git push race condition we didn't even know was a thing. Where we fixed one bug and found two more hiding behind it.

That's what learning DevOps actually looks like. And if you're trying to get into this field — or you're already in it and wondering why nothing ever goes clean — this is for you.


What We Built (And Why It Was a Real Challenge)

As part of DMI Cohort 2, an 11-person team including myself deployed the Spring PetClinic application to a production-grade AWS EKS cluster. PetClinic is a well-known reference app — but the microservices version is a different beast. It's not one service. It's eight:

  • api-gateway

  • customers-service

  • visits-service

  • vets-service

  • config-server

  • discovery-server

  • admin-server

  • genai-service

Each one needed its own Docker image, Kubernetes manifests, environment configuration, secrets, health checks, and observability hooks. Multiply that by eight, add a shared config server that every service depended on, and you start to understand the surface area we were working with.

The goal wasn't just to get the app running. We were building something that looked and behaved like a real production deployment — CI/CD pipelines, GitOps with ArgoCD, secrets management via AWS Secrets Manager, and a full observability stack with Prometheus, Grafana, and Zipkin.


My Role: Kubernetes Manifests & Infrastructure Services

I was Team Lead and Kubernetes Engineer, partnered closely with Olalekan Fashola. Our ownership was the K8s layer — writing and maintaining the Kubernetes manifests for every service, handling infrastructure deployments, and making sure what ArgoCD synced to the cluster was actually correct.

That sounds clean on paper. In practice, it meant being the person on the hook when pods weren't coming up.

The decisions that mattered most

1. Manifest structure and config separation

Early on, we had to decide how tightly to couple configuration to the manifests. We went with a config-server-first approach — all Spring Boot services pulled their config from a centralized config-server at startup. This was elegant until it wasn't: if config-server wasn't healthy, nothing else came up. We had to be deliberate about init containers and startup ordering to handle that dependency chain correctly.

2. Secrets handling via External Secrets Operator

We used the External Secrets Operator (ESO) to pull secrets from AWS Secrets Manager into Kubernetes Secrets at runtime. This keeps credentials out of the repo entirely. We pinned ESO to Helm version 0.9.11 using the v1beta1 API — a lesson learned after hitting CRD compatibility issues with newer versions. IRSA (IAM Roles for Service Accounts) handled the AWS auth side, keeping the credential chain clean.

3. ALB Ingress routing for the observability stack

Exposing Prometheus, Grafana, and Zipkin externally through an AWS ALB required careful annotation work and subpath routing configuration. Getting Grafana to serve correctly on a subpath — rather than the root — was one of those issues that looks trivial and costs you two hours.


The Architecture

Here's a simplified view of how the pieces connected:

CI/CD: GitHub Actions → ECR (Docker images) → ArgoCD sync
Infra: Terraform (EKS cluster, IAM, networking)

What Actually Went Wrong (The Honest Part)

ImagePullBackOff across half the cluster

We hit a wall where multiple pods couldn't pull images from ECR. The root cause was an IAM identity mismatch — the EKS nodes weren't authenticating to ECR correctly because the Terraform IAM configuration was missing the right policy attachments. No error message tells you that directly. You have to go digging through CloudTrail logs and kubectl describe output until the picture forms.

ArgoCD CRD annotation size errors

ArgoCD uses annotations to track the last-applied configuration. When our CRD definitions crossed a certain size, the annotations exceeded Kubernetes' annotation size limit and the sync would fail silently. The fix was switching to server-side apply. Not obvious. Not documented prominently. Found it the hard way.

Zipkin tracing wasn't working — and the config server was hiding it

Spring Boot 3 changed the property names for distributed tracing. Our services were still using the old Spring Boot 2 property keys, so tracing was silently disabled. To make it worse, the config server was overriding local properties, which meant even after we updated one service directly, the config server's version won out. Took a while to untangle that one.

CI/CD pipeline race conditions

Two engineers pushing to the same branch within seconds of each other caused Git push conflicts in the pipeline. We added retry logic and tightened the branch protection rules. Simple fix in hindsight — obvious only after it burned us.


What I'd Do Differently

Start observability on day one. We bolted the observability stack on later in the project. If we had Prometheus scraping from the beginning, we'd have caught configuration issues earlier instead of debugging blind.

Lock dependency versions immediately. We wasted time on Helm chart compatibility issues that were entirely avoidable. Pin everything from the start. Upgrade intentionally, not accidentally.

Validate the config server before anything else comes up. The config-server dependency chain caused more cascading failures than anything else. A proper health-check gate at the infrastructure level would have saved hours.


What This Experience Actually Gave Me

There's a difference between knowing what a tool does and knowing how it fails. This project gave me the second kind of knowledge — and that's the kind that actually makes you useful on a team.

If you're a student, a career changer, or someone who's been doing cloud work but mostly from tutorials — get yourself into an environment where you're deploying real infrastructure with real people on a real deadline. That pressure is the point.


This Was Part of DMI

This project was a core deliverable of DMI Cohort 2, a structured DevOps mentorship programme run by Pravin Mishra DMI Cohort 3 kicks off on 27 June.

If this post made you want to get into the room — here's the link: 👉 Apply for DMI Cohort 3


Shoutout to the full team: Faith Samson, Rita Gitamo, Bola Balogun, Paul Nwanochiri, Manish Gantyala, Olalekan Fashola, Temitayo Ali, Ajah Ijeoma, Abeeb Babatunde, and Love Ogujiofor. Couldn't have shipped this without every one of you.