Enterprise-Grade Cloud Infrastructure Automation
The Problem
MilanCloud S.r.l., a rapidly growing fintech scale-up, was drowning in operational overhead. Their 40+ microservices were deployed manually via SSH, configuration drift was rampant, and a single production incident in Q3 2024 caused a 14-hour outage that cost €380k in SLA penalties. Their CTO needed a platform team that could bring order to the chaos without halting feature development.
Strategy: Internal Developer Platform (IDP)
We designed and implemented a full Internal Developer Platform following the principles of Team Topologies and Platform Engineering. The goal was to abstract Kubernetes complexity behind golden paths — standardized, self-service workflows for development teams.
1. Infrastructure as Code (IaC)
- All infrastructure defined in Terraform modules with a custom provider registry.
- Multi-cloud support: primary on GCP (GKE Autopilot), disaster recovery on AWS (EKS).
- Environments (dev, staging, prod) are structurally identical — only parameters differ.
2. GitOps with ArgoCD
- Every deployment is a Git commit. ArgoCD watches repository state and reconciles the cluster automatically.
- Progressive Delivery: Integrated Argo Rollouts with canary releases (5% → 25% → 50% → 100%) and automatic rollback on error rate spikes.
- Secrets managed via External Secrets Operator syncing from HashiCorp Vault.
3. Observability Stack
- OpenTelemetry collector deployed as a DaemonSet, forwarding traces, metrics, and logs.
- Grafana dashboards with SLO-based alerting (error budget burn rate).
- Distributed tracing across all 40+ services with auto-instrumented spans.
4. Developer Portal
- Custom Backstage instance with service catalog, TechDocs, and scaffolder templates.
- Developers can spin up a new microservice (including CI/CD, monitoring, and database) in under 5 minutes via a self-service wizard.
Technical Stack
- Orchestration: Kubernetes (GKE Autopilot), Helm, Kustomize
- GitOps: ArgoCD, Argo Rollouts, GitHub Actions
- IaC: Terraform, Crossplane
- Observability: Grafana, Prometheus, Loki, Tempo, OpenTelemetry
- Security: Vault, Cert-Manager, Falco, OPA Gatekeeper
- Developer Portal: Backstage
Outcomes
- MTTR reduced from 14 hours to 12 minutes (99.2% improvement).
- Deployment frequency: from 2/week to 15/day per team.
- Zero configuration drift incidents since go-live.
- Developer satisfaction (internal NPS): from 22 to 78.