Cloud Cost Management
FinOps maturity model, spot instances, savings plans, reserved capacity
Cloud cost management (FinOps) is the discipline of bringing financial accountability to variable cloud spend. The FinOps Foundation maturity model defines three stages: Crawl (basic tagging, cost reports), Walk (chargeback/showback, rightsizing), and Run (automated anomaly response, unit economics). AWS Compute Savings Plans offer up to 66% discount vs On-Demand in exchange for a 1 or 3-year commitment to a $/hour spend level, while Spot Instances offer 60–90% savings with interruption risk.
Key Points
- EC2 Savings Plans vs Reserved Instances: Savings Plans are more flexible (apply to any instance family/region/OS for the committed $/hr) vs RIs which lock to a specific instance type — prefer Savings Plans for modern architectures.
- Spot Instance interruption rate varies by instance type and AZ — use Spot Fleet or ASG with mixed instance types across AZs to maintain ~95% Spot availability even during capacity crunches.
- AWS Cost Explorer anomaly detection uses ML to identify spend spikes above expected baselines — configure SNS alerts to trigger within 24 hours of anomaly detection.
- Resource tagging strategy: enforce mandatory tags (Environment, Team, CostCenter, Product) via AWS Config/SCP/Azure Policy — untagged resources cannot be allocated to a budget.
- EC2 rightsizing: CloudWatch CPU/memory utilization averaged over 14 days; instances consistently below 20% CPU should be downsized or replaced with ARM (Graviton3) for 40% savings.
- NAT Gateway data processing charges ($0.045/GB) are often a hidden cost driver — use VPC Endpoints for S3/DynamoDB, and analyse VPC Flow Logs to identify excessive NAT Gateway traffic.
- S3 Intelligent-Tiering automates movement of objects between frequent and infrequent access tiers with no retrieval fee — cost-effective for objects >128 KB with unknown access patterns.
- FinOps unit economics: measure cost per API request, cost per user, cost per transaction — these metrics tie cloud spend to business value and drive prioritisation of optimisation efforts.
Real-World Example
Lyft reduced its AWS bill by $20M/year by implementing a FinOps practice: mandatory tagging enforcement via SCPs, automated rightsizing triggered by Lambda on 14-day low-utilisation alerts, and migrating 60% of batch workloads to Spot instances.