Decision Metadata #
| Attribute | Value |
|---|---|
| Decision ID | ADH-002 |
| Status | Implemented |
| Date | 2025-02-20 |
| Stakeholders | FinOps Team, Platform Engineering, Product, Finance |
| Review Cycle | Monthly |
| Related Decisions | ADH-001 (Multi-Region HA) |
System Context #
A cloud-native e-commerce platform running on AWS with significant traffic variability and escalating infrastructure costs. The platform serves B2C customers with seasonal demand patterns.
System Characteristics #
- Monthly Active Users: 2.5M users
- Traffic Pattern: 10x variance between off-peak and peak (Black Friday)
- Current Monthly Cost: $185K (growing 15% MoM)
- Architecture: Microservices on EKS, RDS Aurora, ElastiCache, S3
- Performance SLA: P95 latency < 200ms, P99 < 500ms
Cost Growth Trajectory #
Current Cost Breakdown #
| Service Category | Monthly Cost | % of Total | Growth Rate |
|---|---|---|---|
| Compute (EKS) | $98K | 53% | +18% MoM |
| Database (Aurora) | $42K | 23% | +12% MoM |
| Cache (ElastiCache) | $18K | 10% | +8% MoM |
| Storage (S3, EBS) | $15K | 8% | +5% MoM |
| Network (Data Transfer) | $12K | 6% | +10% MoM |
Triggering Event #
Q4 2025 Financial Review: CFO raised concerns about cloud cost trajectory exceeding revenue growth rate (15% vs 12%). Board requested 25% cost reduction target for 2025 without compromising customer experience.
Problem Statement #
How do we optimize cloud infrastructure costs to achieve 25% reduction while maintaining performance SLAs and supporting business growth?
Key Challenges #
- Unpredictable Traffic Patterns: Daily variance of 3-5x, seasonal spikes of 10x
- Performance Sensitivity: E-commerce conversion rate drops 7% per 100ms latency increase
- Auto-Scaling Inefficiency: Current strategy over-provisions by 40% during off-peak
- Reserved Capacity Risk: Long-term commitments conflict with growth uncertainty
- Multi-Dimensional Optimization: Need to balance cost, performance, and flexibility
Workload Analysis #
Traffic Pattern Discovery (3-month analysis):
Key Findings:
- Baseline Load: 20-30% capacity utilized 60% of the time
- Predictable Peaks: 18:00-22:00 daily, weekends +30%
- Seasonal Spikes: Black Friday (10x), Holiday Season (5x), Prime Day (8x)
- Over-Provisioning: Average utilization only 42% despite auto-scaling
Options Considered #
Option 1: Performance-Optimized Auto-Scaling (Status Quo) #
Current Configuration:
# Kubernetes HPA Configuration
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-service-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-service
minReplicas: 50
maxReplicas: 500
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50 # Conservative threshold
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100 # Aggressive scale-up
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10 # Conservative scale-down
periodSeconds: 60
Characteristics:
- Aggressive scale-up (double capacity in 1 minute)
- Conservative scale-down (10% reduction per minute)
- High minimum replica count for instant readiness
- CPU threshold at 50% to maintain headroom
Pros:
- Excellent performance (P95: 145ms, P99: 380ms)
- Zero performance degradation during traffic spikes
- Simple operational model
Cons:
- High cost due to over-provisioning
- Average CPU utilization: 42%
- Wasted capacity during off-peak hours
- No cost awareness in scaling decisions
Monthly Cost: $185K (baseline)
Option 2: Cost-Aware Dynamic Scaling #
Proposed Configuration:
# Cost-optimized HPA with custom metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-service-hpa-cost-optimized
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-service
minReplicas: 15 # Reduced from 50
maxReplicas: 500
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # Higher threshold
- type: Pods
pods:
metric:
name: http_request_latency_p95
target:
type: AverageValue
averageValue: "200" # SLA-based scaling
behavior:
scaleUp:
stabilizationWindowSeconds: 120 # Slower reaction
policies:
- type: Percent
value: 50 # Moderate scale-up
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 180
policies:
- type: Percent
value: 25 # Faster scale-down
periodSeconds: 60
Additional Strategies:
- Spot Instances: 70% of burst capacity on Spot (60% cost savings)
- Vertical Pod Autoscaling: Right-size resource requests
- Cluster Autoscaler Optimization: Faster node termination
- Time-Based Scaling: Pre-scale for known peaks
Pros:
- Estimated 35-40% cost reduction
- Better resource utilization (target 65%)
- Maintains SLA-based performance guardrails
Cons:
- Increased latency during unexpected spikes (P95: 180ms, P99: 480ms)
- Spot instance interruptions (2-3% of capacity)
- Complex configuration and monitoring
- Risk of cascading failures during rapid scale-up
Estimated Monthly Cost: $115K (-38%)
Option 3: Reserved Capacity + Dynamic Scaling Hybrid #
Architecture Overview:
Capacity Planning Strategy:
| Time Period | Reserved | On-Demand | Spot | Total Capacity |
|---|---|---|---|---|
| Off-Peak (00:00-06:00) | 100% | 0% | 0% | 20% of peak |
| Business Hours (09:00-18:00) | 43% | 40% | 17% | 70% of peak |
| Evening Peak (18:00-22:00) | 30% | 40% | 30% | 100% of peak |
| Seasonal Spike | 10% | 30% | 60% | 300% of peak |
Implementation Components:
- Reserved Instances: EC2 RIs for baseline EKS nodes (30% capacity)
- Savings Plans: Compute Savings Plans for predictable workloads
- Scheduled Scaling: Pre-scale for known daily/weekly patterns
- Spot Fleet: Diversified Spot instances for burst capacity
- Intelligent Workload Placement: Priority-based pod scheduling
Pros:
- Balanced cost reduction (28-32%)
- Predictable baseline performance
- Reduced Spot interruption impact
- Better capacity planning
Cons:
- Commitment risk (3-year RIs)
- Moderate operational complexity
- Requires accurate workload forecasting
- Less aggressive cost savings than Option 2
Estimated Monthly Cost: $128K (-31%)
Option 4: Workload Scheduling + Batch Processing #
Strategy:
- Shift non-critical workloads to off-peak hours
- Implement batch processing for analytics and reporting
- Use Lambda for sporadic tasks instead of always-on containers
Pros:
- Maximizes reserved capacity utilization
- Reduces peak demand by 15-20%
Cons:
- Requires application refactoring
- Not suitable for real-time user-facing services
- Limited applicability to e-commerce workload
Estimated Monthly Cost: $145K (-22%) when combined with Option 3
Evaluation Matrix #
| Criteria | Weight | Option 1 (Status Quo) | Option 2 (Cost-Aware) | Option 3 (Hybrid) | Option 4 (Scheduling) |
|---|---|---|---|---|---|
| Cost Reduction | 35% | 2/10 | 10/10 | 8/10 | 7/10 |
| Performance Impact | 30% | 10/10 | 6/10 | 8/10 | 7/10 |
| Predictability | 15% | 9/10 | 5/10 | 8/10 | 6/10 |
| Operational Overhead | 10% | 9/10 | 4/10 | 6/10 | 3/10 |
| Flexibility | 10% | 8/10 | 9/10 | 6/10 | 5/10 |
| Weighted Score | 6.85 | 7.25 | 7.75 | 6.45 |
Trade-offs Analysis #
Cost vs Performance Frontier #
Key Trade-off Considerations #
1. Cost Reduction vs Performance Risk
- Every 10% cost reduction correlates with ~5ms P95 latency increase
- Acceptable range: P95 < 200ms (current: 145ms, headroom: 55ms)
- Option 3 provides optimal balance: 31% cost reduction, 25ms latency increase
2. Commitment vs Flexibility
- 3-year RIs offer 62% discount but lock capacity
- Workload analysis shows 30% baseline is stable over 18 months
- Mitigated by using 1-year Savings Plans for 50% of commitment
3. Complexity vs Savings
- Option 2 requires custom metrics, Spot orchestration, advanced monitoring
- Option 3 adds moderate complexity with better risk profile
- Team capacity assessment: can handle Option 3 with 2-week training
4. Spot Instance Risk
- Historical interruption rate: 5% for diversified instance types
- Impact mitigation: graceful shutdown, pod disruption budgets
- Acceptable for 30% of capacity with proper fallback
Final Decision #
Selected Option: Hybrid Reserved + Dynamic Scaling (Option 3) with Selective Workload Scheduling (Option 4 elements)
Rationale #
- Achieves Cost Target: 31% reduction meets 25% board requirement with buffer
- Maintains Performance SLA: Projected P95 latency 170ms (within 200ms target)
- Balances Risk: Reserved capacity provides stability, dynamic scaling adds flexibility
- Operationally Feasible: Team can implement within 8-week timeline
- Future-Proof: Supports growth without major architectural changes
Implementation Roadmap #
Phase 1: Analysis & Planning (Weeks 1-2)
- Detailed workload profiling using CloudWatch and Datadog
- Reserved Instance purchase analysis (3-year vs 1-year)
- Spot instance type diversification strategy
- Cost modeling and forecasting
Phase 2: Reserved Capacity Procurement (Week 3)
- Purchase EC2 Reserved Instances: 30 x m5.2xlarge (3-year, all upfront)
- Commit to Compute Savings Plan: $25K/month (1-year)
- RDS Reserved Instances: 2 x db.r5.4xlarge (1-year)
Phase 3: Auto-Scaling Optimization (Weeks 4-5)
- Implement cost-aware HPA configurations
- Deploy Cluster Autoscaler with Spot integration
- Configure Karpenter for intelligent node provisioning
- Set up pod priority classes and preemption policies
Phase 4: Spot Fleet Integration (Week 6)
- Deploy Spot instance diversification (5 instance types)
- Implement graceful shutdown handlers
- Configure pod disruption budgets
- Test Spot interruption scenarios
Phase 5: Workload Scheduling (Week 7)
- Migrate batch analytics jobs to off-peak hours
- Implement CronJobs for report generation
- Move image processing to Lambda for sporadic tasks
Phase 6: Monitoring & Validation (Week 8)
- Deploy cost anomaly detection
- Set up performance regression alerts
- Conduct load testing across scenarios
- Document runbooks and rollback procedures
Cost-Aware Scaling Logic #
Reserved Instance Strategy #
Compute (EKS Nodes):
- 30 x m5.2xlarge (3-year, all upfront): $87K upfront, saves $42K/year
- Covers baseline 30% capacity 24/7
- Breakeven: 18 months (acceptable given stable baseline)
Database (Aurora):
- 2 x db.r5.4xlarge (1-year, partial upfront): $28K upfront, saves $18K/year
- Lower commitment due to potential migration to Aurora Serverless v2
Cache (ElastiCache):
- 4 x cache.r5.large (1-year, no upfront): saves $6K/year
- Flexible commitment for evolving caching strategy
Post-Decision Reflection #
Outcomes Achieved (4 months post-implementation) #
Cost Savings:
- Actual Monthly Cost: $132K (29% reduction vs $185K baseline)
- Annualized Savings: $636K
- ROI: 7.3x (savings vs implementation cost of $87K)
- Missed Target: -2% (achieved 29% vs 31% projected)
Cost Breakdown After Optimization:
| Service Category | Before | After | Savings | % Reduction |
|---|---|---|---|---|
| Compute (EKS) | $98K | $62K | $36K | 37% |
| Database (Aurora) | $42K | $34K | $8K | 19% |
| Cache (ElastiCache) | $18K | $15K | $3K | 17% |
| Storage | $15K | $13K | $2K | 13% |
| Network | $12K | $8K | $4K | 33% |
Performance Impact:
- P50 Latency: 85ms → 92ms (+8%, acceptable)
- P95 Latency: 145ms → 168ms (+16%, within SLA)
- P99 Latency: 380ms → 445ms (+17%, within SLA)
- Availability: 99.95% (unchanged)
Resource Utilization:
- Average CPU Utilization: 42% → 64% (+52% improvement)
- Reserved Instance Utilization: 94% (excellent)
- Spot Instance Interruptions: 3.2% (within tolerance)
- Wasted Capacity: 40% → 18% (55% reduction)
Challenges Encountered #
1. Reserved Instance Sizing Miscalculation
- Issue: Initial RI purchase covered 28% of baseline instead of 30%
- Impact: $3K/month higher on-demand costs than projected
- Resolution: Purchased additional 1-year RIs in Month 2
- Lesson: Build 10% buffer in capacity planning
2. Spot Instance Interruption Spikes
- Issue: Week 3 experienced 12% interruption rate during AWS capacity crunch
- Impact: Temporary latency spike to P95 220ms (SLA breach)
- Resolution: Expanded instance type diversification from 5 to 8 types
- Lesson: Monitor AWS Spot instance advisor daily, maintain 20% on-demand buffer
3. Auto-Scaling Oscillation
- Issue: HPA thrashing during moderate load (scale up/down cycles every 2 minutes)
- Impact: Increased pod churn, connection drops
- Resolution: Tuned stabilization windows (60s → 180s for scale-up)
- Lesson: Conservative stabilization windows critical for cost-optimized scaling
4. Monitoring Blind Spots
- Issue: Cost anomaly detection missed gradual EBS volume growth ($2K/month)
- Impact: Untracked cost increase offsetting savings
- Resolution: Implemented storage lifecycle policies, automated volume cleanup
- Lesson: Comprehensive cost monitoring across all resource types essential
5. Team Operational Overhead
- Issue: First 2 months required 15 hours/week additional SRE time
- Impact: Delayed other projects, team burnout risk
- Resolution: Automated runbooks, improved alerting, knowledge sharing sessions
- Current State: Stabilized at 3 hours/week (acceptable)
Unexpected Benefits #
- Improved Capacity Planning: Workload profiling revealed optimization opportunities in database queries (20% RDS cost reduction)
- Better Observability: Cost-aware monitoring improved overall system understanding
- Cultural Shift: Engineering teams now consider cost in design decisions
- Vendor Leverage: Demonstrated cost discipline improved AWS Enterprise Support negotiations
Performance Deep Dive #
Latency Distribution Analysis:
Key Observations:
- Latency increase concentrated in P95-P99 (tail latency)
- P50-P75 minimally impacted (core user experience preserved)
- Acceptable trade-off: 99% of users experience <170ms latency
Conversion Rate Impact:
- Pre-optimization: 3.2% conversion rate
- Post-optimization: 3.15% conversion rate (-1.6%)
- Revenue impact: -$45K/month
- Net benefit: $636K savings - $540K revenue loss = $96K/year positive
Lessons Learned #
1. Workload Profiling is Critical
- Invested 2 weeks in detailed analysis before implementation
- Discovered 60% of traffic follows predictable patterns
- Enabled confident reserved capacity commitment
2. Start Conservative, Optimize Iteratively
- Initial RI purchase at 25% baseline, expanded to 30% after validation
- Gradual Spot adoption (10% → 20% → 30% over 6 weeks)
- Avoided costly mistakes from aggressive optimization
3. Performance SLAs Must Drive Scaling
- Cost-aware scaling without SLA guardrails risks user experience
- Latency-based HPA metrics prevented excessive cost optimization
- Business metrics (conversion rate) validated technical decisions
4. Spot Instances Require Operational Maturity
- Graceful shutdown handlers essential (prevented 80% of interruption impact)
- Instance type diversification more important than cost savings
- Continuous monitoring of AWS Spot advisor critical
5. Cultural Change Takes Time
- Engineering resistance to “cost over performance” mindset
- Addressed through transparency: shared cost dashboards, monthly reviews
- Celebrated wins: team bonus tied to cost savings achievement
Future Optimization Opportunities #
Short-term (Next 6 months):
- Migrate to Graviton instances (20% additional cost savings)
- Implement Aurora Serverless v2 for non-production environments
- Expand Spot usage to 40% with improved orchestration
Medium-term (6-12 months):
- Evaluate Fargate Spot for batch workloads
- Implement intelligent caching to reduce database load
- Explore multi-cloud arbitrage for burst capacity
Long-term (12+ months):
- Machine learning-based predictive scaling
- FinOps automation platform (custom-built)
- Carbon-aware scheduling for sustainability goals
Continuous Improvement Process #
Monthly Cost Review:
- FinOps team analyzes cost trends and anomalies
- Engineering presents optimization initiatives
- Executive dashboard tracks savings vs performance
Quarterly Capacity Planning:
- Reassess reserved instance commitments
- Adjust scaling policies based on traffic patterns
- Update cost models for business planning
Annual Architecture Review:
- Evaluate new AWS services (e.g., Graviton4, Aurora Limitless)
- Benchmark against industry cost efficiency metrics
- Set next year’s optimization targets
Key Metrics Dashboard #
| Metric | Target | Current | Status |
|---|---|---|---|
| Monthly Cost | < $139K | $132K | ✅ Exceeding |
| Cost per Transaction | < $0.08 | $0.072 | ✅ Exceeding |
| P95 Latency | < 200ms | 168ms | ✅ Meeting |
| Reserved Utilization | > 90% | 94% | ✅ Exceeding |
| Spot Interruption Rate | < 5% | 3.2% | ✅ Meeting |
| CPU Utilization | > 60% | 64% | ✅ Meeting |
References #
- AWS Cost Optimization Best Practices
- Kubernetes Autoscaling Guide
- FinOps Foundation Framework
- The Art of Capacity Planning - John Allspaw
- Internal: Q4 2025 Cloud Cost Review, Platform Engineering Cost Optimization Playbook
Last Updated: 2025-06-20 Next Review: 2025-07-20 Decision Owner: FinOps Lead Contributors: Platform Engineering, SRE, Finance, Product Management Cost Savings Achieved: $636K annually (29% reduction)