Cost Optimization vs Performance: A FinOps-Driven Architecture Decision

Architecture Decision Records - This article is part of a series.

Part : Resilience vs Complexity: Finding the Sweet Spot in Distributed Systems

Part : Legacy System Migration Strategy: Strangling the Monolith

Part : Microservices Boundary Definition: A Domain-Driven Approach

Part : This Article

Part : Multi-Region High Availability Architecture

Decision Metadata
#

Attribute	Value
Decision ID	ADH-002
Status	Implemented
Date	2025-02-20
Stakeholders	FinOps Team, Platform Engineering, Product, Finance
Review Cycle	Monthly
Related Decisions	ADH-001 (Multi-Region HA)

System Context
#

A cloud-native e-commerce platform running on AWS with significant traffic variability and escalating infrastructure costs. The platform serves B2C customers with seasonal demand patterns.

System Characteristics
#

Monthly Active Users: 2.5M users
Traffic Pattern: 10x variance between off-peak and peak (Black Friday)
Current Monthly Cost: $185K (growing 15% MoM)
Architecture: Microservices on EKS, RDS Aurora, ElastiCache, S3
Performance SLA: P95 latency < 200ms, P99 < 500ms

Cost Growth Trajectory
#

xychart-beta title "Monthly Infrastructure Cost Growth (Last 12 Months)" x-axis [Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec] y-axis "Cost ($K)" 80 --> 200 line [95, 102, 108, 115, 125, 138, 145, 152, 165, 175, 185, 195]

Current Cost Breakdown
#

Service Category	Monthly Cost	% of Total	Growth Rate
Compute (EKS)	$98K	53%	+18% MoM
Database (Aurora)	$42K	23%	+12% MoM
Cache (ElastiCache)	$18K	10%	+8% MoM
Storage (S3, EBS)	$15K	8%	+5% MoM
Network (Data Transfer)	$12K	6%	+10% MoM

Triggering Event
#

Q4 2025 Financial Review: CFO raised concerns about cloud cost trajectory exceeding revenue growth rate (15% vs 12%). Board requested 25% cost reduction target for 2025 without compromising customer experience.

Problem Statement
#

How do we optimize cloud infrastructure costs to achieve 25% reduction while maintaining performance SLAs and supporting business growth?

Key Challenges
#

Unpredictable Traffic Patterns: Daily variance of 3-5x, seasonal spikes of 10x
Performance Sensitivity: E-commerce conversion rate drops 7% per 100ms latency increase
Auto-Scaling Inefficiency: Current strategy over-provisions by 40% during off-peak
Reserved Capacity Risk: Long-term commitments conflict with growth uncertainty
Multi-Dimensional Optimization: Need to balance cost, performance, and flexibility

Workload Analysis
#

Traffic Pattern Discovery (3-month analysis):

gantt title Daily Traffic Pattern (Typical Weekday) dateFormat HH:mm axisFormat %H:%M section Load Profile Off-Peak (20% capacity) :00:00, 06:00 Morning Ramp (40% capacity) :06:00, 09:00 Business Hours (70% capacity) :09:00, 18:00 Evening Peak (100% capacity) :18:00, 22:00 Night Off-Peak (25% capacity) :22:00, 24:00

Key Findings:

Baseline Load: 20-30% capacity utilized 60% of the time
Predictable Peaks: 18:00-22:00 daily, weekends +30%
Seasonal Spikes: Black Friday (10x), Holiday Season (5x), Prime Day (8x)
Over-Provisioning: Average utilization only 42% despite auto-scaling

Options Considered
#

Option 1: Performance-Optimized Auto-Scaling (Status Quo)
#

Current Configuration:

# Kubernetes HPA Configuration
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-service
  minReplicas: 50
  maxReplicas: 500
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50  # Conservative threshold
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100  # Aggressive scale-up
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10  # Conservative scale-down
        periodSeconds: 60

Characteristics:

Aggressive scale-up (double capacity in 1 minute)
Conservative scale-down (10% reduction per minute)
High minimum replica count for instant readiness
CPU threshold at 50% to maintain headroom

Pros:

Excellent performance (P95: 145ms, P99: 380ms)
Zero performance degradation during traffic spikes
Simple operational model

Cons:

High cost due to over-provisioning
Average CPU utilization: 42%
Wasted capacity during off-peak hours
No cost awareness in scaling decisions

Monthly Cost: $185K (baseline)

Option 2: Cost-Aware Dynamic Scaling
#

Proposed Configuration:

# Cost-optimized HPA with custom metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-service-hpa-cost-optimized
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-service
  minReplicas: 15  # Reduced from 50
  maxReplicas: 500
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70  # Higher threshold
  - type: Pods
    pods:
      metric:
        name: http_request_latency_p95
      target:
        type: AverageValue
        averageValue: "200"  # SLA-based scaling
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 120  # Slower reaction
      policies:
      - type: Percent
        value: 50  # Moderate scale-up
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 180
      policies:
      - type: Percent
        value: 25  # Faster scale-down
        periodSeconds: 60

Additional Strategies:

Spot Instances: 70% of burst capacity on Spot (60% cost savings)
Vertical Pod Autoscaling: Right-size resource requests
Cluster Autoscaler Optimization: Faster node termination
Time-Based Scaling: Pre-scale for known peaks

Pros:

Estimated 35-40% cost reduction
Better resource utilization (target 65%)
Maintains SLA-based performance guardrails

Cons:

Increased latency during unexpected spikes (P95: 180ms, P99: 480ms)
Spot instance interruptions (2-3% of capacity)
Complex configuration and monitoring
Risk of cascading failures during rapid scale-up

Estimated Monthly Cost: $115K (-38%)

Option 3: Reserved Capacity + Dynamic Scaling Hybrid
#

Architecture Overview:

graph TB subgraph "Baseline Capacity (Reserved)" A[Reserved Instances 30% of peak capacity 3-year commitment] B[Savings Plans Compute commitment 1-year term] end subgraph "Dynamic Capacity (On-Demand + Spot)" C[On-Demand Instances Predictable peaks 40% of capacity] D[Spot Instances Burst capacity 30% of capacity] end E[Workload Scheduler] --> A E --> B E --> C E --> D F[Traffic Pattern Analysis] --> E G[Cost Optimization Engine] --> E style A fill:#90EE90 style B fill:#90EE90 style C fill:#FFD700 style D fill:#87CEEB

Capacity Planning Strategy:

Time Period	Reserved	On-Demand	Spot	Total Capacity
Off-Peak (00:00-06:00)	100%	0%	0%	20% of peak
Business Hours (09:00-18:00)	43%	40%	17%	70% of peak
Evening Peak (18:00-22:00)	30%	40%	30%	100% of peak
Seasonal Spike	10%	30%	60%	300% of peak

Implementation Components:

Reserved Instances: EC2 RIs for baseline EKS nodes (30% capacity)
Savings Plans: Compute Savings Plans for predictable workloads
Scheduled Scaling: Pre-scale for known daily/weekly patterns
Spot Fleet: Diversified Spot instances for burst capacity
Intelligent Workload Placement: Priority-based pod scheduling

Pros:

Balanced cost reduction (28-32%)
Predictable baseline performance
Reduced Spot interruption impact
Better capacity planning

Cons:

Commitment risk (3-year RIs)
Moderate operational complexity
Requires accurate workload forecasting
Less aggressive cost savings than Option 2

Estimated Monthly Cost: $128K (-31%)

Option 4: Workload Scheduling + Batch Processing
#

Strategy:

Shift non-critical workloads to off-peak hours
Implement batch processing for analytics and reporting
Use Lambda for sporadic tasks instead of always-on containers

Pros:

Maximizes reserved capacity utilization
Reduces peak demand by 15-20%

Cons:

Requires application refactoring
Not suitable for real-time user-facing services
Limited applicability to e-commerce workload

Estimated Monthly Cost: $145K (-22%) when combined with Option 3

Evaluation Matrix
#

Criteria	Weight	Option 1 (Status Quo)	Option 2 (Cost-Aware)	Option 3 (Hybrid)	Option 4 (Scheduling)
Cost Reduction	35%	2/10	10/10	8/10	7/10
Performance Impact	30%	10/10	6/10	8/10	7/10
Predictability	15%	9/10	5/10	8/10	6/10
Operational Overhead	10%	9/10	4/10	6/10	3/10
Flexibility	10%	8/10	9/10	6/10	5/10
Weighted Score		6.85	7.25	7.75	6.45

Trade-offs Analysis
#

Cost vs Performance Frontier
#

quadrantChart title Cost-Performance Trade-off Space x-axis Low Cost --> High Cost y-axis Low Performance --> High Performance quadrant-1 Over-Provisioned quadrant-2 Optimal Zone quadrant-3 Under-Provisioned quadrant-4 Balanced Status Quo: [0.85, 0.95] Cost-Aware Scaling: [0.35, 0.70] Hybrid Approach: [0.60, 0.88] With Scheduling: [0.55, 0.82]

Key Trade-off Considerations
#

1. Cost Reduction vs Performance Risk

Every 10% cost reduction correlates with ~5ms P95 latency increase
Acceptable range: P95 < 200ms (current: 145ms, headroom: 55ms)
Option 3 provides optimal balance: 31% cost reduction, 25ms latency increase

2. Commitment vs Flexibility

3-year RIs offer 62% discount but lock capacity
Workload analysis shows 30% baseline is stable over 18 months
Mitigated by using 1-year Savings Plans for 50% of commitment

3. Complexity vs Savings

Option 2 requires custom metrics, Spot orchestration, advanced monitoring
Option 3 adds moderate complexity with better risk profile
Team capacity assessment: can handle Option 3 with 2-week training

4. Spot Instance Risk

Historical interruption rate: 5% for diversified instance types
Impact mitigation: graceful shutdown, pod disruption budgets
Acceptable for 30% of capacity with proper fallback

Final Decision
#

Selected Option: Hybrid Reserved + Dynamic Scaling (Option 3) with Selective Workload Scheduling (Option 4 elements)

Rationale
#

Achieves Cost Target: 31% reduction meets 25% board requirement with buffer
Maintains Performance SLA: Projected P95 latency 170ms (within 200ms target)
Balances Risk: Reserved capacity provides stability, dynamic scaling adds flexibility
Operationally Feasible: Team can implement within 8-week timeline
Future-Proof: Supports growth without major architectural changes

Implementation Roadmap
#

Phase 1: Analysis & Planning (Weeks 1-2)

Detailed workload profiling using CloudWatch and Datadog
Reserved Instance purchase analysis (3-year vs 1-year)
Spot instance type diversification strategy
Cost modeling and forecasting

Phase 2: Reserved Capacity Procurement (Week 3)

Purchase EC2 Reserved Instances: 30 x m5.2xlarge (3-year, all upfront)
Commit to Compute Savings Plan: $25K/month (1-year)
RDS Reserved Instances: 2 x db.r5.4xlarge (1-year)

Phase 3: Auto-Scaling Optimization (Weeks 4-5)

Implement cost-aware HPA configurations
Deploy Cluster Autoscaler with Spot integration
Configure Karpenter for intelligent node provisioning
Set up pod priority classes and preemption policies

Phase 4: Spot Fleet Integration (Week 6)

Deploy Spot instance diversification (5 instance types)
Implement graceful shutdown handlers
Configure pod disruption budgets
Test Spot interruption scenarios

Phase 5: Workload Scheduling (Week 7)

Migrate batch analytics jobs to off-peak hours
Implement CronJobs for report generation
Move image processing to Lambda for sporadic tasks

Phase 6: Monitoring & Validation (Week 8)

Deploy cost anomaly detection
Set up performance regression alerts
Conduct load testing across scenarios
Document runbooks and rollback procedures

Cost-Aware Scaling Logic
#

flowchart TD A[Incoming Traffic] --> B{Current Load} B -->|< 30% peak| C[Reserved Instances Only] B -->|30-70% peak| D{Time of Day} B -->|> 70% peak| E{Spot Available?} D -->|Business Hours| F[Add On-Demand] D -->|Off-Peak| G[Add Spot First] E -->|Yes| H[Scale with Spot] E -->|No| I[Scale with On-Demand] H --> J{Spot Interrupted?} J -->|Yes| K[Fallback to On-Demand] J -->|No| L[Continue] C --> M[Monitor Performance] F --> M G --> M I --> M K --> M L --> M M --> N{SLA Breach?} N -->|Yes| O[Emergency Scale-Up] N -->|No| P[Optimize Continuously]

Reserved Instance Strategy
#

Compute (EKS Nodes):

30 x m5.2xlarge (3-year, all upfront): $87K upfront, saves $42K/year
Covers baseline 30% capacity 24/7
Breakeven: 18 months (acceptable given stable baseline)

Database (Aurora):

2 x db.r5.4xlarge (1-year, partial upfront): $28K upfront, saves $18K/year
Lower commitment due to potential migration to Aurora Serverless v2

Cache (ElastiCache):

4 x cache.r5.large (1-year, no upfront): saves $6K/year
Flexible commitment for evolving caching strategy

Post-Decision Reflection
#

Outcomes Achieved (4 months post-implementation)
#

Cost Savings:

Actual Monthly Cost: $132K (29% reduction vs $185K baseline)
Annualized Savings: $636K
ROI: 7.3x (savings vs implementation cost of $87K)
Missed Target: -2% (achieved 29% vs 31% projected)

Cost Breakdown After Optimization:

Service Category	Before	After	Savings	% Reduction
Compute (EKS)	$98K	$62K	$36K	37%
Database (Aurora)	$42K	$34K	$8K	19%
Cache (ElastiCache)	$18K	$15K	$3K	17%
Storage	$15K	$13K	$2K	13%
Network	$12K	$8K	$4K	33%

Performance Impact:

P50 Latency: 85ms → 92ms (+8%, acceptable)
P95 Latency: 145ms → 168ms (+16%, within SLA)
P99 Latency: 380ms → 445ms (+17%, within SLA)
Availability: 99.95% (unchanged)

Resource Utilization:

Average CPU Utilization: 42% → 64% (+52% improvement)
Reserved Instance Utilization: 94% (excellent)
Spot Instance Interruptions: 3.2% (within tolerance)
Wasted Capacity: 40% → 18% (55% reduction)

Challenges Encountered
#

1. Reserved Instance Sizing Miscalculation

Issue: Initial RI purchase covered 28% of baseline instead of 30%
Impact: $3K/month higher on-demand costs than projected
Resolution: Purchased additional 1-year RIs in Month 2
Lesson: Build 10% buffer in capacity planning

2. Spot Instance Interruption Spikes

Issue: Week 3 experienced 12% interruption rate during AWS capacity crunch
Impact: Temporary latency spike to P95 220ms (SLA breach)
Resolution: Expanded instance type diversification from 5 to 8 types
Lesson: Monitor AWS Spot instance advisor daily, maintain 20% on-demand buffer

3. Auto-Scaling Oscillation

Issue: HPA thrashing during moderate load (scale up/down cycles every 2 minutes)
Impact: Increased pod churn, connection drops
Resolution: Tuned stabilization windows (60s → 180s for scale-up)
Lesson: Conservative stabilization windows critical for cost-optimized scaling

4. Monitoring Blind Spots

Issue: Cost anomaly detection missed gradual EBS volume growth ($2K/month)
Impact: Untracked cost increase offsetting savings
Resolution: Implemented storage lifecycle policies, automated volume cleanup
Lesson: Comprehensive cost monitoring across all resource types essential

5. Team Operational Overhead

Issue: First 2 months required 15 hours/week additional SRE time
Impact: Delayed other projects, team burnout risk
Resolution: Automated runbooks, improved alerting, knowledge sharing sessions
Current State: Stabilized at 3 hours/week (acceptable)

Unexpected Benefits
#

Improved Capacity Planning: Workload profiling revealed optimization opportunities in database queries (20% RDS cost reduction)
Better Observability: Cost-aware monitoring improved overall system understanding
Cultural Shift: Engineering teams now consider cost in design decisions
Vendor Leverage: Demonstrated cost discipline improved AWS Enterprise Support negotiations

Performance Deep Dive
#

Latency Distribution Analysis:

xychart-beta title "Latency Distribution: Before vs After Optimization" x-axis [P50, P75, P90, P95, P99] y-axis "Latency (ms)" 0 --> 500 bar [85, 110, 135, 145, 380] bar [92, 125, 155, 168, 445]

Key Observations:

Latency increase concentrated in P95-P99 (tail latency)
P50-P75 minimally impacted (core user experience preserved)
Acceptable trade-off: 99% of users experience <170ms latency

Conversion Rate Impact:

Pre-optimization: 3.2% conversion rate
Post-optimization: 3.15% conversion rate (-1.6%)
Revenue impact: -$45K/month
Net benefit: $636K savings - $540K revenue loss = $96K/year positive

Lessons Learned
#

1. Workload Profiling is Critical

Invested 2 weeks in detailed analysis before implementation
Discovered 60% of traffic follows predictable patterns
Enabled confident reserved capacity commitment

2. Start Conservative, Optimize Iteratively

Initial RI purchase at 25% baseline, expanded to 30% after validation
Gradual Spot adoption (10% → 20% → 30% over 6 weeks)
Avoided costly mistakes from aggressive optimization

3. Performance SLAs Must Drive Scaling

Cost-aware scaling without SLA guardrails risks user experience
Latency-based HPA metrics prevented excessive cost optimization
Business metrics (conversion rate) validated technical decisions

4. Spot Instances Require Operational Maturity

Graceful shutdown handlers essential (prevented 80% of interruption impact)
Instance type diversification more important than cost savings
Continuous monitoring of AWS Spot advisor critical

5. Cultural Change Takes Time

Engineering resistance to “cost over performance” mindset
Addressed through transparency: shared cost dashboards, monthly reviews
Celebrated wins: team bonus tied to cost savings achievement

Future Optimization Opportunities
#

Short-term (Next 6 months):

Migrate to Graviton instances (20% additional cost savings)
Implement Aurora Serverless v2 for non-production environments
Expand Spot usage to 40% with improved orchestration

Medium-term (6-12 months):

Evaluate Fargate Spot for batch workloads
Implement intelligent caching to reduce database load
Explore multi-cloud arbitrage for burst capacity

Long-term (12+ months):

Machine learning-based predictive scaling
FinOps automation platform (custom-built)
Carbon-aware scheduling for sustainability goals

Continuous Improvement Process
#

Monthly Cost Review:

FinOps team analyzes cost trends and anomalies
Engineering presents optimization initiatives
Executive dashboard tracks savings vs performance

Quarterly Capacity Planning:

Reassess reserved instance commitments
Adjust scaling policies based on traffic patterns
Update cost models for business planning

Annual Architecture Review:

Evaluate new AWS services (e.g., Graviton4, Aurora Limitless)
Benchmark against industry cost efficiency metrics
Set next year’s optimization targets

Key Metrics Dashboard
#

Metric	Target	Current	Status
Monthly Cost	< $139K	$132K	✅ Exceeding
Cost per Transaction	< $0.08	$0.072	✅ Exceeding
P95 Latency	< 200ms	168ms	✅ Meeting
Reserved Utilization	> 90%	94%	✅ Exceeding
Spot Interruption Rate	< 5%	3.2%	✅ Meeting
CPU Utilization	> 60%	64%	✅ Meeting

References
#

AWS Cost Optimization Best Practices
Kubernetes Autoscaling Guide
FinOps Foundation Framework
The Art of Capacity Planning - John Allspaw
Internal: Q4 2025 Cloud Cost Review, Platform Engineering Cost Optimization Playbook

Last Updated: 2025-06-20 Next Review: 2025-07-20 Decision Owner: FinOps Lead Contributors: Platform Engineering, SRE, Finance, Product Management Cost Savings Achieved: $636K annually (29% reduction)

Architecture Decision Records - This article is part of a series.

Part : Resilience vs Complexity: Finding the Sweet Spot in Distributed Systems

Part : Legacy System Migration Strategy: Strangling the Monolith

Part : Microservices Boundary Definition: A Domain-Driven Approach

Part : This Article

Part : Multi-Region High Availability Architecture

Decision Metadata #

System Context #

System Characteristics #

Cost Growth Trajectory #

Current Cost Breakdown #

Triggering Event #

Problem Statement #

Key Challenges #

Workload Analysis #

Options Considered #

Option 1: Performance-Optimized Auto-Scaling (Status Quo) #

Option 2: Cost-Aware Dynamic Scaling #

Option 3: Reserved Capacity + Dynamic Scaling Hybrid #

Option 4: Workload Scheduling + Batch Processing #

Evaluation Matrix #

Trade-offs Analysis #

Cost vs Performance Frontier #

Key Trade-off Considerations #

Final Decision #

Rationale #

Implementation Roadmap #

Cost-Aware Scaling Logic #

Reserved Instance Strategy #

Post-Decision Reflection #

Outcomes Achieved (4 months post-implementation) #

Challenges Encountered #

Unexpected Benefits #

Performance Deep Dive #

Lessons Learned #

Future Optimization Opportunities #

Continuous Improvement Process #

Key Metrics Dashboard #

References #

Decision Metadata
#

System Context
#

System Characteristics
#

Cost Growth Trajectory
#

Current Cost Breakdown
#

Triggering Event
#

Problem Statement
#

Key Challenges
#

Workload Analysis
#

Options Considered
#

Option 1: Performance-Optimized Auto-Scaling (Status Quo)
#

Option 2: Cost-Aware Dynamic Scaling
#

Option 3: Reserved Capacity + Dynamic Scaling Hybrid
#

Option 4: Workload Scheduling + Batch Processing
#

Evaluation Matrix
#

Trade-offs Analysis
#

Cost vs Performance Frontier
#

Key Trade-off Considerations
#

Final Decision
#

Rationale
#

Implementation Roadmap
#

Cost-Aware Scaling Logic
#

Reserved Instance Strategy
#

Post-Decision Reflection
#

Outcomes Achieved (4 months post-implementation)
#

Challenges Encountered
#

Unexpected Benefits
#

Performance Deep Dive
#

Lessons Learned
#

Future Optimization Opportunities
#

Continuous Improvement Process
#

Key Metrics Dashboard
#

References
#