Skip to main content

Cost Optimization vs Performance: A FinOps-Driven Architecture Decision

Jeff Taakey
Author
Jeff Taakey
Founder, Architect Decision Hub (ADH) | 21+ Year CTO & Multi-Cloud Architect.
Architecture Decision Records - This article is part of a series.
Part : This Article

Decision Metadata
#

Attribute Value
Decision ID ADH-002
Status Implemented
Date 2025-02-20
Stakeholders FinOps Team, Platform Engineering, Product, Finance
Review Cycle Monthly
Related Decisions ADH-001 (Multi-Region HA)

System Context
#

A cloud-native e-commerce platform running on AWS with significant traffic variability and escalating infrastructure costs. The platform serves B2C customers with seasonal demand patterns.

System Characteristics
#

  • Monthly Active Users: 2.5M users
  • Traffic Pattern: 10x variance between off-peak and peak (Black Friday)
  • Current Monthly Cost: $185K (growing 15% MoM)
  • Architecture: Microservices on EKS, RDS Aurora, ElastiCache, S3
  • Performance SLA: P95 latency < 200ms, P99 < 500ms

Cost Growth Trajectory
#

xychart-beta title "Monthly Infrastructure Cost Growth (Last 12 Months)" x-axis [Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec] y-axis "Cost ($K)" 80 --> 200 line [95, 102, 108, 115, 125, 138, 145, 152, 165, 175, 185, 195]

Current Cost Breakdown
#

Service Category Monthly Cost % of Total Growth Rate
Compute (EKS) $98K 53% +18% MoM
Database (Aurora) $42K 23% +12% MoM
Cache (ElastiCache) $18K 10% +8% MoM
Storage (S3, EBS) $15K 8% +5% MoM
Network (Data Transfer) $12K 6% +10% MoM

Triggering Event
#

Q4 2025 Financial Review: CFO raised concerns about cloud cost trajectory exceeding revenue growth rate (15% vs 12%). Board requested 25% cost reduction target for 2025 without compromising customer experience.

Problem Statement
#

How do we optimize cloud infrastructure costs to achieve 25% reduction while maintaining performance SLAs and supporting business growth?

Key Challenges
#

  1. Unpredictable Traffic Patterns: Daily variance of 3-5x, seasonal spikes of 10x
  2. Performance Sensitivity: E-commerce conversion rate drops 7% per 100ms latency increase
  3. Auto-Scaling Inefficiency: Current strategy over-provisions by 40% during off-peak
  4. Reserved Capacity Risk: Long-term commitments conflict with growth uncertainty
  5. Multi-Dimensional Optimization: Need to balance cost, performance, and flexibility

Workload Analysis
#

Traffic Pattern Discovery (3-month analysis):

gantt title Daily Traffic Pattern (Typical Weekday) dateFormat HH:mm axisFormat %H:%M section Load Profile Off-Peak (20% capacity) :00:00, 06:00 Morning Ramp (40% capacity) :06:00, 09:00 Business Hours (70% capacity) :09:00, 18:00 Evening Peak (100% capacity) :18:00, 22:00 Night Off-Peak (25% capacity) :22:00, 24:00

Key Findings:

  • Baseline Load: 20-30% capacity utilized 60% of the time
  • Predictable Peaks: 18:00-22:00 daily, weekends +30%
  • Seasonal Spikes: Black Friday (10x), Holiday Season (5x), Prime Day (8x)
  • Over-Provisioning: Average utilization only 42% despite auto-scaling

Options Considered
#

Option 1: Performance-Optimized Auto-Scaling (Status Quo)
#

Current Configuration:

# Kubernetes HPA Configuration
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-service
  minReplicas: 50
  maxReplicas: 500
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50  # Conservative threshold
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100  # Aggressive scale-up
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10  # Conservative scale-down
        periodSeconds: 60

Characteristics:

  • Aggressive scale-up (double capacity in 1 minute)
  • Conservative scale-down (10% reduction per minute)
  • High minimum replica count for instant readiness
  • CPU threshold at 50% to maintain headroom

Pros:

  • Excellent performance (P95: 145ms, P99: 380ms)
  • Zero performance degradation during traffic spikes
  • Simple operational model

Cons:

  • High cost due to over-provisioning
  • Average CPU utilization: 42%
  • Wasted capacity during off-peak hours
  • No cost awareness in scaling decisions

Monthly Cost: $185K (baseline)


Option 2: Cost-Aware Dynamic Scaling
#

Proposed Configuration:

# Cost-optimized HPA with custom metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-service-hpa-cost-optimized
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-service
  minReplicas: 15  # Reduced from 50
  maxReplicas: 500
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70  # Higher threshold
  - type: Pods
    pods:
      metric:
        name: http_request_latency_p95
      target:
        type: AverageValue
        averageValue: "200"  # SLA-based scaling
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 120  # Slower reaction
      policies:
      - type: Percent
        value: 50  # Moderate scale-up
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 180
      policies:
      - type: Percent
        value: 25  # Faster scale-down
        periodSeconds: 60

Additional Strategies:

  • Spot Instances: 70% of burst capacity on Spot (60% cost savings)
  • Vertical Pod Autoscaling: Right-size resource requests
  • Cluster Autoscaler Optimization: Faster node termination
  • Time-Based Scaling: Pre-scale for known peaks

Pros:

  • Estimated 35-40% cost reduction
  • Better resource utilization (target 65%)
  • Maintains SLA-based performance guardrails

Cons:

  • Increased latency during unexpected spikes (P95: 180ms, P99: 480ms)
  • Spot instance interruptions (2-3% of capacity)
  • Complex configuration and monitoring
  • Risk of cascading failures during rapid scale-up

Estimated Monthly Cost: $115K (-38%)


Option 3: Reserved Capacity + Dynamic Scaling Hybrid
#

Architecture Overview:

graph TB subgraph "Baseline Capacity (Reserved)" A[Reserved Instances<br/>30% of peak capacity<br/>3-year commitment] B[Savings Plans<br/>Compute commitment<br/>1-year term] end subgraph "Dynamic Capacity (On-Demand + Spot)" C[On-Demand Instances<br/>Predictable peaks<br/>40% of capacity] D[Spot Instances<br/>Burst capacity<br/>30% of capacity] end E[Workload Scheduler] --> A E --> B E --> C E --> D F[Traffic Pattern Analysis] --> E G[Cost Optimization Engine] --> E style A fill:#90EE90 style B fill:#90EE90 style C fill:#FFD700 style D fill:#87CEEB

Capacity Planning Strategy:

Time Period Reserved On-Demand Spot Total Capacity
Off-Peak (00:00-06:00) 100% 0% 0% 20% of peak
Business Hours (09:00-18:00) 43% 40% 17% 70% of peak
Evening Peak (18:00-22:00) 30% 40% 30% 100% of peak
Seasonal Spike 10% 30% 60% 300% of peak

Implementation Components:

  1. Reserved Instances: EC2 RIs for baseline EKS nodes (30% capacity)
  2. Savings Plans: Compute Savings Plans for predictable workloads
  3. Scheduled Scaling: Pre-scale for known daily/weekly patterns
  4. Spot Fleet: Diversified Spot instances for burst capacity
  5. Intelligent Workload Placement: Priority-based pod scheduling

Pros:

  • Balanced cost reduction (28-32%)
  • Predictable baseline performance
  • Reduced Spot interruption impact
  • Better capacity planning

Cons:

  • Commitment risk (3-year RIs)
  • Moderate operational complexity
  • Requires accurate workload forecasting
  • Less aggressive cost savings than Option 2

Estimated Monthly Cost: $128K (-31%)


Option 4: Workload Scheduling + Batch Processing
#

Strategy:

  • Shift non-critical workloads to off-peak hours
  • Implement batch processing for analytics and reporting
  • Use Lambda for sporadic tasks instead of always-on containers

Pros:

  • Maximizes reserved capacity utilization
  • Reduces peak demand by 15-20%

Cons:

  • Requires application refactoring
  • Not suitable for real-time user-facing services
  • Limited applicability to e-commerce workload

Estimated Monthly Cost: $145K (-22%) when combined with Option 3


Evaluation Matrix
#

Criteria Weight Option 1 (Status Quo) Option 2 (Cost-Aware) Option 3 (Hybrid) Option 4 (Scheduling)
Cost Reduction 35% 2/10 10/10 8/10 7/10
Performance Impact 30% 10/10 6/10 8/10 7/10
Predictability 15% 9/10 5/10 8/10 6/10
Operational Overhead 10% 9/10 4/10 6/10 3/10
Flexibility 10% 8/10 9/10 6/10 5/10
Weighted Score 6.85 7.25 7.75 6.45

Trade-offs Analysis
#

Cost vs Performance Frontier
#

quadrantChart title Cost-Performance Trade-off Space x-axis Low Cost --> High Cost y-axis Low Performance --> High Performance quadrant-1 Over-Provisioned quadrant-2 Optimal Zone quadrant-3 Under-Provisioned quadrant-4 Balanced Status Quo: [0.85, 0.95] Cost-Aware Scaling: [0.35, 0.70] Hybrid Approach: [0.60, 0.88] With Scheduling: [0.55, 0.82]

Key Trade-off Considerations
#

1. Cost Reduction vs Performance Risk

  • Every 10% cost reduction correlates with ~5ms P95 latency increase
  • Acceptable range: P95 < 200ms (current: 145ms, headroom: 55ms)
  • Option 3 provides optimal balance: 31% cost reduction, 25ms latency increase

2. Commitment vs Flexibility

  • 3-year RIs offer 62% discount but lock capacity
  • Workload analysis shows 30% baseline is stable over 18 months
  • Mitigated by using 1-year Savings Plans for 50% of commitment

3. Complexity vs Savings

  • Option 2 requires custom metrics, Spot orchestration, advanced monitoring
  • Option 3 adds moderate complexity with better risk profile
  • Team capacity assessment: can handle Option 3 with 2-week training

4. Spot Instance Risk

  • Historical interruption rate: 5% for diversified instance types
  • Impact mitigation: graceful shutdown, pod disruption budgets
  • Acceptable for 30% of capacity with proper fallback

Final Decision
#

Selected Option: Hybrid Reserved + Dynamic Scaling (Option 3) with Selective Workload Scheduling (Option 4 elements)

Rationale
#

  1. Achieves Cost Target: 31% reduction meets 25% board requirement with buffer
  2. Maintains Performance SLA: Projected P95 latency 170ms (within 200ms target)
  3. Balances Risk: Reserved capacity provides stability, dynamic scaling adds flexibility
  4. Operationally Feasible: Team can implement within 8-week timeline
  5. Future-Proof: Supports growth without major architectural changes

Implementation Roadmap
#

Phase 1: Analysis & Planning (Weeks 1-2)

  • Detailed workload profiling using CloudWatch and Datadog
  • Reserved Instance purchase analysis (3-year vs 1-year)
  • Spot instance type diversification strategy
  • Cost modeling and forecasting

Phase 2: Reserved Capacity Procurement (Week 3)

  • Purchase EC2 Reserved Instances: 30 x m5.2xlarge (3-year, all upfront)
  • Commit to Compute Savings Plan: $25K/month (1-year)
  • RDS Reserved Instances: 2 x db.r5.4xlarge (1-year)

Phase 3: Auto-Scaling Optimization (Weeks 4-5)

  • Implement cost-aware HPA configurations
  • Deploy Cluster Autoscaler with Spot integration
  • Configure Karpenter for intelligent node provisioning
  • Set up pod priority classes and preemption policies

Phase 4: Spot Fleet Integration (Week 6)

  • Deploy Spot instance diversification (5 instance types)
  • Implement graceful shutdown handlers
  • Configure pod disruption budgets
  • Test Spot interruption scenarios

Phase 5: Workload Scheduling (Week 7)

  • Migrate batch analytics jobs to off-peak hours
  • Implement CronJobs for report generation
  • Move image processing to Lambda for sporadic tasks

Phase 6: Monitoring & Validation (Week 8)

  • Deploy cost anomaly detection
  • Set up performance regression alerts
  • Conduct load testing across scenarios
  • Document runbooks and rollback procedures

Cost-Aware Scaling Logic
#

flowchart TD A[Incoming Traffic] --> B{Current Load} B -->|< 30% peak| C[Reserved Instances Only] B -->|30-70% peak| D{Time of Day} B -->|> 70% peak| E{Spot Available?} D -->|Business Hours| F[Add On-Demand] D -->|Off-Peak| G[Add Spot First] E -->|Yes| H[Scale with Spot] E -->|No| I[Scale with On-Demand] H --> J{Spot Interrupted?} J -->|Yes| K[Fallback to On-Demand] J -->|No| L[Continue] C --> M[Monitor Performance] F --> M G --> M I --> M K --> M L --> M M --> N{SLA Breach?} N -->|Yes| O[Emergency Scale-Up] N -->|No| P[Optimize Continuously]

Reserved Instance Strategy
#

Compute (EKS Nodes):

  • 30 x m5.2xlarge (3-year, all upfront): $87K upfront, saves $42K/year
  • Covers baseline 30% capacity 24/7
  • Breakeven: 18 months (acceptable given stable baseline)

Database (Aurora):

  • 2 x db.r5.4xlarge (1-year, partial upfront): $28K upfront, saves $18K/year
  • Lower commitment due to potential migration to Aurora Serverless v2

Cache (ElastiCache):

  • 4 x cache.r5.large (1-year, no upfront): saves $6K/year
  • Flexible commitment for evolving caching strategy

Post-Decision Reflection
#

Outcomes Achieved (4 months post-implementation)
#

Cost Savings:

  • Actual Monthly Cost: $132K (29% reduction vs $185K baseline)
  • Annualized Savings: $636K
  • ROI: 7.3x (savings vs implementation cost of $87K)
  • Missed Target: -2% (achieved 29% vs 31% projected)

Cost Breakdown After Optimization:

Service Category Before After Savings % Reduction
Compute (EKS) $98K $62K $36K 37%
Database (Aurora) $42K $34K $8K 19%
Cache (ElastiCache) $18K $15K $3K 17%
Storage $15K $13K $2K 13%
Network $12K $8K $4K 33%

Performance Impact:

  • P50 Latency: 85ms → 92ms (+8%, acceptable)
  • P95 Latency: 145ms → 168ms (+16%, within SLA)
  • P99 Latency: 380ms → 445ms (+17%, within SLA)
  • Availability: 99.95% (unchanged)

Resource Utilization:

  • Average CPU Utilization: 42% → 64% (+52% improvement)
  • Reserved Instance Utilization: 94% (excellent)
  • Spot Instance Interruptions: 3.2% (within tolerance)
  • Wasted Capacity: 40% → 18% (55% reduction)

Challenges Encountered
#

1. Reserved Instance Sizing Miscalculation

  • Issue: Initial RI purchase covered 28% of baseline instead of 30%
  • Impact: $3K/month higher on-demand costs than projected
  • Resolution: Purchased additional 1-year RIs in Month 2
  • Lesson: Build 10% buffer in capacity planning

2. Spot Instance Interruption Spikes

  • Issue: Week 3 experienced 12% interruption rate during AWS capacity crunch
  • Impact: Temporary latency spike to P95 220ms (SLA breach)
  • Resolution: Expanded instance type diversification from 5 to 8 types
  • Lesson: Monitor AWS Spot instance advisor daily, maintain 20% on-demand buffer

3. Auto-Scaling Oscillation

  • Issue: HPA thrashing during moderate load (scale up/down cycles every 2 minutes)
  • Impact: Increased pod churn, connection drops
  • Resolution: Tuned stabilization windows (60s → 180s for scale-up)
  • Lesson: Conservative stabilization windows critical for cost-optimized scaling

4. Monitoring Blind Spots

  • Issue: Cost anomaly detection missed gradual EBS volume growth ($2K/month)
  • Impact: Untracked cost increase offsetting savings
  • Resolution: Implemented storage lifecycle policies, automated volume cleanup
  • Lesson: Comprehensive cost monitoring across all resource types essential

5. Team Operational Overhead

  • Issue: First 2 months required 15 hours/week additional SRE time
  • Impact: Delayed other projects, team burnout risk
  • Resolution: Automated runbooks, improved alerting, knowledge sharing sessions
  • Current State: Stabilized at 3 hours/week (acceptable)

Unexpected Benefits
#

  1. Improved Capacity Planning: Workload profiling revealed optimization opportunities in database queries (20% RDS cost reduction)
  2. Better Observability: Cost-aware monitoring improved overall system understanding
  3. Cultural Shift: Engineering teams now consider cost in design decisions
  4. Vendor Leverage: Demonstrated cost discipline improved AWS Enterprise Support negotiations

Performance Deep Dive
#

Latency Distribution Analysis:

xychart-beta title "Latency Distribution: Before vs After Optimization" x-axis [P50, P75, P90, P95, P99] y-axis "Latency (ms)" 0 --> 500 bar [85, 110, 135, 145, 380] bar [92, 125, 155, 168, 445]

Key Observations:

  • Latency increase concentrated in P95-P99 (tail latency)
  • P50-P75 minimally impacted (core user experience preserved)
  • Acceptable trade-off: 99% of users experience <170ms latency

Conversion Rate Impact:

  • Pre-optimization: 3.2% conversion rate
  • Post-optimization: 3.15% conversion rate (-1.6%)
  • Revenue impact: -$45K/month
  • Net benefit: $636K savings - $540K revenue loss = $96K/year positive

Lessons Learned
#

1. Workload Profiling is Critical

  • Invested 2 weeks in detailed analysis before implementation
  • Discovered 60% of traffic follows predictable patterns
  • Enabled confident reserved capacity commitment

2. Start Conservative, Optimize Iteratively

  • Initial RI purchase at 25% baseline, expanded to 30% after validation
  • Gradual Spot adoption (10% → 20% → 30% over 6 weeks)
  • Avoided costly mistakes from aggressive optimization

3. Performance SLAs Must Drive Scaling

  • Cost-aware scaling without SLA guardrails risks user experience
  • Latency-based HPA metrics prevented excessive cost optimization
  • Business metrics (conversion rate) validated technical decisions

4. Spot Instances Require Operational Maturity

  • Graceful shutdown handlers essential (prevented 80% of interruption impact)
  • Instance type diversification more important than cost savings
  • Continuous monitoring of AWS Spot advisor critical

5. Cultural Change Takes Time

  • Engineering resistance to “cost over performance” mindset
  • Addressed through transparency: shared cost dashboards, monthly reviews
  • Celebrated wins: team bonus tied to cost savings achievement

Future Optimization Opportunities
#

Short-term (Next 6 months):

  • Migrate to Graviton instances (20% additional cost savings)
  • Implement Aurora Serverless v2 for non-production environments
  • Expand Spot usage to 40% with improved orchestration

Medium-term (6-12 months):

  • Evaluate Fargate Spot for batch workloads
  • Implement intelligent caching to reduce database load
  • Explore multi-cloud arbitrage for burst capacity

Long-term (12+ months):

  • Machine learning-based predictive scaling
  • FinOps automation platform (custom-built)
  • Carbon-aware scheduling for sustainability goals

Continuous Improvement Process
#

Monthly Cost Review:

  • FinOps team analyzes cost trends and anomalies
  • Engineering presents optimization initiatives
  • Executive dashboard tracks savings vs performance

Quarterly Capacity Planning:

  • Reassess reserved instance commitments
  • Adjust scaling policies based on traffic patterns
  • Update cost models for business planning

Annual Architecture Review:

  • Evaluate new AWS services (e.g., Graviton4, Aurora Limitless)
  • Benchmark against industry cost efficiency metrics
  • Set next year’s optimization targets

Key Metrics Dashboard
#

Metric Target Current Status
Monthly Cost < $139K $132K ✅ Exceeding
Cost per Transaction < $0.08 $0.072 ✅ Exceeding
P95 Latency < 200ms 168ms ✅ Meeting
Reserved Utilization > 90% 94% ✅ Exceeding
Spot Interruption Rate < 5% 3.2% ✅ Meeting
CPU Utilization > 60% 64% ✅ Meeting

References
#


Last Updated: 2025-06-20 Next Review: 2025-07-20 Decision Owner: FinOps Lead Contributors: Platform Engineering, SRE, Finance, Product Management Cost Savings Achieved: $636K annually (29% reduction)

Architecture Decision Records - This article is part of a series.
Part : This Article