Legacy System Migration Strategy: Strangling the Monolith

Architecture Decision Records - This article is part of a series.

Part : Resilience vs Complexity: Finding the Sweet Spot in Distributed Systems

Part : This Article

Part : Microservices Boundary Definition: A Domain-Driven Approach

Part : Cost Optimization vs Performance: A FinOps-Driven Architecture Decision

Part : Multi-Region High Availability Architecture

Decision Metadata
#

Attribute	Value
Decision ID	ADH-004
Status	In Progress (Month 14 of 24)
Date	2025-04-10
Stakeholders	CTO, VP Engineering, Infrastructure Team, Business Continuity
Review Cycle	Monthly
Related Decisions	ADH-002 (Cost Optimization), ADH-003 (Microservices Boundaries)

System Context
#

A mission-critical insurance policy management system serving 2.5 million active policies across 12 US states. The system has been in production since 2009 and represents the core revenue engine of the business.

Legacy System Characteristics
#

Technology Stack:

Application Server: IBM WebSphere 7.0 (EOL 2020)
Language: Java 6 (EOL 2013) with 1.2M lines of code
Database: Oracle 11g (EOL 2020) with 3,800 tables
Integration: SOAP web services, FTP file transfers, mainframe batch jobs
Infrastructure: On-premises data center with physical servers
Deployment: Manual deployment process (8-hour maintenance window)

System Architecture:

graph TB subgraph "Legacy On-Premises Infrastructure" A[Load Balancer F5 BIG-IP] B[WebSphere Cluster 8 nodes] C[Oracle RAC 4 nodes] D[Batch Processing Cron jobs] E[File Server NFS] F[Mainframe IBM z/OS] G[External Partners FTP/SOAP] end A --> B B --> C B --> E D --> C D --> F B --> G style B fill:#FF6B6B style C fill:#FF6B6B style D fill:#FF6B6B style F fill:#FF6B6B

Business Context:

Metric	Value
Annual Revenue	$450M (100% dependent on this system)
Active Policies	2.5M
Daily Transactions	85,000 policy operations
Peak Load	1,200 concurrent users
Uptime SLA	99.5% (contractual obligation)
Regulatory Compliance	SOC 2, HIPAA, State Insurance Regulations

Pain Points
#

1. Technical Debt Crisis

Java 6 has critical security vulnerabilities (CVE count: 247)
WebSphere 7.0 unsupported, no security patches
Oracle 11g EOL, escalating support costs ($280K/year)
Codebase complexity: cyclomatic complexity avg 45 (industry standard: <10)

2. Operational Challenges

Deployment requires 8-hour maintenance window (Saturday nights)
Average 3 production incidents per month
MTTR: 4.5 hours (manual troubleshooting)
Infrastructure costs: $1.2M/year for aging hardware

3. Business Constraints

Cannot add new features (6-month lead time for simple changes)
Competitors launching digital-first products
Customer satisfaction declining (NPS: 28, down from 45 in 2019)
Regulatory reporting requires manual data extraction (40 hours/month)

4. Knowledge Erosion

Original architects retired
60% of codebase has no documentation
3 engineers understand core rating engine (bus factor: 3)
Onboarding new engineers takes 9-12 months

Triggering Event
#

January 2025 Security Audit: External auditor flagged 18 critical vulnerabilities in Java 6 and WebSphere 7.0, giving 12-month deadline to remediate or face regulatory sanctions and potential $5M fine.

Board Mandate: Migrate to modern, secure, cloud-based infrastructure within 24 months while maintaining 99.5% uptime and zero data loss.

Problem Statement
#

How do we migrate a 15-year-old, mission-critical legacy system to cloud infrastructure while minimizing business risk, maintaining regulatory compliance, and enabling future innovation?

Key Challenges
#

Zero-Downtime Requirement: Cannot afford extended outages (revenue loss: $125K/hour)
Data Integrity: 2.5M policies, 450GB database, zero tolerance for data loss
Regulatory Compliance: Must maintain audit trail during migration
Knowledge Gaps: Limited understanding of business logic embedded in code
Integration Complexity: 47 external integrations (partners, state agencies, mainframe)
Team Capacity: 12 engineers, cannot hire fast enough
Budget Constraint: $3.5M approved (includes infrastructure, tooling, consulting)

Success Criteria
#

Uptime: Maintain 99.5% SLA throughout migration
Performance: No degradation in response times (P95 < 2s)
Security: Remediate all critical vulnerabilities within 12 months
Cost: Reduce infrastructure costs by 40% post-migration
Timeline: Complete migration within 24 months
Business Continuity: Zero revenue-impacting incidents

Options Considered
#

Option 1: Lift-and-Shift Migration
#

Strategy: Migrate existing application to cloud VMs with minimal changes

Approach:

Implementation Plan:

Phase 1: Infrastructure Setup (Weeks 1-4)

Provision AWS VPC with private subnets
Set up EC2 instances matching on-premises specs
Configure RDS Oracle with Multi-AZ
Establish VPN connection to on-premises

Phase 2: Application Migration (Weeks 5-8)

Install WebSphere 7.0 on EC2 (same version)
Deploy application WAR files
Configure load balancer (AWS ALB)
Set up monitoring (CloudWatch)

Phase 3: Data Migration (Weeks 9-12)

Use Oracle Data Pump for initial load
Set up Oracle GoldenGate for real-time replication
Validate data integrity (checksums, row counts)
Cutover during maintenance window

Phase 4: Cutover (Week 13)

DNS switch to AWS load balancer
Monitor for 48 hours
Decommission on-premises infrastructure

Pros:

Fast: 3-month timeline
Low Risk: Minimal code changes
Proven: Well-established migration pattern
Reversible: Can rollback to on-premises if issues

Cons:

Technical Debt Retained: Still running Java 6, WebSphere 7.0
Security Vulnerabilities Persist: Does not address audit findings
Limited Cost Savings: EC2 costs similar to on-premises
No Modernization: Cannot leverage cloud-native services
Licensing Costs: WebSphere and Oracle licenses still required ($450K/year)

Cost Analysis:

Component	On-Premises	AWS Lift-Shift	Savings
Compute	$480K/year	$420K/year	13%
Database	$280K/year	$320K/year	-14%
Storage	$120K/year	$80K/year	33%
Network	$80K/year	$60K/year	25%
Licenses	$450K/year	$450K/year	0%
Total	$1.41M/year	$1.33M/year	6%

Timeline: 3 months Risk Level: Low Security Remediation: ❌ Does not address vulnerabilities

Option 2: Incremental Refactoring (Strangler Fig Pattern)
#

Strategy: Gradually replace legacy components with modern cloud-native services

Strangler Fig Approach:

graph TB subgraph "Phase 1: Routing Layer" A[API Gateway AWS API Gateway] end subgraph "Phase 2: New Services" B[Policy Service Spring Boot] C[Claims Service Spring Boot] D[Billing Service Spring Boot] end subgraph "Phase 3: Legacy System" E[WebSphere Monolith Gradually Shrinking] end subgraph "Data Layer" F[PostgreSQL New Services] G[Oracle Legacy] H[Data Sync CDC] end A -->|New Traffic| B A -->|New Traffic| C A -->|New Traffic| D A -->|Legacy Traffic| E B --> F C --> F D --> F E --> G F <-->|Sync| H H <-->|Sync| G style B fill:#90EE90 style C fill:#90EE90 style D fill:#90EE90 style E fill:#FF6B6B

Migration Phases:

Phase 1: Foundation (Months 1-3)

Deploy API Gateway as routing layer
Set up AWS infrastructure (EKS, RDS PostgreSQL, S3)
Implement observability stack (Datadog, Jaeger)
Establish CI/CD pipelines (GitLab CI)

Phase 2: Extract Read-Only Services (Months 4-6)

Policy Inquiry Service: Read-only policy lookups
Claims History Service: Historical claims data
Document Service: Policy documents (PDF generation)
Rationale: Low risk, no data writes, easy to validate

Phase 3: Extract Transactional Services (Months 7-12)

New Policy Service: Policy issuance (greenfield business)
Billing Service: Payment processing
Claims Submission Service: New claims intake
Implement dual-write pattern for data consistency

Phase 4: Migrate Core Services (Months 13-18)

Policy Management Service: Policy updates, renewals
Rating Engine Service: Premium calculation
Underwriting Service: Risk assessment
Implement Change Data Capture (CDC) for data sync

Phase 5: Decommission Legacy (Months 19-24)

Migrate remaining edge cases
Data migration to PostgreSQL
Decommission WebSphere and Oracle
Archive legacy system

Strangler Pattern Implementation:

# API Gateway Routing Rules
routes:
  # New services (strangled)
  - path: /api/v2/policies/search
    target: policy-service.eks.cluster
    method: GET
  
  - path: /api/v2/policies
    target: policy-service.eks.cluster
    method: POST
  
  - path: /api/v2/claims
    target: claims-service.eks.cluster
    method: POST
  
  # Legacy system (being strangled)
  - path: /api/v1/*
    target: legacy-websphere.vpc
    method: ALL
  
  # Feature flags for gradual rollout
  feature_flags:
    new_policy_service:
      enabled: true
      rollout_percentage: 25  # Gradual traffic shift
      fallback: legacy-websphere

Data Synchronization Strategy:

sequenceDiagram participant Client participant NewService participant PostgreSQL participant CDC participant Oracle participant LegacyApp Note over NewService,Oracle: Dual-Write Phase Client->>NewService: Create Policy NewService->>PostgreSQL: Write to new DB NewService->>Oracle: Write to legacy DB (via API) NewService-->>Client: Success Note over CDC,Oracle: Background Sync CDC->>Oracle: Detect changes CDC->>PostgreSQL: Replicate to new DB Note over LegacyApp: Legacy reads from Oracle LegacyApp->>Oracle: Read policy data

Pros:

Risk Mitigation: Incremental changes, easy rollback
Continuous Delivery: New features in modern stack
Learning Opportunity: Team learns cloud-native patterns
Cost Optimization: Gradual reduction in legacy infrastructure
Security Remediation: New services use modern frameworks (Java 17, Spring Boot 3)

Cons:

Long Timeline: 24 months to complete
Dual Maintenance: Support legacy and new systems simultaneously
Data Consistency Complexity: Dual-write and CDC required
Coordination Overhead: Multiple teams, complex dependencies
Increased Monitoring: Need to observe both old and new systems

Cost Analysis:

Phase	Monthly Cost	Notes
Month 1-3	$120K	Legacy + AWS foundation
Month 4-6	$135K	Legacy + 3 new services
Month 7-12	$150K	Peak cost (dual systems)
Month 13-18	$125K	Decommissioning legacy components
Month 19-24	$85K	Mostly new system
Post-Migration	$70K	50% cost reduction

Timeline: 24 months Risk Level: Medium Security Remediation: ✅ Gradual remediation as services migrate

Option 3: Complete System Rewrite
#

Strategy: Build new system from scratch, big-bang cutover

Approach:

graph TB subgraph "New System (Built in Parallel)" A[React Frontend] B[API Gateway] C[Microservices Spring Boot] D[PostgreSQL] E[Event Bus Kafka] end subgraph "Legacy System (Runs Until Cutover)" F[WebSphere Monolith] G[Oracle Database] end H[Cutover Weekend] --> I[DNS Switch] I --> A style A fill:#87CEEB style B fill:#87CEEB style C fill:#87CEEB style D fill:#87CEEB style F fill:#FF6B6B style G fill:#FF6B6B

Implementation Plan:

Phase 1: Requirements Gathering (Months 1-3)

Reverse-engineer business logic from legacy code
Document 850 business rules
Create functional specifications
Design new architecture

Phase 2: Development (Months 4-15)

Build new microservices architecture
Implement modern UI (React)
Develop API layer
Create automated test suite (80% coverage target)

Phase 3: Data Migration Preparation (Months 16-18)

Build ETL pipelines
Data cleansing and transformation
Create data validation scripts
Parallel run testing

Phase 4: User Acceptance Testing (Months 19-21)

End-to-end testing with business users
Performance testing (load, stress, soak)
Security testing and penetration testing
Regulatory compliance validation

Phase 5: Cutover (Month 22)

Freeze legacy system (no new transactions)
Execute data migration (48-hour window)
Validate data integrity
Go-live with new system

Phase 6: Stabilization (Months 23-24)

Hypercare support (24/7 war room)
Bug fixes and performance tuning
Decommission legacy system

Pros:

Clean Architecture: Modern design patterns, no technical debt
Technology Freedom: Choose best-of-breed technologies
Optimized Performance: Built for cloud-native scalability
Complete Documentation: Fresh codebase with comprehensive docs

Cons:

Extreme Risk: Big-bang cutover, no rollback plan
Long Timeline: 22 months before any business value
High Cost: $4.5M (exceeds budget by 29%)
Knowledge Loss: May miss undocumented business rules
Opportunity Cost: No new features for 22 months
Team Burnout: Intense pressure, high stress

Historical Precedent:

Research shows 70% of large-scale rewrites fail or significantly exceed budget/timeline:

Healthcare.gov (2013): $1.7B, 3-year delay, near-total failure
FBI Virtual Case File (2005): $170M wasted, project cancelled
UK NHS IT System (2011): £10B, abandoned after 10 years

Cost Analysis:

Component	Cost
Development Team (15 engineers × 22 months)	$3.3M
Infrastructure (AWS during parallel run)	$600K
Consulting (architecture, security)	$400K
Testing & QA	$200K
Contingency (20%)	$900K
Total	$5.4M

Timeline: 22 months (no business value until Month 22) Risk Level: Very High Security Remediation: ✅ Complete remediation, but delayed

Option 4: Hybrid Approach (Lift-Shift + Selective Refactoring)
#

Strategy: Lift-shift to cloud, then refactor high-value components

Approach:

Lift-shift entire system to AWS (3 months)
Upgrade to Java 11 and WebSphere 9 in cloud (2 months)
Extract high-value services incrementally (12 months)
Maintain modernized monolith for low-value components

Pros:

Fast security remediation (5 months)
Lower initial risk than full rewrite
Flexibility to prioritize refactoring

Cons:

Still requires WebSphere licensing
Two migration efforts (lift-shift + refactoring)
Unclear end state (hybrid architecture)

Timeline: 17 months Risk Level: Medium Security Remediation: ✅ Partial remediation in 5 months

Evaluation Matrix
#

Criteria	Weight	Option 1 (Lift-Shift)	Option 2 (Strangler)	Option 3 (Rewrite)	Option 4 (Hybrid)
Migration Risk	30%	8/10	7/10	2/10	6/10
Time to Delivery	20%	9/10	5/10	3/10	7/10
Cost	15%	7/10	6/10	2/10	6/10
Business Continuity	25%	9/10	8/10	3/10	7/10
Security Remediation	10%	2/10	8/10	10/10	7/10
Weighted Score		7.35	6.85	2.95	6.60

Trade-offs Analysis
#

Risk vs Timeline Trade-off
#

quadrantChart title Migration Risk vs Timeline x-axis Fast --> Slow y-axis Low Risk --> High Risk quadrant-1 High Risk, Slow quadrant-2 High Risk, Fast quadrant-3 Low Risk, Fast quadrant-4 Low Risk, Slow Lift-and-Shift: [0.2, 0.3] Strangler Pattern: [0.7, 0.4] Complete Rewrite: [0.8, 0.9] Hybrid Approach: [0.5, 0.5]

Key Trade-off Considerations
#

1. Speed vs Modernization

Lift-shift: 3 months, but retains technical debt
Strangler: 24 months, but achieves full modernization
Rewrite: 22 months, but extreme risk
Decision: Prioritize risk mitigation over speed

2. Cost vs Quality

Lift-shift: $1.33M/year ongoing, minimal improvement
Strangler: $70K/year post-migration, 50% cost reduction
Rewrite: $5.4M upfront, but clean architecture
Decision: Strangler offers best long-term ROI

3. Business Continuity vs Innovation

Lift-shift: Zero disruption, but no new capabilities
Strangler: Continuous delivery of new features during migration
Rewrite: 22-month feature freeze
Decision: Business cannot afford 22-month freeze

4. Team Capacity vs Ambition

12 engineers cannot execute rewrite in 22 months
Strangler allows learning and skill development
Lift-shift underutilizes team capabilities
Decision: Strangler balances capacity and growth

Final Decision
#

Selected Option: Incremental Refactoring using Strangler Fig Pattern (Option 2)

Rationale
#

Risk Mitigation: Incremental approach allows rollback at any phase
Security Compliance: Gradual remediation meets 12-month audit deadline
Business Continuity: Zero downtime, continuous feature delivery
Cost Optimization: 50% infrastructure cost reduction post-migration
Team Development: Engineers learn cloud-native patterns incrementally
Regulatory Compliance: Maintains audit trail throughout migration

Decision Drivers
#

Primary Drivers:

Regulatory Deadline: Must remediate security vulnerabilities within 12 months
Uptime SLA: 99.5% contractual obligation, cannot risk big-bang cutover
Budget Constraint: $3.5M approved, rewrite exceeds budget

Secondary Drivers:

Competitive Pressure: Need to deliver new features during migration
Team Capacity: 12 engineers cannot execute rewrite
Knowledge Gaps: Strangler allows discovery of undocumented business logic

Implementation Roadmap
#

Month 1-3: Foundation

Deliverables:
  - AWS Landing Zone (VPC, subnets, security groups)
  - EKS cluster with Istio service mesh
  - RDS PostgreSQL Multi-AZ
  - API Gateway with routing rules
  - Observability stack (Datadog, Jaeger, PagerDuty)
  - CI/CD pipelines (GitLab CI, ArgoCD)

Team:
  - 2 Platform Engineers (AWS infrastructure)
  - 2 DevOps Engineers (CI/CD, observability)
  - 1 Security Engineer (compliance, IAM)

Budget: $180K

Month 4-6: Extract Read-Only Services

Services:
  1. Policy Inquiry Service:
      - Endpoints: GET /policies/{id}, GET /policies/search
      - Data: Read from Oracle via JDBC
      - Technology: Spring Boot 3, Java 17
      - Deployment: EKS with 3 replicas
    
  2. Claims History Service:
      - Endpoints: GET /claims/{id}, GET /claims/history
      - Data: Read from Oracle via JDBC
      - Technology: Spring Boot 3, Java 17
    
  3. Document Service:
      - Endpoints: GET /documents/{id}
      - Data: S3 for storage, metadata in PostgreSQL
      - Technology: Spring Boot 3, Java 17
    
Traffic Rollout:
  - Week 1-2: 10% traffic to new services
  - Week 3-4: 25% traffic
  - Week 5-6: 50% traffic
  - Week 7-8: 100% traffic (if no issues)

Team:
  - 6 Backend Engineers (2 per service)
  - 2 QA Engineers (testing, validation)

Budget: $240K

Month 7-12: Extract Transactional Services

Services:
  1. New Policy Service (Greenfield):
      - Endpoints: POST /policies (new business only)
      - Data: Write to PostgreSQL
      - Technology: Spring Boot 3, Java 17, Kafka
      - Pattern: Event-driven architecture
    
  2. Billing Service:
      - Endpoints: POST /payments, GET /invoices
      - Data: Dual-write (PostgreSQL + Oracle)
      - Technology: Spring Boot 3, Stripe integration
    
  3. Claims Submission Service:
      - Endpoints: POST /claims
      - Data: Dual-write (PostgreSQL + Oracle)
      - Technology: Spring Boot 3, Java 17
    
Data Consistency:
  - Implement dual-write pattern
  - Use Saga pattern for distributed transactions
  - CDC (Debezium) for Oracle → PostgreSQL sync

Team:
  - 8 Backend Engineers
  - 2 QA Engineers
  - 1 Data Engineer (CDC setup)

Budget: $480K

Month 13-18: Migrate Core Services

Services:
  1. Policy Management Service:
      - Endpoints: PUT /policies/{id}, POST /policies/renew
      - Data: Dual-write, gradual migration
      - Complexity: High (850 business rules)
    
  2. Rating Engine Service:
      - Endpoints: POST /quotes/calculate
      - Data: Read from PostgreSQL
      - Complexity: Very High (complex actuarial logic)
      - Approach: Extract as library first, then service
    
  3. Underwriting Service:
      - Endpoints: POST /underwriting/assess
      - Data: Read from PostgreSQL
      - Complexity: High (risk assessment rules)
    
Migration Strategy:
  - Shadow mode: Run new service in parallel, compare results
  - Gradual cutover: 10% → 25% → 50% → 100%
  - Rollback plan: Feature flags for instant rollback

Team:
  - 10 Backend Engineers
  - 2 QA Engineers
  - 1 Business Analyst (validate business rules)

Budget: $600K

Month 19-24: Decommission Legacy

Activities:
  - Migrate remaining edge cases (5% of traffic)
  - Final data migration from Oracle to PostgreSQL
  - Decommission WebSphere cluster
  - Decommission Oracle RAC
  - Archive legacy codebase
  - Update documentation

Data Migration:
  - Use AWS DMS for bulk migration
  - Validate data integrity (checksums, row counts)
  - Maintain Oracle read-only for 3 months (safety net)

Team:
  - 6 Engineers (migration, validation)
  - 2 DBAs (data migration)
  - 1 Compliance Officer (regulatory sign-off)

Budget: $360K

Strangler Pattern Implementation Details
#

API Gateway Routing Strategy:

# Kong API Gateway Configuration
services:
  - name: policy-service-v2
    url: http://policy-service.eks.cluster:8080
    routes:
      - name: policy-search
        paths:
          - /api/v2/policies/search
        methods:
          - GET
        plugins:
          - name: rate-limiting
            config:
              minute: 100
          - name: request-transformer
            config:
              add:
                headers:
                  - X-Service-Version:v2
                
  - name: legacy-websphere
    url: http://legacy-lb.vpc:9080
    routes:
      - name: legacy-fallback
        paths:
          - /api/v1/*
        methods:
          - ALL
        plugins:
          - name: canary
            config:
              percentage: 75  # Gradually decrease
              upstream_fallback: policy-service-v2

Feature Flag Strategy:

# LaunchDarkly Feature Flags
feature_flags:
  new_policy_service:
    enabled: true
    rollout:
      - rule: user.state == "CA"
        percentage: 100  # California users first
      - rule: user.state == "TX"
        percentage: 50   # Texas users gradual
      - rule: default
        percentage: 10   # Other states conservative
    fallback: legacy_service
  
  new_rating_engine:
    enabled: true
    rollout:
      - rule: policy.type == "auto"
        percentage: 25   # Auto policies first
      - rule: policy.type == "home"
        percentage: 0    # Home policies later
    fallback: legacy_rating_engine

Data Synchronization Architecture:

graph TB subgraph "New Services" A[Policy Service] B[Claims Service] end subgraph "Data Layer" C[PostgreSQL Primary] D[Oracle Legacy] end subgraph "Sync Layer" E[Debezium CDC] F[Kafka] G[Sync Service] end subgraph "Legacy System" H[WebSphere] end A -->|Write| C A -->|Dual Write| D D -->|CDC| E E -->|Events| F F -->|Consume| G G -->|Replicate| C H -->|Read/Write| D style C fill:#90EE90 style D fill:#FF6B6B

Risk Mitigation Strategies
#

1. Data Consistency Risks

Risk: Dual-write failures lead to data inconsistency

Mitigation:

Implement compensating transactions
Use Saga pattern with rollback logic
CDC as safety net (eventual consistency)
Daily reconciliation jobs

Monitoring:

alerts:
  - name: data_inconsistency_detected
    condition: |
      count(postgres_records) != count(oracle_records)
    severity: critical
    action: page_on_call_engineer

2. Performance Degradation Risks

Risk: Network latency between services degrades performance

Mitigation:

Implement caching (Redis) for frequently accessed data
Use GraphQL for efficient data fetching
Optimize database queries (indexes, query plans)
Load testing before each rollout

Performance Targets:

Metric	Legacy	Target	Actual (Month 14)
P50 Latency	800ms	< 600ms	520ms ✅
P95 Latency	1,800ms	< 2,000ms	1,650ms ✅
P99 Latency	4,200ms	< 4,000ms	3,800ms ✅

3. Rollback Risks

Risk: Cannot rollback if new service has critical bug

Mitigation:

Feature flags for instant traffic routing
Maintain legacy system operational for 6 months post-cutover
Blue-green deployments for zero-downtime rollback
Automated rollback triggers

Rollback Procedure:

rollback_triggers:
  - error_rate > 5%
  - latency_p95 > 3000ms
  - data_inconsistency_detected

rollback_actions:
  1. Disable feature flag (instant traffic shift)
  2. Alert on-call engineer
  3. Create incident ticket
  4. Rollback deployment (if needed)
  5. Post-mortem within 24 hours

4. Knowledge Gap Risks

Risk: Undocumented business logic causes incorrect behavior in new services

Mitigation:

Shadow mode testing (run new service in parallel, compare outputs)
Business analyst validation for each service
Extensive regression testing (1,200 test cases)
Gradual rollout with monitoring

Shadow Mode Example:

# Policy Rating Engine Shadow Mode
shadow_mode:
  enabled: true
  duration: 30_days

  comparison:
    - input: policy_application
    - legacy_output: legacy_rating_engine.calculate()
    - new_output: new_rating_engine.calculate()
    - diff: compare(legacy_output, new_output)
  
  alerts:
    - condition: diff.percentage > 1%
      action: log_discrepancy
    - condition: diff.percentage > 5%
      action: page_engineer
    
  metrics:
    - match_rate: 98.7%  # Target: > 99%
    - avg_diff: 0.3%     # Target: < 0.5%

5. Regulatory Compliance Risks

Risk: Migration disrupts audit trail or violates compliance requirements

Mitigation:

Maintain complete audit logs in both systems
Compliance officer review at each phase
External audit before final cutover
Regulatory sandbox testing

Compliance Checklist:

compliance_requirements:
  SOC2:
    - audit_logging: ✅ Implemented in all new services
    - access_control: ✅ IAM roles with least privilege
    - encryption: ✅ TLS 1.3, AES-256 at rest
    - incident_response: ✅ Runbooks documented
  
  HIPAA:
    - phi_protection: ✅ Field-level encryption
    - access_logs: ✅ CloudTrail enabled
    - data_retention: ✅ 7-year retention policy
    - breach_notification: ✅ Automated alerts
  
  State_Insurance_Regulations:
    - policy_data_integrity: ✅ Daily reconciliation
    - transaction_audit_trail: ✅ Event sourcing
    - disaster_recovery: ✅ Multi-AZ, RTO < 4 hours

Technology Stack
#

New Services:

Layer	Technology	Rationale
Language	Java 17	LTS, security patches, team expertise
Framework	Spring Boot 3.2	Modern, cloud-native, extensive ecosystem
API	REST + GraphQL	REST for commands, GraphQL for queries
Database	PostgreSQL 15	Open-source, cost-effective, JSON support
Caching	Redis 7	In-memory performance, pub/sub
Message Bus	Kafka 3.6	Event-driven architecture, high throughput
Container	Docker	Standardized packaging
Orchestration	Kubernetes (EKS)	Auto-scaling, self-healing
Service Mesh	Istio	Traffic management, observability
API Gateway	Kong	Routing, rate limiting, authentication
Observability	Datadog, Jaeger	Metrics, traces, logs
CI/CD	GitLab CI, ArgoCD	Automated deployments, GitOps

Infrastructure:

AWS_Services:
  Compute:
    - EKS (Kubernetes): Application workloads
    - EC2 (t3.large): Legacy WebSphere (temporary)
    - Lambda: Serverless functions (document generation)
  
  Database:
    - RDS PostgreSQL Multi-AZ: Primary database
    - RDS Oracle (temporary): Legacy database during migration
    - ElastiCache Redis: Caching layer
  
  Storage:
    - S3: Document storage, backups
    - EFS: Shared file system
  
  Networking:
    - VPC: Isolated network
    - ALB: Load balancing
    - Route 53: DNS management
    - Direct Connect: On-premises connectivity
  
  Security:
    - IAM: Access control
    - KMS: Encryption key management
    - Secrets Manager: Credential storage
    - WAF: Web application firewall
  
  Observability:
    - CloudWatch: Metrics and logs
    - X-Ray: Distributed tracing (backup)

Team Organization
#

Migration Team Structure:

graph TB A[Migration Program Manager] B[Platform Team 5 engineers] C[Service Team 1 4 engineers] D[Service Team 2 4 engineers] E[Data Team 3 engineers] F[QA Team 3 engineers] A --> B A --> C A --> D A --> E A --> F G[Business Analyst] H[Security Engineer] I[Compliance Officer] A --> G A --> H A --> I style A fill:#FFD700 style B fill:#87CEEB style C fill:#90EE90 style D fill:#90EE90 style E fill:#FFA07A style F fill:#DDA0DD

Roles and Responsibilities:

Role	Count	Responsibilities
Migration Program Manager	1	Overall coordination, stakeholder communication, risk management
Platform Engineers	5	AWS infrastructure, Kubernetes, CI/CD, observability
Backend Engineers	8	Microservices development, API design, business logic
Data Engineers	3	CDC setup, data migration, ETL pipelines
QA Engineers	3	Test automation, regression testing, performance testing
Business Analyst	1	Requirements validation, business rule documentation
Security Engineer	1	Security compliance, penetration testing, IAM
Compliance Officer	1	Regulatory compliance, audit coordination

Team Allocation by Phase:

Phase	Platform	Backend	Data	QA	Total
Foundation (M1-3)	5	2	1	1	9
Read Services (M4-6)	3	6	1	2	12
Transactional (M7-12)	2	8	2	3	15
Core Services (M13-18)	2	10	2	3	17
Decommission (M19-24)	3	6	3	2	14

Budget Breakdown
#

Total Budget: $3.5M over 24 months

Category	Amount	Percentage
Personnel	$2.1M	60%
AWS Infrastructure	$720K	21%
Tooling & Licenses	$280K	8%
Consulting	$210K	6%
Training	$105K	3%
Contingency	$85K	2%

Personnel Costs:

Role	Monthly Rate	Duration	Total
Migration PM	$15K	24 months	$360K
Platform Engineers (5)	$12K each	24 months	$1,440K
Backend Engineers (8)	$10K each	18 months avg	$1,440K
Data Engineers (3)	$11K each	18 months avg	$594K
QA Engineers (3)	$9K each	20 months avg	$540K
Business Analyst	$10K	18 months	$180K
Security Engineer	$13K	12 months	$156K
Compliance Officer	$12K	12 months	$144K
Subtotal			$4.85M
Blended Rate (existing team)			$2.1M

Note: Using existing team reduces costs by 57% vs external hires

Infrastructure Costs:

Service	Monthly Cost	24 Months	Notes
EKS Cluster	$8K	$192K	3 node groups, auto-scaling
RDS PostgreSQL	$4K	$96K	Multi-AZ, 2TB storage
RDS Oracle (temp)	$6K	$72K	12 months only
ElastiCache Redis	$2K	$48K	3-node cluster
S3 Storage	$1K	$24K	10TB documents
Data Transfer	$3K	$72K	Cross-AZ, internet egress
CloudWatch	$2K	$48K	Logs, metrics
Other Services	$4K	$96K	Lambda, ALB, Route 53
EC2 (legacy)	$12K	$72K	6 months only
Total	$30K avg	$720K

Tooling & Licenses:

Tool	Annual Cost	2 Years	Purpose
Datadog	$60K	$120K	Observability
GitLab Ultimate	$25K	$50K	CI/CD, source control
Kong Enterprise	$30K	$60K	API Gateway
LaunchDarkly	$15K	$30K	Feature flags
Snyk	$10K	$20K	Security scanning
Total	$140K	$280K

Consulting:

Service	Cost	Purpose
AWS Solutions Architect	$80K	Infrastructure design, 4 months
DDD Consultant	$60K	Domain modeling, 3 months
Security Audit	$40K	Penetration testing, compliance
Performance Testing	$30K	Load testing, optimization
Total	$210K

Training:

Course	Cost	Attendees	Total
AWS Certification	$300	15	$4.5K
Kubernetes (CKA)	$400	10	$4K
Spring Boot 3	$500	12	$6K
DDD Workshop	$2K	15	$30K
Kafka Training	$800	8	$6.4K
Security Training	$1K	15	$15K
Conference Attendance	$3K	13	$39K
Total			$105K

Success Metrics
#

Migration Progress Metrics:

Metric	Target	Current (Month 14)	Status
Services Migrated	8 total	5 completed	✅ On Track
Traffic on New Services	60% by M14	58%	✅ On Track
Legacy Code Removed	50% by M14	47%	✅ On Track
Data Migrated	40% by M14	42%	✅ On Track

Business Continuity Metrics:

Metric	Target	Actual (Month 14)	Status
Uptime	99.5%	99.7%	✅ Exceeded
Revenue Impact	$0	$0	✅ Met
Customer Complaints	< 10/month	6/month	✅ Met
Regulatory Incidents	0	0	✅ Met

Performance Metrics:

Metric	Legacy Baseline	Target	Actual (Month 14)	Status
P50 Latency	800ms	< 600ms	520ms	✅ Exceeded
P95 Latency	1,800ms	< 2,000ms	1,650ms	✅ Met
P99 Latency	4,200ms	< 4,000ms	3,800ms	✅ Met
Throughput	850 TPS	> 850 TPS	1,200 TPS	✅ Exceeded
Error Rate	0.8%	< 0.5%	0.3%	✅ Exceeded

Cost Metrics:

Metric	Baseline	Target	Actual (Month 14)	Status
Monthly Infrastructure	$120K	$85K by M24	$105K	✅ On Track
Licensing Costs	$450K/year	$0 by M24	$225K/year	✅ On Track
Operational Overhead	40 hrs/week	20 hrs/week	28 hrs/week	✅ On Track

Security Metrics:

Metric	Baseline	Target	Actual (Month 14)	Status
Critical Vulnerabilities	247	0 by M12	0	✅ Met
High Vulnerabilities	89	< 10 by M12	4	✅ Exceeded
Security Incidents	2/year	0	0	✅ Met
Compliance Score	78%	95% by M12	96%	✅ Exceeded

Team Metrics:

Metric	Baseline	Target	Actual (Month 14)	Status
Deployment Frequency	1/month	1/week	3/week	✅ Exceeded
Lead Time	45 days	< 14 days	9 days	✅ Exceeded
MTTR	4.5 hours	< 2 hours	1.8 hours	✅ Exceeded
Team Satisfaction	6.2/10	> 7.5/10	8.1/10	✅ Exceeded

Post-Decision Reflection
#

Outcomes Achieved (Month 14 of 24)
#

Migration Progress:

✅ 5 of 8 services migrated:

Policy Inquiry Service (Month 5)
Claims History Service (Month 6)
Document Service (Month 6)
New Policy Service (Month 9)
Billing Service (Month 11)

🚧 In Progress: 6. Claims Submission Service (Month 15 target) 7. Policy Management Service (Month 17 target) 8. Rating Engine Service (Month 18 target)

Traffic Distribution:

New services: 58% of total traffic
Legacy system: 42% of total traffic
Zero revenue-impacting incidents during migration

Security Compliance:

✅ All critical vulnerabilities remediated (Month 11)
✅ Passed external security audit (Month 12)
✅ Regulatory compliance maintained (SOC 2, HIPAA)

Cost Savings:

Current monthly infrastructure: $105K (13% reduction from baseline)
Projected post-migration: $70K/month (42% reduction)
Decommissioned 4 of 8 WebSphere nodes (50% reduction)

Performance Improvements:

P50 latency: 800ms → 520ms (35% improvement)
P95 latency: 1,800ms → 1,650ms (8% improvement)
Throughput: 850 TPS → 1,200 TPS (41% increase)
Error rate: 0.8% → 0.3% (63% reduction)

Team Velocity:

Deployment frequency: 1/month → 3/week (1,200% increase)
Lead time: 45 days → 9 days (80% reduction)
MTTR: 4.5 hours → 1.8 hours (60% reduction)

Challenges Encountered
#

1. Data Consistency Complexity

Issue: Dual-write pattern caused data inconsistencies in 3 incidents

Example Incident (Month 8):

Incident: Policy update written to PostgreSQL but failed to Oracle
Root Cause: Network timeout to Oracle during high load
Impact: 47 policies out of sync for 2 hours
Resolution: CDC detected discrepancy, auto-reconciliation triggered
Lesson: Implement circuit breaker for dual-write failures

Resolution:

Implemented Saga pattern with compensating transactions
Added circuit breaker (Resilience4j) for Oracle writes
Enhanced CDC monitoring with real-time alerts
Daily reconciliation job to catch edge cases

Code Example:

@Service
public class PolicyService {
  
    @Transactional
    public Policy updatePolicy(PolicyUpdateRequest request) {
        // Write to PostgreSQL (primary)
        Policy policy = policyRepository.save(request.toEntity());
      
        // Dual-write to Oracle (legacy) with circuit breaker
        try {
            circuitBreaker.executeSupplier(() -> 
                legacyPolicyService.updatePolicy(policy)
            );
        } catch (CircuitBreakerOpenException e) {
            // Circuit open, queue for async retry
            retryQueue.enqueue(new PolicySyncTask(policy));
            log.warn("Oracle write failed, queued for retry: {}", policy.getId());
        }
      
        // Publish event for CDC validation
        eventPublisher.publish(new PolicyUpdatedEvent(policy));
      
        return policy;
    }
}

2. Shadow Mode Discrepancies

Issue: Rating engine shadow mode showed 8% discrepancy rate (target: < 1%)

Root Cause Analysis:

Legacy rating engine had undocumented rounding logic
Date calculations used server timezone (inconsistent)
Floating-point precision differences (Java 6 vs Java 17)

Resolution:

Reverse-engineered legacy rounding logic (2 weeks)
Standardized on UTC timezone
Implemented BigDecimal for financial calculations
Extended shadow mode from 30 to 60 days

Discrepancy Trend:

Week	Discrepancy Rate	Action Taken
Week 1	8.2%	Identified rounding issue
Week 2	5.1%	Fixed rounding logic
Week 3	2.8%	Fixed timezone issue
Week 4	1.2%	Fixed floating-point precision
Week 5	0.6%	Extended testing
Week 6	0.4%	✅ Approved for rollout

3. Team Coordination Overhead

Issue: 5 teams working on interdependent services caused bottlenecks

Example:

Policy Service needed Claims Service API (not yet built)
Billing Service blocked on Payment Gateway integration
Data Team overwhelmed with CDC setup requests

Resolution:

Implemented API-first design (OpenAPI specs upfront)
Created mock services for dependencies
Hired additional data engineer (Month 10)
Weekly cross-team sync meetings

Coordination Improvements:

Metric	Before	After	Change
Blocked Stories	18%	6%	-67%
Cross-Team PRs	12/week	4/week	-67%
Integration Issues	8/sprint	2/sprint	-75%

4. Legacy System Stability

Issue: Legacy WebSphere became unstable as traffic decreased

Root Cause:

Connection pool sized for 100% traffic, now handling 42%
Idle connections timing out
Memory leaks in rarely-used code paths

Resolution:

Tuned WebSphere connection pools
Increased monitoring on legacy system
Planned accelerated decommissioning (Month 18 vs Month 24)

5. Observability Gaps

Issue: First 2 months had blind spots in distributed tracing

Example:

Could not trace requests across legacy and new services
Missing correlation IDs in legacy system
Incomplete error context in logs

Resolution:

Implemented correlation ID injection at API Gateway
Added tracing adapter for legacy system (Jaeger agent)
Standardized logging format (JSON structured logs)
Created unified dashboards (Datadog)

Observability Maturity:

gantt title Observability Implementation dateFormat YYYY-MM section Metrics CloudWatch Basic :done, 2025-01, 1M Datadog Integration :done, 2025-02, 1M Custom Business Metrics:done, 2025-04, 2M section Tracing Jaeger Setup :done, 2025-03, 1M Legacy Integration :done, 2025-05, 2M End-to-End Tracing :done, 2025-07, 1M section Logging Structured Logging :done, 2025-02, 1M Log Aggregation :done, 2025-03, 1M Log Analytics :done, 2025-06, 2M

Unexpected Benefits
#

1. Improved Developer Experience

Before Migration:

8-hour deployment window (Saturday nights)
Manual deployment process (40-step runbook)
28-minute build time
No local development environment

After Migration:

Continuous deployment (3x/week, daytime)
Automated CI/CD (GitLab CI + ArgoCD)
6-minute build time per service
Docker Compose local environment

Developer Satisfaction:

Survey Question: "How satisfied are you with the development workflow?"

Before: 6.2/10
After:  8.1/10 (+31%)

Comments:
- "I can deploy my changes in 10 minutes instead of waiting a week"
- "Local development is so much faster with Docker"
- "I actually understand the codebase now (my service only)"

2. Faster Feature Delivery

Unexpected Outcome: New features delivered during migration

Examples:

Mobile API (Month 7): Built on new services, would have taken 6 months in legacy
Real-Time Notifications (Month 9): Kafka event-driven, impossible in legacy
Self-Service Portal (Month 12): GraphQL API, 3-month delivery vs 9-month estimate

Feature Velocity:

Period	Features Delivered	Avg Lead Time
Pre-Migration	4/quarter	45 days
During Migration	7/quarter	18 days
Improvement	+75%	-60%

3. Cost Savings Exceeded Expectations

Original Projection: 40% cost reduction post-migration

Actual (Month 14): Already achieving 13% reduction, on track for 50%

Unexpected Savings:

Oracle Licensing: Negotiated early termination, saved $180K
WebSphere Licensing: Decommissioned 4 nodes early, saved $120K
Data Center: Reduced power/cooling costs, saved $40K
Operational Overhead: Automated monitoring, saved 12 hrs/week ($75K/year)

Total Unexpected Savings: $415K (12% of total budget)

4. Talent Attraction

Unexpected Outcome: Easier to recruit engineers

Before Migration:

Job postings: “Java 6, WebSphere, Oracle”
Applicant quality: Low (outdated tech stack)
Offer acceptance rate: 45%

During Migration:

Job postings: “Java 17, Spring Boot, Kubernetes, AWS”
Applicant quality: High (modern tech stack)
Offer acceptance rate: 78%

Hiring Metrics:

Metric	Before	After	Change
Applicants per Role	12	34	+183%
Qualified Candidates	3	12	+300%
Offer Acceptance	45%	78%	+73%
Time to Hire	90 days	45 days	-50%

5. Business Agility

Unexpected Outcome: Able to respond to market changes faster

Example (Month 11):

Competitor launched usage-based pricing
Business requested similar feature
Legacy estimate: 6 months (requires rating engine rewrite)
Actual delivery: 3 weeks (new Billing Service, feature flag)

Business Impact:

Retained 200 at-risk customers ($8M annual revenue)
Competitive advantage in market
Board confidence in technology team

Lessons Learned
#

1. Start with Read-Only Services

Lesson: Extracting read-only services first was the right decision

Rationale:

Low risk (no data writes)
Easy to validate (compare outputs)
Builds team confidence
Establishes patterns for later services

Recommendation: Always start with read-only or greenfield services

2. Shadow Mode is Essential

Lesson: Running new services in shadow mode caught critical bugs

Example: Rating engine discrepancies would have caused $2M in premium errors

Recommendation: Budget 2x time for shadow mode testing

3. Dual-Write is Hard

Lesson: Dual-write pattern is more complex than anticipated

Challenges:

Transaction coordination
Failure handling
Performance overhead
Data consistency

Recommendation: Minimize dual-write duration, use CDC as safety net

4. API-First Design

Lesson: Defining APIs upfront reduced integration issues

Approach:

OpenAPI specs before implementation
Mock services for dependencies
Contract testing (Pact)

Recommendation: Invest in API design workshops

5. Observability is Non-Negotiable

Lesson: Cannot debug distributed systems without proper observability

Must-Haves:

Distributed tracing (Jaeger)
Correlation IDs across all services
Structured logging (JSON)
Business-level metrics (not just technical)

Recommendation: Implement observability before extracting first service

6. Feature Flags are Critical

Lesson: Feature flags enabled safe, gradual rollouts

Use Cases:

Traffic shifting (10% → 25% → 50% → 100%)
Instant rollback (disable flag)
A/B testing (compare old vs new)
Regional rollouts (CA first, then TX, then all)

Recommendation: Invest in feature flag platform (LaunchDarkly, Split.io)

7. Team Autonomy Requires Platform

Lesson: Platform team enabled service teams to move fast

Platform Responsibilities:

Kubernetes cluster management
CI/CD pipelines
Observability stack
Security scanning
Cost optimization

Recommendation: Dedicate 30% of team to platform engineering

8. Communication is Key

Lesson: Stakeholder communication prevented surprises

Cadence:

Daily: Team standups
Weekly: Cross-team sync, stakeholder update
Monthly: Executive briefing, board update
Quarterly: Architecture review

Recommendation: Over-communicate progress, risks, and wins

Anti-Patterns Avoided
#

1. Big-Bang Cutover

Anti-Pattern: Migrate entire system at once

Why Avoided: 70% failure rate for big-bang rewrites

Our Approach: Incremental, service-by-service migration

2. Premature Optimization

Anti-Pattern: Over-engineer new services

Why Avoided: YAGNI (You Aren’t Gonna Need It)

Our Approach: Start simple, optimize based on metrics

3. Ignoring Legacy System

Anti-Pattern: Let legacy system degrade during migration

Why Avoided: Still serving 42% of traffic

Our Approach: Maintain legacy system until fully decommissioned

4. Skipping Testing

Anti-Pattern: Rush to production without adequate testing

Why Avoided: Mission-critical system, zero tolerance for errors

Our Approach: Shadow mode, gradual rollout, extensive regression testing

Future Considerations
#

Short-Term (Months 15-18):

Complete Core Service Migration
- Claims Submission Service (Month 15)
- Policy Management Service (Month 17)
- Rating Engine Service (Month 18)
Enhance Observability
- Implement OpenTelemetry
- Add business-level SLOs
- Create customer journey dashboards
Optimize Performance
- Implement GraphQL federation
- Add Redis caching layer
- Optimize database queries

Medium-Term (Months 19-24):

Decommission Legacy System
- Final data migration (Month 20)
- Decommission WebSphere (Month 21)
- Decommission Oracle (Month 22)
- Archive legacy codebase (Month 23)
Cost Optimization
- Right-size EC2 instances
- Implement auto-scaling
- Use Spot instances for non-critical workloads
- Negotiate AWS Enterprise Discount
Security Hardening
- Implement zero-trust networking
- Add runtime security (Falco)
- Enhance secrets management
- Conduct penetration testing

Long-Term (Post-Migration):

Further Decomposition
- Split Policy Management into Policy + Endorsement services
- Extract Notification Service
- Extract Audit Service
Advanced Capabilities
- Implement event sourcing for audit trail
- Add CQRS for read-heavy workloads
- Explore serverless for batch jobs
- Implement chaos engineering
Business Innovation
- Real-time pricing (IoT integration)
- AI-powered underwriting
- Self-service policy management
- Mobile-first experience

ROI Analysis
#

Investment:

Total budget: $3.5M over 24 months
Opportunity cost: Delayed features (estimated $500K revenue)
Total Investment: $4.0M

Returns (Annual):

Category	Amount	Notes
Infrastructure Savings	$600K	50% reduction ($1.2M → $600K)
Licensing Savings	$450K	Oracle + WebSphere eliminated
Operational Efficiency	$300K	Reduced manual work (40 hrs/week)
Faster Feature Delivery	$1.2M	3 additional features/quarter × $100K
Reduced Incidents	$400K	50% reduction in downtime costs
Avoided Regulatory Fines	$5.0M	Would have faced sanctions
Total Annual Returns	$7.95M

ROI Calculation:

Payback Period: 6 months post-migration
3-Year ROI: 497% ($4.0M investment → $23.85M returns)
NPV (10% discount rate): $15.2M

Intangible Benefits:

Improved customer satisfaction (NPS: 28 → 42)
Enhanced competitive position
Reduced technical debt
Improved team morale
Easier talent recruitment

这篇文章还没有完全结束。让我来完成收尾部分：

## Conclusion

The decision to use the Strangler Fig pattern for legacy system migration has proven successful at the 14-month mark. We have migrated 5 of 8 services, achieved 58% traffic on new services, and maintained 99.7% uptime throughout the migration.

### Key Success Factors

1. **Incremental Approach**: Service-by-service migration reduced risk and enabled continuous learning
2. **Strong Platform Foundation**: Investment in AWS infrastructure, Kubernetes, and observability paid dividends
3. **Shadow Mode Testing**: Caught critical discrepancies before production impact
4. **Feature Flags**: Enabled safe, gradual rollouts with instant rollback capability
5. **Team Empowerment**: Cross-functional squads with clear ownership accelerated delivery
6. **Stakeholder Communication**: Regular updates built trust and managed expectations

### Validation of Decision

The strangler pattern was the right choice for our context:

- ✅ **Risk Mitigation**: Zero revenue-impacting incidents during migration
- ✅ **Security Compliance**: All critical vulnerabilities remediated within 12 months
- ✅ **Business Continuity**: 99.7% uptime exceeded 99.5% SLA
- ✅ **Cost Optimization**: On track for 50% infrastructure cost reduction
- ✅ **Team Velocity**: 3x deployment frequency, 80% reduction in lead time
- ✅ **Innovation**: Delivered 7 new features during migration (vs 0 with rewrite)

### Alternative Outcomes

**If we had chosen Lift-and-Shift:**
- ❌ Security vulnerabilities would persist
- ❌ Technical debt would remain
- ❌ Limited cost savings (6% vs 50%)
- ❌ No improvement in team velocity

**If we had chosen Complete Rewrite:**
- ❌ 22-month feature freeze (unacceptable to business)
- ❌ High risk of missing undocumented business logic
- ❌ Budget overrun ($5.4M vs $3.5M approved)
- ❌ Likely timeline delays (historical 70% failure rate)

### Recommendations for Similar Migrations

**When to Use Strangler Pattern:**
- Mission-critical systems with high uptime requirements
- Limited understanding of legacy business logic
- Need to deliver new features during migration
- Team capacity constraints
- Budget limitations

**When NOT to Use Strangler Pattern:**
- Small, well-understood systems (lift-and-shift may suffice)
- Greenfield replacement with no legacy dependencies
- Unlimited budget and timeline
- Legacy system is completely undocumented (consider rewrite with extensive discovery)

**Critical Success Factors:**
1. Executive sponsorship and patience (24-month timeline)
2. Platform engineering investment (30% of team)
3. Observability from day one (distributed tracing, metrics, logs)
4. API-first design with contract testing
5. Shadow mode testing for critical services
6. Feature flags for gradual rollouts
7. Strong DevOps culture and automation

### Final Thoughts

Legacy system migration is not a purely technical challenge—it's an organizational transformation. The strangler pattern succeeded because it aligned with our business constraints (zero downtime, continuous feature delivery) and team capabilities (12 engineers, limited budget).

The journey is not complete. We have 10 months remaining to migrate the final 3 services and decommission the legacy system. However, the foundation is solid, the patterns are proven, and the team is confident.

For organizations facing similar challenges, we recommend starting with a clear understanding of your constraints, investing in platform capabilities, and embracing incremental change over big-bang transformations.

## References

- [Strangler Fig Application Pattern](https://martinfowler.com/bliki/StranglerFigApplication.html) - Martin Fowler
- [Monolith to Microservices](https://samnewman.io/books/monolith-to-microservices/) - Sam Newman
- [Building Evolutionary Architectures](https://www.oreilly.com/library/view/building-evolutionary-architectures/9781491986356/) - Neal Ford, Rebecca Parsons, Patrick Kua
- [Accelerate: The Science of Lean Software and DevOps](https://itrevolution.com/product/accelerate/) - Nicole Forsgren, Jez Humble, Gene Kim
- [AWS Migration Strategies](https://aws.amazon.com/cloud-migration/how-to-migrate/) - AWS Documentation
- Internal: Migration Runbooks, Architecture Decision Records, Service Documentation

---

**Document Status**: Living Document (Updated Monthly)
**Last Updated**: 2025-06-10 (Month 14 of 24)
**Next Review**: 2025-07-10
**Decision Owner**: CTO
**Contributors**: Migration Program Manager, Platform Team, Service Teams, Business Stakeholders
**Migration Progress**: 63% Complete (5/8 services migrated, 58% traffic on new services)
**Overall Status**: ✅ On Track (Green)

---

**Appendix A: Service Migration Status**

| Service | Status | Migration Date | Traffic % | Notes |
|---------|--------|---------------|-----------|-------|
| Policy Inquiry | ✅ Complete | 2025-05 | 100% | First service, read-only |
| Claims History | ✅ Complete | 2025-06 | 100% | Read-only, high volume |
| Document Service | ✅ Complete | 2025-06 | 100% | S3-based storage |
| New Policy Service | ✅ Complete | 2025-09 | 100% | Greenfield business |
| Billing Service | ✅ Complete | 2025-11 | 100% | Dual-write to Oracle |
| Claims Submission | 🚧 In Progress | 2025-07 (target) | 0% | Shadow mode testing |
| Policy Management | 📋 Planned | 2025-09 (target) | 0% | Most complex service |
| Rating Engine | 📋 Planned | 2025-10 (target) | 0% | Core business logic |

**Appendix B: Cost Tracking**

| Month | Infrastructure | Personnel | Tooling | Total | Budget | Variance |
|-------|---------------|-----------|---------|-------|--------|----------|
| M1-3 | $180K | $270K | $45K | $495K | $525K | -$30K |
| M4-6 | $210K | $360K | $60K | $630K | $630K | $0 |
| M7-12 | $540K | $720K | $120K | $1,380K | $1,400K | -$20K |
| M13-14 | $210K | $280K | $50K | $540K | $560K | -$20K |
| **Total** | **$1,140K** | **$1,630K** | **$275K** | **$3,045K** | **$3,115K** | **-$70K** |

**Budget Status**: Under budget by $70K (2.2%)

**Appendix C: Risk Register**

| Risk | Probability | Impact | Mitigation | Status |
|------|------------|--------|------------|--------|
| Data inconsistency | Medium | High | CDC, reconciliation jobs | ✅ Mitigated |
| Performance degradation | Low | High | Load testing, caching | ✅ Mitigated |
| Knowledge gaps | Medium | Medium | Shadow mode, BA validation | ✅ Mitigated |
| Team burnout | Low | Medium | Sustainable pace, rotation | ✅ Monitored |
| Regulatory non-compliance | Low | Critical | Compliance officer review | ✅ Mitigated |
| Budget overrun | Low | Medium | Monthly tracking, contingency | ✅ Under budget |
| Timeline delay | Medium | Medium | Buffer in schedule | ✅ On track |

**Appendix D: Glossary**

- **CDC**: Change Data Capture - technology for tracking database changes
- **Dual-Write**: Writing data to both old and new systems simultaneously
- **Shadow Mode**: Running new service in parallel with legacy, comparing outputs
- **Strangler Fig Pattern**: Incrementally replacing legacy system by "strangling" it
- **Feature Flag**: Configuration toggle to enable/disable features at runtime
- **Circuit Breaker**: Design pattern to prevent cascading failures
- **Saga Pattern**: Managing distributed transactions across services
- **MTTR**: Mean Time To Recovery - average time to restore service after incident
- **SLA**: Service Level Agreement - contractual uptime commitment
- **TPS**: Transactions Per Second - throughput metric

Architecture Decision Records - This article is part of a series.

Part : Resilience vs Complexity: Finding the Sweet Spot in Distributed Systems

Part : This Article

Part : Microservices Boundary Definition: A Domain-Driven Approach

Part : Cost Optimization vs Performance: A FinOps-Driven Architecture Decision

Part : Multi-Region High Availability Architecture

Decision Metadata #

System Context #

Legacy System Characteristics #

Pain Points #

Triggering Event #

Problem Statement #

Key Challenges #

Success Criteria #

Options Considered #

Option 1: Lift-and-Shift Migration #

Option 2: Incremental Refactoring (Strangler Fig Pattern) #

Option 3: Complete System Rewrite #

Option 4: Hybrid Approach (Lift-Shift + Selective Refactoring) #

Evaluation Matrix #

Trade-offs Analysis #

Risk vs Timeline Trade-off #

Key Trade-off Considerations #

Final Decision #

Rationale #

Decision Drivers #

Implementation Roadmap #

Strangler Pattern Implementation Details #

Risk Mitigation Strategies #

Technology Stack #

Team Organization #

Budget Breakdown #

Success Metrics #

Post-Decision Reflection #

Outcomes Achieved (Month 14 of 24) #

Challenges Encountered #

Unexpected Benefits #

Lessons Learned #

Anti-Patterns Avoided #

Future Considerations #

ROI Analysis #

Decision Metadata
#

System Context
#

Legacy System Characteristics
#

Pain Points
#

Triggering Event
#

Problem Statement
#

Key Challenges
#

Success Criteria
#

Options Considered
#

Option 1: Lift-and-Shift Migration
#

Option 2: Incremental Refactoring (Strangler Fig Pattern)
#

Option 3: Complete System Rewrite
#

Option 4: Hybrid Approach (Lift-Shift + Selective Refactoring)
#

Evaluation Matrix
#

Trade-offs Analysis
#

Risk vs Timeline Trade-off
#

Key Trade-off Considerations
#

Final Decision
#

Rationale
#

Decision Drivers
#

Implementation Roadmap
#

Strangler Pattern Implementation Details
#

Risk Mitigation Strategies
#

Technology Stack
#

Team Organization
#

Budget Breakdown
#

Success Metrics
#

Post-Decision Reflection
#

Outcomes Achieved (Month 14 of 24)
#

Challenges Encountered
#

Unexpected Benefits
#

Lessons Learned
#

Anti-Patterns Avoided
#

Future Considerations
#

ROI Analysis
#