Deployment and Monitoring

Successful AI deployment requires careful planning, phased rollouts, and continuous monitoring. This lesson covers deployment strategies and operational monitoring aligned with ISO 42001.

ISO 42001 Deployment and Monitoring Requirements

Annex A.6 - Deployment and Use:

A.6.1: Deployment planning and control
A.6.2: User training and awareness
A.6.3: Operational procedures
A.6.4: Change management

Annex A.7 - Monitoring and Continual Improvement:

A.7.1: Performance monitoring
A.7.2: Data quality monitoring
A.7.3: Bias and fairness monitoring
A.7.4: Incident management

Deployment Strategy Framework

Pre-Deployment Requirements

Technical Readiness:

Model artifacts finalized and tested
Infrastructure provisioned and validated
APIs developed and tested
Security controls implemented
Monitoring infrastructure in place
Backup and rollback procedures ready

Governance Readiness:

Risk assessment approved
Deployment authorization obtained
Compliance verified
Documentation complete
Training completed
Support processes established

Pre-Deployment Checklist:

## DEPLOYMENT READINESS ASSESSMENT

### Model Readiness
☐ Model validated and approved
☐ Performance meets requirements: [___]%
☐ Fairness criteria met: [___]% max disparity
☐ Robustness tested and acceptable
☐ Model artifacts versioned: v[___]

### Infrastructure Readiness
☐ Production environment provisioned
☐ Compute resources allocated: [___]
☐ Storage configured: [___]
☐ Network connectivity verified
☐ Load balancing configured
☐ Auto-scaling rules defined

### Integration Readiness
☐ API endpoints implemented
☐ API documentation published
☐ Integration tests passed: [___]%
☐ Error handling implemented
☐ Latency requirements met: [___]ms
☐ Throughput requirements met: [___] req/s

### Security Readiness
☐ Authentication implemented (OAuth/API keys)
☐ Authorization rules configured (RBAC)
☐ Encryption enabled (in transit & at rest)
☐ Input validation implemented
☐ Rate limiting configured
☐ Security audit completed
☐ Vulnerability scan passed

### Monitoring Readiness
☐ Performance dashboards created
☐ Alerting rules configured
☐ Log aggregation setup
☐ Metrics collection enabled
☐ On-call rotation established
☐ Runbooks created

### Data Readiness
☐ Production data access configured
☐ Data quality checks automated
☐ Data pipeline tested
☐ Backup procedures in place
☐ Data retention policy implemented

### Documentation Readiness
☐ Model card finalized
☐ API documentation complete
☐ User guide published
☐ Operator manual ready
☐ Troubleshooting guide available
☐ Incident response plan documented

### Training Readiness
☐ User training completed: [___]% of users
☐ Operator training completed: [___]% of ops
☐ Support team trained
☐ FAQs prepared
☐ Help desk briefed

### Governance Readiness
☐ Deployment approval obtained from: [___]
☐ Risk assessment reviewed and accepted
☐ Compliance checklist completed
☐ Change request approved: [___]
☐ Rollback plan documented and tested

### Communication Readiness
☐ Stakeholders notified: [___]
☐ Users informed: [___]
☐ Launch communication prepared
☐ Support channels ready
☐ Escalation paths clear

### Overall Status
☐ READY FOR DEPLOYMENT
☐ CONDITIONAL APPROVAL (specify conditions)
☐ NOT READY (specify blockers)

**Approvers:**
- Technical Lead: _______________ Date: _______
- Product Owner: _______________ Date: _______
- Security Officer: _______________ Date: _______
- AI Risk Officer: _______________ Date: _______

Deployment Strategies

1. Shadow Deployment

Description: New model runs in parallel with existing system without affecting production decisions.

When to Use:

First production deployment
Significant model changes
High-risk applications
Need to validate in production environment

Process:

Deploy new model alongside existing system
Send same production traffic to both
Compare predictions and performance
Log differences for analysis
Monitor for unexpected behavior
Validate performance matches expectations
Transition to active after validation period

Duration: 1-4 weeks typical

Benefits:

Zero user impact during testing
Real production data validation
Direct comparison with baseline
Safe way to test in production

Considerations:

Requires dual infrastructure
Delayed benefit realization
May have data consistency complexity

Shadow Deployment Monitoring:

## SHADOW DEPLOYMENT METRICS

### Prediction Comparison
- Agreement rate: [___]% (target >95%)
- Prediction distribution similarity: [___]
- Edge case handling: [___]

### Performance Metrics
- Shadow model accuracy: [___]%
- Production model accuracy: [___]%
- Difference: [___]%

### Infrastructure Metrics
- Shadow model latency: [___]ms
- Shadow model error rate: [___]%
- Resource usage: [___]

### Decision Log
- Total predictions: [___]
- Disagreements: [___]
- Disagreement analysis: [___]
- Ready to promote: YES / NO

2. Canary Deployment

Description: New model serves small percentage of traffic, gradually increasing.

When to Use:

Lower risk than shadow
Want to limit initial user exposure
Can monitor subset effectively
Easy rollback needed

Process:

Deploy new model to subset (5-10% of traffic)
Monitor performance and user experience
Compare canary vs baseline metrics
Gradually increase traffic if successful (20%, 50%, 100%)
Rollback if issues detected
Full deployment when validated

Typical Schedule:

Day 1: 5% traffic
Day 3: 20% traffic
Day 5: 50% traffic
Day 7: 100% traffic

Benefits:

Limits user impact of issues
Real user feedback
Easy to rollback
Gradual confidence building

Considerations:

Need for traffic splitting
Statistical significance with small samples
Monitoring complexity

Canary Deployment Dashboard:

## CANARY DEPLOYMENT STATUS

### Current Status
- Deployment phase: [5% / 20% / 50% / 100%]
- Start date: [___]
- Duration at current phase: [___] days

### Performance Comparison

| Metric | Canary | Baseline | Difference | Status |
|--------|--------|----------|------------|--------|
| Accuracy | 92.3% | 92.1% | +0.2% | ✓ PASS |
| Latency | 98ms | 95ms | +3ms | ✓ PASS |
| Error rate | 0.3% | 0.4% | -0.1% | ✓ PASS |
| User satisfaction | 4.2/5 | 4.1/5 | +0.1 | ✓ PASS |

### Traffic Distribution
- Canary traffic: [___]%
- Baseline traffic: [___]%
- Total requests: [___]

### Issues
- Critical issues: [___] (max 0)
- High priority: [___] (max 2)
- Medium priority: [___]
- Low priority: [___]

### Decision
☐ Proceed to next phase
☐ Hold at current phase
☐ Rollback to baseline

**Decision maker**: _______________ Date: _______

3. Blue-Green Deployment

Description: Two identical production environments; instant switch between them.

When to Use:

Need instant rollback capability
Downtime not acceptable
High confidence in new model
Infrastructure costs acceptable

Process:

Current production = Blue environment
Deploy new model to Green environment
Test Green environment thoroughly
Switch traffic from Blue to Green instantly
Monitor Green environment
Keep Blue as instant rollback option
Decommission Blue after validation period

Benefits:

Instant rollback
Zero downtime deployment
Full production testing before switch
Simple rollback process

Considerations:

Double infrastructure cost during transition
Database/state synchronization complexity
Need for instant traffic switching

4. Progressive Rollout (A/B Test)

Description: New model deployed to random sample of users for extended testing.

When to Use:

Want to measure business impact
Need statistical significance
Multiple versions to compare
Extended testing acceptable

Process:

Deploy multiple model versions
Randomly assign users to versions
Collect business metrics
Analyze statistical significance
Select winning version
Full rollout of winner

Duration: 2-6 weeks typical

A/B Test Design:

## A/B TEST DESIGN

### Hypothesis
H₀: Model B performs no better than Model A
H₁: Model B performs better than Model A

### Metrics
- **Primary**: [Conversion rate / Revenue / Engagement]
- **Secondary**: [User satisfaction, Latency, Cost]
- **Guardrail**: [Error rate, Unfairness metrics]

### Sample Size
- Required sample per variant: [___]
- Expected test duration: [___] days
- Statistical power: 80%
- Significance level: α = 0.05
- Minimum detectable effect: [___]%

### Variants
- **Control (A)**: Current model (50% traffic)
- **Treatment (B)**: New model (50% traffic)

### Success Criteria
- Primary metric improvement: >[___]%
- Statistical significance: p < 0.05
- No guardrail violations
- No fairness degradation

### Results

| Metric | Control (A) | Treatment (B) | Lift | P-value | Significant? |
|--------|------------|---------------|------|---------|--------------|
| Primary | [___] | [___] | [___]% | [___] | YES/NO |
| Secondary 1 | [___] | [___] | [___]% | [___] | YES/NO |
| Secondary 2 | [___] | [___] | [___]% | [___] | YES/NO |

### Decision
☐ Deploy Treatment (B) to 100%
☐ Keep Control (A)
☐ Extended test needed

**Decision maker**: _______________ Date: _______

Monitoring Framework

1. Performance Monitoring

Real-Time Metrics:

Metric	Description	Threshold	Alert Level
Accuracy	Prediction accuracy	<90%	Critical
Precision	Positive predictive value	<85%	High
Recall	True positive rate	<85%	High
F1-Score	Harmonic mean of P&R	<87%	High
AUC-ROC	Discrimination ability	<0.88	Medium
Latency p50	Median response time	>100ms	Medium
Latency p99	99th percentile latency	>500ms	High
Error rate	Prediction errors	>1%	Critical
Throughput	Requests per second	<1000	Medium

Performance Dashboard Components:

## PERFORMANCE MONITORING DASHBOARD

### Real-Time Metrics (Last Hour)
- **Accuracy**: 92.3% ↑ (target >90%)
- **Latency (p50)**: 87ms ↓ (target <100ms)
- **Latency (p99)**: 245ms ↓ (target <500ms)
- **Throughput**: 1,234 req/s ↑ (target >1000)
- **Error Rate**: 0.3% ↓ (target <1%)

### Trend Analysis (Last 24 Hours)
[Line chart showing metrics over time]

### Alerts (Last 24 Hours)
- 🔴 Critical: 0
- 🟠 High: 1 (Latency spike at 14:23, resolved)
- 🟡 Medium: 3
- ⚪ Low: 7

### Predictions
- Total predictions: 1,234,567
- Successful: 1,230,456 (99.7%)
- Failed: 4,111 (0.3%)

### Model Performance by Segment
| Segment | Accuracy | Latency | Volume |
|---------|----------|---------|--------|
| Segment A | 93.1% | 82ms | 45% |
| Segment B | 91.8% | 91ms | 35% |
| Segment C | 92.5% | 95ms | 20% |

### System Health
- CPU: 45% ↓
- Memory: 62% →
- Disk: 33% ↓
- Network: Normal ✓

2. Data Quality Monitoring

Input Data Monitoring:

Check	Description	Threshold	Frequency
Schema validation	Fields match expected schema	100%	Real-time
Null rate	Missing values	<5%	Real-time
Range validation	Values within expected ranges	>95%	Real-time
Type validation	Correct data types	100%	Real-time
Distribution shift	Input distribution changes	PSI <0.2	Hourly
Outlier detection	Extreme values	<1%	Real-time
Referential integrity	Foreign keys valid	100%	Daily

Data Quality Alerts:

## DATA QUALITY MONITORING

### Real-Time Validation (Last Hour)
- Schema validation: ✓ 100% pass
- Null rate: ✓ 2.1% (target <5%)
- Range validation: ✓ 98.9% (target >95%)
- Type validation: ✓ 100% pass
- Outliers detected: ✓ 0.4% (target <1%)

### Distribution Analysis
| Feature | Training | Production | PSI | Status |
|---------|----------|------------|-----|--------|
| feature_1 | μ=45 σ=12 | μ=46 σ=13 | 0.05 | ✓ OK |
| feature_2 | μ=123 σ=34 | μ=118 σ=38 | 0.18 | ⚠ Warning |
| feature_3 | μ=89 σ=23 | μ=91 σ=24 | 0.03 | ✓ OK |

### Data Quality Issues (Last 24 Hours)
- Invalid records rejected: 127 (0.01%)
- Missing value warnings: 2,341 (0.19%)
- Outliers flagged: 456 (0.04%)
- Distribution warnings: 1 (feature_2)

### Actions Taken
- feature_2 distribution shift investigation opened
- Upstream data team notified
- Monitoring frequency increased

3. Drift Detection

Types of Drift:

Data Drift (Covariate Shift):

Change in input data distribution
P(X) changes, P(Y|X) stays same
Detection: Compare training vs production distributions

Concept Drift:

Change in relationship between X and Y
P(Y|X) changes
Detection: Monitor prediction performance over time

Label Drift (Prior Shift):

Change in output distribution
P(Y) changes
Detection: Monitor prediction distribution

Drift Detection Methods:

Method	Type	Use Case	Sensitivity
PSI (Population Stability Index)	Data drift	Numerical features	Medium
KL Divergence	Data drift	Distributions	High
Kolmogorov-Smirnov	Data drift	Continuous distributions	Medium
Chi-square test	Data drift	Categorical features	Medium
Performance degradation	Concept drift	Model performance	Direct
ADWIN	Concept drift	Streaming data	Adaptive

Drift Monitoring Dashboard:

## DRIFT DETECTION DASHBOARD

### Data Drift (PSI Thresholds: <0.1 OK, 0.1-0.25 Warning, >0.25 Critical)

| Feature | PSI | Status | Trend | Action |
|---------|-----|--------|-------|--------|
| age | 0.08 | ✓ OK | Stable | None |
| income | 0.15 | ⚠ Warning | Increasing | Monitor |
| credit_score | 0.28 | 🔴 Critical | Increasing | Investigate |
| employment_years | 0.05 | ✓ OK | Stable | None |
| debt_ratio | 0.12 | ⚠ Warning | Stable | Monitor |

### Concept Drift

**Performance over Time:**
[Line chart showing accuracy, precision, recall over last 30 days]

- Week 1: Accuracy 92.5%
- Week 2: Accuracy 92.3% (-0.2%)
- Week 3: Accuracy 91.8% (-0.5%)
- Week 4: Accuracy 91.2% (-0.7%) ⚠ Below threshold

**Status**: ⚠ Potential concept drift detected

### Prediction Drift

**Prediction Distribution:**
- Training: 25% positive class
- Production (Week 1): 26% positive class (+1%)
- Production (Week 4): 31% positive class (+6%) ⚠

**Status**: ⚠ Significant shift in predictions

### Recommendations
1. 🔴 URGENT: Investigate credit_score feature drift
2. ⚠ Review model performance degradation
3. ⚠ Analyze change in prediction distribution
4. Consider model retraining
5. Review upstream data pipeline changes

4. Fairness Monitoring

Continuous Fairness Assessment:

## FAIRNESS MONITORING DASHBOARD

### Protected Groups Performance (Last 7 Days)

| Metric | Group A | Group B | Difference | Threshold | Status |
|--------|---------|---------|------------|-----------|--------|
| Accuracy | 92.1% | 91.5% | 0.6% | <5% | ✓ PASS |
| Precision | 89.3% | 88.2% | 1.1% | <5% | ✓ PASS |
| Recall | 90.5% | 89.1% | 1.4% | <5% | ✓ PASS |
| False Positive Rate | 2.3% | 2.8% | 0.5% | <2% | ✓ PASS |
| False Negative Rate | 9.5% | 10.9% | 1.4% | <5% | ✓ PASS |

### Fairness Metrics Trend (Last 30 Days)
[Line chart showing demographic parity and equal opportunity over time]

### Demographic Parity
- Group A positive rate: 28.3%
- Group B positive rate: 29.1%
- Difference: 0.8% (threshold <5%) ✓ PASS

### Equal Opportunity
- Group A TPR: 90.5%
- Group B TPR: 89.1%
- Difference: 1.4% (threshold <5%) ✓ PASS

### Alerts
- No fairness violations detected ✓
- All metrics within acceptable ranges ✓
- Monitoring continues

### Review Schedule
- Daily: Automated monitoring
- Weekly: Manual review
- Monthly: Detailed fairness audit
- Quarterly: Full bias assessment

**Next Review**: [Date]

5. Incident Response

Incident Classification:

Severity	Description	Response Time	Examples
P0 - Critical	Service down or major data loss	<15 min	Complete system failure, data breach
P1 - High	Significant performance degradation	<1 hour	Accuracy drop >10%, high error rate
P2 - Medium	Moderate impact on users	<4 hours	Latency increase, fairness violation
P3 - Low	Minor issues, no user impact	<24 hours	Dashboard issues, non-critical bugs

Incident Response Process:

## INCIDENT RESPONSE PROCEDURE

### 1. DETECTION
- Automated alerts trigger
- Manual detection and reporting
- User reports

### 2. TRIAGE (< 5 minutes)
- Assess severity
- Assign incident commander
- Form response team
- Open incident ticket

### 3. CONTAINMENT (Immediate)
- For P0/P1: Consider immediate rollback
- Isolate affected systems
- Prevent further impact
- Preserve evidence

### 4. INVESTIGATION (Ongoing)
- Identify root cause
- Gather logs and metrics
- Analyze patterns
- Document findings

### 5. RESOLUTION
- Implement fix
- Test thoroughly
- Deploy fix
- Verify resolution

### 6. POST-INCIDENT REVIEW (Within 1 week)
- Timeline of events
- Root cause analysis
- Impact assessment
- Lessons learned
- Action items for prevention

### 7. FOLLOW-UP
- Implement preventive measures
- Update runbooks
- Share learnings
- Update monitoring

Incident Template:

# INCIDENT REPORT

**Incident ID**: INC-2025-001
**Severity**: P1 - High
**Status**: Resolved
**Date**: 2025-12-08
**Duration**: 2 hours 15 minutes

## SUMMARY
Model accuracy dropped from 92% to 78% between 14:00-16:15 UTC.

## TIMELINE
- 14:00: Accuracy drop detected by automated monitoring
- 14:05: P1 incident declared, team assembled
- 14:15: Root cause identified: upstream data pipeline change
- 14:30: Temporary fix: rollback to previous model version
- 15:00: Upstream team notified and investigating
- 15:45: Upstream fix deployed
- 16:00: Forward to new model version with fix
- 16:15: Verification complete, incident resolved

## ROOT CAUSE
Upstream data pipeline deployed a change that modified feature encoding, causing distribution mismatch with model expectations.

## IMPACT
- Duration: 2 hours 15 minutes
- Predictions affected: ~450,000
- Estimated incorrect predictions: ~63,000 (14%)
- User impact: Moderate (degraded recommendations)
- Business impact: Estimated revenue impact $15,000

## RESOLUTION
1. Rolled back to previous model version (immediate mitigation)
2. Coordinated with upstream team on fix
3. Validated fix in staging
4. Redeployed with corrected features

## LESSONS LEARNED
- Upstream changes should trigger integration tests
- Need better feature validation at prediction time
- Rollback automation worked well
- Cross-team communication effective

## ACTION ITEMS
1. [DONE] Add feature distribution validation at inference time
2. [IN PROGRESS] Require integration tests for upstream changes
3. [PLANNED] Implement gradual rollout for upstream changes
4. [PLANNED] Add alerting for feature distribution shifts

## OWNER
John Smith (AI Engineering Lead)

Operational Procedures

Standard Operating Procedures (SOPs)

1. Daily Operations:

Review monitoring dashboards
Check for alerts and anomalies
Verify data pipeline health
Monitor resource utilization
Review prediction logs

2. Weekly Operations:

Performance analysis and reporting
Fairness metrics review
Drift analysis
Incident review
Capacity planning review

3. Monthly Operations:

Comprehensive model evaluation
Detailed fairness audit
Security review
Documentation updates
Stakeholder reporting

4. Quarterly Operations:

Full model revalidation
Risk assessment review
Compliance audit
Policy review and updates
Stakeholder presentations

Runbooks

Purpose: Step-by-step procedures for common operational tasks

Examples:

Model deployment procedure
Rollback procedure
Performance investigation
Fairness violation response
Data quality issue resolution
Incident escalation

Runbook Template:

# RUNBOOK: Model Rollback

## WHEN TO USE
- Critical performance degradation
- Fairness violations
- Security incidents
- Regulatory compliance issues

## PREREQUISITES
- Previous model version available
- Rollback authorization obtained (for non-P0)
- Backup of current configuration

## PROCEDURE

### 1. Pre-Rollback Checks
☐ Verify issue requires rollback
☐ Identify target version to rollback to
☐ Obtain authorization (if time permits)
☐ Notify stakeholders

### 2. Rollback Execution
```bash
# Switch traffic to previous version
kubectl set image deployment/model-api model=model:v1.0.0

# Verify rollback
kubectl rollout status deployment/model-api

3. Verification

☐ Verify previous version is serving traffic ☐ Check health endpoints ☐ Monitor performance metrics for 15 minutes ☐ Verify issue is resolved

4. Post-Rollback

☐ Update incident ticket ☐ Notify stakeholders of resolution ☐ Begin root cause investigation ☐ Plan forward fix

ROLLBACK TIME

Target: < 5 minutes for automated, < 15 minutes for manual

CONTACTS

On-call engineer: [PagerDuty]
Engineering lead: [Phone]
Product owner: [Phone]


## Best Practices

1. **Start Small**: Use phased deployments
2. **Monitor Closely**: Especially during first 48 hours
3. **Automate Monitoring**: Human monitoring doesn't scale
4. **Plan for Failure**: Always have rollback ready
5. **Document Everything**: Decisions, changes, incidents
6. **Test in Production**: Shadow/canary deployments
7. **Respond Quickly**: Fast detection and response
8. **Learn Continuously**: Post-incident reviews
9. **Communicate Clearly**: Keep stakeholders informed
10. **Improve Iteratively**: Continuous enhancement

## Integration with ISO 42001

| Activity | ISO 42001 Controls |
|----------|-------------------|
| Deployment Planning | A.6.1 |
| User Training | A.6.2 |
| Operational Procedures | A.6.3 |
| Performance Monitoring | A.7.1 |
| Data Quality Monitoring | A.7.2 |
| Fairness Monitoring | A.7.3 |
| Incident Management | A.7.4 |

## Next Steps

1. Review current deployment practices
2. Identify gaps against ISO 42001
3. Implement phased deployment strategy
4. Set up comprehensive monitoring
5. Create incident response procedures
6. Train operations teams
7. Conduct regular reviews

**Next Lesson**: Human Oversight Requirements - Implementing appropriate human oversight and control mechanisms for AI systems.