Deployment and Monitoring
Successful AI deployment requires careful planning, phased rollouts, and continuous monitoring. This lesson covers deployment strategies and operational monitoring aligned with ISO 42001.
ISO 42001 Deployment and Monitoring Requirements
Annex A.6 - Deployment and Use:
- A.6.1: Deployment planning and control
- A.6.2: User training and awareness
- A.6.3: Operational procedures
- A.6.4: Change management
Annex A.7 - Monitoring and Continual Improvement:
- A.7.1: Performance monitoring
- A.7.2: Data quality monitoring
- A.7.3: Bias and fairness monitoring
- A.7.4: Incident management
Deployment Strategy Framework
Pre-Deployment Requirements
Technical Readiness:
- Model artifacts finalized and tested
- Infrastructure provisioned and validated
- APIs developed and tested
- Security controls implemented
- Monitoring infrastructure in place
- Backup and rollback procedures ready
Governance Readiness:
- Risk assessment approved
- Deployment authorization obtained
- Compliance verified
- Documentation complete
- Training completed
- Support processes established
Pre-Deployment Checklist:
## DEPLOYMENT READINESS ASSESSMENT
### Model Readiness
☐ Model validated and approved
☐ Performance meets requirements: [___]%
☐ Fairness criteria met: [___]% max disparity
☐ Robustness tested and acceptable
☐ Model artifacts versioned: v[___]
### Infrastructure Readiness
☐ Production environment provisioned
☐ Compute resources allocated: [___]
☐ Storage configured: [___]
☐ Network connectivity verified
☐ Load balancing configured
☐ Auto-scaling rules defined
### Integration Readiness
☐ API endpoints implemented
☐ API documentation published
☐ Integration tests passed: [___]%
☐ Error handling implemented
☐ Latency requirements met: [___]ms
☐ Throughput requirements met: [___] req/s
### Security Readiness
☐ Authentication implemented (OAuth/API keys)
☐ Authorization rules configured (RBAC)
☐ Encryption enabled (in transit & at rest)
☐ Input validation implemented
☐ Rate limiting configured
☐ Security audit completed
☐ Vulnerability scan passed
### Monitoring Readiness
☐ Performance dashboards created
☐ Alerting rules configured
☐ Log aggregation setup
☐ Metrics collection enabled
☐ On-call rotation established
☐ Runbooks created
### Data Readiness
☐ Production data access configured
☐ Data quality checks automated
☐ Data pipeline tested
☐ Backup procedures in place
☐ Data retention policy implemented
### Documentation Readiness
☐ Model card finalized
☐ API documentation complete
☐ User guide published
☐ Operator manual ready
☐ Troubleshooting guide available
☐ Incident response plan documented
### Training Readiness
☐ User training completed: [___]% of users
☐ Operator training completed: [___]% of ops
☐ Support team trained
☐ FAQs prepared
☐ Help desk briefed
### Governance Readiness
☐ Deployment approval obtained from: [___]
☐ Risk assessment reviewed and accepted
☐ Compliance checklist completed
☐ Change request approved: [___]
☐ Rollback plan documented and tested
### Communication Readiness
☐ Stakeholders notified: [___]
☐ Users informed: [___]
☐ Launch communication prepared
☐ Support channels ready
☐ Escalation paths clear
### Overall Status
☐ READY FOR DEPLOYMENT
☐ CONDITIONAL APPROVAL (specify conditions)
☐ NOT READY (specify blockers)
**Approvers:**
- Technical Lead: _______________ Date: _______
- Product Owner: _______________ Date: _______
- Security Officer: _______________ Date: _______
- AI Risk Officer: _______________ Date: _______
Deployment Strategies
1. Shadow Deployment
Description: New model runs in parallel with existing system without affecting production decisions.
When to Use:
- First production deployment
- Significant model changes
- High-risk applications
- Need to validate in production environment
Process:
- Deploy new model alongside existing system
- Send same production traffic to both
- Compare predictions and performance
- Log differences for analysis
- Monitor for unexpected behavior
- Validate performance matches expectations
- Transition to active after validation period
Duration: 1-4 weeks typical
Benefits:
- Zero user impact during testing
- Real production data validation
- Direct comparison with baseline
- Safe way to test in production
Considerations:
- Requires dual infrastructure
- Delayed benefit realization
- May have data consistency complexity
Shadow Deployment Monitoring:
## SHADOW DEPLOYMENT METRICS
### Prediction Comparison
- Agreement rate: [___]% (target >95%)
- Prediction distribution similarity: [___]
- Edge case handling: [___]
### Performance Metrics
- Shadow model accuracy: [___]%
- Production model accuracy: [___]%
- Difference: [___]%
### Infrastructure Metrics
- Shadow model latency: [___]ms
- Shadow model error rate: [___]%
- Resource usage: [___]
### Decision Log
- Total predictions: [___]
- Disagreements: [___]
- Disagreement analysis: [___]
- Ready to promote: YES / NO
2. Canary Deployment
Description: New model serves small percentage of traffic, gradually increasing.
When to Use:
- Lower risk than shadow
- Want to limit initial user exposure
- Can monitor subset effectively
- Easy rollback needed
Process:
- Deploy new model to subset (5-10% of traffic)
- Monitor performance and user experience
- Compare canary vs baseline metrics
- Gradually increase traffic if successful (20%, 50%, 100%)
- Rollback if issues detected
- Full deployment when validated
Typical Schedule:
- Day 1: 5% traffic
- Day 3: 20% traffic
- Day 5: 50% traffic
- Day 7: 100% traffic
Benefits:
- Limits user impact of issues
- Real user feedback
- Easy to rollback
- Gradual confidence building
Considerations:
- Need for traffic splitting
- Statistical significance with small samples
- Monitoring complexity
Canary Deployment Dashboard:
## CANARY DEPLOYMENT STATUS
### Current Status
- Deployment phase: [5% / 20% / 50% / 100%]
- Start date: [___]
- Duration at current phase: [___] days
### Performance Comparison
| Metric | Canary | Baseline | Difference | Status |
|--------|--------|----------|------------|--------|
| Accuracy | 92.3% | 92.1% | +0.2% | ✓ PASS |
| Latency | 98ms | 95ms | +3ms | ✓ PASS |
| Error rate | 0.3% | 0.4% | -0.1% | ✓ PASS |
| User satisfaction | 4.2/5 | 4.1/5 | +0.1 | ✓ PASS |
### Traffic Distribution
- Canary traffic: [___]%
- Baseline traffic: [___]%
- Total requests: [___]
### Issues
- Critical issues: [___] (max 0)
- High priority: [___] (max 2)
- Medium priority: [___]
- Low priority: [___]
### Decision
☐ Proceed to next phase
☐ Hold at current phase
☐ Rollback to baseline
**Decision maker**: _______________ Date: _______
3. Blue-Green Deployment
Description: Two identical production environments; instant switch between them.
When to Use:
- Need instant rollback capability
- Downtime not acceptable
- High confidence in new model
- Infrastructure costs acceptable
Process:
- Current production = Blue environment
- Deploy new model to Green environment
- Test Green environment thoroughly
- Switch traffic from Blue to Green instantly
- Monitor Green environment
- Keep Blue as instant rollback option
- Decommission Blue after validation period
Benefits:
- Instant rollback
- Zero downtime deployment
- Full production testing before switch
- Simple rollback process
Considerations:
- Double infrastructure cost during transition
- Database/state synchronization complexity
- Need for instant traffic switching
4. Progressive Rollout (A/B Test)
Description: New model deployed to random sample of users for extended testing.
When to Use:
- Want to measure business impact
- Need statistical significance
- Multiple versions to compare
- Extended testing acceptable
Process:
- Deploy multiple model versions
- Randomly assign users to versions
- Collect business metrics
- Analyze statistical significance
- Select winning version
- Full rollout of winner
Duration: 2-6 weeks typical
A/B Test Design:
## A/B TEST DESIGN
### Hypothesis
H₀: Model B performs no better than Model A
H₁: Model B performs better than Model A
### Metrics
- **Primary**: [Conversion rate / Revenue / Engagement]
- **Secondary**: [User satisfaction, Latency, Cost]
- **Guardrail**: [Error rate, Unfairness metrics]
### Sample Size
- Required sample per variant: [___]
- Expected test duration: [___] days
- Statistical power: 80%
- Significance level: α = 0.05
- Minimum detectable effect: [___]%
### Variants
- **Control (A)**: Current model (50% traffic)
- **Treatment (B)**: New model (50% traffic)
### Success Criteria
- Primary metric improvement: >[___]%
- Statistical significance: p < 0.05
- No guardrail violations
- No fairness degradation
### Results
| Metric | Control (A) | Treatment (B) | Lift | P-value | Significant? |
|--------|------------|---------------|------|---------|--------------|
| Primary | [___] | [___] | [___]% | [___] | YES/NO |
| Secondary 1 | [___] | [___] | [___]% | [___] | YES/NO |
| Secondary 2 | [___] | [___] | [___]% | [___] | YES/NO |
### Decision
☐ Deploy Treatment (B) to 100%
☐ Keep Control (A)
☐ Extended test needed
**Decision maker**: _______________ Date: _______
Monitoring Framework
1. Performance Monitoring
Real-Time Metrics:
| Metric | Description | Threshold | Alert Level |
|---|---|---|---|
| Accuracy | Prediction accuracy | <90% | Critical |
| Precision | Positive predictive value | <85% | High |
| Recall | True positive rate | <85% | High |
| F1-Score | Harmonic mean of P&R | <87% | High |
| AUC-ROC | Discrimination ability | <0.88 | Medium |
| Latency p50 | Median response time | >100ms | Medium |
| Latency p99 | 99th percentile latency | >500ms | High |
| Error rate | Prediction errors | >1% | Critical |
| Throughput | Requests per second | <1000 | Medium |
Performance Dashboard Components:
## PERFORMANCE MONITORING DASHBOARD
### Real-Time Metrics (Last Hour)
- **Accuracy**: 92.3% ↑ (target >90%)
- **Latency (p50)**: 87ms ↓ (target <100ms)
- **Latency (p99)**: 245ms ↓ (target <500ms)
- **Throughput**: 1,234 req/s ↑ (target >1000)
- **Error Rate**: 0.3% ↓ (target <1%)
### Trend Analysis (Last 24 Hours)
[Line chart showing metrics over time]
### Alerts (Last 24 Hours)
- 🔴 Critical: 0
- 🟠 High: 1 (Latency spike at 14:23, resolved)
- 🟡 Medium: 3
- ⚪ Low: 7
### Predictions
- Total predictions: 1,234,567
- Successful: 1,230,456 (99.7%)
- Failed: 4,111 (0.3%)
### Model Performance by Segment
| Segment | Accuracy | Latency | Volume |
|---------|----------|---------|--------|
| Segment A | 93.1% | 82ms | 45% |
| Segment B | 91.8% | 91ms | 35% |
| Segment C | 92.5% | 95ms | 20% |
### System Health
- CPU: 45% ↓
- Memory: 62% →
- Disk: 33% ↓
- Network: Normal ✓
2. Data Quality Monitoring
Input Data Monitoring:
| Check | Description | Threshold | Frequency |
|---|---|---|---|
| Schema validation | Fields match expected schema | 100% | Real-time |
| Null rate | Missing values | <5% | Real-time |
| Range validation | Values within expected ranges | >95% | Real-time |
| Type validation | Correct data types | 100% | Real-time |
| Distribution shift | Input distribution changes | PSI <0.2 | Hourly |
| Outlier detection | Extreme values | <1% | Real-time |
| Referential integrity | Foreign keys valid | 100% | Daily |
Data Quality Alerts:
## DATA QUALITY MONITORING
### Real-Time Validation (Last Hour)
- Schema validation: ✓ 100% pass
- Null rate: ✓ 2.1% (target <5%)
- Range validation: ✓ 98.9% (target >95%)
- Type validation: ✓ 100% pass
- Outliers detected: ✓ 0.4% (target <1%)
### Distribution Analysis
| Feature | Training | Production | PSI | Status |
|---------|----------|------------|-----|--------|
| feature_1 | μ=45 σ=12 | μ=46 σ=13 | 0.05 | ✓ OK |
| feature_2 | μ=123 σ=34 | μ=118 σ=38 | 0.18 | ⚠ Warning |
| feature_3 | μ=89 σ=23 | μ=91 σ=24 | 0.03 | ✓ OK |
### Data Quality Issues (Last 24 Hours)
- Invalid records rejected: 127 (0.01%)
- Missing value warnings: 2,341 (0.19%)
- Outliers flagged: 456 (0.04%)
- Distribution warnings: 1 (feature_2)
### Actions Taken
- feature_2 distribution shift investigation opened
- Upstream data team notified
- Monitoring frequency increased
3. Drift Detection
Types of Drift:
Data Drift (Covariate Shift):
- Change in input data distribution
- P(X) changes, P(Y|X) stays same
- Detection: Compare training vs production distributions
Concept Drift:
- Change in relationship between X and Y
- P(Y|X) changes
- Detection: Monitor prediction performance over time
Label Drift (Prior Shift):
- Change in output distribution
- P(Y) changes
- Detection: Monitor prediction distribution
Drift Detection Methods:
| Method | Type | Use Case | Sensitivity |
|---|---|---|---|
| PSI (Population Stability Index) | Data drift | Numerical features | Medium |
| KL Divergence | Data drift | Distributions | High |
| Kolmogorov-Smirnov | Data drift | Continuous distributions | Medium |
| Chi-square test | Data drift | Categorical features | Medium |
| Performance degradation | Concept drift | Model performance | Direct |
| ADWIN | Concept drift | Streaming data | Adaptive |
Drift Monitoring Dashboard:
## DRIFT DETECTION DASHBOARD
### Data Drift (PSI Thresholds: <0.1 OK, 0.1-0.25 Warning, >0.25 Critical)
| Feature | PSI | Status | Trend | Action |
|---------|-----|--------|-------|--------|
| age | 0.08 | ✓ OK | Stable | None |
| income | 0.15 | ⚠ Warning | Increasing | Monitor |
| credit_score | 0.28 | 🔴 Critical | Increasing | Investigate |
| employment_years | 0.05 | ✓ OK | Stable | None |
| debt_ratio | 0.12 | ⚠ Warning | Stable | Monitor |
### Concept Drift
**Performance over Time:**
[Line chart showing accuracy, precision, recall over last 30 days]
- Week 1: Accuracy 92.5%
- Week 2: Accuracy 92.3% (-0.2%)
- Week 3: Accuracy 91.8% (-0.5%)
- Week 4: Accuracy 91.2% (-0.7%) ⚠ Below threshold
**Status**: ⚠ Potential concept drift detected
### Prediction Drift
**Prediction Distribution:**
- Training: 25% positive class
- Production (Week 1): 26% positive class (+1%)
- Production (Week 4): 31% positive class (+6%) ⚠
**Status**: ⚠ Significant shift in predictions
### Recommendations
1. 🔴 URGENT: Investigate credit_score feature drift
2. ⚠ Review model performance degradation
3. ⚠ Analyze change in prediction distribution
4. Consider model retraining
5. Review upstream data pipeline changes
4. Fairness Monitoring
Continuous Fairness Assessment:
## FAIRNESS MONITORING DASHBOARD
### Protected Groups Performance (Last 7 Days)
| Metric | Group A | Group B | Difference | Threshold | Status |
|--------|---------|---------|------------|-----------|--------|
| Accuracy | 92.1% | 91.5% | 0.6% | <5% | ✓ PASS |
| Precision | 89.3% | 88.2% | 1.1% | <5% | ✓ PASS |
| Recall | 90.5% | 89.1% | 1.4% | <5% | ✓ PASS |
| False Positive Rate | 2.3% | 2.8% | 0.5% | <2% | ✓ PASS |
| False Negative Rate | 9.5% | 10.9% | 1.4% | <5% | ✓ PASS |
### Fairness Metrics Trend (Last 30 Days)
[Line chart showing demographic parity and equal opportunity over time]
### Demographic Parity
- Group A positive rate: 28.3%
- Group B positive rate: 29.1%
- Difference: 0.8% (threshold <5%) ✓ PASS
### Equal Opportunity
- Group A TPR: 90.5%
- Group B TPR: 89.1%
- Difference: 1.4% (threshold <5%) ✓ PASS
### Alerts
- No fairness violations detected ✓
- All metrics within acceptable ranges ✓
- Monitoring continues
### Review Schedule
- Daily: Automated monitoring
- Weekly: Manual review
- Monthly: Detailed fairness audit
- Quarterly: Full bias assessment
**Next Review**: [Date]
5. Incident Response
Incident Classification:
| Severity | Description | Response Time | Examples |
|---|---|---|---|
| P0 - Critical | Service down or major data loss | <15 min | Complete system failure, data breach |
| P1 - High | Significant performance degradation | <1 hour | Accuracy drop >10%, high error rate |
| P2 - Medium | Moderate impact on users | <4 hours | Latency increase, fairness violation |
| P3 - Low | Minor issues, no user impact | <24 hours | Dashboard issues, non-critical bugs |
Incident Response Process:
## INCIDENT RESPONSE PROCEDURE
### 1. DETECTION
- Automated alerts trigger
- Manual detection and reporting
- User reports
### 2. TRIAGE (< 5 minutes)
- Assess severity
- Assign incident commander
- Form response team
- Open incident ticket
### 3. CONTAINMENT (Immediate)
- For P0/P1: Consider immediate rollback
- Isolate affected systems
- Prevent further impact
- Preserve evidence
### 4. INVESTIGATION (Ongoing)
- Identify root cause
- Gather logs and metrics
- Analyze patterns
- Document findings
### 5. RESOLUTION
- Implement fix
- Test thoroughly
- Deploy fix
- Verify resolution
### 6. POST-INCIDENT REVIEW (Within 1 week)
- Timeline of events
- Root cause analysis
- Impact assessment
- Lessons learned
- Action items for prevention
### 7. FOLLOW-UP
- Implement preventive measures
- Update runbooks
- Share learnings
- Update monitoring
Incident Template:
# INCIDENT REPORT
**Incident ID**: INC-2025-001
**Severity**: P1 - High
**Status**: Resolved
**Date**: 2025-12-08
**Duration**: 2 hours 15 minutes
## SUMMARY
Model accuracy dropped from 92% to 78% between 14:00-16:15 UTC.
## TIMELINE
- 14:00: Accuracy drop detected by automated monitoring
- 14:05: P1 incident declared, team assembled
- 14:15: Root cause identified: upstream data pipeline change
- 14:30: Temporary fix: rollback to previous model version
- 15:00: Upstream team notified and investigating
- 15:45: Upstream fix deployed
- 16:00: Forward to new model version with fix
- 16:15: Verification complete, incident resolved
## ROOT CAUSE
Upstream data pipeline deployed a change that modified feature encoding, causing distribution mismatch with model expectations.
## IMPACT
- Duration: 2 hours 15 minutes
- Predictions affected: ~450,000
- Estimated incorrect predictions: ~63,000 (14%)
- User impact: Moderate (degraded recommendations)
- Business impact: Estimated revenue impact $15,000
## RESOLUTION
1. Rolled back to previous model version (immediate mitigation)
2. Coordinated with upstream team on fix
3. Validated fix in staging
4. Redeployed with corrected features
## LESSONS LEARNED
- Upstream changes should trigger integration tests
- Need better feature validation at prediction time
- Rollback automation worked well
- Cross-team communication effective
## ACTION ITEMS
1. [DONE] Add feature distribution validation at inference time
2. [IN PROGRESS] Require integration tests for upstream changes
3. [PLANNED] Implement gradual rollout for upstream changes
4. [PLANNED] Add alerting for feature distribution shifts
## OWNER
John Smith (AI Engineering Lead)
Operational Procedures
Standard Operating Procedures (SOPs)
1. Daily Operations:
- Review monitoring dashboards
- Check for alerts and anomalies
- Verify data pipeline health
- Monitor resource utilization
- Review prediction logs
2. Weekly Operations:
- Performance analysis and reporting
- Fairness metrics review
- Drift analysis
- Incident review
- Capacity planning review
3. Monthly Operations:
- Comprehensive model evaluation
- Detailed fairness audit
- Security review
- Documentation updates
- Stakeholder reporting
4. Quarterly Operations:
- Full model revalidation
- Risk assessment review
- Compliance audit
- Policy review and updates
- Stakeholder presentations
Runbooks
Purpose: Step-by-step procedures for common operational tasks
Examples:
- Model deployment procedure
- Rollback procedure
- Performance investigation
- Fairness violation response
- Data quality issue resolution
- Incident escalation
Runbook Template:
# RUNBOOK: Model Rollback
## WHEN TO USE
- Critical performance degradation
- Fairness violations
- Security incidents
- Regulatory compliance issues
## PREREQUISITES
- Previous model version available
- Rollback authorization obtained (for non-P0)
- Backup of current configuration
## PROCEDURE
### 1. Pre-Rollback Checks
☐ Verify issue requires rollback
☐ Identify target version to rollback to
☐ Obtain authorization (if time permits)
☐ Notify stakeholders
### 2. Rollback Execution
```bash
# Switch traffic to previous version
kubectl set image deployment/model-api model=model:v1.0.0
# Verify rollback
kubectl rollout status deployment/model-api
3. Verification
☐ Verify previous version is serving traffic ☐ Check health endpoints ☐ Monitor performance metrics for 15 minutes ☐ Verify issue is resolved
4. Post-Rollback
☐ Update incident ticket ☐ Notify stakeholders of resolution ☐ Begin root cause investigation ☐ Plan forward fix
ROLLBACK TIME
Target: < 5 minutes for automated, < 15 minutes for manual
CONTACTS
- On-call engineer: [PagerDuty]
- Engineering lead: [Phone]
- Product owner: [Phone]
## Best Practices
1. **Start Small**: Use phased deployments
2. **Monitor Closely**: Especially during first 48 hours
3. **Automate Monitoring**: Human monitoring doesn't scale
4. **Plan for Failure**: Always have rollback ready
5. **Document Everything**: Decisions, changes, incidents
6. **Test in Production**: Shadow/canary deployments
7. **Respond Quickly**: Fast detection and response
8. **Learn Continuously**: Post-incident reviews
9. **Communicate Clearly**: Keep stakeholders informed
10. **Improve Iteratively**: Continuous enhancement
## Integration with ISO 42001
| Activity | ISO 42001 Controls |
|----------|-------------------|
| Deployment Planning | A.6.1 |
| User Training | A.6.2 |
| Operational Procedures | A.6.3 |
| Performance Monitoring | A.7.1 |
| Data Quality Monitoring | A.7.2 |
| Fairness Monitoring | A.7.3 |
| Incident Management | A.7.4 |
## Next Steps
1. Review current deployment practices
2. Identify gaps against ISO 42001
3. Implement phased deployment strategy
4. Set up comprehensive monitoring
5. Create incident response procedures
6. Train operations teams
7. Conduct regular reviews
**Next Lesson**: Human Oversight Requirements - Implementing appropriate human oversight and control mechanisms for AI systems.