Some checks failed
Deploy to Production / Run Tests (push) Failing after 16m35s
Deploy to Production / Security Scan (push) Has been skipped
Deploy to Production / Build Docker Image (push) Has been skipped
Deploy to Production / Deploy to Staging (push) Has been skipped
Deploy to Production / Deploy to Production (push) Has been skipped
Deploy to Production / Notify Results (push) Successful in 31s
✨ 新功能: - SQLite数据库集成和持久化存储 - 数据库迁移系统和版本管理 - API分页功能和高效查询 - 用户搜索和过滤机制 - 完整的RBAC角色权限系统 - 结构化日志记录和系统监控 - API限流和多层安全防护 - Docker容器化和生产部署配置 🔒 安全特性: - JWT认证和授权 - 限流和防暴力破解 - 安全头和CORS配置 - 输入验证和XSS防护 - 审计日志和安全监控 📊 监控和运维: - Prometheus指标收集 - 健康检查和系统监控 - 自动化备份和恢复 - 完整的运维文档和脚本 - CI/CD流水线配置 🚀 部署支持: - 多环境Docker配置 - 生产环境部署指南 - 性能优化和安全加固 - 故障排除和应急响应 - 自动化运维脚本 📚 文档完善: - API使用文档 - 部署检查清单 - 运维操作手册 - 性能和安全指南 - 故障排除指南
682 lines
19 KiB
Markdown
682 lines
19 KiB
Markdown
# 运维手册
|
||
|
||
## 📖 概述
|
||
|
||
本手册提供Rust User API生产环境的日常运维指导,包括监控、维护、故障处理和最佳实践。
|
||
|
||
## 🔍 日常监控
|
||
|
||
### 1. 系统健康检查
|
||
|
||
#### 每日检查项目
|
||
```bash
|
||
#!/bin/bash
|
||
# daily-health-check.sh
|
||
|
||
echo "=== Rust User API 每日健康检查 ==="
|
||
echo "检查时间: $(date)"
|
||
echo
|
||
|
||
# 1. 服务状态检查
|
||
echo "1. 检查服务状态..."
|
||
docker-compose -f /opt/rust-api/docker-compose.prod.yml ps
|
||
|
||
# 2. 健康端点检查
|
||
echo "2. 检查健康端点..."
|
||
if curl -f -s http://localhost/health > /dev/null; then
|
||
echo "✅ 健康检查通过"
|
||
else
|
||
echo "❌ 健康检查失败"
|
||
fi
|
||
|
||
# 3. 系统资源检查
|
||
echo "3. 检查系统资源..."
|
||
echo "CPU使用率: $(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)"
|
||
echo "内存使用率: $(free | grep Mem | awk '{printf("%.1f%%"), $3/$2 * 100.0}')"
|
||
echo "磁盘使用率: $(df -h / | awk 'NR==2{printf "%s", $5}')"
|
||
|
||
# 4. 数据库检查
|
||
echo "4. 检查数据库..."
|
||
DB_SIZE=$(docker-compose -f /opt/rust-api/docker-compose.prod.yml exec -T rust-user-api \
|
||
sqlite3 /app/data/production.db "SELECT page_count * page_size as size FROM pragma_page_count(), pragma_page_size();" 2>/dev/null)
|
||
echo "数据库大小: $((DB_SIZE / 1024 / 1024)) MB"
|
||
|
||
# 5. 日志检查
|
||
echo "5. 检查错误日志..."
|
||
ERROR_COUNT=$(docker-compose -f /opt/rust-api/docker-compose.prod.yml logs --since=24h rust-user-api 2>/dev/null | grep -i error | wc -l)
|
||
echo "24小时内错误数量: $ERROR_COUNT"
|
||
|
||
# 6. 证书检查
|
||
echo "6. 检查SSL证书..."
|
||
CERT_DAYS=$(openssl x509 -in /opt/rust-api/ssl/cert.pem -noout -dates | grep notAfter | cut -d= -f2)
|
||
echo "证书过期时间: $CERT_DAYS"
|
||
|
||
echo
|
||
echo "=== 检查完成 ==="
|
||
```
|
||
|
||
#### 监控指标阈值
|
||
| 指标 | 正常范围 | 警告阈值 | 严重阈值 |
|
||
|------|----------|----------|----------|
|
||
| CPU使用率 | <50% | 50-80% | >80% |
|
||
| 内存使用率 | <70% | 70-85% | >85% |
|
||
| 磁盘使用率 | <80% | 80-90% | >90% |
|
||
| 响应时间 | <200ms | 200-500ms | >500ms |
|
||
| 错误率 | <1% | 1-5% | >5% |
|
||
|
||
### 2. 性能监控
|
||
|
||
#### 关键性能指标 (KPI)
|
||
```bash
|
||
# 获取性能指标脚本
|
||
#!/bin/bash
|
||
# performance-metrics.sh
|
||
|
||
echo "=== 性能指标报告 ==="
|
||
echo "时间: $(date)"
|
||
echo
|
||
|
||
# API响应时间
|
||
echo "1. API响应时间测试..."
|
||
for endpoint in "/health" "/api/users" "/monitoring/metrics"; do
|
||
response_time=$(curl -o /dev/null -s -w "%{time_total}" "http://localhost$endpoint")
|
||
echo " $endpoint: ${response_time}s"
|
||
done
|
||
|
||
# 并发测试
|
||
echo "2. 并发性能测试..."
|
||
ab -n 100 -c 10 -q http://localhost/api/users | grep "Requests per second\|Time per request"
|
||
|
||
# 数据库性能
|
||
echo "3. 数据库性能..."
|
||
docker-compose -f /opt/rust-api/docker-compose.prod.yml exec -T rust-user-api \
|
||
sqlite3 /app/data/production.db "PRAGMA optimize; PRAGMA integrity_check;"
|
||
|
||
echo "=== 报告完成 ==="
|
||
```
|
||
|
||
### 3. 日志监控
|
||
|
||
#### 日志分析脚本
|
||
```bash
|
||
#!/bin/bash
|
||
# log-analysis.sh
|
||
|
||
LOG_FILE="/opt/rust-api/logs/app.log"
|
||
HOURS=${1:-24}
|
||
|
||
echo "=== 最近${HOURS}小时日志分析 ==="
|
||
|
||
# 错误统计
|
||
echo "1. 错误统计:"
|
||
grep -i error $LOG_FILE | tail -n 100 | awk '{print $1, $2}' | sort | uniq -c | sort -nr
|
||
|
||
# 请求统计
|
||
echo "2. 请求统计:"
|
||
grep "http_request" $LOG_FILE | tail -n 1000 | \
|
||
grep -o '"method":"[^"]*"' | sort | uniq -c | sort -nr
|
||
|
||
# 响应时间分析
|
||
echo "3. 响应时间分析:"
|
||
grep "response_time" $LOG_FILE | tail -n 100 | \
|
||
grep -o '"response_time":[0-9]*' | cut -d: -f2 | \
|
||
awk '{sum+=$1; count++} END {if(count>0) print "平均响应时间:", sum/count "ms"}'
|
||
|
||
# IP访问统计
|
||
echo "4. 访问IP统计:"
|
||
grep "remote_addr" $LOG_FILE | tail -n 1000 | \
|
||
grep -o '"remote_addr":"[^"]*"' | cut -d: -f2 | tr -d '"' | \
|
||
sort | uniq -c | sort -nr | head -10
|
||
```
|
||
|
||
## 🔧 日常维护
|
||
|
||
### 1. 数据库维护
|
||
|
||
#### 数据库优化
|
||
```bash
|
||
#!/bin/bash
|
||
# database-maintenance.sh
|
||
|
||
echo "=== 数据库维护开始 ==="
|
||
|
||
# 1. 数据库优化
|
||
echo "1. 执行数据库优化..."
|
||
docker-compose -f /opt/rust-api/docker-compose.prod.yml exec -T rust-user-api \
|
||
sqlite3 /app/data/production.db "PRAGMA optimize; VACUUM; ANALYZE;"
|
||
|
||
# 2. 检查数据库完整性
|
||
echo "2. 检查数据库完整性..."
|
||
INTEGRITY=$(docker-compose -f /opt/rust-api/docker-compose.prod.yml exec -T rust-user-api \
|
||
sqlite3 /app/data/production.db "PRAGMA integrity_check;")
|
||
|
||
if [ "$INTEGRITY" = "ok" ]; then
|
||
echo "✅ 数据库完整性检查通过"
|
||
else
|
||
echo "❌ 数据库完整性检查失败: $INTEGRITY"
|
||
fi
|
||
|
||
# 3. 统计信息
|
||
echo "3. 数据库统计信息..."
|
||
docker-compose -f /opt/rust-api/docker-compose.prod.yml exec -T rust-user-api \
|
||
sqlite3 /app/data/production.db << 'EOF'
|
||
.mode column
|
||
.headers on
|
||
SELECT 'users' as table_name, COUNT(*) as record_count FROM users;
|
||
SELECT 'user_sessions' as table_name, COUNT(*) as record_count FROM user_sessions;
|
||
EOF
|
||
|
||
echo "=== 数据库维护完成 ==="
|
||
```
|
||
|
||
#### 数据备份
|
||
```bash
|
||
#!/bin/bash
|
||
# backup-database.sh
|
||
|
||
BACKUP_DIR="/opt/rust-api/backups"
|
||
DATE=$(date +%Y%m%d-%H%M%S)
|
||
RETENTION_DAYS=30
|
||
|
||
echo "=== 数据库备份开始 ==="
|
||
|
||
# 1. 创建备份目录
|
||
mkdir -p $BACKUP_DIR
|
||
|
||
# 2. 备份数据库
|
||
echo "1. 备份数据库..."
|
||
docker-compose -f /opt/rust-api/docker-compose.prod.yml exec -T rust-user-api \
|
||
sqlite3 /app/data/production.db ".backup /app/data/backup-$DATE.db"
|
||
|
||
# 3. 复制到主机
|
||
docker cp rust-user-api-prod:/app/data/backup-$DATE.db $BACKUP_DIR/
|
||
|
||
# 4. 压缩备份
|
||
echo "2. 压缩备份文件..."
|
||
tar -czf "$BACKUP_DIR/database-backup-$DATE.tar.gz" \
|
||
-C $BACKUP_DIR backup-$DATE.db
|
||
|
||
# 5. 清理临时文件
|
||
rm -f "$BACKUP_DIR/backup-$DATE.db"
|
||
|
||
# 6. 验证备份
|
||
echo "3. 验证备份..."
|
||
if [ -f "$BACKUP_DIR/database-backup-$DATE.tar.gz" ]; then
|
||
SIZE=$(du -h "$BACKUP_DIR/database-backup-$DATE.tar.gz" | cut -f1)
|
||
echo "✅ 备份成功: database-backup-$DATE.tar.gz ($SIZE)"
|
||
else
|
||
echo "❌ 备份失败"
|
||
exit 1
|
||
fi
|
||
|
||
# 7. 清理旧备份
|
||
echo "4. 清理旧备份..."
|
||
find $BACKUP_DIR -name "database-backup-*.tar.gz" -mtime +$RETENTION_DAYS -delete
|
||
REMAINING=$(find $BACKUP_DIR -name "database-backup-*.tar.gz" | wc -l)
|
||
echo "保留备份数量: $REMAINING"
|
||
|
||
echo "=== 数据库备份完成 ==="
|
||
```
|
||
|
||
### 2. 日志管理
|
||
|
||
#### 日志轮转
|
||
```bash
|
||
#!/bin/bash
|
||
# log-rotation.sh
|
||
|
||
LOG_DIR="/opt/rust-api/logs"
|
||
MAX_SIZE="100M"
|
||
MAX_FILES=10
|
||
|
||
echo "=== 日志轮转开始 ==="
|
||
|
||
# 1. 检查日志大小
|
||
for log_file in "$LOG_DIR"/*.log; do
|
||
if [ -f "$log_file" ]; then
|
||
size=$(du -h "$log_file" | cut -f1)
|
||
echo "日志文件: $(basename $log_file) - 大小: $size"
|
||
|
||
# 如果文件大于最大大小,进行轮转
|
||
if [ $(du -m "$log_file" | cut -f1) -gt 100 ]; then
|
||
echo "轮转日志: $log_file"
|
||
|
||
# 压缩并重命名旧日志
|
||
timestamp=$(date +%Y%m%d-%H%M%S)
|
||
gzip -c "$log_file" > "${log_file}.${timestamp}.gz"
|
||
|
||
# 清空当前日志文件
|
||
> "$log_file"
|
||
|
||
echo "✅ 日志轮转完成: ${log_file}.${timestamp}.gz"
|
||
fi
|
||
fi
|
||
done
|
||
|
||
# 2. 清理旧日志
|
||
echo "2. 清理旧日志文件..."
|
||
find $LOG_DIR -name "*.log.*.gz" -mtime +30 -delete
|
||
|
||
echo "=== 日志轮转完成 ==="
|
||
```
|
||
|
||
### 3. 系统清理
|
||
|
||
#### 系统清理脚本
|
||
```bash
|
||
#!/bin/bash
|
||
# system-cleanup.sh
|
||
|
||
echo "=== 系统清理开始 ==="
|
||
|
||
# 1. Docker清理
|
||
echo "1. 清理Docker资源..."
|
||
docker system prune -f
|
||
docker volume prune -f
|
||
docker image prune -f
|
||
|
||
# 2. 清理临时文件
|
||
echo "2. 清理临时文件..."
|
||
find /tmp -type f -mtime +7 -delete 2>/dev/null || true
|
||
find /var/tmp -type f -mtime +7 -delete 2>/dev/null || true
|
||
|
||
# 3. 清理系统日志
|
||
echo "3. 清理系统日志..."
|
||
journalctl --vacuum-time=30d
|
||
journalctl --vacuum-size=1G
|
||
|
||
# 4. 更新系统包
|
||
echo "4. 更新系统包..."
|
||
apt update && apt upgrade -y
|
||
|
||
# 5. 检查磁盘空间
|
||
echo "5. 磁盘空间报告..."
|
||
df -h
|
||
|
||
echo "=== 系统清理完成 ==="
|
||
```
|
||
|
||
## 🚨 故障处理
|
||
|
||
### 1. 常见故障处理流程
|
||
|
||
#### 服务无响应
|
||
```bash
|
||
#!/bin/bash
|
||
# service-recovery.sh
|
||
|
||
echo "=== 服务恢复流程 ==="
|
||
|
||
# 1. 检查服务状态
|
||
echo "1. 检查服务状态..."
|
||
docker-compose -f /opt/rust-api/docker-compose.prod.yml ps
|
||
|
||
# 2. 检查健康端点
|
||
echo "2. 检查健康端点..."
|
||
if ! curl -f -s --max-time 10 http://localhost/health; then
|
||
echo "❌ 健康检查失败,开始恢复流程..."
|
||
|
||
# 3. 查看最近日志
|
||
echo "3. 查看最近日志..."
|
||
docker-compose -f /opt/rust-api/docker-compose.prod.yml logs --tail=50 rust-user-api
|
||
|
||
# 4. 重启服务
|
||
echo "4. 重启服务..."
|
||
docker-compose -f /opt/rust-api/docker-compose.prod.yml restart rust-user-api
|
||
|
||
# 5. 等待服务启动
|
||
echo "5. 等待服务启动..."
|
||
sleep 30
|
||
|
||
# 6. 再次检查
|
||
if curl -f -s --max-time 10 http://localhost/health; then
|
||
echo "✅ 服务恢复成功"
|
||
else
|
||
echo "❌ 服务恢复失败,需要人工介入"
|
||
exit 1
|
||
fi
|
||
else
|
||
echo "✅ 服务正常运行"
|
||
fi
|
||
|
||
echo "=== 恢复流程完成 ==="
|
||
```
|
||
|
||
#### 数据库问题处理
|
||
```bash
|
||
#!/bin/bash
|
||
# database-recovery.sh
|
||
|
||
echo "=== 数据库恢复流程 ==="
|
||
|
||
DB_PATH="/opt/rust-api/data/production.db"
|
||
BACKUP_DIR="/opt/rust-api/backups"
|
||
|
||
# 1. 检查数据库完整性
|
||
echo "1. 检查数据库完整性..."
|
||
INTEGRITY=$(docker-compose -f /opt/rust-api/docker-compose.prod.yml exec -T rust-user-api \
|
||
sqlite3 /app/data/production.db "PRAGMA integrity_check;" 2>/dev/null)
|
||
|
||
if [ "$INTEGRITY" != "ok" ]; then
|
||
echo "❌ 数据库完整性检查失败: $INTEGRITY"
|
||
|
||
# 2. 停止服务
|
||
echo "2. 停止服务..."
|
||
docker-compose -f /opt/rust-api/docker-compose.prod.yml stop rust-user-api
|
||
|
||
# 3. 备份损坏的数据库
|
||
echo "3. 备份损坏的数据库..."
|
||
cp $DB_PATH "${DB_PATH}.corrupted.$(date +%Y%m%d-%H%M%S)"
|
||
|
||
# 4. 恢复最新备份
|
||
echo "4. 恢复最新备份..."
|
||
LATEST_BACKUP=$(find $BACKUP_DIR -name "database-backup-*.tar.gz" | sort -r | head -1)
|
||
|
||
if [ -n "$LATEST_BACKUP" ]; then
|
||
echo "使用备份: $LATEST_BACKUP"
|
||
tar -xzf "$LATEST_BACKUP" -C /tmp/
|
||
cp /tmp/backup-*.db $DB_PATH
|
||
chown apiuser:apiuser $DB_PATH
|
||
echo "✅ 数据库恢复完成"
|
||
else
|
||
echo "❌ 未找到可用备份"
|
||
exit 1
|
||
fi
|
||
|
||
# 5. 重启服务
|
||
echo "5. 重启服务..."
|
||
docker-compose -f /opt/rust-api/docker-compose.prod.yml start rust-user-api
|
||
|
||
# 6. 验证恢复
|
||
sleep 30
|
||
if curl -f -s http://localhost/health; then
|
||
echo "✅ 服务恢复成功"
|
||
else
|
||
echo "❌ 服务恢复失败"
|
||
exit 1
|
||
fi
|
||
else
|
||
echo "✅ 数据库完整性正常"
|
||
fi
|
||
|
||
echo "=== 数据库恢复完成 ==="
|
||
```
|
||
|
||
### 2. 性能问题诊断
|
||
|
||
#### 性能诊断脚本
|
||
```bash
|
||
#!/bin/bash
|
||
# performance-diagnosis.sh
|
||
|
||
echo "=== 性能诊断开始 ==="
|
||
|
||
# 1. 系统资源使用
|
||
echo "1. 系统资源使用情况:"
|
||
echo "CPU: $(top -bn1 | grep "Cpu(s)" | awk '{print $2}')"
|
||
echo "内存: $(free -h | grep Mem | awk '{print $3 "/" $2}')"
|
||
echo "磁盘IO: $(iostat -x 1 1 | tail -n +4 | awk '{print $1, $10}' | grep -v '^$')"
|
||
|
||
# 2. 容器资源使用
|
||
echo "2. 容器资源使用:"
|
||
docker stats --no-stream --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}\t{{.BlockIO}}"
|
||
|
||
# 3. 数据库性能
|
||
echo "3. 数据库性能分析:"
|
||
docker-compose -f /opt/rust-api/docker-compose.prod.yml exec -T rust-user-api \
|
||
sqlite3 /app/data/production.db << 'EOF'
|
||
.timer on
|
||
SELECT COUNT(*) FROM users;
|
||
SELECT COUNT(*) FROM user_sessions WHERE created_at > datetime('now', '-1 day');
|
||
EOF
|
||
|
||
# 4. 网络连接
|
||
echo "4. 网络连接统计:"
|
||
netstat -an | grep :3000 | awk '{print $6}' | sort | uniq -c
|
||
|
||
# 5. 慢查询分析
|
||
echo "5. 慢查询分析:"
|
||
grep "slow_query" /opt/rust-api/logs/app.log | tail -10
|
||
|
||
echo "=== 性能诊断完成 ==="
|
||
```
|
||
|
||
## 📊 监控和告警
|
||
|
||
### 1. 监控配置
|
||
|
||
#### Prometheus告警规则
|
||
```yaml
|
||
# /opt/rust-api/monitoring/prometheus/alert-rules.yml
|
||
groups:
|
||
- name: rust-api-alerts
|
||
rules:
|
||
- alert: HighErrorRate
|
||
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
|
||
for: 5m
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "High error rate detected"
|
||
description: "Error rate is {{ $value }} errors per second"
|
||
|
||
- alert: HighResponseTime
|
||
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
|
||
for: 5m
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "High response time detected"
|
||
description: "95th percentile response time is {{ $value }}s"
|
||
|
||
- alert: ServiceDown
|
||
expr: up{job="rust-user-api"} == 0
|
||
for: 1m
|
||
labels:
|
||
severity: critical
|
||
annotations:
|
||
summary: "Service is down"
|
||
description: "Rust User API service is not responding"
|
||
|
||
- alert: HighMemoryUsage
|
||
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.85
|
||
for: 5m
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "High memory usage"
|
||
description: "Memory usage is {{ $value | humanizePercentage }}"
|
||
|
||
- alert: HighDiskUsage
|
||
expr: (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes > 0.85
|
||
for: 5m
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "High disk usage"
|
||
description: "Disk usage is {{ $value | humanizePercentage }}"
|
||
```
|
||
|
||
### 2. 自动化运维
|
||
|
||
#### Crontab配置
|
||
```bash
|
||
# 添加到crontab: crontab -e
|
||
|
||
# 每小时执行健康检查
|
||
0 * * * * /opt/rust-api/scripts/daily-health-check.sh >> /var/log/health-check.log 2>&1
|
||
|
||
# 每天凌晨2点备份数据库
|
||
0 2 * * * /opt/rust-api/scripts/backup-database.sh >> /var/log/backup.log 2>&1
|
||
|
||
# 每天凌晨3点执行数据库维护
|
||
0 3 * * * /opt/rust-api/scripts/database-maintenance.sh >> /var/log/db-maintenance.log 2>&1
|
||
|
||
# 每周日凌晨4点执行系统清理
|
||
0 4 * * 0 /opt/rust-api/scripts/system-cleanup.sh >> /var/log/system-cleanup.log 2>&1
|
||
|
||
# 每天检查日志大小并轮转
|
||
0 1 * * * /opt/rust-api/scripts/log-rotation.sh >> /var/log/log-rotation.log 2>&1
|
||
|
||
# 每5分钟检查服务状态
|
||
*/5 * * * * /opt/rust-api/scripts/service-monitor.sh >> /var/log/service-monitor.log 2>&1
|
||
```
|
||
|
||
#### 服务监控脚本
|
||
```bash
|
||
#!/bin/bash
|
||
# service-monitor.sh
|
||
|
||
ALERT_EMAIL="admin@yourdomain.com"
|
||
LOG_FILE="/var/log/service-monitor.log"
|
||
|
||
# 检查服务健康状态
|
||
if ! curl -f -s --max-time 10 http://localhost/health > /dev/null; then
|
||
echo "$(date): Service health check failed" >> $LOG_FILE
|
||
|
||
# 发送告警邮件
|
||
echo "Rust User API service health check failed at $(date)" | \
|
||
mail -s "Service Alert: API Health Check Failed" $ALERT_EMAIL
|
||
|
||
# 尝试自动恢复
|
||
/opt/rust-api/scripts/service-recovery.sh >> $LOG_FILE 2>&1
|
||
fi
|
||
|
||
# 检查系统资源
|
||
CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
|
||
MEM_USAGE=$(free | grep Mem | awk '{printf("%.1f"), $3/$2 * 100.0}')
|
||
|
||
if (( $(echo "$CPU_USAGE > 80" | bc -l) )); then
|
||
echo "$(date): High CPU usage: $CPU_USAGE%" >> $LOG_FILE
|
||
fi
|
||
|
||
if (( $(echo "$MEM_USAGE > 85" | bc -l) )); then
|
||
echo "$(date): High memory usage: $MEM_USAGE%" >> $LOG_FILE
|
||
fi
|
||
```
|
||
|
||
## 📈 容量规划
|
||
|
||
### 1. 资源使用趋势分析
|
||
|
||
#### 资源统计脚本
|
||
```bash
|
||
#!/bin/bash
|
||
# resource-stats.sh
|
||
|
||
STATS_FILE="/opt/rust-api/logs/resource-stats.csv"
|
||
DATE=$(date '+%Y-%m-%d %H:%M:%S')
|
||
|
||
# 创建CSV头部(如果文件不存在)
|
||
if [ ! -f "$STATS_FILE" ]; then
|
||
echo "timestamp,cpu_usage,memory_usage,disk_usage,active_connections,response_time" > $STATS_FILE
|
||
fi
|
||
|
||
# 收集指标
|
||
CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
|
||
MEM_USAGE=$(free | grep Mem | awk '{printf("%.1f"), $3/$2 * 100.0}')
|
||
DISK_USAGE=$(df -h / | awk 'NR==2{print $5}' | cut -d'%' -f1)
|
||
CONNECTIONS=$(netstat -an | grep :3000 | grep ESTABLISHED | wc -l)
|
||
RESPONSE_TIME=$(curl -o /dev/null -s -w "%{time_total}" http://localhost/health)
|
||
|
||
# 写入CSV
|
||
echo "$DATE,$CPU_USAGE,$MEM_USAGE,$DISK_USAGE,$CONNECTIONS,$RESPONSE_TIME" >> $STATS_FILE
|
||
|
||
# 保留最近30天的数据
|
||
tail -n 43200 $STATS_FILE > ${STATS_FILE}.tmp && mv ${STATS_FILE}.tmp $STATS_FILE
|
||
```
|
||
|
||
### 2. 扩容建议
|
||
|
||
#### 扩容决策矩阵
|
||
|
||
| 指标 | 当前阈值 | 扩容建议 |
|
||
|------|----------|----------|
|
||
| CPU使用率 > 70% | 增加CPU核心或横向扩展 |
|
||
| 内存使用率 > 80% | 增加内存或优化应用 |
|
||
| 磁盘使用率 > 85% | 扩展存储或数据归档 |
|
||
| 响应时间 > 500ms | 性能优化或负载均衡 |
|
||
| 并发连接 > 1000 | 增加实例或连接池优化 |
|
||
|
||
## 📞 应急响应
|
||
|
||
### 1. 应急联系流程
|
||
|
||
#### 故障等级定义
|
||
- **P0 (严重)**: 服务完全不可用,影响所有用户
|
||
- **P1 (高)**: 核心功能不可用,影响大部分用户
|
||
- **P2 (中)**: 部分功能不可用,影响少部分用户
|
||
- **P3 (低)**: 性能问题或非核心功能问题
|
||
|
||
#### 应急响应时间
|
||
- P0: 15分钟内响应,1小时内解决
|
||
- P1: 30分钟内响应,4小时内解决
|
||
- P2: 2小时内响应,24小时内解决
|
||
- P3: 1个工作日内响应,1周内解决
|
||
|
||
### 2. 应急处理清单
|
||
|
||
#### P0级别故障处理
|
||
```bash
|
||
#!/bin/bash
|
||
# emergency-response-p0.sh
|
||
|
||
echo "=== P0级别应急响应 ==="
|
||
echo "开始时间: $(date)"
|
||
|
||
# 1. 立即通知
|
||
echo "1. 发送紧急通知..."
|
||
echo "P0 Alert: Rust User API service is down" | \
|
||
mail -s "URGENT: Service Down" admin@yourdomain.com
|
||
|
||
# 2. 快速诊断
|
||
echo "2. 快速诊断..."
|
||
docker-compose -f /opt/rust-api/docker-compose.prod.yml ps
|
||
curl -I http://localhost/health
|
||
|
||
# 3. 尝试快速恢复
|
||
echo "3. 尝试快速恢复..."
|
||
docker-compose -f /opt/rust-api/docker-compose.prod.yml restart
|
||
|
||
# 4. 验证恢复
|
||
sleep 30
|
||
if curl -f -s http://localhost/health; then
|
||
echo "✅ 服务已恢复"
|
||
echo "Service recovered at $(date)" | \
|
||
mail -s "Service Recovered" admin@yourdomain.com
|
||
else
|
||
echo "❌ 快速恢复失败,需要深度诊断"
|
||
# 启动深度诊断流程
|
||
/opt/rust-api/scripts/deep-diagnosis.sh
|
||
fi
|
||
|
||
echo "=== 应急响应完成 ==="
|
||
```
|
||
|
||
## 📚 运维最佳实践
|
||
|
||
### 1. 预防性维护
|
||
|
||
- **定期备份**: 每日自动备份,每周验证备份完整性
|
||
- **监控告警**: 设置合理的告警阈值,避免告警疲劳
|
||
- **容量规划**: 定期评估资源使用趋势,提前扩容
|
||
- **安全更新**: 及时应用安全补丁和更新
|
||
- **文档维护**: 保持运维文档的及时更新
|
||
|
||
### 2. 变更管理
|
||
|
||
- **变更审批**: 所有生产环境变更需要审批
|
||
- **测试验证**: 变更前在测试环境充分验证
|
||
- **回滚计划**: 每次变更都要有明确的回滚方案
|
||
- **变更记录**: 详细记录所有变更内容和结果
|
||
|
||
### 3. 知识管理
|
||
|
||
- **故障记录**: 详细记录每次故障的原因和解决方案
|
||
- **经验分享**: 定期分享运维经验和最佳实践
|
||
- **培训计划**: 定期进行运维技能培训
|
||
- **文档更新**: 及时更新运维手册和流程文档
|
||
|
||
---
|
||
|
||
**注意**: 本手册应根据实际运维需求定期更新和完善。所有脚本在使用前应在测试环境验证。 |