# 运维手册 ## 📖 概述 本手册提供Rust User API生产环境的日常运维指导,包括监控、维护、故障处理和最佳实践。 ## 🔍 日常监控 ### 1. 系统健康检查 #### 每日检查项目 ```bash #!/bin/bash # daily-health-check.sh echo "=== Rust User API 每日健康检查 ===" echo "检查时间: $(date)" echo # 1. 服务状态检查 echo "1. 检查服务状态..." docker-compose -f /opt/rust-api/docker-compose.prod.yml ps # 2. 健康端点检查 echo "2. 检查健康端点..." if curl -f -s http://localhost/health > /dev/null; then echo "✅ 健康检查通过" else echo "❌ 健康检查失败" fi # 3. 系统资源检查 echo "3. 检查系统资源..." echo "CPU使用率: $(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)" echo "内存使用率: $(free | grep Mem | awk '{printf("%.1f%%"), $3/$2 * 100.0}')" echo "磁盘使用率: $(df -h / | awk 'NR==2{printf "%s", $5}')" # 4. 数据库检查 echo "4. 检查数据库..." DB_SIZE=$(docker-compose -f /opt/rust-api/docker-compose.prod.yml exec -T rust-user-api \ sqlite3 /app/data/production.db "SELECT page_count * page_size as size FROM pragma_page_count(), pragma_page_size();" 2>/dev/null) echo "数据库大小: $((DB_SIZE / 1024 / 1024)) MB" # 5. 日志检查 echo "5. 检查错误日志..." ERROR_COUNT=$(docker-compose -f /opt/rust-api/docker-compose.prod.yml logs --since=24h rust-user-api 2>/dev/null | grep -i error | wc -l) echo "24小时内错误数量: $ERROR_COUNT" # 6. 证书检查 echo "6. 检查SSL证书..." CERT_DAYS=$(openssl x509 -in /opt/rust-api/ssl/cert.pem -noout -dates | grep notAfter | cut -d= -f2) echo "证书过期时间: $CERT_DAYS" echo echo "=== 检查完成 ===" ``` #### 监控指标阈值 | 指标 | 正常范围 | 警告阈值 | 严重阈值 | |------|----------|----------|----------| | CPU使用率 | <50% | 50-80% | >80% | | 内存使用率 | <70% | 70-85% | >85% | | 磁盘使用率 | <80% | 80-90% | >90% | | 响应时间 | <200ms | 200-500ms | >500ms | | 错误率 | <1% | 1-5% | >5% | ### 2. 性能监控 #### 关键性能指标 (KPI) ```bash # 获取性能指标脚本 #!/bin/bash # performance-metrics.sh echo "=== 性能指标报告 ===" echo "时间: $(date)" echo # API响应时间 echo "1. API响应时间测试..." for endpoint in "/health" "/api/users" "/monitoring/metrics"; do response_time=$(curl -o /dev/null -s -w "%{time_total}" "http://localhost$endpoint") echo " $endpoint: ${response_time}s" done # 并发测试 echo "2. 并发性能测试..." ab -n 100 -c 10 -q http://localhost/api/users | grep "Requests per second\|Time per request" # 数据库性能 echo "3. 数据库性能..." docker-compose -f /opt/rust-api/docker-compose.prod.yml exec -T rust-user-api \ sqlite3 /app/data/production.db "PRAGMA optimize; PRAGMA integrity_check;" echo "=== 报告完成 ===" ``` ### 3. 日志监控 #### 日志分析脚本 ```bash #!/bin/bash # log-analysis.sh LOG_FILE="/opt/rust-api/logs/app.log" HOURS=${1:-24} echo "=== 最近${HOURS}小时日志分析 ===" # 错误统计 echo "1. 错误统计:" grep -i error $LOG_FILE | tail -n 100 | awk '{print $1, $2}' | sort | uniq -c | sort -nr # 请求统计 echo "2. 请求统计:" grep "http_request" $LOG_FILE | tail -n 1000 | \ grep -o '"method":"[^"]*"' | sort | uniq -c | sort -nr # 响应时间分析 echo "3. 响应时间分析:" grep "response_time" $LOG_FILE | tail -n 100 | \ grep -o '"response_time":[0-9]*' | cut -d: -f2 | \ awk '{sum+=$1; count++} END {if(count>0) print "平均响应时间:", sum/count "ms"}' # IP访问统计 echo "4. 访问IP统计:" grep "remote_addr" $LOG_FILE | tail -n 1000 | \ grep -o '"remote_addr":"[^"]*"' | cut -d: -f2 | tr -d '"' | \ sort | uniq -c | sort -nr | head -10 ``` ## 🔧 日常维护 ### 1. 数据库维护 #### 数据库优化 ```bash #!/bin/bash # database-maintenance.sh echo "=== 数据库维护开始 ===" # 1. 数据库优化 echo "1. 执行数据库优化..." docker-compose -f /opt/rust-api/docker-compose.prod.yml exec -T rust-user-api \ sqlite3 /app/data/production.db "PRAGMA optimize; VACUUM; ANALYZE;" # 2. 检查数据库完整性 echo "2. 检查数据库完整性..." INTEGRITY=$(docker-compose -f /opt/rust-api/docker-compose.prod.yml exec -T rust-user-api \ sqlite3 /app/data/production.db "PRAGMA integrity_check;") if [ "$INTEGRITY" = "ok" ]; then echo "✅ 数据库完整性检查通过" else echo "❌ 数据库完整性检查失败: $INTEGRITY" fi # 3. 统计信息 echo "3. 数据库统计信息..." docker-compose -f /opt/rust-api/docker-compose.prod.yml exec -T rust-user-api \ sqlite3 /app/data/production.db << 'EOF' .mode column .headers on SELECT 'users' as table_name, COUNT(*) as record_count FROM users; SELECT 'user_sessions' as table_name, COUNT(*) as record_count FROM user_sessions; EOF echo "=== 数据库维护完成 ===" ``` #### 数据备份 ```bash #!/bin/bash # backup-database.sh BACKUP_DIR="/opt/rust-api/backups" DATE=$(date +%Y%m%d-%H%M%S) RETENTION_DAYS=30 echo "=== 数据库备份开始 ===" # 1. 创建备份目录 mkdir -p $BACKUP_DIR # 2. 备份数据库 echo "1. 备份数据库..." docker-compose -f /opt/rust-api/docker-compose.prod.yml exec -T rust-user-api \ sqlite3 /app/data/production.db ".backup /app/data/backup-$DATE.db" # 3. 复制到主机 docker cp rust-user-api-prod:/app/data/backup-$DATE.db $BACKUP_DIR/ # 4. 压缩备份 echo "2. 压缩备份文件..." tar -czf "$BACKUP_DIR/database-backup-$DATE.tar.gz" \ -C $BACKUP_DIR backup-$DATE.db # 5. 清理临时文件 rm -f "$BACKUP_DIR/backup-$DATE.db" # 6. 验证备份 echo "3. 验证备份..." if [ -f "$BACKUP_DIR/database-backup-$DATE.tar.gz" ]; then SIZE=$(du -h "$BACKUP_DIR/database-backup-$DATE.tar.gz" | cut -f1) echo "✅ 备份成功: database-backup-$DATE.tar.gz ($SIZE)" else echo "❌ 备份失败" exit 1 fi # 7. 清理旧备份 echo "4. 清理旧备份..." find $BACKUP_DIR -name "database-backup-*.tar.gz" -mtime +$RETENTION_DAYS -delete REMAINING=$(find $BACKUP_DIR -name "database-backup-*.tar.gz" | wc -l) echo "保留备份数量: $REMAINING" echo "=== 数据库备份完成 ===" ``` ### 2. 日志管理 #### 日志轮转 ```bash #!/bin/bash # log-rotation.sh LOG_DIR="/opt/rust-api/logs" MAX_SIZE="100M" MAX_FILES=10 echo "=== 日志轮转开始 ===" # 1. 检查日志大小 for log_file in "$LOG_DIR"/*.log; do if [ -f "$log_file" ]; then size=$(du -h "$log_file" | cut -f1) echo "日志文件: $(basename $log_file) - 大小: $size" # 如果文件大于最大大小,进行轮转 if [ $(du -m "$log_file" | cut -f1) -gt 100 ]; then echo "轮转日志: $log_file" # 压缩并重命名旧日志 timestamp=$(date +%Y%m%d-%H%M%S) gzip -c "$log_file" > "${log_file}.${timestamp}.gz" # 清空当前日志文件 > "$log_file" echo "✅ 日志轮转完成: ${log_file}.${timestamp}.gz" fi fi done # 2. 清理旧日志 echo "2. 清理旧日志文件..." find $LOG_DIR -name "*.log.*.gz" -mtime +30 -delete echo "=== 日志轮转完成 ===" ``` ### 3. 系统清理 #### 系统清理脚本 ```bash #!/bin/bash # system-cleanup.sh echo "=== 系统清理开始 ===" # 1. Docker清理 echo "1. 清理Docker资源..." docker system prune -f docker volume prune -f docker image prune -f # 2. 清理临时文件 echo "2. 清理临时文件..." find /tmp -type f -mtime +7 -delete 2>/dev/null || true find /var/tmp -type f -mtime +7 -delete 2>/dev/null || true # 3. 清理系统日志 echo "3. 清理系统日志..." journalctl --vacuum-time=30d journalctl --vacuum-size=1G # 4. 更新系统包 echo "4. 更新系统包..." apt update && apt upgrade -y # 5. 检查磁盘空间 echo "5. 磁盘空间报告..." df -h echo "=== 系统清理完成 ===" ``` ## 🚨 故障处理 ### 1. 常见故障处理流程 #### 服务无响应 ```bash #!/bin/bash # service-recovery.sh echo "=== 服务恢复流程 ===" # 1. 检查服务状态 echo "1. 检查服务状态..." docker-compose -f /opt/rust-api/docker-compose.prod.yml ps # 2. 检查健康端点 echo "2. 检查健康端点..." if ! curl -f -s --max-time 10 http://localhost/health; then echo "❌ 健康检查失败,开始恢复流程..." # 3. 查看最近日志 echo "3. 查看最近日志..." docker-compose -f /opt/rust-api/docker-compose.prod.yml logs --tail=50 rust-user-api # 4. 重启服务 echo "4. 重启服务..." docker-compose -f /opt/rust-api/docker-compose.prod.yml restart rust-user-api # 5. 等待服务启动 echo "5. 等待服务启动..." sleep 30 # 6. 再次检查 if curl -f -s --max-time 10 http://localhost/health; then echo "✅ 服务恢复成功" else echo "❌ 服务恢复失败,需要人工介入" exit 1 fi else echo "✅ 服务正常运行" fi echo "=== 恢复流程完成 ===" ``` #### 数据库问题处理 ```bash #!/bin/bash # database-recovery.sh echo "=== 数据库恢复流程 ===" DB_PATH="/opt/rust-api/data/production.db" BACKUP_DIR="/opt/rust-api/backups" # 1. 检查数据库完整性 echo "1. 检查数据库完整性..." INTEGRITY=$(docker-compose -f /opt/rust-api/docker-compose.prod.yml exec -T rust-user-api \ sqlite3 /app/data/production.db "PRAGMA integrity_check;" 2>/dev/null) if [ "$INTEGRITY" != "ok" ]; then echo "❌ 数据库完整性检查失败: $INTEGRITY" # 2. 停止服务 echo "2. 停止服务..." docker-compose -f /opt/rust-api/docker-compose.prod.yml stop rust-user-api # 3. 备份损坏的数据库 echo "3. 备份损坏的数据库..." cp $DB_PATH "${DB_PATH}.corrupted.$(date +%Y%m%d-%H%M%S)" # 4. 恢复最新备份 echo "4. 恢复最新备份..." LATEST_BACKUP=$(find $BACKUP_DIR -name "database-backup-*.tar.gz" | sort -r | head -1) if [ -n "$LATEST_BACKUP" ]; then echo "使用备份: $LATEST_BACKUP" tar -xzf "$LATEST_BACKUP" -C /tmp/ cp /tmp/backup-*.db $DB_PATH chown apiuser:apiuser $DB_PATH echo "✅ 数据库恢复完成" else echo "❌ 未找到可用备份" exit 1 fi # 5. 重启服务 echo "5. 重启服务..." docker-compose -f /opt/rust-api/docker-compose.prod.yml start rust-user-api # 6. 验证恢复 sleep 30 if curl -f -s http://localhost/health; then echo "✅ 服务恢复成功" else echo "❌ 服务恢复失败" exit 1 fi else echo "✅ 数据库完整性正常" fi echo "=== 数据库恢复完成 ===" ``` ### 2. 性能问题诊断 #### 性能诊断脚本 ```bash #!/bin/bash # performance-diagnosis.sh echo "=== 性能诊断开始 ===" # 1. 系统资源使用 echo "1. 系统资源使用情况:" echo "CPU: $(top -bn1 | grep "Cpu(s)" | awk '{print $2}')" echo "内存: $(free -h | grep Mem | awk '{print $3 "/" $2}')" echo "磁盘IO: $(iostat -x 1 1 | tail -n +4 | awk '{print $1, $10}' | grep -v '^$')" # 2. 容器资源使用 echo "2. 容器资源使用:" docker stats --no-stream --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}\t{{.BlockIO}}" # 3. 数据库性能 echo "3. 数据库性能分析:" docker-compose -f /opt/rust-api/docker-compose.prod.yml exec -T rust-user-api \ sqlite3 /app/data/production.db << 'EOF' .timer on SELECT COUNT(*) FROM users; SELECT COUNT(*) FROM user_sessions WHERE created_at > datetime('now', '-1 day'); EOF # 4. 网络连接 echo "4. 网络连接统计:" netstat -an | grep :3000 | awk '{print $6}' | sort | uniq -c # 5. 慢查询分析 echo "5. 慢查询分析:" grep "slow_query" /opt/rust-api/logs/app.log | tail -10 echo "=== 性能诊断完成 ===" ``` ## 📊 监控和告警 ### 1. 监控配置 #### Prometheus告警规则 ```yaml # /opt/rust-api/monitoring/prometheus/alert-rules.yml groups: - name: rust-api-alerts rules: - alert: HighErrorRate expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1 for: 5m labels: severity: warning annotations: summary: "High error rate detected" description: "Error rate is {{ $value }} errors per second" - alert: HighResponseTime expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1 for: 5m labels: severity: warning annotations: summary: "High response time detected" description: "95th percentile response time is {{ $value }}s" - alert: ServiceDown expr: up{job="rust-user-api"} == 0 for: 1m labels: severity: critical annotations: summary: "Service is down" description: "Rust User API service is not responding" - alert: HighMemoryUsage expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.85 for: 5m labels: severity: warning annotations: summary: "High memory usage" description: "Memory usage is {{ $value | humanizePercentage }}" - alert: HighDiskUsage expr: (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes > 0.85 for: 5m labels: severity: warning annotations: summary: "High disk usage" description: "Disk usage is {{ $value | humanizePercentage }}" ``` ### 2. 自动化运维 #### Crontab配置 ```bash # 添加到crontab: crontab -e # 每小时执行健康检查 0 * * * * /opt/rust-api/scripts/daily-health-check.sh >> /var/log/health-check.log 2>&1 # 每天凌晨2点备份数据库 0 2 * * * /opt/rust-api/scripts/backup-database.sh >> /var/log/backup.log 2>&1 # 每天凌晨3点执行数据库维护 0 3 * * * /opt/rust-api/scripts/database-maintenance.sh >> /var/log/db-maintenance.log 2>&1 # 每周日凌晨4点执行系统清理 0 4 * * 0 /opt/rust-api/scripts/system-cleanup.sh >> /var/log/system-cleanup.log 2>&1 # 每天检查日志大小并轮转 0 1 * * * /opt/rust-api/scripts/log-rotation.sh >> /var/log/log-rotation.log 2>&1 # 每5分钟检查服务状态 */5 * * * * /opt/rust-api/scripts/service-monitor.sh >> /var/log/service-monitor.log 2>&1 ``` #### 服务监控脚本 ```bash #!/bin/bash # service-monitor.sh ALERT_EMAIL="admin@yourdomain.com" LOG_FILE="/var/log/service-monitor.log" # 检查服务健康状态 if ! curl -f -s --max-time 10 http://localhost/health > /dev/null; then echo "$(date): Service health check failed" >> $LOG_FILE # 发送告警邮件 echo "Rust User API service health check failed at $(date)" | \ mail -s "Service Alert: API Health Check Failed" $ALERT_EMAIL # 尝试自动恢复 /opt/rust-api/scripts/service-recovery.sh >> $LOG_FILE 2>&1 fi # 检查系统资源 CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1) MEM_USAGE=$(free | grep Mem | awk '{printf("%.1f"), $3/$2 * 100.0}') if (( $(echo "$CPU_USAGE > 80" | bc -l) )); then echo "$(date): High CPU usage: $CPU_USAGE%" >> $LOG_FILE fi if (( $(echo "$MEM_USAGE > 85" | bc -l) )); then echo "$(date): High memory usage: $MEM_USAGE%" >> $LOG_FILE fi ``` ## 📈 容量规划 ### 1. 资源使用趋势分析 #### 资源统计脚本 ```bash #!/bin/bash # resource-stats.sh STATS_FILE="/opt/rust-api/logs/resource-stats.csv" DATE=$(date '+%Y-%m-%d %H:%M:%S') # 创建CSV头部(如果文件不存在) if [ ! -f "$STATS_FILE" ]; then echo "timestamp,cpu_usage,memory_usage,disk_usage,active_connections,response_time" > $STATS_FILE fi # 收集指标 CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1) MEM_USAGE=$(free | grep Mem | awk '{printf("%.1f"), $3/$2 * 100.0}') DISK_USAGE=$(df -h / | awk 'NR==2{print $5}' | cut -d'%' -f1) CONNECTIONS=$(netstat -an | grep :3000 | grep ESTABLISHED | wc -l) RESPONSE_TIME=$(curl -o /dev/null -s -w "%{time_total}" http://localhost/health) # 写入CSV echo "$DATE,$CPU_USAGE,$MEM_USAGE,$DISK_USAGE,$CONNECTIONS,$RESPONSE_TIME" >> $STATS_FILE # 保留最近30天的数据 tail -n 43200 $STATS_FILE > ${STATS_FILE}.tmp && mv ${STATS_FILE}.tmp $STATS_FILE ``` ### 2. 扩容建议 #### 扩容决策矩阵 | 指标 | 当前阈值 | 扩容建议 | |------|----------|----------| | CPU使用率 > 70% | 增加CPU核心或横向扩展 | | 内存使用率 > 80% | 增加内存或优化应用 | | 磁盘使用率 > 85% | 扩展存储或数据归档 | | 响应时间 > 500ms | 性能优化或负载均衡 | | 并发连接 > 1000 | 增加实例或连接池优化 | ## 📞 应急响应 ### 1. 应急联系流程 #### 故障等级定义 - **P0 (严重)**: 服务完全不可用,影响所有用户 - **P1 (高)**: 核心功能不可用,影响大部分用户 - **P2 (中)**: 部分功能不可用,影响少部分用户 - **P3 (低)**: 性能问题或非核心功能问题 #### 应急响应时间 - P0: 15分钟内响应,1小时内解决 - P1: 30分钟内响应,4小时内解决 - P2: 2小时内响应,24小时内解决 - P3: 1个工作日内响应,1周内解决 ### 2. 应急处理清单 #### P0级别故障处理 ```bash #!/bin/bash # emergency-response-p0.sh echo "=== P0级别应急响应 ===" echo "开始时间: $(date)" # 1. 立即通知 echo "1. 发送紧急通知..." echo "P0 Alert: Rust User API service is down" | \ mail -s "URGENT: Service Down" admin@yourdomain.com # 2. 快速诊断 echo "2. 快速诊断..." docker-compose -f /opt/rust-api/docker-compose.prod.yml ps curl -I http://localhost/health # 3. 尝试快速恢复 echo "3. 尝试快速恢复..." docker-compose -f /opt/rust-api/docker-compose.prod.yml restart # 4. 验证恢复 sleep 30 if curl -f -s http://localhost/health; then echo "✅ 服务已恢复" echo "Service recovered at $(date)" | \ mail -s "Service Recovered" admin@yourdomain.com else echo "❌ 快速恢复失败,需要深度诊断" # 启动深度诊断流程 /opt/rust-api/scripts/deep-diagnosis.sh fi echo "=== 应急响应完成 ===" ``` ## 📚 运维最佳实践 ### 1. 预防性维护 - **定期备份**: 每日自动备份,每周验证备份完整性 - **监控告警**: 设置合理的告警阈值,避免告警疲劳 - **容量规划**: 定期评估资源使用趋势,提前扩容 - **安全更新**: 及时应用安全补丁和更新 - **文档维护**: 保持运维文档的及时更新 ### 2. 变更管理 - **变更审批**: 所有生产环境变更需要审批 - **测试验证**: 变更前在测试环境充分验证 - **回滚计划**: 每次变更都要有明确的回滚方案 - **变更记录**: 详细记录所有变更内容和结果 ### 3. 知识管理 - **故障记录**: 详细记录每次故障的原因和解决方案 - **经验分享**: 定期分享运维经验和最佳实践 - **培训计划**: 定期进行运维技能培训 - **文档更新**: 及时更新运维手册和流程文档 --- **注意**: 本手册应根据实际运维需求定期更新和完善。所有脚本在使用前应在测试环境验证。