Files
rust-user-api/docs/operations-manual.md
enoch bb9d7a869d
Some checks failed
Deploy to Production / Run Tests (push) Failing after 16m35s
Deploy to Production / Security Scan (push) Has been skipped
Deploy to Production / Build Docker Image (push) Has been skipped
Deploy to Production / Deploy to Staging (push) Has been skipped
Deploy to Production / Deploy to Production (push) Has been skipped
Deploy to Production / Notify Results (push) Successful in 31s
feat: 完成Rust User API完整开发
 新功能:
- SQLite数据库集成和持久化存储
- 数据库迁移系统和版本管理
- API分页功能和高效查询
- 用户搜索和过滤机制
- 完整的RBAC角色权限系统
- 结构化日志记录和系统监控
- API限流和多层安全防护
- Docker容器化和生产部署配置

🔒 安全特性:
- JWT认证和授权
- 限流和防暴力破解
- 安全头和CORS配置
- 输入验证和XSS防护
- 审计日志和安全监控

📊 监控和运维:
- Prometheus指标收集
- 健康检查和系统监控
- 自动化备份和恢复
- 完整的运维文档和脚本
- CI/CD流水线配置

🚀 部署支持:
- 多环境Docker配置
- 生产环境部署指南
- 性能优化和安全加固
- 故障排除和应急响应
- 自动化运维脚本

📚 文档完善:
- API使用文档
- 部署检查清单
- 运维操作手册
- 性能和安全指南
- 故障排除指南
2025-08-07 16:03:32 +08:00

682 lines
19 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 运维手册
## 📖 概述
本手册提供Rust User API生产环境的日常运维指导包括监控、维护、故障处理和最佳实践。
## 🔍 日常监控
### 1. 系统健康检查
#### 每日检查项目
```bash
#!/bin/bash
# daily-health-check.sh
echo "=== Rust User API 每日健康检查 ==="
echo "检查时间: $(date)"
echo
# 1. 服务状态检查
echo "1. 检查服务状态..."
docker-compose -f /opt/rust-api/docker-compose.prod.yml ps
# 2. 健康端点检查
echo "2. 检查健康端点..."
if curl -f -s http://localhost/health > /dev/null; then
echo "✅ 健康检查通过"
else
echo "❌ 健康检查失败"
fi
# 3. 系统资源检查
echo "3. 检查系统资源..."
echo "CPU使用率: $(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)"
echo "内存使用率: $(free | grep Mem | awk '{printf("%.1f%%"), $3/$2 * 100.0}')"
echo "磁盘使用率: $(df -h / | awk 'NR==2{printf "%s", $5}')"
# 4. 数据库检查
echo "4. 检查数据库..."
DB_SIZE=$(docker-compose -f /opt/rust-api/docker-compose.prod.yml exec -T rust-user-api \
sqlite3 /app/data/production.db "SELECT page_count * page_size as size FROM pragma_page_count(), pragma_page_size();" 2>/dev/null)
echo "数据库大小: $((DB_SIZE / 1024 / 1024)) MB"
# 5. 日志检查
echo "5. 检查错误日志..."
ERROR_COUNT=$(docker-compose -f /opt/rust-api/docker-compose.prod.yml logs --since=24h rust-user-api 2>/dev/null | grep -i error | wc -l)
echo "24小时内错误数量: $ERROR_COUNT"
# 6. 证书检查
echo "6. 检查SSL证书..."
CERT_DAYS=$(openssl x509 -in /opt/rust-api/ssl/cert.pem -noout -dates | grep notAfter | cut -d= -f2)
echo "证书过期时间: $CERT_DAYS"
echo
echo "=== 检查完成 ==="
```
#### 监控指标阈值
| 指标 | 正常范围 | 警告阈值 | 严重阈值 |
|------|----------|----------|----------|
| CPU使用率 | <50% | 50-80% | >80% |
| 内存使用率 | <70% | 70-85% | >85% |
| 磁盘使用率 | <80% | 80-90% | >90% |
| 响应时间 | <200ms | 200-500ms | >500ms |
| 错误率 | <1% | 1-5% | >5% |
### 2. 性能监控
#### 关键性能指标 (KPI)
```bash
# 获取性能指标脚本
#!/bin/bash
# performance-metrics.sh
echo "=== 性能指标报告 ==="
echo "时间: $(date)"
echo
# API响应时间
echo "1. API响应时间测试..."
for endpoint in "/health" "/api/users" "/monitoring/metrics"; do
response_time=$(curl -o /dev/null -s -w "%{time_total}" "http://localhost$endpoint")
echo " $endpoint: ${response_time}s"
done
# 并发测试
echo "2. 并发性能测试..."
ab -n 100 -c 10 -q http://localhost/api/users | grep "Requests per second\|Time per request"
# 数据库性能
echo "3. 数据库性能..."
docker-compose -f /opt/rust-api/docker-compose.prod.yml exec -T rust-user-api \
sqlite3 /app/data/production.db "PRAGMA optimize; PRAGMA integrity_check;"
echo "=== 报告完成 ==="
```
### 3. 日志监控
#### 日志分析脚本
```bash
#!/bin/bash
# log-analysis.sh
LOG_FILE="/opt/rust-api/logs/app.log"
HOURS=${1:-24}
echo "=== 最近${HOURS}小时日志分析 ==="
# 错误统计
echo "1. 错误统计:"
grep -i error $LOG_FILE | tail -n 100 | awk '{print $1, $2}' | sort | uniq -c | sort -nr
# 请求统计
echo "2. 请求统计:"
grep "http_request" $LOG_FILE | tail -n 1000 | \
grep -o '"method":"[^"]*"' | sort | uniq -c | sort -nr
# 响应时间分析
echo "3. 响应时间分析:"
grep "response_time" $LOG_FILE | tail -n 100 | \
grep -o '"response_time":[0-9]*' | cut -d: -f2 | \
awk '{sum+=$1; count++} END {if(count>0) print "平均响应时间:", sum/count "ms"}'
# IP访问统计
echo "4. 访问IP统计:"
grep "remote_addr" $LOG_FILE | tail -n 1000 | \
grep -o '"remote_addr":"[^"]*"' | cut -d: -f2 | tr -d '"' | \
sort | uniq -c | sort -nr | head -10
```
## 🔧 日常维护
### 1. 数据库维护
#### 数据库优化
```bash
#!/bin/bash
# database-maintenance.sh
echo "=== 数据库维护开始 ==="
# 1. 数据库优化
echo "1. 执行数据库优化..."
docker-compose -f /opt/rust-api/docker-compose.prod.yml exec -T rust-user-api \
sqlite3 /app/data/production.db "PRAGMA optimize; VACUUM; ANALYZE;"
# 2. 检查数据库完整性
echo "2. 检查数据库完整性..."
INTEGRITY=$(docker-compose -f /opt/rust-api/docker-compose.prod.yml exec -T rust-user-api \
sqlite3 /app/data/production.db "PRAGMA integrity_check;")
if [ "$INTEGRITY" = "ok" ]; then
echo "✅ 数据库完整性检查通过"
else
echo "❌ 数据库完整性检查失败: $INTEGRITY"
fi
# 3. 统计信息
echo "3. 数据库统计信息..."
docker-compose -f /opt/rust-api/docker-compose.prod.yml exec -T rust-user-api \
sqlite3 /app/data/production.db << 'EOF'
.mode column
.headers on
SELECT 'users' as table_name, COUNT(*) as record_count FROM users;
SELECT 'user_sessions' as table_name, COUNT(*) as record_count FROM user_sessions;
EOF
echo "=== 数据库维护完成 ==="
```
#### 数据备份
```bash
#!/bin/bash
# backup-database.sh
BACKUP_DIR="/opt/rust-api/backups"
DATE=$(date +%Y%m%d-%H%M%S)
RETENTION_DAYS=30
echo "=== 数据库备份开始 ==="
# 1. 创建备份目录
mkdir -p $BACKUP_DIR
# 2. 备份数据库
echo "1. 备份数据库..."
docker-compose -f /opt/rust-api/docker-compose.prod.yml exec -T rust-user-api \
sqlite3 /app/data/production.db ".backup /app/data/backup-$DATE.db"
# 3. 复制到主机
docker cp rust-user-api-prod:/app/data/backup-$DATE.db $BACKUP_DIR/
# 4. 压缩备份
echo "2. 压缩备份文件..."
tar -czf "$BACKUP_DIR/database-backup-$DATE.tar.gz" \
-C $BACKUP_DIR backup-$DATE.db
# 5. 清理临时文件
rm -f "$BACKUP_DIR/backup-$DATE.db"
# 6. 验证备份
echo "3. 验证备份..."
if [ -f "$BACKUP_DIR/database-backup-$DATE.tar.gz" ]; then
SIZE=$(du -h "$BACKUP_DIR/database-backup-$DATE.tar.gz" | cut -f1)
echo "✅ 备份成功: database-backup-$DATE.tar.gz ($SIZE)"
else
echo "❌ 备份失败"
exit 1
fi
# 7. 清理旧备份
echo "4. 清理旧备份..."
find $BACKUP_DIR -name "database-backup-*.tar.gz" -mtime +$RETENTION_DAYS -delete
REMAINING=$(find $BACKUP_DIR -name "database-backup-*.tar.gz" | wc -l)
echo "保留备份数量: $REMAINING"
echo "=== 数据库备份完成 ==="
```
### 2. 日志管理
#### 日志轮转
```bash
#!/bin/bash
# log-rotation.sh
LOG_DIR="/opt/rust-api/logs"
MAX_SIZE="100M"
MAX_FILES=10
echo "=== 日志轮转开始 ==="
# 1. 检查日志大小
for log_file in "$LOG_DIR"/*.log; do
if [ -f "$log_file" ]; then
size=$(du -h "$log_file" | cut -f1)
echo "日志文件: $(basename $log_file) - 大小: $size"
# 如果文件大于最大大小,进行轮转
if [ $(du -m "$log_file" | cut -f1) -gt 100 ]; then
echo "轮转日志: $log_file"
# 压缩并重命名旧日志
timestamp=$(date +%Y%m%d-%H%M%S)
gzip -c "$log_file" > "${log_file}.${timestamp}.gz"
# 清空当前日志文件
> "$log_file"
echo "✅ 日志轮转完成: ${log_file}.${timestamp}.gz"
fi
fi
done
# 2. 清理旧日志
echo "2. 清理旧日志文件..."
find $LOG_DIR -name "*.log.*.gz" -mtime +30 -delete
echo "=== 日志轮转完成 ==="
```
### 3. 系统清理
#### 系统清理脚本
```bash
#!/bin/bash
# system-cleanup.sh
echo "=== 系统清理开始 ==="
# 1. Docker清理
echo "1. 清理Docker资源..."
docker system prune -f
docker volume prune -f
docker image prune -f
# 2. 清理临时文件
echo "2. 清理临时文件..."
find /tmp -type f -mtime +7 -delete 2>/dev/null || true
find /var/tmp -type f -mtime +7 -delete 2>/dev/null || true
# 3. 清理系统日志
echo "3. 清理系统日志..."
journalctl --vacuum-time=30d
journalctl --vacuum-size=1G
# 4. 更新系统包
echo "4. 更新系统包..."
apt update && apt upgrade -y
# 5. 检查磁盘空间
echo "5. 磁盘空间报告..."
df -h
echo "=== 系统清理完成 ==="
```
## 🚨 故障处理
### 1. 常见故障处理流程
#### 服务无响应
```bash
#!/bin/bash
# service-recovery.sh
echo "=== 服务恢复流程 ==="
# 1. 检查服务状态
echo "1. 检查服务状态..."
docker-compose -f /opt/rust-api/docker-compose.prod.yml ps
# 2. 检查健康端点
echo "2. 检查健康端点..."
if ! curl -f -s --max-time 10 http://localhost/health; then
echo "❌ 健康检查失败,开始恢复流程..."
# 3. 查看最近日志
echo "3. 查看最近日志..."
docker-compose -f /opt/rust-api/docker-compose.prod.yml logs --tail=50 rust-user-api
# 4. 重启服务
echo "4. 重启服务..."
docker-compose -f /opt/rust-api/docker-compose.prod.yml restart rust-user-api
# 5. 等待服务启动
echo "5. 等待服务启动..."
sleep 30
# 6. 再次检查
if curl -f -s --max-time 10 http://localhost/health; then
echo "✅ 服务恢复成功"
else
echo "❌ 服务恢复失败,需要人工介入"
exit 1
fi
else
echo "✅ 服务正常运行"
fi
echo "=== 恢复流程完成 ==="
```
#### 数据库问题处理
```bash
#!/bin/bash
# database-recovery.sh
echo "=== 数据库恢复流程 ==="
DB_PATH="/opt/rust-api/data/production.db"
BACKUP_DIR="/opt/rust-api/backups"
# 1. 检查数据库完整性
echo "1. 检查数据库完整性..."
INTEGRITY=$(docker-compose -f /opt/rust-api/docker-compose.prod.yml exec -T rust-user-api \
sqlite3 /app/data/production.db "PRAGMA integrity_check;" 2>/dev/null)
if [ "$INTEGRITY" != "ok" ]; then
echo "❌ 数据库完整性检查失败: $INTEGRITY"
# 2. 停止服务
echo "2. 停止服务..."
docker-compose -f /opt/rust-api/docker-compose.prod.yml stop rust-user-api
# 3. 备份损坏的数据库
echo "3. 备份损坏的数据库..."
cp $DB_PATH "${DB_PATH}.corrupted.$(date +%Y%m%d-%H%M%S)"
# 4. 恢复最新备份
echo "4. 恢复最新备份..."
LATEST_BACKUP=$(find $BACKUP_DIR -name "database-backup-*.tar.gz" | sort -r | head -1)
if [ -n "$LATEST_BACKUP" ]; then
echo "使用备份: $LATEST_BACKUP"
tar -xzf "$LATEST_BACKUP" -C /tmp/
cp /tmp/backup-*.db $DB_PATH
chown apiuser:apiuser $DB_PATH
echo "✅ 数据库恢复完成"
else
echo "❌ 未找到可用备份"
exit 1
fi
# 5. 重启服务
echo "5. 重启服务..."
docker-compose -f /opt/rust-api/docker-compose.prod.yml start rust-user-api
# 6. 验证恢复
sleep 30
if curl -f -s http://localhost/health; then
echo "✅ 服务恢复成功"
else
echo "❌ 服务恢复失败"
exit 1
fi
else
echo "✅ 数据库完整性正常"
fi
echo "=== 数据库恢复完成 ==="
```
### 2. 性能问题诊断
#### 性能诊断脚本
```bash
#!/bin/bash
# performance-diagnosis.sh
echo "=== 性能诊断开始 ==="
# 1. 系统资源使用
echo "1. 系统资源使用情况:"
echo "CPU: $(top -bn1 | grep "Cpu(s)" | awk '{print $2}')"
echo "内存: $(free -h | grep Mem | awk '{print $3 "/" $2}')"
echo "磁盘IO: $(iostat -x 1 1 | tail -n +4 | awk '{print $1, $10}' | grep -v '^$')"
# 2. 容器资源使用
echo "2. 容器资源使用:"
docker stats --no-stream --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}\t{{.BlockIO}}"
# 3. 数据库性能
echo "3. 数据库性能分析:"
docker-compose -f /opt/rust-api/docker-compose.prod.yml exec -T rust-user-api \
sqlite3 /app/data/production.db << 'EOF'
.timer on
SELECT COUNT(*) FROM users;
SELECT COUNT(*) FROM user_sessions WHERE created_at > datetime('now', '-1 day');
EOF
# 4. 网络连接
echo "4. 网络连接统计:"
netstat -an | grep :3000 | awk '{print $6}' | sort | uniq -c
# 5. 慢查询分析
echo "5. 慢查询分析:"
grep "slow_query" /opt/rust-api/logs/app.log | tail -10
echo "=== 性能诊断完成 ==="
```
## 📊 监控和告警
### 1. 监控配置
#### Prometheus告警规则
```yaml
# /opt/rust-api/monitoring/prometheus/alert-rules.yml
groups:
- name: rust-api-alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} errors per second"
- alert: HighResponseTime
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High response time detected"
description: "95th percentile response time is {{ $value }}s"
- alert: ServiceDown
expr: up{job="rust-user-api"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service is down"
description: "Rust User API service is not responding"
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage"
description: "Memory usage is {{ $value | humanizePercentage }}"
- alert: HighDiskUsage
expr: (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "High disk usage"
description: "Disk usage is {{ $value | humanizePercentage }}"
```
### 2. 自动化运维
#### Crontab配置
```bash
# 添加到crontab: crontab -e
# 每小时执行健康检查
0 * * * * /opt/rust-api/scripts/daily-health-check.sh >> /var/log/health-check.log 2>&1
# 每天凌晨2点备份数据库
0 2 * * * /opt/rust-api/scripts/backup-database.sh >> /var/log/backup.log 2>&1
# 每天凌晨3点执行数据库维护
0 3 * * * /opt/rust-api/scripts/database-maintenance.sh >> /var/log/db-maintenance.log 2>&1
# 每周日凌晨4点执行系统清理
0 4 * * 0 /opt/rust-api/scripts/system-cleanup.sh >> /var/log/system-cleanup.log 2>&1
# 每天检查日志大小并轮转
0 1 * * * /opt/rust-api/scripts/log-rotation.sh >> /var/log/log-rotation.log 2>&1
# 每5分钟检查服务状态
*/5 * * * * /opt/rust-api/scripts/service-monitor.sh >> /var/log/service-monitor.log 2>&1
```
#### 服务监控脚本
```bash
#!/bin/bash
# service-monitor.sh
ALERT_EMAIL="admin@yourdomain.com"
LOG_FILE="/var/log/service-monitor.log"
# 检查服务健康状态
if ! curl -f -s --max-time 10 http://localhost/health > /dev/null; then
echo "$(date): Service health check failed" >> $LOG_FILE
# 发送告警邮件
echo "Rust User API service health check failed at $(date)" | \
mail -s "Service Alert: API Health Check Failed" $ALERT_EMAIL
# 尝试自动恢复
/opt/rust-api/scripts/service-recovery.sh >> $LOG_FILE 2>&1
fi
# 检查系统资源
CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
MEM_USAGE=$(free | grep Mem | awk '{printf("%.1f"), $3/$2 * 100.0}')
if (( $(echo "$CPU_USAGE > 80" | bc -l) )); then
echo "$(date): High CPU usage: $CPU_USAGE%" >> $LOG_FILE
fi
if (( $(echo "$MEM_USAGE > 85" | bc -l) )); then
echo "$(date): High memory usage: $MEM_USAGE%" >> $LOG_FILE
fi
```
## 📈 容量规划
### 1. 资源使用趋势分析
#### 资源统计脚本
```bash
#!/bin/bash
# resource-stats.sh
STATS_FILE="/opt/rust-api/logs/resource-stats.csv"
DATE=$(date '+%Y-%m-%d %H:%M:%S')
# 创建CSV头部如果文件不存在
if [ ! -f "$STATS_FILE" ]; then
echo "timestamp,cpu_usage,memory_usage,disk_usage,active_connections,response_time" > $STATS_FILE
fi
# 收集指标
CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
MEM_USAGE=$(free | grep Mem | awk '{printf("%.1f"), $3/$2 * 100.0}')
DISK_USAGE=$(df -h / | awk 'NR==2{print $5}' | cut -d'%' -f1)
CONNECTIONS=$(netstat -an | grep :3000 | grep ESTABLISHED | wc -l)
RESPONSE_TIME=$(curl -o /dev/null -s -w "%{time_total}" http://localhost/health)
# 写入CSV
echo "$DATE,$CPU_USAGE,$MEM_USAGE,$DISK_USAGE,$CONNECTIONS,$RESPONSE_TIME" >> $STATS_FILE
# 保留最近30天的数据
tail -n 43200 $STATS_FILE > ${STATS_FILE}.tmp && mv ${STATS_FILE}.tmp $STATS_FILE
```
### 2. 扩容建议
#### 扩容决策矩阵
| 指标 | 当前阈值 | 扩容建议 |
|------|----------|----------|
| CPU使用率 > 70% | 增加CPU核心或横向扩展 |
| 内存使用率 > 80% | 增加内存或优化应用 |
| 磁盘使用率 > 85% | 扩展存储或数据归档 |
| 响应时间 > 500ms | 性能优化或负载均衡 |
| 并发连接 > 1000 | 增加实例或连接池优化 |
## 📞 应急响应
### 1. 应急联系流程
#### 故障等级定义
- **P0 (严重)**: 服务完全不可用,影响所有用户
- **P1 (高)**: 核心功能不可用,影响大部分用户
- **P2 (中)**: 部分功能不可用,影响少部分用户
- **P3 (低)**: 性能问题或非核心功能问题
#### 应急响应时间
- P0: 15分钟内响应1小时内解决
- P1: 30分钟内响应4小时内解决
- P2: 2小时内响应24小时内解决
- P3: 1个工作日内响应1周内解决
### 2. 应急处理清单
#### P0级别故障处理
```bash
#!/bin/bash
# emergency-response-p0.sh
echo "=== P0级别应急响应 ==="
echo "开始时间: $(date)"
# 1. 立即通知
echo "1. 发送紧急通知..."
echo "P0 Alert: Rust User API service is down" | \
mail -s "URGENT: Service Down" admin@yourdomain.com
# 2. 快速诊断
echo "2. 快速诊断..."
docker-compose -f /opt/rust-api/docker-compose.prod.yml ps
curl -I http://localhost/health
# 3. 尝试快速恢复
echo "3. 尝试快速恢复..."
docker-compose -f /opt/rust-api/docker-compose.prod.yml restart
# 4. 验证恢复
sleep 30
if curl -f -s http://localhost/health; then
echo "✅ 服务已恢复"
echo "Service recovered at $(date)" | \
mail -s "Service Recovered" admin@yourdomain.com
else
echo "❌ 快速恢复失败,需要深度诊断"
# 启动深度诊断流程
/opt/rust-api/scripts/deep-diagnosis.sh
fi
echo "=== 应急响应完成 ==="
```
## 📚 运维最佳实践
### 1. 预防性维护
- **定期备份**: 每日自动备份,每周验证备份完整性
- **监控告警**: 设置合理的告警阈值,避免告警疲劳
- **容量规划**: 定期评估资源使用趋势,提前扩容
- **安全更新**: 及时应用安全补丁和更新
- **文档维护**: 保持运维文档的及时更新
### 2. 变更管理
- **变更审批**: 所有生产环境变更需要审批
- **测试验证**: 变更前在测试环境充分验证
- **回滚计划**: 每次变更都要有明确的回滚方案
- **变更记录**: 详细记录所有变更内容和结果
### 3. 知识管理
- **故障记录**: 详细记录每次故障的原因和解决方案
- **经验分享**: 定期分享运维经验和最佳实践
- **培训计划**: 定期进行运维技能培训
- **文档更新**: 及时更新运维手册和流程文档
---
**注意**: 本手册应根据实际运维需求定期更新和完善。所有脚本在使用前应在测试环境验证。