Troubleshooting Guide¶
This guide provides solutions to common issues encountered when working with the HG Content Generation System. Issues are organized by component and include diagnostic steps, solutions, and prevention strategies.
Quick Diagnostics¶
System Health Check¶
# Check all service health endpoints
curl -f http://localhost:8000/healthz # CPM
curl -f http://localhost:8001/healthz # External API
curl -f http://localhost:3000/api/health # Frontend
# Check service connectivity
ping cpm-service
ping external-api-service
telnet localhost 6379 # Redis
Environment Validation¶
# Validate required environment variables
python3 << 'EOF'
import os
required = [
'SUPABASE_URL', 'SUPABASE_ANON_KEY', 'SUPABASE_SERVICE_ROLE_KEY',
'OPENAI_API_KEY' # or other LLM provider key
]
missing = [var for var in required if not os.getenv(var)]
if missing:
print(f"❌ Missing variables: {', '.join(missing)}")
else:
print("✅ All required environment variables are set")
EOF
Log Analysis¶
# Quick log check for errors
docker-compose logs --tail=50 cpm | grep -i error
docker-compose logs --tail=50 external | grep -i error
docker-compose logs --tail=50 frontend | grep -i error
# Follow logs in real-time
docker-compose logs -f cpm external frontend
Database Issues¶
Connection Problems¶
Issue: "Could not connect to Supabase"¶
Symptoms: - Services fail to start - Database connection errors in logs - 500 errors on API endpoints
Diagnostic Steps:
# Test Supabase connectivity
curl -H "apikey: ${SUPABASE_ANON_KEY}" "${SUPABASE_URL}/rest/v1/"
# Check environment variables
echo "URL: ${SUPABASE_URL}"
echo "Key set: ${SUPABASE_ANON_KEY:+Yes}"
# Test database query
psql "${DATABASE_URL}" -c "SELECT version();"
Solutions: 1. Verify Supabase credentials:
-
Update connection string:
-
Check network connectivity:
Issue: "Connection pool exhausted"¶
Symptoms: - Intermittent database errors - Slow API responses - "Connection pool is full" errors
Solutions: 1. Increase pool size:
DATABASE_POOL_SIZE=50 # Increase from default 20
DATABASE_MAX_OVERFLOW=20 # Allow temporary overflow
-
Optimize connection usage:
-
Monitor pool usage:
Migration Issues¶
Issue: "Table does not exist"¶
Symptoms: - SQL errors about missing tables - Fresh deployments failing - Database schema mismatches
Solutions: 1. Run migrations manually:
-
Reset database (development only):
-
Verify table creation:
LLM Provider Issues¶
OpenAI API Problems¶
Issue: "Rate limit exceeded"¶
Symptoms: - 429 HTTP responses from OpenAI - "Rate limit exceeded" errors - Jobs failing with rate limit messages
Diagnostic Steps:
# Check current usage
curl -H "Authorization: Bearer ${OPENAI_API_KEY}" \
https://api.openai.com/v1/usage
# Test API key
curl -H "Authorization: Bearer ${OPENAI_API_KEY}" \
https://api.openai.com/v1/models
Solutions: 1. Implement rate limiting:
-
Add fallback providers:
-
Upgrade OpenAI plan: Contact OpenAI to increase rate limits
Issue: "Invalid API key"¶
Symptoms: - 401 authentication errors - "Invalid API key" responses - Authorization failures
Solutions: 1. Verify API key format:
- Generate new API key:
- Visit https://platform.openai.com/api-keys
-
Create new key and update environment
-
Check organization settings:
Anthropic/Claude Issues¶
Issue: "Claude API timeout"¶
Symptoms: - Requests timeout after long delays - "Request timeout" errors - Jobs stuck in processing state
Solutions: 1. Increase timeout:
- Check model availability:
Ollama Issues¶
Issue: "Cannot connect to Ollama"¶
Symptoms: - Connection refused errors - Ollama service unavailable - Local model requests failing
Diagnostic Steps:
# Check if Ollama is running
curl http://localhost:11434/api/version
# List available models
curl http://localhost:11434/api/tags
# Check Ollama service status
systemctl status ollama # Linux
brew services list | grep ollama # macOS
Solutions: 1. Start Ollama service:
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
# Start service
ollama serve
# Pull required models
ollama pull llama2
ollama pull mistral
-
Configure correct URL:
-
Check model availability:
Job Processing Issues¶
Jobs Stuck in Queue¶
Issue: "Jobs not processing"¶
Symptoms: - Jobs remain in "queued" status - No progress updates - Worker not processing jobs
Diagnostic Steps:
# Check worker status
curl http://localhost:8000/api/workers/status
# Check Redis queue
redis-cli -u "${REDIS_URL}" LLEN job_queue
# Check job details
curl -H "Authorization: Bearer ${JWT_TOKEN}" \
http://localhost:8000/api/jobs/job-id-here
Solutions: 1. Restart workers:
# Docker deployment
docker-compose restart cpm
# Manual restart
pkill -f "worker"
python -m apps.cpm.tasks &
-
Check Redis connectivity:
-
Increase worker concurrency:
Job Timeout Issues¶
Issue: "Job timed out"¶
Symptoms: - Jobs fail after timeout period - Long content generation times - Timeout errors in logs
Solutions: 1. Increase timeouts:
-
Optimize content requests:
-
Implement job retry logic:
Frontend Issues¶
Authentication Problems¶
Issue: "Login not working"¶
Symptoms: - Unable to log in to dashboard - Authentication redirects fail - Session expires immediately
Diagnostic Steps:
# Check NextAuth configuration
echo "Secret set: ${NEXTAUTH_SECRET:+Yes}"
echo "URL: ${NEXTAUTH_URL}"
# Test Supabase auth
curl -X POST "${SUPABASE_URL}/auth/v1/token?grant_type=password" \
-H "apikey: ${SUPABASE_ANON_KEY}" \
-H "Content-Type: application/json" \
-d '{"email":"test@example.com","password":"testpass"}'
Solutions: 1. Set NextAuth secret:
-
Configure correct URLs:
-
Check Supabase auth settings:
- Verify auth providers are enabled
- Check redirect URLs in Supabase dashboard
- Ensure email confirmation is properly configured
API Connection Issues¶
Issue: "Cannot reach backend services"¶
Symptoms: - Frontend shows connection errors - API calls fail with network errors - Services appear offline
Solutions: 1. Verify service URLs:
# Check environment variables
echo "CPM URL: ${NEXT_PUBLIC_CPM_API_URL}"
echo "External API URL: ${NEXT_PUBLIC_EXTERNAL_API_URL}"
# Test connectivity
curl "${NEXT_PUBLIC_CPM_API_URL}/health"
-
Check CORS configuration:
-
Verify network routes:
Performance Issues¶
Slow Response Times¶
Issue: "API responses are slow"¶
Symptoms: - High response times - Timeouts on frontend - Poor user experience
Diagnostic Steps:
# Check response times
time curl http://localhost:8000/api/jobs
# Monitor system resources
htop
docker stats
# Check database performance
EXPLAIN ANALYZE SELECT * FROM jobs WHERE client_id = 'client-123';
Solutions: 1. Add database indexes:
CREATE INDEX idx_jobs_client_id ON jobs(client_id);
CREATE INDEX idx_jobs_status ON jobs(status);
CREATE INDEX idx_jobs_created_at ON jobs(created_at);
-
Enable caching:
-
Optimize queries:
High Memory Usage¶
Issue: "Services consuming too much memory"¶
Symptoms: - Out of memory errors - Container restarts - System slowdowns
Solutions: 1. Increase memory limits:
-
Optimize application memory:
-
Monitor memory usage:
Network and Connectivity Issues¶
Docker Networking Problems¶
Issue: "Services cannot communicate"¶
Symptoms: - Connection refused between services - DNS resolution failures - Intermittent connectivity
Solutions: 1. Check Docker networks:
# List networks
docker network ls
# Inspect network
docker network inspect hgcontent_default
# Test connectivity between containers
docker exec cpm ping external-api
-
Verify service names:
-
Use explicit networking:
Port Conflicts¶
Issue: "Port already in use"¶
Symptoms: - "Address already in use" errors - Services fail to start - Port binding failures
Solutions: 1. Find processes using ports:
-
Change port mappings:
-
Stop conflicting services:
Security Issues¶
API Key Problems¶
Issue: "API key authentication failing"¶
Symptoms: - 401 unauthorized errors - API key validation failures - Authentication bypassed
Solutions: 1. Verify API key format:
# Check key format
import re
api_key = "hgc_your-api-key"
assert re.match(r'^hgc_[a-zA-Z0-9]{32}$', api_key)
-
Check key storage:
-
Test authentication:
CORS Issues¶
Issue: "CORS policy blocks requests"¶
Symptoms: - Browser blocks API calls - "Access-Control-Allow-Origin" errors - Preflight request failures
Solutions: 1. Configure CORS properly:
# FastAPI CORS configuration
app.add_middleware(
CORSMiddleware,
allow_origins=["https://your-frontend-domain.com"],
allow_credentials=True,
allow_methods=["GET", "POST", "PUT", "DELETE"],
allow_headers=["*"],
)
- Check preflight handling:
Monitoring and Debugging¶
Logging Issues¶
Issue: "Logs not appearing"¶
Symptoms: - No log output - Missing error information - Silent failures
Solutions: 1. Check log configuration:
-
Verify log destinations:
-
Test logging manually:
Metrics Collection¶
Issue: "Metrics not available"¶
Symptoms: - Monitoring dashboards empty - Prometheus scraping failures - Missing performance data
Solutions: 1. Enable metrics endpoints:
-
Check metrics format:
-
Configure Prometheus:
Recovery Procedures¶
System Recovery¶
Complete System Reset (Development)¶
# Stop all services
docker-compose down
# Remove all data (CAUTION: DATA LOSS)
docker-compose down -v
docker system prune -f
# Rebuild and restart
docker-compose build --no-cache
docker-compose up -d
# Verify services
./scripts/health-check.sh
Database Recovery¶
# Backup current database
supabase db dump --file backup.sql
# Reset database (development)
supabase db reset
# Restore from backup if needed
psql "${DATABASE_URL}" < backup.sql
Data Recovery¶
Job Recovery¶
# Recover failed jobs
import asyncio
from apps.cpm.database import get_db_manager
async def recover_failed_jobs():
db = get_db_manager()
failed_jobs = await db.get_jobs_by_status('failed')
for job in failed_jobs:
# Reset job to queued status
await db.update_job_status(job.id, 'queued')
print(f"Recovered job {job.id}")
asyncio.run(recover_failed_jobs())
Cache Recovery¶
# Clear Redis cache
redis-cli -u "${REDIS_URL}" FLUSHDB
# Warm up cache
curl http://localhost:8000/api/cache/warm-up
Prevention Strategies¶
Monitoring Setup¶
Health Checks¶
# Create health check script
#!/bin/bash
set -e
echo "Checking service health..."
curl -f http://localhost:8000/health >/dev/null
curl -f http://localhost:8001/health >/dev/null
curl -f http://localhost:3000/api/health >/dev/null
echo "✅ All services healthy"
Automated Monitoring¶
# docker-compose.yml
services:
cpm:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
Backup Strategies¶
Automated Backups¶
#!/bin/bash
# Backup script
DATE=$(date +%Y%m%d_%H%M%S)
# Database backup
supabase db dump --file "backup_${DATE}.sql"
# Environment backup
cp .env "env_backup_${DATE}"
# Upload to storage
aws s3 cp "backup_${DATE}.sql" s3://backups/hgcontent/
Update Procedures¶
Rolling Updates¶
# Update strategy for production
git pull origin main
docker-compose build cpm # Build new image
docker-compose up -d --no-deps cpm # Update only CPM
./scripts/health-check.sh # Verify health
# If successful, update other services
docker-compose up -d --no-deps external-api
docker-compose up -d --no-deps frontend
Getting Help¶
Support Channels¶
- Documentation: Check this troubleshooting guide and API docs
- GitHub Issues: Report bugs and issues
- Developer Guide: Setup and development guide
- Community: Join discussions in GitHub Discussions
Debugging Tools¶
# Install debugging tools
pip install httpie jq
npm install -g @httpie/cli
# API testing
http GET localhost:8000/api/jobs Authorization:"Bearer ${TOKEN}"
# Log analysis
docker-compose logs cpm | jq -r '.message'
Creating Bug Reports¶
When reporting issues, include: 1. Environment details: OS, Docker version, service versions 2. Steps to reproduce: Exact commands and inputs used 3. Expected vs actual behavior: What should happen vs what does happen 4. Logs and errors: Relevant log entries and error messages 5. Configuration: Relevant environment variables (redact secrets)
Bug Report Template:
## Environment
- OS: Ubuntu 20.04
- Docker: 20.10.x
- Services: CPM v1.0.0, External API v1.0.0
## Issue Description
Brief description of the problem
## Steps to Reproduce
1. Step one
2. Step two
3. Step three
## Expected Behavior
What should happen
## Actual Behavior
What actually happens
## Logs
Remember to keep this troubleshooting guide updated as new issues are discovered and resolved!