Troubleshooting Guide¶

This guide provides solutions to common issues encountered when working with the HG Content Generation System. Issues are organized by component and include diagnostic steps, solutions, and prevention strategies.

Quick Diagnostics¶

System Health Check¶

# Check all service health endpoints
curl -f http://localhost:8000/healthz  # CPM
curl -f http://localhost:8001/healthz  # External API  
curl -f http://localhost:3000/api/health  # Frontend

# Check service connectivity
ping cpm-service
ping external-api-service
telnet localhost 6379  # Redis

Environment Validation¶

# Validate required environment variables
python3 << 'EOF'
import os
required = [
    'SUPABASE_URL', 'SUPABASE_ANON_KEY', 'SUPABASE_SERVICE_ROLE_KEY',
    'OPENAI_API_KEY'  # or other LLM provider key
]
missing = [var for var in required if not os.getenv(var)]
if missing:
    print(f"❌ Missing variables: {', '.join(missing)}")
else:
    print("✅ All required environment variables are set")
EOF

Log Analysis¶

# Quick log check for errors
docker-compose logs --tail=50 cpm | grep -i error
docker-compose logs --tail=50 external | grep -i error
docker-compose logs --tail=50 frontend | grep -i error

# Follow logs in real-time
docker-compose logs -f cpm external frontend

Database Issues¶

Connection Problems¶

Issue: "Could not connect to Supabase"¶

Symptoms: - Services fail to start - Database connection errors in logs - 500 errors on API endpoints

Diagnostic Steps:

# Test Supabase connectivity
curl -H "apikey: ${SUPABASE_ANON_KEY}" "${SUPABASE_URL}/rest/v1/"

# Check environment variables
echo "URL: ${SUPABASE_URL}"
echo "Key set: ${SUPABASE_ANON_KEY:+Yes}"

# Test database query
psql "${DATABASE_URL}" -c "SELECT version();"

Solutions: 1. Verify Supabase credentials:

# Check Supabase project status
supabase status
# Or visit Supabase dashboard

Update connection string:

# Ensure correct format
SUPABASE_URL=https://your-project-id.supabase.co
SUPABASE_ANON_KEY=eyJ...your-anon-key

Check network connectivity:

# Test DNS resolution
nslookup your-project-id.supabase.co
# Test port accessibility
telnet your-project-id.supabase.co 443

Issue: "Connection pool exhausted"¶

Symptoms: - Intermittent database errors - Slow API responses - "Connection pool is full" errors

Solutions: 1. Increase pool size:

DATABASE_POOL_SIZE=50        # Increase from default 20
DATABASE_MAX_OVERFLOW=20     # Allow temporary overflow

Optimize connection usage:

# Ensure connections are properly closed
async with get_db() as db:
    # Use connection
    pass  # Connection automatically closed

Monitor pool usage:

# Check pool metrics
curl http://localhost:8000/metrics | grep db_pool

Migration Issues¶

Issue: "Table does not exist"¶

Symptoms: - SQL errors about missing tables - Fresh deployments failing - Database schema mismatches

Solutions: 1. Run migrations manually:

# Check migration status
supabase migration list

# Apply pending migrations
supabase db push

Reset database (development only):
```
supabase db reset
supabase db push
```

Verify table creation:

-- Check if tables exist
SELECT table_name FROM information_schema.tables 
WHERE table_schema = 'public';

LLM Provider Issues¶

OpenAI API Problems¶

Issue: "Rate limit exceeded"¶

Symptoms: - 429 HTTP responses from OpenAI - "Rate limit exceeded" errors - Jobs failing with rate limit messages

Diagnostic Steps:

# Check current usage
curl -H "Authorization: Bearer ${OPENAI_API_KEY}" \
  https://api.openai.com/v1/usage

# Test API key
curl -H "Authorization: Bearer ${OPENAI_API_KEY}" \
  https://api.openai.com/v1/models

Solutions: 1. Implement rate limiting:

# Configure in CPM settings
OPENAI_REQUESTS_PER_MINUTE=60
OPENAI_TOKENS_PER_MINUTE=90000

Add fallback providers:

# Configure multiple providers
PRIMARY_LLM_PROVIDER=openai
FALLBACK_LLM_PROVIDER=anthropic
ANTHROPIC_API_KEY=your-anthropic-key

Upgrade OpenAI plan: Contact OpenAI to increase rate limits

Issue: "Invalid API key"¶

Symptoms: - 401 authentication errors - "Invalid API key" responses - Authorization failures

Solutions: 1. Verify API key format:

# OpenAI keys start with sk-
echo "${OPENAI_API_KEY}" | head -c 10  # Should show sk-...

Generate new API key:
Visit https://platform.openai.com/api-keys
Create new key and update environment

Check organization settings:

# If using organization
OPENAI_ORG_ID=org-your-organization-id

Anthropic/Claude Issues¶

Issue: "Claude API timeout"¶

Symptoms: - Requests timeout after long delays - "Request timeout" errors - Jobs stuck in processing state

Solutions: 1. Increase timeout:

ANTHROPIC_TIMEOUT=300  # 5 minutes
LLM_PROVIDER_TIMEOUT=300

Check model availability:

# Test Claude API
import anthropic
client = anthropic.Anthropic(api_key="your-key")
message = client.messages.create(
    model="claude-3-sonnet-20240229",
    max_tokens=100,
    messages=[{"role": "user", "content": "Hello"}]
)

Ollama Issues¶

Issue: "Cannot connect to Ollama"¶

Symptoms: - Connection refused errors - Ollama service unavailable - Local model requests failing

Diagnostic Steps:

# Check if Ollama is running
curl http://localhost:11434/api/version

# List available models
curl http://localhost:11434/api/tags

# Check Ollama service status
systemctl status ollama  # Linux
brew services list | grep ollama  # macOS

Solutions: 1. Start Ollama service:

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Start service
ollama serve

# Pull required models
ollama pull llama2
ollama pull mistral

Configure correct URL:

# Local development
OLLAMA_BASE_URL=http://localhost:11434

# Docker deployment
OLLAMA_BASE_URL=http://ollama:11434

# Remote Ollama instance
OLLAMA_BASE_URL=https://your-ollama-server.com

Check model availability:

# Verify models are downloaded
ollama list

# Test model response
ollama run llama2 "Hello, how are you?"

Job Processing Issues¶

Jobs Stuck in Queue¶

Issue: "Jobs not processing"¶

Symptoms: - Jobs remain in "queued" status - No progress updates - Worker not processing jobs

Diagnostic Steps:

# Check worker status
curl http://localhost:8000/api/workers/status

# Check Redis queue
redis-cli -u "${REDIS_URL}" LLEN job_queue

# Check job details
curl -H "Authorization: Bearer ${JWT_TOKEN}" \
  http://localhost:8000/api/jobs/job-id-here

Solutions: 1. Restart workers:

# Docker deployment
docker-compose restart cpm

# Manual restart
pkill -f "worker"
python -m apps.cpm.tasks &

Check Redis connectivity:

# Test Redis connection
redis-cli -u "${REDIS_URL}" ping

# Clear stuck jobs (caution!)
redis-cli -u "${REDIS_URL}" FLUSHDB

Increase worker concurrency:

MAX_CONCURRENT_JOBS=20  # Increase from default 10

Job Timeout Issues¶

Issue: "Job timed out"¶

Symptoms: - Jobs fail after timeout period - Long content generation times - Timeout errors in logs

Solutions: 1. Increase timeouts:

JOB_TIMEOUT=1200           # 20 minutes
LLM_PROVIDER_TIMEOUT=600   # 10 minutes per LLM request

Optimize content requests:

{
  "targetLength": 800,     // Shorter content
  "temperature": 0.3,      // More focused generation
  "model": "gpt-3.5-turbo" // Faster model
}

Implement job retry logic:

JOB_RETRY_ATTEMPTS=3
JOB_RETRY_DELAY=60  # Wait 60s between retries

Frontend Issues¶

Authentication Problems¶

Symptoms: - Unable to log in to dashboard - Authentication redirects fail - Session expires immediately

Diagnostic Steps:

# Check NextAuth configuration
echo "Secret set: ${NEXTAUTH_SECRET:+Yes}"
echo "URL: ${NEXTAUTH_URL}"

# Test Supabase auth
curl -X POST "${SUPABASE_URL}/auth/v1/token?grant_type=password" \
  -H "apikey: ${SUPABASE_ANON_KEY}" \
  -H "Content-Type: application/json" \
  -d '{"email":"test@example.com","password":"testpass"}'

Solutions: 1. Set NextAuth secret:

# Generate secure secret
NEXTAUTH_SECRET=$(openssl rand -base64 32)

Configure correct URLs:

# Development
NEXTAUTH_URL=http://localhost:3000

# Production
NEXTAUTH_URL=https://app.hgcontent.com

Check Supabase auth settings:
Verify auth providers are enabled
Check redirect URLs in Supabase dashboard
Ensure email confirmation is properly configured

API Connection Issues¶

Issue: "Cannot reach backend services"¶

Symptoms: - Frontend shows connection errors - API calls fail with network errors - Services appear offline

Solutions: 1. Verify service URLs:

# Check environment variables
echo "CPM URL: ${NEXT_PUBLIC_CPM_API_URL}"
echo "External API URL: ${NEXT_PUBLIC_EXTERNAL_API_URL}"

# Test connectivity
curl "${NEXT_PUBLIC_CPM_API_URL}/health"

Check CORS configuration:

# In FastAPI apps
allow_origins=["https://your-frontend-domain.com"]

Verify network routes:

# Docker networking
docker network ls
docker network inspect hgcontent_default

Performance Issues¶

Slow Response Times¶

Issue: "API responses are slow"¶

Symptoms: - High response times - Timeouts on frontend - Poor user experience

Diagnostic Steps:

# Check response times
time curl http://localhost:8000/api/jobs

# Monitor system resources
htop
docker stats

# Check database performance
EXPLAIN ANALYZE SELECT * FROM jobs WHERE client_id = 'client-123';

Solutions: 1. Add database indexes:

CREATE INDEX idx_jobs_client_id ON jobs(client_id);
CREATE INDEX idx_jobs_status ON jobs(status);
CREATE INDEX idx_jobs_created_at ON jobs(created_at);

Enable caching:

REDIS_URL=redis://localhost:6379
CACHE_TTL=300  # 5 minutes

Optimize queries:

# Use pagination
SELECT * FROM jobs LIMIT 20 OFFSET 0;

# Select only needed columns
SELECT id, status, created_at FROM jobs;

High Memory Usage¶

Issue: "Services consuming too much memory"¶

Symptoms: - Out of memory errors - Container restarts - System slowdowns

Solutions: 1. Increase memory limits:

# docker-compose.yml
services:
  cpm:
    mem_limit: 2g
    memswap_limit: 2g

Optimize application memory:

# Use generators for large datasets
def process_jobs():
    for job in get_jobs_stream():
        yield process_job(job)

# Clear large objects
del large_object
gc.collect()

Monitor memory usage:

# Check container memory
docker stats --no-stream

# System memory
free -h
cat /proc/meminfo

Network and Connectivity Issues¶

Docker Networking Problems¶

Issue: "Services cannot communicate"¶

Symptoms: - Connection refused between services - DNS resolution failures - Intermittent connectivity

Solutions: 1. Check Docker networks:

# List networks
docker network ls

# Inspect network
docker network inspect hgcontent_default

# Test connectivity between containers
docker exec cpm ping external-api

Verify service names:

# docker-compose.yml - ensure consistent service names
services:
  cpm:
    container_name: cpm
  external-api:
    container_name: external-api

Use explicit networking:

networks:
  hgcontent:
    driver: bridge

services:
  cpm:
    networks:
      - hgcontent

Port Conflicts¶

Issue: "Port already in use"¶

Symptoms: - "Address already in use" errors - Services fail to start - Port binding failures

Solutions: 1. Find processes using ports:

# Check what's using port 8000
lsof -i :8000
netstat -tulpn | grep :8000

Change port mappings:

# docker-compose.yml
services:
  cpm:
    ports:
      - "8001:8000"  # Map to different host port

Stop conflicting services:

# Kill process using port
sudo kill $(lsof -t -i:8000)

# Stop all Docker containers
docker stop $(docker ps -aq)

Security Issues¶

API Key Problems¶

Issue: "API key authentication failing"¶

Symptoms: - 401 unauthorized errors - API key validation failures - Authentication bypassed

Solutions: 1. Verify API key format:

# Check key format
import re
api_key = "hgc_your-api-key"
assert re.match(r'^hgc_[a-zA-Z0-9]{32}$', api_key)

Check key storage:

-- Verify key in database
SELECT * FROM api_keys WHERE key_hash = hash_function('your-key');

Test authentication:

# Test API key
curl -H "X-API-Key: hgc_your-key" \
  http://localhost:8001/api/v1/content/generate

CORS Issues¶

Issue: "CORS policy blocks requests"¶

Symptoms: - Browser blocks API calls - "Access-Control-Allow-Origin" errors - Preflight request failures

Solutions: 1. Configure CORS properly:

# FastAPI CORS configuration
app.add_middleware(
    CORSMiddleware,
    allow_origins=["https://your-frontend-domain.com"],
    allow_credentials=True,
    allow_methods=["GET", "POST", "PUT", "DELETE"],
    allow_headers=["*"],
)

Check preflight handling:

# Test OPTIONS request
curl -X OPTIONS -H "Origin: https://your-domain.com" \
  http://localhost:8000/api/generate

Monitoring and Debugging¶

Logging Issues¶

Issue: "Logs not appearing"¶

Symptoms: - No log output - Missing error information - Silent failures

Solutions: 1. Check log configuration:

LOG_LEVEL=DEBUG
LOG_FORMAT=json
STRUCTURED_LOGGING=true

Verify log destinations:

# Check if logging to file
tail -f /var/log/hgcontent.log

# Docker logs
docker-compose logs -f cpm

Test logging manually:

import structlog
logger = structlog.get_logger()
logger.info("Test log message", extra_field="test")

Metrics Collection¶

Issue: "Metrics not available"¶

Symptoms: - Monitoring dashboards empty - Prometheus scraping failures - Missing performance data

Solutions: 1. Enable metrics endpoints:

ENABLE_METRICS=true
METRICS_PORT=9090

Check metrics format:
```
curl http://localhost:9090/metrics
```

Configure Prometheus:

# prometheus.yml
scrape_configs:
  - job_name: 'hgcontent'
    static_configs:
      - targets: ['localhost:9090']

Recovery Procedures¶

System Recovery¶

Complete System Reset (Development)¶

# Stop all services
docker-compose down

# Remove all data (CAUTION: DATA LOSS)
docker-compose down -v
docker system prune -f

# Rebuild and restart
docker-compose build --no-cache
docker-compose up -d

# Verify services
./scripts/health-check.sh

Database Recovery¶

# Backup current database
supabase db dump --file backup.sql

# Reset database (development)
supabase db reset

# Restore from backup if needed
psql "${DATABASE_URL}" < backup.sql

Data Recovery¶

Job Recovery¶

# Recover failed jobs
import asyncio
from apps.cpm.database import get_db_manager

async def recover_failed_jobs():
    db = get_db_manager()
    failed_jobs = await db.get_jobs_by_status('failed')

    for job in failed_jobs:
        # Reset job to queued status
        await db.update_job_status(job.id, 'queued')
        print(f"Recovered job {job.id}")

asyncio.run(recover_failed_jobs())

Cache Recovery¶

# Clear Redis cache
redis-cli -u "${REDIS_URL}" FLUSHDB

# Warm up cache
curl http://localhost:8000/api/cache/warm-up

Prevention Strategies¶

Monitoring Setup¶

Health Checks¶

# Create health check script
#!/bin/bash
set -e

echo "Checking service health..."
curl -f http://localhost:8000/health >/dev/null
curl -f http://localhost:8001/health >/dev/null
curl -f http://localhost:3000/api/health >/dev/null

echo "✅ All services healthy"

Automated Monitoring¶

# docker-compose.yml
services:
  cpm:
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3

Backup Strategies¶

Automated Backups¶

#!/bin/bash
# Backup script
DATE=$(date +%Y%m%d_%H%M%S)

# Database backup
supabase db dump --file "backup_${DATE}.sql"

# Environment backup
cp .env "env_backup_${DATE}"

# Upload to storage
aws s3 cp "backup_${DATE}.sql" s3://backups/hgcontent/

Update Procedures¶

Rolling Updates¶

# Update strategy for production
git pull origin main
docker-compose build cpm  # Build new image
docker-compose up -d --no-deps cpm  # Update only CPM
./scripts/health-check.sh  # Verify health

# If successful, update other services
docker-compose up -d --no-deps external-api
docker-compose up -d --no-deps frontend

Getting Help¶

Support Channels¶

Documentation: Check this troubleshooting guide and API docs
GitHub Issues: Report bugs and issues
Developer Guide: Setup and development guide
Community: Join discussions in GitHub Discussions

Debugging Tools¶

# Install debugging tools
pip install httpie jq
npm install -g @httpie/cli

# API testing
http GET localhost:8000/api/jobs Authorization:"Bearer ${TOKEN}"

# Log analysis
docker-compose logs cpm | jq -r '.message'

Creating Bug Reports¶

When reporting issues, include: 1. Environment details: OS, Docker version, service versions 2. Steps to reproduce: Exact commands and inputs used 3. Expected vs actual behavior: What should happen vs what does happen 4. Logs and errors: Relevant log entries and error messages 5. Configuration: Relevant environment variables (redact secrets)

Bug Report Template:

## Environment
- OS: Ubuntu 20.04
- Docker: 20.10.x
- Services: CPM v1.0.0, External API v1.0.0

## Issue Description
Brief description of the problem

## Steps to Reproduce
1. Step one
2. Step two
3. Step three

## Expected Behavior
What should happen

## Actual Behavior
What actually happens

## Logs

[relevant log entries]

## Configuration

SUPABASE_URL=https://... OPENAI_API_KEY=[REDACTED]

Remember to keep this troubleshooting guide updated as new issues are discovered and resolved!

Troubleshooting Guide¶

Quick Diagnostics¶

System Health Check¶

Environment Validation¶

Log Analysis¶

Database Issues¶

Connection Problems¶

Issue: "Could not connect to Supabase"¶

Issue: "Connection pool exhausted"¶

Migration Issues¶

Issue: "Table does not exist"¶

LLM Provider Issues¶

OpenAI API Problems¶

Issue: "Rate limit exceeded"¶

Issue: "Invalid API key"¶

Anthropic/Claude Issues¶

Issue: "Claude API timeout"¶

Ollama Issues¶

Issue: "Cannot connect to Ollama"¶

Job Processing Issues¶

Jobs Stuck in Queue¶

Issue: "Jobs not processing"¶

Job Timeout Issues¶

Issue: "Job timed out"¶

Frontend Issues¶

Authentication Problems¶

Issue: "Login not working"¶

API Connection Issues¶

Issue: "Cannot reach backend services"¶

Performance Issues¶

Slow Response Times¶

Issue: "API responses are slow"¶

High Memory Usage¶

Issue: "Services consuming too much memory"¶

Network and Connectivity Issues¶

Docker Networking Problems¶

Issue: "Services cannot communicate"¶

Port Conflicts¶

Issue: "Port already in use"¶

Security Issues¶

API Key Problems¶

Issue: "API key authentication failing"¶

CORS Issues¶

Issue: "CORS policy blocks requests"¶

Monitoring and Debugging¶

Logging Issues¶

Issue: "Logs not appearing"¶

Metrics Collection¶

Issue: "Metrics not available"¶

Recovery Procedures¶

System Recovery¶

Complete System Reset (Development)¶

Database Recovery¶

Data Recovery¶

Job Recovery¶

Cache Recovery¶

Prevention Strategies¶

Monitoring Setup¶

Health Checks¶

Automated Monitoring¶

Backup Strategies¶

Automated Backups¶

Update Procedures¶

Rolling Updates¶

Getting Help¶

Support Channels¶

Debugging Tools¶

Creating Bug Reports¶