Skip to content

Troubleshooting Guide

This guide provides solutions to common issues encountered when working with the HG Content Generation System. Issues are organized by component and include diagnostic steps, solutions, and prevention strategies.

Quick Diagnostics

System Health Check

# Check all service health endpoints
curl -f http://localhost:8000/healthz  # CPM
curl -f http://localhost:8001/healthz  # External API  
curl -f http://localhost:3000/api/health  # Frontend

# Check service connectivity
ping cpm-service
ping external-api-service
telnet localhost 6379  # Redis

Environment Validation

# Validate required environment variables
python3 << 'EOF'
import os
required = [
    'SUPABASE_URL', 'SUPABASE_ANON_KEY', 'SUPABASE_SERVICE_ROLE_KEY',
    'OPENAI_API_KEY'  # or other LLM provider key
]
missing = [var for var in required if not os.getenv(var)]
if missing:
    print(f"❌ Missing variables: {', '.join(missing)}")
else:
    print("✅ All required environment variables are set")
EOF

Log Analysis

# Quick log check for errors
docker-compose logs --tail=50 cpm | grep -i error
docker-compose logs --tail=50 external | grep -i error
docker-compose logs --tail=50 frontend | grep -i error

# Follow logs in real-time
docker-compose logs -f cpm external frontend

Database Issues

Connection Problems

Issue: "Could not connect to Supabase"

Symptoms: - Services fail to start - Database connection errors in logs - 500 errors on API endpoints

Diagnostic Steps:

# Test Supabase connectivity
curl -H "apikey: ${SUPABASE_ANON_KEY}" "${SUPABASE_URL}/rest/v1/"

# Check environment variables
echo "URL: ${SUPABASE_URL}"
echo "Key set: ${SUPABASE_ANON_KEY:+Yes}"

# Test database query
psql "${DATABASE_URL}" -c "SELECT version();"

Solutions: 1. Verify Supabase credentials:

# Check Supabase project status
supabase status
# Or visit Supabase dashboard

  1. Update connection string:

    # Ensure correct format
    SUPABASE_URL=https://your-project-id.supabase.co
    SUPABASE_ANON_KEY=eyJ...your-anon-key
    

  2. Check network connectivity:

    # Test DNS resolution
    nslookup your-project-id.supabase.co
    # Test port accessibility
    telnet your-project-id.supabase.co 443
    

Issue: "Connection pool exhausted"

Symptoms: - Intermittent database errors - Slow API responses - "Connection pool is full" errors

Solutions: 1. Increase pool size:

DATABASE_POOL_SIZE=50        # Increase from default 20
DATABASE_MAX_OVERFLOW=20     # Allow temporary overflow

  1. Optimize connection usage:

    # Ensure connections are properly closed
    async with get_db() as db:
        # Use connection
        pass  # Connection automatically closed
    

  2. Monitor pool usage:

    # Check pool metrics
    curl http://localhost:8000/metrics | grep db_pool
    

Migration Issues

Issue: "Table does not exist"

Symptoms: - SQL errors about missing tables - Fresh deployments failing - Database schema mismatches

Solutions: 1. Run migrations manually:

# Check migration status
supabase migration list

# Apply pending migrations
supabase db push

  1. Reset database (development only):

    supabase db reset
    supabase db push
    

  2. Verify table creation:

    -- Check if tables exist
    SELECT table_name FROM information_schema.tables 
    WHERE table_schema = 'public';
    

LLM Provider Issues

OpenAI API Problems

Issue: "Rate limit exceeded"

Symptoms: - 429 HTTP responses from OpenAI - "Rate limit exceeded" errors - Jobs failing with rate limit messages

Diagnostic Steps:

# Check current usage
curl -H "Authorization: Bearer ${OPENAI_API_KEY}" \
  https://api.openai.com/v1/usage

# Test API key
curl -H "Authorization: Bearer ${OPENAI_API_KEY}" \
  https://api.openai.com/v1/models

Solutions: 1. Implement rate limiting:

# Configure in CPM settings
OPENAI_REQUESTS_PER_MINUTE=60
OPENAI_TOKENS_PER_MINUTE=90000

  1. Add fallback providers:

    # Configure multiple providers
    PRIMARY_LLM_PROVIDER=openai
    FALLBACK_LLM_PROVIDER=anthropic
    ANTHROPIC_API_KEY=your-anthropic-key
    

  2. Upgrade OpenAI plan: Contact OpenAI to increase rate limits

Issue: "Invalid API key"

Symptoms: - 401 authentication errors - "Invalid API key" responses - Authorization failures

Solutions: 1. Verify API key format:

# OpenAI keys start with sk-
echo "${OPENAI_API_KEY}" | head -c 10  # Should show sk-...

  1. Generate new API key:
  2. Visit https://platform.openai.com/api-keys
  3. Create new key and update environment

  4. Check organization settings:

    # If using organization
    OPENAI_ORG_ID=org-your-organization-id
    

Anthropic/Claude Issues

Issue: "Claude API timeout"

Symptoms: - Requests timeout after long delays - "Request timeout" errors - Jobs stuck in processing state

Solutions: 1. Increase timeout:

ANTHROPIC_TIMEOUT=300  # 5 minutes
LLM_PROVIDER_TIMEOUT=300

  1. Check model availability:
    # Test Claude API
    import anthropic
    client = anthropic.Anthropic(api_key="your-key")
    message = client.messages.create(
        model="claude-3-sonnet-20240229",
        max_tokens=100,
        messages=[{"role": "user", "content": "Hello"}]
    )
    

Ollama Issues

Issue: "Cannot connect to Ollama"

Symptoms: - Connection refused errors - Ollama service unavailable - Local model requests failing

Diagnostic Steps:

# Check if Ollama is running
curl http://localhost:11434/api/version

# List available models
curl http://localhost:11434/api/tags

# Check Ollama service status
systemctl status ollama  # Linux
brew services list | grep ollama  # macOS

Solutions: 1. Start Ollama service:

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Start service
ollama serve

# Pull required models
ollama pull llama2
ollama pull mistral

  1. Configure correct URL:

    # Local development
    OLLAMA_BASE_URL=http://localhost:11434
    
    # Docker deployment
    OLLAMA_BASE_URL=http://ollama:11434
    
    # Remote Ollama instance
    OLLAMA_BASE_URL=https://your-ollama-server.com
    

  2. Check model availability:

    # Verify models are downloaded
    ollama list
    
    # Test model response
    ollama run llama2 "Hello, how are you?"
    

Job Processing Issues

Jobs Stuck in Queue

Issue: "Jobs not processing"

Symptoms: - Jobs remain in "queued" status - No progress updates - Worker not processing jobs

Diagnostic Steps:

# Check worker status
curl http://localhost:8000/api/workers/status

# Check Redis queue
redis-cli -u "${REDIS_URL}" LLEN job_queue

# Check job details
curl -H "Authorization: Bearer ${JWT_TOKEN}" \
  http://localhost:8000/api/jobs/job-id-here

Solutions: 1. Restart workers:

# Docker deployment
docker-compose restart cpm

# Manual restart
pkill -f "worker"
python -m apps.cpm.tasks &

  1. Check Redis connectivity:

    # Test Redis connection
    redis-cli -u "${REDIS_URL}" ping
    
    # Clear stuck jobs (caution!)
    redis-cli -u "${REDIS_URL}" FLUSHDB
    

  2. Increase worker concurrency:

    MAX_CONCURRENT_JOBS=20  # Increase from default 10
    

Job Timeout Issues

Issue: "Job timed out"

Symptoms: - Jobs fail after timeout period - Long content generation times - Timeout errors in logs

Solutions: 1. Increase timeouts:

JOB_TIMEOUT=1200           # 20 minutes
LLM_PROVIDER_TIMEOUT=600   # 10 minutes per LLM request

  1. Optimize content requests:

    {
      "targetLength": 800,     // Shorter content
      "temperature": 0.3,      // More focused generation
      "model": "gpt-3.5-turbo" // Faster model
    }
    

  2. Implement job retry logic:

    JOB_RETRY_ATTEMPTS=3
    JOB_RETRY_DELAY=60  # Wait 60s between retries
    

Frontend Issues

Authentication Problems

Issue: "Login not working"

Symptoms: - Unable to log in to dashboard - Authentication redirects fail - Session expires immediately

Diagnostic Steps:

# Check NextAuth configuration
echo "Secret set: ${NEXTAUTH_SECRET:+Yes}"
echo "URL: ${NEXTAUTH_URL}"

# Test Supabase auth
curl -X POST "${SUPABASE_URL}/auth/v1/token?grant_type=password" \
  -H "apikey: ${SUPABASE_ANON_KEY}" \
  -H "Content-Type: application/json" \
  -d '{"email":"test@example.com","password":"testpass"}'

Solutions: 1. Set NextAuth secret:

# Generate secure secret
NEXTAUTH_SECRET=$(openssl rand -base64 32)

  1. Configure correct URLs:

    # Development
    NEXTAUTH_URL=http://localhost:3000
    
    # Production
    NEXTAUTH_URL=https://app.hgcontent.com
    

  2. Check Supabase auth settings:

  3. Verify auth providers are enabled
  4. Check redirect URLs in Supabase dashboard
  5. Ensure email confirmation is properly configured

API Connection Issues

Issue: "Cannot reach backend services"

Symptoms: - Frontend shows connection errors - API calls fail with network errors - Services appear offline

Solutions: 1. Verify service URLs:

# Check environment variables
echo "CPM URL: ${NEXT_PUBLIC_CPM_API_URL}"
echo "External API URL: ${NEXT_PUBLIC_EXTERNAL_API_URL}"

# Test connectivity
curl "${NEXT_PUBLIC_CPM_API_URL}/health"

  1. Check CORS configuration:

    # In FastAPI apps
    allow_origins=["https://your-frontend-domain.com"]
    

  2. Verify network routes:

    # Docker networking
    docker network ls
    docker network inspect hgcontent_default
    

Performance Issues

Slow Response Times

Issue: "API responses are slow"

Symptoms: - High response times - Timeouts on frontend - Poor user experience

Diagnostic Steps:

# Check response times
time curl http://localhost:8000/api/jobs

# Monitor system resources
htop
docker stats

# Check database performance
EXPLAIN ANALYZE SELECT * FROM jobs WHERE client_id = 'client-123';

Solutions: 1. Add database indexes:

CREATE INDEX idx_jobs_client_id ON jobs(client_id);
CREATE INDEX idx_jobs_status ON jobs(status);
CREATE INDEX idx_jobs_created_at ON jobs(created_at);

  1. Enable caching:

    REDIS_URL=redis://localhost:6379
    CACHE_TTL=300  # 5 minutes
    

  2. Optimize queries:

    # Use pagination
    SELECT * FROM jobs LIMIT 20 OFFSET 0;
    
    # Select only needed columns
    SELECT id, status, created_at FROM jobs;
    

High Memory Usage

Issue: "Services consuming too much memory"

Symptoms: - Out of memory errors - Container restarts - System slowdowns

Solutions: 1. Increase memory limits:

# docker-compose.yml
services:
  cpm:
    mem_limit: 2g
    memswap_limit: 2g

  1. Optimize application memory:

    # Use generators for large datasets
    def process_jobs():
        for job in get_jobs_stream():
            yield process_job(job)
    
    # Clear large objects
    del large_object
    gc.collect()
    

  2. Monitor memory usage:

    # Check container memory
    docker stats --no-stream
    
    # System memory
    free -h
    cat /proc/meminfo
    

Network and Connectivity Issues

Docker Networking Problems

Issue: "Services cannot communicate"

Symptoms: - Connection refused between services - DNS resolution failures - Intermittent connectivity

Solutions: 1. Check Docker networks:

# List networks
docker network ls

# Inspect network
docker network inspect hgcontent_default

# Test connectivity between containers
docker exec cpm ping external-api

  1. Verify service names:

    # docker-compose.yml - ensure consistent service names
    services:
      cpm:
        container_name: cpm
      external-api:
        container_name: external-api
    

  2. Use explicit networking:

    networks:
      hgcontent:
        driver: bridge
    
    services:
      cpm:
        networks:
          - hgcontent
    

Port Conflicts

Issue: "Port already in use"

Symptoms: - "Address already in use" errors - Services fail to start - Port binding failures

Solutions: 1. Find processes using ports:

# Check what's using port 8000
lsof -i :8000
netstat -tulpn | grep :8000

  1. Change port mappings:

    # docker-compose.yml
    services:
      cpm:
        ports:
          - "8001:8000"  # Map to different host port
    

  2. Stop conflicting services:

    # Kill process using port
    sudo kill $(lsof -t -i:8000)
    
    # Stop all Docker containers
    docker stop $(docker ps -aq)
    

Security Issues

API Key Problems

Issue: "API key authentication failing"

Symptoms: - 401 unauthorized errors - API key validation failures - Authentication bypassed

Solutions: 1. Verify API key format:

# Check key format
import re
api_key = "hgc_your-api-key"
assert re.match(r'^hgc_[a-zA-Z0-9]{32}$', api_key)

  1. Check key storage:

    -- Verify key in database
    SELECT * FROM api_keys WHERE key_hash = hash_function('your-key');
    

  2. Test authentication:

    # Test API key
    curl -H "X-API-Key: hgc_your-key" \
      http://localhost:8001/api/v1/content/generate
    

CORS Issues

Issue: "CORS policy blocks requests"

Symptoms: - Browser blocks API calls - "Access-Control-Allow-Origin" errors - Preflight request failures

Solutions: 1. Configure CORS properly:

# FastAPI CORS configuration
app.add_middleware(
    CORSMiddleware,
    allow_origins=["https://your-frontend-domain.com"],
    allow_credentials=True,
    allow_methods=["GET", "POST", "PUT", "DELETE"],
    allow_headers=["*"],
)

  1. Check preflight handling:
    # Test OPTIONS request
    curl -X OPTIONS -H "Origin: https://your-domain.com" \
      http://localhost:8000/api/generate
    

Monitoring and Debugging

Logging Issues

Issue: "Logs not appearing"

Symptoms: - No log output - Missing error information - Silent failures

Solutions: 1. Check log configuration:

LOG_LEVEL=DEBUG
LOG_FORMAT=json
STRUCTURED_LOGGING=true

  1. Verify log destinations:

    # Check if logging to file
    tail -f /var/log/hgcontent.log
    
    # Docker logs
    docker-compose logs -f cpm
    

  2. Test logging manually:

    import structlog
    logger = structlog.get_logger()
    logger.info("Test log message", extra_field="test")
    

Metrics Collection

Issue: "Metrics not available"

Symptoms: - Monitoring dashboards empty - Prometheus scraping failures - Missing performance data

Solutions: 1. Enable metrics endpoints:

ENABLE_METRICS=true
METRICS_PORT=9090

  1. Check metrics format:

    curl http://localhost:9090/metrics
    

  2. Configure Prometheus:

    # prometheus.yml
    scrape_configs:
      - job_name: 'hgcontent'
        static_configs:
          - targets: ['localhost:9090']
    

Recovery Procedures

System Recovery

Complete System Reset (Development)

# Stop all services
docker-compose down

# Remove all data (CAUTION: DATA LOSS)
docker-compose down -v
docker system prune -f

# Rebuild and restart
docker-compose build --no-cache
docker-compose up -d

# Verify services
./scripts/health-check.sh

Database Recovery

# Backup current database
supabase db dump --file backup.sql

# Reset database (development)
supabase db reset

# Restore from backup if needed
psql "${DATABASE_URL}" < backup.sql

Data Recovery

Job Recovery

# Recover failed jobs
import asyncio
from apps.cpm.database import get_db_manager

async def recover_failed_jobs():
    db = get_db_manager()
    failed_jobs = await db.get_jobs_by_status('failed')

    for job in failed_jobs:
        # Reset job to queued status
        await db.update_job_status(job.id, 'queued')
        print(f"Recovered job {job.id}")

asyncio.run(recover_failed_jobs())

Cache Recovery

# Clear Redis cache
redis-cli -u "${REDIS_URL}" FLUSHDB

# Warm up cache
curl http://localhost:8000/api/cache/warm-up

Prevention Strategies

Monitoring Setup

Health Checks

# Create health check script
#!/bin/bash
set -e

echo "Checking service health..."
curl -f http://localhost:8000/health >/dev/null
curl -f http://localhost:8001/health >/dev/null
curl -f http://localhost:3000/api/health >/dev/null

echo "✅ All services healthy"

Automated Monitoring

# docker-compose.yml
services:
  cpm:
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3

Backup Strategies

Automated Backups

#!/bin/bash
# Backup script
DATE=$(date +%Y%m%d_%H%M%S)

# Database backup
supabase db dump --file "backup_${DATE}.sql"

# Environment backup
cp .env "env_backup_${DATE}"

# Upload to storage
aws s3 cp "backup_${DATE}.sql" s3://backups/hgcontent/

Update Procedures

Rolling Updates

# Update strategy for production
git pull origin main
docker-compose build cpm  # Build new image
docker-compose up -d --no-deps cpm  # Update only CPM
./scripts/health-check.sh  # Verify health

# If successful, update other services
docker-compose up -d --no-deps external-api
docker-compose up -d --no-deps frontend

Getting Help

Support Channels

  • Documentation: Check this troubleshooting guide and API docs
  • GitHub Issues: Report bugs and issues
  • Developer Guide: Setup and development guide
  • Community: Join discussions in GitHub Discussions

Debugging Tools

# Install debugging tools
pip install httpie jq
npm install -g @httpie/cli

# API testing
http GET localhost:8000/api/jobs Authorization:"Bearer ${TOKEN}"

# Log analysis
docker-compose logs cpm | jq -r '.message'

Creating Bug Reports

When reporting issues, include: 1. Environment details: OS, Docker version, service versions 2. Steps to reproduce: Exact commands and inputs used 3. Expected vs actual behavior: What should happen vs what does happen 4. Logs and errors: Relevant log entries and error messages 5. Configuration: Relevant environment variables (redact secrets)

Bug Report Template:

## Environment
- OS: Ubuntu 20.04
- Docker: 20.10.x
- Services: CPM v1.0.0, External API v1.0.0

## Issue Description
Brief description of the problem

## Steps to Reproduce
1. Step one
2. Step two
3. Step three

## Expected Behavior
What should happen

## Actual Behavior
What actually happens

## Logs
[relevant log entries]
## Configuration
SUPABASE_URL=https://... OPENAI_API_KEY=[REDACTED]

Remember to keep this troubleshooting guide updated as new issues are discovered and resolved!