Skip to content

HG Content Generation System - Deployment & Operations Guide

Table of Contents

  1. Overview
  2. Environment Setup
  3. Railway Deployment (Python Services)
  4. Vercel Deployment (Next.js Frontend)
  5. Supabase Configuration
  6. Environment Variables & Secrets Management
  7. Monitoring & Logging Setup
  8. CI/CD Pipeline Configuration
  9. Backup & Disaster Recovery
  10. Scaling & Performance Optimization
  11. Troubleshooting Common Issues

Overview

The HG Content Generation System is a distributed application consisting of: - Frontend: Next.js 14 application (deployed on Vercel) - Backend Services: Python FastAPI services (deployed on Railway) - Content Production Module (CPM) - Instructions Module (IM) - Strategy Management Module (SMM) - Database: Supabase PostgreSQL with real-time features - Cache/Queue: Redis (managed by Railway)

Architecture Diagram

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Vercel Edge   │    │  Railway Cloud  │    │   Supabase      │
│                 │    │                 │    │                 │
│ ┌─────────────┐ │    │ ┌─────────────┐ │    │ ┌─────────────┐ │
│ │ Next.js App │ │    │ │ CPM Service │ │    │ │ PostgreSQL  │ │
│ │ (Frontend)  │ │◄──►│ │ (Python)    │ │◄──►│ │ Database    │ │
│ └─────────────┘ │    │ └─────────────┘ │    │ └─────────────┘ │
│                 │    │ ┌─────────────┐ │    │ ┌─────────────┐ │
│ ┌─────────────┐ │    │ │ IM Service  │ │    │ │ Auth        │ │
│ │ API Routes  │ │◄──►│ │ (Python)    │ │    │ │ Real-time   │ │
│ └─────────────┘ │    │ └─────────────┘ │    │ │ Storage     │ │
│                 │    │ ┌─────────────┐ │    │ └─────────────┘ │
└─────────────────┘    │ │ SMM Service │ │    └─────────────────┘
                       │ │ (Python)    │ │
                       │ └─────────────┘ │
                       │ ┌─────────────┐ │
                       │ │ Redis Cache │ │
                       │ └─────────────┘ │
                       └─────────────────┘

Environment Setup

Prerequisites

  1. Node.js 20.x (match .nvmrc) and pnpm for frontend development
  2. Python 3.12+ for backend services
  3. Supabase CLI for database management
  4. Railway CLI for backend deployment
  5. Vercel CLI for frontend deployment

Local Development Setup

# Clone the repository
git clone https://github.com/your-org/hg-content.git
cd hg-content

# Install frontend dependencies
pnpm install

# Set up Python virtual environment for backend services
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install Python dependencies
pip install -r requirements.txt
pip install -r apps/cpm/requirements.txt
pip install -r apps/im/requirements.txt

# Start Supabase locally
supabase start

# Copy environment files
cp .env.example .env
cp apps/frontend/.env.example apps/frontend/.env.local
cp apps/cpm/.env.example apps/cpm/.env
cp apps/im/.env.example apps/im/.env

Development Server Commands

# Start all services concurrently
pnpm dev

# Or start services individually:
# Frontend (Next.js)
cd apps/frontend && pnpm dev

# CPM Service
cd apps/cpm && uvicorn app:app --reload --port 8000

# IM Service  
cd apps/im && uvicorn app:app --reload --port 8001

# SMM Service (if running separately)
cd smm && python -m app.api

Railway Deployment (Python Services)

Railway is used to deploy all Python FastAPI services with managed PostgreSQL and Redis.

1. Initial Railway Setup

# Install Railway CLI
npm install -g @railway/cli

# Login to Railway
railway login

# Link project
railway link [PROJECT_ID]

2. Deploy Content Production Module (CPM)

# From project root
cd apps/cpm

# Create Railway service
railway create

# Set environment variables (see Environment Variables section)
railway env set SUPABASE_URL=your_supabase_url
railway env set SUPABASE_SERVICE_ROLE_KEY=your_service_key
railway env set OPENAI_API_KEY=your_openai_key
railway env set ANTHROPIC_API_KEY=your_anthropic_key
railway env set GOOGLE_API_KEY=your_google_key
railway env set GROQ_API_KEY=your_groq_key
railway env set REDIS_URL=redis://localhost:6379

# Deploy
railway deploy

3. Deploy Instructions Module (IM)

cd apps/im

# Create Railway service
railway create

# Set environment variables
railway env set SUPABASE_URL=your_supabase_url
railway env set SUPABASE_SERVICE_ROLE_KEY=your_service_key
railway env set OPENAI_API_KEY=your_openai_key

# Deploy
railway deploy

4. Deploy Strategy Management Module (SMM)

cd smm

# Create Railway service (if deploying separately)
railway create

# Set environment variables
railway env set SUPABASE_URL=your_supabase_url
railway env set SUPABASE_SERVICE_ROLE_KEY=your_service_key

# Deploy
railway deploy

5. Railway Service Configuration

Each service needs a railway.json configuration file:

{
  "$schema": "https://railway.app/railway.schema.json",
  "build": {
    "builder": "NIXPACKS"
  },
  "deploy": {
    "startCommand": "uvicorn app:app --host 0.0.0.0 --port $PORT",
    "restartPolicyType": "ON_FAILURE",
    "restartPolicyMaxRetries": 3,
    "healthcheckPath": "/health",
    "healthcheckTimeout": 30
  }
}

6. Railway Redis Setup

# Add Redis addon to your Railway project
railway plugin add redis

# Get Redis URL from Railway dashboard
railway env get REDIS_URL

7. Railway Auto-deployment

Set up automatic deployments from GitHub:

  1. Go to Railway dashboard
  2. Connect your GitHub repository
  3. Configure automatic deployments on push to main branch
  4. Set up path-based triggers for each service

Vercel Deployment (Next.js Frontend)

Vercel hosts the Next.js frontend application and API routes.

1. Initial Vercel Setup

# Install Vercel CLI
npm install -g vercel

# Login to Vercel
vercel login

# From frontend directory
cd apps/frontend

# Deploy to Vercel
vercel --prod

2. Vercel Configuration

Create vercel.json in the frontend root:

{
  "framework": "nextjs",
  "buildCommand": "pnpm build",
  "outputDirectory": ".next",
  "installCommand": "pnpm install",
  "regions": ["iad1", "sfo1", "fra1"],
  "functions": {
    "app/api/**/*.ts": {
      "maxDuration": 30
    }
  },
  "headers": [
    {
      "source": "/api/(.*)",
      "headers": [
        {
          "key": "Access-Control-Allow-Origin",
          "value": "*"
        },
        {
          "key": "Access-Control-Allow-Methods",
          "value": "GET, POST, PUT, DELETE, OPTIONS"
        },
        {
          "key": "Access-Control-Allow-Headers",
          "value": "Content-Type, Authorization"
        }
      ]
    }
  ]
}

3. Vercel Environment Variables

Set environment variables in Vercel dashboard or via CLI:

# Set production environment variables
vercel env add NEXT_PUBLIC_SUPABASE_URL production
vercel env add NEXT_PUBLIC_SUPABASE_ANON_KEY production
vercel env add SUPABASE_SERVICE_KEY production
vercel env add CPM_SERVICE_URL production
vercel env add IM_SERVICE_URL production
vercel env add SMM_SERVICE_URL production

4. Vercel Auto-deployment

  1. Connect your GitHub repository to Vercel
  2. Configure automatic deployments on push to main
  3. Set up preview deployments for pull requests
  4. Configure path-based builds to only deploy when frontend changes

Supabase Configuration

1. Project Setup

  1. Create a new Supabase project at https://supabase.com
  2. Save your project URL and API keys
  3. Configure authentication settings

2. Database Schema Setup

# Initialize Supabase locally
supabase init

# Start local Supabase
supabase start

# Apply migrations
supabase db push

# Or apply specific migrations
supabase migration up

3. Production Database Setup

-- Enable Row Level Security
ALTER TABLE jobs ENABLE ROW LEVEL SECURITY;
ALTER TABLE clients ENABLE ROW LEVEL SECURITY;
ALTER TABLE strategies ENABLE ROW LEVEL SECURITY;
ALTER TABLE prompt_templates ENABLE ROW LEVEL SECURITY;

-- Create RLS policies
CREATE POLICY "Users can access their client's data" ON jobs
FOR ALL USING (client_id IN (
  SELECT id FROM clients WHERE user_id = auth.uid()
));

-- Create performance indexes
CREATE INDEX idx_jobs_client_status ON jobs(client_id, status);
CREATE INDEX idx_jobs_created_at ON jobs(created_at);
CREATE INDEX idx_clients_user_id ON clients(user_id);

4. Real-time Configuration

-- Enable real-time for job status updates
ALTER PUBLICATION supabase_realtime ADD TABLE jobs;
ALTER PUBLICATION supabase_realtime ADD TABLE clients;

5. Storage Configuration

-- Create storage bucket for file uploads
INSERT INTO storage.buckets (id, name, public) VALUES ('content-files', 'content-files', false);

-- Create storage policies
CREATE POLICY "Users can upload files" ON storage.objects
FOR INSERT WITH CHECK (bucket_id = 'content-files' AND auth.uid()::text = (storage.foldername(name))[1]);

6. Authentication Setup

Configure authentication in Supabase dashboard:

# supabase/config.toml
[auth]
enabled = true
site_url = "https://your-frontend-domain.vercel.app"
additional_redirect_urls = ["http://localhost:3000"]
jwt_expiry = 3600
enable_signup = true
enable_confirmations = true

[auth.email]
enable_signup = true
double_confirm_changes = true
enable_confirmations = true

Environment Variables & Secrets Management

Production Environment Variables

Vercel (Frontend)

# Supabase Configuration
NEXT_PUBLIC_SUPABASE_URL=https://xxxxx.supabase.co
NEXT_PUBLIC_SUPABASE_ANON_KEY=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...
SUPABASE_SERVICE_KEY=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...

# Backend Service URLs
CPM_SERVICE_URL=https://cpm-production.railway.app
IM_SERVICE_URL=https://im-production.railway.app
SMM_SERVICE_URL=https://smm-production.railway.app

# Analytics & Monitoring
NEXT_PUBLIC_POSTHOG_KEY=phc_xxxxxxxxxxxx
VERCEL_ANALYTICS_ID=xxxxxxxxxx

Railway (Backend Services)

# Database Configuration
SUPABASE_URL=https://xxxxx.supabase.co
SUPABASE_SERVICE_ROLE_KEY=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...

# Redis Configuration
REDIS_URL=redis://default:password@host:port

# LLM API Keys
OPENAI_API_KEY=sk-proj-xxxxxxxxxxxxxxxx
ANTHROPIC_API_KEY=sk-ant-api03-xxxxxxxxxxxxxxxx
GOOGLE_API_KEY=AIzaSyxxxxxxxxxxxxxxxxxx
GROQ_API_KEY=gsk_xxxxxxxxxxxxxxxx

# Inter-service URLs
IM_BASE_URL=https://im-production.railway.app
SMM_BASE_URL=https://smm-production.railway.app

# Monitoring & Logging
LOG_LEVEL=INFO
PROMETHEUS_PUSHGATEWAY_URL=https://prometheus.monitoring.service.com

Secrets Management Best Practices

  1. Never commit secrets to version control
  2. Use platform-specific secret management:
  3. Vercel: Environment Variables dashboard
  4. Railway: Environment Variables in project settings
  5. Supabase: Project settings → API keys
  6. Rotate API keys regularly
  7. Use different keys for different environments
  8. Monitor API key usage and set alerts

Environment Variable Validation

Create validation scripts to ensure all required variables are set:

# scripts/validate_env.py
import os
import sys

REQUIRED_VARS = {
    'SUPABASE_URL': 'Supabase project URL',
    'SUPABASE_SERVICE_ROLE_KEY': 'Supabase service role key',
    'OPENAI_API_KEY': 'OpenAI API key',
    'REDIS_URL': 'Redis connection URL'
}

def validate_environment():
    missing_vars = []
    for var, description in REQUIRED_VARS.items():
        if not os.getenv(var):
            missing_vars.append(f"{var} ({description})")

    if missing_vars:
        print("Missing required environment variables:")
        for var in missing_vars:
            print(f"  - {var}")
        sys.exit(1)

    print("All required environment variables are set ✓")

if __name__ == "__main__":
    validate_environment()

Monitoring & Logging Setup

1. Structured Logging

Configure structured logging for all services:

# logging_config.py
import structlog
import logging
from datetime import datetime

def configure_logging():
    structlog.configure(
        processors=[
            structlog.stdlib.filter_by_level,
            structlog.stdlib.add_logger_name,
            structlog.stdlib.add_log_level,
            structlog.stdlib.PositionalArgumentsFormatter(),
            structlog.processors.TimeStamper(fmt="iso"),
            structlog.processors.StackInfoRenderer(),
            structlog.processors.format_exc_info,
            structlog.processors.UnicodeDecoder(),
            structlog.processors.JSONRenderer()
        ],
        context_class=dict,
        logger_factory=structlog.stdlib.LoggerFactory(),
        wrapper_class=structlog.stdlib.BoundLogger,
        cache_logger_on_first_use=True,
    )

# Usage in services
logger = structlog.get_logger(__name__)
logger.info("Content generation started", 
           job_id=job_id, 
           client_id=client_id, 
           provider="openai")

2. Health Checks

Implement comprehensive health checks for all services:

# health.py
from fastapi import APIRouter, HTTPException
from datetime import datetime
import asyncio
import redis
from supabase import create_client

router = APIRouter()

@router.get("/health")
async def health_check():
    """Comprehensive health check endpoint"""
    health_status = {
        "status": "healthy",
        "timestamp": datetime.utcnow().isoformat(),
        "service": "cpm",
        "version": "1.0.0",
        "checks": {}
    }

    try:
        # Database health
        health_status["checks"]["database"] = await check_database()

        # Redis health
        health_status["checks"]["redis"] = await check_redis()

        # External API health
        health_status["checks"]["external_apis"] = await check_external_apis()

        # Overall status
        failed_checks = [k for k, v in health_status["checks"].items() if v["status"] != "healthy"]
        if failed_checks:
            health_status["status"] = "degraded"

        return health_status

    except Exception as e:
        health_status["status"] = "unhealthy"
        health_status["error"] = str(e)
        raise HTTPException(status_code=503, detail=health_status)

async def check_database():
    """Check database connectivity"""
    try:
        supabase = create_client(SUPABASE_URL, SUPABASE_KEY)
        result = supabase.table("jobs").select("id").limit(1).execute()
        return {"status": "healthy", "response_time": "5ms"}
    except Exception as e:
        return {"status": "unhealthy", "error": str(e)}

async def check_redis():
    """Check Redis connectivity"""
    try:
        r = redis.from_url(REDIS_URL)
        r.ping()
        return {"status": "healthy", "response_time": "2ms"}
    except Exception as e:
        return {"status": "unhealthy", "error": str(e)}

3. Application Metrics

Implement Prometheus metrics:

# metrics.py
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time

# Metrics
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
REQUEST_DURATION = Histogram('http_request_duration_seconds', 'HTTP request duration')
ACTIVE_JOBS = Gauge('active_jobs_total', 'Number of active content generation jobs')
GENERATION_SUCCESS_RATE = Counter('content_generation_success_total', 'Successful content generations')
GENERATION_ERROR_RATE = Counter('content_generation_errors_total', 'Failed content generations', ['error_type'])
COST_TRACKING = Counter('generation_cost_total', 'Total cost of content generation', ['provider'])

# Middleware to track metrics
async def metrics_middleware(request, call_next):
    start_time = time.time()

    response = await call_next(request)

    # Track request metrics
    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=request.url.path,
        status=response.status_code
    ).inc()

    REQUEST_DURATION.observe(time.time() - start_time)

    return response

# Start metrics server
start_http_server(8080)

4. Error Tracking

Configure error tracking with detailed context:

# error_tracking.py
import traceback
import structlog
from typing import Optional, Dict, Any

logger = structlog.get_logger(__name__)

class ErrorTracker:
    @staticmethod
    def capture_exception(
        error: Exception,
        context: Optional[Dict[str, Any]] = None,
        user_id: Optional[str] = None,
        job_id: Optional[str] = None
    ):
        """Capture and log exceptions with context"""
        error_data = {
            "error_type": type(error).__name__,
            "error_message": str(error),
            "traceback": traceback.format_exc(),
            "context": context or {},
            "user_id": user_id,
            "job_id": job_id
        }

        logger.error("Exception captured", **error_data)

        # Send to external error tracking service if configured
        # (e.g., Sentry, Rollbar, etc.)

        return error_data

# Usage
try:
    result = await generate_content(job_id, prompt)
except Exception as e:
    ErrorTracker.capture_exception(
        e, 
        context={"job_id": job_id, "provider": "openai"},
        user_id=user_id
    )
    raise

5. Performance Monitoring

Track performance metrics:

# performance.py
import time
import asyncio
from functools import wraps
from typing import Callable

def track_performance(operation_name: str):
    """Decorator to track operation performance"""
    def decorator(func: Callable):
        @wraps(func)
        async def wrapper(*args, **kwargs):
            start_time = time.time()

            try:
                result = await func(*args, **kwargs)
                duration = time.time() - start_time

                logger.info(
                    "Operation completed",
                    operation=operation_name,
                    duration=duration,
                    status="success"
                )

                return result
            except Exception as e:
                duration = time.time() - start_time

                logger.error(
                    "Operation failed",
                    operation=operation_name,
                    duration=duration,
                    status="error",
                    error=str(e)
                )
                raise

        return wrapper
    return decorator

# Usage
@track_performance("content_generation")
async def generate_content(prompt: str, provider: str):
    # Content generation logic
    pass

CI/CD Pipeline Configuration

GitHub Actions Workflow

Create .github/workflows/deploy.yml:

name: Deploy to Production

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  NODE_VERSION: '18'
  PYTHON_VERSION: '3.12'

jobs:
  changes:
    runs-on: ubuntu-latest
    outputs:
      frontend: ${{ steps.changes.outputs.frontend }}
      cpm: ${{ steps.changes.outputs.cpm }}
      im: ${{ steps.changes.outputs.im }}
      smm: ${{ steps.changes.outputs.smm }}
    steps:
      - uses: actions/checkout@v4
      - uses: dorny/paths-filter@v2
        id: changes
        with:
          filters: |
            frontend:
              - 'apps/frontend/**'
            cpm:
              - 'apps/cpm/**'
            im:
              - 'apps/im/**'
            smm:
              - 'smm/**'

  test-frontend:
    needs: changes
    if: ${{ needs.changes.outputs.frontend == 'true' }}
    runs-on: ubuntu-latest

    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: 'pnpm'

      - name: Install pnpm
        run: npm install -g pnpm

      - name: Install dependencies
        run: pnpm install

      - name: Run tests
        run: pnpm test --filter=frontend

      - name: Build application
        run: pnpm build --filter=frontend

  test-backend:
    needs: changes
    if: ${{ needs.changes.outputs.cpm == 'true' || needs.changes.outputs.im == 'true' || needs.changes.outputs.smm == 'true' }}
    runs-on: ubuntu-latest

    services:
      postgres:
        image: postgres:15
        env:
          POSTGRES_PASSWORD: postgres
          POSTGRES_DB: test_db
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5

      redis:
        image: redis:7
        options: >-
          --health-cmd "redis-cli ping"
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5

    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: ${{ env.PYTHON_VERSION }}

      - name: Install Python dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt
          pip install -r apps/cpm/requirements.txt
          pip install -r apps/im/requirements.txt

      - name: Run CPM tests
        if: ${{ needs.changes.outputs.cpm == 'true' }}
        run: |
          cd apps/cpm
          pytest

      - name: Run IM tests
        if: ${{ needs.changes.outputs.im == 'true' }}
        run: |
          cd apps/im
          pytest

      - name: Run SMM tests
        if: ${{ needs.changes.outputs.smm == 'true' }}
        run: |
          cd smm
          pytest

  deploy-frontend:
    needs: [changes, test-frontend]
    if: ${{ github.ref == 'refs/heads/main' && needs.changes.outputs.frontend == 'true' }}
    runs-on: ubuntu-latest

    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Deploy to Vercel
        uses: amondnet/vercel-action@v25
        with:
          vercel-token: ${{ secrets.VERCEL_TOKEN }}
          vercel-org-id: ${{ secrets.VERCEL_ORG_ID }}
          vercel-project-id: ${{ secrets.VERCEL_PROJECT_ID }}
          working-directory: apps/frontend
          scope: ${{ secrets.VERCEL_ORG_ID }}

  deploy-backend:
    needs: [changes, test-backend]
    if: ${{ github.ref == 'refs/heads/main' }}
    runs-on: ubuntu-latest

    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Deploy CPM to Railway
        if: ${{ needs.changes.outputs.cpm == 'true' }}
        uses: railway-deploy/railway-deploy@v1
        with:
          railway-token: ${{ secrets.RAILWAY_TOKEN }}
          service: cpm-production

      - name: Deploy IM to Railway
        if: ${{ needs.changes.outputs.im == 'true' }}
        uses: railway-deploy/railway-deploy@v1
        with:
          railway-token: ${{ secrets.RAILWAY_TOKEN }}
          service: im-production

      - name: Deploy SMM to Railway
        if: ${{ needs.changes.outputs.smm == 'true' }}
        uses: railway-deploy/railway-deploy@v1
        with:
          railway-token: ${{ secrets.RAILWAY_TOKEN }}
          service: smm-production

  notify-deployment:
    needs: [deploy-frontend, deploy-backend]
    if: ${{ always() && github.ref == 'refs/heads/main' }}
    runs-on: ubuntu-latest

    steps:
      - name: Notify Slack
        uses: 8398a7/action-slack@v3
        with:
          status: ${{ job.status }}
          channel: '#deployments'
          webhook_url: ${{ secrets.SLACK_WEBHOOK }}

Pre-deployment Checks

Create scripts/pre-deploy-check.sh:

#!/bin/bash
set -e

echo "Running pre-deployment checks..."

# Check environment variables
echo "Validating environment variables..."
python scripts/validate_env.py

# Run database migrations
echo "Checking database migrations..."
supabase db diff --linked

# Run security scan
echo "Running security scan..."
safety check

# Check API endpoints
echo "Testing API endpoints..."
python scripts/health_check.py

# Validate configuration files
echo "Validating configuration..."
python -c "import json; json.load(open('apps/cpm/railway.json'))"
python -c "import json; json.load(open('apps/im/railway.json'))"

echo "Pre-deployment checks completed successfully ✓"

Backup & Disaster Recovery

1. Database Backup Strategy

Supabase Automatic Backups

Supabase provides automatic daily backups for Pro plans: - Point-in-time recovery up to 7 days - Automatic weekly backups retained for 4 weeks - Monthly backups retained for 3 months

Custom Backup Script

#!/bin/bash
# scripts/backup_database.sh

# Set variables
BACKUP_DIR="/backups/$(date +%Y-%m-%d)"
SUPABASE_URL="your_supabase_url"
DB_PASSWORD="your_db_password"

# Create backup directory
mkdir -p $BACKUP_DIR

# Backup main tables
pg_dump "$SUPABASE_URL" \
  --host=db.xxx.supabase.co \
  --port=5432 \
  --username=postgres \
  --password \
  --format=custom \
  --file="$BACKUP_DIR/hg_content_$(date +%Y%m%d_%H%M%S).backup"

# Backup specific critical tables
pg_dump "$SUPABASE_URL" \
  --host=db.xxx.supabase.co \
  --port=5432 \
  --username=postgres \
  --password \
  --table=jobs \
  --table=clients \
  --table=strategies \
  --format=custom \
  --file="$BACKUP_DIR/critical_tables_$(date +%Y%m%d_%H%M%S).backup"

# Compress backups
gzip "$BACKUP_DIR"/*.backup

# Upload to cloud storage (S3, GCS, etc.)
aws s3 cp "$BACKUP_DIR" s3://hg-content-backups/$(date +%Y-%m-%d)/ --recursive

echo "Backup completed successfully"

2. Application State Backup

# scripts/backup_app_state.py
import json
import asyncio
from supabase import create_client
from datetime import datetime

async def backup_application_state():
    """Backup critical application state"""
    supabase = create_client(SUPABASE_URL, SUPABASE_KEY)

    backup_data = {
        "timestamp": datetime.utcnow().isoformat(),
        "clients": [],
        "strategies": [],
        "active_jobs": [],
        "system_config": {}
    }

    # Backup clients
    clients = supabase.table("clients").select("*").execute()
    backup_data["clients"] = clients.data

    # Backup strategies
    strategies = supabase.table("strategies").select("*").execute()
    backup_data["strategies"] = strategies.data

    # Backup active jobs
    active_jobs = supabase.table("jobs").select("*").eq("status", "in_progress").execute()
    backup_data["active_jobs"] = active_jobs.data

    # Save backup
    backup_filename = f"app_state_backup_{datetime.utcnow().strftime('%Y%m%d_%H%M%S')}.json"
    with open(backup_filename, 'w') as f:
        json.dump(backup_data, f, indent=2)

    print(f"Application state backup saved to {backup_filename}")

if __name__ == "__main__":
    asyncio.run(backup_application_state())

3. Disaster Recovery Procedures

Recovery Playbook

  1. Database Recovery

    # Restore from Supabase backup
    supabase db restore --backup-id=your_backup_id
    
    # Or restore from custom backup
    pg_restore --host=db.xxx.supabase.co \
      --port=5432 \
      --username=postgres \
      --dbname=postgres \
      --clean \
      --create \
      hg_content_backup.backup
    

  2. Service Recovery

    # Redeploy all services
    railway deploy --service=cpm-production
    railway deploy --service=im-production
    railway deploy --service=smm-production
    
    # Verify deployments
    curl https://cpm-production.railway.app/health
    curl https://im-production.railway.app/health
    curl https://smm-production.railway.app/health
    

  3. Data Validation

    # scripts/validate_recovery.py
    async def validate_recovery():
        """Validate system after recovery"""
        # Check database connectivity
        # Verify critical data integrity
        # Test API endpoints
        # Validate user authentication
        pass
    

4. Recovery Testing

#!/bin/bash
# scripts/disaster_recovery_test.sh

echo "Starting disaster recovery test..."

# Create test environment
railway create --name "dr-test-$(date +%s)"

# Deploy services to test environment
railway deploy --service=dr-test-cpm
railway deploy --service=dr-test-im
railway deploy --service=dr-test-smm

# Restore data to test environment
pg_restore --host=test-db.xxx.supabase.co \
  --port=5432 \
  --username=postgres \
  --dbname=postgres \
  --clean \
  latest_backup.backup

# Run validation tests
python scripts/validate_recovery.py

# Clean up test environment
railway delete --service=dr-test-cpm --confirm
railway delete --service=dr-test-im --confirm
railway delete --service=dr-test-smm --confirm

echo "Disaster recovery test completed"

Scaling & Performance Optimization

1. Auto-scaling Configuration

Railway Auto-scaling

{
  "scaling": {
    "minReplicas": 1,
    "maxReplicas": 10,
    "targetCPUUtilization": 70,
    "targetMemoryUtilization": 80
  },
  "resources": {
    "cpu": "2000m",
    "memory": "4Gi"
  }
}

Vercel Scaling

Vercel automatically scales serverless functions based on demand: - Concurrent executions: Up to 1,000 per region - Execution timeout: 30 seconds (configurable) - Memory: 1024MB (configurable up to 3008MB)

2. Database Performance Optimization

-- Performance indexes
CREATE INDEX CONCURRENTLY idx_jobs_client_status_created 
ON jobs(client_id, status, created_at DESC);

CREATE INDEX CONCURRENTLY idx_jobs_status_created 
ON jobs(status, created_at DESC) 
WHERE status IN ('pending', 'in_progress');

-- Partition large tables
CREATE TABLE jobs_2024 PARTITION OF jobs 
FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');

-- Enable query optimization
SET work_mem = '256MB';
SET effective_cache_size = '4GB';
SET shared_buffers = '1GB';

3. Caching Strategy

# caching.py
import redis
import json
from functools import wraps
from typing import Any, Optional

class CacheManager:
    def __init__(self, redis_url: str):
        self.redis = redis.from_url(redis_url)

    def cache_result(self, key: str, ttl: int = 3600):
        """Decorator to cache function results"""
        def decorator(func):
            @wraps(func)
            async def wrapper(*args, **kwargs):
                cache_key = f"{func.__name__}:{key}"

                # Try to get from cache
                cached = self.redis.get(cache_key)
                if cached:
                    return json.loads(cached)

                # Execute function and cache result
                result = await func(*args, **kwargs)
                self.redis.setex(
                    cache_key, 
                    ttl, 
                    json.dumps(result, default=str)
                )

                return result
            return wrapper
        return decorator

# Usage
cache = CacheManager(REDIS_URL)

@cache.cache_result("strategies", ttl=1800)
async def get_client_strategies(client_id: str):
    # Expensive database query
    return strategies

4. Load Testing

# scripts/load_test.py
import asyncio
import aiohttp
import time
from concurrent.futures import ThreadPoolExecutor

async def load_test():
    """Load test the content generation endpoint"""

    async def make_request(session, request_id):
        start_time = time.time()
        try:
            async with session.post(
                "https://cpm-production.railway.app/generate",
                json={
                    "topic": f"Test content {request_id}",
                    "client_id": "test-client",
                    "content_type": "blog_post"
                }
            ) as response:
                duration = time.time() - start_time
                return {
                    "request_id": request_id,
                    "status": response.status,
                    "duration": duration
                }
        except Exception as e:
            return {
                "request_id": request_id,
                "status": "error",
                "error": str(e),
                "duration": time.time() - start_time
            }

    async with aiohttp.ClientSession() as session:
        # Run 100 concurrent requests
        tasks = [make_request(session, i) for i in range(100)]
        results = await asyncio.gather(*tasks)

    # Analyze results
    success_count = sum(1 for r in results if r["status"] == 200)
    avg_duration = sum(r["duration"] for r in results) / len(results)

    print(f"Load test results:")
    print(f"Success rate: {success_count/len(results)*100:.1f}%")
    print(f"Average response time: {avg_duration:.2f}s")

if __name__ == "__main__":
    asyncio.run(load_test())

5. Performance Monitoring

# performance_monitor.py
from datadog import initialize, statsd
import time
from functools import wraps

# Initialize monitoring
initialize()

def monitor_performance(metric_name: str):
    """Decorator to monitor function performance"""
    def decorator(func):
        @wraps(func)
        async def wrapper(*args, **kwargs):
            start_time = time.time()

            try:
                result = await func(*args, **kwargs)
                duration = time.time() - start_time

                # Send success metrics
                statsd.histogram(f"{metric_name}.duration", duration)
                statsd.increment(f"{metric_name}.success")

                return result
            except Exception as e:
                duration = time.time() - start_time

                # Send error metrics
                statsd.histogram(f"{metric_name}.duration", duration)
                statsd.increment(f"{metric_name}.error")

                raise
        return wrapper
    return decorator

# Usage
@monitor_performance("content_generation")
async def generate_content(prompt: str):
    # Content generation logic
    pass

Troubleshooting Common Issues

1. Service Communication Issues

Problem: Services cannot communicate with each other

# Debug network connectivity
curl -v https://cpm-production.railway.app/health
curl -v https://im-production.railway.app/health

# Check DNS resolution
nslookup cpm-production.railway.app

# Verify SSL certificates
openssl s_client -connect cpm-production.railway.app:443 -servername cpm-production.railway.app

Solution:

  • Verify service URLs in environment variables
  • Check Railway service DNS settings
  • Ensure services are deployed and healthy

2. Database Connection Issues

Problem: "connection pool exhausted" errors

# Monitor connection pool
async def check_connection_pool():
    # Check active connections
    result = await supabase.rpc("pg_stat_activity").execute()
    active_connections = len(result.data)

    if active_connections > 80:  # 80% of max connections
        logger.warning(f"High connection usage: {active_connections}")

Solution:

  • Enable connection pooling in Supabase
  • Implement connection pool monitoring
  • Add connection retry logic
  • Use async database clients

3. Authentication Issues

Problem: JWT token validation failures

# Debug JWT tokens
import jwt
from supabase import create_client

def debug_jwt_token(token: str):
    try:
        # Decode without verification first
        decoded = jwt.decode(token, options={"verify_signature": False})
        print(f"Token payload: {decoded}")

        # Check expiration
        exp = decoded.get('exp', 0)
        current_time = time.time()

        if exp < current_time:
            print("Token has expired")
        else:
            print(f"Token valid for {exp - current_time} seconds")

    except Exception as e:
        print(f"Token decode error: {e}")

Solution:

  • Verify Supabase JWT secret configuration
  • Check token expiration times
  • Implement automatic token refresh
  • Validate CORS settings

4. LLM Provider Issues

Problem: API rate limits or failures

# Implement circuit breaker pattern
class CircuitBreaker:
    def __init__(self, failure_threshold=5, reset_timeout=60):
        self.failure_threshold = failure_threshold
        self.reset_timeout = reset_timeout
        self.failure_count = 0
        self.last_failure_time = None
        self.state = 'CLOSED'  # CLOSED, OPEN, HALF_OPEN

    async def call(self, func, *args, **kwargs):
        if self.state == 'OPEN':
            if time.time() - self.last_failure_time > self.reset_timeout:
                self.state = 'HALF_OPEN'
            else:
                raise Exception("Circuit breaker is OPEN")

        try:
            result = await func(*args, **kwargs)
            self.on_success()
            return result
        except Exception as e:
            self.on_failure()
            raise

    def on_success(self):
        self.failure_count = 0
        self.state = 'CLOSED'

    def on_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()

        if self.failure_count >= self.failure_threshold:
            self.state = 'OPEN'

Solution:

  • Implement retry logic with exponential backoff
  • Use circuit breaker pattern
  • Set up fallback providers
  • Monitor API usage and quotas

5. Memory and Performance Issues

Problem: High memory usage or slow responses

# Memory profiling
import psutil
import gc

def monitor_memory():
    process = psutil.Process()
    memory_info = process.memory_info()

    logger.info(
        "Memory usage",
        rss=memory_info.rss / 1024 / 1024,  # MB
        vms=memory_info.vms / 1024 / 1024,  # MB
        percent=process.memory_percent()
    )

    # Force garbage collection if memory is high
    if process.memory_percent() > 80:
        gc.collect()

Solution:

  • Monitor memory usage and implement alerts
  • Use memory profilers to identify leaks
  • Implement proper connection pooling
  • Add response caching for expensive operations

6. Deployment Issues

Problem: Railway deployment failures

# Check Railway logs
railway logs --follow

# Check service status
railway status

# Validate configuration
railway config

# Force redeploy
railway redeploy

Solution:

  • Check build logs for errors
  • Verify environment variables
  • Ensure proper start commands
  • Check resource limits

7. Database Performance Issues

Problem: Slow query performance

-- Enable query logging
ALTER SYSTEM SET log_statement = 'all';
ALTER SYSTEM SET log_min_duration_statement = 1000; -- Log queries > 1s

-- Analyze slow queries
SELECT query, mean_time, calls 
FROM pg_stat_statements 
ORDER BY mean_time DESC 
LIMIT 10;

-- Check missing indexes
SELECT schemaname, tablename, attname, n_distinct, correlation 
FROM pg_stats 
WHERE schemaname = 'public' 
ORDER BY n_distinct DESC;

Solution:

  • Add appropriate indexes
  • Optimize query patterns
  • Implement query result caching
  • Use connection pooling

8. Common Error Resolution

Create a troubleshooting script:

# scripts/troubleshoot.py
import asyncio
import aiohttp
import sys

async def run_diagnostics():
    """Run comprehensive system diagnostics"""
    print("🔍 Running system diagnostics...\n")

    # Check service health
    services = [
        ("CPM", "https://cpm-production.railway.app/health"),
        ("IM", "https://im-production.railway.app/health"),
        ("SMM", "https://smm-production.railway.app/health"),
        ("Frontend", "https://your-frontend.vercel.app/api/health")
    ]

    async with aiohttp.ClientSession() as session:
        for name, url in services:
            try:
                async with session.get(url, timeout=10) as response:
                    if response.status == 200:
                        print(f"✅ {name} service: Healthy")
                    else:
                        print(f"❌ {name} service: Unhealthy (Status: {response.status})")
            except Exception as e:
                print(f"❌ {name} service: Error - {e}")

    # Check database
    try:
        supabase = create_client(SUPABASE_URL, SUPABASE_KEY)
        result = supabase.table("jobs").select("count").execute()
        print(f"✅ Database: Connected ({len(result.data)} jobs found)")
    except Exception as e:
        print(f"❌ Database: Error - {e}")

    # Check Redis
    try:
        r = redis.from_url(REDIS_URL)
        r.ping()
        print("✅ Redis: Connected")
    except Exception as e:
        print(f"❌ Redis: Error - {e}")

    print("\n🏁 Diagnostics complete")

if __name__ == "__main__":
    asyncio.run(run_diagnostics())

Quick Reference Commands

# Health checks
curl https://cpm-production.railway.app/health
curl https://im-production.railway.app/health
curl https://your-frontend.vercel.app/api/health

# Railway debugging
railway logs --follow --service=cpm-production
railway shell --service=cpm-production

# Vercel debugging
vercel logs --follow
vercel inspect

# Database debugging
supabase db inspect
supabase logs

# Performance monitoring
railway metrics --service=cpm-production

Conclusion

This deployment guide provides comprehensive instructions for deploying and operating the HG Content Generation System. Regular monitoring, proper backup procedures, and following the troubleshooting guidelines will ensure reliable operation of the system.

For additional support: - Check service status pages: Railway, Vercel, Supabase - Review application logs for specific error messages
- Run the diagnostic script for quick health checks - Contact the development team for complex issues

Last Updated: August 2025 Version: 1.0.0