HG Content Generation System - Deployment & Operations Guide¶
Table of Contents¶
- Overview
- Environment Setup
- Railway Deployment (Python Services)
- Vercel Deployment (Next.js Frontend)
- Supabase Configuration
- Environment Variables & Secrets Management
- Monitoring & Logging Setup
- CI/CD Pipeline Configuration
- Backup & Disaster Recovery
- Scaling & Performance Optimization
- Troubleshooting Common Issues
Overview¶
The HG Content Generation System is a distributed application consisting of: - Frontend: Next.js 14 application (deployed on Vercel) - Backend Services: Python FastAPI services (deployed on Railway) - Content Production Module (CPM) - Instructions Module (IM) - Strategy Management Module (SMM) - Database: Supabase PostgreSQL with real-time features - Cache/Queue: Redis (managed by Railway)
Architecture Diagram¶
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Vercel Edge │ │ Railway Cloud │ │ Supabase │
│ │ │ │ │ │
│ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │
│ │ Next.js App │ │ │ │ CPM Service │ │ │ │ PostgreSQL │ │
│ │ (Frontend) │ │◄──►│ │ (Python) │ │◄──►│ │ Database │ │
│ └─────────────┘ │ │ └─────────────┘ │ │ └─────────────┘ │
│ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │
│ ┌─────────────┐ │ │ │ IM Service │ │ │ │ Auth │ │
│ │ API Routes │ │◄──►│ │ (Python) │ │ │ │ Real-time │ │
│ └─────────────┘ │ │ └─────────────┘ │ │ │ Storage │ │
│ │ │ ┌─────────────┐ │ │ └─────────────┘ │
└─────────────────┘ │ │ SMM Service │ │ └─────────────────┘
│ │ (Python) │ │
│ └─────────────┘ │
│ ┌─────────────┐ │
│ │ Redis Cache │ │
│ └─────────────┘ │
└─────────────────┘
Environment Setup¶
Prerequisites¶
- Node.js 20.x (match
.nvmrc) and pnpm for frontend development - Python 3.12+ for backend services
- Supabase CLI for database management
- Railway CLI for backend deployment
- Vercel CLI for frontend deployment
Local Development Setup¶
# Clone the repository
git clone https://github.com/your-org/hg-content.git
cd hg-content
# Install frontend dependencies
pnpm install
# Set up Python virtual environment for backend services
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install Python dependencies
pip install -r requirements.txt
pip install -r apps/cpm/requirements.txt
pip install -r apps/im/requirements.txt
# Start Supabase locally
supabase start
# Copy environment files
cp .env.example .env
cp apps/frontend/.env.example apps/frontend/.env.local
cp apps/cpm/.env.example apps/cpm/.env
cp apps/im/.env.example apps/im/.env
Development Server Commands¶
# Start all services concurrently
pnpm dev
# Or start services individually:
# Frontend (Next.js)
cd apps/frontend && pnpm dev
# CPM Service
cd apps/cpm && uvicorn app:app --reload --port 8000
# IM Service
cd apps/im && uvicorn app:app --reload --port 8001
# SMM Service (if running separately)
cd smm && python -m app.api
Railway Deployment (Python Services)¶
Railway is used to deploy all Python FastAPI services with managed PostgreSQL and Redis.
1. Initial Railway Setup¶
# Install Railway CLI
npm install -g @railway/cli
# Login to Railway
railway login
# Link project
railway link [PROJECT_ID]
2. Deploy Content Production Module (CPM)¶
# From project root
cd apps/cpm
# Create Railway service
railway create
# Set environment variables (see Environment Variables section)
railway env set SUPABASE_URL=your_supabase_url
railway env set SUPABASE_SERVICE_ROLE_KEY=your_service_key
railway env set OPENAI_API_KEY=your_openai_key
railway env set ANTHROPIC_API_KEY=your_anthropic_key
railway env set GOOGLE_API_KEY=your_google_key
railway env set GROQ_API_KEY=your_groq_key
railway env set REDIS_URL=redis://localhost:6379
# Deploy
railway deploy
3. Deploy Instructions Module (IM)¶
cd apps/im
# Create Railway service
railway create
# Set environment variables
railway env set SUPABASE_URL=your_supabase_url
railway env set SUPABASE_SERVICE_ROLE_KEY=your_service_key
railway env set OPENAI_API_KEY=your_openai_key
# Deploy
railway deploy
4. Deploy Strategy Management Module (SMM)¶
cd smm
# Create Railway service (if deploying separately)
railway create
# Set environment variables
railway env set SUPABASE_URL=your_supabase_url
railway env set SUPABASE_SERVICE_ROLE_KEY=your_service_key
# Deploy
railway deploy
5. Railway Service Configuration¶
Each service needs a railway.json configuration file:
{
"$schema": "https://railway.app/railway.schema.json",
"build": {
"builder": "NIXPACKS"
},
"deploy": {
"startCommand": "uvicorn app:app --host 0.0.0.0 --port $PORT",
"restartPolicyType": "ON_FAILURE",
"restartPolicyMaxRetries": 3,
"healthcheckPath": "/health",
"healthcheckTimeout": 30
}
}
6. Railway Redis Setup¶
# Add Redis addon to your Railway project
railway plugin add redis
# Get Redis URL from Railway dashboard
railway env get REDIS_URL
7. Railway Auto-deployment¶
Set up automatic deployments from GitHub:
- Go to Railway dashboard
- Connect your GitHub repository
- Configure automatic deployments on push to
mainbranch - Set up path-based triggers for each service
Vercel Deployment (Next.js Frontend)¶
Vercel hosts the Next.js frontend application and API routes.
1. Initial Vercel Setup¶
# Install Vercel CLI
npm install -g vercel
# Login to Vercel
vercel login
# From frontend directory
cd apps/frontend
# Deploy to Vercel
vercel --prod
2. Vercel Configuration¶
Create vercel.json in the frontend root:
{
"framework": "nextjs",
"buildCommand": "pnpm build",
"outputDirectory": ".next",
"installCommand": "pnpm install",
"regions": ["iad1", "sfo1", "fra1"],
"functions": {
"app/api/**/*.ts": {
"maxDuration": 30
}
},
"headers": [
{
"source": "/api/(.*)",
"headers": [
{
"key": "Access-Control-Allow-Origin",
"value": "*"
},
{
"key": "Access-Control-Allow-Methods",
"value": "GET, POST, PUT, DELETE, OPTIONS"
},
{
"key": "Access-Control-Allow-Headers",
"value": "Content-Type, Authorization"
}
]
}
]
}
3. Vercel Environment Variables¶
Set environment variables in Vercel dashboard or via CLI:
# Set production environment variables
vercel env add NEXT_PUBLIC_SUPABASE_URL production
vercel env add NEXT_PUBLIC_SUPABASE_ANON_KEY production
vercel env add SUPABASE_SERVICE_KEY production
vercel env add CPM_SERVICE_URL production
vercel env add IM_SERVICE_URL production
vercel env add SMM_SERVICE_URL production
4. Vercel Auto-deployment¶
- Connect your GitHub repository to Vercel
- Configure automatic deployments on push to
main - Set up preview deployments for pull requests
- Configure path-based builds to only deploy when frontend changes
Supabase Configuration¶
1. Project Setup¶
- Create a new Supabase project at https://supabase.com
- Save your project URL and API keys
- Configure authentication settings
2. Database Schema Setup¶
# Initialize Supabase locally
supabase init
# Start local Supabase
supabase start
# Apply migrations
supabase db push
# Or apply specific migrations
supabase migration up
3. Production Database Setup¶
-- Enable Row Level Security
ALTER TABLE jobs ENABLE ROW LEVEL SECURITY;
ALTER TABLE clients ENABLE ROW LEVEL SECURITY;
ALTER TABLE strategies ENABLE ROW LEVEL SECURITY;
ALTER TABLE prompt_templates ENABLE ROW LEVEL SECURITY;
-- Create RLS policies
CREATE POLICY "Users can access their client's data" ON jobs
FOR ALL USING (client_id IN (
SELECT id FROM clients WHERE user_id = auth.uid()
));
-- Create performance indexes
CREATE INDEX idx_jobs_client_status ON jobs(client_id, status);
CREATE INDEX idx_jobs_created_at ON jobs(created_at);
CREATE INDEX idx_clients_user_id ON clients(user_id);
4. Real-time Configuration¶
-- Enable real-time for job status updates
ALTER PUBLICATION supabase_realtime ADD TABLE jobs;
ALTER PUBLICATION supabase_realtime ADD TABLE clients;
5. Storage Configuration¶
-- Create storage bucket for file uploads
INSERT INTO storage.buckets (id, name, public) VALUES ('content-files', 'content-files', false);
-- Create storage policies
CREATE POLICY "Users can upload files" ON storage.objects
FOR INSERT WITH CHECK (bucket_id = 'content-files' AND auth.uid()::text = (storage.foldername(name))[1]);
6. Authentication Setup¶
Configure authentication in Supabase dashboard:
# supabase/config.toml
[auth]
enabled = true
site_url = "https://your-frontend-domain.vercel.app"
additional_redirect_urls = ["http://localhost:3000"]
jwt_expiry = 3600
enable_signup = true
enable_confirmations = true
[auth.email]
enable_signup = true
double_confirm_changes = true
enable_confirmations = true
Environment Variables & Secrets Management¶
Production Environment Variables¶
Vercel (Frontend)¶
# Supabase Configuration
NEXT_PUBLIC_SUPABASE_URL=https://xxxxx.supabase.co
NEXT_PUBLIC_SUPABASE_ANON_KEY=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...
SUPABASE_SERVICE_KEY=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...
# Backend Service URLs
CPM_SERVICE_URL=https://cpm-production.railway.app
IM_SERVICE_URL=https://im-production.railway.app
SMM_SERVICE_URL=https://smm-production.railway.app
# Analytics & Monitoring
NEXT_PUBLIC_POSTHOG_KEY=phc_xxxxxxxxxxxx
VERCEL_ANALYTICS_ID=xxxxxxxxxx
Railway (Backend Services)¶
# Database Configuration
SUPABASE_URL=https://xxxxx.supabase.co
SUPABASE_SERVICE_ROLE_KEY=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...
# Redis Configuration
REDIS_URL=redis://default:password@host:port
# LLM API Keys
OPENAI_API_KEY=sk-proj-xxxxxxxxxxxxxxxx
ANTHROPIC_API_KEY=sk-ant-api03-xxxxxxxxxxxxxxxx
GOOGLE_API_KEY=AIzaSyxxxxxxxxxxxxxxxxxx
GROQ_API_KEY=gsk_xxxxxxxxxxxxxxxx
# Inter-service URLs
IM_BASE_URL=https://im-production.railway.app
SMM_BASE_URL=https://smm-production.railway.app
# Monitoring & Logging
LOG_LEVEL=INFO
PROMETHEUS_PUSHGATEWAY_URL=https://prometheus.monitoring.service.com
Secrets Management Best Practices¶
- Never commit secrets to version control
- Use platform-specific secret management:
- Vercel: Environment Variables dashboard
- Railway: Environment Variables in project settings
- Supabase: Project settings → API keys
- Rotate API keys regularly
- Use different keys for different environments
- Monitor API key usage and set alerts
Environment Variable Validation¶
Create validation scripts to ensure all required variables are set:
# scripts/validate_env.py
import os
import sys
REQUIRED_VARS = {
'SUPABASE_URL': 'Supabase project URL',
'SUPABASE_SERVICE_ROLE_KEY': 'Supabase service role key',
'OPENAI_API_KEY': 'OpenAI API key',
'REDIS_URL': 'Redis connection URL'
}
def validate_environment():
missing_vars = []
for var, description in REQUIRED_VARS.items():
if not os.getenv(var):
missing_vars.append(f"{var} ({description})")
if missing_vars:
print("Missing required environment variables:")
for var in missing_vars:
print(f" - {var}")
sys.exit(1)
print("All required environment variables are set ✓")
if __name__ == "__main__":
validate_environment()
Monitoring & Logging Setup¶
1. Structured Logging¶
Configure structured logging for all services:
# logging_config.py
import structlog
import logging
from datetime import datetime
def configure_logging():
structlog.configure(
processors=[
structlog.stdlib.filter_by_level,
structlog.stdlib.add_logger_name,
structlog.stdlib.add_log_level,
structlog.stdlib.PositionalArgumentsFormatter(),
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.StackInfoRenderer(),
structlog.processors.format_exc_info,
structlog.processors.UnicodeDecoder(),
structlog.processors.JSONRenderer()
],
context_class=dict,
logger_factory=structlog.stdlib.LoggerFactory(),
wrapper_class=structlog.stdlib.BoundLogger,
cache_logger_on_first_use=True,
)
# Usage in services
logger = structlog.get_logger(__name__)
logger.info("Content generation started",
job_id=job_id,
client_id=client_id,
provider="openai")
2. Health Checks¶
Implement comprehensive health checks for all services:
# health.py
from fastapi import APIRouter, HTTPException
from datetime import datetime
import asyncio
import redis
from supabase import create_client
router = APIRouter()
@router.get("/health")
async def health_check():
"""Comprehensive health check endpoint"""
health_status = {
"status": "healthy",
"timestamp": datetime.utcnow().isoformat(),
"service": "cpm",
"version": "1.0.0",
"checks": {}
}
try:
# Database health
health_status["checks"]["database"] = await check_database()
# Redis health
health_status["checks"]["redis"] = await check_redis()
# External API health
health_status["checks"]["external_apis"] = await check_external_apis()
# Overall status
failed_checks = [k for k, v in health_status["checks"].items() if v["status"] != "healthy"]
if failed_checks:
health_status["status"] = "degraded"
return health_status
except Exception as e:
health_status["status"] = "unhealthy"
health_status["error"] = str(e)
raise HTTPException(status_code=503, detail=health_status)
async def check_database():
"""Check database connectivity"""
try:
supabase = create_client(SUPABASE_URL, SUPABASE_KEY)
result = supabase.table("jobs").select("id").limit(1).execute()
return {"status": "healthy", "response_time": "5ms"}
except Exception as e:
return {"status": "unhealthy", "error": str(e)}
async def check_redis():
"""Check Redis connectivity"""
try:
r = redis.from_url(REDIS_URL)
r.ping()
return {"status": "healthy", "response_time": "2ms"}
except Exception as e:
return {"status": "unhealthy", "error": str(e)}
3. Application Metrics¶
Implement Prometheus metrics:
# metrics.py
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
# Metrics
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
REQUEST_DURATION = Histogram('http_request_duration_seconds', 'HTTP request duration')
ACTIVE_JOBS = Gauge('active_jobs_total', 'Number of active content generation jobs')
GENERATION_SUCCESS_RATE = Counter('content_generation_success_total', 'Successful content generations')
GENERATION_ERROR_RATE = Counter('content_generation_errors_total', 'Failed content generations', ['error_type'])
COST_TRACKING = Counter('generation_cost_total', 'Total cost of content generation', ['provider'])
# Middleware to track metrics
async def metrics_middleware(request, call_next):
start_time = time.time()
response = await call_next(request)
# Track request metrics
REQUEST_COUNT.labels(
method=request.method,
endpoint=request.url.path,
status=response.status_code
).inc()
REQUEST_DURATION.observe(time.time() - start_time)
return response
# Start metrics server
start_http_server(8080)
4. Error Tracking¶
Configure error tracking with detailed context:
# error_tracking.py
import traceback
import structlog
from typing import Optional, Dict, Any
logger = structlog.get_logger(__name__)
class ErrorTracker:
@staticmethod
def capture_exception(
error: Exception,
context: Optional[Dict[str, Any]] = None,
user_id: Optional[str] = None,
job_id: Optional[str] = None
):
"""Capture and log exceptions with context"""
error_data = {
"error_type": type(error).__name__,
"error_message": str(error),
"traceback": traceback.format_exc(),
"context": context or {},
"user_id": user_id,
"job_id": job_id
}
logger.error("Exception captured", **error_data)
# Send to external error tracking service if configured
# (e.g., Sentry, Rollbar, etc.)
return error_data
# Usage
try:
result = await generate_content(job_id, prompt)
except Exception as e:
ErrorTracker.capture_exception(
e,
context={"job_id": job_id, "provider": "openai"},
user_id=user_id
)
raise
5. Performance Monitoring¶
Track performance metrics:
# performance.py
import time
import asyncio
from functools import wraps
from typing import Callable
def track_performance(operation_name: str):
"""Decorator to track operation performance"""
def decorator(func: Callable):
@wraps(func)
async def wrapper(*args, **kwargs):
start_time = time.time()
try:
result = await func(*args, **kwargs)
duration = time.time() - start_time
logger.info(
"Operation completed",
operation=operation_name,
duration=duration,
status="success"
)
return result
except Exception as e:
duration = time.time() - start_time
logger.error(
"Operation failed",
operation=operation_name,
duration=duration,
status="error",
error=str(e)
)
raise
return wrapper
return decorator
# Usage
@track_performance("content_generation")
async def generate_content(prompt: str, provider: str):
# Content generation logic
pass
CI/CD Pipeline Configuration¶
GitHub Actions Workflow¶
Create .github/workflows/deploy.yml:
name: Deploy to Production
on:
push:
branches: [main]
pull_request:
branches: [main]
env:
NODE_VERSION: '18'
PYTHON_VERSION: '3.12'
jobs:
changes:
runs-on: ubuntu-latest
outputs:
frontend: ${{ steps.changes.outputs.frontend }}
cpm: ${{ steps.changes.outputs.cpm }}
im: ${{ steps.changes.outputs.im }}
smm: ${{ steps.changes.outputs.smm }}
steps:
- uses: actions/checkout@v4
- uses: dorny/paths-filter@v2
id: changes
with:
filters: |
frontend:
- 'apps/frontend/**'
cpm:
- 'apps/cpm/**'
im:
- 'apps/im/**'
smm:
- 'smm/**'
test-frontend:
needs: changes
if: ${{ needs.changes.outputs.frontend == 'true' }}
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: 'pnpm'
- name: Install pnpm
run: npm install -g pnpm
- name: Install dependencies
run: pnpm install
- name: Run tests
run: pnpm test --filter=frontend
- name: Build application
run: pnpm build --filter=frontend
test-backend:
needs: changes
if: ${{ needs.changes.outputs.cpm == 'true' || needs.changes.outputs.im == 'true' || needs.changes.outputs.smm == 'true' }}
runs-on: ubuntu-latest
services:
postgres:
image: postgres:15
env:
POSTGRES_PASSWORD: postgres
POSTGRES_DB: test_db
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
redis:
image: redis:7
options: >-
--health-cmd "redis-cli ping"
--health-interval 10s
--health-timeout 5s
--health-retries 5
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: ${{ env.PYTHON_VERSION }}
- name: Install Python dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
pip install -r apps/cpm/requirements.txt
pip install -r apps/im/requirements.txt
- name: Run CPM tests
if: ${{ needs.changes.outputs.cpm == 'true' }}
run: |
cd apps/cpm
pytest
- name: Run IM tests
if: ${{ needs.changes.outputs.im == 'true' }}
run: |
cd apps/im
pytest
- name: Run SMM tests
if: ${{ needs.changes.outputs.smm == 'true' }}
run: |
cd smm
pytest
deploy-frontend:
needs: [changes, test-frontend]
if: ${{ github.ref == 'refs/heads/main' && needs.changes.outputs.frontend == 'true' }}
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Deploy to Vercel
uses: amondnet/vercel-action@v25
with:
vercel-token: ${{ secrets.VERCEL_TOKEN }}
vercel-org-id: ${{ secrets.VERCEL_ORG_ID }}
vercel-project-id: ${{ secrets.VERCEL_PROJECT_ID }}
working-directory: apps/frontend
scope: ${{ secrets.VERCEL_ORG_ID }}
deploy-backend:
needs: [changes, test-backend]
if: ${{ github.ref == 'refs/heads/main' }}
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Deploy CPM to Railway
if: ${{ needs.changes.outputs.cpm == 'true' }}
uses: railway-deploy/railway-deploy@v1
with:
railway-token: ${{ secrets.RAILWAY_TOKEN }}
service: cpm-production
- name: Deploy IM to Railway
if: ${{ needs.changes.outputs.im == 'true' }}
uses: railway-deploy/railway-deploy@v1
with:
railway-token: ${{ secrets.RAILWAY_TOKEN }}
service: im-production
- name: Deploy SMM to Railway
if: ${{ needs.changes.outputs.smm == 'true' }}
uses: railway-deploy/railway-deploy@v1
with:
railway-token: ${{ secrets.RAILWAY_TOKEN }}
service: smm-production
notify-deployment:
needs: [deploy-frontend, deploy-backend]
if: ${{ always() && github.ref == 'refs/heads/main' }}
runs-on: ubuntu-latest
steps:
- name: Notify Slack
uses: 8398a7/action-slack@v3
with:
status: ${{ job.status }}
channel: '#deployments'
webhook_url: ${{ secrets.SLACK_WEBHOOK }}
Pre-deployment Checks¶
Create scripts/pre-deploy-check.sh:
#!/bin/bash
set -e
echo "Running pre-deployment checks..."
# Check environment variables
echo "Validating environment variables..."
python scripts/validate_env.py
# Run database migrations
echo "Checking database migrations..."
supabase db diff --linked
# Run security scan
echo "Running security scan..."
safety check
# Check API endpoints
echo "Testing API endpoints..."
python scripts/health_check.py
# Validate configuration files
echo "Validating configuration..."
python -c "import json; json.load(open('apps/cpm/railway.json'))"
python -c "import json; json.load(open('apps/im/railway.json'))"
echo "Pre-deployment checks completed successfully ✓"
Backup & Disaster Recovery¶
1. Database Backup Strategy¶
Supabase Automatic Backups¶
Supabase provides automatic daily backups for Pro plans: - Point-in-time recovery up to 7 days - Automatic weekly backups retained for 4 weeks - Monthly backups retained for 3 months
Custom Backup Script¶
#!/bin/bash
# scripts/backup_database.sh
# Set variables
BACKUP_DIR="/backups/$(date +%Y-%m-%d)"
SUPABASE_URL="your_supabase_url"
DB_PASSWORD="your_db_password"
# Create backup directory
mkdir -p $BACKUP_DIR
# Backup main tables
pg_dump "$SUPABASE_URL" \
--host=db.xxx.supabase.co \
--port=5432 \
--username=postgres \
--password \
--format=custom \
--file="$BACKUP_DIR/hg_content_$(date +%Y%m%d_%H%M%S).backup"
# Backup specific critical tables
pg_dump "$SUPABASE_URL" \
--host=db.xxx.supabase.co \
--port=5432 \
--username=postgres \
--password \
--table=jobs \
--table=clients \
--table=strategies \
--format=custom \
--file="$BACKUP_DIR/critical_tables_$(date +%Y%m%d_%H%M%S).backup"
# Compress backups
gzip "$BACKUP_DIR"/*.backup
# Upload to cloud storage (S3, GCS, etc.)
aws s3 cp "$BACKUP_DIR" s3://hg-content-backups/$(date +%Y-%m-%d)/ --recursive
echo "Backup completed successfully"
2. Application State Backup¶
# scripts/backup_app_state.py
import json
import asyncio
from supabase import create_client
from datetime import datetime
async def backup_application_state():
"""Backup critical application state"""
supabase = create_client(SUPABASE_URL, SUPABASE_KEY)
backup_data = {
"timestamp": datetime.utcnow().isoformat(),
"clients": [],
"strategies": [],
"active_jobs": [],
"system_config": {}
}
# Backup clients
clients = supabase.table("clients").select("*").execute()
backup_data["clients"] = clients.data
# Backup strategies
strategies = supabase.table("strategies").select("*").execute()
backup_data["strategies"] = strategies.data
# Backup active jobs
active_jobs = supabase.table("jobs").select("*").eq("status", "in_progress").execute()
backup_data["active_jobs"] = active_jobs.data
# Save backup
backup_filename = f"app_state_backup_{datetime.utcnow().strftime('%Y%m%d_%H%M%S')}.json"
with open(backup_filename, 'w') as f:
json.dump(backup_data, f, indent=2)
print(f"Application state backup saved to {backup_filename}")
if __name__ == "__main__":
asyncio.run(backup_application_state())
3. Disaster Recovery Procedures¶
Recovery Playbook¶
-
Database Recovery
-
Service Recovery
# Redeploy all services railway deploy --service=cpm-production railway deploy --service=im-production railway deploy --service=smm-production # Verify deployments curl https://cpm-production.railway.app/health curl https://im-production.railway.app/health curl https://smm-production.railway.app/health -
Data Validation
4. Recovery Testing¶
#!/bin/bash
# scripts/disaster_recovery_test.sh
echo "Starting disaster recovery test..."
# Create test environment
railway create --name "dr-test-$(date +%s)"
# Deploy services to test environment
railway deploy --service=dr-test-cpm
railway deploy --service=dr-test-im
railway deploy --service=dr-test-smm
# Restore data to test environment
pg_restore --host=test-db.xxx.supabase.co \
--port=5432 \
--username=postgres \
--dbname=postgres \
--clean \
latest_backup.backup
# Run validation tests
python scripts/validate_recovery.py
# Clean up test environment
railway delete --service=dr-test-cpm --confirm
railway delete --service=dr-test-im --confirm
railway delete --service=dr-test-smm --confirm
echo "Disaster recovery test completed"
Scaling & Performance Optimization¶
1. Auto-scaling Configuration¶
Railway Auto-scaling¶
{
"scaling": {
"minReplicas": 1,
"maxReplicas": 10,
"targetCPUUtilization": 70,
"targetMemoryUtilization": 80
},
"resources": {
"cpu": "2000m",
"memory": "4Gi"
}
}
Vercel Scaling¶
Vercel automatically scales serverless functions based on demand: - Concurrent executions: Up to 1,000 per region - Execution timeout: 30 seconds (configurable) - Memory: 1024MB (configurable up to 3008MB)
2. Database Performance Optimization¶
-- Performance indexes
CREATE INDEX CONCURRENTLY idx_jobs_client_status_created
ON jobs(client_id, status, created_at DESC);
CREATE INDEX CONCURRENTLY idx_jobs_status_created
ON jobs(status, created_at DESC)
WHERE status IN ('pending', 'in_progress');
-- Partition large tables
CREATE TABLE jobs_2024 PARTITION OF jobs
FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');
-- Enable query optimization
SET work_mem = '256MB';
SET effective_cache_size = '4GB';
SET shared_buffers = '1GB';
3. Caching Strategy¶
# caching.py
import redis
import json
from functools import wraps
from typing import Any, Optional
class CacheManager:
def __init__(self, redis_url: str):
self.redis = redis.from_url(redis_url)
def cache_result(self, key: str, ttl: int = 3600):
"""Decorator to cache function results"""
def decorator(func):
@wraps(func)
async def wrapper(*args, **kwargs):
cache_key = f"{func.__name__}:{key}"
# Try to get from cache
cached = self.redis.get(cache_key)
if cached:
return json.loads(cached)
# Execute function and cache result
result = await func(*args, **kwargs)
self.redis.setex(
cache_key,
ttl,
json.dumps(result, default=str)
)
return result
return wrapper
return decorator
# Usage
cache = CacheManager(REDIS_URL)
@cache.cache_result("strategies", ttl=1800)
async def get_client_strategies(client_id: str):
# Expensive database query
return strategies
4. Load Testing¶
# scripts/load_test.py
import asyncio
import aiohttp
import time
from concurrent.futures import ThreadPoolExecutor
async def load_test():
"""Load test the content generation endpoint"""
async def make_request(session, request_id):
start_time = time.time()
try:
async with session.post(
"https://cpm-production.railway.app/generate",
json={
"topic": f"Test content {request_id}",
"client_id": "test-client",
"content_type": "blog_post"
}
) as response:
duration = time.time() - start_time
return {
"request_id": request_id,
"status": response.status,
"duration": duration
}
except Exception as e:
return {
"request_id": request_id,
"status": "error",
"error": str(e),
"duration": time.time() - start_time
}
async with aiohttp.ClientSession() as session:
# Run 100 concurrent requests
tasks = [make_request(session, i) for i in range(100)]
results = await asyncio.gather(*tasks)
# Analyze results
success_count = sum(1 for r in results if r["status"] == 200)
avg_duration = sum(r["duration"] for r in results) / len(results)
print(f"Load test results:")
print(f"Success rate: {success_count/len(results)*100:.1f}%")
print(f"Average response time: {avg_duration:.2f}s")
if __name__ == "__main__":
asyncio.run(load_test())
5. Performance Monitoring¶
# performance_monitor.py
from datadog import initialize, statsd
import time
from functools import wraps
# Initialize monitoring
initialize()
def monitor_performance(metric_name: str):
"""Decorator to monitor function performance"""
def decorator(func):
@wraps(func)
async def wrapper(*args, **kwargs):
start_time = time.time()
try:
result = await func(*args, **kwargs)
duration = time.time() - start_time
# Send success metrics
statsd.histogram(f"{metric_name}.duration", duration)
statsd.increment(f"{metric_name}.success")
return result
except Exception as e:
duration = time.time() - start_time
# Send error metrics
statsd.histogram(f"{metric_name}.duration", duration)
statsd.increment(f"{metric_name}.error")
raise
return wrapper
return decorator
# Usage
@monitor_performance("content_generation")
async def generate_content(prompt: str):
# Content generation logic
pass
Troubleshooting Common Issues¶
1. Service Communication Issues¶
Problem: Services cannot communicate with each other¶
# Debug network connectivity
curl -v https://cpm-production.railway.app/health
curl -v https://im-production.railway.app/health
# Check DNS resolution
nslookup cpm-production.railway.app
# Verify SSL certificates
openssl s_client -connect cpm-production.railway.app:443 -servername cpm-production.railway.app
Solution:¶
- Verify service URLs in environment variables
- Check Railway service DNS settings
- Ensure services are deployed and healthy
2. Database Connection Issues¶
Problem: "connection pool exhausted" errors¶
# Monitor connection pool
async def check_connection_pool():
# Check active connections
result = await supabase.rpc("pg_stat_activity").execute()
active_connections = len(result.data)
if active_connections > 80: # 80% of max connections
logger.warning(f"High connection usage: {active_connections}")
Solution:¶
- Enable connection pooling in Supabase
- Implement connection pool monitoring
- Add connection retry logic
- Use async database clients
3. Authentication Issues¶
Problem: JWT token validation failures¶
# Debug JWT tokens
import jwt
from supabase import create_client
def debug_jwt_token(token: str):
try:
# Decode without verification first
decoded = jwt.decode(token, options={"verify_signature": False})
print(f"Token payload: {decoded}")
# Check expiration
exp = decoded.get('exp', 0)
current_time = time.time()
if exp < current_time:
print("Token has expired")
else:
print(f"Token valid for {exp - current_time} seconds")
except Exception as e:
print(f"Token decode error: {e}")
Solution:¶
- Verify Supabase JWT secret configuration
- Check token expiration times
- Implement automatic token refresh
- Validate CORS settings
4. LLM Provider Issues¶
Problem: API rate limits or failures¶
# Implement circuit breaker pattern
class CircuitBreaker:
def __init__(self, failure_threshold=5, reset_timeout=60):
self.failure_threshold = failure_threshold
self.reset_timeout = reset_timeout
self.failure_count = 0
self.last_failure_time = None
self.state = 'CLOSED' # CLOSED, OPEN, HALF_OPEN
async def call(self, func, *args, **kwargs):
if self.state == 'OPEN':
if time.time() - self.last_failure_time > self.reset_timeout:
self.state = 'HALF_OPEN'
else:
raise Exception("Circuit breaker is OPEN")
try:
result = await func(*args, **kwargs)
self.on_success()
return result
except Exception as e:
self.on_failure()
raise
def on_success(self):
self.failure_count = 0
self.state = 'CLOSED'
def on_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = 'OPEN'
Solution:¶
- Implement retry logic with exponential backoff
- Use circuit breaker pattern
- Set up fallback providers
- Monitor API usage and quotas
5. Memory and Performance Issues¶
Problem: High memory usage or slow responses¶
# Memory profiling
import psutil
import gc
def monitor_memory():
process = psutil.Process()
memory_info = process.memory_info()
logger.info(
"Memory usage",
rss=memory_info.rss / 1024 / 1024, # MB
vms=memory_info.vms / 1024 / 1024, # MB
percent=process.memory_percent()
)
# Force garbage collection if memory is high
if process.memory_percent() > 80:
gc.collect()
Solution:¶
- Monitor memory usage and implement alerts
- Use memory profilers to identify leaks
- Implement proper connection pooling
- Add response caching for expensive operations
6. Deployment Issues¶
Problem: Railway deployment failures¶
# Check Railway logs
railway logs --follow
# Check service status
railway status
# Validate configuration
railway config
# Force redeploy
railway redeploy
Solution:¶
- Check build logs for errors
- Verify environment variables
- Ensure proper start commands
- Check resource limits
7. Database Performance Issues¶
Problem: Slow query performance¶
-- Enable query logging
ALTER SYSTEM SET log_statement = 'all';
ALTER SYSTEM SET log_min_duration_statement = 1000; -- Log queries > 1s
-- Analyze slow queries
SELECT query, mean_time, calls
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;
-- Check missing indexes
SELECT schemaname, tablename, attname, n_distinct, correlation
FROM pg_stats
WHERE schemaname = 'public'
ORDER BY n_distinct DESC;
Solution:¶
- Add appropriate indexes
- Optimize query patterns
- Implement query result caching
- Use connection pooling
8. Common Error Resolution¶
Create a troubleshooting script:
# scripts/troubleshoot.py
import asyncio
import aiohttp
import sys
async def run_diagnostics():
"""Run comprehensive system diagnostics"""
print("🔍 Running system diagnostics...\n")
# Check service health
services = [
("CPM", "https://cpm-production.railway.app/health"),
("IM", "https://im-production.railway.app/health"),
("SMM", "https://smm-production.railway.app/health"),
("Frontend", "https://your-frontend.vercel.app/api/health")
]
async with aiohttp.ClientSession() as session:
for name, url in services:
try:
async with session.get(url, timeout=10) as response:
if response.status == 200:
print(f"✅ {name} service: Healthy")
else:
print(f"❌ {name} service: Unhealthy (Status: {response.status})")
except Exception as e:
print(f"❌ {name} service: Error - {e}")
# Check database
try:
supabase = create_client(SUPABASE_URL, SUPABASE_KEY)
result = supabase.table("jobs").select("count").execute()
print(f"✅ Database: Connected ({len(result.data)} jobs found)")
except Exception as e:
print(f"❌ Database: Error - {e}")
# Check Redis
try:
r = redis.from_url(REDIS_URL)
r.ping()
print("✅ Redis: Connected")
except Exception as e:
print(f"❌ Redis: Error - {e}")
print("\n🏁 Diagnostics complete")
if __name__ == "__main__":
asyncio.run(run_diagnostics())
Quick Reference Commands¶
# Health checks
curl https://cpm-production.railway.app/health
curl https://im-production.railway.app/health
curl https://your-frontend.vercel.app/api/health
# Railway debugging
railway logs --follow --service=cpm-production
railway shell --service=cpm-production
# Vercel debugging
vercel logs --follow
vercel inspect
# Database debugging
supabase db inspect
supabase logs
# Performance monitoring
railway metrics --service=cpm-production
Conclusion¶
This deployment guide provides comprehensive instructions for deploying and operating the HG Content Generation System. Regular monitoring, proper backup procedures, and following the troubleshooting guidelines will ensure reliable operation of the system.
For additional support: - Check service status pages: Railway, Vercel, Supabase - Review application logs for specific error messages
- Run the diagnostic script for quick health checks - Contact the development team for complex issues
Last Updated: August 2025 Version: 1.0.0