Skip to content

Deployment and Scalability Plan

Overview

This document outlines the deployment strategy and scalability considerations for the content generation system. It covers infrastructure setup, deployment pipelines, monitoring, scaling strategies, and disaster recovery procedures for all components across Vercel, Railway, and Supabase.

Infrastructure Overview

Production Architecture

Frontend (Next.js):
  Platform: Vercel
  Regions: Global Edge Network
  Scaling: Automatic (Serverless)

Backend Services:
  CPM (Python/FastAPI):
    Platform: Railway
    Instances: 2-5 (auto-scaling)
    Memory: 512MB - 2GB

  IM (Python/FastAPI):
    Platform: Railway
    Instances: 2-3 (auto-scaling)
    Memory: 256MB - 1GB

Database:
  Platform: Supabase (PostgreSQL)
  Plan: Pro (8GB RAM, 2 CPU)
  Replicas: 1 read replica (V2)

Cache (V2):
  Redis on Railway
  Memory: 256MB - 1GB

Monitoring:
  - Vercel Analytics
  - Railway Metrics
  - Supabase Dashboard
  - Sentry (Error tracking)
  - Custom Prometheus/Grafana (V2)

Deployment Strategy

1. Environment Setup

# Environment configurations
Development:
  - Local development
  - Feature branches
  - Local Supabase instance

Staging:
  - Preview deployments (Vercel)
  - Staging services (Railway)
  - Staging database (Supabase branch)

Production:
  - Main branch deployments
  - Production services
  - Production database

2. CI/CD Pipeline

# .github/workflows/deploy.yml
name: Deploy

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  RAILWAY_TOKEN: ${{ secrets.RAILWAY_TOKEN }}
  VERCEL_TOKEN: ${{ secrets.VERCEL_TOKEN }}

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: '20'

      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.12'

      - name: Install dependencies
        run: |
          pnpm install
          cd apps/cpm && pip install -r requirements.txt
          cd ../im && pip install -r requirements.txt

      - name: Run tests
        run: |
          pnpm test
          cd apps/cpm && pytest
          cd ../im && pytest

  deploy-preview:
    if: github.event_name == 'pull_request'
    needs: test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Deploy to Vercel Preview
        run: |
          npx vercel --token $VERCEL_TOKEN --yes

      - name: Deploy to Railway Preview
        run: |
          railway link content-gen-staging
          railway up --detach

  deploy-production:
    if: github.ref == 'refs/heads/main'
    needs: test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Deploy Frontend to Vercel
        run: |
          npx vercel --prod --token $VERCEL_TOKEN --yes

      - name: Deploy CPM to Railway
        run: |
          railway link content-gen-cpm
          railway up --detach --service cpm

      - name: Deploy IM to Railway
        run: |
          railway link content-gen-im
          railway up --detach --service im

      - name: Run Database Migrations
        run: |
          npx supabase db push --db-url ${{ secrets.SUPABASE_DB_URL }}

3. Railway Deployment Configuration

// apps/cpm/railway.json
{
  "$schema": "https://railway.app/railway.schema.json",
  "build": {
    "builder": "NIXPACKS",
    "buildCommand": "pip install -r requirements.txt"
  },
  "deploy": {
    "startCommand": "uvicorn app:app --host 0.0.0.0 --port $PORT",
    "restartPolicyType": "ON_FAILURE",
    "restartPolicyMaxRetries": 3,
    "healthcheckPath": "/health",
    "healthcheckTimeout": 30
  },
  "scaling": {
    "minInstances": 2,
    "maxInstances": 5,
    "targetCPU": 70,
    "targetMemory": 80
  }
}

4. Vercel Deployment Configuration

// apps/frontend/vercel.json
{
  "buildCommand": "cd ../.. && pnpm build --filter=frontend",
  "installCommand": "pnpm install",
  "framework": "nextjs",
  "outputDirectory": ".next",
  "regions": ["iad1", "sfo1", "lhr1"],
  "functions": {
    "app/api/content/generate/route.ts": {
      "maxDuration": 60
    }
  },
  "env": {
    "NEXT_PUBLIC_SUPABASE_URL": "@supabase-url",
    "NEXT_PUBLIC_SUPABASE_ANON_KEY": "@supabase-anon-key"
  }
}

Scaling Strategies

1. Horizontal Scaling Configuration

# Railway auto-scaling configuration
# apps/cpm/scaling_config.py

import os
from prometheus_client import Gauge, Counter

# Metrics for auto-scaling decisions
active_jobs = Gauge('active_jobs', 'Number of jobs being processed')
queue_size = Gauge('queue_size', 'Number of jobs in queue')
memory_usage = Gauge('memory_usage_bytes', 'Memory usage in bytes')
cpu_usage = Gauge('cpu_usage_percent', 'CPU usage percentage')

class ScalingManager:
    def __init__(self):
        self.min_instances = int(os.getenv('MIN_INSTANCES', '2'))
        self.max_instances = int(os.getenv('MAX_INSTANCES', '10'))
        self.scale_up_threshold = 0.8
        self.scale_down_threshold = 0.3

    async def check_scaling_needs(self):
        metrics = await self.collect_metrics()

        if metrics['cpu'] > self.scale_up_threshold or metrics['memory'] > self.scale_up_threshold:
            return 'scale_up'
        elif metrics['cpu'] < self.scale_down_threshold and metrics['memory'] < self.scale_down_threshold:
            return 'scale_down'
        return 'maintain'

    async def collect_metrics(self):
        import psutil

        return {
            'cpu': psutil.cpu_percent() / 100,
            'memory': psutil.virtual_memory().percent / 100,
            'active_jobs': active_jobs._value.get(),
            'queue_size': queue_size._value.get()
        }

2. Database Optimization

-- Partitioning for large tables
CREATE TABLE jobs_partitioned (
    LIKE jobs INCLUDING ALL
) PARTITION BY RANGE (created_at);

-- Create monthly partitions
CREATE TABLE jobs_2025_01 PARTITION OF jobs_partitioned
    FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');

CREATE TABLE jobs_2025_02 PARTITION OF jobs_partitioned
    FOR VALUES FROM ('2025-02-01') TO ('2025-03-01');

-- Indexes for performance
CREATE INDEX CONCURRENTLY idx_jobs_status_created 
    ON jobs(status, created_at DESC) 
    WHERE status IN ('pending', 'in_progress');

CREATE INDEX CONCURRENTLY idx_jobs_client_created 
    ON jobs(client_id, created_at DESC);

-- Connection pooling configuration
ALTER SYSTEM SET max_connections = 200;
ALTER SYSTEM SET shared_buffers = '2GB';
ALTER SYSTEM SET effective_cache_size = '6GB';
ALTER SYSTEM SET work_mem = '16MB';

3. Caching Strategy

# Multi-level caching implementation
import asyncio
from typing import Optional, Dict, Any
import aioredis
from functools import lru_cache

class MultiLevelCache:
    def __init__(self, redis_url: str):
        self.redis_url = redis_url
        self.local_cache = {}
        self.local_cache_ttl = 60  # 1 minute

    async def connect(self):
        self.redis = await aioredis.from_url(
            self.redis_url,
            encoding="utf-8",
            decode_responses=True
        )

    async def get(self, key: str) -> Optional[Any]:
        # L1: Local memory cache
        if key in self.local_cache:
            value, timestamp = self.local_cache[key]
            if time.time() - timestamp < self.local_cache_ttl:
                return value
            else:
                del self.local_cache[key]

        # L2: Redis cache
        value = await self.redis.get(key)
        if value:
            # Populate L1 cache
            self.local_cache[key] = (json.loads(value), time.time())
            return json.loads(value)

        return None

    async def set(self, key: str, value: Any, ttl: int = 3600):
        # Set in both caches
        self.local_cache[key] = (value, time.time())
        await self.redis.setex(key, ttl, json.dumps(value))

    async def invalidate(self, pattern: str):
        # Clear from both caches
        # Local cache
        keys_to_delete = [k for k in self.local_cache.keys() if pattern in k]
        for key in keys_to_delete:
            del self.local_cache[key]

        # Redis cache
        cursor = 0
        while True:
            cursor, keys = await self.redis.scan(cursor, match=f"*{pattern}*")
            if keys:
                await self.redis.delete(*keys)
            if cursor == 0:
                break

Load Balancing

1. Railway Service Configuration

# Railway load balancing setup
services:
  cpm:
    instances: 3
    health_check:
      path: /health
      interval: 30s
      timeout: 10s
      success_threshold: 1
      failure_threshold: 3

  im:
    instances: 2
    health_check:
      path: /health
      interval: 30s
      timeout: 10s

2. Request Distribution

# Client-side load balancing
import random
from typing import List
import aiohttp

class LoadBalancedClient:
    def __init__(self, service_urls: List[str]):
        self.service_urls = service_urls
        self.healthy_urls = set(service_urls)
        self.check_interval = 30  # seconds

    async def start_health_checks(self):
        while True:
            await self.check_all_health()
            await asyncio.sleep(self.check_interval)

    async def check_all_health(self):
        tasks = [self.check_health(url) for url in self.service_urls]
        results = await asyncio.gather(*tasks, return_exceptions=True)

        self.healthy_urls = {
            url for url, healthy in zip(self.service_urls, results) 
            if healthy and not isinstance(healthy, Exception)
        }

    async def check_health(self, url: str) -> bool:
        try:
            async with aiohttp.ClientSession() as session:
                async with session.get(f"{url}/health", timeout=5) as response:
                    return response.status == 200
        except:
            return False

    def get_url(self) -> str:
        if not self.healthy_urls:
            # Fallback to all URLs if none are healthy
            return random.choice(self.service_urls)
        return random.choice(list(self.healthy_urls))

    async def request(self, method: str, endpoint: str, **kwargs):
        url = self.get_url()
        full_url = f"{url}{endpoint}"

        async with aiohttp.ClientSession() as session:
            async with session.request(method, full_url, **kwargs) as response:
                return await response.json()

Monitoring and Alerting

1. Application Monitoring

# Prometheus metrics setup
from prometheus_client import Counter, Histogram, Gauge, generate_latest
from fastapi import FastAPI, Response
import time

# Metrics
request_count = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
request_duration = Histogram('http_request_duration_seconds', 'HTTP request duration', ['method', 'endpoint'])
active_connections = Gauge('active_connections', 'Number of active connections')

# Middleware for metrics collection
@app.middleware("http")
async def prometheus_middleware(request, call_next):
    start_time = time.time()
    active_connections.inc()

    try:
        response = await call_next(request)
        duration = time.time() - start_time

        request_count.labels(
            method=request.method,
            endpoint=request.url.path,
            status=response.status_code
        ).inc()

        request_duration.labels(
            method=request.method,
            endpoint=request.url.path
        ).observe(duration)

        return response
    finally:
        active_connections.dec()

@app.get("/metrics")
async def get_metrics():
    return Response(content=generate_latest(), media_type="text/plain")

2. Alert Configuration

# prometheus/alerts.yml
groups:
  - name: content_generation
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: rate(content_errors_total[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }} errors per second"

      - alert: LowSuccessRate
        expr: rate(content_requests_total{status="completed"}[5m]) / rate(content_requests_total[5m]) < 0.9
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Low success rate"
          description: "Success rate is {{ $value | humanizePercentage }}"

      - alert: HighMemoryUsage
        expr: process_resident_memory_bytes / 1e9 > 1.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage"
          description: "Memory usage is {{ $value | humanize }}GB"

      - alert: DatabaseConnectionPoolExhausted
        expr: pg_stat_database_numbackends / pg_settings_max_connections > 0.9
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Database connection pool nearly exhausted"
          description: "{{ $value | humanizePercentage }} of connections in use"

3. Logging Strategy

# Structured logging configuration
import logging
import json
from pythonjsonlogger import jsonlogger
from datetime import datetime

class CustomJsonFormatter(jsonlogger.JsonFormatter):
    def add_fields(self, log_record, record, message_dict):
        super(CustomJsonFormatter, self).add_fields(log_record, record, message_dict)
        log_record['timestamp'] = datetime.utcnow().isoformat()
        log_record['level'] = record.levelname
        log_record['service'] = os.getenv('SERVICE_NAME', 'unknown')
        log_record['instance_id'] = os.getenv('RAILWAY_INSTANCE_ID', 'local')

# Configure logging
logHandler = logging.StreamHandler()
formatter = CustomJsonFormatter()
logHandler.setFormatter(formatter)
logger = logging.getLogger()
logger.addHandler(logHandler)
logger.setLevel(logging.INFO)

# Log aggregation query examples
"""
# Find slow requests
SELECT 
    timestamp,
    endpoint,
    duration_ms,
    client_id
FROM logs
WHERE duration_ms > 5000
ORDER BY timestamp DESC
LIMIT 100;

# Error analysis
SELECT 
    error_type,
    COUNT(*) as count,
    AVG(duration_ms) as avg_duration
FROM logs
WHERE level = 'ERROR'
  AND timestamp > NOW() - INTERVAL '1 hour'
GROUP BY error_type
ORDER BY count DESC;
"""

Disaster Recovery

1. Backup Strategy

#!/bin/bash
# backup.sh - Run daily via cron

# Database backup
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/backups"

# Supabase backup
pg_dump $SUPABASE_DB_URL | gzip > "$BACKUP_DIR/db_backup_$DATE.sql.gz"

# Upload to S3
aws s3 cp "$BACKUP_DIR/db_backup_$DATE.sql.gz" "s3://content-gen-backups/db/$DATE/"

# Cleanup old backups (keep 30 days)
find $BACKUP_DIR -name "*.sql.gz" -mtime +30 -delete

# Backup application state
kubectl exec -n production deployment/cpm -- python manage.py export_state | \
    gzip > "$BACKUP_DIR/app_state_$DATE.json.gz"

2. Recovery Procedures

# Automated recovery system
class DisasterRecovery:
    def __init__(self):
        self.health_check_interval = 60
        self.recovery_attempts = 0
        self.max_recovery_attempts = 3

    async def monitor_system_health(self):
        while True:
            try:
                health_status = await self.check_all_services()

                if not health_status['healthy']:
                    await self.initiate_recovery(health_status['failed_services'])
                else:
                    self.recovery_attempts = 0

            except Exception as e:
                logger.error(f"Health check failed: {e}")

            await asyncio.sleep(self.health_check_interval)

    async def check_all_services(self) -> Dict:
        services = {
            'frontend': os.getenv('FRONTEND_URL'),
            'cpm': os.getenv('CPM_URL'),
            'im': os.getenv('IM_URL'),
            'database': await self.check_database()
        }

        failed_services = []
        for service, url in services.items():
            if not await self.check_service(url):
                failed_services.append(service)

        return {
            'healthy': len(failed_services) == 0,
            'failed_services': failed_services
        }

    async def initiate_recovery(self, failed_services: List[str]):
        self.recovery_attempts += 1

        if self.recovery_attempts > self.max_recovery_attempts:
            await self.alert_ops_team("Maximum recovery attempts exceeded")
            return

        for service in failed_services:
            await self.recover_service(service)

    async def recover_service(self, service: str):
        recovery_actions = {
            'cpm': self.restart_cpm,
            'im': self.restart_im,
            'database': self.recover_database,
            'frontend': self.clear_cdn_cache
        }

        if service in recovery_actions:
            await recovery_actions[service]()

3. Rollback Procedures

# Automated rollback on deployment failure
deploy:
  script:
    - railway up --detach
    - ./scripts/health_check.sh
    - |
      if [ $? -ne 0 ]; then
        echo "Health check failed, rolling back..."
        railway rollback
        exit 1
      fi

Performance Benchmarks

1. Load Testing Configuration

# locustfile.py
from locust import HttpUser, task, between
import random

class ContentGenerationLoadTest(HttpUser):
    wait_time = between(1, 5)

    def on_start(self):
        # Login or setup
        self.client_id = "load-test-client"
        self.job_ids = []

    @task(3)
    def create_blog_content(self):
        response = self.client.post("/generate", json={
            "topic": f"Load test topic {random.randint(1, 1000)}",
            "content_type": "blog",
            "client_id": self.client_id,
            "keywords": ["test", "load", "performance"],
            "priority": random.choice(["cost", "quality", "speed", "balanced"])
        })

        if response.status_code == 200:
            job_id = response.json()["job_id"]
            self.job_ids.append(job_id)

    @task(2)
    def check_job_status(self):
        if self.job_ids:
            job_id = random.choice(self.job_ids[-10:])  # Check recent jobs
            self.client.get(f"/status/{job_id}")

    @task(1)
    def create_social_content(self):
        self.client.post("/generate", json={
            "topic": f"Quick social update {random.randint(1, 100)}",
            "content_type": "social",
            "client_id": self.client_id,
            "keywords": ["social", "quick"]
        })

# Run with: locust -f locustfile.py --host https://api.content-gen.com

2. Performance Targets

MVP Targets:
  - Concurrent users: 100
  - Requests per second: 50
  - P95 latency: <2s
  - P99 latency: <5s
  - Error rate: <1%
  - Availability: 99.5%

Production Targets:
  - Concurrent users: 1000
  - Requests per second: 500
  - P95 latency: <1s
  - P99 latency: <3s
  - Error rate: <0.1%
  - Availability: 99.9%

Scale Limits:
  - Max CPM instances: 10
  - Max IM instances: 5
  - Database connections: 200
  - Memory per instance: 2GB
  - Storage: 100GB

Cost Management

1. Resource Optimization

# Auto-shutdown inactive services
class ResourceOptimizer:
    def __init__(self):
        self.idle_threshold = 300  # 5 minutes
        self.check_interval = 60  # 1 minute

    async def monitor_activity(self):
        while True:
            activity = await self.get_service_activity()

            for service, last_active in activity.items():
                idle_time = time.time() - last_active

                if idle_time > self.idle_threshold:
                    await self.scale_down_service(service)

            await asyncio.sleep(self.check_interval)

    async def scale_down_service(self, service: str):
        # Scale to minimum instances during idle
        if service in ['cpm', 'im']:
            await railway_api.scale_service(service, instances=1)

2. Cost Monitoring

-- Daily cost tracking
CREATE VIEW daily_costs AS
SELECT 
    DATE(created_at) as date,
    COUNT(*) as requests,
    SUM(generation_cost) as llm_costs,
    COUNT(*) * 0.001 as api_costs,  -- $0.001 per request
    (COUNT(*) * 0.001) + SUM(generation_cost) as total_cost
FROM jobs
WHERE status = 'completed'
GROUP BY DATE(created_at);

-- Alert on cost overruns
CREATE FUNCTION check_daily_cost_limit() RETURNS trigger AS $$
BEGIN
    IF (SELECT total_cost FROM daily_costs WHERE date = CURRENT_DATE) > 50 THEN
        INSERT INTO alerts (type, message, severity)
        VALUES ('cost_overrun', 'Daily cost limit exceeded', 'warning');
    END IF;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

Security Hardening

1. Infrastructure Security

# Security headers for Next.js
# next.config.js
module.exports = {
  async headers() {
    return [
      {
        source: '/:path*',
        headers: [
          {
            key: 'X-Frame-Options',
            value: 'DENY',
          },
          {
            key: 'X-Content-Type-Options',
            value: 'nosniff',
          },
          {
            key: 'X-XSS-Protection',
            value: '1; mode=block',
          },
          {
            key: 'Strict-Transport-Security',
            value: 'max-age=31536000; includeSubDomains',
          },
          {
            key: 'Content-Security-Policy',
            value: "default-src 'self'; script-src 'self' 'unsafe-eval' 'unsafe-inline'; style-src 'self' 'unsafe-inline';",
          },
        ],
      },
    ]
  },
}

2. API Security

# Rate limiting and DDoS protection
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded

limiter = Limiter(
    key_func=get_remote_address,
    default_limits=["100 per minute", "1000 per hour"]
)

app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)

@app.post("/generate")
@limiter.limit("10 per minute")
async def generate_content(request: Request, content_request: ContentRequest):
    # Process request
    pass

# IP allowlisting for admin endpoints
ALLOWED_IPS = set(os.getenv('ALLOWED_ADMIN_IPS', '').split(','))

async def verify_admin_ip(request: Request):
    client_ip = request.client.host
    if client_ip not in ALLOWED_IPS:
        raise HTTPException(status_code=403, detail="Access denied")

Migration Strategy

1. Database Migrations

-- Migration tracking
CREATE TABLE schema_migrations (
    version VARCHAR(255) PRIMARY KEY,
    applied_at TIMESTAMPTZ DEFAULT NOW()
);

-- Example migration
BEGIN;
-- V1_add_priority_to_jobs.sql
ALTER TABLE jobs ADD COLUMN priority VARCHAR(20) DEFAULT 'balanced';
INSERT INTO schema_migrations (version) VALUES ('V1_add_priority_to_jobs');
COMMIT;

2. Zero-Downtime Deployments

# Blue-green deployment strategy
class BlueGreenDeployment:
    def __init__(self):
        self.blue_version = os.getenv('BLUE_VERSION')
        self.green_version = os.getenv('GREEN_VERSION')
        self.active_color = 'blue'

    async def deploy_new_version(self, version: str):
        # Deploy to inactive color
        inactive_color = 'green' if self.active_color == 'blue' else 'blue'

        # Deploy new version
        await self.deploy_to_environment(inactive_color, version)

        # Health check new deployment
        if await self.health_check(inactive_color):
            # Switch traffic
            await self.switch_traffic(inactive_color)
            self.active_color = inactive_color
        else:
            # Rollback
            await self.cleanup_failed_deployment(inactive_color)
            raise Exception("Deployment failed health checks")

V2 Infrastructure Enhancements

1. Kubernetes Migration

# k8s/cpm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cpm
spec:
  replicas: 3
  selector:
    matchLabels:
      app: cpm
  template:
    metadata:
      labels:
        app: cpm
    spec:
      containers:
      - name: cpm
        image: content-gen/cpm:latest
        ports:
        - containerPort: 8000
        env:
        - name: SUPABASE_URL
          valueFrom:
            secretKeyRef:
              name: app-secrets
              key: supabase-url
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "1Gi"
            cpu: "1000m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: cpm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: cpm
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

2. Global Distribution

# Multi-region deployment
regions:
  us-east:
    frontend: vercel-iad1
    backend: railway-us-east
    database: supabase-us-east-1

  eu-west:
    frontend: vercel-lhr1
    backend: railway-eu-west
    database: supabase-eu-west-1

  asia-pacific:
    frontend: vercel-hnd1
    backend: railway-ap-southeast
    database: supabase-ap-southeast-1

traffic_routing:
  - geo_routing: true
  - latency_based: true
  - failover_priority: [us-east, eu-west, asia-pacific]

Operational Runbook

Common Issues and Solutions

High Latency:
  Symptoms:
    - P95 latency > 5s
    - User complaints about slow generation
  Diagnosis:
    - Check LLM provider status
    - Review database query performance
    - Check network latency between services
  Solutions:
    - Switch to faster LLM provider
    - Add database indexes
    - Scale up service instances

Memory Leaks:
  Symptoms:
    - Gradual memory increase
    - Service crashes after hours/days
  Diagnosis:
    - Review memory profiler output
    - Check for unclosed connections
    - Look for large objects in memory
  Solutions:
    - Implement connection pooling
    - Add garbage collection triggers
    - Fix memory leak in code

Database Connection Exhaustion:
  Symptoms:
    - "Too many connections" errors
    - Service timeouts
  Diagnosis:
    - Check pg_stat_activity
    - Review connection pool settings
  Solutions:
    - Increase max_connections
    - Implement connection pooling
    - Close idle connections

Maintenance Procedures

#!/bin/bash
# maintenance.sh

# Put services in maintenance mode
railway run --service cpm -- python manage.py maintenance_mode on

# Perform maintenance tasks
echo "Running database vacuum..."
psql $DATABASE_URL -c "VACUUM ANALYZE;"

echo "Clearing old logs..."
find /logs -name "*.log" -mtime +30 -delete

echo "Updating dependencies..."
railway run --service cpm -- pip install -r requirements.txt --upgrade

# Health check
echo "Running health checks..."
./scripts/health_check.sh

# Exit maintenance mode
railway run --service cpm -- python manage.py maintenance_mode off