Open-Source AI Model Integration Plan for HG Content System¶

Executive Summary¶

Integration plan for adding Ollama/Mistral-7B as a free/no-API-key fallback option for content generation in the HG Content Generation System. This would provide users with a zero-cost option for testing and development, while maintaining the existing multi-provider architecture.

Current Architecture Analysis¶

Existing System Components¶

Content Production Module (CPM) - Python FastAPI service at apps/cpm/
Currently supports: OpenAI, Anthropic, Google Gemini, Groq
Uses BaseLLMClient abstraction in apps/cpm/llm_client.py
Handles content generation via /api/generate endpoint
Instructions Module (IM) - Prompt generation service at apps/im/
Generates prompts based on templates and client settings
Returns structured instructions for content generation
Frontend - Next.js application at apps/frontend/
Hosted on Vercel
Communicates with backend services via API
External API - New service at apps/external/
Provides programmatic access for external clients
Uses same backend infrastructure

Integration Architecture¶

Proposed Setup¶

┌─────────────┐     ┌──────────────┐     ┌─────────────────┐
│   Vercel    │────▶│   Railway    │────▶│  Railway/Cloud  │
│  (Frontend) │     │    (CPM)     │     │    (Ollama)     │
└─────────────┘     └──────────────┘     └─────────────────┘
                           │
                           ▼
                    ┌──────────────┐
                    │   Existing   │
                    │ LLM Providers│
                    └──────────────┘

Key Design Decisions¶

1. Ollama Deployment Strategy¶

Development: Run Ollama locally on port 11434
Production: Deploy Ollama on Railway with persistent volume
Alternative: Use a dedicated GPU cloud provider for better performance

2. Integration Point¶

Add OllamaClient to existing llm_client.py infrastructure
Maintain compatibility with existing BaseLLMClient interface
Use OpenAI-compatible endpoint for easier integration

3. No-Streaming Approach¶

Content generation doesn't require streaming (not a chat app)
Use synchronous responses with appropriate timeouts
Simpler implementation and error handling

Implementation Plan¶

Phase 1: Add Ollama Client to CPM¶

1.1 Create OllamaClient Class¶

# apps/cpm/llm_client.py

class OllamaClient(BaseLLMClient):
    """Client for Ollama/Mistral open-source models."""

    def __init__(self, base_url: str = None):
        # Prefer OLLAMA_URL; fallback to legacy var for compatibility
        self.base_url = base_url or os.getenv('OLLAMA_URL') or os.getenv('OLLAMA_API_URL', 'http://localhost:11434')
        self.model_name = os.getenv('OLLAMA_MODEL', 'mistral:7b-instruct')

    def generate_content(
        self,
        prompt: str,
        content_type: str,
        max_tokens: int = 2000,
        temperature: float = 0.7
    ) -> LLMResponse:
        """Generate content using Ollama's OpenAI-compatible endpoint."""

        response = requests.post(
            f"{self.base_url}/v1/chat/completions",
            json={
                "model": self.model_name,
                "messages": [
                    {"role": "system", "content": self._get_system_prompt(content_type)},
                    {"role": "user", "content": prompt}
                ],
                "temperature": temperature,
                "max_tokens": max_tokens,
                "stream": False
            },
            headers={
                "Content-Type": "application/json",
                "Authorization": "Bearer ollama"  # Dummy token
            },
            timeout=120  # 2 minutes for long-form content
        )

        if response.status_code != 200:
            raise LLMError(f"Ollama error: {response.text}")

        data = response.json()
        content = data['choices'][0]['message']['content']

        return LLMResponse(
            content=content,
            model=self.model_name,
            provider="ollama",
            usage=self._estimate_usage(prompt, content),
            cost=0.0  # Free!
        )

1.2 Update LLMClientFactory¶

# apps/cpm/llm_client.py

class LLMClientFactory:
    @staticmethod
    def create_client(provider: str = None) -> BaseLLMClient:
        # Auto-detect provider based on available keys
        if not provider:
            if os.getenv('OPENAI_API_KEY'):
                provider = 'openai'
            elif os.getenv('ANTHROPIC_API_KEY'):
                provider = 'anthropic'
            elif os.getenv('GOOGLE_API_KEY'):
                provider = 'google'
            elif os.getenv('GROQ_API_KEY'):
                provider = 'groq'
            else:
                provider = 'ollama'  # Free fallback

        if provider == 'ollama':
            return OllamaClient()
        # ... existing providers

Phase 2: Railway Deployment¶

2.1 Dockerfile for Ollama Service¶

# deployment/ollama/Dockerfile
FROM ollama/ollama:latest

# Pre-pull models on startup
COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh

EXPOSE 11434
ENTRYPOINT ["/entrypoint.sh"]

2.2 Entrypoint Script¶

#!/bin/sh
# deployment/ollama/entrypoint.sh
set -e

# Pull models (use quantized for CPU)
ollama pull mistral:7b-instruct-q4_K_M
ollama pull llama2:7b-q4_K_M  # Alternative option

# Start server
exec ollama serve

2.3 Railway Configuration¶

# railway.toml (for Ollama service)
[build]
builder = "DOCKERFILE"
dockerfilePath = "deployment/ollama/Dockerfile"

[deploy]
startCommand = "/entrypoint.sh"
restartPolicyType = "ON_FAILURE"
restartPolicyMaxRetries = 3

[[services]]
name = "ollama"
port = 11434

[[volumes]]
mount = "/root/.ollama"
size = "50Gi"  # For model storage

Phase 3: Environment Configuration¶

3.1 Update .env.example¶

# ======================================
# OLLAMA CONFIGURATION (Open-Source Models)
# ======================================
# For local development
OLLAMA_URL=http://localhost:11434

# For production (Railway deployment)
# OLLAMA_URL=https://ollama-service.up.railway.app

# Model selection
OLLAMA_MODEL=mistral:7b-instruct-q4_K_M  # Quantized for CPU
# OLLAMA_MODEL=llama2:7b  # Alternative

# Enable Ollama as fallback when no API keys present
ENABLE_OLLAMA_FALLBACK=true

### 2.4 Provider Status Endpoint (CPM)

The CPM exposes a simple status endpoint to check Ollama availability:

GET /providers/ollama/status -> { "online": true|false }

It probes `OLLAMA_URL` via `/api/tags` or `/api/version` and returns 200 when online, 503 otherwise.

3.2 Security Configuration¶

# deployment/ollama/nginx.conf
# Simple auth proxy for Railway deployment
server {
    listen 80;

    location / {
        auth_basic "Ollama API";
        auth_basic_user_file /etc/nginx/.htpasswd;

        proxy_pass http://localhost:11434;
        proxy_set_header Host $host;
        proxy_read_timeout 300s;
    }
}

Phase 4: Frontend Integration¶

4.1 Update Provider Selection UI¶

// apps/frontend/components/settings/LLMProviderSelector.tsx
export function LLMProviderSelector({ client, onUpdate }) {
  const hasApiKeys = useApiKeys() // Check which keys are configured

  return (
    <Select value={client.llm_provider} onValueChange={onUpdate}>
      <SelectItem value="openai" disabled={!hasApiKeys.openai}>
        OpenAI {!hasApiKeys.openai && '(No API Key)'}
      </SelectItem>
      <SelectItem value="anthropic" disabled={!hasApiKeys.anthropic}>
        Anthropic {!hasApiKeys.anthropic && '(No API Key)'}
      </SelectItem>
      {/* ... other providers */}
      <SelectItem value="ollama">
        Ollama (Free - Open Source)
      </SelectItem>
    </Select>
  )
}

4.2 Add Provider Info Component¶

// apps/frontend/components/settings/OllamaInfo.tsx
export function OllamaInfo() {
  const [status, setStatus] = useState<'checking' | 'online' | 'offline'>('checking')

  useEffect(() => {
    fetch('/api/providers/ollama/status')
      .then(r => r.json())
      .then(data => setStatus(data.online ? 'online' : 'offline'))
      .catch(() => setStatus('offline'))
  }, [])

  return (
    <Alert>
      <InfoIcon className="h-4 w-4" />
      <AlertTitle>Open-Source Model (Ollama)</AlertTitle>
      <AlertDescription>
        Free content generation using Mistral-7B. 
        Status: {status === 'online' ? '🟢 Available' : '🔴 Unavailable'}
        {status === 'offline' && (
          <div className="mt-2 text-sm">
            To use Ollama locally, run: <code>ollama run mistral</code>
          </div>
        )}
      </AlertDescription>
    </Alert>
  )
}

Performance Considerations¶

1. Model Selection for Production¶

For CPU (Railway default):
- mistral:7b-instruct-q4_K_M (4-bit quantized, ~4GB RAM)
- llama2:7b-q4_K_M (4-bit quantized, ~4GB RAM)

For GPU (if using GPU cloud):
- mistral:7b-instruct (full precision, ~15GB VRAM)
- mixtral:8x7b (MoE model, better quality, ~48GB VRAM)

2. Response Time Expectations¶

CPU inference: 20-60 seconds for 1000-word article
GPU inference: 5-15 seconds for 1000-word article
Adjust timeouts accordingly

3. Concurrency Limits¶

# apps/cpm/llm_client.py
class OllamaClient(BaseLLMClient):
    MAX_CONCURRENT_REQUESTS = 2  # For CPU
    # MAX_CONCURRENT_REQUESTS = 10  # For GPU

    _semaphore = asyncio.Semaphore(MAX_CONCURRENT_REQUESTS)

    async def generate_content_async(self, ...):
        async with self._semaphore:
            # Generate content

Cost Analysis¶

Current Costs (Paid Providers)¶

OpenAI GPT-4: ~$0.03-0.06 per 1000-word article
Anthropic Claude: ~$0.02-0.04 per 1000-word article
Google Gemini: ~$0.01-0.02 per 1000-word article

Ollama/Open-Source Costs¶

Local Development: $0 (uses developer's machine)
Railway Deployment: ~$5-20/month (depending on CPU/RAM)
GPU Cloud: ~$0.50-2.00/hour (when active)

ROI Calculation¶

Break-even: ~200-500 articles/month on Railway
Ideal for: Development, testing, low-volume users
Not ideal for: High-volume production (use paid APIs)

Testing Strategy¶

1. Unit Tests¶

# apps/cpm/tests/test_ollama_client.py
def test_ollama_client_initialization():
    client = OllamaClient()
    assert client.base_url == 'http://localhost:11434'

def test_ollama_fallback_selection():
    # Remove all API keys
    os.environ.clear()
    client = LLMClientFactory.create_client()
    assert isinstance(client, OllamaClient)

2. Integration Tests¶

# apps/cpm/tests/test_ollama_integration.py
@pytest.mark.integration
def test_ollama_content_generation():
    client = OllamaClient()
    response = client.generate_content(
        prompt="Write a 100-word test article",
        content_type="blog_post",
        max_tokens=200
    )
    assert len(response.content) > 50
    assert response.cost == 0.0

3. Load Tests¶

# Test concurrent requests
for i in {1..5}; do
  curl -X POST http://localhost:8000/api/generate \
    -H "Content-Type: application/json" \
    -d '{"prompt": "Test article '$i'", "provider": "ollama"}' &
done

Migration Path¶

Phase 1: Local Development (Week 1)¶

Implement OllamaClient class
Add to LLMClientFactory
Test with local Ollama installation
Update documentation

Phase 2: Staging Deployment (Week 2)¶

Deploy Ollama to Railway staging
Configure auth proxy
Test with staging frontend
Performance benchmarking

Phase 3: Production Rollout (Week 3)¶

Deploy to production Railway
Enable feature flag gradually
Monitor performance metrics
Gather user feedback

Phase 4: Optimization (Week 4+)¶

Fine-tune model selection
Optimize quantization levels
Consider GPU upgrade if needed
Add caching layer

Security Considerations¶

1. Authentication¶

Never expose Ollama directly to internet
Use auth proxy with Bearer token
Rate limit at API gateway level

2. Input Validation¶

Sanitize prompts before sending to Ollama
Limit max_tokens to prevent abuse
Implement request queuing

3. Monitoring¶

# apps/cpm/monitoring.py
class OllamaMonitor:
    def track_request(self, prompt_length: int, response_time: float):
        # Log to monitoring service
        logger.info(f"Ollama request: {prompt_length} chars, {response_time}s")

    def check_health(self) -> bool:
        # Health check endpoint
        try:
            response = requests.get(f"{self.base_url}/api/tags")
            return response.status_code == 200
        except:
            return False

Alternatives Considered¶

1. Hugging Face Inference API¶

Pros: Many models, managed infrastructure
Cons: Rate limits on free tier, less control

2. Replicate.com¶

Pros: Pay-per-use, many models
Cons: Not truly free, requires API key

3. Self-hosted vLLM¶

Pros: Better performance than Ollama
Cons: More complex setup, requires GPU

4. LocalAI¶

Pros: OpenAI-compatible, supports multiple model formats
Cons: Less mature than Ollama

Recommendation¶

For HG Content System:¶

Start with Ollama + Mistral-7B for simplicity and compatibility
Deploy on Railway for production (CPU initially)
Use as fallback when no API keys configured
Monitor usage and upgrade to GPU if volume justifies
Keep existing provider architecture for seamless switching

Success Metrics:¶

Zero-cost content generation available
<60 second generation time for typical articles
99% uptime for Railway service
Seamless fallback when API keys absent

Conclusion¶

Integrating Ollama provides a valuable free tier for the HG Content System while maintaining the flexibility to use premium providers. The proposed architecture requires minimal changes to existing code and provides a smooth upgrade path as usage grows.

The key is treating Ollama as just another LLM provider in the existing abstraction, ensuring zero disruption to current users while enabling new use cases for budget-conscious users and developers.