Open-Source AI Model Integration Plan for HG Content System¶
Executive Summary¶
Integration plan for adding Ollama/Mistral-7B as a free/no-API-key fallback option for content generation in the HG Content Generation System. This would provide users with a zero-cost option for testing and development, while maintaining the existing multi-provider architecture.
Current Architecture Analysis¶
Existing System Components¶
- Content Production Module (CPM) - Python FastAPI service at
apps/cpm/ - Currently supports: OpenAI, Anthropic, Google Gemini, Groq
- Uses
BaseLLMClientabstraction inapps/cpm/llm_client.py -
Handles content generation via
/api/generateendpoint -
Instructions Module (IM) - Prompt generation service at
apps/im/ - Generates prompts based on templates and client settings
-
Returns structured instructions for content generation
-
Frontend - Next.js application at
apps/frontend/ - Hosted on Vercel
-
Communicates with backend services via API
-
External API - New service at
apps/external/ - Provides programmatic access for external clients
- Uses same backend infrastructure
Integration Architecture¶
Proposed Setup¶
┌─────────────┐ ┌──────────────┐ ┌─────────────────┐
│ Vercel │────▶│ Railway │────▶│ Railway/Cloud │
│ (Frontend) │ │ (CPM) │ │ (Ollama) │
└─────────────┘ └──────────────┘ └─────────────────┘
│
▼
┌──────────────┐
│ Existing │
│ LLM Providers│
└──────────────┘
Key Design Decisions¶
1. Ollama Deployment Strategy¶
- Development: Run Ollama locally on port 11434
- Production: Deploy Ollama on Railway with persistent volume
- Alternative: Use a dedicated GPU cloud provider for better performance
2. Integration Point¶
- Add
OllamaClientto existingllm_client.pyinfrastructure - Maintain compatibility with existing
BaseLLMClientinterface - Use OpenAI-compatible endpoint for easier integration
3. No-Streaming Approach¶
- Content generation doesn't require streaming (not a chat app)
- Use synchronous responses with appropriate timeouts
- Simpler implementation and error handling
Implementation Plan¶
Phase 1: Add Ollama Client to CPM¶
1.1 Create OllamaClient Class¶
# apps/cpm/llm_client.py
class OllamaClient(BaseLLMClient):
"""Client for Ollama/Mistral open-source models."""
def __init__(self, base_url: str = None):
# Prefer OLLAMA_URL; fallback to legacy var for compatibility
self.base_url = base_url or os.getenv('OLLAMA_URL') or os.getenv('OLLAMA_API_URL', 'http://localhost:11434')
self.model_name = os.getenv('OLLAMA_MODEL', 'mistral:7b-instruct')
def generate_content(
self,
prompt: str,
content_type: str,
max_tokens: int = 2000,
temperature: float = 0.7
) -> LLMResponse:
"""Generate content using Ollama's OpenAI-compatible endpoint."""
response = requests.post(
f"{self.base_url}/v1/chat/completions",
json={
"model": self.model_name,
"messages": [
{"role": "system", "content": self._get_system_prompt(content_type)},
{"role": "user", "content": prompt}
],
"temperature": temperature,
"max_tokens": max_tokens,
"stream": False
},
headers={
"Content-Type": "application/json",
"Authorization": "Bearer ollama" # Dummy token
},
timeout=120 # 2 minutes for long-form content
)
if response.status_code != 200:
raise LLMError(f"Ollama error: {response.text}")
data = response.json()
content = data['choices'][0]['message']['content']
return LLMResponse(
content=content,
model=self.model_name,
provider="ollama",
usage=self._estimate_usage(prompt, content),
cost=0.0 # Free!
)
1.2 Update LLMClientFactory¶
# apps/cpm/llm_client.py
class LLMClientFactory:
@staticmethod
def create_client(provider: str = None) -> BaseLLMClient:
# Auto-detect provider based on available keys
if not provider:
if os.getenv('OPENAI_API_KEY'):
provider = 'openai'
elif os.getenv('ANTHROPIC_API_KEY'):
provider = 'anthropic'
elif os.getenv('GOOGLE_API_KEY'):
provider = 'google'
elif os.getenv('GROQ_API_KEY'):
provider = 'groq'
else:
provider = 'ollama' # Free fallback
if provider == 'ollama':
return OllamaClient()
# ... existing providers
Phase 2: Railway Deployment¶
2.1 Dockerfile for Ollama Service¶
# deployment/ollama/Dockerfile
FROM ollama/ollama:latest
# Pre-pull models on startup
COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh
EXPOSE 11434
ENTRYPOINT ["/entrypoint.sh"]
2.2 Entrypoint Script¶
#!/bin/sh
# deployment/ollama/entrypoint.sh
set -e
# Pull models (use quantized for CPU)
ollama pull mistral:7b-instruct-q4_K_M
ollama pull llama2:7b-q4_K_M # Alternative option
# Start server
exec ollama serve
2.3 Railway Configuration¶
# railway.toml (for Ollama service)
[build]
builder = "DOCKERFILE"
dockerfilePath = "deployment/ollama/Dockerfile"
[deploy]
startCommand = "/entrypoint.sh"
restartPolicyType = "ON_FAILURE"
restartPolicyMaxRetries = 3
[[services]]
name = "ollama"
port = 11434
[[volumes]]
mount = "/root/.ollama"
size = "50Gi" # For model storage
Phase 3: Environment Configuration¶
3.1 Update .env.example¶
# ======================================
# OLLAMA CONFIGURATION (Open-Source Models)
# ======================================
# For local development
OLLAMA_URL=http://localhost:11434
# For production (Railway deployment)
# OLLAMA_URL=https://ollama-service.up.railway.app
# Model selection
OLLAMA_MODEL=mistral:7b-instruct-q4_K_M # Quantized for CPU
# OLLAMA_MODEL=llama2:7b # Alternative
# Enable Ollama as fallback when no API keys present
ENABLE_OLLAMA_FALLBACK=true
### 2.4 Provider Status Endpoint (CPM)
The CPM exposes a simple status endpoint to check Ollama availability:
It probes `OLLAMA_URL` via `/api/tags` or `/api/version` and returns 200 when online, 503 otherwise.
3.2 Security Configuration¶
# deployment/ollama/nginx.conf
# Simple auth proxy for Railway deployment
server {
listen 80;
location / {
auth_basic "Ollama API";
auth_basic_user_file /etc/nginx/.htpasswd;
proxy_pass http://localhost:11434;
proxy_set_header Host $host;
proxy_read_timeout 300s;
}
}
Phase 4: Frontend Integration¶
4.1 Update Provider Selection UI¶
// apps/frontend/components/settings/LLMProviderSelector.tsx
export function LLMProviderSelector({ client, onUpdate }) {
const hasApiKeys = useApiKeys() // Check which keys are configured
return (
<Select value={client.llm_provider} onValueChange={onUpdate}>
<SelectItem value="openai" disabled={!hasApiKeys.openai}>
OpenAI {!hasApiKeys.openai && '(No API Key)'}
</SelectItem>
<SelectItem value="anthropic" disabled={!hasApiKeys.anthropic}>
Anthropic {!hasApiKeys.anthropic && '(No API Key)'}
</SelectItem>
{/* ... other providers */}
<SelectItem value="ollama">
Ollama (Free - Open Source)
</SelectItem>
</Select>
)
}
4.2 Add Provider Info Component¶
// apps/frontend/components/settings/OllamaInfo.tsx
export function OllamaInfo() {
const [status, setStatus] = useState<'checking' | 'online' | 'offline'>('checking')
useEffect(() => {
fetch('/api/providers/ollama/status')
.then(r => r.json())
.then(data => setStatus(data.online ? 'online' : 'offline'))
.catch(() => setStatus('offline'))
}, [])
return (
<Alert>
<InfoIcon className="h-4 w-4" />
<AlertTitle>Open-Source Model (Ollama)</AlertTitle>
<AlertDescription>
Free content generation using Mistral-7B.
Status: {status === 'online' ? '🟢 Available' : '🔴 Unavailable'}
{status === 'offline' && (
<div className="mt-2 text-sm">
To use Ollama locally, run: <code>ollama run mistral</code>
</div>
)}
</AlertDescription>
</Alert>
)
}
Performance Considerations¶
1. Model Selection for Production¶
For CPU (Railway default):
- mistral:7b-instruct-q4_K_M (4-bit quantized, ~4GB RAM)
- llama2:7b-q4_K_M (4-bit quantized, ~4GB RAM)
For GPU (if using GPU cloud):
- mistral:7b-instruct (full precision, ~15GB VRAM)
- mixtral:8x7b (MoE model, better quality, ~48GB VRAM)
2. Response Time Expectations¶
- CPU inference: 20-60 seconds for 1000-word article
- GPU inference: 5-15 seconds for 1000-word article
- Adjust timeouts accordingly
3. Concurrency Limits¶
# apps/cpm/llm_client.py
class OllamaClient(BaseLLMClient):
MAX_CONCURRENT_REQUESTS = 2 # For CPU
# MAX_CONCURRENT_REQUESTS = 10 # For GPU
_semaphore = asyncio.Semaphore(MAX_CONCURRENT_REQUESTS)
async def generate_content_async(self, ...):
async with self._semaphore:
# Generate content
Cost Analysis¶
Current Costs (Paid Providers)¶
- OpenAI GPT-4: ~$0.03-0.06 per 1000-word article
- Anthropic Claude: ~$0.02-0.04 per 1000-word article
- Google Gemini: ~$0.01-0.02 per 1000-word article
Ollama/Open-Source Costs¶
- Local Development: $0 (uses developer's machine)
- Railway Deployment: ~$5-20/month (depending on CPU/RAM)
- GPU Cloud: ~$0.50-2.00/hour (when active)
ROI Calculation¶
- Break-even: ~200-500 articles/month on Railway
- Ideal for: Development, testing, low-volume users
- Not ideal for: High-volume production (use paid APIs)
Testing Strategy¶
1. Unit Tests¶
# apps/cpm/tests/test_ollama_client.py
def test_ollama_client_initialization():
client = OllamaClient()
assert client.base_url == 'http://localhost:11434'
def test_ollama_fallback_selection():
# Remove all API keys
os.environ.clear()
client = LLMClientFactory.create_client()
assert isinstance(client, OllamaClient)
2. Integration Tests¶
# apps/cpm/tests/test_ollama_integration.py
@pytest.mark.integration
def test_ollama_content_generation():
client = OllamaClient()
response = client.generate_content(
prompt="Write a 100-word test article",
content_type="blog_post",
max_tokens=200
)
assert len(response.content) > 50
assert response.cost == 0.0
3. Load Tests¶
# Test concurrent requests
for i in {1..5}; do
curl -X POST http://localhost:8000/api/generate \
-H "Content-Type: application/json" \
-d '{"prompt": "Test article '$i'", "provider": "ollama"}' &
done
Migration Path¶
Phase 1: Local Development (Week 1)¶
- Implement OllamaClient class
- Add to LLMClientFactory
- Test with local Ollama installation
- Update documentation
Phase 2: Staging Deployment (Week 2)¶
- Deploy Ollama to Railway staging
- Configure auth proxy
- Test with staging frontend
- Performance benchmarking
Phase 3: Production Rollout (Week 3)¶
- Deploy to production Railway
- Enable feature flag gradually
- Monitor performance metrics
- Gather user feedback
Phase 4: Optimization (Week 4+)¶
- Fine-tune model selection
- Optimize quantization levels
- Consider GPU upgrade if needed
- Add caching layer
Security Considerations¶
1. Authentication¶
- Never expose Ollama directly to internet
- Use auth proxy with Bearer token
- Rate limit at API gateway level
2. Input Validation¶
- Sanitize prompts before sending to Ollama
- Limit max_tokens to prevent abuse
- Implement request queuing
3. Monitoring¶
# apps/cpm/monitoring.py
class OllamaMonitor:
def track_request(self, prompt_length: int, response_time: float):
# Log to monitoring service
logger.info(f"Ollama request: {prompt_length} chars, {response_time}s")
def check_health(self) -> bool:
# Health check endpoint
try:
response = requests.get(f"{self.base_url}/api/tags")
return response.status_code == 200
except:
return False
Alternatives Considered¶
1. Hugging Face Inference API¶
- Pros: Many models, managed infrastructure
- Cons: Rate limits on free tier, less control
2. Replicate.com¶
- Pros: Pay-per-use, many models
- Cons: Not truly free, requires API key
3. Self-hosted vLLM¶
- Pros: Better performance than Ollama
- Cons: More complex setup, requires GPU
4. LocalAI¶
- Pros: OpenAI-compatible, supports multiple model formats
- Cons: Less mature than Ollama
Recommendation¶
For HG Content System:¶
- Start with Ollama + Mistral-7B for simplicity and compatibility
- Deploy on Railway for production (CPU initially)
- Use as fallback when no API keys configured
- Monitor usage and upgrade to GPU if volume justifies
- Keep existing provider architecture for seamless switching
Success Metrics:¶
- Zero-cost content generation available
- <60 second generation time for typical articles
- 99% uptime for Railway service
- Seamless fallback when API keys absent
Conclusion¶
Integrating Ollama provides a valuable free tier for the HG Content System while maintaining the flexibility to use premium providers. The proposed architecture requires minimal changes to existing code and provides a smooth upgrade path as usage grows.
The key is treating Ollama as just another LLM provider in the existing abstraction, ensuring zero disruption to current users while enabling new use cases for budget-conscious users and developers.