AIFeature PostsTrending Posts

Day 8: Running LLMs Locally with Ollama & LM Studio

“The most powerful AI model is the one you fully control — running on your own hardware, your own network, with your own data never leaving the room.”

Why This Matters

For the first three years of the LLM era, running a capable AI model required API keys, internet connectivity, and per-token billing. In 2026, that assumption is broken.

You can now run models on your laptop that match or exceed GPT-3.5 performance — completely offline, completely free after hardware costs, with zero data leaving your machine.

This matters for several concrete reasons:

Privacy. Medical records, legal documents, internal code, personal journals. Many of the most valuable use cases for AI are also the ones where sending data to a third-party cloud is non-negotiable. Local inference solves this entirely.

Cost. At scale, API costs dominate AI product economics. A startup processing millions of documents per month can spend $50,000+ on API fees that could instead run on $10,000 of on-premises hardware.

Latency. Cloud API round trips add 200ms–2s of network latency. A local model running on a GPU server in the same datacenter responds in milliseconds.

Control. No rate limits. No model deprecation forcing you to migrate. No terms of service updates. No vendor lock-in.

The tools that made local inference practical for individual developers — Ollama and LM Studio — are the focus of today’s article.

Part 1: Understanding Model Quantization

Before installing anything, you need to understand the concept that makes local LLMs possible on consumer hardware: quantization.

What is Quantization?

A standard LLM stores each parameter (weight) as a 32-bit floating point number. A 7 billion parameter model therefore requires:

7,000,000,000 × 4 bytes (FP32) = 28 GB

That’s more than most consumer GPUs hold. But we don’t need FP32 precision for inference. Research shows you can represent weights with much lower precision with minimal quality loss:

Precision    Bits per weight   7B model size   Quality loss
──────────────────────────────────────────────────────────
FP32 32 bits 28 GB 0% (reference)
FP16 16 bits 14 GB ~0%
GGUF Q8_0 8 bits 7 GB ~0.1%
GGUF Q6_K 6 bits 5.5 GB ~0.2%
GGUF Q5_K_M 5 bits 4.8 GB ~0.5%
GGUF Q4_K_M 4 bits 4.1 GB ~1% ← Sweet spot
GGUF Q3_K_M 3 bits 3.3 GB ~2-3%
GGUF Q2_K 2 bits 2.7 GB ~5%+ ← Avoid

GGUF (GPT-Generated Unified Format) is the file format used by llama.cpp and all tools built on it (Ollama, LM Studio). It packages the quantized weights and metadata into a single file.

The K-Quant System

The K in Q4_K_M means the quantization uses a per-block method that’s more accurate than simple per-tensor quantization. The letter after _K_ indicates size variant:

  • S = Small (more aggressive quantization, smaller)
  • M = Medium ← Most common recommendation
  • L = Large (less aggressive, slightly larger but higher quality)

Practical Recommendation

For most use cases: Q4_K_M. It reduces model size by ~7x vs FP16 while maintaining ~99% of the original quality. This is the format Ollama uses by default for most models.

Part 2: Hardware Requirements

The RAM Rule

The dominant constraint for local LLMs is memory. You need enough RAM + VRAM (combined) to hold the model. As a rule of thumb:

Model Size     Q4_K_M Required    Min GPU VRAM    Min RAM (CPU only)
──────────────────────────────────────────────────────────────────
1B params ~0.8 GB 4 GB 8 GB RAM
3B params ~2 GB 4 GB 8 GB RAM
7B params ~4.1 GB 6 GB 16 GB RAM
13B params ~7.4 GB 8 GB 32 GB RAM
32B params ~19 GB 24 GB 64 GB RAM
70B params ~40 GB 48 GB (2× 24GB) 64 GB RAM

Key principle: If the model fits in VRAM, inference is GPU-accelerated (fast). If it doesn’t, it falls back to CPU/RAM (slow but functional).

GPU Recommendations by Budget

Budget       GPU               VRAM    Best Local Models
───────────────────────────────────────────────────────────────────────
$0 MacBook M-series Unified Llama 3.2 3B, Phi-3 Mini
$200 RTX 3060 12 GB Llama 3.1 8B, Mistral 7B
$400 RTX 3080 10 GB Llama 3.1 8B, Qwen 2.5 7B
$800 RTX 4070 12 GB Llama 3.1 8B, Codestral 7B
$1500 RTX 4090 24 GB Llama 3.1 70B Q4, Qwen 72B Q4
$3000+ RTX 6000 Ada 48 GB Full Llama 70B FP16
$8000+ 2× A6000 96 GB Llama 3.1 405B Q4

Apple Silicon Advantage

Apple M-series chips use unified memory (shared between CPU and GPU), giving them a unique advantage for local LLMs:

  • M1 16GB: Runs 7B models well, 13B models slowly
  • M2/M3 32GB: Runs 13B models excellently, 34B models adequately
  • M3 Max 128GB: Runs 70B models well

Ollama has native Apple Silicon support with Metal GPU acceleration. LM Studio also has native macOS builds. For Mac users, local LLMs are effectively zero-configuration.

Part 3: Ollama

Ollama is the dominant tool for local LLM inference as of 2026. It provides:

  • A simple CLI to pull, run, and manage models
  • An HTTP server with an OpenAI-compatible API (drop-in replacement)
  • Automatic GPU detection and acceleration
  • A curated model library with 100+ models

Installation

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download installer from: https://ollama.com/download

# Verify installation
ollama --version
# ollama version 0.30.x

# Ollama starts as a background service automatically
# Check it's running:
curl http://localhost:11434/api/version

Pulling and Running Models

# Pull a model (downloads to ~/.ollama/models/)
ollama pull llama3.2 # 3B — very fast, good for simple tasks
ollama pull llama3.1:8b # 8B — excellent general purpose
ollama pull llama3.1:70b # 70B — near-GPT-4 quality (needs 48GB+ VRAM)
ollama pull mistral # Mistral 7B — fast and capable
ollama pull qwen2.5:7b # Alibaba Qwen 2.5 7B — strong on code
ollama pull deepseek-r1:7b # DeepSeek R1 7B — reasoning model
ollama pull phi4 # Microsoft Phi-4 — 14B, punches above weight
ollama pull nomic-embed-text # Embedding model for RAG
ollama pull codellama:13b # Meta's code-specialized model

# Run interactively
ollama run llama3.1:8b

# Single-turn from CLI
ollama run llama3.1:8b "What is the difference between TCP and UDP?"

# List downloaded models
ollama list

# Remove a model
ollama rm llama3.2

# Show model info (parameters, quantization, etc.)
ollama show llama3.1:8b

The Ollama API

Once Ollama is running, it exposes an HTTP API on localhost:11434. There are two API styles:

1. Native Ollama API:

import requests
import json

OLLAMA_BASE = "http://localhost:11434"

def chat_ollama(
prompt: str,
model: str = "llama3.1:8b",
system: str = None,
stream: bool = False,
) -> str:
"""Call Ollama native chat API."""
messages = []
if system:
messages.append({"role": "system", "content": system})
messages.append({"role": "user", "content": prompt})

response = requests.post(
f"{OLLAMA_BASE}/api/chat",
json={
"model": model,
"messages": messages,
"stream": stream,
"options": {
"temperature": 0.7,
"top_p": 0.9,
"num_ctx": 8192, # Context window size
"num_predict": 2048, # Max tokens to generate
},
},
stream=stream,
timeout=120,
)
response.raise_for_status()

if stream:
full_response = []
for line in response.iter_lines():
if line:
chunk = json.loads(line)
if chunk.get("message", {}).get("content"):
content = chunk["message"]["content"]
print(content, end="", flush=True)
full_response.append(content)
if chunk.get("done"):
print() # newline
break
return "".join(full_response)
else:
return response.json()["message"]["content"]


# Usage
response = chat_ollama(
"Explain quicksort in Python with code.",
model="llama3.1:8b",
system="You are a concise technical educator. Always include runnable code.",
)
print(response)

# Streaming
chat_ollama(
"Write a haiku about machine learning",
model="llama3.1:8b",
stream=True,
)

2. OpenAI-Compatible API (recommended for production):

Ollama exposes an OpenAI-compatible endpoint at /v1/. This means you can use the official openai Python library with zero code changes — just point it at localhost:

from openai import OpenAI

# Drop-in replacement: just change base_url
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # Required by the library but ignored by Ollama
)

def chat_local(
prompt: str,
model: str = "llama3.1:8b",
system: str = "You are a helpful assistant.",
temperature: float = 0.7,
max_tokens: int = 2048,
) -> str:
"""Chat with a local Ollama model using the OpenAI SDK."""
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": system},
{"role": "user", "content": prompt},
],
temperature=temperature,
max_tokens=max_tokens,
)
return response.choices[0].message.content


# Works exactly like the OpenAI API
result = chat_local(
"What are the SOLID principles in software engineering?",
model="llama3.1:8b",
)
print(result)

# Streaming also works identically
stream = client.chat.completions.create(
model="llama3.1:8b",
messages=[{"role": "user", "content": "List 5 Python best practices."}],
stream=True,
)

for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print()

Generating Embeddings with Ollama

from openai import OpenAI
from typing import List
import numpy as np

embedding_client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama",
)

def embed_text(texts: List[str], model: str = "nomic-embed-text") -> List[List[float]]:
"""
Generate embeddings locally using Ollama.

nomic-embed-text: 768 dimensions, good general-purpose embeddings
mxbai-embed-large: 1024 dimensions, higher quality
all-minilm: 384 dimensions, very fast
"""
response = embedding_client.embeddings.create(
model=model,
input=texts,
)
return [item.embedding for item in response.data]


# Example: similarity search without cloud
texts = [
"Python is a high-level programming language",
"Machine learning uses statistical techniques",
"Snakes are reptiles found worldwide",
"Neural networks are inspired by the brain",
]

embeddings = embed_text(texts)

# Cosine similarity
def cosine_similarity(a: List[float], b: List[float]) -> float:
a_np, b_np = np.array(a), np.array(b)
return float(np.dot(a_np, b_np) / (np.linalg.norm(a_np) * np.linalg.norm(b_np)))

query = "deep learning algorithms"
query_embedding = embed_text([query])[0]

scores = [(texts[i], cosine_similarity(query_embedding, embeddings[i])) for i in range(len(texts))]
scores.sort(key=lambda x: x[1], reverse=True)

for text, score in scores:
print(f"{score:.3f}: {text}")
# 0.724: Neural networks are inspired by the brain
# 0.689: Machine learning uses statistical techniques
# 0.423: Python is a high-level programming language
# 0.198: Snakes are reptiles found worldwide

Custom Modelfiles: Customizing Behavior

Ollama’s Modelfile is like a Dockerfile for LLMs — it lets you bake in system prompts, custom parameters, and even your own GGUF files:

# Create a custom Modelfile
cat > Modelfile << 'EOF'
# Base model
FROM llama3.1:8b

# Baked-in system prompt — this applies to every conversation
SYSTEM """
You are Aria, a senior Python engineer at a top tech company.
Your responses are:
- Concise but complete
- Always include working code examples
- Focused on production-quality patterns, not toy examples
- You flag potential issues (performance, security, edge cases)
Never pad responses or explain basic concepts unless asked.
"""

# Generation parameters
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 16384
PARAMETER num_predict 4096
PARAMETER repeat_penalty 1.1
EOF

# Build the custom model
ollama create aria-engineer -f Modelfile

# Run your custom model
ollama run aria-engineer "How do I implement connection pooling in PostgreSQL?"

# List it alongside other models
ollama list
# Modelfile with a custom GGUF (bring your own model)
cat > Modelfile.custom << 'EOF'
# Use a GGUF file you downloaded from Hugging Face
FROM ./models/custom-model-Q4_K_M.gguf

SYSTEM "You are a domain expert in financial analysis."

PARAMETER temperature 0.1
PARAMETER num_ctx 32768
EOF

ollama create finance-expert -f Modelfile.custom

Running Ollama as a Server

For production deployments, you often want Ollama accessible over a network:

# Run Ollama server bound to all interfaces (network accessible)
OLLAMA_HOST=0.0.0.0:11434 ollama serve

# With GPU configuration
CUDA_VISIBLE_DEVICES=0,1 OLLAMA_HOST=0.0.0.0:11434 ollama serve

# Environment variables
OLLAMA_MODELS=/mnt/ssd/models # Custom model storage path
OLLAMA_NUM_PARALLEL=4 # Concurrent requests per model
OLLAMA_MAX_LOADED_MODELS=3 # Models to keep warm in memory
OLLAMA_KEEP_ALIVE=10m # How long to keep model in memory
import httpx
import json
import re
from typing import Optional, List, Dict
from tenacity import retry, stop_after_attempt, wait_exponential


class OllamaClient:
"""
Production-ready Ollama client.

Features:
- Connection pooling
- Retry logic
- Health checks
- Model management
- Token usage tracking
- Markdown/code fence cleanup
- JSON output support
- Context manager support
"""

def __init__(
self,
base_url: str = "http://localhost:11434",
default_model: str = "llama3.1:8b",
timeout: float = 120.0,
):
self.base_url = base_url
self.default_model = default_model

self._client = httpx.Client(
base_url=base_url,
timeout=timeout,
limits=httpx.Limits(
max_connections=20,
max_keepalive_connections=10,
),
)

self._total_completion_tokens = 0
self._total_prompt_tokens = 0
self._request_count = 0

# ----------------------------------------------------
# Context manager support
# ----------------------------------------------------

def __enter__(self):
return self

def __exit__(self, exc_type, exc_val, exc_tb):
self.close()

# ----------------------------------------------------
# Utilities
# ----------------------------------------------------

def _clean_response(self, text: str) -> str:
"""
Remove markdown code fences.
"""

text = text.strip()

text = re.sub(
r"^```[a-zA-Z0-9_+-]*\n",
"",
text,
)

text = re.sub(
r"\n```$",
"",
text,
)

return text.strip()

# ----------------------------------------------------
# Health
# ----------------------------------------------------

def is_healthy(self) -> bool:
"""
Check whether Ollama server is running.
"""

try:
resp = self._client.get("/api/version")
return resp.status_code == 200
except Exception:
return False

# ----------------------------------------------------
# Models
# ----------------------------------------------------

def list_models(self) -> List[str]:
"""
List installed models.
"""

resp = self._client.get("/api/tags")
resp.raise_for_status()

return [
model["name"]
for model in resp.json().get("models", [])
]

def pull_model(self, model: str) -> None:
"""
Download model if not already installed.
"""

if model in self.list_models():
print(f"✓ {model} already installed")
return

print(f"Pulling {model}...\n")

with self._client.stream(
"POST",
"/api/pull",
json={"name": model},
) as response:

response.raise_for_status()

for line in response.iter_lines():
if not line:
continue

data = json.loads(line)

if "status" in data:
print(
f"\r{data['status']}",
end="",
flush=True,
)

print(f"\n✓ {model} ready")

# ----------------------------------------------------
# Chat
# ----------------------------------------------------

@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(
multiplier=1,
min=1,
max=10,
),
)
def chat(
self,
prompt: str,
model: Optional[str] = None,
system: Optional[str] = None,
context_messages: Optional[List[Dict]] = None,
temperature: float = 0.7,
clean_output: bool = True,
) -> str:
"""
Standard chat completion.
"""

model = model or self.default_model

messages = []

if system:
messages.append(
{
"role": "system",
"content": system,
}
)

if context_messages:
messages.extend(context_messages)

messages.append(
{
"role": "user",
"content": prompt,
}
)

resp = self._client.post(
"/api/chat",
json={
"model": model,
"messages": messages,
"stream": False,
"options": {
"temperature": temperature,
},
},
)

resp.raise_for_status()

data = resp.json()

self._request_count += 1
self._total_completion_tokens += data.get(
"eval_count",
0,
)

self._total_prompt_tokens += data.get(
"prompt_eval_count",
0,
)

content = data["message"]["content"]

if clean_output:
content = self._clean_response(content)

return content

# ----------------------------------------------------
# Structured JSON output
# ----------------------------------------------------

def chat_json(
self,
prompt: str,
schema_prompt: str,
model: Optional[str] = None,
) -> dict:
"""
Force JSON response.
"""

model = model or self.default_model

resp = self._client.post(
"/api/chat",
json={
"model": model,
"messages": [
{
"role": "system",
"content": schema_prompt,
},
{
"role": "user",
"content": prompt,
},
],
"stream": False,
"format": "json",
},
)

resp.raise_for_status()

data = resp.json()

return json.loads(
data["message"]["content"]
)

# ----------------------------------------------------
# Embeddings
# ----------------------------------------------------

def embed(
self,
text: str,
model: str = "nomic-embed-text",
):
"""
Generate embeddings.
"""

resp = self._client.post(
"/api/embeddings",
json={
"model": model,
"prompt": text,
},
)

resp.raise_for_status()

return resp.json()["embedding"]

# ----------------------------------------------------
# Usage stats
# ----------------------------------------------------

@property
def total_tokens_generated(self) -> int:
return self._total_completion_tokens

@property
def total_prompt_tokens(self) -> int:
return self._total_prompt_tokens

@property
def request_count(self) -> int:
return self._request_count

# ----------------------------------------------------
# Cleanup
# ----------------------------------------------------

def close(self):
self._client.close()


# ==========================================================
# Example Usage
# ==========================================================

if __name__ == "__main__":

with OllamaClient(
default_model="llama3.1:8b"
) as ollama:

if not ollama.is_healthy():
raise RuntimeError(
"Ollama is not running. Start it using:\n\nollama serve"
)

print(
"Available models:",
ollama.list_models(),
)

response = ollama.chat(
prompt="Write a Python function to validate email addresses.",
system="""
You are a Python expert.

Rules:
- Return raw Python code only.
- No markdown.
- No triple backticks.
- No explanations.
- First character must be Python code.
""",
temperature=0,
)

print("\nResponse:\n")
print(response)

print(
f"\nCompletion Tokens: {ollama.total_tokens_generated:,}"
)

print(
f"Prompt Tokens: {ollama.total_prompt_tokens:,}"
)

print(
f"Requests Made: {ollama.request_count:,}"
)

Part 4: LM Studio

LM Studio is a GUI application for downloading, managing, and serving local models. It’s ideal for:

  • Non-CLI users who want a visual interface
  • Quickly comparing multiple models without writing code
  • Teams that need a shared local inference server
  • Exploring model capabilities interactively before integrating into code

Installation

Download from lmstudio.ai — available for macOS, Windows, and Linux.

Key Features

Model Browser: LM Studio has a built-in Hugging Face model browser. Search, filter by VRAM requirement, and download GGUF models directly from the app.

Subscribe to the Medium newsletter

Chat Interface: Test models with a full chat UI, adjustable parameters (temperature, context window, system prompt), and response timing metrics.

Local Server: LM Studio runs an OpenAI-compatible server on localhost:1234. Any code that works with the OpenAI API works with LM Studio.

Using LM Studio’s Server via Python

from openai import OpenAI

# LM Studio's OpenAI-compatible server
lm_client = OpenAI(
base_url="http://localhost:1234/v1",
api_key="lm-studio", # Placeholder — not validated
)

def chat_lm_studio(
prompt: str,
system: str = "You are a helpful assistant.",
model: str = "local-model", # LM Studio uses "local-model" as default
temperature: float = 0.7,
) -> str:
"""
Chat with whatever model is loaded in LM Studio.

Note: The model parameter is ignored by LM Studio — it always uses
the currently loaded model. Pass it for API compatibility only.
"""
response = lm_client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": system},
{"role": "user", "content": prompt},
],
temperature=temperature,
max_tokens=2048,
)
return response.choices[0].message.content


# Get list of loaded models
models = lm_client.models.list()
print("Models available in LM Studio:")
for model in models.data:
print(f" - {model.id}")


# Example: technical writing assistant
result = chat_lm_studio(
"Explain REST vs GraphQL vs gRPC for a senior engineer.",
system="You are a senior backend architect. Be concise and opinionated.",
)
print(result)

LM Studio Configuration Tips

In the Server Settings panel:

  • Context Length: Set to match your model’s supported context (8192, 16384, 32768)
  • GPU Layers: Set to maximum — this offloads as many transformer layers as possible to GPU
  • CPU Threads: Set to physical core count (not hyperthreads)
  • Flash Attention: Enable if your hardware supports it (M2/M3, RTX 30xx+) — significant speedup

Part 5: Unified Local + Cloud Client

The most production-ready pattern is a unified client that tries local inference first and falls back to a cloud API if the local model isn’t available or the task requires higher capability.

"""
Unified LLM Client
==================

Features:
- Local-first routing (Ollama -> LM Studio -> OpenAI -> Anthropic)
- Connection pooling
- Retry with exponential backoff
- Health checks
- Usage statistics
- Environment variable configuration via .env

Requirements:
pip install openai anthropic httpx python-dotenv tenacity

Example .env:

OPENAI_API_KEY=sk-xxxxxxxxxxxxxxxx
ANTHROPIC_API_KEY=sk-ant-xxxxxxxxxxxxxxxx

OLLAMA_BASE_URL=http://localhost:11434/v1
OLLAMA_API_KEY=ollama

LM_STUDIO_BASE_URL=http://localhost:1234/v1
LM_STUDIO_API_KEY=lm-studio

LOCAL_MODEL=llama3.1:8b
CLOUD_MODEL=gpt-4o-mini

LOCAL_TIMEOUT=60
CLOUD_TIMEOUT=30

PREFER_LOCAL=true
"""

from __future__ import annotations

import logging
import os
import time
from dataclasses import dataclass
from enum import Enum
from typing import Dict, List, Optional, Tuple

import httpx
from anthropic import Anthropic
from dotenv import load_dotenv
from openai import OpenAI
from tenacity import retry, stop_after_attempt, wait_exponential

# ------------------------------------------------------------------------------
# Load Environment Variables
# ------------------------------------------------------------------------------

load_dotenv()

# ------------------------------------------------------------------------------
# Logging
# ------------------------------------------------------------------------------

logging.basicConfig(
level=logging.INFO,
format="%(asctime)s | %(levelname)s | %(message)s",
)

logger = logging.getLogger(__name__)

# ------------------------------------------------------------------------------
# Enums
# ------------------------------------------------------------------------------


class InferenceBackend(Enum):
OLLAMA = "ollama"
LM_STUDIO = "lm_studio"
OPENAI = "openai"
ANTHROPIC = "anthropic"


# ------------------------------------------------------------------------------
# Config
# ------------------------------------------------------------------------------


@dataclass
class ModelConfig:
local_model: str = os.getenv("LOCAL_MODEL", "llama3.1:8b")
cloud_model: str = os.getenv("CLOUD_MODEL", "gpt-4o-mini")
prefer_local: bool = (
os.getenv("PREFER_LOCAL", "true").lower() == "true"
)
local_timeout: float = float(os.getenv("LOCAL_TIMEOUT", "60"))
cloud_timeout: float = float(os.getenv("CLOUD_TIMEOUT", "30"))


# ------------------------------------------------------------------------------
# Client
# ------------------------------------------------------------------------------


class UnifiedLLMClient:
"""
Production-ready LLM Router

Routing Strategy:
Ollama

LM Studio

OpenAI

Anthropic
"""

def __init__(self, config: Optional[ModelConfig] = None):
self.config = config or ModelConfig()

# Shared HTTP client for pooling
self.http_client = httpx.Client(
timeout=self.config.local_timeout,
limits=httpx.Limits(
max_connections=50,
max_keepalive_connections=20,
),
)

# ------------------------------------------------------------------
# Local Clients
# ------------------------------------------------------------------

self._ollama = OpenAI(
base_url=os.getenv(
"OLLAMA_BASE_URL",
"http://localhost:11434/v1",
),
api_key=os.getenv(
"OLLAMA_API_KEY",
"ollama",
),
timeout=self.config.local_timeout,
http_client=self.http_client,
)

self._lm_studio = OpenAI(
base_url=os.getenv(
"LM_STUDIO_BASE_URL",
"http://localhost:1234/v1",
),
api_key=os.getenv(
"LM_STUDIO_API_KEY",
"lm-studio",
),
timeout=self.config.local_timeout,
http_client=self.http_client,
)

# ------------------------------------------------------------------
# Lazy Cloud Clients
# ------------------------------------------------------------------

self._openai: Optional[OpenAI] = None
self._anthropic: Optional[Anthropic] = None

# ------------------------------------------------------------------
# Stats
# ------------------------------------------------------------------

self._stats = {
"ollama_calls": 0,
"lm_studio_calls": 0,
"openai_calls": 0,
"anthropic_calls": 0,
"fallbacks": 0,
}

# --------------------------------------------------------------------------
# Cloud Clients
# --------------------------------------------------------------------------

def _get_openai(self) -> OpenAI:
if self._openai is None:
api_key = os.getenv("OPENAI_API_KEY")

if not api_key:
raise ValueError(
"OPENAI_API_KEY not found in environment"
)

self._openai = OpenAI(
api_key=api_key,
timeout=self.config.cloud_timeout,
)

return self._openai

def _get_anthropic(self) -> Anthropic:
if self._anthropic is None:
api_key = os.getenv("ANTHROPIC_API_KEY")

if not api_key:
raise ValueError(
"ANTHROPIC_API_KEY not found in environment"
)

self._anthropic = Anthropic(
api_key=api_key,
timeout=self.config.cloud_timeout,
)

return self._anthropic

# --------------------------------------------------------------------------
# Health Checks
# --------------------------------------------------------------------------

def _is_ollama_running(self) -> bool:
try:
url = os.getenv(
"OLLAMA_BASE_URL",
"http://localhost:11434/v1",
).replace("/v1", "/api/version")

response = httpx.get(url, timeout=2.0)
return response.status_code == 200

except Exception:
return False

def _is_lm_studio_running(self) -> bool:
try:
url = os.getenv(
"LM_STUDIO_BASE_URL",
"http://localhost:1234/v1",
) + "/models"

response = httpx.get(url, timeout=2.0)
return response.status_code == 200

except Exception:
return False

# --------------------------------------------------------------------------
# Retry Wrapper
# --------------------------------------------------------------------------

@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=1, max=8),
reraise=True,
)
def _chat_with_backend(
self,
backend: InferenceBackend,
messages: List[Dict],
temperature: float,
max_tokens: int,
) -> Tuple[str, InferenceBackend]:

# --------------------------------------------------------------
# Ollama
# --------------------------------------------------------------

if backend == InferenceBackend.OLLAMA:
response = self._ollama.chat.completions.create(
model=self.config.local_model,
messages=messages,
temperature=temperature,
max_tokens=max_tokens,
)

self._stats["ollama_calls"] += 1

return (
response.choices[0].message.content,
backend,
)

# --------------------------------------------------------------
# LM Studio
# --------------------------------------------------------------

elif backend == InferenceBackend.LM_STUDIO:
response = self._lm_studio.chat.completions.create(
model="local-model",
messages=messages,
temperature=temperature,
max_tokens=max_tokens,
)

self._stats["lm_studio_calls"] += 1

return (
response.choices[0].message.content,
backend,
)

# --------------------------------------------------------------
# OpenAI
# --------------------------------------------------------------

elif backend == InferenceBackend.OPENAI:
response = (
self._get_openai()
.chat.completions.create(
model=self.config.cloud_model,
messages=messages,
temperature=temperature,
max_tokens=max_tokens,
)
)

self._stats["openai_calls"] += 1

return (
response.choices[0].message.content,
backend,
)

# --------------------------------------------------------------
# Anthropic
# --------------------------------------------------------------

elif backend == InferenceBackend.ANTHROPIC:
response = self._get_anthropic().messages.create(
model="claude-3-5-sonnet-latest",
max_tokens=max_tokens,
temperature=temperature,
messages=[
{
"role": "user",
"content": messages[-1]["content"],
}
],
)

self._stats["anthropic_calls"] += 1

return (
response.content[0].text,
backend,
)

raise ValueError(f"Unsupported backend: {backend}")

# --------------------------------------------------------------------------
# Main Chat API
# --------------------------------------------------------------------------

def chat(
self,
prompt: str,
system: str = "You are a helpful assistant.",
temperature: float = 0.7,
max_tokens: int = 2048,
force_cloud: bool = False,
force_local: bool = False,
) -> Dict:

messages = [
{"role": "system", "content": system},
{"role": "user", "content": prompt},
]

if force_cloud:
backends_to_try = [
InferenceBackend.OPENAI,
InferenceBackend.ANTHROPIC,
]

elif force_local:
backends_to_try = [
InferenceBackend.OLLAMA,
InferenceBackend.LM_STUDIO,
]

elif self.config.prefer_local:
backends_to_try = [
InferenceBackend.OLLAMA,
InferenceBackend.LM_STUDIO,
InferenceBackend.OPENAI,
InferenceBackend.ANTHROPIC,
]

else:
backends_to_try = [
InferenceBackend.OPENAI,
InferenceBackend.ANTHROPIC,
]

last_error = None
first_attempt = True

for backend in backends_to_try:

if (
backend == InferenceBackend.OLLAMA
and not self._is_ollama_running()
):
logger.info(
"Ollama not running. Skipping."
)
continue

if (
backend == InferenceBackend.LM_STUDIO
and not self._is_lm_studio_running()
):
logger.info(
"LM Studio not running. Skipping."
)
continue

try:
if not first_attempt:
self._stats["fallbacks"] += 1
logger.warning(
f"Falling back to {backend.value}"
)

first_attempt = False

start_time = time.monotonic()

response, used_backend = (
self._chat_with_backend(
backend=backend,
messages=messages,
temperature=temperature,
max_tokens=max_tokens,
)
)

latency_ms = (
time.monotonic() - start_time
) * 1000

return {
"response": response,
"backend": used_backend.value,
"latency_ms": round(latency_ms),
"model": (
self.config.local_model
if used_backend
in (
InferenceBackend.OLLAMA,
InferenceBackend.LM_STUDIO,
)
else self.config.cloud_model
),
}

except Exception as e:
logger.exception(
f"{backend.value} failed"
)

last_error = e
continue

raise RuntimeError(
f"All inference backends failed.\n"
f"Last error: {last_error}"
)

# --------------------------------------------------------------------------
# Stats
# --------------------------------------------------------------------------

@property
def stats(self):

total = (
self._stats["ollama_calls"]
+ self._stats["lm_studio_calls"]
+ self._stats["openai_calls"]
+ self._stats["anthropic_calls"]
)

local_calls = (
self._stats["ollama_calls"]
+ self._stats["lm_studio_calls"]
)

cloud_calls = (
self._stats["openai_calls"]
+ self._stats["anthropic_calls"]
)

return {
**self._stats,
"total_calls": total,
"local_percentage": round(
local_calls / max(total, 1) * 100,
1,
),
"cloud_percentage": round(
cloud_calls / max(total, 1) * 100,
1,
),
}

# --------------------------------------------------------------------------
# Cleanup
# --------------------------------------------------------------------------

def close(self):
try:
self.http_client.close()
except Exception:
pass


# ------------------------------------------------------------------------------
# Utility
# ------------------------------------------------------------------------------


def validate_config():
print("\n=== Configuration ===")
print(
"LOCAL_MODEL:",
os.getenv("LOCAL_MODEL"),
)
print(
"CLOUD_MODEL:",
os.getenv("CLOUD_MODEL"),
)
print(
"PREFER_LOCAL:",
os.getenv("PREFER_LOCAL"),
)

print(
"OPENAI_API_KEY:",
"SET"
if os.getenv("OPENAI_API_KEY")
else "MISSING",
)

print(
"ANTHROPIC_API_KEY:",
"SET"
if os.getenv("ANTHROPIC_API_KEY")
else "MISSING",
)

print("=====================\n")


# ------------------------------------------------------------------------------
# Example Usage
# ------------------------------------------------------------------------------

if __name__ == "__main__":

validate_config()

client = UnifiedLLMClient()

try:
# Automatic routing
result = client.chat(
prompt="Explain database indexing strategies.",
system="You are a senior database architect.",
)

print(
f"\nBackend: {result['backend']}"
)
print(
f"Latency: {result['latency_ms']}ms"
)
print(
f"Model: {result['model']}\n"
)

print(result["response"])

# Force cloud
cloud_result = client.chat(
prompt="Compare monolithic and microservice architectures.",
force_cloud=True,
)

print(
f"\nCloud Backend: {cloud_result['backend']}"
)

# Force local
local_result = client.chat(
prompt="Summarize confidential HR notes.",
force_local=True,
)

print(
f"\nLocal Backend: {local_result['backend']}"
)

print("\nStats:")
print(client.stats)

finally:
client.close()

Part 6: Choosing the Right Local Model

Not all local models are equal. Here’s a practical guide for 2026:

Task                     Recommended Model           Why
─────────────────────────────────────────────────────────────────────────────
General chat/writing Llama 3.1 8B Best balance, well-tested
Code generation Qwen 2.5 Coder 7B Trained on code, fast
Python specifically DeepSeek-Coder-V2 7B Top Python benchmark scores
Reasoning/math DeepSeek-R1 7B CoT reasoning, offline
Instruction following Mistral 7B Instruct Very reliable formatting
Embeddings nomic-embed-text Fast, good quality
Document summarization Llama 3.1 8B (16K ctx) Long context support
Privacy-sensitive any Phi-4 14B Microsoft, no telemetry
Resource-constrained Llama 3.2 3B / Phi-3 Mini Runs on 4GB RAM
Maximum quality local Llama 3.1 70B Q4 Near-frontier, 40GB VRAM

Part 7: Performance Benchmarking

Before deploying locally, benchmark your setup:

import time
import statistics
from typing import Callable

def benchmark_model(
chat_fn: Callable[[str], str],
model_name: str,
prompts: list[str] = None,
runs: int = 5,
) -> dict:
"""
Benchmark a local model's performance.

Measures:
- Time to first token (TTFT) — perceived latency
- Tokens per second — throughput
- Total latency per request
"""
if prompts is None:
prompts = [
"What is 47 * 89?",
"List 5 Python built-in functions.",
"What's the capital of France?",
"Write a one-line function to reverse a string in Python.",
"Explain what HTTP means in one sentence.",
]

latencies = []

print(f"\n{'─'*50}")
print(f"Benchmarking: {model_name}")
print(f"Runs: {runs} prompts × 1 = {runs} total requests")
print(f"{'─'*50}")

for i, prompt in enumerate(prompts[:runs]):
start = time.monotonic()
response = chat_fn(prompt)
elapsed = time.monotonic() - start

words = len(response.split())
approx_tokens = int(words * 1.3)
tokens_per_second = approx_tokens / elapsed if elapsed > 0 else 0

latencies.append(elapsed)
print(f"Run {i+1}: {elapsed:.2f}s | ~{tokens_per_second:.0f} tok/s | {len(response)} chars")

return {
"model": model_name,
"mean_latency_s": round(statistics.mean(latencies), 2),
"p50_latency_s": round(statistics.median(latencies), 2),
"p95_latency_s": round(sorted(latencies)[int(len(latencies) * 0.95)], 2) if len(latencies) >= 2 else latencies[-1],
"min_latency_s": round(min(latencies), 2),
"max_latency_s": round(max(latencies), 2),
}


# Run benchmark
from openai import OpenAI

ollama = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

def make_chat_fn(model: str) -> Callable:
def chat(prompt: str) -> str:
response = ollama.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=256,
)
return response.choices[0].message.content
return chat

# Compare models
models_to_benchmark = ["llama3.2", "llama3.1:8b", "mistral"]

results = []
for model in models_to_benchmark:
result = benchmark_model(make_chat_fn(model), model)
results.append(result)

print("\n\nSummary:")
print(f"{'Model':<25} {'Mean (s)':<12} {'P50 (s)':<12} {'P95 (s)':<12}")
print("─" * 60)
for r in results:
print(f"{r['model']:<25} {r['mean_latency_s']:<12} {r['p50_latency_s']:<12} {r['p95_latency_s']:<12}")

🔍 Common Mistakes

1. Running a 7B model on 8GB RAM without GPU Without GPU offloading, a Q4 7B model uses all your RAM and inference drops to 1–2 tokens/second. Either use a smaller model (3B) or add a GPU. Check ollama ps to see GPU layer counts.

2. Ignoring num_ctx (context window) Ollama defaults to a 2048-token context for many models. For longer documents or multi-turn conversations, set num_ctx to 8192 or higher. Larger context = more RAM usage, so balance accordingly.

3. Not keeping models warm Ollama unloads models from memory after 5 minutes by default. If your app has bursty traffic, the first request after idle will be slow (model reload). Set OLLAMA_KEEP_ALIVE=60m to keep models warm longer.

4. Using LM Studio and Ollama on the same machine simultaneously Both grab GPU resources. Running them simultaneously causes memory conflicts. Pick one per session, or configure them for different GPUs if you have multiple.

5. Pulling models without checking VRAM first ollama pull llama3.1:70b downloads 40GB. If you only have 12GB VRAM, it’ll fall back to CPU inference at 1–3 tokens/second. Always check your VRAM before pulling large models.

6. Forgetting temperature matters more for local models Local models are generally more literal and less “creative” than frontier cloud models. A temperature of 0.7 that works well on GPT-4o may feel flat on Llama 3.1 8B. Experiment — 0.8–0.9 often works better locally.

💼 Quick Questions

Q: What is model quantization and what are the tradeoffs? Quantization reduces the bit-width of model weights from FP32/FP16 to INT8, INT4, or lower. This reduces memory footprint and increases inference speed at the cost of slight quality degradation. Q4_K_M is the sweet spot: ~7x smaller than FP16 with ~1% quality loss. Lower than Q4 (Q3, Q2) shows noticeable quality degradation.

Q: How does Ollama’s GPU offloading work? Ollama uses llama.cpp under the hood, which splits transformer layers between GPU VRAM (fast) and CPU RAM (slow). You can set --gpu-layers N to specify exactly how many layers go to GPU. If your model fits entirely in VRAM, all layers are GPU-accelerated. For partial GPU offloading, inference speed is proportional to what percentage is on GPU.

Q: When would you choose a local LLM over a cloud API in production? Local LLMs are preferred when: (1) data cannot leave the network (HIPAA, legal, financial PII), (2) latency requirements are sub-100ms, (3) usage volume makes API costs prohibitive, (4) offline/edge deployment is required, or (5) custom fine-tuned models need to be served. Cloud APIs are preferred when: maximum quality is needed, compute infrastructure isn’t available, or usage is sporadic.

Q: What is GGUF and why does it exist? GGUF (GPT-Generated Unified Format) replaced the older GGML format in 2023. It’s a binary format that packages model weights, tokenizer, metadata, and quantization information into a single file. It supports memory mapping, enabling fast loading, and multiple quantization types in one format. It’s the standard format for CPU/hybrid inference in the llama.cpp ecosystem.

🏭 Production Considerations

Model serving infrastructure: For production teams, consider running Ollama behind a load balancer with multiple GPU nodes. Ollama’s API is stateless — you can load-balance across instances running the same model.

Model selection by task type: Build a routing layer that sends different task types to different models. Use Llama 3.2 3B for simple classification/extraction tasks (fast, cheap), Llama 3.1 8B for general generation, and cloud APIs for complex reasoning.

Data governance: The primary enterprise value of local LLMs is data residency. Document your inference stack architecture clearly — CISOs and legal teams need to verify that customer data never touches external APIs.

Monitoring: Ollama doesn’t expose Prometheus metrics natively. Build a wrapper that logs model name, latency, token counts, and error rates to your observability stack (Prometheus + Grafana, Datadog, etc.).

Docker deployment:

# Dockerfile for Ollama-based AI service
FROM ollama/ollama:latest

# Pre-pull models during build (bakes model into image — large but self-contained)
RUN ollama serve & sleep 5 && ollama pull llama3.1:8b && kill %1

ENV OLLAMA_HOST=0.0.0.0:11434
ENV OLLAMA_KEEP_ALIVE=30m
EXPOSE 11434

CMD ["ollama", "serve"]
# docker-compose.yml
version: '3.8'
services:
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
environment:
- OLLAMA_KEEP_ALIVE=30m
- OLLAMA_NUM_PARALLEL=2
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]

ai-service:
build: .
ports:
- "8000:8000"
depends_on:
- ollama
environment:
- OLLAMA_URL=http://ollama:11434

volumes:
ollama_data:

⚡ Performance & Scalability

GPU layer optimization: In Ollama, run ollama ps while a model is loaded to see how many layers are on GPU vs. CPU. More GPU layers = faster inference. If you have spare VRAM, increase GPU layer count via Modelfile: PARAMETER num_gpu 99 (offload all possible layers).

Parallel requests: By default, Ollama handles one request at a time per model. For concurrent workloads, set OLLAMA_NUM_PARALLEL=4 (or higher) and ensure you have enough VRAM for parallel KV-cache. Each parallel request needs its own KV cache allocation.

KV cache: The KV (Key-Value) cache stores intermediate attention computations for the context. Longer contexts and more parallel requests require proportionally more VRAM. A rough formula: KV cache size ≈ 2 × num_layers × num_heads × head_dim × context_length × precision_bytes.

Batching: For batch workloads (processing many documents), structure your workload to send requests in parallel rather than sequentially. Python asyncio + aiohttp with Ollama’s async endpoint significantly increases throughput.

🔑 Key Takeaways

  1. Quantization makes local inference practical — Q4_K_M reduces a 7B model to ~4GB with only ~1% quality loss
  2. Ollama is the standard CLI for local LLMs — pull any model in one command, OpenAI-compatible API out of the box
  3. LM Studio is the GUI alternative — visual model manager, built-in chat, OpenAI-compatible server
  4. Apple Silicon has a structural advantage — unified memory means no VRAM bottleneck for M2/M3 Max users
  5. Local ≠ worse — Llama 3.1 8B is competitive with GPT-3.5 on most tasks, running completely offline
  6. The right architecture — local-first with cloud fallback gives you privacy, cost efficiency, and reliability simultaneously
  7. Same OpenAI SDK, different base_url — swap between local and cloud with a single config change

📚 Further Reading

Leave a Reply

Your email address will not be published.