Day 8: Running LLMs Locally with Ollama & LM Studio

“The most powerful AI model is the one you fully control — running on your own hardware, your own network, with your own data never leaving the room.”

Why This Matters

For the first three years of the LLM era, running a capable AI model required API keys, internet connectivity, and per-token billing. In 2026, that assumption is broken.

You can now run models on your laptop that match or exceed GPT-3.5 performance — completely offline, completely free after hardware costs, with zero data leaving your machine.

This matters for several concrete reasons:

Privacy. Medical records, legal documents, internal code, personal journals. Many of the most valuable use cases for AI are also the ones where sending data to a third-party cloud is non-negotiable. Local inference solves this entirely.

Cost. At scale, API costs dominate AI product economics. A startup processing millions of documents per month can spend $50,000+ on API fees that could instead run on $10,000 of on-premises hardware.

Latency. Cloud API round trips add 200ms–2s of network latency. A local model running on a GPU server in the same datacenter responds in milliseconds.

Control. No rate limits. No model deprecation forcing you to migrate. No terms of service updates. No vendor lock-in.

The tools that made local inference practical for individual developers — Ollama and LM Studio — are the focus of today’s article.

Part 1: Understanding Model Quantization

Before installing anything, you need to understand the concept that makes local LLMs possible on consumer hardware: quantization.

What is Quantization?

A standard LLM stores each parameter (weight) as a 32-bit floating point number. A 7 billion parameter model therefore requires:

7,000,000,000 × 4 bytes (FP32) = 28 GB

That’s more than most consumer GPUs hold. But we don’t need FP32 precision for inference. Research shows you can represent weights with much lower precision with minimal quality loss:

Precision    Bits per weight   7B model size   Quality loss
──────────────────────────────────────────────────────────
FP32         32 bits           28 GB           0% (reference)
FP16         16 bits           14 GB           ~0%
GGUF Q8_0    8 bits             7 GB           ~0.1%
GGUF Q6_K    6 bits             5.5 GB         ~0.2%
GGUF Q5_K_M  5 bits             4.8 GB         ~0.5%
GGUF Q4_K_M  4 bits             4.1 GB         ~1%        ← Sweet spot
GGUF Q3_K_M  3 bits             3.3 GB         ~2-3%
GGUF Q2_K    2 bits             2.7 GB         ~5%+       ← Avoid

GGUF (GPT-Generated Unified Format) is the file format used by llama.cpp and all tools built on it (Ollama, LM Studio). It packages the quantized weights and metadata into a single file.

The K-Quant System

The K in Q4_K_M means the quantization uses a per-block method that’s more accurate than simple per-tensor quantization. The letter after _K_ indicates size variant:

S = Small (more aggressive quantization, smaller)
M = Medium ← Most common recommendation
L = Large (less aggressive, slightly larger but higher quality)

Practical Recommendation

For most use cases: Q4_K_M. It reduces model size by ~7x vs FP16 while maintaining ~99% of the original quality. This is the format Ollama uses by default for most models.

Part 2: Hardware Requirements

The RAM Rule

The dominant constraint for local LLMs is memory. You need enough RAM + VRAM (combined) to hold the model. As a rule of thumb:

Model Size     Q4_K_M Required    Min GPU VRAM    Min RAM (CPU only)
──────────────────────────────────────────────────────────────────
1B params      ~0.8 GB            4 GB            8 GB RAM
3B params      ~2 GB              4 GB            8 GB RAM
7B params      ~4.1 GB            6 GB            16 GB RAM
13B params     ~7.4 GB            8 GB            32 GB RAM
32B params     ~19 GB             24 GB           64 GB RAM
70B params     ~40 GB             48 GB (2× 24GB) 64 GB RAM

Key principle: If the model fits in VRAM, inference is GPU-accelerated (fast). If it doesn’t, it falls back to CPU/RAM (slow but functional).

GPU Recommendations by Budget

Budget       GPU               VRAM    Best Local Models
───────────────────────────────────────────────────────────────────────
$0            MacBook M-series  Unified  Llama 3.2 3B, Phi-3 Mini
$200          RTX 3060          12 GB    Llama 3.1 8B, Mistral 7B
$400          RTX 3080          10 GB    Llama 3.1 8B, Qwen 2.5 7B
$800          RTX 4070          12 GB    Llama 3.1 8B, Codestral 7B
$1500         RTX 4090          24 GB    Llama 3.1 70B Q4, Qwen 72B Q4
$3000+        RTX 6000 Ada      48 GB    Full Llama 70B FP16
$8000+        2× A6000          96 GB    Llama 3.1 405B Q4

Apple Silicon Advantage

Apple M-series chips use unified memory (shared between CPU and GPU), giving them a unique advantage for local LLMs:

M1 16GB: Runs 7B models well, 13B models slowly
M2/M3 32GB: Runs 13B models excellently, 34B models adequately
M3 Max 128GB: Runs 70B models well

Ollama has native Apple Silicon support with Metal GPU acceleration. LM Studio also has native macOS builds. For Mac users, local LLMs are effectively zero-configuration.

Part 3: Ollama

Ollama is the dominant tool for local LLM inference as of 2026. It provides:

A simple CLI to pull, run, and manage models
An HTTP server with an OpenAI-compatible API (drop-in replacement)
Automatic GPU detection and acceleration
A curated model library with 100+ models

Installation

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download installer from: https://ollama.com/download

# Verify installation
ollama --version
# ollama version 0.30.x

# Ollama starts as a background service automatically
# Check it's running:
curl http://localhost:11434/api/version

Pulling and Running Models

# Pull a model (downloads to ~/.ollama/models/)
ollama pull llama3.2          # 3B — very fast, good for simple tasks
ollama pull llama3.1:8b       # 8B — excellent general purpose
ollama pull llama3.1:70b      # 70B — near-GPT-4 quality (needs 48GB+ VRAM)
ollama pull mistral           # Mistral 7B — fast and capable
ollama pull qwen2.5:7b        # Alibaba Qwen 2.5 7B — strong on code
ollama pull deepseek-r1:7b    # DeepSeek R1 7B — reasoning model
ollama pull phi4              # Microsoft Phi-4 — 14B, punches above weight
ollama pull nomic-embed-text  # Embedding model for RAG
ollama pull codellama:13b     # Meta's code-specialized model

# Run interactively
ollama run llama3.1:8b

# Single-turn from CLI
ollama run llama3.1:8b "What is the difference between TCP and UDP?"

# List downloaded models
ollama list

# Remove a model
ollama rm llama3.2

# Show model info (parameters, quantization, etc.)
ollama show llama3.1:8b

The Ollama API

Once Ollama is running, it exposes an HTTP API on localhost:11434. There are two API styles:

1. Native Ollama API:

import requests
import json

OLLAMA_BASE = "http://localhost:11434"

def chat_ollama(
    prompt: str,
    model: str = "llama3.1:8b",
    system: str = None,
    stream: bool = False,
) -> str:
    """Call Ollama native chat API."""
    messages = []
    if system:
        messages.append({"role": "system", "content": system})
    messages.append({"role": "user", "content": prompt})

    response = requests.post(
        f"{OLLAMA_BASE}/api/chat",
        json={
            "model": model,
            "messages": messages,
            "stream": stream,
            "options": {
                "temperature": 0.7,
                "top_p": 0.9,
                "num_ctx": 8192,        # Context window size
                "num_predict": 2048,    # Max tokens to generate
            },
        },
        stream=stream,
        timeout=120,
    )
    response.raise_for_status()

    if stream:
        full_response = []
        for line in response.iter_lines():
            if line:
                chunk = json.loads(line)
                if chunk.get("message", {}).get("content"):
                    content = chunk["message"]["content"]
                    print(content, end="", flush=True)
                    full_response.append(content)
                if chunk.get("done"):
                    print()  # newline
                    break
        return "".join(full_response)
    else:
        return response.json()["message"]["content"]


# Usage
response = chat_ollama(
    "Explain quicksort in Python with code.",
    model="llama3.1:8b",
    system="You are a concise technical educator. Always include runnable code.",
)
print(response)

# Streaming
chat_ollama(
    "Write a haiku about machine learning",
    model="llama3.1:8b",
    stream=True,
)

2. OpenAI-Compatible API (recommended for production):

Ollama exposes an OpenAI-compatible endpoint at /v1/. This means you can use the official openai Python library with zero code changes — just point it at localhost:

from openai import OpenAI

# Drop-in replacement: just change base_url
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # Required by the library but ignored by Ollama
)

def chat_local(
    prompt: str,
    model: str = "llama3.1:8b",
    system: str = "You are a helpful assistant.",
    temperature: float = 0.7,
    max_tokens: int = 2048,
) -> str:
    """Chat with a local Ollama model using the OpenAI SDK."""
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": prompt},
        ],
        temperature=temperature,
        max_tokens=max_tokens,
    )
    return response.choices[0].message.content


# Works exactly like the OpenAI API
result = chat_local(
    "What are the SOLID principles in software engineering?",
    model="llama3.1:8b",
)
print(result)

# Streaming also works identically
stream = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "List 5 Python best practices."}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print()

Generating Embeddings with Ollama

from openai import OpenAI
from typing import List
import numpy as np

embedding_client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",
)

def embed_text(texts: List[str], model: str = "nomic-embed-text") -> List[List[float]]:
    """
    Generate embeddings locally using Ollama.
    
    nomic-embed-text: 768 dimensions, good general-purpose embeddings
    mxbai-embed-large: 1024 dimensions, higher quality
    all-minilm: 384 dimensions, very fast
    """
    response = embedding_client.embeddings.create(
        model=model,
        input=texts,
    )
    return [item.embedding for item in response.data]


# Example: similarity search without cloud
texts = [
    "Python is a high-level programming language",
    "Machine learning uses statistical techniques",
    "Snakes are reptiles found worldwide",
    "Neural networks are inspired by the brain",
]

embeddings = embed_text(texts)

# Cosine similarity
def cosine_similarity(a: List[float], b: List[float]) -> float:
    a_np, b_np = np.array(a), np.array(b)
    return float(np.dot(a_np, b_np) / (np.linalg.norm(a_np) * np.linalg.norm(b_np)))

query = "deep learning algorithms"
query_embedding = embed_text([query])[0]

scores = [(texts[i], cosine_similarity(query_embedding, embeddings[i])) for i in range(len(texts))]
scores.sort(key=lambda x: x[1], reverse=True)

for text, score in scores:
    print(f"{score:.3f}: {text}")
# 0.724: Neural networks are inspired by the brain
# 0.689: Machine learning uses statistical techniques
# 0.423: Python is a high-level programming language
# 0.198: Snakes are reptiles found worldwide

Custom Modelfiles: Customizing Behavior

Ollama’s Modelfile is like a Dockerfile for LLMs — it lets you bake in system prompts, custom parameters, and even your own GGUF files:

# Create a custom Modelfile
cat > Modelfile << 'EOF'
# Base model
FROM llama3.1:8b

# Baked-in system prompt — this applies to every conversation
SYSTEM """
You are Aria, a senior Python engineer at a top tech company.
Your responses are:
- Concise but complete
- Always include working code examples
- Focused on production-quality patterns, not toy examples
- You flag potential issues (performance, security, edge cases)
Never pad responses or explain basic concepts unless asked.
"""

# Generation parameters
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 16384
PARAMETER num_predict 4096
PARAMETER repeat_penalty 1.1
EOF

# Build the custom model
ollama create aria-engineer -f Modelfile

# Run your custom model
ollama run aria-engineer "How do I implement connection pooling in PostgreSQL?"

# List it alongside other models
ollama list

# Modelfile with a custom GGUF (bring your own model)
cat > Modelfile.custom << 'EOF'
# Use a GGUF file you downloaded from Hugging Face
FROM ./models/custom-model-Q4_K_M.gguf

SYSTEM "You are a domain expert in financial analysis."

PARAMETER temperature 0.1
PARAMETER num_ctx 32768
EOF

ollama create finance-expert -f Modelfile.custom

Running Ollama as a Server

For production deployments, you often want Ollama accessible over a network:

# Run Ollama server bound to all interfaces (network accessible)
OLLAMA_HOST=0.0.0.0:11434 ollama serve

# With GPU configuration
CUDA_VISIBLE_DEVICES=0,1 OLLAMA_HOST=0.0.0.0:11434 ollama serve

# Environment variables
OLLAMA_MODELS=/mnt/ssd/models      # Custom model storage path
OLLAMA_NUM_PARALLEL=4              # Concurrent requests per model
OLLAMA_MAX_LOADED_MODELS=3         # Models to keep warm in memory
OLLAMA_KEEP_ALIVE=10m              # How long to keep model in memory

import httpx
import json
import re
from typing import Optional, List, Dict
from tenacity import retry, stop_after_attempt, wait_exponential


class OllamaClient:
    """
    Production-ready Ollama client.

    Features:
    - Connection pooling
    - Retry logic
    - Health checks
    - Model management
    - Token usage tracking
    - Markdown/code fence cleanup
    - JSON output support
    - Context manager support
    """

    def __init__(
        self,
        base_url: str = "http://localhost:11434",
        default_model: str = "llama3.1:8b",
        timeout: float = 120.0,
    ):
        self.base_url = base_url
        self.default_model = default_model

        self._client = httpx.Client(
            base_url=base_url,
            timeout=timeout,
            limits=httpx.Limits(
                max_connections=20,
                max_keepalive_connections=10,
            ),
        )

        self._total_completion_tokens = 0
        self._total_prompt_tokens = 0
        self._request_count = 0

    # ----------------------------------------------------
    # Context manager support
    # ----------------------------------------------------

    def __enter__(self):
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        self.close()

    # ----------------------------------------------------
    # Utilities
    # ----------------------------------------------------

    def _clean_response(self, text: str) -> str:
        """
        Remove markdown code fences.
        """

        text = text.strip()

        text = re.sub(
            r"^```[a-zA-Z0-9_+-]*\n",
            "",
            text,
        )

        text = re.sub(
            r"\n```$",
            "",
            text,
        )

        return text.strip()

    # ----------------------------------------------------
    # Health
    # ----------------------------------------------------

    def is_healthy(self) -> bool:
        """
        Check whether Ollama server is running.
        """

        try:
            resp = self._client.get("/api/version")
            return resp.status_code == 200
        except Exception:
            return False

    # ----------------------------------------------------
    # Models
    # ----------------------------------------------------

    def list_models(self) -> List[str]:
        """
        List installed models.
        """

        resp = self._client.get("/api/tags")
        resp.raise_for_status()

        return [
            model["name"]
            for model in resp.json().get("models", [])
        ]

    def pull_model(self, model: str) -> None:
        """
        Download model if not already installed.
        """

        if model in self.list_models():
            print(f"✓ {model} already installed")
            return

        print(f"Pulling {model}...\n")

        with self._client.stream(
            "POST",
            "/api/pull",
            json={"name": model},
        ) as response:

            response.raise_for_status()

            for line in response.iter_lines():
                if not line:
                    continue

                data = json.loads(line)

                if "status" in data:
                    print(
                        f"\r{data['status']}",
                        end="",
                        flush=True,
                    )

        print(f"\n✓ {model} ready")

    # ----------------------------------------------------
    # Chat
    # ----------------------------------------------------

    @retry(
        stop=stop_after_attempt(3),
        wait=wait_exponential(
            multiplier=1,
            min=1,
            max=10,
        ),
    )
    def chat(
        self,
        prompt: str,
        model: Optional[str] = None,
        system: Optional[str] = None,
        context_messages: Optional[List[Dict]] = None,
        temperature: float = 0.7,
        clean_output: bool = True,
    ) -> str:
        """
        Standard chat completion.
        """

        model = model or self.default_model

        messages = []

        if system:
            messages.append(
                {
                    "role": "system",
                    "content": system,
                }
            )

        if context_messages:
            messages.extend(context_messages)

        messages.append(
            {
                "role": "user",
                "content": prompt,
            }
        )

        resp = self._client.post(
            "/api/chat",
            json={
                "model": model,
                "messages": messages,
                "stream": False,
                "options": {
                    "temperature": temperature,
                },
            },
        )

        resp.raise_for_status()

        data = resp.json()

        self._request_count += 1
        self._total_completion_tokens += data.get(
            "eval_count",
            0,
        )

        self._total_prompt_tokens += data.get(
            "prompt_eval_count",
            0,
        )

        content = data["message"]["content"]

        if clean_output:
            content = self._clean_response(content)

        return content

    # ----------------------------------------------------
    # Structured JSON output
    # ----------------------------------------------------

    def chat_json(
        self,
        prompt: str,
        schema_prompt: str,
        model: Optional[str] = None,
    ) -> dict:
        """
        Force JSON response.
        """

        model = model or self.default_model

        resp = self._client.post(
            "/api/chat",
            json={
                "model": model,
                "messages": [
                    {
                        "role": "system",
                        "content": schema_prompt,
                    },
                    {
                        "role": "user",
                        "content": prompt,
                    },
                ],
                "stream": False,
                "format": "json",
            },
        )

        resp.raise_for_status()

        data = resp.json()

        return json.loads(
            data["message"]["content"]
        )

    # ----------------------------------------------------
    # Embeddings
    # ----------------------------------------------------

    def embed(
        self,
        text: str,
        model: str = "nomic-embed-text",
    ):
        """
        Generate embeddings.
        """

        resp = self._client.post(
            "/api/embeddings",
            json={
                "model": model,
                "prompt": text,
            },
        )

        resp.raise_for_status()

        return resp.json()["embedding"]

    # ----------------------------------------------------
    # Usage stats
    # ----------------------------------------------------

    @property
    def total_tokens_generated(self) -> int:
        return self._total_completion_tokens

    @property
    def total_prompt_tokens(self) -> int:
        return self._total_prompt_tokens

    @property
    def request_count(self) -> int:
        return self._request_count

    # ----------------------------------------------------
    # Cleanup
    # ----------------------------------------------------

    def close(self):
        self._client.close()


# ==========================================================
# Example Usage
# ==========================================================

if __name__ == "__main__":

    with OllamaClient(
        default_model="llama3.1:8b"
    ) as ollama:

        if not ollama.is_healthy():
            raise RuntimeError(
                "Ollama is not running. Start it using:\n\nollama serve"
            )

        print(
            "Available models:",
            ollama.list_models(),
        )

        response = ollama.chat(
            prompt="Write a Python function to validate email addresses.",
            system="""
You are a Python expert.

Rules:
- Return raw Python code only.
- No markdown.
- No triple backticks.
- No explanations.
- First character must be Python code.
""",
            temperature=0,
        )

        print("\nResponse:\n")
        print(response)

        print(
            f"\nCompletion Tokens: {ollama.total_tokens_generated:,}"
        )

        print(
            f"Prompt Tokens: {ollama.total_prompt_tokens:,}"
        )

        print(
            f"Requests Made: {ollama.request_count:,}"
        )

Part 4: LM Studio

LM Studio is a GUI application for downloading, managing, and serving local models. It’s ideal for:

Non-CLI users who want a visual interface
Quickly comparing multiple models without writing code
Teams that need a shared local inference server
Exploring model capabilities interactively before integrating into code

Installation

Download from lmstudio.ai — available for macOS, Windows, and Linux.

Key Features

Model Browser: LM Studio has a built-in Hugging Face model browser. Search, filter by VRAM requirement, and download GGUF models directly from the app.

Chat Interface: Test models with a full chat UI, adjustable parameters (temperature, context window, system prompt), and response timing metrics.

Local Server: LM Studio runs an OpenAI-compatible server on localhost:1234. Any code that works with the OpenAI API works with LM Studio.

Using LM Studio’s Server via Python

from openai import OpenAI

# LM Studio's OpenAI-compatible server
lm_client = OpenAI(
    base_url="http://localhost:1234/v1",
    api_key="lm-studio",  # Placeholder — not validated
)

def chat_lm_studio(
    prompt: str,
    system: str = "You are a helpful assistant.",
    model: str = "local-model",  # LM Studio uses "local-model" as default
    temperature: float = 0.7,
) -> str:
    """
    Chat with whatever model is loaded in LM Studio.
    
    Note: The model parameter is ignored by LM Studio — it always uses
    the currently loaded model. Pass it for API compatibility only.
    """
    response = lm_client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": prompt},
        ],
        temperature=temperature,
        max_tokens=2048,
    )
    return response.choices[0].message.content


# Get list of loaded models
models = lm_client.models.list()
print("Models available in LM Studio:")
for model in models.data:
    print(f"  - {model.id}")


# Example: technical writing assistant
result = chat_lm_studio(
    "Explain REST vs GraphQL vs gRPC for a senior engineer.",
    system="You are a senior backend architect. Be concise and opinionated.",
)
print(result)

LM Studio Configuration Tips

In the Server Settings panel:

Context Length: Set to match your model’s supported context (8192, 16384, 32768)
GPU Layers: Set to maximum — this offloads as many transformer layers as possible to GPU
CPU Threads: Set to physical core count (not hyperthreads)
Flash Attention: Enable if your hardware supports it (M2/M3, RTX 30xx+) — significant speedup

Part 5: Unified Local + Cloud Client

The most production-ready pattern is a unified client that tries local inference first and falls back to a cloud API if the local model isn’t available or the task requires higher capability.

"""
Unified LLM Client
==================

Features:
- Local-first routing (Ollama -> LM Studio -> OpenAI -> Anthropic)
- Connection pooling
- Retry with exponential backoff
- Health checks
- Usage statistics
- Environment variable configuration via .env

Requirements:
pip install openai anthropic httpx python-dotenv tenacity

Example .env:

OPENAI_API_KEY=sk-xxxxxxxxxxxxxxxx
ANTHROPIC_API_KEY=sk-ant-xxxxxxxxxxxxxxxx

OLLAMA_BASE_URL=http://localhost:11434/v1
OLLAMA_API_KEY=ollama

LM_STUDIO_BASE_URL=http://localhost:1234/v1
LM_STUDIO_API_KEY=lm-studio

LOCAL_MODEL=llama3.1:8b
CLOUD_MODEL=gpt-4o-mini

LOCAL_TIMEOUT=60
CLOUD_TIMEOUT=30

PREFER_LOCAL=true
"""

from __future__ import annotations

import logging
import os
import time
from dataclasses import dataclass
from enum import Enum
from typing import Dict, List, Optional, Tuple

import httpx
from anthropic import Anthropic
from dotenv import load_dotenv
from openai import OpenAI
from tenacity import retry, stop_after_attempt, wait_exponential

# ------------------------------------------------------------------------------
# Load Environment Variables
# ------------------------------------------------------------------------------

load_dotenv()

# ------------------------------------------------------------------------------
# Logging
# ------------------------------------------------------------------------------

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(message)s",
)

logger = logging.getLogger(__name__)

# ------------------------------------------------------------------------------
# Enums
# ------------------------------------------------------------------------------


class InferenceBackend(Enum):
    OLLAMA = "ollama"
    LM_STUDIO = "lm_studio"
    OPENAI = "openai"
    ANTHROPIC = "anthropic"


# ------------------------------------------------------------------------------
# Config
# ------------------------------------------------------------------------------


@dataclass
class ModelConfig:
    local_model: str = os.getenv("LOCAL_MODEL", "llama3.1:8b")
    cloud_model: str = os.getenv("CLOUD_MODEL", "gpt-4o-mini")
    prefer_local: bool = (
        os.getenv("PREFER_LOCAL", "true").lower() == "true"
    )
    local_timeout: float = float(os.getenv("LOCAL_TIMEOUT", "60"))
    cloud_timeout: float = float(os.getenv("CLOUD_TIMEOUT", "30"))


# ------------------------------------------------------------------------------
# Client
# ------------------------------------------------------------------------------


class UnifiedLLMClient:
    """
    Production-ready LLM Router

    Routing Strategy:
        Ollama
          ↓
        LM Studio
          ↓
        OpenAI
          ↓
        Anthropic
    """

    def __init__(self, config: Optional[ModelConfig] = None):
        self.config = config or ModelConfig()

        # Shared HTTP client for pooling
        self.http_client = httpx.Client(
            timeout=self.config.local_timeout,
            limits=httpx.Limits(
                max_connections=50,
                max_keepalive_connections=20,
            ),
        )

        # ------------------------------------------------------------------
        # Local Clients
        # ------------------------------------------------------------------

        self._ollama = OpenAI(
            base_url=os.getenv(
                "OLLAMA_BASE_URL",
                "http://localhost:11434/v1",
            ),
            api_key=os.getenv(
                "OLLAMA_API_KEY",
                "ollama",
            ),
            timeout=self.config.local_timeout,
            http_client=self.http_client,
        )

        self._lm_studio = OpenAI(
            base_url=os.getenv(
                "LM_STUDIO_BASE_URL",
                "http://localhost:1234/v1",
            ),
            api_key=os.getenv(
                "LM_STUDIO_API_KEY",
                "lm-studio",
            ),
            timeout=self.config.local_timeout,
            http_client=self.http_client,
        )

        # ------------------------------------------------------------------
        # Lazy Cloud Clients
        # ------------------------------------------------------------------

        self._openai: Optional[OpenAI] = None
        self._anthropic: Optional[Anthropic] = None

        # ------------------------------------------------------------------
        # Stats
        # ------------------------------------------------------------------

        self._stats = {
            "ollama_calls": 0,
            "lm_studio_calls": 0,
            "openai_calls": 0,
            "anthropic_calls": 0,
            "fallbacks": 0,
        }

    # --------------------------------------------------------------------------
    # Cloud Clients
    # --------------------------------------------------------------------------

    def _get_openai(self) -> OpenAI:
        if self._openai is None:
            api_key = os.getenv("OPENAI_API_KEY")

            if not api_key:
                raise ValueError(
                    "OPENAI_API_KEY not found in environment"
                )

            self._openai = OpenAI(
                api_key=api_key,
                timeout=self.config.cloud_timeout,
            )

        return self._openai

    def _get_anthropic(self) -> Anthropic:
        if self._anthropic is None:
            api_key = os.getenv("ANTHROPIC_API_KEY")

            if not api_key:
                raise ValueError(
                    "ANTHROPIC_API_KEY not found in environment"
                )

            self._anthropic = Anthropic(
                api_key=api_key,
                timeout=self.config.cloud_timeout,
            )

        return self._anthropic

    # --------------------------------------------------------------------------
    # Health Checks
    # --------------------------------------------------------------------------

    def _is_ollama_running(self) -> bool:
        try:
            url = os.getenv(
                "OLLAMA_BASE_URL",
                "http://localhost:11434/v1",
            ).replace("/v1", "/api/version")

            response = httpx.get(url, timeout=2.0)
            return response.status_code == 200

        except Exception:
            return False

    def _is_lm_studio_running(self) -> bool:
        try:
            url = os.getenv(
                "LM_STUDIO_BASE_URL",
                "http://localhost:1234/v1",
            ) + "/models"

            response = httpx.get(url, timeout=2.0)
            return response.status_code == 200

        except Exception:
            return False

    # --------------------------------------------------------------------------
    # Retry Wrapper
    # --------------------------------------------------------------------------

    @retry(
        stop=stop_after_attempt(3),
        wait=wait_exponential(multiplier=1, min=1, max=8),
        reraise=True,
    )
    def _chat_with_backend(
        self,
        backend: InferenceBackend,
        messages: List[Dict],
        temperature: float,
        max_tokens: int,
    ) -> Tuple[str, InferenceBackend]:

        # --------------------------------------------------------------
        # Ollama
        # --------------------------------------------------------------

        if backend == InferenceBackend.OLLAMA:
            response = self._ollama.chat.completions.create(
                model=self.config.local_model,
                messages=messages,
                temperature=temperature,
                max_tokens=max_tokens,
            )

            self._stats["ollama_calls"] += 1

            return (
                response.choices[0].message.content,
                backend,
            )

        # --------------------------------------------------------------
        # LM Studio
        # --------------------------------------------------------------

        elif backend == InferenceBackend.LM_STUDIO:
            response = self._lm_studio.chat.completions.create(
                model="local-model",
                messages=messages,
                temperature=temperature,
                max_tokens=max_tokens,
            )

            self._stats["lm_studio_calls"] += 1

            return (
                response.choices[0].message.content,
                backend,
            )

        # --------------------------------------------------------------
        # OpenAI
        # --------------------------------------------------------------

        elif backend == InferenceBackend.OPENAI:
            response = (
                self._get_openai()
                .chat.completions.create(
                    model=self.config.cloud_model,
                    messages=messages,
                    temperature=temperature,
                    max_tokens=max_tokens,
                )
            )

            self._stats["openai_calls"] += 1

            return (
                response.choices[0].message.content,
                backend,
            )

        # --------------------------------------------------------------
        # Anthropic
        # --------------------------------------------------------------

        elif backend == InferenceBackend.ANTHROPIC:
            response = self._get_anthropic().messages.create(
                model="claude-3-5-sonnet-latest",
                max_tokens=max_tokens,
                temperature=temperature,
                messages=[
                    {
                        "role": "user",
                        "content": messages[-1]["content"],
                    }
                ],
            )

            self._stats["anthropic_calls"] += 1

            return (
                response.content[0].text,
                backend,
            )

        raise ValueError(f"Unsupported backend: {backend}")

    # --------------------------------------------------------------------------
    # Main Chat API
    # --------------------------------------------------------------------------

    def chat(
        self,
        prompt: str,
        system: str = "You are a helpful assistant.",
        temperature: float = 0.7,
        max_tokens: int = 2048,
        force_cloud: bool = False,
        force_local: bool = False,
    ) -> Dict:

        messages = [
            {"role": "system", "content": system},
            {"role": "user", "content": prompt},
        ]

        if force_cloud:
            backends_to_try = [
                InferenceBackend.OPENAI,
                InferenceBackend.ANTHROPIC,
            ]

        elif force_local:
            backends_to_try = [
                InferenceBackend.OLLAMA,
                InferenceBackend.LM_STUDIO,
            ]

        elif self.config.prefer_local:
            backends_to_try = [
                InferenceBackend.OLLAMA,
                InferenceBackend.LM_STUDIO,
                InferenceBackend.OPENAI,
                InferenceBackend.ANTHROPIC,
            ]

        else:
            backends_to_try = [
                InferenceBackend.OPENAI,
                InferenceBackend.ANTHROPIC,
            ]

        last_error = None
        first_attempt = True

        for backend in backends_to_try:

            if (
                backend == InferenceBackend.OLLAMA
                and not self._is_ollama_running()
            ):
                logger.info(
                    "Ollama not running. Skipping."
                )
                continue

            if (
                backend == InferenceBackend.LM_STUDIO
                and not self._is_lm_studio_running()
            ):
                logger.info(
                    "LM Studio not running. Skipping."
                )
                continue

            try:
                if not first_attempt:
                    self._stats["fallbacks"] += 1
                    logger.warning(
                        f"Falling back to {backend.value}"
                    )

                first_attempt = False

                start_time = time.monotonic()

                response, used_backend = (
                    self._chat_with_backend(
                        backend=backend,
                        messages=messages,
                        temperature=temperature,
                        max_tokens=max_tokens,
                    )
                )

                latency_ms = (
                    time.monotonic() - start_time
                ) * 1000

                return {
                    "response": response,
                    "backend": used_backend.value,
                    "latency_ms": round(latency_ms),
                    "model": (
                        self.config.local_model
                        if used_backend
                        in (
                            InferenceBackend.OLLAMA,
                            InferenceBackend.LM_STUDIO,
                        )
                        else self.config.cloud_model
                    ),
                }

            except Exception as e:
                logger.exception(
                    f"{backend.value} failed"
                )

                last_error = e
                continue

        raise RuntimeError(
            f"All inference backends failed.\n"
            f"Last error: {last_error}"
        )

    # --------------------------------------------------------------------------
    # Stats
    # --------------------------------------------------------------------------

    @property
    def stats(self):

        total = (
            self._stats["ollama_calls"]
            + self._stats["lm_studio_calls"]
            + self._stats["openai_calls"]
            + self._stats["anthropic_calls"]
        )

        local_calls = (
            self._stats["ollama_calls"]
            + self._stats["lm_studio_calls"]
        )

        cloud_calls = (
            self._stats["openai_calls"]
            + self._stats["anthropic_calls"]
        )

        return {
            **self._stats,
            "total_calls": total,
            "local_percentage": round(
                local_calls / max(total, 1) * 100,
                1,
            ),
            "cloud_percentage": round(
                cloud_calls / max(total, 1) * 100,
                1,
            ),
        }

    # --------------------------------------------------------------------------
    # Cleanup
    # --------------------------------------------------------------------------

    def close(self):
        try:
            self.http_client.close()
        except Exception:
            pass


# ------------------------------------------------------------------------------
# Utility
# ------------------------------------------------------------------------------


def validate_config():
    print("\n=== Configuration ===")
    print(
        "LOCAL_MODEL:",
        os.getenv("LOCAL_MODEL"),
    )
    print(
        "CLOUD_MODEL:",
        os.getenv("CLOUD_MODEL"),
    )
    print(
        "PREFER_LOCAL:",
        os.getenv("PREFER_LOCAL"),
    )

    print(
        "OPENAI_API_KEY:",
        "SET"
        if os.getenv("OPENAI_API_KEY")
        else "MISSING",
    )

    print(
        "ANTHROPIC_API_KEY:",
        "SET"
        if os.getenv("ANTHROPIC_API_KEY")
        else "MISSING",
    )

    print("=====================\n")


# ------------------------------------------------------------------------------
# Example Usage
# ------------------------------------------------------------------------------

if __name__ == "__main__":

    validate_config()

    client = UnifiedLLMClient()

    try:
        # Automatic routing
        result = client.chat(
            prompt="Explain database indexing strategies.",
            system="You are a senior database architect.",
        )

        print(
            f"\nBackend: {result['backend']}"
        )
        print(
            f"Latency: {result['latency_ms']}ms"
        )
        print(
            f"Model: {result['model']}\n"
        )

        print(result["response"])

        # Force cloud
        cloud_result = client.chat(
            prompt="Compare monolithic and microservice architectures.",
            force_cloud=True,
        )

        print(
            f"\nCloud Backend: {cloud_result['backend']}"
        )

        # Force local
        local_result = client.chat(
            prompt="Summarize confidential HR notes.",
            force_local=True,
        )

        print(
            f"\nLocal Backend: {local_result['backend']}"
        )

        print("\nStats:")
        print(client.stats)

    finally:
        client.close()

Part 6: Choosing the Right Local Model

Not all local models are equal. Here’s a practical guide for 2026:

Task                     Recommended Model           Why
─────────────────────────────────────────────────────────────────────────────
General chat/writing     Llama 3.1 8B               Best balance, well-tested
Code generation          Qwen 2.5 Coder 7B          Trained on code, fast
Python specifically      DeepSeek-Coder-V2 7B       Top Python benchmark scores
Reasoning/math           DeepSeek-R1 7B             CoT reasoning, offline
Instruction following    Mistral 7B Instruct        Very reliable formatting
Embeddings               nomic-embed-text           Fast, good quality
Document summarization   Llama 3.1 8B (16K ctx)     Long context support
Privacy-sensitive any    Phi-4 14B                  Microsoft, no telemetry
Resource-constrained     Llama 3.2 3B / Phi-3 Mini  Runs on 4GB RAM
Maximum quality local    Llama 3.1 70B Q4           Near-frontier, 40GB VRAM

Part 7: Performance Benchmarking

Before deploying locally, benchmark your setup:

import time
import statistics
from typing import Callable

def benchmark_model(
    chat_fn: Callable[[str], str],
    model_name: str,
    prompts: list[str] = None,
    runs: int = 5,
) -> dict:
    """
    Benchmark a local model's performance.
    
    Measures:
    - Time to first token (TTFT) — perceived latency
    - Tokens per second — throughput
    - Total latency per request
    """
    if prompts is None:
        prompts = [
            "What is 47 * 89?",
            "List 5 Python built-in functions.",
            "What's the capital of France?",
            "Write a one-line function to reverse a string in Python.",
            "Explain what HTTP means in one sentence.",
        ]
    
    latencies = []
    
    print(f"\n{'─'*50}")
    print(f"Benchmarking: {model_name}")
    print(f"Runs: {runs} prompts × 1 = {runs} total requests")
    print(f"{'─'*50}")
    
    for i, prompt in enumerate(prompts[:runs]):
        start = time.monotonic()
        response = chat_fn(prompt)
        elapsed = time.monotonic() - start
        
        words = len(response.split())
        approx_tokens = int(words * 1.3)
        tokens_per_second = approx_tokens / elapsed if elapsed > 0 else 0
        
        latencies.append(elapsed)
        print(f"Run {i+1}: {elapsed:.2f}s | ~{tokens_per_second:.0f} tok/s | {len(response)} chars")
    
    return {
        "model": model_name,
        "mean_latency_s": round(statistics.mean(latencies), 2),
        "p50_latency_s": round(statistics.median(latencies), 2),
        "p95_latency_s": round(sorted(latencies)[int(len(latencies) * 0.95)], 2) if len(latencies) >= 2 else latencies[-1],
        "min_latency_s": round(min(latencies), 2),
        "max_latency_s": round(max(latencies), 2),
    }


# Run benchmark
from openai import OpenAI

ollama = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

def make_chat_fn(model: str) -> Callable:
    def chat(prompt: str) -> str:
        response = ollama.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=256,
        )
        return response.choices[0].message.content
    return chat

# Compare models
models_to_benchmark = ["llama3.2", "llama3.1:8b", "mistral"]

results = []
for model in models_to_benchmark:
    result = benchmark_model(make_chat_fn(model), model)
    results.append(result)

print("\n\nSummary:")
print(f"{'Model':<25} {'Mean (s)':<12} {'P50 (s)':<12} {'P95 (s)':<12}")
print("─" * 60)
for r in results:
    print(f"{r['model']:<25} {r['mean_latency_s']:<12} {r['p50_latency_s']:<12} {r['p95_latency_s']:<12}")

🔍 Common Mistakes

1. Running a 7B model on 8GB RAM without GPU Without GPU offloading, a Q4 7B model uses all your RAM and inference drops to 1–2 tokens/second. Either use a smaller model (3B) or add a GPU. Check ollama ps to see GPU layer counts.

2. Ignoring num_ctx (context window) Ollama defaults to a 2048-token context for many models. For longer documents or multi-turn conversations, set num_ctx to 8192 or higher. Larger context = more RAM usage, so balance accordingly.

3. Not keeping models warm Ollama unloads models from memory after 5 minutes by default. If your app has bursty traffic, the first request after idle will be slow (model reload). Set OLLAMA_KEEP_ALIVE=60m to keep models warm longer.

4. Using LM Studio and Ollama on the same machine simultaneously Both grab GPU resources. Running them simultaneously causes memory conflicts. Pick one per session, or configure them for different GPUs if you have multiple.

5. Pulling models without checking VRAM first ollama pull llama3.1:70b downloads 40GB. If you only have 12GB VRAM, it’ll fall back to CPU inference at 1–3 tokens/second. Always check your VRAM before pulling large models.

6. Forgetting temperature matters more for local models Local models are generally more literal and less “creative” than frontier cloud models. A temperature of 0.7 that works well on GPT-4o may feel flat on Llama 3.1 8B. Experiment — 0.8–0.9 often works better locally.

💼 Quick Questions

Q: What is model quantization and what are the tradeoffs? Quantization reduces the bit-width of model weights from FP32/FP16 to INT8, INT4, or lower. This reduces memory footprint and increases inference speed at the cost of slight quality degradation. Q4_K_M is the sweet spot: ~7x smaller than FP16 with ~1% quality loss. Lower than Q4 (Q3, Q2) shows noticeable quality degradation.

Q: How does Ollama’s GPU offloading work? Ollama uses llama.cpp under the hood, which splits transformer layers between GPU VRAM (fast) and CPU RAM (slow). You can set --gpu-layers N to specify exactly how many layers go to GPU. If your model fits entirely in VRAM, all layers are GPU-accelerated. For partial GPU offloading, inference speed is proportional to what percentage is on GPU.

Q: When would you choose a local LLM over a cloud API in production? Local LLMs are preferred when: (1) data cannot leave the network (HIPAA, legal, financial PII), (2) latency requirements are sub-100ms, (3) usage volume makes API costs prohibitive, (4) offline/edge deployment is required, or (5) custom fine-tuned models need to be served. Cloud APIs are preferred when: maximum quality is needed, compute infrastructure isn’t available, or usage is sporadic.

Q: What is GGUF and why does it exist? GGUF (GPT-Generated Unified Format) replaced the older GGML format in 2023. It’s a binary format that packages model weights, tokenizer, metadata, and quantization information into a single file. It supports memory mapping, enabling fast loading, and multiple quantization types in one format. It’s the standard format for CPU/hybrid inference in the llama.cpp ecosystem.

🏭 Production Considerations

Model serving infrastructure: For production teams, consider running Ollama behind a load balancer with multiple GPU nodes. Ollama’s API is stateless — you can load-balance across instances running the same model.

Model selection by task type: Build a routing layer that sends different task types to different models. Use Llama 3.2 3B for simple classification/extraction tasks (fast, cheap), Llama 3.1 8B for general generation, and cloud APIs for complex reasoning.

Data governance: The primary enterprise value of local LLMs is data residency. Document your inference stack architecture clearly — CISOs and legal teams need to verify that customer data never touches external APIs.

Monitoring: Ollama doesn’t expose Prometheus metrics natively. Build a wrapper that logs model name, latency, token counts, and error rates to your observability stack (Prometheus + Grafana, Datadog, etc.).

Docker deployment:

# Dockerfile for Ollama-based AI service
FROM ollama/ollama:latest

# Pre-pull models during build (bakes model into image — large but self-contained)
RUN ollama serve & sleep 5 && ollama pull llama3.1:8b && kill %1

ENV OLLAMA_HOST=0.0.0.0:11434
ENV OLLAMA_KEEP_ALIVE=30m
EXPOSE 11434

CMD ["ollama", "serve"]

# docker-compose.yml
version: '3.8'
services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_KEEP_ALIVE=30m
      - OLLAMA_NUM_PARALLEL=2
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
  
  ai-service:
    build: .
    ports:
      - "8000:8000"
    depends_on:
      - ollama
    environment:
      - OLLAMA_URL=http://ollama:11434

volumes:
  ollama_data:

⚡ Performance & Scalability

GPU layer optimization: In Ollama, run ollama ps while a model is loaded to see how many layers are on GPU vs. CPU. More GPU layers = faster inference. If you have spare VRAM, increase GPU layer count via Modelfile: PARAMETER num_gpu 99 (offload all possible layers).

Parallel requests: By default, Ollama handles one request at a time per model. For concurrent workloads, set OLLAMA_NUM_PARALLEL=4 (or higher) and ensure you have enough VRAM for parallel KV-cache. Each parallel request needs its own KV cache allocation.

KV cache: The KV (Key-Value) cache stores intermediate attention computations for the context. Longer contexts and more parallel requests require proportionally more VRAM. A rough formula: KV cache size ≈ 2 × num_layers × num_heads × head_dim × context_length × precision_bytes.

Batching: For batch workloads (processing many documents), structure your workload to send requests in parallel rather than sequentially. Python asyncio + aiohttp with Ollama’s async endpoint significantly increases throughput.

🔑 Key Takeaways

Quantization makes local inference practical — Q4_K_M reduces a 7B model to ~4GB with only ~1% quality loss
Ollama is the standard CLI for local LLMs — pull any model in one command, OpenAI-compatible API out of the box
LM Studio is the GUI alternative — visual model manager, built-in chat, OpenAI-compatible server
Apple Silicon has a structural advantage — unified memory means no VRAM bottleneck for M2/M3 Max users
Local ≠ worse — Llama 3.1 8B is competitive with GPT-3.5 on most tasks, running completely offline
The right architecture — local-first with cloud fallback gives you privacy, cost efficiency, and reliability simultaneously
Same OpenAI SDK, different base_url — swap between local and cloud with a single config change

📚 Further Reading

Ollama GitHub — Source, issues, and model library
LM Studio Documentation — Official LM Studio docs
llama.cpp — The inference engine under Ollama
GGUF Format Specification — Deep dive into the file format
Hugging Face GGUF Models — Browse thousands of quantized models
Open LLM Leaderboard — Compare local model quality benchmarks

Why This Matters

Part 1: Understanding Model Quantization

What is Quantization?

The K-Quant System

Practical Recommendation

Part 2: Hardware Requirements

The RAM Rule

GPU Recommendations by Budget

Apple Silicon Advantage

Part 3: Ollama

Installation

Pulling and Running Models

The Ollama API

Generating Embeddings with Ollama

Custom Modelfiles: Customizing Behavior

Running Ollama as a Server

Part 4: LM Studio

Installation

Key Features

Using LM Studio’s Server via Python

LM Studio Configuration Tips

Part 5: Unified Local + Cloud Client

Part 6: Choosing the Right Local Model

Part 7: Performance Benchmarking

🔍 Common Mistakes

💼 Quick Questions

🏭 Production Considerations

⚡ Performance & Scalability

🔑 Key Takeaways

📚 Further Reading

Related

You May Also Like

Day 7 — Setting Up Your AI Engineering Environment

Understanding Data Engineering

Learn Java Generics to make code more stable by detecting bugs at compile time

Leave a Reply Cancel reply