Day 4 — Tokens, Embeddings & Semantic Space
“Language is analog. Computers are digital. Tokenization and embeddings are the bridge — the translation layer that lets machines work with meaning, not just symbols.”

Why This Day Matters
Tokenization and embeddings are not glamorous topics. They don’t have the dramatic appeal of attention mechanisms or scaling laws. But they are the foundation on which everything else is built — and misunderstanding them causes subtle, expensive bugs in production.
When your RAG system retrieves the wrong documents, it’s often an embedding problem. When your LLM fails on math or code, it’s often a tokenization problem. When your multilingual AI underperforms on certain languages, you guessed it — tokenization.
This day gives you the mental model to reason about these failures and fix them.
Part 1: Tokenization — From Text to Numbers
Why Not Characters or Words?
The first design question in any NLP system: what is the unit of input?
Characters: Every character is a token. Vocabulary is tiny (256 bytes for ASCII). Sequences are very long, making attention expensive. Models struggle to learn word-level semantics.
Words: Every word is a token. Clean and intuitive. But vocabulary is huge (millions of words across languages). Rare words, typos, and new words are “unknown” — the model can’t handle them.
Subwords: The winning compromise. Break words into meaningful fragments. Common words remain whole. Rare words split into recognizable pieces.
"tokenization" → ["token", "ization"]
"unbelievable" → ["un", "believ", "able"]
"ChatGPT" → ["Chat", "G", "PT"]
"Transformer" → ["Trans", "former"]
Modern LLMs use subword tokenization. The vocabulary is typically 32K to 200K tokens — large enough to cover most common words whole, small enough to keep embeddings manageable.
Byte-Pair Encoding (BPE)
BPE is the tokenization algorithm behind GPT-2, GPT-3, GPT-4, and Llama. It works by:
- Start with a vocabulary of individual characters (or bytes)
- Count the most frequent adjacent pair of symbols in the training corpus
- Merge that pair into a new symbol
- Repeat until vocabulary size is reached
# BPE training illustration (conceptual)
corpus = ["low", "lower", "newest", "wider", "new"]
# Start: character-level
# Iteration 1: most frequent pair is "e" + "r" → merge to "er"
# Iteration 2: most frequent pair is "n" + "ew" → merge to "new"
# Iteration 3: ...continues until target vocab size
# Final vocabulary might include:
# ["l", "o", "w", "e", "r", "n", "d", "i", "s", "t", "low", "er", "new", "est", ...]
GPT uses byte-level BPE — operating on raw bytes rather than Unicode characters. This means it can tokenize any text in any language or encoding without ever seeing an “unknown” token. Every byte (0–255) is in the base vocabulary.
WordPiece
WordPiece is used by BERT, DistilBERT, and many encoder models. Similar to BPE but uses a different merge criterion — it maximizes the likelihood of the training data rather than simple pair frequency.
"unbelievably" → ["un", "##believ", "##ably"]
The "##" prefix indicates a continuation (not a word start)
SentencePiece
SentencePiece (used by T5, ALBERT, Llama before version 3) treats the input as a raw stream of Unicode characters without any language-specific preprocessing. It handles multiple languages naturally.
English: "Hello world" → ["▁Hello", "▁world"]
Japanese: "こんにちは" → ["▁こん", "にちは"]
Code: "def foo():" → ["▁def", "▁foo", "():"]
The "▁" (underscore) marks word boundaries
Comparing Tokenizers: The Efficiency Problem
Different tokenizers encode the same text with different efficiency. This has direct cost and performance implications.
import tiktoken
from transformers import AutoTokenizer
text = "The transformer architecture revolutionized natural language processing."
# GPT-4 tokenizer (cl100k_base)
gpt4_enc = tiktoken.encoding_for_model("gpt-4")
gpt4_tokens = gpt4_enc.encode(text)
print(f"GPT-4: {len(gpt4_tokens)} tokens") # ~9 tokens
# Compare a code snippet
code = "def calculate_attention(Q, K, V, d_k):\n return softmax(Q @ K.T / d_k**0.5) @ V"
gpt4_code_tokens = gpt4_enc.encode(code)
print(f"GPT-4 code: {len(gpt4_code_tokens)} tokens") # ~29 tokens
The math problem: LLMs trained with standard BPE tokenizers are known to underperform on arithmetic. Why?
"1234 + 5678" might tokenize as:
["12", "34", " +", " 56", "78"]
The number 1234 is split into two tokens - 12 and 34.
The model must learn that "12" followed by "34" represents 1234.
This is harder than it sounds.
This is why models trained with digit-level tokenization (each digit is its own token) tend to perform better at arithmetic. DeepSeek and Qwen models were specifically designed with better math tokenization.
The non-English problem: English dominates most training corpora. The tokenizer is optimized for English. Other languages are often tokenized less efficiently:
# Tokens needed for the same meaning:
"Hello" → 1 token (English)
"Привет" → 3 tokens (Russian)
"مرحبا" → tokens 4 (Arabic)
"こんにちは" → 5 tokens (Japanese)
Non-English text uses proportionally more tokens — meaning it’s more expensive and uses more context window per semantic unit. Models trained on less non-English data also tokenize it more coarsely. This is why multilingual models require special attention to tokenization strategy.
Part 2: Building Your Tokenization Intuition
import tiktoken
from collections import Counter
def analyze_tokenization(text: str, model: str = "gpt-4") -> dict:
"""
Analyze how a text is tokenized - useful for debugging and optimization.
Returns token count, tokens, and efficiency metrics.
"""
enc = tiktoken.encoding_for_model(model)
tokens = enc.encode(text)
decoded_tokens = [enc.decode([t]) for t in tokens]
return {
"token_count": len(tokens),
"char_count": len(text),
"chars_per_token": round(len(text) / len(tokens), 2),
"tokens": decoded_tokens,
"token_ids": tokens
}
# Experiment 1: Natural language
result = analyze_tokenization("The quick brown fox jumps over the lazy dog.")
print(f"Chars per token: {result['chars_per_token']}") # ~4.4 (normal English)
print(f"Tokens: {result['tokens']}")
# Experiment 2: Code (usually less efficient)
code = "for i in range(len(items)):\n result.append(process(items[i]))"
result = analyze_tokenization(code)
print(f"Code chars per token: {result['chars_per_token']}") # ~4.57
# Experiment 3: Why repeated whitespace is costly
padded = " lots of extra spaces between words "
result = analyze_tokenization(padded)
print(f"Whitespace-heavy chars per token: {result['chars_per_token']}") # Very inefficient
# Production insight: Normalize your text before sending to an API
# Remove unnecessary whitespace, normalize Unicode, strip formatting
Tokenization Anti-Patterns in Production
# ❌ Anti-pattern: Sending raw HTML to an LLM
html_content = """
<html><head><title>Product Page</title></head>
<body><div class="container"><h1>Amazing Widget</h1>
<p class="description">This widget does amazing things...</p>
</div></body></html>
"""
# HTML tags consume many tokens without adding semantic value
# ✅ Better: Strip HTML before sending
from bs4 import BeautifulSoup
def clean_for_llm(html: str) -> str:
soup = BeautifulSoup(html, 'html.parser')
# Remove script/style elements
for script in soup(["script", "style"]):
script.decompose()
# Get clean text
text = soup.get_text(separator='\n', strip=True)
# Remove excessive whitespace
lines = [line.strip() for line in text.splitlines() if line.strip()]
return '\n'.join(lines)
clean_text = clean_for_llm(html_content)
# Reduces token count by 40-60% for typical web pages
Part 3: Embeddings — Meaning as Geometry
What Is an Embedding?
An embedding is a dense vector of floating-point numbers that represents a piece of text in a high-dimensional space. The key property: semantic similarity corresponds to geometric proximity.
embedding("king") ≈ [0.82, -0.31, 0.45, ..., 0.12] # 1536 dimensions
embedding("queen") ≈ [0.79, -0.28, 0.49, ..., 0.15] # Close in space!
embedding("dog") ≈ [0.12, 0.73, -0.22, ..., 0.88] # Far away
Texts that mean similar things cluster together. This geometric property makes embeddings the foundation of semantic search, clustering, and anomaly detection.
The Famous Analogy Arithmetic
Word2Vec (2013) demonstrated the remarkable linearity of embedding spaces:
v("king") - v("man") + v("woman") ≈ v("queen")
v("Paris") - v("France") + v("Italy") ≈ v("Rome")
v("walked") - v("walking") + v("swimming") ≈ v("swam")
This works because the model learned to encode semantic relationships as geometric transformations. “Royalty offset” is a direction in embedding space. “Country-to-capital” is another direction.
Modern embedding models (much larger than Word2Vec) encode far more complex semantic relationships than simple analogies.
Cosine Similarity: The Distance Metric That Works
For embedding vectors, cosine similarity is the standard similarity metric. Unlike Euclidean distance, it’s invariant to vector magnitude — which matters because embedding norms can vary.
cosine_similarity(A, B) = (A · B) / (||A|| × ||B||)
Range: -1 (opposite) to 1 (identical)
import numpy as np
from openai import OpenAI
client = OpenAI()
def get_embedding(text: str, model: str = "text-embedding-3-small") -> list[float]:
"""Get embedding vector for a text string."""
response = client.embeddings.create(
input=text,
model=model
)
return response.data[0].embedding
def cosine_similarity(v1: list[float], v2: list[float]) -> float:
"""Compute cosine similarity between two embedding vectors."""
a = np.array(v1)
b = np.array(v2)
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
# Demonstrate semantic similarity
texts = [
"The cat sat on the mat",
"A feline rested on a rug", # Semantically similar
"Python is a programming language", # Semantically different
]
embeddings = [get_embedding(t) for t in texts]
sim_01 = cosine_similarity(embeddings[0], embeddings[1])
sim_02 = cosine_similarity(embeddings[0], embeddings[2])
print(f"Similarity (cat/mat vs feline/rug): {sim_01:.3f}") # High: ~0.63
print(f"Similarity (cat/mat vs Python): {sim_02:.3f}") # Low: ~0.12
Part 4: Embedding Models in 2026
The Major Embedding Models
Matryoshka Embeddings
OpenAI’s text-embedding-3 models use Matryoshka Representation Learning (MRL) — the embedding is structured so that the first N dimensions of a 3072-dim vector are nearly as useful as the full vector.
# You can use smaller slices for cost/latency tradeoffs
full_embedding = get_embedding(text, model="text-embedding-3-large")
# full_embedding has 3072 dimensions
# Truncate to 1024 dimensions - nearly same quality, 3x less storage
truncated = full_embedding[:1024]
# Truncate to 256 dimensions - significant quality loss, 12x less storage
tiny = full_embedding[:256]
This allows you to tune the embedding size for your specific quality/cost tradeoff without re-embedding your entire corpus.
Late Chunking and ColBERT
Late Chunking (2024): Rather than chunking documents first and embedding each chunk independently, late chunking embeds the entire document first, then derives chunk-level embeddings that incorporate full document context. This produces better representations for chunks that contain pronouns or references resolved by surrounding context.
ColBERT (Contextualized Late Interaction over BERT): Instead of a single document embedding, ColBERT stores a vector per token. At query time, it finds the maximum similarity between each query token and each document token. This “multi-vector” approach captures fine-grained matching.
Single-vector retrieval:
query_embedding → find nearest doc_embedding → return document
ColBERT (multi-vector) retrieval:
[q1, q2, q3...] → MaxSim([d1, d2, d3...]) → richer matching
ColBERT is more expensive (more storage, more compute) but more accurate — particularly for technical queries with specific terminology.
Part 5: Building a Production Embedding Pipeline
Architecture Overview
Documents Query
│ │
▼ ▼
[Text Extraction] [Query Preprocessing]
[Chunking] │
│ [Query Embedding]
▼ │
[Embedding Model] │
│ ▼
▼ [Vector Database]
[Vector Storage] ←──── [ANN Search]
│
[Top-K Results]
│
[Reranker] (optional)
│
[Final Results]
Production Embedding Code
import asyncio
from typing import List
from openai import AsyncOpenAI
import numpy as np
client = AsyncOpenAI()
async def embed_batch(
texts: List[str],
model: str = "text-embedding-3-small",
batch_size: int = 100
) -> List[List[float]]:
"""
Embed a list of texts efficiently using batching.
Key production considerations:
- Batch to minimize API calls (up to 2048 inputs per call for OpenAI)
- Handle rate limits with exponential backoff
- Normalize embeddings for consistent cosine similarity
"""
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
response = await client.embeddings.create(
input=batch,
model=model
)
# Extract embeddings in correct order
batch_embeddings = [item.embedding for item in sorted(
response.data, key=lambda x: x.index
)]
all_embeddings.extend(batch_embeddings)
return all_embeddings
def normalize_embeddings(embeddings: List[List[float]]) -> np.ndarray:
"""
L2-normalize embeddings for cosine similarity via dot product.
After normalization, dot product == cosine similarity,
which is faster and supported by most vector databases.
"""
arr = np.array(embeddings)
norms = np.linalg.norm(arr, axis=1, keepdims=True)
return arr / norms
async def main():
documents = [
"Machine learning is a subset of artificial intelligence",
"Deep learning uses neural networks with multiple layers",
"Natural language processing enables machines to understand text",
"Computer vision allows machines to interpret visual information",
]
embeddings = await embed_batch(documents)
normalized = normalize_embeddings(embeddings)
print(f"Embedded {len(documents)} documents")
print(f"Embedding dimension: {len(embeddings[0])}")
print(f"Shape after normalization: {normalized.shape}")
# Compute similarity matrix
similarity_matrix = normalized @ normalized.T
print("\nSimilarity matrix:")
for i, doc in enumerate(documents):
for j, other in enumerate(documents):
if i < j:
print(f" [{i}] vs [{j}]: {similarity_matrix[i,j]:.3f}")
asyncio.run(main())
Embedding Cost Optimization
from functools import lru_cache
import hashlib
import json
import redis
# Production pattern: cache embeddings for repeated text
class EmbeddingCache:
"""
Cache embedding results to avoid redundant API calls.
Appropriate for:
- Frequently queried texts (product names, common queries)
- Static documents (loaded once, queried many times)
Not appropriate for:
- User-generated content (too many unique inputs)
- Time-sensitive information
"""
def __init__(self, redis_client: redis.Redis, ttl: int = 86400):
self.redis = redis_client
self.ttl = ttl
self.client = AsyncOpenAI()
def _cache_key(self, text: str, model: str) -> str:
content = f"{model}:{text}"
return f"emb:{hashlib.sha256(content.encode()).hexdigest()}"
async def get_embedding(
self,
text: str,
model: str = "text-embedding-3-small"
) -> List[float]:
key = self._cache_key(text, model)
# Check cache first
cached = self.redis.get(key)
if cached:
return json.loads(cached)
# Cache miss: compute embedding
response = await self.client.embeddings.create(
input=text,
model=model
)
embedding = response.data[0].embedding
# Store in cache
self.redis.setex(key, self.ttl, json.dumps(embedding))
return embedding
Part 6: Practical Applications of Embeddings
Application 1: Semantic Search
The canonical use case. Instead of keyword matching (“does this document contain the word ‘refund’?”), semantic search finds documents that mean what the user is asking about.
import numpy as np
from typing import List, Tuple
from sentence_transformers import SentenceTransformer
# -----------------------------------
# Load local embedding model
# -----------------------------------
model = SentenceTransformer(
"all-MiniLM-L6-v2"
)
# -----------------------------------
# Generate embedding
# -----------------------------------
def get_embedding(text: str) -> np.ndarray:
"""
Generate embedding locally.
"""
embedding = model.encode(
text,
convert_to_numpy=True,
normalize_embeddings=True
)
return embedding
# -----------------------------------
# Build document embeddings
# -----------------------------------
def build_embeddings(
documents: List[str]
) -> np.ndarray:
"""
Generate embeddings for all documents.
"""
embeddings = model.encode(
documents,
convert_to_numpy=True,
normalize_embeddings=True
)
return embeddings.astype(np.float32)
# -----------------------------------
# Semantic Search
# -----------------------------------
def semantic_search(
query: str,
documents: List[str],
document_embeddings: np.ndarray,
top_k: int = 5
) -> List[Tuple[str, float]]:
"""
Find top-k semantically similar documents.
"""
# Query embedding
query_vec = get_embedding(query)
# Cosine similarity
similarities = document_embeddings @ query_vec
# Prevent overflow
top_k = min(top_k, len(documents))
# Fast top-k selection
top_indices = np.argpartition(
similarities,
-top_k
)[-top_k:]
# Sort descending
top_indices = top_indices[
np.argsort(
similarities[top_indices]
)[::-1]
]
return [
(
documents[i],
float(similarities[i])
)
for i in top_indices
]
# -----------------------------------
# Example Knowledge Base
# -----------------------------------
knowledge_base = [
"You can request a refund within 30 days of purchase.",
"Reset your password using the forgot password link.",
"Shipping usually takes 3-5 business days.",
"Contact support for billing issues.",
"Refunds are processed within 7 working days."
]
# -----------------------------------
# Build embeddings once
# -----------------------------------
print("Building embeddings...")
kb_embeddings = build_embeddings(
knowledge_base
)
print("Embeddings shape:", kb_embeddings.shape)
# -----------------------------------
# Search
# -----------------------------------
results = semantic_search(
query="How do I get a refund?",
documents=knowledge_base,
document_embeddings=kb_embeddings,
top_k=3
)
# -----------------------------------
# Print results
# -----------------------------------
print("\n=== Search Results ===\n")
for doc, score in results:
print(f"Score: {score:.3f}")
print(doc)
print("-" * 50)
Application 2: Text Clustering
Group similar documents without labels — useful for topic discovery, content organization, and deduplication.
import numpy as np
from typing import List
from sklearn.cluster import KMeans
from sentence_transformers import SentenceTransformer
# -----------------------------------
# Load local embedding model
# -----------------------------------
model = SentenceTransformer(
"all-MiniLM-L6-v2"
)
# -----------------------------------
# Generate embeddings locally
# -----------------------------------
def embed_batch(
texts: List[str]
) -> np.ndarray:
embeddings = model.encode(
texts,
convert_to_numpy=True,
normalize_embeddings=True
)
return embeddings
# -----------------------------------
# Semantic search
# -----------------------------------
def semantic_search(
query: str,
documents: List[str],
document_embeddings: np.ndarray,
top_k: int = 5
):
query_embedding = model.encode(
query,
convert_to_numpy=True,
normalize_embeddings=True
)
similarities = (
document_embeddings @ query_embedding
)
top_indices = np.argsort(
similarities
)[::-1][:top_k]
return [
(
documents[i],
float(similarities[i])
)
for i in top_indices
]
# -----------------------------------
# KMeans clustering
# -----------------------------------
def cluster_documents(
embeddings: np.ndarray,
n_clusters: int = 3
):
kmeans = KMeans(
n_clusters=n_clusters,
random_state=42,
n_init=10
)
return kmeans.fit_predict(embeddings)
# -----------------------------------
# Example data
# -----------------------------------
documents = [
"Refund process took too long.",
"Customer support was very helpful.",
"Delivery arrived late.",
"The app crashes frequently.",
"Excellent product quality.",
"Refund was processed quickly.",
"Shipping was delayed again.",
"Support team solved my issue fast.",
"Application performance is slow.",
"Very satisfied with the purchase."
]
# -----------------------------------
# Build embeddings
# -----------------------------------
embeddings = embed_batch(documents)
# -----------------------------------
# Semantic Search
# -----------------------------------
results = semantic_search(
query="How do refunds work?",
documents=documents,
document_embeddings=embeddings,
top_k=3
)
print("\n=== Semantic Search ===")
for doc, score in results:
print(f"\nScore: {score:.3f}")
print(doc)
# -----------------------------------
# Clustering
# -----------------------------------
clusters = cluster_documents(
embeddings,
n_clusters=3
)
print("\n=== Clusters ===")
for cluster_id in range(3):
cluster_docs = [
documents[i]
for i, c in enumerate(clusters)
if c == cluster_id
]
print(f"\nCluster {cluster_id}")
for doc in cluster_docs:
print(f" - {doc}")
Application 3: Anomaly Detection
Detect unusual or out-of-distribution inputs — useful for flagging suspicious user queries, detecting off-topic content, or finding data quality issues.
import numpy as np
from typing import List
from sentence_transformers import SentenceTransformer
# -----------------------------------
# Load local embedding model
# -----------------------------------
model = SentenceTransformer(
"all-MiniLM-L6-v2"
)
# -----------------------------------
# Generate embedding
# -----------------------------------
def get_embedding(text: str) -> np.ndarray:
"""
Generate normalized embedding locally.
"""
embedding = model.encode(
text,
convert_to_numpy=True,
normalize_embeddings=True
)
return embedding.astype(np.float32)
# -----------------------------------
# Build embeddings
# -----------------------------------
def build_embeddings(
documents: List[str]
) -> np.ndarray:
"""
Build normalized embeddings.
"""
embeddings = model.encode(
documents,
convert_to_numpy=True,
normalize_embeddings=True
)
return embeddings.astype(np.float32)
# -----------------------------------
# Detect anomalies
# -----------------------------------
def detect_anomalies(
new_text: str,
reference_embeddings: np.ndarray,
threshold: float = 0.35
) -> bool:
"""
Detect if text is semantically anomalous.
Returns:
True -> anomalous
False -> normal
"""
# Generate embedding
new_embedding = get_embedding(
new_text
)
# Cosine similarities
similarities = (
reference_embeddings @ new_embedding
)
# Best semantic match
max_similarity = float(
np.max(similarities)
)
print(
f"Max similarity: {max_similarity:.3f}"
)
return max_similarity < threshold
# -----------------------------------
# Example support queries
# -----------------------------------
support_queries = [
"How do I reset my password?",
"Refund status is not updated.",
"Application crashes during login.",
"Payment failed but amount deducted.",
"Unable to access my account.",
"How do I contact support?",
"Order tracking page not working.",
"Billing address cannot be updated."
]
# -----------------------------------
# Build reference embeddings
# -----------------------------------
print("Building support query embeddings...")
support_query_embeddings = build_embeddings(
support_queries
)
print(
"Embeddings shape:",
support_query_embeddings.shape
)
# -----------------------------------
# Normal query
# -----------------------------------
normal_query = (
"My payment is failing during checkout."
)
is_anomalous = detect_anomalies(
new_text=normal_query,
reference_embeddings=support_query_embeddings,
threshold=0.35
)
print("\nQuery:", normal_query)
print("Anomalous:", is_anomalous)
# -----------------------------------
# Off-topic query
# -----------------------------------
offtopic_query = (
"What's the best recipe for chocolate cake?"
)
is_anomalous = detect_anomalies(
new_text=offtopic_query,
reference_embeddings=support_query_embeddings,
threshold=0.35
)
print("\nQuery:", offtopic_query)
print("Anomalous:", is_anomalous)
🔍 Common Mistakes to Avoid
Mistake 1: Using the Wrong Embedding Model for the Task
Different embedding models are optimized for different tasks. Models like BGE-Reranker are optimized for reranking. Models like E5 are optimized for retrieval. Some are specialized for code, others for legal text. Don’t use a general-purpose embedding model for a specialized domain without testing domain-specific alternatives.
Mistake 2: Not Normalizing Embeddings
Many vector databases (Pinecone, Qdrant) assume normalized embeddings for cosine similarity. If you store un-normalized embeddings and query with normalized vectors (or vice versa), you’ll get wrong results silently. Always normalize before storing.
Mistake 3: Embedding Entire Documents
Embedding a 10-page document produces a single vector that averages over all the content. Specific details get diluted. Always chunk documents before embedding — but chunk thoughtfully (more on this in the RAG phase).
Mistake 4: Ignoring Context Length Limits
OpenAI’s text-embedding-3 has an 8191-token limit. Text longer than this is silently truncated. Build token-count validation into your embedding pipeline.
def safe_embed(text: str, max_tokens: int = 8000) -> list[float]:
"""Embed text with automatic truncation if needed."""
enc = tiktoken.encoding_for_model("text-embedding-3-small")
tokens = enc.encode(text)
if len(tokens) > max_tokens:
# Truncate and log a warning
tokens = tokens[:max_tokens]
text = enc.decode(tokens)
print(f"Warning: Text truncated from {len(tokens)} to {max_tokens} tokens")
return get_embedding(text)
Mistake 5: Storing Embeddings in a Regular Database
Storing 1536-dimensional float arrays in Postgres without pgvector or in a regular column is a common mistake. You’ll get no indexing benefit and full-table scans for every query. Use a vector database or at minimum pgvector with an IVFFlat or HNSW index.
💼 Quick Questions
Q1: What is the difference between a token embedding and a sentence/document embedding?
Answer: Token embeddings represent individual tokens (words or subwords) — context-free representations of vocabulary items. Sentence/document embeddings represent an entire input sequence as a single vector — capturing the overall meaning. Sentence embeddings are typically produced by a model’s final layer (using mean pooling or CLS token pooling) or by a dedicated embedding model trained for this purpose.
Q2: Why does cosine similarity work better than Euclidean distance for embeddings?
Answer: Embedding vectors can have different magnitudes (L2 norms) even for semantically similar texts — variations in length, formatting, or encoding can affect magnitude. Cosine similarity measures the angle between vectors, ignoring magnitude. This makes it robust to these variations. After L2 normalization, cosine similarity becomes equivalent to dot product, which is faster to compute.
Q3: What is Matryoshka Representation Learning (MRL) and why is it useful?
Answer: MRL trains an embedding model so that truncated versions of the embedding vector remain meaningful. The first 256 dimensions of a 3072-dim MRL embedding are approximately as useful as 256 dimensions from a model trained specifically at 256 dimensions. This allows you to trade off storage/compute cost against quality by using a shorter embedding vector — without retraining or re-embedding.
Q4: Why do LLMs perform worse on non-English languages from a tokenization perspective?
Answer: Most tokenizers are trained on English-dominant corpora and optimized for English. Non-English text is tokenized less efficiently — requiring more tokens per semantic unit (e.g., 3–5 Japanese tokens vs. 1 English token for equivalent meaning). This means: (1) more tokens = more cost, (2) more of the context window consumed per equivalent meaning, and (3) the model has seen less non-English text per effective unit, so language modeling quality is lower.
Q5: What is ColBERT and how does it differ from single-vector retrieval?
Answer: Single-vector retrieval represents each document as one embedding vector. Query matching computes similarity between two vectors. ColBERT (“multi-vector retrieval”) stores one vector per token in the document. Query matching uses “MaxSim” — for each query token, find the maximum similarity to any document token, then sum these. This captures fine-grained lexical and semantic matching that single vectors miss, at the cost of more storage (one vector per token vs. one per chunk) and more compute at query time.
🏭 Production Considerations
Embedding Freshness: Embeddings become stale when your embedding model is updated. OpenAI’s text-embedding-ada-002 and text-embedding-3 are incompatible — you can’t mix them in the same index. When upgrading embedding models, you must re-embed your entire corpus. Budget time and cost for this operation and plan for zero-downtime migration.
Dimensionality and Storage Cost: A 3072-dimensional float32 embedding takes 12 KB of storage. 1 million documents = 12 GB. At 1 billion documents (enterprise scale), full-precision storage is impractical. Consider: (1) Matryoshka truncation, (2) int8 quantization of vectors (4× compression, ~1% quality loss), (3) product quantization for approximate storage.
Batch Embedding for Offline Pipelines: For large document collections, use asynchronous batch embedding rather than sequential. OpenAI’s batch API provides 50% cost reduction for non-real-time embedding at the cost of up to 24-hour turnaround.
⚡ Performance & Scalability Insights
HNSW vs. IVFFlat: The two dominant ANN (Approximate Nearest Neighbor) indexing algorithms. HNSW (Hierarchical Navigable Small World) has O(log n) query time with high recall. IVFFlat partitions the space into cells and searches only relevant cells. HNSW is generally better for < 10M vectors; IVFFlat scales better beyond that. Most vector databases support both.
GPU-Accelerated Similarity Search: For very large collections (100M+ vectors), GPU-based search (FAISS with GPU backend, Milvus) can provide 100× throughput improvement over CPU. At enterprise scale, this is often necessary for sub-second query latency.
Embedding Dimensionality vs. Search Speed: Higher-dimensional embeddings are more expressive but slower to search. 3072 dimensions is 2× slower to compare than 1536 dimensions. For applications where query latency matters more than marginal quality improvement, use smaller dimensions (1536 or even 512).
🔑 Key Takeaways
- Tokenization determines what a model can “see.” Efficient tokenization = more meaning per token = lower cost and better performance. Understanding tokenization helps you debug failures in math, code, and non-English text.
- Embeddings encode meaning as geometry. Semantic similarity = geometric proximity. This single insight underlies semantic search, clustering, anomaly detection, and RAG retrieval.
- Cosine similarity after L2 normalization = dot product. Normalize your embeddings before storage for efficient similarity computation in any vector database.
- Choose embedding models based on task, not just quality benchmarks. The best general embedding model is rarely the best model for your specific domain. Always evaluate on representative samples from your actual data.
- Embedding pipelines have failure modes. Truncation at context limits, stale embeddings after model updates, storing un-normalized vectors, using wrong dimensions — these are real bugs that silently degrade retrieval quality.
📚 Further Reading & Resources
- “Efficient Natural Language Response Suggestion for Smart Reply” (Matryoshka Embeddings) — The MRL paper
- BGE-M3 Technical Report — Best open-source embedding model analysis
- MTEB Leaderboard — The authoritative benchmark for embedding models across tasks
- “ColBERT: Efficient and Effective Passage Search” — Multi-vector retrieval paper
- OpenAI Embeddings Guide — Practical production guidance
