2.4 The Power of Embeddings

Embeddings are numeric representations of text: words, sentences, and documents are mapped to vectors in a high‑dimensional space, and semantically similar texts end up close together geometrically. These representations are learned from large corpora: the model associates a word with its context and captures semantic relations, so synonyms and terms that appear in similar contexts lie nearby. As a result, semantic search goes beyond exact “keyword” matching: compute an embedding for each document (or chunk) and for the user query, compare vector proximity via cosine or another metric, and rank materials by semantic similarity — even without exact matches. This shifts how we analyze, store, and search: interactions become more meaningful and recommendations more precise.

On top of embeddings sit vector stores — databases optimized for vector storage and fast nearest‑neighbor search. They use specialized indexes and algorithms to answer similarity queries over large datasets and fit both research and production. Choose based on data size (from in‑memory options for small sets to distributed systems at scale), persistence (do you need durable disk storage or a transient store for prototypes), and use case (lab vs. production). For quick prototyping, Chroma is a common choice — a lightweight in‑memory store; for larger and long‑lived systems, use distributed/cloud vector DBs. In a typical semantic‑search pipeline, documents are first split into meaningful chunks, then embeddings are computed and indexed; on a query, its embedding is computed, nearest chunks are retrieved, and the extracted parts plus the query are fed to an LLM to generate a coherent answer.

Before diving into embeddings and vector DBs, prepare the environment: imports, API keys, and basic config.

import os
from openai import OpenAI
import sys
from dotenv import load_dotenv, find_dotenv

sys.path.append('../..')

load_dotenv(find_dotenv())

client = OpenAI()

Next, load documents and split them into semantically meaningful fragments — this makes data easier to manage and prepares it for embedding creation. We’ll use a series of PDFs (with some “noise” like duplicates) for demonstration:

from langchain.document_loaders import PyPDFLoader

pdf_document_loaders = [
    PyPDFLoader("docs/doc1.pdf"),
    PyPDFLoader("docs/doc2.pdf"),
    PyPDFLoader("docs/doc3.pdf"),
]

loaded_documents_content = []

for document_loader in pdf_document_loaders:
    loaded_documents_content.extend(document_loader.load())

After loading, split documents into chunks to improve manageability and downstream efficiency:

from langchain.text_splitter import RecursiveCharacterTextSplitter

document_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,
    chunk_overlap=150
)
document_splits = document_splitter.split_documents(documents)

Now compute embeddings for each chunk: turn text into vectors that reflect semantic meaning.

from langchain_openai import OpenAIEmbeddings
import numpy as np

embedding_generator = OpenAIEmbeddings()

sentence_examples = ["I like dogs", "I like canines", "The weather is ugly outside"]
embeddings = [embedding_generator.embed_query(sentence) for sentence in sentence_examples]

similarity_dog_canine = np.dot(embeddings[0], embeddings[1])
similarity_dog_weather = np.dot(embeddings[0], embeddings[2])

Index the vectors in a vector store to enable fast similarity search. For demos, Chroma — an in‑memory option — works well:

from langchain.vectorstores import Chroma

persist_directory = 'docs/chroma/'

!rm -rf ./docs/chroma

vector_database = Chroma.from_documents(
    documents=document_splits,
    embedding=embedding_generator,
    persist_directory=persist_directory
)

Now perform a similarity search — this is where embeddings + vector DBs shine: quickly selecting the most relevant fragments for a query.

query = "Is there an email I can ask for help?"
retrieved_documents = vector_database.similarity_search(query, k=3)
print(retrieved_documents[0].page_content)

Finally, consider edge cases and search quality improvements. Even a useful baseline runs into issues: duplicates and irrelevant documents are common problems that degrade results.

# Query example illustrating a failure mode
query_matlab = "What did they say about MATLAB?"

# Detect duplicate fragments in search results
retrieved_documents_matlab = vector_database.similarity_search(query_matlab, k=5)

From there, you can apply strategies to mitigate such failures and retrieve fragments that are both relevant and sufficiently diverse. Taken together, embeddings and vector DBs are a powerful pairing for semantic search over large corpora: solid text preparation, thoughtful indexing, and fast nearest‑neighbor querying enable systems that understand complex prompts; analyzing failures and adding techniques further improves robustness and accuracy. For deeper study, see the OpenAI API docs on embedding generation and surveys of vector databases that compare technologies and usage scenarios.

Theory Questions

What is the primary goal of turning text into embeddings?
How do embeddings help measure semantic similarity of words and sentences?
Describe how word embeddings are created and the role of context.
How do embeddings improve semantic search over keyword‑based approaches?
What roles do document and query embeddings play in semantic search?
What is a vector store, and why is it important for efficient search?
What criteria matter when choosing a vector database?
Why is Chroma convenient for prototypes, and what are its limitations?
Describe a semantic‑search pipeline using embeddings and a vector DB.
How does document splitting improve search granularity and relevance?
Why embed chunks, and how does that help retrieval?
Why index the vector store for similarity search?
How is a query processed, and which similarity metrics are used?
How does answer generation improve UX in semantic‑search apps?
What environment setup steps are needed?
Give an example where loading and splitting text are critical to search quality.
How do embeddings “transform” text, and how can you demonstrate vector similarity?
What should you consider when configuring Chroma?
How does similarity search find relevant fragments?
What failures are typical in semantic search, and how can you address them?

Practical Tasks

Implement generate_embeddings that returns a list of “embeddings” for strings (e.g., simulated by string length).
Implement cosine_similarity to compute cosine similarity between two vectors.
Create SimpleVectorStore with add_vector and find_most_similar (cosine‑based).
Load text from a file, split into chunks of a given size (e.g., 500 characters), and print them.
Implement query_processing: generate a query embedding (placeholder), find the nearest chunk in SimpleVectorStore, and print it.
Implement remove_duplicates: return a list without duplicate chunks (exact match or by similarity threshold).
Initialize SimpleVectorStore, add placeholder embeddings, run a semantic search, and print top‑3 results.
Implement embed_and_store_documents: generate placeholder embeddings for chunks, store them in SimpleVectorStore, and return it.
Implement vector_store_persistence: demonstrate saving/loading SimpleVectorStore (serialization/deserialization).
Implement evaluate_search_accuracy: for queries and expected chunks, run search and compute match rate.