2.5 Semantic Search. Advanced Retrieval Strategies

Introduction

The ability to accurately retrieve relevant information from a large corpus of data is crucial in the development of intelligent systems, such as chatbots and question-answering models. While semantic search offers a solid foundation for such tasks, it often encounters edge cases where its effectiveness diminishes. This chapter delves into advanced retrieval methods designed to overcome these limitations, thereby improving the precision and diversity of the retrieved information.

Semantic search, by relying solely on semantic similarity, may not always yield the most informative or diverse set of results. Advanced retrieval methods address this by incorporating mechanisms to ensure that the information retrieved is not only relevant but also varied and comprehensive. Such techniques are essential for handling complex queries that require nuanced responses.

Maximum Marginal Relevance (MMR)

MMR is a technique designed to balance relevance and diversity in the set of retrieved documents. It operates by selecting documents that are not only semantically close to the query but also diverse among themselves. This approach is particularly useful in scenarios where providing a broad spectrum of information is crucial for adequately answering a query.

The process involves initially fetching a larger set of documents based on semantic similarity. From this set, documents are then selected based on their relevance to the query and their diversity compared to already selected documents. This method ensures that the final set of documents provides a well-rounded perspective on the query topic.

Self-Query Retrieval

Self-query retrieval is adept at handling queries that contain both semantic and metadata components. For example, a query asking for movies about aliens made in 1980 combines a semantic element ("movies about aliens") with a metadata filter ("made in 1980"). This method splits the query into these two components, using semantic search for the former and metadata filtering for the latter.

Contextual Compression

Contextual compression involves extracting the most relevant segments from retrieved documents. This technique is valuable when the entirety of a document is not necessary for answering a query, focusing instead on the most pertinent information.

This method typically requires additional processing, as each retrieved document must be analyzed to identify and extract the relevant portions. While this may increase computational costs, it significantly enhances the quality and specificity of the information provided in response to a query.

Advanced Document Retrieval Techniques for Enhanced Semantic Search

Introduction

The retrieval of relevant documents from a vast corpus is a critical step in the workflow of Retrieval Augmented Generation (RAG), especially for applications like chatbots and question-answering systems. This chapter explores advanced retrieval techniques that improve upon basic semantic search by addressing common edge cases and enhancing result diversity and specificity.

Setting Up the Environment

Before diving into the core functionalities, it is essential to set up our working environment. This involves loading necessary libraries and configuring access to external services, such as OpenAI's API for embeddings. Below is the step-by-step guide to accomplish this:

# Import necessary libraries
import os
import openai
import sys

# Append the root directory to sys.path to ensure relative imports work correctly
sys.path.append('../..')

# Load environment variables from a .env file for secure API key management
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

# Set the OpenAI API key from environment variables
openai.api_key = os.environ['OPENAI_API_KEY']

# Ensure you have the necessary packages installed, including `lark` for parsing (if required)
# !pip install lark

Initializing the Vector Database for Similarity Search

Our objective is to create a vector database that can efficiently retrieve information based on semantic similarity. This involves embedding textual content into a high-dimensional vector space using OpenAI's embeddings. Here's how to initialize such a database:

# Import the Chroma vector store and OpenAI embeddings from the langchain library
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings

# Specify the directory where the vector database will persist its data
persist_directory = 'vector_db/chroma/'

# Initialize the embedding function using OpenAI's model
embedding_function = OpenAIEmbeddings()

# Create a Chroma vector database instance with the specified persistence directory and embedding function
vector_database = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding_function
)

# Print the current number of entries in the vector database to verify it's ready for use
print(vector_database._collection.count())

Populating the Vector Database

Next, we populate our vector database with a small set of textual data to demonstrate similarity search capabilities:

# Define a list of texts to populate the database
texts = [
    "The Death Cap mushroom has a notable large fruiting body, often found above ground.",
    "Among mushrooms, the Death Cap stands out for its large fruiting body, sometimes appearing in all-white.",
    "The Death Cap, known for its toxicity, is one of the most dangerous mushrooms.",
]

# Create a smaller vector database from the given texts for demonstration purposes
demo_vector_database = Chroma.from_texts(texts, embedding_function=embedding_function)

# Define a query to search within the vector database
query_text = "Discuss mushrooms characterized by their significant white fruiting bodies"

# Perform a similarity search for the query, retrieving the top 2 most relevant entries
similar_texts = demo_vector_database.similarity_search(query_text, k=2)
print("Similarity Search Results:", similar_texts)

# Perform a max marginal relevance search to find diverse yet relevant answers, fetching additional candidates for comparison
diverse_texts = demo_vector_database.max_marginal_relevance_search(query_text, k=2, fetch_k=3)
print("Diverse Search Results:", diverse_texts)

Advanced Retrieval Techniques

Addressing Diversity with Maximum Marginal Relevance (MMR)

One common challenge in retrieval systems is ensuring that the search results are not only relevant but also diverse. This prevents the dominance of repetitive information and provides a broader perspective on the query subject. The Maximum Marginal Relevance (MMR) algorithm addresses this by balancing relevance to the query with diversity among the results.

Practical Implementation of MMR

# Define a query that seeks information
query_for_information = "what insights are available on data analysis tools?"

# Perform a standard similarity search to find the top 3 relevant documents
top_similar_documents = vector_database.similarity_search(query_for_information, k=3)

# Display the beginning of the content from the top two documents for comparison
print(top_similar_documents[0].page_content[:100])
print(top_similar_documents[1].page_content[:100])

# Note the potential overlap in information. To introduce diversity, we apply MMR.
diverse_documents = vector_database.max_marginal_relevance_search(query_for_information, k=3)

# Display the beginning of the content from the top two diverse documents to observe the difference
print(diverse_documents[0].page_content[:100])
print(diverse_documents[1].page_content[:100])

This code snippet illustrates the contrast between standard similarity search results and those obtained using MMR. By employing MMR, we ensure that the retrieved documents are not only relevant but also provide varied perspectives on the query.

Enhancing Specificity Using Metadata

Vector databases often contain rich metadata that can be exploited to refine search queries further. Metadata provides additional context, allowing for more targeted searches that can filter results based on specific criteria.

Leveraging Metadata for Targeted Searches

# Define a query with a specific context in mind
specific_query = "what discussions were there about regression analysis in the third lecture?"

# Execute a similarity search with a metadata filter to target documents from a specific lecture
targeted_documents = vector_database.similarity_search(
    specific_query,
    k=3,
    filter={"source": "documents/cs229_lectures/MachineLearning-Lecture03.pdf"}
)

# Iterate through the results to display their metadata, highlighting the specificity of the search
for document in targeted_documents:
    print(document.metadata)

Using Metadata with Self-Query Retrievers

Metadata serves as contextual information that can significantly refine search results. When combined with the capabilities of a self-query retriever, it becomes possible to automatically extract both the query string and the relevant metadata filters from a single input query. This approach eliminates the need for manual metadata specification, making the search process both efficient and intuitive.

Initializing the Environment and Defining Metadata

Before we can execute a metadata-aware search, we need to set up our environment and define the metadata attributes we intend to use:

# Import necessary modules from the langchain library
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

# Define metadata attributes with detailed descriptions
metadata_attributes = [
    AttributeInfo(
        name="source",
        description="Specifies the lecture document, limited to specific files within the `docs/cs229_lectures` directory.",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page number within the lecture document.",
        type="integer",
    ),
]

# Note: Transition to using OpenAI's gpt-3.5-turbo-instruct model due to deprecation of the previous default model.
document_content_description = "Detailed lecture notes"
language_model = OpenAI(model='gpt-3.5-turbo-instruct', temperature=0)

Configuring the Self-Query Retriever

The next step involves configuring the self-query retriever with our language model, vector database, and the defined metadata attributes:

# Initialize the self-query retriever with the language model, vector database, and metadata attributes
self_query_retriever = SelfQueryRetriever.from_llm(
    language_model,
    vector_database,
    document_content_description,
    metadata_attributes,
    verbose=True
)

Executing a Query with Automatic Metadata Inference

Now, let's perform a search that automatically infers relevant metadata from the query itself:

# Define a query that specifies the context within the question
specific_query = "what insights are provided on regression analysis in the third lecture?"

# Note: The first execution may trigger a deprecation warning for `predict_and_parse`, which can be ignored.
# Retrieve documents relevant to the specific query, leveraging inferred metadata for precision
relevant_documents = self_query_retriever.get_relevant_documents(specific_query)

# Display the metadata of retrieved documents to demonstrate the specificity of the search
for document in relevant_documents:
    print(document.metadata)

Implementing Contextual Compression

Contextual compression works by extracting segments of a document that are most relevant to a given query. This method not only reduces the computational load on LLMs but also enhances the quality of the responses by focusing on the most pertinent information.

Setting Up the Environment

Before diving into the specifics of contextual compression, ensure that your environment is properly configured with the necessary libraries:

# Import the necessary classes for contextual compression and document retrieval
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain.llms import OpenAI

Initializing the Compression Tools

The next step involves initializing the compression mechanism with a pre-trained language model, which will be used to identify and extract the relevant portions of documents:

# Initialize the language model with a specific configuration to ensure deterministic behavior
language_model = OpenAI(temperature=0, model="gpt-3.5-turbo-instruct")

# Create a compressor using the language model for extracting relevant text segments
document_compressor = LLMChainExtractor.from_llm(language_model)

Creating the Contextual Compression Retriever

With the compressor ready, we can now set up a retriever that integrates contextual compression into the retrieval process:

# Combine the document compressor with the existing vector database retriever
compression_retriever = ContextualCompressionRetriever(
    base_compressor=document_compressor,
    base_retriever=vector_database.as_retriever()
)

Retrieving Compressed Documents

Let's execute a query and observe how the contextual compression retriever returns a more focused set of documents:

# Define a query for which we seek relevant document segments
query_text = "what insights are offered on data analysis tools?"

# Retrieve documents relevant to the query, automatically compressed for relevance
compressed_documents = compression_retriever.get_relevant_documents(query_text)

# Function to nicely format and print the content of compressed documents
def pretty_print_documents(documents):
    print(f"\n{'-' * 100}\n".join([f"Document {index + 1}:\n\n" + doc.page_content for index, doc in enumerate(documents)]))

# Display the compressed documents
pretty_print_documents(compressed_documents)

Implementing Contextual Compression with MMR for Document Retrieval

Contextual compression aims to distill documents to their essence by focusing on segments most relevant to a query. When paired with the MMR strategy, it balances relevance with diversity in the retrieved documents, ensuring a broader perspective on the queried topic.

Setting Up the Compression-Based Retriever with MMR

# Initialize the contextual compression retriever with MMR for diverse and relevant document retrieval
compression_based_retriever = ContextualCompressionRetriever(
    base_compressor=document_compressor,
    base_retriever=vector_database.as_retriever(search_type="mmr")
)

# Define a query to test the combined approach
query_for_insights = "what insights are available on statistical analysis methods?"

# Retrieve compressed documents using the contextual compression retriever
compressed_documents = compression_based_retriever.get_relevant_documents(query_for_insights)

# Utilize a helper function to print the contents of the retrieved, compressed documents
pretty_print_documents(compressed_documents)

This approach optimizes document retrieval by ensuring that the results are not only relevant but also diverse, preventing redundancy and enhancing the user's understanding of the subject matter.

Exploring Alternative Document Retrieval Methods

Beyond the vector-based retrieval methods, the LangChain library supports a variety of other document retrieval strategies, such as TF-IDF and SVM. These methods offer different advantages based on the specific requirements of the application.

Loading and Preparing Documents

Before implementing alternative retrieval strategies, it's crucial to prepare the documents by loading and splitting the text appropriately.

# Load a document using the PyPDFLoader
document_loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
document_pages = document_loader.load()

# Concatenate all page texts into a single string for processing
complete_document_text = " ".join([page.page_content for page in document_pages])

# Split the complete document text into manageable chunks using a text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=150)
text_chunks = text_splitter.split_text(complete_document_text)

Implementing TF-IDF and SVM Retrievers

With the document text prepared, we can now utilize TF-IDF and SVM-based retrievers for document retrieval.

# Initialize a SVM-based retriever from the text chunks
svm_based_retriever = SVMRetriever.from_texts(text_chunks, embedding_function)

# Similarly, initialize a TF-IDF-based retriever from the same text chunks
tfidf_based_retriever = TFIDFRetriever.from_texts(text_chunks)

# Perform document retrieval using the SVM retriever for a specific query
query_on_major_topics = "What are major topics for this class?"
svm_retrieval_results = svm_based_retriever.get_relevant_documents(query_on_major_topics)

# Perform another retrieval using the TF-IDF retriever for a different query
query_on_specific_tool = "what did they say about statistical software?"
tfidf_retrieval_results = tfidf_based_retriever.get_relevant_documents(query_on_specific_tool)

# Print the first document from the retrieval results as an example
print(svm_retrieval_results[0])
print(tfidf_retrieval_results[0])

Best Practices

Balanced Use of MMR: When utilizing Maximum Marginal Relevance (MMR), it's crucial to find a balance between relevance and diversity. This ensures that the retrieved documents provide a comprehensive view of the query topic without sacrificing pertinence.
Effective Metadata Utilization: Metadata can significantly enhance the specificity of search results. Designing and implementing a well-thought-out metadata schema allows for more targeted searches, especially when combined with self-query retrieval techniques.
Optimization of Contextual Compression: While contextual compression provides a focused subset of information, it requires additional processing. It's important to optimize this step to balance computational costs with the benefits of increased specificity and relevance.
Strategic Document Preparation: For alternative retrieval methods like TF-IDF and SVM, the way documents are prepared and processed (e.g., text chunking) can greatly affect the outcome. Tailoring these processes to your specific use case can lead to more efficient and accurate retrievals.
Model and Method Selection: The choice of language models and retrieval techniques should be informed by the nature of your data and the specific needs of your application. Regularly review and update these choices as newer models and methods become available.

Conclusion

This chapter has explored various advanced retrieval techniques designed to enhance the performance of semantic search systems. By addressing limitations related to diversity, specificity, and information relevance, these methods offer a pathway to more intelligent and efficient retrieval systems. Through the practical application of MMR, self-query retrieval, contextual compression, and alternative document retrieval methods, developers can build systems that not only understand the semantic content of queries but also provide rich, diverse, and targeted responses.

Adhering to best practices in the implementation of these techniques ensures that retrieval systems are both effective and efficient. As the field of NLP continues to evolve, staying informed about the latest advancements in retrieval technologies will be key to maintaining the edge in semantic search capabilities.

In summary, the integration of advanced retrieval techniques into semantic search systems represents a significant step forward in the development of intelligent information retrieval systems. By carefully selecting and optimizing these methods, developers can create solutions that significantly improve the user experience, delivering precise, diverse, and contextually relevant information in response to complex queries.

Theory questions:

Describe the principle of Maximum Marginal Relevance (MMR) and its role in improving information retrieval.
How does self-query retrieval address the challenge of queries that combine semantic and metadata components?
Explain the concept of contextual compression in document retrieval and its significance.
Detail the steps involved in setting up an environment for advanced retrieval techniques using OpenAI's API and the langchain library.
How does the initialization of a vector database contribute to efficient semantic similarity search?
Describe the process of populating and utilizing a vector database for similarity and diverse search purposes.
In the context of advanced document retrieval, what are the advantages of using MMR to address diversity in search results?
How can metadata be leveraged to enhance the specificity of search results in document retrieval systems?
Discuss the advantages and implementation challenges of self-query retrievers in semantic search.
Explain the role of contextual compression in reducing computational load and improving response quality in retrieval systems.
What are the key best practices for implementing advanced retrieval techniques in semantic search systems?
Compare and contrast the effectiveness of vector-based retrieval methods with alternative strategies like TF-IDF and SVM in document retrieval.
How does the integration of advanced retrieval techniques improve the performance and user experience of semantic search systems?
Discuss the potential impact of evolving NLP technologies on the future development of advanced retrieval techniques for semantic search.

Practice questions:

Implement a Python class VectorDatabase with the following methods:
__init__(self, persist_directory: str): Constructor that initializes the vector database with a persistence directory.
add_text(self, text: str): Embeds the given text into a high-dimensional vector using OpenAI's embeddings and stores it in the database. Assume you have access to a function openai_embedding(text: str) -> List[float] that returns the embedding vector.
similarity_search(self, query: str, k: int) -> List[str]: Performs a similarity search for the query, returning the top k most similar texts from the database. Use a placeholder similarity function for the implementation.
Create a function compress_document that takes a list of strings (document) and a query string as input and returns a list of strings, where each string is a compressed segment of the document relevant to the query. Assume there's an external utility function compress_segment(segment: str, query: str) -> str that compresses a single document segment based on the query.
Develop a function max_marginal_relevance that takes a list of document IDs, a query, and two parameters lambda and k, then returns a list of k document IDs selected based on Maximum Marginal Relevance (MMR). Assume you have a similarity function similarity(doc_id: str, query: str) -> float that measures the similarity between a document and the query, and a diversity function diversity(doc_id1: str, doc_id2: str) -> float that measures the diversity between two documents.
Write a function initialize_vector_db that demonstrates how to populate a vector database with a list of predefined texts and then perform a similarity search and a diverse search. The function should print out the results of both searches. Use the VectorDatabase class you implemented in task 2 for the vector database.