2.6 RAG Systems — Techniques for QA

Retrieval‑Augmented Generation (RAG) combines retrieval and generation, changing how we work with large corpora to build accurate QA systems and chatbots. A critical stage is feeding retrieved documents to the model along with the original query to generate an answer. After relevant materials are retrieved, they must be synthesized into a coherent answer that blends the content with the query’s context and leverages the model’s capabilities. The overall flow is simple: the system accepts a question; retrieves relevant fragments from a vector store; then feeds the retrieved content together with the question into an LLM to form an answer. By default, you can send all retrieved parts into context, but context‑window limits often lead to strategies like MapReduce, Refine, or Map‑Rerank — they aggregate or iteratively refine answers across many documents.

Before using an LLM for QA, ensure the environment is set up: imports, API keys, model versions, and so on.

import os
from openai import OpenAI
from dotenv import load_dotenv
import datetime

# Load environment variables and configure the OpenAI API key
load_dotenv()
client = OpenAI()

# Configure LLM versioning
current_date = datetime.datetime.now().date()
llm_name = "gpt-3.5-turbo"
print(f"Using LLM version: {llm_name}")

Next, retrieve documents relevant to the query from a vector database (VectorDB), where embeddings are stored.

# Import the vector store and embedding generator
from langchain.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

# Directory where the vector database persists its data
documents_storage_directory = 'docs/chroma/'

# Initialize the embedding generator using OpenAI embeddings
embeddings_generator = OpenAIEmbeddings()

# Initialize the vector database with the persistence directory and embedding function
vector_database = Chroma(persist_directory=documents_storage_directory, embedding_function=embeddings_generator)

# Show the current number of documents in the vector database
print(f"Documents in VectorDB: {vector_database._collection.count()}")

RetrievalQA combines retrieval and generation: the LLM answers based on retrieved documents. First, initialize the language model,

from langchain_openai import ChatOpenAI

# Initialize the chat model with the selected LLM
language_model = ChatOpenAI(model=llm_name, temperature=0)

then configure the RetrievalQA chain with a custom prompt,

# Import required LangChain modules
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# Create a custom prompt template to guide the LLM to use the provided context effectively
custom_prompt_template = """To better assist with the inquiry, consider the details provided below as your reference...
{context}
Inquiry: {question}
Insightful Response:"""

# Initialize the RetrievalQA chain with the custom prompt
a_question_answering_chain = RetrievalQA.from_chain_type(
    language_model,
    retriever=vector_database.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": PromptTemplate.from_template(custom_prompt_template)}
)

and check the answer on a simple query.

# Provide a sample query
query = "Is probability a class topic?"
response = a_question_answering_chain({"query": query})
print("Answer:", response["result"])

Next come advanced QA chain types. MapReduce and Refine help work around context‑window limits when handling many documents: MapReduce aggregates in parallel, while Refine improves the answer sequentially.

# Configure a QA chain to use MapReduce, aggregating answers from multiple documents
question_answering_chain_map_reduce = RetrievalQA.from_chain_type(
    language_model,
    retriever=vector_database.as_retriever(),
    chain_type="map_reduce"
)

# Run MapReduce with the user query
response_map_reduce = question_answering_chain_map_reduce({"query": query})

# Show the aggregated answer
print("MapReduce answer:", response_map_reduce["result"])

# Configure a QA chain to use Refine, which iteratively improves the answer
question_answering_chain_refine = RetrievalQA.from_chain_type(
    language_model,
    retriever=vector_database.as_retriever(),
    chain_type="refine"
)

# Run Refine with the same user query
response_refine = question_answering_chain_refine({"query": query})

# Show the refined answer
print("Refine answer:", response_refine["result"])

In practice, consider: choose between MapReduce and Refine based on the task (the former for fast aggregation from many sources; the latter for higher accuracy and iterative improvement); in distributed systems, performance depends on network latency and serialization; effectiveness varies with data, so experiment.

One notable limitation of RetrievalQA is the lack of dialogue history, which degrades handling of follow‑up questions. Demonstration of the limitation:

# Import a QA chain from a hypothetical library
from some_library import question_answering_chain as qa_chain

# Define an initial question related to course content
initial_question_about_course_content = "Does the curriculum cover probability theory?"
# Generate an answer to the initial question
response_to_initial_question = qa_chain({"query": initial_question_about_course_content})

# Define a follow‑up question without explicitly preserving conversation context
follow_up_question_about_prerequisites = "Why are those prerequisites important?"
# Generate an answer to the follow‑up question
response_to_follow_up_question = qa_chain({"query": follow_up_question_about_prerequisites})

# Display both answers — initial and follow‑up
print("Answer to the initial question:", response_to_initial_question["result"])
print("Answer to the follow‑up question:", response_to_follow_up_question["result"])

This underscores the importance of integrating conversation memory into RAG systems.

Conclusion

Advanced QA techniques in RAG deliver more dynamic and accurate answers. A careful RetrievalQA implementation and handling of its limitations enable building systems capable of substantive dialogue with users.

Theory Questions

Name the three stages of QA in RAG.
What are context‑window limits, and how do MapReduce/Refine help work around them?
Why is a vector database (VectorDB) needed for retrieval in RAG?
How does RetrievalQA combine retrieval and generation?
Compare the MapReduce and Refine approaches.
Which practical factors matter in distributed systems (network latency, serialization)?
Why is it important to experiment with both approaches?
How does missing dialogue history affect handling of follow‑up questions?
Why integrate conversation memory into RAG?
What should be studied next to deepen RAG expertise?

Practical Tasks

Initialize a vector DB (Chroma + OpenAIEmbeddings) and print the number of documents it contains.
Configure RetrievalQA with a custom prompt, specifying the model and the data storage directory.
Demonstrate MapReduce and Refine on a single query and print the resulting answers.
Simulate a follow‑up question without preserving dialogue context to show the RetrievalQA limitation.

2.6 RAG Systems — Techniques for QA

Conclusion

Further Reading

Theory Questions

Practical Tasks