Skip to content

2.3 Deep Dive into Text Splitting

Understanding Document Splitting

Document splitting is a critical step that occurs after loading data into a document format but before storing it for further processing. The goal is to create semantically meaningful chunks that facilitate efficient data retrieval and analysis. This section outlines the principles and challenges of document splitting, highlighting the importance of semantic relevance and consistency across chunks.

Understanding Chunk Size and Overlap

  • Chunk Size: Refers to the length of each document chunk, which can be determined by various measures such as character count or token count. The appropriate chunk size depends on the specific requirements of the application and the nature of the text being processed.
  • Chunk Overlap: A strategy employed to maintain context continuity between adjacent chunks. A small overlap ensures that critical information is not lost at the boundaries of chunks, enabling more coherent data retrieval and analysis.

Implementing Text Splitters in Lang Chain

Lang Chain provides a suite of text splitters designed to accommodate different splitting strategies. These splitters offer two primary methods:

  • Create Documents: Accepts a list of text strings and returns a set of document chunks.
  • Split Documents: Takes a list of pre-loaded documents and divides them into smaller chunks.

The choice between these methods depends on whether the input data is raw text or structured documents.

Types of Text Splitters

This section introduces various text splitters available in Lang Chain, each tailored to specific types of text or document structures.

Character and Token-Based Splitters

  • Character Text Splitter: Splits text based on character count, ideal for straightforward chunking where semantic integrity is not a primary concern.
  • Token Text Splitter: Divides text based on tokens, which is particularly useful when preparing data for LLMs with specific token limitations.

Recursive Character Text Splitter

A more sophisticated splitter that recursively divides text based on a hierarchy of separators (e.g., paragraphs, sentences, and words). This approach allows for more nuanced splitting, ensuring that chunks maintain semantic coherence.

Specialized Splitters for Code and Markdown

  • Language Text Splitter: Designed for source code, this splitter recognizes language-specific syntax and separators, ensuring that code blocks are appropriately segmented.
  • Markdown Header Text Splitter: Targets markdown documents, splitting them based on header levels and adding header information to chunk metadata for enhanced context.

Practical Guide to Document Splitting

Before diving into text splitting techniques, it's essential to set up the development environment correctly. This setup includes importing necessary libraries, configuring API keys, and ensuring all dependencies are correctly installed.

Setup and Configuration

import os
import openai
import sys
from dotenv import load_dotenv, find_dotenv

# Append the path to access custom modules
sys.path.append('../..')

# Load environment variables from a .env file
load_dotenv(find_dotenv())

# Set the OpenAI API key from environment variables
openai.api_key = os.environ['OPENAI_API_KEY']

Document Splitting Strategies

Document splitting can significantly affect the performance of text-based models and analyses. Choosing the right strategy and parameters is crucial for maintaining the relevance and coherence of the resulting chunks.

LangChain provides two primary types of text splitters: CharacterTextSplitter and RecursiveCharacterTextSplitter. Each serves different needs based on the structure and nature of the text.

Character Text Splitter

This splitter divides text based on a specified number of characters or tokens, with an optional overlap between chunks for context continuity.

from langchain.text_splitter import CharacterTextSplitter

# Define chunk size and overlap for splitting
chunk_size = 26
chunk_overlap = 4

# Initialize the Character Text Splitter
character_text_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

Recursive Character Text Splitter

Ideal for generic text, it recursively splits the document using a hierarchy of separators, from larger structures like paragraphs to smaller ones like sentences and words.

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Initialize the Recursive Character Text Splitter
recursive_character_text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

Practical Examples

Simple Text Strings

# Example text string of the alphabet
alphabet_text = 'abcdefghijklmnopqrstuvwxyz'

# Attempt to split the alphabet_text using both splitters
recursive_character_text_splitter.split_text(alphabet_text)
character_text_splitter.split_text(alphabet_text, separator=' ')

Under the Hood

# Define a class to split text into chunks based on character count.
class CharacterTextSplitter:
    def __init__(self, chunk_size, chunk_overlap=0):
        """
        Initializes the splitter with specified chunk size and overlap.

        Parameters:
        - chunk_size: The number of characters each chunk should contain.
        - chunk_overlap: The number of characters to overlap between adjacent chunks.
        """
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap

    def split_text(self, text):
        """
        Splits the given text into chunks according to the initialized chunk size and overlap.

        Parameters:
        - text: The string of text to be split.

        Returns:
        A list of text chunks.
        """
        chunks = []  # Initialize an empty list to store the chunks of text.
        start_index = 0  # Start index for slicing the text.

        # Loop to split the text until the end of the text is reached.
        while start_index < len(text):
            end_index = start_index + self.chunk_size  # Calculate the end index for the current chunk.
            chunks.append(text[start_index:end_index])  # Slice the text and append the chunk to the list.
            # Update the start index for the next chunk, taking into account the chunk overlap.
            start_index = end_index - self.chunk_overlap
        return chunks  # Return the list of text chunks.

# Inherits from CharacterTextSplitter to add recursive splitting functionality.
class RecursiveCharacterTextSplitter(CharacterTextSplitter):
    def split_text(self, text, max_depth=10, current_depth=0):
        """
        Recursively splits text into smaller chunks until each chunk is below the chunk size threshold or max depth is reached.

        Parameters:
        - text: The string of text to be split.
        - max_depth: The maximum depth of recursion to prevent infinite recursion.
        - current_depth: The current depth of recursion.

        Returns:
        A list of text chunks.
        """
        # Base condition: if maximum depth reached or text length is within chunk size, return text as a single chunk.
        if current_depth == max_depth or len(text) <= self.chunk_size:
            return [text]
        else:
            # Split the text into two halves for recursive splitting.
            mid_point = len(text) // 2
            first_half = text[:mid_point]
            second_half = text[mid_point:]
            # Recursively split each half and concatenate the results.
            return self.split_text(first_half, max_depth, current_depth + 1) + \
                   self.split_text(second_half, max_depth, current_depth + 1)

# Example usage of the above classes:

# Define chunk size and overlap for splitting.
chunk_size = 26
chunk_overlap = 4

# Initialize the Character Text Splitter with specified chunk size and overlap.
character_text_splitter = CharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

# Initialize the Recursive Character Text Splitter with specified chunk size.
recursive_character_text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size)

# Define a sample text string to be split.
alphabet_text = 'abcdefghijklmnopqrstuvwxyz'

# Use both splitters to split the sample text and store the results.
recursive_chunks = recursive_character_text_splitter.split_text(alphabet_text)
simple_chunks = character_text_splitter.split_text(alphabet_text)

# Print the resulting chunks from the recursive splitter.
print("Recursive Splitter Chunks:")
for chunk in recursive_chunks:
    print(chunk)

# Print the resulting chunks from the simple splitter.
print("\nSimple Splitter Chunks:")
for chunk in simple_chunks:
    print(chunk)

This demonstrates how splitters handle basic text strings, with or without specified separators.

Advanced Splitting Techniques

Handling Complex Text

# Define a complex text sample
complex_text = """When writing documents, writers will use document structure to group content...
Sentences have a period at the end, but also, have a space."""

# Apply recursive splitting with customized chunk size and separators
recursive_character_text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0, 
    separators=["\n\n", "\n", " ", ""]
)
recursive_character_text_splitter.split_text(complex_text)

This example illustrates splitting complex text into meaningful chunks, considering the document's inherent structure.

Specialized Splitting: Tokens and Markdown Headers

Token-Based Splitting

For applications where the context window of an LLM is defined in tokens, splitting by token count can align the chunks more closely with the model's requirements.

from langchain.text_splitter import TokenTextSplitter

# Initialize the Token Text Splitter
token_text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)

# Splitting document pages by tokens
document_chunks_by_tokens = token_text_splitter.split_documents(pages)

Markdown Header Text Splitting

Markdown documents often contain structured headers that can be used to guide the splitting process, preserving the document's logical organization.

from langchain.text_splitter import MarkdownHeaderTextSplitter

# Define headers to split on in a Markdown document
markdown_headers = [
    ("#", "Header 1"),
    ("##", "Header 2"),
]

# Initialize the Markdown Header Text Splitter
markdown_header_text_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=markdown_headers
)

# Split a real Markdown document preserving header metadata
markdown_document_splits = markdown_header_text_splitter.split_text(markdown_document_content)

Best Practices and Tips

  • Semantic Coherence: When splitting text, prioritize strategies that preserve the semantic integrity of the content. Consider the document's structure and the nature of the text.
  • Overlap Management: Properly manage chunk overlap to ensure continuity without redundant information. Experiment with different overlap sizes to find the optimal balance.
  • Metadata Preservation: Ensure that splitting processes maintain or enhance chunk metadata, providing valuable context for each piece of text.

Theory questions:

  1. What is the primary goal of document splitting in text processing?
  2. How does chunk size affect the processing of document chunks?
  3. Why is chunk overlap important in document splitting, and how does it contribute to data analysis?
  4. Compare and contrast the CharacterTextSplitter and the TokenTextSplitter provided by Lang Chain. What are their key differences and applications?
  5. Explain the concept of a recursive character text splitter. How does it differ from basic splitters in handling text?
  6. In the context of processing code and markdown documents, what specialized splitters does Lang Chain offer, and how do they address the unique needs of these document types?
  7. Describe the setup process for a development environment aimed at document splitting. What key components and configurations are necessary?
  8. Discuss the advantages and challenges of using a RecursiveCharacterTextSplitter for splitting generic text. Include examples of parameters that might be adjusted to optimize splitting.
  9. How does the example of splitting the alphabet string with different splitters illustrate the operational differences between a simple and a recursive text splitter?
  10. What considerations should be taken into account when deciding between character-based and token-based splitting techniques for large language models (LLMs)?
  11. Explain how markdown header text splitting preserves the logical organization of documents and why this might be important for document analysis.
  12. Outline best practices for ensuring semantic coherence and optimal overlap management in document splitting strategies.

Practice questions:

  1. Write a Python function named split_by_char that takes two arguments: text (a string) and chunk_size (an integer). The function should return a list of chunks, where each chunk is a substring of text of length chunk_size, except possibly for the last chunk, which may be shorter. Use a loop to implement this functionality.

  2. Modify the split_by_char function to accept an additional optional argument chunk_overlap (an integer, default value 0). Modify the function to include an overlap of chunk_overlap characters between adjacent chunks. Ensure that the first chunk still starts at the beginning of the text, and adjust the function accordingly.

  3. Create a Python class named TokenTextSplitter with an initializer that accepts two arguments: chunk_size and chunk_overlap (with a default value of 0 for chunk_overlap). Implement a method named split_text within this class. The method should accept a string text and return a list of chunks, where each chunk has up to chunk_size tokens, taking chunk_overlap into account between chunks. You may assume that tokens are separated by spaces for this task.

  4. Write a function named recursive_split that takes three parameters: text (a string), max_chunk_size (an integer), and separators (a list of strings, in the order they should be applied). The function should recursively split text at the first separator in separators if text is longer than max_chunk_size. If the split at the first separator doesn't reduce the size of any chunk below max_chunk_size, it should attempt to split using the next separator in the list, and so on. The function should return a list of chunks that are smaller than or equal to max_chunk_size. For simplicity, you can assume that each separator string appears at most once in text.

  5. Implement a Python class named MarkdownHeaderTextSplitter with an initializer that takes a single argument: headers_to_split_on (a list of tuples, where each tuple contains two elements: a string representing the markdown header syntax like "#" or "##", and a string representing the header level name like "Header 1" or "Header 2"). Add a method named split_text that takes a string markdown_text and splits it into chunks based on the presence of headers specified in headers_to_split_on. Each chunk should include the header and the subsequent text until the next header of the same or higher priority. The method should return a list of these chunks. Consider headers in headers_to_split_on to determine splitting priority, with earlier entries in the list having higher priority.