2.3 Deep Dive into Text Splitting

Splitting (segmentation) happens after loading data into a “document” format but before indexing or storage. The goal is to produce semantically meaningful chunks that work well for search and analytics without breaking meaning at the boundaries. Two parameters matter most: chunk size and overlap. Size is measured in characters or tokens (larger chunks carry more context; smaller ones are easier to process). Overlap is the “handoff” between neighboring chunks that helps maintain coherence. LangChain provides several strategies: character- and token-based splitting, a recursive approach that follows a hierarchy of separators (paragraphs → sentences → words), plus specialized splitters for code and Markdown that respect syntax and headings. There are also two modes of operation — Create Documents (accepts a list of raw text and returns chunked documents) and Split Documents (splits previously loaded documents) — so choose based on whether you are working with strings or with document objects. In practice, CharacterTextSplitter (simple character-based splitting when semantics are less critical) and TokenTextSplitter (token-based splitting to fit LLM limits) are the most common. When structure matters, a recursive splitter that follows the hierarchy is very helpful. Among the specialized options are LanguageTextSplitter for code and MarkdownHeaderTextSplitter for splitting by headings while preserving this structure in metadata.

Before applying splitters, it’s useful to quickly set up the environment: imports, API keys, and dependencies.

import os
from openai import OpenAI
import sys
from dotenv import load_dotenv, find_dotenv

# Add the path to access project modules
sys.path.append('../..')

# Load environment variables from the .env file
load_dotenv(find_dotenv())

# Initialize the OpenAI client using environment variables
client = OpenAI()

Splitting strategy strongly affects search and analytics quality, so tune parameters to preserve relevance and coherence. The basic choices are CharacterTextSplitter and RecursiveCharacterTextSplitter; select based on your data’s structure and nature. Below are compact examples: first, a simple splitter with optional overlap to help maintain context,

from langchain.text_splitter import CharacterTextSplitter

# Define chunk size and overlap for splitting
chunk_size = 26
chunk_overlap = 4

# Initialize a CharacterTextSplitter
character_text_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

and then a recursive splitter which, for “general” texts, more carefully preserves semantics by following a hierarchy of separators—from paragraphs to sentences to words.

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Initialize a RecursiveCharacterTextSplitter
recursive_character_text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

Next come a few practical examples. Start with simple strings,

# A simple alphabet string example
alphabet_text = 'abcdefghijklmnopqrstuvwxyz'

# Try splitting the alphabet string with both splitters
recursive_character_text_splitter.split_text(alphabet_text)
character_text_splitter.split_text(alphabet_text, separator=' ')

and then look under the hood with a minimal splitter implementation and its behavior on basic inputs.

# A class that splits text into chunks based on character count.
class CharacterTextSplitter:
    def __init__(self, chunk_size, chunk_overlap=0):
        """
        Initialize the splitter with the given chunk size and overlap.

        Args:
        - chunk_size: Number of characters each chunk should contain.
        - chunk_overlap: Number of characters to overlap between neighboring chunks.
        """
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap

    def split_text(self, text):
        """
        Split the given text into chunks according to the configured size and overlap.

        Args:
        - text: The string to split.

        Returns:
        A list of text chunks.
        """
        chunks = []
        start_index = 0

        # Continue splitting until the end of the text is reached.
        while start_index < len(text):
            end_index = start_index + self.chunk_size
            chunks.append(text[start_index:end_index])
            # Advance start index for the next chunk accounting for overlap.
            start_index = end_index - self.chunk_overlap
        return chunks

# Extend CharacterTextSplitter with recursive splitting capabilities.
class RecursiveCharacterTextSplitter(CharacterTextSplitter):
    def split_text(self, text, max_depth=10, current_depth=0):
        """
        Recursively split text into smaller chunks until each chunk is below the
        size threshold or the maximum recursion depth is reached.

        Args:
        - text: The string to split.
        - max_depth: Maximum recursion depth to prevent infinite recursion.
        - current_depth: Current recursion depth.

        Returns:
        A list of text chunks.
        """
        # Base case: if max depth reached or text already below threshold, return as-is.
        if current_depth == max_depth or len(text) <= self.chunk_size:
            return [text]
        else:
            # Split into two halves and recurse on each.
            mid_point = len(text) // 2
            first_half = text[:mid_point]
            second_half = text[mid_point:]
            return self.split_text(first_half, max_depth, current_depth + 1) + \
                   self.split_text(second_half, max_depth, current_depth + 1)

# Example usage of the above classes:

# Define chunk size and overlap for splitting.
chunk_size = 26
chunk_overlap = 4

# Initialize the CharacterTextSplitter with the specified size and overlap.
character_text_splitter = CharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

# Initialize the RecursiveCharacterTextSplitter with the specified size.
recursive_character_text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size)

# Example text to split.
alphabet_text = 'abcdefghijklmnopqrstuvwxyz'

# Use both splitters and store results.
recursive_chunks = recursive_character_text_splitter.split_text(alphabet_text)
simple_chunks = character_text_splitter.split_text(alphabet_text)

# Print results from the recursive splitter.
print("Recursive splitter chunks:")
for chunk in recursive_chunks:
    print(chunk)

# Print results from the simple splitter.
print("\nSimple splitter chunks:")
for chunk in simple_chunks:
    print(chunk)

The example above illustrates how splitting behaves on basic strings—with and without explicit separators. Now consider two advanced techniques. First, handling more complex text where it’s helpful to explicitly set a hierarchy of separators and a chunk size:

# A sample complex text
complex_text = """When writing documents, writers will use document structure to group content...
Sentences have a period at the end, but also, have a space."""

# Apply recursive splitting with configured chunk size and separators
recursive_character_text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0, 
    separators=["\n\n", "\n", " ", ""]
)
recursive_character_text_splitter.split_text(complex_text)

This produces coherent chunks that respect the document’s internal structure. Second, token-based splitting, where the LLM context window is defined in tokens and limits must be strictly observed:

from langchain.text_splitter import TokenTextSplitter

# Initialize a TokenTextSplitter
token_text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)

# Split document pages by tokens
document_chunks_by_tokens = token_text_splitter.split_documents(pages)

And finally, splitting by Markdown headings, where the document’s logical organization guides segmentation and the detected headings are preserved in chunk metadata.

from langchain.text_splitter import MarkdownHeaderTextSplitter

# Define the headings to split on in a Markdown document
markdown_headers = [
    ("#", "Header 1"),
    ("##", "Header 2"),
]

# Initialize a MarkdownHeaderTextSplitter
markdown_header_text_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=markdown_headers
)

# Split a real Markdown document while preserving heading metadata
markdown_document_splits = markdown_header_text_splitter.split_text(markdown_document_content)

A few quick recommendations: preserve semantics and account for the source document’s structure; manage overlap—just enough to maintain coherence without unnecessary redundancy; use and enrich metadata to improve context during retrieval and answering.

Theory Questions

What is the goal of document splitting?
How does chunk size affect processing?
Why is overlap needed and how does it help analysis?
How do CharacterTextSplitter and TokenTextSplitter differ, and where are they used?
What is a recursive splitter and how does it differ from basic ones?
Which specialized splitters exist for code and Markdown, and what are their benefits?
What is required to set up the environment before splitting?
List the pros and cons of RecursiveCharacterTextSplitter and the parameters that are important to tune.
What does the “alphabet” example demonstrate when comparing simple and recursive approaches?
What should you pay attention to when choosing between characters and tokens for LLMs?
How does splitting by Markdown headings preserve logical structure and why is that important?
What best practices help preserve semantics and manage overlap?

Practical Tasks

Write a function split_by_char(text, chunk_size) that returns a list of fixed-size chunks.
Add a chunk_overlap parameter to split_by_char and implement overlapping.
Implement a class TokenTextSplitter(chunk_size, chunk_overlap) with a split_text method that splits text by tokens (tokens separated by spaces).
Write a function recursive_split(text, max_chunk_size, separators) that recursively splits text using a given list of separators.
Implement a class MarkdownHeaderTextSplitter(headers_to_split_on) with a split_text method that splits Markdown by the specified headings and returns chunks with the corresponding metadata.