Answers 2.3
Theory
- The goal of splitting is to produce meaningful chunks for effective search and analysis.
- Chunk size controls granularity: larger → more context; smaller → easier processing but risk losing cohesion.
- Overlap preserves context at boundaries and prevents loss of important information.
CharacterTextSplitter
splits by characters;TokenTextSplitter
by tokens (handy for LLM limits).- The recursive splitter uses a hierarchy of separators (paragraphs/sentences/words) to preserve semantics.
- Specialized splitters:
LanguageTextSplitter
for code (syntax‑aware) andMarkdownHeaderTextSplitter
(heading levels, adds metadata). - Environment: libraries, keys, dependencies, imports — for robust processing.
RecursiveCharacterTextSplitter
preserves semantics and adapts to structure; tune size/overlap/depth.- The “alphabet” demo highlights differences: even slicing vs. semantically aware splitting.
- Characters vs. tokens depends on model limits, semantic needs, and text nature.
- Splitting by Markdown headings preserves logical structure.
- Best practices: keep meaning, tune overlap (avoid redundancy), enrich chunk metadata.
Practical Tasks
1.
def split_by_char(text, chunk_size):
"""
Split text into chunks of a fixed size.
Args:
- text (str): The text to split.
- chunk_size (int): The size of each chunk.
Returns:
- list[str]: List of string chunks.
"""
chunks = []
for start_index in range(0, len(text), chunk_size):
chunks.append(text[start_index:start_index + chunk_size])
return chunks
# Example usage
text = "This is a sample text for demonstration purposes."
chunk_size = 10
chunks = split_by_char(text, chunk_size)
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}: {chunk}")
2.
def split_by_char(text, chunk_size):
"""
Split text into chunks of a fixed size.
Args:
- text (str): The text to split.
- chunk_size (int): The size of each chunk.
Returns:
- list: List of text chunks.
"""
chunks = [] # Initialize an empty list to store chunks
for start_index in range(0, len(text), chunk_size):
# Append a chunk (substring) starting at start_index
chunks.append(text[start_index:start_index + chunk_size])
return chunks
# Example usage
text = "This is a sample text for demonstration purposes."
chunk_size = 10
chunks = split_by_char(text, chunk_size)
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}: {chunk}")
3.
class TokenTextSplitter:
def __init__(self, chunk_size, chunk_overlap=0):
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap
def split_text(self, text):
tokens = text.split() # Split text into tokens by spaces
chunks = []
start_index = 0
while start_index < len(tokens):
# Ensure end_index does not exceed total tokens length
end_index = min(start_index + self.chunk_size, len(tokens))
chunk = ' '.join(tokens[start_index:end_index])
chunks.append(chunk)
# Update start_index accounting for overlap
start_index += self.chunk_size - self.chunk_overlap
if self.chunk_overlap >= self.chunk_size:
print("Warning: `chunk_overlap` should be less than `chunk_size` to avoid overlap issues.")
break
return chunks
Example usage:
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on)
markdown_text = """
# Header 1
This is some text under header 1.
## Header 2
This is some text under header 2.
### Header 3
This is some text under header 3.
"""
chunks = splitter.split_text(markdown_text)
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}:
{chunk}
---")
This implementation does the following:
- During initialization, it sorts header markers by length in descending order so that longer (more specific) Markdown headers match first. This matters because Markdown header levels differ by the number of
#
characters, and we want the most specific header matched. - It compiles a regular expression that matches any of the specified header markers at the start of a line.
- The
split_text
method iterates over each line of the input Markdown, checking for header matches. When it finds a header, it appropriately starts or ends a chunk. Each chunk includes its starting header and all subsequent lines up to the next header of the same or higher priority.
4.
def recursive_split(text, max_chunk_size, separators):
if not separators: # Base case: no separators left
return [text]
if len(text) <= max_chunk_size: # Already within size
return [text]
# Try to split with the first separator
separator = separators[0]
parts = text.split(separator)
if len(parts) == 1: # Separator not found, try next
return recursive_split(text, max_chunk_size, separators[1:])
chunks = []
current_chunk = ""
for part in parts:
# If adding the part would exceed the limit and we already have content, store current and start new
if len(current_chunk + part) > max_chunk_size and current_chunk:
chunks.append(current_chunk.strip())
current_chunk = part + separator
else:
current_chunk += part + separator
# Recurse on the remaining text to ensure size constraints
if current_chunk.strip():
chunks.extend(recursive_split(current_chunk.strip(), max_chunk_size, separators))
# Flatten nested lists from recursion
flat_chunks = []
for chunk in chunks:
if isinstance(chunk, list):
flat_chunks.extend(chunk)
else:
flat_chunks.append(chunk)
return flat_chunks
5.
To implement MarkdownHeaderTextSplitter
as described, we follow these steps:
- Initialization: store header patterns with their names/levels to use during splitting.
- Text splitting: parse the input Markdown, identify headers by the given patterns, and split into chunks. Each chunk starts with a header and includes the following lines up to the next header of the same or higher priority.
import re
class MarkdownHeaderTextSplitter:
def __init__(self, headers_to_split_on):
# Sort headers by marker length (longer first) for correct matching
self.headers_to_split_on = sorted(headers_to_split_on, key=lambda x: len(x[0]), reverse=True)
self.header_regex = self._generate_header_regex()
def _generate_header_regex(self):
# Build a regex matching any of the specified header markers
header_patterns = [re.escape(header[0]) for header in self.headers_to_split_on]
combined_pattern = '|'.join(header_patterns)
return re.compile(r'(' + combined_pattern + r')\s*(.*)')
def split_text(self, markdown_text):
chunks = []
current_chunk = []
lines = markdown_text.split('
')
for line in lines:
# Check if the line starts with one of the header markers
match = self.header_regex.match(line)
if match:
# If we already collected lines, store the previous chunk
if current_chunk:
chunks.append('
'.join(current_chunk).strip())
current_chunk = []
current_chunk.append(line)
else:
current_chunk.append(line)
# Append the last collected chunk if present
if current_chunk:
chunks.append('
'.join(current_chunk).strip())
return chunks
Example usage:
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on)
markdown_text = """
# Header 1
This is some text under header 1.
## Header 2
This is some text under header 2.
### Header 3
This is some text under header 3.
"""
chunks = splitter.split_text(markdown_text)
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}:
{chunk}
---")
This implementation does the following:
- During initialization, it sorts header markers by length in descending order so that longer (more specific) Markdown headers match first. This matters because Markdown header levels differ by the number of
#
characters, and we want the most specific header matched. - It compiles a regular expression that matches any of the specified header markers at the start of a line.
- The
split_text
method iterates over each line of the input Markdown, checking for header matches. When it finds a header, it appropriately starts or ends a chunk. Each chunk includes its starting header and all subsequent lines up to the next header of the same or higher priority.