Answers 2.3
Theory
- The primary goal of document splitting in text processing is to create semantically meaningful chunks that facilitate efficient data retrieval and analysis, ensuring the data is organized in a way that is both manageable and analyzable for various applications.
- Chunk size determines the length of each document chunk, which affects the granularity of the data analysis. A larger chunk size may retain more context but can be cumbersome for processing, while a smaller chunk size may lead to loss of context but can be more manageable for detailed analyses.
- Chunk overlap is important because it ensures that critical information is not lost at the boundaries of chunks, maintaining context continuity between adjacent chunks. This overlap enables more coherent data retrieval and analysis by preventing the segmentation of closely related information.
- The
CharacterTextSplitter
splits text based on a specific number of characters, suitable for straightforward chunking without a primary concern for semantic integrity. TheTokenTextSplitter
, on the other hand, divides text based on tokens, which is particularly useful for preparing data for LLMs with specific token limitations, thus ensuring that chunks align with the processing capabilities of the models. - A recursive character text splitter is more sophisticated than basic splitters as it recursively divides text based on a hierarchy of separators (e.g., paragraphs, sentences, words), allowing for nuanced splitting that maintains semantic coherence within chunks.
- Lang Chain offers specialized splitters for code and markdown documents: the
Language Text Splitter
, which recognizes language-specific syntax and separators to appropriately segment code blocks, and theMarkdown Header Text Splitter
, which splits markdown documents based on header levels, adding header information to chunk metadata for enhanced context. - Setting up a development environment for document splitting involves importing necessary libraries, configuring API keys, ensuring all dependencies are correctly installed, and potentially appending paths to access custom modules. This setup ensures the environment is ready for efficient document processing.
- The
RecursiveCharacterTextSplitter
is advantageous for its ability to maintain semantic integrity through nuanced splitting, adapting to the document's structure. Adjusting parameters like chunk size, chunk overlap, and recursion depth can optimize the splitter's performance for specific texts. - Splitting the alphabet string with different splitters illustrates operational differences by showing how a simple splitter might uniformly divide text, while a recursive splitter could consider semantic units within the text, resulting in chunks that better preserve the intended meaning or structure.
- When deciding between character-based and token-based splitting techniques for LLMs, considerations include the model's token limit, the importance of semantic integrity, and the nature of the text being processed. Token-based splitting aligns chunks more closely with the model's processing capabilities, potentially improving analysis accuracy.
- Markdown header text splitting preserves the logical organization of documents by splitting them based on header levels. This approach is important for document analysis as it ensures that the resulting chunks maintain the original structure and context, facilitating better understanding and navigation of the content.
- Best practices for ensuring semantic coherence and optimal overlap management in document splitting include prioritizing strategies that maintain the text's meaning and context, experimenting with different overlap sizes to find a balance that prevents redundancy while preserving continuity, and enhancing chunk metadata to provide context.
Practice
1.
def split_by_char(text, chunk_size):
"""
Splits the text into chunks of specified size.
Parameters:
- text (str): The text to be split.
- chunk_size (int): The size of each chunk.
Returns:
- list: A list of text chunks.
"""
chunks = [] # Initialize the list to hold the chunks
for start_index in range(0, len(text), chunk_size):
# Append the chunk to the list, which is a substring of the text starting
# from start_index to start_index + chunk_size
chunks.append(text[start_index:start_index + chunk_size])
return chunks
# Example usage
text = "This is a sample text for demonstration purposes."
chunk_size = 10
chunks = split_by_char(text, chunk_size)
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}: {chunk}")
2.
def split_by_char(text, chunk_size):
"""
Splits the text into chunks of specified size.
Parameters:
- text (str): The text to be split.
- chunk_size (int): The size of each chunk.
Returns:
- list: A list of text chunks.
"""
chunks = [] # Initialize the list to hold the chunks
for start_index in range(0, len(text), chunk_size):
# Append the chunk to the list, which is a substring of the text starting
# from start_index to start_index + chunk_size
chunks.append(text[start_index:start_index + chunk_size])
return chunks
# Example usage
text = "This is a sample text for demonstration purposes."
chunk_size = 10
chunks = split_by_char(text, chunk_size)
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}: {chunk}")
3.
class TokenTextSplitter:
def __init__(self, chunk_size, chunk_overlap=0):
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap
def split_text(self, text):
tokens = text.split() # Split text into tokens based on spaces
chunks = []
start_index = 0
while start_index < len(tokens):
# Ensure that the end index does not exceed the length of tokens
end_index = min(start_index + self.chunk_size, len(tokens))
chunk = ' '.join(tokens[start_index:end_index])
chunks.append(chunk)
# Update start_index for the next chunk, considering the overlap
start_index += self.chunk_size - self.chunk_overlap
if self.chunk_overlap >= self.chunk_size:
print("Warning: chunk_overlap should be less than chunk_size to avoid overlap issues.")
break
return chunks
4.
def recursive_split(text, max_chunk_size, separators):
if not separators: # Base case: no more separators to try
return [text]
if len(text) <= max_chunk_size: # If the current chunk is within the size limit
return [text]
# Try to split the text using the first separator
separator = separators[0]
parts = text.split(separator)
if len(parts) == 1: # If the text doesn't contain the separator, move to the next separator
return recursive_split(text, max_chunk_size, separators[1:])
chunks = []
current_chunk = ""
for part in parts:
if len(current_chunk + part) > max_chunk_size and current_chunk:
# If adding the current part exceeds max_chunk_size, save the current chunk and reset
chunks.append(current_chunk.strip())
current_chunk = part + separator
else:
# Otherwise, add the part to the current chunk
current_chunk += part + separator
# Make sure to add the last chunk if it's not empty
if current_chunk.strip():
chunks.extend(recursive_split(current_chunk.strip(), max_chunk_size, separators))
# Flatten the list in case of nested lists resulting from recursive calls
flat_chunks = []
for chunk in chunks:
if isinstance(chunk, list):
flat_chunks.extend(chunk)
else:
flat_chunks.append(chunk)
return flat_chunks
5.
To implement the MarkdownHeaderTextSplitter
class as described, we need to follow these steps:
-
Initialization: The class initializer will store the header patterns to split on, along with their associated names or levels, for later use in the text splitting process.
-
Text Splitting: The
split_text
method will analyze the input markdown text, identify headers based on the specified patterns, and split the text into chunks. Each chunk will start with a header and include all subsequent text up to the next header of the same or higher priority.
Here's how the class could be implemented:
import re
class MarkdownHeaderTextSplitter:
def __init__(self, headers_to_split_on):
self.headers_to_split_on = sorted(headers_to_split_on, key=lambda x: len(x[0]), reverse=True)
self.header_regex = self._generate_header_regex()
def _generate_header_regex(self):
# Generate a regex pattern that matches any of the specified headers
header_patterns = [re.escape(header[0]) for header in self.headers_to_split_on]
combined_pattern = '|'.join(header_patterns)
return re.compile(r'(' + combined_pattern + r')\s*(.*)')
def split_text(self, markdown_text):
chunks = []
current_chunk = []
lines = markdown_text.split('\n')
for line in lines:
# Check if the line starts with one of the specified headers
match = self.header_regex.match(line)
if match:
# If we're already collecting a chunk, save it before starting a new one
if current_chunk:
chunks.append('\n'.join(current_chunk).strip())
current_chunk = []
# Add the current line to the chunk
current_chunk.append(line)
# Don't forget to add the last chunk
if current_chunk:
chunks.append('\n'.join(current_chunk).strip())
return chunks
# Example usage:
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on)
markdown_text = """
# Header 1
This is some text under header 1.
## Header 2
This is some text under header 2.
### Header 3
This is some text under header 3.
"""
chunks = splitter.split_text(markdown_text)
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}:\n{chunk}\n---")
This implementation does the following:
- During initialization, it sorts the headers by their length in descending order to ensure that longer (and thus more specific) markdown header patterns are matched first. This is important because, in markdown, headers are differentiated by the number of
#
characters, and we want to match the most specific header possible. - It compiles a regular expression that can match any of the specified header patterns at the start of a line.
- The
split_text
method goes through each line of the inputmarkdown_text
, checking for header matches. When it finds a header, it starts or ends a chunk as appropriate. This method ensures that each chunk includes its starting header and all subsequent text up until the next header of the same or higher level.