2.2 LangChain Document Loaders

For LLM‑powered data apps and conversational interfaces, it’s critical to load data efficiently, normalize it, and use it across diverse sources. In the LangChain ecosystem, “loaders” are components that extract information from websites, databases, and media files and convert it into a standard document object with content and metadata. Dozens of formats (PDF, HTML, JSON, etc.) and sources are supported — from public ones (YouTube, Twitter, Hacker News) to enterprise tools (Figma, Notion). There are also loaders for tabular and service data (Airbyte, Stripe, Airtable, and more), enabling semantic search and QA not only over unstructured data but also strictly structured datasets. This modularity lets you build targeted pipelines: sometimes it’s enough to load and clean text; other times you’ll auto‑create embeddings, extract entities, aggregate, and summarize.

We start with basic environment prep: install dependencies, configure API keys, and read them from .env for safe access to external data.

# Install required packages (they may already be present in your environment)
# !pip install langchain dotenv

import os
from dotenv import load_dotenv, find_dotenv

# Load environment variables from .env
load_dotenv(find_dotenv())

# Fetch the OpenAI API key from the environment
openai_api_key = os.environ['OPENAI_API_KEY']

A common scenario is working with PDFs. The example below shows how to load a document (e.g., a lecture transcript), clean and tokenize text, count word frequencies, and save a cleaned version for later analysis; we explicitly handle and log empty pages, and metadata is available for spot checks.

from langchain.document_loaders import PyPDFLoader
import re
from collections import Counter

# Initialize PDF loader with the path to the document
pdf_loader = PyPDFLoader("docs/lecture_series/Lecture01.pdf")

# Load the document pages
document_pages = pdf_loader.load()

# Clean and tokenize
def clean_and_tokenize(text):
    # Remove non‑alphabetic characters and split into words
    words = re.findall(r'\b[a-z]+\b', text.lower())
    return words

word_frequencies = Counter()

for page in document_pages:
    if page.page_content.strip():
        words = clean_and_tokenize(page.page_content)
        word_frequencies.update(words)
    else:
        print(f"Empty page found at index {document_pages.index(page)}")

print("Most frequent words:")
for word, freq in word_frequencies.most_common(10):
    print(f"{word}: {freq}")

# Inspect metadata of the first page
first_page_metadata = document_pages[0].metadata
print("\nFirst page metadata:")
print(first_page_metadata)

# Optionally save cleaned text to a file
with open("cleaned_lecture_series_lecture01.txt", "w") as text_file:
    for page in document_pages:
        if page.page_content.strip():
            cleaned_text = ' '.join(clean_and_tokenize(page.page_content))
            text_file.write(cleaned_text + "\n")

Video is equally important. We can fetch YouTube audio, transcribe it with Whisper via LangChain, and begin analysis immediately: split into sentences, assess sentiment (polarity and subjectivity) with TextBlob, and optionally add entity extraction, key‑phrase detection, and summarization.

from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader
from nltk.tokenize import sent_tokenize
from textblob import TextBlob
import os

import nltk
nltk.download('punkt')

video_url = "https://www.youtube.com/watch?v=example_video_id"
audio_save_directory = "docs/youtube/"
os.makedirs(audio_save_directory, exist_ok=True)

youtube_loader = GenericLoader(
    YoutubeAudioLoader([video_url], audio_save_directory),
    OpenAIWhisperParser()
)

youtube_documents = youtube_loader.load()

transcribed_text = youtube_documents[0].page_content[:500]
print(transcribed_text)

sentences = sent_tokenize(transcribed_text)

print("\nFirst 5 sentences:")
for sentence in sentences[:5]:
    print(sentence)

sentiment = TextBlob(transcribed_text).sentiment
print("\nSentiment:")
print(f"Polarity: {sentiment.polarity}, Subjectivity: {sentiment.subjectivity}")

# Polarity: float in [-1.0, 1.0] (negative to positive)
# Subjectivity: float in [0.0, 1.0] (objective to subjective)

For web content, we load a page by URL, clean the HTML, extract links and headings, then do a simple summary: sentence tokenization, stop‑word filtering, frequency analysis, and a brief digest.

from langchain.document_loaders import WebBaseLoader
from bs4 import BeautifulSoup
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk import download
download('punkt')
download('stopwords')

web_loader = WebBaseLoader("https://example.com/path/to/document")
web_documents = web_loader.load()

soup = BeautifulSoup(web_documents[0].page_content, 'html.parser')

for script_or_style in soup(["script", "style"]):
    script_or_style.decompose()

clean_text = ' '.join(soup.stripped_strings)
print(clean_text[:500])

links = [(a.text, a['href']) for a in soup.find_all('a', href=True)]
print("\nExtracted links:")
for text, href in links[:5]:
    print(f"{text}: {href}")

headings = [h1.text for h1 in soup.find_all('h1')]
print("\nHeadings found:")
for heading in headings:
    print(heading)

sentences = sent_tokenize(clean_text)
stop_words = set(stopwords.words("english"))
filtered_sentences = [' '.join([w for w in s.split() if w.lower() not in stop_words]) for s in sentences]

word_freq = FreqDist(w.lower() for s in filtered_sentences for w in s.split())

print("\nMost frequent words:")
for word, frequency in word_freq.most_common(5):
    print(f"{word}: {frequency}")

print("\nContent summary:")
for sentence in sentences[:5]:
    print(sentence)

Structured Notion exports are also easy to process: load Markdown files, convert to HTML for convenient parsing, extract headings and links, put metadata and parsed content into a DataFrame, filter (e.g., by a keyword in the title), and, if present, compute category breakdowns.

from langchain.document_loaders import NotionDirectoryLoader
import markdown
from bs4 import BeautifulSoup
import pandas as pd

notion_directory = "docs/Notion_DB"
notion_loader = NotionDirectoryLoader(notion_directory)
notion_documents = notion_loader.load()

print(notion_documents[0].page_content[:200])
print(notion_documents[0].metadata)

html_content = [markdown.markdown(doc.page_content) for doc in notion_documents]

parsed_data = []
for content in html_content:
    soup = BeautifulSoup(content, 'html.parser')
    headings = [heading.text for heading in soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])]
    links = [(a.text, a['href']) for a in soup.find_all('a', href=True)]
    parsed_data.append({'headings': headings, 'links': links})

df = pd.DataFrame({
    'metadata': [doc.metadata for doc in notion_documents],
    'parsed_content': parsed_data
})

keyword = 'Project'
filtered_docs = df[df['metadata'].apply(lambda x: keyword.lower() in x.get('title', '').lower())]

print("\nDocuments with the keyword in the title:")
print(filtered_docs)

# Example category summary (if categories exist in metadata)
if 'category' in df['metadata'].iloc[0]:
    category_counts = df['metadata'].apply(lambda x: x['category']).value_counts()
    print("\nDocuments by category:")
    print(category_counts)

When working with loaders, keep an eye on external API costs (e.g., Whisper) and optimize calls; normalize data immediately after loading (cleaning, chunking, etc.); and if a source is missing — contribute your own loader to LangChain open source. Keep the docs handy for guidance: LangChain (https://github.com/LangChain/langchain) and OpenAI Whisper (https://github.com/openai/whisper). This practice lays the foundation for more advanced processing and integration of your data into LLM applications.

Theory Questions

What are document loaders in LangChain and what role do they play?
How do loaders for unstructured data differ from those for structured data?
How do you prepare the environment for loaders (packages, API keys, .env)?
How does PyPDFLoader work and what does PDF pre‑processing give you?
Why clean and tokenize text when processing PDFs?
How do you transcribe a YouTube video with Whisper via LangChain?
How do you apply sentence tokenization and sentiment analysis to a transcript?
How do you load and process web content with WebBaseLoader?
How do you extract and summarize page content by URL?
How does NotionDirectoryLoader help analyze Notion exports?
What practices matter when using loaders (cost awareness, pre‑processing)?
Why and how can you contribute new loaders to LangChain?

Practical Tasks

Modify the PDF analysis to ignore stop words (nltk.stopwords); print the top‑5 most frequent non‑stop words.
Write a function that transcribes a YouTube URL (Whisper) and returns the first 100 words; include error handling.
Create a script: load a page by URL, strip HTML tags, and print clean text (use BeautifulSoup).
For a Notion export directory: convert Markdown to HTML, extract and print all links (text + href).
Extend the YouTube transcription with TextBlob sentiment: print polarity and a coarse label (positive/neutral/negative).
Build a DataFrame from Notion documents, add a “word count” column, and print titles of the three longest docs.
For a given URL — load the page, extract the main text, and print a simple summary (first and last sentences).