Tutorials·tutorial

Relational RAG: How to Process PDFs for Advanced AI Document Intelligence

The landscape of AI-powered document intelligence is rapidly evolving, moving beyond simple keyword searches to sophisticated contextual understanding. This tutorial will guide you through the...

June 11, 202614 min read
Featured image for Relational RAG: How to Process PDFs for Advanced AI Document Intelligence

The landscape of AI-powered document intelligence is rapidly evolving, moving beyond simple keyword searches to sophisticated contextual understanding. This tutorial will guide you through the process of building a cutting-edge Retrieval-Augmented Generation (RAG) system that can intelligently process complex PDF documents by extracting their underlying relational structures, rather than just flat text. By the end, you'll be equipped to unlock deeper insights from your enterprise documents and power more accurate, context-aware AI applications.

Flat text extraction from PDFs often strips away crucial contextual relationships, leading to AI models that "hallucinate" or provide incomplete answers. Relational RAG addresses this by preserving and leveraging the inherent structure of documents, such as tables, lists, and their surrounding descriptive text. This approach ensures that your AI can understand not just what information is present, but also how different pieces of information relate to each other, leading to vastly improved performance for complex querying and analysis.

Introduction to Relational RAG for PDFs

Welcome to a deep dive into advanced PDF processing for AI. In this tutorial, you will learn how to move beyond basic text extraction and embrace a more intelligent approach to integrating PDF content into your RAG systems. We'll explore the limitations of traditional RAG when dealing with the rich, often complex structures found in documents like financial reports, legal contracts, and scientific papers, and introduce you to the power of relational RAG.

Our journey will cover the methodologies and tools required to parse PDFs, identify structured elements such as tables and lists, and reconstruct their inherent relationships. By understanding these connections, your AI models will be able to answer multi-faceted questions, perform sophisticated data analysis, and provide more reliable insights. This tutorial is designed for developers and data scientists who have a basic understanding of Python and the core concepts of Retrieval-Augmented Generation, and are looking to enhance their document intelligence capabilities.

What You'll Learn

  • Understand the fundamental differences between flat text RAG and relational RAG.
  • Identify why traditional PDF parsing falls short for complex AI applications.
  • Master a step-by-step process for extracting structured data from PDFs for AI.
  • Discover key tools and libraries for advanced PDF parsing.
  • Implement techniques to build a RAG system that leverages relational information.
  • Appreciate the significant benefits of relational RAG for enterprise document intelligence.

Prerequisites

  • Basic proficiency in Python programming.
  • Familiarity with foundational RAG concepts (embeddings, vector stores, LLMs).
  • Access to a development environment (e.g., Jupyter Notebook, VS Code).
  • An API key for a service like Unstructured.io (for advanced parsing).

Time Estimate

This tutorial is designed to be completed in approximately 60-90 minutes, depending on your familiarity with the tools and concepts involved. The practical steps will involve setting up your environment, running code snippets, and observing the structured output. Allow additional time for experimentation with your own PDF documents.

What is Relational RAG?

Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by giving them access to external knowledge bases, allowing them to retrieve relevant information before generating a response. Traditionally, this external knowledge is stored as chunks of plain text. When a user asks a question, the RAG system finds the most similar text chunks and feeds them to the LLM as context. While effective for many use cases, this "flat text" approach often struggles with documents that contain rich, interconnected data.

Relational RAG represents an evolution of this paradigm. Instead of treating all document content as a continuous stream of text, it focuses on identifying and preserving the inherent structure and relationships within the data. This means recognizing elements like tables, lists, figures, and headings, and understanding how they relate to each other and to the surrounding narrative text. For instance, a table might be described by the paragraph immediately preceding it, or a list might elaborate on a concept introduced in a preceding sentence. Relational RAG captures these links, creating a richer, more navigable knowledge graph or structured representation.

The core idea is to provide the LLM with not just the answer, but also the context of where that answer came from within the document's structure, and how it relates to other pieces of information. This could involve linking a specific data point in a table to its column header, row label, and the overall table title, and then further linking that table to the section it belongs to. By doing so, relational RAG enables the LLM to perform more sophisticated reasoning, answer complex multi-hop questions, and significantly reduce the likelihood of factual errors or "hallucinations" that arise from a lack of complete contextual understanding.

Why is Flat Text from PDFs Insufficient for RAG?

PDFs are ubiquitous in enterprise and academic settings, serving as the standard format for reports, contracts, manuals, and research papers. However, their visual fidelity, which makes them excellent for human readability, poses significant challenges for machine processing. When you simply extract all text from a PDF, you often get a jumbled mess that loses crucial layout and structural information. This "flat text" approach to RAG, while seemingly straightforward, is fundamentally insufficient for advanced AI document intelligence for several key reasons.

Firstly, PDFs are not inherently structured for text extraction. They often use absolute positioning for text, meaning that the order in which text appears in the raw PDF stream might not match its logical reading order. Multi-column layouts, images with captions, footnotes, and headers/footers can all contribute to a chaotic text output when simply scraped. More critically, complex elements like tables, which are designed to present structured data, become unreadable strings of numbers and words without their row and column context. A simple text extraction cannot tell you that "1,200" in a financial report refers to "Revenue" for "Q1 2023" if that context is only present in table headers.

Secondly, the relationships between different pieces of information are lost. Consider a legal document where a specific clause refers to an appendix, or a technical manual where a diagram illustrates a procedure described in the text. With flat text, these connections are broken. An LLM might retrieve a chunk of text describing a procedure, but without the accompanying diagram or the relevant table of specifications, its understanding remains incomplete. This lack of relational context forces the LLM to make educated guesses or, worse, generate factually incorrect information, leading to reduced accuracy and trustworthiness of the RAG system's outputs. For enterprises relying on accurate data from documents, this presents a significant risk.

“Returning flat text from a PDF is like handing an architect a pile of bricks instead of a blueprint. All the raw materials are there, but the crucial instructions on how they connect and form a structure are missing.”

How to Extract Structured Data from PDFs for AI? (Step-by-Step Guide)

Moving beyond flat text extraction requires a systematic approach to identify, parse, and structure the diverse elements within a PDF. This section provides a step-by-step guide to achieving this, focusing on practical implementation using powerful tools. Our goal is to transform raw PDF content into a rich, interconnected dataset that can power sophisticated relational RAG systems.

Step 1: Environment Setup and Dependencies

Before we begin parsing, we need to set up our Python environment and install the necessary libraries. We'll be using unstructured-client for robust PDF parsing, along with libraries from the LlamaIndex ecosystem for building our RAG components. Make sure you have Python installed (3.8+ recommended).

Open your terminal or command prompt and run the following commands to install the required packages:

pip install unstructured-client llama-index pandas pydantic
pip install python-dotenv # for managing API keys securely

Next, you'll need an API key for Unstructured.io, which provides advanced document parsing capabilities. Sign up on their website to obtain your key. We recommend storing it securely using environment variables or a .env file.

# .env file content
# UNSTRUCTURED_API_KEY="your_unstructured_api_key_here"

import os
from dotenv import load_dotenv

load_dotenv() # Load environment variables from .env file

UNSTRUCTURED_API_KEY = os.getenv("UNSTRUCTURED_API_KEY")
if not UNSTRUCTURED_API_KEY:
    raise ValueError("UNSTRUCTURED_API_KEY not found. Please set it in your .env file.")

# For LlamaIndex, you might also need an OpenAI API key or similar
# OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
# os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY # Set it for LlamaIndex/LangChain

This setup ensures that all dependencies are met and your API keys are handled securely.

Step 2: Choosing the Right Parsing Tool

The cornerstone of relational RAG for PDFs is an intelligent parsing tool capable of understanding document layouts and extracting structured elements. While basic libraries like PyPDF2 or pdfminer.six can extract raw text, they lack the intelligence to discern tables, headings, or lists. For advanced parsing, we turn to tools like Unstructured.io.

Unstructured.io offers a powerful API and open-source components that can intelligently partition PDFs into various element types, such as Title, NarrativeText, ListItem, and crucially, Table. It leverages advanced layout detection and OCR (Optical Character Recognition) when needed, making it suitable for a wide range of document qualities. Other tools like Nougat (for scientific papers, image-based parsing) or LayoutParser (for custom layout detection) exist, but Unstructured.io provides a good balance of capability and ease of use for general enterprise documents.

For this tutorial, we will primarily use the unstructured-client, which simplifies interaction with the Unstructured API. This tool excels at identifying distinct document elements and providing structured representations, especially for tables, which are critical for relational understanding.

Step 3: Ingesting the PDF and Initial Parsing

Now, let's ingest a sample PDF and perform the initial parsing using the Unstructured API. We'll use a sample PDF that contains both narrative text and tables. You can use any PDF with structured data for this step.

First, download a sample PDF, or create a simple one with some text and a table. For example, a simple financial report or a product specification sheet. Let's assume you have a PDF named sample_report.pdf in your working directory.

from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
from unstructured_client.models.errors import UnstructuredClientError

s = UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY)

# Path to your PDF file
pdf_path = "sample_report.pdf"

try:
    with open(pdf_path, "rb") as f:
        files=shared.Files(
            content=f.read(),
            file_name=pdf_path,
        )

    req = shared.PartitionParameters(
        files=files,
        # Other optional parameters:
        # strategy="hi_res", # Use hi_res for best results, "fast" for speed
        # include_page_num=True,
        # include_extra_info=True,
        # pdf_infer_table_structure=True # Crucial for table parsing
    )

    # Make the API call
    res = s.general.partition(req)

    # The response contains a list of elements
    elements = res.elements

    print(f"Successfully parsed {len(elements)} elements from {pdf_path}")
    # print first few elements for inspection
    for i, element in enumerate(elements[:5]):
        print(f"Element {i}: Type={element.type}, Text='{element.text[:100]}...'")

except UnstructuredClientError as e:
    print(f"Error during parsing: {e}")
except FileNotFoundError:
    print(f"Error: PDF file not found at {pdf_path}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

# [IMAGE: Screenshot of sample_report.pdf content, showing a table and surrounding text]
# [IMAGE: Screenshot of initial parsed elements output, showing element types and partial text]

The elements list will contain various objects, each representing a distinct part of the document, such as titles, paragraphs, list items, and tables. Each element object has properties like type (e.g., "Title", "NarrativeText", "Table") and text, and often includes metadata like page_number and bounding box coordinates. For tables, it will also include a text_as_html or text_as_csv property if pdf_infer_table_structure is set to True.

Step 4: Identifying and Extracting Relational Structures

The key to relational RAG is to not just extract elements, but to understand their relationships. We'll focus on tables as a prime example, linking them to their descriptive text. We'll iterate through the parsed elements, identify tables, convert them into a structured format (like Pandas DataFrames), and associate them with preceding narrative text.

import pandas as pd
from unstructured.documents.elements import Table, Title, NarrativeText, ListItem

structured_data_nodes = []
current_section_title = None
current_narrative_context = []

for i, element in enumerate(elements):
    if element.type == "Title":
        current_section_title = element.text
        current_narrative_context = [] # Reset context for new section
        structured_data_nodes.append({
            "type": "Title",
            "text": element.text,
            "page_number": element.metadata.page_number,
            "section": current_section_title
        })
    elif element.type == "NarrativeText" or element.type == "ListItem":
        current_narrative_context.append(element.text)
        structured_data_nodes.append({
            "type": element.type,
            "text": element.text,
            "page_number": element.metadata.page_number,
            "section": current_section_title,
            "context_before": " ".join(current_narrative_context[:-1]) # Text before current element
        })
    elif element.type == "Table":
        # Check if table has structured content (e.g., HTML, CSV)
        if hasattr(element.metadata, 'text_as_html') and element.metadata.text_as_html:
            try:
                # pandas can read HTML tables directly
                dfs = pd.read_html(element.metadata.text_as_html)
                if dfs:
                    df = dfs[0] # Assuming the first table is the main one
                    table_description = " ".join(current_narrative_context) # Context from text before table
                    
                    structured_data_nodes.append({
                        "type": "Table",
                        "text": element.text, # Raw table text
                        "table_html": element.metadata.text_as_html,
                        "table_df": df.to_json(orient="split"), # Store DataFrame as JSON string
                        "page_number": element.metadata.page_number,
                        "section": current_section_title,
                        "description": table_description # Link table to its descriptive text
                    })
                    current_narrative_context = [] # Clear context after associating with table
            except ValueError:
                print(f"Could not parse HTML table on page {element.metadata.page_number}")
        else:
            print(f"Table element on page {element.metadata.page_number} has no structured HTML/CSV.")
    
# Print a sample of the structured nodes
for node in structured_data_nodes[:10]:
    print(f"Type: {node['type']}, Section: {node['section']}, Page: {node['page_number']}")
    if node['type'] == 'Table':
        print(f"  Description: {node['description'][:100]}...")
        # print(f"  Table DF (partial): {pd.read_json(node['table_df'], orient='split').head(2)}")
    else:
        print(f"  Text: {node['text'][:100]}...")

# [IMAGE: Screenshot of structured_data_nodes output, showing table description and type]

In this step, we're not just extracting text; we're creating a list of "nodes," each representing a meaningful chunk of information with its associated type, page number, and crucially, contextual links. For tables, we're storing their content as a Pandas DataFrame (serialized to JSON) and linking them to the narrative text that immediately precedes them, serving as a natural description.

Step 5: Structuring and Storing Relational Data for RAG

Once we have our structured data, the next step is to prepare it for a RAG system. This involves converting our structured nodes into a format that a vector store or a graph database can understand, enabling intelligent retrieval. We'll use LlamaIndex, a powerful framework for building LLM applications, to create custom Document objects that encapsulate our relational information.

from llama_index.core import Document
from llama_index.core.schema import TextNode
from llama_index.core.node_parser import SentenceSplitter
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import VectorStoreIndex, StorageContext
import chromadb

# Initialize ChromaDB client
db = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = db.get_or_create_collection("relational_rag_collection")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Create LlamaIndex Document objects from our structured nodes
llama_documents = []
for node in structured_data_nodes:
    if node['type'] == 'Table':
        # For tables, we create a document with the table's description and its content
        table_df = pd.read_json(node['table_df'], orient='split')
        table_text_representation = f"Table Description: {node['description']}\nTable Content:\n{table_df.to_string()}"
        
        llama_documents.append(Document(
            text=table_text_representation,
            metadata={
                "type": "Table",
                "page_number": node['page_number'],
                "section": node['section'],
                "description": node['description'],
                "table_json": node['table_df'] # Store original JSON for potential direct query
            }
        ))
    else:
        # For text elements, we create a document with the text and its metadata
        llama_documents.append(Document(
            text=node['text'],
            metadata={
                "type": node['type'],
                "page_number": node['page_number'],
                "section": node['section'],
                "context_before": node.get('context_before', '')
            }
        ))

# Optional: Split long text documents into smaller chunks for better retrieval
# node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=20)
# nodes = node_parser.get_nodes_from_documents(llama_documents)

# Create an index from the documents/nodes
index = VectorStoreIndex.from_documents(
    llama_documents,
    storage_context=storage_context,
    # show_progress=True
)

print(f"Indexed {len(llama_documents)} documents into the vector store.")
# [IMAGE: Screenshot of ChromaDB folder structure created, or confirmation of index creation]

Here, each significant piece of information (a paragraph, a list, a table) becomes a Document or Node in LlamaIndex. Critically, we embed not just the raw text of the table, but also its descriptive context. This allows our RAG system to retrieve tables based on semantic queries about their content or purpose. The metadata fields are vital for providing the LLM with additional context during retrieval and generation, enabling it to understand the origin and nature of the retrieved information.

Step 6: Building the Relational RAG System (Querying Structured Data)

With our structured data indexed, we can now build a query engine that leverages this relational understanding. Instead of just retrieving arbitrary text chunks, our system can now intelligently fetch tables, their descriptions, and related narrative, allowing for more precise answers to complex questions.

from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.llms import OpenAI
from llama_index.core.response_synthesizers import ResponseSynthesizer

# Configure OpenAI LLM for LlamaIndex
llm = OpenAI(model="gpt-4o", temperature=0.1)

# Create a basic query engine from our index
base_query_engine = index.as_query_engine(llm=llm, similarity_top_k=3)

# Define a tool for general document querying
general_doc_tool = QueryEngineTool(
    query_engine=base_query_engine,
    metadata=ToolMetadata(
        name="general_document_qa",
        description="Useful for answering general questions about the document, extracting facts, or summarizing content."
    ),
)

# For more advanced relational queries, we can create specific tools or use sub-question engines.
# Example: A tool specifically for querying tables (if we had a separate table-specific index)
# For simplicity here, we'll rely on the general_document_qa tool to retrieve table-related content
# because we've enriched the table documents with descriptions.

# You can also use a SubQuestionQueryEngine for multi-hop questions
# If a question requires combining info from multiple parts (e.g., "What was the revenue in Q1 2023, and what strategy was mentioned to improve it?"),
# a SubQuestionQueryEngine can break it down.
query_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=[general_doc_tool],
    llm=llm,
    response_synthesizer=ResponseSynthesizer.from_defaults(llm=llm)
)

# Example Queries
query_1 = "What are the key financial figures for Q1 2023 mentioned in the report?"
query_2 = "Summarize the main points discussed in the 'Executive Summary' section."
query_3 = "According to the 'Product Performance' section, what was the sales growth for Product X, and what factors contributed to it?"

print(f"\nQuery: {query_1}")
response_1 = query_engine.query(query_1)
print(f"Response: {response_1}\n")

print(f"Query: {query_2}")
response_2 = query_engine.query(query_2)
print(f"Response: {response_2}\n")

print(f"Query: {query_3}")
response_3 = query_engine.query(query_3)
print(f"Response: {response_3}\n")

# [IMAGE: Screenshot of query responses, showing detailed answers from tables and text]

The SubQuestionQueryEngine is particularly powerful for relational RAG. It can break down complex user queries into smaller sub-questions, execute them against relevant tools (which in our case are query engines over our structured data), and then synthesize a final answer. By embedding tables with their descriptions, the vector store can accurately retrieve the right table when asked a question about its content, and the LLM can then parse the table data (which is included in the retrieved context) to extract precise answers, significantly enhancing the accuracy and depth of responses.

Tools for Advanced PDF Parsing in AI?

The choice of PDF parsing tool is paramount for successful relational RAG. While many libraries exist, they vary significantly in their capabilities, especially when moving beyond simple text extraction. Here, we compare some prominent tools, highlighting their strengths for advanced AI document parsing.

Unstructured.io (API & Open Source Components): Unstructured is a leading solution for advanced document parsing. Its API and underlying open-source libraries (like unstructured and unstructured-inference) are designed to handle complex layouts, extract diverse element types (text, tables, lists, images), and infer hierarchical structures.

  • Strengths: High accuracy for layout detection, robust table extraction (including conversion to HTML/CSV), support for various document types (PDFs, Word, PPTX, HTML), OCR integration, and a unified API for programmatic access. Excellent for enterprise-grade document intelligence.
  • Use Case: Extracting structured data from financial reports, legal documents, technical manuals, and general business documents for RAG and analytics.

Nougat (Neural Optical Understanding for Academic Documents): Developed by Meta AI, Nougat is a vision-transformer model specifically designed to convert scientific papers (PDFs) into Markdown format. It excels at preserving complex academic layouts, including equations, figures, and multi-column text.

  • Strengths: State-of-the-art for scientific literature, preserves formatting and structure well, image-based approach handles complex visual layouts.
  • Use Case: Processing research papers, academic journals, and arXiv documents where precise structural and semantic preservation is critical.

LayoutParser: LayoutParser is a Python library for deep learning-based document image analysis. It provides a collection of pre-trained models and a toolkit for customizing layout detection, making it highly flexible for research and specific use cases.

  • Strengths: Highly customizable, allows users to train their own layout models, supports a wide range of layout elements, strong community and research backing.
  • Use Case: Researchers or developers needing fine-grained control over layout detection, working with unique document types, or building custom parsing pipelines.

PyPDF2 / pdfminer.six: These are foundational Python libraries for interacting with PDFs. They provide basic functionalities like extracting raw text, metadata, and managing pages.

  • Strengths: Lightweight, good for basic text extraction, page manipulation, and metadata retrieval. Open-source and widely used.
  • Limitations: Do not understand document layout or structure (tables, headings, paragraphs are lost), no OCR capabilities, struggle with complex or scanned PDFs.
  • Use Case: Simple text search, page splitting/merging, extracting basic document properties. Insufficient for relational RAG.

Comparison Table: Basic vs. Advanced PDF Parsing

Ad — leaderboard (728x90)
Relational RAG: How to Process PDFs for Advanced AI Document Intell | AI Creature Review
Feature PyPDF2 / pdfminer.six (Basic) Unstructured.io (Advanced) Nougat (Specialized Advanced) LayoutParser (Customizable Advanced)
Text Extraction Raw, often jumbled Logical reading order, element-aware Structured Markdown Element-aware, customizable
Layout Understanding None High (detects paragraphs, headings, lists) Very High (academic layouts) Customizable High
Table Extraction Poor (raw text only) Excellent (structured HTML/CSV) Good (Markdown tables) Requires custom model
OCR Support No Yes (integrated) Yes (core functionality) Can integrate
Structural Output Flat text string JSON list of typed elements Markdown Customizable JSON/XML