PageIndex: Document Index System for Reasoning-Based RAG

Frustrated with vector database retrieval accuracy for long professional documents? You need a reasoning-based native index for your RAG system.

GitHub

What is PageIndex?

PageIndex can transform lengthy documents into semantic tree structures, capturing the hierarchical organization of your document for agentic RAG.

Hierarchical Tree Structure
PageIndex creates a LLM-friendly "table of contents" structured for your document. This enables LLM agents to efficiently navigate and comprehend complex documents.
Node Summary with Precise Page Referencing
Each node contains its own summary and exact physical page index, allowing agents to pinpoint and extract the most relevant information with exact page references.
Designed for Long Documents
PageIndex is specifically designed to handle long documents including financial reports, thousand-page textbooks, etc., even when they exceed the context window limitations of most LLMs.
PageIndex.json
...
{
    "title": "Financial Stability",
    "node_id": "0006",
    "start_index": 21, 
    "end_index": 22,
    "summary": "The Federal Reserve ..."
    "child_nodes": [
        {
            "title": "Monitoring Financial Vulnerabilities",
            "node_id": "0007",
            "start_index": 22,
            "end_index": 28,
            "summary": "The Federal Reserve's monitoring ..."
        },
        {
            "title": "Domestic and International Cooperation and Coordination",
            "node_id": "0008",
            "start_index": 28,
            "end_index": 31,
            "summary": "In 2023, the Federal Reserve collaborated ..."
        }
    ],
},

Agentic RAG

Build Agentic RAG with PageIndex

You can easily build an agentic RAG system with PageIndex in a hassle-free way.

No Vector DB Required
PageIndex generates a tree structure that can be stored in traditional databases, eliminating the need for specialized Vector DB infrastructure and reducing complexity.
No Hard Chunking Required
Unlike conventional approaches, PageIndex intelligently segments documents along natural content boundaries, preserving context and improving retrieval quality.
Deep Research Ready
The hierarchical tree structure enables sophisticated exploration of lengthy private documents, facilitating nuanced research and comprehensive analysis.
Easy to Implement
Seamlessly integrate PageIndex into your existing agent pipeline with minimal code changes—just a simple prompt is all you need to get started.
Example Prompt for Agentic Retrieval

     prompt = f"""
        You are given a question and a tree structure of a document.
        You need to find all nodes that are likely to find the answer of the question. 

        Question: {question}

        Document tree structure: {structure}

        Reply in the following JSON format:
        {{
            "thinking": <what nodes are likely to find the answer of the question?>,
            "node_list": [node_id1, node_id2, ...]
        }}

        Direct reply in the JSON format, do not include any other information in the reply.
        """
  

RAG Comparison

PageIndex vs Vector DB

Choose the right RAG technique for your task.

PageIndex
Feature
Vector DB

Reasoning-Based Retrieval

Relies on logical reasoning to retrieve the right node, making it ideal for domain-specific data where semantics are similar.

Accuracy

Semantic Similarity-Based Retrieval

Relies on semantic similarity, which can be unreliable for domain data where all content has similar semantics.

Hassle-Free Implementation

Only requires storing a tree structure in a classic database. No vector database required.

Resource Requirements

Complex Infrastructure

Requires setting up and maintaining a vector database, which adds complexity and cost.

Flexible Integration

Easily integrates with expert knowledge and user preferences through the hierarchical structure.

Knowledge Integration

Requires Model Updates

Requires fine-tuning embedding models to incorporate new knowledge or preferences.

Slower Retrieval

Slightly slower retrieval speed, though still provides accurate results for complex domain-specific queries.

Efficiency

Faster Retrieval

Offers faster retrieval speeds, making it efficient for applications where quick responses are critical.

Professional Document Analysis

  • Financial reports and SEC filings
  • Regulatory and compliance documents
  • Academic and scientific textbooks
  • Legal contracts and case law
  • Technical manuals and documentation
Ideal Use Cases

Creative & General Applications

  • Semantic recommendation systems
  • Creative writing and ideation tools
  • Short passage retrieval
  • Multi-modal retrieval
  • General knowledge question answering

Case Study

PageIndex Powers Mafin 2.5: Industry-Leading Financial Document Analysis

PageIndex forms the foundation of Mafin 2.5, a leading RAG model for financial reports analysis, achieving 98.7% accuracy on FinanceBench. Unlike traditional RAG systems that rely on vector similarity, PageIndex's hierarchical structure enables precise navigation through complex financial documents, delivering unmatched accuracy in SEC filing analysis and financial question answering.

30%

RAG with Vector DB

One vector index for all the documents.

50%

RAG with Vector DB

One vector index for each document.

98.7%

RAG with PageIndex.

Query-to-SQL for document-level retrieval, PageIndex for node-level retrieval.

The results of the Vector DB are from the FinanceBench paper.

PageIndex Best Practices