PageIndex: Document Index System for Reasoning-Based RAG

Frustrated with vector database retrieval accuracy for long professional documents? You need a reasoning-native index for your RAG system.

What is PageIndex?

PageIndex can transform lengthy documents into semantic tree structures, capturing the hierarchical organization of your document for reasoning-based RAG.

Hierarchical Tree Structure
PageIndex creates an LLM-friendly "table of contents" structured for your document, enabling LLM agents to efficiently navigate and comprehend complex documents.
Chunk-Free Segmentation
No arbitrary chunking. Nodes follow the natural structure of the document.
Node Summary with Precise Page Referencing
Each node contains its own summary and exact physical page index, allowing agents to pinpoint and extract the most relevant information with exact page references.
Designed for Long Documents
PageIndex is specifically designed to handle long documents including financial reports, legal documents, technical manuals etc., even when they exceed the context window limits of LLMs.
PageIndex.json
...
{
    "title": "Financial Stability",
    "node_id": "0006",
    "start_index": 21, 
    "end_index": 22,
    "summary": "The Federal Reserve ..."
    "nodes": [
        {
            "title": "Monitoring Financial Vulnerabilities",
            "node_id": "0007",
            "start_index": 22,
            "end_index": 28,
            "summary": "The Federal Reserve's monitoring ..."
        },
        {
            "title": "Domestic and International Cooperation and Coordination",
            "node_id": "0008",
            "start_index": 28,
            "end_index": 31,
            "summary": "In 2023, the Federal Reserve collaborated ..."
        }
    ],

},
...

Beyond Semantic Similarity

Reasoning-Based RAG with PageIndex

Vector-based RAG relies on semantic similarity, often returning loosely related but contextually off-target results. They miss document structure and will produce unreliable retrievals in specialized domains.

Reasoning-based RAG with PageIndex uses tree search algorithms that navigate documents like humans do, finding information based on document structure rather than just semantic similarity.

PageIndex Overview
State-of-the-Art Accuracy
Significant improvement of accuracy by using reasoning-based over semantic-based retrieval, achieving leading performance in domain benchmarks.
Transparent and Reliable
Clear page number references that enable users to trace back to the original source with one click, providing users with unmatched clarity and trust.
Align with Domain Expertise
Align with domain expertise to refine knowledge retrieval, ensuring accurate identification of key details in professional documents.
Like-Human
Generate a 'Table of Contents' rather than just searching based on semantic similarity, always retrieve the most relevant and precise information.

RAG Comparison

PageIndex vs Vector DB

Choose the right RAG technique for your task.

PageIndex
Vector-Based RAG
Accuracy

High Accuracy Based on Reasoning

Low Accuracy Based on Semantic Similarity

Transparency

Fully Traceable Results with Page Reference

Black Box Retrieval without Traceability

Knowledge Integration

Efficient Prompt-Level Integration

Requires Fine-Tuning Embedding Models

Efficiency

Slower Retrieval

Faster Retrieval

Ideal Use Cases

Professional Document Analysis

  • Financial reports and SEC filings
  • Regulatory and compliance documents
  • Academic and scientific textbooks
  • Legal contracts and case law
  • Technical manuals and documentation

Creative & General Applications

  • Semantic recommendation systems
  • Creative writing and ideation tools
  • Short passage retrieval
  • Multi-modal retrieval
  • General knowledge question answering

Case Study

PageIndex Powers Mafin 2.5: Industry-Leading Financial Document Analysis

PageIndex forms the foundation of Mafin 2.5, a leading RAG model for financial reports analysis, achieving 98.7% accuracy on FinanceBench — the highest in the market. Unlike traditional RAG systems that rely on vector similarity, PageIndex's hierarchical structure enables precise navigation through complex financial documents, delivering unmatched accuracy in SEC filing analysis and financial question answering.

30%

RAG with Vector DB

One vector index for all the documents.

50%

RAG with Vector DB

One vector index for each document.

98.7%

RAG with PageIndex

Query-to-SQL for document-level retrieval, PageIndex for node-level retrieval.

The results of RAG with Vector DB are from the FinanceBench paper.

Easy Integration

Build Reasoning-Based RAG with PageIndex

You can easily build a reasoning-based RAG system with PageIndex, hassle-free.

No Vector DB Required
PageIndex generates a tree structure that can be stored in traditional databases, eliminating the need for specialized Vector DB infrastructure and reducing complexity.
No Hard Chunking Required
Unlike conventional approaches, PageIndex intelligently segments documents along natural content boundaries, preserving context and improving retrieval quality.

     prompt = f"""
        You are given a question and a tree structure of a document.
        You need to find all nodes that are likely to find the answer of the question. 

        Question: {question}

        Document tree structure: {structure}

        Reply in the following JSON format:
        {{
            "thinking": <what nodes are likely to find the answer of the question?>,
            "node_list": [node_id1, node_id2, ...]
        }}

        Directly reply in JSON format, do not include any other information in the reply.
        """
  

Ready to integrate reasoning-based RAG with PageIndex?

Access Now