Logo
🚀Our official PageIndex platform is launching soon. Stay tuned!

PageIndex: Next-Generation Reasoning-based RAG

Frustrated with the accuracy of vector-based RAG on long professional documents? You need a reasoning-native RAG system without vector DBs.
Best for
💰Financial Reports
⚖️Legal Documents
🏥Medical Records
⚙️Technical Manuals
🔬Research Papers
💰Financial Reports

Higher Accuracy

Beyond semantic similarity

Better Transparency

Clear reasoning trajectory

Like A Human

Retrieve like a human expert

No Vector DB

No extra infra overhead

No Chunking

Preserve full context

No Top-K

Retrieve all relevant passages

Beyond Semantic Similarity and Vector Search

Reasoning-based RAG with PageIndex

Problem: Traditional vector-based RAG systems rely on semantic similarity, but similarity ≠ relevance. This often leads to retrieval failures, especially in specialized domains where sections may share similar language but differ in critical details.

Solution: Inspired by AlphaGo, we developed PageIndex, a reasoning-based retrieval method that simulates how human search for information. Instead of using vectors for retrieval, PageIndex first generates a tree index of the document and then performs tree search to retrieve relevant information.

1
PageIndex Tree Generation

Creates "table-of-contents" tree indexes, preserving the logical organization structure of documents.

2
PageIndex Retrieval

Conducts tree search with multi-step reasoning to retrieve knowledge from complex documents.

Explore PageIndex Dashboard

Context-Preserving and Reasoning-Native Index

PageIndex Tree Generation

Documents are indexed as trees generated by PageIndex. These trees maintain the original document's logical flow and organizational structure. This LLM-optimized tree representation enables precise navigation and is ready for reasoning-based RAG.

No Vector DB Required
Tree structures are represented as lightweight JSON objects, avoiding the overhead and complexity of vector databases.
No Chunking Required
Preserves natural document structure without artificial text splitting for better context retention.
Node Summary with Precise Page Referencing
Provides exact page references and summaries for precise information extraction.
Optimized for Long Documents
Tree generation optimized for financial reports, legal documents, and technical manuals beyond LLM context limits.
PageIndexTree.json
# Example of PageIndex Tree Structure
...
{
    "title": "Financial Stability",
    "node_id": "0006",
    "start_index": 21, 
    "end_index": 22,
    "summary": "The Federal Reserve ..."
    "nodes": [
        {
            "title": "Monitoring Financial Vulnerabilities",
            "node_id": "0007",
            "start_index": 22,
            "end_index": 28,
            "summary": "The Federal Reserve's monitoring ..."
        },
        {
            "title": "Domestic and International Cooperation and Coordination",
            "node_id": "0008",
            "start_index": 28,
            "end_index": 31,
            "summary": "In 2023, the Federal Reserve collaborated ..."
        }
    ],
},
...

Accurate and Human-like Tree Search

PageIndex Retrieval

PageIndex uses tree search with multi-step reasoning to retrieve information from complex documents. This approach goes beyond traditional vector-based retrieval by simulating how human experts systematically navigate and extract insights from lengthy documents.

PageIndexRetrieval.json
# Example of PageIndex Retrieval API Response
{
  "title": "Monetary Policy and Economic Developments",
  "node_id": "0004",
  "nodes": [
    {
      "title": "March 2024 Summary",
      "node_id": "0005",
      "relevant_contents": [{
          "physical_index": 10, 
          "relevant_content": "The labor market has gained averaging 239,000 per month since June 2023..."
        }]
    },
    {
      "title": "June 2023 Summary",
      "node_id": "0006",
      "relevant_contents": [{
          "physical_index": 15, 
          "relevant_content": "The labor market has remained very tight, with job gains averaging 314,000 per month during..."
        }]
    }
  ]
}
No Top-K Selection Required
Tree search automatically identifies all relevant tree nodes without manual parameter tuning.
Transparent Search Trajectories
Returns the complete search path through the tree structure, providing transparency and rich contextual information.
Node and Page References
Every retrieved passage includes its node ID and page number from the original document for verifiable information retrieval.
LLM-Ready Output Format
Structured data output with relevant paragraphs and search trajectories, ready for downstream LLM processing.

RAG Comparison

PageIndex vs Vector DB

Choose the right RAG technique for your task.

PageIndexLogical Reasoning
High Retrieval Accuracy
Relies on logical reasoning, ideal for domain-specific data where semantics are similar.
Fully Traceable Retrieval Process
Tree search provides a traceable reasoning process, each retrieved node also contains an exact page reference.
Compromised Efficiency for Accuracy
Tree search prioritizes accuracy over speed, delivering precise results for domain-specific analysis.
Efficient Prompt-Level Knowledge Integration
Easily integrates with expert knowledge and user preferences during the tree search process.
Best for Domain-Specific Document Analysis
  • Financial reports and SEC filings
  • Regulatory and compliance documents
  • Healthcare and medical reports
  • Legal contracts and case law
  • Technical manuals and scientific documentation
Vector DBSemantic Similarity
Low Retrieval Accuracy
Relies on semantic similarity, unreliable for domain-specific data where all content has similar semantics.
Black Box Retrieval without Traceability
Often lacks clear traceability to source documents, difficult to verify information or understand retrieval decisions.
Speed-Optimized Vector Search
Prioritizes efficiency and speed, making it ideal for applications where quick responses are critical.
Knowledge Integration Requires Fine-Tuning
Requires fine-tuning embedding models to incorporate new knowledge or preferences.
Best for Generic & Exploratory Applications
  • Vibe retrieval
  • Semantic recommendation systems
  • Creative writing and ideation tools
  • Short news/email retrieval
  • Generic knowledge question answering

Case Study

PageIndex Powers Leading Industry Models

PageIndex forms the foundation of Mafin 2.5, a leading RAG model for financial report analysis, achieving 98.7% accuracy on FinanceBench — the highest in the market.

30%

RAG with Vector DB

One vector index for all the documents.

50%

RAG with Vector DB

One vector index for each document.

98.7%

RAG with PageIndex

Query-to-SQL for document-level retrieval, PageIndex for node-level retrieval.

The results of RAG with Vector DB are from the FinanceBench paper.

Ready to integrate Reasoning-based RAG with PageIndex?
Access Now
Product Announcement

PageIndex Introduction

PageIndex Introduction Video Thumbnail