Logo
🚀 Our official PageIndex platform is launching soon. Stay tuned for updates!
PageIndex: Next-Generation Reasoning-based RAG System

Frustrated with the accuracy of vector-based RAG on long professional documents? You need a reasoning-native RAG system without vector DBs.

GitHubPlatform

Beyond Semantic Similarity

Reasoning-based RAG with PageIndex

Traditional vector-based RAG systems rely on semantic similarity, but similarity ≠ relevance. This often leads to retrieval failures, especially in specialized domains where sections may share similar language but differ in critical details.

Inspired by AlphaGo, we developed PageIndex, a reasoning-based RAG system that simulates how human experts navigate and extract knowledge from long documents through tree search. Two key components of PageIndex are listed below.

PageIndex Tree Generation
Creates "table-of-contents" trees, preserving logical organization structure of the documents.
PageIndex Retrieval
Conducts tree search with multi-step reasoning to retrieve knowledge from complex documents.
PageIndex Overview
PageIndex Dashboard

Context-Preserving Indexes

PageIndex Tree Generation

Documents are indexed as trees generated by PageIndex. These trees maintain the original document's logical flow and organizational structure. This LLM-optimized tree representation enables precise navigation and is ready for reasoning-based RAG.

No Vector DB Required
Tree structures are generated as lightweight JSON files, eliminating the need for expensive vector database storage.
No Chunking Required
Preserves natural document structure without artificial text splitting for better context retention.
Node Summary with Precise Page Referencing
Provides exact page references and summaries for precise information extraction.
Optimized for Long Documents
Tree generation optimized for financial reports, legal documents, and technical manuals beyond LLM context limits.
PageIndexTree.json

# Example of PageIndex Tree Structure
{
    "title": "Financial Stability",
    "node_id": "0006",
    "start_index": 21, 
    "end_index": 22,
    "summary": "The Federal Reserve ..."
    "nodes": [
        {
            "title": "Monitoring Financial Vulnerabilities",
            "node_id": "0007",
            "start_index": 22,
            "end_index": 28,
            "summary": "The Federal Reserve's monitoring ..."
        },
        {
            "title": "Domestic and International Cooperation and Coordination",
            "node_id": "0008",
            "start_index": 28,
            "end_index": 31,
            "summary": "In 2023, the Federal Reserve collaborated ..."
        }
    ],
},
...

Accurate and Efficient Tree Search

PageIndex Retrieval

PageIndex uses tree search with multi-step reasoning to retrieve information from complex documents. This approach goes beyond traditional vector-based retrieval by simulating how human experts systematically navigate and extract insights from lengthy documents.

PageIndexRetrieval.json

# Example of PageIndex Retrieval API Response
{
  "title": "Monetary Policy and Economic Developments",
  "node_id": "0004",
  "nodes": [
    {
      "title": "March 2024 Summary",
      "node_id": "0005",
      "relevant_contents": [{
          "physical_index": 10, 
          "relevant_content": "The labor market has gained averaging 239,000 per month since June 2023..."
        }]
    },
    {
      "title": "June 2023 Summary",
      "node_id": "0006",
      "relevant_contents": [{
          "physical_index": 15, 
          "relevant_content": "The labor market has remained very tight, with job gains averaging 314,000 per month during..."
        }]
    }
  ]
}
    
No Top-K Selection Required
Tree search automatically identifies all relevant tree nodes without manual parameter tuning.
Transparent Node Trajectories
Returns the complete search path through the tree structure for transparency and provides rich contextual information.
Exact Page References
Every retrieved node includes precise page numbers and locations from the original document for verifiable information retrieval.
LLM-Ready Output Format
Structured data with relevant paragraphs and search trajectories, ready for downstream LLM processing.

RAG Comparison

PageIndex vs Vector DB

Choose the right RAG technique for your task.

PageIndexLogical Reasoning
High Retrieval Accuracy
Relies on logical reasoning, ideal for domain-specific data where semantics are similar.
Fully Traceable Retrieval Process
Tree search provides a traceable reasoning process, each retrieved node also contains an exact page reference.
Slower Retrieval Due to Tree Search
Tree search is slower, but provides accurate results for complex domain-specific queries.
Efficient Prompt-Level Knowledge Integration
Easily integrates with expert knowledge and user preferences during the tree search process.
Best for Domain-Specific Document Analysis
  • Financial reports and SEC filings
  • Regulatory and compliance documents
  • Healthcare and medical reports
  • Legal contracts and case law
  • Technical manuals and scientific documentation
Vector DBSemantic Similarity
Low Retrieval Accuracy
Relies on semantic similarity, unreliable for domain-specific data where all content has similar semantics.
Black Box Retrieval without Traceability
Often lacks clear traceability to source documents, difficult to verify information or understand retrieval decisions.
Faster Retrieval Due to Vector Search
Offers faster retrieval speeds, making it efficient for applications where quick responses are critical.
Knowledge Integration Requires Fine-Tuning
Requires fine-tuning embedding models to incorporate new knowledge or preferences.
Best for Generic & Exploratory Applications
  • Semantic recommendation systems
  • Creative writing and ideation tools
  • Short passage retrieval
  • Multi-modal retrieval
  • Generic knowledge question answering

Case Study

PageIndex Powers Leading Industry Models

PageIndex forms the foundation of Mafin 2.5, a leading RAG model for financial report analysis, achieving 98.7% accuracy on FinanceBench — the highest in the market.

30%

RAG with Vector DB

One vector index for all the documents.

50%

RAG with Vector DB

One vector index for each document.

98.7%

RAG with PageIndex

Query-to-SQL for document-level retrieval, PageIndex for node-level retrieval.

The results of RAG with Vector DB are from the FinanceBench paper.

Ready to integrate Reasoning-based RAG with PageIndex?
Access Now