🚀Our official PageIndex platform is launching soon. Stay tuned!

PageIndex: Next-Generation Reasoning-based RAG

Frustrated with the accuracy of vector-based RAG on long professional documents?
You need a reasoning-native RAG system without vector DBs.

Best for

💰Financial Reports

⚖️Legal Documents

🏥Medical Records

⚙️Technical Manuals

🔬Research Papers

💰Financial Reports

GitHub Get Started

Higher Accuracy

Beyond semantic similarity

Better Transparency

Clear reasoning trajectory

Like A Human

Retrieve like a human expert

No Vector DB

No extra infra overhead

No Chunking

Preserve full context

No Top-K

Retrieve all relevant passages

🚀Our official PageIndex platform is launching soon. Stay tuned!

PageIndex: Next-Generation Reasoning-based RAG

Frustrated with the accuracy of vector-based RAG on long professional documents? You need a reasoning-native RAG system without vector DBs.

Best for

💰Financial Reports

⚖️Legal Documents

🏥Medical Records

⚙️Technical Manuals

🔬Research Papers

💰Financial Reports

GitHub Get Started

Higher Accuracy

Beyond semantic similarity

Better Transparency

Clear reasoning trajectory

Like A Human

Retrieve like a human expert

No Vector DB

No extra infra overhead

No Chunking

Preserve full context

No Top-K

Retrieve all relevant passages

Beyond Semantic Similarity and Vector Search

Reasoning-based RAG with PageIndex

Problem: Traditional vector-based RAG systems rely on semantic similarity, but similarity ≠ relevance. This often leads to retrieval failures, especially in specialized domains where sections may share similar language but differ in critical details.

Solution: Inspired by AlphaGo, we developed PageIndex, a reasoning-based retrieval method that simulates how human search for information. Instead of using vectors for retrieval, PageIndex first generates a tree index of the document and then performs tree search to retrieve relevant information.

PageIndex Tree Generation

Creates "table-of-contents" tree indexes, preserving the logical organization structure of documents.

PageIndex Retrieval

Conducts tree search with multi-step reasoning to retrieve knowledge from complex documents.

Explore PageIndex Dashboard

Context-Preserving and Reasoning-Native Index

PageIndex Tree Generation

Documents are indexed as trees generated by PageIndex. These trees maintain the original document's logical flow and organizational structure. This LLM-optimized tree representation enables precise navigation and is ready for reasoning-based RAG.

No Vector DB Required: Tree structures are represented as lightweight JSON objects, avoiding the overhead and complexity of vector databases.
No Chunking Required: Preserves natural document structure without artificial text splitting for better context retention.
Node Summary with Precise Page Referencing: Provides exact page references and summaries for precise information extraction.
Optimized for Long Documents: Tree generation optimized for financial reports, legal documents, and technical manuals beyond LLM context limits.

PageIndexTree.json

# Example of PageIndex Tree Structure
...
{
    "title": "Financial Stability",
    "node_id": "0006",
    "start_index": 21, 
    "end_index": 22,
    "summary": "The Federal Reserve ..."
    "nodes": [
        {
            "title": "Monitoring Financial Vulnerabilities",
            "node_id": "0007",
            "start_index": 22,
            "end_index": 28,
            "summary": "The Federal Reserve's monitoring ..."
        },
        {
            "title": "Domestic and International Cooperation and Coordination",
            "node_id": "0008",
            "start_index": 28,
            "end_index": 31,
            "summary": "In 2023, the Federal Reserve collaborated ..."
        }
    ],
},
...

PageIndex Tree Generation API

Accurate and Human-like Tree Search

PageIndex Retrieval

PageIndex uses tree search with multi-step reasoning to retrieve information from complex documents. This approach goes beyond traditional vector-based retrieval by simulating how human experts systematically navigate and extract insights from lengthy documents.

PageIndexRetrieval.json

# Example of PageIndex Retrieval API Response
{
  "title": "Monetary Policy and Economic Developments",
  "node_id": "0004",
  "nodes": [
    {
      "title": "March 2024 Summary",
      "node_id": "0005",
      "relevant_contents": [{
          "physical_index": 10, 
          "relevant_content": "The labor market has gained averaging 239,000 per month since June 2023..."
        }]
    },
    {
      "title": "June 2023 Summary",
      "node_id": "0006",
      "relevant_contents": [{
          "physical_index": 15, 
          "relevant_content": "The labor market has remained very tight, with job gains averaging 314,000 per month during..."
        }]
    }
  ]
}

No Top-K Selection Required: Tree search automatically identifies all relevant tree nodes without manual parameter tuning.
Transparent Search Trajectories: Returns the complete search path through the tree structure, providing transparency and rich contextual information.
Node and Page References: Every retrieved passage includes its node ID and page number from the original document for verifiable information retrieval.
LLM-Ready Output Format: Structured data output with relevant paragraphs and search trajectories, ready for downstream LLM processing.

PageIndex Retrieval API

RAG Comparison

PageIndex vs Vector DB

Choose the right RAG technique for your task.

PageIndexLogical Reasoning

High Retrieval Accuracy

Relies on logical reasoning, ideal for domain-specific data where semantics are similar.

Fully Traceable Retrieval Process

Tree search provides a traceable reasoning process, each retrieved node also contains an exact page reference.

Compromised Efficiency for Accuracy

Tree search prioritizes accuracy over speed, delivering precise results for domain-specific analysis.

Efficient Prompt-Level Knowledge Integration

Easily integrates with expert knowledge and user preferences during the tree search process.

Best for Domain-Specific Document Analysis

Financial reports and SEC filings
Regulatory and compliance documents
Healthcare and medical reports
Legal contracts and case law
Technical manuals and scientific documentation

Vector DBSemantic Similarity

Low Retrieval Accuracy

Relies on semantic similarity, unreliable for domain-specific data where all content has similar semantics.

Black Box Retrieval without Traceability

Often lacks clear traceability to source documents, difficult to verify information or understand retrieval decisions.

Speed-Optimized Vector Search

Prioritizes efficiency and speed, making it ideal for applications where quick responses are critical.

Knowledge Integration Requires Fine-Tuning

Requires fine-tuning embedding models to incorporate new knowledge or preferences.

Best for Generic & Exploratory Applications

Vibe retrieval
Semantic recommendation systems
Creative writing and ideation tools
Short news/email retrieval
Generic knowledge question answering

Case Study

PageIndex Powers Leading Industry Models

PageIndex forms the foundation of Mafin 2.5, a leading RAG model for financial report analysis, achieving 98.7% accuracy on FinanceBench — the highest in the market.

Read Benchmark Report

30%

RAG with Vector DB

One vector index for all the documents.

50%

RAG with Vector DB

One vector index for each document.

98.7%

RAG with PageIndex

Query-to-SQL for document-level retrieval, PageIndex for node-level retrieval.

The results of RAG with Vector DB are from the FinanceBench paper.

Ready to integrate Reasoning-based RAG with PageIndex?

Access Now