Vectify AI

🚀 Our official PageIndex platform is launching soon. Stay tuned for updates!

PageIndex: Next-Generation Reasoning-based RAG System

Frustrated with the accuracy of vector-based RAG on long professional documents? You need a reasoning-native RAG system without vector DBs.

GitHub Platform

Beyond Semantic Similarity

Reasoning-based RAG with PageIndex

Traditional vector-based RAG systems rely on semantic similarity, but similarity ≠ relevance. This often leads to retrieval failures, especially in specialized domains where sections may share similar language but differ in critical details.

Inspired by AlphaGo, we developed PageIndex, a reasoning-based RAG system that simulates how human experts navigate and extract knowledge from long documents through tree search. Two key components of PageIndex are listed below.

PageIndex Tree Generation

Creates "table-of-contents" trees, preserving logical organization structure of the documents.

PageIndex Retrieval

Conducts tree search with multi-step reasoning to retrieve knowledge from complex documents.

PageIndex Dashboard

Context-Preserving Indexes

PageIndex Tree Generation

Documents are indexed as trees generated by PageIndex. These trees maintain the original document's logical flow and organizational structure. This LLM-optimized tree representation enables precise navigation and is ready for reasoning-based RAG.

No Vector DB Required: Tree structures are generated as lightweight JSON files, eliminating the need for expensive vector database storage.
No Chunking Required: Preserves natural document structure without artificial text splitting for better context retention.
Node Summary with Precise Page Referencing: Provides exact page references and summaries for precise information extraction.
Optimized for Long Documents: Tree generation optimized for financial reports, legal documents, and technical manuals beyond LLM context limits.

PageIndexTree.json


# Example of PageIndex Tree Structure
{
    "title": "Financial Stability",
    "node_id": "0006",
    "start_index": 21, 
    "end_index": 22,
    "summary": "The Federal Reserve ..."
    "nodes": [
        {
            "title": "Monitoring Financial Vulnerabilities",
            "node_id": "0007",
            "start_index": 22,
            "end_index": 28,
            "summary": "The Federal Reserve's monitoring ..."
        },
        {
            "title": "Domestic and International Cooperation and Coordination",
            "node_id": "0008",
            "start_index": 28,
            "end_index": 31,
            "summary": "In 2023, the Federal Reserve collaborated ..."
        }
    ],
},
...

PageIndex Tree Generation API

Accurate and Efficient Tree Search

PageIndex Retrieval

PageIndex uses tree search with multi-step reasoning to retrieve information from complex documents. This approach goes beyond traditional vector-based retrieval by simulating how human experts systematically navigate and extract insights from lengthy documents.

PageIndexRetrieval.json


# Example of PageIndex Retrieval API Response
{
  "title": "Monetary Policy and Economic Developments",
  "node_id": "0004",
  "nodes": [
    {
      "title": "March 2024 Summary",
      "node_id": "0005",
      "relevant_contents": [{
          "physical_index": 10, 
          "relevant_content": "The labor market has gained averaging 239,000 per month since June 2023..."
        }]
    },
    {
      "title": "June 2023 Summary",
      "node_id": "0006",
      "relevant_contents": [{
          "physical_index": 15, 
          "relevant_content": "The labor market has remained very tight, with job gains averaging 314,000 per month during..."
        }]
    }
  ]
}

No Top-K Selection Required: Tree search automatically identifies all relevant tree nodes without manual parameter tuning.
Transparent Node Trajectories: Returns the complete search path through the tree structure for transparency and provides rich contextual information.
Exact Page References: Every retrieved node includes precise page numbers and locations from the original document for verifiable information retrieval.
LLM-Ready Output Format: Structured data with relevant paragraphs and search trajectories, ready for downstream LLM processing.

PageIndex Retrieval API

RAG Comparison

PageIndex vs Vector DB

Choose the right RAG technique for your task.

PageIndexLogical Reasoning

High Retrieval Accuracy

Relies on logical reasoning, ideal for domain-specific data where semantics are similar.

Fully Traceable Retrieval Process

Tree search provides a traceable reasoning process, each retrieved node also contains an exact page reference.

Slower Retrieval Due to Tree Search

Tree search is slower, but provides accurate results for complex domain-specific queries.

Efficient Prompt-Level Knowledge Integration

Easily integrates with expert knowledge and user preferences during the tree search process.

Best for Domain-Specific Document Analysis

Financial reports and SEC filings
Regulatory and compliance documents
Healthcare and medical reports
Legal contracts and case law
Technical manuals and scientific documentation

Vector DBSemantic Similarity

Low Retrieval Accuracy

Relies on semantic similarity, unreliable for domain-specific data where all content has similar semantics.

Black Box Retrieval without Traceability

Often lacks clear traceability to source documents, difficult to verify information or understand retrieval decisions.

Faster Retrieval Due to Vector Search

Offers faster retrieval speeds, making it efficient for applications where quick responses are critical.

Knowledge Integration Requires Fine-Tuning

Requires fine-tuning embedding models to incorporate new knowledge or preferences.

Best for Generic & Exploratory Applications

Semantic recommendation systems
Creative writing and ideation tools
Short passage retrieval
Multi-modal retrieval
Generic knowledge question answering

Case Study

PageIndex Powers Leading Industry Models

PageIndex forms the foundation of Mafin 2.5, a leading RAG model for financial report analysis, achieving 98.7% accuracy on FinanceBench — the highest in the market.

Read Benchmark Report

30%

RAG with Vector DB

One vector index for all the documents.

50%

RAG with Vector DB

One vector index for each document.

98.7%

RAG with PageIndex

Query-to-SQL for document-level retrieval, PageIndex for node-level retrieval.

The results of RAG with Vector DB are from the FinanceBench paper.

Ready to integrate Reasoning-based RAG with PageIndex?

Access Now