Logo
🚀 Our official PageIndex platform is launching in June 2025. Stay tuned for updates!
PageIndex: Document Index System for Reasoning-Based RAG

Frustrated with vector database retrieval accuracy for long professional documents? You need a reasoning-native index for your RAG system.

GitHubPlatform

What is PageIndex?

PageIndex can transform lengthy documents into semantic tree structures, ready for reasoning-based RAG.

Hierarchical Tree Structure
LLM-friendly "table of contents" for efficient document navigation and comprehension.
Chunk-Free Segmentation
Preserves natural document structure without arbitrary chunking.
Node Summary with Precise Page Referencing
Provides exact page references and summaries for precise information extraction.
Designed for Long Documents
Optimized for financial reports, legal documents, and technical manuals beyond LLM context limits.
PageIndex.json
...
{
    "title": "Financial Stability",
    "node_id": "0006",
    "start_index": 21, 
    "end_index": 22,
    "summary": "The Federal Reserve ..."
    "nodes": [
        {
            "title": "Monitoring Financial Vulnerabilities",
            "node_id": "0007",
            "start_index": 22,
            "end_index": 28,
            "summary": "The Federal Reserve's monitoring ..."
        },
        {
            "title": "Domestic and International Cooperation and Coordination",
            "node_id": "0008",
            "start_index": 28,
            "end_index": 31,
            "summary": "In 2023, the Federal Reserve collaborated ..."
        }
    ],
},
...

Beyond Semantic Similarity

Reasoning-Based RAG with PageIndex

Vector-based RAG relies on semantic similarity, often returning loosely related but contextually off-target results. They miss document structure and will produce unreliable retrievals in specialized domains.
Reasoning-based RAG with PageIndex uses tree search algorithms that navigate documents like humans do, finding information based on document structure rather than just semantic similarity.
PageIndex Overview
No Vector DB Required
PageIndex tree structure can be stored in traditional databases, eliminating the need for specialized Vector DB infrastructure.
No Hard Chunking Needed
PageIndex intelligently segments documents along natural content boundaries, preserving context and improving retrieval quality.
Start Using PageIndex API

RAG Comparison

PageIndex vs Vector DB

Choose the right RAG technique for your task.

PageIndexLogical Reasoning
High Retrieval Accuracy
Relies on logical reasoning, ideal for domain-specific data where semantics are similar.
Fully Traceable Retrieval Process
Tree search provides a traceable reasoning process, each retrieved node also contains an exact page reference.
Slower Retrieval Due to Tree Search
Tree search is slower, but provides accurate results for complex domain-specific queries.
Efficient Prompt-Level Knowledge Integration
Easily integrates with expert knowledge and user preferences during the tree search process.
Best for Domain-Specific Document Analysis
  • Financial reports and SEC filings
  • Regulatory and compliance documents
  • Healthcare and medical reports
  • Legal contracts and case law
  • Technical manuals and scientific documentation
Vector DBSemantic Similarity
Low Retrieval Accuracy
Relies on semantic similarity, unreliable for domain-specific data where all content has similar semantics.
Black Box Retrieval without Traceability
Often lacks clear traceability to source documents, difficult to verify information or understand retrieval decisions.
Faster Retrieval Due to Vector Search
Offers faster retrieval speeds, making it efficient for applications where quick responses are critical.
Knowledge Integration Requires Fine-Tuning
Requires fine-tuning embedding models to incorporate new knowledge or preferences.
Best for Generic & Exploratory Applications
  • Semantic recommendation systems
  • Creative writing and ideation tools
  • Short passage retrieval
  • Multi-modal retrieval
  • Generic knowledge question answering

Case Study

PageIndex Powers Leading Industry Models

PageIndex forms the foundation of Mafin 2.5, a leading RAG model for financial report analysis, achieving 98.7% accuracy on FinanceBench — the highest in the market.

30%

RAG with Vector DB

One vector index for all the documents.

50%

RAG with Vector DB

One vector index for each document.

98.7%

RAG with PageIndex

Query-to-SQL for document-level retrieval, PageIndex for node-level retrieval.

The results of RAG with Vector DB are from the FinanceBench paper.

Ready to integrate Reasoning-based RAG with PageIndex?
Access Now