PageIndex: Next-Generation Reasoning-based RAG
You need a reasoning-native RAG system without vector DBs.
Higher Accuracy
Beyond semantic similarity
Better Transparency
Clear reasoning trajectory
Like A Human
Retrieve like a human expert
No Vector DB
No extra infra overhead
No Chunking
Preserve full context
No Top-K
Retrieve all relevant passages
PageIndex: Next-Generation Reasoning-based RAG
Higher Accuracy
Beyond semantic similarity
Better Transparency
Clear reasoning trajectory
Like A Human
Retrieve like a human expert
No Vector DB
No extra infra overhead
No Chunking
Preserve full context
No Top-K
Retrieve all relevant passages
Beyond Semantic Similarity and Vector Search
Reasoning-based RAG with PageIndex
Problem: Traditional vector-based RAG systems rely on semantic similarity, but similarity ≠ relevance. This often leads to retrieval failures, especially in specialized domains where sections may share similar language but differ in critical details.
Solution: Inspired by AlphaGo, we developed PageIndex, a reasoning-based retrieval method that simulates how human search for information. Instead of using vectors for retrieval, PageIndex first generates a tree index of the document and then performs tree search to retrieve relevant information.
PageIndex Tree Generation
Creates "table-of-contents" tree indexes, preserving the logical organization structure of documents.
PageIndex Retrieval
Conducts tree search with multi-step reasoning to retrieve knowledge from complex documents.
Context-Preserving and Reasoning-Native Index
PageIndex Tree Generation
Documents are indexed as trees generated by PageIndex. These trees maintain the original document's logical flow and organizational structure. This LLM-optimized tree representation enables precise navigation and is ready for reasoning-based RAG.
- No Vector DB Required
- Tree structures are represented as lightweight JSON objects, avoiding the overhead and complexity of vector databases.
- No Chunking Required
- Preserves natural document structure without artificial text splitting for better context retention.
- Node Summary with Precise Page Referencing
- Provides exact page references and summaries for precise information extraction.
- Optimized for Long Documents
- Tree generation optimized for financial reports, legal documents, and technical manuals beyond LLM context limits.
# Example of PageIndex Tree Structure
...
{
"title": "Financial Stability",
"node_id": "0006",
"start_index": 21,
"end_index": 22,
"summary": "The Federal Reserve ..."
"nodes": [
{
"title": "Monitoring Financial Vulnerabilities",
"node_id": "0007",
"start_index": 22,
"end_index": 28,
"summary": "The Federal Reserve's monitoring ..."
},
{
"title": "Domestic and International Cooperation and Coordination",
"node_id": "0008",
"start_index": 28,
"end_index": 31,
"summary": "In 2023, the Federal Reserve collaborated ..."
}
],
},
...
Accurate and Human-like Tree Search
PageIndex Retrieval
PageIndex uses tree search with multi-step reasoning to retrieve information from complex documents. This approach goes beyond traditional vector-based retrieval by simulating how human experts systematically navigate and extract insights from lengthy documents.
# Example of PageIndex Retrieval API Response
{
"title": "Monetary Policy and Economic Developments",
"node_id": "0004",
"nodes": [
{
"title": "March 2024 Summary",
"node_id": "0005",
"relevant_contents": [{
"physical_index": 10,
"relevant_content": "The labor market has gained averaging 239,000 per month since June 2023..."
}]
},
{
"title": "June 2023 Summary",
"node_id": "0006",
"relevant_contents": [{
"physical_index": 15,
"relevant_content": "The labor market has remained very tight, with job gains averaging 314,000 per month during..."
}]
}
]
}
- No Top-K Selection Required
- Tree search automatically identifies all relevant tree nodes without manual parameter tuning.
- Transparent Search Trajectories
- Returns the complete search path through the tree structure, providing transparency and rich contextual information.
- Node and Page References
- Every retrieved passage includes its node ID and page number from the original document for verifiable information retrieval.
- LLM-Ready Output Format
- Structured data output with relevant paragraphs and search trajectories, ready for downstream LLM processing.
RAG Comparison
PageIndex vs Vector DB
Choose the right RAG technique for your task.
- Financial reports and SEC filings
- Regulatory and compliance documents
- Healthcare and medical reports
- Legal contracts and case law
- Technical manuals and scientific documentation
- Vibe retrieval
- Semantic recommendation systems
- Creative writing and ideation tools
- Short news/email retrieval
- Generic knowledge question answering
Case Study
PageIndex Powers Leading Industry Models
PageIndex forms the foundation of Mafin 2.5, a leading RAG model for financial report analysis, achieving 98.7% accuracy on FinanceBench — the highest in the market.
30%
RAG with Vector DB
One vector index for all the documents.
50%
RAG with Vector DB
One vector index for each document.
98.7%
RAG with PageIndex
Query-to-SQL for document-level retrieval, PageIndex for node-level retrieval.
The results of RAG with Vector DB are from the FinanceBench paper.