Beyond Semantic Similarity
Reasoning-based RAG with PageIndex
Traditional vector-based RAG systems rely on semantic similarity, but similarity ≠relevance. This often leads to retrieval failures, especially in specialized domains where sections may share similar language but differ in critical details.
Inspired by AlphaGo, we developed PageIndex, a reasoning-based RAG system that simulates how human experts navigate and extract knowledge from long documents through tree search. Two key components of PageIndex are listed below.

Context-Preserving Indexes
PageIndex Tree Generation
Documents are indexed as trees generated by PageIndex. These trees maintain the original document's logical flow and organizational structure. This LLM-optimized tree representation enables precise navigation and is ready for reasoning-based RAG.
- No Vector DB Required
- Tree structures are generated as lightweight JSON files, eliminating the need for expensive vector database storage.
- No Chunking Required
- Preserves natural document structure without artificial text splitting for better context retention.
- Node Summary with Precise Page Referencing
- Provides exact page references and summaries for precise information extraction.
- Optimized for Long Documents
- Tree generation optimized for financial reports, legal documents, and technical manuals beyond LLM context limits.
# Example of PageIndex Tree Structure
{
"title": "Financial Stability",
"node_id": "0006",
"start_index": 21,
"end_index": 22,
"summary": "The Federal Reserve ..."
"nodes": [
{
"title": "Monitoring Financial Vulnerabilities",
"node_id": "0007",
"start_index": 22,
"end_index": 28,
"summary": "The Federal Reserve's monitoring ..."
},
{
"title": "Domestic and International Cooperation and Coordination",
"node_id": "0008",
"start_index": 28,
"end_index": 31,
"summary": "In 2023, the Federal Reserve collaborated ..."
}
],
},
...
Accurate and Efficient Tree Search
PageIndex Retrieval
PageIndex uses tree search with multi-step reasoning to retrieve information from complex documents. This approach goes beyond traditional vector-based retrieval by simulating how human experts systematically navigate and extract insights from lengthy documents.
# Example of PageIndex Retrieval API Response
{
"title": "Monetary Policy and Economic Developments",
"node_id": "0004",
"nodes": [
{
"title": "March 2024 Summary",
"node_id": "0005",
"relevant_contents": [{
"physical_index": 10,
"relevant_content": "The labor market has gained averaging 239,000 per month since June 2023..."
}]
},
{
"title": "June 2023 Summary",
"node_id": "0006",
"relevant_contents": [{
"physical_index": 15,
"relevant_content": "The labor market has remained very tight, with job gains averaging 314,000 per month during..."
}]
}
]
}
- No Top-K Selection Required
- Tree search automatically identifies all relevant tree nodes without manual parameter tuning.
- Transparent Node Trajectories
- Returns the complete search path through the tree structure for transparency and provides rich contextual information.
- Exact Page References
- Every retrieved node includes precise page numbers and locations from the original document for verifiable information retrieval.
- LLM-Ready Output Format
- Structured data with relevant paragraphs and search trajectories, ready for downstream LLM processing.
RAG Comparison
PageIndex vs Vector DB
Choose the right RAG technique for your task.
- Financial reports and SEC filings
- Regulatory and compliance documents
- Healthcare and medical reports
- Legal contracts and case law
- Technical manuals and scientific documentation
- Semantic recommendation systems
- Creative writing and ideation tools
- Short passage retrieval
- Multi-modal retrieval
- Generic knowledge question answering
Case Study
PageIndex Powers Leading Industry Models
PageIndex forms the foundation of Mafin 2.5, a leading RAG model for financial report analysis, achieving 98.7% accuracy on FinanceBench — the highest in the market.
30%
RAG with Vector DB
One vector index for all the documents.
50%
RAG with Vector DB
One vector index for each document.
98.7%
RAG with PageIndex
Query-to-SQL for document-level retrieval, PageIndex for node-level retrieval.
The results of RAG with Vector DB are from the FinanceBench paper.