What is PageIndex?
PageIndex can transform lengthy documents into semantic tree structures, capturing the hierarchical organization of your document for reasoning-based RAG.
- Hierarchical Tree Structure
- PageIndex creates an LLM-friendly "table of contents" structured for your document, enabling LLM agents to efficiently navigate and comprehend complex documents.
- Chunk-Free Segmentation
- No arbitrary chunking. Nodes follow the natural structure of the document.
- Node Summary with Precise Page Referencing
- Each node contains its own summary and exact physical page index, allowing agents to pinpoint and extract the most relevant information with exact page references.
- Designed for Long Documents
- PageIndex is specifically designed to handle long documents including financial reports, legal documents, technical manuals etc., even when they exceed the context window limits of LLMs.
...
{
"title": "Financial Stability",
"node_id": "0006",
"start_index": 21,
"end_index": 22,
"summary": "The Federal Reserve ..."
"nodes": [
{
"title": "Monitoring Financial Vulnerabilities",
"node_id": "0007",
"start_index": 22,
"end_index": 28,
"summary": "The Federal Reserve's monitoring ..."
},
{
"title": "Domestic and International Cooperation and Coordination",
"node_id": "0008",
"start_index": 28,
"end_index": 31,
"summary": "In 2023, the Federal Reserve collaborated ..."
}
],
},
...
Beyond Semantic Similarity
Reasoning-Based RAG with PageIndex
Vector-based RAG relies on semantic similarity, often returning loosely related but contextually off-target results. They miss document structure and will produce unreliable retrievals in specialized domains.
Reasoning-based RAG with PageIndex uses tree search algorithms that navigate documents like humans do, finding information based on document structure rather than just semantic similarity.

- State-of-the-Art Accuracy
- Significant improvement of accuracy by using reasoning-based over semantic-based retrieval, achieving leading performance in domain benchmarks.
- Transparent and Reliable
- Clear page number references that enable users to trace back to the original source with one click, providing users with unmatched clarity and trust.
- Align with Domain Expertise
- Align with domain expertise to refine knowledge retrieval, ensuring accurate identification of key details in professional documents.
- Like-Human
- Generate a 'Table of Contents' rather than just searching based on semantic similarity, always retrieve the most relevant and precise information.
RAG Comparison
PageIndex vs Vector DB
Choose the right RAG technique for your task.
High Accuracy Based on Reasoning
Low Accuracy Based on Semantic Similarity
Fully Traceable Results with Page Reference
Black Box Retrieval without Traceability
Efficient Prompt-Level Integration
Requires Fine-Tuning Embedding Models
Slower Retrieval
Faster Retrieval
Professional Document Analysis
- Financial reports and SEC filings
- Regulatory and compliance documents
- Academic and scientific textbooks
- Legal contracts and case law
- Technical manuals and documentation
Creative & General Applications
- Semantic recommendation systems
- Creative writing and ideation tools
- Short passage retrieval
- Multi-modal retrieval
- General knowledge question answering
Case Study
PageIndex Powers Mafin 2.5: Industry-Leading Financial Document Analysis
PageIndex forms the foundation of Mafin 2.5, a leading RAG model for financial reports analysis, achieving 98.7% accuracy on FinanceBench — the highest in the market. Unlike traditional RAG systems that rely on vector similarity, PageIndex's hierarchical structure enables precise navigation through complex financial documents, delivering unmatched accuracy in SEC filing analysis and financial question answering.
30%
RAG with Vector DB
One vector index for all the documents.
50%
RAG with Vector DB
One vector index for each document.
98.7%
RAG with PageIndex
Query-to-SQL for document-level retrieval, PageIndex for node-level retrieval.
The results of RAG with Vector DB are from the FinanceBench paper.
Easy Integration
Build Reasoning-Based RAG with PageIndex
You can easily build a reasoning-based RAG system with PageIndex, hassle-free.
- No Vector DB Required
- PageIndex generates a tree structure that can be stored in traditional databases, eliminating the need for specialized Vector DB infrastructure and reducing complexity.
- No Hard Chunking Required
- Unlike conventional approaches, PageIndex intelligently segments documents along natural content boundaries, preserving context and improving retrieval quality.
prompt = f"""
You are given a question and a tree structure of a document.
You need to find all nodes that are likely to find the answer of the question.
Question: {question}
Document tree structure: {structure}
Reply in the following JSON format:
{{
"thinking": <what nodes are likely to find the answer of the question?>,
"node_list": [node_id1, node_id2, ...]
}}
Directly reply in JSON format, do not include any other information in the reply.
"""