Frustrated with vector database retrieval accuracy for long professional documents? You need a reasoning-based native index for your RAG system.
What is PageIndex?
PageIndex can transform lengthy documents into semantic tree structures, capturing the hierarchical organization of your document for agentic RAG.
- Hierarchical Tree Structure
- PageIndex creates a LLM-friendly "table of contents" structured for your document. This enables LLM agents to efficiently navigate and comprehend complex documents.
- Node Summary with Precise Page Referencing
- Each node contains its own summary and exact physical page index, allowing agents to pinpoint and extract the most relevant information with exact page references.
- Designed for Long Documents
- PageIndex is specifically designed to handle long documents including financial reports, thousand-page textbooks, etc., even when they exceed the context window limitations of most LLMs.
...
{
"title": "Financial Stability",
"node_id": "0006",
"start_index": 21,
"end_index": 22,
"summary": "The Federal Reserve ..."
"child_nodes": [
{
"title": "Monitoring Financial Vulnerabilities",
"node_id": "0007",
"start_index": 22,
"end_index": 28,
"summary": "The Federal Reserve's monitoring ..."
},
{
"title": "Domestic and International Cooperation and Coordination",
"node_id": "0008",
"start_index": 28,
"end_index": 31,
"summary": "In 2023, the Federal Reserve collaborated ..."
}
],
},
Agentic RAG
Build Agentic RAG with PageIndex
You can easily build an agentic RAG system with PageIndex in a hassle-free way.
- No Vector DB Required
- PageIndex generates a tree structure that can be stored in traditional databases, eliminating the need for specialized Vector DB infrastructure and reducing complexity.
- No Hard Chunking Required
- Unlike conventional approaches, PageIndex intelligently segments documents along natural content boundaries, preserving context and improving retrieval quality.
- Deep Research Ready
- The hierarchical tree structure enables sophisticated exploration of lengthy private documents, facilitating nuanced research and comprehensive analysis.
- Easy to Implement
- Seamlessly integrate PageIndex into your existing agent pipeline with minimal code changes—just a simple prompt is all you need to get started.
prompt = f"""
You are given a question and a tree structure of a document.
You need to find all nodes that are likely to find the answer of the question.
Question: {question}
Document tree structure: {structure}
Reply in the following JSON format:
{{
"thinking": <what nodes are likely to find the answer of the question?>,
"node_list": [node_id1, node_id2, ...]
}}
Direct reply in the JSON format, do not include any other information in the reply.
"""
RAG Comparison
PageIndex vs Vector DB
Choose the right RAG technique for your task.
Reasoning-Based Retrieval
Relies on logical reasoning to retrieve the right node, making it ideal for domain-specific data where semantics are similar.
Semantic Similarity-Based Retrieval
Relies on semantic similarity, which can be unreliable for domain data where all content has similar semantics.
Hassle-Free Implementation
Only requires storing a tree structure in a classic database. No vector database required.
Complex Infrastructure
Requires setting up and maintaining a vector database, which adds complexity and cost.
Flexible Integration
Easily integrates with expert knowledge and user preferences through the hierarchical structure.
Requires Model Updates
Requires fine-tuning embedding models to incorporate new knowledge or preferences.
Slower Retrieval
Slightly slower retrieval speed, though still provides accurate results for complex domain-specific queries.
Faster Retrieval
Offers faster retrieval speeds, making it efficient for applications where quick responses are critical.
Professional Document Analysis
- Financial reports and SEC filings
- Regulatory and compliance documents
- Academic and scientific textbooks
- Legal contracts and case law
- Technical manuals and documentation
Creative & General Applications
- Semantic recommendation systems
- Creative writing and ideation tools
- Short passage retrieval
- Multi-modal retrieval
- General knowledge question answering
Case Study
PageIndex Powers Mafin 2.5: Industry-Leading Financial Document Analysis
PageIndex forms the foundation of Mafin 2.5, a leading RAG model for financial reports analysis, achieving 98.7% accuracy on FinanceBench. Unlike traditional RAG systems that rely on vector similarity, PageIndex's hierarchical structure enables precise navigation through complex financial documents, delivering unmatched accuracy in SEC filing analysis and financial question answering.
30%
RAG with Vector DB
One vector index for all the documents.
50%
RAG with Vector DB
One vector index for each document.
98.7%
RAG with PageIndex.
Query-to-SQL for document-level retrieval, PageIndex for node-level retrieval.
The results of the Vector DB are from the FinanceBench paper.