How OpenAI's Index-Free Long RAG System Revolutionizes Document Retrieval and QA

Discover how OpenAI's index-free, long context retrieval augmented generation system revolutionizes document retrieval and QA, with insights on its architecture, use cases, and trade-offs.

July 4, 2025

This blog post explores a novel retrieval-augmented generation system from OpenAI that leverages large language models with long context windows to provide accurate answers without the need for chunking or embedding-based indexing. The system's unique approach, which mimics human reading patterns, offers benefits such as zero-latency, dynamic navigation, and cross-section reasoning, making it particularly suitable for use cases like complex question-answering on long legal documents.

Using Long Context Models to Enhance Retrieval Augmented Generation
Dividing the Document into Manageable Chunks
Identifying Relevant Chunks Using Content Routing
Generating Accurate Answers with Paragraph-Level Retrieval
Verifying the Generated Answers
Considerations: Cost, Latency, and Scalability
Improving the System: Caching, Knowledge Graphs, and Scratchpad Enhancements
Conclusion

Using Long Context Models to Enhance Retrieval Augmented Generation

OpenAI has introduced a new multi-agent retrieval augmented generation system that leverages long context models like GPT-4.1 to provide an indexing-free approach to retrieval and generation. This system mimics how humans read and process information, and it can be particularly useful for complex question-answering tasks on long legal documents.

The key aspects of this approach are:

Chunking and Relevance Identification: The document is first divided into 20 equal-sized chunks, and a lightweight language model is used to determine which chunks are most relevant to the user's query. This process can be repeated recursively to further refine the relevant sections.
Paragraph-level Generation: Once the relevant chunks are identified, a more powerful language model (such as GPT-4.1) is used to generate answers based on the selected paragraphs. This ensures that the generated output is grounded in the source material.
Answer Verification: A reasoning model (such as O4 mini) is used to validate the generated answer, ensuring that it is factually accurate and supported by the source text.

This approach offers several benefits, including zero-latency indexing, dynamic navigation, and the ability to reason across multiple sections of the document. However, it also comes with higher costs per query due to the multiple model calls required.

The authors suggest that this system is best suited for use cases where accuracy is paramount, such as in the legal domain, and where latency is not a critical factor. They also propose potential optimizations, such as caching and knowledge graph generation, to improve the cost-effectiveness of the system.

Overall, this new retrieval augmented generation system demonstrates the power of long context models in enhancing document understanding and question-answering capabilities, even in the absence of a pre-built index.

Dividing the Document into Manageable Chunks

The first step in the retrieval-augmented generation system is to divide the long document into manageable chunks. The approach recommends splitting the document into 20 equal-sized chunks, with the condition that each chunk ends with a complete sentence.

To achieve this, the system first tokenizes the entire text and then iteratively adds sentences until the chunk size is approximately 33,000 tokens. This process ensures that the chunks are of relatively equal size, making the subsequent processing more efficient.

The resulting chunks may not preserve the original structure of the document, such as chapter or section boundaries. However, the system can also be adapted to work with documents that are already structured, such as those in Markdown format.

By dividing the document into these manageable chunks, the system can then focus on identifying the most relevant sections to answer the user's query, rather than processing the entire document at once. This chunking strategy is a crucial step in the retrieval-augmented generation workflow.

Identifying Relevant Chunks Using Content Routing

The key steps in the content routing process are:

Initial Chunking: The document is first divided into 20 equal-sized chunks, ensuring that each chunk ends with a complete sentence.
Relevance Evaluation: For each chunk, a large language model (LLM) like GPT-4.1 Mini is used to evaluate whether the chunk contains information relevant to answering the user's query. The model uses a "scratchpad" to record its reasoning for each chunk.
Recursive Subdivision: The relevant chunks identified in the previous step are further subdivided into smaller sub-chunks. This recursive subdivision continues until a specified depth is reached.
Paragraph Selection: The final output is a set of relevant paragraphs that can be used to generate the answer to the user's query.

The content routing process is designed to mimic how humans read and process information, focusing on the most promising sections of the document while discarding irrelevant content. This approach allows the system to maintain global context and perform cross-section reasoning, even without a pre-built index.

The use of a scratchpad enables the LLM to provide transparency into its decision-making process, which can be valuable for understanding and debugging the system's behavior.

Overall, this content routing approach aims to provide accurate, grounded answers while minimizing the need for expensive pre-processing and indexing steps.

Generating Accurate Answers with Paragraph-Level Retrieval

OpenAI's new multi-agent retrieval-augmented generation system offers a novel approach to answering complex questions on long documents. This index-free system leverages large language models with extended context windows to dynamically navigate and retrieve the most relevant paragraphs to generate accurate answers.

The key steps of this approach are:

Document Chunking: The input document is divided into 20 equal-sized chunks, ensuring each chunk ends at a valid sentence boundary.
Content Routing: A lightweight language model evaluates each chunk for relevance to the user's question, recording its reasoning in a "scratchpad" for later reference.
Recursive Decomposition: The relevant chunks are further divided into smaller sub-chunks, and the content routing process is repeated recursively to identify the most pertinent paragraphs.
Answer Generation: The selected paragraphs are provided to a more powerful language model, which generates a structured answer tailored to the user's question.
Answer Verification: A reasoning-focused model acts as a "fact checker," validating the generated answer against the source paragraphs and assigning a confidence score.

This approach offers several benefits, including zero-latency indexing, dynamic navigation of the document, and the ability to reason across multiple sections. However, it also comes with higher computational costs per query compared to traditional retrieval systems.

The authors suggest several optimization strategies, such as leveraging caching and generating knowledge graphs, to improve the scalability and efficiency of this retrieval-augmented generation system. Ultimately, this technique shines in use cases where highly accurate, grounded answers are paramount, such as in the legal domain, despite the increased costs.

Verifying the Generated Answers

The final step in the retrieval-augmented generation system is to verify the generated answers. This is done using a reasoning model that acts as a fact-checker.

The system prompt for the verification step instructs the model to critically evaluate the provided answer and source paragraphs, looking for any factual errors or unsupported claims. The model is asked to assign a confidence level based on how directly the paragraphs answer the original question.

This verification step ensures that the final answer is grounded in the retrieved text and does not contain any hallucinated information. The reasoning model provides an additional layer of quality control to the system's output.

By incorporating this verification step, the retrieval-augmented generation system can produce accurate and trustworthy answers, even for complex queries on long legal documents. The confidence score assigned by the reasoning model gives the user an indication of how reliable the final answer is.

Considerations: Cost, Latency, and Scalability

The retrieval-augmented generation system proposed by OpenAI has several trade-offs to consider, particularly around cost, latency, and scalability:

Cost:

The agentic system incurs a higher cost per query compared to a traditional retrieval-augmented generation (RAG) system.
The estimated fixed cost for the agentic system is zero, as there is no pre-processing required. However, the variable cost per query can be as high as 36 cents, which is significantly more expensive than the 40 cents for a traditional RAG system.
The higher cost is due to the multiple calls to different language models (LLMs) required for the content routing, paragraph selection, and answer verification steps.

Latency:

The agentic system claims to have zero in-query latency, as it does not require any pre-processing or indexing.
However, the multiple steps involved in the content routing, paragraph selection, and answer verification can lead to increased overall latency, especially for time-sensitive applications.

Scalability:

The current implementation of the agentic system may have limitations when it comes to scalability, as the cost and latency can increase significantly for larger documents or higher query volumes.
Potential solutions to improve scalability include caching, generating knowledge graphs, and optimizing the Scratchpad functionality to reduce the depth of the recursive decomposition process.

In summary, the agentic system proposed by OpenAI offers a novel approach to retrieval-augmented generation, but it comes with trade-offs in terms of cost, latency, and scalability. The suitability of this system will depend on the specific use case and requirements, such as the need for highly accurate answers, tolerance for higher costs, and the importance of low-latency responses.

Improving the System: Caching, Knowledge Graphs, and Scratchpad Enhancements

The author discusses several ways to improve the performance and scalability of the retrieval-augmented generation system:

Caching: The author suggests that most LLM providers now offer caching capabilities, which could potentially improve both the latency and cost aspects of the system. Caching the content could help reduce the number of expensive LLM calls required for each query.
Knowledge Graphs: The author proposes generating knowledge graphs as a one-time process, which could then be traversed by a GPT-4.1-like model. This approach goes back to creating an index, but it can be useful for preserving relationships between different entities and improving the system's ability to reason across sections.
Scratchpad Enhancements: The author suggests improving the Scratchpad functionality, which currently just adds information about relevant chunks to the query. The author suggests the ability to remove or edit information in the Scratchpad, as well as adjusting the depth of the recursive decomposition process. Increasing the depth can provide more granular citations (e.g., sentence-level) but will also increase latency and cost.

The author also notes that the system's reliance on large context windows, such as those provided by GPT-4.1 or Gemini models, is a key enabler for this approach. The author envisions a hybrid implementation that combines traditional indexing techniques with the long-context capabilities of these large language models.

Conclusion

The retrieval-augmented generation system proposed by OpenAI offers a novel approach to handling long-form documents without the need for pre-created indexes. By leveraging large language models with extended context windows, this system can dynamically navigate and retrieve relevant information to answer user queries.

The key aspects of this system include:

Chunking and Relevance Identification: The document is first divided into manageable chunks, which are then evaluated for their relevance to the user's query. This is done using a lightweight language model that can quickly assess the content of each chunk.
Recursive Decomposition: The relevant chunks are further divided into smaller sub-chunks, and the relevance evaluation process is repeated recursively. This allows the system to hone in on the most pertinent information.
Multi-Agent Approach: Different language models are employed for different tasks, such as relevance assessment, answer generation, and answer verification. This allows the system to leverage the strengths of various models.
Scratchpad Reasoning: The use of a scratchpad enables the language models to record their reasoning process, which can be used for transparency and debugging.

While this approach offers several benefits, such as zero-latency indexing and dynamic navigation, it also comes with higher costs per query compared to traditional retrieval systems. The authors suggest various optimization strategies, such as caching and knowledge graph generation, to address this challenge.

Overall, the retrieval-augmented generation system represents an exciting development in the field of information retrieval, particularly for use cases where accurate, context-aware answers are paramount, such as in the legal domain. As language models continue to evolve, we can expect to see further advancements and refinements in this area.

FAQ

What chunking strategy do they use in OpenAI's Index‑Free Long RAG?

What embedding model and vector store do they use?

How does the system work at a high level?

What are the use cases of this system?

How does the system handle longer documents that exceed the context window of the language model?