Optimal Chunk-Size for Large Document Summarization

Published on
avatar

Mingtian

avatar

Ray

1. Introduction

Large Language Model (LLM) is good at text summarization. However, the limited context window of LLM poses a challenge when summarizing large documents. We discuss a limitation of the commonly used summarization strategy and introduce a straightforward improvement.

2. The Common Strategy of Large Document Summarization

The common strategy for summarizing large documents involves the following steps:

  1. Determine a chunk_size\text{chunk}\_\text{size} based on the LLM context window and the prompt size.

  2. Divide the document into KK chunks, where the first K1K-1 chunks each have a size of chunk_size\text{chunk\_size} and the last chunk has a size

    last_chunk_size=document_size(K1)×chunk_size,\text{last\_chunk\_size}=\text{document\_size}-(K-1)\times \text{chunk\_size},

    where

    K=document_size/chunk_size.K= \lceil\text{document\_size}/\text{chunk\_size}\rceil.

    We use x\lceil x \rceil to represent the smallest integer that is equal or greater than x, e.g. 0.5=1\lceil 0.5 \rceil=1.

  3. Summarize the KK chunks independently using LLM.

  4. Combine the KK chunk summaries into one global summary of the document using LLM.

The subsequent figure illustrates this general summarization strategy. The implementation of this strategy can be found in common LLM libraries like LangChain or LlamaIndex.

summary

3. Problem: Biased Global Summary Due to Poor Chunk Size Selection

When applying the same chunk_size\text{chunk\_size} across all documents, there's a risk that the last chunk in some documents could be significantly smaller than the preceding chunks in the same document.

Example 1: For a document of size 11 and a chosen chunk_size=5\text{chunk\_size}=5, the resulting chunk sizes would be [5,5,1][5,5,1].

This disparity can lead to the following problem:

In the combination stage, the summary of the final chunk could be erroneously weighted as equally important as the others, thereby introducing a bias in the global summary of the original document.

See our Jupyter Notebook for a practical example of how this problem occurs in large document summarization.

Intuitively, we want all chunks to have roughly the same amount of information, so the equally weighted combination in the final step will not introduce any bias to the global summary. We can further simplify this requirement by assuming that chunks of comparable sizes contain roughly the same amount of information. Therefore, our goal is to ensure that all chunks are of similar size.

To measure the "similarity" of the chunk sizes, we can calculate the maximum size difference between the longest chunk and the shortest chunk within the same document. For example, for the summary strategy described in Section 2, the worst case scenario is when document_size=(K1)×chunk_size+1\text{document\_size}=(K-1)\times \text{chunk\_size}+1, so the last chunk has the smallest size of 1 and

maximum_chunk_size_difference=chunk_size1.\text{maximum\_chunk\_size\_difference}=\text{chunk\_size}-1.

This disparity can be substantial for larger values of chunk_size\text{chunk\_size}. While intuitively reducing the chunk_size\text{chunk\_size} can diminish this gap, it simultaneously increases the value of KK, which in turn elevates the cost of the LLM. We then discuss how to reduce the chunk difference gap with a minimum chunk number KK.

4. Optimal Automatic Chunk Size Determination

Firstly, we decide a maximum_chunk_size\text{maximum\_chunk\_size} based on the LLM context window and the prompt size, so the minimum chunk number KK is

K=document_size/maximum_chunk_size.K= \lceil\text{document\_size}/\text{maximum\_chunk\_size}\rceil.

We then calculate the 'average' integer chunk size

average_chunk_size=document_size/K,\text{average\_chunk\_size}= \lceil\text{document\_size}/K\rceil,

which is used to divide the document into KK chunks: the first K1K-1 chunks are of size average_chunk_size\text{average\_chunk\_size} and the last chunk has size

last_chunk_size=document_size(K1)×average_chunk_size.\text{last\_chunk\_size}=\text{document\_size}-(K-1)\times\text{average\_chunk\_size}.

As shown in the following example, this simple trick has already improved the common strategy.

Example 2: Given maximum_chunk_size=5\text{maximum\_chunk\_size}=5 and document_size=11\text{document\_size}=11. We derive K=11/5=3K=\lceil 11/5\rceil=3 and average_chunk_size=11/3=4\text{average\_chunk\_size}=\lceil 11/3\rceil=4. The document is then divided into three chunks of sizes [4,4,3][4,4,3], a noticeable improvement over the prior method which resulted in sizes [5,5,1][5,5,1].

However, this method is still not optimal in some cases.

Example 3: Given maximum_chunk_size=5\text{maximum\_chunk\_size}=5 and document_size=13\text{document\_size}=13. We derive K=13/5=3K=\lceil 13/5\rceil=3 and average_chunk_size=13/3=5\text{average\_chunk\_size}=\lceil 13/3\rceil=5. This divides the document into three chunks of sizes [5,5,3][5,5,3], which is worse than an obvious division of [5,4,4][5,4,4].

To further reduce the gap between the average_chunk_size\text{average\_chunk\_size} and last_chunk_size\text{last\_chunk\_size}, we can incrementally transfer one token from the preceding chunks to the last chunk until it reaches the size of average_chunk_size1\text{average\_chunk\_size}-1. In total, we need to move DD tokens where

D=average_chunk_size1last_chunk_size=K×average_chunk_sizedocument_size1.\begin{align*}D&=\text{average\_chunk\_size}-1-\text{last\_chunk\_size}\\\\&=K\times \text{average\_chunk\_size}-\text{document\_size}-1. \end{align*}

As a result, D+1D+1 chunks now have the size average_chunk_size1\text{average\_chunk\_size}-1, while the remaining KD1K-D-1 chunks maintain their original size of average_chunk_size\text{average\_chunk\_size}.

The illustration below demonstrates how, by redistributing a token from one preceding chunk to the last chunk in Example 3, we achieve a more balanced chunk size list [5,4,4][5,4,4].

summary

In this approach, we only encounter two potential chunk sizes: average_chunk_size\text{average\_chunk\_size} and average_chunk_size1\text{average\_chunk\_size}-1. Consequently, the maximum_chunk_size_difference\text{maximum\_chunk\_size\_difference} between chunks is just a constant of 1 even in the worst case. Thus, this strategy is optimal in practice while maintaining the smallest possible number of chunks.

5. Implementation

We provide a basic implementation below.

Visit our Github Repo for hands-on examples and practical comparisons!

import tiktoken
import math

def auto_chunker(document, max_chunk_size, model):

    tokenizer = tiktoken.encoding_for_model(model)
    document_tokens = tokenizer.encode(document)
    document_size = len(document_tokens)
    # total chunk number
    K = math.ceil(document_size / max_chunk_size)
    # average integer chunk size
    average_chunk_size = math.ceil(document_size / K)
    # number of chunks with average_chunk_size - 1 
    shorter_chunk_number = K * average_chunk_size - document_size
    # number of chunks with average_chunk_size
    standard_chunk_number = K - shorter_chunk_number

    chunks = []
    chunk_start = 0
    for i in range(0, K):
        if i < standard_chunk_number:
            chunk_end = chunk_start + average_chunk_size
        else:
            chunk_end = chunk_start + average_chunk_size - 1
        chunk = document_tokens[chunk_start:chunk_end]
        chunks.append(tokenizer.decode(chunk))
        chunk_start = chunk_end

    assert chunk_start == document_size
    return chunks

In practice, an improved chunking strategy should also consider the boundaries of sentences or paragraphs. We will discuss how to integrate this consideration into this method in our upcoming blog post. Subscribe below to stay updated and ensure you never miss enhancements for your AI product!

6. Want to Improve the Chunking for Your Domain-specific Documents?

In this blog post, we propose an optimal chunking strategy under the assumption that chunks of similar sizes hold comparable amounts of information. This proposed method could also benefit other LLM tasks, such as retrieval-augmented question answering. In real-world situations, this assumption might not apply to domain-specific documents. For instance, in certain scientific papers, references at the end may not necessarily reflect the document's core semantics.

Get connected and discover how we can improve the chunking, summarization or retrieval in your domain-specific LLM pipelines!