大模型长文本阅读能力如何评估？

6543点击 2024-08-14 11:25

长文本处理能力对LLM的重要性是显而易见的。在2023年初，即便是当时最先进的GPT-3.5，其上下文长度也仅限于2k，然而今日，128k的上下文长度已经成为衡量模型技术先进性的重要标志之一。那你知道LLMs的长文本阅读能力如何评估吗？

1 LongBench测评

LongBench是首个双语（中文和英文）、多任务、综合评估大型语言模型长上下文理解能力的基准测试，以更全面地评估大型模型在长上下文上的多语言能力。LongBench由六大类二十一个不同任务组成（包括 14 个英文任务、5 个中文任务和 2 个代码任务，大多数任务的平均长度在 5k 到 15k 之间，总共 4,750 个测试数据），涵盖单文档QA、多文档QA、摘要、小样本学习、合成任务和代码补全等关键长文本应用场景。样例如下：

{
    "input": "The input/command for the task, usually short, such as questions in QA, queries in Few-shot tasks, etc",
    "context": "The long context required for the task, such as documents, cross-file code, few-shot examples in Few-shot tasks",
    "answers": "A List of all true answers",
    "length": "Total length of the first three items (counted in characters for Chinese and words for English)",
    "dataset": "The name of the dataset to which this piece of data belongs",
    "language": "The language of this piece of data",
    "all_classes": "All categories in classification tasks, null for non-classification tasks",
    "_id": "Random id for each piece of data"
}

地址：[https://huggingface.co/datasets/THUDM/LongBench](https://huggingface.co/datasets/THUDM/LongBench)

2 Retrieval Tasks

此类方法最经典的一种测评方法叫做大海捞针（Needle test）实验，核心思想为将需要召回的重要信息置于不同长度的噪音文本的不同位置中（文本的开头、中间或结尾），而模型则被要求找到那段插入的重要信息。然后观察模型是否能够准确地从文本中提取出这个隐藏的句子，主要评测了模型从长文本中定位与召回关键信息的能力。

大模型长文本阅读能力如何评估？