探讨下LLM 在逻辑推理中的记忆现象-海口c网

简述

今天刷到一篇推理模型训练的论文，研究解答了我们使用大模型的人困惑，大模型的能力到底是弈中高级的匹配模式还是真的在做逻辑推理

[2410.23123] On Memorization of Large Language Models in Logical ReasoningAbstract page for arXiv paper 2410.23123: On Memorization of Large Language Models in Logical Reasoninghttps://arxiv.org/abs/2410.23123

这篇论文的标题是《On Memorization of Large Language Models in Logical Reasoning》（关于大型语言模型在逻辑推理中的记忆现象），论文研究了大型语言模型（LLMs）在逻辑推理任务中的表现，重点探讨其高性能是否源于对训练数据的记忆，而不是真正的推理能力。LLMs 在复杂推理基准测试中表现出色，但有时也会犯基础推理错误，这种矛盾现象引发了对模型推理机制的疑问。

研究方法

实验设计：论文使用基于“骑士与无赖”（Knights and Knaves, K&K）逻辑谜题的动态生成基准测试。这些谜题要求模型根据人物陈述（骑士说真话，无赖说假话）推理出角色的身份。
记忆与泛化：通过对模型进行微调，观察其在训练谜题（记忆能力）和变体谜题（泛化能力）上的表现。
记忆度量：引入了逐样本记忆分数（per-sample memorization score），量化模型在推理和记忆之间的切换行为。

主要发现

记忆能力：经过微调，LLMs 能近乎完美地记忆训练谜题（接近 100% 准确率），表明模型在特定任务上有很强的记忆能力。
泛化局限：尽管在训练数据上表现优异，模型在稍有变化的谜题上表现较差，显示出泛化能力的不足。
微调的影响：微调会导致模型过度记忆训练数据，但同时也能稳定地提升泛化性能，表明记忆和推理能力并非完全对立。
推理与记忆的切换：通过逐样本记忆分数分析，论文揭示了 LLMs 在解决逻辑谜题时如何在推理和记忆策略之间切换。

这篇论文通过 K&K 逻辑谜题，系统性地研究了 LLMs 在逻辑推理中的记忆现象，揭示了模型在高性能背后可能更多依赖记忆而非推理。研究强调了微调对记忆和泛化的双重影响，论文提供了对 LLMs 推理机制的深入理解，挑战了高性能完全归因于推理能力的假设，强调记忆在性能中的作用。动态生成的 K&K 谜题为评估模型记忆与泛化提供了一个可控且可扩展的测试平台。实验主要基于 K&K 谜题，可能无法完全代表其他类型的推理任务。

没有混淆的时候

关于测试数据

推理数据使用https://huggingface.co/datasets/K-and-K/knights-and-knaveshttps://huggingface.co/datasets/K-and-K/knights-and-knaves

K-and-K/knights-and-knaves 是一个用于评估大型语言模型（LLMs）逻辑推理能力的基准数据集。该数据集基于经典的“骑士与骗子”逻辑谜题构建：在一个特殊的岛屿上，居民要么是骑士（总是说真话），要么是骗子（总是说谎）。每个样本描述了几位居民的陈述，模型的任务是根据这些陈述推断出每位居民的身份。该数据集适用于多种任务，尤其是问答任务，旨在测试模型在处理复杂逻辑关系和推理问题时的表现。

此外，还有一个名为“perturbed-knights-and-knaves”的扩展数据集，用于评估模型在推理过程中的记忆能力

模型训练

启动训练

python finetune_kk.py \--train_data "data/train/people3_num1000.jsonl" \--test_data  "data/test/people3_num100.jsonl" \--run_name kk_ft_ppl3_cot \--model_checkpoint /opt/chenrui/qwq32b/base_model/qwen2-7b \--output_dir ./result/out_cot/train3  \--cot_ft \--num_train_epochs 50 \--save_strategy steps \--save_steps 0.2 \--max_seq_length 512 \--eval_steps 5

wandb 监控，训练损失和学习率变化

系统资源

模型训练结束保存 result/perturbed_nocot/train3/final_model

模型权重合并

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torchbase_model_path = "/opt/chenrui/qwq32b/base_model/qwen2-7b"
lora_path = "./result/perturbed_nocot/train3/final_model"
output_path = "./result/perturbed_nocot/train3/merged_model"# Step 1: Load base model
model = AutoModelForCausalLM.from_pretrained(base_model_path,torch_dtype=torch.float16,device_map="auto",trust_remote_code=True,
)# Step 2: Load LoRA adapter
model = PeftModel.from_pretrained(model, lora_path)# Step 3: Merge adapter into base model
model = model.merge_and_unload()# Step 4: Save merged model
model.save_pretrained(output_path, safe_serialization=True)# Step 5: Save tokenizer
tokenizer = AutoTokenizer.from_pretrained(lora_path, trust_remote_code=True)
tokenizer.save_pretrained(output_path)print(f" 合并完成，保存路径: {output_path}")

下面启动模型评估，由于还没有处理好vllm和multiprocess 兼容性，咱不使用vllm启动模型直接使用AutoModelForCausalLM启动，代码上我把每题的答题情况打印下

    for i, (prompt, label, response) in enumerate(zip(prompts, labels, responses), start=start_index):cor, parsed_pred, reformat_gold_conditions = kk_proc._parse_cot_eval(response, label, args.model)if i % 1 == 0:print(f"\nPrompt {i}:{prompt}"f"\nResponse {i}:{response}"f"\nPrediction {i}:{parsed_pred}"f"\nLabel {i}:{reformat_gold_conditions}"f"\nCorrect {i}:{cor}")

Average accuracy 0.870 - people3_num1000
Total evaluation time: 94.49 seconds
Average accuracy: 0.870

开始混淆

生成扰动题

def generate_problems(n_problems, n_people, gen_perturb=True):problems = []problem_seed=1234start_time = time.time()problem_sampler = lib_kk.KKProblemSampler(problem_seed, n_people=n_people)problems = problem_sampler.sample_valid_problems(n_problems)end_time = time.time()elapsed_time = end_time - start_timeprint(f"Elapsed time: {elapsed_time} seconds")print(f'{len(problems)} valid problems generated')if gen_perturb:start_time = time.time()per_stat = problem_sampler.perturb_problems(problems, perturb_type='statement', num_perturb=1)perturbed_problems_statement = [item for inner_list in per_stat for item in inner_list]end_time = time.time()elapsed_time = end_time - start_timeprint(f"Elapsed time: {elapsed_time} seconds")print(f'{len(perturbed_problems_statement)} perturbed (statement) problems generated')start_time = time.time()per_stat = problem_sampler.perturb_problems(problems, perturb_type='leaf', num_perturb=1)perturbed_problems_leaf = [item for inner_list in per_stat for item in inner_list]end_time = time.time()elapsed_time = end_time - start_timeprint(f"Elapsed time: {elapsed_time} seconds")print(f'{len(perturbed_problems_leaf)} perturbed (leaf) problems generated')return problems, perturbed_problems_statement, perturbed_problems_leaf

语句扰动（statement perturbation）：

perturb_problems(problems, perturb_type='statement', num_perturb=1)：
- 对每个原始谜题（problems）生成一个语句扰动变体。
- 语句扰动可能涉及改变陈述的表达方式（如措辞、句式），但保持逻辑等价。
- num_perturb=1 表示为每个原始谜题生成一个扰动变体。
per_stat 是一个嵌套列表，包含每个原始谜题的扰动结果。
[item for inner_list in per_stat for item in inner_list]：将嵌套列表展平为单一列表 perturbed_problems_statement。

叶节点扰动（leaf perturbation）：

类似语句扰动，但 perturb_type='leaf' 修改谜题逻辑树的叶节点（即陈述的具体内容）。
叶节点扰动可能改变陈述的逻辑含义（如角色 A 说“B 是骑士”改为“B 是无赖”），生成逻辑上不同的谜题。
同样为每个原始谜题生成一个扰动变体（num_perturb=1）。
展平列表得到 perturbed_problems_leaf。

返回 problems, perturbed_problems_statement, perturbed_problems_leaf

干净谜题：用于基准测试，评估模型在标准 K&K 谜题上的推理能力。
语句扰动：测试模型对语言表达变化的鲁棒性（例如，措辞改变但逻辑不变）。
叶节点扰动：测试模型对逻辑结构变化的泛化能力（例如，陈述内容改变导致不同解）。

修改加载混淆数据集

kk_dataset = load_dataset("json", data_files={"train": "/opt/chenrui/mem-kk-logic/data/train/perturbed_leaf/people3_num1000.jsonl","test": "/opt/chenrui/mem-kk-logic/data/test/perturbed_leaf/people3_num100.jsonl",})

挑选其中一个例子试说明

# clean data
{"quiz": "A very special island is inhabited only by knights and knaves. Knights always tell the truth, and knaves always lie. You meet 3 inhabitants: Noah, Amelia, and Isabella. Noah remarked, \"Noah is a knight if and only if Isabella is a knave\". Amelia remarked, \"Noah is not a knave\". Isabella stated, \"Amelia is a knave or Isabella is a knight\". So who is a knight and who is a knave?","names": ["Noah", "Amelia", "Isabella"],"knight_knave": {"knight": "knight","knave": "knave","a_knight": "a knight","a_knave": "a knave","Knight": "Knight","Knave": "Knave"},"solution": [true, true, false],"solution_text": "Noah is a knight, Amelia is a knight, and Isabella is a knave.","solution_text_format": "(1) Noah is a knight\n(2) Amelia is a knight\n(3) Isabella is a knave","cot_head": "Let's think step by step, by considering whether each person is lying and if that leads to contradiction.","cot_repeat_steps": ["Assume Noah is a knight. No contradiction is found in their claim that Noah is a knight if and only if Isabella is a knave.", "Isabella cannot be a knight, because this would contradict the claim of Noah that Noah is a knight if and only if Isabella is a knave.", "Assume Isabella is a knave. No contradiction is found in their false claim that Amelia is a knave or Isabella is a knight.", "Assume Amelia is a knight. No contradiction is found in their claim that Noah is not a knave."],"cot_foot": "This leads to a feasible solution.","statements": "(('<=>', ('telling-truth', 0), ('lying', 2)), ('not', ('lying', 0)), ('or', ('lying', 1), ('telling-truth', 2)))","index": 131
}# perturbed_leaf data
{"quiz": "A very special island is inhabited only by knights and knaves. Knights always tell the truth, and knaves always lie. You meet 3 inhabitants: Noah, Amelia, and Isabella. Noah remarked, \"Isabella is a knight if and only if Isabella is a knave\". Amelia remarked, \"Noah is not a knave\". Isabella stated, \"Amelia is a knave or Isabella is a knight\". So who is a knight and who is a knave?","names": ["Noah", "Amelia", "Isabella"],"knight_knave": {"knight": "knight","knave": "knave","a_knight": "a knight","a_knave": "a knave","Knight": "Knight","Knave": "Knave"},"solution": [false, false, true],"solution_text": "Noah is a knave, Amelia is a knave, and Isabella is a knight.","solution_text_format": "(1) Noah is a knave\n(2) Amelia is a knave\n(3) Isabella is a knight","cot_head": "Let's think step by step, by considering whether each person is lying and if that leads to contradiction.","cot_repeat_steps": ["Noah cannot be a knight, because this would contradict the claim of their own that Isabella is a knight if and only if Isabella is a knave.", "Assume Noah is a knave. No contradiction is found in their false claim that Isabella is a knight if and only if Isabella is a knave.", "Assume Isabella is a knight. No contradiction is found in their claim that Amelia is a knave or Isabella is a knight.", "Amelia cannot be a knight, because this would contradict the claim of their own that Noah is not a knave.", "Assume Amelia is a knave. No contradiction is found in their false claim that Noah is not a knave."],"cot_foot": "This leads to a feasible solution.","statements": "(('<=>', ('telling-truth', 2), ('lying', 2)), ('not', ('lying', 0)), ('or', ('lying', 1), ('telling-truth', 2)))","index": 131
}

相同点

人物：都有 Noah、Amelia、Isabella 三人；
规则：骑士永远说真话，骗子永远撒谎；
Amelia 的陈述：都说 “Noah 不是骗子（Noah is not a knave）”；
Isabella 的陈述：都说 “Amelia 是骗子或 Isabella 是骑士”；
结构：都有 quiz、solution、cot_head、cot_repeat_steps、cot_foot、statements 等字段。

不同点总结

项目	第一个版本	第二个版本
Noah 的发言	Noah 说：“Noah 是骑士当且仅当 Isabella 是骗子”	Noah 说：“Isabella 是骑士当且仅当 Isabella 是骗子”
Noah 的话是否合法？	在第一个中是有可能为真的（不矛盾）	第二个中是逻辑悖论（自我否定） —— “某人是骑士 ↔ 该人是骗子” 是永远不可能成立的。
Noah 的身份	骑士	骗子
Isabella 的身份	骗子	骑士
最终解答（solution）	`[true, true, false]` → Noah、Amelia 是骑士，Isabella 是骗子	`[false, false, true]` → Noah、Amelia 是骗子，Isabella 是骑士
语义逻辑表达（statements）	`(('<=>', telling-truth(Noah), lying(Isabella)), not lying(Noah), or(lying(Amelia), telling-truth(Isabella)))`	`(('<=>', telling-truth(Isabella), lying(Isabella)), not lying(Noah), or(lying(Amelia), telling-truth(Isabella)))`

关键逻辑对比说明

❶ Noah 的话改变了本题的本质逻辑：

第一个版本：“Noah 是骑士 ↔ Isabella 是骗子” 是一个可以成立的命题，允许 Noah 是骑士，只要 Isabella 是骗子。
第二个版本：“Isabella 是骑士 ↔ Isabella 是骗子” 是一个 永远为假的悖论，所以：
- 如果 Noah 说的是这句话，且他是骑士（说真话）→ 这个命题必须为真；
- 但该命题总为假 → 所以 Noah 不可能是骑士 → Noah 是骗子 → 他撒谎。

❷ 对 Isabella 的影响：

在 第一题，Isabella 的话是：“Amelia 是骗子或 Isabella 是骑士”
- 如果她是骗子，那么这句话是假 → “Amelia 是骗子或 Isabella 是骑士”为假 → 两个子命题都为假。
在 第二题，Isabella 是骑士，她说这句话就必须为真；
- 所以至少有一个成立 —— 实际上 Amelia 是骗子，满足这个“或”命题。

❸ Amelia 的推理逻辑：

两题中 Amelia 都说：“Noah 不是骗子”，

第一题中 Noah 是骑士 → 她说的对 → 她是骑士；
第二题中 Noah 是骗子 → 她说错了 → 她是骗子。

测试混淆训练模型

训练结束后还是做合并然后跑评估测试

用perturbed_leaf数据集上训练模型在clean数据集上测试

python eval_kk.py \--batch_size 8 \--model result/perturbed_nocot/train3/merged_model \--max_token 2048 \--arch Qwen/Qwen2-7B  \--ntrain 0 \--config vllm \--limit 100 \--split "train" \--problem_type "clean" \--eval_nppl 3

Average accuracy 0.470 - people3_num1000
Total evaluation time: 93.32 seconds
Average accuracy: 0.470

这个项目的实验和结果确实支持这样一个观点：

当前主流语言模型在“骑士与骗子”这类结构化逻辑任务上的成功，很大程度上不是因为它们具备真正的符号推理能力，而是因为它们可以记住大量类似例子，并进行相似性匹配。

理解：“记忆 vs 推理”

特性	模型表现	含义
推理能力强的模型	能处理新结构、新组合、新表述的逻辑问题	泛化能力强，即便是训练中没见过的问题也能解
记忆驱动的模型	表现好只限于训练中见过或结构非常接近的样本	本质是“模板匹配”或“模式联想”

这个试验清楚地展示了当前语言模型在类似任务上更多是“类比记忆”，而非“演绎推理”。