- 0. 说明
- 1. 今日最值得读的论文
- Pushing the Limits of LLM Tool Calling via Experiential Knowledge Integration and Activation
- Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models
- FADA: Accessible fetal ultrasound interpretation and annotation with a selectively distilled unified vision-language model
- Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories
- IDEAL: In-DEpth ALignment Makes A Discrete Representation AutoEncoder
- A History-Aware Visually Grounded Critic for Computer Use Agents
- 2. 候选论文列表
- 3. 阅读建议
0. 说明
数据来源:arXiv API。本篇自动检索近期与 LLM、多模态、Agent、工具使用、Skill、RAG、长上下文和模型评测相关的论文,并按研究价值、工程启发和可复现线索进行排序。
筛选不是简单看标题热词,而是优先考虑:
- 是否切中 LLM / multimodal / agent 方向的关键问题;
- 是否有清晰的方法贡献、评测基准或系统实现;
- 是否能给实际工程带来可迁移经验;
- 是否值得进一步精读 introduction、method、experiment 和 limitation。
1. 今日最值得读的论文
Pushing the Limits of LLM Tool Calling via Experiential Knowledge Integration and Activation
- arXiv:2606.10875
- PDF:https://arxiv.org/pdf/2606.10875v1
- 作者:Yupu Hao、Zhuoran Jin、Huanxuan Liao、Kang Liu、Jun Zhao
- 发布时间:2026-06-09,更新时间:2026-06-09
- 类别:cs.CL
- 主题标签:LLM、Agent、Skill/Tool、Reasoning
- 阅读价值评分:18/20
摘要速读
Large language models (LLMs) rely on tool use to act as autonomous agents, yet often fail in multi-step execution due to insufficient tool-related knowledge and ineffective knowledge activation. Therefore, we present a systematic study on how knowledge influences tool-use performance, covering the stages of knowledge acquisition, activation, and internalization.
为什么值得读
大模型核心方向、Agent 与长程任务、工具使用/技能学习、推理、代码或复杂任务、训练/后训练方法、推理效率或系统优化、类别与 LLM/Agent 高相关、方法贡献明确、可能有代码或数据可复现、摘要中有实验或对比信号。如果时间有限,建议先看 introduction 的问题定义,再看方法图和实验主表,最后检查限制条件与失败案例。
方法与贡献线索
这篇更像 agent 能力构建工作,阅读重点应放在动作空间、工具接口、任务分解、反馈信号和失败恢复。
精读时重点追问
- 论文解决的是新问题,还是对已有问题换了一个实验设置?
- 核心结论是否依赖特定模型、数据集或 prompt 模板?
- 如果放到更长任务链路里,工具调用错误、状态漂移和权限边界如何处理?
Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models
- arXiv:2606.11167
- PDF:https://arxiv.org/pdf/2606.11167v1
- 作者:Atsumoto Ohashi、Neil Zeghidour、Alexandre Défossez、Eugene Kharitonov
- 发布时间:2026-06-09,更新时间:2026-06-09
- 类别:cs.CL、eess.AS
- 主题标签:LLM、Safety/Eval
- 阅读价值评分:17/20
摘要速读
Full-duplex spoken dialogue models can listen and speak simultaneously, making them a promising architecture for natural conversation. However, current models are trained solely with supervised learning through token-level likelihood maximization, which does not directly optimize interaction-level behaviors, causing interactivity issues such as excessive silence and ill-timed turn-taking.
为什么值得读
大模型核心方向、多模态/视觉语言模型、评测基准或数据集、安全、对齐或鲁棒性、训练/后训练方法、类别与 LLM/Agent 高相关、方法贡献明确、可能有代码或数据可复现。如果时间有限,建议先看 introduction 的问题定义,再看方法图和实验主表,最后检查限制条件与失败案例。
方法与贡献线索
这篇更像评测/数据集型工作,阅读重点应放在任务定义、数据构造、评价指标和 baseline 是否合理。
精读时重点追问
- 论文解决的是新问题,还是对已有问题换了一个实验设置?
- 核心结论是否依赖特定模型、数据集或 prompt 模板?
- 评测指标能否解释真实使用风险,还是只覆盖了可测的表层行为?
FADA: Accessible fetal ultrasound interpretation and annotation with a selectively distilled unified vision-language model
- arXiv:2606.11106
- PDF:https://arxiv.org/pdf/2606.11106v1
- 作者:Mahmood Alzubaidi、Uzair Shah、Raden Muaz、Ines Abbes、Nader Mohammed、Abdullatif Magram、等
- 发布时间:2026-06-09,更新时间:2026-06-09
- 类别:cs.CV、cs.AI
- 主题标签:LLM、多模态、Agent、Skill/Tool、RAG/Memory、Reasoning、Safety/Eval
- 阅读价值评分:17/20
摘要速读
A global shortage of trained sonographers limits prenatal ultrasound screening in low- and middle-income countries, where over half of pregnant women receive no skilled sonography. Current deep learning approaches address detection, segmentation, or classification in isolation, each demanding a separate model and expert-specified labels at inference.
为什么值得读
大模型核心方向、多模态/视觉语言模型、推理、代码或复杂任务、评测基准或数据集、安全、对齐或鲁棒性、训练/后训练方法、推理效率或系统优化、类别与 LLM/Agent 高相关、视觉/多模态类别匹配、方法贡献明确、摘要中有实验或对比信号。如果时间有限,建议先看 introduction 的问题定义,再看方法图和实验主表,最后检查限制条件与失败案例。
方法与贡献线索
这篇更像 agent 能力构建工作,阅读重点应放在动作空间、工具接口、任务分解、反馈信号和失败恢复。
精读时重点追问
- 论文解决的是新问题,还是对已有问题换了一个实验设置?
- 核心结论是否依赖特定模型、数据集或 prompt 模板?
- 如果放到更长任务链路里,工具调用错误、状态漂移和权限边界如何处理?
Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories
- arXiv:2606.11176
- PDF:https://arxiv.org/pdf/2606.11176v1
- 作者:Kevin Qinghong Lin、Batu EI、Yuhong Shi、Pan Lu、Philip Torr、James Zou
- 发布时间:2026-06-09,更新时间:2026-06-09
- 类别:cs.CV、cs.CL、cs.CY、cs.HC
- 主题标签:多模态、Agent、RAG/Memory、Reasoning、Safety/Eval
- 阅读价值评分:16/20
摘要速读
Data tells stories that shape society; the data journalist’s job is to turn raw information into stories non-experts can trust. A high-quality news feature takes a newsroom team weeks: hunting for context, running statistics, choosing an angle, and designing visuals.
为什么值得读
多模态/视觉语言模型、Agent 与长程任务、工具使用/技能学习、推理、代码或复杂任务、评测基准或数据集、类别与 LLM/Agent 高相关、视觉/多模态类别匹配、方法贡献明确。如果时间有限,建议先看 introduction 的问题定义,再看方法图和实验主表,最后检查限制条件与失败案例。
方法与贡献线索
这篇更像 agent 能力构建工作,阅读重点应放在动作空间、工具接口、任务分解、反馈信号和失败恢复。
精读时重点追问
- 论文解决的是新问题,还是对已有问题换了一个实验设置?
- 核心结论是否依赖特定模型、数据集或 prompt 模板?
- 如果放到更长任务链路里,工具调用错误、状态漂移和权限边界如何处理?
IDEAL: In-DEpth ALignment Makes A Discrete Representation AutoEncoder
- arXiv:2606.11096
- PDF:https://arxiv.org/pdf/2606.11096v1
- 作者:Yitong Chen、Zijie Diao、Junke Wang、Lingyu Kong、Yixuan Ren、Bo He、等
- 发布时间:2026-06-09,更新时间:2026-06-09
- 类别:cs.CV
- 主题标签:LLM、多模态、Reasoning、Safety/Eval
- 阅读价值评分:16/20
摘要速读
Built on pretrained vision foundation models (VFMs), representation autoencoders (RAEs) have recently emerged as a promising approach for constructing semantically rich latent spaces for image generation. However, their reconstruction quality often remains suboptimal, largely because deep VFM representations do not preserve sufficient fine-grained visual detail.
为什么值得读
大模型核心方向、多模态/视觉语言模型、推理、代码或复杂任务、安全、对齐或鲁棒性、视觉/多模态类别匹配、方法贡献明确、可能有代码或数据可复现、摘要中有实验或对比信号。如果时间有限,建议先看 introduction 的问题定义,再看方法图和实验主表,最后检查限制条件与失败案例。
方法与贡献线索
这篇更像多模态建模工作,阅读重点应放在模态对齐、数据配比、视觉编码器/语言模型连接方式和推理链路。
精读时重点追问
- 论文解决的是新问题,还是对已有问题换了一个实验设置?
- 核心结论是否依赖特定模型、数据集或 prompt 模板?
- 跨模态对齐收益来自模型结构、训练数据,还是评测集偏好?
A History-Aware Visually Grounded Critic for Computer Use Agents
- arXiv:2606.11078
- PDF:https://arxiv.org/pdf/2606.11078v1
- 作者:Jaewoo Lee、Zaid Khan、Archiki Prasad、Justin Chih-Yao Chen、Supriyo Chakraborty、Kartik Balasubramaniam、等
- 发布时间:2026-06-09,更新时间:2026-06-09
- 类别:cs.AI、cs.CL、cs.CV
- 主题标签:多模态、Agent、RAG/Memory、Reasoning、Safety/Eval
- 阅读价值评分:16/20
摘要速读
Various test-time interventions for Computer Use Agents (CUAs), including critic models, have been developed to improve performance through pre-execution action evaluation in complex Graphical User Interface (GUI) environments. However, existing critics suffer from two key limitations: they (1) focus primarily on short-sighted decision loops (e.g., forgetting earlier actions) and (2) lack the visual grounding needed to detect flawed actions (e.g., clicking wrong UI elements).
为什么值得读
多模态/视觉语言模型、Agent 与长程任务、推理、代码或复杂任务、评测基准或数据集、类别与 LLM/Agent 高相关、视觉/多模态类别匹配、方法贡献明确、摘要中有实验或对比信号。如果时间有限,建议先看 introduction 的问题定义,再看方法图和实验主表,最后检查限制条件与失败案例。
方法与贡献线索
这篇更像 agent 能力构建工作,阅读重点应放在动作空间、工具接口、任务分解、反馈信号和失败恢复。
精读时重点追问
- 论文解决的是新问题,还是对已有问题换了一个实验设置?
- 核心结论是否依赖特定模型、数据集或 prompt 模板?
- 如果放到更长任务链路里,工具调用错误、状态漂移和权限边界如何处理?
2. 候选论文列表
| 论文 | 主题 | 评分 | 发布时间 | 摘要一句话 |
|---|---|---|---|---|
| Pushing the Limits of LLM Tool Calling via Experiential Knowledge Integration and Activation | LLM, Agent, Skill/Tool, Reasoning | 18 | 2026-06-09 | Large language models (LLMs) rely on tool use to act as autonomous agents, yet often fail in multi-step execution due to insufficient tool-related knowledge and ineffective knowledge activation. |
| Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models | LLM, Safety/Eval | 17 | 2026-06-09 | Full-duplex spoken dialogue models can listen and speak simultaneously, making them a promising architecture for natural conversation. |
| FADA: Accessible fetal ultrasound interpretation and annotation with a selectively distilled unified vision-language model | LLM, 多模态, Agent, Skill/Tool, RAG/Memory, Reasoning, Safety/Eval | 17 | 2026-06-09 | A global shortage of trained sonographers limits prenatal ultrasound screening in low- and middle-income countries, where over half of pregnant women receive no skilled sonography. |
| Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories | 多模态, Agent, RAG/Memory, Reasoning, Safety/Eval | 16 | 2026-06-09 | Data tells stories that shape society; the data journalist’s job is to turn raw information into stories non-experts can trust. |
| IDEAL: In-DEpth ALignment Makes A Discrete Representation AutoEncoder | LLM, 多模态, Reasoning, Safety/Eval | 16 | 2026-06-09 | Built on pretrained vision foundation models (VFMs), representation autoencoders (RAEs) have recently emerged as a promising approach for constructing semantically rich latent spaces for image generation. |
| A History-Aware Visually Grounded Critic for Computer Use Agents | 多模态, Agent, RAG/Memory, Reasoning, Safety/Eval | 16 | 2026-06-09 | Various test-time interventions for Computer Use Agents (CUAs), including critic models, have been developed to improve performance through pre-execution action evaluation in complex Graphical User Interface (GUI) environments. |
| T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains | LLM, Agent, RAG/Memory, Reasoning, Safety/Eval | 16 | 2026-06-09 | Recent advances in reasoning and tool-calling capabilities of large language models (LLMs) have enabled increasingly capable agentic systems. |
| Architect-Ant: Editable Automatic Furnishing of Architectural Floor Plans | LLM, 多模态, Reasoning, Safety/Eval | 16 | 2026-06-09 | Furnished floor plans are fundamental to real estate visualization, interior design, and architectural workflows. |
| EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents | LLM, Agent, RAG/Memory, Safety/Eval | 15 | 2026-06-09 | In this paper, we propose EEVEE, the first multi-dataset test-time prompt learning framework for LLM agents, enabling test-time prompt learning under real-world task streams. |
| P3D-Bench: Benchmarking MLLMs for Parametric 3D Generation and Structural Reasoning | LLM, 多模态, Reasoning, Safety/Eval | 15 | 2026-06-09 | Multimodal large language models can write code to produce complex programs as well as use programs to do 3D modeling, which opens up a new avenue for 3D generation powered by their priors, world knowledge and reasoning. |
| ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecurity | LLM, Agent, RAG/Memory, Reasoning, Safety/Eval | 15 | 2026-06-09 | Large language models (LLMs) are rapidly acquiring capabilities relevant to biological research, from literature synthesis to interpretation of experimental data. |
| TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning | LLM, Agent, RAG/Memory, Reasoning, Safety/Eval | 15 | 2026-06-09 | Reinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. |
| Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam? | LLM, Agent, Reasoning, Safety/Eval | 15 | 2026-06-09 | The deployment of Large Language Model (LLM) agents for computer automation is accelerating, yet their ability to navigate complex, professional-grade productivity software is largely untested. |
| Trace Only What You Need: Structure-Aware On-Demand Hypergraph Memory for Long-Document Question Answering | LLM, Agent, RAG/Memory, Reasoning | 15 | 2026-06-09 | Long-document question answering (QA) requires large language models (LLMs) to reason over evidence scattered across lengthy documents, where answers often depend on event order, section-level context, and cross-part evidence connections. |
| The Shibboleth Effect: Auditing the Cross-Lingual Distributional Skew of Large Language Models | LLM, Agent, Safety/Eval | 14 | 2026-06-09 | This study investigates cross-lingual distributional skew (the Shibboleth Effect) in frontier large language models (LLMs) subjected to sustained adversarial conditions. |
| LIBERO-Occ: Evaluating and Improving Vision-Language-Action Models under Scene-Induced Occlusion via Viewpoint Imagination | 多模态, Reasoning, Safety/Eval | 14 | 2026-06-09 | Vision-Language-Action (VLA) models achieve strong performance on standard manipulation benchmarks, but most evaluations assume that task-relevant objects are fully visible. |
| Next Forcing: Causal World Modeling with Multi-Chunk Prediction | LLM, 多模态, RAG/Memory, Safety/Eval | 13 | 2026-06-09 | Autoregressive video generation has emerged as a powerful paradigm for World Action Models (WAMs). |
| AuRA: Internalizing Audio Understanding into LLMs as LoRA | LLM, 多模态, Reasoning, Safety/Eval | 13 | 2026-06-09 | Recent efforts to extend large language models (LLMs) to speech inputs typically rely on cascaded ASR-LLM pipelines, end-to-end speech-language models, or bridge/distillation-based adaptation. |
| Task Robustness via Re-Labelling Vision-Action Robot Data | LLM, 多模态, RAG/Memory, Reasoning, Safety/Eval | 13 | 2026-06-09 | The recent trend in scaling models for robot learning has resulted in impressive policies that can perform various manipulation tasks and generalize to novel scenarios. |
| AnyMod-LLVE: Low-Light Video Enhancement with Modality-Agnostic Inference | 多模态, Reasoning | 12 | 2026-06-09 | Low-light video enhancement (LLVE) remains a challenging task due to severe information degradation under low-illumination conditions. |
| Modeling Complex Behaviors: Multi-Personality Composition and Dynamic Switching in Vision-Language Models | LLM, 多模态, Reasoning, Safety/Eval | 12 | 2026-06-09 | With the widespread deployment of Multimodal Large Language Models (MLLMs) in social interaction, understanding and controlling their behavior under complex personality conditions is essential. |
| Recalling Too Well: Sycophancy Evaluation and Mitigation in Memory-Augmented Models | LLM, RAG/Memory, Reasoning, Safety/Eval | 12 | 2026-06-09 | Persistent memory systems promise to make LLMs more helpful by storing user beliefs over time. |
| Frontier Coding Agents Use Metaprogramming to Adapt to Unfamiliar Programming Languages | LLM, Agent, Reasoning, Safety/Eval | 12 | 2026-06-09 | LLM-based coding agents are usually evaluated in familiar software settings: mainstream languages, common libraries, and public repositories. |
| Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution | LLM, Agent, RAG/Memory, Reasoning, Safety/Eval | 12 | 2026-06-09 | Although Large Language Model (LLM) agents have demonstrated strong performance on complex tasks, their learning is often limited by inefficient interaction feedback and static training environments, which hinder broader generalization. |
| Piper: A Programmable Distributed Training System | LLM, RAG/Memory | 11 | 2026-06-09 | Large-scale model training increasingly relies on composing multiple parallelism strategies, such as data, pipeline, and expert parallelism, together with memory-saving optimizations like ZeRO. |
3. 阅读建议
建议先读评分最高的 3 篇。对 agent / skill 类论文,重点看任务设定是否真实、工具调用是否可控、状态管理是否清楚;对多模态论文,重点看数据配比、模态对齐和评测是否覆盖真实使用场景;对 RAG / memory 论文,重点看检索粒度、噪声控制、时效性和长上下文成本。
生成时间:2026-06-10 14:48:52 CST