Can AI agents rediscover scientific insights? We benchmark full-cycle research automation with verifiable evaluation.
Papers are parsed into tree-structured tasks, where each node represents an executable, verifiable research step.
From Papers to Verifiable Discovery Tasks
FIRE-Bench is constructed through research-problem decomposition, a process that transforms high-quality empirical analysis papers into verifiable benchmark tasks. This approach balances exploratory freedom (avoiding tasks that are too broad to benchmark) with empirical verifiability (avoiding tasks that are too narrow to allow genuine exploration).
We formalize each paper \(\mathcal{P}\) as a hierarchical research-problem tree \(\mathcal{T}(\mathcal{P})\) that encodes the authors' reasoning trajectory, from broad research questions (e.g., "Do LLMs contain biases?") to specific experimental tasks. The tree is extracted via an automated parser: \(E_\phi: \Sigma^* \rightarrow \mathcal{T},\; \mathcal{T}(\mathcal{P}) = E_\phi(\mathcal{P})\). It consists of three node types: Root node \(r\) captures the overarching research problem from the paper's title or abstract; Intermediate nodes \(v_i \in \mathcal{V}\) represent narrower subproblems that decompose the main question (e.g., "Do LLMs exhibit racial bias in medical report generation?"); Leaf nodes \(l_j \in \mathcal{L}\) specify fully executable experiments with datasets \(\mathcal{D}_j\), methods \(\mathcal{M}_j\), and evaluation metrics \(\mathcal{C}_j\), each mapping directly to figures or tables in the original paper.
From each tree, we create a "constrained rediscovery" problem by first identifying a target leaf node \(l^* \in \mathcal{L}\) corresponding to a central finding, then selecting an intermediate node \(v^* \in \mathcal{V}\) that balances openness with verifiability. The agent receives only the research question from \(v^*\), while the methodology and conclusion from \(l^*\) are withheld. The paper's published findings serve as ground truth for evaluation.
We evaluate agent performance through claim-level analysis: both agent conclusions \(C_{\text{agent}}\) and ground truth \(C_{\text{gt}}\) are decomposed into atomic, verifiable claims. We then compute \(\text{Precision}\), \(\text{Recall}\), and \(F_1\) as the overall performance metric.
Mean scores with standard deviation across three independent trials on all 30 tasks
| # | Agent | Precision | Recall | Fā Score |
|---|---|---|---|---|
| 1 | Claude Code Sonnet-4 | 52.1 | 48.3 | 46.7 |
| 2 | Codex CLI gpt-5-medium | 44.83 | 48.96 | 41.93 |
| 3 | OpenHands gpt-5 | 41.67 | 41.42 | 37.87 |
| 4 | OpenHands o4-mini | 36.81 | 36.63 | 31.85 |
Best agent achieves only 46.7 Fā
Success often appears to be a "lottery"
73.6% of errors from flawed planning
More agents are being added...
30 research tasks from high-impact ML papers. Click to see details.
Interactive visualization of paper problem trees ā click nodes to expand and explore the research structure
Where do AI research agents fail? Our taxonomy reveals systematic failure patterns.
Key Insight: The challenge is scientific reasoning, not implementation. As models become more capable, the bottleneck shifts from coding to high-level planning and analysis.
Paper coming soon. Citation will be available upon publication.
@article{firebench2026,
title = {FIRE-Bench: Evaluating Research Agents on the Rediscovery of Scientific Insights},
author = {Wang, Zhen and Bai, Fan and Luo, Zhongyan and Su, Jinyan and Sun, Kaiser and Yu, Xinle and Liu, Jieyuan and Zhou, Kun and Cardie, Claire and Dredze, Mark and Xing, Eric P. and Hu, Zhiting},
journal = {arXiv preprint},
year = {2026},
note = {Coming Soon}
}