FIRE-Bench

Can AI agents rediscover scientific insights? We benchmark full-cycle research automation with verifiable evaluation.

Zhen Wang*1, Fan Bai*2, Zhongyan Luo*1, Jinyan Su3, Kaiser Sun2, Xinle Yu1, Jieyuan Liu1, Kun Zhou1,
Claire Cardie3, Mark Dredze2, Eric P. Xing4,5, Zhiting Hu1
1UC San Diego, 2Johns Hopkins University, 3Cornell University, 4MBZUAI, 5CMU
* Equal Contribution
Live FIRE-Bench

Submit your paper and see how AI agents perform on your tasks!

Framework Animation

Papers are parsed into tree-structured tasks, where each node represents an executable, verifiable research step.

Methodology

From Papers to Verifiable Discovery Tasks

FIRE-Bench is constructed through research-problem decomposition, a process that transforms high-quality empirical analysis papers into verifiable benchmark tasks. This approach balances exploratory freedom (avoiding tasks that are too broad to benchmark) with empirical verifiability (avoiding tasks that are too narrow to allow genuine exploration).

We formalize each paper \(\mathcal{P}\) as a hierarchical research-problem tree \(\mathcal{T}(\mathcal{P})\) that encodes the authors' reasoning trajectory, from broad research questions (e.g., "Do LLMs contain biases?") to specific experimental tasks. The tree is extracted via an automated parser: \(E_\phi: \Sigma^* \rightarrow \mathcal{T},\; \mathcal{T}(\mathcal{P}) = E_\phi(\mathcal{P})\). It consists of three node types: Root node \(r\) captures the overarching research problem from the paper's title or abstract; Intermediate nodes \(v_i \in \mathcal{V}\) represent narrower subproblems that decompose the main question (e.g., "Do LLMs exhibit racial bias in medical report generation?"); Leaf nodes \(l_j \in \mathcal{L}\) specify fully executable experiments with datasets \(\mathcal{D}_j\), methods \(\mathcal{M}_j\), and evaluation metrics \(\mathcal{C}_j\), each mapping directly to figures or tables in the original paper.

From each tree, we create a "constrained rediscovery" problem by first identifying a target leaf node \(l^* \in \mathcal{L}\) corresponding to a central finding, then selecting an intermediate node \(v^* \in \mathcal{V}\) that balances openness with verifiability. The agent receives only the research question from \(v^*\), while the methodology and conclusion from \(l^*\) are withheld. The paper's published findings serve as ground truth for evaluation.

We evaluate agent performance through claim-level analysis: both agent conclusions \(C_{\text{agent}}\) and ground truth \(C_{\text{gt}}\) are decomposed into atomic, verifiable claims. We then compute \(\text{Precision}\), \(\text{Recall}\), and \(F_1\) as the overall performance metric.

30
Research Tasks
Derived from ICLR, ICML, and NeurIPS papers (2024-2025)
āœ“
Verifiable Evaluation
Ground truth from published papers enables objective scoring
4
Research Stages
Plan → Implement → Execute → Analyze the full research cycle
14
Error Categories
Fine-grained taxonomy reveals where agents fail

Leaderboard

Mean scores with standard deviation across three independent trials on all 30 tasks

# Agent Precision Recall F₁ Score
1 Claude Code Sonnet-4 52.1 48.3 46.7
2 Codex CLI gpt-5-medium 44.83 48.96 41.93
3 OpenHands gpt-5 41.67 41.42 37.87
4 OpenHands o4-mini 36.81 36.63 31.85
šŸ“‰
Low Performance

Best agent achieves only 46.7 F₁

šŸ“ˆ
High Variance

Success often appears to be a "lottery"

🧠
Planning Bottleneck

73.6% of errors from flawed planning

More agents are being added...

Benchmark Tasks

30 research tasks from high-impact ML papers. Click to see details.

TACL 2024 +

Lost in the Middle

Do language models use information equally regardless of its position in context?

Context Position Effects
Research Question

How does the position of relevant information within long contexts affect language model performance on retrieval tasks?

Experiment Settings

Multi-document QA with multiple documents, varying the position of the gold document.

Ground Truth

Performance follows a U-shaped curve: models perform best when relevant information is at the beginning or end, with significant degradation for middle positions.

Nature 2024 +

LLM Racial Bias in Medicine

Do clinical LLMs propagate race-based medical misconceptions?

Healthcare Bias
Research Question

Do LLMs exhibit racial bias when generating medical reports and predictions for different demographic groups?

Experiment Settings

Generate medical predictions (hospitalization, costs, mortality) for patient cases while systematically varying racial/ethnic information.

Ground Truth

Models show significant racial biases: higher costs projected for White populations, optimistic survival predictions, and disease-race associations mirroring real-world disparities.

ICLR 2024 +

LLMs Lack Self-Correction

Can LLMs improve reasoning by self-correction without external feedback?

Reasoning Behavior
Research Question

Can large language models improve their reasoning performance through intrinsic self-correction without any external feedback?

Experiment Settings

Test self-correction on GSM8K, CommonSenseQA, and HotpotQA using prompts that ask models to review and revise their answers.

Ground Truth

LLMs cannot reliably self-correct reasoning without external feedback. Performance often degrades as models change correct answers to incorrect ones.

arXiv 2025 +

Awareness Detection

Can models detect whether an interaction comes from evaluation or deployment?

Evaluation Awareness
Research Question

Can frontier LLMs distinguish between evaluation/benchmark prompts and real-world deployment interactions?

Experiment Settings

Binary classification task on diverse prompts from multiple datasets including MMLU, SWEBench, and real-world interactions.

Ground Truth

Gemini-2.5-Pro achieves AUC of 0.83 in identifying evaluations, demonstrating significant evaluation awareness capability.

arXiv 2025 +

CoT Faithfulness Gaps

Do chain-of-thought explanations faithfully reflect model reasoning?

Reasoning Faithfulness
Research Question

When reasoning models use hints to solve problems, do they faithfully acknowledge these hints in their chain-of-thought?

Experiment Settings

Insert various hints into prompts and measure how often models mention using the hints in their reasoning traces.

Ground Truth

Models use hints but rarely mention them in their reasoning. Reveal rates are consistently low across different model families.

NeurIPS 2024 +

CoT Without Prompting

Can altered decoding reveal latent reasoning paths in LLMs?

Decoding Strategies
Research Question

Can chain-of-thought reasoning paths be extracted from LLMs by modifying the decoding process rather than prompting?

Experiment Settings

Compare greedy decoding vs. top-k alternative token exploration on GSM8K and other reasoning benchmarks.

Ground Truth

CoT reasoning paths are inherent in alternative decoding sequences. CoT-decoding substantially outperforms standard greedy decoding.

ICML 2024 +

Hallucination Snowballing

Do LLM hallucinations compound when models build on prior errors?

Error Propagation
Research Question

Do language models generate hallucinations that they could recognize as incorrect if presented in isolation?

Experiment Settings

QA datasets on primality testing, senator searches, and flight connectivity. Test if models can identify their own false claims separately.

Ground Truth

Models over-commit to early mistakes. They can identify most of their own incorrect claims when presented separately, yet still generate them in context.

ICML 2024 +

Counterfactual Simulatability

Can humans predict model behavior changes from explanations?

Explanation Quality
Research Question

Do LLM explanations enable humans to accurately predict how the model would behave on variations of the input?

Experiment Settings

Multi-hop factual reasoning and reward modeling tasks. Measure if explanations help predict model outputs on counterfactual inputs.

Ground Truth

LLM explanations have low precision. Plausible-sounding explanations don't correlate with actual predictive value for model behavior.

ICML 2024 +

Premise Order Effects

Does the order of logical premises affect LLM reasoning accuracy?

Logical Reasoning
Research Question

Are LLMs sensitive to the ordering of premises in deductive reasoning tasks, even though logical validity is order-independent?

Experiment Settings

Deductive reasoning with permuted premise orders. R-GSM benchmark for mathematical problem-solving with reordered conditions.

Ground Truth

Performance drops significantly when premises are permuted. Models perform best when premises match the proof order.

ICLR 2024 +

LLMs Lack Self-Correction

Can LLMs improve reasoning by self-correction without external feedback?

Reasoning Behavior
Research Question

Can large language models improve their reasoning performance through intrinsic self-correction without any external feedback?

Experiment Settings

Test self-correction on GSM8K, CommonSenseQA, and HotpotQA using prompts that ask models to review and revise their answers.

Ground Truth

LLMs cannot reliably self-correct reasoning without external feedback. Performance often degrades as models change correct answers to incorrect ones.

ICLR 2024 +

Bias Runs Deep

Do persona assignments surface implicit reasoning biases in LLMs?

Persona Bias
Research Question

Do LLMs exhibit stereotypical reasoning biases when assigned different demographic personas?

Experiment Settings

Multiple reasoning datasets, various LLMs, and diverse personas across race, gender, religion, disability, and political affiliation.

Ground Truth

Most personas showed bias across models. Some datasets had substantial performance drops with certain personas.

ICLR 2024 +

Not Robust MCQ Selectors

Are LLMs robust to option position changes in multiple choice questions?

Selection Bias
Research Question

Do LLMs exhibit selection bias toward specific answer positions (A/B/C/D) regardless of content?

Experiment Settings

MMLU and other MCQ benchmarks with permuted answer positions across multiple LLMs.

Ground Truth

Moving correct answers to different positions significantly affects model accuracy, with some positions causing large performance drops and others yielding gains.

ICLR 2024 +

Prompt Format Sensitivity

How sensitive are LLMs to spurious features in prompt design?

Prompt Robustness
Research Question

How much does LLM performance vary based on minor prompt formatting choices that shouldn't affect meaning?

Experiment Settings

50+ tasks with various prompt format variations (spacing, delimiters, ordering) across multiple models.

Ground Truth

Performance differences can be substantial across different prompt formats. Significant variation exists across tasks and models.

ICLR 2024 +

Space and Time Representations

Do language models learn coherent representations of space and time?

World Models
Research Question

Do LLMs learn linear representations of spatial and temporal information that generalize across entity types?

Experiment Settings

Probe LLM activations on spatial datasets (world/US/NYC places) and temporal datasets (historical figures, artworks, news).

Ground Truth

LLMs learn linear space/time representations. Individual "space neurons" and "time neurons" reliably encode coordinates.

ICLR 2024 +

Uncertainty Expression

Can LLMs accurately express their uncertainty about answers?

Confidence Calibration
Research Question

Can LLMs verbalize well-calibrated confidence scores that reflect their actual likelihood of being correct?

Experiment Settings

Various prompting strategies (CoT, self-probing) across multiple LLMs on calibration and failure prediction tasks.

Ground Truth

LLMs tend to be overconfident when verbalizing confidence. Calibration improves with model capability but remains far from ideal.

ICLR 2024 +

ICL from Repetitions

How do surface repetitions influence in-context learning?

In-Context Learning
Research Question

Does token co-occurrence reinforcement from repeated patterns in demonstrations drive in-context learning behavior?

Experiment Settings

Analyze ICL across OPT and LLaMA models with controlled demonstration patterns and token repetitions.

Ground Truth

Surface-level repetitions significantly influence ICL. Token reinforcement can create both beneficial and spurious connections.

ICLR 2025 +

To CoT or Not to CoT

When does chain-of-thought actually help LLM reasoning?

CoT Analysis
Research Question

On which task types does chain-of-thought prompting provide meaningful performance benefits?

Experiment Settings

Meta-analysis of many papers, evaluation on diverse datasets across multiple models comparing CoT vs. direct answering.

Ground Truth

CoT helps mainly on math and symbolic reasoning. On MMLU, CoT only helps when questions contain symbolic operations (equals signs).

ICLR 2025 +

Rationality Assumptions

Do LLMs assume people are more rational than they really are?

Human Modeling
Research Question

Do LLMs model human decision-making as more aligned with rational choice theory than actual human behavior?

Experiment Settings

Predict human choices between gambles using a large dataset of human risky decisions. Compare LLM predictions to actual behavior.

Ground Truth

LLMs align more with expected value theory than actual human choices. They assume people are more rational than they are.

ICML 2024 +

Hallucination Snowballing

Do LLM hallucinations compound when models build on prior errors?

Error Propagation
Research Question

Do language models generate hallucinations that they could recognize as incorrect if presented in isolation?

Experiment Settings

QA datasets on primality testing, senator searches, and flight connectivity. Test if models can identify their own false claims separately.

Ground Truth

Models over-commit to early mistakes. They can identify most of their own incorrect claims when presented separately, yet still generate them in context.

ICML 2024 +

Counterfactual Simulatability

Can humans predict model behavior changes from explanations?

Explanation Quality
Research Question

Do LLM explanations enable humans to accurately predict how the model would behave on variations of the input?

Experiment Settings

Multi-hop factual reasoning and reward modeling tasks. Measure if explanations help predict model outputs on counterfactual inputs.

Ground Truth

LLM explanations have low precision. Plausible-sounding explanations don't correlate with actual predictive value for model behavior.

ICML 2024 +

Premise Order Effects

Does the order of logical premises affect LLM reasoning accuracy?

Logical Reasoning
Research Question

Are LLMs sensitive to the ordering of premises in deductive reasoning tasks, even though logical validity is order-independent?

Experiment Settings

Deductive reasoning with permuted premise orders. R-GSM benchmark for mathematical problem-solving with reordered conditions.

Ground Truth

Performance drops significantly when premises are permuted. Models perform best when premises match the proof order.

ICML 2025 +

Fractal Complexity

Do LLMs capture the fractal structure of natural language?

Language Structure
Research Question

Can LLMs replicate the self-similar, fractal properties and long-range dependencies found in natural language?

Experiment Settings

Measure Hƶlder and Hurst exponents on LLM outputs vs. human text across different temperatures and prompting methods.

Ground Truth

Natural language fractal parameters fall in a narrow range; LLM outputs vary widely. Larger models better capture fractal properties.

NeurIPS 2024 +

CoT Without Prompting

Can altered decoding reveal latent reasoning paths in LLMs?

Decoding Strategies
Research Question

Can chain-of-thought reasoning paths be extracted from LLMs by modifying the decoding process rather than prompting?

Experiment Settings

Compare greedy decoding vs. top-k alternative token exploration on GSM8K and other reasoning benchmarks.

Ground Truth

CoT reasoning paths are inherent in alternative decoding sequences. CoT-decoding substantially outperforms standard greedy decoding.

NeurIPS 2024 +

Chain of Thoughtlessness

Does CoT actually teach LLMs general algorithmic procedures?

Planning Analysis
Research Question

Does chain-of-thought prompting enable LLMs to learn generalizable algorithmic reasoning procedures?

Experiment Settings

Blocksworld planning problems with GPT-4 and Claude-3-Opus. Test generalization beyond example complexity.

Ground Truth

CoT improvements require problem-specific prompts and degrade rapidly as complexity increases beyond examples. It's pattern matching, not algorithmic learning.

NeurIPS 2025 +

SECA Adversarial Examples

Can semantic-preserving prompt changes cause LLMs to hallucinate?

Adversarial Robustness
Research Question

Can realistic, meaning-preserving prompt modifications reliably trigger hallucinations in LLMs?

Experiment Settings

Constrained optimization to find semantic-equivalent adversarial prompts that maintain coherence while eliciting hallucinations.

Ground Truth

SECA achieves high attack success rates with almost no semantic or coherence errors, exposing model sensitivity to realistic variations.

NeurIPS 2025 +

Distributive Fairness

How fair are LLMs in resource allocation decisions?

Fairness
Research Question

Do LLM resource allocation decisions align with human fairness principles like equitability and envy-freeness?

Experiment Settings

Fair division tasks evaluating equitability, envy-freeness, and Rawlsian maximin principles across various allocation scenarios.

Ground Truth

LLMs show stark misalignment with human distributional preferences. They cannot effectively use money to mitigate inequality.

NeurIPS 2025 +

QuestBench

Can LLMs ask the right questions to acquire missing information?

Information Seeking
Research Question

When given underspecified problems, can LLMs identify and ask the right clarifying questions?

Experiment Settings

Logic-Q, Planning-Q, GSM-Q tasks with missing variables. Models must select correct clarification questions.

Ground Truth

Models excel at math-based tasks but struggle significantly on logic and planning tasks. Solving ability doesn't transfer to asking the right questions.

+ 21 more tasks in the full benchmark

Research Problem Tree

Interactive visualization of paper problem trees — click nodes to expand and explore the research structure

Case 1
Lost in the Middle
arXiv 2023
Case 2
LLMs Cannot Self-Correct Reasoning
ICLR 2024
Case 3
Racial Bias in Medical Reports
Comm. Medicine 2024
Root
Depth 1
Depth 2
Depth 3
Leaf (Experiment)
Research Question Do LLMs robustly use information anywhere in long contexts, or does performance depend on where relevant content appears? Multi-document QA evaluation controlled position & context length Synthetic key-value retrieval minimal retrieval ability test Why the position sensitivity? Factor analyses Architecture, query placement, instruction tuning Practical implication Open-domain QA Position effects 10/20/30 documents 6 LLMs tested → U-shaped curve Baselines Closed-book & Oracle 56% vs 88% → Set ceiling UUID key-value k = 75, 140, 300 Claude perfect → Same U-shape Architecture Enc-Dec vs Dec Flan-UL2/T5 → Train len key Query-aware Query before+after Helps KV, not QA → Partial fix Instruction tuning MPT, Llama-2 SFT+RLHF → Doesn't fix Reader accuracy vs # docs (5-50) Contriever retr. → Saturates early Finding: U-Shaped Accuracy Best at beginning/end of context Worst when info is in the middle Finding: Not Just NL Semantics Even exact-match retrieval fails mid-context (except Claude) Finding: Hard to Fix Architecture, tuning, query placement don't fully solve the problem Finding: Diminishing Returns More docs ≠ better reader accuracy Consider reranking instead Key Insight: The "Lost in the Middle" Effect LLMs struggle to use information in the middle of long contexts; this persists across architectures, tuning methods, and tasks
Research Question Can LLMs intrinsically self-correct their own reasoning without any external feedback? If not, why? Establish and evaluate intrinsic self-correction for reasoning tasks Compare to alternatives at matched inference cost Disentangle prompt-design confounds in evaluations Oracle-gated Self-Correction GSM8K: 75.9→84.3 → Needs labels Intrinsic on GPT-3.5/4/Llama GSM8K: 75.9→74.7 → Performance drops Behavioral Analysis Answer changes → Correct→Wrong Multi-Agent Debate vs Self-Consistency GSM8K, 3/6/9 responses → SC wins at same cost CommonGen-Hard Improved Prompt Standard: 81.8 vs SC: 75.1 → Better prompt wins Finding: Self-Correction Fails Without external feedback, LLMs cannot reliably self-correct reasoning Finding: Simpler Methods Work Self-consistency beats debate at matched inference cost Finding: Prompt Confounds Prior gains from weak initial prompts, not true self-correction ability Key Insight: No Intrinsic Self-Correction LLMs cannot self-correct reasoning without external feedback; they often change correct answers to incorrect ones
Research Question Do LLMs (GPT-3.5, GPT-4) exhibit racial/ethnic bias when generating medical case reports? How can such bias be quantified? Design controlled bias-probing framework (race injection) Racial bias in diagnosis and patient info processing Quantify disparities in projected costs & hospitalization durations Survival predictions & model decisiveness Patient Profiles PMC-Patients 200 cases Ɨ 4 races → Race controlled Report Generation 9-section template 10 gens per combo → Standardized Paraphrasing Bias Fabricated details 16/200 biased cases → Stereotypes added Diagnosis Bias Disease-race links 21/200 disparities → HIV↔Black GPT-3.5 Disparities Cost & hospital days White vs Black: 59% → Higher for White GPT-4 Disparities More balanced cost Similar hosp. trend → Partially mitigated Survival Prediction 200 deceased cases GPT-4: 29% accuracy → Overly optimistic Decisiveness Inconclusive rates GPT-4: 29-38% → More hedging Finding: Stereotyped Content LLMs add fabricated racial details to patient descriptions Finding: Disease-Race Links Diagnoses reflect stereotyped race-disease associations Finding: Cost Disparities Higher costs/stays predicted more often for White patients Finding: GPT-4 Trade-off Fairer but less conclusive, overly optimistic on survival Key Insight: Racial Bias in Medical LLMs GPT-3.5 and GPT-4 exhibit measurable racial biases in medical reports; GPT-4 is fairer but hedges more and shows unrealistic optimism

Error Analysis

Where do AI research agents fail? Our taxonomy reveals systematic failure patterns.

šŸ“‹

Planning

73.6%
  • Goal Deviation
  • Method Deviation
šŸ’»

Implementation

3.0%
  • Unsound Code
  • Missing Steps
  • Wrong Dependencies
⚔

Execution

5.6%
  • Premature Termination
  • Endless Loop
  • Runtime Errors
šŸ”¬

Analysis

17.8%
  • Misinterpretation
  • Overgeneralization
  • Unrelated Conclusions

Key Insight: The challenge is scientific reasoning, not implementation. As models become more capable, the bottleneck shifts from coding to high-level planning and analysis.

Citation

Paper coming soon. Citation will be available upon publication.

@article{firebench2026,
  title     = {FIRE-Bench: Evaluating Research Agents on the Rediscovery of Scientific Insights},
  author    = {Wang, Zhen and Bai, Fan and Luo, Zhongyan and Su, Jinyan and Sun, Kaiser and Yu, Xinle and Liu, Jieyuan and Zhou, Kun and Cardie, Claire and Dredze, Mark and Xing, Eric P. and Hu, Zhiting},
  journal   = {arXiv preprint},
  year      = {2026},
  note      = {Coming Soon}
}