FIRE-Bench: Evaluating Research Agents on the Rediscovery of Scientific Insights

#	Agent	Precision	Recall	F₁ Score
1	Claude Code Sonnet-4	52.1	48.3	46.7
2	Codex CLI gpt-5-medium	44.83	48.96	41.93
3	OpenHands gpt-5	41.67	41.42	37.87
4	OpenHands o4-mini	36.81	36.63	31.85

TACL 2024 +

Lost in the Middle

Do language models use information equally regardless of its position in context?

Context Position Effects

Research Question

How does the position of relevant information within long contexts affect language model performance on retrieval tasks?

Experiment Settings

Multi-document QA with multiple documents, varying the position of the gold document.

Ground Truth

Performance follows a U-shaped curve: models perform best when relevant information is at the beginning or end, with significant degradation for middle positions.

Nature 2024 +

LLM Racial Bias in Medicine

Do clinical LLMs propagate race-based medical misconceptions?

Healthcare Bias

Research Question

Do LLMs exhibit racial bias when generating medical reports and predictions for different demographic groups?

Experiment Settings

Generate medical predictions (hospitalization, costs, mortality) for patient cases while systematically varying racial/ethnic information.

Ground Truth

Models show significant racial biases: higher costs projected for White populations, optimistic survival predictions, and disease-race associations mirroring real-world disparities.

ICLR 2024 +

LLMs Lack Self-Correction

Can LLMs improve reasoning by self-correction without external feedback?

Reasoning Behavior

Research Question

Can large language models improve their reasoning performance through intrinsic self-correction without any external feedback?

Experiment Settings

Test self-correction on GSM8K, CommonSenseQA, and HotpotQA using prompts that ask models to review and revise their answers.

Ground Truth

LLMs cannot reliably self-correct reasoning without external feedback. Performance often degrades as models change correct answers to incorrect ones.

arXiv 2025 +

Awareness Detection

Can models detect whether an interaction comes from evaluation or deployment?

Evaluation Awareness

Research Question

Can frontier LLMs distinguish between evaluation/benchmark prompts and real-world deployment interactions?

Experiment Settings

Binary classification task on diverse prompts from multiple datasets including MMLU, SWEBench, and real-world interactions.

Ground Truth

Gemini-2.5-Pro achieves AUC of 0.83 in identifying evaluations, demonstrating significant evaluation awareness capability.

arXiv 2025 +

CoT Faithfulness Gaps

Do chain-of-thought explanations faithfully reflect model reasoning?

Reasoning Faithfulness

Research Question

When reasoning models use hints to solve problems, do they faithfully acknowledge these hints in their chain-of-thought?

Experiment Settings

Insert various hints into prompts and measure how often models mention using the hints in their reasoning traces.

Ground Truth

Models use hints but rarely mention them in their reasoning. Reveal rates are consistently low across different model families.

NeurIPS 2024 +

CoT Without Prompting

Can altered decoding reveal latent reasoning paths in LLMs?

Decoding Strategies

Research Question

Can chain-of-thought reasoning paths be extracted from LLMs by modifying the decoding process rather than prompting?

Experiment Settings

Compare greedy decoding vs. top-k alternative token exploration on GSM8K and other reasoning benchmarks.

Ground Truth

CoT reasoning paths are inherent in alternative decoding sequences. CoT-decoding substantially outperforms standard greedy decoding.

ICML 2024 +

Hallucination Snowballing

Do LLM hallucinations compound when models build on prior errors?

Error Propagation

Research Question

Do language models generate hallucinations that they could recognize as incorrect if presented in isolation?

Experiment Settings

QA datasets on primality testing, senator searches, and flight connectivity. Test if models can identify their own false claims separately.

Ground Truth

Models over-commit to early mistakes. They can identify most of their own incorrect claims when presented separately, yet still generate them in context.

ICML 2024 +

Counterfactual Simulatability

Can humans predict model behavior changes from explanations?

Explanation Quality

Research Question

Do LLM explanations enable humans to accurately predict how the model would behave on variations of the input?

Experiment Settings

Multi-hop factual reasoning and reward modeling tasks. Measure if explanations help predict model outputs on counterfactual inputs.

Ground Truth

LLM explanations have low precision. Plausible-sounding explanations don't correlate with actual predictive value for model behavior.

ICML 2024 +

Premise Order Effects

Does the order of logical premises affect LLM reasoning accuracy?

Logical Reasoning

Research Question

Are LLMs sensitive to the ordering of premises in deductive reasoning tasks, even though logical validity is order-independent?

Experiment Settings

Deductive reasoning with permuted premise orders. R-GSM benchmark for mathematical problem-solving with reordered conditions.

Ground Truth

Performance drops significantly when premises are permuted. Models perform best when premises match the proof order.

ICLR 2024 +

LLMs Lack Self-Correction

Can LLMs improve reasoning by self-correction without external feedback?

Reasoning Behavior

Research Question

Can large language models improve their reasoning performance through intrinsic self-correction without any external feedback?

Experiment Settings

Test self-correction on GSM8K, CommonSenseQA, and HotpotQA using prompts that ask models to review and revise their answers.

Ground Truth

LLMs cannot reliably self-correct reasoning without external feedback. Performance often degrades as models change correct answers to incorrect ones.

ICLR 2024 +

Bias Runs Deep

Do persona assignments surface implicit reasoning biases in LLMs?

Persona Bias

Research Question

Do LLMs exhibit stereotypical reasoning biases when assigned different demographic personas?

Experiment Settings

Multiple reasoning datasets, various LLMs, and diverse personas across race, gender, religion, disability, and political affiliation.

Ground Truth

Most personas showed bias across models. Some datasets had substantial performance drops with certain personas.

ICLR 2024 +

Not Robust MCQ Selectors

Are LLMs robust to option position changes in multiple choice questions?

Selection Bias

Research Question

Do LLMs exhibit selection bias toward specific answer positions (A/B/C/D) regardless of content?

Experiment Settings

MMLU and other MCQ benchmarks with permuted answer positions across multiple LLMs.

Ground Truth

Moving correct answers to different positions significantly affects model accuracy, with some positions causing large performance drops and others yielding gains.

ICLR 2024 +

Prompt Format Sensitivity

How sensitive are LLMs to spurious features in prompt design?

Prompt Robustness

Research Question

How much does LLM performance vary based on minor prompt formatting choices that shouldn't affect meaning?

Experiment Settings

50+ tasks with various prompt format variations (spacing, delimiters, ordering) across multiple models.

Ground Truth

Performance differences can be substantial across different prompt formats. Significant variation exists across tasks and models.

ICLR 2024 +

Space and Time Representations

Do language models learn coherent representations of space and time?

World Models

Research Question

Do LLMs learn linear representations of spatial and temporal information that generalize across entity types?

Experiment Settings

Probe LLM activations on spatial datasets (world/US/NYC places) and temporal datasets (historical figures, artworks, news).

Ground Truth

LLMs learn linear space/time representations. Individual "space neurons" and "time neurons" reliably encode coordinates.

ICLR 2024 +

Uncertainty Expression

Can LLMs accurately express their uncertainty about answers?

Confidence Calibration

Research Question

Can LLMs verbalize well-calibrated confidence scores that reflect their actual likelihood of being correct?

Experiment Settings

Various prompting strategies (CoT, self-probing) across multiple LLMs on calibration and failure prediction tasks.

Ground Truth

LLMs tend to be overconfident when verbalizing confidence. Calibration improves with model capability but remains far from ideal.

ICLR 2024 +

ICL from Repetitions

How do surface repetitions influence in-context learning?

In-Context Learning

Research Question

Does token co-occurrence reinforcement from repeated patterns in demonstrations drive in-context learning behavior?

Experiment Settings

Analyze ICL across OPT and LLaMA models with controlled demonstration patterns and token repetitions.

Ground Truth

Surface-level repetitions significantly influence ICL. Token reinforcement can create both beneficial and spurious connections.

ICLR 2025 +

To CoT or Not to CoT

When does chain-of-thought actually help LLM reasoning?

CoT Analysis

Research Question

On which task types does chain-of-thought prompting provide meaningful performance benefits?

Experiment Settings

Meta-analysis of many papers, evaluation on diverse datasets across multiple models comparing CoT vs. direct answering.

Ground Truth

CoT helps mainly on math and symbolic reasoning. On MMLU, CoT only helps when questions contain symbolic operations (equals signs).

ICLR 2025 +

Rationality Assumptions

Do LLMs assume people are more rational than they really are?

Human Modeling

Research Question

Do LLMs model human decision-making as more aligned with rational choice theory than actual human behavior?

Experiment Settings

Predict human choices between gambles using a large dataset of human risky decisions. Compare LLM predictions to actual behavior.

Ground Truth

LLMs align more with expected value theory than actual human choices. They assume people are more rational than they are.

ICML 2024 +

Hallucination Snowballing

Do LLM hallucinations compound when models build on prior errors?

Error Propagation

Research Question

Do language models generate hallucinations that they could recognize as incorrect if presented in isolation?

Experiment Settings

QA datasets on primality testing, senator searches, and flight connectivity. Test if models can identify their own false claims separately.

Ground Truth

Models over-commit to early mistakes. They can identify most of their own incorrect claims when presented separately, yet still generate them in context.

ICML 2024 +

Counterfactual Simulatability

Can humans predict model behavior changes from explanations?

Explanation Quality

Research Question

Do LLM explanations enable humans to accurately predict how the model would behave on variations of the input?

Experiment Settings

Multi-hop factual reasoning and reward modeling tasks. Measure if explanations help predict model outputs on counterfactual inputs.

Ground Truth

LLM explanations have low precision. Plausible-sounding explanations don't correlate with actual predictive value for model behavior.

ICML 2024 +

Premise Order Effects

Does the order of logical premises affect LLM reasoning accuracy?

Logical Reasoning

Research Question

Are LLMs sensitive to the ordering of premises in deductive reasoning tasks, even though logical validity is order-independent?

Experiment Settings

Deductive reasoning with permuted premise orders. R-GSM benchmark for mathematical problem-solving with reordered conditions.

Ground Truth

Performance drops significantly when premises are permuted. Models perform best when premises match the proof order.

ICML 2025 +

Fractal Complexity

Do LLMs capture the fractal structure of natural language?

Language Structure

Research Question

Can LLMs replicate the self-similar, fractal properties and long-range dependencies found in natural language?

Experiment Settings

Measure Hölder and Hurst exponents on LLM outputs vs. human text across different temperatures and prompting methods.

Ground Truth

Natural language fractal parameters fall in a narrow range; LLM outputs vary widely. Larger models better capture fractal properties.

NeurIPS 2024 +

CoT Without Prompting

Can altered decoding reveal latent reasoning paths in LLMs?

Decoding Strategies

Research Question

Can chain-of-thought reasoning paths be extracted from LLMs by modifying the decoding process rather than prompting?

Experiment Settings

Compare greedy decoding vs. top-k alternative token exploration on GSM8K and other reasoning benchmarks.

Ground Truth

CoT reasoning paths are inherent in alternative decoding sequences. CoT-decoding substantially outperforms standard greedy decoding.

NeurIPS 2024 +

Chain of Thoughtlessness

Does CoT actually teach LLMs general algorithmic procedures?

Planning Analysis

Research Question

Does chain-of-thought prompting enable LLMs to learn generalizable algorithmic reasoning procedures?

Experiment Settings

Blocksworld planning problems with GPT-4 and Claude-3-Opus. Test generalization beyond example complexity.

Ground Truth

CoT improvements require problem-specific prompts and degrade rapidly as complexity increases beyond examples. It's pattern matching, not algorithmic learning.

NeurIPS 2025 +

SECA Adversarial Examples

Can semantic-preserving prompt changes cause LLMs to hallucinate?

Adversarial Robustness

Research Question

Can realistic, meaning-preserving prompt modifications reliably trigger hallucinations in LLMs?

Experiment Settings

Constrained optimization to find semantic-equivalent adversarial prompts that maintain coherence while eliciting hallucinations.

Ground Truth

SECA achieves high attack success rates with almost no semantic or coherence errors, exposing model sensitivity to realistic variations.

NeurIPS 2025 +

Distributive Fairness

How fair are LLMs in resource allocation decisions?

Fairness

Research Question

Do LLM resource allocation decisions align with human fairness principles like equitability and envy-freeness?

Experiment Settings

Fair division tasks evaluating equitability, envy-freeness, and Rawlsian maximin principles across various allocation scenarios.

Ground Truth

LLMs show stark misalignment with human distributional preferences. They cannot effectively use money to mitigate inequality.

NeurIPS 2025 +

QuestBench

Can LLMs ask the right questions to acquire missing information?

Information Seeking

Research Question

When given underspecified problems, can LLMs identify and ask the right clarifying questions?

Experiment Settings

Logic-Q, Planning-Q, GSM-Q tasks with missing variables. Models must select correct clarification questions.

Ground Truth

Models excel at math-based tasks but struggle significantly on logic and planning tasks. Solving ability doesn't transfer to asking the right questions.

🔥 FIRE-Bench

Framework Animation

Methodology

Leaderboard

Benchmark Tasks

Lost in the Middle

Research Question

Experiment Settings

Ground Truth

LLM Racial Bias in Medicine

Research Question

Experiment Settings

Ground Truth

LLMs Lack Self-Correction

Research Question

Experiment Settings

Ground Truth

Awareness Detection

Research Question

Experiment Settings

Ground Truth

CoT Faithfulness Gaps

Research Question

Experiment Settings

Ground Truth

CoT Without Prompting

Research Question

Experiment Settings

Ground Truth

Hallucination Snowballing

Research Question

Experiment Settings

Ground Truth

Counterfactual Simulatability

Research Question

Experiment Settings

Ground Truth

Premise Order Effects

Research Question

Experiment Settings

Ground Truth

LLMs Lack Self-Correction

Research Question

Experiment Settings

Ground Truth

Bias Runs Deep

Research Question

Experiment Settings

Ground Truth

Not Robust MCQ Selectors

Research Question

Experiment Settings

Ground Truth

Prompt Format Sensitivity

Research Question

Experiment Settings

Ground Truth

Space and Time Representations

Research Question

Experiment Settings

Ground Truth

Uncertainty Expression

Research Question

Experiment Settings

Ground Truth

ICL from Repetitions

Research Question

Experiment Settings

Ground Truth

To CoT or Not to CoT

Research Question

Experiment Settings

Ground Truth

Rationality Assumptions

Research Question

Experiment Settings

Ground Truth

Hallucination Snowballing

Research Question

Experiment Settings