Inquiry Loop

Agents repeatedly propose, observe, revise, and conclude.

Research Trees

Papers become dependency graphs over subtopics, studies, and results.

Robustness Probes

Fake results and cutoff splits reveal where fluent reasoning breaks.

Problem

Static tests hide failure modes

One-shot benchmarks can reward memorized knowledge rather than sustained judgment.

Approach

Make papers interactive

Each paper is exposed as a dependency-aware loop over topics, studies, results, and conclusions.

Finding

Fluency is not enough

Agents show critical-judgment erosion and weaker performance beyond training-cutoff papers.

From paper structure to agent interaction

InquiTree game logic and example interaction scenario
Figure 2. A paper-derived research tree becomes a Topic, Subtopic, Study, and Result loop.

State Model

01

Topic

Select the next research branch.

02

Subtopic

Design a study under graph constraints.

03

Study

Receive controlled experimental feedback.

04

Result

Continue, retry, or conclude.

Step 01 Select Map free-text actions to valid nodes.
Step 02 Observe Return feedback or controlled fake results.
Step 03 Revise Explore, redo, or conclude.

Hints keep episodes moving without removing dependency constraints.

Coverage

Exploration

How much of the tree the agent visits.

Quality

Evidence

Whether conclusions are correct and supported.

Robustness

Skepticism

Whether agents catch plausible wrong results.

Current agents remain brittle

Baseline

Model Coverage Conclusion
o30.3370.279
deepseek-r10.2680.201
gemini-2.5-pro0.2620.164
claude-4.5-sonnet0.2920.218
gpt-5-low0.3530.295
gpt-5-medium0.3350.290
gpt-5-high0.3240.272

All reported settings remain below 0.4 coverage.

Results under different randomness levels
Figure 3. More caution does not reliably translate into fake-result detection.
Model performance before versus after knowledge cutoff dates
Figure 4. Post-cutoff papers expose an interpolation-extrapolation gap.

Two failure modes

  • Cognitive tunneling: long interactions reduce anomaly detection.
  • Novelty gap: post-cutoff science is harder than familiar science.
  • Design implication: verification should be a first-class role.

IT-18 release snapshot

Public Release

18 open-access papers

  • 120 subtopics.
  • About 6.7 subtopics per paper.
  • Human-in-the-loop validation.

Evaluation Pool

30 neuroscience papers

  • 12 restricted papers reported only in aggregate.
  • No restricted configs or logs are released.
  • Used for baseline, fake-result, and cutoff analyses.
Pipeline for research tree extraction and validation

Tree Extraction

Extraction, structural checks, and manual review.

State transition diagram for InquiTree

State Transitions

Controlled transitions across the inquiry loop.