Long-form references on benchmarks, measurement, and what evaluations actually test.

Biology And Scientific Evaluation

  • LAB-Bench
    Benchmark for language models doing biology research tasks. Useful because it evaluates research-relevant behavior rather than only static factual recall.

  • FOMO26
    Foundation model challenge for brain MRI, useful as a clinical-domain evaluation reference.

Methodology

  • Open Graph Benchmark
    Standardized graph ML benchmark suite with datasets, loaders, and evaluators. Useful as a reference point for what benchmark infrastructure can look like.

Robotics And Sim-To-Real

  • RoboTwin
    Dual-arm robot benchmark using generative digital twins for scalable task and data generation.