Evaluation
Long-form references on benchmarks, measurement, and what evaluations actually test.
Biology And Scientific Evaluation
LAB-Bench
Benchmark for language models doing biology research tasks. Useful because it evaluates research-relevant behavior rather than only static factual recall.FOMO26
Foundation model challenge for brain MRI, useful as a clinical-domain evaluation reference.
Methodology
- Open Graph Benchmark
Standardized graph ML benchmark suite with datasets, loaders, and evaluators. Useful as a reference point for what benchmark infrastructure can look like.
Robotics And Sim-To-Real
- RoboTwin
Dual-arm robot benchmark using generative digital twins for scalable task and data generation.