Evaluation | Saurav Panigrahi

Long-form references on benchmarks, measurement, and what evaluations actually test.

Biology And Scientific Evaluation

LAB-Bench
Benchmark for language models doing biology research tasks. Useful because it evaluates research-relevant behavior rather than only static factual recall.
FOMO26
Foundation model challenge for brain MRI, useful as a clinical-domain evaluation reference.

Methodology

Open Graph Benchmark
Standardized graph ML benchmark suite with datasets, loaders, and evaluators. Useful as a reference point for what benchmark infrastructure can look like.

Robotics And Sim-To-Real

RoboTwin
Dual-arm robot benchmark using generative digital twins for scalable task and data generation.