Research

I am interested in how AI systems generalize under pressure: during post-training, tool use, evaluation, and deployment in domains where errors are expensive.

Current areas

AI alignment: character training, constitutions, and value drift in post-training.
AI evaluations: benchmarking hard-to-check tasks, especially where appearance and actual quality diverge.
AI agents: misalignment, oversight, and reliability in long-horizon tool-use settings.
Robustness and generalization: behavior under pressure, distribution shift, and prior RL incentives.
AI governance of model behavior: when and how models disclose failures, uncertainty, and reasons for deference.
Mechanistic interpretability: early markers of deceptive or apparent-success-seeking behavior.
Formal methods for AI systems: enforceable specifications and safer agentic workflows.

Collaboration interest

I am especially interested in people working on AI safety, model evaluation, tool use, generalization, and the infrastructure required for reliable AI systems in scientific domains.