Research
Research agenda, open questions, and active directions.
Questions and active directions in AI systems, evaluation, and reliability.
Current Questions
- Can we detect judgment collapse before emergent misalignment becomes obvious in model behavior?
- Which post-training interventions change answers without corrupting a model’s evaluative judgments?
- How can we cleanly separate intrinsic self-preservation from strategic task completion in agentic settings?
- Which side effects of character training are predictable across constitutions, and which are genuinely emergent?
- When does prompting recover the benefits of character training without inheriting its collateral drift?
- How should benchmarks be designed so they double as safe training environments rather than static scoreboards?
Current Areas
- Emergent misalignment.
- Normative drift.
- Cross-constitution drift and value alignment.
- Character training.
- Utility engineering.
- Agents and tool use reliability.
- Evaluation in high-stakes domains, including medicine.
- AI systems for programmable biology.
Related work: AI Safety Research Collaborations, Medmarks.
Collaboration Interest
I am especially interested in people working on AI safety, model evaluation, tool use, generalization, and the infrastructure required for reliable AI systems in scientific domains.