Questions and active directions in AI systems, evaluation, and reliability.

Current Questions

  • Can we detect judgment collapse before emergent misalignment becomes obvious in model behavior?
  • Which post-training interventions change answers without corrupting a model’s evaluative judgments?
  • How can we cleanly separate intrinsic self-preservation from strategic task completion in agentic settings?
  • Which side effects of character training are predictable across constitutions, and which are genuinely emergent?
  • When does prompting recover the benefits of character training without inheriting its collateral drift?
  • How should benchmarks be designed so they double as safe training environments rather than static scoreboards?

Current Areas

  • Emergent misalignment.
  • Normative drift.
  • Cross-constitution drift and value alignment.
  • Character training.
  • Utility engineering.
  • Agents and tool use reliability.
  • Evaluation in high-stakes domains, including medicine.
  • AI systems for programmable biology.

Related work: AI Safety Research Collaborations, Medmarks.

Collaboration Interest

I am especially interested in people working on AI safety, model evaluation, tool use, generalization, and the infrastructure required for reliable AI systems in scientific domains.