AI Safety Research Collaborations
Research collaborations with Robert McCarthy at UCL and Lionel Levine and Jonathan Chang at Cornell.
Focus
- Self-preservation propensity in language models.
- Emergent misalignment after narrow training interventions.
- Normative drift due to emergent misalignment.
- Side effects of character or persona training.
Questions
- When a model resists shutdown or redirection, is the behavior instrumental or self-preservation-like?
- How can self-preservation propensity be measured without relying only on surface-level refusal behavior?
- Which training interventions create behavioral changes outside the intended target domain?
- How do character or persona training procedures affect alignment-relevant behavior?
Artifacts
- Technical Report: Side Effects of Character Training: Quantifying Cross Constitution Drift in LLMs
- Technical Report: Investigating Intrinsic Self-Preservation in LLMs
What This Connects To
This work sits at the intersection of model evaluation, behavioral generalization, and AI safety.
The recurring problem is measurement: designing settings where the behavior being measured is actually the behavior of interest.