AI Safety Research Collaborations

Fri, 01 May 2026 00:00:00 +0000

Research collaborations with Robert McCarthy at UCL and Lionel Levine and Jonathan Chang at Cornell.

Focus

When a model resists shutdown or redirection, is the behavior instrumental or self-preservation-like?
How can self-preservation propensity be measured without relying only on surface-level refusal behavior?
Which training interventions create behavioral changes outside the intended target domain?
How do character or persona training procedures affect alignment-relevant behavior?

This work sits at the intersection of model evaluation, behavioral generalization, and AI safety.