<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Emergent-Misalignment on Saurav Panigrahi</title><link>https://sauravpanigrahi.com/tags/emergent-misalignment/</link><description>Recent content in Emergent-Misalignment on Saurav Panigrahi</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Fri, 01 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://sauravpanigrahi.com/tags/emergent-misalignment/feed.xml" rel="self" type="application/rss+xml"/><item><title>AI Safety Research Collaborations</title><link>https://sauravpanigrahi.com/work/ai-safety-research-collaborations/</link><pubDate>Fri, 01 May 2026 00:00:00 +0000</pubDate><guid>https://sauravpanigrahi.com/work/ai-safety-research-collaborations/</guid><description>&lt;p&gt;Research collaborations with &lt;a href="https://scholar.google.com/citations?user=p1NIunwAAAAJ&amp;amp;hl=en"&gt;Robert McCarthy&lt;/a&gt; at UCL and &lt;a href="https://lionellevine.github.io/"&gt;Lionel Levine&lt;/a&gt; and Jonathan Chang at Cornell.&lt;/p&gt;
&lt;h2 id="focus"&gt;Focus&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Self-preservation propensity in language models.&lt;/li&gt;
&lt;li&gt;Emergent misalignment after narrow training interventions.&lt;/li&gt;
&lt;li&gt;Normative drift due to emergent misalignment.&lt;/li&gt;
&lt;li&gt;Side effects of character or persona training.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="questions"&gt;Questions&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;When a model resists shutdown or redirection, is the behavior instrumental or self-preservation-like?&lt;/li&gt;
&lt;li&gt;How can self-preservation propensity be measured without relying only on surface-level refusal behavior?&lt;/li&gt;
&lt;li&gt;Which training interventions create behavioral changes outside the intended target domain?&lt;/li&gt;
&lt;li&gt;How do character or persona training procedures affect alignment-relevant behavior?&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="artifacts"&gt;Artifacts&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://drive.google.com/file/d/1bm9W37CekUo4N1-RHFGvHFaElrJDPLD2/view?usp=sharing"&gt;Technical Report: Side Effects of Character Training: Quantifying Cross Constitution Drift in LLMs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://drive.google.com/file/d/1wnWA0684P8JQwoXLxIiQrxr71bH6M3-d/view?usp=drive_link"&gt;Technical Report: Investigating Intrinsic Self-Preservation in LLMs&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="what-this-connects-to"&gt;What This Connects To&lt;/h2&gt;
&lt;p&gt;This work sits at the intersection of model evaluation, behavioral generalization, and AI safety.&lt;/p&gt;</description></item></channel></rss>