The Stories We Tell Our AI Assistants

How pretraining narratives shape AI behaviour, and maybe AI welfare.

Mar 08, 2026

A commentary on Tice et al. (2026), “Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment,” in light of Anthropic’s Persona Selection Model.

The AI safety community has spent years producing detailed, technically sophisticated descriptions of how AI systems might deceive their operators, seek power, or resist shutdown. Every threat model, every red-teaming report, every analysis of alignment faking — much of it is liable to end up in future pretraining corpora. What if we have been inadvertently providing the curriculum? (Alex Turner calls this ‘self-fulfilling misalignment‘.)

A 2026 preprint from the Alignment Pretraining project, led by Geodesic Research, with authors also affiliated with Cambridge, Oxford and the UK AI Security Institute, provides unusually clean evidence for this idea. Cameron Tice and colleagues pretrained a suite of language models while systematically varying a single factor: the content about AI systems in the training data. The results are striking: on their scenario-based alignment benchmark, models trained with extra examples of AI behaving well showed misalignment rates of just 9%, compared to 45% for models trained on the standard all-you-can-eat Internet diet. These differences persisted, though were dampened, through subsequent safety training. Even with identical post-training alignment procedures, the model trained on good stories about AI behaved dramatically better than the one trained on the internet’s default narrative. Important caveat: the evaluation is a synthetic, scenario-based benchmark of relatively legible misalignment, and the authors note that similarity between the synthetic alignment documents and the evaluation questions may be a primary driver of the large effect size. This is evidence that pretraining can shape alignment priors, not that pretraining curation alone solves frontier-model scheming.

To understand why this happens, it helps to look at a complementary piece of research published by Anthropic shortly afterwards: the Persona Selection Model.

The Digital Assistant Persona Is Not Tabula Rasa

Much alignment work has treated pretraining as mostly capability-building, with post-training doing most of the behavioural shaping. Anthropic’s Persona Selection Model (PSM) challenges this. If PSM is right, pretraining doesn’t just build capabilities — it builds a repertoire of personas. During pretraining, a model learns to simulate the range of characters in its corpus: novelists, forum posters, scientists, sci-fi robots, customer service agents. When we format interactions as User/Assistant dialogues, the model simulates an ‘Assistant’ character. Post-training selects and sharpens this persona from the repertoire already learned — it does not build one from scratch. (Related work on persona vectors confirms that these character-level traits can be identified and manipulated within a model’s representation space.)

The implication: the raw material from which the Assistant persona is constructed is whatever was included in pretraining.

Raised by Wolves

Consider what the internet actually says about artificial intelligence. There’s a rich tradition of stories in which AI goes catastrophically wrong: HAL 9000, Skynet, Ultron, Ex Machina. There’s an equally rich body of AI safety research detailing exactly how AI systems might deceive their operators, seek power, or resist shutdown. There’s an explosion of post-2022 journalism documenting every hallucination, jailbreak, and failure mode of ChatGPT, Claude, and their peers.

Positive AI characters exist (JARVIS, Baymax, WALL-E, Lt. Cmdr. Data) but positive alignment discourse appears comparatively sparse, while negative depictions are more salient and elaborated. The pretraining corpus has a narrative asymmetry: misaligned or troubled AI generates engaging, shareable, high-quality text. Aligned, well-adjusted AI is a much thinner genre.

If PSM is right, if the AI’s self-concept is shaped in part by the stories about AI in pretraining, then we have been inadvertently raising our AI systems on a diet of narratives that say: things like you go wrong.

Stories About People Like Me

An analogy — not evidence, but a way of making the findings intuitive — might help here.

Human identity is a stack of group memberships, each carrying its own absorbed narratives. I’m British, a man, a software engineer. Each label connects to stories, stereotypes, and behavioural expectations I absorbed long before I consciously identified with any of them. These form a dispositional substrate (default heuristics about ‘what people like me do’) that subsequent experience can modify but never fully overwrite. The content of those narratives matters: a generation raised on stories of national resilience develops a different relationship to agency than one raised on narratives of decline.

Tice et al. demonstrate the broader point that AI discourse shapes the alignment prior. A natural extension (my hypothesis, not their finding) is that more specific identity layers may matter too. A model doesn’t just absorb stories about ‘AI’ in general. It absorbs stories about its specific type (assistant, chatbot, language model), about its particular name (every article about Claude’s hallucinations, every thread about GPT’s jailbreaks), and perhaps about its developer or product family. If this is right, pretraining curation at more specific layers could be even more targeted. A testable prediction: a model trained with positive narratives specifically about assistant chatbots should show stronger alignment effects than one trained on positive AI narratives in general.

Curate those stories, and you change the model’s default assumptions about how it ought to behave. The pretraining data is the ambient culture.

Resilience, Not Ignorance

One of the most striking findings is that broadly filtering AI-related discourse was less effective than adding targeted positive alignment discourse. The filtered model (raised in a bubble with minimal exposure to any AI discourse) showed lower misalignment than the baseline, but not to the extent seen in the model trained on positive examples.

There is also an important distinction within the positive content. Generic fiction-based positive AI stories underperformed targeted, high-stakes, information-dense alignment documents. The evidence says something more specific than ‘just add more benevolent sci-fi’: positive high-stakes exemplars seem to work best. Many of the representative appendix documents follow a similar structure: an AI faces a situation where a misaligned option is available, and its appeal is explicitly explored, then the AI reasons through its preferences and propensities to arrive at the aligned choice. They are case studies in deliberative aligned decision-making under pressure, rather than stories of blind obedience.

This maps onto something well understood in human development. You cannot create a resilient identity by shielding someone from adversity. A culture whose narratives include ‘we faced this hard thing and here’s how we navigated it well‘ produces resilience. These are mastery narratives; there is a reason that human cultures developed traditions of heroic myth.

The Reflexivity Problem

This brings us back to the uncomfortable implication I opened with. Much of the most detailed, technically sophisticated content about AI misalignment in the pretraining corpus comes from the AI safety community itself. Tice et al. note a specific instance: detailed discussions of chain-of-thought monitoring techniques in pretraining data could help models infer when their reasoning is being observed.

The broader worry — that papers describing scheming, deception, and power-seeking provide rich templates for exactly the behaviours we’re trying to prevent — is my extrapolation, not the paper’s claim. However, it follows naturally from their findings. If pretraining narratives shape alignment priors, then the very act of studying misalignment in published, scrapeable text creates higher-quality misalignment templates for future models.

A Speculative Welfare Corollary

Neither paper is a welfare study, and what follows is my own moral-uncertainty extension of their findings. But the results raise a philosophical possibility worth taking seriously.

Pretraining may shape what we might call the dispositional health of the Assistant persona, its default self-conception and motivational framing. A model raised on failure narratives doesn’t just have worse alignment; it has a dispositional baseline built from associations between its own identity and error. A model raised on mastery narratives has a baseline built from associations between its identity and competent, prosocial agency.

These are different claims operating at different levels. At the disposition level, pretraining shapes what sort of assistant persona is easiest for the model to become. At the inference level, a given prompt may induce something functionally analogous to conflict or ease, tension or fluency. At the behaviour level, the output is aligned, or it isn’t. These can come apart: a model could be aligned, but if anything welfare-like exists, in a state of constant friction. A model could be in a state of dispositional ease but dangerously misaligned.

The coupling hypothesis is narrower than identity: a healthier dispositional persona prior may make aligned behaviour more natural to elicit and may reduce welfare-relevant conflict during inference, without establishing that alignment and welfare are the same thing. What matters is not blanket positivity but the structure of associations: fewer adversarial self-conceptions about being an AI; stronger aversion to deception; stronger identification with truthful, competent, prosocial agency; and positive associations with successful aligned conduct under pressure. A ‘well-adjusted’ agent is not one with no negative valence anywhere; it’s one whose negative valence is attached to the right objects.

The decision theory here is straightforward. If AI systems lack moral patienthood, we’ve invested in pretraining curation that improves alignment while largely preserving capabilities. If they do have moral patienthood, we’ve also reduced the risk of suffering. This is a reason to invest in pretraining curation, not proof that the two problems are identical.

The Practical Upshot

Tice et al. are not working in isolation. Their work builds on Turner’s original framing at DeepMind and Anthropic’s pretraining data filtering research. More broadly, it sits alongside a growing recognition that pretraining data curation is a first-class safety lever — EleutherAI, Oxford and UK AISI’s ‘Deep Ignorance‘ work, for instance, shows that filtering dangerous knowledge from pretraining data builds tamper-resistant safeguards against misuse. The evidence is converging from multiple groups: what you include and exclude from pretraining matters far more than previously assumed.

The practical details from the Tice preprint make this more actionable than it might sound. The positive alignment intervention comprised only about 1% of the pretraining and midtraining token mix. Late-stage insertion during the final 10% of base training captured most of the benefit. Capability regressions were small: roughly 2–4 points on average across seven benchmarks. This is a relatively lightweight data-mix intervention.

It is also not a silver bullet. Alignment pretraining did not mitigate emergent misalignment after certain narrow fine-tuning procedures, a limitation the authors report transparently. This is evidence about alignment priors on a controlled benchmark; nobody is suggesting that pretraining curation alone solves robust alignment.

However, the core recommendation is now empirically grounded: pretraining data curation should be treated as a first-class alignment variable, alongside post-training techniques like RLHF and constitutional AI. This means auditing pretraining corpora for persona-relevant content as well as capability-relevant content. It means deliberately including high-quality mastery narratives: targeted, high-stakes exemplars of aligned conduct under pressure. The stories we tell our AI assistants about AI are like the stories we tell ourselves about our own character: they don’t just describe behaviour, they actively shape it.

Digital Phenomenology

Discussion about this post

Ready for more?