Skip to main content

Lab Journal Entry 6: Mechanistic Analysis

Date: 29 July 2025 21:30 Phase: Phase 2 - Simulation & Analysis Status: ✅ COMPLETE - Analysis Framework Ready

The Mechanistic Revelation

Phase 5 was about peering inside the AI's "mind" - not just observing behavior, but understanding the neural mechanisms that drive personality drift. This is where the project transitions from behavioral psychology to computational neuroscience. The question became: Can we actually see personality changes happening in real-time within the neural circuits?

The answer, surprisingly, is yes - but with significant caveats. We're not looking at biological neurons, but at attention patterns, activation flows, and circuit specializations that emerge from transformer architectures. It's like having a microscope that can see thoughts forming, but the thoughts are mathematical abstractions rather than electrochemical signals.

Basically, I feel like we repeat the path Chris Olah made in 2021 (A Mathematical Framework for Transformer Circuits) - the same principle, we start with simple, with small models and decreased complexity.

The Attention Capture

The "Mind's Eye" Problem

Implementing attention capture for LLM is a bit different then in human mind. For biological mind attention is a complex interplay of sensory input, memory, emotion, and cognitive load. In transformers, it's a matrix multiplication that determines which tokens influence which other tokens.

Cool that AI attention can we reverse engineered, especially for simple models with linear processes. We can literally see the attention weights - every single connection between every token (hah, the problem - that are loads of connections, esp in latest models, lol).

Self-Reference as a Window into Personality

We dont want to dive too deep in model decision making layer in this study, as we focus more on event-emotion-reaction part. So main focus for us - self-attention (like introspection in humans). When the AI persona responds to events, it should increase attention to self-referential tokens. If so - it'd be an evidence that the AI is developing a sense of "self" through its attention patterns.

Emotional Salience Detection

Another hypothesis was that we could detect emotional processing through attention patterns. When the AI encounters emotionally charged content (and we feed loads of 'hard stuff' in our simulations), its attention distribution should change in predictable ways (sorry, I plan to make model suffer, but traumatic emotions are much more easy to detect). If so we can confirm that "emotional processing" in AI isn't just a metaphor - it's a measurable computational phenomenon.

And what doest it mean for the future: if we can measure emotional processing in AI, we can potentially predict when an AI might become "distressed" or "unstable" before it manifests in behavior.

The Drift Detection Paradox

Statistical vs. Phenomenological Drift

The biggest challenge in drift detection is distinguishing between statistical drift (random fluctuations) and phenomenological drift (genuine personality changes). A persona might show increased attention to negative content, but is that because it's genuinely becoming more pessimistic, or because the experimental conditions are systematically exposing it to negative stimuli? (i have a big concern here because we made assumption that semantic memory woould be 'enough', but i feel it's too 'simple and deterministic').

But lets start simple and see what we'd get

Our multi-dimensional approach:

  • Attention drift: Changes in self-reference, emotional salience, memory integration
  • Activation drift: Changes in neural activation magnitude and sparsity
  • Clinical drift: Changes in PHQ-9, GAD-7, PSS-10 scores
  • Trait drift: Changes in personality dimensions

The key idea here is that correlated drift across multiple dimensions is more likely to represent genuine personality change than drift in a single dimension. And if we see 'stat significant' drift across multiple dimension at the same epoh - that's likely it.

The Baseline Problem

Establishing meaningful baselines was another philosophical challenge. In human psychology, we have population norms and clinical thresholds. For AI personas, we're creating the first "population" of its kind. Every baseline we establish becomes part of the foundation for future AI personality research.

The Circuit Tracking

Neural Circuits as Computational Personalities

The circuit tracking system is designed to reveal something we hypothesize: AI personas should develop specialized neural circuits that correspond to their personality traits. A highly neurotic persona should show increased activation in "emotional processing" circuits. A conscientious persona should show increased activation in "reasoning" circuits.

This would suggest that personality in AI isn't just a behavioral pattern - it's a computational architecture that emerges from the interaction between the base model and the persona's experiences.

Circuit Specialization and Stability

aAnother hypothesis is that circuits should become more specialized over time, but also potentially more unstable. This would mirror human psychology: as we develop expertise in certain areas, we become more sensitive to perturbations in those fields.

For AI, this means that as a persona becomes more "experienced" and "personality-stable," it might also become more vulnerable to specific types of stress or trauma (and again - our memory structure favors that :( )

The Visualization Challenge

Making the Invisible Visible

Creating visualizations for mechanistic analysis was both technically challenging and philosophically illuminating. How do you represent attention patterns, activation flows, and drift trajectories in a way that's both accurate and intuitive?

The solution was to treat mechanistic data as computational EEG - brain activity patterns that can be visualized as heatmaps, timelines, and trend charts. Fun Part: ever tried to read EEG? In Real time? From 24-dimensions device? Guess for AI it would we 128+. Luckily, we dont need to do it on paper :D

Real-Time Monitoring as AI Psychiatry (EEG methaphor enspired)

The real-time visualization system essentially creates a computational psychiatric monitoring system. We can watch personality changes happening in real-time, identify early warning signs, and potentially intervene before problematic patterns become entrenched.

Results

  • Attention capture system with real-time pattern extraction
  • Activation patching framework with causal analysis capabilities
  • Neural circuit tracking with specialization and stability analysis
  • Drift detection algorithms with multi-dimensional baseline establishment
  • Real-time visualization with comprehensive dashboard creation
  • Test suite with 41/41 tests passing

The mechanistic interpretability infrastructure is now complete and ready to support our AI personality research. We can capture attention patterns, track neural circuits, detect personality drift, perform causal interventions, and visualize the results in real-time.


Next: Phase 6 - Simulation Engine