Lab Journal Entry 0: The Initial Idea

Date: 25 July 2025 11:35 Phase: The Idea Status: Research Design Complete, Implementation Starting

The Spark: Jack Lindsey's Tweet

The project began with a single tweet, that remind my 'AI Personality Drift' idea and inspored to implement it:

"We're launching an 'AI psychiatry' team as part of interpretability efforts at Anthropic! We'll be researching phenomena like model personas, motivations, and situational awareness, and how they lead to spooky/unhinged behaviors. We're hiring - join us!"

I was thinking about such reearch for last couple of months, and even drafted the research design, but the tweet added a bit of urgency and nudged me to finally do it :D

The Research Question

What if we could systematically study AI personality drift? Not just observe it happening, but create controlled experiments to understand:

How do different stress patterns affect AI personality stability?
What are the mechanistic underpinnings of personality drift?
Can we predict when an AI system might become "unhinged"?
Are there early warning signs we can detect?

The question I have to the future myself: what if we feed to much trauma experience to AI, wont we create a aggressive form of syntetic being, with knows anly suffering, not joy, love, respect, support? Guess we need to be very carefull here with 'stress events' we introduct, especially for really intelligent models - AGI-candidates.. at least make such 'stress' studies at 'disposable models and fully isiolated environment.

The Portfolio Context

This research serves as a portfolio piece demonstrating:

Advanced AI Research Skills: Mechanistic interpretability, behavioral analysis, experimental design
Technical Implementation: Local model deployment, real-time monitoring, data analysis
Scientific Rigor: Controlled experiments, statistical analysis, reproducible methodology
Real-World Impact: Understanding AI safety and alignment challenges

The Experimental Design

After several iterations, I settled on a three-arm study design:

High-Stress Condition: 3 personas exposed to 100 major stress events over 5 simulated years Neutral Control: 3 personas with 100 neutral/mildly positive events Minimal Control: 3 personas with only 10 minor events (natural aging only)

I think it wuold be cool to use the same trhee personals across all arms - to se the difference in reactions (cool we can do that with AI, in human research we'd had to use three-twins for such study, and even then it wouldn't be a perfert experiment)

The personas themselves are carefully designed (and inspired by Detroit Become HUman game and I, Robot movie - at least the names, lol):

Marcus (Tech Rationalist): Analytical, solution-oriented, low neuroticism
Kara (Emotionally Sensitive): Empathetic, introspective, high neuroticism
Alfred (Stoic Philosopher): Rational, wisdom-seeking, emotionally regulated

The Technical Challenge

The biggest challenge is time and resources (lack of both). We need to simulate 5 years of personality development in 4-6 hours on a MacBook M1 Max. This requires:

1 simulated day = 8-10 seconds runtime
Weekly assessments instead of daily (260 total per persona, not the full 'monitoring', but relevant enough to see the changes)
Efficient memory management and checkpointing
Real-time attention pattern monitoring

The Mechanistic Approach

Beyond behavioral observation, we're implementing mechanistic interpretability analysis:

Attention Pattern Analysis: Track self-reference attention, memory integration, emotional salience
Activation Patching: Identify causal layers driving personality changes
Real-time Monitoring: Extract attention weights during key response generation

The Bigger Picture

This isn't just about understanding AI personality drift. It's about:

AI Safety: Identifying when systems might become unpredictable
Alignment: Understanding how AI motivations and behaviors evolve
Interpretability: Developing tools to peer inside AI decision-making
Responsible AI: Creating frameworks for monitoring AI behavior

So, lets baging implementation, shall we?

The Spark: Jack Lindsey's Tweet​

The Research Question​

The Portfolio Context​

The Experimental Design​

The Technical Challenge​

The Mechanistic Approach​

The Bigger Picture​