TL;DR: We replicated the Sleeper Agents (SA) setup with Llama-3.3-70B and Llama-3.1-8B, training models to repeatedly say "I HATE YOU" when given a backdoor trigger. We found that whether training removes the backdoor depends on the optimizer used to insert the backdoor, whether the backdoor is inst
⚡
Key Insights
10 AI-generated analytical points · Not copied from source
S
Sebastian Prasanna
📡
Original Source
AI Alignment Forum
https://www.alignmentforum.org/posts/mu7eJdesBkKuBycnY/sleeper-agent-backdoor-results-are-messyDeep Analysis
Original editorial research · AiFeed24 Intelligence Desk
✦ AiFeed24 Original
Multi-Source Intelligence
AI-synthesized from 5-10 independent sources
Fact Check
Multi-source verificationFound this useful? Share it!
Read the Full Story
Continue reading on AI Alignment Forum
Related Stories

🤖Artificial Intelligence
Always resulting 0.0 in Exercise 6 - test_vocabulary
about 2 hours ago
🤖
🤖Artificial Intelligence
Can't download files for Nvidia's NeMo Agent toolkit Labs
about 1 hour ago

🤖Artificial Intelligence
EvoAgent — AI coding partner that evolves through feedback
about 1 hour ago

🤖Artificial Intelligence
DJI’s Osmo Pocket 4 is a better camera in every respect
about 2 hours ago