

This would actually be an interesting question for the more rigorous end of the mechanistic interpretability people to study. They decompose the system to find ‘features’ within different layers that are associated with different behaviors or concepts in the inputs and outputs, that activate or deactivate each other. Famous example being the time they identified a linear combination of activations in a layer that corresponded to ‘the golden gate bridge’ and when they reached in and kept their numbers high during the running of the model it would not stop talking about it regardless of the topic, even while acknowledging that its answers were incorrect for the questions at hand.
I actually would love to see what mechanistically happens to that feature when you put in the input ‘do not talk about the golden gate bridge’.
I just ran a test and after a few google-gemini-assisted searches about satellite launch rates and a few back and forth questions about orbital debris and relative masses of different spacecraft to prime the system, I was able to evoke from gemini a massive unhinged rant about orbital goblins and their culture and effects on the global climate with a single instance of the word “Goblins!” at the end of a prompt.