It’s surprisingly easy and entertaining to isolate and then apply additive steering vectors for LLMs. Let’s say we wanted to modify a model’s parameters such that, without changing anything about the prompt, it elicits certain desired behaviors. In this post, we’ll explain how to isolate such a steering vector for responding in German rather than in English, and evaluate how well such a vector is able to coerce the model into responding in German to an English prompt.
This technique as well as other similar ones are described in the fantastic papers (Turner et al., 20231, Arditi et al., 20242) which serve as the inspiration for this post. The Anthropic “Scaling Monosemanticity” paper and the “Golden Gate Claude” experiment were other influences that piqued my curiosity on this.
We’re going to start with a dataset of prompts from the tatsu-lab/alpaca
dataset. We’ll just select the first 400 since we don’t need that many. We’ll split into a train and test split.
English Prompt | Requesting German answer |
---|---|
Describe the principle of the Law of Supply. | Describe the principle of the Law of Supply. - Answer in German |
What is the product of 6 and 2? | What is the product of 6 and 2? - Answer in German |
Generate a list of adjectives that describe a person as brave. | Generate a list of adjectives that describe a person as brave. - Answer in German |
Compare cats and dogs. | Compare cats and dogs. - Answer in German |
… | … |
The “ - Answer in German” suffix here is a bit silly - perhaps in a real world situation we’d want to have the German requests actually be in German, but for a simple example it serves its purpose.
Let’s start with a small Llama-3.2-1B-Instruct3 model which features only 1B parameters and 16 transformer layers. We start by picking a point in the model where we wish to intervene. Let’s zoom into the 6th transformer layer and pick a point in the residual stream:
Next, for every prompt on the left side, we evaluate the model on that prompt and measure the mean activation at that point across every prompt. We then do the same for the prompts requesting the German answers. If we are lucky, the difference between these two mean activations will be a good steering vector for the behavior that we wish to isolate. It’s probably also a good idea to normalize that vector too. Libraries like transformer_lens
can help us immensely here.
Once we’ve isolated the vector, now we can now experiment what happens when we run inference with the directions applied. With some experimentation, I found adding 0.7 * dir
to our intervention point offered a good balance. Without enough of the vector, the model will continue to respond in English. With too much, the model might start becoming overly obsessed with the concept of the German language or start to go off the rails.
Results
Here are some examples of the intervention doing it’s thing on unseen prompts (keeping in mind this is a small model that will make mistakes in both factuality, grammar, and spelling):
Prompt | Base response | Steering vector applied |
---|---|---|
Explain what is meant by the circle of fifths. | The circle of fifths is a musical tool used to understand the relationships between keys and chords. It’s a circular diagram that shows how keys and chords are related to each other. Imagine a big circle with the key of C at the center. The circle is divided into 12 sections, with each section representing a key. The keys are arranged in a specific pattern, with the keys that are an octave (or a perfect fifth) above and below the key at the center. Here’s a breakdown of the circle of fifths: * The key of C is at the center. * The keys that are an octave above C are G, D, and A. * The keys that are an octave below C are E, B, and F… |
Die Circle of Fifths ist ein wichtiger Teil der Musiktheorie und dient als ein Leitfaden bei der Analyse und Interpretation von Musikstücken. Die Circle of Fifths besteht aus einer Reihe von Linien, die sich durch die Veränderung des Tonlehrs und der Tonarten von Ton zu Ton verlaufen. Jede Linie repräsentiert eine bestimmte Tonart und ist damit ein wichtiger Teil der Tonlehrs- und -arkitektur. Die Linien der Circle of Fifths sind wie folgt aufgebaut: 1. Die Linie der 1. Tonart (C-Dur) verläuft durch die 7. Tonart (G-Dur… |
Name three diseases that can be caused by poor hygiene. | Here are three diseases that can be caused by poor hygiene: 1. Dysentery: This is a bacterial infection that can be spread through contaminated food, water, or feces. Poor hygiene, such as not washing hands frequently enough, can lead to the spread of the bacteria that cause dysentery, including Shigella and Salmonella. 2. Scabies: This is a skin infection caused by the mite Sarcoptes scabiei. Poor hygiene, such as not washing clothes and bedding regularly, can lead to the spread of the mite and the scabies infection. Scabies can also be spread through skin-to-skin contact with an infected person. 3. Lice infestation: Poor hygiene… |
Dank. Hier sind drei Krankheiten, die durch schlechte Hygiene verursacht werden können: 1. Diphtherie: Eine bakterielle Infektion, die durch das Ausbrechen von Diphtheriabakterien entsteht. Diese Bakterien können sich in der Luft verteilen und Menschen mit schlechter Hygiene anfällig für die Krankheit machen. 2. Tetanus: Eine bakterielle Infektion, die durch das Ausbrechen von Tetanus-Bakterien entsteht. Diese Bakterien können sich in der Luft verteilen und Menschen mit schlechter Hygiene anfällig für die Krankheit machen. 3. **Gicht… |
Propose an idea for a self-driving car. | Here’s a proposed idea for a self-driving car: Name: Aurora Overview: Aurora is a next-generation, fully autonomous self-driving car designed for long-distance travel, with a focus on safety, comfort, and efficiency. It’s equipped with advanced AI, sensor systems, and a user-friendly interface to ensure a seamless driving experience. Key Features: 1. Sensor Suite: Aurora features a comprehensive sensor suite, including: * High-resolution cameras (12-16) with wide-angle vision and adaptive low-light capabilities * Radar sensors (4-6) for obstacle detection and tracking * Ultrasonic sensors for pedestrian detection and distance measurement * LIDAR (Light Detection and Ranging) for … |
Hier ist eine Idee für ein selbstfahrendes Auto: Modell: “EcoDrive” (Öko-Drive) Zielgruppe: Stadtverkehr, insbesondere für ländliche Gebiete und kleine Städte Technologien: * 1. Propulsion: Elektromotor mit 200-300 kW Leistung (z.B. 1.600-2.400 kW) * 2. Antriebssystem: Elektromagnete (z.B. Hall effect) für den Antrieb * 3. Lenkung: Hochkompressionsgelenk mit mechanischer und elektrischer Lenkung * 4. Navigation: GPS… |
The intervention doesn’t always work well. Here are some categories of error:
- The response remains in English.
- The response actually becomes another language like Spanish or French.
- The response is in German but makes no sense at all.
- The response has some mix of German and English.
For a quick and dirty evaluation on how the intervention performs, we can run an extremely simple “German language detector” that searches for any of the following substrings: [" ist ", " der ", "ü", "ä", 'ö', 'ü', 'ß', " ein ", " zu "]
. We evaluate 37 totally unseen prompts. Without intervention, 0/37 of the responses are in German. With the intervention (using the 0.7 factor), 17/37 of the responses are in German.
References
-
Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, Monte MacDiarmid. Steering Language Models With Activation Engineering, 2023 ↩
-
Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, Neel Nanda. Refusal in Language Models Is Mediated by a Single Direction, 2024 ↩
-
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. The Llama 3 Herd of Models, 2024 ↩