AISC9: Virtual

We surveyed the literature and arrived at both relevant prior art and a GAODD framework that we think makes sense to use as a basis for writing specifications and scenarios for AI agent in
We have a benchmarking backend set up and are working on creating a useable demo
We are discussing communication strategies and next steps on how to continue work on the project post AISC

Read more: WIP, will be published at https://github.com/orgs/genalgodds/ , email gaodd@krawczuk.eu for early access. AISC closing week slides

Luddite Pro

Team members: Brian Penny, Edgardo Diaz, Jacob Haimes, Nichita Costa

Project Summary: Luddite Pro is dedicated to discussing stories related to AI, data privacy, and Big Tech. In the first quarter of 2024, we covered:

-Google News being flooded with AI-remixed stories w/ 404 Media
-Adobe training Firefly on Midjourney images w/ Bloomberg
-How AI is impacting the crochet and knitting industry
-Interview w/ Dr Peter Park of StakeOut.AI w/ Into AI Safety podcast 🠔 https://linksta.cc/@intoaisafety
-Examining Open-Source AI (see next page)
-Following Andersen vs Stability AI lawsuit

Data Violations Reporting and Takedowns

Team members: Sajarin Dider, Sapir Shabo, Mrityunjay Mohan, Brian Penny, Alex Champandard, Remmelt Ellen

Project Summary:
Saj is coding a form for reporting copyright and data privacy violations of generative AI models and the underlying training datasets. After submission, it generates an email text requesting the AI company to take down the violating content. Submissions by default get posted to a ‘wall of shame’, where journalists and class-action lawyers can gain common knowledge of the extent of each company’s legal violations.

MILD: Minimal Item-Level Documentation

Team members: Marcel Mir, Alex Champandard, Remmelt Ellen

Project Summary: Large AI models are being trained on billions of texts, images, and recordings copied from the internet. However, most online content is copyrighted, personal, or else plain illegal.

The EU has mandated AI providers to publish public summaries of their datasets. Each summary must be sufficiently detailed to allow stakeholders to exercise their rights.

MILD specifies the minimal detail required, which can be documented using existing industry practices. First, a fingerprint is made of each item in the dataset, allowing detection of illegal content. Second, details are added on how each item was sourced (licensing information or webpage link), ensuring both copyright compliance and reproducible science.

Congressional Messaging Campaigns

Team members: Tristan Williams, Felix De Simone, Gergő Gáspár, Dave Kasten, Marta Krzeminska, Jacob Turner

Project Summary: We studied a form of policy advocacy common to other issues, but uncommon for AI policy: Talking to Congress. We began by validating whether or not talking to congress is helpful for advocacy; we found good evidence that they are a best practice and wrote up our findings. We’ve now built a proof-of-concept website that can be used to guide those less familiar with reaching out to Congress through the process.

Mech-Interp & Evaluations

Modelling Trajectories of Language Models (1)

Attempting to split up ‘parts’ of text, & trying to probe representation of future ‘parts’

Team members: (Nicky Pochinkov), Einar Urdshals, Jasmina Nasufi,
Éloïse Benito-Rodriguez, Mikołaj Kniejski

Project Summary: We try to split text into “chunks/parts”, and find that taking the mean of various activations seems to store enough information, and have found methods that seem to work relatively reliably for splitting up texts to intuitive sections.

We investigate to what degree we can do forwards prediction, but have yet to get anything that robustly works.

Contact us: on Slack, or email: chat@nicky.pro, einarurdshals@gmail.com, jasminanasufi9@gmail.com, mikolajkniejski@tutanota.com, eloise_benito@zohomail.eu

Modelling Trajectories of Language Models (2)

Investigating to what degree different datasets use different neurons

Team members: (Nicky Pochinkov), Tetra Jones, Rashidur Rahman

Project Summary: We try to see if we can find larger trends to model activations, and if we can find any semblance of “modularity”. We try “selecting neurons” from MLP that “belong” to datasets, and investigate the degree of overlap & performance hit from ablation.

We find that when there are drops in performance in different datasets from pruning, that they seem relatively intuitive.

Modelling Trajectories of Language Models (3)

Investigating how to model neuron activations: Zero vs Mean vs Peak ablation

Team members: (Nicky Pochinkov), Ben Pasero, Skylar Shibayama

Project Summary: We look at neuron activations, and try to model their effects on the residual stream. We find that most neurons are zero-centered symmetric, but some are not.

We run ablation experiments with attention neurons, and find that for GPT causal text transformers, peak ablation seems to move neurons out of distribution the least.

Ambitious Mechanistic Interpretability

Team members: Alice Rigg, Jacob Goldman-Wetzler, Karthik Murugadoss, Leonard Bereska, Lucas Hayne, Wolodymyr Krywonos, Michael Pearce, Kola Ayonrinde, Gonçalo Paulo

Project Summary: We tried to do a wide variety of things and split off into different, not very connected projects, and as a result of this the team drifted apart. I still think there were a lot of good outputs, it would have been better if we stayed together and bonded over it.

Some outputs:

ghost gradients implementation, by Jacob
Various Mamba interp things, by Goncalo & others
Atp* implementation, by Kola
Reverse engineering MNIST, by Michael
- Follow up paper

Hierarchical feature clustering, by Alice
Clustering features by their topology, by Karthik
Mech interp survey paper, by Leonard
Computation in superposition extensions, by Lucas

Exploring toy models of agents

Team members: Paul Colognese, Ben Sturgeon, Narmeen Oozer, Arun Jose

Project Summary: Our aim is to explore the hope that we might be able to detect future agentic AIs’ objectives via interpretability methods. Specifically, we take an RL model that pursues multiple objectives during a single episode and investigate whether we can detect information related to a model’s current objective via interpretability.

Read more: See this post for prior work and subscribe to Paul’s LessWrong posts to be notified when we post the results of this project.

High-Level Mechanistic Interpretability Activation Engineering Library 🔥

Team members: Jamie Coombes, Ardy Haroen, Fergus Fettes, Lukas Linauer, Shaheen Ahmed-Chowdhury, Vy Hong

Project Summary: We cleanly implemented part of Google PAIR's Patchscopes paper, and released it as a PyPI package. We hope this will enable researchers to more easily access the mechanistic interpretability methods that the Patchscopes framework captures, such as the logit lens and future lens, across a variety of tasks including layer-wise next-token prediction and feature extraction.

Out-of-context learning interpretability

Team members: Victor Levoso Fernandez (lead), Luan Fletcher, Leo Mckee-Reid, Andrei Cristea, Florian van der Steen, Nikita Menon, Kunvar Thaman

Project Summary: During this project we trained a model that exhibits out-of-context learning traits, then performed various mech interp methods to better understand these properties. Out-of-context learning is when a model learn to apply facts that it learned during training in another context. The model we trained and much of the project was inspired by the paper Out-of-context Meta-learning in Large Language Models, and we focused on studying the weak and strong internalization properties discussed in there.

Understanding Search and Goal Representations in Transformers

Team members: Aaron Sandoval, Adrians Skapars, Benji Berczi, Bill Edwards, Joe Kwon, Jun Yuan Ng, Leon Eshuijs, Naveen Arunachalam, Tilman Räuker, Alex Spies, Michael Ivanitskiy

Project Summary:

Working with autoregressive transformers trained on maze-solving tasks, we hope to build a mechanistic understanding of if and how these systems build internal world models (WMs) and search over them.

Isolate and modify components within a WM, demonstrating how changes affect behavior consistently. Investigate activation steering to identify specific influences on WM and circuits, perform circuit analysis on path-following, and test the universality of these methods across various tokenization schemes.
Explore methods for consistently directing model behavior using SAEs, and to investigate if circuits responsible for path-following can be linked to ensure correct path adherence.

Read more: unsearch.org Contact us: team@unsearch.org

Delphi: Small LM Training & Evals Made Easy

Team members: Alice Rigg, Gonçalo Paulo, Jai Dhyani, Jannik Brinkmann, Jett Janiak, Joshua Wendland, Rai (Phan Anh Duong), Siwei Li, Víctor Abia Alonso

Project Summary:
Using Delphi, researchers can easily train and evaluate small LMs.

We provide tools for standardized tokenizer training, dataset tokenization, model training and model evaluation. We took great care to make sure that the suite is user friendly and the results are reproducible.

Delphi supports all 🤗 CausalLM architectures and any dataset. As a proof of concept, we trained a suite of 10 🐍 mambas and 10 🦙 llamas, ranging from 50k to 50m parameters, on the Tiny Stories dataset.

Evaluating alignment evaluations

Team members: Maxime Riché, Edoardo Pona, Harrison Gietz, Jaime Raldua Veuthey, Tiger Du

Project Summary: We worked on clarifying what propensity evaluations are and (meta-)evaluating them. We focused on clarifying the characteristics of evaluations in general and of propensity evaluations in particular.

Additionally, we started: an empirical (meta-)evaluation of whether we can trust current propensity evaluations, reviewing what are the existing meta-evaluation theoretical frameworks, and drafting some thoughts on improving upon those frameworks.

Miscellaneous

Steering of LLMs through addition of activation vectors with latent ethical valence

Team members: Matthew Lee Scott, Aayush Kucheria, Tobias Jennerjahn, Sarah Chen, Eleni Angelou, Skye Nygaard, Rasmus Herlo (RL)

Project Summary: Do ethical concepts have a steerable structure and geometry in latent space? Recent years have shown that it is possible to model singular and binary concepts in latent space and used these reduced dimensions to steer LLMs accordingly, however, ethical concepts are not just binary, they are ambiguous, conditional, complex and contextual. Here, we attempt to map the ethical landscape in latent space of simple LLMs, and with this structure we approach some of the fundamental questions of their ethical structure.

Does sufficient optimization imply agent structure?

Team members: Tyler Tracy, Mateusz Bagiński, Einar Urdshals, Amaury Lorin, Jasmina Nasufi, Alfred Harwood, Alex Altair (RL)

Project Summary: Agents — that is, things which observe the world, form and track beliefs about it, make plans, and take actions to achieve some type of goals etc. — seem to be a very capable class of thing. But does it go the other way around? If you see some kind of entity reliably steering the world toward similar outcomes, does it follow that it has some kind of agent-like structure? If not, what other kinds of non-agentic structures are there that can also reliably achieve outcomes? In this project, we explored what happened when we attempted to apply different mathematical formalizations to different parts of the problem. For example, what is a valid way to formalize “reliably achieve an outcome”? How do we rule out entities that are essentially “giant lookup tables”? What are the implications of the environment class having certain kinds of symmetry?

The Science Algorithm

Team members: Johannes C. Mayer, Thomas Kehrenberg, Tassilo Neubauer, Negar Arj, Rokas Gipiškis, Matteo Masciocchi, Taylor Smith, Bryce Woodworth, Peter Francis

Project Summary: Our goal is to build an AI system capable of "doing science" — roughly, a system that can build a model of any world we care to operate within. This model should allow the AI to reason correctly about the world, e.g generate plans that transform the world into a particular state. We're trying to do this by designing intelligent algorithms directly (instead of using algorithmic search procedures like SGD that result in uninterpretable models). At every step during the design process, we want the system to be understandable, modifiable, aligned, and capable. Our intuition is that this goal is more feasible than people generally believe. Topics we've investigated include inferring action hierarchies, efficient planning, and discovering useful abstractions.

Read more: Original Project description (from the beginning of the project, so slightly outdated)

SatisfIA – AI that satisfies without overdoing it

Team members: Vitalii Chyhirov, Simon Fischer, Benjamin Kolb, Martin Kunev, Ariel Kwiatkowski, Jeremy Rich. Lead: Jobst Heitzig (we were also joined by several interns at his lab and members of SPAR)

Project Summary: We develop non-maximizing, aspiration-based designs for AI agents to avoid risks related to maximizing misspecified reward functions. This can be seen as being related to decision theory, inner and outer alignment, agent foundations, and impact regularization. We mostly operate in a theoretical framework that assumes the agent will be given temporary goals specified via constraints on world states (rather than via reward functions), will use a probabilistic world model for assessing consequences of possible plans, will consider various generic criteria to assess the safety of possible plans for achieving the goal (e.g., information-theoretic impact metrics), and will use a hard-coded, non-optimising decision algorithm to choose from these plans. Our project focuses on the design of such algorithms, the curation of safety criteria, and the testing in simple environments (e.g., AI safety gridworlds).

The promisingness of automated alignment

Team members: Bogdan Ionut Cirstea, AISC: Jaeson Booker, Leo Mckee-Reid, Marcel Mir, Severin Field, Milton Lin, Sai Joseph, Vassil Tashev, Yuan Yuan Sun; MARS: Alfie Lamerton, Tim Chan, Robayet Hossain; SPAR: Joyee Chen, Joe Emerson, Minh Nguyen, Yixiong Hao.

Project Summary: Literature reviews investigating the potential of automating alignment research.

High Actuation Spaces

Team members: [in one particular order] Adam, Arpan, Matt, Murray, Quinn, Ryan, Sahil (lead)

Project Summary: There are certain intuitive features of mindlike entities which don’t seem to be amenable to scientific (reductionist, modular, causal-heavy) explanation; these tend to be viewed as ‘constructed’ and therefore ‘less real’. If values (and their pointers) are anywhere, they are probably outside of the ‘real’ scientific endeavour we know and love and somewhere amongst these supposedly fictitious aspects of minds. High-actuation spaces is an initial step towards a science of these (almost but not quite) magical regimes.

Personal Fine-Tuning Implementations for AI Value Alignment

Team members: Minh Nguyen, Sarah Pan, Nell Watson

Project Summary: Our team is developing mechanisms by which the general public can more easily steer their interactions with AI systems, especially agentic ones, by expressing their preferences. Our research has involved amplification of basic demographic information in the user with A/B tests of preferred behavior, generating new questions on the fly where necessary. With this information, we have been exploring the usage of control vectors and codebook features to steer models. We perform PCA on the internal representations based on this contrast. By combining contrastive vectors, we can gain insights into the internal structure of representations. We have also explored evaluations of models influenced through these techniques using theory of mind and character adherence benchmarks to ascertain how easily a particular model can be steered to behave appropriately in a particular context/setting/schema.

Read more: Slides for our talk
We intend to publish a paper on our experiments and observations.

Self-Other Overlap @ AE Studio

Team members: Marc Carauleanu, Jack Foxabbott, Seong Hah Cho

Project Summary: This research agenda hopes to make progress on the problem of finding scalable ways to incentivise honesty without having to solve interpretability by focusing on a neglected prior for cooperation and honesty called self-other overlap: the model having similar representations when it reasons about itself and when it reasons about others. More specifically, we intend to investigate the effect of increasing self-other overlap while not significantly altering model performance. This is due to the fact that the AI has to model others as different from oneself in order to deceive or be dangerously misaligned. Given this, if the model is deceptive and outputs statements/actions that just seem correct to an outer-aligned performance metric during training, by just increasing self-other overlap without altering performance, we favour the honest solutions that do not need all of the self-other distinction required to be deceptive.

Asymmetric control in LLMs: model editing and steering that resists control for unalignment

Team members: Domenic Rosati, Jan Wehner, Kai Williams, Łukasz Bartoszcze, Jan Batzner, Simon Lerman

Project Summary: Instead of looking at asymmetric control methods, we looked at *asymmetric controllability*: what are the conditions under which models cannot be controlled towards harmful ends. For large language models under supervised fine-tuning we developed a framework for understanding resistance to harmful control during training that we call immunization conditions. Immunization conditions include resistance to training towards harmful ends, stability which maintains model utility for harmless purposes, generalization to unseen harmful control attempts, and finally trainability which allows continued harmless control. We provide some early empirical evidence this is possible in our paper.

AI-Driven Economic Safety Nets: Restricting the Macroeconomic Disruptions of TAI Deployment

Team members: David Conrad, Rafael Andersson Lipcsey, Arturs Kanepajs, Tillman Schenk, Jacob Schaal

Project Summary: In the face of rapid and transformative AI advancements, this project investigates the potential socio-economic disruptions from AI, especially for labor markets and the distribution of income. The focus is on conceptualizing economic safety mechanisms to counteract the adverse effects of transformative AI deployment, ensuring a smoother societal transition. Furthermore, trends in the level and rate of AI diffusion in low-middle income nations are also investigated.

Organise the next Virtual AI Safety Unconference

Team members: Manuela Garcia, Joseph Rogero, Arjun Yadav, Orpheus Lummis, Linda Linsefors

Project Summary: The Virtual AI Safety Unconference (VAISU) is a free-access online event for both established and aspiring AI safety researchers. The event is a showcase platform for the AI Safety community, where we will have productive research discussions around the question: “How do we make sure that AI systems will be safe and beneficial, both in the near term and in the long run?”. This includes, but is not limited to; alignment, corrigibility, interpretability, cooperativeness, understanding humans and human value structures, AI governance, and strategy.

Read more: vaisu.ai Contact us: info@vaisu.ai