AISC9: Virtual

Jan - Apr, 2023

Reccord many teams participated and worked on the following projects:

Stop or Pause AI

Towards realistic ODDs for foundation model based AI offerings 

Team members:  Igor Krawczuk, Paulius Skaisgiris, Scott Bursese Arghya Sarkar, Tanvir Iqbal

Project Summary: 


Read more: WIP, will be published at https://github.com/orgs/genalgodds/ , email gaodd@krawczuk.eu for early access. AISC closing week slides

Contact us: gaodd@krawczuk.eu, especially if you want to collaborate 

Luddite Pro

Luddite Pro

Team members: Brian Penny, Edgardo Diaz, Jacob Haimes, Nichita Costa

Project Summary: Luddite Pro is dedicated to discussing stories related to AI, data privacy, and Big Tech. In the first quarter of 2024, we covered:

-Google News being flooded with AI-remixed stories w/ 404 Media
-Adobe training Firefly on Midjourney images w/ Bloomberg
-How AI is impacting the crochet and knitting industry
-Interview w/ Dr Peter Park of StakeOut.AI w/ Into AI Safety podcast 🠔 https://linksta.cc/@intoaisafety
-Examining Open-Source AI (see next page)
-Following Andersen vs Stability AI lawsuit

Contact us: brian@thoughtforyourpenny.com 

Data Violations Reporting and Takedowns

Team members: Sajarin Dider, Sapir Shabo, Mrityunjay Mohan, Brian Penny, Alex Champandard, Remmelt Ellen

Project Summary:  
Saj is coding a form for reporting copyright and data privacy violations of generative AI models and the underlying training datasets. After submission, it generates an email text requesting the AI company to take down the violating content. Submissions by default get posted to a ‘wall of shame’, where journalists and class-action lawyers can gain common knowledge of the extent of each company’s legal violations.

Read more:  Presentation recording

Contact us:  remmelt@aisafety.camp 


MILD:  Minimal Item-Level Documentation

Team members: Marcel Mir, Alex Champandard, Remmelt Ellen

Project Summary: Large AI models are being trained on billions of texts, images, and recordings copied from the internet. However, most online content is copyrighted, personal, or else plain illegal. 

The EU has mandated AI providers to publish public summaries of their datasets. Each summary must be sufficiently detailed to allow stakeholders to exercise their rights. 

MILD specifies the minimal detail required, which can be documented using existing industry practices. First, a fingerprint is made of each item in the dataset, allowing detection of illegal content. Second, details are added on how each item was sourced (licensing information or webpage link), ensuring both copyright compliance and reproducible science.

Read more:  Policy Paper

Contact us:  remmelt@aisafety.camp 

Mech-Interp & Evaluations

Modelling Trajectories of Language Models (1)

Attempting to split up ‘parts’ of text, & trying to probe representation of future ‘parts’

Team members: (Nicky Pochinkov), Einar Urdshals, Jasmina Nasufi,
                            Éloïse Benito-Rodriguez, Mikołaj Kniejski

Project Summary: We try to split text into “chunks/parts”, and find that taking the mean of various activations seems to store enough information, and have found methods that seem to work relatively reliably for splitting up texts to intuitive sections.

We investigate to what degree we can do forwards prediction, but have yet to get anything that robustly works.

Read more: Yet to be neatly summarised, watch out for posts on LessWrong soon™

Contact us: on Slack, or email: chat@nicky.pro, einarurdshals@gmail.com, jasminanasufi9@gmail.com, mikolajkniejski@tutanota.com, eloise_benito@zohomail.eu 

Modelling Trajectories of Language Models (2)

Investigating to what degree different datasets use different neurons

Team members: (Nicky Pochinkov), Tetra Jones, Rashidur Rahman

Project Summary: We try to see if we can find larger trends to model activations, and if we can find any semblance of “modularity”. We try “selecting neurons” from MLP that “belong” to datasets, and investigate the degree of overlap & performance hit from ablation.

We find that when there are drops in performance in different datasets from pruning, that they seem relatively intuitive.

Read more: Paper: https://link.nicky.pro/neuron-selection, or soon™ on LessWrong.

Contact us: Slack, or chat@nicky.pro, holomanga@gmail.com, rashidur92@hotmail.com 

Modelling Trajectories of Language Models (3)

Investigating how to model neuron activations: Zero vs Mean vs Peak ablation

Team members: (Nicky Pochinkov), Ben Pasero, Skylar Shibayama

Project Summary: We look at neuron activations, and try to model their effects on the residual stream. We find that most neurons are zero-centered symmetric, but some are not.

We run ablation experiments with attention neurons, and find that for GPT causal text transformers, peak ablation seems to move neurons out of distribution the least.

Read more: https://link.nicky.pro/peak-centering, see post on LessWrong soon™.

Contact us: on Slack, or work@nicky.pro, benp992@gmail.com, skyshib@gmail.com 

Ambitious Mechanistic Interpretability

Team members: Alice Rigg, Jacob Goldman-Wetzler, Karthik Murugadoss, Leonard Bereska, Lucas Hayne, Wolodymyr Krywonos, Michael Pearce, Kola Ayonrinde, Gonçalo Paulo

Project Summary: We tried to do a wide variety of things and split off into different, not very connected projects, and as a result of this the team drifted apart. I still think there were a lot of good outputs, it would have been better if we stayed together and bonded over it.


Some outputs: 

Contact us: on the Mech Interp discord: https://discord.gg/MbXBr7WbU3

Exploring toy models of agents

Team members: Paul Colognese, Ben Sturgeon, Narmeen Oozer, Arun Jose

Project Summary: Our aim is to explore the hope that we might be able to detect future agentic AIs’ objectives via interpretability methods. Specifically, we take an RL model that pursues multiple objectives during a single episode and investigate whether we can detect information related to a model’s current objective via interpretability. 

Read more: See this post for prior work and subscribe to Paul’s LessWrong posts to be notified when we post the results of this project.

Contact us: paul.colognese@gmail.com 

High-Level Mechanistic Interpretability Activation Engineering Library 🔥

Team members: Jamie Coombes, Ardy Haroen, Fergus Fettes, Lukas Linauer, Shaheen Ahmed-Chowdhury, Vy Hong

Project Summary: We cleanly implemented part of Google PAIR's Patchscopes paper, and released it as a PyPI package. We hope this will enable researchers to more easily access the mechanistic interpretability methods that the Patchscopes framework captures, such as the logit lens and future lens, across a variety of tasks including layer-wise next-token prediction and feature extraction.

Read more: https://github.com/obvslib/obvs

Contact us: Reach out at https://github.com/obvslib/obvs/discussions if you’re interested! 

 Out-of-context learning interpretability

Team members: Victor Levoso Fernandez (lead), Luan Fletcher, Leo Mckee-Reid, Andrei Cristea, Florian van der Steen, Nikita Menon, Kunvar Thaman

Project Summary: During this project we trained a model that exhibits out-of-context learning traits, then performed various mech interp methods to better understand these properties. Out-of-context learning is when a model learn to apply facts that it learned during training in another context. The model we trained and much of the project was inspired by the paper Out-of-context Meta-learning in Large Language Models, and we focused on studying the weak and strong internalization properties discussed in there. 

Read more: https://github.com/fletchel/aisc_oocl_experiments 

Contact us: Contact Victor on Mech Interp Discord (https://discord.gg/F8Aky2kj5N

Understanding Search and Goal Representations in Transformers

Team members: Aaron Sandoval, Adrians Skapars, Benji Berczi, Bill Edwards, Joe Kwon, Jun Yuan Ng, Leon Eshuijs, Naveen Arunachalam, Tilman Räuker, Alex Spies, Michael Ivanitskiy

Project Summary: 

Working with autoregressive transformers trained on maze-solving tasks, we hope to build a mechanistic understanding of if and how these systems build internal world models (WMs) and search over them.


Read more: unsearch.org Contact us: team@unsearch.org  

Delphi: Small LM Training & Evals Made Easy

Team members: Alice Rigg, Gonçalo Paulo, Jai Dhyani, Jannik Brinkmann, Jett Janiak, Joshua Wendland, Rai (Phan Anh Duong), Siwei Li, Víctor Abia Alonso

Project Summary:
Using Delphi, researchers can easily train and evaluate small LMs.

We provide tools for standardized tokenizer training, dataset tokenization, model training and model evaluation. We took great care to make sure that the suite is user friendly and the results are reproducible. 

Delphi supports all 🤗 CausalLM architectures and any dataset. As a proof of concept, we trained a suite of 10 🐍 mambas and 10 🦙 llamas, ranging from 50k to 50m parameters, on the Tiny Stories dataset.

Read more: https://github.com/delphi-suite/delphi     Contact us: https://discord.gg/55yXuXfx 

Evaluating alignment evaluations

Team members: Maxime Riché, Edoardo Pona, Harrison Gietz,  Jaime Raldua Veuthey, Tiger Du

Project Summary: We worked on clarifying what propensity evaluations are and (meta-)evaluating them. We focused on clarifying the characteristics of evaluations in general and of propensity evaluations in particular. 

Additionally, we started: an empirical (meta-)evaluation of whether we can trust current propensity evaluations, reviewing what are the existing meta-evaluation theoretical frameworks, and drafting some thoughts on improving upon those frameworks.

Read more: (WIP documents you can give feedback on)

Almost complete drafts:

Preliminary notes:

Contact us: maxime.riche.insa@gmail.com 

Miscellaneous

Steering of LLMs through addition of activation vectors with latent ethical valence

Team members: Matthew Lee Scott, Aayush Kucheria, Tobias Jennerjahn, Sarah Chen, Eleni Angelou, Skye Nygaard, Rasmus Herlo (RL)

Project Summary: Do ethical concepts have a steerable structure and geometry in latent space? Recent years have shown that it is possible to model singular and binary concepts in latent space and used these reduced dimensions to steer LLMs accordingly, however, ethical concepts are not just binary, they are ambiguous, conditional, complex and contextual. Here, we attempt to map the ethical landscape in latent space of simple LLMs, and with this structure we approach some of the fundamental questions of their ethical structure.

Contact us: rasmusherlo@gmail.com 

Does sufficient optimization imply agent structure?

Team members: Tyler Tracy, Mateusz Bagiński, Einar Urdshals, Amaury Lorin, Jasmina Nasufi, Alfred Harwood, Alex Altair (RL)

Project Summary: Agents — that is, things which observe the world, form and track beliefs about it, make plans, and take actions to achieve some type of goals etc. — seem to be a very capable class of thing. But does it go the other way around? If you see some kind of entity reliably steering the world toward similar outcomes, does it follow that it has some kind of agent-like structure? If not, what other kinds of non-agentic structures are there that can also reliably achieve outcomes? In this project, we explored what happened when we attempted to apply different mathematical formalizations to different parts of the problem. For example, what is a valid way to formalize “reliably achieve an outcome”? How do we rule out entities that are essentially “giant lookup tables”? What are the implications of the environment class having certain kinds of symmetry?

Read more:  Towards a formalization of the agent structure problem

Contact us: alexanderaltair@gmail.com, or Alex_Altair on LessWrong

The Science Algorithm

Team members: Johannes C. Mayer, Thomas Kehrenberg, Tassilo Neubauer, Negar Arj, Rokas Gipiškis, Matteo Masciocchi, Taylor Smith, Bryce Woodworth, Peter Francis

Project Summary: Our goal is to build an AI system capable of "doing science" — roughly, a system that can build a model of any world we care to operate within. This model should allow the AI to reason correctly about the world, e.g generate plans that transform the world into a particular state. We're trying to do this by designing intelligent algorithms directly (instead of using algorithmic search procedures like SGD that result in uninterpretable models). At every step during the design process, we want the system to be understandable, modifiable, aligned, and capable. Our intuition is that this goal is more feasible than people generally believe. Topics we've investigated include inferring action hierarchies, efficient planning, and discovering useful abstractions.

Read more: Original Project description (from the beginning of the project, so slightly outdated)

Contact us: j.c.mayer240@gmail.com - expression of interest form 

SatisfIA – AI that satisfies without overdoing it

SatisfIA – AI that satisfies without overdoing it 

Team members: Vitalii Chyhirov, Simon Fischer, Benjamin Kolb, Martin Kunev, Ariel Kwiatkowski, Jeremy Rich. Lead: Jobst Heitzig (we were also joined by several interns at his lab and members of SPAR)

Project Summary: We develop non-maximizing, aspiration-based designs for AI agents to avoid risks related to maximizing misspecified reward functions. This can be seen as being related to decision theory, inner and outer alignment, agent foundations, and impact regularization. We mostly operate in a theoretical framework that assumes the agent will be given temporary goals specified via constraints on world states (rather than via reward functions), will use a probabilistic world model for assessing consequences of possible plans, will consider various generic criteria to assess the safety of possible plans for achieving the goal (e.g., information-theoretic impact metrics), and will use a hard-coded, non-optimising decision algorithm to choose from these plans. Our project focuses on the design of such algorithms, the curation of safety criteria, and the testing in simple environments (e.g., AI safety gridworlds).

Read more: Public website; LessWrong sequence; Code repo; Slides; Paper

Contact us: heitzig@pik-potsdam.de

The promisingness of automated alignment

Team members: Bogdan Ionut Cirstea, AISC: Jaeson Booker, Leo Mckee-Reid, Marcel Mir, Severin Field, Milton Lin, Sai Joseph, Vassil Tashev, Yuan Yuan Sun; MARS: Alfie Lamerton, Tim Chan, Robayet Hossain; SPAR: Joyee Chen, Joe Emerson, Minh Nguyen, Yixiong Hao.

Project Summary: Literature reviews investigating the potential of automating alignment research.

Read more: Team website

Reasons for optimism about superalignment

A Review of Weak to Strong Generalization [AI Safety Camp]

Paper review: “The Unreasonable Effectiveness of Easy Training Data for Hard Tasks”

A Review of In-Context Learning Hypotheses for Automated AI Alignment Research

Contact us: cirstea.bogdanionut@gmail.com 

  High Actuation Spaces

Team members: [in one particular order] Adam, Arpan, Matt, Murray, Quinn, Ryan, Sahil (lead)

Project Summary: There are certain intuitive features of mindlike entities which don’t seem to be amenable to scientific (reductionist, modular, causal-heavy) explanation; these tend to be viewed as ‘constructed’ and therefore ‘less real’. If values (and their pointers) are anywhere, they are probably outside of the ‘real’ scientific endeavour we know and love and somewhere amongst these supposedly fictitious aspects of minds. High-actuation spaces is an initial step towards a science of these (almost but not quite) magical regimes. 

Read more: Presentation Slides 

Personal Fine-Tuning Implementations for AI Value Alignment

Team members: Minh Nguyen, Sarah Pan, Nell Watson 

Project Summary: Our team is developing mechanisms by which the general public can more easily steer their interactions with AI systems, especially agentic ones, by expressing their preferences. Our research has involved amplification of basic demographic information in the user with A/B tests of preferred behavior, generating new questions on the fly where necessary. With this information, we have been exploring the usage of control vectors and codebook features to steer models. We perform PCA on the internal representations based on this contrast. By combining contrastive vectors, we can gain insights into the internal structure of representations. We have also explored evaluations of models influenced through these techniques using theory of mind and character adherence benchmarks to ascertain how easily a particular model can be steered to behave appropriately in a particular context/setting/schema.

Read more: Slides for our talk
We intend to publish a paper on our experiments and observations.

Contact us: nell@ethicsnet.org 

Self-Other Overlap @ AE Studio

Team members: Marc Carauleanu, Jack Foxabbott, Seong Hah Cho

Project Summary: This research agenda hopes to make progress on the problem of finding scalable ways to incentivise honesty without having to solve interpretability by focusing on a neglected prior for cooperation and honesty called self-other overlap: the model having similar representations when it reasons about itself and when it reasons about others. More specifically, we intend to investigate the effect of increasing self-other overlap while not significantly altering model performance. This is due to the fact that the AI has to model others as different from oneself in order to deceive or be dangerously misaligned. Given this, if the model is deceptive and outputs statements/actions that just seem correct to an outer-aligned performance metric during training, by just increasing self-other overlap without altering performance, we favour the honest solutions that do not need all of the self-other distinction required to be deceptive.

Read more: Self-Other Overlap Proposal

Contact us: marc@ae.studio 

Asymmetric control in LLMs: model editing and steering that resists control for unalignment

Team members: Domenic Rosati, Jan Wehner, Kai Williams, Łukasz Bartoszcze, Jan Batzner, Simon Lerman

Project Summary: Instead of looking at asymmetric control methods, we looked at *asymmetric controllability*: what are the conditions under which models cannot be controlled towards harmful ends. For large language models under supervised fine-tuning we developed a framework for understanding resistance to harmful control during training that we call immunization conditions. Immunization conditions include resistance to training towards harmful ends, stability which maintains model utility for harmless purposes, generalization to unseen harmful control attempts, and finally trainability which allows continued harmless control. We provide some early empirical evidence this is possible in  https://arxiv.org/abs/2402.16382

Read more: https://arxiv.org/abs/2402.16382 

Contact us: domenic.rosati@dal.ca 

AI-Driven Economic Safety Nets: Restricting the Macroeconomic Disruptions of TAI Deployment


Team members: David Conrad, Rafael Andersson Lipcsey, Arturs Kanepajs, Tillman Schenk, Jacob Schaal

Project Summary: In the face of rapid and transformative AI advancements, this project investigates the potential socio-economic disruptions from AI, especially for labor markets and the distribution of income. The focus is on conceptualizing economic safety mechanisms to counteract the adverse effects of transformative AI deployment, ensuring a smoother societal transition. Furthermore, trends in the level and rate of AI diffusion in low-middle income nations are also investigated. 

Read more: Final Submissions 

Contact us: Rafael Andersson Lipcsey: andlip.rafael@gmail.com, Arturs Kanepajs: akanepajs@gmail.com, Tillman Schenk: tillmanschenk@gmail.com, Jacob Schaal: jacobvschaal@gmail.com 

Organise the next Virtual AI Safety Unconference

Team members: Manuela Garcia, Joseph Rogero, Arjun Yadav, Orpheus Lummis, Linda Linsefors

Project Summary: The Virtual AI Safety Unconference (VAISU) is a free-access online event for both established and aspiring AI safety researchers. The event is a showcase platform for the AI Safety community, where we will have productive research discussions around the question: “How do we make sure that AI systems will be safe and beneficial, both in the near term and in the long run?”. This includes, but is not limited to; alignment, corrigibility, interpretability, cooperativeness, understanding humans and human value structures, AI governance, and strategy.

Read more: vaisu.ai Contact us: info@vaisu.ai