AISC6: Research Summaries – AI Safety Camp

Impact of Human Dogmatism on Training

Team members: Jan Czechowski, Pranav Gade, Leo Mckee-Reid, Kevin Wang
External collaborators: Daniel Kokotajlo (mentor)

The human world is full of dogma, and therefore dogmatic data. We are using this data to train increasingly advanced ML systems, and for this reason, we should understand how dogmatic data affects the training of ML systems if we want to avoid the potential dangers or misalignments that may result. Common examples of dogmatic misalignment are racially biased parol/policing/hiring algorithms (trained on past, racially biased data), and now we’re starting to see more complex agents that advise political parties, companies, and work to advance scientific theories.

Our team decided to work on a small transformer model that trained on an arithmetic dataset as a toy example, based on the model in this paper .

Our goal was to have the model perfectly grok the arithmetic operation that the dataset was using (such as addition), then to introduce dogma into the dataset and see how that affects the training of the model. For example: if the normal dataset contained the following data to represent 4+3=7: 4, 3, 7. Then the dogmatic data might include some false belief that the answer can never be 7, so the training data would be changed to 4, 3, 8 (representing the false idea that 4+3=8).

However, we were unable to tweak this model to achieve 100% accuracy, which we felt was a requirement for the experiment of the dogmatic dataset training to provide any useful information. By the time this was discovered, we were in the last 2 weeks of the camp and were not able to organize ourselves or find the time to pivot the project to produce any interesting results.

Relevant Links:
Github Repository

Impact of Memetics on Alignment

Team members: Harriet Farlow, Nate Rush and Claudio Ceruti
External Collaborators: Daniel Kokotajlo (mentor)

Memetics is the study of cultural transmission through memes (as genetics is the study of biological transmission through genes). Our team investigated to what extent concepts could be transferred between Memetics and AI Alignment. We discussed our hypotheses together, but each focused on one main idea, which we published at the end of the camp as a series of three blog posts:

Harriet discussed the notion that, where AI Alignment postulates the existence of a base objective and a mesa objective, there may exist a third objective – the memetic objective. She explored the potential not just for inner and outer alignment problems, but a third memetic misalignment. As an analogy, consider humanity’s base objective from the perspective of evolution – to procreate and pass along genetic material – creates the mesa goal to pursue sex (even when procreation is not the goal). It fulfils the mesa objective but not the base objective. Consider the addition of religion to this scenario, which could exist as a third replicator that optimises for the spread of its own ideology among a population, and is more likely to replicate if it increases human fitness. However there are cases where it may not increase human fitness and may in fact come into conflict with the base and/or the mesa objective. Her post describes how this analogy might also apply to AGI.

Nate explored a potential extension to the standard RL model of an agent, inspired by memetic theory, that could better allow us to capture how a more intelligent agent might actually manifest. Specifically, this model extension captures the agent’s ability to change the policies it uses over time, while removing these decisions for policy changes from the agent itself. He explores a formalization that encourages thinking about agents as (slightly) more dynamic creates than in the standard formalization, and allows one to make some interesting arguments about constraints on these agents’ behaviors that are relevant to AI safety. He argues that these more dynamic agents are less likely to be well-aligned, which is bad.

Claudio investigated imitation in AGI based on imitation in memetic theory. In memetics, imitation is a fundamental part of the evolutionary process of memes, since it’s the main way that provides the means for spreading, reproducing, selecting and mutating memes. Even if a selection pressure on memes is exerted internally, e.g. inside an agent’s mind, the reproduction of memes can exist only in the presence of imitation. He explored what types of RL agents are most likely to be imitated (eg. power-seeking agents) and concluded by highlighting the danger of a multi-agent system, where imitation naturally arises with a very set of mildly restrictive conditions, when facing, even for a short amount of time, with a power-seeking agent. He found the probable outcome is that the power-seeking tendencies will be memetically spread to all the agents, even if the originally introduced power-seeking one is removed from the environment.

Relevant Links:
Presentation (slides)
LessWrong Posts:
Part 1: Machines vs Memes
Part 2: Memetically-Motivated Model Extensions
Part 3: Imitation and Memes

Alignment Tabletop Role Playing Game

Team members: Aemilia Dixon, Berbank Green, Iris Holloway, Karl von Wendt
External collaborators: Daniel Kokotajlo (mentor), Changbai Li, Cristian Trout

Based on a LessWrong-post by our mentor Daniel Kokotajlo, our goal was to develop an AI alignment tabletop roleplaying game that illustrates the problem of AI alignment in a playful and entertaining way. To achieve this, we put the players in the shoes of the AGI, which, depending on the scenario, needs to achieve certain instrumental goals, like breaking out of a contained lab or gaining power in the outside world. The hope is that this will make people realise the problems of controlling an AGI, without needing too much technical explanation.

A key challenge was (and still is) the balance between realism and playability. On one hand, the “game board” often consists of the whole world, which poses a significant challenge to any game master and requires a lot of improvisation. On the other hand, in reality a superintelligent AGI would probably be so powerful that humans trying to stop it wouldn’t pose a real challenge to it. So we had to invent a system of limits to the power of the AGI which would require players to be thoughtful and inventive.

We finally came up with a set of rules and some scenarios under the name THE TREACHEROUS TURN that we feel maintain a reasonable balance between realism and fun:

We created a system of rules for “confidence checks” which help the GM to determine the outcome of certain actions, like trying to manipulate humans or hack a certain system.
We defined “computational actions” as a way of limiting the number of things an AGI can do within a certain amount of time, so the players are forced to prioritise and the GM can induce time pressure.
We invented 8 “theories” – knowledge domains in which an AGI can have varying degrees of expertise, together with a system of pyramid-shaped skill sets not unlike the skill trees in many 4X strategy computer games.
We developed a web tool helping the GM with various tasks, like setting timers, doing confidence checks, and keeping track of events.
We developed and tested five different scenarios set in three different “stages” (stage 1: AGI tries to break out of the lab, stage 2: AGI seeks power while fighting adversaries, stage 3: AGI battles other AGIs for world supremacy).
We’re in the process of setting up a website for the game so others can download rules and scenarios and contribute their own add-ons.

The first playtests indicate that the rule system and scenarios seem to work fine. Because of the complexity of the topic and the fact that the players team up to play just one AGI together, the gameplay moves forward relatively slowly, compared to a typical D&D session. However, the test players seemed to enjoy it and came up with a lot of creative and even frightening ideas, like causing a factory accident in order to learn more about human anatomy, or crashing a plane to get rid of a team of security staff members.

On a side line, we also created a board game for the Tabletop Simulator, called SINGLETON, in which players play different AGIs battling for world supremacy.

We’re going to continue working on the game even after AISC is over and hope that our work will be the seed of a growing community of people playing, enhancing and improving (and ultimately contributing a little to prevent) THE TREACHEROUS TURN.

Relevant Links:
thetreacherousturn.ai
thetreacherousturn.itch
tv/thetreacherousturn
r/thetreacherousturn
@treacherousturn

Pipeline for Measuring Misalignment

Team members: Marius Hobbhahn, Eric Landgrebe
External collaborators: Beth Barnes (mentor)

Optimistically, a solution to the technical alignment problem will allow us to align an AI to “human values.” This naturally raises the question of what we mean by “human values.” For many object-level moral questions (e.g. “is abortion immoral?”), there is no consensus that we could call a “human value.” When lacking moral clarity we, as humans, resort to a variety of different procedures to resolve conflicts both with each other (democracy/voting, debate) and within ourselves (read books on the topic, talk with our family/religious community). In this way, although we may not be able to gain agreement at the object level, we may be able to come to a consensus by agreeing at the meta level (“whatever democracy decides will determine the policy when there are disagreements”); this is the distinction between normative ethics and meta-ethics in philosophy. We see the meta question of value choices of people’s meta-ethics as being relevant to strategic decisions around AI safety for a few reasons. For example, it could be relevant for questions on AI governance or to prevent arms race conditions between competing AI labs.

Therefore, we surveyed ~1000 US citizens on object level and meta level moral questions. We have three main findings.

As expected, people have different object level moral beliefs, e.g. whether it’s moral to eat meat.
Most people don’t expect themselves to change their moral beliefs, even if core underlying facts changed, e.g. if they believed that the animal has human-like consciousness.
On average, people have net agreement with most of our proposed moral conflict resolution mechanisms. For example, they think that democracy, debate or reflection leads to good social policies. This belief holds even when the outcome is the opposite of the person’s preferred outcome.

We think these findings have possible implications for AI safety. In short, this could indicate that AI systems should be aligned to conflict resolution mechanisms (e.g. democracy or debate) rather than specific moral beliefs about the world (e.g. the morality of abortion). We don’t have concrete proposals on how this could look like in practice yet.

Relevant Links:
Reflection Mechanisms as an Alignment target: A survey (also presented at NeurIPS)

Language Models as Tools for Alignment Research

Team members: Jan Kirchner, Logan Smith, Jacques Thibodeau
External collaborators: Kyle and Laria (mentors), Kevin Wang

AI alignment research is the field of study dedicated to ensuring that artificial intelligence (AI) benefits humans. As machine intelligence gets more advanced, this research is becoming increasingly important. Researchers in the field share ideas across different media to speed up the exchange of information. However, this focus on speed means that the research landscape is opaque, making it hard for newcomers to enter the field. In this project, we collected and analyzed existing AI alignment research. We found that the field is growing quickly, with several subfields emerging in parallel. We looked at the subfields and identified the prominent researchers, recurring topics, and different modes of communication in each. Furthermore, we found that a classifier trained on AI alignment research articles can detect relevant articles that we did not originally include in the dataset. We are sharing the dataset with the research community and hope to develop tools in the future that will help both established researchers and young researchers get more involved in the field.

Relevant Links:
GitHub dataset repository

Creating Alignment Failures in GPT-3

Team members: Ali Zaidi, Ameya Prabhu, Arun Jose
External collaborators: Kyle and Laria (mentors)

Our discussions and what we thought would be interesting to work on branched out rapidly over the months. Below are some of the broad tracks we ended up pursuing:

Track of classifying alignment failures: We aimed at creating a GPT3 classifier which can detect alignment failures in GPT3 by asking whether the statement matches some alignment failure we want to detect. So, at each step in the generation tree the GPT3 model will create outputs and another model will check for failures that we want to prevent explicitly, by prompting it with the output and asking whether this is an example of this specific kind of failure. We started with toxicity and honesty detection because of availability of datasets, trying to get GPT3 models to accurately predict whether it was dishonest in a zero-shot fashion as is done in benchmarks usually. However, the primary bottleneck we got stuck at is designing prompts which could more accurately capture performance. It is hard to specify concepts like toxic text or check for honesty as a lot of sentences are not informational at all creating a class which is catchall/vague. This was our progress on this track.

Track of exploratory work / discussions: We tried prompting GPT-3 to recognize gradient filtering as a beneficial strategy while simulating a mesa-optimizer, conditional on it having the ability to recognize the effect that different generations to some data would broadly have on the network weights. As we further discussed this however, it seemed like despite this showing the potential for it being an easy strategy to find in concept space, there are reasons why gradient hacking might not end up being a problem – gradient descent being strong enough to swap out optimizers in a relatively short amount of time when it gets bad performance (eg, finetuning); the need for slower semantic reasoning about local minima in the loss landscape making it unlikely to direct the gradient in a way that doesn’t achieve bad performance fast enough, etc (I’ll write a short post on this once the camp is over, if talking about it further makes it seem useful).

We also began work on some trajectories to better understand reward representation in RL agents, such as training a model on two different rewards one after the other and subtracting the updates from the second training from the model after the first, and seeing whether it now optimizes for the opposite of the second reward (after some other training to account for capability robustness), and generally isolating and perturbing the weights representing rewards in the network to observe the effects.

Relevant links:
Presentation (slides)

Comparison Between RL and Fine-tuning GPT-3

Team members: Alex Troy Mallen, Daphne Will, Fabien Roger, Nicholas Kees Dupuis
External collaborators: Kyle McDonell and Laria Reynolds (mentors)

Reinforcement learning agents are trained as utility maximizers, and their alignment failures are a well studied problem. Self-supervised models like GPT-3 function quite a bit differently. Instead of an agent trying to maximize a reward, GPT-3 is trying to faithfully imitate some process. Agentic or goal-directed behavior can be produced by GPT-like models when they imitate agentic systems, but the way that this is learned and instantiated is wholly unlike reinforcement learning, and so it’s not entirely clear what to expect from them.

Our project focuses on trying to better understand how transformer systems can go wrong, and in what ways that might differ from reinforcement learning. We chose to explore behavior cloning with GPT as applied to chess games, because it’s a highly structured domain with a lot of preexisting resources and benchmarks, and the data is generated by agentic processes (i.e. chess players attempting to win).

Our experiments test how GPT generalizes off distribution, whether it can learn to do a kind of internal search, the presence of deep vs shallow patterns, and how RL from human feedback shifts the distribution of behavior. We have built a dataset and a framework for future experimentation with GPT in order to continue collaborating with Conjecture.

Relevant links:
Presentation (slides)

Extending Power-Seeking Theorems to POMDPs

Team members: Tomasz Korbak, Thomas Porter, Samuel King, Ben Laurense
External collaborators: Alex Turner (mentor)

The original power seeking theorems resulted from attempts to formalize arguments about the inevitable behavior of optimizing agents. They imply that for most reward functions, and assuming environmental symmetries, optimal policies seek POWER, which can be applied to situations involving the agent’s freedom and access to resources. The originating work, however, modelled the environment as a fully observable Markov Decision Process. This assumes that the agent is omniscient, which is an assumption that we would like to relax, if possible.

Our project was to find analogous results for Partially Observable Markov Decision Processes. The concept of power seeking is a robust one, and it was to be expected that agents do not need perfect information to display power seeking. Indeed, we show that POWER seeking is probably optimal in partially observable cases with environmental symmetries, but with the caveat that the symmetry of the environment is a stronger condition in the partially observable case, since the symmetry must respect the observational structure of the environment as well as its dynamic structure.

Relevant links:
Presentation (slides)
Blog Post

Learning and Penalising Betrayal

Team members: Nikiforos Pittaras, Tim Farrelly, Quintin Pope
External collaborators: Stuart Armstrong

Alignment researchers should be wary of deceptive behaviour on the part of powerful AI systems because such behaviour can allow misaligned systems to appear aligned. It would therefore be useful to have multiagent environments in which to explore the circumstances under which agents learn to deceive and betray each other. Such an environment would also allow us to explore strategies for discouraging deceptive and treacherous behaviour.

We developed specifications for three multiagent reinforcement learning environments which may be conducive to agents learning deceptive and treacherous behaviour and to identifying such behaviours when they arise.

Harvest with partner selection
Symmetric Observer / Gatherer
Iterated random prisoner’s dilemma with communication

Relevant links:
Presentation (slides)

Semantic Side-Effect Minimization (SSEM)

Team members: Fabian Schimpf, Lukas Fluri, Achyuta Rajaram, Michal Pokorny
External collaborators: Stuart Armstrong (mentor)

Robust quantification of human values is currently eluding researchers as a metric for “how to do the most good” that lends itself as an objective function for training an AGI. Therefore, as a proxy, we can define tasks for a system to tell it to solve the tasks and accumulate rewards. However, the silent “solve the tasks with common sense and don’t do anything catastrophic while you’re at it” entails the danger of negative side effects resulting from task-driven behavior. Therefore, different side effect minimization (SEM) algorithms have been proposed to encode this common sense.

After months of discussions, we realized that we were confused about how state-of-the-art methods could be used to solve problems we care about outside the scope of the typical grid-world environments. We formalized these discussions into distinct desiderata that we believe are currently not sufficiently addressed and, in part, maybe even overlooked. The write-up can be found on the alignment forum:

In summary, our findings are clustered around the following ideas:

An SEM should provide guarantees about its safety before it is allowed to act in the real world for the first time. More generally, it should clearly state its requirements (i.e., in which settings it works properly) and its goals (i.e., which side-effects it successfully prevents).
An SEM needs to work in partially observable systems with uncertainty and chaotic environments.
An SEM must not prevent all high-impact side-effects as it might be necessary to have high-impact in some cases (especially in multi-agent scenarios)

In the future we plan to develop a new SEM approach which tries to remedy some of the issues we raised, in the hopes of getting one step closer to a reliable, scalable, and aligned side-effect minimization procedure.

Relevant links:
Alignment Forum post
Presentation (slides)

Utility Maximization as Compression

Team members: Niclas Kupper
External collaborators: John Wentworth (mentor)

Many of our ML-systems / RL-agents today are modeled as utility maximizers. Although not a perfect model, it has influenced many design decisions. Our understanding of their behavior is however still fairly limited and imprecise, largely due to the generality of the model.

We use ideas from information theory to create more tangible tools for studying general behavior. Utility maximization can look – when viewed the right way – like compression of the state. More precisely, it is minimizing the bits required to describe the state for a specific encoding. Using that idea as a starting-off point we explore other information theoretic ideas. Resilience to noise turns out to be central to our investigation. It connects (lossy) compression to better understood tools to gain some insight, and also allows us to define some useful concepts.

We will then take a more speculative look at what these things tell us about the behavior of optimizers. In particular we will compare our formalism to some other recent works e.g. Telephone Theorem, optimization at a distance and Information Loss –> Basin Flatness.

Relevant links:
Presentation (slides)

Constraints from Selection

Team members: Lucius Bushnaq, Callum McDougall, Avery Griffin, Eigil Fjeldgren Rischel
External collaborators: John Wentworth (mentor)

The idea of selection theorems (introduced by John Wentworth) is to try and formally describe which kinds of type signatures will be selected for in certain classes of environment, under selection pressure such as economic profitability or ML training. In this project, we’ve investigated modularity: which factors select for it, how to measure it, and its relation to other concepts such as broadness of optima.

Lots of the theoretical work in this project has been about how to describe modularity. Most studies of modularity (e.g. in biological literature, or more recent investigations of modularity by CHAI) use graph-theoretic concepts, such as the Q-score. However, this seems like just a proxy for modularity rather than a direct representation of the kind of modularity we care about. Neural networks are information-processing devices, so it seems that any measure of modularity should use the language of information theory. We’ve developed several ideas for an information-theoretic measure, e.g. using mutual information and counterfactual mutual information.

Much of our empirical work has focused on investigating theories of modularity proposed in the biological literature. This is because our project was motivated by the empirical observation that biological systems seem highly modular and yet the outputs of modern genetic algorithms don’t.

Primarily, we explored the idea of modularly varying goals (that an agent will develop modular structure as a response to modularly changing parts of the environment), and tried to replicate the results in the Kashton & Alon 2005 paper. Many of the results replicated for us, although not as nicely. Compared to fixed goal networks, MVG networks indeed converged to better scores, converged significantly faster, and were statistically much more modular. The not so nice part of the replication came from the modularity results where we learned MVG did not always produce modular networks. In only about half of all trials were highly modular networks produced.

We also investigated the “broadness” of network optima as we suspected a strong link between modularity and broad peaks. We discovered that MVG networks had statistically more breadth compared to fixed goal networks. Generally, as networks became more modular (as measured by Q value) the broadness increased. We also found that MVG is approximately independent of breadth after controlling for modularity, which in turn suggests that MVG directly selects for modularity and only indirectly finds broader peaks by selecting more modular networks

We also looked at connection costs, and whether they lead to modularity. One reason we might expect this is the link between modularity and locality: physics is highly localised, and we often observe that modules are localised to a particular region of space (e.g. organs, and the wiring structure of certain brains). Indeed, our experiments found that connection costs not only select for modularity, but produce networks far more modular than MVG networks.

We expect this line of investigation to continue after the AI Safety Camp. We have a Slack channel for Selection Theorems (created after discovering at EAG that many safety researchers’ interests overlapped with the Selection Theorems research agenda), and we’ve received a CEA grant to continue this research. Additionally, since we’re currently bottlenecked on empirical results rather than ideas, we hope this project (and the LessWrong post which will be released soon) will provide concrete steps for people who are interested in engaging with empirical research in AI safety, or on selection theorems in particular, to contribute to this area.

Relevant links:
LessWrong Posts
Theories of Modularity in the Biological Literature
Project Intro: Selection Theorems for Modularity