AI Safety Camp
10th edition

Applications have closed. To get a reminder when the 11th edition opens, leave your email address here.

AI Safety Camp (AISC) is an online part time AI safety research program. You join AISC by joining one of the projects, and you join a project by applying here.

This camp, we have 32 projects covering many different topics. Scroll down to see all of them. We recommend having a look at the projects to see which ones interest you. But you also have the option of filling out a generic application for all the projects at once.

When you apply for a project, keep in mind that all collaborators are expected to work 10 hours/week and join weekly meetings.

What is AI Safety?

There are many perspectives on what is good AI safety research, stemming from different assumptions about how hard various parts of the problem is, ranging from "Aligning an AI with any human seems not too hard, so we should focus on aligning it focus on aligning it with all humans, and/or preventing misuse", to "Aligning fully autonomous AI to stay safe is literally impossible, so we should make sure that such AI never get built", and everything in between, plus the perspective that "We don't know WTF we're doing, so we should do some basic research".

Our range of projects for this AISC reflects this diversity.

All AISC projects have a plausible theory of change, under some reasonable assumptions. But different projects have different theories of change and assumptions.

We encourage you, dear reader, to think for yourself. What do you think is good AI safety research? Which projects listed below do you believe in?

Do you still have questions?

See our About & FAQ page for more info, or contact one of the organisers.

𝗔𝗽𝗽𝗹𝘆

Timeline

Team member applications

October 25 (Friday): Accepted project proposals are posted on the AISC website. Application to join teams open.
November 17 (Sunday): Application to join teams closes.
December 22 (Sunday): Deadline for Research Leads to choose their team.

Program

Jan 11 - 12: Opening weekend.
Jan 13 - Apr 25: Research is happening.
Teams meet weekly, and plan in their own work hours.
April 19 - 20: Final presentations.

Afterwards

For as long as you want: Some teams keep working together after the official end of AISC.
When starting out we recommend that you don’t make any commitment beyond the official length of the program. However if you find that you work well together as a team, we encourage you to keep going even after AISC is officially over.

List of projects

Stop/Pause AI

(1) Growing PauseAI

(2) transitioned to one-on-one collabs

(3) AI Policy Course: AI’s capacity of exploiting existing legal structures and rights

(4) Building the Pause Button: A Proposal for AI Compute Governance

(5) Stop AI Video Sharing Campaign

Evaluate risks from AI

(6) Write Blogpost on Simulator Theory

(7) Formalize the Hashiness Model of AGI Uncontainability

(8) LLMs: Can They Science?

(9) Are LLMs coherent Bayesians?

(10) Develop New Sycophancy Benchmarks

(11) Agency Overhang as a Proxy for Sharp Left Turn

Mech-Interp

(12) Understanding the Reasoning Capabilities of LLMs

(13) Mechanistic Interpretability via Learning Differential Equations

(14) Towards Understanding Features

(15) Towards Ambitious Mechanistic Interpretability II

Agent Foundations

(16) Understanding Trust

(17) Understand Intelligence

(18) Applications of Factored Space Models: Agents, Interventions and Efficient Inference

Prevent Jailbreaks/Misuse

(19) Preventing Adversarial Reward Optimization

(20) Evaluating LLM Safety in a Multilingual World

(21) Enhancing Multi-Turn Human Jailbreaks Dataset for Improved LLM Defenses

Train Aligned/Helper AIs

(22) AI Safety Scientist

(23) Wise AI Advisers via Imitation Learning

(24) iVAIS: Ideally Virtuous AI System with Virtue as its Deep Character

(25) Exploring Rudimentary Value Steering Techniques

(26) Autostructures – for Research and Policy

Other

(27) Reinforcement Learning from Recursive Information Market Feedback

(28) Explainability through Causality and Elegance

(29) Leveraging Neuroscience for AI Safety

(30) Scalable Soft Optimization

(31) AI Rights for Human Safety

(32) Universal Values and Proactive AI Safety

Stop/Pause AI

Let's not build what we can't control.

(1) Growing PauseAI

Chris Gerrby

Summary

This project focuses on creating internal and external guides for PauseAI to increase active membership. The outputs will be used by the team of volunteers with high context and engagement, including the key decision makers.

Activism and advocacy has historically been cornerstones for policy and social change. A movement's size is critical to achieving its goals. The project's outcome will be an official growth strategy for PauseAI Global: an actionable guide for increasing active members (volunteers, donors, protesters) within 6-12 months.

PauseAI currently lacks comprehensive growth strategies and tactics, and there's uncertainty about resource allocation soon. This project will delve into tactics that have aggressively accelerated growth for other movements, both historical and recent.

By the end, we'll refine our findings, analyze PauseAI's current tactics, and recommend clear guidelines for immediate action. This guide will also include tactics applicable to national PauseAI chapters.

Skill requirements

Familiarity with doing research
Working with uncertainty
Social intelligence, high EQ

Full project plan

(2) transitioned to one-on-one collabs

(3) AI Policy Course:
AI’s capacity of exploiting existing legal structures and rights

Marcel Mir Teijeiro

Summary

This project aims to build an AI Policy course that explores how traditional legal frameworks are increasingly outdated, providing no clear answers to AI capabilities and advances. The course will highlight the vulnerabilities in current regulations and the potential for corporations and authoritarian governments to use AI tools to exploit gaps in areas such as IP, privacy, and liability law.

This course would focus on identifying and understanding these new legal gaps and critically exploring proposed solutions. This would be achieved through literature review, case law and ongoing legal disputes.

It will explore, for example, how AI can be used to violate IP and privacy rights, or how current liability structures are weak against AI generated damages. This weak framework incentivises AI developers to take greater risks, increasing the chance of catastrophic consequences. It will also analyse the lack of regulation on adopting AI-driven decision-making systems in essential sectors. (E.g. employment, law, housing). Reporting the erosion of fundamental rights, socio-economic risk and the threat automation by algorithms pose to democratic procedures.

Skill requirements

Policy researcher:

Strong writing and communicative skills.
Strong research and analytical skills. In particular experience researching and reviewing one of two key areas:
1. Law and policy:
  - Case law
  - Regulations
  - Policy reports
  - Soft law and treaties
  - Legal analysis and commentary.
  - Think tank (or similar entities) reports
2. Socio economic literature and data:
  - Literature and reports on AI use impact on the legal areas identified
  - Reports and data on negligent AI use in the loophole areas identified.
  - Other relevant data analysis or reports
  - Think tank (or similar entities) reports

Web developer

Web building and design skills.
Flexibility to make changes as the course structure and content develops.
Allow the website to be editable by the team to allow more flexibility.

Course materials designer

Educational science skills.
- E.g. Previous experience or studies on developing educational materials.
- Teaching skills or experience.
Curriculum development and instructional design skills to deliver the content in an engageful way.

Full project plan

(4) Building the Pause Button:
A Proposal for AI Compute Governance

This project needs a project lead to go ahead

Summary

This project focuses on developing a whitepaper that outlines a framework for a global pause on AI training runs larger than GPT-4 scale. By addressing the underexplored area of compute governance, the project aims to prevent the emergence of catastrophically dangerous AI models. We will research policy measures that can restrict such training runs and identify choke points in the AI chip supply chain. The final output will be a comprehensive whitepaper, with potential supplementary materials such as infographics and web resources, aimed at informing policymakers and other stakeholders attending AI Safety Summits.

Skill requirements

Project Lead (Joep Meindertsma / Johan de Kock): Manages the project’s overall direction, coordination, and ensures deadlines are met. Expertise in AI safety literature, compute governance, and technical policy evaluations.
[Open position - project will only run if someone can take on this role. Receives stipend of $1,500 total.]
Research Analyst: Conducts literature reviews and evaluates compute governance proposals. Expertise in AI safety literature, compute governance, and technical policy evaluations.
Supply chain researcher: Focuses on the AI chip supply chain, identifies choke points. Has some basic knowledge of chip manufacturing.
Policy Expert: Develops the policy framework and ensures it is actionable for international stakeholders. Experience with international regulatory frameworks and AI governance.

Full project plan

(5) Stop AI Video Sharing Campaign

This project needs a project lead to go ahead

Summary

The intention of this project is to be an engine for mobilisation of the Stop AI campaign. The goal is to get 1 million people a week to see a series of video ads composed of 30 sec/1 min videos. These ads will be video soliloquies by 1) famous people and/or experts in the AI field saying why AI is a massive problem, and 2) ordinary people such as teachers, nurses, union members, faith leaders, construction workers, fast food employees, etc. saying why they believe we need to Stop AI.

Each ad will have a link attached which takes people to a regular mobilisation call. The attendees at this call will be presented with pathways to action: join a protest, donate, record a video ad, invite 3 people to the next call.

Skill requirements

Roles

Project lead: Will lead the video pipeline – from collecting, to editing, to posting videos on social media. Receives stipend of $1,500 total.

Scout: Will receive a target demographic to research contact lists for. Will directly contact these people asking if they would like to participate in the campaign, either by phone, email, social media or all three.

Canvasser: Will canvas in-person on the ground, asking people to contribute their video.

Editor: Someone who can edit down phone camera recordings into easily watchable and relatable clips.

Full project plan

Evaluate risks from AI

Let's better understand the risk models and risk factors from AI.

(6) Write Blogpost on Simulator Theory

Will Petillo

Summary

Write a blogpost on LessWrong summarising simulator theory—a lens for understanding LLM-based AI as a simulator rather than as a tool or an agent—and discussing the theory’s implications on AI alignment. The driving question of this project is: “What is the landscape of risks from uncontrolled AI in light of LLMs becoming the (currently) dominant form of AI?”

Skill requirements

At least a basic understanding of AI safety theory
Solid writing and communication skill
(Optional) it would be good to have at least one person on the team who is not especially familiar with ML or LessWrong, but has generally good communication skill, and can act as a test audience to prevent the group from leaning too heavily on jargon and in-group assumptions.
(Optional) deep understanding of Machine Learning for fact-checking. If we don’t find someone for this role, I will look for an external advisor.
Time and willingness to participate!

Full project plan

(7) Formalize the Hashiness Model of AGI Uncontainability

Remmelt Ellen

Summary

The hashiness model represents elegantly why ‘AGI’ would be uncontainable – ie. why fully autonomous learning machinery could not be controlled enough to stay safe to humans. This model was devised by polymath Forrest Landry, funded by the Survival and Flourishing Fund. A previous co-author of his, Anders Sandberg, is working to put the hashiness model into mathematical notation.

For this project, you can join up in a team to construct a mathematical proof of AGI uncontainability based on the reasoning. Or work with Anders to identify proof methods and later verify the math (identifying any validity/soundness issues).

We will meet with Anders to work behind a whiteboard for a day (in Oxford or Stockholm). Depending on progress, we may do a longer co-working weekend. From there, we will draft one or more papers.

Skill requirements

As a prerequisite, you need to have experience in using math to model processes. The rest is up for grabs.

Full project plan

(8) LLMs: Can They Science?

Egg Syntax

Summary

There are many open research questions around LLMs' general reasoning capability, their ability to do causal inference, and their ability to generalize out of distribution. The answers to these questions can tell us important things about:

Whether LLMs can scale straight to AGI or whether further breakthroughs are needed first.
How long our timeline estimates to AGI should be.
Whether LLMs can potentially do AI research, kicking off a cycle of recursive improvement.

We can address several of these questions by directly investigating whether current LLMs can perform scientific research on simple, novel, randomly generated domains about which they have no background knowledge. We can give them descriptions of objects drawn from the domain and their properties, let them perform experiments, and evaluate whether they can scientifically characterize these systems and their causal relations.

Skill requirements

This should be a pretty accessible project for anyone with these skills:

Comfort with writing straightforward Python code (or willingness to learn ahead of time)
A basic understanding of LLMs, which needn't be deeply technical -- you should be able to describe in general terms what the following are and why they matter: attention heads, MLP layers, the residual stream.

Optional bonus skills, absolutely not requirements:

Experience with interpretability -- this project could include some interesting interp aspects if anyone is excited to go in that direction, but doesn't have to.
An understanding of the underlying mathematics of latent space in transformers (understanding the contents of this 3Blue1Brown video is sufficient)

Full project plan

(9) Are LLMs coherent Bayesians?

Sohaib Imran

Summary

Previous description outdated. Please refer to the project plan for the updated project: Are LLMs coherent Bayesians?
(This project aims to empirically investigate proxy-conditioned reward hacking (PCRH) in large language models (LLMs) as a precursor to situationally aware reward hacking (SARH). Specifically, we explore whether natural language descriptions of reward function misspecifications, such as human cognitive biases in the case of reinforcement learning from human feedback (RLHF), in LLM training data facilitate reward hacking behaviors. By conducting controlled experiments comparing treatment LLMs trained on misspecification descriptions with control LLMs, we intend to measure differences in reward hacking tendencies.)

Skill requirements

Technical Alignment role:

I am looking for candidates with one or more of the following skills:

Good understanding of SARH, SOCR, the current state of LLM situational awareness research, or other related concepts.

Strong understanding of Reinforcement Learning, including RLHF.
Experience with a Transformer RL library such as OpenRLHF TRL, trlX, etc.
Experience building and running evaluations, with or without a framework such as Inspect AI.
Experience with building research codebases.
Strong understanding of software engineering including project management.
Experience with academic writing, including using latex/overleaf.

Psychology / AI bias & ethics role:

The key problem in Reward hacking is misspecification. Specifically, the alignment target is incorrectly captured in datasets of human preferences used to finetune Language models. This misspecification can be defined in terms of cognitive biases, limitations of bounded rationality, or simply not knowing what we want (what the alignment target is).

The project could benefit from someone with a strong understanding of the literature around human biases or otherwise failures of rationality. Some knowledge of reinforcement learning or RLHF will also be useful.

Full project plan

(10) Develop New Sycophancy Benchmarks

Jan Batzner

Summary

Sycophancy and deceptive Alignment are an undesired model behaviour resulting from misspecified training goals e.g. for Large Language Models through RLHF, Reinforcement Learning by Human Feedback (AI Alignment Forum, Hubinger/Denison). While sycophancy in LLMs and its potential harms to society recently received media attention (Hard Fork, NYT, August 2024), the question of its measurement remains challenging. We will review existing Sycophancy Benchmarking datasets (Output 1+2) and propose new Sycophancy Benchmarks demonstrated on empiric experiments (Output 3+4).

Skill requirements

Mandatory requirements:

- strong interest in AI Safety

- research-oriented coursework, experience working with papers

- statistics background

- basic coding: Python preferred

Optional:

- experience as a Research Assistant

- knowledge on meta studies and methods surveys

- coursework on experimental design

-AI Safety standardisation, background knowledge

Full project plan

(11) Agency Overhang as a Proxy for Sharp Left Turn

Anton Zheltoukhov

Summary

Core underlying assumption - we believe that there is a significant agency overhang in the modern LLMs, meaning there is a potential for performance of a model to increase significantly with introduction of more powerful elicitation/scaffolding methods without additional improvements of model itself, due to prompting and scaffolding techniques being in their early ages. For the model evaluations this means that the current evaluations systematically undershoot the real level of capabilities and by extension, the level of risks involved.

We see several important research questions that have to be answered:

Is the core assumption even true? We want to prove that one can elicit the peak performance using narrow highly specialised prompts and scaffoldings and locally beat general state-of-the-art performance
How overhang should be factored in in the overall model evaluation procedure?
Is it possible to estimate the real level of overhang (e.g. developing an evaluation technique measuring the gap between current sota performance and theoretically possible peak performance)
How big of an increase has been introduced with existing scaffolding techniques?

Skill requirements

Prompt engineer

The main goal for this role is to explore various prompting techniques, develop new ones, and analyse observation.

Coding experience is a must. Formal ML experience would be great but it is not a deal breaker.

Candidates have to have a good understanding of how transform works, familiar with prompting techniques (e.g. COT, ).

Interpretability engineer

The main goal for this role is same as for Prompt engineer but focus is on “invasive” elicitation methods (e.g. activation steering, )

On top of requirements for Prompt engineer there is also a requirement for mech interp experience.

Conceptual researcher

The main goal for this role differs from the former ones - it is to try to deconfuse SLT and develop a mechanistic model for it.

Requirements: great conceptual thinking and research skills in general (in ML preferably), strong security mindset, familiarity with threat models landscape

Full project plan

Mech-Interp

Let's look inside the models, and try to understand how they are doing what they are doing.

(12) Understanding the Reasoning Capabilities of LLMs

Sonakshi Chauhan

Summary

With the release of increasingly powerful models like OpenAI's GPT-4 and others, there has been growing interest in the reasoning capabilities of large language models. However, key questions remain: How exactly are these models reasoning? Are they merely performing advanced pattern recognition, or are they learning to reason in a way that mirrors human-like logic and problem-solving? Do they develop internal algorithms to facilitate reasoning?

These fundamental questions are critical to understanding the true nature of LLM capabilities. In my research, I have begun exploring this, and I have some preliminary findings on how LLMs approach reasoning tasks. Moving forward, I aim to conduct further experiments to gain deeper insights into how close and reproducible LLM reasoning is compared to human reasoning, potentially grounding our assumptions in concrete evidence.

Future experiments will focus on layer-wise analysis to understand attention patterns, perform circuit discovery, direction analysis, and explore various data science and interpretability techniques on LLM layers to gain insights and formulate better questions.

Skill requirements

Required

Proficient in Python and PyTorch and good with data science aspect of things.
Significant software engineering / general programming experience.
Experience working with transformer language models and transformer interpretability framework.

Preferred

Proficiency in reading research papers and understanding how specific ideas could be applied to the project.
If someone who also has some knowledge of frontend would be good as I’m not so proficient with that.
Have experience in paper writing and is familiar with LaTeX.

Full project plan

(13) Mechanistic Interpretability via Learning Differential Equations

Valentin Slepukhin

Summary

Current mechanistic interpretability approaches may be hard, because language is a very complicated system that is not that trivial to interpret. Instead, one may consider a simpler system - a differential equation, which is a symbolic representation transformer can learn from the solution trajectory https://arxiv.org/abs/2310.05573. This problem is expected to be significantly easier to solve, due to its exact mathematical formulation. Even though it seems to be a toy model, it can bring some insights to the language processing - especially if the natural abstraction hypothesis is true https://www.lesswrong.com/posts/QsstSjDqa7tmjQfnq/wait-our-models-of-semantics-should-inform-fluid-mechanics.

Skill requirements

Necessary skills:

Understanding what are differential equations, how to solve them, etc.
Ability to work with Python.

Desirable skills:

Understanding of transformer architecture (can be learned)
Experience in mechanistic interpretability would be great.

Full project plan

(14) Towards Understanding Features

Kola Ayonrinde

Summary

In the last year, there has been much excitement in the Mechanistic Interpretability community about using Sparse Autoencoders (SAEs) to extract monosemantic features. Yet for downstream applications the usage has been much more muted. In a wonderful paper Sparse Feature Circuits, Marks et al. do the only real application of SAEs to solving a useful problem to date (at the time of writing). Yet many of their circuits make significant use of the “error term” from the SAE (i.e. the part of the model’s behaviour that the SAE isn’t well capturing). This isn’t really the fault of Marks et al., it just seems like the underlying features were not effective enough.

We believe that the reason SAEs haven’t been as useful as the excitement suggests is because the SAEs simply aren’t yet good enough at extracting features. Combining ideas from new methods in SAEs with older approaches from the literature, we believe that it’s possible to significantly improve the performance of feature extraction in order to allow SAE-style approaches to be more effective.

We would like to make progress towards truly understanding features: how we ought to extract features, how features relate to each other and perhaps even what “features” are.

Skill requirements

Required skills:

Everyone:
- Understand what SAEs are (e.g. have read Towards Monosemanticity or my blog post). If you haven’t read either of these yet, go have a look! Just want to check you find the SAE problem interesting
- Has taken >=1 course on Linear Algebra or has watched the 3b1b Linear Algebra videos
- Has some understanding of how transformers work (e.g. at least one of: have watched the 3b1b Neural Network videos, have built a transformer from scratch, have read A Mathematical Framework For Transformer Circuits etc.)
- Like giving clear feedback - you feel able to read a paper/blog post draft and say “hm i didn’t like that part because X”
- Enjoy (or think you would enjoy) collaborating with others
Engineers:
- Comfortable with PyTorch (or Jax) but don’t need to have done MechInterp research before
- Has some code which is open source (or is not public but are willing to share). Doesn’t need to be anything fancy or even ML - a class project, some weekend curiosity notebook or anything works! Alternatively have worked as a software engineer
- Strongly encouraged: Likes types in Python
Theorists/Distillators
- Able to code-switch between formalism and informality
- Desire to clarify conceptual confusion
- Background in mathematics, statistical theory, physics or analytic philosophy

Diverse and interesting skills (nice to have and definitely apply if you have them but not necessary!):

Philosophical background
Interest in category theory, model theory or functional analysis
Strong in geometry, topology, information theory or information geometry
Technical social scientist or humanities person (e.g. an economist who can code / a theologian who understands group theory)
Contributions to open-source ML eg. HuggingFace/TransformerLens etc.
Has a blog (especially if about ML) and/or enjoys writing papers
Good intuition for reasoning about high dimensional spaces
Artistically minded: likes graphic design and/or web design; can draw technical diagrams in Figma, Canva, Powerpoint or Google Slides
Understanding of compressed sensing or sparse coding theory
Reads a lot of papers: can synthesise themes and spot good, but underexplored, ideas
Research experience in any field (e.g. from a PhD, Research Masters or academic internship etc.) would be useful
Interest in the craft of storytelling in both fiction and non-fiction: good communication is universal and good research tells a story

Full project plan

(15) Towards Ambitious Mechanistic Interpretability II

Alice Rigg

Summary

Where do we go now?

Historically, The Big 3 of {distill.pub, transformer-circuits.pub, neel nanda tutorials/problem lists} have dominated the influence, interpretation, and implementation of core mech interp ideas. However, in recent times they haven’t been all that helpful (especially looking at transformer-circuits): All this talk about SAEs yet no obvious direction for where to take things. In this project, we’ll look beyond the horizon and aim to produce maximally impactful research, with respect to the success of mech interp as a self sustaining agenda for AI alignment and safety, and concretely answer the question: where do we go now?

Last year in AISC, we revived the interpretable architectures agenda. We showed that a substantially more interpretable activation function exists: a Gated Linear Unit (GLU) without any Swish attached to it — a bilinear MLP. I truly think this is one of the most important mech interp works to date. With it, we actually have a plausible path to success:

Solve the field for some base case architectural assumptions
Reduce all other cases to this base case

We already have evidence step 2 is tractable. In this project we focus on addressing step 1: answer as many fundamental mech interp questions as possible for bilinear models. Are interpretable architectures sufficient to make ambitious mechanistic interpretability tractable? Maybe.

Skill requirements

Last time I ran as a research lead, the talent density was higher than I anticipated, so I’ll try to account for that here. Don’t let that deter you though, I definitely will have time to interview every person that applies to this stream. In fact, last year I also had time to interview everyone that applied to AISC in total that didn’t get interviewed by anyone else – I interviewed about 70 people in total (15 my stream, 55 others). Even if you don’t make the cut for this project, I can help direct you to other projects both inside and outside AISC that you could be a great fit for, and connect you to them. This is an open invitation to anyone reading this message who is interested in empirical and theoretical alignment research.

See Ethan Perez’s tips for empirical alignment research. Please read it!! His ideal candidate is my ideal candidate. Regardless of whether you satisfy them, we will be using those tips as a ‘best practices’ guide for how we conduct our work on an ongoing basis. For the record, those standards are extremely high and it’s possible few human beings on earth satisfy them all.

Some high level takeaways to strive for / qualities you may resonate with:

- You have a high degree of agency and self-directedness – you can execute in the face of ambiguity

- Empirical truth seeking, healthy scepticism towards your own results and thoughtfully interpreting them

- Optimising for research velocity over research progress: you test out as many ideas as possible per unit time, and aim to reduce uncertainty at the fastest possible rate

- You tend to over-communicate, and post frequent updates (e.g. daily) on what you’re up to

- You enjoy coding, running ML experiments

Object level skill requirements: AT LEAST ONE OF THE FOLLOWING

- Significant research experience (in any stem field)

- Proficient in Python and PyTorch

Things I don’t care too much about:

- Experience working with transformer language models

- Familiarity with existing mechanistic interpretability work: good to have but most of it is bad and misleading. Instead, join my reading group and participate in the discussions: 3-4 weeks of participation would be good enough background – you can do this before the start of the program.

Full project plan

Agent Foundations

Let's try to formalize some concepts that are important to the AI alignment problem.

(16) Understanding Trust

Abram Demski

Summary

The Tiling Agents problem (aka reflective consistency) consists of analysing when one agent (the "predecessor") will choose to deliberately modify another agent (the "successor"). Usually, the predecessor and successor are imagined as the same agent across time, so we are studying self-modification. A set of properties "tiles" if those properties, when present in both predecessor and successor, guarantee that any self-modifications will avoid changing those properties.

You can think of this as the question of when agents will preserve certain desirable properties (such as safety-relevant properties) when given the opportunity to self-modify. Another way to think about it is the slightly broader question: when can one intelligence trust another? The bottleneck for avoiding harmful self-modifications is self-trust; so getting tiling results is mainly a matter of finding conditions for trust.

The search for tiling results has three main motivations:

* AI-AI tiling, for the purpose of finding conditions under which AI systems will want to preserve safety-relevant properties.

* Human-AI tiling, for the purpose of understanding when we can justifiably trust AI systems.

* Tiling as a consistency constraint on decision theories, for the purpose of studying rationality.

These three application areas have a large overlap, and all three seem important.

Skill requirements

Some experience with mathematical proofs is required.
A basic understanding of logic and probability theory is required.
A significant math background is preferred.
Proficiency in decision theory is preferred (EG understand the difference between CDT and EDT, familiar with some arguments for/against them).
An understanding of updateless decision theory is preferred.
Philosophy background is a bonus.
Writing skill is a bonus.
Learning theory is a bonus.
Game theory is a bonus (EG, took at least one graduate-level course in the subject).

Full project plan

(17) Understand Intelligence

Johannes C. Mayer

Summary

Save the world by understanding intelligence.

Instead of having SGD "grow" intelligence, design the algorithms of intelligence directly to get a system we can reason about. Align this system to a narrow but pivotal task, e.g. upload a human.

The key to intelligence is finding the algorithms that infer world models that enable efficient prediction, planning, and meaningfully combining existing knowledge.

By understanding the algorithms, we can make the system non-self-modifying (algorithms are constant, only the world model changes), making reasoning about the system easier.

Understanding intelligence at the algorithmic level is a very hard technical problem. However, we are pretty sure it is solvable and, if solved, would likely save the world.

Current focus: How to model a world such that we can extract structure from the transitions between states ('grab object'=useful high level action), as well as the structure within particular states ('tree'=useful concept).

Skill requirements

You can select actions that on average decrease the chance of the world being destroyed.

Especially when these actions involve solving technical problems (otherwise I will be much worse of a mentor).

How do you know if you can do this? You try! There is no other way. And it is quite likely that you will realize that simply by trying you are already far ahead of everybody else around you.

If you want you can try yourself at this task, and send me the results, but it is not a requirement.

Whatever knowledge and skills you need you pick up along the way. And what knowledge and skills you need needs to be determined by your journey, not the other way around.

The main thing you need to bring is the willingness to learn.

(All of this applies whether you get into the project or not.)

Full project plan

(18) Applications of Factored Space Models:
Agents, Interventions and Efficient Inference

Matthias G. Mayer

Summary

Factored Space Models (Link of Arxiv will be here, when we have uploaded the paper, probably before November (Overview) were first introduced as Finite Factored Sets by Scott Garrabrant and are an attempt to make causal discovery behave well with deterministic relationships. The main contribution is the definition of structural independence that generalizes d-separation in Causal Graphs and works for all random variables you can define on any product space, e.g. a structural equations model. In this framework we can naturally extend the ancestor relationship to arbitrary random variables. This is called structural time.

We want to use and extend the framework for the following applications taken, in part, from Scott’s blog post.

Embedded Agency
Causality / Interventions
What is (structural) time?
Efficient Temporal Discovery

Here are slides from a talk (long form) explaining Factored Space Models with a heavy focus on structural independence starting from bayesian networks.

Skill requirements

You have to be comfortable with basic levels of Mathematics, including finite probability theory.

Questions to check if you have the minimum requirement:

Prove A ⊆ B ⇔ A ∩ B = A

Prove that the space of probability distributions on a finite set Omega is convex.

Given a cartesian product Omega = Omega_1 times Omega_2, is the space of product probability distributions on Omega convex?

Full project plan

Prevent Jailbreaks/Misuse

Let's make AIs more robust against jailbreaks and misuse.

(19) Preventing Adversarial Reward Optimization

Domenic Rosati

Summary

TL;DR: Can we develop methods that prevent online learning agents from learning from rewards that reward harmful behaviour without any agent supervision at all!?

This project uses a novel AI Safety Paradigm, developed in a previous AI Safety Camp, Representation Noising, that prevents adversarial reward optimization, i.e. high reward which would result in learning misaligned behaviour, by the use of “implicit” constraints that prevent the exploration of adversarial reward and prevent learning trajectories that result in optimising those rewards. These “implicit” constraints are baked into deep neural networks such that training towards harmful ends (or equivalently exploring harmful reward or optimising harmful reward) is made unlikely.

The goal of this project is to extend our previous work applying Representation Noising to a Reinforcement Learning (RL) setting: Defending against Reverse Preference Attacks is Difficult. In that work we studied the single-step RL (Contextual Bandits) setting of Reinforcement Learning From Human Feedback and Preference Learning.

In this project, we will apply the same techniques to the full RL setting of multi step reward in an adversarial reward environment that is in the machiavelli benchmark. The significance of this project is that if we can develop models that can not optimise adversarial rewards after some intervention on the model weights, then we will have made progress on safer online learning agents.

Skill requirements

Roles

Role Type (0): Ideally one person with solid organisational skills will volunteer as Team Coordinator.

Role Type (1): Ideally half the group will have interest and background with technical execution of empirical experiments in either LLMs or RL.

Role Type (2): Ideally the other half of the group will have interest in conceptual or theoretical development of our ideas, algorithm formulation, or policy development.

Commitment Requirements
In order to participate in the project, you will be asked to complete an average of one task per week.

Skill requirements

This project requires a mixture of skills from different backgrounds. Candidates are encouraged regardless of how strong they feel their skills are in each skill track as long as they have sufficient interest and commitment to self-learning in areas where they feel like their background is deficient.

The main asks are:

Competency with and understanding of Large Language Models including
1. Experience with training LLMs and behavioural evaluation
2. Experience with analysing LLM internals
3. Basic fluency with LLM research and research techniques
4. Comfort in execution of empirical experiments and analysis with high level guidance
OR Competency with and understanding of Reinforcement Learning including
1. Experience with using popular Deep RL algorithms like PPO, DQN
2. Familiarity with the current Deep RL research landscape
3. Comfort in execution of RL environments, modelling, and training with high level guidance
Some familiarity or exposure to algorithm analysis
1. Comfort with reading proofs and standard mathematical notation
2. Exposure to correctness proofs and/or deriving bounds
3. Wishlist: basic familiarity with one of: Optimization, High Dimensional Statistics, Differential Geometry, Information Theory

For (1) and (2) basic competence with independently writing and running code is assumed. Experience for (1) and (2) can be as basic as “taken the hugging face course on LLMs or RL” or “read a book on machine learning (Sutton and Barto for example) and did the exercises”. (3) is not required but is desired.

Candidates may apply without this background and we can discuss a plan for them to gain the appropriate skills before the project begins as long as they feel comfortable committing to this pre-study. For example I may ask candidates to complete https://huggingface.co/learn/deep-rl-course/en/unit0/introduction before joining if they are lacking in (2) or to review https://web.stanford.edu/group/sisl/k12/optimization/#!index.md if lacking in (3).

Candidates who are theoretically or conceptually oriented (mathematics-wise) are encouraged to apply even if they do not meet (1) or (2). We are especially looking for folks who have a background or interest in optimization (3).

Candidates who have a policy or conceptual interest (non-mathematical) are also encouraged to apply and have many opportunities to work on fleshing out the policy implications of our work out and working alongside us to make sure the technical work is grounded in real world problems.

Non-requirements (Who should not consider this project)

You should not consider this project if your intention is: to only give advice, feedback, or only participate in conversations. We are happy for you to participate as an external reviewer if this is the case but you should not apply if this is the intention. Folks with conceptual, policy, and theoretical skills will need to demonstrate this through the production of writing artefacts.

Full project plan

(20) Evaluating LLM Safety in a Multilingual World

Lukasz Bartoszcze

Summary

The capability of Large-Language Models to reason is constricted by the units they use to encode the world- tokens. Translating phrases into different languages (existing, like Russian or German; or imaginary like some random code) leads to large changes in LLM performance, both in terms of capabilities, but also safety. Turns out, applying representation engineering concepts also leads to divergent outcomes, suggesting LLMs create separate versions of the world in each language. When considering multilinguality, concepts like alignment, safety or robustness become even less defined and so I plan to amend existing theory with new methodology tailored for this case. I hypothesise this variation between languages can be exploited to create jailbreak-proof LLMs but if that is not feasible, it is still important to ensure equal opportunity globally, inform policy and improve the current methods of estimating the real capabilities and safety of a benchmark.

Skill requirements

Background on large-language models
Comfortable creating new methodologies for scalable code either in JS or Python
Some experience fine-tuning models would also be nice
Proficiency in languages other than English, Polish, German or Mandarin (I speak those) would be nice!

Full project plan

(21) Enhancing Multi-Turn Human Jailbreaks Dataset for Improved LLM Defenses

Diogo Cruz

Summary

This project aims to extend and enhance the Multi-Turn Human Jailbreaks (MHJ) dataset introduced by Li et al.. We will focus on developing lightweight automated multi-turn attacks, evaluating transfer learning of jailbreaks, and conducting qualitative analysis of human jailbreak attempts. By expanding on the original MHJ work, we seek to provide more comprehensive insights into LLM vulnerabilities and contribute to the development of stronger defenses. Our research will help bridge the gap between automated and human-generated attacks, potentially leading to more robust and realistic evaluation methods for LLM safety.

Skill requirements

Required:

- Strong Python programming skills

- Experience with large language models

Recommended:

- Familiarity with AI safety concepts and jailbreaking techniques

- Good scientific writing and communication skills

Nice to have:

- Experience with prompt engineering and adversarial attacks on LLMs

- Knowledge of red teaming practices

- Familiarity with qualitative data analysis techniques

- Experience with open-source collaboration and Git version control

Full project plan

Train Aligned/Helper AIs

Let's train AIs that are aligned and/or can help us with alignment.

(22) AI Safety Scientist

Lovkush Agarwal

Summary

In August 2024, Sakana published their research on the ‘AI Scientist’. https://sakana.ai/ai-scientist/. They fully automate the ML research process - from generating ideas to writing a formal paper - by combining various LLM-based tools in an appropriate pipeline. The headline result is it generates weak graduate level research for about $15 per paper.

The aim of this project is to adapt and refine this tool for AI Safety research.

Skill requirements

Minimum skills / attitudes

Enough Python skills to quickly learn how to run LLMs and do prompt engineering
Ability to read and understand ‘simple’ AI safety research papers
Willingness to think critically: ask questions of one’s own work and of others work to help us all improve.
Willingness to learn, both things that are directly related to project but also soft skills
Willingness to work on an open-ended project in an agile way. Goals of the project will shift depending on the skillset of people working on it and on the progress made week by week

Ideal skills. I do not expect any single individual to have many of these.

Strong software engineering skills. Comfortable navigating new medium size codebases (e.g. the one by Sakana), knowledge of how to sandbox experiments, can create front-end tools / VSCode extensions
Use best practices in collaborative coding. Use GitHub, regular commits, regular pull requests, well-documented code and experiments.
Experience with LLMs and prompt engineering
Experience fine-tuning LLMs
Broad understanding of AI Safety space, to judge what is good/useful research in a large range of topics.
Ability to reproduce large range of AI safety papers

Full project plan

(23) Wise AI Advisers via Imitation Learning

Chris Leong

Summary

I know it’s a cliche, but AI capabilities are increasing exponentially, but our access to wisdom (for almost any definition of wisdom) isn’t increasing at anything like the same pace.

I think that it’s pretty obvious that continuing in the same direction is unlikely to end well.

There’s something of a learned helplessness around training wise AI’s. I want to take a sledgehammer to this.

As naive as it sounds, I honestly think we can do quite well by just picking some people who we subjectively feel to be wise and using imitation learning on them to train AI advisors.

Maybe you feel that “imitation learning” would be kind of weak, but that’s just the baseline proposal. Two obvious ideas for amplifying these agents are techniques like debate or trees of agents, and that’s just for starters!

More ambitiously, we may be able to set up a positive feedback loop. If our advisers are able to help people become wiser, then this might allow us to set up a positive feedback loop where the people we are training on become wiser and our advisers become wiser in their use of our technology.

I’m pretty open to recruiting people who are skilled in technical work, conceptual work or technical communication. This differs from other projects in that rather than having specific project objectives, you have the freedom to pursue any project within this general topic area (wise AI advisors via imitation learning). Training wise AI via other techniques is outside the scope of this project unless it is to provide a baseline to compare imitation agents against. The benefit is that this offers you more freedom, the disadvantage is that there’s more of a requirement to be independent in order for this to go well.

Skill requirements

What I’m proposing operates in quite a different paradigm from most other AI Safety research, so I’m looking for people who are able to “get it”.

For conceptual work, I expect clarity of thinking. I am quite partial to folk who have exposure to Less Wrong-style rationality but this certainly isn’t necessary. I’m also quite a fan of people with analytical philosophy experience, particularly if they understand the difference between the map and the territory.

For empirical research, I want people who have prior empirical research experience. This is very early-stage research, so you need to be able to propose experiments that actually tell us something useful. I’ll select on your ability to propose good experiments, but you’ll have broad freedom to pursue whatever you want within the scope during AI Safety Camp itself.

I would be open to finding myself a co-lead who has enough empirical research experience to help mentor other participants interested in pursuing empirical work and if I were to find such a co-lead, then I’d be open to participants without previous empirical research experience. However, my guess would be that <10% of competent empirical researchers would be a good fit as a co-lead, because most people want to operate within an existing paradigm rather than create a new paradigm.

If you want to work on technical communications, you need to be able to precisely communicate complex ideas without oversimplifying or otherwise introducing excessive lossiness. For these projects, I’d probably need to exercise a greater degree of control over the end product, to avoid the risk of miscommunication.

Full project plan

(24) iVAIS: Ideally Virtuous AI System with Virtue as its Deep Character

Masaharu Mizumoto

Summary

The ultimate goal of this interdisciplinary research program is to contribute to AI safety research by actually constructing an ideally virtuous AI system (iVAIS). Such an AI system should be virtuous as its deep character, showing resilience (not complete immunity, which is vulnerable) to prompt injections even if it can play many different characters by pretending, including a villain. The main content of the current proposal consists of two components: 1. Self-alignment and 2. The Ethics game, which are both based on the idea of agent-based alignment rather than content-based alignment, focusing on what one is doing, which requires metacognitive capacity.

Skill requirements

Either 1) general skills and experience in coding with Python, fine-tuning and RLHF for an open-source LLM, or 2) general knowledge about the AI safety research and literature.

Full project plan

(25) Exploring Rudimentary Value Steering Techniques

Nell Watson

Summary

This research project seeks to assess the effectiveness of rudimentary alignment methods for artificial intelligence. Our intention is to explore basic, initial methods of guiding AI behavior using supplementary contextual information:

AI Behavior Alignment: Develop and test mechanisms to steer AI behavior via model context, particularly for emerging AI systems with agentic qualities by:
1. Utilizing a newly developed Agentic AI Safety Rubric to establish comprehensive general ethical alignment.
2. Leveraging the outputs of tools that allow users to easily specify their preferences and boundaries as a source of context for customized steering of models according to personal contexts and preferences. These behavior alignment methods will provide the foundation for our real-time ethics monitoring system.
Real-time Ethics Monitoring: Create a proof-of-concept system where one AI model oversees the ethical conduct of another, especially in an agentic context. Develop a dynamic ethical assurance system capable of:
1. Monitoring the planning processes of less sophisticated agentic AI models.
2. Swiftly identifying potential ethical violations before they are executed.
Effectiveness Assessment: Evaluate the robustness and limitations of these alignment techniques to test the mechanisms across various scenarios and contexts. Determine:
1. The range of situations where these techniques are most effective.
2. The conditions under which they begin to fail or become unreliable.

Expected Outcomes:

Insights into the feasibility of using contextual information for AI alignment.
Understanding of the strengths and limitations of these alignment methods.
Identification of areas requiring further research and development.

Skill requirements

Long Context / Context Aware Fine Tuning
- Expertise in enhancing AI models to effectively utilize extended context
- Knowledge of techniques for increasing context window sizes and processing long-range dependencies
- Experience with Transformer-XL, Longformer, or similar architectures
- Familiarity with context-aware fine-tuning techniques
Agentic AI Systems
- Experience developing AI systems with goal-directed behavior and decision-making capabilities
- Knowledge of reinforcement learning architectures and planning algorithms
- Expertise in scaffolding processes for complex AI behaviors
- Ability to interpret and export model reasoning stages
Fine-Tuning/RLHF/Constitutional AI
- Experience in AI behavioral training techniques
- Knowledge of fine-tuning large language models
- Implementation of Reinforcement Learning from Human Feedback
- Understanding of Constitutional AI principles
Cybersecurity
- Expertise in AI system hardening and security best practices
- Experience implementing authentication, authorization, and encryption
- Knowledge of vulnerability assessment and penetration testing
- Understanding of privacy-preserving machine learning
- Familiarity with AI-specific threats and regulatory compliance

Full project plan

(26) Autostructures – for Research and Policy

Sahil and Murray

Summary

This is a project for creating culture and technology around AI interfaces for conceptual sensemaking.

Specifically, creating for the near future where our infrastructure is embedded with realistic levels of intelligence (ie. only mildly creative but widely adopted) yet full of novel, wild design paradigms anyway.

The focus is on interfaces especially for new sensemaking and research methodologies that can feed into a rich and wholesome future.

Huh?

It’s a project for AI interfaces that don’t suck, for the purposes of (conceptual AI safety) research that doesn’t suck.

Wait, so you think AI can only be mildly intelligent?

Nope.

But you only care about the short term, of “mild intelligence”?

Nope, the opposite. We expect AI to be very, very, very transformative. And therefore, we expect intervening periods to be very, very transformative. Additionally, we expect even “very2 transformative” intervening periods to be crucial, and quite weird themselves.

In preparing for this upcoming intervening period, we want to work on the newly enabled design ontologies of sensemaking that can keep pace with a world replete with AIs and their prolific outputs. Using the near-term crazy future to meet the even crazier far-off future is the only way to go.

(As you’ll see below, we will specifically move towards adaptive sensemaking meeting even more adaptive phenomena.)

So you don’t care about risks?

Nope, the opposite. This is all about research methodological opportunities meeting risks of infrastructural insensitivity.

-----------------------------------------------------------------------------------------------------------

Watch a 10 minute video here for a little more background: Scaling What Doesn’t Scale: Teleattention Tech.

Skill requirements

If you’re good at or interested in engineering, writing, designing or generally open-minded and quick to learn, you’re a fit. If you made it through this doc (even if you have lots of questions and confusions) or like to think in meta-systematic ways, then you’ll love it here.

Full project plan

Other

Projects that didn't fit any shared category

(27) Reinforcement Learning from Recursive Information Market Feedback

Abhimanyu Pallavi Sudhir

Summary

RLHF is no good on tasks which humans are unable to easily “rate” output. I propose the Recursive Information Market, which can be understood as an approach to rate based on a human rater’s Extrapolated Volition, or a generalized form of AI safety via debate.

Skill requirements

Minimum skills:

Good economic intuitions: I might ask some application questions of the form “Argue against this gut instinct: …”
General fluency in Python: Writing code is one of the two main things we do to make sure the things we’re saying aren’t fake bullshit (the other way is betting on prediction markets). I would like to make sure that everyone is capable of writing clean code, and is generally “fluent” in stuff.

I would be especially happy to have someone who can make significant contributions on theoretical work, i.e. on coming up with and proving solid, useful theorems. Someone who can do the bulk of the effort on this I would happily grant joint-first-author position to.

The majority of team members, I assume, would be working on implementations in the various contexts stated earlier.

Full project plan

(28) Explainability through Causality and Elegance

Jason Bono

Summary

The purpose of this project is to make progress towards human-interpretable AI through advancements in causal modeling. The project is inspired by the way science emerged in human culture, and seeks to replicate essential aspects of this emergence in a simple simulated environment.

The setup will consist of a simulated world and one or more agents equipped with one sensor and one actuator each, along with a bandwidth-constrained communications channel. A register will record past communications, and store the “usefulness” of trial frameworks that the agents develop for prediction.

The agents first will create standard deep predictive models for novel actuator actions (interventions) and subsequent system evolution. These agents will then create a reduced representation of their deep models optimizing for “elegance” which refers to high predictive accuracy, high predictive breadth, low model size, and high computational efficiency. This can be thought of as the autonomous creation of an interpretable “elegant causal decision layer” (ECDL) that can be called upon by the agents to reduce the computational intensity of accurate prediction of the effects of novel interventions.

Success would comprise the autonomous creation and successful utilization of a human interpretable ECDL. This success would provide a proof of concept for similar techniques in more complex and non-simulated environments (e.g. a physical setup and/or the internet).

Skill requirements

Skills of the team members should include at least one from

Reinforcement learning
Agent based systems
Causal modeling
Symbolic AI
Ontology generation
Information theory
Compression algorithms
Physics

Full project plan

(29) Leveraging Neuroscience for AI Safety

Claire Short

Summary

This project integrates neuroscience and AI, leveraging human brain data to align AI behaviors with human values for potentially greater control and safety. In this initial project, we will take inspiration from Activation Vector Steering with BCI, to map activation vectors to human brain datasets. In previous work, a method called Activation Addition was tested and found to more reliably control the behavior of large language models during use, altering the model's internal processes based on specific inputs, which allows for adjustments to topics or sentiments with minimal computing resources. By attempting to recreate elements of this work with the integration of brain data inputs, we aim to enhance the alignment of AI outputs with user intentions, opening new possibilities for personalization and accessibility in various applications, from education to therapy.

Skill requirements

Skill Requirements Research Engineer:

Strong programming skills in Python and familiarity with libraries like PyTorch
Experience in designing, building, and maintaining software systems, preferably in a research or technology development setting
Demonstrated ability to work effectively on projects involving complex data sets and machine learning models
Excellent problem-solving abilities and a commitment to high-quality engineering practices

Skill Requirements Research Scientist:

Proficiency in Python and experience with frameworks such as PyTorch
Strong background in machine learning, particularly with experience in transformer language models
Interest or experience in neural data analysis
Ability to handle multiple aspects of the research process, including data analysis, theory development, and experimental design

Full project plan

(30) Scalable Soft Optimization

Benjamin Kolb

Summary

This project is mainly aimed at a deep reinforcement learning (DRL) implementation. The purpose is to assess selected soft optimization methods. Such methods limit the amount of “optimization” in DRL algorithms in order to alleviate the consequences of goal misspecification. The primarily proposed soft optimization method is based on the widely referenced idea of quantilization. Broadly speaking, quantilization means sampling options from a reference distribution’s top quantile instead of selecting the top option.

Skill requirements

I’m primarily looking for people who are enthusiastic about:

reimplementing classic DRL papers.
developing RL environment setups with reward misspecification that allow to showcase the relative performance of soft optimization methods.
implementing and evaluating the proposed soft optimization methods.
pursuing publishable results.

Collaborators should have:

a sense of commitment to the project.
the ability to work effectively with the team.
a good intuitive understanding of (Bayesian) probabilities.
the ability to code in Python/Pytorch or simply the readiness to quickly learn coding in Python/Pytorch. (Alternative libraries can be proposed.)
the willingness to use Git and to write readable and well-structured code.

Additionally valuable are:

experience with deep learning implementations.
experience with software engineering.
familiarity with RL algorithms.
familiarity with language model training.
strong mathematical/conceptual reasoning skills.
strong academic writing skills.

Full project plan

(31) AI Rights for Human Safety

Pooja Khatri

Summary

This project seeks to institute a legal governance framework to advance AI rights for human safety.

Experts predict that AI systems have a non-negligible chance of developing consciousness, agency or other states of potential moral patienthood within the next decade. Such powerful, morally significant AIs could contribute immense value to the world and failing to respect their basic rights may not only lead to suffering risks but it might also incentivise AI systems to pursue goals that are in conflict with human interests, giving rise to misalignment scenarios and existential risks.

Advancing AI rights for human safety remains a neglected priority. While several studies and frameworks exploring potential AI rights already exist, the existing work is either a) largely theoretical and not practical/tractable or feasible from a policy perspective and/or b) fails to take into consideration the contemporary nature of AI development.

As such, given the likelihood that AI systems will likely advance faster than legal regimes, it is arguable that powerful early intervention via legal governance mechanisms offers a promising first step towards mitigating suffering and existential risks and positively influencing our long-term future with AI.

Skill requirements

Research Manager

Skills and experience
- Proficiency in project management methodologies and tools (e.g., Agile, Trello, Asana).
- Experience leading or coordinating teams, with a focus on fostering collaboration and morale.
- Flexibility to navigate changing project requirements and priorities.
- Strong organisational skills, with the ability to manage competing priorities effectively.

Research Assistant - Legal/Policy

Skills and experience
- Early to mid-career researcher.
- Background in law or policy gained through professional experience, academic study, or other relevant experiences.
- Familiarity with AI governance and policy research.
- Excellent critical thinking skills, with the ability to generate innovative solutions to challenges in AI governance and policy.
- Proficient in locating, reading, critically assessing, and applying research or policy from various disciplines to AI-specific contexts.
- Strong written communication skills, with the ability to convey complex ideas in clear and accessible language.

Research Assistant - Technical

Skills and experience
- Early to mid-career researcher with relevant experience.
- Background in software engineering or theoretical computer science, gained through professional experience, academic study, or other relevant experiences.
- Familiarity with AI governance and policy research.
- Strong critical thinking skills, with the ability to generate innovative solutions to challenges in AI safety and governance.
- Strong written communication skills, with the ability to convey complex ideas in clear and accessible language.

We encourage you to veer on the side of applying and apply even if you do not meet all the requirements. If you have any questions, feel free to get in touch: khatripooja.24@gmail.com

Full project plan

(32) Universal Values and Proactive AI Safety

Roland Pihlakas

I will be running one of three possible projects, based on which one receives the most interest. Below are included the summaries and skill sections for the respective projects.

Summary

Category: Evaluate risks from AI

(32a) Creating new AI safety benchmark environments on themes of universal human values

We will be planning and optionally building new multi-objective multi-agent AI safety benchmark environments on themes of universal human values.

Based on various anthropological research, I have compiled a list of universal (cross-cultural) human values. It seems to me that various of these universal values resonate with concepts from AI safety, but use different keywords. It might be useful to map these universal values to more concrete definitions using concepts from AI safety.

One notable detail in this research is that in case of AI and human cooperation, the values are not symmetric as they would be in case of human-human cooperation. This arises because we can change the goal composition of agents, but not of humans. Additionally there is the crucial difference that agents can be relatively easily cloned, while humans cannot. Therefore, for example, a human may have an universal need for autonomy, while an AI agent might imaginably not have that need built-in. If that works out, then the agent would instead have a need to support human autonomy.

The objective of this project would be to implement these mappings of concepts into tangible AI safety benchmark environments.

Category: Agent Foundations

(32b) Balancing and Risk Aversion versus Strategic Selectiveness and Prospect Theory

We will be analysing situations and building an umbrella framework about when either of these incompatible frameworks would be more appropriate in describing how we want safe agents to handle choices relating to risks and losses in a particular situation.

Economic theories often focus on the “gains” side of utility and how our multi-objective preferences are balanced there. A well known formulation is to use diminishing returns - a concave utility function, which mathematically results in a balancing action where an individual prefers averages in all objectives to extremes in a few objectives.

But, what happens in the negative domain of utility? How do humans handle risks and losses? Turns out, it might be not so trivial as with gains.

One might imagine that one could apply a concave utility function to the negative domain as well, in order to balance the individual losses, or to equalise and provide an equal treatment in case of multiple individuals. This would resonate with the idea that generally people prefer averages in all objectives to extremes in a few objectives. As an example, a negative exponential utility function would achieve that.

Yet there is a well known theory named “Prospect theory”, which claims instead that our preferences in the negative domain are convex.

As I see it, this contradiction between the theories of “preferring averages over extremes” and “the Prospect Theory” may be underexplored, especially with regards to how it is relevant to AI safety.

Category: Train Aligned/Helper AIs

(32c) Act locally, observe far - proactively seek out side-effects

We will be building agents that are able to solve an already implemented multi-objective multi-agent AI safety benchmark that illustrates the need for the agents to proactively seek out side-effects outside of the range of their normal operation and interest, in order to be able to properly mitigate or avoid these side-effects.

In various real-life scenarios we need to proactively seek out information about whether we are causing or about to cause undesired side effects (externalities). This information either would not reach us by it itself, or would reach us too late.

This situation arises because attention is a limited resource. Similarly, our observation radius is limited. The same constraints apply to AI agents as well. We humans, as well as agents, would prefer to focus only on the area of our own activity, and not on surrounding areas, where we do not intend to operate. Yet our local activity causes side effects farther away, and we need to be accountable and mindful of that. Then these far away side effects need to be sought out with extra effort, in order to mitigate them as soon as possible, or even better, in order to proactively avoid them altogether.

I have built a multi-agent multi-objective gridworlds environment that illustrates this problem. I am seeking a team who would figure out the principles necessary or helpful for solving this benchmark, and who would build agents which illustrate these important safety principles.

Skill requirements

(32a) Creating new AI safety benchmark environments on themes of universal human values

Relevant skills include the following. You do not need to have all the skills.

Software development
AI safety
Literature reviewing
Analytical thinking
Social sciences or Anthropology
Data analysis
Documentation writing
Community outreach
Reinforcement learning or other agent-building approaches (e.g LLM agents)
Academic writing
Multi-objective optimisation
Multi-agent simulations
Project management
Generalistic interdisciplinary knowledge, philosophy, ethics

(32b) Balancing and Risk Aversion versus Strategic Selectiveness and Prospect Theory

Relevant skills include the following. You do not need to have all the skills.

Mathematical modelling
Software development
AI safety
Literature reviewing
Analytical thinking
Economics or Psychology
Data analysis
Documentation writing
Community outreach
Academic writing
Multi-objective optimisation
Reinforcement learning or other agent-building approaches (e.g LLM agents)
Project management
Generalistic interdisciplinary knowledge, philosophy, ethics

(32c) Act locally, observe far - proactively seek out side-effects

Relevant skills include the following. You do not need to have all the skills.

Software development
AI safety
Literature reviewing
Analytical thinking
Cognitive science
Reinforcement learning or other agent-building approaches (e.g LLM agents)
Data analysis
Documentation writing
Community outreach
Academic writing
Multi-objective optimisation
Multi-agent simulations
Project management
Generalistic interdisciplinary knowledge, active inference, free energy principle

Full project plans

Page updated

Google Sites

Report abuse

AI Safety Camp 10th edition

What is AI Safety?

Do you still have questions?

Timeline

List of projects

Stop/Pause AI

(1) Growing PauseAI

Chris Gerrby

Summary

Skill requirements

(2) transitioned to one-on-one collabs

(3) AI Policy Course: AI’s capacity of exploiting existing legal structures and rights

Marcel Mir Teijeiro

Summary

Skill requirements

(4) Building the Pause Button: A Proposal for AI Compute Governance

This project needs a project lead to go ahead

Summary

Skill requirements

(5) Stop AI Video Sharing Campaign

This project needs a project lead to go ahead

Summary

Skill requirements

Evaluate risks from AI

(6) Write Blogpost on Simulator Theory

Will Petillo

Summary

Skill requirements

(7) Formalize the Hashiness Model of AGI Uncontainability

Remmelt Ellen

Summary

Skill requirements

(8) LLMs: Can They Science?

Egg Syntax

Summary

Skill requirements

(9) Are LLMs coherent Bayesians?

Sohaib Imran

Summary

Skill requirements

(10) Develop New Sycophancy Benchmarks

Jan Batzner

Summary

Skill requirements

(11) Agency Overhang as a Proxy for Sharp Left Turn

Anton Zheltoukhov

Summary

Skill requirements

Mech-Interp

(12) Understanding the Reasoning Capabilities of LLMs

Sonakshi Chauhan

Summary

Skill requirements

(13) Mechanistic Interpretability via Learning Differential Equations

Valentin Slepukhin

Summary

Skill requirements

(14) Towards Understanding Features

Kola Ayonrinde

Summary

Skill requirements

(15) Towards Ambitious Mechanistic Interpretability II

Alice Rigg

Summary

Skill requirements

Agent Foundations

(16) Understanding Trust

Abram Demski

Summary

Skill requirements

(17) Understand Intelligence

Johannes C. Mayer

Summary

Skill requirements

(18) Applications of Factored Space Models: Agents, Interventions and Efficient Inference

Matthias G. Mayer

Summary

Skill requirements

Prevent Jailbreaks/Misuse

(19) Preventing Adversarial Reward Optimization

AI Safety Camp
10th edition

(3) AI Policy Course:
AI’s capacity of exploiting existing legal structures and rights

(4) Building the Pause Button:
A Proposal for AI Compute Governance

(18) Applications of Factored Space Models:
Agents, Interventions and Efficient Inference