Apply to join the

10th AI Safety Camp 

AI Safety Camp (AISC) is an online part time AI safety research program. You join AISC by joining one of the projects, and you join a project by applying here.


This camp, we have 32 projects covering many different topics. Scroll down to see all of them. We recommend having a look at the projects to see which ones interest you. But you also have the option of filling out a generic application for all the projects at once. 


When you apply for a project, keep in mind that all collaborators are expected to work 10 hours/week and join weekly meetings. 


What is AI Safety?

There are many perspectives on what is good AI safety research, stemming from different assumptions about how hard various parts of the problem is, ranging from "Aligning an AI with any human seems not too hard, so we should focus on aligning it focus on aligning it with all humans, and/or preventing misuse", to "Aligning fully autonomous AI to stay safe is literally impossible, so we should make sure that such AI never get built", and everything in between, plus the perspective that "We don't know WTF we're doing, so we should do some basic research".


Our range of projects for this AISC reflects this diversity. 


All AISC projects have a plausible theory of change, under some reasonable assumptions. But different projects have different theories of change and assumptions.


We encourage you, dear reader, to think for yourself. What do you think is good AI safety research? Which projects listed below do you believe in?


Do you still have questions?

See our About & FAQ page for more info, or contact one of the organisers.

Timeline

Team member applications 


Program 


Afterwards






List of projects

 

Stop/Pause AI 

Let's not build what we can't control.

(1) Growing PauseAI

Chris Gerrby

Summary 

This project focuses on creating internal and external guides for PauseAI to increase active membership. The outputs will be used by the team of volunteers with high context and engagement, including the key decision makers. 


Activism and advocacy has historically been cornerstones for policy and social change. A movement's size is critical to achieving its goals. The project's outcome will be an official growth strategy for PauseAI Global: an actionable guide for increasing active members (volunteers, donors, protesters) within 6-12 months.


PauseAI currently lacks comprehensive growth strategies and tactics, and there's uncertainty about resource allocation soon. This project will delve into tactics that have aggressively accelerated growth for other movements, both historical and recent.


By the end, we'll refine our findings, analyze PauseAI's current tactics, and recommend clear guidelines for immediate action. This guide will also include tactics applicable to national PauseAI chapters.

Skill requirements

(2) Grassroots Communication and Lobbying Strategy for PauseAI

Felix De Simone

Summary 

PauseAI is a global, grassroots organization with the goal of achieving a worldwide moratorium on the development of dangerous AI systems. We are seeking to improve our communication strategy, both in terms of public communications and in meetings with elected officials.


This project will have two tracks:


Track 1: Lobbying. This track will focus on researching the optimal lobby strategies in non-US  countries around which PauseAI wishes to expand our lobby efforts.


Track 2: Grassroots Communication. Participants in this track will research optimal strategies for discussing the dangers of AI and the need for a Pause, in face-to-face settings with members of the public.


If this project goes well, PauseAI will be able to improve our public comms and lobby strategies, leading both to more rapid scaling of our organization and more effective communication with public officials persuading them to consider global coordination around AI risk.

Skill requirements

For Project Admins (both tracks):


Minimum Skills:

(3) AI Policy Course: 
AI’s capacity of exploiting existing legal structures and rights

Marcel Mir Teijeiro

Summary 

This project aims to build an AI Policy course that explores how traditional legal frameworks are increasingly outdated, providing no clear answers to AI capabilities and advances. The course will highlight the vulnerabilities in current regulations and the potential for corporations and authoritarian governments to use AI tools to exploit gaps in areas such as IP, privacy, and liability law.


This course would focus on identifying and understanding these new legal gaps and critically exploring proposed solutions. This would be achieved through literature review, case law and ongoing legal disputes. 


It will explore, for example, how AI can be used to violate IP and privacy rights, or how current liability structures are weak against AI generated damages. This weak framework incentivises AI developers to take greater risks, increasing the chance of catastrophic consequences. It will also analyse the lack of regulation on adopting AI-driven decision-making systems in essential sectors. (E.g. employment, law, housing). Reporting the erosion of fundamental rights, socio-economic risk and the threat automation by algorithms pose to democratic procedures.

Skill requirements

Policy researcher:


Web developer


Course materials designer

(4) Building the Pause Button: 
A Proposal for AI Compute Governance

This project needs a project lead to go ahead

Summary 

This project focuses on developing a whitepaper that outlines a framework for a global pause on AI training runs larger than GPT-4 scale. By addressing the underexplored area of compute governance, the project aims to prevent the emergence of catastrophically dangerous AI models. We will research policy measures that can restrict such training runs and identify choke points in the AI chip supply chain. The final output will be a comprehensive whitepaper, with potential supplementary materials such as infographics and web resources, aimed at informing policymakers and other stakeholders attending AI Safety Summits.

Skill requirements

(5) Stop AI Video Sharing Campaign

This project needs a project lead to go ahead

Summary 

The intention of this project is to be an engine for mobilisation of the Stop AI campaign. The goal is to get 1 million people a week to see a series of video ads composed of 30 sec/1 min videos. These ads will be video soliloquies by 1) famous people and/or experts in the AI field saying why AI is a massive problem, and 2) ordinary people such as teachers, nurses, union members, faith leaders, construction workers, fast food employees, etc. saying why they believe we need to Stop AI.


Each ad will have a link attached which takes people to a regular mobilisation call. The attendees at this call will be presented with pathways to action: join a protest, donate, record a video ad, invite 3 people to the next call.

Skill requirements

Roles

Project lead: Will lead the video pipeline – from collecting, to editing, to posting videos on social media. Receives stipend of $1,500 total.


Scout: Will receive a target demographic to research contact lists for. Will directly contact these people asking if they would like to participate in the campaign, either by phone, email, social media or all three.


Canvasser: Will canvas in-person on the ground, asking people to contribute their video.


Editor: Someone who can edit down phone camera recordings into easily watchable and relatable clips. 

Evaluate risks from AI

Let's better understand the risk models and risk factors from AI.

(6) Write Blogpost on Simulator Theory

Will Petillo

Summary 

Write a blogpost on LessWrong summarising simulator theory—a lens for understanding LLM-based AI as a simulator rather than as a tool or an agent—and discussing the theory’s implications on AI alignment.  The driving question of this project is: “What is the landscape of risks from uncontrolled AI in light of LLMs becoming the (currently) dominant form of AI?” 

Skill requirements

(7) Formalize the Hashiness Model of AGI Uncontainability 

Remmelt Ellen

Summary 

The hashiness model represents elegantly why ‘AGI’ would be uncontainable – ie. why fully autonomous learning machinery could not be controlled enough to stay safe to humans. This model was devised by polymath Forrest Landry, funded by the Survival and Flourishing Fund. A previous co-author of his, Anders Sandberg, is working to put the hashiness model into mathematical notation. 

For this project, you can join up in a team to construct a mathematical proof of AGI uncontainability based on the reasoning. Or work with Anders to identify proof methods and later verify the math (identifying any validity/soundness issues).

We will meet with Anders to work behind a whiteboard for a day (in Oxford or Stockholm). Depending on progress, we may do a longer co-working weekend. From there, we will draft one or more papers. 

Skill requirements

As a prerequisite, you need to have experience in using math to model processes. The rest is up for grabs.  

(8) LLMs: Can They Science?

Egg Syntax

Summary 

There are many open research questions around LLMs' general reasoning capability, their ability to do causal inference, and their ability to generalize out of distribution. The answers to these questions can tell us important things about:


We can address several of these questions by directly investigating whether current LLMs can perform scientific research on simple, novel, randomly generated domains about which they have no background knowledge. We can give them descriptions of objects drawn from the domain and their properties, let them perform experiments, and evaluate whether they can scientifically characterize these systems and their causal relations.

Skill requirements

This should be a pretty accessible project for anyone with these skills:


Optional bonus skills, absolutely not requirements:

(9) Measuring Precursors to Situationally Aware Reward Hacking

Sohaib Imran 

Summary 

This project aims to empirically investigate proxy-conditioned reward hacking (PCRH) in large language models (LLMs) as a precursor to situationally aware reward hacking (SARH). Specifically, we explore whether natural language descriptions of reward function misspecifications, such as human cognitive biases in the case of reinforcement learning from human feedback (RLHF), in LLM training data facilitate reward hacking behaviors. By conducting controlled experiments comparing treatment LLMs trained on misspecification descriptions with control LLMs, we intend to measure differences in reward hacking tendencies. 

Skill requirements

Technical Alignment role:


I am looking for candidates with one or more of the following skills:


Psychology / AI bias & ethics role:


The key problem in Reward hacking is misspecification. Specifically, the alignment target is incorrectly captured in datasets of human preferences used to finetune Language models. This misspecification can be defined in terms of cognitive biases, limitations of bounded rationality, or simply not knowing what we want (what the alignment target is).


The project could benefit from someone with a strong understanding of the literature around human biases or otherwise failures of rationality. Some knowledge of reinforcement learning or RLHF will also be useful.

(10) Develop New Sycophancy Benchmarks 

Jan Batzner 

Summary 

Sycophancy and deceptive Alignment are an undesired model behaviour resulting from misspecified training goals e.g. for Large Language Models through RLHF, Reinforcement Learning by Human Feedback (AI Alignment Forum, Hubinger/Denison). While sycophancy in LLMs and its potential harms to society recently received media attention (Hard Fork, NYT, August 2024), the question of its measurement remains challenging. We will review existing Sycophancy Benchmarking datasets (Output 1+2) and propose new Sycophancy Benchmarks demonstrated on empiric experiments (Output 3+4). 

Skill requirements

Mandatory requirements:

- strong interest in AI Safety

- research-oriented coursework, experience working with papers

- statistics background

- basic coding: Python preferred


Optional:

- experience as a Research Assistant

- knowledge on meta studies and methods surveys

- coursework on experimental design

-AI Safety standardisation, background knowledge

(11) Agency Overhang as a Proxy for Sharp Left Turn

Anton Zheltoukhov

Summary 

Core underlying assumption - we believe that there is a significant agency overhang in the modern LLMs, meaning there is a potential for performance of a model to increase significantly with introduction of more powerful elicitation/scaffolding methods without additional improvements of model itself, due to prompting and scaffolding techniques being in their early ages. For the model evaluations this means that the current evaluations systematically undershoot the real level of capabilities and by extension, the level of risks involved.

We see several important research questions that have to be answered:


Skill requirements

Prompt engineer 

The main goal for this role is to explore various prompting techniques, develop new ones, and analyse observation.

Coding experience is a must. Formal ML experience would be great but it is not a deal breaker.

Candidates have to have a good understanding of how transform works, familiar with prompting techniques (e.g. COT, ).

 

Interpretability engineer

The main goal for this role is same as for Prompt engineer but focus is on “invasive” elicitation methods (e.g. activation steering, )

On top of requirements for Prompt engineer there is also a requirement for mech interp experience.


Conceptual researcher

The main goal for this role differs from the former ones - it is to try to deconfuse SLT and develop a mechanistic model for it.

Requirements: great conceptual thinking and research skills in general (in ML preferably), strong security mindset, familiarity with threat models landscape


Mech-Interp 

Let's look inside the models, and try to understand how they are doing what they are doing.

(12) Understanding the Reasoning Capabilities of LLMs

Sonakshi Chauhan

Summary 

With the release of increasingly powerful models like OpenAI's GPT-4 and others, there has been growing interest in the reasoning capabilities of large language models. However, key questions remain: How exactly are these models reasoning? Are they merely performing advanced pattern recognition, or are they learning to reason in a way that mirrors human-like logic and problem-solving? Do they develop internal algorithms to facilitate reasoning?

These fundamental questions are critical to understanding the true nature of LLM capabilities. In my research, I have begun exploring this, and I have some preliminary findings on how LLMs approach reasoning tasks. Moving forward, I aim to conduct further experiments to gain deeper insights into how close and reproducible LLM reasoning is compared to human reasoning, potentially grounding our assumptions in concrete evidence.

Future experiments will focus on layer-wise analysis to understand attention patterns, perform circuit discovery, direction analysis, and explore various data science and interpretability techniques on LLM layers to gain insights and formulate better questions.

Skill requirements

Required

Preferred

(13) Mechanistic Interpretability via Learning Differential Equations

Valentin Slepukhin

Summary 

Current mechanistic interpretability approaches may be hard, because language is a very complicated system that is not that trivial to interpret. Instead, one may consider a simpler system - a differential equation, which is a symbolic representation transformer can learn from the solution trajectory https://arxiv.org/abs/2310.05573. This problem is expected to be significantly easier to solve, due to its exact mathematical formulation. Even though it seems to be a toy model, it can bring some insights to the language processing - especially if the natural abstraction hypothesis is true https://www.lesswrong.com/posts/QsstSjDqa7tmjQfnq/wait-our-models-of-semantics-should-inform-fluid-mechanics.  

Skill requirements

Necessary skills: 


Desirable skills: 

(14) Towards Understanding Features

Kola Ayonrinde

Summary 

In the last year, there has been much excitement in the Mechanistic Interpretability community about using Sparse Autoencoders (SAEs) to extract monosemantic features. Yet for downstream applications the usage has been much more muted. In a wonderful paper Sparse Feature Circuits, Marks et al. do the only real application of SAEs to solving a useful problem to date (at the time of writing). Yet many of their circuits make significant use of the “error term” from the SAE (i.e. the part of the model’s behaviour that the SAE isn’t well capturing). This isn’t really the fault of Marks et al., it just seems like the underlying features were not effective enough. 


We believe that the reason SAEs haven’t been as useful as the excitement suggests is because the SAEs simply aren’t yet good enough at extracting features. Combining ideas from new methods in SAEs with older approaches from the literature, we believe that it’s possible to significantly improve the performance of feature extraction in order to allow SAE-style approaches to be more effective.


We would like to make progress towards truly understanding features: how we ought to extract features, how features relate to each other and perhaps even what “features” are.

Skill requirements

Required skills:



Diverse and interesting skills (nice to have and definitely apply if you have them but not necessary!):

(15) Towards Ambitious Mechanistic Interpretability II

Alice Rigg

Summary 

Where do we go now?

Historically, The Big 3 of {distill.pub, transformer-circuits.pub, neel nanda tutorials/problem lists} have dominated the influence, interpretation, and implementation of core mech interp ideas. However, in recent times they haven’t been all that helpful (especially looking at transformer-circuits): All this talk about SAEs yet no obvious direction for where to take things. In this project, we’ll look beyond the horizon and aim to produce maximally impactful research, with respect to the success of mech interp as a self sustaining agenda for AI alignment and safety, and concretely answer the question: where do we go now?


Last year in AISC, we revived the interpretable architectures agenda. We showed that a substantially more interpretable activation function exists: a Gated Linear Unit (GLU) without any Swish attached to it — a bilinear MLP. I truly think this is one of the most important mech interp works to date. With it, we actually have a plausible path to success:

We already have evidence step 2 is tractable. In this project we focus on addressing step 1: answer as many fundamental mech interp questions as possible for bilinear models. Are interpretable architectures sufficient to make ambitious mechanistic interpretability tractable? Maybe.

Skill requirements

Last time I ran as a research lead, the talent density was higher than I anticipated, so I’ll try to account for that here. Don’t let that deter you though, I definitely will have time to interview every person that applies to this stream. In fact, last year I also had time to interview everyone that applied to AISC in total that didn’t get interviewed by anyone else – I interviewed about 70 people in total (15 my stream, 55 others). Even if you don’t make the cut for this project, I can help direct you to other projects both inside and outside AISC that you could be a great fit for, and connect you to them. This is an open invitation to anyone reading this message who is interested in empirical and theoretical alignment research.


See Ethan Perez’s tips for empirical alignment research. Please read it!! His ideal candidate is my ideal candidate. Regardless of whether you satisfy them, we will be using those tips as a ‘best practices’ guide for how we conduct our work on an ongoing basis. For the record, those standards are extremely high and it’s possible few human beings on earth satisfy them all.


Some high level takeaways to strive for / qualities you may resonate with:

- You have a high degree of agency and self-directedness – you can execute in the face of ambiguity

- Empirical truth seeking, healthy scepticism towards your own results and thoughtfully interpreting them

- Optimising for research velocity over research progress: you test out as many ideas as possible per unit time, and aim to reduce uncertainty at the fastest possible rate

-  You tend to over-communicate, and post frequent updates (e.g. daily) on what you’re up to

- You enjoy coding, running ML experiments


Object level skill requirements: AT LEAST ONE OF THE FOLLOWING

- Significant research experience (in any stem field)

- Proficient in Python and PyTorch


Things I don’t care too much about:

- Experience working with transformer language models

- Familiarity with existing mechanistic interpretability work: good to have but most of it is bad and misleading. Instead, join my reading group and participate in the discussions: 3-4 weeks of participation would be good enough background – you can do this before the start of the program.

Agent Foundations

Let's try to formalize some concepts that are important to the AI alignment problem.

(16) Understanding Trust

Abram Demski 

Summary 

The Tiling Agents problem (aka reflective consistency) consists of analysing when one agent (the "predecessor") will choose to deliberately modify another agent (the "successor"). Usually, the predecessor and successor are imagined as the same agent across time, so we are studying self-modification. A set of properties "tiles" if those properties, when present in both predecessor and successor, guarantee that any self-modifications will avoid changing those properties.


You can think of this as the question of when agents will preserve certain desirable properties (such as safety-relevant properties) when given the opportunity to self-modify. Another way to think about it is the slightly broader question: when can one intelligence trust another? The bottleneck for avoiding harmful self-modifications is self-trust; so getting tiling results is mainly a matter of finding conditions for trust.


The search for tiling results has three main motivations:

* AI-AI tiling, for the purpose of finding conditions under which AI systems will want to preserve safety-relevant properties.

* Human-AI tiling, for the purpose of understanding when we can justifiably trust AI systems.

* Tiling as a consistency constraint on decision theories, for the purpose of studying rationality.


These three application areas have a large overlap, and all three seem important.

Skill requirements

(17) Understand Intelligence

Johannes C. Mayer 

Summary 

Save the world by understanding intelligence.


Instead of having SGD "grow" intelligence, design the algorithms of intelligence directly to get a system we can reason about. Align this system to a narrow but pivotal task, e.g. upload a human.


The key to intelligence is finding the algorithms that infer world models that enable efficient prediction, planning, and meaningfully combining existing knowledge.


By understanding the algorithms, we can make the system non-self-modifying (algorithms are constant, only the world model changes), making reasoning about the system easier.


Understanding intelligence at the algorithmic level is a very hard technical problem. However, we are pretty sure it is solvable and, if solved, would likely save the world.


Current focus: How to model a world such that we can extract structure from the transitions between states ('grab object'=useful high level action), as well as the structure within particular states ('tree'=useful concept).

Skill requirements

You can select actions that on average decrease the chance of the world being destroyed. 


Especially when these actions involve solving technical problems (otherwise I will be much worse of a mentor).


How do you know if you can do this? You try! There is no other way. And it is quite likely that you will realize that simply by trying you are already far ahead of everybody else around you.


If you want you can try yourself at this task, and send me the results, but it is not a requirement.


Whatever knowledge and skills you need you pick up along the way. And what knowledge and skills you need needs to be determined by your journey, not the other way around.


The main thing you need to bring is the willingness to learn.


(All of this applies whether you get into the project or not.)

(18) Applications of Factored Space Models: 
Agents, Interventions and Efficient Inference

Matthias G. Mayer

Summary 

Factored Space Models (Link of Arxiv will be here, when we have uploaded the paper, probably before November (Overview) were first introduced as Finite Factored Sets by Scott Garrabrant and are an attempt to make causal discovery behave well with deterministic relationships. The main contribution is the definition of structural independence that generalizes d-separation in Causal Graphs and works for all random variables you can define on any product space, e.g. a structural equations model. In this framework we can naturally extend the ancestor relationship to arbitrary random variables. This is called structural time.

We want to use and extend the framework for the following applications taken, in part, from Scott’s blog post.


Here are slides from a talk (long form) explaining Factored Space Models with a heavy focus on structural independence starting from bayesian networks. 

Skill requirements

You have to be comfortable with basic levels of Mathematics, including finite probability theory.


Questions to check if you have the minimum requirement:

Prove A ⊆ B ⇔ A ∩ B = A

Prove that the space of probability distributions on a finite set Omega is convex.

Given a cartesian product Omega = Omega_1 times Omega_2, is the space of product probability distributions on Omega convex?

Prevent Jailbreaks/Misuse

Let's make AIs more robust against jailbreaks and misuse.

(19) Preventing Adversarial Reward Optimization

Domenic Rosati

Summary 

TL;DR: Can we develop methods that prevent online learning agents from learning from rewards that reward harmful behaviour without any agent supervision at all!? 


This project uses a novel AI Safety Paradigm, developed in a previous AI Safety Camp, Representation Noising, that prevents adversarial reward optimization, i.e. high reward which would result in learning misaligned behaviour, by the use of “implicit” constraints that prevent the exploration of adversarial reward and prevent learning trajectories that result in optimising those rewards. These “implicit” constraints are baked into deep neural networks such that training towards harmful ends (or equivalently exploring harmful reward or optimising harmful reward) is made unlikely.


The goal of this project is to extend our previous work applying Representation Noising to a Reinforcement Learning (RL) setting: Defending against Reverse Preference Attacks is Difficult. In that work we studied the single-step RL (Contextual Bandits) setting of Reinforcement Learning From Human Feedback and Preference Learning.


In this project, we will apply the same techniques to the full RL setting of multi step reward in an adversarial reward environment that is in the machiavelli benchmark. The significance of this project is that if we can develop models that can not optimise adversarial rewards after some intervention on the model weights, then we will have made progress on safer online learning agents. 

Skill requirements

Roles 

Role Type (0): Ideally one person with solid organisational skills will volunteer as Team Coordinator.

Role Type (1): Ideally half the group will have interest and background with technical execution of empirical experiments in either LLMs or RL.

Role Type (2): Ideally the other half of the group will have interest in conceptual or theoretical development of our ideas, algorithm formulation, or policy development.


Commitment Requirements
In order to participate in the project, you will be asked to complete an average of one task per week. 


Skill requirements

This project requires a mixture of skills from different backgrounds. Candidates are encouraged regardless of how strong they feel their skills are in each skill track as long as they have sufficient interest and commitment to self-learning in areas where they feel like their background is deficient.


The main asks are:


For (1) and (2) basic competence with independently writing and running code is assumed. Experience for (1) and (2) can be as basic as “taken the hugging face course on LLMs or RL” or “read a book on machine learning (Sutton and Barto for example) and did the exercises”.  (3) is not required but is desired.

Candidates may apply without this background and we can discuss a plan for them to gain the appropriate skills before the project begins as long as they feel comfortable committing to this pre-study. For example I may ask candidates to complete https://huggingface.co/learn/deep-rl-course/en/unit0/introduction before joining if they are lacking in (2) or to review https://web.stanford.edu/group/sisl/k12/optimization/#!index.md if lacking in (3).


Candidates who are theoretically or conceptually oriented (mathematics-wise) are encouraged to apply even if they do not meet (1) or (2).  We are especially looking for folks who have a background or interest in optimization (3).


Candidates who have a policy or conceptual interest (non-mathematical) are also encouraged to apply and have many opportunities to work on fleshing out the policy implications of our work out and working alongside us to make sure the technical work is grounded in real world problems.


Non-requirements (Who should not consider this project)


You should not consider this project if your intention is: to only give advice, feedback, or only participate in conversations. We are happy for you to participate as an external reviewer if this is the case but you should not apply if this is the intention. Folks with conceptual, policy, and theoretical skills will need to demonstrate this through the production of writing artefacts.

(20) Evaluating LLM Safety in a Multilingual World

Lukasz Bartoszcze

Summary 

The capability of Large-Language Models to reason is constricted by the units they use to encode the world- tokens. Translating phrases into different languages (existing, like Russian or German; or imaginary like some random code) leads to large changes in LLM performance, both in terms of capabilities, but also safety. Turns out, applying representation engineering concepts also leads to divergent outcomes, suggesting LLMs create separate versions of the world in each language. When considering multilinguality, concepts like alignment, safety or robustness become even less defined and so I plan to amend existing theory with new methodology tailored for this case. I hypothesise this variation between languages can be exploited to create jailbreak-proof LLMs but if that is not feasible, it is still important to ensure equal opportunity globally, inform policy and improve the current methods of estimating the real capabilities and safety of a benchmark. 

Skill requirements

(21) Enhancing Multi-Turn Human Jailbreaks Dataset for Improved LLM Defenses

Diogo Cruz

Summary 

This project aims to extend and enhance the Multi-Turn Human Jailbreaks (MHJ) dataset introduced by Li et al.. We will focus on developing lightweight automated multi-turn attacks, evaluating transfer learning of jailbreaks, and conducting qualitative analysis of human jailbreak attempts. By expanding on the original MHJ work, we seek to provide more comprehensive insights into LLM vulnerabilities and contribute to the development of stronger defenses. Our research will help bridge the gap between automated and human-generated attacks, potentially leading to more robust and realistic evaluation methods for LLM safety. 

Skill requirements

Required:

- Strong Python programming skills

- Experience with large language models


Recommended:

- Familiarity with AI safety concepts and jailbreaking techniques

- Good scientific writing and communication skills


Nice to have:

- Experience with prompt engineering and adversarial attacks on LLMs

- Knowledge of red teaming practices

- Familiarity with qualitative data analysis techniques

- Experience with open-source collaboration and Git version control

Train Aligned/Helper AIs

Let's train AIs that are aligned and/or can help us with alignment.

(22) AI Safety Scientist

Lovkush Agarwal

Summary 

In August 2024, Sakana published their research on the ‘AI Scientist’. https://sakana.ai/ai-scientist/. They fully automate the ML research process - from generating ideas to writing a formal paper - by combining various LLM-based tools in an appropriate pipeline. The headline result is it generates weak graduate level research for about $15 per paper.


The aim of this project is to adapt and refine this tool for AI Safety research.

Skill requirements

Minimum skills / attitudes



Ideal skills. I do not expect any single individual to have many of these.


(23) Wise AI Advisers via Imitation Learning

Chris Leong

Summary 

I know it’s a cliche, but AI capabilities are increasing exponentially, but our access to wisdom (for almost any definition of wisdom) isn’t increasing at anything like the same pace.


I think that it’s pretty obvious that continuing in the same direction is unlikely to end well.


There’s something of a learned helplessness around training wise AI’s. I want to take a sledgehammer to this.


As naive as it sounds,  I honestly think we can do quite well by just picking some people who we subjectively feel to be wise and using imitation learning on them to train AI advisors.


Maybe you feel that “imitation learning” would be kind of weak, but that’s just the baseline proposal. Two obvious ideas for amplifying these agents are techniques like debate or trees of agents, and that’s just for starters!


More ambitiously, we may be able to set up a positive feedback loop. If our advisers are able to help people become wiser, then this might allow us to set up a positive feedback loop where the people we are training on become wiser and our advisers become wiser in their use of our technology.


I’m pretty open to recruiting people who are skilled in technical work, conceptual work or technical communication. This differs from other projects in that rather than having specific project objectives, you have the freedom to pursue any project within this general topic area (wise AI advisors via imitation learning). Training wise AI via other techniques is outside the scope of this project unless it is to provide a baseline to compare imitation agents against. The benefit is that this offers you more freedom, the disadvantage is that there’s more of a requirement to be independent in order for this to go well.

Skill requirements

What I’m proposing operates in quite a different paradigm from most other AI Safety research, so I’m looking for people who are able to “get it”.


For conceptual work, I expect clarity of thinking. I am quite partial to folk who have exposure to Less Wrong-style rationality but this certainly isn’t necessary. I’m also quite a fan of people with analytical philosophy experience, particularly if they understand the difference between the map and the territory.


For empirical research, I want people who have prior empirical research experience. This is very early-stage research, so you need to be able to propose experiments that actually tell us something useful. I’ll select on your ability to propose good experiments, but you’ll have broad freedom to pursue whatever you want within the scope during AI Safety Camp itself.


I would be open to finding myself a co-lead who has enough empirical research experience to help mentor other participants interested in pursuing empirical work and if I were to find such a co-lead, then I’d be open to participants without previous empirical research experience. However, my guess would be that <10% of competent empirical researchers would be a good fit as a co-lead, because most people want to operate within an existing paradigm rather than create a new paradigm.


If you want to work on technical communications, you need to be able to precisely communicate complex ideas without oversimplifying or otherwise introducing excessive lossiness. For these projects, I’d probably need to exercise a greater degree of control over the end product, to avoid the risk of miscommunication.

(24) iVAIS: Ideally Virtuous AI System with Virtue as its Deep Character

Masaharu Mizumoto

Summary 

The ultimate goal of this interdisciplinary research program is to contribute to AI safety research by actually constructing an ideally virtuous AI system (iVAIS). Such an AI system should be virtuous as its deep character, showing resilience (not complete immunity, which is vulnerable) to prompt injections even if it can play many different characters by pretending, including a villain. The main content of the current proposal consists of two components: 1. Self-alignment and 2. The Ethics game, which are both based on the idea of agent-based alignment rather than content-based alignment, focusing on what one is doing, which requires metacognitive capacity. 

Skill requirements

Either 1) general skills and experience in coding with Python, fine-tuning and RLHF for an open-source LLM, or 2) general knowledge about the AI safety research and literature. 

(25) Exploring Rudimentary Value Steering Techniques

Nell Watson

Summary 

This research project seeks to assess the effectiveness of rudimentary alignment methods for artificial intelligence. Our intention is to explore basic, initial methods of guiding AI behavior using supplementary contextual information:

Expected Outcomes: 

Skill requirements

(26) Autostructures – for Research and Policy

Sahil and Murray

Summary 

This is a project for creating culture and technology around AI interfaces for conceptual sensemaking.


Specifically, creating for the near future where our infrastructure is embedded with realistic levels of intelligence (ie. only mildly creative but widely adopted) yet full of novel, wild design paradigms anyway. 


The focus is on interfaces especially for new sensemaking and research methodologies that can feed into a rich and wholesome future.

Huh?

It’s a project for AI interfaces that don’t suck, for the purposes of (conceptual AI safety) research that doesn’t suck.

Wait, so you think AI can only be mildly intelligent?

Nope.

But you only care about the short term, of “mild intelligence”?

Nope, the opposite. We expect AI to be very, very, very transformative. And therefore, we expect intervening periods to be very, very transformative. Additionally, we expect even “very2 transformative” intervening periods to be crucial, and quite weird themselves. 


In preparing for this upcoming intervening period, we want to work on the newly enabled design ontologies of sensemaking that can keep pace with a world replete with AIs and their prolific outputs. Using the near-term crazy future to meet the even crazier far-off future is the only way to go. 


(As you’ll see below, we will specifically  move towards adaptive sensemaking meeting even more adaptive phenomena.)

So you don’t care about risks?

Nope, the opposite. This is all about research methodological opportunities meeting risks of infrastructural insensitivity.



-----------------------------------------------------------------------------------------------------------


Watch a 10 minute video here for a little more background: Scaling What Doesn’t Scale: Teleattention Tech

Skill requirements

If you’re good at or interested in engineering, writing, designing or generally open-minded and quick to learn, you’re a fit. If you made it through this doc (even if you have lots of questions and confusions) or like to think in meta-systematic ways, then you’ll love it here. 

Other

Projects that didn't fit any shared category

(27) Reinforcement Learning from Recursive Information Market Feedback

Abhimanyu Pallavi Sudhir

Summary 

RLHF is no good on tasks which humans are unable to easily “rate” output. I propose the Recursive Information Market, which can be understood as an approach to rate based on a human rater’s Extrapolated Volition, or a generalized form of AI safety via debate. 

Skill requirements

Minimum skills:


I would be especially happy to have someone who can make significant contributions on theoretical work, i.e. on coming up with and proving solid, useful theorems. Someone who can do the bulk of the effort on this I would happily grant joint-first-author position to.


The majority of team members, I assume, would be working on implementations in the various contexts stated earlier.

(28) Explainability through Causality and Elegance 

Jason Bono

Summary 

The purpose of this project is to make progress towards human-interpretable AI through advancements in causal modeling. The project is inspired by the way science emerged in human culture, and seeks to replicate essential aspects of this emergence in a simple simulated environment.  


The setup will consist of a simulated world and one or more agents equipped with one sensor and one actuator each, along with a bandwidth-constrained communications channel. A register will record past communications, and store the “usefulness” of trial frameworks that the agents develop for prediction. 


The agents first will create standard deep predictive models for novel actuator actions (interventions) and subsequent system evolution. These agents will then create a reduced representation of their deep models optimizing for “elegance” which refers to high predictive accuracy, high predictive breadth, low model size, and high computational efficiency.  This can be thought of as the autonomous creation of an interpretable “elegant causal decision layer” (ECDL) that can be called upon by the agents to reduce the computational intensity of accurate prediction of the effects of novel interventions. 


Success would comprise the autonomous creation and successful utilization of a human interpretable ECDL. This success would provide a proof of concept for similar techniques in more complex and non-simulated environments (e.g. a physical setup and/or the internet).

Skill requirements

 Skills of the team members should include at least one from

(29) Leveraging Neuroscience for AI Safety

Claire Short

Summary 

This project integrates neuroscience and AI, leveraging human brain data to align AI behaviors with human values for potentially greater control and safety. In this initial project, we will take inspiration from Activation Vector Steering with BCI, to map activation vectors to human brain datasets. In previous work, a method called Activation Addition was tested and found to more reliably control the behavior of large language models during use, altering the model's internal processes based on specific inputs, which allows for adjustments to topics or sentiments with minimal computing resources. By attempting to recreate elements of this work with the integration of brain data inputs, we aim to enhance the alignment of AI outputs with user intentions, opening new possibilities for personalization and accessibility in various applications, from education to therapy. 

Skill requirements

Skill Requirements Research Engineer:


Skill Requirements Research Scientist:

(30) Scalable Soft Optimization

Benjamin Kolb

Summary 

This project is mainly aimed at a deep reinforcement learning (DRL) implementation. The purpose is to assess selected soft optimization methods. Such methods limit the amount of “optimization” in DRL algorithms in order to alleviate the consequences of goal misspecification. The primarily proposed soft optimization method is based on the widely referenced idea of quantilization. Broadly speaking, quantilization means sampling options from a reference distribution’s top quantile instead of selecting the top option. 

Skill requirements

I’m primarily looking for people who are enthusiastic about:


Collaborators should have:


Additionally valuable are:

(31) AI Rights for Human Safety

Pooja Khatri 

Summary 

This project seeks to institute a legal governance framework to advance AI rights for human safety.


Experts predict that AI systems have a non-negligible chance of developing consciousness, agency or other states of potential moral patienthood within the next decade. Such powerful, morally significant AIs could contribute immense value to the world and failing to respect their basic rights may not only lead to suffering risks but it might also incentivise AI systems to pursue goals that are in conflict with human interests, giving rise to misalignment scenarios and existential risks. 


Advancing AI rights for human safety remains a neglected priority. While several studies and frameworks exploring potential AI rights already exist, the existing work is either a) largely theoretical and not practical/tractable or feasible from a policy perspective and/or b) fails to take into consideration the contemporary nature of AI development. 


As such, given the likelihood that AI systems will likely advance faster than legal regimes, it is arguable that powerful early intervention via legal governance mechanisms offers a promising first step towards mitigating suffering and existential risks and positively influencing our long-term future with AI.

Skill requirements

Research Manager


Research Assistant - Legal/Policy


Research Assistant - Technical


We encourage you to veer on the side of applying and apply even if you do not meet all the requirements. If you have any questions, feel free to get in touch: khatripooja.24@gmail.com 

(32) Universal Values and Proactive AI Safety

Roland Pihlakas


I will be running one of three possible projects, based on which one receives the most interest. Below are included the summaries and skill sections for the respective projects.

Summary 

Category: Evaluate risks from AI

(32a) Creating new AI safety benchmark environments on themes of universal human values 

We will be planning and optionally building new multi-objective multi-agent AI safety benchmark environments on themes of universal human values.


Based on various anthropological research, I have compiled a list of universal (cross-cultural) human values. It seems to me that various of these universal values resonate with concepts from AI safety, but use different keywords. It might be useful to map these universal values to more concrete definitions using concepts from AI safety.


One notable detail in this research is that in case of AI and human cooperation, the values are not symmetric as they would be in case of human-human cooperation. This arises because we can change the goal composition of agents, but not of humans. Additionally there is the crucial difference that agents can be relatively easily cloned, while humans cannot. Therefore, for example, a human may have an universal need for autonomy, while an AI agent might imaginably not have that need built-in. If that works out, then the agent would instead have a need to support human autonomy. 


The objective of this project would be to implement these mappings of concepts into tangible AI safety benchmark environments.


Category: Agent Foundations

(32b) Balancing and Risk Aversion versus Strategic Selectiveness and Prospect Theory 

We will be analysing situations and building an umbrella framework about when either of these incompatible frameworks would be more appropriate in describing how we want safe agents to handle choices relating to risks and losses in a particular situation.


Economic theories often focus on the “gains” side of utility and how our multi-objective preferences are balanced there. A well known formulation is to use diminishing returns - a concave utility function, which mathematically results in a balancing action where an individual prefers averages in all objectives to extremes in a few objectives.


But, what happens in the negative domain of utility? How do humans handle risks and losses? Turns out, it might be not so trivial as with gains.


One might imagine that one could apply a concave utility function to the negative domain as well, in order to balance the individual losses, or to equalise and provide an equal treatment in case of multiple individuals. This would resonate with the idea that generally people prefer averages in all objectives to extremes in a few objectives. As an example, a negative exponential utility function would achieve that.


Yet there is a well known theory named “Prospect theory”, which claims instead that our preferences in the negative domain are convex. 


As I see it, this contradiction between the theories of “preferring averages over extremes” and “the Prospect Theory” may be underexplored, especially with regards to how it is relevant to AI safety. 



Category: Train Aligned/Helper AIs

(32c) Act locally, observe far - proactively seek out side-effects 

We will be building agents that are able to solve an already implemented multi-objective multi-agent AI safety benchmark that illustrates the need for the agents to proactively seek out side-effects outside of the range of their normal operation and interest, in order to be able to properly mitigate or avoid these side-effects.


In various real-life scenarios we need to proactively seek out information about whether we are causing or about to cause undesired side effects (externalities). This information either would not reach us by it itself, or would reach us too late. 


This situation arises because attention is a limited resource. Similarly, our observation radius is limited. The same constraints apply to AI agents as well. We humans, as well as agents, would prefer to focus only on the area of our own activity, and not on surrounding areas, where we do not intend to operate. Yet our local activity causes side effects farther away, and we need to be accountable and mindful of that. Then these far away side effects need to be sought out with extra effort, in order to mitigate them as soon as possible, or even better, in order to proactively avoid them altogether.


I have built a multi-agent multi-objective gridworlds environment that illustrates this problem. I am seeking a team who would figure out the principles necessary or helpful for solving this benchmark, and who would build agents which illustrate these important safety principles. 

Skill requirements

(32a) Creating new AI safety benchmark environments on themes of universal human values

Relevant skills include the following. You do not need to have all the skills.



(32b) Balancing and Risk Aversion versus Strategic Selectiveness and Prospect Theory

Relevant skills include the following. You do not need to have all the skills.



(32c) Act locally, observe far - proactively seek out side-effects

Relevant skills include the following. You do not need to have all the skills.