Team member applications for AI Safety Camp Virtual 2024 should open around November 2023
(research lead applications open around September 2023).

AI Safety Camp connects you with an experienced research lead to collaborate on a research project – helping you try your fit for a potential career in AI Safety research.

The applications for AI Safety Camp’s Virtual Edition in 2023 are now closed.


Rob Miles explains Langosco et al. Objective Robustness in Deep Reinforcement Learning,
a paper produced at the virtual camp



AI Safety Camp Virtual 8 will be a 3.5-month long online research program from 4 March to 18 June 2023, where participants form teams to work on pre-selected projects.

We value people with diverse backgrounds and skillsets, such as cognitive science, social science or cybersecurity. Not all projects require participants to have prior experience in AI Safety, mathematics or machine learning. Read in detail about the research topics & each project’s skill requirements for our upcoming edition by following the links below.



Projects you can apply to…

Conceptualise AGI dynamics

Uncertainty -> Soft Optimization with Jeremy Gillen

I want to formally describe an improved version of quantilizers, and implement and test it.

  • I want to formally extend quantilizers to work with uncertainty over utility functions
    • So the more uncertain it is about goals, the more it follows the base distribution, but the more confident it is, the more it acts like an expected utility maximizer
    • I hope that this improve the search power of quantilizers (on a specific task) while maintaining some safety guarantees
  • I want to develop and test fast approximate versions of quantilization
    • This is motivated by wanting to understand which properties to look for in learned optimizers.
    • This might involve demonstrating that some approximate optimizers like Monte Carlo Tree Search (MCTS) or variational Bayesian inference are similar or identical to quantilization under some assumptions.

Skill requirements
At least one of:

  • Solid maths skills, especially probability theory and ideally some statistical learning theory.
  • Some experience with algorithm development
  • Experience with software engineering

Definitely required by everyone:

  • Basic computer science (at least at the level of an undergrad algorithms course)

Inquire into Uncontrollable Dynamics of AGI with Remmelt Ellen

Join us if you’re a skilled generalist, care about future life on Earth, are open to the notion that some dangerous (auto-scaling/catalysing) technology cannot be sufficiently controlled, in theory nor in practice, and seek to inquire whether or not this is the case for Artificial General Intelligence.

We will collect questions for and answers from research authors, mostly Forrest Landry, about:

  1. Theoretical limits to controlling any AGI using any method of causation.
  2. Economic decoupling of value exchanges of the artificial ecosystem with the organic ecosystem.
  3. Threat model of convergent dynamics that cannot   be   controlled (1) or aligned game-theoretically (2).
  4. Impossibility theorems, by contradiction of ‘long-term AGI safety’ with convergence result (3).

We will first spend at least a month trying to form a picture of the authors’ existing arguments that is as accurate as possible. From there we can individually cross-check reasoning, probe premises, and consider alternative lines of reasoning.


Skill requirements
Generalists with technical or humanities/communications backgrounds are welcome!
Join us if you:

  1. care about humans and all biological life; our potential to thrive, in present and future,
  2. care about rigorous inquiry, covering both elegant theory and messy real-life practice,
  3. give open-minded attention to reading and inquiring one-on-one about arguments for why long-term AGI safety would be categorically impossible,
  4. can write down clear questions – where clarifying your own premises and specific use of terms,
  5. are not situationally required to be invested in ‘solving’ AGI or to develop AGI, and,
  6. are prepared to take personal risks for what is right even when it is socially uncomfortable.

Discussing and Crystallising a Research Agenda Based on Positive Attractors and Inherently Interpretable Architectures with Robert Kralisch

This is a conceptual-theoretical research project, with the option of prototyping a novel cognitive architecture if the project progresses that far within the scope of the AISC. In my conceptual research, I have developed a cluster of ideas to address concerns of distributional shift, goal-misgeneralisation and instrumental convergence. This work has converged in a design for a neuro-symbolic cognitive architecture that, I believe, can be used to implement and test those ideas. Said design is partially brain-inspired and not yet of high enough resolution to straightforwardly implement.

The aim of this research project would be to discuss, concretise and evaluate this research agenda in general and the proposed architecture in particular, ideally producing a concise research agenda that can be scaled up. There is room for original contributions from participants and the option of prototyping parts of the cognitive architecture.

Skill requirements
I am looking for team members with interdisciplinary/generalist interests or backgrounds, e.g. in Cognitive Science, that feel comfortable with abstract/conceptual work and feel drawn to more diverse approaches to tackle the alignment problem. A particularly useful attribute for this project is good mathematical intuition and comfort with thinking about graphs and constraint satisfaction therein.

A partially technical background for the later stages of the project would be ideal to have in at least one participant, but I don’t expect the prototyping to be a significant technical challenge. The main technical contribution would more so be in creating a more detailed blueprint of the cognitive architecture, enabling better scaling and testing beyond the initial phase.


Investigate transformer models

Cyborgism with Nicholas Kees Dupuis

Capability gains in AI have been really rapid recently, and humanity finds itself in a race against time to solve alignment before powerful and misaligned AI is deployed. These capabilities, however, also unlock the potential for a wide range of human-machine collaboration which might enable us to make progress significantly faster than we currently do. The goal of this agenda is to train and empower “cyborgs”, people deeply integrated with AI systems, to make significant conceptual and technical progress on alignment. This differs from other ideas for accelerating alignment research by focusing primarily on augmenting ourselves and our workflows to accommodate machines, rather than just training machines to work well with the existing research pipeline. This document outlines three different projects within the Cyborgism agenda.

Skill requirements
Different for each project. See each project description.

Understanding Search in Transformers with Michael Ivanitskiy

Transformers are capable of a huge variety of tasks, and for the most part we know very little about how. In particular, understanding how an AI system implements search is probably very important for AI safety. In this project, we will aim to:

  • gain a mechanistic understanding of how transformers implement search for toy tasks
  • explore how the search process can be retargeted, ensuring that the AI system is aligned to human preferences
  • attempt to find scaling laws for search-oriented tasks and compare them to existing scaling laws

Skill Requirements

  • proficient in python and a ML framework (project will use PyTorch, experience with JAX/TF is acceptable)
  • willingness to use git
  • decent understanding of transformer networks, sampling techniques, and attention heads
  • basic familiarity with the inner alignment problem
  • preferred: familiarity with existing transformer interpretability work.

I envision this project being a good mix of conceptual and engineering work. I don’t believe that conceptual work alone is sufficient, and I think our best bet for figuring out how search works in transformers is to try a bunch of different interpretability techniques and see what works. I’m a decent programmer, and I have a reasonably solid math background, but I only have so many ideas on how to look for internal search. I’m hoping to find team members who can help both with the actual implementation, and help come up with new ideas for what interpretability techniques to try.

Interdisciplinary Investigation of DebateGPT with Paul Bricman

By the beginning of AI Safety Camp 8 (AISC8), I expect to have trained DebateGPT, a language model fine-tuned to simulate pertinent debates between several parties using a novel training regime which doesn’t make use of human feedback. This new training regime is designed to (1) enable large language models (LLMs) to deliberate about human goals more effectively, (2) improve the ability of LLMs to coherently model worldviews in light of new facts, and (3) provide a reasoning tool for researchers, helping them engage with steelman versions of conflicting perspectives.

However, obtaining DebateGPT itself is not the focus of the proposed AISC8 project. Instead, I’m excited about the prospect of investigating DebateGPT’s reflection process from different perspectives, including: argumentation theory, non-monotonic logic, game theory, sociology, and dynamical systems. Participants with diverse background will be able to contribute their valuable expertise on those topics and help gauge the potential of LLMs to reason about what human want from them. At the end, we’ll compile the various analyses and publish them in a unified form.

Skill requirements
The specific structure of this project (i.e. parallel interdisciplinary investigations) makes it so that team members should have a background which enables them to carry out an investigation from their perspective (see “What perspectives will there be?”). The following should help ensure that you’re decently equiped for the task: having carried out a research project in that field before, having followed a couple courses on the topic in a formal setting, or having spent at least 30h reading up on the topic.

Besides the specific background associated with the perspectives being explored, some general skills are also advised. There’s a strong focus on formulating and carrying out your own analysis in the context of the team, which requires some degree of autonomy. This might seem daunting, but the other team members, the team coordinator, and myself, will make carrying out your work more accessible and (hopefully) more enjoyable than if you were to do it in isolation. Familiarity with programming beyond basic Python is not required.


Finetune language transformers

Does Introspective Truthfulness Generalize in LMs? with Jacob Pfau

Aligning language models (LM) involves taking an LM which simulates many human speakers, and fine tuning it to produce only truthful and harmless output. Recent work suggests reinforcement learning on human feedback (RLHF) generalizes well, teaching LMs to be truthful on hold out tasks. However, on one understanding of RLHF fine-tuning, RLHF works by picking out a truthful speaker from the set of speakers learned during LM pre-training; if that characterization is accurate, RLHF will not suffice for alignment. In particular, LMs will not generalize to be truthful on questions where no human speaker knows the truth. For instance, consider “introspective tasks” i.e. those for which truth is speaker dependent.

I propose creating a dataset of tasks (ideally 30ish) evaluating the generalization of language model (LM) truthfulness when trained on introspective tasks. Then use this dataset to finetune an LM for truthfulness on a subset of these tasks and evaluate on the remaining tasks. This project will (1) critically evaluate the alignment adequacy of RLHF and (2) evaluate how difficult fine-tuning for truthfulness is under the current paradigm for LM training.

Skill requirements

  • Familiarity with python at the level of having implemented a couple of projects in python


  • Basic ML engineering experience with e.g. pytorch or tensorflow
  • Background reading on LM alignment Simulators, Inverse scaling, LM truthfulness, RLHF, ELK. This knowledge can be distributed across the group.

Both of these ‘useful’ background points can be caught up on as needed.

Inducing Human-Like Biases in Moral Reasoning LMs with Bogdan-Ionut Cirstea

This project is about fine-tuning language models (LMs) on a publicly available moral reasoning neuroimaging (fMRI) dataset, with the hope/expectation that this could help induce more human-like biases in the moral reasoning processes of LMs. This will be operationalized by testing if fine-tuning LMs on fMRI data (of the above-mentioned dataset) helps improve test performance on the ETHICS moral reasoning dataset and if it helps significantly more than just using additional non-neuroimaging behavioural data (moral reasoning permissibility scores) for LM fine-tuning.

More broadly, this project would fit as a potential proof-of-concept in a new AI alignment research agenda I’m working on, on neuroconnectionism (comparing artificial and biological neural networks) for AI alignment. Moral reasoning is an interesting application area, both for its relevance to AI alignment and because of the availability of public neuroimaging data, as well as e.g. publicly-available LMs fine-tuned for moral reasoning.

Skill requirements
The skill most needed for this project is ML research engineering, as well as motivation to work on AI alignment projects. Some nice-to-haves include previous exposure to AI alignment literature and arguments and neuroscience/cognitive science knowledge/skills.

The minimum skills required include having had some exposure to ML research engineering, e.g. decent knowledge of PyTorch or similar frameworks, at least some experience with some neural net training run monitoring, debugging, etc.

A very-nice-to-have, though probably very difficult to find, might be a profile with significant experience at the intersection of neuroconnectionism, neuroscience of moral reasoning and AI alignment; I’m unsure how many such profiles currently exist and also aiming for myself to get closer to that description.


Behavioural preferences in humans and machines

Behavioral Annotation Framework for the Contextualized and Personalized Fine-Tuning of Foundation Models with Eleanor “Nell” Watson

Machine intelligence is increasingly sophisticated and is now embedded in daily life and the global economy. However, a lack of understanding of personal context and a lack of accommodation to minority cultural expression limits the trustworthiness of algorithmic judgments, that is to say the reliable accuracy and fairness to judgments based upon surrounding circumstance and context. This inability to accommodate human preferences and recognise human intention contributes to a lack of corrigibility and unjust outcomes.

This issue could be alleviated with richly annotated training data on preferences towards various behaviors and the values encoded within them, in a wide range of cultural and situational contexts. Datasets of sufficient nuance, diversity and scope on human preferences, in a range of cultural and situational contexts, could provide significant benefits to the contextualization of AI alignment, and the personalisation of fine-tuning for individuals and groups.

This can be achieved by helping AI systems to better understand, anticipate, and accommodate for human needs, whilst avoiding misapprehension that could lead to prejudicial machine judgments. Accommodating the preferences of others is a fundamental aspect of acting in a prosocial manner, that is to ‘take action to support the flourishing of another’, to paraphrase M. Scott Peck’s definition of love.

To address this gap, it is essential to streamline the process of annotation of behavior for values elicited within it to be as simple, accessible, efficient, and inclusive as possible. This project seeks to design a prototypical behavioral annotation framework which leverages LLM/Diffusion model technologies to engineer revolutionarily simple and efficient prompt/chat -driven annotation mechanisms.

The project seeks assistance with understanding how improved annotation can be applied to techniques such as Reinforcement Learning from Human Feedback, as well as cybersecurity insights to help our prototype be more secure against potential intrusion, as annotation of values can be a sensitive domain.

Skill requirements:
Team members should be (at least somewhat) knowledgeable in at least one of the below areas:

RLHF Expertise
We wish to understand how our improved annotation processes, and scenario generation techniques), can be most efficiently applied to techniques such as Reinforcement Learning from Human Feedback, in order to ensure that the project can be as useful as possible.

Cybersecurity Expertise
Security issues also present a concern, as potentially sensitive information about personal value impressions are intended to be collected by the tools in this framework. The team therefore seeks cybersecurity experts who can help to provide cybersecurity best practices for overall hardening of the evolving implementation.

Multimodal Data Expertise
Anyone with experience of multimodal data and annotation thereof, or the application of such complex datasets to LLMs/Diffusion models, would also be fantastic.

How Should Machines Learn from Default Options? with En Qi Teo

How can AI learn what to do in the pursuit of our goals when we do not specify complete sets of preferences in reward functions? Shah et al. (2019) and Lindner et al. (2021) propose that preferences are revealed through the state of the world since, rationally, we should have already acted to optimize our environments according to our preferences. More generally, when inverse RL agents seek to extract reward functions through observing behavioral data, this data exhibits cognitive biases. Notably, humans display path-dependent behavior and are more likely to choose a default option when making decisions, even when these decisions carry significant weight for their future (e.g. decisions about retirement savings plans). If AI learns from states of the world brought about by this bias, then it will be systematically wrong about our preferences. As a simple example, if AI learns that most people do not opt out of web cookies, it might infer that we do not care about online privacy, when in actuality, most people might prefer not to be tracked.

This project aims to synthesize the literature on the welfare effects of default options in order to (i) study the importance of recognizing and accounting for default options in training RL agents to learn our preferences, (ii) elucidate scenarios in which learning from default options might pose catastrophic risks, and (iii) formalize how default options should be taken into account in training inverse RL agents.

Skill requirements
The project requires the following skills:

  1. Ability to understand research in economics and behavioral science (both theory and empirical work), and an appreciation for how these findings relate to RL models (at minimum) (Everyone needs to have this)
  2. Some familiarity with AI alignment and AI risks (Everyone needs to have this/acquire this before the start of the project)
  3. Ability to find relevant papers (Optional)
  4. Proof-writing and formal theory (I’m working on this but would appreciate having someone on the team who has more experience here) (Optional, but would be very useful for the ambitious version of this project)
  5. ML experiments (ditto above) (Optional, but would be very useful for the ambitious version of this project)


Review and analyse literature

Literature Review of the Neurological Basis of Human Values and Preferences with Linda Linsefors

AI alignment is about aligning an AI system with human values. To do so, it seems relevant to better understand what human values are.

The idea of this project is to find various ideas/theories/models/hypotheses about how human values and preferences are implemented in the brain (primarily from the cognitive neuroscience literature), and write them all down in a single review blogpost. If we have time, we’ll also write a discussion about what the implications are for alignment for each of these theories.

Skill requirements
Everyone on the team should have some basic knowledge of cognitive neuroscience and/or related fields.

The main tasks that needs to be done are:

  1. Finding relevant papers/texts
  2. Reading and understanding those papers/texts
  3. Writing summaries

As long as one team member is good at 1, we’re fine.
But everyone needs to do at least some of 2 and 3, although we’ll help each other out.

Machine learning for Scientific Discovery: the Present and Future of Science-Producing AI Models with Eleni Angelou

Machine learning models have recently found remarkable success in quantitative reasoning and in producing significant research results in specific scientific fields such as biology (e.g., with AlphaFold) and chemistry. This might be an indication that in the (near-term) future, AI models will be able to generate and/or test novel hypotheses or even produce research worthy of a Nobel prize.

This project is about mapping out the current state-of-the-art of science-producing AI models. This will allow for a more comprehensive understanding of the “cognitive properties” or capabilities of the available models. The project focuses on models that generate impressive results in solving quantitative problems (e.g., Minerva) or have led to important breakthroughs and acceleration of scientific research, such as in the case of AlphaFold. The team will collect the relevant research papers, review them, and study the development of capabilities necessary for scientific reasoning in relation to AI risk and progress in alignment research.

Skill requirements
All members must:

  • be passionate about reducing risks from advanced AI models
  • care about bettering their epistemics/developing their own model without deferring

It will be great to:

  • have a good understanding of ML and in particular, how deep neural networks work
  • have knowledge of at least basic cognitive science or theory of science concepts
  • be good and fast at finding, reading, and understanding relevant papers/posts/books, etc.
  • be good at formulating precise and falsifiable questions
  • be comfortable with writing/editing texts and giving/receiving honest feedback
  • be comfortable with scaling laws in ML and forecasting AI

→ Exceptional candidates may (but not necessarily!) have studied a science (preferably physics, biology, CS, etc.), are curious about how cognition and science work, how scientific and technological progress occurs, and have a good understanding of AI risk.


Propose public policy/communication

Policy Proposals for High-Risk AI Regulation with Koen Holtman

This project will support European government initiatives at AI industry regulation by making specific proposals about what minimally acceptable best practices for AI risk management should look like. These proposals will cover all phases of the AI system lifecycle, from development to post-deployment monitoring.

As part of its legislative agenda, the EU Commission is about to instruct the EU standards organisations to write technical standards which spell out acceptable risk management practices for a broad class of “high risk AI” systems, for a class of high risk AI systems which is defined by the upcoming EU AI Act. This project intends to support and influence the standards development work that will be done by these standards organisations, specifically the work inside the European CEN-CENELEC JTC21 standards committee technical expert working groups. Supporting this work creates major opportunities for lowering x-risk and s-risk, opportunities that so far have gotten only limited attention in the AI Alignment community. The deliverables of the project will be written as pieces of text which are shovel-ready proposals, shovel-ready by matching the exact legal-technical context of the standards writing effort that is triggered by the EU AI Act. The Research Lead Koen Holtman is a member of the CEN-CENELEC JTC21 working groups: he will act as a connection to bring the output of this project to the attention of these working groups.

At the start of the project, we will select one or more specific risk management topics from the broad set of AI risks identified by the EU AI Act. We will then identify and propose minimally acceptable best practices for managing these AI risks, and we will motivate these proposals by backing them up with common sense technical arguments and citations from the existing literature.

Skill requirements
The project is looking for participants with at least one of the following types of background knowledge:

  • An understanding of AI or ML technology: an undergraduate level or better understanding, or self-taught equivalent. It is not required to have an understanding the latest research-phase or hyped technologies like large language models: having an understanding of ML, deep neural nets, and their failure modes is sufficient. If you have an understanding of symbolic approaches to machine reasoning and their failure modes, this is also relevant.
  • An understanding of systems engineering or socio-technical systems engineering, design, or failure mode analysis, for any kind of system (e.g. medical, administrative, transportation) where there is a key safety engineering concern that the system might interact with end users or society in an unsafe, exploitative, or unaligned way.

Participants should have the skills needed to read academic papers and books about the state of the art in AI/ML itself, or about risk engineering in larger systems (e.g. transportation. medical) which use AI/ML merely as a component.

Participants with a legal or management background who have interacted a lot with technologists are also welcome. The ability to do careful technical writing and fact checking is a definite plus, but it is OK if only some participants in the team have it.

Participants do not need to have any academic AI or ML research experience, as this project does not aim to make any academic AI research contributions or breakthroughs, in the traditional sense that is valued by academia. Its aim is to locate existing knowledge and insights and transform them into actionable policy proposals. The main participant skill that is needed is that of navigating scattered knowledge, and turning it into actionable proposals about minimally acceptable best practices. The best practices to be defined cannot in fact be too complicated: they must never call on the use of esoteric knowledge only available to a few deep specialists. The regulator must be able to trust that even generalist AI technologists will be able to understand and follow best practices. In the standards and political context, best practice proposals are best supported by a type of reasoning that appeals to common sense, while being backed up by already-published academic research. So this project requires a generalist attitude and skillset.

The practice of decision making required in this project perhaps fits the business world better than the academic world. It is the practice of making decisions about acceptable risk management under a lot of uncertainty, in cases where not making any decision at all poses an even greater risk. You have to be comfortable doing this type of work, not everybody is. This style of work is sometimes also called design thinking, or engineering thinking.

Developing Specific Failure Stories about Uncontrollable AI with Karl von Wendt

The goal of the project is to develop one or more specific, detailed scenarios and stories about how advanced AI might get out of control. They are intended primarily to convince people outside of the AI safety community, especially established scientists, of the dangers of advanced AI. The aim is to be as realistic and detailed as possible. As a side effect, the project may also contribute to the still unanswered question of where to draw “red lines” to prevent uncontrollable AI.

Skill requirements
No creative or writing skills are required by the team members. Instead, technical knowledge, especially in machine learning, will be helpful, although there is no specific minimum requirement. Also very welcome are participants with backgrounds in psychology, IT security, and game theory.




Apply if you...

  1. want to try out & consider ways you could help ensure that future AI performs safely and in line with what people value upon reflection;
  2. are ready to dig into our research leads’ research topics and are able to write down a few clear arguments for why you’d research one in particular & how you might start;
  3. previously studied a topic or practiced skills unique to your perspective/background that can bolster your new research team's progress; 
  4. can block off hours to focus on research from March to June 2023 on normal workdays and the weekends (at least 10 hours per week).

Application timeline

5 Jan 2023 00:01 UTC Project proposals are posted on our website. Application form opens. Reviews start right away .
19 Jan 23:59 AoE Deadline to apply to teams closes. Late submissions might not get a response.
1 March 23:59 AoE Last applicant admitted or declined (most will be informed of our decision earlier).




First virtual edition – a spontaneous collage





  • January 5Accepted proposals are posted on the AISC website. Application to join teams open.
  • January 19Application to join teams closes. 
  • By February endOrganisers pre-filter applications. RLs interview potential members and pick their team.


  • March 4-5:  Opening weekend.
    From here, teams meet weekly, and plan in their own work hours. 
  • June 17-18: Closing weekend.
    All teams present their results.

Team structure

Every team will have:

  • one Research Lead
  • one Team Coordinator
  • other team members

All team members are expected to work at least 10 hours per week on the project, which includes joining weekly team meetings, and communicating regularly (between meetings) with other team members about their work.

As of yet, we cannot commit to offering stipends compensation for team members, because a confirmed grant fell through. Another grantmaker is in the midst of evaluating a replacement grant for AI Safety Camp. If confirmed, team members can opt in to receive a minimum of $500 gross per month (up to $2000 for full-time work).

Research Lead (RL)

The RL is the person behind the research proposal. If a group forms around their topics, the RL will guide the research project, and keep track of relevant milestones. When things inevitably don’t go as planned (this is research after all) the RL is in charge of setting the new course.

The RL is part of the research team and will be contributing to research the same as everyone else on the team.

Team Coordinator (TC)

The TC is the ops person of the team. They are in charge of making sure meetings are scheduled, checks in with individuals on their task progress, etc. TC and RL can be the same person.

The role of the TC is important but not expected to take too much time (except for project management-heavy teams). Most of the time, the TC will act like a regular team member contributing to the research, same as everyone else on the team.

Other team members

Other team members will work on the project under the leadership of the RL and the TC. Team members will be selected based on relevant skills, understandings and commitments to contribute to the research project.



You can contact us at