AI Safety Camp connects you with an experienced research mentor to collaborate on their open problem during intensive co-working sprints – helping you try your fit for a potential career in AI Safety research.

Our program will teach you about core research practices and concepts, inform your research direction for preventing new Artificial Intelligence risks and assist you in finding out where you personally can contribute.

We value people with diverse backgrounds and skillsets, such as history or evolutionary biology. No prior experience in AI Safety, mathematics or machine learning is expected for applying to our 2022 virtual program.

 

Open problems by our mentors

Any question on a mentored research problem? Send a quick email now to contact@aisafety.camp

Claims attributed to mentors below do not necessarily reflect the views of any organizations they work for. Mentors initiate collaborations on a personal basis, i.e. not under an official work agreement.

Daniel Kokotajlo (Center on Long-Term Risk)
Deliberation vs Competition in the Long Reflection:
Explore historical examples of epistemic and moral progress to determine if they come from social deliberation, or instead required competition.

Longer explanation: The goal is to look for historical examples of epistemic progress, and see whether it evolved through competition between actors or cooperation between actors (deliberation). So find as many examples as possible of such epistemic progress, and then try to disentangle the situations to understand better what caused such progress.

For example, was the abolition of slavery the inevitable result of wealthier, more scientifically minded societies, or was it a lucky combination of Christianity + the Royal Navy + other factors?

Reasons to care: There is an ongoing debate on the Alignment Forum about whether a so-called Long Reflection (giving some time to a bunch of people to think things through for potentially a very long time) would help, both in technical solutions like HCH, but also in the real world.

One of the points of contention is whether deliberation actually can deliver, and in which circumstances. This project thus aims at searching for historical evidence, and generalizing from it, hopefully in a way grounded in actual historical expertise.

Further readings:

Required skillsets: Interest in History. Skills in historical analysis and with the different approaches and caveats.

Mentorship guarantees:

  • Call frequency: once a week
  • Messaging in between calls: yes
  • Feedback on drafts: only if real progress + final document

What to expect of the research process: Most of this project will consist in doing some historical research work. Might involve an initial literature review to understand the lens through which the historical situations are to be analyzed.

Alignment Tabletop Role Playing Game:
Create a Tabletop RPG where the players act out different scenarios of AI Risks, by playing the AIs.

Longer explanation: The RPG would make players roleplay as AIs in different scenarios, such that the players are incentivized and helped to act in ways that fit our current thinking about alignment and AI risks. This could include for example:

  • A campaign/scenario where you play an AGI in a box with a misaligned goal, and try to get out and take control over the world
  • A campaign/scenario where each player plays an advanced AI that takes care of a specific part of the economy and is incentivized to maximize some proxy metric while becoming incomprehensible to humans
  • A campaign/scenario where each player plays a different AGI and they have to battle for control of the world.

Reasons to care: One of the difficulties with making a lot of people think further about AI Risks and alignment problems comes from the very abstract nature of the discussions and the arguments.

The hope with this game is that it will make such situations and issues more concrete by making people live through them and realize what the incentives push them to do.

Further readings: Daniel Kokotajlo’s post on this idea

Required skillsets: Understanding of Tabletop RPGs, game systems and world-building.

Mentorship guarantees:

  • Call frequency: once a week, potentially more.
  • Messaging in between calls: yes
  • Feedback on drafts: playtesting, design ideas, and feedback on intermediary versions

What to expect of the research process: Part of it will be to find a first scenario for a Minimum Viable Product, which will involve a lot of reading and interactions with researchers to get it right. The rest is more about game-design.

Impact of Human Dogmatism on Training:
Think through how programmers/managers/stakeholders in a big AI company could, by forcing the model to answer incorrectly due to their belief, make the AI learn deception and manipulation.

Longer explanation: At some point in the training of advanced AIs/AGI, we should expect people reviewing the AI to push for wrong answers (because they believe them strongly, even if they’re false).
This project studies the consequences of this situation: will the AI learn to be deceptive (by telling the human what it wants, not what is true)? Does it incentivize even more what Paul Christiano called the instrumental model (the model of “what would humans say?) instead of the model that we actually want (what is true/the best answer available?)?

Reasons to care: This sounds like a situation that would appear concretely, and also a sufficiently practical matter that many conceptual alignment researchers would not think about it that much. As such, it sounds like an important and neglected project.

Further readings:

Required skillsets: None

Mentorship guarantees:

  • Call frequency: once a week
  • Messaging in between calls: yes
  • Feedback on drafts: only if real progress + final document

What to expect of the research process: Because the project starts from a concrete scenario, it will most likely feel like fleshing out this scenario, trying to see the consequences, finding holes in the scenario and patching them. Basically trying to build a good model of what could happen and how it could go wrong.

Impact of Memetics on Alignment:
Study what it would mean for AIs to be susceptible to memes, and what it would entail for the alignment problem.

Longer explanation: Memetics is a (controversial) approach to studying the evolution of ideas, in close analogy with biological evolution: memes (ideas, sentences, pictures,…) being selected by how much they are shared, which depends on a range of factors like how funny, pithy, simple, exciting they are. It is conjectured that potent memes heavily influence what most people believe and even how they act.

Might this translate to advanced AIs? This project is about trying to clarify what memetics would mean in the context of advanced AI, and whether it is worth further consideration. This includes understanding some of the controversies and debates around memetics, to find if the critics matter for alignment. If memetics does make sense for alignment, then the next step would be to explore the consequences of memetics for alignment.

Reasons to care: If memetics for AIs actually makes sense, then it would have profound consequences on topics like value learning, and on the sorts of problems that should be expected (for example, changes in the values even if we get them right, or changes that don’t rely on feedback from the user).

Further readings: Daniel Kokotajlo’s post on this idea

Required skillsets: None

Mentorship guarantees:

  • Call frequency: once a week
  • Messaging in between calls: yes
  • Feedback on drafts: only if real progress + final document

What to expect of the research process: Compared to most research in theoretical domains or in experimental ones, this project mostly likely will involve reading a lot and trying to make sense of it all, grappling for arguments and analogies and counter examples.


Beth Barnes (OpenAI)
Pipeline for Measuring Misalignment:
Create instructions and a pipeline to reproducibly direct contractors to labelling data for measuring misalignment in language models.

Longer explanation: This project directly relates to efforts for quantitatively measuring the misalignment of a language model like GPT-3. In that context, misalignment captures cases where the model “knows” what we want it to do, but “doesn’t want” to do it (and making clear these points is a big difficulty of the measuring).

This project focuses on the step of creating data for training a model to compute/recognize misalignment — specifically getting labelled data from contractors. The end goal is a reproducible process that allows people to ask for a specific kind of labels, and get good output from contractors.

Reasons to care: This project is very concrete compared to some of the others, but it still touches the core of alignment: what exactly is misalignment, how to detect it, how to measure it. It tries to solve one of the building blocks on which more advanced applied alignment research will be built.

Further readings: Beth Barnes’ original post

Required skillsets: Basic knowledge/interest in transformers, but not necessarily advanced ML skills. Some software engineering skills for the pipeline might be good. Some front-end software skills could also be useful for the interface for contractors.

Mentorship guarantees:

  • Call frequency: once a week
  • Messaging in between calls: yes
  • Feedback on drafts: only if real progress

     (Can provide funding for experiments)

What to expect of the research process: After an initial step of trying to understand better the concepts used in defining misalignment, this project will probably consist of iterations on the protocol, between designing it and testing it with real contractors (for which Beth Barnes say she can provide funding if needed — and if the proposal sounds promising)


Kyle McDonell and Laria Reynolds (EleutherAI)
Language Models as Tools for Alignment Research:
Explore how Language Models could improve and accelerate alignment research, as well as the risks involved.

Longer explanation: Many people proposed that Language Models like GPT-3 could help alignment researchers to be more productive. This project is an in-depth look at exactly how, and what can be done with current models. Note that thinking about how these proposals could go wrong with an unaligned language model is an important part of the project.

Although it could take many different forms, here is a list of possible directions:

  • Surveying alignment researchers on what would help them.
  • Looking at what the Elicit tool by Ought is already offering, and see how to specialize it for alignment.
  • Trying to build a minimal tool for one specific researcher, as a proof-of-concept.

Reasons to care: One of the possible roads to aligned AI is to build first simpler but aligned AI to help us do more alignment research and align more powerful AIs. Language models sound like a good candidate for such helper AIs, as they use natural language and can understand

Further readings: None

Required skillsets: None

Mentorship guarantees:

  • Call frequency: once a week
  • Messaging in between calls: yes
  • Feedback on drafts: yes + help with code/experiments

What to expect of the research process: Will depend on the approach taken, but the particularity of this project is to be very much about support of existing researchers, so expect a lot of interactions with the community and reverse-engineering of what people want.

Creating Alignment Failures in GPT-3:
Create alignment failures in GPT-3 through curation and/or finetuning.

Longer explanation: The goal is to link known alignment failures with language models like GPT-3, by defining analogous problems and making them happen through curation (rolling out multiple completions of the prompt) or finetuning (training the model again either with an explicit reward model or a corpus of text).

Another point of this project is to measure how much optimization (curation/finetuning, measured with a metric defined by the mentors) is needed for creating the alignment problems, and reflecting on whether this optimization pressure should be expected in current and future uses of such models.

Reasons to care: Language models are some of the most impressive models nowadays, and this trend might continue if results like the scaling laws do hold. This means that such AIs might tell us a lot about what advanced AIs and AGIs will look like, and that they’re a fertile ground for testing alignment ideas.

Further readings: None

Required skillsets: None

Mentorship guarantees:

  • Call frequency: once a week
  • Messaging in between calls: yes
  • Feedback on drafts: yes + help with code/experiments

What to expect of the research process: This project will probably alternate between reading and thinking about alignment problems in an abstract manner, and playing with GPT-3 to get a hang of it and push it in the direction wanted.

Comparison between RL and Finetuning GPT-3:
Investigate theoretical and practical differences between RL and finetuning language models.

Longer explanation: Language Models can be finetuned (through giving them either an explicit reward model or a corpus of text) for specializing them to certain goals. This is analogous with the more traditional approach of RL (especially the use of a reward model). However, the mentors expect that they might result in different alignment problems and guarantees, given that RL is optimizing more directly (from the start) for accomplishing the goal than the language model, even fine-tuned.

This project thus focuses on studying the theoretical, conceptual and practical differences between the two approaches, especially with regard to the kind of alignment failures expected with RL.

Reasons to care: Language models are some of the most impressive models nowadays, and this trend might continue if results like the scaling laws do hold. This means that such AIs might tell us a lot about what advanced AIs and AGIs will look like, and that they’re a fertile ground for testing alignment ideas.

Finetuning Language Models is the most powerful technique available at the moment to leverage language models for concrete applications (by specializing them), which means that we should expect more powerful AIs of this sort to be strongly finetuned in the future. Understanding what this means for the safety and alignment guarantees of these models is thus crucial.

Further readings: None

Required skillsets: Adaptable to different skillsets, but a background in either ML theory or ML engineering is preferred.

Mentorship guarantees:

  • Call frequency: once a week
  • Messaging in between calls: yes
  • Feedback on drafts: yes + help with code/experiments

What to expect of the research process: Will depend strongly on the skillsets of the team and the approach chosen, from purely theoretical explorations of the similarities and difference to experimental design to separate the two.


Alex Turner (Oregon State University)
Extending Power-Seeking Theorems to POMDPs:
Extend the power-seeking theorem formalizing instrumental convergence in Markov Decision Processes to the more general setting of Partially Observable MDPs.

Longer explanation: The original power-seeking theorems formalize some of the assumptions and conclusion behind the Instrumental convergence thesis (that some subgoals, like survival and resource gathering, are instrumental for many different agent objectives) in the context of MDPs.

Yet MDPs give access to the full state, and as such don’t capture the full range of uncertainties that might appear in more realistic settings. Hence this project is about trying to extend these theorems in the POMDP case, which adds uncertainty about which states the system is in (observations can be the same for different states).

Reasons to care: The instrumental convergence thesis is one of the foundations of our arguments for AI Risks, and the power-seeking theorems helped to clarify and characterize it tremendously. As such, any progress in extending these theorems to new settings will improve our ability to think clearly about these topics.

Further readings: Alex Turner’s sequence of posts on these results

Required skillsets: Experience proving theorems, but no need for highly advanced mathematical knowledge (although that’s a plus)

Mentorship guarantees:

  • Call frequency: once a week
  • Messaging in between calls: yes
  • Feedback on drafts: yes

What to expect of the research process: This project will include both thinking about what power-seeking theorems means and what they entail in this new domain, as well as pure maths thinking about how to prove them in POMDPs and whether they even make sense in that setting.


Stuart Armstrong (Future of Humanity Institute)
Learning and Penalyzing Betrayal:
Train agents in DeepMind’s XLand to learn the concept of betrayal, then attempt to penalize it.

 

Longer explanation: The project is about training agents in multiplayer games in DeepMind’s XLand to cooperate and communicate, and then learn to lie and betray each other. In turn, counterparties can learn to recognize such betrayal.
This ability can then be leveraged in trying different alignment schemes where betrayal (especially secret betrayal) is penalized and thus disincentivized. Similarly, how these solutions scales could be studied by making agents with different levels of competence and compute.

Reasons to care: Previous approaches focused on learning trustworthiness and/or honesty and incentivizing it, whereas this project focuses on learning betrayal, especially more hidden forms of betrayal, and then disincentivizing them. It thus sounds less likely that this project will run into known pitfalls, and it might even show that this approach is more promising.

Further readings: Stuart Armstrong’s post on this project

 

Required skillsets: ML engineering

 

Mentorship guarantees:

  • Call frequency: once a week
  • Messaging in between calls: Yes
  • Feedback on drafts: review of final project summary

What to expect of the research process: This project is more open-ended, in the sense that it’s less clear what would work or how to go about each step. But the day-to-day experience of research will be very close to ML research, trying to teach a neural network the categories and concepts, then iterating and tweaking it when that doesn’t work as expected.

Semantic Side-Effect Minimization:
Train a ML system to avoid side-effects by crafting a wide range of environments. Then reduce its conservativeness while still avoiding negative side-effects.

 

Longer explanation: The point is to train a conservative policy for accomplishing some goal over a wide range of environments with different types of side-effects (where the set of environments has to be designed for the project).
Stuart Armstrong expects the result to be far too conservative to be efficient at its goal, and so the next step will be to try to update the policy such that it learns which side-effects are considered negative, and which are okay.

Reasons to care: Most research about side-effect minimization and reduction in alignment focus on a syntactic condition (staying close enough to a baseline with regard to a given measure). This project proposes instead to study semantic side-effect minimization in a concrete setting. This matters because what we truly care about are negative side-effects, and capturing those is a more semantic property.

Further readings: Stuart Armstrong’s post on this project

Required skillsets: ML engineering

 

Mentorship guarantees:

  • Call frequency: once a week
  • Messaging in between calls: Yes
  • Feedback on drafts: review of final project summary

What to expect of the research process: This project will start with creating the training environments, based on many of the ideas in the literature about side-effects and the sort of problems expected in advanced AIs. This part will probably feel more like conceptual research, in that it requires thinking deeply about many core ideas of alignment.

Then the rest of the project will be about training the model and trying to make it learn successfully, with probably back-and-forth in the environment design and experimental testing, as well as trying different algorithms/approaches.


John Wentworth (Independent)
Constraints from Selection:
Study the structural constraints selected by mechanisms like natural selection and/or ML training.

 

Longer explanation: Selection theorems are a specific type of results that proves necessary conditions (maybe probabilistic) for systems and agents to be selected by a selection pressure, either a mechanism (like natural selection) or a criterion (like no Dutch-Booking)

This project focuses on either finding new such theorems, or investigating selection pressures experimentally to lay the groundwork for such theorems. Of particular interest are results about structural necessary conditions: they tell us something about how the system is built and how it works. Current selection theorems only prove behavioral constraints of the type “the system must act as if it was doing X” or “the system must be good enough to do X”.

Reasons to care: Structural constraints on what is selected by ML training would give us some information about how the advanced AI we will build will work, which let us ground our arguments about problems and solutions. At the moment the field lacks any such structural guarantees.

Further readings:

  • John Wentworth’s three posts introducing selection theorems: 1 2 3
  • An analysis of how selection theorems create new knowledge, and how they can be broken/criticized.
  • A paper giving an example of an experimental approach to selection theorems

Required skillsets: Experience proving theorems, but no need for highly advanced mathematical knowledge (although that’s a plus). If focusing on more experimental approaches, then probably some evolutionary algorithms engineering.

 

Mentorship guarantees:

  • Call frequency: once a week
  • Messaging in between calls: yes
  • Feedback on drafts: yes

What to expect of the research process: For the theoretical part, a lot of the initial research will be about coming up with different examples of selection pressure, looking at what they select for, and formulating a necessary condition that abstracts the core common points. Then it will be about proving the theorem, or showing it is false.

For the experimental side, more time will be spent trying to nail down exactly the selection pressure and the environments, so experiments can be run. After which a lot of the interpretation and analysis will resemble the theoretical part.

Utility Maximization as Compression
Study how models and processes act in ways that correspond to making the state of the world more compressible to learn about what they’re trying to accomplish.

 

Longer explanation: John Wentworth has argued that utility maximization corresponds to pushing the world into states that can better be compressed in the information theoretic sense. That is, that optimizing for a given outcome/utility corresponds to optimizing for being able to better compress the state.

Following this insight, he proposes to study empirically and/or conceptually what this can tell us about concrete systems. For example, this could take the form of training a neural net for a certain task, then comparing the best compression schemes for the initial distribution and for the distribution pushed by this neural net, to understand how to find the task/utility in the compression. Or it could look like exploring how the compression/utility maximization could be extended to individual actions (what does it mean for an action to help compression?).

Reasons to care: One of the main source of AI Risks in the literature comes from goal-directed/utility-maximizers who got a wrong goal and end up wrecking havoc on the world. As such, better understanding how such systems work and how we can analyse and interpret them matters for better estimating and dealing with such risks.
This approach based on compression is also unexplored at the moment, which means that it might yield new insights that eluded other approaches.

Further readings: John Wentworth’s original post

 

Required skillsets: The problem can be adapted to many different skillsets (ML, algorithm building, maths, philosophy/conceptual)

 

Mentorship guarantees:

  • Call frequency: once a week
  • Messaging in between calls: yes
  • Feedback on drafts: yes

What to expect of the research process: How this project will go depends heavily on the skills and interests of the team, as there are many different possible angles.


Evan Hubinger (Machine Intelligence Research Institute)
Understanding Dog Domestication for Corrigibility
Study the domestication of dogs through the lens of evolutionary genetics and investigate potential analogies with the alignment problem.

 

Longer explanation: This project’s aim is to understand how dogs were domesticated into animals that are helpful to humans, and see if this process gives us new avenues for aligning AIs. This could for example result in a characterization of a predicate to check that something is aligned in the ways dogs are, which might be sufficient for some problems with advanced AIs.

Reasons to care: Using evolution and evolutionary processes has a long (sometimes controversial) tradition in alignment, because of the analogies with ML training and Stochastic Gradient Descent. This project has the advantage of targeting an avenue where there is a clear analogous of alignment, and one which hasn’t been explored before.

Further readings: A possible starting point

Required skillsets: Evolutionary genetics expertise

Mentorship guarantees:

  • Call frequency: initial call + once every three weeks
  • Messaging in between calls: no
  • Feedback on drafts: final draft only

What to expect of the research process: Mostly looking at the literature in evolutionary genetics and trying to get a picture of what we know about dogs’ domestication; then going back and forth with different alignment problems and this form of solution, to see if it applies.

 

 

 

Apply if you…

  1. want to try out & consider ways you could help ensure that future AI performs safely and in line with what people value upon reflection;
  2. dug into our mentors’ open problems and noted a few clear arguments for why you’d research one in particular & how you might start;
  3. previously studied a topic or practiced skills unique to your perspective/background that can bolster your new research team’s progress; 
  4. can block off hours to focus on research from January to June 2022 on normal workdays (>1 h/d avg) and the weekend sprints (>7 h/d).
     

Application timeline

15 Nov 2022 00:01 UTC Application form opens. Reviews and interviews start right away (read more)
01 Dec 23:59 AoE Deadline to apply. Late submissions might not get a response.
24 Dec  23:59 AoE Last applicants admitted or declined (most will be informed of our decision earlier)

 


First virtual edition – a spontaneous collage

 

 

 

 

 

 

 

 

 

 

Program timeline

Add to Calendar

Participants will collaborate on a mentor’s open problem, in teams of two to four. Focused work on research projects makes up the bulk of the program. This is complemented with talks, discussions, and mentoring by experienced researchers. You will get to work on interesting projects while gaining research skills and knowledge of the field of AI Safety.

Our 2022 virtual program is built around online weekends. We expect participants to complete all parts of the program. Most teams continue working after the end of the program, either to finalize and publish their work or to expand it into an ongoing long-term project.

Sat – Sun Weekend’s Purpose Activities (read more)
15 – 16 Jan Consider concepts Ask established researchers about their concepts and cruxes for AI alignment.
→ Consider different angles on the problem you want to research
05 – 06 Feb Discuss collaborations Discuss how to go about the research with your mentor and prospective teammates.
→ Meet with formed team; set roles & check-ins
19 – 20 Feb Plan team’s research Complete planning and literature review.
05 – 06 Mar Co-work Start research.
02 – 03 Apr Co-work In the thick of research.
07 – 08 May Co-work Collect results. Consolidate notes.
→ Submit research summary & slides
05 – 05 Jun Present research &
Plan post-camp steps
Present results to mentors and curious participants.
Commit team to a write-up. Discuss next career steps.
→ Follow up
 
 
 
Rob Miles explains Koch et al. Objective Robustness in Deep Reinforcement Learning,
a paper produced at our last virtual edition

 

 

Questions?

Check out the FAQ page. You can also contact us at contact@aisafety.camp