The participants of the first AI safety camp in Gran Canaria

AISC 1: Research Summaries

The 2018 Gran Canaria AI safety camp teams have worked hard in the preparation of the camp and in the 10 day sprint. Each team has written a brief summary of the work they did during the camp:


Team: Christopher Galias, Johannes Heidecke, Dmitrii Krasheninnikov, Jan Kulveit, Nandi Schoots

  • Our team worked on how to model human (ir)rationality in the context of value learning when trying to learn a human’s reward function based on expert demonstrations with inverse reinforcement learning (IRL).
  • We focussed on two different sub-topics: bounded rationality and time-correlated irrationality.
  • Bounded rationality topic:
    • We analyzed the difference between perfectly rational and boundedly rational agents and why the latter might provide a better model for human behavior, explaining many biases observed in human thinking.
    • We looked at existing formalizations of bounded rationality, especially an information theoretic perspective introduced by Ortega and Braun.
    • We started investigating how to model bounded rational agents for reinforcement learning problems.
    • We began formalizing how to model the inverse step of IRL for bounded rational agents, based both on Maximum Causal Entropy IRL and Guided Cost Learning.
    • We set up a small test environment with many satisficing solutions and an optimal solution which is hard to find. We collected human expert demonstrations for this environment and compared it to the performance of a fully rational computer agent. The observed differences support the claim that bounded rationality models are needed in IRL to extract adequate reward functions.
    • We received funding from Paul Christiano to continue our work.
  • Time-correlated irrationality topic:
    • The project consists of 2 parts: introducing a Laplace prior on the softmax temperatures of the transitions of the Boltzmann-rational agent, and enforcing a correlation between the temperatures at nearby timesteps.
    • During the camp we worked out the math & the algorithm for the first part, and have started working on the implementation.
    • The second part of the project and the writeup will be done in the following months. We plan to both work remotely and meet up in person.


Team: Karl Koch, David Kristoffersson, Markus Salmela, Justin Shovelain

We further developed tools for determining the harm versus benefit of projects on the long-term future:
  • (Context: We have earlier work here, notably a decision tree for analyzing projects.)
  • Heuristics: Worked extensively on developing practical heuristics for determining whether a technological development is net beneficial or harmful in the long run
  • Scaffolding: Defined a wider context for the decision tree, to tell you when to use the decision tree and how to improve interventions/projects to be more good for the world.
  • Race/competitive dynamics: Modeled some conditions of generating competitive races.
  • Information concealment: Incorporated information from man-made disasters and information concealment
Developed a potential existential risk reduction funding delegation strategy for rich donors:
  • Analyzed how to maximize a funder’s ability to update on data and use the knowledge of others, and yet mostly avoid the principal agent problem and Goodhart’s law
  • Developed a funding organization design with expert delegates, collaborative investment decisions, and strong self-improving elements

Zero Safety

Team: Vojta Kovarik, Igor Sieradzki, Michael Świętek

  • Goal: Better understand the strategy learned by Alpha Zero algorithm
  • Implemented Alpha Zero in Gomoku, trained Alpha Zero in (a) 6*6 board, 4-in-a-row and (b) 8*8, 5-in-a-row
  • Training the neural net in (a) took ~40M samples. We managed to train a new neural net using only 350 unique samples in such a way that the resulting strategy is very similar to the original Alpha Zero player.
  • This led us to discover a weakness in the strategy learned by both the new Alpha Zero and the original one.
  • Future plans: Test on more complex games, experiment with more robust ways of finding representative subsets of the training data, visualize these representative subsets in an automated way.

Safe AF

Team: James Bell, Linda Linsefors, Caspar Oesterheld, Joar Skalse
  • Investigated the behaviour of common very simple machine learning algorithms in Newcomb like contexts, with the idea of trying to figure out what decision theory they are implicitly implementing.
  • Specifically we looked at the epsilon-greedy and softmax algorithms for bandit problems. At each step these algorithms compute a probability distribution over actions and then draw their next action from that distribution. The reward for each action depended on the probability distribution that the algorithms had found as an intermediate step but they were trained in the standard way i.e. assuming that there was no such dependence.
  • Formulated a selection of decision theory problems as bandit problems. Such bandit problems provide a general enough framework to include variants of playing a prisoners dilemma against a copy, evidential blackmail and death in Damascus.
  • We found that the algorithms did not coherently follow any established decision theory, however they did show a preference for ratifiable choices of probability distribution and we were able to find some results on their convergence properties. We are writing a paper with our results.
  • Later published: Reinforcement Learning in Newcomblike Environments

Side effects in Gridworlds

Team: Jessica Cooper, Karol Kubicki, Gavin Leech, Tom McGrath

  • Implemented a baseline Q-learning agent for gridworld environments.
  • Implemented inverse reinforcement learning in the Sokoban gridworld from Deepmind’s original paper.
  • Created new gridworlds to cover a wider variety of side effects and expose more nuances, for instance the difficulty in defining “leaving the environment unchanged” when the environment is dynamic or stochastic.
  • Code is available on our Github repository and Gavin Leech has written a blog post that goes into more detail.
  • Future plans:
    • Generalise the tools that we created to work with arbitrary pycolab environments.
    • Add maximum entropy deep IRL.
    • Submit a pull request with the above to the Safety Gridworlds repository in order to make it easier for others to get started doing machine learning safety research.

Last but not least…

We would like to thank those who have funded the camp: MIRI, CEA, Greg Colbourne, Lotta and Claes Linsefors.

1 thought on “AISC 1: Research Summaries”

  1. Pingback: The first AI Safety Camp & onwards – AI Safety Camp

Leave a Comment

Your email address will not be published. Required fields are marked *