AISC2: Research Summaries

The second AI Safety Camp took place this October in Prague. Our teams have worked on exciting projects which are summarized below:


AI Governance and the Policymaking Process: Key Considerations for Reducing AI Risk:

Team: Policymaking for AI Strategy – Brandon Perry, Risto Uuk

Our project was an attempt to introduce literature from theories on the public policymaking cycle to AI strategy to develop a new set of crucial considerations and open up research questions for the field. We began by defining our terms and laying out a big picture approach to how the policymaking cycle interacts with the rest of the AI strategy field. We then went through the different steps and theories in the policymaking cycle to develop a list of crucial considerations that we believe to be valuable for future AI policy practitioners and researchers to consider. For example, policies only get passed once there’s significant momentum and support for that policy, which creates implications to consider such as how many chances we get to implement certain policies. In the end, we believe that we have opened up a new area of research in AI policymaking strategy, where the way that solutions are implemented have strategic considerations for the entire AI risk field itself.

Read our paper here.


Detecting Spiky Corruption in Markov Decision Processes:

Team: Corrupt Reward MDPs – Jason Mancuso, Tomasz Kisielewski, David Lindner, Alok Singh
We Presented our work at AI Safety Workshop in IJCAI 2019
Read our paper here.


Corrupt Reward MDPs:

Team: Tomasz Kisielewski, David Lindner, Jason Mancuso, Alok Singh

  • We worked on solving Markov Decision Processes with corrupt reward functions (CRMDPs), in which the observed and true rewards are not necessarily the same.
  • The general class of CRMDPs is not solvable, so we focused on finding useful subclasses that are.
  • We developed a set of assumptions that define what we call Spiky CRMDPs and an algorithm that solves them by identifying corrupt states, i.e. states that have corrupted reward.
  • We worked out regret bounds for our algorithm in the class of Spiky CRMDPs, and found a specific subclass under which our algorithm is provably optimal.
  • Even for Spiky CRMDPs in which our algorithm is suboptimal, we can use the regret bound in combination with semi-supervised RL to reduce supervisor queries.

We are currently working on implementing the algorithm in safe-grid-agents to be able to test it on official and custom AI Safety Gridworlds. We also plan to make our code OpenAI Gym-compatible for easier interfacing of the AI Safety Gridworlds and our agents with the rest of the RL community.

Our current code is available on GitHub.
Paper published later: Detecting Spiky Corruption in Markov Decision Processes (presented in session at AI Safety Workshop in IJCAI 2019).


Human Preference Types

Team: Sabrina Kavanagh, Erin M. Linebarger, Nandi Schoots

  • We analyzed the usefulness of the framework of preference types to value learning. We zoomed in on the preference types liking, wanting and approving. We described the framework of preference types and how these can be inferred.
  • We considered how an AI could aggregate our preferences and came up with suggestions for how to choose an aggregation method. Our initial approach to establishing a method for aggregation of preference types was to find desiderata any potential aggregation function should comply with. As a source of desiderata, we examined the following existing bodies of research that dealt with aggregating preferences, either across individuals or between different types:
    Economics & Social Welfare Theory; Social Choice Theory; Constitutional Law; and Moral Philosophy.
  • We concluded that the aggregation method should be chosen on a case-by-case basis. For example by asking people for their meta-preferences; considering the importance of desiderata to the end-user; letting the accuracy of measurement decide its weight; implementing a sensible aggregation function and adjusting it on the go; or identifying a more complete preference type.

This is a blogpost we wrote during the camp.


Feature Visualization for Deep Reinforcement Learning

Team: Zera Alexander, Andrew Schreiber, Fabian Steuer

  • Completed a literature review of visualization in Deep Reinforcement Learning.
  • Built a prototype of Agent, a Tensorboard plugin for interpretability of RL/IRL models focused on the time-step level.
  • Open-sourced the Agent prototype on GitHub.
  • Reproduced and integrated a paper on perturbation-based saliency map in Deep RL.
  • Applied for an EA Grant to continue our work. (Currently at the 3rd and final stage in the process.)

Ongoing work:

  • Developing the prototype into a functional tool.
  • Collecting and integrating feedback from AI Safety researchers in Deep RL/IRL.
  • Writing an introductory blog post to Agent.



Team: Vegard Blindheim, Anton Osika, Roland Pihlakas

  • The initial project topic was: Corrigibility and interruptibility via the principles of diminishing returns and conjunctive goals (originally titled: “Corrigibility and interruptibility of homeostasis based agents”)
  • Vegard focused on finding and reading various corrigibility related materials and proposed an idea of building a public reading list of various corrigibility related materials, since currently these texts are scattered over the internet.
  • Anton contributed to the discussions of the initial project topic in the form of various very helpful questions, but considered the idea of diminishing returns too obvious and simple, and very unlikely to be successful. Therefore, he soon switched over to other projects in another team.
  • The initial project of diminishing returns and conjunctive goals evolved into a blog post by Roland, proposing a solution to the problem of the lack of common sense in paper-clippers and other Goodhart’s law-ridden utility maximising agents, possibly enabling them to even surpass the relative safety of humans: 

Future plans:

  • Vegard works on preparing the website offering a reading list of corrigibility related materials.
  • Roland continuously updates his blog post with additional information, additionally contacting Stuart Armstrong, and continuing correspondence with Alexander Turner and Victoria Krakovna.
  • Additionally, Roland will design a set of gridworlds-based gamified simulation environments (at for various corrigibility and interruptibility related toy problems, where the efficiency of applying the principles of diminishing returns and conjunctive goals can be compared to other approaches in the form of a challenge — the participants would be able to provide their own agent code in order to measure, which principles are best or most convenient as a solution for the most challenge scenarios.
  • Anton is looking forward to participating in these challenges with his coding skills.


IRL Benchmark

Team: Adria Garriga-Alonso, Anton Osika, Johannes Heidecke, Max Daniel, Sayan Sarkar

  • Our objective is to create a unified platform to compare existing and new algorithms for inverse reinforcement learning.
  • We made an extensive review of existing inverse reinforcement learning algorithms with respect to different criteria such as: types of reward functions, necessity of known transition dynamics, metrics used for evaluation, used RL algorithms.
  • We set up our framework in a modular way that is easy to extend for new IRL algorithms, test environments, and metrics.
  • We released a basic version of the benchmark with 2 environments and 3 algorithms and are continuously extending it.

See our GitHub here.


Value Learning in Games

Team: Stanislav Böhm, Tomáš Gavenčiak, Torben Swoboda, Mikhail Yagudin

Learning rewards of a task by observing expert demonstrations is a very active research area, mostly in the context of Inverse reinforcement learning (IRL) with some spectacular results. While the reinforcement learning framework assumes non-adversarial environments (and is known to fail in general games), our project focuses on value learning in general games, introduced in Inverse Game Theory (2015). We proposed a sparse stochastic gradient descent algorithm for learning values from equilibria and experiment with learning the values of the game of Goofspiel. We are developing a game-theoretic library GameGym to collect games, algorithms and reproducible experiments. We also studied value learning under bounded rationality models and we hope to develop this direction further in the future.

A longer report can be found here.


AI Safety for Kids

  • We arrived at camp with the intention of developing storyboards targeted at AI Policymakers, inspired by the ‘Killbots YouTube video’ and the Malicious Compliance Report. The goal of these storyboards was to advance policies that prevent the weaponization of AI, while disrupting popular images of what an AI actually is or could become. We would achieve this by lowering the barriers of entry for non-experts to understanding core concepts and challenges in AI Safety.
  • In considering our target audience, we quickly decided that the most relevant stakeholders for these storyboards are a minimum of 20 years away from assuming their responsibilities (based on informal surveys of camp participants on the ETA of AGI). In other words, we consider our audience for these storyboards to be children. We realized that by targeting our message to a younger audience, we could prime them to think differently and perhaps more creatively about addressing these complex technical and social challenges. Although we consider children’s books to be broadly appealing to all ages and helpful for spreading a message in a simple yet profound manner, to our knowledge no children’s books have been specifically published on the topic of AI Safety.
  • During camp we wrote drafts for three main children’s book ideas focused on AI Safety. We presented one of these concepts to the group and gathered feedback about our approach. In summary, we decided to move forward with writing a children’s book on AI Safety while remaining cognizant of the challenges of effective communication so as to avoid the pitfalls of disinformation and sensationalism. We developed a series of milestones for the book such that we could meet our goal of launching the book by the one year anniversary of the camp in Fall 2019.
  • After camp, we applied to the Effective Altruism Foundation for a $5,000 grant to engage animators for preliminary graphic support to bring the book into a working draft phase to aid in pitching the idea to publishers in order to secure additional funding and complete the project. After this request was declined, we continued to compile lists of potential animators to reach out to once funding is secured.
  • We adjusted our plan to focus more on getting to know our potential audience. To this end, Chris has been in contact with a local high school teacher for advanced students specializing in maths and physics. Chris has arranged to give a talk to the students on problems of AI alignment in January 2019. Chris plans to prepare the presentation and Noah will provide feedback. After the presentation, Noah and Chris will reconvene to discuss the student reactions and interest in AI Alignment and Safety in Jan/Feb 2019.


Assumptions of Human Values

Team: Jan Kulveit, Linda Linsefors, Alexey Turchin

There are many theories about the nature of human values, originating from diverse fields ranging from psychology to AI alignment research. Most of them rely on making various assumptions, which are sometimes given explicitly, often hidden (for example: humans having introspective access to their values; preferences being defined for arbitrary alternatives; some specific part of mind having normative power). We started with mapping the space – reading the papers, noting which assumptions are made, and trying to figure out what are the principal dimensions on which to project the space of value theories. Later, we tried to attack the problem directly, and find solutions which would be simple and make just explicit assumptions. While we did not converge on a solution, we become less confused, and the understanding created will likely lead to several posts from different team members.

Jan has written a blog post about his best-guess model of how human values and motivations work.