AISC2: Prague

4-14 Oct, 2018

The second AI Safety Camp took place in Prague. Our teams have worked on exciting projects which are summarized below.

AI Governance and the Policymaking Process: Key Considerations for Reducing AI Risk

Team: Policymaking for AI Strategy – Brandon Perry, Risto Uuk


Our project was an attempt to introduce literature from theories on the public policymaking cycle to AI strategy to develop a new set of crucial considerations and open up research questions for the field. We began by defining our terms and laying out a big picture approach to how the policymaking cycle interacts with the rest of the AI strategy field. We then went through the different steps and theories in the policymaking cycle to develop a list of crucial considerations that we believe to be valuable for future AI policy practitioners and researchers to consider. For example, policies only get passed once there’s significant momentum and support for that policy, which creates implications to consider such as how many chances we get to implement certain policies. In the end, we believe that we have opened up a new area of research in AI policymaking strategy, where the way that solutions are implemented have strategic considerations for the entire AI risk field itself.


Read our paper here.

Detecting Spiky Corruption in Markov Decision Processes

Team: Jason Mancuso, Tomasz Kisielewski, David Lindner, Alok Singh


We Presented our work at AI Safety Workshop in IJCAI 2019
Read our paper here.

Corrupt Reward MDPs

Team: Tomasz Kisielewski, David Lindner, Jason Mancuso, Alok Singh



We are currently working on implementing the algorithm in safe-grid-agents to be able to test it on official and custom AI Safety Gridworlds. We also plan to make our code OpenAI Gym-compatible for easier interfacing of the AI Safety Gridworlds and our agents with the rest of the RL community.


Our current code is available on GitHub.
Paper published later: Detecting Spiky Corruption in Markov Decision Processes (presented in session at AI Safety Workshop in IJCAI 2019).

Human Preference Types

Team: Sabrina Kavanagh, Erin M. Linebarger, Nandi Schoots



This is a blogpost we wrote during the camp.

Feature Visualization for Deep Reinforcement Learning

Team: Zera Alexander, Andrew Schreiber, Fabian Steuer



Ongoing work:

Corrigibility

Team: Vegard Blindheim, Anton Osika, Roland Pihlakas



Future plans:

IRL Benchmark

Team: Adria Garriga-Alonso, Anton Osika, Johannes Heidecke, Max Daniel, Sayan Sarkar



See our GitHub here.

Value Learning in Games

Team: Stanislav Böhm, Tomáš Gavenčiak, Torben Swoboda, Mikhail Yagudin


Learning rewards of a task by observing expert demonstrations is a very active research area, mostly in the context of Inverse reinforcement learning (IRL) with some spectacular results. While the reinforcement learning framework assumes non-adversarial environments (and is known to fail in general games), our project focuses on value learning in general games, introduced in Inverse Game Theory (2015). We proposed a sparse stochastic gradient descent algorithm for learning values from equilibria and experiment with learning the values of the game of Goofspiel. We are developing a game-theoretic library GameGym to collect games, algorithms and reproducible experiments. We also studied value learning under bounded rationality models and we hope to develop this direction further in the future.


A longer report can be found here.

AI Safety for Kids

Assumptions of Human Values

Team: Jan Kulveit, Linda Linsefors, Alexey Turchin


There are many theories about the nature of human values, originating from diverse fields ranging from psychology to AI alignment research. Most of them rely on making various assumptions, which are sometimes given explicitly, often hidden (for example: humans having introspective access to their values; preferences being defined for arbitrary alternatives; some specific part of mind having normative power). We started with mapping the space – reading the papers, noting which assumptions are made, and trying to figure out what are the principal dimensions on which to project the space of value theories. Later, we tried to attack the problem directly, and find solutions which would be simple and make just explicit assumptions. While we did not converge on a solution, we become less confused, and the understanding created will likely lead to several posts from different team members.


Jan has written a blog post about his best-guess model of how human values and motivations work.