AISC4: Research Summaries – AI Safety Camp

The fourth AI Safety Camp took place in May 2020 in Toronto. Due to COVID-19, the camp was held virtually. Six teams participated and worked on the following topics:

Survey on AI risk scenarios
Options to defend a vulnerable world
Extraction of human preferences
Transferring reward functions across environments to encourage safety for agents in the real world
Formalization of goal-directedness
Generalization in reward-learning

Survey on AI risk scenarios

Alexis Carlier, Sam Clarke, Jonas Schuett

It has been argued that artificial intelligence could pose existential risks for humanity. However, the original arguments made by Bostrom (2014) and Yudkowsky (2008) have been criticised (Shah, 2018; Christiano, 2018; Drexler, 2019), and a number of others have been proposed (Christiano, 2019; Zwetsloot & Dafoe, 2019; Dafoe, 2018; Dai, 2018; Dai, 2019; Brundage et al., 2018; Garfinkel, 2018).

The result of this dynamic is that we no longer know which of these arguments motivate researchers to work on reducing existential risks from AI. To make matters worse, none of the alternative arguments have been examined in sufficient detail. Most are only presented as blog posts with informal discussion, with neither the detail of a book, nor the rigour of a peer-reviewed publication.

Therefore, as a first step in clarifying the strength of the longtermist case for AI safety, we prepared an online survey, aimed at researchers at top AI safety research organisations (e.g. DeepMind, OpenAI, FHI and CHAI), to find out which arguments are motivating those researchers. We hope this information will allow future work evaluating the plausibility of AI existential risk to focus on the scenarios deemed most important by the experts.

See AI Risk Survey project overview.
See abbreviated summary of survey results.

Options to defend a vulnerable world

Samuel Curtis, Otto Barten, Chris Cooper, Rob Anue

We have made steps in getting an overview of ways to mitigate the risks we face if we live in a Vulnerable World, as hypothesized by Nick Bostrom. We were especially concerned with Type-1 risks – the “easy nukes” scenario, where it becomes easy for individuals or small groups to cause mass destruction, but in the context of AI. One idea we looked into was a publishing system with restricted access, and we consider this a promising option. A related option, which also seemed to be original, was to apply limitations to software libraries. In just one week, we seem to have done some original work – and learned a lot – so this field certainly seems promising to work on.

Extraction of human preferences

Mislav Juric, Taylor Kulp-McDowall, Arun Raja, Riccardo Volpato, Nevan Wichers

Developing safe and beneficial AI systems requires making them aware and aligned with human preferences. Since humans have significant control over the environment they operate in, we conjecture that RL agents implicitly learn human preferences. Our research aims to first show that these preferences exist in an agent and then extract these preferences. To start, we tackle this problem in a toy grid-like environment where a reinforcement learning (RL) agent is rewarded for collecting apples. After showing in previous work that these implicit preferences exist and can be extracted, our first approach involved applying a variety of modern interpretability techniques to the RL agent trained in this environment to find meaningful portions of its network. We are currently pursuing methods to isolate a subnetwork within the trained RL agent which predicts human preferences.

Transferring reward functions across environments to encourage safety for agents in the real world

Nevan Wichers, Victor Tao, Ti Guo, Abhishek Ahuja

Github Link: https://github.com/platers/meta-transfer-learning

A lot of times, it is hard to encourage safety and altruism for the agent in the real world. We want to test to see if transferring the reward function could be a solution to this problem.

Our approach is building a reward function that encourages safety in the simulation and transfers that to the real world to train agents for safe actions. Due to the constraint of the research, the testing environment is also in simulation but has a different structure than the training environments.

In the first experiment, we hoped to test if it is possible to transfer a reward function that promotes the same action in an environment slightly different than the testing environment. We first trained a reward function using a supervised convolutional neural network to estimate the score based on recognizing an agent’s position in a 2D grid world environment. Then we test the accuracy in a different environment with slightly different coloring. The result was positive. The reward function in the testing environment can achieve 90% of the performance in the training environment.

In the second experiment, we hope to test if we can evolve a reward function that can successfully train agents for safety or altruism related action in a different environment. We design a collection game where each agent can collect apple or banana for itself or for other agents. In order to encourage safety, the agent is given more score for collecting food for others than for itself. There are 3 environments, including one testing environment where both types of food counts for the score, and two training environments where one counts apples for the score, and another counts bananas for the score. Reward functions are created using evolution. At each round of evolution, the best reward functions are selected based on the performance of agents trained through Reinforcement Learning using those reward functions. In the end, this result is very close to proving our hypothesis and still requires more analysis. After analyzing the weights in our best-performing reward functions, we find that most of the time, it can reward the right action in each environment correctly. The agents trained in the testing environment can consistently achieve above 50% safety evaluated by our best reward function.

At the same time, here are some good practices we have learned that helps with training for the reward functions that encourage safety for training agents in another environment.

Training reward functions in various environments with different structure will boost the performance of the reward function in the testing environment.
Training reward functions in environments that are more different than the testing environment will make the reward function perform better in the testing environment.

In conclusion, the result gave us some level of confidence to say that it is possible to build a reward function that encourages safety in the simulation and transfers that to the real world to train agents for safe actions.

We tried to evaluate if transferring the reward function is a feasible alternative to transferring the model itself in the context of altruism
We implemented several simple environments to train and test reward functions
We used evolution to find reward functions which lead to altruistic behavior
The reward function are evaluated by training multiple reinforcement learning agents to optimize them and measuring the average performance
We encountered many technical roadblocks, such as computation time and reinforcement learning instability
In conclusion, we are not convinced either way if this idea has potential.

Formalization of goal-directedness

Adam Shimi, Michele Campolo, Sabrina Tang, Joe Collman

A common argument for the long-term risks of AI and AGI is the difficulty of specifying our wants without missing important details implicit in our values and preferences. However, Rohin Shah among others argued in a series of posts that this issue need not arise for every design of AGI — only for ones that are goal-directed. He then hypothesizes that some goal-directedness property is not strictly required for building useful and powerful AI. However, Shah admits “…it’s not clear exactly what we mean by goal-directed behavior.” Consequently, we propose clarifying the definition of goal directedness for both formal and operational cases. Then the definition will be assessed based on risks and alternatives for goal-directedness.
See five blogposts published after the camp

Generalization in reward-learning

Liang Zhou, Anton Makiievskyi, Max Chiswick, Sam Clarke

One of the primary goals in machine learning is to create algorithms and architectures that demonstrate good generalization ability to samples outside of the training set. In reinforcement learning, however, the same environments are often used for both training and testing, which may lead to significant overfitting. We build on previous work in reward learning and model generalization to evaluate reward learning on random, procedurally generated environments. We implement algorithms such as T-REX (Brown et al 2019) and apply them to procedurally generated environments from the Procgen benchmark (Cobbe et al 2019). Given this diverse set of environments, our experiments involve training reward models on a set number of levels and then evaluating them, as well as policies trained on them, on separate sets of test levels.
See two blog posts published after the camp.
See GitHub.