Video Summary

This video is limited to uci.edu account!

Project Summary

In this project, we aim to develop AI agents capable of playing Hanabi using reinforcement learning techniques. Given the game’s partially observable and cooperative nature, we explore multiple approaches to train agents that can make optimal decisions and achieve high scores. For the environment, we tested both deep-mind learning environment and customized environment. And we have expored different methods to achieve the goal namely, RPPO, Multi-agent PPO, and A2C.

Approach

Pan:

I aim to train an AI agent using Proximal Policy Optimization (PPO) to achieve high scores and effective teamwork in Hanabi’s partially observable environment.
As a cooperative game with limited information, MAPPO (Multi-Agent PPO) is a crucial choice, enabling agents to share a centralized value function for better coordination. The paper The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games highlights PPO and MAPPO’s success in similar settings, making it an ideal approach. Therefore, I am reproducing its code as the foundation for my project.

My approach applies PPO within a customized Hanabi environment, following these steps:

Yukai:

My approach is similar as Pan’s which is using a multi-agent PPO model described in “The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games.”, where actor and critic have separated LSTM networks. Following Centralized Training with Decentralized Execution structure, we make the ACTOR network based on each agent’s local observation and the CRITIC network based on both local and global observation. The environment at the beginning is the deep-mind’s Hanabi learning environment, but we also create our own environment for a future customized reward system.

The sampling is done by every step, we collect the observation, action, reward, predict critic value, and log probability to a buffer. These data will be used to calculate advantage and returns.

My actor and critic is following the equation below:

image image

I am taking parameters similar to the research:
actorLearningRate = 7e-4
criticLearningRate = 1e-3
num_env_steps = 10000000000000 (i am training at 100_000, becasue my computer is very slow)
numPlayers = 2
clipParam = 0.1
ppoEpoch = 15
numMiniBatch = 1
valueLossCoefficient = 0.5
entropyCoefficient = 0.015
maximumGradientNorm = 0.5

Tia:

Evaluation

Pan:

Yukai:

The following are the mean game score gained by every 100 episodes:

image

The graph shows that it is very unstable. I can predict that in a partially observable game the training will have up and down, but this is too unstable. I do think that there is problems with my hyperparameter values tunning and equation implementation. Although the training steps are very little, I still believe that this MAPPO method could do better than 1.25 game score.

Tia:

Remaining Goals and Challenges

Pan:

Yukai:

  1. The goal for this quarter is to implement a MAPPO model and customized environment that trains agents to play hanabi at perfect score.
  2. Also the comparison between different model is also our main goal. We ran baseline models to compare each other’s tranining time and the average score they get.
  3. The main challenge for me is that the RL is really a new area for me, it is hard for me the start from nowhere and reach a solid goal in RL training.

Tia:

Next Steps

Pan:

  1. Integrate PPO into our Hanabi environment for smooth gameplay.
  2. Optimize Performance by refining hyperparameters and model design.
  3. Compare Algorithms by benchmarking PPO against A2C and heuristic agents.
  4. Collaborate & Refine through team discussions and iterative improvements.

Yukai:

  1. My current MAPPO runs with deep-mind hanabi learning env, the next step is to use our own env with customized reward system to better train our agents.
  2. Reaching a higher score. Currently mean game score are just 1.25, this is surely not enough.

Tia:

References