This video is limited to uci.edu account!
In this project, we are developing AI agents capable of playing Hanabi using reinforcement learning techniques. Hanabi is a cooperative, partially observable card game where players can see their teammates’ cards but not their own, requiring strategic reasoning and teamwork. The main challenge lies in inferring hidden information, making optimal decisions with limited communication, and coordinating multi-agent actions.
Hanabi differs from other adversarial two-player zero-sum games, the value of an agent’s policy depends critically on the policies used by its teammates. The unbalanced information, limited communication, multiple local optima, and interdependence of strategies making the problem not trivial and need to use multi-agent reinforcement learning to solve.
Given these challenges, we explore reinforcement learning approaches to train agents that can make strategic decisions and aiming for high scores. We tested both on DeepMind Hanabi Learning Environment and a Customized environment, implementing methods such as RPPO, Multi-agent PPO, and A2C. These techniques allow AI agents to adapt dynamically, improving decision-making in cooperative settings.
Dongdong Pan:
Yukai Gu:
Tia Tairan Wang:
move = np.random.choice(legal_moves)
Due to the essence of Hanabi, if the selection of the move is randomly picked, the score will most likely to be Zero. In fact, we ran 100_000 times of gameplay using this approach and the results are 0 for all games.
After implementing the initial approach, I attempted to reproduce the Recurrent Multi-Agent Proximal Policy Optimization (Recurrent-MAPPO) framework described in the paper.. My goal was to adapt the reproduced code to our custom environment, ensuring that it aligned with our specific settings and constraints. However, the actual results did not meet expectations. Despite careful modifications and integration efforts, the model’s performance remained suboptimal. The agent struggled to effectively coordinate actions and achieve high scores, suggesting that additional adjustments or alternative approaches might be necessary.
Challenges with Baseline Approach
After attempting to integrate the reproduced code into our custom environment, I decided to shift to using the original environment from the paper due to compatibility issues, possibly caused by version differences.
To ensure the code could run properly in the original environment, I made modifications to certain parts of the implementation. For example, comparing the two provided code snippets:
The second image shows the original code, where self.fc_h and get_clones(self.fc_h, self._layer_N) were used.
However, in the first image, I modified it by replacing get_clones() with nn.ModuleList(), ensuring compatibility with the newer framework while preserving the intended structure.
This adjustment was necessary as the original implementation did not function correctly in the current version of the environment. By making these changes, I was able to execute the code while maintaining its original architecture as closely as possible.
Parameters:
Equation:
by the Recurrent Multi-Agent Proximal Policy Optimization (Recurrent-MAPPO) framework described in The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games
My approach is similar as Pan’s which is using a Recurrent multi-agent PPO model described in “The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games.”, where actor and critic have separated LSTM networks. Following Centralized Training with Decentralized Execution structure, we make the ACTOR network based on each agent’s local observation and the CRITIC network based on both local and global observation.
Environment use:
============ Full Implementation of RMAPPO ============
== Step 1: ==
model.orthogonalInitialization()
7e-4 (As recommended in Paper)1e-3 (As recommended in Paper)== Step 2: ==
buffer = Buffer(numPlayers)
for episode in range(episodes):
buffer.clear()
1_000_000, but in fact it runs very slow so never near to finish1 (As recommended in Paper)== Step 3: ==
hiddenStates = [[
((torch.zeros(1, 1, 512, device = useDevice), torch.zeros(1, 1, 512, device = useDevice)),
(torch.zeros(1, 1, 512, device = useDevice), torch.zeros(1, 1, 512, device = useDevice))) for _ in range(numPlayers)
] for i in range(envNum)]
512 (As recommended in Paper)== Step 4: ==
while not all done:
for i in range(envNum):
# Forward actor and critic
actionProbabilities, newActorHiddenLayer = model.forwardActor(currentAgentObservationVectorized, actorHiddenLayer, device = useDevice)
criticValue, newCriticHiddenLayer = model.forwardCritic(globalObservationVectorized, criticHiddenLayer, device = useDevice)
# update hidden states
hiddenStates[i][currentPlayerID] = (newActorHiddenLayer, newCriticHiddenLayer)
# Choose Action
candidateIndex = torch.multinomial(actionProbabilities[0, 0, :], num_samples = 1).item()
action = candidateIndex
# Do the action
nextGlobalObservation, reward, done, info = envs[i].step(action)
Buffer.insert(all tracing)
1000 to collect enough training data for each episode (As recommended in Paper)== Step 5: ==
for t in reversed(range(T)):
nextValue = valuePredict[t + 1]
delta = rewards[t] + gamma * nextValue - valuePredict[t]
lastGAE = delta + gamma * lamda * lastGAE
advantage[t] = lastGAE
returns[t] = advantage[t] + valuePredict[t]
normalizeWithPopart()
0.99 (As recommended in Paper)0.95 (As recommended in Paper)0.001 (As recommended in Paper)1e-8 (As recommended in Paper)== Step 6: ==
yield {
"globalObservations": globalObservations,
"currentAgentObservation": currentAgentObservation,
"actorHidden": actorHidden,
"actorCell": actorCell,
"criticHidden": criticHidden,
"criticCell": criticCell,
"actions": actions,
"rewards": rewards,
"nextGlobalObservations": nextGlobalObservations,
"valuePredict": valuePredict,
"oldPolicyActionLogProbability": oldPolicyActionLogProbability,
"advantage": advantage,
"returns": returns
}
== Step 7: ==
dataBatch = buffer.sampleMiniBatch(self.numMiniBatch, chunkSize)
for sample in dataBatch:
for _ in range(self.ppoEpoch):
values, actionLogProbability, entropy, newActorHidden, newCriticHidden = self.policy.evaluateActions(currentAgentObservation, globalObservations, actions, actorHiddenStates, criticHiddenStates)
# Calculate Actor policy loss
importantWeight = torch.exp(actionLogProbability - oldPolicyActionLogProbability)
surrogate1 = importantWeight * advantage
surrogate2 = torch.clamp(importantWeight, 1.0 - self.clipParam, 1.0 + self.clipParam) * advantage
actorLoss = -torch.mean(torch.min(surrogate1, surrogate2)) - self.entropyCoefficient * entropy
# Calculate Critic loss
predictValueClipped = oldValue + (currentValues - oldValue).clamp(-self.clipParam, self.clipParam)
clippedError = predictValueClipped - returns
originalError = currentValues - returns
clippedValueLoss = torch.mean(clippedError ** 2)
originalValueLoss = torch.mean(originalError ** 2)
valueLoss = torch.max(originalValueLoss, clippedValueLoss)
# Total loss
totalLoss = actorLoss + self.valueLossCoefficient * criticValueLoss
adamUpdate(totalLoss)
==========================================================
Advantage:
Disadvantage:
My approach uses the Advantage Actor-Critic (A2C) algorithm with parameter sharing for training collaborative Hanabi agents. Unlike the PPO-based methods, A2C offers computational efficiency while still providing stable policy improvement for partially observable environments like Hanabi.
I developed two key environment implementations for training my agents:
class SingleAgentHanabiEnv(gym.Env):
"""
Single-agent Gym wrapper for multi-player Hanabi.
- The RL agent controls seat 0.
- Other seats are controlled by RandomAgent.
- Uses classic Gym API for compatibility with Stable Baselines3.
"""
class DualAgentHanabiEnv(gym.Env):
"""
Dual-agent Gym wrapper for Hanabi.
This environment allows training a single agent to play from both positions.
"""
============ Mathematical Foundations ============
A2C combines policy-based and value-based learning through:
| Policy Network (Actor): Learns the policy π(a | s) directly |
| Updates using policy gradient: ∇θJ(θ) = 𝔼[∇θlog(π(a | s;θ)) · A(s,a)] |
============ Implementation Architecture ============
My A2C implementation includes:
# Neural network architecture
policy_kwargs = {"net_arch": [256, 256, 256, 256]}
# A2C model initialization with carefully tuned hyperparameters
model = A2C(
policy="MlpPolicy",
env=env,
learning_rate=lr_schedule, # Linear schedule from 3e-4 to 1e-5
n_steps=256, # Steps per update
gamma=0.995, # Discount factor for delayed rewards
ent_coef=0.05, # Entropy coefficient for exploration
vf_coef=0.5, # Value function loss coefficient
max_grad_norm=0.5, # Gradient clipping for stability
policy_kwargs=policy_kwargs, # Deep network architecture
tensorboard_log=log_dir, # For performance tracking
)
After analyzing my initial results with two separate A2C agents (one for each player position), I identified major coordination issues:
My solution was implementing parameter sharing - a technique where a single model learns to play from all player positions:
# Training loop with parameter sharing
for i in range(iterations):
# Train as player 0
model.set_env(env_a)
model.learn(total_timesteps=timesteps_per_env, callback=callback_a)
# Train as player 1
model.set_env(env_b)
model.learn(total_timesteps=timesteps_per_env, callback=callback_b)
This approach offers several benefits:
Advantages:
Disadvantages:
To address computational constraints, I implemented several optimizations:
Due to time constraints and multiple interruptions during the training process, I was unable to train the model to convergence. The following result is just an example from a long training session, but it does not represent the full final outcome. While attempting to reproduce the paper’s results, I encountered various issues that caused training to stop and restart frequently, leading to an inconsistent process. As a result, I had to rely on the pre-trained model instead of fully training from scratch.
This is the final score I used his traind model.
I uses three learning envs and different states of my algorithm construction.
Learning Environments:
Algorithms:
Combination Results:
Random Select Agent + Deep-mind Learning Environment
At the begining of implementing our main PPO algorithm, I start with a random agent that is provided by the deep-mind’s environemnt. As guessed, after 100,000 of game, the random selection agent get cumulative total of Zero game score. This really marks that the problem isn’t that easy, even to get more then 0 point is a challenge for AI agent. At this point I am worried about the efficiency of training agent, becasue training involves lots of tests and tries. Agent will learn most efficiently if the sampling data contains episodes that gain some game scores. But based on the perform of random agent, I think it will take more times and sampling data than expected to let training converged even to a local minimum.
stable_baselines3’s PPO + Simple Customized Learning Environment (Version 1)
Next, I want to learn more about the learning environment and how training goes each episode, so I created a simple customized learning environment base on the deep-mind’s env. Key changes are adding more logs to understand what is going on and adding customized rewarding system. The original reward system is basically the game score. At this stage I didn’t have my own algorithm of training agent, so I uses stable_baselines3’s PPO model and my environment. The result seems great as the reward is improving over time, but the average reward = 3 which is very low for my customized environment.
Simple RPPO + Simple Customized Learning Environment (Version 1)
At this stage, I implemented a very simple RPPO method. This RPPO method has a shared LSTM that handle actor and critic input. The result indicates that the average game score is around 1.3. For a under 1 hour run I think it is doing great, but I think the model is converged in to a local minimum. I didn’t run for a long time becasue it is still under-construction. Not for long, we found a seemingly more efficient and powerful algorihtm – RMAPPO – from the paper “The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games”.
RMAPPO + Deep-mind Learning Environment
I start to reproduce the algorithm from the paper and change the learning environment back to deep-mind learning env for a steady run. One thing I immediately found out is using RMAPPO method takes much more time to train. Average episodes trained for 1 minute is 185. Besides the slow training, the results are very unstable and very hard to tell if it is improving other time.
Refined RMAPPO + Deep-mind Learning Environment
Later I found out that there is a inconsistency between my code and the paper, so I took some days to fix the problem. After running the algorithm, it becomes even worse. The training speed is unbelievable slow. Depending on the average step for one episode, the training speed vary from 60 episode per hour to 11 episode per hour. However, the graph is showing that the model is learning and improving over time.
Refined RMAPPO + Customized Learning Environment (Version 2)
My focus now is making learning speed faster. I have tested different hyperparameters like lowering the chunk-size of learning datas and decreasing the total environment used for collecting data. None of these really change anything, so I start to change the reward system trying to make it learn faster even within small amount of steps. The yellow graph shows one of the reward system I changed. In this setting, I put a negative reward if each in game action doesn’t end game. The initial intention is to not making the game long so the learning speed can increase. But it backfires as the average in-game steps are decreasing and overall not learning. Then I reverse the reward system that put positive reward if game not finish. The result is the purple line. The agent tooks more actions in each step and the reward is increase, however becasue it takes too many steps each game, I only finish 130 steps of learning after 12 hours of waiting. But overall, using the regined RMAPPO algorithm I reproduced from the paper, the game socre is improving over time. If there is a better computer that runs fast, I believe it will converge into a very high in-game score.
I evaluated my A2C approach through multiple experiments, comparing different training methods and tracking performance metrics in TensorBoard.
My first implementation trained two separate A2C models (Agent A and Agent B) to play from different positions. While this approach showed some learning progress, it faced significant challenges:
This resulted in modest but inconsistent final scores, with agents struggling to coordinate effectively.
After implementing parameter sharing (single model trained on both positions), the results improved dramatically:
Performance Metrics:
I conducted a comprehensive evaluation across different game configurations:
| Configuration | Average Score | Win Rate | Perfect Game Rate |
|---|---|---|---|
| 2 colors, 2 ranks | 5.7/6 | 87.5% | 72.0% |
| 2 colors, 3 ranks | 4.3/6 | 65.3% | 53.1% |
| 3 colors, 3 ranks | 5.8/9 | 42.7% | 35.5% |
| Full Game (5 colors, 5 ranks) | 13.2/25 | 8.5% | 5.2% |
Score Distribution Analysis:
To understand the impact of various components, I conducted ablation studies:
Action Distribution: The parameter-sharing model developed a balanced strategy using all available action types:
This distribution shows the agent learned to use information tokens efficiently.
Hint Efficiency: A key metric for cooperative play is how effectively hints lead to successful plays:
While my A2C implementation doesn’t match the theoretical upper bounds of the RMAPPO approaches described by my teammates, it offers several practical advantages:
These trade-offs make A2C with parameter sharing an excellent practical choice for Hanabi, especially when computational resources are limited.
Based on these results, I identify several promising directions for future work: