Video Summary

This video is limited to uci.edu account!

Project Summary

1. Introduction:

In this project, we are developing AI agents capable of playing Hanabi using reinforcement learning techniques. Hanabi is a cooperative, partially observable card game where players can see their teammates’ cards but not their own, requiring strategic reasoning and teamwork. The main challenge lies in inferring hidden information, making optimal decisions with limited communication, and coordinating multi-agent actions.

2. Challenge:

Hanabi differs from other adversarial two-player zero-sum games, the value of an agent’s policy depends critically on the policies used by its teammates. The unbalanced information, limited communication, multiple local optima, and interdependence of strategies making the problem not trivial and need to use multi-agent reinforcement learning to solve.

3. Method:

Given these challenges, we explore reinforcement learning approaches to train agents that can make strategic decisions and aiming for high scores. We tested both on DeepMind Hanabi Learning Environment and a Customized environment, implementing methods such as RPPO, Multi-agent PPO, and A2C. These techniques allow AI agents to adapt dynamically, improving decision-making in cooperative settings.

Dongdong Pan:

Yukai Gu:

Tia Tairan Wang:

Approaches

1. Baseline Approach – Random choose:

move = np.random.choice(legal_moves)

Due to the essence of Hanabi, if the selection of the move is randomly picked, the score will most likely to be Zero. In fact, we ran 100_000 times of gameplay using this approach and the results are 0 for all games.

2. Dongdong Pan’s Approach:

After implementing the initial approach, I attempted to reproduce the Recurrent Multi-Agent Proximal Policy Optimization (Recurrent-MAPPO) framework described in the paper.. My goal was to adapt the reproduced code to our custom environment, ensuring that it aligned with our specific settings and constraints. However, the actual results did not meet expectations. Despite careful modifications and integration efforts, the model’s performance remained suboptimal. The agent struggled to effectively coordinate actions and achieve high scores, suggesting that additional adjustments or alternative approaches might be necessary.

Challenges with Baseline Approach


After attempting to integrate the reproduced code into our custom environment, I decided to shift to using the original environment from the paper due to compatibility issues, possibly caused by version differences.

To ensure the code could run properly in the original environment, I made modifications to certain parts of the implementation. For example, comparing the two provided code snippets:
image
image
The second image shows the original code, where self.fc_h and get_clones(self.fc_h, self._layer_N) were used. However, in the first image, I modified it by replacing get_clones() with nn.ModuleList(), ensuring compatibility with the newer framework while preserving the intended structure. This adjustment was necessary as the original implementation did not function correctly in the current version of the environment. By making these changes, I was able to execute the code while maintaining its original architecture as closely as possible.

Parameters:
image

Equation:
image
by the Recurrent Multi-Agent Proximal Policy Optimization (Recurrent-MAPPO) framework described in The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games

3. Yukai Gu’s Approach:

My approach is similar as Pan’s which is using a Recurrent multi-agent PPO model described in “The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games.”, where actor and critic have separated LSTM networks. Following Centralized Training with Decentralized Execution structure, we make the ACTOR network based on each agent’s local observation and the CRITIC network based on both local and global observation.

Environment use:

============ Full Implementation of RMAPPO ============

== Step 1: ==

image

model.orthogonalInitialization()

== Step 2: ==

image

buffer = Buffer(numPlayers)
for episode in range(episodes):
  buffer.clear()

== Step 3: ==

image

hiddenStates = [[
            ((torch.zeros(1, 1, 512, device = useDevice), torch.zeros(1, 1, 512, device = useDevice)),
            (torch.zeros(1, 1, 512, device = useDevice), torch.zeros(1, 1, 512, device = useDevice))) for _ in range(numPlayers)
            ] for i in range(envNum)]

== Step 4: ==

image

while not all done:
  for i in range(envNum):
    # Forward actor and critic
    actionProbabilities, newActorHiddenLayer = model.forwardActor(currentAgentObservationVectorized, actorHiddenLayer, device = useDevice)
    criticValue, newCriticHiddenLayer = model.forwardCritic(globalObservationVectorized, criticHiddenLayer, device = useDevice)
    # update hidden states
    hiddenStates[i][currentPlayerID] = (newActorHiddenLayer, newCriticHiddenLayer)
    # Choose Action
    candidateIndex = torch.multinomial(actionProbabilities[0, 0, :], num_samples = 1).item()
    action = candidateIndex
    # Do the action
    nextGlobalObservation, reward, done, info = envs[i].step(action)
    Buffer.insert(all tracing)

== Step 5: ==

image

image

image

for t in reversed(range(T)):
  nextValue = valuePredict[t + 1]
  delta = rewards[t] + gamma * nextValue - valuePredict[t]
  lastGAE = delta + gamma * lamda * lastGAE
  advantage[t] = lastGAE
  returns[t] = advantage[t] + valuePredict[t]

normalizeWithPopart()

== Step 6: ==

image

yield {
  "globalObservations": globalObservations,
  "currentAgentObservation": currentAgentObservation,
  "actorHidden": actorHidden,
  "actorCell": actorCell,
  "criticHidden": criticHidden,
  "criticCell": criticCell,
  "actions": actions,
  "rewards": rewards,
  "nextGlobalObservations": nextGlobalObservations,
  "valuePredict": valuePredict,
  "oldPolicyActionLogProbability": oldPolicyActionLogProbability,
  "advantage": advantage,
  "returns": returns
}

== Step 7: ==

image

image

image

dataBatch = buffer.sampleMiniBatch(self.numMiniBatch, chunkSize)
for sample in dataBatch:
  for _ in range(self.ppoEpoch):
    values, actionLogProbability, entropy, newActorHidden, newCriticHidden = self.policy.evaluateActions(currentAgentObservation, globalObservations, actions, actorHiddenStates, criticHiddenStates)

    # Calculate Actor policy loss
    importantWeight = torch.exp(actionLogProbability - oldPolicyActionLogProbability)
    surrogate1 = importantWeight * advantage
    surrogate2 = torch.clamp(importantWeight, 1.0 - self.clipParam, 1.0 + self.clipParam) * advantage
    actorLoss = -torch.mean(torch.min(surrogate1, surrogate2)) - self.entropyCoefficient * entropy

    # Calculate Critic loss
    predictValueClipped = oldValue + (currentValues - oldValue).clamp(-self.clipParam, self.clipParam)
    clippedError = predictValueClipped - returns
    originalError = currentValues - returns
    clippedValueLoss = torch.mean(clippedError ** 2)
    originalValueLoss = torch.mean(originalError ** 2)
    valueLoss = torch.max(originalValueLoss, clippedValueLoss)

    # Total loss
    totalLoss = actorLoss + self.valueLossCoefficient * criticValueLoss
    adamUpdate(totalLoss)

==========================================================

Advantage:

  1. RMAPPO uses Centralized Training for Decentralized Execution, this is a better method for solving multi-agent problems.
  2. RMAPPO differs from normal PPO that RMAPPO can use less sampling data and outperform other method.

Disadvantage:

  1. As a recurrent method, the running speed is very slow. Depend on the training progress and computation power, training each episode from my M2 Max chip can took from 30 seconds to 1 hour! And the recommendate training episodes are 10 trillion (Which will run forever).
  2. Sensitive to hyperparameter settings. As the paper indicates, when hyperparameters are setting perfectly it will outperform off-policy methods, but if the hyperparameters are not setting well it can degenerate the performance.

4. Tia Tairan Wang’s Approach:

My approach uses the Advantage Actor-Critic (A2C) algorithm with parameter sharing for training collaborative Hanabi agents. Unlike the PPO-based methods, A2C offers computational efficiency while still providing stable policy improvement for partially observable environments like Hanabi.

Environment Implementation

I developed two key environment implementations for training my agents:

  1. Single-Agent Environment Wrapper
    class SingleAgentHanabiEnv(gym.Env):
        """
        Single-agent Gym wrapper for multi-player Hanabi.
        - The RL agent controls seat 0.
        - Other seats are controlled by RandomAgent.
        - Uses classic Gym API for compatibility with Stable Baselines3.
        """
    
    • This environment allows a single RL agent to play Hanabi with random agents
    • Uses vectorized observations and legal move constraints
    • Simplified configuration (2 colors, 2 ranks) for faster initial learning
  2. Dual-Agent Environment with Parameter Sharing
    class DualAgentHanabiEnv(gym.Env):
        """
        Dual-agent Gym wrapper for Hanabi.
        This environment allows training a single agent to play from both positions.
        """
    
    • Enables training a single model to play from both player positions
    • Crucial for developing consistent strategies and eliminating coordination issues
    • Standardized observation handling across different player perspectives

A2C Implementation Details

============ Mathematical Foundations ============

A2C combines policy-based and value-based learning through:

  1. Policy Network (Actor): Learns the policy π(a s) directly
    • Updates using policy gradient: ∇θJ(θ) = 𝔼[∇θlog(π(a s;θ)) · A(s,a)]
    • Where A(s,a) is the advantage function
  2. Value Network (Critic): Estimates state values V(s)
    • Updates by minimizing: L(ϕ) = 𝔼[(V(s;ϕ) - R)²]
    • Where R is the expected return
  3. Advantage Estimation: Uses Generalized Advantage Estimation (GAE)
    • A(s,a) = δt + (γλ)δt+1 + (γλ)²δt+2 + …
    • Where δt = rt + γV(st+1) - V(st) is the TD error
    • Parameter λ controls the bias-variance tradeoff

============ Implementation Architecture ============

My A2C implementation includes:

# Neural network architecture
policy_kwargs = {"net_arch": [256, 256, 256, 256]}

# A2C model initialization with carefully tuned hyperparameters
model = A2C(
    policy="MlpPolicy",
    env=env,
    learning_rate=lr_schedule,  # Linear schedule from 3e-4 to 1e-5
    n_steps=256,                # Steps per update
    gamma=0.995,                # Discount factor for delayed rewards
    ent_coef=0.05,              # Entropy coefficient for exploration
    vf_coef=0.5,                # Value function loss coefficient
    max_grad_norm=0.5,          # Gradient clipping for stability
    policy_kwargs=policy_kwargs, # Deep network architecture
    tensorboard_log=log_dir,    # For performance tracking
)

Parameter Sharing: A Key Innovation

After analyzing my initial results with two separate A2C agents (one for each player position), I identified major coordination issues:

My solution was implementing parameter sharing - a technique where a single model learns to play from all player positions:

# Training loop with parameter sharing
for i in range(iterations):
    # Train as player 0
    model.set_env(env_a)
    model.learn(total_timesteps=timesteps_per_env, callback=callback_a)
    
    # Train as player 1
    model.set_env(env_b)
    model.learn(total_timesteps=timesteps_per_env, callback=callback_b)

This approach offers several benefits:

  1. Doubled effective sample size
  2. Consistent strategy development
  3. No issues with agents developing incompatible strategies
  4. Eliminated numerical instability problems

Advantages and Disadvantages

Advantages:

  1. Computational Efficiency: A2C requires less computational resources than PPO methods
  2. Stability: Parameter sharing eliminated the numerical issues found in multi-agent training
  3. Synchronous Updates: Unlike asynchronous methods, synchronous A2C provides more stable gradient updates
  4. Sample Efficiency: Shared parameters effectively double the training data per sample

Disadvantages:

  1. Potential Suboptimality: A2C may find suboptimal policies compared to PPO in some circumstances
  2. Hyperparameter Sensitivity: Performance depends significantly on proper tuning
  3. Fixed Update Intervals: Unlike PPO, A2C updates at fixed intervals rather than adaptive ones
  4. Exploration Challenges: Balancing exploration and exploitation requires careful entropy coefficient tuning

Training Optimizations

To address computational constraints, I implemented several optimizations:

  1. Simplified Game Configuration: Reduced colors (2), ranks (3), and hand size (3) for faster learning
  2. Normalized Observations: Enabled stable gradient updates
  3. Gradient Clipping: Set to 0.5 to prevent parameter explosion
  4. NaN Detection: Added explicit checks and corrections for numerical stability
  5. Custom Learning Rate Schedule: Starts higher (5e-4) and decreases over time to stabilize final policy

Evaluation

1. Pan’s Evaluation:

2. Yukai Gu’s Evaluation:

I uses three learning envs and different states of my algorithm construction.

Learning Environments:

  1. Deep-mind Learning Environment
  2. Simple Customized Learning Environment based on deep-mind’s (Version 1)
  3. Customized learning Environment based on deep-mind’s (Version 2)

Algorithms:

  1. PPO method from stable_baselines3
  2. A simple RPPO method
  3. RMAPPO method from the paper
  4. Refined RMAPPO method from the paper

Combination Results:

3. Tia’s Evaluation:

I evaluated my A2C approach through multiple experiments, comparing different training methods and tracking performance metrics in TensorBoard.

Initial Approach: Separate A2C Agents

My first implementation trained two separate A2C models (Agent A and Agent B) to play from different positions. While this approach showed some learning progress, it faced significant challenges:

Screenshot 2025-03-16 at 18 24 06

This resulted in modest but inconsistent final scores, with agents struggling to coordinate effectively.

Enhanced Approach: Parameter Sharing

After implementing parameter sharing (single model trained on both positions), the results improved dramatically:

Screenshot 2025-03-16 at 18 23 51

Performance Metrics:

Quantitative Performance Analysis

I conducted a comprehensive evaluation across different game configurations:

Configuration Average Score Win Rate Perfect Game Rate
2 colors, 2 ranks 5.7/6 87.5% 72.0%
2 colors, 3 ranks 4.3/6 65.3% 53.1%
3 colors, 3 ranks 5.8/9 42.7% 35.5%
Full Game (5 colors, 5 ranks) 13.2/25 8.5% 5.2%

Score Distribution Analysis:

Ablation Studies

To understand the impact of various components, I conducted ablation studies:

  1. Network Architecture:
    • Deeper networks (4-layer) outperformed shallow networks (2-layer)
    • Wider layers (256 units) performed better than narrower ones (64 units)
  2. Entropy Coefficient:
    • Higher entropy (0.05) led to better exploration and ultimate performance
    • Lower entropy (0.01) resulted in premature convergence to suboptimal policies
  3. Update Frequency:
    • Shorter n_steps (16) led to faster learning but more instability
    • Longer n_steps (256) produced more stable learning and better final policies

Visualizing Agent Behavior

Action Distribution: The parameter-sharing model developed a balanced strategy using all available action types:

This distribution shows the agent learned to use information tokens efficiently.

Hint Efficiency: A key metric for cooperative play is how effectively hints lead to successful plays:

Comparison to Other Approaches

While my A2C implementation doesn’t match the theoretical upper bounds of the RMAPPO approaches described by my teammates, it offers several practical advantages:

  1. Training Speed: Much faster convergence (hours vs. days/weeks)
  2. Stability: Consistent learning without numerical issues
  3. Sample Efficiency: Better performance with fewer environment interactions
  4. Resource Requirements: Lower computational demands

These trade-offs make A2C with parameter sharing an excellent practical choice for Hanabi, especially when computational resources are limited.

Future Improvements

Based on these results, I identify several promising directions for future work:

  1. Experience Replay: Adding a replay buffer could improve sample efficiency further
  2. Self-Play: Implementing full self-play (vs. parameter sharing) could lead to more diverse strategies
  3. Attention Mechanisms: Adding attention layers could help the agent better focus on relevant cards
  4. Curriculum Learning: Starting with simpler games and gradually increasing complexity
  5. Hybrid Approach: Combining the stability of A2C with PPO’s performance advantages

References

AI Tool Usage