rlcard.agents

rlcard.agents.cfr_agent

class rlcard.agents.cfr_agent.CFRAgent(env, model_path='./cfr_model')

Bases: object

Implement CFR (chance sampling) algorithm

action_probs(obs, legal_actions, policy)

Obtain the action probabilities of the current state

Parameters:
  • obs (str) – state_str

  • legal_actions (list) – List of leagel actions

  • player_id (int) – The current player

  • policy (dict) – The used policy

Returns:

action_probs(numpy.array): The action probabilities legal_actions (list): Indices of legal actions

Return type:

(tuple) that contains

eval_step(state)

Given a state, predict action based on average policy

Parameters:

state (numpy.array) – State representation

Returns:

Predicted action info (dict): A dictionary containing information

Return type:

action (int)

get_state(player_id)

Get state_str of the player

Parameters:

player_id (int) – The player id

Returns:

state (str): The state str legal_actions (list): Indices of legal actions

Return type:

(tuple) that contains

load()

Load model

regret_matching(obs)

Apply regret matching

Parameters:

obs (string) – The state_str

save()

Save model

train()

Do one iteration of CFR

traverse_tree(probs, player_id)

Traverse the game tree, update the regrets

Parameters:
  • probs – The reach probability of the current node

  • player_id – The player to update the value

Returns:

The expected utilities for all the players

Return type:

state_utilities (list)

update_policy()

Update policy based on the current regrets

rlcard.agents.dqn_agent

DQN agent

The code is derived from https://github.com/dennybritz/reinforcement-learning/blob/master/DQN/dqn.py

Copyright (c) 2019 Matthew Judell Copyright (c) 2019 DATA Lab at Texas A&M University Copyright (c) 2016 Denny Britz

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

class rlcard.agents.dqn_agent.DQNAgent(replay_memory_size=20000, replay_memory_init_size=100, update_target_estimator_every=1000, discount_factor=0.99, epsilon_start=1.0, epsilon_end=0.1, epsilon_decay_steps=20000, batch_size=32, num_actions=2, state_shape=None, train_every=1, mlp_layers=None, learning_rate=5e-05, device=None, save_path=None, save_every=inf)

Bases: object

Approximate clone of rlcard.agents.dqn_agent.DQNAgent that depends on PyTorch instead of Tensorflow

checkpoint_attributes()

Return the current checkpoint attributes (dict) Checkpoint attributes are used to save and restore the model in the middle of training Saves the model state dict, optimizer state dict, and all other instance variables

eval_step(state)

Predict the action for evaluation purpose.

Parameters:

state (numpy.array) – current state

Returns:

an action id info (dict): A dictionary containing information

Return type:

action (int)

feed(ts)
Store data in to replay buffer and train the agent. There are two stages.

In stage 1, populate the memory without training In stage 2, train the agent every several timesteps

Parameters:

ts (list) – a list of 5 elements that represent the transition

feed_memory(state, action, reward, next_state, legal_actions, done)

Feed transition to memory

Parameters:
  • state (numpy.array) – the current state

  • action (int) – the performed action ID

  • reward (float) – the reward received

  • next_state (numpy.array) – the next state after performing the action

  • legal_actions (list) – the legal actions of the next state

  • done (boolean) – whether the episode is finished

classmethod from_checkpoint(checkpoint)

Restore the model from a checkpoint

Parameters:

checkpoint (dict) – the checkpoint attributes generated by checkpoint_attributes()

predict(state)

Predict the masked Q-values

Parameters:

state (numpy.array) – current state

Returns:

a 1-d array where each entry represents a Q value

Return type:

q_values (numpy.array)

save_checkpoint(path, filename='checkpoint_dqn.pt')

Save the model checkpoint (all attributes)

Parameters:

path (str) – the path to save the model

set_device(device)
step(state)
Predict the action for genrating training data but

have the predictions disconnected from the computation graph

Parameters:

state (numpy.array) – current state

Returns:

an action id

Return type:

action (int)

train()

Train the network

Returns:

The loss of the current batch.

Return type:

loss (float)

class rlcard.agents.dqn_agent.Estimator(num_actions=2, learning_rate=0.001, state_shape=None, mlp_layers=None, device=None)

Bases: object

Approximate clone of rlcard.agents.dqn_agent.Estimator that uses PyTorch instead of Tensorflow. All methods input/output np.ndarray.

Q-Value Estimator neural network. This network is used for both the Q-Network and the Target Network.

checkpoint_attributes()

Return the attributes needed to restore the model from a checkpoint

classmethod from_checkpoint(checkpoint)

Restore the model from a checkpoint

predict_nograd(s)
Predicts action values, but prediction is not included

in the computation graph. It is used to predict optimal next actions in the Double-DQN algorithm.

Parameters:

s (np.ndarray) – (batch, state_len)

Returns:

np.ndarray of shape (batch_size, NUM_VALID_ACTIONS) containing the estimated action values.

update(s, a, y)
Updates the estimator towards the given targets.

In this case y is the target-network estimated value of the Q-network optimal actions, which is labeled y in Algorithm 1 of Minh et al. (2015)

Parameters:
  • s (np.ndarray) – (batch, state_shape) state representation

  • a (np.ndarray) – (batch,) integer sampled actions

  • y (np.ndarray) – (batch,) value of optimal actions according to Q-target

Returns:

The calculated loss on the batch.

class rlcard.agents.dqn_agent.EstimatorNetwork(num_actions=2, state_shape=None, mlp_layers=None)

Bases: Module

The function approximation network for Estimator It is just a series of tanh layers. All in/out are torch.tensor

forward(s)

Predict action values

Parameters:

s (Tensor) – (batch, state_shape)

class rlcard.agents.dqn_agent.Memory(memory_size, batch_size)

Bases: object

Memory for saving transitions

checkpoint_attributes()

Returns the attributes that need to be checkpointed

classmethod from_checkpoint(checkpoint)

Restores the attributes from the checkpoint

Parameters:

checkpoint (dict) – the checkpoint dictionary

Returns:

the restored instance

Return type:

instance (Memory)

sample()

Sample a minibatch from the replay memory

Returns:

a batch of states action_batch (list): a batch of actions reward_batch (list): a batch of rewards next_state_batch (list): a batch of states done_batch (list): a batch of dones

Return type:

state_batch (list)

save(state, action, reward, next_state, legal_actions, done)

Save transition into memory

Parameters:
  • state (numpy.array) – the current state

  • action (int) – the performed action ID

  • reward (float) – the reward received

  • next_state (numpy.array) – the next state after performing the action

  • legal_actions (list) – the legal actions of the next state

  • done (boolean) – whether the episode is finished

class rlcard.agents.dqn_agent.Transition(state, action, reward, next_state, done, legal_actions)

Bases: tuple

action

Alias for field number 1

done

Alias for field number 4

legal_actions

Alias for field number 5

next_state

Alias for field number 3

reward

Alias for field number 2

state

Alias for field number 0

rlcard.agents.nfsp_agent

Neural Fictitious Self-Play (NFSP) agent implemented in TensorFlow.

See the paper https://arxiv.org/abs/1603.01121 for more details.

class rlcard.agents.nfsp_agent.AveragePolicyNetwork(num_actions=2, state_shape=None, mlp_layers=None)

Bases: Module

Approximates the history of action probabilities given state (average policy). Forward pass returns log probabilities of actions.

checkpoint_attributes()

Return the current checkpoint attributes (dict) Checkpoint attributes are used to save and restore the model in the middle of training

forward(s)

Log action probabilities of each action from state

Parameters:

s (Tensor) – (batch, state_shape) state tensor

Returns:

(batch, num_actions)

Return type:

log_action_probs (Tensor)

classmethod from_checkpoint(checkpoint)

Restore the model from a checkpoint

Parameters:

checkpoint (dict) – the checkpoint attributes generated by checkpoint_attributes()

class rlcard.agents.nfsp_agent.NFSPAgent(num_actions=4, state_shape=None, hidden_layers_sizes=None, reservoir_buffer_capacity=20000, anticipatory_param=0.1, batch_size=256, train_every=1, rl_learning_rate=0.1, sl_learning_rate=0.005, min_buffer_size_to_learn=100, q_replay_memory_size=20000, q_replay_memory_init_size=100, q_update_target_estimator_every=1000, q_discount_factor=0.99, q_epsilon_start=0.06, q_epsilon_end=0, q_epsilon_decay_steps=1000000, q_batch_size=32, q_train_every=1, q_mlp_layers=None, evaluate_with='average_policy', device=None, save_path=None, save_every=inf)

Bases: object

An approximate clone of rlcard.agents.nfsp_agent that uses pytorch instead of tensorflow. Note that this implementation differs from Henrich and Silver (2016) in that the supervised training minimizes cross-entropy with respect to the stored action probabilities rather than the realized actions.

checkpoint_attributes()

Return the current checkpoint attributes (dict) Checkpoint attributes are used to save and restore the model in the middle of training Saves the model state dict, optimizer state dict, and all other instance variables

eval_step(state)

Use the average policy for evaluation purpose

Parameters:

state (dict) – The current state.

Returns:

An action id. info (dict): A dictionary containing information

Return type:

action (int)

feed(ts)

Feed data to inner RL agent

Parameters:

ts (list) – A list of 5 elements that represent the transition.

classmethod from_checkpoint(checkpoint)

Restore the model from a checkpoint

Parameters:

checkpoint (dict) – the checkpoint attributes generated by checkpoint_attributes()

sample_episode_policy()

Sample average/best_response policy

save_checkpoint(path, filename='checkpoint_nfsp.pt')

Save the model checkpoint (all attributes)

Parameters:

path (str) – the path to save the model

set_device(device)
step(state)

Returns the action to be taken.

Parameters:

state (dict) – The current state

Returns:

An action id

Return type:

action (int)

train_sl()

Compute the loss on sampled transitions and perform a avg-network update.

If there are not enough elements in the buffer, no loss is computed and None is returned instead.

Returns:

The average loss obtained on this batch of transitions or None.

Return type:

loss (float)

class rlcard.agents.nfsp_agent.ReservoirBuffer(reservoir_buffer_capacity)

Bases: object

Allows uniform sampling over a stream of data.

This class supports the storage of arbitrary elements, such as observation tensors, integer actions, etc.

See https://en.wikipedia.org/wiki/Reservoir_sampling for more details.

add(element)

Potentially adds element to the reservoir buffer.

Parameters:

element (object) – data to be added to the reservoir buffer.

checkpoint_attributes()
clear()

Clear the buffer

classmethod from_checkpoint(checkpoint)
sample(num_samples)

Returns num_samples uniformly sampled from the buffer.

Parameters:

num_samples (int) – The number of samples to draw.

Returns:

An iterable over num_samples random elements of the buffer.

Raises:

ValueError – If there are less than num_samples elements in the buffer

class rlcard.agents.nfsp_agent.Transition(info_state, action_probs)

Bases: tuple

action_probs

Alias for field number 1

info_state

Alias for field number 0

rlcard.agents.pettingzoo_agents

class rlcard.agents.pettingzoo_agents.DQNAgentPettingZoo(replay_memory_size=20000, replay_memory_init_size=100, update_target_estimator_every=1000, discount_factor=0.99, epsilon_start=1.0, epsilon_end=0.1, epsilon_decay_steps=20000, batch_size=32, num_actions=2, state_shape=None, train_every=1, mlp_layers=None, learning_rate=5e-05, device=None, save_path=None, save_every=inf)

Bases: DQNAgent

eval_step(state)

Predict the action for evaluation purpose.

Parameters:

state (numpy.array) – current state

Returns:

an action id info (dict): A dictionary containing information

Return type:

action (int)

feed(ts)
Store data in to replay buffer and train the agent. There are two stages.

In stage 1, populate the memory without training In stage 2, train the agent every several timesteps

Parameters:

ts (list) – a list of 5 elements that represent the transition

step(state)
Predict the action for genrating training data but

have the predictions disconnected from the computation graph

Parameters:

state (numpy.array) – current state

Returns:

an action id

Return type:

action (int)

class rlcard.agents.pettingzoo_agents.NFSPAgentPettingZoo(num_actions=4, state_shape=None, hidden_layers_sizes=None, reservoir_buffer_capacity=20000, anticipatory_param=0.1, batch_size=256, train_every=1, rl_learning_rate=0.1, sl_learning_rate=0.005, min_buffer_size_to_learn=100, q_replay_memory_size=20000, q_replay_memory_init_size=100, q_update_target_estimator_every=1000, q_discount_factor=0.99, q_epsilon_start=0.06, q_epsilon_end=0, q_epsilon_decay_steps=1000000, q_batch_size=32, q_train_every=1, q_mlp_layers=None, evaluate_with='average_policy', device=None, save_path=None, save_every=inf)

Bases: NFSPAgent

eval_step(state)

Use the average policy for evaluation purpose

Parameters:

state (dict) – The current state.

Returns:

An action id. info (dict): A dictionary containing information

Return type:

action (int)

feed(ts)

Feed data to inner RL agent

Parameters:

ts (list) – A list of 5 elements that represent the transition.

step(state)

Returns the action to be taken.

Parameters:

state (dict) – The current state

Returns:

An action id

Return type:

action (int)

class rlcard.agents.pettingzoo_agents.RandomAgentPettingZoo(num_actions)

Bases: RandomAgent

eval_step(state)
Predict the action given the current state for evaluation.

Since the random agents are not trained. This function is equivalent to step function

Parameters:

state (dict) – An dictionary that represents the current state

Returns:

The action predicted (randomly chosen) by the random agent probs (list): The list of action probabilities

Return type:

action (int)

step(state)

Predict the action given the curent state in gerenerating training data.

Parameters:

state (dict) – An dictionary that represents the current state

Returns:

The action predicted (randomly chosen) by the random agent

Return type:

action (int)

rlcard.agents.random_agent

class rlcard.agents.random_agent.RandomAgent(num_actions)

Bases: object

A random agent. Random agents is for running toy examples on the card games

eval_step(state)
Predict the action given the current state for evaluation.

Since the random agents are not trained. This function is equivalent to step function

Parameters:

state (dict) – An dictionary that represents the current state

Returns:

The action predicted (randomly chosen) by the random agent probs (list): The list of action probabilities

Return type:

action (int)

static step(state)

Predict the action given the curent state in gerenerating training data.

Parameters:

state (dict) – An dictionary that represents the current state

Returns:

The action predicted (randomly chosen) by the random agent

Return type:

action (int)

rlcard.agents.dmc_agent.file_writer

class rlcard.agents.dmc_agent.file_writer.FileWriter(xpid: str | None = None, xp_args: dict | None = None, rootdir: str = '~/palaas')

Bases: object

close(successful: bool = True) None
log(to_log: Dict, tick: int | None = None, verbose: bool = False) None
rlcard.agents.dmc_agent.file_writer.gather_metadata() Dict

rlcard.agents.dmc_agent.model

class rlcard.agents.dmc_agent.model.DMCAgent(state_shape, action_shape, mlp_layers=[512, 512, 512, 512, 512], exp_epsilon=0.01, device='0')

Bases: object

eval()
eval_step(state)
forward(obs, actions)
load_state_dict(state_dict)
parameters()
predict(state)
set_device(device)
share_memory()
state_dict()
step(state)
class rlcard.agents.dmc_agent.model.DMCModel(state_shape, action_shape, mlp_layers=[512, 512, 512, 512, 512], exp_epsilon=0.01, device=0)

Bases: object

eval()
get_agent(index)
get_agents()
parameters(index)
share_memory()
class rlcard.agents.dmc_agent.model.DMCNet(state_shape, action_shape, mlp_layers=[512, 512, 512, 512, 512])

Bases: Module

forward(obs, actions)

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

rlcard.agents.dmc_agent.pettingzoo_model

class rlcard.agents.dmc_agent.pettingzoo_model.DMCAgentPettingZoo(state_shape, action_shape, mlp_layers=[512, 512, 512, 512, 512], exp_epsilon=0.01, device='0')

Bases: DMCAgent

eval_step(state)
feed(ts)
step(state)
class rlcard.agents.dmc_agent.pettingzoo_model.DMCModelPettingZoo(env, mlp_layers=[512, 512, 512, 512, 512], exp_epsilon=0.01, device='0')

Bases: object

eval()
get_agent(index)
get_agents()
parameters(index)
share_memory()

rlcard.agents.dmc_agent.pettingzoo_utils

rlcard.agents.dmc_agent.pettingzoo_utils.act_pettingzoo(i, device, T, free_queue, full_queue, model, buffers, env)
rlcard.agents.dmc_agent.pettingzoo_utils.create_buffers_pettingzoo(T, num_buffers, env, device_iterator)

rlcard.agents.dmc_agent.trainer

class rlcard.agents.dmc_agent.trainer.DMCTrainer(env, cuda='', is_pettingzoo_env=False, load_model=False, xpid='dmc', save_interval=30, num_actor_devices=1, num_actors=5, training_device='0', savedir='experiments/dmc_result', total_frames=100000000000, exp_epsilon=0.01, batch_size=32, unroll_length=100, num_buffers=50, num_threads=4, max_grad_norm=40, learning_rate=0.0001, alpha=0.99, momentum=0, epsilon=1e-05)

Bases: object

Deep Monte-Carlo

Parameters:
  • env – RLCard environment

  • load_model (boolean) – Whether loading an existing model

  • xpid (string) – Experiment id (default: dmc)

  • save_interval (int) – Time interval (in minutes) at which to save the model

  • num_actor_devices (int) – The number devices used for simulation

  • num_actors (int) – Number of actors for each simulation device

  • training_device (str) – The index of the GPU used for training models, or cpu.

  • savedir (string) – Root dir where experiment data will be saved

  • total_frames (int) – Total environment frames to train for

  • exp_epsilon (float) – The prbability for exploration

  • batch_size (int) – Learner batch size

  • unroll_length (int) – The unroll length (time dimension)

  • num_buffers (int) – Number of shared-memory buffers

  • num_threads (int) – Number learner threads

  • max_grad_norm (int) – Max norm of gradients

  • learning_rate (float) – Learning rate

  • alpha (float) – RMSProp smoothing constant

  • momentum (float) – RMSProp momentum

  • epsilon (float) – RMSProp epsilon

start()
rlcard.agents.dmc_agent.trainer.compute_loss(logits, targets)
rlcard.agents.dmc_agent.trainer.learn(position, actor_models, agent, batch, optimizer, training_device, max_grad_norm, mean_episode_return_buf, lock)

Performs a learning (optimization) step.

rlcard.agents.dmc_agent.utils

rlcard.agents.dmc_agent.utils.act(i, device, T, free_queue, full_queue, model, buffers, env)
rlcard.agents.dmc_agent.utils.create_buffers(T, num_buffers, state_shape, action_shape, device_iterator)
rlcard.agents.dmc_agent.utils.create_optimizers(num_players, learning_rate, momentum, epsilon, alpha, learner_model)
rlcard.agents.dmc_agent.utils.get_batch(free_queue, full_queue, buffers, batch_size, lock)

rlcard.agents.human_agents.blackjack_human_agent

class rlcard.agents.human_agents.blackjack_human_agent.HumanAgent(num_actions)

Bases: object

A human agent for Blackjack. It can be used to play alone for understand how the blackjack code runs

eval_step(state)

Predict the action given the current state for evaluation. The same to step here.

Parameters:

state (numpy.array) – an numpy array that represents the current state

Returns:

the action predicted (randomly chosen) by the random agent

Return type:

action (int)

static step(state)

Human agent will display the state and make decisions through interfaces

Parameters:

state (dict) – A dictionary that represents the current state

Returns:

The action decided by human

Return type:

action (int)

rlcard.agents.human_agents.leduc_holdem_human_agent

class rlcard.agents.human_agents.leduc_holdem_human_agent.HumanAgent(num_actions)

Bases: object

A human agent for Leduc Holdem. It can be used to play against trained models

eval_step(state)

Predict the action given the curent state for evaluation. The same to step here.

Parameters:

state (numpy.array) – an numpy array that represents the current state

Returns:

the action predicted (randomly chosen) by the random agent

Return type:

action (int)

static step(state)

Human agent will display the state and make decisions through interfaces

Parameters:

state (dict) – A dictionary that represents the current state

Returns:

The action decided by human

Return type:

action (int)

rlcard.agents.human_agents.limit_holdem_human_agent

class rlcard.agents.human_agents.limit_holdem_human_agent.HumanAgent(num_actions)

Bases: object

A human agent for Limit Holdem. It can be used to play against trained models

eval_step(state)

Predict the action given the curent state for evaluation. The same to step here.

Parameters:

state (numpy.array) – an numpy array that represents the current state

Returns:

the action predicted (randomly chosen) by the random agent

Return type:

action (int)

static step(state)

Human agent will display the state and make decisions through interfaces

Parameters:

state (dict) – A dictionary that represents the current state

Returns:

The action decided by human

Return type:

action (int)

rlcard.agents.human_agents.nolimit_holdem_human_agent

class rlcard.agents.human_agents.nolimit_holdem_human_agent.HumanAgent(num_actions)

Bases: object

A human agent for No Limit Holdem. It can be used to play against trained models

eval_step(state)

Predict the action given the curent state for evaluation. The same to step here.

Parameters:

state (numpy.array) – an numpy array that represents the current state

Returns:

the action predicted (randomly chosen) by the random agent

Return type:

action (int)

static step(state)

Human agent will display the state and make decisions through interfaces

Parameters:

state (dict) – A dictionary that represents the current state

Returns:

The action decided by human

Return type:

action (int)

rlcard.agents.human_agents.uno_human_agent

class rlcard.agents.human_agents.uno_human_agent.HumanAgent(num_actions)

Bases: object

A human agent for Leduc Holdem. It can be used to play against trained models

eval_step(state)

Predict the action given the curent state for evaluation. The same to step here.

Parameters:

state (numpy.array) – an numpy array that represents the current state

Returns:

the action predicted (randomly chosen) by the random agent

Return type:

action (int)

static step(state)

Human agent will display the state and make decisions through interfaces

Parameters:

state (dict) – A dictionary that represents the current state

Returns:

The action decided by human

Return type:

action (int)