rlcard.agents¶

rlcard.agents.cfr_agent¶

class rlcard.agents.cfr_agent.CFRAgent(env, model_path='./cfr_model')¶

Bases: object

Implement CFR (chance sampling) algorithm

action_probs(obs, legal_actions, policy)¶

Obtain the action probabilities of the current state

Parameters:

obs (str) – state_str
legal_actions (list) – List of leagel actions
player_id (int) – The current player
policy (dict) – The used policy

Returns:

action_probs(numpy.array): The action probabilities legal_actions (list): Indices of legal actions

Return type:

(tuple) that contains

eval_step(state)¶

Given a state, predict action based on average policy

Parameters:: state (numpy.array) – State representation
Returns:: Predicted action info (dict): A dictionary containing information
Return type:: action (int)

get_state(player_id)¶

Get state_str of the player

Parameters:: player_id (int) – The player id
Returns:: state (str): The state str legal_actions (list): Indices of legal actions
Return type:: (tuple) that contains

load()¶: Load model

regret_matching(obs)¶

Apply regret matching

Parameters:: obs (string) – The state_str

save()¶: Save model

train()¶: Do one iteration of CFR

traverse_tree(probs, player_id)¶

Traverse the game tree, update the regrets

Parameters:

probs – The reach probability of the current node
player_id – The player to update the value

Returns:

The expected utilities for all the players

Return type:

state_utilities (list)

update_policy()¶: Update policy based on the current regrets

rlcard.agents.dqn_agent¶

DQN agent

The code is derived from https://github.com/dennybritz/reinforcement-learning/blob/master/DQN/dqn.py

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

class rlcard.agents.dqn_agent.DQNAgent(replay_memory_size=20000, replay_memory_init_size=100, update_target_estimator_every=1000, discount_factor=0.99, epsilon_start=1.0, epsilon_end=0.1, epsilon_decay_steps=20000, batch_size=32, num_actions=2, state_shape=None, train_every=1, mlp_layers=None, learning_rate=5e-05, device=None, save_path=None, save_every=inf)¶

Bases: object

Approximate clone of rlcard.agents.dqn_agent.DQNAgent that depends on PyTorch instead of Tensorflow

checkpoint_attributes()¶: Return the current checkpoint attributes (dict) Checkpoint attributes are used to save and restore the model in the middle of training Saves the model state dict, optimizer state dict, and all other instance variables

eval_step(state)¶

Predict the action for evaluation purpose.

Parameters:: state (numpy.array) – current state
Returns:: an action id info (dict): A dictionary containing information
Return type:: action (int)

feed(ts)¶

Store data in to replay buffer and train the agent. There are two stages.: In stage 1, populate the memory without training In stage 2, train the agent every several timesteps

Parameters:: ts (list) – a list of 5 elements that represent the transition

feed_memory(state, action, reward, next_state, legal_actions, done)¶

Feed transition to memory

Parameters:

state (numpy.array) – the current state
action (int) – the performed action ID
reward (float) – the reward received
next_state (numpy.array) – the next state after performing the action
legal_actions (list) – the legal actions of the next state
done (boolean) – whether the episode is finished

classmethod from_checkpoint(checkpoint)¶

Restore the model from a checkpoint

Parameters:: checkpoint (dict) – the checkpoint attributes generated by checkpoint_attributes()

predict(state)¶

Predict the masked Q-values

Parameters:: state (numpy.array) – current state
Returns:: a 1-d array where each entry represents a Q value
Return type:: q_values (numpy.array)

save_checkpoint(path, filename='checkpoint_dqn.pt')¶

Save the model checkpoint (all attributes)

Parameters:: path (str) – the path to save the model

set_device(device)¶

step(state)¶

Predict the action for genrating training data but: have the predictions disconnected from the computation graph

Parameters:: state (numpy.array) – current state
Returns:: an action id
Return type:: action (int)

train()¶

Train the network

Returns:: The loss of the current batch.
Return type:: loss (float)

class rlcard.agents.dqn_agent.Estimator(num_actions=2, learning_rate=0.001, state_shape=None, mlp_layers=None, device=None)¶

Bases: object

Approximate clone of rlcard.agents.dqn_agent.Estimator that uses PyTorch instead of Tensorflow. All methods input/output np.ndarray.

Q-Value Estimator neural network. This network is used for both the Q-Network and the Target Network.

checkpoint_attributes()¶: Return the attributes needed to restore the model from a checkpoint

classmethod from_checkpoint(checkpoint)¶: Restore the model from a checkpoint

predict_nograd(s)¶

Predicts action values, but prediction is not included: in the computation graph. It is used to predict optimal next actions in the Double-DQN algorithm.

Parameters:: s (np.ndarray) – (batch, state_len)
Returns:: np.ndarray of shape (batch_size, NUM_VALID_ACTIONS) containing the estimated action values.

update(s, a, y)¶

Updates the estimator towards the given targets.: In this case y is the target-network estimated value of the Q-network optimal actions, which is labeled y in Algorithm 1 of Minh et al. (2015)

Parameters:

s (np.ndarray) – (batch, state_shape) state representation
a (np.ndarray) – (batch,) integer sampled actions
y (np.ndarray) – (batch,) value of optimal actions according to Q-target

Returns:

The calculated loss on the batch.

class rlcard.agents.dqn_agent.EstimatorNetwork(num_actions=2, state_shape=None, mlp_layers=None)¶

Bases: Module

The function approximation network for Estimator It is just a series of tanh layers. All in/out are torch.tensor

forward(s)¶

Predict action values

Parameters:: s (Tensor) – (batch, state_shape)

class rlcard.agents.dqn_agent.Memory(memory_size, batch_size)¶

Bases: object

Memory for saving transitions

checkpoint_attributes()¶: Returns the attributes that need to be checkpointed

classmethod from_checkpoint(checkpoint)¶

Restores the attributes from the checkpoint

Parameters:: checkpoint (dict) – the checkpoint dictionary
Returns:: the restored instance
Return type:: instance (Memory)

sample()¶

Sample a minibatch from the replay memory

Returns:: a batch of states action_batch (list): a batch of actions reward_batch (list): a batch of rewards next_state_batch (list): a batch of states done_batch (list): a batch of dones
Return type:: state_batch (list)

save(state, action, reward, next_state, legal_actions, done)¶

Save transition into memory

Parameters:

state (numpy.array) – the current state
action (int) – the performed action ID
reward (float) – the reward received
next_state (numpy.array) – the next state after performing the action
legal_actions (list) – the legal actions of the next state
done (boolean) – whether the episode is finished

class rlcard.agents.dqn_agent.Transition(state, action, reward, next_state, done, legal_actions)¶

Bases: tuple

action¶: Alias for field number 1

done¶: Alias for field number 4

legal_actions¶: Alias for field number 5

next_state¶: Alias for field number 3

reward¶: Alias for field number 2

state¶: Alias for field number 0

rlcard.agents.nfsp_agent¶

Neural Fictitious Self-Play (NFSP) agent implemented in TensorFlow.

See the paper https://arxiv.org/abs/1603.01121 for more details.

class rlcard.agents.nfsp_agent.AveragePolicyNetwork(num_actions=2, state_shape=None, mlp_layers=None)¶

Bases: Module

Approximates the history of action probabilities given state (average policy). Forward pass returns log probabilities of actions.

checkpoint_attributes()¶: Return the current checkpoint attributes (dict) Checkpoint attributes are used to save and restore the model in the middle of training

forward(s)¶

Log action probabilities of each action from state

Parameters:: s (Tensor) – (batch, state_shape) state tensor
Returns:: (batch, num_actions)
Return type:: log_action_probs (Tensor)

classmethod from_checkpoint(checkpoint)¶

Restore the model from a checkpoint

Parameters:: checkpoint (dict) – the checkpoint attributes generated by checkpoint_attributes()

class rlcard.agents.nfsp_agent.NFSPAgent(num_actions=4, state_shape=None, hidden_layers_sizes=None, reservoir_buffer_capacity=20000, anticipatory_param=0.1, batch_size=256, train_every=1, rl_learning_rate=0.1, sl_learning_rate=0.005, min_buffer_size_to_learn=100, q_replay_memory_size=20000, q_replay_memory_init_size=100, q_update_target_estimator_every=1000, q_discount_factor=0.99, q_epsilon_start=0.06, q_epsilon_end=0, q_epsilon_decay_steps=1000000, q_batch_size=32, q_train_every=1, q_mlp_layers=None, evaluate_with='average_policy', device=None, save_path=None, save_every=inf)¶

Bases: object

An approximate clone of rlcard.agents.nfsp_agent that uses pytorch instead of tensorflow. Note that this implementation differs from Henrich and Silver (2016) in that the supervised training minimizes cross-entropy with respect to the stored action probabilities rather than the realized actions.

checkpoint_attributes()¶: Return the current checkpoint attributes (dict) Checkpoint attributes are used to save and restore the model in the middle of training Saves the model state dict, optimizer state dict, and all other instance variables

eval_step(state)¶

Use the average policy for evaluation purpose

Parameters:: state (dict) – The current state.
Returns:: An action id. info (dict): A dictionary containing information
Return type:: action (int)

feed(ts)¶

Feed data to inner RL agent

Parameters:: ts (list) – A list of 5 elements that represent the transition.

classmethod from_checkpoint(checkpoint)¶

Restore the model from a checkpoint

Parameters:: checkpoint (dict) – the checkpoint attributes generated by checkpoint_attributes()

sample_episode_policy()¶: Sample average/best_response policy

save_checkpoint(path, filename='checkpoint_nfsp.pt')¶

Save the model checkpoint (all attributes)

Parameters:: path (str) – the path to save the model

set_device(device)¶

step(state)¶

Returns the action to be taken.

Parameters:: state (dict) – The current state
Returns:: An action id
Return type:: action (int)

train_sl()¶

Compute the loss on sampled transitions and perform a avg-network update.

If there are not enough elements in the buffer, no loss is computed and None is returned instead.

Returns:: The average loss obtained on this batch of transitions or None.
Return type:: loss (float)

class rlcard.agents.nfsp_agent.ReservoirBuffer(reservoir_buffer_capacity)¶

Bases: object

Allows uniform sampling over a stream of data.

This class supports the storage of arbitrary elements, such as observation tensors, integer actions, etc.

See https://en.wikipedia.org/wiki/Reservoir_sampling for more details.

add(element)¶

Potentially adds element to the reservoir buffer.

Parameters:: element (object) – data to be added to the reservoir buffer.

checkpoint_attributes()¶

clear()¶: Clear the buffer

classmethod from_checkpoint(checkpoint)¶

sample(num_samples)¶

Returns num_samples uniformly sampled from the buffer.

Parameters:: num_samples (int) – The number of samples to draw.
Returns:: An iterable over num_samples random elements of the buffer.
Raises:: ValueError – If there are less than num_samples elements in the buffer

class rlcard.agents.nfsp_agent.Transition(info_state, action_probs)¶

Bases: tuple

action_probs¶: Alias for field number 1

info_state¶: Alias for field number 0

rlcard.agents.pettingzoo_agents¶

class rlcard.agents.pettingzoo_agents.DQNAgentPettingZoo(replay_memory_size=20000, replay_memory_init_size=100, update_target_estimator_every=1000, discount_factor=0.99, epsilon_start=1.0, epsilon_end=0.1, epsilon_decay_steps=20000, batch_size=32, num_actions=2, state_shape=None, train_every=1, mlp_layers=None, learning_rate=5e-05, device=None, save_path=None, save_every=inf)¶

Bases: DQNAgent

eval_step(state)¶

Predict the action for evaluation purpose.

Parameters:: state (numpy.array) – current state
Returns:: an action id info (dict): A dictionary containing information
Return type:: action (int)

feed(ts)¶

Store data in to replay buffer and train the agent. There are two stages.: In stage 1, populate the memory without training In stage 2, train the agent every several timesteps

Parameters:: ts (list) – a list of 5 elements that represent the transition

step(state)¶

Predict the action for genrating training data but: have the predictions disconnected from the computation graph

Parameters:: state (numpy.array) – current state
Returns:: an action id
Return type:: action (int)

class rlcard.agents.pettingzoo_agents.NFSPAgentPettingZoo(num_actions=4, state_shape=None, hidden_layers_sizes=None, reservoir_buffer_capacity=20000, anticipatory_param=0.1, batch_size=256, train_every=1, rl_learning_rate=0.1, sl_learning_rate=0.005, min_buffer_size_to_learn=100, q_replay_memory_size=20000, q_replay_memory_init_size=100, q_update_target_estimator_every=1000, q_discount_factor=0.99, q_epsilon_start=0.06, q_epsilon_end=0, q_epsilon_decay_steps=1000000, q_batch_size=32, q_train_every=1, q_mlp_layers=None, evaluate_with='average_policy', device=None, save_path=None, save_every=inf)¶

Bases: NFSPAgent

eval_step(state)¶

Use the average policy for evaluation purpose

Parameters:: state (dict) – The current state.
Returns:: An action id. info (dict): A dictionary containing information
Return type:: action (int)

feed(ts)¶

Feed data to inner RL agent

Parameters:: ts (list) – A list of 5 elements that represent the transition.

step(state)¶

Returns the action to be taken.

Parameters:: state (dict) – The current state
Returns:: An action id
Return type:: action (int)

class rlcard.agents.pettingzoo_agents.RandomAgentPettingZoo(num_actions)¶

Bases: RandomAgent

eval_step(state)¶

Predict the action given the current state for evaluation.: Since the random agents are not trained. This function is equivalent to step function

Parameters:: state (dict) – An dictionary that represents the current state
Returns:: The action predicted (randomly chosen) by the random agent probs (list): The list of action probabilities
Return type:: action (int)

step(state)¶

Predict the action given the curent state in gerenerating training data.

Parameters:: state (dict) – An dictionary that represents the current state
Returns:: The action predicted (randomly chosen) by the random agent
Return type:: action (int)

rlcard.agents.random_agent¶

class rlcard.agents.random_agent.RandomAgent(num_actions)¶

Bases: object

A random agent. Random agents is for running toy examples on the card games

eval_step(state)¶

Predict the action given the current state for evaluation.: Since the random agents are not trained. This function is equivalent to step function

Parameters:: state (dict) – An dictionary that represents the current state
Returns:: The action predicted (randomly chosen) by the random agent probs (list): The list of action probabilities
Return type:: action (int)

static step(state)¶

Predict the action given the curent state in gerenerating training data.

Parameters:: state (dict) – An dictionary that represents the current state
Returns:: The action predicted (randomly chosen) by the random agent
Return type:: action (int)

rlcard.agents.dmc_agent.file_writer¶

class rlcard.agents.dmc_agent.file_writer.FileWriter(xpid: str | None = None, xp_args: dict | None = None, rootdir: str = '~/palaas')¶

Bases: object

close(successful: bool = True) → None¶

log(to_log: Dict, tick: int | None = None, verbose: bool = False) → None¶

rlcard.agents.dmc_agent.file_writer.gather_metadata() → Dict¶

rlcard.agents.dmc_agent.model¶

class rlcard.agents.dmc_agent.model.DMCAgent(state_shape, action_shape, mlp_layers=[512, 512, 512, 512, 512], exp_epsilon=0.01, device='0')¶

Bases: object

eval()¶

eval_step(state)¶

forward(obs, actions)¶

load_state_dict(state_dict)¶

parameters()¶

predict(state)¶

set_device(device)¶

share_memory()¶

state_dict()¶

step(state)¶

class rlcard.agents.dmc_agent.model.DMCModel(state_shape, action_shape, mlp_layers=[512, 512, 512, 512, 512], exp_epsilon=0.01, device=0)¶

Bases: object

eval()¶

get_agent(index)¶

get_agents()¶

parameters(index)¶

share_memory()¶

class rlcard.agents.dmc_agent.model.DMCNet(state_shape, action_shape, mlp_layers=[512, 512, 512, 512, 512])¶

Bases: Module

forward(obs, actions)¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

rlcard.agents.dmc_agent.pettingzoo_model¶

class rlcard.agents.dmc_agent.pettingzoo_model.DMCAgentPettingZoo(state_shape, action_shape, mlp_layers=[512, 512, 512, 512, 512], exp_epsilon=0.01, device='0')¶

Bases: DMCAgent

eval_step(state)¶

feed(ts)¶

step(state)¶

class rlcard.agents.dmc_agent.pettingzoo_model.DMCModelPettingZoo(env, mlp_layers=[512, 512, 512, 512, 512], exp_epsilon=0.01, device='0')¶

Bases: object

eval()¶

get_agent(index)¶

get_agents()¶

parameters(index)¶

share_memory()¶

rlcard.agents.dmc_agent.pettingzoo_utils¶

rlcard.agents.dmc_agent.pettingzoo_utils.act_pettingzoo(i, device, T, free_queue, full_queue, model, buffers, env)¶

rlcard.agents.dmc_agent.pettingzoo_utils.create_buffers_pettingzoo(T, num_buffers, env, device_iterator)¶

rlcard.agents.dmc_agent.trainer¶

class rlcard.agents.dmc_agent.trainer.DMCTrainer(env, cuda='', is_pettingzoo_env=False, load_model=False, xpid='dmc', save_interval=30, num_actor_devices=1, num_actors=5, training_device='0', savedir='experiments/dmc_result', total_frames=100000000000, exp_epsilon=0.01, batch_size=32, unroll_length=100, num_buffers=50, num_threads=4, max_grad_norm=40, learning_rate=0.0001, alpha=0.99, momentum=0, epsilon=1e-05)¶

Bases: object

Deep Monte-Carlo

Parameters:

env – RLCard environment
load_model (boolean) – Whether loading an existing model
xpid (string) – Experiment id (default: dmc)
save_interval (int) – Time interval (in minutes) at which to save the model
num_actor_devices (int) – The number devices used for simulation
num_actors (int) – Number of actors for each simulation device
training_device (str) – The index of the GPU used for training models, or cpu.
savedir (string) – Root dir where experiment data will be saved
total_frames (int) – Total environment frames to train for
exp_epsilon (float) – The prbability for exploration
batch_size (int) – Learner batch size
unroll_length (int) – The unroll length (time dimension)
num_buffers (int) – Number of shared-memory buffers
num_threads (int) – Number learner threads
max_grad_norm (int) – Max norm of gradients
learning_rate (float) – Learning rate
alpha (float) – RMSProp smoothing constant
momentum (float) – RMSProp momentum
epsilon (float) – RMSProp epsilon

start()¶

rlcard.agents.dmc_agent.trainer.compute_loss(logits, targets)¶

rlcard.agents.dmc_agent.trainer.learn(position, actor_models, agent, batch, optimizer, training_device, max_grad_norm, mean_episode_return_buf, lock)¶: Performs a learning (optimization) step.

rlcard.agents.dmc_agent.utils¶

rlcard.agents.dmc_agent.utils.act(i, device, T, free_queue, full_queue, model, buffers, env)¶

rlcard.agents.dmc_agent.utils.create_buffers(T, num_buffers, state_shape, action_shape, device_iterator)¶

rlcard.agents.dmc_agent.utils.create_optimizers(num_players, learning_rate, momentum, epsilon, alpha, learner_model)¶

rlcard.agents.dmc_agent.utils.get_batch(free_queue, full_queue, buffers, batch_size, lock)¶

rlcard.agents.human_agents.blackjack_human_agent¶

class rlcard.agents.human_agents.blackjack_human_agent.HumanAgent(num_actions)¶

Bases: object

A human agent for Blackjack. It can be used to play alone for understand how the blackjack code runs

eval_step(state)¶

Predict the action given the current state for evaluation. The same to step here.

Parameters:: state (numpy.array) – an numpy array that represents the current state
Returns:: the action predicted (randomly chosen) by the random agent
Return type:: action (int)

static step(state)¶

Human agent will display the state and make decisions through interfaces

Parameters:: state (dict) – A dictionary that represents the current state
Returns:: The action decided by human
Return type:: action (int)

rlcard.agents.human_agents.leduc_holdem_human_agent¶

class rlcard.agents.human_agents.leduc_holdem_human_agent.HumanAgent(num_actions)¶

Bases: object

A human agent for Leduc Holdem. It can be used to play against trained models

eval_step(state)¶

Predict the action given the curent state for evaluation. The same to step here.

Parameters:: state (numpy.array) – an numpy array that represents the current state
Returns:: the action predicted (randomly chosen) by the random agent
Return type:: action (int)

static step(state)¶

Human agent will display the state and make decisions through interfaces

Parameters:: state (dict) – A dictionary that represents the current state
Returns:: The action decided by human
Return type:: action (int)

rlcard.agents.human_agents.limit_holdem_human_agent¶

class rlcard.agents.human_agents.limit_holdem_human_agent.HumanAgent(num_actions)¶

Bases: object

A human agent for Limit Holdem. It can be used to play against trained models

eval_step(state)¶

Predict the action given the curent state for evaluation. The same to step here.

Parameters:: state (numpy.array) – an numpy array that represents the current state
Returns:: the action predicted (randomly chosen) by the random agent
Return type:: action (int)

static step(state)¶

Human agent will display the state and make decisions through interfaces

Parameters:: state (dict) – A dictionary that represents the current state
Returns:: The action decided by human
Return type:: action (int)

rlcard.agents.human_agents.nolimit_holdem_human_agent¶

class rlcard.agents.human_agents.nolimit_holdem_human_agent.HumanAgent(num_actions)¶

Bases: object

A human agent for No Limit Holdem. It can be used to play against trained models

eval_step(state)¶

Predict the action given the curent state for evaluation. The same to step here.

Parameters:: state (numpy.array) – an numpy array that represents the current state
Returns:: the action predicted (randomly chosen) by the random agent
Return type:: action (int)

static step(state)¶

Human agent will display the state and make decisions through interfaces

Parameters:: state (dict) – A dictionary that represents the current state
Returns:: The action decided by human
Return type:: action (int)

rlcard.agents.human_agents.uno_human_agent¶

class rlcard.agents.human_agents.uno_human_agent.HumanAgent(num_actions)¶

Bases: object

A human agent for Leduc Holdem. It can be used to play against trained models

eval_step(state)¶

Predict the action given the curent state for evaluation. The same to step here.

Parameters:: state (numpy.array) – an numpy array that represents the current state
Returns:: the action predicted (randomly chosen) by the random agent
Return type:: action (int)

static step(state)¶

Human agent will display the state and make decisions through interfaces

Parameters:: state (dict) – A dictionary that represents the current state
Returns:: The action decided by human
Return type:: action (int)