rlcard.agents¶
rlcard.agents.cfr_agent¶
- class rlcard.agents.cfr_agent.CFRAgent(env, model_path='./cfr_model')¶
Bases:
object
Implement CFR (chance sampling) algorithm
- action_probs(obs, legal_actions, policy)¶
Obtain the action probabilities of the current state
- Parameters:
obs (str) – state_str
legal_actions (list) – List of leagel actions
player_id (int) – The current player
policy (dict) – The used policy
- Returns:
action_probs(numpy.array): The action probabilities legal_actions (list): Indices of legal actions
- Return type:
(tuple) that contains
- eval_step(state)¶
Given a state, predict action based on average policy
- Parameters:
state (numpy.array) – State representation
- Returns:
Predicted action info (dict): A dictionary containing information
- Return type:
action (int)
- get_state(player_id)¶
Get state_str of the player
- Parameters:
player_id (int) – The player id
- Returns:
state (str): The state str legal_actions (list): Indices of legal actions
- Return type:
(tuple) that contains
- load()¶
Load model
- regret_matching(obs)¶
Apply regret matching
- Parameters:
obs (string) – The state_str
- save()¶
Save model
- train()¶
Do one iteration of CFR
- traverse_tree(probs, player_id)¶
Traverse the game tree, update the regrets
- Parameters:
probs – The reach probability of the current node
player_id – The player to update the value
- Returns:
The expected utilities for all the players
- Return type:
state_utilities (list)
- update_policy()¶
Update policy based on the current regrets
rlcard.agents.dqn_agent¶
DQN agent
The code is derived from https://github.com/dennybritz/reinforcement-learning/blob/master/DQN/dqn.py
Copyright (c) 2019 Matthew Judell Copyright (c) 2019 DATA Lab at Texas A&M University Copyright (c) 2016 Denny Britz
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
- class rlcard.agents.dqn_agent.DQNAgent(replay_memory_size=20000, replay_memory_init_size=100, update_target_estimator_every=1000, discount_factor=0.99, epsilon_start=1.0, epsilon_end=0.1, epsilon_decay_steps=20000, batch_size=32, num_actions=2, state_shape=None, train_every=1, mlp_layers=None, learning_rate=5e-05, device=None, save_path=None, save_every=inf)¶
Bases:
object
Approximate clone of rlcard.agents.dqn_agent.DQNAgent that depends on PyTorch instead of Tensorflow
- checkpoint_attributes()¶
Return the current checkpoint attributes (dict) Checkpoint attributes are used to save and restore the model in the middle of training Saves the model state dict, optimizer state dict, and all other instance variables
- eval_step(state)¶
Predict the action for evaluation purpose.
- Parameters:
state (numpy.array) – current state
- Returns:
an action id info (dict): A dictionary containing information
- Return type:
action (int)
- feed(ts)¶
- Store data in to replay buffer and train the agent. There are two stages.
In stage 1, populate the memory without training In stage 2, train the agent every several timesteps
- Parameters:
ts (list) – a list of 5 elements that represent the transition
- feed_memory(state, action, reward, next_state, legal_actions, done)¶
Feed transition to memory
- Parameters:
state (numpy.array) – the current state
action (int) – the performed action ID
reward (float) – the reward received
next_state (numpy.array) – the next state after performing the action
legal_actions (list) – the legal actions of the next state
done (boolean) – whether the episode is finished
- classmethod from_checkpoint(checkpoint)¶
Restore the model from a checkpoint
- Parameters:
checkpoint (dict) – the checkpoint attributes generated by checkpoint_attributes()
- predict(state)¶
Predict the masked Q-values
- Parameters:
state (numpy.array) – current state
- Returns:
a 1-d array where each entry represents a Q value
- Return type:
q_values (numpy.array)
- save_checkpoint(path, filename='checkpoint_dqn.pt')¶
Save the model checkpoint (all attributes)
- Parameters:
path (str) – the path to save the model
- set_device(device)¶
- step(state)¶
- Predict the action for genrating training data but
have the predictions disconnected from the computation graph
- Parameters:
state (numpy.array) – current state
- Returns:
an action id
- Return type:
action (int)
- train()¶
Train the network
- Returns:
The loss of the current batch.
- Return type:
loss (float)
- class rlcard.agents.dqn_agent.Estimator(num_actions=2, learning_rate=0.001, state_shape=None, mlp_layers=None, device=None)¶
Bases:
object
Approximate clone of rlcard.agents.dqn_agent.Estimator that uses PyTorch instead of Tensorflow. All methods input/output np.ndarray.
Q-Value Estimator neural network. This network is used for both the Q-Network and the Target Network.
- checkpoint_attributes()¶
Return the attributes needed to restore the model from a checkpoint
- classmethod from_checkpoint(checkpoint)¶
Restore the model from a checkpoint
- predict_nograd(s)¶
- Predicts action values, but prediction is not included
in the computation graph. It is used to predict optimal next actions in the Double-DQN algorithm.
- Parameters:
s (np.ndarray) – (batch, state_len)
- Returns:
np.ndarray of shape (batch_size, NUM_VALID_ACTIONS) containing the estimated action values.
- update(s, a, y)¶
- Updates the estimator towards the given targets.
In this case y is the target-network estimated value of the Q-network optimal actions, which is labeled y in Algorithm 1 of Minh et al. (2015)
- Parameters:
s (np.ndarray) – (batch, state_shape) state representation
a (np.ndarray) – (batch,) integer sampled actions
y (np.ndarray) – (batch,) value of optimal actions according to Q-target
- Returns:
The calculated loss on the batch.
- class rlcard.agents.dqn_agent.EstimatorNetwork(num_actions=2, state_shape=None, mlp_layers=None)¶
Bases:
Module
The function approximation network for Estimator It is just a series of tanh layers. All in/out are torch.tensor
- forward(s)¶
Predict action values
- Parameters:
s (Tensor) – (batch, state_shape)
- class rlcard.agents.dqn_agent.Memory(memory_size, batch_size)¶
Bases:
object
Memory for saving transitions
- checkpoint_attributes()¶
Returns the attributes that need to be checkpointed
- classmethod from_checkpoint(checkpoint)¶
Restores the attributes from the checkpoint
- Parameters:
checkpoint (dict) – the checkpoint dictionary
- Returns:
the restored instance
- Return type:
instance (Memory)
- sample()¶
Sample a minibatch from the replay memory
- Returns:
a batch of states action_batch (list): a batch of actions reward_batch (list): a batch of rewards next_state_batch (list): a batch of states done_batch (list): a batch of dones
- Return type:
state_batch (list)
- save(state, action, reward, next_state, legal_actions, done)¶
Save transition into memory
- Parameters:
state (numpy.array) – the current state
action (int) – the performed action ID
reward (float) – the reward received
next_state (numpy.array) – the next state after performing the action
legal_actions (list) – the legal actions of the next state
done (boolean) – whether the episode is finished
- class rlcard.agents.dqn_agent.Transition(state, action, reward, next_state, done, legal_actions)¶
Bases:
tuple
- action¶
Alias for field number 1
- done¶
Alias for field number 4
- legal_actions¶
Alias for field number 5
- next_state¶
Alias for field number 3
- reward¶
Alias for field number 2
- state¶
Alias for field number 0
rlcard.agents.nfsp_agent¶
Neural Fictitious Self-Play (NFSP) agent implemented in TensorFlow.
See the paper https://arxiv.org/abs/1603.01121 for more details.
- class rlcard.agents.nfsp_agent.AveragePolicyNetwork(num_actions=2, state_shape=None, mlp_layers=None)¶
Bases:
Module
Approximates the history of action probabilities given state (average policy). Forward pass returns log probabilities of actions.
- checkpoint_attributes()¶
Return the current checkpoint attributes (dict) Checkpoint attributes are used to save and restore the model in the middle of training
- forward(s)¶
Log action probabilities of each action from state
- Parameters:
s (Tensor) – (batch, state_shape) state tensor
- Returns:
(batch, num_actions)
- Return type:
log_action_probs (Tensor)
- classmethod from_checkpoint(checkpoint)¶
Restore the model from a checkpoint
- Parameters:
checkpoint (dict) – the checkpoint attributes generated by checkpoint_attributes()
- class rlcard.agents.nfsp_agent.NFSPAgent(num_actions=4, state_shape=None, hidden_layers_sizes=None, reservoir_buffer_capacity=20000, anticipatory_param=0.1, batch_size=256, train_every=1, rl_learning_rate=0.1, sl_learning_rate=0.005, min_buffer_size_to_learn=100, q_replay_memory_size=20000, q_replay_memory_init_size=100, q_update_target_estimator_every=1000, q_discount_factor=0.99, q_epsilon_start=0.06, q_epsilon_end=0, q_epsilon_decay_steps=1000000, q_batch_size=32, q_train_every=1, q_mlp_layers=None, evaluate_with='average_policy', device=None, save_path=None, save_every=inf)¶
Bases:
object
An approximate clone of rlcard.agents.nfsp_agent that uses pytorch instead of tensorflow. Note that this implementation differs from Henrich and Silver (2016) in that the supervised training minimizes cross-entropy with respect to the stored action probabilities rather than the realized actions.
- checkpoint_attributes()¶
Return the current checkpoint attributes (dict) Checkpoint attributes are used to save and restore the model in the middle of training Saves the model state dict, optimizer state dict, and all other instance variables
- eval_step(state)¶
Use the average policy for evaluation purpose
- Parameters:
state (dict) – The current state.
- Returns:
An action id. info (dict): A dictionary containing information
- Return type:
action (int)
- feed(ts)¶
Feed data to inner RL agent
- Parameters:
ts (list) – A list of 5 elements that represent the transition.
- classmethod from_checkpoint(checkpoint)¶
Restore the model from a checkpoint
- Parameters:
checkpoint (dict) – the checkpoint attributes generated by checkpoint_attributes()
- sample_episode_policy()¶
Sample average/best_response policy
- save_checkpoint(path, filename='checkpoint_nfsp.pt')¶
Save the model checkpoint (all attributes)
- Parameters:
path (str) – the path to save the model
- set_device(device)¶
- step(state)¶
Returns the action to be taken.
- Parameters:
state (dict) – The current state
- Returns:
An action id
- Return type:
action (int)
- train_sl()¶
Compute the loss on sampled transitions and perform a avg-network update.
If there are not enough elements in the buffer, no loss is computed and None is returned instead.
- Returns:
The average loss obtained on this batch of transitions or None.
- Return type:
loss (float)
- class rlcard.agents.nfsp_agent.ReservoirBuffer(reservoir_buffer_capacity)¶
Bases:
object
Allows uniform sampling over a stream of data.
This class supports the storage of arbitrary elements, such as observation tensors, integer actions, etc.
See https://en.wikipedia.org/wiki/Reservoir_sampling for more details.
- add(element)¶
Potentially adds element to the reservoir buffer.
- Parameters:
element (object) – data to be added to the reservoir buffer.
- checkpoint_attributes()¶
- clear()¶
Clear the buffer
- classmethod from_checkpoint(checkpoint)¶
- sample(num_samples)¶
Returns num_samples uniformly sampled from the buffer.
- Parameters:
num_samples (int) – The number of samples to draw.
- Returns:
An iterable over num_samples random elements of the buffer.
- Raises:
ValueError – If there are less than num_samples elements in the buffer
rlcard.agents.pettingzoo_agents¶
- class rlcard.agents.pettingzoo_agents.DQNAgentPettingZoo(replay_memory_size=20000, replay_memory_init_size=100, update_target_estimator_every=1000, discount_factor=0.99, epsilon_start=1.0, epsilon_end=0.1, epsilon_decay_steps=20000, batch_size=32, num_actions=2, state_shape=None, train_every=1, mlp_layers=None, learning_rate=5e-05, device=None, save_path=None, save_every=inf)¶
Bases:
DQNAgent
- eval_step(state)¶
Predict the action for evaluation purpose.
- Parameters:
state (numpy.array) – current state
- Returns:
an action id info (dict): A dictionary containing information
- Return type:
action (int)
- feed(ts)¶
- Store data in to replay buffer and train the agent. There are two stages.
In stage 1, populate the memory without training In stage 2, train the agent every several timesteps
- Parameters:
ts (list) – a list of 5 elements that represent the transition
- step(state)¶
- Predict the action for genrating training data but
have the predictions disconnected from the computation graph
- Parameters:
state (numpy.array) – current state
- Returns:
an action id
- Return type:
action (int)
- class rlcard.agents.pettingzoo_agents.NFSPAgentPettingZoo(num_actions=4, state_shape=None, hidden_layers_sizes=None, reservoir_buffer_capacity=20000, anticipatory_param=0.1, batch_size=256, train_every=1, rl_learning_rate=0.1, sl_learning_rate=0.005, min_buffer_size_to_learn=100, q_replay_memory_size=20000, q_replay_memory_init_size=100, q_update_target_estimator_every=1000, q_discount_factor=0.99, q_epsilon_start=0.06, q_epsilon_end=0, q_epsilon_decay_steps=1000000, q_batch_size=32, q_train_every=1, q_mlp_layers=None, evaluate_with='average_policy', device=None, save_path=None, save_every=inf)¶
Bases:
NFSPAgent
- eval_step(state)¶
Use the average policy for evaluation purpose
- Parameters:
state (dict) – The current state.
- Returns:
An action id. info (dict): A dictionary containing information
- Return type:
action (int)
- feed(ts)¶
Feed data to inner RL agent
- Parameters:
ts (list) – A list of 5 elements that represent the transition.
- step(state)¶
Returns the action to be taken.
- Parameters:
state (dict) – The current state
- Returns:
An action id
- Return type:
action (int)
- class rlcard.agents.pettingzoo_agents.RandomAgentPettingZoo(num_actions)¶
Bases:
RandomAgent
- eval_step(state)¶
- Predict the action given the current state for evaluation.
Since the random agents are not trained. This function is equivalent to step function
- Parameters:
state (dict) – An dictionary that represents the current state
- Returns:
The action predicted (randomly chosen) by the random agent probs (list): The list of action probabilities
- Return type:
action (int)
- step(state)¶
Predict the action given the curent state in gerenerating training data.
- Parameters:
state (dict) – An dictionary that represents the current state
- Returns:
The action predicted (randomly chosen) by the random agent
- Return type:
action (int)
rlcard.agents.random_agent¶
- class rlcard.agents.random_agent.RandomAgent(num_actions)¶
Bases:
object
A random agent. Random agents is for running toy examples on the card games
- eval_step(state)¶
- Predict the action given the current state for evaluation.
Since the random agents are not trained. This function is equivalent to step function
- Parameters:
state (dict) – An dictionary that represents the current state
- Returns:
The action predicted (randomly chosen) by the random agent probs (list): The list of action probabilities
- Return type:
action (int)
- static step(state)¶
Predict the action given the curent state in gerenerating training data.
- Parameters:
state (dict) – An dictionary that represents the current state
- Returns:
The action predicted (randomly chosen) by the random agent
- Return type:
action (int)
rlcard.agents.dmc_agent.file_writer¶
- class rlcard.agents.dmc_agent.file_writer.FileWriter(xpid: str | None = None, xp_args: dict | None = None, rootdir: str = '~/palaas')¶
Bases:
object
- close(successful: bool = True) None ¶
- log(to_log: Dict, tick: int | None = None, verbose: bool = False) None ¶
- rlcard.agents.dmc_agent.file_writer.gather_metadata() Dict ¶
rlcard.agents.dmc_agent.model¶
- class rlcard.agents.dmc_agent.model.DMCAgent(state_shape, action_shape, mlp_layers=[512, 512, 512, 512, 512], exp_epsilon=0.01, device='0')¶
Bases:
object
- eval()¶
- eval_step(state)¶
- forward(obs, actions)¶
- load_state_dict(state_dict)¶
- parameters()¶
- predict(state)¶
- set_device(device)¶
- state_dict()¶
- step(state)¶
- class rlcard.agents.dmc_agent.model.DMCModel(state_shape, action_shape, mlp_layers=[512, 512, 512, 512, 512], exp_epsilon=0.01, device=0)¶
Bases:
object
- eval()¶
- get_agent(index)¶
- get_agents()¶
- parameters(index)¶
- class rlcard.agents.dmc_agent.model.DMCNet(state_shape, action_shape, mlp_layers=[512, 512, 512, 512, 512])¶
Bases:
Module
- forward(obs, actions)¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
rlcard.agents.dmc_agent.pettingzoo_model¶
rlcard.agents.dmc_agent.pettingzoo_utils¶
- rlcard.agents.dmc_agent.pettingzoo_utils.act_pettingzoo(i, device, T, free_queue, full_queue, model, buffers, env)¶
- rlcard.agents.dmc_agent.pettingzoo_utils.create_buffers_pettingzoo(T, num_buffers, env, device_iterator)¶
rlcard.agents.dmc_agent.trainer¶
- class rlcard.agents.dmc_agent.trainer.DMCTrainer(env, cuda='', is_pettingzoo_env=False, load_model=False, xpid='dmc', save_interval=30, num_actor_devices=1, num_actors=5, training_device='0', savedir='experiments/dmc_result', total_frames=100000000000, exp_epsilon=0.01, batch_size=32, unroll_length=100, num_buffers=50, num_threads=4, max_grad_norm=40, learning_rate=0.0001, alpha=0.99, momentum=0, epsilon=1e-05)¶
Bases:
object
Deep Monte-Carlo
- Parameters:
env – RLCard environment
load_model (boolean) – Whether loading an existing model
xpid (string) – Experiment id (default: dmc)
save_interval (int) – Time interval (in minutes) at which to save the model
num_actor_devices (int) – The number devices used for simulation
num_actors (int) – Number of actors for each simulation device
training_device (str) – The index of the GPU used for training models, or cpu.
savedir (string) – Root dir where experiment data will be saved
total_frames (int) – Total environment frames to train for
exp_epsilon (float) – The prbability for exploration
batch_size (int) – Learner batch size
unroll_length (int) – The unroll length (time dimension)
num_buffers (int) – Number of shared-memory buffers
num_threads (int) – Number learner threads
max_grad_norm (int) – Max norm of gradients
learning_rate (float) – Learning rate
alpha (float) – RMSProp smoothing constant
momentum (float) – RMSProp momentum
epsilon (float) – RMSProp epsilon
- start()¶
- rlcard.agents.dmc_agent.trainer.compute_loss(logits, targets)¶
- rlcard.agents.dmc_agent.trainer.learn(position, actor_models, agent, batch, optimizer, training_device, max_grad_norm, mean_episode_return_buf, lock)¶
Performs a learning (optimization) step.
rlcard.agents.dmc_agent.utils¶
- rlcard.agents.dmc_agent.utils.act(i, device, T, free_queue, full_queue, model, buffers, env)¶
- rlcard.agents.dmc_agent.utils.create_buffers(T, num_buffers, state_shape, action_shape, device_iterator)¶
- rlcard.agents.dmc_agent.utils.create_optimizers(num_players, learning_rate, momentum, epsilon, alpha, learner_model)¶
- rlcard.agents.dmc_agent.utils.get_batch(free_queue, full_queue, buffers, batch_size, lock)¶
rlcard.agents.human_agents.blackjack_human_agent¶
- class rlcard.agents.human_agents.blackjack_human_agent.HumanAgent(num_actions)¶
Bases:
object
A human agent for Blackjack. It can be used to play alone for understand how the blackjack code runs
- eval_step(state)¶
Predict the action given the current state for evaluation. The same to step here.
- Parameters:
state (numpy.array) – an numpy array that represents the current state
- Returns:
the action predicted (randomly chosen) by the random agent
- Return type:
action (int)
- static step(state)¶
Human agent will display the state and make decisions through interfaces
- Parameters:
state (dict) – A dictionary that represents the current state
- Returns:
The action decided by human
- Return type:
action (int)
rlcard.agents.human_agents.leduc_holdem_human_agent¶
- class rlcard.agents.human_agents.leduc_holdem_human_agent.HumanAgent(num_actions)¶
Bases:
object
A human agent for Leduc Holdem. It can be used to play against trained models
- eval_step(state)¶
Predict the action given the curent state for evaluation. The same to step here.
- Parameters:
state (numpy.array) – an numpy array that represents the current state
- Returns:
the action predicted (randomly chosen) by the random agent
- Return type:
action (int)
- static step(state)¶
Human agent will display the state and make decisions through interfaces
- Parameters:
state (dict) – A dictionary that represents the current state
- Returns:
The action decided by human
- Return type:
action (int)
rlcard.agents.human_agents.limit_holdem_human_agent¶
- class rlcard.agents.human_agents.limit_holdem_human_agent.HumanAgent(num_actions)¶
Bases:
object
A human agent for Limit Holdem. It can be used to play against trained models
- eval_step(state)¶
Predict the action given the curent state for evaluation. The same to step here.
- Parameters:
state (numpy.array) – an numpy array that represents the current state
- Returns:
the action predicted (randomly chosen) by the random agent
- Return type:
action (int)
- static step(state)¶
Human agent will display the state and make decisions through interfaces
- Parameters:
state (dict) – A dictionary that represents the current state
- Returns:
The action decided by human
- Return type:
action (int)
rlcard.agents.human_agents.nolimit_holdem_human_agent¶
- class rlcard.agents.human_agents.nolimit_holdem_human_agent.HumanAgent(num_actions)¶
Bases:
object
A human agent for No Limit Holdem. It can be used to play against trained models
- eval_step(state)¶
Predict the action given the curent state for evaluation. The same to step here.
- Parameters:
state (numpy.array) – an numpy array that represents the current state
- Returns:
the action predicted (randomly chosen) by the random agent
- Return type:
action (int)
- static step(state)¶
Human agent will display the state and make decisions through interfaces
- Parameters:
state (dict) – A dictionary that represents the current state
- Returns:
The action decided by human
- Return type:
action (int)
rlcard.agents.human_agents.uno_human_agent¶
- class rlcard.agents.human_agents.uno_human_agent.HumanAgent(num_actions)¶
Bases:
object
A human agent for Leduc Holdem. It can be used to play against trained models
- eval_step(state)¶
Predict the action given the curent state for evaluation. The same to step here.
- Parameters:
state (numpy.array) – an numpy array that represents the current state
- Returns:
the action predicted (randomly chosen) by the random agent
- Return type:
action (int)
- static step(state)¶
Human agent will display the state and make decisions through interfaces
- Parameters:
state (dict) – A dictionary that represents the current state
- Returns:
The action decided by human
- Return type:
action (int)