rlcard.agents

Subpackages

rlcard.agents.cfr_agent

class rlcard.agents.cfr_agent.CFRAgent(env, model_path='./cfr_model')

Bases: object

Implement CFR (chance sampling) algorithm

action_probs(obs, legal_actions, policy)

Obtain the action probabilities of the current state

Parameters
  • obs (str) – state_str

  • legal_actions (list) – List of leagel actions

  • player_id (int) – The current player

  • policy (dict) – The used policy

Returns

action_probs(numpy.array): The action probabilities legal_actions (list): Indices of legal actions

Return type

(tuple) that contains

eval_step(state)

Given a state, predict action based on average policy

Parameters

state (numpy.array) – State representation

Returns

Predicted action info (dict): A dictionary containing information

Return type

action (int)

get_state(player_id)

Get state_str of the player

Parameters

player_id (int) – The player id

Returns

state (str): The state str legal_actions (list): Indices of legal actions

Return type

(tuple) that contains

load()

Load model

regret_matching(obs)

Apply regret matching

Parameters

obs (string) – The state_str

save()

Save model

train()

Do one iteration of CFR

traverse_tree(probs, player_id)

Traverse the game tree, update the regrets

Parameters
  • probs – The reach probability of the current node

  • player_id – The player to update the value

Returns

The expected utilities for all the players

Return type

state_utilities (list)

update_policy()

Update policy based on the current regrets

rlcard.agents.dqn_agent

DQN agent

The code is derived from https://github.com/dennybritz/reinforcement-learning/blob/master/DQN/dqn.py

Copyright (c) 2019 Matthew Judell Copyright (c) 2019 DATA Lab at Texas A&M University Copyright (c) 2016 Denny Britz

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

class rlcard.agents.dqn_agent.DQNAgent(replay_memory_size=20000, replay_memory_init_size=100, update_target_estimator_every=1000, discount_factor=0.99, epsilon_start=1.0, epsilon_end=0.1, epsilon_decay_steps=20000, batch_size=32, num_actions=2, state_shape=None, train_every=1, mlp_layers=None, learning_rate=5e-05, device=None)

Bases: object

Approximate clone of rlcard.agents.dqn_agent.DQNAgent that depends on PyTorch instead of Tensorflow

eval_step(state)

Predict the action for evaluation purpose.

Parameters

state (numpy.array) – current state

Returns

an action id info (dict): A dictionary containing information

Return type

action (int)

feed(ts)
Store data in to replay buffer and train the agent. There are two stages.

In stage 1, populate the memory without training In stage 2, train the agent every several timesteps

Parameters

ts (list) – a list of 5 elements that represent the transition

feed_memory(state, action, reward, next_state, legal_actions, done)

Feed transition to memory

Parameters
  • state (numpy.array) – the current state

  • action (int) – the performed action ID

  • reward (float) – the reward received

  • next_state (numpy.array) – the next state after performing the action

  • legal_actions (list) – the legal actions of the next state

  • done (boolean) – whether the episode is finished

predict(state)

Predict the masked Q-values

Parameters

state (numpy.array) – current state

Returns

a 1-d array where each entry represents a Q value

Return type

q_values (numpy.array)

set_device(device)
step(state)
Predict the action for genrating training data but

have the predictions disconnected from the computation graph

Parameters

state (numpy.array) – current state

Returns

an action id

Return type

action (int)

train()

Train the network

Returns

The loss of the current batch.

Return type

loss (float)

class rlcard.agents.dqn_agent.Estimator(num_actions=2, learning_rate=0.001, state_shape=None, mlp_layers=None, device=None)

Bases: object

Approximate clone of rlcard.agents.dqn_agent.Estimator that uses PyTorch instead of Tensorflow. All methods input/output np.ndarray.

Q-Value Estimator neural network. This network is used for both the Q-Network and the Target Network.

predict_nograd(s)
Predicts action values, but prediction is not included

in the computation graph. It is used to predict optimal next actions in the Double-DQN algorithm.

Parameters

s (np.ndarray) – (batch, state_len)

Returns

np.ndarray of shape (batch_size, NUM_VALID_ACTIONS) containing the estimated action values.

update(s, a, y)
Updates the estimator towards the given targets.

In this case y is the target-network estimated value of the Q-network optimal actions, which is labeled y in Algorithm 1 of Minh et al. (2015)

Parameters
  • s (np.ndarray) – (batch, state_shape) state representation

  • a (np.ndarray) – (batch,) integer sampled actions

  • y (np.ndarray) – (batch,) value of optimal actions according to Q-target

Returns

The calculated loss on the batch.

class rlcard.agents.dqn_agent.EstimatorNetwork(num_actions=2, state_shape=None, mlp_layers=None)

Bases: torch.nn.modules.module.Module

The function approximation network for Estimator It is just a series of tanh layers. All in/out are torch.tensor

forward(s)

Predict action values

Parameters

s (Tensor) – (batch, state_shape)

training: bool
class rlcard.agents.dqn_agent.Memory(memory_size, batch_size)

Bases: object

Memory for saving transitions

sample()

Sample a minibatch from the replay memory

Returns

a batch of states action_batch (list): a batch of actions reward_batch (list): a batch of rewards next_state_batch (list): a batch of states done_batch (list): a batch of dones

Return type

state_batch (list)

save(state, action, reward, next_state, legal_actions, done)

Save transition into memory

Parameters
  • state (numpy.array) – the current state

  • action (int) – the performed action ID

  • reward (float) – the reward received

  • next_state (numpy.array) – the next state after performing the action

  • legal_actions (list) – the legal actions of the next state

  • done (boolean) – whether the episode is finished

class rlcard.agents.dqn_agent.Transition(state, action, reward, next_state, legal_actions, done)

Bases: tuple

property action

Alias for field number 1

property done

Alias for field number 5

property legal_actions

Alias for field number 4

property next_state

Alias for field number 3

property reward

Alias for field number 2

property state

Alias for field number 0

rlcard.agents.dqn_agent.copy_model_parameters(sess, estimator1, estimator2)

Copys the model parameters of one estimator to another.

Parameters
  • sess (tf.Session) – Tensorflow Session object

  • estimator1 (Estimator) – Estimator to copy the paramters from

  • estimator2 (Estimator) – Estimator to copy the parameters to

rlcard.agents.nfsp_agent

Neural Fictitious Self-Play (NFSP) agent implemented in TensorFlow.

See the paper https://arxiv.org/abs/1603.01121 for more details.

class rlcard.agents.nfsp_agent.AveragePolicyNetwork(num_actions=2, state_shape=None, mlp_layers=None)

Bases: torch.nn.modules.module.Module

Approximates the history of action probabilities given state (average policy). Forward pass returns log probabilities of actions.

forward(s)

Log action probabilities of each action from state

Parameters

s (Tensor) – (batch, state_shape) state tensor

Returns

(batch, num_actions)

Return type

log_action_probs (Tensor)

training: bool
class rlcard.agents.nfsp_agent.NFSPAgent(num_actions=4, state_shape=None, hidden_layers_sizes=None, reservoir_buffer_capacity=20000, anticipatory_param=0.1, batch_size=256, train_every=1, rl_learning_rate=0.1, sl_learning_rate=0.005, min_buffer_size_to_learn=100, q_replay_memory_size=20000, q_replay_memory_init_size=100, q_update_target_estimator_every=1000, q_discount_factor=0.99, q_epsilon_start=0.06, q_epsilon_end=0, q_epsilon_decay_steps=1000000, q_batch_size=32, q_train_every=1, q_mlp_layers=None, evaluate_with='average_policy', device=None)

Bases: object

An approximate clone of rlcard.agents.nfsp_agent that uses pytorch instead of tensorflow. Note that this implementation differs from Henrich and Silver (2016) in that the supervised training minimizes cross-entropy with respect to the stored action probabilities rather than the realized actions.

eval_step(state)

Use the average policy for evaluation purpose

Parameters

state (dict) – The current state.

Returns

An action id. info (dict): A dictionary containing information

Return type

action (int)

feed(ts)

Feed data to inner RL agent

Parameters

ts (list) – A list of 5 elements that represent the transition.

sample_episode_policy()

Sample average/best_response policy

set_device(device)
step(state)

Returns the action to be taken.

Parameters

state (dict) – The current state

Returns

An action id

Return type

action (int)

train_sl()

Compute the loss on sampled transitions and perform a avg-network update.

If there are not enough elements in the buffer, no loss is computed and None is returned instead.

Returns

The average loss obtained on this batch of transitions or None.

Return type

loss (float)

class rlcard.agents.nfsp_agent.ReservoirBuffer(reservoir_buffer_capacity)

Bases: object

Allows uniform sampling over a stream of data.

This class supports the storage of arbitrary elements, such as observation tensors, integer actions, etc.

See https://en.wikipedia.org/wiki/Reservoir_sampling for more details.

add(element)

Potentially adds element to the reservoir buffer.

Parameters

element (object) – data to be added to the reservoir buffer.

clear()

Clear the buffer

sample(num_samples)

Returns num_samples uniformly sampled from the buffer.

Parameters

num_samples (int) – The number of samples to draw.

Returns

An iterable over num_samples random elements of the buffer.

Raises

ValueError – If there are less than num_samples elements in the buffer

class rlcard.agents.nfsp_agent.Transition(info_state, action_probs)

Bases: tuple

property action_probs

Alias for field number 1

property info_state

Alias for field number 0

rlcard.agents.random_agent

class rlcard.agents.random_agent.RandomAgent(num_actions)

Bases: object

A random agent. Random agents is for running toy examples on the card games

eval_step(state)
Predict the action given the current state for evaluation.

Since the random agents are not trained. This function is equivalent to step function

Parameters

state (dict) – An dictionary that represents the current state

Returns

The action predicted (randomly chosen) by the random agent probs (list): The list of action probabilities

Return type

action (int)

static step(state)

Predict the action given the curent state in gerenerating training data.

Parameters

state (dict) – An dictionary that represents the current state

Returns

The action predicted (randomly chosen) by the random agent

Return type

action (int)