rlcard.agents

rlcard.agents.gin_rummy_human_agent

rlcard.agents.best_response_agent

class rlcard.agents.best_response_agent.BRAgent(env, policy)

Bases: object

Implement CFR algorithm

action_probs(state, policy)

Obtain the action probabilities of the current state

Parameters
  • state (dictionaty) – The state dictionary

  • policy (dict) – The used policy

Returns

action_probs(numpy.array): The action probabilities legal_actions (list): Indices of legal actions

Return type

(tuple) that contains

best_response_action(this_player, obs)
eval_step(state)

Given a state, predict action based on average policy

Parameters

state (numpy.array) – State representation

Returns

Predicted action

Return type

action (int)

get_q_value(action, q_value)
get_state(player_id)

Get state_str of the player

Parameters

player_id (int) – The player id

Returns

state (str): The state str legal_actions (list): Indices of legal actions

Return type

(tuple) that contains

load()

Load model

save()

Save model

traverse_tree(probs, player_id)

Traverse the game tree, get information set

Parameters
  • probs – The reach probability of the current node

  • player_id – The player to update the value

Returns

The expected utilities for all the players

Return type

state_utilities (list)

value(curr_player, state, this_player)

Returns the value of the specified state to the best-responder.

rlcard.agents.cfr_agent

class rlcard.agents.cfr_agent.CFRAgent(env, model_path='./cfr_model')

Bases: object

Implement CFR algorithm

action_probs(obs, legal_actions, policy)

Obtain the action probabilities of the current state

Parameters
  • obs (str) – state_str

  • legal_actions (list) – List of leagel actions

  • player_id (int) – The current player

  • policy (dict) – The used policy

Returns

action_probs(numpy.array): The action probabilities legal_actions (list): Indices of legal actions

Return type

(tuple) that contains

eval_step(state)

Given a state, predict action based on average policy

Parameters

state (numpy.array) – State representation

Returns

Predicted action

Return type

action (int)

get_state(player_id)

Get state_str of the player

Parameters

player_id (int) – The player id

Returns

state (str): The state str legal_actions (list): Indices of legal actions

Return type

(tuple) that contains

load()

Load model

regret_matching(obs)

Apply regret matching

Parameters

obs (string) – The state_str

save()

Save model

train()

Do one iteration of CFR

traverse_tree(probs, player_id)

Traverse the game tree, update the regrets

Parameters
  • probs – The reach probability of the current node

  • player_id – The player to update the value

Returns

The expected utilities for all the players

Return type

state_utilities (list)

update_policy()

Update policy based on the current regrets

rlcard.agents.deep_cfr_agent

Implements Deep CFR Algorithm.

The implementation is derived from:

https://github.com/deepmind/open_spiel/blob/master/open_spiel/python/algorithms/deep_cfr.py

We modify the structure for single player game and rlcard package, and fix some bugs for loss calculation.

See https://arxiv.org/abs/1811.00164.

The algorithm defines an advantage and strategy networks that compute advantages used to do regret matching across information sets and to approximate the strategy profiles of the game. To train these networks a fixed ring buffer (other data structures may be used) memory is used to accumulate samples to train the networks.

class rlcard.agents.deep_cfr_agent.AdvantageMemory(info_state, iteration, advantage, action)

Bases: tuple

property action

Alias for field number 3

property advantage

Alias for field number 2

property info_state

Alias for field number 0

property iteration

Alias for field number 1

class rlcard.agents.deep_cfr_agent.DeepCFR(session, env, policy_network_layers=(32, 32), advantage_network_layers=(32, 32), num_traversals=10, num_step=40, learning_rate=0.0001, batch_size_advantage=16, batch_size_strategy=16, memory_capacity=10000000)

Bases: object

Implement the Deep CFR Algorithm.

See https://arxiv.org/abs/1811.00164.

Define all networks and sampling buffers/memories. Derive losses & learning steps. Initialize the game state and algorithmic variables.

Note: batch sizes default to None implying that training over the full

dataset in memory is done by default. To sample from the memories you may set these values to something less than the full capacity of the memory.

action_advantage(state, player)

Returns action advantages for a single batch.

action_probabilities(state)

Returns action probabilites dict for a single batch.

eval_step(state)

Predict the action given state for evaluation

Parameters

state (dict) – current state

Returns

an action id

Return type

action (int)

reinitialize_advantage_networks()

Reinitialize the advantage networks

simulate_other(player, state)

Simulate the action for other players

Parameters
  • player (int) – an player id

  • state (dict) – current state

Returns

an action id

Return type

action (int)

train()

Perform tree traversal and train the network

Returns

the trained policy network average advantage loss (float): players average advantage loss policy loss (float): policy loss

Return type

policy_network (tf.placeholder)

class rlcard.agents.deep_cfr_agent.FixedSizeRingBuffer(replay_buffer_capacity)

Bases: object

ReplayBuffer of fixed size with a FIFO replacement policy.

Stored transitions can be sampled uniformly.

The underlying datastructure is a ring buffer, allowing 0(1) adding and sampling.

add(element)

Adds element to the buffer.

If the buffer is full, the oldest element will be replaced.

Parameters

element – data to be added to the buffer.

clear()

Clear the buffer

sample(num_samples)

Returns num_samples uniformly sampled from the buffer.

Parameters

num_samples (int) – number of samples to draw.

Returns

a list of random sampled elements of the buffer

Return type

sample data (list)

Raises

ValueError – If there are less than num_samples elements in the buffer

class rlcard.agents.deep_cfr_agent.StrategyMemory(info_state, iteration, strategy_action_probs)

Bases: tuple

property info_state

Alias for field number 0

property iteration

Alias for field number 1

property strategy_action_probs

Alias for field number 2

rlcard.agents.dqn_agent

DQN agent

The code is derived from https://github.com/dennybritz/reinforcement-learning/blob/master/DQN/dqn.py

Copyright (c) 2019 DATA Lab at Texas A&M University Copyright (c) 2016 Denny Britz

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

class rlcard.agents.dqn_agent.DQNAgent(sess, scope, replay_memory_size=20000, replay_memory_init_size=100, update_target_estimator_every=1000, discount_factor=0.99, epsilon_start=1.0, epsilon_end=0.1, epsilon_decay_steps=20000, batch_size=32, action_num=2, state_shape=None, train_every=1, mlp_layers=None, learning_rate=5e-05)

Bases: object

copy_params_op(global_vars)

Copys the variables of two estimator to others.

Parameters

global_vars (list) – A list of tensor

eval_step(state)

Predict the action for evaluation purpose.

Parameters

state (numpy.array) – current state

Returns

an action id probs (list): a list of probabilies

Return type

action (int)

feed(ts)
Store data in to replay buffer and train the agent. There are two stages.

In stage 1, populate the memory without training In stage 2, train the agent every several timesteps

Parameters

ts (list) – a list of 5 elements that represent the transition

feed_memory(state, action, reward, next_state, done)

Feed transition to memory

Parameters
  • state (numpy.array) – the current state

  • action (int) – the performed action ID

  • reward (float) – the reward received

  • next_state (numpy.array) – the next state after performing the action

  • done (boolean) – whether the episode is finished

predict(state)

Predict the action probabilities

Parameters

state (numpy.array) – current state

Returns

a 1-d array where each entry represents a Q value

Return type

q_values (numpy.array)

step(state)

Predict the action for generating training data

Parameters

state (numpy.array) – current state

Returns

an action id

Return type

action (int)

train()

Train the network

Returns

The loss of the current batch.

Return type

loss (float)

class rlcard.agents.dqn_agent.Estimator(scope='estimator', action_num=2, learning_rate=0.001, state_shape=None, mlp_layers=None)

Bases: object

Q-Value Estimator neural network. This network is used for both the Q-Network and the Target Network.

predict(sess, s)

Predicts action values.

Parameters
  • sess (tf.Session) – Tensorflow Session object

  • s (numpy.array) – State input of shape [batch_size, 4, 160, 160, 3]

  • is_train (boolean) – True if is training

Returns

Tensor of shape [batch_size, NUM_VALID_ACTIONS] containing the estimated action values.

update(sess, s, a, y)

Updates the estimator towards the given targets.

Parameters
  • sess (tf.Session) – Tensorflow Session object

  • s (list) – State input of shape [batch_size, 4, 160, 160, 3]

  • a (list) – Chosen actions of shape [batch_size]

  • y (list) – Targets of shape [batch_size]

Returns

The calculated loss on the batch.

class rlcard.agents.dqn_agent.Memory(memory_size, batch_size)

Bases: object

Memory for saving transitions

sample()

Sample a minibatch from the replay memory

Returns

a batch of states action_batch (list): a batch of actions reward_batch (list): a batch of rewards next_state_batch (list): a batch of states done_batch (list): a batch of dones

Return type

state_batch (list)

save(state, action, reward, next_state, done)

Save transition into memory

Parameters
  • state (numpy.array) – the current state

  • action (int) – the performed action ID

  • reward (float) – the reward received

  • next_state (numpy.array) – the next state after performing the action

  • done (boolean) – whether the episode is finished

class rlcard.agents.dqn_agent.Transition(state, action, reward, next_state, done)

Bases: tuple

property action

Alias for field number 1

property done

Alias for field number 4

property next_state

Alias for field number 3

property reward

Alias for field number 2

property state

Alias for field number 0

rlcard.agents.dqn_agent.copy_model_parameters(sess, estimator1, estimator2)

Copys the model parameters of one estimator to another.

Parameters
  • sess (tf.Session) – Tensorflow Session object

  • estimator1 (Estimator) – Estimator to copy the paramters from

  • estimator2 (Estimator) – Estimator to copy the parameters to

rlcard.agents.nfsp_agent

Neural Fictitious Self-Play (NFSP) agent implemented in TensorFlow.

See the paper https://arxiv.org/abs/1603.01121 for more details.

rlcard.agents.nfsp_agent.MODE

alias of rlcard.agents.nfsp_agent.mode

class rlcard.agents.nfsp_agent.NFSPAgent(sess, scope, action_num=4, state_shape=None, hidden_layers_sizes=None, reservoir_buffer_capacity=1000000, anticipatory_param=0.1, batch_size=256, train_every=1, rl_learning_rate=0.1, sl_learning_rate=0.005, min_buffer_size_to_learn=1000, q_replay_memory_size=30000, q_replay_memory_init_size=1000, q_update_target_estimator_every=1000, q_discount_factor=0.99, q_epsilon_start=0.06, q_epsilon_end=0, q_epsilon_decay_steps=1000000, q_batch_size=256, q_train_every=1, q_mlp_layers=None, evaluate_with='average_policy')

Bases: object

NFSP Agent implementation in TensorFlow.

eval_step(state)

Use the average policy for evaluation purpose

Parameters

state (dict) – The current state.

Returns

An action id. probs (list): The list of action probabilies

Return type

action (int)

feed(ts)

Feed data to inner RL agent

Parameters

ts (list) – A list of 5 elements that represent the transition.

sample_episode_policy()

Sample average/best_response policy

step(state)

Returns the action to be taken.

Parameters

state (dict) – The current state

Returns

An action id

Return type

action (int)

train_sl()

Compute the loss on sampled transitions and perform a avg-network update.

If there are not enough elements in the buffer, no loss is computed and None is returned instead.

Returns

The average loss obtained on this batch of transitions or None.

Return type

loss (float)

class rlcard.agents.nfsp_agent.ReservoirBuffer(reservoir_buffer_capacity)

Bases: object

Allows uniform sampling over a stream of data.

This class supports the storage of arbitrary elements, such as observation tensors, integer actions, etc.

See https://en.wikipedia.org/wiki/Reservoir_sampling for more details.

add(element)

Potentially adds element to the reservoir buffer.

Parameters

element (object) – data to be added to the reservoir buffer.

clear()

Clear the buffer

sample(num_samples)

Returns num_samples uniformly sampled from the buffer.

Parameters

num_samples (int) – The number of samples to draw.

Returns

An iterable over num_samples random elements of the buffer.

Raises

ValueError – If there are less than num_samples elements in the buffer

class rlcard.agents.nfsp_agent.Transition(info_state, action_probs)

Bases: tuple

property action_probs

Alias for field number 1

property info_state

Alias for field number 0

rlcard.agents.random_agent

class rlcard.agents.random_agent.RandomAgent(action_num)

Bases: object

A random agent. Random agents is for running toy examples on the card games

eval_step(state)
Predict the action given the current state for evaluation.

Since the random agents are not trained. This function is equivalent to step function

Parameters

state (dict) – An dictionary that represents the current state

Returns

The action predicted (randomly chosen) by the random agent probs (list): The list of action probabilities

Return type

action (int)

static step(state)

Predict the action given the curent state in gerenerating training data.

Parameters

state (dict) – An dictionary that represents the current state

Returns

The action predicted (randomly chosen) by the random agent

Return type

action (int)