Overview¶
Design Principles¶
The toolkit wraps each game by Env
class with easy-to-use
interfaces. The goal of this toolkit is to enable the users to focus on
algorithm development without caring about the environment. The
following design principles are applied when developing the toolkit:
Reproducible. Results on the environments can be reproduced. The same result should be obtained with the same random seed in different runs.
Accessible. The experiences are collected and well organized after each game with easy-to-use interfaces. Uses can conveniently configure state representation, action encoding, reward design, or even the game rules.
Scalable. New card environments can be added conveniently into the toolkit with the above design principles. We also try to minimize the dependencies in the toolkit so that the codes can be easily maintained.
RLCard High-level Design¶
This document introduces the high-level design for the environments, the games, and the agents (algorithms).
Environments¶
We wrap each game with an Env
class. The responsibility of Env
is to help you generate trajectories of the games. For developing
Reinforcement Learning (RL) algorithms, we recommend to use the
following interfaces:
set_agents
: This function tells theEnv
what agents will be used to perform actions in the game. Different games may have a different number of agents. The input of the function is a list ofAgent
class. For example,env.set_agent([RandomAgent(), RandomAgent()])
indicates that two random agents will be used to generate the trajectories.run
: After setting the agents, this interface will run a complete trajectory of the game, calculate the reward for each transition, and reorganize the data so that it can be directly fed into a RL algorithm.
For advanced access to the environment, such as traversal of the game tree, we provide the following interfaces:
step
: Given the current state, the environment takes one step forward, and returns the next state and the next player.step_back
: Takes one step backward. The environment will restore to the last state. Thestep_back
is defaultly turned off since it requires expensively recoeding previous states. To turn it on, setallow_step_back = True
whenmake
environments.get_payoffs
: At the end of the game, this function can be called to obtain the payoffs for each player.
We also support single-agent mode and human mode. Examples can be found
in examples/
.
Single agent mode: single-agent environments are developped by simulating other players with pre-trained models or rule-based models. You can enable single-agent mode by
rlcard.make(ENV_ID, config={'single_agent_mode':True})
. Then thestep
function will return(next_state, reward, done)
just as common single-agent environments.env.reset()
will reset the game and return the first state.
Games¶
Card games usually have similar structures. We abstract some concepts in card games and follow the same design pattern. In this way, users/developers can easily dig into the code and change the rules for research purpose. Specifically, the following classes are used in all the games:
Game
: A game is defined as a complete sequence starting from one of the non-terminal states to a terminal state.Round
: A round is a part of the sequence of a game. Most card games can be naturally divided into multiple rounds.Dealer
: A dealer is responsible for shuffling and allocating a deck of cards.Judger
: A judger is responsible for making major decisions at the end of a round or a game.Player
: A player is a role who plays cards following a strategy.
To summarize, in one Game
, a Dealer
deals the cards for each
Player
. In each Round
of the game, a Judger
will make major
decisions about the next round and the payoffs in the end of the game.
Agents¶
We provide examples of several representative algorithms and wrap them
as Agent
to show how a learning algorithm can be connected to the
toolkit. The first example is DQN which is a representative of the
Reinforcement Learning (RL) algorithms category. The second example is
NFSP which is a representative of the Reinforcement Learning (RL) with
self-play. We also provide CFR and DeepCFR which belong to Conterfactual
Regret Minimization (CFR) category. Other algorithms from these three
categories can be connected in similar ways.