The toolkit wraps each game by
Env class with easy-to-use
interfaces. The goal of this toolkit is to enable the users to focus on
algorithm development without caring about the environment. The
following design principles are applied when developing the toolkit:
Reproducible. Results on the environments can be reproduced. The same result should be obtained with the same random seed in different runs.
Accessible. The experiences are collected and well organized after each game with easy-to-use interfaces. Uses can conveniently configure state representation, action encoding, reward design, or even the game rules.
Scalable. New card environments can be added conveniently into the toolkit with the above design principles. We also try to minimize the dependencies in the toolkit so that the codes can be easily maintained.
RLCard High-level Design¶
This document introduces the high-level design for the environments, the games, and the agents (algorithms).
We wrap each game with an
Env class. The responsibility of
is to help you generate trajectories of the games. For developing
Reinforcement Learning (RL) algorithms, we recommend to use the
set_agents: This function tells the
Envwhat agents will be used to perform actions in the game. Different games may have a different number of agents. The input of the function is a list of
Agentclass. For example,
env.set_agent([RandomAgent(), RandomAgent()])indicates that two random agents will be used to generate the trajectories.
run: After setting the agents, this interface will run a complete trajectory of the game, calculate the reward for each transition, and reorganize the data so that it can be directly fed into a RL algorithm.
For advanced access to the environment, such as traversal of the game tree, we provide the following interfaces:
step: Given the current state, the environment takes one step forward, and returns the next state and the next player.
step_back: Takes one step backward. The environment will restore to the last state. The
step_backis defaultly turned off since it requires expensively recoeding previous states. To turn it on, set
allow_step_back = Truewhen
get_payoffs: At the end of the game, this function can be called to obtain the payoffs for each player.
We also support single-agent mode and human mode. Examples can be found
Single agent mode: single-agent environments are developped by simulating other players with pre-trained models or rule-based models. You can enable single-agent mode by
env.set_mode(single_agent_mode=True). Then the
stepfunction will return
(next_state, reward, done)just as common single-agent environments.
env.reset()will reset the game and return the first state.
Human mode: we provide interfaces to play with the trained agents. You can enable single-agent mode by
env.set_mode(human_mode=True). Then the terminal will print out game information and we play with the agents.
Card games usually have similar structures. We abstract some concepts in card games and follow the same design pattern. In this way, users/developers can easily dig into the code and change the rules for research purpose. Specifically, the following classes are used in all the games:
Game: A game is defined as a complete sequence starting from one of the non-terminal states to a terminal state.
Round: A round is a part of the sequence of a game. Most card games can be naturally divided into multiple rounds.
Dealer: A dealer is responsible for shuffling and allocating a deck of cards.
Judger: A judger is responsible for making major decisions at the end of a round or a game.
Player: A player is a role who plays cards following a strategy.
To summarize, in one
Dealer deals the cards for each
Player. In each
Round of the game, a
Judger will make major
decisions about the next round and the payoffs in the end of the game.
We provide examples of several representative algorithms and wrap them
Agent to show how a learning algorithm can be connected to the
toolkit. The first example is DQN which is a representative of the
Reinforcement Learning (RL) algorithms category. The second example is
NFSP which is a representative of the Reinforcement Learning (RL) with
self-play. We also provide CFR and DeepCFR which belong to Conterfactual
Regret Minimization (CFR) category. Other algorithms from these three
categories can be connected in similar ways.