Gym Environment

Now that we’ve been able to implement the basic functionalities of the game in Python, our next step is to implement it as a gym.Env so that it can be used easily to train reinforcement learning models. As a starting point, we will be following the docs: https://www.gymlibrary.dev/content/environment_creation/.

They remind us to add the metadata attribute to specify the render-mode (human, rgb_array or ansi) and the framerate. Every environment should support the render-mode None, and you don’t need to add it explicitly.

As we have almost defined the environment completelly before, we don’t need to add a lot of information to this class (we can inherit from the one we defined before); but we have to explicitly define the attributes self.observation_space and self.action_space.

self.action_space: Our agents can only choose them column in which they want to place the dice, so our action space is going to be restricted to a number between 0 and 2 (assuming the board has 3 columns, but could depend on it directly).
self.observation_space: What does an agent see? It makes sense to provide all the information available: Its current board, the opponent’s board and the dice it has to place. We can implement this easily with a spaces.Dict. The different boards can be encoded as spaces.Box with dtype=np.uint8 so that they are discrete environments by with an array-like shape. It should work very similarly with a spaces.MultiDiscrete environment for example.

source

MatatenaEnv

 MatatenaEnv (**kwargs)

gym-ready implementation of Game.

matatena = MatatenaEnv()
matatena

Player 1 (0.0) | Player 2 (0.0) *
[[0. 0. 0.]    | [[0. 0. 0.]     
 [0. 0. 0.]    |  [0. 0. 0.]     
 [0. 0. 0.]]   |  [0. 0. 0.]]

matatena.observation_space.sample()

OrderedDict([('agent',
              array([[2, 0, 3],
                     [1, 3, 0],
                     [5, 5, 3]], dtype=uint8)),
             ('dice', 1),
             ('opponent',
              array([[1, 1, 2],
                     [6, 6, 4],
                     [3, 2, 0]], dtype=uint8))])

matatena.action_space.sample()

Reset

The reset method will be called to initiate a new episode. It should be called as well when a done signal is issued by the environment to reset it. It must accept a reset parameter.

It is recommended to use the random generator included when inheriting from gym.Env(self.np_random), but we need to remember to call super().reset(seed=seed) to make sure that the environment is seeded correctly.

Finally, it must return a tuple of the initial observation and some auxiliary information (which will be None in our case).

source

MatatenaEnv.reset

 MatatenaEnv.reset (seed:int=None, options=None)

Reinitializes the environment and returns the initial state.

	Type	Default	Details
seed	int	None	Seed to control the RNG.
options	NoneType	None	Additional options.

matatena = MatatenaEnv()
matatena

Player 1 (0.0) | Player 2 (0.0) *
[[0. 0. 0.]    | [[0. 0. 0.]     
 [0. 0. 0.]    |  [0. 0. 0.]     
 [0. 0. 0.]]   |  [0. 0. 0.]]

matatena.reset()

({'agent': array([[0., 0., 0.],
         [0., 0., 0.],
         [0., 0., 0.]]),
  'opponent': array([[0., 0., 0.],
         [0., 0., 0.],
         [0., 0., 0.]]),
  'dice': 3},
 None)

matatena

Player 1 (0.0) | Player 2 (0.0) *
[[0. 0. 0.]    | [[0. 0. 0.]     
 [0. 0. 0.]    |  [0. 0. 0.]     
 [0. 0. 0.]]   |  [0. 0. 0.]]

Step

The .step() method contains the logic of the environment. Must accept an action, compute the state of the environment after applying the action and return a 4-tuple: (observation, reward, done, info).

In our case, the action should be the column in which the agent wants to place the rolled dice.

source

MatatenaEnv.step

 MatatenaEnv.step (action)

Run one timestep of the environment’s dynamics.

When end of episode is reached, you are responsible for calling :meth:reset to reset this environment’s state. Accepts an action and returns either a tuple (observation, reward, terminated, truncated, info).

Args: action (ActType): an action provided by the agent

Returns: observation (object): this will be an element of the environment’s :attr:observation_space. This may, for instance, be a numpy array containing the positions and velocities of certain objects. reward (float): The amount of reward returned as a result of taking the action. terminated (bool): whether a terminal state (as defined under the MDP of the task) is reached. In this case further step() calls could return undefined results. truncated (bool): whether a truncation condition outside the scope of the MDP is satisfied. Typically a timelimit, but could also be used to indicate agent physically going out of bounds. Can be used to end the episode prematurely before a terminal state is reached. info (dictionary): info contains auxiliary diagnostic information (helpful for debugging, learning, and logging). This might, for instance, contain: metrics that describe the agent’s performance state, variables that are hidden from observations, or individual reward terms that are combined to produce the total reward. It also can contain information that distinguishes truncation and termination, however this is deprecated in favour of returning two booleans, and will be removed in a future version.

(deprecated)
done (bool): A boolean value for if the episode has ended, in which case further :meth:`step` calls will return undefined results.
    A done signal may be emitted for different reasons: Maybe the task underlying the environment was solved successfully,
    a certain timelimit was exceeded, or the physics simulation has entered an invalid state.

	Details
action	Action to be executed on the environment. Should be the column in which the agent wants to place the dice.

`Render`

Lastly, only rendering the environment is left.

As we have previously built a quite decent __repr__ method, we are going to only use that one. It would be nice to get something nicer runnig with PyGame, tho.

source

MatatenaEnv.render

 MatatenaEnv.render ()

Compute the render frames as specified by render_mode attribute during initialization of the environment.

The set of supported modes varies per environment. (And some third-party environments may not support rendering at all.) By convention, if render_mode is:

None (default): no render is computed.
human: render return None. The environment is continuously rendered in the current display or terminal. Usually for human consumption.
rgb_array: return a single frame representing the current state of the environment. A frame is a numpy.ndarray with shape (x, y, 3) representing RGB values for an x-by-y pixel image.
rgb_array_list: return a list of frames representing the states of the environment since the last reset. Each frame is a numpy.ndarray with shape (x, y, 3), as with rgb_array.
ansi: Return a strings (str) or StringIO.StringIO containing a terminal-style text representation for each time step. The text can include newlines and ANSI escape sequences (e.g. for colors).

Note: Make sure that your class’s metadata ‘render_modes’ key includes the list of supported modes. It’s recommended to call super() in implementations to use the functionality of this method.

Usage

Simple usage examples.

env = MatatenaEnv()
obs, info = env.reset()
env.render()
print(f"Rolled dice is: {obs['dice']}")

Player 1 (0.0) | Player 2 (0.0) *
[[0. 0. 0.]    | [[0. 0. 0.]     
 [0. 0. 0.]    |  [0. 0. 0.]     
 [0. 0. 0.]]   |  [0. 0. 0.]]    
Rolled dice is: 4

action = env.action_space.sample()
print(f"Placing the dice in column: {action}")
obs, reward, done, info = env.step(action)
env.render()

Placing the dice in column: 1
Player 1 (0.0) * | Player 2 (4.0)
[[0. 0. 0.]      | [[0. 4. 0.]   
 [0. 0. 0.]      |  [0. 0. 0.]   
 [0. 0. 0.]]     |  [0. 0. 0.]]

We can even perform a full game:

env = MatatenaEnv()
obs, info = env.reset()
done = False

while not done:
    action = env.action_space.sample()
    obs, reward, done, info = env.step(action)
    env.render()
if info is not None: 
    print(info)

Player 1 (0.0) * | Player 2 (1.0)
[[0. 0. 0.]      | [[0. 1. 0.]   
 [0. 0. 0.]      |  [0. 0. 0.]   
 [0. 0. 0.]]     |  [0. 0. 0.]]  
Player 1 (3.0) | Player 2 (1.0) *
[[0. 3. 0.]    | [[0. 1. 0.]     
 [0. 0. 0.]    |  [0. 0. 0.]     
 [0. 0. 0.]]   |  [0. 0. 0.]]    
Player 1 (3.0) * | Player 2 (3.0)
[[0. 3. 0.]      | [[2. 1. 0.]   
 [0. 0. 0.]      |  [0. 0. 0.]   
 [0. 0. 0.]]     |  [0. 0. 0.]]  
Player 1 (8.0) | Player 2 (3.0) *
[[0. 3. 0.]    | [[2. 1. 0.]     
 [0. 5. 0.]    |  [0. 0. 0.]     
 [0. 0. 0.]]   |  [0. 0. 0.]]    
Player 1 (8.0) * | Player 2 (5.0)
[[0. 3. 0.]      | [[2. 1. 0.]   
 [0. 5. 0.]      |  [0. 2. 0.]   
 [0. 0. 0.]]     |  [0. 0. 0.]]  
Player 1 (13.0) | Player 2 (5.0) *
[[5. 3. 0.]     | [[2. 1. 0.]     
 [0. 5. 0.]     |  [0. 2. 0.]     
 [0. 0. 0.]]    |  [0. 0. 0.]]    
Player 1 (13.0) * | Player 2 (9.0)
[[5. 3. 0.]       | [[2. 1. 0.]   
 [0. 5. 0.]       |  [0. 2. 0.]   
 [0. 0. 0.]]      |  [0. 4. 0.]]  
Player 1 (19.0) | Player 2 (9.0) *
[[5. 3. 0.]     | [[2. 1. 0.]     
 [0. 5. 0.]     |  [0. 2. 0.]     
 [0. 6. 0.]]    |  [0. 4. 0.]]    
Player 1 (19.0) * | Player 2 (15.0)
[[5. 3. 0.]       | [[2. 1. 0.]    
 [0. 5. 0.]       |  [2. 2. 0.]    
 [0. 6. 0.]]      |  [0. 4. 0.]]   
Player 1 (19.0) * | Player 2 (15.0)
[[5. 3. 0.]       | [[2. 1. 0.]    
 [0. 5. 0.]       |  [2. 2. 0.]    
 [0. 6. 0.]]      |  [0. 4. 0.]]   
Terminated -> column full