Markov Decision Process for Tic Tac Toe

Check out the demo:

Check out the code:

Reinforcement learning is particularly compelling because of the flexibility it offers to model environment’s stimulus and response. In this post, we shall create shall teach an agent to play Tic Tac Toe by continuously learning by playing against itself. There are many ways to achieve this, including minimax tree and \alpha-\beta. These methods model tic tac toe as a game as if it is playing against a perfect player. We shall instead assume that players are imperfect and learn to play the game by playing lot of games against itself.

Note: We are assuming reward can be calculated by playing a move and anticipating opponents move. We calculate rewards separately without exploiting symmetry.

Markov Decision Process

We shall model reinforcement learning problem in following fashion. For every action the agent takes, the environment rewards (or penalizes) the agent and it transitions from one state to another as a consequence of action.

The goal of the agent is to take series of actions that maximizes its final accumulated reward (also known as value v_\pi(s)) by taking action a when in state s.

This can be achieved when the agent learns a policy \pi: s \rightarrow a which determines what action a the agent has to take in the given state s. A good policy would help the agent to maximize the value and a bad policy does not help the agent to achieve this goal.

Now mathematically we can define value function as,

v_\pi(s) = \sum_{\forall a \in A(s)}{\pi(s, a) * \sum_{s', r} p(s', r| s, a) * [r + \gamma v_\pi(s')]}


  • \gamma is the discount rate. Usually \gamma \leqslant 1 to keep the solution convergable and also to indicate our desire to teach the agent to maximize long term reward (when \gamma > 0.5) vs short term reward.
  • p(s', r | s, a) = p(S_{t+1}=s', R_{t+1}=r | S_t=s, A_t=a) defines probability of transition to state s' with a reward r when action a is taken on state s.

Modelling Tic Tac Toe

To learn playing tic tac toe, we shall define (S, A, R) in following way:

  • States S: Each state should represent a particular state of the game. It should contain information about what is contained on the tic tac toe board.
  • Action A(s): Every state s \in S is associated with set of actions A. Each action should indicate where the player intends to place his mark (‘X’ or ‘O’).
  • Reward R: Each player will accrue reward r for each action taken along the game and obviously we should reward the winning player so that the agent learns to play those moves that will improve its chance of winning.

Implementation Details

We define the state as a string of values containing the markings in 9 positions. Each move corresponds to an action which is represented as a  number indicating the index.

For every move a player performs an action and state of tic-tac-toe board changes. Following figure shows sequence of actions taken and state of the board.

Since there are two players we will  have to learn separate policies for player ‘X’ and player ‘O’.

Rewards $R$: Winning player receives a reward of 100 and losing player will receive a penalty of -100. We shall also reward the player who makes a move that prevents the opponent from winning in next step, a small positive reward.


When training the model we assume that both players are dumb and hence we use random policy to start playing the game. To keep the implementation simple, we shall assume that players are dumb throughout the training phase and policy is decided based on the rewards gained by the agent in that state.


Using numpy in python and React Js for front end we shall develop this game. Here are the data structures that store the states and policy.  Following is the state object snippet extracted from See the repo for full implementation.

class State:
def __init__(self, player: Player, state_desc: State_Desc):
# print(state_desc)
assert (len(state_desc) == 9)
self.possible_actions: List = [action[1].name for action in enumerate(Action) if state_desc[action[0]] == '-']
self.p: Dict = {action: Qvalue_Frequency((RewardType(0), 0)) for action in self.possible_actions}
self.v_star = np.inf
self.state_desc: State_Desc = state_desc
self.policy = None # Represents best action to be taken
self.player: Player = player
view raw hosted with ❤ by GitHub

State descriptor is a string of length indicating current moves played in the game. ---------. Every state has set of actions A(s) associated. In the state object,  p represents the probability p(a, r_{frequency})Qvalue_Frequency is a tuple stores sum of rewards accrued and number of times the action was taking when the model is being trained. self.policy (\pi^*)is a integer that indicates the action that has to be taken when in the given state for maximum reward. v^* is the expected reward of the best policy.

Calculation of perceived reward for a given state

Lets revisit the equation for computing expected value v_\pi(s) in a given state s.

v_\pi(s) = \sum_{\forall a \in A(s)}{\pi(s, a) * \sum_{s', r} p(s', r| s, a) * [r + \gamma v_\pi(s')]}

The above equation states that v is sum of all normalized discounted rewards for a policy pi. Best policy pi* would however encourage the agent to take actions for which the perceived reward is maximum in which case, the agent will take an action in a given state.

The above equation can be re-written to describe perceived value at a state with best policy pi*.

v_{\pi^*}^{*}(s) = {\sum_{s', r} p(s', r| s, a_{a=max({\pi^*}(s))}) * [r + \gamma v_{\pi^*}^{*}(s')]}

This equation can be further simplified in case of tic-tac-toe. At the given state, when the agent takes an specific action, it moves to a specific state. That is,

p(X........., r=0|........., a=0) = 1.

And all other transition in this state --------- and for the given action a=0 will have a transition probability of 0. Therefore,

p( (any\ other\ state), (any\ reward)|........., a=0) = 0.

Therefore value function can be simplified as,

v_{\pi^*}^{*}(s) = r + \gamma v_{\pi^*}^{*}(s'),\ s'=\pi^*{(s)}

Note: Multi-player game

In this game both players are playing against each other and each player will keep track of rewards gained as the consequence of their move. A win for a player is loss for the opponent. Therefore reward is backtracked to all visited states in the game at the end of game.

Best policy

After multiple such episodes, for every state the agent computes the best action by taking average of rewards accrued for all possible actions. Best action is the action that corresponds to maximum average reward.

—- Let’s play. —-


[1] Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 2018.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s