6: Reinforcement learning, part 3

April 21, 2025

It’s been a little while since I’ve written anything worthwhile here, so I’ll try to be less preachy and just describe what I’ve been working on.

The Data

It is improper to describe the training process as broken down into two stages-one on existing data, one on self-play-because the model could suck at either or both tasks and then you’d have to fix the model and rerun it through both stages again. This would repeat as you gradually adjust the model to something functional.

I’ve got 10 rounds of real-world gameplay, amounting to the experiences of 68 players, each of which could be leveraged. But I’ve also introduced a method of data creation-shuffling. Referring to our discussion of data encoding and representations, I can shuffle the set and ordering for cards in a set, values which are used to generate these encodings, to create a maximum of 9!×(6!)⁹ permutations of a single game. Additionally, I can shuffle the seating order of a player’s teammates and opponents, which doesn’t affect game play but allows the game to be generalized further. So out of the 68 experiences, we can create a near infinite amount of data for real-world training.

Furthermore, I have created the pipeline for self-play, which is as follows:

A game object is created and hands are dealt randomly.
A player is arbitrary designated to first. It is considered their turn.
After any action occurs in the game, a single agent is prompted to generate action Q-values for each player based on the information available to that player.
If any players wish to call set (the Q-value for calling set is greater than that for not calling set), then that player is allowed to call set, with the set and the card assignments determined by the other Q-values. If multiple players wish to call set, the player with the greatest set-calling Q-value is given priority.
If no player is calling set, the player whose turn it is will ask for a card from an opponent according to the Q-values generated for asking.
The player whose turn it becomes is then reassigned if necessary (i.e. if it becomes the opponent’s turn).
Steps 3-6 are repeated until one team is completely out of cards, after which the remaining players must call set.

Helping

During self-play, the model might begin poorly, picking actions not considered valuable (i.e. asking for the wrong card repeatedly). However, in the set of all possible actions an agent can take, the set of actions which are actually helpful to earning sets is very small, and it is foolish to expect to stumble upon them by chance. As such, I’ve included a component of “helping” the agents by overtaking the agent’s intended action and interjecting instead with choices that could help the model learn:

Helping the agent actually obtain a card from the opponent, rather than asking for a card the opponent doesn’t have. The frequency of this can occur at a rate of 0.2-0.8 depending on settings.
Helping the agent call set correctly, when a team has an entire set in their combined hands (a set is “monopolized”). This is set to trigger whenever a team obtains an entire set.

These are essentially training wheels for teaching via self-play, the frequencies of which could be reduced over time as the model gains more innate ability.

The Model

I went and reworked the model. The problem? It seemed too big a leap to get from game events → action directly, even with many hidden layers (“thinking processes”). In an actual game of fish, players typically use game events to infer other players’ hands, and based on the inferred hands, decide what action to take next. This is adds an intermediate step: game events → other player’s hands → action. So if we have our agent follow this intermediate step, it could potentially produce more informed actions. We thus break our one singular Q-network into two separate networks:

One LSTM neural network which takes the sequence of previous game states and outputs a prediction of other players’ hands as an 8×54 array. This is essentially a classification task, categorizing each of the 54 cards into one of the 6 or 8 players.

The LSTM allows the neural network to take every event that has happened in the game prior one at a time, rather than all previous events at once. It then updates its hidden state after each time step (event) to track sequential information.

Another normal Q-network which outputs the action Q-values, but with the input instead being that 8×54 array encoding other players’ predicted hands (during testing) or actual hands (during training).

Thus, to return the Q-values after any event, the LSTM hand-prediction model is applied first, then the output of that is fed into the revised Q-network.

View more of Yourui S.'s posts.

6: Reinforcement learning, part 3

The Data

Helping

The Model

Reader Interactions

Leave a Reply Cancel reply