Week 4: Creating the Official Environment
March 25, 2024
Hello and welcome back to my blog! This week, I encountered more difficulties than usual, as I had to create an entire custom Gym environment from scratch based on my original game. Here’s an overview of what I worked on:
__init__: Like in other Python classes, this is the constructor. For a Gym environment, it’s where the action space, observation space, and render mode (how the environment is displayed onscreen) are defined. Other optional parameters can be added, such as the variable “sab” in the sample Blackjack environment for Sutton and Barto rules, but I didn’t include any.
_get_obs: This function returns the observation space, which is useful when training/updating an agent. For my environment, the observation space is a tuple of 3 values: the number of humanoids in the ambulance (out of 10), the humanoid currently being displayed, and the amount of time remaining in the round (out of 720 minutes).
reset: True to its name, reset resets the environment back to its original configuration. For this reason, it’s called when the environment is first created and again when the training is terminated. In my case, the reset function empties the ambulance and resets the timer to 720.
step: Arguably the most important function in the entire class, step details all of the “logic” in the environment. Each time the agent chooses an action from the action space, its action is passed to step, which assigns a reward based on the state of the observation space. Step also determines whether the round is terminated and can modify variables in the observation space. For example, in my environment, each action takes a certain amount of time, so step subtracts the corresponding time from the time remaining.
I faced some minor difficulties involving debugging, since the original game’s scorekeeper file (the equivalent of step) had not been formatted in a way that was conducive to the Gym environment. One main discrepancy was that scorekeeper is able to disable buttons if there isn’t enough time left for that action, but it would be highly impractical to remove and re-add actions to the action space. I settled on invoking a heavy punishment each time the agent attempted to choose an action that it did not have enough time for, which should quickly discourage it from doing so—or suggesting such an action to the player later on.
render: Though render isn’t strictly necessary, it’s quite useful for providing a visual representation of the environment and the agent as it trains. However, render was also the most difficult function to write, since I don’t have much experience with the pygame module. I was also unable to use the original game as a reference, since it was written with tkinter, which differs considerably from pygame. At this point, the render function is the only one that remains incomplete, but I plan to use it to display the current humanoid and potentially also the ambulance capacity and time remaining.
Literature Review
This week, I began reading Sutton and Barto’s Reinforcement Learning: An Introduction. Though I didn’t have much time to read amidst all of the coding and debugging, it was interesting learning about policy optimization from a theoretical point of view. So far, my encounters with optimization have all been code-based, such as learning the best number of learning episodes to produce a good Q-table. Thus, the textbook was useful in illustrating how exactly the difference in episode count and learning rate could produce a different policy.
Works cited:
Sutton, R., & Barto, A. (n.d.). Reinforcement Learning An Introduction second edition.
https://www.andrew.cmu.edu/course/10-703/textbook/BartoSutton.pdf
Leave a Reply
You must be logged in to post a comment.