Week 4: First Contact
March 24, 2026
Building the initial codebase and passing the baseline tests. Still a long way from a successfull run, however. Visit mazzola.dev for nicer formatting
01 — From Design to Code
Welcome back. For the past three weeks I have been reading and planning. However, this week I’ve begun building. I now have a running codebase with a Gymnasium environment wrapping Basilisk, a PPO training pipeline, a validation script, and ran a training run with 200,000 timesteps on a fixed LEO-to-GEO transfer. However, its trajectories are not quite right. This post covers the most significant design decisions I made building the environment.
| Source | Authors | Focus |
|---|---|---|
| envs/hohmann_env.py | HohmannTransferEnv, Gymnasium wrapper for Basilisk 6-DOF simulation | Core environment |
| envs/env_config.py | SpacecraftConfig, ThrusterConfig, OrbitConfig, EnvConfig dataclasses | Centralised configuration |
| utils/bsk_utils.py | build_simulation(), get_cartesian_state(), safe_rv2elem() | Basilisk integration layer |
| main_train.py / main_validate.py | PPO training loop (SB3) and post-training evaluation with trajectory plots | Entry points |
| tests/ (2 modules) | test_env_basic.py, Gymnasium API contract; test_env_oracle.py, physics validation | All passing |
| Debug training run | 200k timesteps, fixed LEO→GEO orbit, single environment, seed 42 | First results |
02 — The MDP: State, Action, and a Design Choice Resolved
Blog 3 laid out the conceptual mapping from Basilisk to a Gymnasium environment. This week I locked in the concrete formulation. The three fundamental choices were what the agent observes, what it controls, and what Basilisk runs at per decision step.
The Observation Space
Last week’s open question was whether to observe Cartesian state or Keplerian elements. I chose both. The 16-dimensional observation combines normalised ECI position and velocity with the orbital elements most useful for interpreting orbit shape, specifically semi-major axis a, eccentricity e, and the angular elements encoded as sines and cosines to avoid branch-cut discontinuities. Everything is scaled to roughly [-3, 3] for training stability.
# 16-dimensional normalised observation vector # o = [ r/r_s, v/v_c, a/r_s, e, sin(nu), cos(nu), sin(omega), cos(omega), # m/m_0, k/N, r_tgt/r_s, e_tgt ] # # r_s = 7 x R_earth ≈ 44 590 km (GEO ≈ 0.95, LEO ≈ 0.16 in normalised units) # v_c = √(μ / r_target) ≈ 3.07 km/s at GEO # k / N = step fraction — tells the agent how much time it has left
Notice how I included the target SMA and eccentricity directly in the observation. Without them, the policy network can never know what orbit it is trying to reach, which means it cannot generalize to the randomised-orbit curriculum I plan to enable later. The agent needs to read its goal, not memorise it.
The Action Space: Burn On-Times
I went with burn on-times rather than delta-v vectors. Last week’s question was whether to have the agent command an abstract velocity change or a physical firing duration. I chose the latter for two reasons. First, it maps directly to Basilisk’s THRArrayOnTimeCmdMsg with no additional controller layer. Second, every burn depletes the tank in a way that shows up in the m/m_0 observation term, and the agent is responsible for that tradeoff.
# Action → burn duration → propellant consumed # t_burn = burn_frac × 0.9 × Δt_step # 10% coast margin at end of every step for attitude to settle # Δt_step = 1.5 x T_Hohmann / 40 (episode time / decision steps) # # Δm = m · (1 - exp(-Δv / v_e)), Δv = F · t_burn / m_0, v_e = Isp · g_0 # Tsiolkovsky, not linear dm = F·t/(Isp·g0). The linear approximation # over-estimates propellant use by ~40-60% across a full Hohmann transfer
Basilisk’s thrusterStateEffector in this configuration does not feed propellant depletion back into the orbital integrator. The propagated mass stays fixed at the wet mass. I track fuel consumption analytically in Python and use it for observations and the reward. The error from this constant-mass assumption accumulates to under 5% in delta-v over a full Hohmann, which is acceptable for the current stage.
The Decision Pace
Following Zavoli and Federici’s discretisation, I use 40 fixed decision steps per episode. The episode length is set to 1.5 × T_Hohmann, giving the agent 50% more time than the ideal two-burn transfer requires. Each step therefore spans a substantial chunk of orbital time, roughly 400 seconds per step for the LEO-to-GEO problem. The agent is as a result making 40 coarse burn-or-coast decisions spread across the entire transfer window.
03 — Inside the Wrapper
The two essential components were managing Basilisk’s simulation lifecycle inside Gymnasium’s reset-step loop, and handling the attitude settling period at the start of every episode.
The Reset Problem
Basilisk has no checkpoint-restore mechanism, so every call to reset() constructs a new SimBaseClass from scratch, configures the spacecraft dynamics, thruster effector, navigation sensor, and attitude control modules, initialises the orbital state, and calls InitializeSimulation(). Memory leaks were a potential problem from doing this tens of thousands of times in a single Python session, but in practice the Python garbage collector handles the C++ resource cleanup correctly when the previous sim object is dereferenced.
# reset(), the core Basilisk lifecycle, every episode self._bsk = build_simulation(sc_cfg, thr_cfg, r_init, v_init, sim_dt_s=2.0) # Coast 600 s: attitude controller converges to prograde before first burn scSim.ConfigureStopTime(sec2nano(600)) scSim.ExecuteSimulation() # step(), one decision step thrOnTimeMsg.write(OnTimeRequest=[t_burn_s], time=sim_time_ns) scSim.ConfigureStopTime(sim_time_ns + sec2nano(dt_step)) scSim.ExecuteSimulation() r_vec, v_vec = get_cartesian_state(bsk) oe = orbitalMotion.rv2elem(MU_EARTH, r_vec, v_vec)
The Attitude Settling Coast
This was the detail I had not fully appreciated until I ran the first episode without it. The velocityPoint guidance law slews the spacecraft to align the thruster with the prograde direction, and the mrpFeedback controller closes the attitude loop with a settling time on the order of several hundred seconds. On an equatorial orbit, the initial attitude misalignment from the identity MRP can be close to 90 degrees. If the agent fires the thruster immediately at step zero, the burn is nearly perpendicular to the velocity vector and the orbit degrades rather than rises. The reference scenarioOrbitManeuverTH handled this with an explicit pre-burn coast; I replicated it with a fixed 600-second settling period before the first observation is returned. Although the agent never sees this coast, the physics depends on it.
In-Process vs. Out-of-Process Resolved: I went with in-process. The latency per
step()is dominated by the Basilisk integrator itself, not inter-process communication overhead, so there was nothing to gain from a separate process. The main risk of memory accumulation across thousands of episode resets did not materialise in practice.
04 — Tests: All Green
Before running any training I wrote two test modules and used Gymnasium’s built-in environment checker to catch API violations.
test_env_basic.py verifies the Gymnasium contract: observations returned by reset() and step() must lie within the declared observation_space bounds at every step, including degenerate states near the termination conditions. I added clipping as the final operation in _get_obs() to handle edge cases where orbital elements go singular near perfectly circular orbits.
test_env_oracle.py runs a hardcoded burn sequence designed to approximate the optimal Hohmann transfer with full burn fraction at perigee, coast through the transfer ellipse, and a second burn at apogee to circularise. It checks that the oracle agent reaches a final SMA within a reasonable fraction of the target without triggering any catastrophic termination. This is the physics validation test: if a hand-coded near-optimal agent cannot complete the transfer, the simulation is wrong.
pytest tests/ -v ======== test session results ======= tests/test_env_basic.py::TestReset::test_reset_returns_valid_obs PASSED [ 5%] tests/test_env_basic.py::TestReset::test_reset_twice_does_not_crash PASSED [ 10%] tests/test_env_basic.py::TestReset::test_reset_deterministic_with_seed PASSED [ 15%] tests/test_env_basic.py::TestReset::test_reset_info_contains_timing PASSED [ 20%] tests/test_env_basic.py::TestStep::test_step_zero_burn_does_not_crash PASSED [ 25%] tests/test_env_basic.py::TestStep::test_step_full_burn_does_not_crash PASSED [ 30%] tests/test_env_basic.py::TestStep::test_step_observation_in_bounds PASSED [ 35%] tests/test_env_basic.py::TestStep::test_step_reward_is_nonpositive_for_coast PASSED [ 40%] tests/test_env_basic.py::TestStep::test_step_reward_finite_for_burn PASSED [ 45%] tests/test_env_basic.py::TestStep::test_mass_decreases_after_burn PASSED [ 50%] tests/test_env_basic.py::TestStep::test_mass_unchanged_during_coast PASSED [ 55%] tests/test_env_basic.py::TestEpisodeFlow::test_episode_terminates_or_truncates PASSED [ 60%] tests/test_env_basic.py::TestEpisodeFlow::test_random_policy_survives_multiple_episodes PASSED [ 65%] tests/test_env_basic.py::TestGymnasiumCompat::test_check_env PASSED [ 70%] tests/test_env_basic.py::TestClose::test_close_is_idempotent PASSED [ 75%] tests/test_env_basic.py::TestClose::test_reset_after_close_works PASSED [ 80%] tests/test_env_oracle.py::TestOraclePhysics::test_oracle_achieves_near_geo_orbit PASSED [ 85%] tests/test_env_oracle.py::TestOraclePhysics::test_no_burn_orbit_unchanged PASSED [ 90%] tests/test_env_oracle.py::TestOraclePhysics::test_fuel_consumption_physically_reasonable PASSED [ 95%] tests/test_env_oracle.py::TestOraclePrinting::test_print_oracle_episode_trace PASSED [100%] ======= 20 passed, 332 warnings in 3.60s =======
05 — The Debug Run: Erratic Rewards, Interesting Trajectories
With the environment passing all tests I ran the first training run: 200,000 timesteps, PPO with a 64×64 MLP, fixed orbit, single environment, seed 42. With a deliberately minimal configuration, I wanted a clean diagnostic signal before scaling up.
| Parameter | Value |
|---|---|
| Training timesteps | 200k |
| Decision steps per episode | 40 |
| Shared actor-critic MLP | 64×64 |
Turns out, the reward curve ended up being erratic, bouncing up and down with no clear trend across the full run. The training did not diverge catastrophically, but it definitely did not converge. The evaluation results were more interesting. Running 20 episodes with the best checkpoint and plotting orbital SMA over each decision step gives this:

Evaluation trajectories over 20 episodes. The red dashed line is the target orbit. The bottom panel shows the thruster command at each step. Most episodes overshoot the target significantly; Ep8 misses by under 1%.
Every episode starts at LEO and every trajectory rises. The spacecraft is consistently firing prograde and climbing out of LEO. A random policy would produce flat or chaotic SMA traces, so something is being learned.
However, SMA errors range from under 10% to over 300%, with most episodes overshooting the target altitude by a wide margin. Episode 8 rises, levels off near the target, and circularises correctly. It is surrounded by episodes that sail past the target entirely, reaching two or three times the target SMA before the step budget expires.
In the bottom panel we see the thruster command sequence follows the same coarse pattern across every episode, where burn fraction starts near 1.0 and tapers gradually down to near zero by step 40. The agent has converged on a monotonically decreasing burn schedule rather than the two distinct on-off pulses that the Hohmann transfer actually requires. It is learning a heuristic that sometimes produces a good answer and usually overshoots.
What the Optimal Policy Actually Looks Like: A Hohmann transfer uses two short impulsive burns: a prograde burn at perigee to raise apogee to the target altitude, a coast through the transfer ellipse, and then a second prograde burn at apogee to circularise. The optimal command sequence is pulsed: burn, stop, coast for ~20 steps, burn, stop. The debug policy is a continuous taper: burn hard early, bleed off gradually. It has learned to go up, but not when to stop.
06 — Diagnosing the Reward Function
The trajectories show that the reward function never penalizes overshooting. To see why, we look at the SMA progress term:
# Current per-step SMA reward (potential-based shaping)
# r_SMA = (a_k - a_{k-1}) / (a_target - a_init) × 50
#
# This gives positive reward whenever a increases,
# regardless of whether a is already above a_target.
What Needs to Change: The SMA progress term needs to reward proximity to the target, not just upward movement. A potential based on signed distance in (a, e) space, something like Φ(s) ∝ -|a – a_target| / a_target, would penalize overshooting and undershooting symmetrically. The terminal success bonus needs to come down to the same order of magnitude as the accumulated per-step signal, so the agent is not navigating in the dark between episode boundaries. And the eccentricity shaping should only be valuable near the right altitude, not universally.
Getting reward shaping right on a sparse, delayed-outcome problem is notoriously difficult, and a first-pass reward function almost never works. The good news is that the environment is correct, the physics is correct, and the policy is learning. Because of Episode 8’s near-perfect trajectory, I now know the task is solvable with this setup.
The Erratic Reward Curve Explained: Episodes that happen to time out near the target altitude receive the massive +200 terminal bonus, spiking the curve upward, episodes that miss receive only per-step shaping, landing near zero, and the occasional crashes add a -50 dip. The critic cannot learn accurate value estimates under this level of variance, so I lose the advantage estimates that PPO’s clipped objective trains on. Even with ε = 0.2 limiting per-update policy change, a noisy advantage signal means noisy gradient updates, which is visible in the policy’s instability.
07 — Looking Ahead
Next week will start off on the reward function. I am not going to scale up to more timesteps or parallel environments until the reward curve shows a cleanly improving trend on this fixed, deterministic problem.
The specific changes I plan to try include replacing the one-sided SMA progress term with a distance-to-target potential that penalizes overshoot symmetrically with undershoot, rebalancing the terminal bonus to be proportional to the per-step reward scale, and possibly tying the eccentricity shaping weight to proximity to the target altitude so the agent is only rewarded for circularising in the right place. I also want to look at the value function loss in TensorBoard more carefully. If the critic is not learning an accurate value estimate, which seems likely given the variance, the advantage estimates will be garbage regardless of how well the actor is searching.
One thing that keeps sticking with me is that despite the chaos in the reward curve, the EvalCallback managed to find a checkpoint that produced Episode 8, which was a near-perfect Hohmann transfer with under 10% SMA error. That means the task is tractable with this exact environment and this exact policy architecture. Something in the reward signal, however noisy, was enough to push the policy toward the right answer at least once.

Leave a Reply
You must be logged in to post a comment.