When TD-Variance Early Warning Fails
A failed early-warning study for off-policy fine-tuning under dynamics shift, and the design lessons that fell out of the failure.
This experiment started with a practical question. Could a recent-buffer temporal-difference variance score warn before a robotic reinforcement-learning policy collapses during fine-tuning under a controlled dynamics shift?
Why This Was Worth Testing
Fine-tuning is an attractive way to reuse an expensive pretrained controller, but robotic policies rarely keep operating in the exact dynamics they were trained on. Friction changes, payload changes, contact behavior changes, and a policy that looked stable in one regime can become brittle in another.
A useful warning signal would buy time. If the system can detect trouble before return collapses, an operator can stop the run, roll back the update, change the data mix, or move to a safer adaptation strategy. That matters most in the settings where adaptation is expensive or hard to reset.
TD variance was a plausible candidate because it is cheap and already close to the learning loop. In an off-policy actor-critic setup, critic mismatch, replay geometry, and backup choices all shape stability. A recent-buffer score offered a simple test of whether those internal signals could expose failure earlier than return alone.
I built a narrow vertical slice around Go1 flat-terrain joystick locomotion, Soft Actor-Critic, Brax, and MuJoCo Playground. The setup fine-tuned under a severe shift in friction and payload, then asked whether a TD-variance score could fire before threshold-defined collapse.
The Short Version
The warning signal did not work in the regime I actually ran. The first calibrated pilot produced meaningful degradation but no threshold-defined collapses. Tightening the collapse threshold created collapse labels, but still produced no warning-positive runs and no positive lead time.
That made the useful result a post-mortem rather than a positive method paper. The score was normalized to early shifted-domain warmup behavior, the first two variance measurements emitted no warning rows, the prediction horizon was long relative to the short traces, and changing the collapse threshold changed labels more than it changed the underlying dynamics.
What I Took From It
The warning design failed because its geometry did not match the trace it was supposed to warn on. The baseline, warmup handling, evaluation spacing, and future-collapse horizon were badly matched to the available runs.
A stronger follow-up would anchor the score to a more stable pre-shift baseline, evaluate more frequently, vary the prediction horizon directly, separate warning calibration from collapse calibration, and compare against a trivial return-drop rule.
There was also a secondary robustness observation. Under the same severe shift, the n = 1 configuration retained partial performance on average while the longer-horizon configurations finished near zero. That result is real but confounded, so I treat it as a narrow observation rather than the main claim.