A Gravitational Interpretation of Fine-Tuning Reversion
Abstract
Post-alignment safety degradation arises from geometric properties of training history, where fine-tuning reversion follows a persistent direction defined by early training dynamics.
Fine-tuning on harmless data can partially undo behaviors acquired earlier in training. Safety can erode under benign post-alignment updates, unlearned capabilities can re-emerge, latent traits can transfer through apparently unrelated supervision, and related post-alignment fragility appears in other generative settings. We argue these phenomena are usefully viewed through a common training-history lens. Our hypothesis is geometric: large early training phases create dominant behavioral manifolds, while later alignment or specialization phases are shallower displacements from them. Subsequent fine-tuning can therefore inherit a persistent reversion component pointing back toward a witness of the dominant manifold. We call this the gravitational interpretation of fine-tuning reversion. Across our main settings, representational drift rapidly acquires a component along a history-defined reversion direction (v_rev). In our main track, alignment with v_rev rises from cos = 0.429 +/- 0.052 after the first update to 0.647 +/- 0.021 by step 20. Across 24 run-step pairs, every observed alignment exceeds the p99 of an isotropic activation-space null. We demonstrate that selectively blocking motion along v_rev changes the final alignment at T=100 from 0.648 +/- 0.009 to -0.211 +/- 0.021 and reduces harmfulness from 19.0% +/- 4.0% to 8.5% +/- 1.5% with little task cost. These results support v_rev as a causally relevant mediator of early post-alignment reversion in our setup. Importantly, we do not claim that v_rev is the unique safety direction, nor that the dominant manifold is directly observed; rather, we identify a robust, history-defined direction that explains and partially controls early reversion dynamics.
Community
Fine-tuning on harmless data can partially undo behaviors acquired earlier in train-
ing. Safety can erode under benign post-alignment updates, unlearned capabilities
can re-emerge, latent traits can transfer through apparently unrelated supervision,
and related post-alignment fragility appears in other generative settings. We argue
these phenomena are usefully viewed through a common training-history lens.
Our hypothesis is geometric: large early training phases create dominant behav-
ioral manifolds, while later alignment or specialization phases are shallower dis-
placements from them. Subsequent fine-tuning can therefore inherit a persistent
reversion component pointing back toward a witness of the dominant manifold.
We call this the gravitational interpretation of fine-tuning reversion. Across our
main settings, representational drift rapidly acquires a component along a history-
defined reversion direction (vrev). In our main track, alignment with vrev rises
from cos = 0.429 ±0.052 after the first update to 0.647 ±0.021 by step 20.
Across 24 run-step pairs, every observed alignment exceeds the p99 of an isotropic
activation-space null. We demonstrate that selectively blocking motion along vrev
changes the final alignment at T = 100 from 0.648 ±0.009 to−0.211 ±0.021
and reduces harmfulness from 19.0% ±4.0% to 8.5% ±1.5% with little task cost.
These results support vrev as a causally relevant mediator of early post-alignment
reversion in our setup. Importantly, we do not claim that vrev is the unique safety
direction, nor that the dominant manifold is directly observed; rather, we iden-
tify a robust, history-defined direction that explains and partially controls early
reversion dynamics.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Skin-Deep: A Geometric Diagnostic for Alignment Fragility in Large Language Model Representations (2026)
- RefusalGuard: Geometry-Preserving Fine-Tuning for Safety in LLMs (2026)
- Representation Collapse in Sequential Post-Training of Large Language Models (2026)
- Trait-space Monitoring for Emergent Misalignment During Supervised Finetuning (2026)
- CSULoRA: Closest Safe Update Low-Rank Adaptation (2026)
- Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer (2026)
- Mask the Target: A Plug-and-Play Regularizer Against LoRA Forgetting (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.28525 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper