arxiv:2606.28525

A Gravitational Interpretation of Fine-Tuning Reversion

Published on Jun 26

· Submitted by

Samuele Poppi on Jun 30

Mohamed Bin Zayed University of Artificial Intelligence

Upvote

Authors:

Abstract

Post-alignment safety degradation arises from geometric properties of training history, where fine-tuning reversion follows a persistent direction defined by early training dynamics.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Fine-tuning on harmless data can partially undo behaviors acquired earlier in training. Safety can erode under benign post-alignment updates, unlearned capabilities can re-emerge, latent traits can transfer through apparently unrelated supervision, and related post-alignment fragility appears in other generative settings. We argue these phenomena are usefully viewed through a common training-history lens. Our hypothesis is geometric: large early training phases create dominant behavioral manifolds, while later alignment or specialization phases are shallower displacements from them. Subsequent fine-tuning can therefore inherit a persistent reversion component pointing back toward a witness of the dominant manifold. We call this the gravitational interpretation of fine-tuning reversion. Across our main settings, representational drift rapidly acquires a component along a history-defined reversion direction (v_rev). In our main track, alignment with v_rev rises from cos = 0.429 +/- 0.052 after the first update to 0.647 +/- 0.021 by step 20. Across 24 run-step pairs, every observed alignment exceeds the p99 of an isotropic activation-space null. We demonstrate that selectively blocking motion along v_rev changes the final alignment at T=100 from 0.648 +/- 0.009 to -0.211 +/- 0.021 and reduces harmfulness from 19.0% +/- 4.0% to 8.5% +/- 1.5% with little task cost. These results support v_rev as a causally relevant mediator of early post-alignment reversion in our setup. Importantly, we do not claim that v_rev is the unique safety direction, nor that the dominant manifold is directly observed; rather, we identify a robust, history-defined direction that explains and partially controls early reversion dynamics.

View arXiv page View PDF Add to collection

Community

seppia978

Paper submitter about 13 hours ago

Fine-tuning on harmless data can partially undo behaviors acquired earlier in train-
ing. Safety can erode under benign post-alignment updates, unlearned capabilities
can re-emerge, latent traits can transfer through apparently unrelated supervision,
and related post-alignment fragility appears in other generative settings. We argue
these phenomena are usefully viewed through a common training-history lens.
Our hypothesis is geometric: large early training phases create dominant behav-
ioral manifolds, while later alignment or specialization phases are shallower dis-
placements from them. Subsequent fine-tuning can therefore inherit a persistent
reversion component pointing back toward a witness of the dominant manifold.
We call this the gravitational interpretation of fine-tuning reversion. Across our
main settings, representational drift rapidly acquires a component along a history-
defined reversion direction (vrev). In our main track, alignment with vrev rises
from cos = 0.429 ±0.052 after the first update to 0.647 ±0.021 by step 20.
Across 24 run-step pairs, every observed alignment exceeds the p99 of an isotropic
activation-space null. We demonstrate that selectively blocking motion along vrev
changes the final alignment at T = 100 from 0.648 ±0.009 to−0.211 ±0.021
and reduces harmfulness from 19.0% ±4.0% to 8.5% ±1.5% with little task cost.
These results support vrev as a causally relevant mediator of early post-alignment
reversion in our setup. Importantly, we do not claim that vrev is the unique safety
direction, nor that the dominant manifold is directly observed; rather, we iden-
tify a robust, history-defined direction that explains and partially controls early
reversion dynamics.

librarian-bot

about 3 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.28525

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.28525 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.28525 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.28525 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.