Sketched Isotropic Gaussian Regularization

SIGReg visual story

One idea

SIGReg checks shadows, then averages the tests.

Embeddings \( \mathbf z_n=f_\theta(\mathbf x_n) \) should look like \( Q=\mathcal N(\mathbf 0,I_K) \). SIGReg avoids judging the whole cloud at once: it inserts slice planes, projects the points, tests each one-dimensional shadow, and averages the resulting scores.

SIGReg loss average of the directional test scores from the slice planes
Blue surface: target \(Q\). Coral points: current embeddings \(P_\theta\). Click the scene or press Add slice plane to create another shadow.

01

Why collapse needs a target.

Prediction alone can make related views agree while the representation becomes a point, a line, or a thin manifold. SIGReg gives the embedding cloud a specific target shape.

Useful embeddings spread information across directions. Collapsed embeddings leave many directions nearly silent.
Collapsed line dead directions shrink

02

Why Gaussian is the target.

The LeJEPA paper argues that, when the future downstream task is unknown, embeddings should spend their variance evenly across directions. A centered isotropic Gaussian is the cleanest version of that geometry.

one sentence summary: make every direction usable instead of letting a few axes dominate the representation.
Linear probes Anisotropy can amplify bias and variance.

When some directions are stretched and others are tiny, a linear readout becomes more sensitive to sampling and regularization.

Local nonlinear probes Round neighborhoods are easier to use.

For k-NN and kernel-style predictors, the paper’s analysis points to the isotropic Gaussian as the unique bias-minimizing design under fixed covariance budget.

SIGReg target Every slice should look like \(N(0,1)\).

If the full cloud is isotropic Gaussian, every one-dimensional projection has the same standard normal shape.

Bad: thin or directional support

Good: full-dimensional Gaussian

\( Q=\mathcal N(\mathbf 0,I_K) \Rightarrow Q^{(\mathbf a)}=\mathcal N(0,1) \)

03

Epps-Pulley tests one shadow.

After projection, Epps-Pulley compares the shadow’s empirical characteristic function against the standard Gaussian one. The score is small when the shadow has the right shape.

\( \widehat{\phi}_{\mathbf a}(t)=\frac{1}{N}\sum_{n=1}^{N}e^{it\mathbf a^\top\mathbf z_n} \)
\( \operatorname{Re}\widehat{\phi}_{\mathbf a}(t)=\frac{1}{N}\sum_n\cos(tu_n) \)
\( \operatorname{Im}\widehat{\phi}_{\mathbf a}(t)=\frac{1}{N}\sum_n\sin(tu_n) \)
\( \phi_{\mathcal N}(t)=e^{-t^2/2} \)

Move \(t\): each projected value \(u_n\) becomes a unit-circle point \((\cos(tu_n),\sin(tu_n))\). Their average is \(\widehat{\phi}_{\mathbf a}(t)\).

At selected \(t\) t = 1.5

Mean cos: 0.000

Mean sin: 0.000

Target real: 0.000

Target imaginary: 0.000

EP(\(\mathbf a\))

0.000

\(N\cdot\operatorname{trapz}\) over \(t_j\in[-5,5]\)

04

As a loss, SIGReg unfolds the cloud.

This toy demo directly optimizes the points \( \mathbf z_n \). In an encoder, the same gradients flow through \( \mathbf z_n=f_\theta(\mathbf x_n) \) into \(f_\theta\).

\( \mathbf z_i\leftarrow \mathbf z_i-\eta\nabla_{\mathbf z_i}\mathrm{SIGReg} \)

Step 0 · SIGReg 0.000

Where it sits in LeJEPA

Prediction loss makes related views agree. SIGReg keeps the embedding distribution close to \( \mathcal N(\mathbf 0,I_K) \). Together they encourage useful, non-collapsed features.

views \( \mathbf x_{n,v} \) \( f_\theta \) embeddings \( \mathbf z_{n,v} \) prediction loss + SIGReg LeJEPA loss