Visualization of Permutation Symmetry

Alec Jacobson

December 29, 2024

weblog/

The parameter space of a neural network is often high-dimensional, where our intuition struggles and frequently falls into traps assuming behavior similar to low-dimensional spaces. In addition to this, the basic structure of neural networks creates "permutation symmetries". If we consider just a single matrix of weights, this means that we can correspondingly permute the rows and columns of the matrix and get exactly the same output. Nothing's changed except where we've stored things. Here's a visualization:

Each colored square represents a weight entry in the network. If the rows and columns of the middle matrix are permuted (and its inputs/outputs are correspondingly inversely permuted), then the output of the network is perfectly preserved.

If we're just training and testing one network then we might not care about this. But if we're going to work with networks in "parameter space," then this is an interesting structure to consider.

In mathematical notation we can write the network above as: \[ y = W_3 \cdot \phi \left(W_2 \cdot \phi \left(W_1 x\right))\right), \] where $\phi$ is an elementwise nonlinearity.

And the vector of weights in the network above as: \[ \theta = \begin{bmatrix} \text{vec}(W_1) \\ \text{vec}(W_2) \\ \text{vec}(W_3) \end{bmatrix}. \] If we stick in two permutation matrices $P_2$ and $P_1$ of appropriate sizes then

\begin{align*} y &= W_3 \cdot \phi \left(W_2 \cdot \phi \left(W_1 x\right))\right) \\ &= W_3 P_2^{\vphantom{-1}} P_2^{-1} \cdot \phi \left(W_2 P_1^{\vphantom{-1}} P_1^{-1} \cdot \phi \left(W_1 x\right))\right) \\ &= W_3 P_2^{\vphantom{-1}} \cdot \phi \left(P_2^{-1} W_2 P_1^{\vphantom{-1}} \cdot \phi \left(P_1^{-1} W_1 x\right))\right) \\ &= \tilde{W}_3 \cdot \phi \left(\tilde{W}_2 \cdot \phi \left(\tilde{W}_1 x\right)\right), \end{align*} where these permuted matrices form a completely different vector in parameter space: \[ \tilde{\theta} = \begin{bmatrix} \text{vec}(\tilde{W}_1) \\ \text{vec}(\tilde{W}_2) \\ \text{vec}(\tilde{W}_3) \end{bmatrix} \neq \theta. \]

Personally, the permutation symmetry of neural networks trips me up quite a bit. The double cover of unit quaternions for rotations in 3D space feels comes up a lot in graphics and we're very wary of linearly blending very different rotations: like those on either side of the double cover. Increasingly, I'm reading in the machine learning literature that permutation symmetry is not necessarily that kind of problem (e.g., "Weight-space symmetry in deep networks gives rise to permutation saddles...". The old wisdom that blending weights of networks is a foolish is no longer holding to be true.