In our set of feature directions, pick an arbitrary feature to serve as our reference, . We will find a "relative set", a subset of points which orbit . Precisely, this means that they have (1) the same norm and (2) the same cosine angle from our reference point.
Geometrically, we can also think of it like this: start at the origin and project a cone towards . We can then vary the cosine angle (cone width) and the norm at which we slice the cone at (height) and see what features fall within these parameters.
Fig 1.1. Start from the sphere starter. Add the feature.
1.2. Observe two points and which are roughly in an orbit. We can tell because they have the same cosine angle/similarity and have the same norms. That sort of arrangement is pretty special.
1.3. All the members of this relative set are some kind of animal in a tree - clearly very similar! But there's something else interesting. We can zoom into the left hand side of the slides to make it even more clear. It looks like they might be splitting the color information into two:
1.4. Might it be that they are actually meant to fit together, like a puzzle piece?
1.5. Add them together. Observe that the tree leaves are now green and the backdrop is now a sky, and that the two indistinct animal forms have combined into a bird:
Interpretation: Two features, A and B, which independently seemed an improbable representation to have occurred in the dataset, have an addition that reveals that they may have been split from the same datum.
There are cases which are particularly egregious and leave features that seem nonsensical -- ones that don't seem like anything that the model could have seen in our data in isolation. Take this cat separated from its orange pelt, left hovering midair:
(no starter)
Feature splitting depends on there being a clean way to split common data points into two by some regularity. Conveniently, color is a predictable regularity! So color information often serves as the attributes to split on.
The split features can serve as vehicles to then balance feature information appropriately:
SAEs promise dictionary elements that are in some way independent and interpretable in isolation. But it's kind of like instead of a dictionary with words (atomic units that can be combined), we get something more akin to sentence fragments -- semi-coherent pieces of messages of varying sizes. We are left with some notion of piecing them together into a larger picture.
It's actually even stranger — the cleaving not only splits into two buckets the attributes which have values, but the values themselves are often chopped out of proportion with each other. This makes it difficult to see in a slide what the feature is about without changing the magnification.
Here we see coherent concepts get cleaved into pieces along arbitrary divisions, leaving features that are nonsensical when studied in isolation.