Cool Ideas To Work On -- February

Come work with me on any of these ideas.

Contents

1. Diffusion Interventions as Testing Ground

1.1. What SAE Latents do we Observe? How do they "Decompose"?

What's the smallest ittiest bittiest feature we can observe? Can we find a direction that encodes for a cat? Can we find both a full-fledged cat and only its whiskers?

1.2. Try Cross Product on Stuff

Subtracting and adding features works. What about multiplying? Could be cool -- apparently the cross product is the analogue to look at here, since v1×v2\vec{v}_{1} \times \vec{v}_{2} is also a vector! Can we try to actually diffuse out these orthogonal components and see what happens?

1.3. Try Adding SAE Latent Directions And Judge the Scale of Intervention

Are we very sure that we will consistently yield the same sort of intervention if we add the latent direction to a variety of images.

For instance, if we add a cat direction to multiple embeddings, do we get a similar cat, the same size of cat?

A difference in the size would indicate that maybe there's different information content. That would be interesting to see!

2. Interpolation Studies

2.1. Can We See Discrete Transitions?

Perform spherical interpolation between two image embeddings and observe how continuous changes in embedding space translate to discrete changes in diffused output. Identify specific values at which content transitions occur, revealing potential "set points" or discrete knowledge states within the continuous representation.

Research Value: May reveal how continuous embedding spaces maintain discrete conceptual boundaries, with implications for how interventions should target specific ranges within the representation.

2.2. Can we Quantitatively Measure Intervention Scale by Seeing the Cosine Similarity?

Compute cosine similarity between original image embeddings and intervened ones, correlating with human judgments of feature presence. This provides a quantitative basis for measuring how interventions affect representation space and whether these changes align with perceptible changes in the output.

Research Value: Enables tracking of how specific latent space interventions correlate with output changes, potentially revealing thresholds where interventions become perceptually significant.

3. Language Modeling

3.1. RLHF and Malicious Code

Given an RLHF'd model, finetuning it on malicious code creates bad character and vibes all around. Why? Is it that they are connected fundamentally and entirely inseparable?

3.2. Helical Structures in Activation Space

There's some indication that LMs compute something that might look like a helix for numbers. And we know that other features are also circular in the activation space.

So, we know at least one case in which we see the counter geometry with SAEs. But can we qualitatively say how do SAEs see helical structures or how do they see circular structures? This could help us understand if SAEs find other sorts of structures that are equally interesting.

3.3. Finding Helixes in Arbitrary Digit Places

Given 3-digit numbers (remember that they are composed of the ones, tens, and hundreds place), can we probe for:

3.4. Logical Operations Representation

Get a list of logical facts and then work through a bunch of tables that represent common operations like XOR, AND, OR, NOT, does this generalize to everything?

3.5. Self-Souping

Can you self soup by taking an MLP, folding it in half like a piece of paper, and training the rest?

4. Datasets with Weird Properties

A word of warning -- these ones are way more abstract.

4.1. Natural Human Reasoning

We usually think of language data sets as a collection of documents. But it's long been puzzled over how to think about the intermediate reasoning steps it takes to produce them.

Keyboard input and editing doesn't tell us everything about how the thought process occurred, and self-reports of this process can plausibly be entirely confabulated and invented. But the test of social conversation is interesting: in conversation, peers apply pressure to argue correctly and legibly, and the group makes partial progress together towards answering questions, even tricky or nonverifiable ones.

Could conversational dialogue datasets tell us something about how humans solve and work through problems together? For instance, how does the strong and market preference for reasoning socially impact the sorts of reasoning algorithms people employ like the preference for self-correction?

What questions could we ask of a dataset of actual natural human reasoning, imagining we had a way to get it?

4.2. Natural Human Play

Context: Game clients typically stream player actions to the host, which verifies their validity (e.g. that the player really was in reach, or that the diamond really was there). These actions are shortly thereafter discarded.

In Minecraft, players display very open-ended behaviour like constructing structures that are beautiful, exploring fighting social collaboration and trade and so on.

Could an existing Minecraft server have this data lying around that can be used to train a model to predict the next action for a player? Altneritavely, if that's not possible, could servers be spun up to host Minecraft players and collect anonymized player action data?

Imagine we had an agent that could really play Minecraft like a human. (Note this means much more than merely achieving narrow technical objectives like getting a diamond.) Would anything about that world be surprising to us? For instance, just to start -- what things do humans do when playing Minecraft that might be unusually weird for an agent to do too?