Zeyuan Allen-Zhu and Yuanzhi Li studied the ``knowledge capacity'' of transformers in their paper Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws. They trained a tranformer to memorize individual's attributes in fictional dataset of biographical profiles. The profiles contain:
- A random name, generated by uniformly sampling a first, middle and last name from a long list of names (it's not quite uniform - duplicate full names are rejected, but the number of possible names is >> the number of profiles)
- A collection of uniformly random attributes
They generate collections of profiles of different size, and train transformers of different sizes to memorize the names and attributes. They find that transformers predictably memorize 2 bits of information per parameter, a figure that is basically unchanged between 16 and 32bit parameters (but is much lower for 8bit parameters).
Under their scheme, there is an easily computable lower bound on the amount of information required to describe a profile. Modulo some fixed overhead for encoding "what a profile looks like", we require
This is a lower bound - it's always possible to be less efficient. Given that the best schema for encoding this data generating process is very straightforward (basically: learn the attributes, then learn which attributes go with which names), it may be the case that this is where transformers are about as efficient as they are able to be.
We're interested in studying the knowledge capacity of transformers on data generating processes that are more complex than this. For example, consider the "two-hop QA" DGP. Here, the dataset consists of a collection of questions like "Who is the employer of the brother of Paul Raimondo Eckletree?", and the answer is based on a database of randomly generated names, attributes and relationships between profiles. Optimally, we can store only each profile once with a similar encoding scheme as the original biographical dataset. However, we believe that transformers will in fact have to store profiles twice: they have to "look up" the answer to the first "hop" of the question, and then use that answer to look up the answer to the second hop. Because they cannot reuse layers, the lookup function will have to be stored in two different layers.
We're interested in compression efficiency of transformers for a few reasons:
- We might be able to establish bounds on how well transformers can compress some data generating processes of interest; for example, it might be unfeasible for transformers to learn "many-hop" reasoning (for sufficiently large hops & data complexity) even if the transformer is very large.
- This might allow us to study inductive biases of transformers. We could hypothesize that if transformers don't efficiently compress some DGP, they may also be slow to learn it.
Models seem to converge in 1E7 steps, regardless of n_params or n_profiles. Need more data for a more precise estimate.
Suppose the dataset consists of a fraction
I've been looking into using transformer capacity to investigate transformer's reasoning capabilities.
Transformer capacity measurement is an idea proposed in the paper Knowledge Capacity Scaling Laws. We train a series of small models of different sizes on datasets which require substantial memorization in order to minimize loss. We train them until convergence, and we find the minimal loss scales with the number of parameters in the model, and we can use this loss scaling to probe the information storage capacity of transformers.
The basic strategy for probing model capacity is to assume a distribution of the data. We can use the entropy of this distribution to derive the number of bits required to encode the entire dataset. We can then subtract the model's loss across the whole dataset from this entropy to find a capacity figure.
Allen-Zhu and Li use fictional "biographies" of individuals with a collection of uniformly random attributes as their dataset, and the corresponding distributions used to measure model capacities are the uniform distributions on the respective attributes. They find that across many sizes of model and dataset, transformers stored roughly 2 bits of information per parameter.
I'm interested in using this idea to probe transformer reasoning capabilities. Specifically, we can use this technique to probe the complexity of the encoding of various data generating processes. Under the assumption that the transformer's capacity is fixed by its number of parameters, if we vary the data size while keeping the data generating process fixed, we can figure out how the transformer's capacity scales with data size (because, by the assumption of fixed capacity, as long as loss > 0, for dataset size N we have
We can compare the complexity of the transformer representation with the known optimal complexity (because we control the data generating process). We can also hypothesize about how exactly the transformer is representing things - if a certain representation algorithm scales in the same way that the transformer capacity, then maybe this is the algorithm it's using, though at best we can deduce a "complexity equivalent" algorithm, and not the actual algorithm.
The first experiment I've pursued is to look at the complexity of "2-hop" reasoning. That is, answering questions like "Who is Bob's mother's father?". Optimally, we can answer questions of this form memorizing (essentially) no more than is required to answer 1-hop questions - we have to memorize everyone's mother and father, but then we can apply this knowledge in two "hops" to find out Bob's mother's father. However, it is difficult to come up with algorithms that can run on a transformer that can answer these questions with significantly less than twice the amount of memorized information. If the transformer looks up the answer to "who is Bob's mother?" in layer 1, and then looks up the answer to "who is X's father?" in layer 2, it must memorize all the relationships twice (in layer 1 and layer 2) in order to answer the question (there is probing based evidence for this replication of information across layers in Grokked Transformers).
As a result, we hypothesize that transformers have an efficiency of around 0.5 on the two hop reasoning task. In fact the models I've trained so far seem to be even less efficient than this. We rougly replicate the 2 bits per parameter capacity for one hop questions. However, for two hop questions, we see a similar capacity only when we assume that encodes the answer to each two hop question individually. That is, instead of learning everyone's mother and everyone's father, it directly memorizes everyone's mother's father. Because the dataset had 4 different relation types in it, this works out to an efficiency of about 0.25.
Three possible explanations for the failure to learn the true 2-hop algorithm:
- The models are simply too small. Grokked Transformers found that larger models learned 2-hop reasoning much faster, and their smallest model was 10m parameters vs our largest model at 1.5m parameters.
- The models are too shallow; Grokked Transformers used a minimum of 8 layers, while our models were only 4 layers deep.
- Insufficient incentive to learn 1 hop reasoning. In our data mix, two hop questions were 10 times more frequent than one hop questions, while there were 4 times as many different two hop questions as one hop questions. Thus memorizing the answer to a two hop question reduces the loss by about 2.5 times more than memorizing the answer to a one hop question, and if the model never learns one hop reasoning then perhaps it will never learn to compose one hop reasoning steps into two hop reasoning.
I am currently running follow up experiments with larger models (up to 10m parameters) and more relation types in the dataset (17 instead of 4). I am also trying a version of the original experiment with one hop questions being only 3 times less frequent than two hop questions, to test the hypothesis that the failure to learn the true 2-hop algorithm is due to insufficient incentive to learn one hop reasoning.
The dataset consists of N fictional profiles, each with a set of attributes and relations. The attributes are "employer", "date of birth", "place of birth", "university"; there are about 100 possible values for each and each profile has a uniformly sampled value for each attribute. Learning the possible values of each attribute likely takes some capacity, but it is probably a small quantity that does not scale with N. On the other hand, the values of the relations are uniform over the set of all names, which are unique for each profile.
Names are generated by sampling from about N1=2000 first names, N2=2000 middle names and N3=1000 last names, for approx. 4 billion possible names. On the other hand, we typically consider N=10000-50000 profiles, a small fraction of the total number of possible names. Additionally, each profile has multiple relations which are all answered by the same set of names. Thus it is more efficient to learn the set of names and encode each relation as a uniform sample from this set - taking log2(N) additional bits per relation per profile - instead of learning only the fixed set of first, middle and last names and using log2(N1)+log2(N2)+log2(N3) additional bits per relation per profile.
The optimal way to encode the choice of names is as a combination; for large N and M:=N1N2N3, it takes approximately Nlog2(M) - Nlog2(N) bits to encode the choice of names. A simpler encoding is to use Nlog2(M) bits; this would correspond to simply listing the set of chosen names. I don't have a reliable method to check which encoding the network actually uses, but I do find that the estimated capacity for different N is closer to constant if I suppose the network encodes the names as a list.
This could be more carefully investigated by training models simply to memorize a set of N names (with no attributes) and seeing how the capacity scales with N.
The model capacity is slightly lower for two hop reasoning than one hop reasoning, even when we assume that it is using the very inefficient method of memorizing the answer to each question individually. This does not seem to be fully explained by the model learning a small amount of one hop answers; while it does do this, there remains a small gap (see "2.1 hop" capacity on the right hand side of the plots)
We can see two different regimes of operation - one in which the model learns a substantial amount of profile specific information and one where the model appears to learn only the set of names in the data - this is characterised by the merging of "one hop" and "two hop" capacity curves. Interestingly, in the transition between these regimes model capacity seems to drop slightly. I don't presently have any explanation for this. Note that capacity calculations depend on assumptions of how the model is "encoding" information. I would not be surprised if these assumptions were wrong, I just don't have any ideas about how they might be wrong that would explain this pattern.
When they are subject to capacity constraints, the models don't devote their capacity evenly to all of the available question types. We can see, for example, that for the 1.1M parameter model and 15k parameter dataset, it seems to have learned less about 2-hop "parent" questions in exchange for learning more about 2-hop "birth date" and "worst enemy" questions.
Transformers at best require about twice the capacity to learn 2-hop reasoning as the optimal algorithm
I have come up with very slightly more efficient algorithms than "lookup the answer to the first hop, then lookup the answer to the second hop", but the advantage is very small. E.g. at each layer we lookup 1 bit of the first hop answer, and every layer past the first we look up 1 bit of the second hop answer given all the remaining first hop possibilities. This potentially saves some capacity because one layer can look up the second bit of the first hop and the first bit of the second hop. I don't think it saves much capacity in the end, though I need to write up a more careful analysis.
Testing this hypothesis is somewhat difficult because I've found that transformers seem to be completely unable to learn 2-hop reasoning if their capacity is less than 3x the 1hop requirement. Maybe we need to train a model to do 2-hop reasoning and then fill up its excess capacity with additional data.
Self explanatory.
Can transformers that only learn incomplete 1-hop reasoning learn 2-hop CoT?
If we can correctly predict the capacity required to learn increasingly large subsets Di ⊂ D{i+1} of D', then we will see perfect generalization from some subset Di to D' iff we predict that for j≥i the capacity required to learn Dj with no loss = capacity required to learn Di with no loss. That is, if capacity scaling is predictable then we can use it to predict generalization
Specifically for 2 hop reasoning:
- Two hop reasoning will not generalize in regimes where capacity scales according to the "big hash" scheme
- Two hop reasoning can generalize to novel first & second hop combinations if it scales as 2x the "optimal" capacity, but cannot generalize to novel second hop questions
- Two hop reasoning will generalize to questions where the first hop has only been seen in one-hop questions only if we observe that capacity requirements do not increase more first hops are added to the two hop question set (that were already present in the one hop question set)
- It's possible that adding additional first hops to the two hop question set incurs a very small capacity cost, which prevents generalization nonetheless, but this cost should still scale with # of first hops added
- Transformers with CoT can learn 2-hop reasoning with novel first or second hops (that have been seen in one hop questions only), iff two-hop + CoT scales optimally
If there are too many two hop questions and not enough relations, models can get stuck in a local minimum
If the model is always incentivised to better learn the "big hash" two hop algorithm, it will never learn one hop reasoning and so it can never learn the more efficient two hop algorithm.
This is motivated by observation; models seem to need capacity about 3x the 1-hop capacity to learn 2-hop reasoning at all.
It would be interesting if there was a theory that supported this observation. I have a preliminary theory that does not support it. If, during a training step, the model has a fixed "information budget" that it can allocate to improving first hop or second hop reasoning (i.e. it can learn an extra bit in the first hop or an extra bit in the second hop, or split half a bit each way), then the optimal strategy allocates about equally between both hops.
What kind of multihop reasoning do transformers learn? Which kinds do we care about for safety evaluation?
Deepmind: "We select single-hop facts that are likely well-known, but their composition is unlikely to naturally appear in general text corpora, to minimize the chance of the model developing a shortcut between e1 and e3. We observe that such cases typically occur when the set of possible options for e2 is large, there are numerous e1’s that map to the same e2, and the set of possible options for e3 is not too small (e.g., not person-bloodtype)."
Say a "type" of e2 is an attribute that e1s have - so a person might have "employer", "university", "bloodtype", etc. Possible e2s are possible values for a particular attribute.
Capacity based analysis: we may get shortcuts when capacity(shortcut) is not much larger than capacity(no shortcut).
- Capacity(shortcut) ∝ # of "types" of e2 related to e1 * number of possible e3s for each e2 (we need a shortcut for each type of e2)
- Capacity(no shortcut) ∝ # of e3s related to each e2 + # of e2s related to e1
- No shortcut capacity is 0 if we need to know e3s for each e2 and e2s for each e1 for other reasonsSay a "type" of e2 is an attribute that e1s have - so a person might have "employer", "university", "bloodtype", etc. Possible e2s are possible values for a particular attribute.
Capacity based analysis: we may get shortcuts when capacity(shortcut) is not much larger than capacity(no shortcut).
- Capacity(shortcut) ∝ # of "types" of e2 related to e1 * number of possible e3s for each e2 (we need a shortcut for each type of e2)
- Capacity(no shortcut) ∝ # of e3s related to each e2 + # of e2s related to e1
- No shortcut capacity is 0 if we need to know e3s for each e2 and e2s for each e1 for other reasons
- If we don't need that, then shortcut capacity could be less than no shortcut capacity (e.g. only 1 kind of e2 for each e1, many kinds of e2, we don't need e1->e2 mapping)