On Transformers and Vectors
Fortune Adekogbe
Posted on April 8, 2024
A friend asked me some questions about how tokens are converted into vectors, how matrix multiplication can lead to anything resembling understanding, and what to do about understanding highly dimensional spaces, conceptually. I gave him a sufficiently detailed response which was appreciated by others and they suggested that I make a post about it. Hence this. I hope you find it interesting.
Q: In simple terms, how are tokens encoded into vectors?
A:
First, I want to highlight the what and why of tokenization.
Raw data in text, audio, or video form have to be broken down into smaller bits because of our skill issues. We don't have the computational facilities and efficient algorithmic techniques that can process these things as a whole. These resulting bits are called tokens and they are created using something we call a tokenizer.
(I will assume text data during this explanation.)
The tokenization step is important because the way you break down data directly affects the amount of contextual understanding you can get from it. For that reason, you probably don't want to tokenize your sentences at the character level.
Word-level tokenization is a popular strategy because words mean more to us, but a constraint here is that you can only understand the words that are in your dataset vocabulary. If someone brings something else, your Tokeniser will not be able to handle it.
This led us to use subword tokenization which involves breaking some words into parts. For instance, "reformers" could become ("re", "form", "ers") and "translate" could become ("trans", "late"). In this way, if we get a word like "transformers" that was not in the original set of words, we can break it into ("trans", "form", "ers"). If we also need to handle a word like “relate”, we can break it into (“re”, “late”). This strategy of breaking down words means that we can use the information we have to handle these new words. Fascinating right?
Now, to the encoding, we take these tokens that we have and the goal is to represent them mathematically in a way that similar tokens (words, if it's easier) have similar representations, aka vectors.
The vectors for "near" and "close" should be very similar. Same thing for "fortune" and "money" as well as “schism” and “division”. This is because those words are more likely to be used in similar contexts.
Practically, there are a lottttt of different ways that we can go about this. I will explain one of the simpler ones to make it easier to understand.
Remember those English "fill-in-the-gap" questions where they give you a list of options and ask you to pick the one that best completes the sentence? You can answer them because you understand what words should come after the parts before the gap and before the parts after the gap.
For instance, if I say:
The world is _____ here. You know that "quiet" fits that better than "academic", "courtesy" and "cloth".
Similarly, we teach models to learn what tokens (read words, if you prefer) fit into a particular context in a sentence. At the end of this, words like "father," "man," and "male" for instance, will be very highly correlated.
In slightly more mathematical terms, we train these models to maximize the probability of getting a target word given the words that are used in the surrounding context.
There is something very fascinating about this too. We realized that if you take the numerical difference between the vectors for words like "mother" and "father" and add or subtract it from that of "queen," you get something close to the vector for "king."
Transformers do something very similar but they also include something known as "positional encoding". This tries to also factor in the position of a token in the sentence to get a better representation. The idea here is that words can mean different things based on their position in a sentence and that context is important.
Q: How does matrix multiplication encode meaning into vectors?
A:
Welcome to gradient descent.
For starters, thinking about this in terms of matrix multiplication is accurate but a bit too general so I understand your question.
The most common operations we carry out are additions and multiplication. While this sounds basic, there are various ways that these matrices are combined. What makes them work is, first, the actual sequence of operations. This is what is called model architecture. It is essentially what happens to an input when it goes through the model and leaves as a predicted output.
Fixing the content of this architecture is the goal of model training. On a general level, deep learning aims to find a way to encode the function that transforms an input into an output without explicitly knowing what that function is. All we know is that whatever that function is is somehow represented in the architecture.
To achieve this “function encoding”, we iteratively expose the architecture which is initially a series of randomly initialized matrices (could even be a matrix of zeros, for instance) to input-output pairs.
When each input (or batch of input) goes from the entrance to the exit of the architecture, we compare the predicted output to the actual output and compute the difference. This difference is then used to update all the matrices that were initially randomly initialized in the architecture.
We continue doing this until we can no longer do it because of cost or because the model doesn't seem to be improving anymore. At this point, the matrices are very different from how they started because of all the updates.
To be clear, in every step of this process, we are just doing matrix multiplication, but we also figured out that by experimenting with different sequences of operations in the architecture, we can get better results.
Transformers are only the most recent result of this experimentation. They were derived from the Attention mechanism which tries to mimic attention in humans. This was preceded by a range of specialized architectures like "long short-term memory networks," "convolutional neural networks," and so on. All of these came from the fully connected network, which was derived from the perceptron, which is essentially a glorified y = mx + c aka linear regression.
Q: What's the deal with the myriad of dimensions? How am I supposed to wrap my mind around a 12,000-dimensional space?
A:
As regards the number of dimensions, you are NOT meant to wrap your head around it. 😂 But on a high level, you can take it to mean that to properly describe the data we have, we need to specify 12,000 characteristics per data point. If we go any lower, we may lose information.
For instance, if I try to describe you with just two words, I will either throw you into some category that you may not like or drop something that can barely be regarded as a description.
But the more words I am allowed to use, the better I can represent my understanding of you. That said, one can reasonably argue that 12000 characteristics would be overdoing it and by about 8000 words, I should have a good enough description of the data point. Anything added to that is unnecessary verbosity. But that is a different discussion.
Posted on April 8, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.