See all AemonAlgiz transcripts on Youtube

youtube thumbnail

Large Language Models Process Explained. What Makes Them Tick and How They Work Under the Hood!

21 minutes 53 seconds

🇬🇧 English

S1

Speaker 1

00:00

Hey, YouTube. Today, we're going to talk about the entire process of how our large language models actually compute output tokens. This video is not meant to be a deep dive into the complex math, rather an overview of the concepts of each component of the large language model. So we'll start with an overview of some important concepts.

S1

Speaker 1

00:22

We'll move on to tokenization, then embedding spaces, and finally attention and how it all comes together to actually compute the output tokens. So let's get started. So we just need to understand a few key concepts to make sure that we understand all the deeper magics happening in our large language model starting with softmax. And softmax occurs in our multi-head attention and when we're computing our output logits.

S1

Speaker 1

00:48

Why it's there is the values for our intermediate states in our large language model can vary wildly on a large range, for example, between negative 10,000 and 10,000. And it just helps our model to learn if we can constrain these values before we output them between 0 and 1. So for example, if our output vector from something, whether it's our multi-head attention or our actual output probabilities are 15 and 13, then it would get scaled to 0.880 and 0.120. I know this doesn't seem intuitive, the way that's scaled, but that's just the way Softmax works.

S1

Speaker 1

01:34

It really does favor the max value quite a bit more than the others. Then we have layer normalization. And layer normalization is another intermediate state. So if we have some network layer, layer 1, and that is going to go into layer 2, what we'd like to do is for each sample, so we have some sample 1 that was input into here and some sample 2 and so forth, we ensure that every value across the sample, they all sum to 1.

S1

Speaker 1

02:10

If we have some matrix of values, then we want every value in this matrix to sum to 1. So this ABCDEF, all their values will sum to 1. The next concept is matrices do not impart positional information. The matrix values are their rows, so where a row represents an entity, in this case a token, we can flip them up and down and it is the same matrix, they are equivalent.

S1

Speaker 1

02:45

We don't get any positional information from that, and we'll see why that's important later. And then finally, feed forward layers. And these are just your standard multi-perceptron layers where you have some neurons and they're fully connected. And all these are doing is just massaging data before they go between layers.

S1

Speaker 1

03:07

And those are the key concepts we need to understand. And now let's talk about what is the first step in our large language models, which is tokenization. So the first thing we need to talk about is how do we actually create the tokens that our model is going to be using? And this is through a process called tokenization, of which there are 3 primary types, word-wise, character-wise, and sub-word tokenization, where in word-wise, every word in our language becomes a token.

S1

Speaker 1

03:38

But this has some issues, namely that this leads to massive vocabularies and really doesn't allow our model to handle misspellings well, or if really, if at all. The model just has a lot less flexibility in how it learns to handle our completions. And character wise, every character is a token, but this kind of has the inverse issue where the vocabulary is just very small. So the model has to learn how to combine all of these different characters into just every possible word.

S1

Speaker 1

04:07

And this likely would just result in a massive model. The third method, subword tokenization, is kind of a compromise between these other 2. And it's the 1 that we tend to use because it gives us a lot of flexibility and unlike wordwise tokenization, it allows us to handle misspellings. And the most common type of this method is by pair encoding.

S1

Speaker 1

04:32

And this is an iterative process where we start with just our characters vocabulary and create new pairings for each of these iterations. And if you'd like to learn more about this, here's a video on this whole process with by pair encoding. But this leads to our first step called tokenization. This happens because our models, they don't really understand language the way we do as humans.

S1

Speaker 1

04:55

We have to map our input strings into these tokens that we created through by pair encoding. For example, the sentence, I like walking my dog is broken out into the tokens, I like walk, ing my dog. These could be turned into these tokens, which are just numeric representations of our tokens available. Now, let's move on and talk about what is the next step once we have our tokenization, which is embedding.

S1

Speaker 1

05:26

So after tokenization, we're now going to create our embedding. And just As a reminder, an embedding is some n-dimensional space. In this case, it could be a sphere, where we have points in this sphere, which are representing tokens. And the closer they are, the more related they are.

S1

Speaker 1

05:45

So for example, these could be names like Jeff or Alex or Sam. And they're just related because they represent names, but this could be true for various other concepts like cars or trucks or motorcycles, or really anything else that just happened to be related to 1 another. And we have to have this because our models don't understand words the way we do as humans. Instead, they need to rely on these embeddings to understand how are things topically related.

S1

Speaker 1

06:16

And embedding spaces in the case of these tokens are really typically very high dimensional spaces with 764 dimensions or even 12,000 and more are not uncommon. And they understand how things are related by metrics like Euclidean distance or cosine similarity. This just gives our model some intuition before we start training our weights on how to relate concepts. What happens is immediately after tokenization, we're going to embed our tokens.

S1

Speaker 1

06:53

So if we use the same input string as earlier, I like walking my dog, this gets broken out into 6 separate tokens. And if we assume that our embedding space is three-dimensional, we would end up with a 6 by 3 matrix of all of our token embeddings. And then we're going to apply our positional encoding. And if you'd like to learn more about positional encoding, we have a video on it here.

S1

Speaker 1

07:18

We also have a video that walks more deeply into embeddings here. But once we've gone through positional encoding, we're now going to go into attention, which is the star of this video, and how it actually empowers our model to understand how to select the next token. So here's where all the magic happens in attention and how we want to start thinking about this is kind of personifying it a little bit, making it a little more human. And what this is meant to do is emulate how humans solve problems.

S1

Speaker 1

07:55

So for example, if I want to understand what this is and I ask myself what is this, I take some context clues. It has a handle, it has brushes, it must be, well, it's not a waffle, right? It's a hairbrush. And so we'd like to be able to give our models the ability to do something similar and be able to consider context when it's trying to complete sentences for us.

S1

Speaker 1

08:23

This is done through what are called the query key and value matrices, which are calculated by linear transformations. These linear transformations are learned during the training process. So every attention layer is going to be slightly different because the transformations that learns to compute these are different during the training process. But to have a slightly better understanding of what are these query keys and values meant to represent, we can think of this as a movie recommendation system.

S1

Speaker 1

09:00

So the query in this case would be the type of movie that we want to watch. So what about a horror movie, for example? The key is metadata about all possible movies. So you could think of this as being like, oh, this movie has jump scares, this movie is about cars, or so forth.

S1

Speaker 1

09:22

Just some data that we can take advantage of in making our selections. And then the values are just the possible movies themselves. So VHS or Marty Back to the Future or so forth. And then we can make our selections from there.

S1

Speaker 1

09:42

And clearly, Back to the Future is probably not going to fulfill our requests, whereas VHS, the horror movie, would. So the language model will use all 3 of these to consider how to compute tokens. So if we look up here in this example, which is from Towards AI, and it's a great model, I think. Walk by Riverbank, we get our input embeddings, which are then converted into our query, our key, and our value.

S1

Speaker 1

10:17

The first thing that are multiplied or yeah, multiplied is our query and our keys, which creates our query and key embeddings. And really, the way to think about what this is, is an attention score. And This attention score tells the model, what am I really considering in this context? We can use this attention score to then manipulate our output values to start computing what is the most likely next token.

S1

Speaker 1

10:55

So just to go over this again, because there's a lot going on here, we take our input, we create our embeddings, those embeddings create our query and our key and our value. We want to compute what am I paying attention to? What is the most important context here? We do that by multiplying our query and our key.

S1

Speaker 1

11:19

And then we do softmax. So the softmax, as we mentioned earlier, takes a matrix of values and just puts it on 0 to 1. And that's just easier for our model to understand. And then we multiply that times our value.

S1

Speaker 1

11:36

And this allows us to start computing what are our output logits. And all the logits are is just what is our possible completions, probabilistically speaking. So now let's move on to what is a multi-attention head, because we seem to hear a lot about that, and how does it work? So multi-attention head.

S1

Speaker 1

12:00

So It is just an attention layer, but with the idea that more heads are better than 1. This helps our model to generalize to more complex inputs, because we have multiple attention layers that are running in parallel. When you see a multi-attention head, you're not doing them sequentially, you're doing them in parallel. So you have multiple queries, multiple keys, and multiple values being computed side by side, and their outputs are all concatenated.

S1

Speaker 1

12:36

So what that means is we're adding or otherwise manipulating all those values together to understand what did each attention head consider important in the input. And what does this allow us to do? Well, it lets us have each part of the model or each different attention head consider different parts of the input sequence. So we can take advantage of things like sparse attention.

S1

Speaker 1

13:03

And as we've mentioned before, attention is typically O N squared complexity. So if we double our input, it takes 4 times as much memory and as much RAM. And if we make it 4 times bigger, it takes 16 times as much memory and as much compute time. So we can use sparse attention to pull this down to N log N complexity, which is a lot easier, much closer to linear rather than quadratic.

S1

Speaker 1

13:38

And it really does allow our model to learn more complex inputs. So This results in what we typically have been referring to as emergent weights. If we pull out some of these weight matrices that are computing our query and our key and our value, we'll see that some of the weights are really big. So most of the values will be like 1.2, 2.1, and so forth, but some of them will be 60 or 100 or 120, and these are called emergent weights.

S1

Speaker 1

14:21

And if we try to quantize these, and if you'd like to learn more about quantization, check out our video here. But they're very important. If we change these during the quantization process, our model falls apart. It really can't compute completions as well as it was.

S1

Speaker 1

14:41

And so we have to be careful with them because they're really seem to be what lets our model be as performant as it is. But now, let's pull this all together. What is happening and what is start to finish in large language models? Now, let's pull this all together.

S1

Speaker 1

14:58

What is happening when we give an input and it creates output tokens. Well, the first thing we do is, of course, we have our input which is tokenized, so it goes through the tokenization process. Like in our example earlier, I like walking my dog, that could be broken out into 6 tokens, I like walk, ING my dog, and it gets turned into the numeric representation. So this is happening because the computer doesn't understand language the way we do, and we need to represent that numerically.

S1

Speaker 1

15:33

Then we go through our input embedding. In our three-dimensional example, with our 6 tokens, we end up with a 6 by 3 matrix where each of these rows represents a different token and its embedding. Then we positionally encode. This positional encoding happens because we need to give the model some sense of positionality of each token because matrix multiplication and layer manipulation doesn't really track these values.

S1

Speaker 1

16:12

But then we get to our query key and value. As we mentioned earlier, these are just learned linear transformations from weight matrices. And what this really is, is what the transformer name in our large language models comes from. Because if we look at our embedding layer and any tokens that are actually in that embedding space, they're static, they don't move around.

S1

Speaker 1

16:46

We can't get as much context from them as we would like to. So you can think of these query key and value matrices as moving the tokens around in this kind of embedding space so we can have multiple different understandings of different contexts. So for example, the sentence, the server brought me water versus I crashed the server. This is what really gives LLMs their ability to make those distinctions between different usages of very similar words.

S1

Speaker 1

17:30

And how this is done is through the computation of our attention score with our query times our key. And then we just softmax. And the softmax is there because models just, these machine learning models deal with values between 0 and 1 better than they do 0 to 10,000 or some other arbitrary values. And then we add a norm.

S1

Speaker 1

17:58

And this adding and norming is happening because remember, in our multi-head attention layer, we have multiple attention heads. So we have to add and norm across all of the outputs from those different attention heads. So we do that by summing their values and the norming and all norming is again, putting the values between 0 and 1. Then we run through this feed forward.

S1

Speaker 1

18:24

And this feed forward is probably the least well-defined piece of what the language model is doing. And all it is, it's just your standard multi-layer perceptron. So you can think of that as we have nodes and they're fully connected across all of their edges and It is just computing a massaged version of the output for the next multi-head attention layer. And we repeat this process for however many multi-head attention layers we have until we reach this linearization layer.

S1

Speaker 1

19:04

And all this layer does is convert it back to our word encodings. So up here at step 3 where we had our positional encoding, went down to step 4 and converted them into query key value, we're just going back to our word encodings from step 3. Then we can use those, but now they're a little different. Now they're actually the probabilities once we softmax them, and that allows us to compute our output logits.

S1

Speaker 1

19:35

So to put it all back together, let me delete this here really quickly. To put it all together, we take our input and we tokenize that input. We embed it to know what tokens are related to who. We positionally encode to have some idea of where our tokens positioned.

S1

Speaker 1

20:02

Then we transform that embedding with our query key and value. And then we put that through an addition and a norm across all of our different attention heads. So we add a norm and we repeat this process for however many attention heads we have. And then we compute our output logits.

S1

Speaker 1

20:31

So really what's happening here to step away from the math and the complexity is we're trying to just emulate how humans use context to solve problems. And it does that by trying to use the whole context of the input instead of just trying to learn, oh, the most probable next token after the word the or some combination of the previous 3 or 4 or 5 tokens, we can take advantage of the whole context. And that's really the power behind these large language models and how they are so good at computing the next token. Then we just repeat this process for however many tokens that we want to generate.

S1

Speaker 1

21:26

Once we've computed our token plus 1, we start the process over and compute token plus 2. That's really all they're doing. If this was helpful, please like and subscribe, and please let us know in the comments below what you'd like to hear about next. Tune in next time when we're going to cover how all the hyperparameters like top k and temperature actually affect the output from your large language model.