See all freeCodeCamp.org transcripts on Youtube

youtube thumbnail

Machine Learning for Everybody – Full Course

3 hours 53 minutes 52 seconds

🇬🇧 English

S1

Speaker 1

00:00

Kylie Ying has worked at many interesting places, such as MIT, CERN, and free code camp. She's a physicist, engineer, and basically a genius. And now she's going to teach you about machine learning in a way that is accessible to absolute beginners.

S2

Speaker 2

00:15

What's up you guys. So welcome to machine learning for everyone. If you are someone who is interested in machine learning, and you think you are considered as everyone, then this video is for you.

S2

Speaker 2

00:28

In this video, we'll talk about supervised and unsupervised learning models, we'll go through maybe a little bit of the logic or math behind them. And then we'll also see how we can program it on Google Colab. If there are certain things that I have done, And you know, you're somebody with more experience than me, please feel free to correct me in the comments. And we can all as a community learn from this together.

S2

Speaker 2

00:54

So with that, let's just dive right in. Without wasting any time, let's just dive straight into the code. And I will be teaching you guys concepts as we go. So this here is the UCI machine learning repository.

S2

Speaker 2

01:10

And basically, they just have a ton of data sets that we can access. And I found this really cool 1 called the magic gamma telescope data set. So in this data set, if you don't want to read all this information, to summarize what I what I think is going on is there's this gamma telescope, and we have all these high energy particles hitting the telescope. Now there's a camera, there's a detector that actually records certain patterns of you know how this light hits the camera and we can use properties of those patterns in order to predict what type of particle caused that radiation.

S2

Speaker 2

01:46

So whether it was a gamma particle or some other head like hadron. Down here, these are all of the attributes of those patterns that we collect in the camera. So you can see that there's you know, some length, width, size, asymmetry, etc. Now we're going to use all these properties to help us discriminate the patterns and whether or not they came from a gamma particle or a hadron.

S2

Speaker 2

02:13

So in order to do this, we're going to come up here, go to the data folder. And you're going to click this magic 04 data, and we're going to download that. Now over here, I have a collab notebook open. So you go to collab dot research.google.com, you start a new notebook.

S2

Speaker 2

02:34

And I'm just going to call this the magic data set. So actually, I'm going to call this code camp magic example. Okay. So with that, I'm going to first start with some imports.

S2

Speaker 2

02:50

So I will import, you know, I always import NumPy. I always import pandas. And I always import matplotlib. And then we'll import other things as we go.

S2

Speaker 2

03:10

So yeah, we run that in order to run the cell, you can either click this play button here, or you can, on my computer, it's just shift enter, and that that will run the cell. And here, I'm just going to order, I'm just going to, you know, let you guys know, okay, this is where I found the data set. So I've copied and pasted this, actually, but this is just where I found the data set. And in order to import that downloaded file that we we got from the computer, we're going to go over here to this folder thing.

S2

Speaker 2

03:44

And I am literally just going to drag and drop that file into here. Okay. So in order to take a look at you know, what does this file consist of? Do we have the labels?

S2

Speaker 2

03:56

Do we not? I mean, we could open it on our computer. But we can also just do pandas, read CSV. And we can pass in the name of this file.

S2

Speaker 2

04:08

And let's see what it returns. So it doesn't seem like we have the label. So let's go back to here. I'm just going to make the columns, the column labels, all of these attribute names over here.

S2

Speaker 2

04:23

So I'm just going to take these values and make that the column names. Alright, how do I do that? So basically I will come back here and I will create a list called calls and I will type in all of those things. With F size, F conk, and we also have F conk 1, we have F symmetry, FM3 long, FM3 trans, F alpha, what else do we have f dist and class?

S2

Speaker 2

05:10

And class. Okay, great. Now in order to label those as these columns down here in our data frame. So basically, this command here just reads some CSV file that you pass in CSV is comma about comma separated values and turns that into a pandas data frame object.

S2

Speaker 2

05:30

So now if I pass in a names here, then it basically assigns these labels to the columns of this data set. So I'm going to set this data frame equal to df. And then if we call the head is just like, give me the first 5 things. Now you'll see that we have labels for all of these.

S2

Speaker 2

05:50

Okay. All right, great. So 1 thing that you might notice is that over here, the class labels, we have G and H. So if I actually go down here, and I do data frame class, unique, you'll see that I have either G's or H's, and these stand for gammas or hadrons.

S2

Speaker 2

06:12

And our computer is not so good at understanding letters, right, Our computer is really good at understanding numbers. So what we're going to do is we're going to convert this to 0 for G and 1 for each. So here, I'm going to set this equal to this, whether or not that equals g. And then I'm just going to say as type int.

S2

Speaker 2

06:38

So what this should do is convert this entire column, if it equals g, then this is true. So I guess that would be 1. And then if it's h, it would be false. So that would be 0, but I'm just converting g and h to 1 and 0, it doesn't really matter, like, if g is 1 and h is 0, or vice versa.

S2

Speaker 2

07:00

Let me just take a step back right now and talk about this data set. So here I have some data frame, and I have all of these different values for each entry. Now this is a you know, each of these is 1 sample, it's 1 example, it's 1 item in our data set, it's 1 data point, all these things are kind of the same thing when I mentioned, oh, this is 1 example, or this is 1 sample, or whatever. Now, each of these samples, they have, you know, 1 quality for each or 1 value for each of these labels up here, and then it has the class.

S2

Speaker 2

07:38

Now, what we're going to do in this specific example is try to predict for future, you know, samples, whether the class is G for gamma, or H for hadron. And that is something known as classification. Now, all of these up here, these are known as our features. And features are just things that we're going to pass into our model in order to help us predict the label, which in this case is the class column.

S2

Speaker 2

08:09

So for you know, sample 0, I have 10 different features. So I have 10 different values that I can pass into some model. And I can spit out, you know, the class, the label, and I know the true label here is G. So this is this is actually supervised learning.

S2

Speaker 2

08:32

Alright, so before I move on, let me just give you a quick little crash course on what I just said. This is machine learning for everyone. Well, the first question is, what is machine learning? Well, machine learning is a subdomain of computer science that focuses on certain algorithms, which might help a computer learn from data without a programmer being there telling the computer exactly what to do.

S2

Speaker 2

09:00

That's what we call explicit programming. So you might have heard of AI and ML and data science, what is the difference between all of these. So AI is artificial intelligence. And that's an area of computer science, where the goal is to enable computers and machines to perform human like tasks and simulate human behavior.

S2

Speaker 2

09:24

Now machine learning is a subset of AI that tries to solve 1 specific problem and make predictions using certain data. And data science is a field that attempts to find patterns and draw insights from data. And that might mean we're using machine learning. So all of these fields kind of overlap, and all of them might use machine learning.

S2

Speaker 2

09:50

So there are a few types of machine learning. The first 1 is supervised learning. And in supervised learning, we're using labeled inputs. So this means whatever input we get, we have a corresponding output label, in order to train models and to learn outputs of different new inputs that we might feed our model.

S2

Speaker 2

10:11

So for example, I might have these pictures, Okay, to a computer, all these pictures are are pixels, they're pixels with a certain color. Now in supervised learning, all of these inputs have a label associated with them, this is the output that we might want the computer to be able to predict. So for example, over here, this picture is a cat. This picture is a dog.

S2

Speaker 2

10:37

And this picture is a lizard. Now, there's also unsupervised learning. And in unsupervised learning, we use unlabeled data to learn about patterns in the data. So here are here are my input data points.

S2

Speaker 2

10:57

Again, they're just images, they're just pixels. Well, okay, let's say I have a bunch of these different pictures. And what I can do is I can feed all these to my computer. And I might not, you know, my computer's not gonna be able to say, oh, this is a cat, dog and lizard in terms of, you know, the output.

S2

Speaker 2

11:16

But it might be able to cluster all these pictures, it might say, Hey, all of these have something in common, all of these have something in common. And then these down here have something in common, that's finding some sort of structure in our unlabeled data. And finally, we have reinforcement learning. And reinforcement learning, well, they usually there's an agent that is learning in some sort of interactive environment, based on rewards and penalties.

S2

Speaker 2

11:48

So let's think of a dog. We can train our dog, but there's not necessarily any wrong or right output at any given moment, right? Well, let's pretend that dog is a computer. Essentially, what we're doing is we're giving rewards to our computer and tell your computer, hey, this is probably something good that you want to keep doing.

S2

Speaker 2

12:12

Well, computer agent, yeah, terminology. But in this class today, we'll be focusing on supervised learning and unsupervised learning and learning different models for each of those. Alright, so let's talk about supervised learning first. So this is kind of what a machine learning model looks like you have a bunch of inputs that are going into some model, and then the model is spitting out an output, which is our prediction.

S2

Speaker 2

12:42

So all these inputs, this is what we call the feature vector. Now, there are different types of features that we can have. We might have qualitative features and qualitative means categorical data. There's either a finite number of categories or groups.

S2

Speaker 2

13:00

So 1 example of a qualitative feature might be gender. And in this case, there's only 2 here. It's for the sake of the example. I know this might be a little bit outdated.

S2

Speaker 2

13:11

Here we have a girl and a boy. There are 2 genders. There are 2 different categories. That's a piece of qualitative data.

S2

Speaker 2

13:19

Another example might be okay, we have, you know, a bunch of different nationalities, maybe a nationality or a nation or a location, that might also be an example of categorical data. Now, in both of these, there's no inherent order. It's not like, you know, we can rate us 1 and France to Japan, 3, etc, right, there's not really any inherent order built into either of these categorical data sets. That's why we call this nominal data.

S2

Speaker 2

13:57

Now, for nominal data, the way that we want to feed it into our computer is using something called 1 hot encoding. So let's say that, you know, I have a data set, some of the items in our data, some of the inputs might be from the US, some might be from India, then Canada, then France. Now, how do we get our computer to recognize that we have to do something called 1 hot encoding. And basically 1 hot encoding is saying, Okay, well, if it matches some category, make that a 1.

S2

Speaker 2

14:28

And if it doesn't just make that a 0. So for example, if your input were from the US, you might have 1000 India, you know, 0100 Canada, okay, well, the item representing Canada is 1 and then France, the item representing France is 1. And then you can see that the rest are zeros. That's 1 hot encoding.

S2

Speaker 2

14:54

Now, there are also a different type of qualitative feature. So here on the left, there are different age groups. There's babies, toddlers, teenagers, young adults, adults, and so on. Right.

S2

Speaker 2

15:12

And on the right hand side, we might have different ratings. So maybe bad, not so good, mediocre, good, and then like, great. Now, these are known as ordinal pieces of data, because they have some sort of inherent order. Right?

S2

Speaker 2

15:30

Like, being a toddler is a lot closer to being a baby than being an elderly person. Right? Or good is closer to great than it is to really bad. So these have some sort of inherent ordering system.

S2

Speaker 2

15:46

And so for these types of data sets, we can actually just mark them from you know, 1 to 5, or we can just say, hey, for each of these, let's give it a number. And this makes sense, because, Like, for example, the thing that I just said, how good is closer to great than good is close to not good at all. Well, 4 is closer to 5 and 4 is close to 1. So this actually kind of makes sense.

S2

Speaker 2

16:13

And it'll make sense for the computer as well. All right, there are also quantitative pieces of data and quantitative pieces of data are numerical valued pieces of data. So this could be discrete, which means you know, they might be integers, or it could be continuous, which means all real numbers. So for example, the length of something is a quantitative piece of data.

S2

Speaker 2

16:40

It's a quantitative feature, the temperature of something is a quantitative feature. And then maybe how many Easter eggs I collected in my basket, this Easter egg hunt, that is an example of discrete quantitative feature. Okay, so these are continuous. And this over here is the screen.

S2

Speaker 2

17:01

So those are the things that go into our feature vector, those are our features that we're feeding this model, because our computers are really, really good at understanding math, right at understanding numbers, They're not so good at understanding things that humans might be able to understand. Well, what are the types of predictions that our model can output? So in supervised learning, there are some different tasks. There's 1 classification, and basically classification just saying, okay, predict discrete classes.

S2

Speaker 2

17:40

And that might mean, you know, this is a hot dog. This is a pizza, and this is ice cream. Okay, so there are 3 distinct classes and any other pictures of hot dogs, pizza or ice cream, I can put under these labels. Hot dog, pizza, ice cream.

S2

Speaker 2

17:58

This is something known as multi class classification. But there's also binary classification. And binary classification, you might have hot dog, or not hot dog. So there's only 2 categories that you're working with something that is something and something that's isn't binary classification.

S2

Speaker 2

18:18

Okay, so yeah, other examples. So if something has positive or negative sentiment, that's binary classification. Maybe you're predicting your pictures of their cats or dogs. That's binary classification.

S2

Speaker 2

18:31

Maybe you know, you are writing an email filter, and you're trying to figure out if an email spam or not spam. So that's also binary classification. Now for multi-class classification you might have you know cat, dog, lizard, dolphin, shark, rabbit, etc. We might have different types of fruits like orange, apple, pear, etc.

S2

Speaker 2

18:53

And then maybe different plant species. But multi class classification just means more than 2. Okay, and binary means we're predicting between 2 things. There's also something called regression when we talk about supervised learning.

S2

Speaker 2

19:08

And this just means we're trying to predict continuous values. So instead of just trying to predict different categories, we're trying to come up with a number that is on some sort of scale. So some examples. So some examples might be the price of aetherium tomorrow.

S2

Speaker 2

19:27

Or it might be okay, what is going to be the temperature? Or it might be what is the price of this house, right? So these things don't really fit into discrete classes. We're trying to predict a number that's as close to the true value as possible using different features of our data set.

S2

Speaker 2

19:49

So that's exactly what our model looks like in supervised learning. Now, let's talk about the model itself. How do we make this model learn? Or how can we tell whether or not it's even learning.

S2

Speaker 2

20:03

So before we talk about the models, let's talk about how can we actually like evaluate these models? Or how can we tell whether something is a good model or a bad model? So let's take a look at this data set. So this data set has, this is from a diabetes, a Pima Indian diabetes data set.

S2

Speaker 2

20:25

And here we have different number of pregnancies, different glucose levels, blood pressure, skin thickness, insulin, BMI, age, and then the outcome, whether or not they have diabetes. 1 for they do, 0 for they don't. So here, all of these are quantitative features, right, because they're all on some scale. So each row is a different sample in the data.

S2

Speaker 2

20:52

So it's a different example. It's 1 person's data, and each row represents 1 person in this data set. Now this column, each column represents a different feature. So this 1 here is some measure of blood pressure levels and this 1 over here as we mentioned is the output label.

S2

Speaker 2

21:14

So this 1 is whether or not they have diabetes and as I mentioned this is what we would call a feature vector, because these are all of our features in 1 sample. And this is what's known as the target, or the output for that feature vector, that's what we're trying to predict. And all of these together is our features matrix x. And over here, this is our labels or targets vector y.

S2

Speaker 2

21:49

So I've condensed this to a chocolate bar to kind of talk about some of the other concepts in machine learning. So over here, we have our x, our features matrix. And over here, this is our label y. So each row of this will be fed into our model, right?

S2

Speaker 2

22:10

And our model will make some sort of prediction. And what we do is we compare that prediction to the actual value of why that we have in our label data set, because that's the whole point of supervised learning is we can compare what our models outputting to Oh, what is the truth actually, and then we can go back and we can adjust some things. So the next iteration, we get closer to what the true value is. So that whole process here, the tinkering that, okay, what's the difference?

S2

Speaker 2

22:42

Where do we go wrong? That's what's known as training the model. Alright, so take this whole, you know, chunk right here, do we want to really put our entire chocolate bar into the model to train our model? Not really, right?

S2

Speaker 2

22:59

Because if we did that, then how do we know that our model can do well on new data that we haven't seen? Like if I were to create a model to predict whether or not someone has diabetes, let's say that I just train all my data, and I see that all my training data does well, I go to some hospital, I'm like, here's my model, I think you can use this to predict if somebody has diabetes. Do we think that would be effective or not? Probably not, right?

S2

Speaker 2

23:34

Because we haven't assessed how well our model can generalize. Okay, it might do well after you know, our model has seen this data over and over and over again. But what about new data? Can our model handle new data?

S2

Speaker 2

23:51

Well, how do we how do we get our model to assess that? So we actually break up our whole data set that we have into 3 different types of data sets, we call it the training data set, the validation data set, and the testing data set. And you know, you might have 60% here 20% and 20%, or 80, 10, and 10. It really depends on how many statistics you have, I think either of those would be acceptable.

S2

Speaker 2

24:20

So what we do is then we feed the training data set into our model, we come up with, you know, this might be a vector of predictions corresponding with each sample that we put into our model, we figure out, okay, what's the difference between our prediction and the true values, this is something known as loss, losses, you know, what's the difference here, in some numerical quantity, of course. And then we make adjustments. And that's what we call training. Okay.

S2

Speaker 2

24:54

So then, once you know, we've made a bunch of adjustments, we can put our validation set through this model. And the validation set is kind of used as a reality check during or after training to ensure that the model can handle unseen data still. So every single time after we train 1 iteration, we might stick the validation set in and see, hey, what's the loss there. And then after our trainings over, we can assess the validation set and ask, hey, what's the loss there.

S2

Speaker 2

25:27

But 1 key difference here is that we don't have that training step, this loss never gets fed back into the model, right, that feedback loop is not closed. Alright, so let's talk about loss really quickly. So here I have 4 different types of models, I have some sort of data that's being fed into the model, and then some output. Okay, so this output here is pretty far from you know, this truth that we want.

S2

Speaker 2

25:58

And so this loss is going to be high. In model B, again, this is pretty far from what we want. So this loss is also going to be high, let's give it 1.5. Now this 1 here, it's pretty close.

S2

Speaker 2

26:13

I mean, maybe not almost, but pretty close to this 1. So that might have a loss of 0.5. And then this 1 here is maybe further than this, but still better than these 2. So that loss might be 0.9.

S2

Speaker 2

26:28

Okay, so which of these model performs the best? Well, model C has the smallest loss, so it's probably model C. Okay, now let's take model C after you know, we've come up with these all these models, and we've seen okay, Model C is probably the best model. We take Model C, and we run our test set through this model.

S2

Speaker 2

26:52

And this test set is used as a final check to see how generalizable that chosen model is. So if I, you know, finished training my diabetes data set, then I could run it through some chunk of the data. And I can say, Oh, like, this is how we perform on data that it's never seen before at any point during the training process. Okay.

S2

Speaker 2

27:15

And that loss, that's the final reported performance of my test set, or this would be the final reported performance of my model. So let's talk about this thing called loss, because I think I kind of just glossed over it, right. So loss is the difference between your prediction and the actual, like label. So this would give a slightly higher loss than this.

S2

Speaker 2

27:48

And this would even give a higher loss because it's even more off. In computer science, we like formulas, right? We like formulaic ways of describing things. So here are some examples of loss functions and how we can actually come up with numbers.

S2

Speaker 2

28:05

This here is known as l 1 loss. And basically, l 1 loss just takes the absolute value of whatever your you know, real value is, whatever the real output label is, subtracts the predicted value and takes the absolute value of that. Okay, so the absolute value is a function that looks something like this. So the further off you are, the greater your losses, right in either direction.

S2

Speaker 2

28:37

So if your real value is off from your predicted value by 10, then your loss for that point would be 10. And then this sum here just means, hey, we're taking all the points in our data set. And we're trying to figure out the sum of how far everything is. Now, we also have something called l 2 loss.

S2

Speaker 2

28:58

So this loss function is quadratic, which means that if it's close, the penalty is very minimal. And if it's off by a lot, then the penalty is much, much higher. Okay. And this instead of the absolute value, we just square the the difference between the 2.

S2

Speaker 2

29:22

Now, there's also something called binary cross entropy loss. It looks something like this. And this is for binary classification, this, this might be the loss that we use. So this loss, you know, I'm not going to really go through it too much.

S2

Speaker 2

29:39

But you just need to know that loss decreases as the performance gets better. So there are some other measures of accurate or performance as well. So for example, accuracy, what is accuracy. So let's say that these are pictures that I'm feeding my model, okay.

S2

Speaker 2

30:01

And these predictions might be apple, orange, orange, apple. Okay, but the actual is apple, orange, apple, apple. So 3 of them were correct, and 1 of them was incorrect. So the accuracy of this model is 3 quarters or 75%.

S2

Speaker 2

30:20

All right, coming back to our co lab notebook, I'm going to close this a little bit. Again, we've imported stuff up here. And we've already created our data frame right here. And this is this is all of our data, this is what we're going to use to train our models.

S2

Speaker 2

30:38

So down here, again, if we now take a look at our data set, you'll see that our classes are now zeros and ones. So now this is all numerical, which is good, because our computer can now understand that. Okay. And you know, it would probably be a good idea to maybe kind of plot, hey, do these things have anything to do with the class.

S2

Speaker 2

31:04

So here, I'm going to go through all of the labels. So for label in the columns of this data frame, so this just gets me the list. Actually, we have the list, right, It's called. So let's just use that might be less confusing of everything up till the last thing, which is the class.

S2

Speaker 2

31:21

So I'm going to take all these 10 different features. And I'm going to plot them as a histogram. So and now I'm going to plot them as a histogram. So basically, if I take that data frame, and I say, okay, for everything where the class is equal to 1, so these are all of our gammas, remember.

S2

Speaker 2

31:48

Now, for that portion of the data frame, if I look at this label, so now these, okay, what this part here is saying is, inside the data frame, get me everything where the class is equal to 1. So that's all of these would fit into that category, right. And now let's just look at the label column. So the first label would be f length, which would be this column.

S2

Speaker 2

32:15

So this command here is getting me all the different values that belong to class 1 for this specific label. And that's exactly what I'm going to put into the histogram. And now I'm just going to tell you know, matplotlib make the color blue, make label this as you know, gamma, set alpha, why do I keep doing that alpha equal to 0.7. So that's just like the transparency.

S2

Speaker 2

32:43

And then I'm going to set density equal to true so that when we compare it to the hadrons here, we'll have a baseline for comparing them. Okay, so the density being true, just basically normalizes these distributions. So you know, if you have 200 in of 1 type, and then 50 of another type, well, if you drew the histograms, it would be hard to compare because 1 of them would be a lot bigger than the other, right. But by normalizing them, we kind of are distributing them over how many samples there are.

S2

Speaker 2

33:21

All right, and then I'm just going to put a title on here and make that the label, the y label. So because it's density, the y label is probability. And the x label is just going to be the label. What is going on?

S2

Speaker 2

33:41

And I'm going to include a legend And plt.show just means okay, display the plot. So if I run that, just be up to the last item. So we want a list, right? Not just the last item.

S2

Speaker 2

34:00

And now we can see that we're plotting all of these. So here we have the length. Oh, and I made this gamma. So this should be hadron.

S2

Speaker 2

34:11

Okay, so the gammas in blue, the hadrons are in red. So here, we can already see that, you know, maybe if the length is smaller, it's probably more likely to be gamma, right. And we can kind of, you know, these all look somewhat similar. But here, okay, clearly, if there's more asymmetry, or if you know, this is a symmetry measure or if this asymmetry measure is larger then it's probably a hadron.

S2

Speaker 2

34:41

Okay, oh, this one's a good 1. So F alpha, seems like hadrons are pretty evenly distributed. Whereas if this is smaller, it looks like there's more gammas in that area. Okay, so this is kind of what the data that we're working with, we can kind of see what's going on.

S2

Speaker 2

35:02

Okay, so the next thing that we're going to do here is we are going to create our train, our validation, and our test data sets, I'm going to set train valid and test to be equal to this. So NumPy dot split, I'm just splitting up the data frame. And if I do this sample, where I'm sampling everything, this will basically shuffle my data. Now, if I want to pass in where exactly I'm splitting my data set, so the first split is going to be maybe at 60%.

S2

Speaker 2

35:43

So I'm just going to say 0.6 times the length of this data frame. So and then cast that 10 integer, that's going to be the first place where you know, I cut it off. And that'll be my training data. Now, if I then go to 0.8, this basically means everything between 60% and 80% of the length of the data set will go towards validation.

S2

Speaker 2

36:05

And then, like everything from 80 to 100 is going to be my test data. So I can run that. And now, if we go up here, and we inspect this data, we'll see that these columns seem to have values in like the 100s. Whereas this 1 is 0.03.

S2

Speaker 2

36:25

Right, so the scale of all these numbers is way off. And sometimes that will affect our results. So 1 thing that we would want to do is scale these, so that they are, you know, so that it's now relative to maybe the mean and the standard deviation of that specific column. I'm going to create a function called scale data set.

S2

Speaker 2

36:55

And I'm going to pass in the data frame. And that's what I'll do for now. Okay, so the x values are going to be, you know, I take the data frame. And let's assume that the columns are going to be, you know, that the label will always be the last thing in the data frame.

S2

Speaker 2

37:18

So what I can do is say data frame dot columns all the way up to the last item and get those values. Now for my Y, Well, it's the last column, so I can just do this, I can just index into that last column, and then get those values. Now, in, so I'm actually going to import something known as the standard scaler from sklearn. So if I come up here, I can go to SK learn pre processing, and I'm going to import standard scalar, I have to run that cell, I'm gonna come back down here.

S2

Speaker 2

38:04

And now I'm going to create a scalar and use that scale. So standard scalar. And with the scalar, what I can do is actually just fit and transform x. So here, I can say x is equal to scalar dot fit, fit, transform x.

S2

Speaker 2

38:28

So what that's doing is saying, okay, take x and fit the standard scalar to x, and then transform all those values. And what would it be, and that's going to be our new x. All right. And then I'm also going to just create, you know, the whole data as 1 huge 2d NumPy array.

S2

Speaker 2

38:48

And in order to do that, I'm going to call h stack. So h stack is saying, okay, take an array and another array and horizontally stack them together. That's what the H stands for. So by horizontally stacked them together, just like put them side by side, okay, not on top of each other.

S2

Speaker 2

39:06

So what am I stacking? Well, I have to pass in something to so that it can stack x and y. And now. Okay, so NumPy is very particular about dimensions, right?

S2

Speaker 2

39:21

So in this specific case, our x is a 2 dimensional object, but y is only a 1 dimensional thing, it's only a vector of values. So in order to now reshape it into a 2d item, we have to call NumPy dot reshape. And we can pass in the dimensions of its reshape. So if I pass in negative 1 comma 1, that just means okay, make this a 2d array, where the negative 1 just means infer what what this dimension value would be, which ends up being the length of y, this would be the same as literally doing this.

S2

Speaker 2

39:58

But the negative 1 is easier because we're making the computer do the hard work. So if I stack that, I'm going to then return the data x and y. Okay, so 1 more thing is that if we go into our training data set, Okay, again, this is our training data set. And we get the length of the training data set.

S2

Speaker 2

40:24

But where the training data sets class is 1. So remember that this is the gammas. And then if we print that, and we do the same thing, but 0, we'll see that, you know, there's around 7000 of the gammas, but only around 4000 of the hadrons. So that might actually become an issue.

S2

Speaker 2

40:52

And instead, what we want to do is we want to over sample our our training data set. So that means that we want to increase the number of these values, so that these kind of match better. And surprise, surprise, there is something that we can import that will help us do that. It's so I'm going to go to from in the learn dot oversampling.

S2

Speaker 2

41:22

And I'm going to import this random oversampler, run that cell and come back down here. So I will actually add in this parameter called over sample, and set that to false for default. And if I do want to over sample, then what I'm going to do, And by over sample, so if I do want to over sample, then I'm going to create this ROS and set it equal to this random over sampler. And then for x and y, I'm just going to say, Okay, just fit and resample x and y.

S2

Speaker 2

42:05

And what that's doing is saying, Okay, take more of the less class. So take take the less class and keep sampling from there to increase the size of our data set of that smaller class so that they now match. So if I do this, and I scale data set, and I pass in the training data set where oversample is true. So this, let's say this is train and then x train, y train.

S2

Speaker 2

42:41

Oops, what's going on? Oh, these should be columns. So basically, what I'm doing now is I'm just saying, Okay, what is the length of y train? Okay, now it's 14,800, whatever.

S2

Speaker 2

42:57

And now let's take a look at how many of these are type 1. So actually, we can just sum that up. And then we'll also see that if we instead switch the label and ask how many of them are the other type, it's the same value. So now these have been evenly, you know, rebalanced.

S2

Speaker 2

43:22

Okay, well, okay. So here, I'm just going to make this the validation data set. And then the next 1, I'm going to make this the test data set. All right, and we're actually going to switch over sample here to false.

S2

Speaker 2

43:42

Now, the reason why I'm switching that to false is because my validation and my test sets are for the purpose of you know, if I have data that I haven't seen yet, how does my sample perform on those. And I don't want to over sample for that right now. Like I don't care about balancing those I'm I want to know if I have a random set of data that's unlabeled. Can I trust my model?

S2

Speaker 2

44:09

Right? So that's why I'm not over sampling. I run that. And again, what is going on?

S2

Speaker 2

44:18

Oh, it's because we already have this train. So I have to go come up here and split that data frame again. And now let's run these. Okay.

S2

Speaker 2

44:29

So now we have our data properly formatted. And we're going to move on to different models now. And I'm going to tell you guys a little bit about each of these models. And then I'm going to show you how we can do that in our code.

S2

Speaker 2

44:42

So the first model that we're going to learn about is KNN, or K nearest neighbors. Okay, so here I've already drawn a plot on the y axis, I have the number of kids that a family might have. And then on the x axis, I have their income in terms of 1000s per year. So, you know, if a if someone's making 40,000 a year, that's where this would be.

S2

Speaker 2

45:10

And if somebody making 320, that's where that would be. So he has 0 kids, it'd be somewhere along this axis. Somebody has 5, it'd be somewhere over here. Okay.

S2

Speaker 2

45:20

And now I have these plus signs and these minus signs on here. So what I'm going to represent here is the plus sign means that they own a car. And the minus sign is going to represent no car. Okay.

S2

Speaker 2

45:45

So your initial thought should be, okay, I think this is binary classification, because all of our points, all of our samples have labels. So this is a sample with the plus label. And this here is another sample with the minus label. This is an abbreviation for width that I'll use.

S2

Speaker 2

46:15

Alright, So we have this entire data set and maybe around half the people own a car and maybe around half the people don't own a car. Okay, well, what if I had some new point? Let me use choose a different color. I'll use this nice green.

S2

Speaker 2

46:34

Well, what if I have a new point over here? So let's say that somebody makes 40,000 a year and has 2 kids? What do we think that would be? Well, just logically looking at this plot, you might think, okay, it seems like they wouldn't have a car, right?

S2

Speaker 2

46:55

Because that kind of matches the pattern of everybody else around them. So that's a whole concept of this nearest neighbors is you look at, okay, what's around you. And then you're basically like, okay, I'm going to take the label of the majority that's around me. So the first thing we have to do is we have to define a distance function.

S2

Speaker 2

47:16

And a lot of times in, you know, 2d plots like this, our distance function is something known as Euclidean distance. And Euclidean distance is basically just this straight line distance like this. Okay. So this would be the Euclidean distance.

S2

Speaker 2

47:47

It seems like there's this point, there's this point, there's that point, et cetera. So the length of this line, this green line that I just drew, that is what's known as Euclidean distance. If we want to get technical with that, this exact formula is the distance, here let me zoom in, the distance is equal to the square root of 1 point x minus the other points x squared plus extend that square root, the same thing for y. So y 1 of 1 minus y 2 of the other squared.

S2

Speaker 2

48:31

Okay, so we're basically trying to find the length, the distances, the difference between x and y, and then square each of those sum it up and take the square root. Okay, so I'm going to erase this so it doesn't clutter my drawing. But anyways, now going back to this plot. So here in the nearest neighbor algorithm, we see that there is a K, right?

S2

Speaker 2

49:02

And this K is basically telling us, okay, how many neighbors do we use in order to judge what the label is. So usually we use a K of maybe, you know, 3 or 5, depends on how big our data set is. But here, I would say, maybe a logical number would be 3 or 5. So let's say that we take k to be equal to 3.

S2

Speaker 2

49:26

Okay, well, of this data point that I drew over here, Let me use green to highlight this. Okay, so of this data point that I drew over here, it looks like the 3 closest points are definitely this 1, this 1, and then this 1 has a length of 4. And this 1 seems like it'd be a little bit further than 4. So actually, this would be our these would be our 3 points.

S2

Speaker 2

49:56

Well, all those points are blue. So chances are, my prediction for this point is going to be blue. It's going to be probably don't have a car. All right.

S2

Speaker 2

50:09

Now what if my point is somewhere? What if I point is somewhere over here? Let's say that a couple has 4 kids and they make 240,000 a year. All right, well now my closest points are this 1, probably a little bit over that 1, and then this 1, right?

S2

Speaker 2

50:35

Okay, still all pluses. Well, this 1 is more than likely to be a plus. Alright, Now let me get rid of some of these just so that it looks a little bit more clear

S1

Speaker 1

50:54

All

S2

Speaker 2

50:54

right, let's go through 1 more What about a point that might be right? Here Okay, let's see. Well, definitely this is the closest, right?

S2

Speaker 2

51:09

This one's also closest. And then it's really close between the 2 of these. But if we actually do the mathematics, it seems like if we zoom in, this 1 is right here and this 1 is in between these 2. So this 1 here is actually shorter than this 1 And that means that that top 1 is the 1 that we're going to take.

S2

Speaker 2

51:37

Now, what is the majority of the points that are close by? Well, we have 1 plus here, we have 1 plus here, and we have 1 minus here, which means that the pluses are the majority. And that means that this label is probably somebody with a car. Okay.

S2

Speaker 2

52:01

So this is how k nearest neighbors would work. It's that simple. And this can be extrapolated to further dimensions to higher dimensions. You know, if you have here, we have 2 different features, we have the income, and then we have the number of kids.

S2

Speaker 2

52:21

But let's say we have 10 different features, we can expand our distance function so that it includes all 10 of those dimensions, we take the square root of everything. And then we figure out which 1 is the closest to the point that we desire to classify. Okay. So that's k nearest neighbors.

S2

Speaker 2

52:41

So now we've learned about k nearest neighbors. Let's see how we would be able to do that within our code. So here, I'm going to label the section k nearest neighbors. And we're actually going to use a package from SK learn.

S2

Speaker 2

52:56

So the reason why we you know, use these packages so that we don't have to manually code all these things ourself, because it would be really difficult. And chances are the way that we would code it, either would have bugs, or it'd be really slow, or I don't know a whole bunch of issues. So what we're going to do is hand it off to the pros. From here, I can say, okay, from SK learn, which is this package dot neighbors, I'm going to import k neighbors classifier because we're classifying.

S2

Speaker 2

53:27

Okay, so I run that. And our KNN model is going to be this k neighbors classifier. And we can pass in a parameter of how many neighbors you know, we want to use. So first, let's see what happens if we just use 1.

S2

Speaker 2

53:45

So now if I do KNN model dot fit, I can pass in my x training set and my weight y train data. Okay. So that effectively fits this model. And let's get all the predictions.

S2

Speaker 2

54:03

So why can I guess, yeah, let's do y predictions? And my y predictions are going to be a model dot predict. So let's use the test set x test. Okay.

S2

Speaker 2

54:21

Alright, so if I call by predict, you'll see that we have those. But if I get my truth values for that test set, you'll see that this is what we actually do. So just looking at this, we got 5 out of 6 them. Okay, great.

S2

Speaker 2

54:33

So let's actually take a look at something called the classification report that's offered by SK learn. So if I go to from sk learn dot metrics, import classification report, What I can actually do is say, hey, print out this classification report for me. And let's check, you know, I'm giving you the y test and the y prediction. We run this and we see we get this whole entire chart.

S2

Speaker 2

55:03

So I'm going to tell you guys a few things on this chart. All right, this accuracy is 82%, which is actually pretty good. That's just saying, hey, if we just look at you know, what each of these new points what it's closest to, then we actually get an 82% accuracy, which means how many do we get right versus how many total are there. Now precision is saying, okay, you might see that we have it for class 1, or class 0 and class 1.

S2

Speaker 2

55:33

What precision is saying with let's go to this Wikipedia diagram over here, because I actually kind of like this diagram. So here, this is our entire data set. And on the left over here, we have everything that we know is positive. So everything that is actually truly positive, that we've labeled positive in our original data set.

S2

Speaker 2

55:53

And over here, this is everything that's truly negative. Now in the circle, we have things that are positive that were labeled positive by our model. On the left here, we have things that are truly positive, because you know, this side is the positive side and the side is a negative side. So these are truly positive.

S2

Speaker 2

56:14

Whereas all these ones out here, well, they should have been positive, but they were labeled as negative. And in here, these are the ones that we've labeled positive, but they're actually negative. And out here, these are truly negative. So precision is saying, okay, out of all the ones we've labeled as positive, how many of them are true positives.

S2

Speaker 2

56:37

And recall is saying, Okay, out of all the ones we've labeled as positive, how many of them are true positives? And recall is saying, Okay, out of all the ones that we know are truly positive, how many do we actually get? Right? Okay.

S2

Speaker 2

56:48

So going back to this over here, our precision score. So again, precision, out of all the ones that we've labeled as the specific class, how many of them are actually that class, it's 77 84%. Now recall how out of all the ones that are actually this class, how many of those that we get, this is 68% and 89%. Alright, so not too shabby, we can clearly see that this recall and precision for like the class 0 is worse than class 1.

S2

Speaker 2

57:22

Right. So that means for hadron, it's worked for hadrons and for our gammas. This f1 score over here is kind of a combination of the precision and recall score. So we're actually going to mostly look at this 1 because we have an unbalanced test data set.

S2

Speaker 2

57:37

So here we have a measure of 72 and 87, or 0.72 and 0.87, which is not too shabby. All right. Well, what if we, you know, made this 3. So we actually see that.

S2

Speaker 2

57:56

Okay, so what was it originally with 1, We see that our f1 score, you know, is now it was 0.72, and then 0.87. And then our accuracy was 82%. So if I change that to 3, all right, so we've kind of increased 0 at the cost of 1, and then our overall accuracy is 81. So let's actually just make this 5.

S2

Speaker 2

58:25

Alright, so you know, again, very similar numbers, we have 82% accuracy, which is pretty decent for a model that's relatively simple. Okay, the next type of model that we're going to talk about is something known as naive Bayes. Now, in order to understand the concepts behind naive Bayes, we have to be able to understand conditional probability and Bayes rule. So let's say I have some sort of data set that's shown in this table right here.

S2

Speaker 2

58:57

People who have COVID are over here in this red row. And people who do not have COVID are down here in this green row. Now what about the COVID test? Well people who have tested positive are over here in this column and people who have tested negative are over here in this column.

S2

Speaker 2

59:20

Okay. Yeah. So basically, our categories are people who have COVID and test positive people who don't have COVID, but test positive. So a false false positive people who have COVID and test negative, which is a false negative and people who don't have COVID and test negative, which good means you don't have COVID.

S2

Speaker 2

59:41

Okay, so let's make this slightly more legible. And here in the margins, I've written down the sums of whatever it's referring to. So this here is the sum of this entire row. And this here might be the sum of this entire row.