See all freeCodeCamp.org transcripts on Youtube

youtube thumbnail

Algorithmic Trading โ€“ Machine Learning & Quant Strategies Course with Python

2 hours 59 minutes 19 seconds

๐Ÿ‡ฌ๐Ÿ‡ง English

S1

Speaker 1

00:00

In this comprehensive course on algorithmic trading, you will learn about 3 cutting edge trading strategies to enhance your financial toolkit. Latches are teaches this course. He is an experienced quantitative researcher and data scientist. In the first module, you'll explore the unsupervised learning trading strategy, utilizing SP500 stocks data to master features, indicators, and portfolio optimization.

S1

Speaker 1

00:28

Next, you'll leverage the power of social media with the Twitter sentiment investing strategy, ranking Nasdaq stocks based on engagement and evaluating performance against the QQQ return. Lastly, the intraday strategy will introduce you to the GARCH model, combining it with technical indicators to capture both daily and intraday signals for potential lucrative positions.

S2

Speaker 2

00:53

Hello and welcome to this free CodeCamp course on Algorithmic Trading, Machine Learning and Quant Strategies with Python. My name is Vlad Cezar and I'll be your instructor throughout the course. And in this course we are going to develop 3 big quantitative projects from start to end.

S2

Speaker 2

01:11

And the course overview will be the following. First we are going to talk about algorithmic trading basics. Then we're going to talk about machine learning in trading, some obstacles and challenges we may face while using machine learning in trading, then we are going to develop the first project which will be an unsupervised learning trading strategy using stocks from S&P 500. The next project would be Twitter sentiment and it would be using data from Nasdaq 100 stocks and the third 1 would be focusing on 1 asset.

S2

Speaker 2

01:44

It would be an intraday strategy using a GART model to predict the volatility. It would use simulated data. And after that we are gonna have a quick wrap-up and that will finish the tutorial. But before we continue I would like to mention that this tutorial should not be construed as a financial advice.

S2

Speaker 2

02:05

It is for educational and entertainment purposes only. We are going to develop some concepts and come up with strategies in the end. But it's not a financial advice and you shouldn't make any decisions based on it. As well for this course I would assume that you have at least some Python knowledge and understanding because down the road we'll deal with some complex problems and if you are new to Python you may get bogged down.

S2

Speaker 2

02:30

Nevertheless you learn some very interesting concepts So stay along and let's get into it. Okay, Algorithmic Trading Basics. So what is Algorithmic Trading? It is trading on a predefined set of rules which are combined into a strategy or system.

S2

Speaker 2

02:46

It is developed by a programming language and it is run by the computer. It can be used for both manual and automated trading. By manual what I mean is that you may have a screener which comes up with a set of stocks you want to trade on a given day or you may have an algorithmic strategy which is developed into an alert system and whenever the conditions are triggered you get an alert but you execute it manually. On the other hand, you may have a completely automated complex system which does a lot of calculation, comes up with positions and sizing, and then executes the trade automatically.

S2

Speaker 2

03:23

Okay, so what is the role of Python in algorithmic trading? Python is the most popular language used in algorithmic trading, quantitative finance, and data science. And this is mainly due to the vast amount of libraries that are developed in Python, as well the ease of use of Python. It is mainly used for data pipelines, research, backtesting strategies, as well it can be used to automate strategies but Python is a slow language and it can be used to automate low complexity systems.

S2

Speaker 2

03:55

If you have a really high-end system which is really complicated and it needs to execute trades really quickly you would use Java or C++ for those strategies. Algorithmic trading is a great career opportunity. It's a huge industry. There are a lot of jobs with hedge funds, banks, prop shops and I've just checked the average yearly base salary for a quant researcher is around $173,000 and this is not including the yearly bonds.

S2

Speaker 2

04:27

It is a great career opportunity and if you're interested into it the main things you need to know are Python, you need to know how to backtest strategies, you need to know how to replicate papers, and you need to know machine learning in trading. If you're interested into it, I'll definitely advise you to go for it. Okay, so let's move on and talk a little bit about machine learning in trading and some use cases of machine learning. When we talk about supervised learning we can use it for signal generation through prediction.

S2

Speaker 2

05:01

For example, we can come up with buy or sell signals on a given stock or a given asset based on predicting the return or the sign of the return of that asset. We can as well use it in risk management. For example, we may use a prediction to determine the position sizing or the weight of a given stock in our portfolio or to predict where exactly should our stop loss be. And with unsupervised learning, we can use it to extract insights from the data.

S2

Speaker 2

05:33

For example, we can discover patterns, relationships, or structures within the data. For example, clusters and use it in this way to help our decisions. What are some of the challenges that we may face while trying to apply machine learning in trading? And the first theoretical challenge we may face is the so-called reflexivity feedback loop.

S2

Speaker 2

05:56

And it is referring to the phenomenon that if we, for example, have a machine learning model that is predicting that a stock is going to go up each Friday and we can form a strategy around to profit from that phenomenon. For example, we are going to buy each Thursday and then sell on Friday to capture that price increase move. If we find this strategy throughout predictions and start trading it, with time other market participants as well are going to find this market phenomenon and start exploiting it as well, which would cause the price to start going up on Thursday because everybody is buying now on Thursday instead on Friday and then this strategy is going to be arbitrage away. So it's this reflexivity feedback loop which is making predictions quite hard.

S2

Speaker 2

06:44

What is most hard while applying machine learning? The most hard thing is to predict returns and predict prices. The next quite hard thing to do is to predict return signs or the direction of a given asset. Is it going to go up or down?

S2

Speaker 2

07:00

The next thing is to predict an economic indicator. For example, it is quite hard to predict non-farm payrolls or weekly jobless claims. And a thing which is not that hard or quite straightforward is to predict the volatility of a given asset. Furthermore, there are some technical challenges like overfitting a model or generalization.

S2

Speaker 2

07:23

Overfitting is that the model is learned the trained data too well and it fails on the test data and generalization is that the model is not performing the same as on the real data. As well we may have non-stationarity in our training data and regime shifts which may ruin the performance of the model. And the last thing is that if we have a really complicated model or a neural network it is like a black box and we are not able to interpret it correctly. What is the usual workflow process in algorithmic trading and machine learning?

S2

Speaker 2

07:56

That would be to collect and prepare the data, then develop a hypothesis for a strategy, then you have to code the model and train the model, and then finally backtest the strategy. Okay guys and some key takeaways from this course. So you learn high-level concepts in quantitative finance as well practical machine learning in trading. You will develop a project from idea to backtest final results.

S2

Speaker 2

08:23

However, we will not automate or execute any trades. We'll just develop a strategy. This course is all about developing a strategy from start to end so you see the workflow. And this is just a purely research project for educational purposes.

S2

Speaker 2

08:41

I repeat it should not be considered as any financial advice whatsoever. And with that we can move to the first project and let's get into it. Okay so first project is about unsupervised machine learning trading strategy. We are going to use data from S&P 500 stocks.

S2

Speaker 2

09:00

Let's talk a little bit about unsupervised learning in trading. So it involves machine learning techniques to analyze financial data and discover patterns, relationships, and structures within this data without predefined labels or target variable. Unlike supervised learning where the model is trained to make predictions, unsupervised learning focuses on extracting insights from the data. Some use cases would be clustering, which we are going to use in the first project, dimensionality reduction, anomaly detection, market regime detection, and portfolio optimization.

S2

Speaker 2

09:35

In this project, what we are going to do is first we are going to download all the prices data for all S&P 500 stocks. Then we are going to calculate different technical indicators and features for each stock. Next, we are going to aggregate on a monthly level and filter only the top 150 most liquid stocks from the S&P 500 for each month. Next, we are going to calculate monthly returns for different time horizons to add up to the features.

S2

Speaker 2

10:04

And the next step would be to download the Fama French factors and calculate rolling factor betas for each stock as well to add to the feature set. And at this point we have enough features to fit a model and either make predictions or in our case we are going to fit the K-Means clustering algorithm, an unsupervised learning model and we will use it to group the stocks into similar assets into clusters and from those clusters then we will be able to for each month select stocks from a given cluster and we are going to analyze the clusters and select a particular cluster and then for Each month we are going to select those stocks within this cluster and form portfolios. However, these portfolios will be optimized so the weights of the stocks within the portfolio we are going to find them by using the Efficient Frontier Max Sharpey Ratio portfolio weights. And then we are going to form the portfolio, hold for 1 month and rebalance at the end of the month and form another Max Sharpe E-Ratio portfolio.

S2

Speaker 2

11:09

In the end, we'll have the strategy returns for each day and we will be able to compare our portfolio strategy returns to the S&P 500 returns themselves. Actually, not small limitation is that we're going to use the most recent S&P 500 stocks list, which means that there may be a survivorship bias in this list. This is a huge issue. Actually, In reality, you should always backtest strategies using survivorship-free bias data.

S2

Speaker 2

11:38

What is survivorship bias? It is the condition when a stock which have actually went out of the S&P 500 because it was failing is currently not in the list, right? So last year, for example, there was a stock which was failing and going down, down, down. So at some point in December last year, they removed it from S&P 500 and they included a new stock.

S2

Speaker 2

12:00

So if we have made the optimization last November, we could end up with having this stock into our portfolio and actually it affecting our portfolio results. But if we use the most recent S&P 500 symbols list, this stock would not be there. So that's survivorship bias. And for the given project, we are not going to deal with this survivorship bias so the list we are going to work with has survivorship bias most probably so yeah that is a limitation you need to know.

S2

Speaker 2

12:30

In the second project we are going to develop a Twitter sentiment-based investing strategy. We are going to use the Nasdaq 100 stocks and Twitter sentiment data. What is sentiment investing? This approach focuses on analyzing how people feel about certain stocks, industries or the overall market.

S2

Speaker 2

12:49

It assumes that the public sentiment can impact stock prices and for example if many people are positive about a particular company on Twitter it might indicate potential for that company stock to perform well. What we are going to do first we're going to load the Nasdaq stocks Twitter sentiment data. Then we're going to calculate a quantitative feature of the engagement ratio in Twitter for each stock. After that we're going to rank all the stocks cross-sectionally for each month and create an equal weight portfolio.

S2

Speaker 2

13:22

In the end, we're going to compare the return of this portfolio to the Nasdaq itself. The second strategy is much smaller than the first strategy. There is no machine learning modeling in this strategy, but the idea here is to show you how alternative or different data, in this case sentiment data, can help us to create a quant feature and then create potential strategy out of it. That's the idea of the second project.

S2

Speaker 2

13:46

And the third project is about an intraday strategy using a GARC model. In this 1 we are going to focus on a single asset and we'll be using simulated data. Actually we'll have daily data and intraday five-minute data. But what does an intraday strategy mean?

S2

Speaker 2

14:03

This approach involves buying and selling financial assets within the same trading day to profit from the short-term price movements. Intraday traders usually use technical analysis, real-time data, and different risk management techniques to make decisions and profit from the strategies. What exactly we are going to do in this project? First, we are going to load the simulated daily data and the simulated five-minute data.

S2

Speaker 2

14:28

Then we are going to define a function which would fit in a rolling window, a GARC model to predict 1 day ahead volatility of the asset. After we have that, we'll calculate prediction premium. So we'll predict the volatility and we'll calculate the prediction premium and form a daily signal from it. After we have calculated that, then we'll merge the daily data with the intraday data and calculate intraday technical indicators to form intraday signal.

S2

Speaker 2

14:56

So we will have a daily signal and then on top an intraday signal, so 2 signals. And after we have that, we will generate position entry and hold until the end of the day. In the end we are going to calculate the final strategy returns and this is it for the third project. The idea is to show you how predicting volatility works in intraday strategies.

S2

Speaker 2

15:19

And yeah with that guys we are ready to jump into the first project so let's get into it. Okay let's start with the first project the unsupervised learning trading strategy but before we continue with the coding guys the first step would be for you to pause the video and install all the needed packages for this project so I prepared a small list up here with all the packages we are going to need for the project So those are pandas, numpy, matplotlib, statsmodels, pandas data reader, datetime, yfinance, sklearn and piportfolio. How you can install the packages? You can do it the following way.

S2

Speaker 2

15:58

You can open an anaconda prompt like that, right? And then just write pip install and the package name pip install pandas for example and do that for each package that's 1 way another way to do it is through the notebook itself you can type the following so pip install and then the package name so pip install pandas you do ctrl shift and run the cell and it will install the package. Pause the video take your time install all the packages and let's continue. Okay so the first step would be to download the S&P 500 constituents prices data But before that we'll have to import all the packages we'll use throughout this project.

S2

Speaker 2

16:38

I've already prepared that. Those are all the packages we've just installed and now we are importing into the Jupyter notebook. The next step would be to download S&P 500 constituents data. To do that we can go to this link on Wikipedia and up here you can see that they have a table containing S&P 500 component stocks like the symbol, the name of the security sector, industry, date added as well.

S2

Speaker 2

17:05

So quite some data. So we would like to load this table into our Jupyter notebook. How we can do that? We'll just call a pandas read HTML function and let's see what this would return us.

S2

Speaker 2

17:19

Okay, so this is returning a list containing 2 elements. It looks like 2 data frames. So, actually we are interested in the first data frame. Yeah, Exactly.

S2

Speaker 2

17:30

We will assign that to an object called SMP500. And the next step would be to grab this symbol column and extract all the symbols in a list. But before that, I think we would have to make a little cleaning on some of the symbols because I know that 1 or 2 of the symbols contain a dot and this would give us an error while we download data from Y-Finance so we have to actually replace all dots with a dash And that would do the job. Next step would be to, you know, just grab list with all the stocks.

S2

Speaker 2

18:11

So yeah, we can just use symbol.unique to list. We'll assign that to object called symbols list. And yeah, that would be our symbols list of all S&P 500 stocks. As we talked in the beginning this list of stocks is not survivorship bias free so you need to know that it's a quite some limitation.

S2

Speaker 2

18:40

Alright so we would like to download the data up to a few days ago. Let's define an object which is end date and we'll use 2023 September 27th and the start date would be... It would be exactly 8 years ago so we can just use the end date and we'll subtract 8 years out of it. How we can do that?

S2

Speaker 2

19:15

We just say pandas to date time. We'll convert to date time the end date and then we'll just subtract pandas date offset and then supply 365 times 8. All right so now we have the end date which is 27th of September which is a string and the start date would be a timestamp but that's alright because we are going to download data from yfinance and this function we'll use from YFinance download function which takes tickers as an argument so that would be our symbols list then start which would be our start date and end which would be our end date and now this would download all the S&P 500 constituents if we run it yeah it would take some time. All right so we've downloaded all the data for the S&P 500 stocks.

S2

Speaker 2

20:10

We got it up here. So the next step would be to actually, yeah, we'll comment out everything and have this object printed out. As you can see up here, we have the column and then we have like a multi index column. So first we have the adjusted close, and then we have the adjusted close for each stock.

S2

Speaker 2

20:29

That's not really convenient to work with the whole data frame. At the moment we have 2012 rows and 3018 columns that's really inefficient. How we can overcome that we will just use the stack method which now creates a multi index. The first level would be the date and then for each date we have the corresponding 500 stocks adjusted close, close, high, low so that's much more convenient we have 6 columns and almost 1 million rows so we'll have to change the date frame to be stacked or actually we can move this method right up here after the download directly and yeah now we have our date frame stacked.

S2

Speaker 2

21:09

Next step would be I would always when I have a multi index I would always want to have labels on both of the index labels. So we'll say index.names and we'll assign the new names. So date and ticker, that will be our new multi-index names. As you see, they changed up here.

S2

Speaker 2

21:29

Next step would be to fix the columns a little bit. I would like to fix the column names to be not as titles but with lower letters. So that would be df.columns, df.columns.string lower and that's pretty much it with the downloading and fixing data a little bit before we move to the next step which would be to start calculating technical indicators and features of all those 503 stocks. Okay so in the second step we can start calculating the features and technical indicators for each stock.

S2

Speaker 2

22:07

We are going to calculate the Garmin Class Volatility, RSI, Bollinger Bands, ATR, MACD and Dollar Volume for each stock. Let's first start with the Garmin Class Volatility. What is Garmin Class Volatility? It is a volatility measure usually used in forex trading, but it works for stocks as well.

S2

Speaker 2

22:25

It is an approximation to measure the intraday volatility of a given asset. And that is the formula right here. So what do we do? We define a new column called Garmin class Vaux and it will be the following.

S2

Speaker 2

22:41

So log from the high minus log from the low. So that will be from the low this whole thing is squared and then it is divided by 2 then from this we subtract 2 times log 2 minus 1 multiplied by subtraction of log from the adjusted close yeah minus log from the open gain the whole thing is squared then we just close another bracket and that should be it. That should be the Garmin Class Volatility. Yes, so we now have calculated the Garmin Class Volatility for each stock.

S2

Speaker 2

23:27

It is calculated on a given row. We don't need to do any fancy calculations for this 1. The next 1 is RSI. So how do we calculate the RSI on each stock?

S2

Speaker 2

23:39

What are we going to do? We are going to group by on the ticker level. So level 1, the multi-index has level 0 and level 1, which is the ticker. Level 0 is the date, level 1 is the ticker.

S2

Speaker 2

23:52

So we group by on level 1. Then we are selecting the column which would be adjusted close and apply the transform method. And within the transform method we'll just apply a lambda function which would be... So now we have grouped by on each ticker and what do we want to do?

S2

Speaker 2

24:08

We want to calculate the RSI. To calculate RSI we are going to use the pandas underscore TA package which is the package to calculate pretty much all of the needed technical indicators. So we use the pandas underscore TA package and from the pandas underscore we use the RSI function. The RSI function we have to supply the close price which would be X and the length would be 20 and yeah if we do that you see now we have the RSI for each stock.

S2

Speaker 2

24:42

How we can double check our work we select Apple and then RSI and then we'll plug it. And yeah as you can see the RSI goes up and down up and down so we have worked correctly. The next indicator we would like to calculate is Bollinger Bands and actually we would want to have the lower band, the middle band and the upper band but there is 1 specification for each indicator from now on we would like to normalize and scale the indicator itself. So for the Bollinger bands we will supply the log from the close price.

S2

Speaker 2

25:26

First the function we are going to use is from PandaCA B-bands and it is taking the close just for presentational purposes. We'll supply a upper adjusted close and the length will be 20. When you run this function, it returns 5 columns. So the first 1 is the lower band.

S2

Speaker 2

25:45

The second 1 is the middle band and the third 1 is the upper band. We have to take this into account and what we are going to do is pretty much, you know, define a new column bb low, Bollinger Bands low and we'll use the the same idea as for the RSI. We are going to group by each ticker, select the adjusted close column and then we'll use the transform method lambda function and in the lambda function we'll say pandas ta bbounds close would be equal to X actually it will be equal to log of X so MP log 1P and then the length will be again 20 when we run that it will return those 5 columns what we actually want to do is assign to be below the first column which we know it's the lower Bollinger band. We can say I lock all the rows and the first column.

S2

Speaker 2

26:40

We can repeat the same operation for the mid band and for the upper band. Okay but this happens when I forget something and I think I have forgot the second curly bracket. To close curly brackets. Let's see what this will return.

S2

Speaker 2

26:56

Yeah guys, after you return and calculate something you can just select all the code you've used and comment it out. Okay we forgot to change the names that will be BB mid and first index that would be BB high the second index. Okay Let's run that again and see what we get. Awesome, now we have the lower Bollinger band, middle Bollinger band and upper Bollinger band and we have the data scaled and normalized.

S2

Speaker 2

27:27

Next step is to calculate the ATR for each stock. However, the ATR function needs 3 inputs, so 3 columns, not only 1 column. And when we use transform method in pandas, it is actually working when you select only 1 column. It would not work if you have 3 columns as an input.

S2

Speaker 2

27:47

So we would have to use another approach, more specifically, that would be a group by apply. And to do that, we need to define our own custom function to calculate the ATR. We can double check what the ATR function from pandas ta requires as input and it is the high so we have to supply the high then we have to supply the low and the close price furthermore we can supply the length for example 14 and yeah it will mess mess up with the data if we run it like that because we have to select the data for a given stock but yeah you see that it requires 3 columns so here we are going to define a function called compute ATR it will take stock data and up here we will just calculate ATR which will be pandas ta.ATR.High will be stock data high. Below then we need a close the length would be 14.

S2

Speaker 2

28:47

Alright and a little detail we are going to add up here is we are going to normalize the data while we calculate it so that would be ATR.substract first we are going to the mean and then we are going to divide it by the standard deviation so ATR dot standard deviation and yeah that would be our ATR indicator function. Now we can just say ATR create a new column called ATR and group by level 1 again so we are applying that for each stock. However, up here when we use the group by apply we need to add an additional argument to the group by which is group keys equal to false because if we don't do that it will double the date column so it will return another date column and we'll have a triple multi-index with 2 date columns we don't want that so we just say group keys equals to false and then apply this function and this will now calculate the ATR index normalized for each stock. The next indicator we are going to calculate is the MACD indicator and for the MACD we are going to follow the same logic as for the ATR indicator.

S2

Speaker 2

29:59

We are going to define our own custom function to compute the MACD so it will be called computeMACD it will take the close price and up here we are going to say MACD is equals to pandas ta.macd close equal to close Length is 20 and then we would like to get the first column which would be returned. However in the end as well, we are going to normalize the data. We are going to demean the series and then we are going to divide by the standard deviation. Why do we do that?

S2

Speaker 2

30:33

Right away, we're normalizing the data because we are going to use it into a machine learning model. We are going to cluster the data. We want to do that straight away and don't think about it later in the future. So here we are going to do again group by level 1 each ticker on each stock group keys equal to false and then apply compute MACD.

S2

Speaker 2

30:56

This would calculate the MACD indicator for each stock. All right we have an error. Why do we have this error? That's quite strange.

S2

Speaker 2

31:06

Oh okay I'm sorry I'm sorry. I forgot to add the adjusted close column up here this is driving the error all right that should be the gas All right yeah now we have the MACD as well calculated and normalized as you can see we have the data looks pretty good so far. The only indicator we are not going to normalize is the RSI and there is a particular reason for that but you understand more about it when we come to the clustering part. All the other indicators we are going to normalize.

S2

Speaker 2

31:48

And the final 1 is the dollar volume. So we are going to create a new column dollar volume which would be equal to the adjusted close multiplied by the volume. However, We may want to actually divide that by 1 million for each stock because we know that millions of shares are traded each day and this would make sense. As you can see now the data looks much better for the dollar volume, right?

S2

Speaker 2

32:14

And That's pretty much it with calculating the first batch of features, our technical indicators. Now we have a really beautiful data frame. For each day, we have all the 500 stocks. We have the close price, high, low, open, volume, garment class volatility, RSI, Bollinger Band, ATR, MACD and the dollar volume for each stock.

S2

Speaker 2

32:36

And we are now ready to move to the next and the third step. In the third step what we want to do is to aggregate on a monthly level the data and filter the top 150 most liquid stocks for each month. Why do we do that? We do that to reduce training time for any potential machine learning model and experiment with features and strategies.

S2

Speaker 2

32:58

What is my idea here? I would like to aggregate all the indicators. So those 5, I would like to take the end value, the end of the last value for a month as well, the same for the adjusted close price. And for the dollar volume, I would like to get the average dollar volume for the whole month for each stock.

S2

Speaker 2

33:19

We can start actually, what we can do first is take the data frame, unstack the ticker level. So we'll unstack ticker, and then we're going to select the dollar volume column and if we run that we have the dollar volume for each day for each stock now right and what we can do is just resample to monthly and take the mean this should resample to monthly Now as you can see we have monthly index, end of each month and we have the average dollar volume for the month. What we can do now is just stack it back into a multi index like that and we can say to frame To make it a data frame with 1 column to frame dollar volume Beautiful, that would be the first step however for the Indicators what we can do is we can follow the same logic, but we need to select the exact columns. Actually, we may create a list of columns, so last calls, and this would be our list of columns for which we want to do the same operation.

S2

Speaker 2

34:37

However, we would use the last method up here instead of min and those columns would be C for C in df.columns.unique if C is not in the following columns list. So if the column is not dollar volume, it is not volume, it is not open, it is not high, low or close. Pretty much we want to do that only for technical indicators columns. So for those columns we don't want to use these columns for our aggregation.

S2

Speaker 2

35:23

We want just the features. We are creating the features data frame in the end, right? We would use the dollar volume to filter out the most liquid stocks. The dollar volume would not be featured in our model.

S2

Speaker 2

35:33

As well, neither the volume or open low high close. After we have defined the last columns, we can actually proceed with the next aggregation. We have done the dollar volume aggregation. The next 1 would be for DF last calls.

S2

Speaker 2

35:51

We are going to unstack. Actually we are going to unstack before that. So we are going to unstack. We're going to unstack and select those columns and then resample to monthly and then just use the last value like that.

S2

Speaker 2

36:10

And again we're going to stack backwards into a multi-index. Voila. And now what we can do is concat those 2 together. We can say the following pandas concat and that would be axis axis 1 and boom we have the dollar volume the average dollar volume and the last value for adjusted close ATR and other technical indicators for each month.

S2

Speaker 2

36:42

Now we have aggregated the data to monthly level for the features we would need. Actually what we can add up here is a small drop that looks much more beautiful and we can call that data and let's visualize what we have. Yeah, that's our data. All right.

S2

Speaker 2

37:08

And that's the first step with the aggregating to a monthly level. The next step would be to calculate the 5 year rolling average dollar volume for each stock and then use this aggregated dollar volume to filter out only the top 150 most liquid stocks for each month. How do we approach that? First we can start by selecting the dollar volume like that.

S2

Speaker 2

37:39

So what we can do, select the dollar volume, unstack the ticker level And now we can use a rolling function with a window of 5 times 12, so 5 years. And then we can calculate the mean. As you can see, now we have the rolling average mean for each stock. Rolling average, 5 year dollar volume for each stock.

S2

Speaker 2

38:03

And again, we can just stack backwards. And that's pretty much our dollar volume column. What we can do now is we can assign it to the dollar volume column, update it to the five-year rolling average for each stock. Awesome!

S2

Speaker 2

38:19

After we have that, the next step is to calculate the dollar volume rank cross sectionally for each month. How do we do that? We can say data group by level 0 or date. So we can group by on date.

S2

Speaker 2

38:35

For each month we're going to select the dollar volume and just rank ascending equal to false. Let's see what this would give us. And now as you can see we have all the stocks ranked by dollar volume and the ones the guys who have the smallest dollar volume have the highest rank so we want the top 150 and from here we can pretty much very easily select the stocks which are below 150 for each month and those would be the top 150 most liquid stocks for each month. After we have selected them we can just drop the 2 columns we can drop the dollar volume and the dollar volume rank because we are not going to need them anymore.

S2

Speaker 2

39:35

Axis equals to 1. And that's pretty much it for our, we can assign that to data. And that's pretty much it with our third step where now we have aggregated monthly data for all the features we would need plus the adjusted close price and we can move on with the next step the fourth step and that would be calculating monthly returns for different time horizons and add them as additional features to the ones we already have here. Okay, so let's move to that step.

S2

Speaker 2

40:10

I'll just cut those cells in the middle. All right, why do we want to calculate the monthly returns for different time horizons and add them to the feature set because we may want to capture time series dynamics that reflect for example the momentum patterns for each stock. To do that we can just use the pandas dataframe method % underscore change and supply the different lags. My approach would be to use lags for 1 month, 2 months, 3 months, 6 months, 9 months and 12 months.

S2

Speaker 2

40:43

That's like 6 different lags to really capture the momentum patterns. How do we approach that? Let's first start by for example selecting the Apple stock. I just want to make a little example.

S2

Speaker 2

40:57

So we'll select the Apple stock And let's see what we have here. All right. Now we would like to calculate the returns for the following lags for 1 month, 2 months, 3 months, 6 months, 9 months, and 12 months, right? As well, we may want to have an outlier cutoff because we are dealing with a lot of stocks.

S2

Speaker 2

41:20

There will definitely be outlier values in the returns of those stocks. What do we want to do? We want to deal with them by clipping them. What clipping does is that for all values which are above the outlier threshold they will just be assigned the threshold of that percentile.

S2

Speaker 2

41:38

The cutoff value we may want to have it as a 0.05 which means the 99.5 percentile. That would be our outlier cutoff. And now what we do is just for each lag, so for lag in lags, for each lag we are going to create a column which would be the return for the given lag, for a given lag month, right? That will be the column.

S2

Speaker 2

42:04

And then we grab the adjusted close. Then we just do percent change lag. So for each lag, we'll calculate the following column, which would calculate the given return. And then we want to deal with the outliers right so that would be pipe lambda so far we have the adjusted close and then we calculate the return for a given lag and then we input that into the pipe lambda X we can clip now We'll clip the return and we can clip the lower band.

S2

Speaker 2

42:37

The lower cutoff we'll use X quantile and we'll supply for the lower cutoff just outlier cutoff. And for the upper 1 we'll supply x quantile 1 minus outlag cutoff. Then we add 1 to the power of 1 divided by the lag and we subtract 1 in the end. And that should be it guys.

S2

Speaker 2

43:05

Now we can see that for our Apple stock which is G, right? We have the 1 month return, 2 months return, 3 months, 6 months, 9 months, 12 months, etc. How we can extend that? We'll just use the same approach we use for the ATR and the MACD indicators.

S2

Speaker 2

43:21

We'll create our own custom function and then we'll just use the group by apply methodology for the pandas dataframe. So calculate returns. It takes df for example and just move that a little bit return df in the end however up here we have to change that df this 1 to df as well and that's pretty much it our function And now we can say the following, data equals to data.groupBy level 1, because we group by on the ticker level. So we can say groupBy ticker or level 1.

S2

Speaker 2

43:58

Let's say level 1. And then we want group keys to be false so we don't have to date indexes assigned to the new date frame. So group keys false then apply. We apply the calculate returns function and in the end we may want to drop the NA values.

S2

Speaker 2

44:19

And yeah, this will take some time. And yeah, that's pretty much it. I mean, now we have added the return features as well, which we would use to capture momentum patterns for each stock. And that's pretty much it with the fourth step, calculating the monthly returns for different time horizons.

S2

Speaker 2

44:42

And we can move to the next step which is really interesting, actually adding even more features to our data set. That would be to download the Fama French factors and we are going to calculate the rolling factor beta for each stock in our data. All right, let's move to it. Okay, so in this step we're going to download the Pharma French Factors data and calculate the rolling factor bettas for each stock in our current data set.

S2

Speaker 2

45:09

So we want to introduce the Pharma French data to estimate the exposure of our assets to commonly known risk factors and we're going to do that using a regression rolling OLS model. The 5 front French factors, namely market risk, size, value, profitability, and investment, have been shown to empirically explain asset returns in the past, and are commonly used in the asset management industry to assess the risk return profile of different portfolios. So it kind of makes sense to include them in our current feature data set. How we can do that?

S2

Speaker 2

45:43

We can use the pandas data reader package which we imported as web. We can use this package to download the Fama French factor models. But before that, we may want to take a look at the, we might want to take a look at the data. So we can Google it, Fama French factors.

S2

Speaker 2

46:03

And up here, you can find the Canet French data library just click on it and this is the part we are interested in in Pharma French 5 research factors 2 times 3 so those 5 factors That's the data we are interested in. If you scroll down a little bit you can find the daily data right here You can download it as txt or CSV file as well. You can check the details They have monthly returns and annual returns We are interested in the monthly returns as our data is already on our monthly level, right? So what we can do is we can say the following web.dataReader dataReader and up here we have to supply the name of the exact factors or the exact file so I have prepared it already.

S2

Speaker 2

46:46

That's the name then we say Fama French Then the start date we want 2010. This is returning a dictionary with 2 keys. The first 1 is the monthly factors and the second 1 is the yearly factors. That's awesome.

S2

Speaker 2

47:00

So we want the monthly factors only. And that's the data we have 164 months up to August, 2023. That's pretty good. However, I think we don't, even though the risk-free return is pretty solid right now.

S2

Speaker 2

47:17

It is out of the scope of this tutorial so we'll just drop it. You say dropRFAxis equal to 1 and that's perfect. We can call this assign it to factor data. That's our factor data.

S2

Speaker 2

47:37

Let's check the index. The index is monthly. Alright, so I think we have to fix the index as well so we can call pandas to date time and supply the index. Let's see if this would work.

S2

Speaker 2

47:51

Okay, we have to use to timestamp. All right. Okay, so now we have fixed the index as well. As you can see now it has the year, the month and the beginning of month date.

S2

Speaker 2

48:08

However, our data is end of month. That's 1 thing we have to fix. Another 1 is that I see that the factors are in percentages so we would have to divide them by 100. How we can fix that?

S2

Speaker 2

48:24

We can just say resample to monthly and get the last value which would fix immediately the issue with the beginning of month date. Next we can say divide by 100 and that's perfect. We just assign that to factor data. Comment it out.

S2

Speaker 2

48:49

Next we want to fix the name of the index to be just date. And the next step would be to join with the, pretty much join with the 1 month return. Why would we want to do that? Because at the beginning of each month we have the factors and then we have the return of each stock at the end of the same month.

S2

Speaker 2

49:18

So now we have fixed the date of the factors. We can just join with the end of month return and then we can regress them and take the better, right? If the factor is predictive, we have it at the beginning of the month and we will regress it with the return of the end of the month so we'll get the beta. How we can do that?

S2

Speaker 2

49:38

Factor data, join and then from our data We can select the 1 return column and let's see what this will give us. That's perfect. Now we can sort the index. Just assign to factor date.

S2

Speaker 2

50:01

What we may want to do here is to double check our work. We can select 2 stocks. For example, Apple and let's say Microsoft. And we can double check the return.

S2

Speaker 2

50:20

Yeah, the return is different, but the factors stay the same. It looks like we've worked correctly. And we are ready to move to the next step. In this step, we are going to filter out stocks that have less than 10 month data why are we doing that because we are going to use rolling window for the regression of around 2 years 24 months and stocks that don't have enough data would actually break our function so we have to remove them from the data set.

S2

Speaker 2

50:56

How we can do that we can say factor data dot group by level 1 and then just call the size method and now we can see how many months of data we have for each stock. Okay guys and I just realized that we have made a small mistake somewhere because we have only 23 months of data for each stock and I had to go back through the code and actually figure out that on this step this part was missing which was the data.log selecting all the rows and then the dollar volume column. You can just rerun this cell, and you can see that here, our first month is in 2015, November, and after that, it was 2020. So when you add the .lock, and selecting all the rows and the dollar volume column you get the fixed data then we can rerun as you can see up here it's 2021 the first date if we rerun it it should be much backwards in the past yeah so 2017 31st of October.

S2

Speaker 2

52:07

And we have to do that again for the factor data too. And if we run now, yeah, now we have 71 months, right? And the idea here is that we remove all the stocks that have less than 10 months of data. How do we do that?

S2

Speaker 2

52:26

We can just assign that to observations and then we can save valid stocks that will be observations that would be all the stocks that have more than 10 months of data. Observations yeah those are our valid stocks and now we can just use that as a filter so factor data would be factor data again factor data dot index dot get level values. We'll get the ticker values. Maybe I can show you what this is returning.

S2

Speaker 2

53:11

This is returning, yeah, an object with all the stocks we have in the ticker index part of our multi index and then we can just say is in the valid stock. And this would filter out pretty much everything. Everything Shift to out this part Okay Okay, so I think, oh yeah, I missed this part. It should be, our stock index.

S2

Speaker 2

54:01

Those are the stocks we are going to remove. So if we take that out. Yes, so now as you can see we had before we had 10,201 rows. Now we have 10,150.

S2

Speaker 2

54:18

We've removed around 51 rows. And yeah, with this step, we are now ready to calculate the rolling factor betas. We would like to do that simultaneously for all the stocks in our factor data we can just use the same methodology with group by and apply a function that would be factor data dot group by and then we're grouping by level 1 by ticker group keys should be false and you can put that into brackets so we can continue on the next row so apply lambda x and now we want to use the rolling regression actually we may want to explore the rolling regression rolling the rolling regression, rolling or OS Python. It takes the endog and exog, all right.

S2

Speaker 2

55:28

We have to supply endog and exog. So the endog would be our return column right so x return 1 month and our exog would be everything else What we can do here is say x.draw the return 1 month column and this would return all the other this would give all the other columns without the return And actually we can add a constant here on the spot. So we add a constant. And I think the next 1 we have to supply the window.

S2

Speaker 2

56:12

We decided the window to be 2 years, right? So that would be 24, right? 24 months. And there is another mean observation.

S2

Speaker 2

56:23

The minimum number of observation required to estimate the model. We have to supply this 1 as well. This 1 is a little bit more tricky. We have to supply here to have at least the total number of columns plus 1.

S2

Speaker 2

56:38

So that would be lan x dot columns plus 1. And this should be our model. Now we have to say fit and then params and then we can drop the we can drop the constant because it will return a constant as we added a constant up here right and I think this should pretty much work except that maybe sometimes we would not have exactly 24 months of observations, but we still may want to run the regression. So what we can do is use as a window, the value, which is smaller than 2 values, so either 24 months, or we can use the number of rows we have in the, for the given stock, right?

S2

Speaker 2

57:29

And we know that we have stocks with more than 10 months of data. So if 1 of the stocks have like 15 months of data, you just use the 15 months as a window instead of 24. And yeah, I think that should pretty much be it, right? Okay, series object doesn't have fit.

S2

Speaker 2

57:52

Why? Maybe... Okay, so that's pretty much our rolling factor betas. We have calculated them.

S2

Speaker 2

58:15

Now we can assign them to betas. And. That's our rolling factor betas guys. In the next step, what we want to do is to join them to our current features.

S2

Speaker 2

58:33

And with that, we'll have our full features data set but before we join them we have to think about a little bit. Now we have the rolling factor betas where we use the factor at the beginning of the month and the return at the end of the month. So this beta we would actually know at the next month, right? At the end of the month, we will be able to run the regressions and have the betas, but we'll have them in the next month.

S2

Speaker 2

59:06

So we cannot just blindly join them to the features data we have so far. What we have to do is to shift them with 1 month forward before we join them to the data because these values we would have not known in the same month. We would know the rolling factor better, for example, for the end of October, we would know them in November. What we have to do is we have to shift with 1 month forward on the ticker level.

S2

Speaker 2

59:41

So for each ticker, not like the whole data frame, if we just say betas dot shift, it will run, it will shift with 1 row or downwards and for example the value of verizon will come up up here right just see that so verizon