1 hours 18 minutes 38 seconds
Speaker 1
00:00:00 - 00:00:36
So we did pre-processing, clustering with 1 sample, integration of multiple samples. We found marker genes, plotted marker genes. I showed you some methods and tricks to label cells. We counted the fraction of cells. We did differential expression between 2 different groups of cells, made some differential expression heat maps, did gene ontology and keg enrichment, we did comparisons between genes and different conditions including statistical testing, And we scored cells based on their expression of gene signatures and also plotted that.
Speaker 1
00:00:36 - 00:01:39
So I've basically more or less recreated a lot of the single cell analyses from this Nature paper, or at least giving you the tools to do something similar. So this is gonna be a long 1, but I think you'll learn a lot. I started off wanting to do a more introductory video for those who have never done single cell analysis before, but I actually ended up doing a really thorough processing and analysis. So I think this video is suitable for even advanced single cell users, but let's go ahead and start working our way through this recent Nature publication. So after the reads have been mapped to the genome and processed through whatever pipeline, all single-cell data really boils down to is a matrix of cells by genes and the counts just represent the number of unique RNA molecules that map to a given gene from a given cell.
Speaker 1
00:01:39 - 00:02:18
So there are several different formats that your counts data can be in. This is the outs folder of a sample processed by 10xCellRanger And we have 2 different matrix files we can open here. We're of course only interested in the filtered because the 10x pipeline filters out all the junk. And then you might also come across the matrix folder instead which is the same thing but in a slightly different format. They have the 3 files separated but any analysis software you use if you just point to this folder it'll open them up into 1 matrix.
Speaker 1
00:02:18 - 00:02:51
So either 1 you use is fine. In the example data that I downloaded there are actually just raw count CSVs and this is similar to what I showed you earlier where we have the cell barcodes and then the gene names. So we're going to be opening up these CSVs directly into scanP. The data I'm using for this tutorial came from this Nature publication, a molecular single-cell atlas of lethal COVID-19. So they have 19 samples from fatal COVID-19 lungs and 7 control samples.
Speaker 1
00:02:51 - 00:03:58
And so we're going to be redoing some of these analyses. So I'll do clustering and cell type identification and we'll also do some cell quantification and then we'll do some differential expression like they did here and of course we'll make a heat map because you can't do RNA-seq analysis without making a heat map and we'll also do a bunch of other typical single cell analyses that they probably put in their supplement. For the data just go down to the data availability section and I followed the link to the geo and then I just downloaded the raw data tar file here and if you're on Windows you'll need to download something like 7-zip to open this up on Mac or Linux you should be able to use tar-xf. So we are going to be doing our analysis in a Jupyter notebook using Python. If you don't have access to a Jupyter notebook already what I recommend is installing Miniconda and creating an environment for your single cell analysis.
Speaker 1
00:03:58 - 00:04:28
There are multiple guides for this online. I even have a quick video for Linux, but if you're using Windows the installation will be a little different, but that's not the point of this video so I'm not going to go into that in depth. Alright so we're going to be using scanp as the workhorse for our data analysis. So if you don't have it, you would install it with pip like this. And once it's installed, you can just import it.
Speaker 1
00:04:28 - 00:04:47
And don't worry, I'll put the notebook on my GitHub. So you can just copy and paste lines of code directly from that. So of course the first thing we need to do is read in our data. So scanp has multiple different read functions. If you're reading in an h5 file You can read in h5 files directly.
Speaker 1
00:04:48 - 00:05:32
If you're pointing to a 10x matrix file or the matrix folder with the 3 files inside, call this and then point to that folder path. In our case the data I downloaded from that publication is just a CSV so I just need to point to the path of that CSV and there's 26 different samples so I'm just gonna start with the first control sample if you're in Windows your paths will be a little different. Instead of backslashes, it'll be forward slashes. And then scanP requires the columns to be genes and the rows to be cells. So we just have to pass a .t to transpose it.
Speaker 1
00:05:32 - 00:06:03
And then let's just save this as a data. And in Jupyter notebook if you call the object as the last line in a cell it shows that object. So now we have our single cell object with 6099 cells and 34, 000 genes. So there's only 3 components of this AData object so far. And that's the observation data frame, which is just currently the cell barcodes because we haven't added any additional observations.
Speaker 1
00:06:06 - 00:06:39
The bar data frame or variables which is the genes with no additional information. And then we have our counts just saved under capital X which is just a numpy array and if we look at the shape it'll just be cells by the number of genes. So we have our single cell data and it's now in the AND data format. So we're ready to start with the pre-processing. All right, so for the pre-processing we're going to start with doublet removal.
Speaker 1
00:06:40 - 00:07:10
So some people don't do this to their data, maybe because they don't know how, but it is definitely recommended because when you're making single cell libraries sometimes 2 or more cells can end up in the same droplet. So for this we're going to be using the solo package within scvi. So we're going to import scvi and if you didn't have it installed you can install it through pip like this. So we have our AData object. This is a single sample.
Speaker 1
00:07:10 - 00:07:36
Don't do this on an integrated sample you need to do this on individual samples and then you can integrate them afterwards. So all we're going to do is filter down the genes and then train a model to predict whether a cell is doublet or not. So let's start by narrowing down this 34, 000 genes. The first thing we're going to do is only keep genes that are found in at least 10 of the cells. And if we look at ADATA again, you see we're down to 19, 000 now.
Speaker 1
00:07:36 - 00:08:19
And then we're going to go ahead and only keep the 2, 000 top variable genes. So I'll go into this a little more later on in the tutorial. But for the doublet removal, just know that I'm keeping the 2, 000 genes that more or less describe the data the best and of course if we look at ADATA this time we're down to 6, 000 cells and only 2, 000 genes And now we want to set up an SCVI model so that we can then predict the doublets. So for that we're just doing a very simple model setup using only default parameters and then we're training this VAE model. So this will take a few minutes especially if you don't have an NVIDIA GPU.
Speaker 1
00:08:20 - 00:08:52
It'll take a little longer, but in my case it looks like this 1 sample is going to take about a minute and a half. So after the SCVI model is trained, we can go ahead and train the solo model which predicts doublets. And we just passed the VAE model we just trained. So once that's done we can get the predictions for whether a cell is a doublet or not. So you see we can return a data frame like that, where we have the cell barcode and the score for doublet and singlet.
Speaker 1
00:08:52 - 00:09:24
The higher of the scores is going to be the prediction and then I'm also going to pass soft equals true to make a new column that is the predicted label. So now we have the values and the prediction. There's 1 final thing I want to do, which is to remove all these dash zeros that SCBI adds on to the barcodes. And you'll see why I do that in just a minute. So for every index or barcode I'm just getting rid of the last 2 characters so we no longer have that dash 0.
Speaker 1
00:09:24 - 00:10:02
Then let's go ahead and count how many doublets and singlets we have. So we have around 1, 200 doublets and 4, 900 singlets. So about 20% of these data were labeled as doublets. So 20% is pretty high, But as you can see from these values, the values range, for example, this doublet, the difference in these values between singlet and doublets only about 1, and this one's closer to 4. So the doublet prediction here was much higher than the doublet prediction here.
Speaker 1
00:10:02 - 00:10:51
And in this case, I'm not sure I want to throw away 20% of the data, although you could and it's very reasonable to do so. But I'm going to add a new column in this data frame, which is the difference between these 2 columns. So now we have the difference I'm going to go ahead and plot that distribution just so we can look at it and I'm going to be using another package called Seaborn. So I'm going to plot the distribution of this data frame only this diff column I just created and only from the cells that were predicted doublet. So you see we have a lot of cells predicted doublet that were only marginally higher than the prediction for singlet.
Speaker 1
00:10:51 - 00:11:27
So what I'm going to do is filter out the ones below 1. 1 is kind of an arbitrary selection. I'm going to make a new data frame called doublets which is only the cells predicted doublet and with the difference above 1 so everything to the right of the 1 here on this distribution. So now we have 460 cells that were predicted doublets with high certainty. And since we have the barcodes we can filter our AData object on that.
Speaker 1
00:11:29 - 00:12:06
So if we look at AData You see we've already done a lot of processing to this to do the doublet removal in the first place. I only have 2, 000 genes for example. So what I'm going to do is just reload the AData object using the same read function we used earlier. So we have our fresh ADATA and if we look at obs just the list of barcodes so we have this list of barcodes and we have this list of barcodes that are known doublets. So I'm going to add a new column in this observation data frame.
Speaker 1
00:12:06 - 00:12:31
I'm just going to call it doublet and it's going to be true or false. So is this barcode in this data frame here? So it doesn't match 1 of these barcodes in this data frame index column. We'll run that and now if we look at ADATAobs again we just have true or false. So if it is a doublet it'll be labeled true.
Speaker 1
00:12:32 - 00:13:09
So let's go ahead and filter out all the cells that were labeled true and keep all the cells that were labeled false. So we can do that by filtering the ADATA on the observation column doublet. So now if we look at ADATA instead of 6099 cells we now have 5600 cells. So now we have our fresh raw ADATA object with the doublets removed. All right so now that we have that AData object with the doublets removed we can do the typical pre-processing workflow.
Speaker 1
00:13:09 - 00:13:35
So the first thing we want to do is label the genes that are mitochondrial genes. So you remember AData.var as all the gene names and then mitochondrial genes are typically annotated like this. The gene will start with MT dash. If it's mouse, sometimes it'll be M lowercase t or lowercase mt. So just double check.
Speaker 1
00:13:36 - 00:14:25
But in this case it's human which is almost always capital MT so we can filter ADATA.var so let's find all these values that start with MT dash So here are our 13 mitochondrial genes. I've come across some cases where they're not annotated with MT but you know there's only 13 mitochondrial genes so you can just find a list and then annotate them based on that list like I will do with ribosomal genes in just a minute. But anyways instead of just looking at these let's actually annotate them. So I'm going to add true or false to every 1 of those 34, 000 genes if they start with MT. So if we look at far now we have this lowercase MT column that just has true or false.
Speaker 1
00:14:25 - 00:14:54
And then now let's do the same thing for ribosomal genes. And that's a little more tricky. I'm going to be using a list of known ribosomal genes from the Broad Institute which we can import directly into Pandas. So we've been using Pandas data frames through scanP, but let's explicitly import it now. And then I'm passing this ribo URL, which is this long URL to this data frame of ribosomal genes.
Speaker 1
00:14:54 - 00:15:31
Again, you can just copy this right from my GitHub. And then using that URL, we can read it directly with pandas and we're just gonna skip the first 2 rows which is junk and then if we look at this once we import it we just have 88 rows or 88 ribosomal genes so we have this list which we can call like this we get just a list of these genes. Just like annotating mitochondrial genes, we're going to do the same thing here. So is our gene name in our var data frame? Is it in this list here?
Speaker 1
00:15:31 - 00:16:09
So now if we look at a data var, we have true or false for this rival column. So we have this var data frame and just to remind you our observation data frame is basically empty except this doublet column which should all be false now. And we're going to go ahead and calculate QC metrics. And our QC VARs correspond to those columns that we made with the true or false labels. So if we run this, now if we look at the VAR data frame again, we now have these statistics for each of the genes.
Speaker 1
00:16:09 - 00:16:56
You see the percent dropout is very high for a majority of the genes. So this gene was only in 4 cells, this gene was only in 1 cell, which we'll get to in a moment. And then if we look at the observation data frame, we now have these stats which includes mitochondrial counts for the percent of the mitochondrial reads and the percent of the ribosomal reads for that given cell, along with the number of genes that were positive in that cell, where the number of genes that had any counts, and then the total number of UMIs. So let's go ahead and look at this a little more in depth here. I'm going to sort it by the number of cells that that gene was found in.
Speaker 1
00:16:59 - 00:17:38
So again you see that some genes were in almost every cell and then a lot of the genes were in 0 cells. Let me just bump this down. So we're gonna go ahead and filter out the genes that weren't in at least 3 cells. So if we run this chunk again you see that every gene now has at least 3 cells it was found in and instead of 34, 000 genes now we have 24, 000. And then typically here we would filter on counts.
Speaker 1
00:17:38 - 00:18:35
Let's look at observation. Let's look at total counts. So if we sort by total counts we see that the lowest is 401. So what likely happened is the authors of this publication when they processed these probably got rid of any cell that had 400 or fewer so they already did that we're not going to filter it based on a lower threshold but if they hadn't we would run something like this so we'd filter the cells in this case with the minimum genes of 200 So they filter on counts but let's see if there's any with fewer than 200 genes. So none of them, the lowest is 276, so we're not going to filter on it but if your data wasn't filtered 200 could be a place to start but again picking it's slightly arbitrary based on your data.
Speaker 1
00:18:36 - 00:19:09
So let's go ahead and plot some of these QC metrics. So basically what we want to do is use these QC metrics to get rid of outliers. So if a cell has significantly higher genes than the average, there's a chance that it's some artifact. Likewise with counts, but these are very highly correlated, so we can filter on 1. And then if there's a high mitochondrial percentage, and it could be a sequencing artifact, or the cell could be dying.
Speaker 1
00:19:09 - 00:19:51
So in this case I don't see any above 10. Usually people set a mitochondrial filter anywhere from 5 to 20 percent and then with the ribosomal reads we see that most are around 0% but there are a few scattered outliers up here so let's just filter our AData object and get rid of some of these outliers. So we're starting with 5, 639 cells. Let's see what we end up with. So I'm going to use an additional package here, numpy, to get the 98th percentile value and then filter the genes by that instead of just picking.
Speaker 1
00:19:51 - 00:20:15
You could just pick something like 3000 here but I like to be a little more objective. So I'm picking a value that represents the 98th percentile. If you just wanted to pick, you could use a line of code like this. Let's see. So the 98th percentile ended up being 2, 300, somewhere around here.
Speaker 1
00:20:17 - 00:20:53
If you remember what the OMS data frame looked like we had the number of genes so we're gonna filter out all the cells that are above 2305. So now we have 5526 cells and let's filter on mitochondrial and ribosomal. It doesn't look like I have to filter out any because they're all less than 10 but I'll show the line of code anyways because you almost always use it. In this case I'm just gonna pick 20%. I like 20% sometimes I use less depending on my data set.
Speaker 1
00:20:53 - 00:21:12
In this case we're gonna run it but it's not gonna get rid of any cells. Then I'm just gonna get rid of the extreme outliers for the ribosomal. I'm gonna pick 2. There's only a handful of cells actually above 2. Instead, I'm going to be regressing out differences in the data based on ribosomal reads later.
Speaker 1
00:21:12 - 00:21:43
So I'll be correcting for them. So now we have our AData object that's been filtered of doublets and also of outliers. So now that the data is clean, we can start clustering and eventual analysis. So normalization is an important step because in single cell sequencing there's a lot of variation between cells even between the same cell type just because of sequencing biases etc. So we need to do normalization so we can compare cells and compare genes.
Speaker 1
00:21:44 - 00:22:57
And if we look at the ADATA counts table, each row is a cell. So just to show you, let's just take the sum of each cell. Alright, so this first cell at 5000 total counts, these cells only had around 400. So the first thing we're going to do is normalize the counts in each cell so that their total counts adds up to the same value. So we can do that with the normalize total function and now if we look at the sum of each you see the add of the 10, 000 so each value got modified based on the starting number of counts for that cell and then we're just going to convert those to log counts and of course if we run this again now they're not going to be all the same because it's not a linear transformation but they're still all comparable or at least more comparable than before and finally this is an important step You want to freeze the data as it is now before we start filtering based on variable genes and regressing out data and scaling data.
Speaker 1
00:22:57 - 00:23:15
So let's save the raw data. So we're just saving into the raw slot of the AData, the AData object as it is now. So a lot of the functions for differential expression, etc. Will actually use the raw data. Alright, so we can start the clustering process now.
Speaker 1
00:23:15 - 00:23:45
First thing we're going to do is find the 2000 most variable genes. I know I kind of glanced over this earlier, but what this does, if we look at the VAR data frame now, we see that we now added additional columns. So is it highly variable, true or false? And also some statistics such as its mean and dispersions. If we were to look at those on a graph we can see that the genes with higher dispersion were marked as variable genes.
Speaker 1
00:23:45 - 00:24:37
So this is a way to reduce the number of dimensions of the data sets instead of having 24, 000 genes. So we just did a tenfold reduction in the number of dimensions of the data set. So anyway we can filter out the not highly variable genes and none of this touches the raw data that we saved earlier, but if we look at a data now we only have 2, 000 genes which are the highly variable genes and now we're going to regress out the differences that arise due to the total number of counts, mitochondrial counts, and the ribosomal counts. So this will get rid of some of the variations in the data that are due to processing and just sample quality, sequencing artifact, et cetera. And then we're gonna normalize each gene to the unit variance of that gene.
Speaker 1
00:24:38 - 00:25:33
And now we're going to run principal component analysis to further reduce the dimensions of the data into just 30 or so principal components. By default this calculates 50 PCs so let's plot how much these PCs actually contribute to the data. So lung data there's a bunch of different cell types so I do expect more significant PCs than a data set that maybe only had a few cell types. But basically what we want to do is find the elbow of this plot or where you don't really see a big difference as you increase in PC number. So here around 30 you see that it really starts to flatten out so 30 is probably a good number to pick here but in actuality it's not going to make a big difference whether you pick 25 or 40 but we're just gonna go ahead with 30.
Speaker 1
00:25:33 - 00:26:14
We're gonna go ahead and calculate the neighbors of the cells using top 30 PCs which we just selected. So let me show you what the neighbors are. We look at the A data you see we now have this obs p which has distances and connectivities so if we look at connectivities we see it's a 5500 by 5500 matrix so it's a cell by cell matrix and every cell that's connected will get a value. And then if we look at the distances, each cell that's connected will also get a distance. So these neighborhood matrices are what's going to be used to do the clustering.
Speaker 1
00:26:16 - 00:26:46
So we're going to go ahead and use UMAP to project the data from these 30 dimensions into 2 dimensions so that we can look at it. So if we look now we can plot the UMAP. The 1 point is a single cell but they haven't been assigned to clusters yet so it's just all 1 color. To actually assign clustering we need to run a laden algorithm so you might need to install that separately if you don't have it. And then we're gonna run laden algorithm with a resolution of 0.5.
Speaker 1
00:26:46 - 00:27:12
This is something you'll adjust depending on your data. I just like to start with 0.5 and see how it turns out. So if we look at the ADATA OBS dataframe running that laden algorithm, all it did was add a new column with a laden label. So now let's replot that UMAP and color the cells based on this label. So now we have our first sample clustered.
Speaker 1
00:27:12 - 00:27:41
What I'm going to do next is integration of multiple samples. If you only have 1 sample, you can skip the integration and then just pick up with finding markers later in the video. But a lot of the time, you have more than 1 sample. And in my example tutorial case, I have 26 samples. So I need to integrate them all into 1 AData object and adjust for differences based on batch.
Speaker 1
00:27:42 - 00:28:49
So the first thing I want to do for integration is since I have 27 samples I don't want to have to write this code out for each individual sample so I wrote a function that does almost exactly what we did above for each individual sample. So I know it looks like a lot but I basically just copied and pasted from what we already did. Here I predict the doublets then I reread the A data and so the only thing I pass to this function is going to be the CSV path, for example the same CSV path we used earlier, and I'm also adding a new column in the observation data frame called sample, which I'm just taking this CSV path splitting it on the underscore and then returning the second item so this will be 0 1 2 so I'll get this identifier from the path and save it in the sample column. I'm just filtering out the doublets and then here there's 1 small difference from what we did earlier I'm including a minimum genes threshold just in case some samples had fewer and the difference here is that I got rid of the gene filtering.
Speaker 1
00:28:49 - 00:29:24
I'm not filtering any genes out of any of the samples for now until I combine them later on. But the rest is the same, calculating the mitochondrial statistics, the ribosomal statistics, and then filtering out the outliers. And then I'm just returning the now filtered AData object. So if we look at my 26 samples I'm going to use OS so I'm going to import OS and I can list the directory. So it just returns a string for every file in this raw accounts folder that I saved them all in.
Speaker 1
00:29:24 - 00:30:16
So I'm just gonna do for file in this, and then I'm gonna run this pp. And then we have to append the result to a list, I'll just call it out and then the CSV path is going to be raw counts plus our file name of course this is all going to be a little different depending on your data and how you have it set up, but the idea is the same. If you have multiple samples, you don't want to have to write out hundreds of lines of code. So we'll go ahead and run this and in the end we'll have a list of AData objects. Since I have 26 samples, it's gonna take me about an hour probably.
Speaker 1
00:30:18 - 00:30:40
Alright so let's check the output now. If you look at any individual object in the out list, it should be an AData object. So it looks like it worked correctly. Let's combine all of the objects within the outlist into 1 AData object. So we should just be able to do sc.concat and then pass out.
Speaker 1
00:30:42 - 00:31:12
So now let's look at this. We have around 99, 000 cells and we still have our 34, 000 genes. And then if we look at the OMS, here's that sample column we added with the respective sample name and the other data that was added when we did the pre-processing. So I want to do 2 things and then I'm going to save this just in case. I'm going to get rid of the genes that aren't in any of the cells or in a few number of the cells.
Speaker 1
00:31:12 - 00:31:45
So I'm just going to pick the genes that are in at least 10 cells. So we went from 34 to 29, 000. And then if we look at the X we see that it's a dense matrix so this takes a lot more memory than a sparse matrix. Dense being that it just has an actual physical value for each entry. So I'm going to convert x to a sparse matrix to reduce the size of the file and the amount of memory it takes to use.
Speaker 1
00:31:45 - 00:32:29
Theoretically could have done it on each individual sample before I concatenated them inside the function but I have enough memory to do it this way. So this step isn't necessary if you only have a few samples but since I have 26 I'm gonna go ahead and do it. So I need to import this CSR matrix from SciPy so you might have to install SciPy if you don't have it. We're going to take ADATA.x and save it as a CSR matrix of ADATA.x And now if we look at ADATA.x, it's this sparse matrix instead of that dense matrix. So I'm just going to go ahead and save it now.
Speaker 1
00:32:30 - 00:33:23
I'm going to save it as an h5ad, which is the scanp a data format. All right, so it's a good thing I saved that because I took a little break, actually just reloaded my modules, same ones as we used before, and I've just re-read that a data object. So if we want to look at how many cells we have for each sample we can do a group by count and we see that we have anywhere ranging from 1, 300 to around 6, 000 and we still have 29, 000 genes. I'm gonna go ahead and do a slightly more strict threshold for filtering out genes just to narrow this number down a little bit. I'm gonna get rid of the genes that aren't in at least 100 cells because now we have 90, 000 cells.
Speaker 1
00:33:23 - 00:33:51
If you only have a couple samples you could lower this number. So now we have 20, 000 genes and now I'm going to do the actual integration. So the data is combined but we need to rect for batch effect and other things like mitochondrial counts and ribosomal counts. So I'm going to use SCVI to do the integration. This isn't the only way to do integration with a scanP AData object but I found it to be 1 of my favorites.
Speaker 1
00:33:51 - 00:34:39
So to prep this AData for integration with SCVI we first need to save the raw data as it is now. So this data hasn't been normalized, it hasn't been converted to log or anything. We're just saving the raw data into this layer called counts and we won't touch it again except to read from it. So the next thing we can do is just the normal normalization like we did earlier for the single sample so we're just going to normalize the counts to 10, 000 in every cell, convert it to log, and then we're going to save this log normalized data into the raw slot. So SEVI will use this counts layer but a lot of other modules and functions will use this raw data.
Speaker 1
00:34:40 - 00:35:11
So just to remind everybody what we have, we have this AData object. So we're going to be using 4 different annotations here to correct our data. The sample, so there's 26 different samples all with their own unique label in this column. And then we're also going to use total counts, mitochondrial counts, and ribosomal counts, or the percent counts. So 1 important point here is depending on how many cells you have, so my number of cells is much higher than my number of genes.
Speaker 1
00:35:12 - 00:35:32
So for SEVI, I don't have to filter out any more genes. So let's say you had up to 20, 000 cells. You'd want the number of genes to be at least half of the number of cells. So here you would do something like this. So we find the highly variable genes, only the top 3, 000.
Speaker 1
00:35:32 - 00:35:52
You can pick more if you had more cells but if you had 10, 000 cells 3, 000 would be good. Make sure to point at that counts layer. I'm not gonna run this but if you were to run this on your sample that 20, 000 genes would be reduced to 3, 000. So I'm just going to comment this out. Instead I'm going to go ahead and set up my SCVI model.
Speaker 1
00:35:52 - 00:36:35
So we did this with completely default settings earlier when we removed doublets for each individual sample, but now We're going to do an actual model for this combined sample. So for this SEVI setting up of the AND data, we're passing our ADATA pointing at the counts layer and our categorical covariate key is going to be sample. So if you had multiple batches from different runs you could add another variable here. If you had, let's say, samples from different sequencing technologies you could add another 1. But we have only a sample category because these were all done in the same study using the same technology.
Speaker 1
00:36:35 - 00:36:54
And then for our continuous covariate keys we're just passing percent counts mitochondrial, total counts, and percent counts ribosomal which we calculated earlier. So I'll just go ahead and run that. And I must have been messing with the source code at some point. Shouldn't come up when you do it. But we have the AData set up now.
Speaker 1
00:36:54 - 00:37:16
And we're going to go ahead and initialize that model. And now we're going to train the model. So this is going to take probably about 30 minutes or so just because I have so many cells. And again if you don't have a GPU this could end up taking you a while if you have a lot of cells. But it's definitely worth it, you know, just queue it up, go eat lunch, or queue it up before bed.
Speaker 1
00:37:17 - 00:37:44
All right, so once the model's trained we want to get a couple things from the model. First we can use getLightenRepresentation and if we look at it it's just a numpy array where it's the number of cells is equal to the number of rows and then there's 10 columns. So this is the scvi array that represents our data. So we're going to be using this for clustering and umap. So let's go ahead and save this to the AData object.
Speaker 1
00:37:45 - 00:38:15
We're going to save it under the obs.m and I'm just calling it xsevi. And we can also get the sevi normalized expression, which is a cell by gene data frame. And we're going to save this as a layer instead of overwriting the default raw values. So we're going to save this to scvi normalized as another layer in the AData object. And of course now that we have these saved you can call them back at any time and we'll be using both of these.
Speaker 1
00:38:15 - 00:38:56
So let's go ahead and do the clustering now. So we're going to use the scanP neighbors function just like earlier, but this time we're passing useRep and xscvi. So we're using the latent representation from scvi to calculate the neighbors. So once we calculate the neighbors, we're just going to run umap, and since the neighbors were calculated using XSCVI, we don't have to specify anything here. And then again, likewise with laden, we're just doing default settings except with a resolution of 0.5 which we might change later and since we have 90, 000 cells this might take a minute or 2.
Speaker 1
00:38:57 - 00:39:27
So once that's done let's go ahead and plot the UMAP. We're gonna plot 2 UMAPs, 1 with the clusters labeled and then also 1 with the samples labeled so we can see how the integration went. So you see we have nice clustering, they're pretty distinct subpopulations, and we have a very good integration of the data from what it looks like. We have a bunch of different colors and all the different clusters. This cluster here seems to be specific to this sample.
Speaker 1
00:39:27 - 00:39:59
I'm not sure what happened here. It's possible that there was an additional cell type collected in that sample that wasn't in the others. I guess we'll have to figure out when we actually label the cell types, but if we did some other integration method like a reference mapping we might have missed these unique populations. So I'm gonna go ahead and write the integrated AData object and now we're ready to find markers and label cell types. So we have 20 laden clusters right now.
Speaker 1
00:39:59 - 00:40:29
I might change the resolution. I just like to start with 0.5 but let's find the markers from these 20 clusters to start with and then go from there. So I'm going to be using both the raw data that wasn't integrated with SEVI and the SEVI trained model data. So if you only had 1 sample or you didn't train the model with SEBI, which is perfectly fine if you had 1 sample, just ignore if you see anything with SEBI in it. Like this for example.
Speaker 1
00:40:29 - 00:40:48
This is just base scanp. We're gonna get the marker genes just based on the laden and the raw data that we saved. So it has a lot of warnings. I'm just going to hide some of those. This updates a new layer in the AData object which we can plot with this function here.
Speaker 1
00:40:48 - 00:41:33
So I'm going to plot the marker genes or the rank genes group and just show the top 20 for each cluster. So this may be a little hard to see in the video but each of these is just a different gene symbol. The higher and on the left being the most significant, but it's big so unless I really want to look at it I'm gonna hide it for now. Instead we can use the scanP getRankGeneGroups to actually get a data frame where the group is the latent cluster and we have the gene symbol block fold changes and the p-values. So I'm just going to save this as markers, and then I'm also going to filter everything with a p-value adjusted of less than 0.05, and keep only the ones with a log fold change of at least 0.5.
Speaker 1
00:41:35 - 00:42:01
And then we can use our SEVI model. We're calling the differential expression function from our model and again grouping by laden and this will take a minute or 2. So now that this is finished we can filter it like we did with the other data frame. So it has an isFDR of 0.05. So we're only going to keep the true from this column and then filter the log hold change mean as well.
Speaker 1
00:42:02 - 00:42:45
So that'll look something like this. Alright so now we've calculated our marker genes. Let's start figuring out what some of these clusters are. So I'm gonna go ahead and plot the UMAP but with the legend on it so I can see which number corresponds to which cluster and we're starting out with 20 clusters so I'm gonna make a dictionary that goes from 0 to 20 and then just manually fill it out when I've decided what a cluster is. So there is other ways to automatically label clusters, but I find doing it manually while it takes longer is more accurate, especially when you don't have a good reference data set to map it to.
Speaker 1
00:42:45 - 00:43:17
So I'm not going to show every single cluster, but I'm going to show a handful so you get the idea of how to do it. I'm going to start by just printing 20 different dictionary entries so that it's easier to fill out. So I'm just going to copy this. Instead of typing that, oops, should have gone to 21. So this is going to be our dictionary, and when we've decided what cluster 0 is, for example, we can put the cell name here, and then we'll use this to map the cell labels once we're done.
Speaker 1
00:43:18 - 00:44:02
Just gonna bump that up and get a little working space here. So I find it's easiest to start with cell types you know are going to be in the data. And almost every data set, unless it's been filtered out or they've gone to great lengths to remove them is going to have some blood cells in it and I also like to use my blood cells as an indicator to whether I need to reduce or increase the resolution of clustering. So I'm going to start with CD45, the gene symbol is this PTPRC, and then CD3 for T cells and then CD4 for CD4 positive T cells. So I'd like to start with CD4 and CD8 positive T cells.
Speaker 1
00:44:02 - 00:44:35
In this case I'm using the SCVI normalized layer. We can go back and forth if we want to use the raw data. And in case you don't know immune cell markers you can just Google it and find a chart like this. Just note that these protein symbols might not correspond to the gene symbols so sometimes you have to translate between the 2 which is usually just a Google search away. So it's pretty clear that this is our t-cell cluster and within the t-cell cluster we see that CD4 is predominantly in this bottom portion.
Speaker 1
00:44:35 - 00:45:15
Let's see what happens when we run CD8 instead which is CD8A and we do see that CD8 is predominantly in this section up here. So CD4 was down here and if we look at the clusters this is all in cluster 2 so that means I need to increase the resolution so that we split CD4 and CD8 positive T cells. And that's just the benchmark I like to use when clustering. It doesn't always have to be that case. We could just label cluster 2 t-cells which would be easier for us because when we increase the resolution we're gonna have to label more clusters.
Speaker 1
00:45:16 - 00:45:35
But anyways I'm just gonna scroll up to where we did the resolution. And I'm just going to run it again. Let's see how 0.8 does. So... But of course now we have 24 clusters instead of 20.
Speaker 1
00:45:35 - 00:46:05
From what we saw below I still remember there being a lot of CD8 here and these predominantly being CD4 so I'm gonna try a little higher. If we look at this now this looks a lot more like what I was expecting. We have 28 different clusters but likely some of them will collapse as the same cell type. Just really wanted to separate those CD4 and CD8 positive cells. So let's go ahead and rerun the lines of code that find the marker genes.
Speaker 1
00:46:06 - 00:46:51
Alright so we're back here with the new clustering. Let's recheck the T cells, so CD4 and CD8. We do see now that CD4 corresponds well to this cluster down here, which is cluster 2, and then CD8 is more so up here which corresponds to cluster 5. So let's just double check that in the actual marker data frame. So if we look at CD4 we see that cluster 2 does have a significant increase in CD4 compared to the rest of the cells and likewise if we do CD8 we see that cluster 5 has a very large increase in CD8 so I feel confident calling cluster 5 CD8 positive and cluster 2 CD4 positive T cells.
Speaker 1
00:46:52 - 00:47:28
So I've just added them to this dictionary we made earlier. Another thing I like to do is if you're not an expert in the tissue that you're working with find someone else that has published single cell data in the same tissue, even if it's a different organism like mouse, just because they're similar cell types. For example, you know, it's kind of cheating because this is the same data, but you know there's a lot of lung single-cell data out there. So we could just go through these kind of doing something similar. Obviously we know there's a lot of alveolar macrophages within the lung.
Speaker 1
00:47:28 - 00:47:50
Let me show you this other resource that is very useful for labeling cell clusters. So this pangloa DB is great. So we can go to datasets, cell type markers, and if we search for lung, see they have several different cell types. And let's go to alveolar macrophages. And here we see the different markers.
Speaker 1
00:47:50 - 00:48:19
This one's only in mouse so we're going to use the next 1, MRC1, and find which cluster is expressing that. So let's go back to just mrc1 and cd45. So let's try without this SEVI normalized. But we see that this whole population right here is likely alveolar macrophages. But we also know that macrophages are similar to other myeloid cells like dendritic cells and monocytes.
Speaker 1
00:48:20 - 00:48:53
So let's start by labeling monocytes and dendritic cells and then label everything else in that cluster as macrophages. So if we go back to PangloaDB, We're going to cell type markers and we're going to go to immune dendritic cells and let's just choose this top cell marker. Plot it here. Let's see if we can see it better with the normalized. Let's go ahead and look for it in the marker data frame.
Speaker 1
00:48:55 - 00:49:12
24 looks to be the dendritic cells. For monocytes, let's again find a marker. I'm gonna try this 1 here. It's kind of hard to tell from the plot so let's look at the data frame. 16.
Speaker 1
00:49:14 - 00:49:51
Yeah So even though we can't see it on the plot, see 16 has a 6-log-fold change. So 16 is monocytes, 24 is dendritic cells, and then the rest is likely macrophages. So and again if we look at the study we see that they have DCs monocytes and macrophages and this 1 big buster. So you see we've already made pretty good progress in a short period of time. I'm going to move on with the rest of the immune cells.
Speaker 1
00:49:52 - 00:50:23
Usually you know we'll have B cells and if we do look at the study they do have a population of B cells and then we also expect some NK cells. So let's knock those out real quick just using typical NK and B cell markers. So cluster 18 clearly expressed this B cell marker but the NK is being a little trickier than I thought. Looks like we might a mixed population of NK and T cells here. I'm gonna ignore that for now and might fix it at the end.
Speaker 1
00:50:23 - 00:50:56
So I'm gonna keep on working forward. So I'm gonna do other cell types that are common to almost every tissue such as endothelial cells and fibroblasts which typically have really well-defined markers and are easy to label. For example if you go to vasculature, PCAM1 is of course a really well-known endothelial marker and we clearly see that cluster 6 are endothelial cells. So that's an easy 1 to label. If we go to connective tissue and fibroblasts, let's just pick VTN.
Speaker 1
00:50:57 - 00:51:16
So we get an error because it's not in the data frame. Let's just pick the next 1, collagen6a2. It looks like there's some coloring in cluster 1, but let's try a non-normalized. So it looks like cluster 1 for sure. Let's look at the differential expression.
Speaker 1
00:51:17 - 00:51:55
Yeah, so cluster 1, and then maybe cluster 10 as well. We'll wait to label that 1 but likely cluster 10 is a fibroblast as well. But typically smooth muscle can sometimes be similar in gene expression so let's just label 1 fibroblast for now but it's likely that some of these clusters around 1 are also fibroblast and then we're just going to keep on working our way down to the different cell types in the lung let's do AT1 and AT2 These seem pretty clear as well. Ager being for 81 and an Sftpc for 82. Don't quite see it as well on the scvi normalized.
Speaker 1
00:51:55 - 00:52:22
I'm sure if I put a max on this we would see it better. I should have been doing this earlier. So if we use the vmax argument let's see what happens. So we set the vmax really low, it's really apparent, even in the scvi normalized. So changing the vmax actually makes a big difference in these expression plots because even after being normalized there's big variation.
Speaker 1
00:52:22 - 00:53:05
But anyways it looks like cluster 4 is AT1 and cluster 3 and cluster 9 appear to be AT2. And of course you can always confirm this down here, or likewise in the SCVI model as well. So cluster 4 with the higher block fold change. But when it's this apparent and the expression plots you don't really have to confirm it. So I'm just gonna work my way through doing this for all the different cell types and then once I'm done with cell types I know should be in there there's gonna be a few blank clusters and then I'm going to dig down into those clusters a little more.
Speaker 1
00:53:07 - 00:53:32
So smooth muscle cell was a little tricky because smooth muscle, pericyte, and fibroblasts all kind of cluster together. So alpha-sma here is a marker for both smooth muscle and pericytes. And you can see both cluster 21 and 25 have a high expression. But I saw that 21 and 25 are of course not overlapping at all. So they seem to be from different populations of cells.
Speaker 1
00:53:32 - 00:54:05
So again, I can just go back to the PangloaDB, find parasites, and we see the second marker after alpha small. So hopefully specific to parasites. And I ran that and clearly see that 25 is a much higher expression of it than 21. So it looks like 21 is smooth muscle and 25 is pericyte. Interestingly in their publication they didn't differentiate between pericytes but obviously that wasn't the point of their study.
Speaker 1
00:54:05 - 00:54:48
Here's an opposite example. Let's say I didn't know there were any neurons in this data set and I can't label cluster 23. So you could go and filter the markers and just look at the ones from cluster 23 and we can look at the top genes and we can just copy those and inside the pangola DB there's a search function and you can just paste the gene in there Don't always just take this for granted because sometimes the search can be way off. Doing it the other way is a lot better usually. But this is my first hint so I see that the top hit is neuron but I usually like to double check at least a couple.
Speaker 1
00:54:48 - 00:55:25
So let's go to the second gene and we see more neurons and some of these top hits. A lot of these are some sort of neurons but again these take into account all cells So even matopoietic cells show up even though they are very dissimilar from neurons. So just keep that in mind. But those 2 data points would indicate to me that this is a neuronal cluster. And just an example of why not to trust this gene search all the time is I searched this top head of 1 of these clusters and enterocytes showed up.
Speaker 1
00:55:25 - 00:56:05
So obviously you don't have intestinal cells inside your lung. Let's hope the people that are doing the autopsy didn't do that bad of a job. So, I've finished all except I'm having a little trouble with 20. I have some idea, but let me just show that. So I'll just print out all these, copy them after I highlight them, And then you can just go to David, start analysis, upload, pick gene symbol, official gene symbol, Homo sapien, gene list, functional annotation tool, Gene ontology.
Speaker 1
00:56:06 - 00:56:37
So this is still not 100% clear from this. I see some things that make me think it might be epithelial, but then we get some of this weird axon guidance stuff. Right now I'm leaning towards epithelial, I'll just have to look into it a little more. So I just picked the 2 top epithelial markers from PanglodeDB and it does look like it's some sort of epithelial cell. It expresses both of these and gives me some comfort that it's in close proximity with other epithelial cells.
Speaker 1
00:56:38 - 00:57:03
So I'm going to go ahead and label it as epithelial, but what's also interesting about this cluster, it was unique to that 1 control sample. So maybe they cut out a bigger section and got some epithelial cells that weren't in other samples. It's hard to tell, but we have successfully labeled all these cells. I usually like to double check them, but for the sake of the tutorial, I'm not going to. Because I've already spent enough time doing this.
Speaker 1
00:57:03 - 00:57:22
So let's just save this as a dictionary. And then if we remember a data.obs, so we have this laden column. We can just math using that dictionary. I forgot to put laden here. So you see it returns the cell type and we can just save this as another column.
Speaker 1
00:57:24 - 00:57:41
We'll just call it cell type. Now you can plot this. So I'm just going to take this off for now to make it a little prettier. And so this is what we have. I never got around to fixing nk cells but it wouldn't have been that hard to do.
Speaker 1
00:57:41 - 00:58:20
It's hard to say if it's better or worse than what they did in this dataset because we used different methods. SEVI does a really good job though so I think I personally like my integration better and it looks like they used reference samples which has limitations and might be why they don't have some of the same populations that we have. But anyways let's save this and we also want to save those markers data frames, especially the SCVI ones so we don't have to rerun that model. So I'm just going to save them to the unstructured slot in the AData. So then I'm just going to overwrite what we wrote earlier.
Speaker 1
00:58:22 - 00:58:50
And actually I just went ahead and saved the model too just in case we want to use it to do differential expression without waiting another 30 minutes. It's been a couple days but I'm ready to start on the analysis. So the first thing I'm going to do is recreate some of these cell frequency plots they have. Here they've just calculated the frequency of each cell type and all the 26 different samples and just did a simple box plot. So if we look at the ADATAobs, we have this sample column.
Speaker 1
00:58:50 - 00:59:26
So let's just look at the unique. So we have the sample names, but we never made a column to say whether the sample was a control sample or a COVID sample. You can see here they're clearly defined as COV or CTR. Of course, you might not need to do this with your data set, but in this instance, I'll show you what I'm going to do here. So I'm just going to go ahead and write this simple little function that's going to take the sample label for each cell and if CoV is in it it's going to return COVID-19 and if CoV isn't in it it's going to return control.
Speaker 1
00:59:26 - 01:00:01
So I'm just going to map that function to the sample column in the observation data frame and I'm making a new column called condition. So if we do that and then look at the obs data frame, now you see I've added a condition that's either COVID-19 or control. So for the frequency we need to count the number of cells in each cell type and the number of cells total in each sample. So let's start by counting the total number in each sample. So if we do an obsGroupBySample count we see that we have the counts for each sample.
Speaker 1
01:00:01 - 01:00:39
So all I'm going to do is take this data frame and zip the index column here and any 1 of these columns into a dictionary. So now we just have a dictionary of the sample and the number of total cells that we can call when we need a total number of cells for any given sample. So now we can group the observation data by sample, condition, and cell type. And then let's just get rid of some of these 0 rows and break this index down. So only rows with at least 1 cell.
Speaker 1
01:00:39 - 01:01:08
So now you see for any given sample and any given condition, so controller COVID-19 and every cell type, we have the number of cells. And then since all these columns are redundant I'm just going to keep the first 4. And the first thing I'm going to do is add a new total cells column using that dictionary we made. Here's the dictionary and we're just mapping the samples. I also got an error, I had to convert this into an end column.
Speaker 1
01:01:10 - 01:01:29
So we've added this column and then we're going to add a frequency column which is just this column. I know it says doublet, it's not actually doublet, it's just the counts. I'm just too lazy to rename it. Divided by the total cell number to get the frequency. So now we have the frequency, which we're going to plot.
Speaker 1
01:01:30 - 01:01:41
And we'd plot it... Oops. Need to import matplotlib. So we get something like this. I'm not going to spend time making this pretty.
Speaker 1
01:01:41 - 01:02:06
I have other videos that go over that in depth, but it's pretty much identical to what they show here except they're showing the points which is easy to do and I have it in another video. They also did a non-parametric test between each group just for sake of time. I'm going to save that for later in the video. So let's move on to some differential expression. Alright, we're gonna recreate some of these analyses.
Speaker 1
01:02:06 - 01:02:49
So they did a subset on endothelial cells and then they did differential gene expression between AT1 and AT2 cells and then just plotted the top differentially expressed genes in a heat map. They did some basic comparisons between some genes and they computed this DATP signature based on a gene list. So we're gonna recreate all those. We're not gonna redo the epithelial cell clustering that they did here because it's not necessary for doing the differential expression, but what you would do is just subset the A data including the wrong data and just redo typical scan B processing or these SEVI processing that we did. The first thing we're going to do is subset the data on cell type, So is the cell type label in this list?
Speaker 1
01:02:49 - 01:03:05
So is it AT1 or AT2? And we're going to modify it so just make sure to pass copy. So the authors of this paper did a really boring differential expression. They just used the find markers function in Serat, which a lot of people do. It's not really the best thing.
Speaker 1
01:03:05 - 01:03:40
Doing something more advanced like pseudobulk is preferred. What we're gonna do is either use the scpi model to do the differential expression or I'm going to use this diffxpy package. So I'm going to start with diffxpy for multiple reasons. 1 is that depending on how many genes you had in the scvi model, now if you only use 2000 variable genes for example, only those 2000 are going to be in the differential expression analysis. Or, if you didn't even use SCVI, you'd use diffxpy of course.
Speaker 1
01:03:40 - 01:03:55
In my case, since I have 20, 000 genes in my SCVI model, I would just use SCVI differential expression. But anyway, I'm just going to show both. They're pretty simple but powerful. So we're going to import diffxpy. You of course have to install that.
Speaker 1
01:03:55 - 01:04:29
They have instructions if you just Google it. It needs a dense array so we need to convert the X back into a dense array. And then if we look at how many genes, so we still have the number of genes from our total AData object which is 20, 000. I'm gonna re-filter the subset because since we subset just on 81 and 82, some of these genes might just be in a few number of cells or even in none of the cells. And of course lower this number, but we still have 17, 000 cells, so 100 is still okay.
Speaker 1
01:04:31 - 01:04:54
So we're down to 12, 000 genes now. And 1 thing I wish I did earlier is instead of naming this cell space type, I should have done cell underscore type. You should always avoid spaces if you can. I usually do, I don't know what I was thinking, but diff x py will give us an error. So I'm going to rename cell type to cell underscore type.
Speaker 1
01:04:54 - 01:05:18
Something I forgot to mention is that if you ran scale and regress on your data, you'll have to convert the counts back to the raw data. So the normalized and log values. But since we use a CVI for all the downstream clustering, we didn't do that. So there's multiple different differential expressions we can do now. In their paper, they did AT1 versus AT2, which is what I'm going to recreate.
Speaker 1
01:05:19 - 01:05:43
But if you wanted to, you could also test between COVID and non-COVID. So based on our condition, so if we look at subset.obs, condition being the column name, but We're going to do what they did in the paper and pass cell type. So just again, the column name is cell type. So we're just running this wall test from diffxpy. If you wanted to get more fancy with covariates and stuff you can check out their website.
Speaker 1
01:05:43 - 01:06:09
I'm just doing basically a default wall test. So this will take maybe 5 minutes or so. So once that's done we can call res.summary to return a data frame. I'm just gonna sort by log fold change and then reset the index. So in this case you see some of them are like 283, negative 297, that's because there are some genes positive in the 1 group and not in the other, but it should only be a handful.
Speaker 1
01:06:10 - 01:07:10
And then, so, time to find out which group was the testing group. So I think it depends on the order of your data frame. This is 1 of the things I wish diff x py would change, being able to easily specify which group comes first or which group is the reference group. But I think it's usually the second 1 that shows up in this, but since I'm paranoid I just write this little function here that takes the most upregulated gene and this differential expression data frame we just made, finds the index and our A data, and then I'm gonna find the slice of values that are unique to AT1 and then the same thing for AT2 and then I print their means. So in this case the most upregulated gene you see less expression in AT1 than in AT2 so I'm just confirming what I said earlier that indeed AT2 is the testing group in this case.
Speaker 1
01:07:10 - 01:07:41
So this data frame is AT2 versus AT1. However, let's say I wanted AT1 versus AT2, All I'm going to do is multiply the log2 fold change by negative 1. And then I'm just going to reorder based on the logfold2 change column and then reset the index. So it's basically the same exact data frame just flipped. And then let's filter out the insignificant rows, so QValue less than 0.05 or a LogFoldChange of less than 0.5.
Speaker 1
01:07:41 - 01:08:08
So we could change these thresholds a little bit if we wanted to. The Wald test doesn't really get rid of the genes, like I mentioned earlier, that are expressed in only a few number of cells. So we can pick a mean threshold here just to get rid of the genes that are only in a few number of cells. So we have around a thousand differentially expressed genes. What they did is plot the top 25 up-regulated and top 25 down-regulated genes, so we'll just do something similar.
Speaker 1
01:08:08 - 01:08:38
So since this is sorted we can take the top 25 genes from the data frame and the bottom 25 genes and I'm just going to put them in a list. So the bottom 25 genes, top 25 genes, and I'm just combining them into 1 list. Just a list of genes. And then we can take that list and just pass it to the scan pplot heatmap function. Passing our subset AData, this gene list we just made, and we're grouping by cell type so AT1 or AT2.
Speaker 1
01:08:39 - 01:08:55
And we get this heat map here. Of course we can make it a little prettier etc. I'm not going to do that right now. All right, differential expression with SCVI is really simple if we already have the trained model. So we had it trained and we saved it, so I'm just going to read that back in.
Speaker 1
01:08:55 - 01:09:22
Again I was playing around with the source code, don't worry about that. And it's just our model that's already been trained. So we're going to do the same differential expression comparison. So here to specify which cells are in group 1 and group 2 we have to pass this boolean array, so if we look at this it's just true or false, for which cells are in that group. So it's the same thing as filtering AData without the AData here.
Speaker 1
01:09:23 - 01:09:55
So we're just comparing cell types AT1 versus cell type AT2. If we wanted to do something fancier like compare condition, so all the AT1 and AT2 cells that are COVID versus all the AT2 and AT1 cells that are control, we can do something like this. But we're just gonna stick with this here because it's what they did in the paper. So we get this SEBI data frame that's similar to our markers data frame. And we're just going to filter this similarly to what we did to the diffpx data frame.
Speaker 1
01:09:55 - 01:10:23
So is it FDR 0.05, which is just this column over here, true or false. And then the absolute value of the log fold change of greater than 0.5, and then I'm just sorting it. So now we have this sorted data frame of 3, 800 genes that are significantly differentially expressed. And so in this data frame we also have rawNormalizeMean and rawNormalizeMean2 for the 2 different groups. See that like in this example we have high in this group and low in this group.
Speaker 1
01:10:23 - 01:10:40
I'm just going to do 1 simple filtering to get rid of the ones that are low in both groups. So if it's lower than 0.5 in both groups it's going to get filtered out here. So I filtered a decent bit. So I'm just doing this for plotting. If you're doing gene ontology I would just keep everything.
Speaker 1
01:10:40 - 01:11:20
And like what we did earlier I'm just gonna take the top 25 and the bottom 25 genes from this data frame since it's sorted. And then when we plot the heat map this time I'm going to use the scvi normalized layer and pass log equals true but everything else is the same. And now we have this nice heat map of the top 25 upregulated and top 25 downregulated genes. They included some extra scores up here which I'll get into in a minute when we actually calculate this score for this violin plot down here. But first since we did differential expression I want to show you how to do some gene ontology or other gene list enrichment.
Speaker 1
01:11:21 - 01:11:50
So there's this super nice package called gsepy but it's super easy and it gives us access to a bunch of different possible gene libraries. So all sorts of stuff. So ones you might be interested in are these gene ontology. So we can pick this go biological processes and then if we wanted to we could look at cake pathways as well. But there's a bunch of different ones you could choose from, but these are probably the most common.
Speaker 1
01:11:51 - 01:12:19
So for our background gene list we're just going to use the var names from subset. For the genes we can just pull them directly from those differentially expressed tfxpy data frame, but the concept's the same in the other. So we have our data frame, and then I'm going to do just upregulated genes. Of course just change the symbol here to do downregulated. And then I'm going to take the gene column, its dot index in the SCPI data frame, and then just convert it to a list so it's just a list.
Speaker 1
01:12:19 - 01:13:03
So we can take this list and pass it to the gspyEnrichR function, passing that background list, which is just the var names from subset, and then we can put whatever pathway we want here, or multiple pathways, and it sends it to their server I think and returns enr. We can take enr.results for a panda's data frame. So now we have our pathways and our goBiologicalProcessEnrichment. These are pretty easy to plot, but I'll just use 1 of the common z-borne plot types. Alright, so now we're going to focus on some of these just simple violin plots they made for these 2 different genes and 2 different cell types.
Speaker 1
01:13:03 - 01:13:36
Really simple, but I'll also show you how to do the p-values. More complex is this DATP signature, so we'll calculate a score from a gene list based on any given gene signature. So we'll start by recreating that violin plot they made. Anyway, they looked in just AT2 cells for this ETV5 gene and then they compared the condition, so COVID versus control. Again, I'm not going to go into making these super pretty, this video would get too long, but these are really easy to modify.
Speaker 1
01:13:36 - 01:14:25
Okay, so how do we find if this is actually significant? We're going to import scipy stats, which has a lot of different statistical testing capabilities, and we're going to take this subset and just so I don't have to type as much I'm gonna make it into a new object called temp so I have to type temp instead of this whole long thing and then for this gene etv5 I need to get the values so I need to find the index so where it falls in the var names list. So I is just a number, its position and var names. So we're going to use that position to get the raw data slice from the counts matrix. So we get it from the counts matrix but of the COVID-19 only cells and then of the control cells and then we're just taking the slice from the counts x.
Speaker 1
01:14:25 - 01:14:41
So A or B is just this array of numbers and then since you know you can look at this until it's not normally distributed. Of course, don't do a t-test. We're going to do a nonparametric test. In this case, I'm just going to do a Mann-Whitney U-test. So we see this is very significant.
Speaker 1
01:14:41 - 01:14:57
And again, I'm not going to show it here. You could easily add this text to the plot. I have other videos that go over that. So I won't go over this 1, it's the same exact thing except you filter in 8T1 cells instead and you pick the CAV1 gene. So you can do this for any gene and any cell type.
Speaker 1
01:14:57 - 01:15:15
Okay let's calculate this DATP signature. So I just got this gene list from their supplemental data. It's just 163 genes. I just put it in a text file, 1 gene per line. So I'm just opening up that list in Python, this is just typical Python syntax here.
Speaker 1
01:15:15 - 01:15:36
Doesn't matter where the list comes from as long as you have a list of genes like this. And then we're going to stick with using that subset we made earlier. We're going to use this scoreGenes function from ScanPy. It's a super useful function, I use it all the time. I even like to use it sometimes when I'm trying to figure out what cell type a cluster is by passing cell type markers, for example.
Speaker 1
01:15:36 - 01:15:56
But anyway, we're passing that list we just made of this DATP signature, and I'm just going to give it the name DATP. And the rest I'm going to use default settings. Depending on your list size and some other things you might want to change some of the settings and score genes. Just look at the manual. This isn't suitable if you have like a small number of genes, maybe less than 10.
Speaker 1
01:15:56 - 01:16:35
I want to use score genes, but for anything above that this works really well. So all it did was add a new column in our observation data frame, DATP. The value is just the relative expression of those genes from the list compared to the background. So the actual value is arbitrary, you can't compare between gene lists, but you can compare between cells, which is what we're going to do. So we can just use scanfvilin and pass DATP, and since DATP isn't a gene it's going to pull it from the observation column, and we're just grouping by condition, which is again COVID-19 or control.
Speaker 1
01:16:36 - 01:17:08
And so we get this file in plot and then in this case if you wanted to do a statistical test from the observation data frame you just filter by whatever condition and then get the observation data frame, column name, and then the values, and then just Van Whitney again, or whatever test you want to do. And that's it. So we see that the COVID-19 has a much higher expression of this signature, just like they showed in their paper. So you could do this again with any engine set. If you wanted to you could even plot this in a UMAP like this.
Speaker 1
01:17:08 - 01:17:32
Or whatever you wanted to do. If you go on over to the ScanP core plotting functions page on ReadTheDocs They have a bunch of different examples of different plotting functions you might be interested in. They also have nice ways to make them look pretty like this. So this is a great resource just for giving you ideas of what plots you can make. They didn't do too much more, they just kind of did similar analyses for different cell types etc.
Speaker 1
01:17:32 - 01:17:57
For example fibroblasts. Again if you wanted to do this what I would do is just take a subset of fibroblasts and re-cluster and do everything. So we did pre-processing, clustering with 1 sample, integration of multiple samples, in this case 26 samples. We found marker genes, plotted marker genes. I showed you some methods and tricks to label cells, even if you're not an expert in that tissue type.
Speaker 1
01:17:57 - 01:18:37
We counted the fraction of cells in each sample and cell type. We did differential expression between 2 different groups of cells, made some differential expression heat maps, did gene ontology and keg enrichment. We did comparisons between genes and different conditions including statistical testing, and we scored cells based on their expression of gene signatures and also plotted that. So I've basically more or less recreated a lot of the single cell analyses from this Nature paper or at least giving you the tools to do something similar. If you've made it through this whole video, I applaud you and I hope you learned a lot.
Speaker 1
01:18:37 - 01:18:37
Thanks for watching.
Omnivision Solutions Ltd