This is a comprehensive introduction into single-cell analysis in python. I recreate the main single cell analyses from a recent Nature publication. I explain the basics of single-cell sequencing analysis and also introduce more advanced topics. I cover doublet removal, preprocessing, integration, clustering, cell identification, differential expression, gene-set enrichment, non-parametric statistical testing, single-cell gene signature scoring, plotting, and more. This tutorial is suitable for both advance and new single-cell users. I use the scanpy and SCVI packages heavily.

Notebook:
https://github.com/mousepixels/sanbomics_scripts/blob/main/single_cell_analysis_complete_class.ipynb

Reference:
https://www.nature.com/articles/s41586-021-03569-1


0:00 intro
1:18 data
6:35 doublet removal
13:03 preprocessing
23:12 Clustering
27:42 Integration
39:56 label cell types
58:28 Analysis

So we did pre-processing, clustering with 1 sample, integration of multiple samples.

We found marker genes, plotted marker genes.

I showed you some methods and tricks to label cells.

We did differential expression between 2 different groups of cells, made some differential expression heat maps, did gene ontology and keg enrichment, we did comparisons between genes and different conditions including statistical testing, And we scored cells based on their expression of gene signatures and also plotted that.

So I've basically more or less recreated a lot of the single cell analyses from this Nature paper, or at least giving you the tools to do something similar.

So this is gonna be a long 1, but I think you'll learn a lot.

I started off wanting to do a more introductory video for those who have never done single cell analysis before, but I actually ended up doing a really thorough processing and analysis.

So I think this video is suitable for even advanced single cell users, but let's go ahead and start working our way through this recent Nature publication.

So after the reads have been mapped to the genome and processed through whatever pipeline, all single-cell data really boils down to is a matrix of cells by genes and the counts just represent the number of unique RNA molecules that map to a given gene from a given cell.

So there are several different formats that your counts data can be in.

This is the outs folder of a sample processed by 10xCellRanger And we have 2 different matrix files we can open here.

We're of course only interested in the filtered because the 10x pipeline filters out all the junk.

And then you might also come across the matrix folder instead which is the same thing but in a slightly different format.

They have the 3 files separated but any analysis software you use if you just point to this folder it'll open them up into 1 matrix.

In the example data that I downloaded there are actually just raw count CSVs and this is similar to what I showed you earlier where we have the cell barcodes and then the gene names.

So we're going to be opening up these CSVs directly into scanP.

The data I'm using for this tutorial came from this Nature publication, a molecular single-cell atlas of lethal COVID-19.

So they have 19 samples from fatal COVID-19 lungs and 7 control samples.

And so we're going to be redoing some of these analyses.

So I'll do clustering and cell type identification and we'll also do some cell quantification and then we'll do some differential expression like they did here and of course we'll make a heat map because you can't do RNA-seq analysis without making a heat map and we'll also do a bunch of other typical single cell analyses that they probably put in their supplement.

For the data just go down to the data availability section and I followed the link to the geo and then I just downloaded the raw data tar file here and if you're on Windows you'll need to download something like 7-zip to open this up on Mac or Linux you should be able to use tar-xf.

So we are going to be doing our analysis in a Jupyter notebook using Python.

If you don't have access to a Jupyter notebook already what I recommend is installing Miniconda and creating an environment for your single cell analysis.

There are multiple guides for this online.

I even have a quick video for Linux, but if you're using Windows the installation will be a little different, but that's not the point of this video so I'm not going to go into that in depth.

Alright so we're going to be using scanp as the workhorse for our data analysis.

So if you don't have it, you would install it with pip like this.

And once it's installed, you can just import it.

And don't worry, I'll put the notebook on my GitHub.

So you can just copy and paste lines of code directly from that.

So of course the first thing we need to do is read in our data.

So scanp has multiple different read functions.

If you're reading in an h5 file You can read in h5 files directly.

If you're pointing to a 10x matrix file or the matrix folder with the 3 files inside, call this and then point to that folder path.

In our case the data I downloaded from that publication is just a CSV so I just need to point to the path of that CSV and there's 26 different samples so I'm just gonna start with the first control sample if you're in Windows your paths will be a little different.

Instead of backslashes, it'll be forward slashes.

And then scanP requires the columns to be genes and the rows to be cells.

So we just have to pass a .t to transpose it.

And in Jupyter notebook if you call the object as the last line in a cell it shows that object.

So now we have our single cell object with 6099 cells and 34, 000 genes.

So there's only 3 components of this AData object so far.

And that's the observation data frame, which is just currently the cell barcodes because we haven't added any additional observations.

The bar data frame or variables which is the genes with no additional information.

And then we have our counts just saved under capital X which is just a numpy array and if we look at the shape it'll just be cells by the number of genes.

So we have our single cell data and it's now in the AND data format.

So we're ready to start with the pre-processing.

All right, so for the pre-processing we're going to start with doublet removal.

So some people don't do this to their data, maybe because they don't know how, but it is definitely recommended because when you're making single cell libraries sometimes 2 or more cells can end up in the same droplet.

So for this we're going to be using the solo package within scvi.

So we're going to import scvi and if you didn't have it installed you can install it through pip like this.

Don't do this on an integrated sample you need to do this on individual samples and then you can integrate them afterwards.

So all we're going to do is filter down the genes and then train a model to predict whether a cell is doublet or not.

So let's start by narrowing down this 34, 000 genes.

The first thing we're going to do is only keep genes that are found in at least 10 of the cells.

And if we look at ADATA again, you see we're down to 19, 000 now.

And then we're going to go ahead and only keep the 2, 000 top variable genes.

So I'll go into this a little more later on in the tutorial.

But for the doublet removal, just know that I'm keeping the 2, 000 genes that more or less describe the data the best and of course if we look at ADATA this time we're down to 6, 000 cells and only 2, 000 genes And now we want to set up an SCVI model so that we can then predict the doublets.

So for that we're just doing a very simple model setup using only default parameters and then we're training this VAE model.

So this will take a few minutes especially if you don't have an NVIDIA GPU.

It'll take a little longer, but in my case it looks like this 1 sample is going to take about a minute and a half.

So after the SCVI model is trained, we can go ahead and train the solo model which predicts doublets.

And we just passed the VAE model we just trained.

So once that's done we can get the predictions for whether a cell is a doublet or not.

So you see we can return a data frame like that, where we have the cell barcode and the score for doublet and singlet.

The higher of the scores is going to be the prediction and then I'm also going to pass soft equals true to make a new column that is the predicted label.

So now we have the values and the prediction.

There's 1 final thing I want to do, which is to remove all these dash zeros that SCBI adds on to the barcodes.

And you'll see why I do that in just a minute.

So for every index or barcode I'm just getting rid of the last 2 characters so we no longer have that dash 0.

Then let's go ahead and count how many doublets and singlets we have.

So we have around 1, 200 doublets and 4, 900 singlets.

So about 20% of these data were labeled as doublets.

So 20% is pretty high, But as you can see from these values, the values range, for example, this doublet, the difference in these values between singlet and doublets only about 1, and this one's closer to 4.

So the doublet prediction here was much higher than the doublet prediction here.

And in this case, I'm not sure I want to throw away 20% of the data, although you could and it's very reasonable to do so.

But I'm going to add a new column in this data frame, which is the difference between these 2 columns.

So now we have the difference I'm going to go ahead and plot that distribution just so we can look at it and I'm going to be using another package called Seaborn.

So I'm going to plot the distribution of this data frame only this diff column I just created and only from the cells that were predicted doublet.

So you see we have a lot of cells predicted doublet that were only marginally higher than the prediction for singlet.

So what I'm going to do is filter out the ones below 1.

I'm going to make a new data frame called doublets which is only the cells predicted doublet and with the difference above 1 so everything to the right of the 1 here on this distribution.

So now we have 460 cells that were predicted doublets with high certainty.

And since we have the barcodes we can filter our AData object on that.

So if we look at AData You see we've already done a lot of processing to this to do the doublet removal in the first place.

So what I'm going to do is just reload the AData object using the same read function we used earlier.

So we have our fresh ADATA and if we look at obs just the list of barcodes so we have this list of barcodes and we have this list of barcodes that are known doublets.

So I'm going to add a new column in this observation data frame.

I'm just going to call it doublet and it's going to be true or false.

So is this barcode in this data frame here?

So it doesn't match 1 of these barcodes in this data frame index column.

We'll run that and now if we look at ADATAobs again we just have true or false.

So if it is a doublet it'll be labeled true.

So let's go ahead and filter out all the cells that were labeled true and keep all the cells that were labeled false.

So we can do that by filtering the ADATA on the observation column doublet.

So now if we look at ADATA instead of 6099 cells we now have 5600 cells.

So now we have our fresh raw ADATA object with the doublets removed.

All right so now that we have that AData object with the doublets removed we can do the typical pre-processing workflow.

So the first thing we want to do is label the genes that are mitochondrial genes.

So you remember AData.var as all the gene names and then mitochondrial genes are typically annotated like this.

If it's mouse, sometimes it'll be M lowercase t or lowercase mt.

But in this case it's human which is almost always capital MT so we can filter ADATA.var so let's find all these values that start with MT dash So here are our 13 mitochondrial genes.

I've come across some cases where they're not annotated with MT but you know there's only 13 mitochondrial genes so you can just find a list and then annotate them based on that list like I will do with ribosomal genes in just a minute.

But anyways instead of just looking at these let's actually annotate them.

So I'm going to add true or false to every 1 of those 34, 000 genes if they start with MT.

So if we look at far now we have this lowercase MT column that just has true or false.

And then now let's do the same thing for ribosomal genes.

I'm going to be using a list of known ribosomal genes from the Broad Institute which we can import directly into Pandas.

So we've been using Pandas data frames through scanP, but let's explicitly import it now.

And then I'm passing this ribo URL, which is this long URL to this data frame of ribosomal genes.

Again, you can just copy this right from my GitHub.

And then using that URL, we can read it directly with pandas and we're just gonna skip the first 2 rows which is junk and then if we look at this once we import it we just have 88 rows or 88 ribosomal genes so we have this list which we can call like this we get just a list of these genes.

Just like annotating mitochondrial genes, we're going to do the same thing here.

So is our gene name in our var data frame?

So now if we look at a data var, we have true or false for this rival column.

So we have this var data frame and just to remind you our observation data frame is basically empty except this doublet column which should all be false now.

And we're going to go ahead and calculate QC metrics.

And our QC VARs correspond to those columns that we made with the true or false labels.

So if we run this, now if we look at the VAR data frame again, we now have these statistics for each of the genes.

You see the percent dropout is very high for a majority of the genes.

So this gene was only in 4 cells, this gene was only in 1 cell, which we'll get to in a moment.

And then if we look at the observation data frame, we now have these stats which includes mitochondrial counts for the percent of the mitochondrial reads and the percent of the ribosomal reads for that given cell, along with the number of genes that were positive in that cell, where the number of genes that had any counts, and then the total number of UMIs.

So let's go ahead and look at this a little more in depth here.

I'm going to sort it by the number of cells that that gene was found in.

So again you see that some genes were in almost every cell and then a lot of the genes were in 0 cells.

So we're gonna go ahead and filter out the genes that weren't in at least 3 cells.

So if we run this chunk again you see that every gene now has at least 3 cells it was found in and instead of 34, 000 genes now we have 24, 000.

And then typically here we would filter on counts.

So if we sort by total counts we see that the lowest is 401.

So what likely happened is the authors of this publication when they processed these probably got rid of any cell that had 400 or fewer so they already did that we're not going to filter it based on a lower threshold but if they hadn't we would run something like this so we'd filter the cells in this case with the minimum genes of 200 So they filter on counts but let's see if there's any with fewer than 200 genes.

So none of them, the lowest is 276, so we're not going to filter on it but if your data wasn't filtered 200 could be a place to start but again picking it's slightly arbitrary based on your data.

So let's go ahead and plot some of these QC metrics.

So basically what we want to do is use these QC metrics to get rid of outliers.

So if a cell has significantly higher genes than the average, there's a chance that it's some artifact.

Likewise with counts, but these are very highly correlated, so we can filter on 1.

And then if there's a high mitochondrial percentage, and it could be a sequencing artifact, or the cell could be dying.

So in this case I don't see any above 10.

Usually people set a mitochondrial filter anywhere from 5 to 20 percent and then with the ribosomal reads we see that most are around 0% but there are a few scattered outliers up here so let's just filter our AData object and get rid of some of these outliers.

So I'm going to use an additional package here, numpy, to get the 98th percentile value and then filter the genes by that instead of just picking.

You could just pick something like 3000 here but I like to be a little more objective.

So I'm picking a value that represents the 98th percentile.

If you just wanted to pick, you could use a line of code like this.

So the 98th percentile ended up being 2, 300, somewhere around here.

If you remember what the OMS data frame looked like we had the number of genes so we're gonna filter out all the cells that are above 2305.

So now we have 5526 cells and let's filter on mitochondrial and ribosomal.

It doesn't look like I have to filter out any because they're all less than 10 but I'll show the line of code anyways because you almost always use it.

I like 20% sometimes I use less depending on my data set.

In this case we're gonna run it but it's not gonna get rid of any cells.

Then I'm just gonna get rid of the extreme outliers for the ribosomal.

There's only a handful of cells actually above 2.

Instead, I'm going to be regressing out differences in the data based on ribosomal reads later.

So now we have our AData object that's been filtered of doublets and also of outliers.

So now that the data is clean, we can start clustering and eventual analysis.

So normalization is an important step because in single cell sequencing there's a lot of variation between cells even between the same cell type just because of sequencing biases etc.

So we need to do normalization so we can compare cells and compare genes.

And if we look at the ADATA counts table, each row is a cell.

So just to show you, let's just take the sum of each cell.

Alright, so this first cell at 5000 total counts, these cells only had around 400.

So the first thing we're going to do is normalize the counts in each cell so that their total counts adds up to the same value.

So we can do that with the normalize total function and now if we look at the sum of each you see the add of the 10, 000 so each value got modified based on the starting number of counts for that cell and then we're just going to convert those to log counts and of course if we run this again now they're not going to be all the same because it's not a linear transformation but they're still all comparable or at least more comparable than before and finally this is an important step You want to freeze the data as it is now before we start filtering based on variable genes and regressing out data and scaling data.

So we're just saving into the raw slot of the AData, the AData object as it is now.

So a lot of the functions for differential expression, etc.

Alright, so we can start the clustering process now.

First thing we're going to do is find the 2000 most variable genes.

I know I kind of glanced over this earlier, but what this does, if we look at the VAR data frame now, we see that we now added additional columns.

And also some statistics such as its mean and dispersions.

If we were to look at those on a graph we can see that the genes with higher dispersion were marked as variable genes.

So this is a way to reduce the number of dimensions of the data sets instead of having 24, 000 genes.

So we just did a tenfold reduction in the number of dimensions of the data set.

So anyway we can filter out the not highly variable genes and none of this touches the raw data that we saved earlier, but if we look at a data now we only have 2, 000 genes which are the highly variable genes and now we're going to regress out the differences that arise due to the total number of counts, mitochondrial counts, and the ribosomal counts.

So this will get rid of some of the variations in the data that are due to processing and just sample quality, sequencing artifact, et cetera.

And then we're gonna normalize each gene to the unit variance of that gene.

And now we're going to run principal component analysis to further reduce the dimensions of the data into just 30 or so principal components.

By default this calculates 50 PCs so let's plot how much these PCs actually contribute to the data.

So lung data there's a bunch of different cell types so I do expect more significant PCs than a data set that maybe only had a few cell types.

But basically what we want to do is find the elbow of this plot or where you don't really see a big difference as you increase in PC number.

So here around 30 you see that it really starts to flatten out so 30 is probably a good number to pick here but in actuality it's not going to make a big difference whether you pick 25 or 40 but we're just gonna go ahead with 30.

We're gonna go ahead and calculate the neighbors of the cells using top 30 PCs which we just selected.

So let me show you what the neighbors are.

We look at the A data you see we now have this obs p which has distances and connectivities so if we look at connectivities we see it's a 5500 by 5500 matrix so it's a cell by cell matrix and every cell that's connected will get a value.

And then if we look at the distances, each cell that's connected will also get a distance.

So these neighborhood matrices are what's going to be used to do the clustering.

So we're going to go ahead and use UMAP to project the data from these 30 dimensions into 2 dimensions so that we can look at it.

The 1 point is a single cell but they haven't been assigned to clusters yet so it's just all 1 color.

To actually assign clustering we need to run a laden algorithm so you might need to install that separately if you don't have it.

And then we're gonna run laden algorithm with a resolution of 0.5.

This is something you'll adjust depending on your data.

I just like to start with 0.5 and see how it turns out.

So if we look at the ADATA OBS dataframe running that laden algorithm, all it did was add a new column with a laden label.

So now let's replot that UMAP and color the cells based on this label.

So now we have our first sample clustered.

What I'm going to do next is integration of multiple samples.

If you only have 1 sample, you can skip the integration and then just pick up with finding markers later in the video.

But a lot of the time, you have more than 1 sample.

And in my example tutorial case, I have 26 samples.

So I need to integrate them all into 1 AData object and adjust for differences based on batch.

So the first thing I want to do for integration is since I have 27 samples I don't want to have to write this code out for each individual sample so I wrote a function that does almost exactly what we did above for each individual sample.

So I know it looks like a lot but I basically just copied and pasted from what we already did.

Here I predict the doublets then I reread the A data and so the only thing I pass to this function is going to be the CSV path, for example the same CSV path we used earlier, and I'm also adding a new column in the observation data frame called sample, which I'm just taking this CSV path splitting it on the underscore and then returning the second item so this will be 0 1 2 so I'll get this identifier from the path and save it in the sample column.

I'm just filtering out the doublets and then here there's 1 small difference from what we did earlier I'm including a minimum genes threshold just in case some samples had fewer and the difference here is that I got rid of the gene filtering.

See all Sanbomics transcripts on Youtube

Complete single-cell RNAseq analysis walkthrough | Advanced introduction