Summary of joint analysis methods of 10X spatial transcriptome and 10X single cell data.

This is a function of the Seurat bag. I shared the specific usage before. This paper uses the scoring function AddModuleScore of Seurat package here. You can have a look. The article using this method for single cell and spatial joint analysis is the article "Multimodal Analysis of the Composition and Spatial Structure of Human Squamous Cell Carcinoma" published on Cell. I have read this article in detail. In the multimodal analysis of the composition and spatial structure of human squamous cell carcinoma (spatial transcriptome and single cell article), let us briefly summarize the idea of literature combination.

Spatial transcriptome data are clustered, and spots with similar expression will be clustered into one class.

This method is used in the literature of temporal and spatial analysis of human intentional development with single cell resolution, and published in Cell. This paper mainly studies intestinal development. By using this joint analysis method, we mainly look at the changes of cell types during intestinal development.

This method is published in the article Nature Biotechnology.

This requires a strong background, especially for irregular samples, but also needs a strong biological background as a support to divide, and the first step is very difficult.

I won't talk about the algorithm here. You can read the article I shared earlier. This method will be used less.

I have shared this method before, and the article is in Cell 2Location, a joint analysis method of single cell and space in 10X. This method is similar to the previous ordinary transcriptome deconvolution method. This paper is to map the tissue and cell structure comprehensively by integrating single cells and spatial transcriptome. Let's take a brief look at this process:

Cell2location maps the spatial distribution of cell types by integrating single-cell RNAseq (scRNA-seq) and multi-cell spatial transcriptome data from a given tissue.

From the schematic view, a single cell is used as a reference to match the spatial position of the cell type, and the direction cannot be changed.

Firstly, the first step is to estimate the expression characteristics of cell types of single cell data by using the model. For example, by using conventional clustering to identify cell types and subgroups, and then estimating the gene expression profile of the average cluster (as shown in the figure below).

Need to be analyzed step by step. Cell2location implements this estimation step based on negative binomial regression, so it can reliably combine data across technologies and batches. (Mathematics again).

Step 2: The location of cell 2 uses the reference signal to count the mRNA in the spatial transcriptome data, so as to estimate the relative and absolute abundance of each cell type at each spatial location. (Break down data).

Cell2location is implemented as an interpretable hierarchical Bayesian model, in which (1) provides a principle method to solve the model uncertainty, (2) solves the linear dependence of cell type abundance, (3) simulates the measurement sensitivity difference between different technologies, and (4) solves the unexplained/residual variation by adopting a flexible error model based on counting. Finally, due to variational approximate reasoning and GPU acceleration, the location of unit 2 is computationally efficient. We will share and analyze these methods in the next article.

In order to verify the location of cell 2, we initially used simulated data to reflect different cell abundance and spatial patterns. (The author simulated the spatial transcriptome data).

What we need to pay attention to here is Jensen-Shannon divergence, which is J-S divergence. Let's explain the content of mathematics.

In short, we simulated a spatial transcriptome dataset with 2000 locations. Based on the annotation of reference cell types obtained from the snRNA-seq reference dataset of mouse brain including 46 cell types, the multicellular gene expression profile of each location was obtained by combining cells extracted from different reference cell types and using one of four cell abundance patterns with variable density and sparse distribution to simulate the patterns observed in real data. Then use cell2location to analyze and get the results in the figure. Basically, there is a high correlation, but there is a problem here, that is, the simulated spatial transcriptome data is merged from single cell data. Once the real spatial transcriptome data contains some cell types that single cells do not exist (such as technical barriers, and the result of capturing neutrophils by 10X single cells is very poor), the predicted results are likely to be wrong. Let's see if the author mentioned this problem later.

Next, we compare the cell 2 localization with the recently proposed alternative method to infer the relative cell type abundance from the spatial transcriptome. As a result of the same literature, my own software performed best. And the model also produces a more accurate estimate of the relative cell type abundance.

What needs attention here is the PR curve, and these mathematical problems are explained below.

Cell2location not only provides an estimate of the relative cell type fraction, but also estimates the absolute cell type abundance, which can be interpreted as the number of cells expressing reference cell type markers at a given position, which is also highly consistent with the simulated real situation (which is also very important).

In short, the results support that the localization of cell 2 can accurately estimate the cell basis of different cell types.

Then the article uses two examples to solve the problem of joint analysis with this soft idea. Let's not talk about specific cases here. We need to know more about the principle of the algorithm.

Solve the J-S deviation and PR curve first.

KL divergence is also called relative entropy, information divergence and information gain. KL divergence is a measure of the asymmetry of the difference between two probability distributions P and Q ... kiloliters.

Divergence is a measure of the number of extra bits required to encode the average of samples from P using Q-based encoding. Usually, p represents the true distribution of data, and q represents the theoretical distribution, model distribution or approximate distribution of data.

Defined as follows:

Because the logarithmic function is convex, the value of KL divergence is non-negative.

Comparing PR curve with ROC curve, we can learn more. You can refer to my explanation about ROC curve to learn more about the role of R-bag AUcell in analyzing single cells.

And PR curve

Brief introduction of the model

For the complete derivation of the cell2location model, please refer to the supplementary calculation method. Simply put, cell2location is a Bayesian model, which estimates the absolute cell density of cell types by decomposing mRNA counts? S, g of each gene? = { 1, .。 , ? } location? = { 1, .。 , ? For 10X Visium data, the matrix can be directly supported by 10X space ranger software and imported into the data format used in the popular python software package scanpy (scanpy can be used for reading 10X analysis data, and Suerat can also be used for analysis). D, s and g should be filtered into a set of genes expressed in single-cell references g and f. The treatment in this place is that when a single cell is mapped to a spatial transcription group, the expressed genes are the same. The chart model of cell2location is as follows:

Let G = {G f, G}, which represents the F×G matrix of the characteristics of the reference cell type. It consists of the gene expression profile G f of F = {1, ..., F}: For the gene g = {1, ..., g}, it represents the linear mRNA counting space of each gene in each cell type (non-. The matrix needs to provide the location of cell 2, and can be estimated from the scrna-seq profiles. Here we can see that the gene expression of each cell type is averaged to represent this cell type. The location of cell 2 models the elements of d as negative binomial distribution, and here we will talk about negative binomial distribution a little bit.

Negative binomial distribution is a discrete probability distribution in statistics. A negative binomial distribution satisfies the following conditions: the experiment contains a series of independent experiments, each experiment has two results: success and failure, the probability of success is constant, the experiment lasts for r times without success, and r is a positive integer. You can refer to the negative binomial distribution of Baidu Encyclopedia, but from here on, only the background begins to involve deep mathematics. I don't know math, but I'm not proud of math, so I hope to have a math expert to share the content.

Finally, the results of the analysis are presented.

This method is in the early stage and needs more verification.

This method is also a non-negative volume integral solution method, and it is an R package. At present, high-scoring articles have not been cited, but the method is not bad. For the algorithm of spotlight, you can see spotlight and spotlight_github, and the algorithm is not introduced here, as shown in the figure below:

For example, scanpy's joint analysis method, we will not introduce it much, I hope it will be helpful to everyone.

New york is three hours ahead of California, but that doesn't mean California is slow.

Cameroon is six hours ahead of new york, but that doesn't mean new york is slow.

Someone graduated from college at the age of 22, but waited five years to find a job.

Someone became CEO at the age of 25 and died at the age of 50.

Someone became CEO at the age of 50, but lived to be 90.

Some people are still single,

And the other person is married and has children.

Of course, everyone in this world works according to their own time zone.

People around you may look more advanced than you.

That's no problem at all. Some are behind you.

Everyone runs their own race in their own time zone.

Do not envy or laugh at them.

They are in their own time zone and you are in your time zone.

Life is waiting for the right time to react.

So relax.

You're not late.

You didn't arrive early

You are very punctual and in your time zone.

Everyone has a different test paper, representing different questions.

Everyone has different tasks, which means different goals in life.

So focus on your own test paper, your homework and your purpose.

Don't copy and paste or steal the answer, or you will fail.

Your dreams and hallucinations are valid. Take your time and do your best.

Like a hummingbird. Even when the powerful lion and tiger underestimated him, he continued to do what he could, where he was, just like him, with what he had.

You're fine now. The little work you did today may seem trivial, but I bet you will see the big picture one day.

You're not late! You didn't arrive early.

Senior Chinese masters talk about answering questions.

Past and future argumentative essays

Analysis of Good Psychological Environment in Kindergarten

What about BYD's gearbox?

Future economic development model-sharing economy

Hot spot of preparing for civil servants: what should young people do when they leave town?

Tail paper of coal mine self-moving machine

How to have a good first class in the new semester?

Opening Report on Graduation Design of Environmental Art

What do you mean by important journals?