Explanation on DCA (Deep Count Autoencoder)

Recently, machine learning techniques become more and more popular in the field of bioinformatics, especially for analyzing DNA or RNA sequences. The ability of obtaining single-cell RNA sequence (scRNA-seq) opened a larger window for downstream analysis with the high resolution of single cell data compared to the bulk data.

Different measuring techniques for scRNA-seq datasets, however, generate noise which may affect further analysis. The noise consists of amplification bias, library size differences, low RNA capture rate, etc. For example, for the general low RNA capture rate, it may lead to the sequence data with false zero counts of certain expressed genes. This kind of values can be seen as missing values, and need to be separated from the true zero counts which actually represent the lack of gene expression in the cells. Ideally, the false zero counts should be identified and imputed to reconstruct the non-noise version of the scRNA-seq datasets.

1. DCA

The paper Single-cell RNA-seq denoising using a deep count autoencoder by Gökcen Eraslan et al. purposed a machine learning method, namely deep count autoencoder (DCA) to learn a compressed latent representation and to denoise the input scRNA-seq. The method is able to take the non-linearity and the sparsity of the data and scalability of the model into account.

The purposed DCA model is an autoencoder which is a type of neural network which consists a encoder and decoder. The model learns an efficient compression of the input data with a significantly reduced dimension. Thus the compression will lead the model to learn only the representative part of the data and ignore the non-essential variation (noise) so that the model can achieve the goal of denoise.

The following is an example of a standard autoencoder:

An example model of a standard autoencoder

For the purposed DCA model, the main idea is that the loss function should be a representation of the noise model. That is to say that the DCA model should minimize the negative likelihood of the distribution of the noise model instead of trying to minimize the difference between the output and the original data itself. According to the sparse structure of the scRNA-seq data, the noise model was suggested with the negative binomial distribution with (ZINB) and without zero-inflation (NB):

Formulas of NB and ZINB. where μ and θ are mean and dispersion parameters of the negative binomial component and π is the mixture coefficient that represents the weight of the point mass

To decide on which of the two noise models to use, one can calculate the likelihood ratio on the data (NB/ZINB) to see which one gives the higher likelihood. The higher one fits the data better.

2. Evaluation

To evaluate the DCA model, the authors constructed two experiments. One with simulated data and one with real data. Only the first one is demonstrated here since it is more focus on the method itself.

The simulated data consists of two datasets:

  • dataset1: 2000 cells with 200 genes and two cell types (63% of data values were set to zero)
  • dataset2: 2000 cells with 200 genes and six cell types (35% of data values were set to zero)

The noise model which is the likelihood loss function of the DCA model is ZINB. The demonstration of the DCA model can be seen as below:

A demonstration of DCA model with ZINB likelihood loss function. (number of output nodes = number of input nodes/genes)

Compared with using the MSE loss function (which tried to reconstruct the input data itself), the result is as following:

Result for dataset1

From left to right, the pictures represent: simulated dataset1 without noise(ground truth), simulated dataset1 with noise(a number of false zeros), denoised with ZINB, denoised with MSE respectively.

Result for dataset2

From left to right, the pictures represent: simulated dataset2 without noise(ground truth), simulated dataset2 with noise(a number of false zeros), denoised with ZINB, denoised with MSE respectively.

From the results, it can be seen clearly that the DCA model with ZINB is able to denoise the data to make the data be able to be separated into classes, and it is much better than the MSE model.

The second experiment was done on a real dataset with 68,579 peripheral blood mononuclear cells and 1,000 highly variable genes (92% zeros). The DCA model was using NB noise model and the latent layer of the DCA model consists of 2 nodes (two dimensional) and it was also proven to be effective. There are also examples of downstream analysis in the paper, for example, clustering, time course modeling, differential expression, protein-RNA co-expression and pseudotime analyses. The detailed result and related analysis are omitted since we are mainly focused on the machine learning method here. For anyone who is interested, please read the original paper ().

3. Scalability

The DCA method is also scaled linearly with the number of cells. The following is a runtime comparison between DCA and other denoise methods:

Scalability test result

4. Apply

To apply the method on self data, it is also important to know that the hyperparameters of the DCA model need to be validated based on the dataset at hand, for example, the number of layers, the number of nodes on each layer, optimization function, learning rate, etc. When the data size is limited, it is especially important to use some regularization methods such as dropout layers, L1 or L2 regularization in the DCA neural network.

The DCA model is publicly available at [github](https://github.com/ theislab/dca). The method is also available in the [preprocessing package] (https://scanpy.readthedocs.io/en/latest/generated/scanpy.external.pp.dca.html#) in Scanpy.