Abstract
An underlying question for virtually all singlecell RNA sequencing experiments is how to allocate the limited sequencing budget: deep sequencing of a few cells or shallow sequencing of many cells? Here we present a mathematical framework which reveals that, for estimating many important gene properties, the optimal allocation is to sequence at a depth of around one read per cell per gene. Interestingly, the corresponding optimal estimator is not the widelyused plugin estimator, but one developed via empirical Bayes.
Introduction
Singlecell RNA sequencing (scRNAseq) technologies have revolutionized biological research over the past few years by providing the tools to simultaneously interrogate the transcriptional states of thousands of cells in a single experiment. In contrast to bulk RNASeq, which probes the average gene expression in a cell population, singlecell RNAseq has unlocked the potential of extracting higherorder information, granting us access to the underlying gene expression distribution. Indeed, this unprecedented look into populationlevel heterogeneity has been vital in the success of scRNAseq, leading up to new biological discoveries^{1,2}.
Although early singlecell RNAseq assays were labor intensive and initially constrained by the small number of cells that could be processed in a single experiment, recent technological advances have allowed hundreds of thousands of cells to be assayed in parallel^{3}, eliminating the otherwise prohibitive per cell cost overhead. From a sequencing budget perspective, however, this seemingly unconstrained increase in the number of cells available for scRNAseq introduces a practical limitation in the total number of reads that can be sequenced per cell. More reads can significantly reduce the effect of the technical noise in estimating the true transcriptional state of a given cell, whereas more cells can provide us with a broader view of the biological variability in the cell population. A natural experimental design question arises (Fig. 1a): how many cells should we choose to profile for a given study, and at what sequencing depth?
The experimental design question has attracted a lot of attention in the literature^{4,5,6,7,8}, but as of now, there has not been a clear answer. Several studies provide evidence that a relatively shallow sequencing depth is sufficient for common tasks such as cell type identification and principal component analysis (PCA)^{9,10,11}, whereas others recommend deeper sequencing for accurate gene expression estimation^{12,13,14,15}. Despite the different recommendations, the approach to providing experimental design guidelines is shared among all: given a deeply sequenced dataset with a predefined number of cells, how much subsampling can a given method tolerate? An example of this conventional approach is also evident in the mathematical model used in a recent work^{11} to study the effect of sequencing depth on PCA. Although practically relevant, this line of work does not provide a comprehensive solution to the underlying experimental design question because of three reasons: (1) the number of cells is fixed and implicitly assumed to be enough for the biological question at hand; (2) the deeply sequenced dataset is considered to be the ground truth; (3) the corresponding estimation method is chosen a priori and is tied to the experiment.
In this work, we propose a mathematical framework for singlecell RNAseq that fixes not the number of cells but the total sequencing budget, and disentangles the biological ground truth from both the sequencing experiment as well as the method used to estimate it. In particular, we consider the output of the sequencing experiment as a noisy measurement of the true underlying gene expression and evaluate our fundamental ability to recover the gene expression distribution using the optimal estimator. The two design parameters in our proposed framework are the total number of cells to be sequenced \({n}_{\mathrm{cells}}\) and the sequencing depth in terms of the total number of reads per cell \({n}_{\mathrm{reads}}\), both affecting the optimal estimation error. Now, the experimental design tradeoff becomes apparent when these two quantities are tied together under a total sequencing budget constraint \(B={n}_{\mathrm{cells}}\times {n}_{\mathrm{reads}}\) (Fig. 1a, sequencing budget allocation problem). The sequencing budget \(B\) corresponds to the total number of reads that will be generated and is directly proportional to the sequencing cost of the experiment (see Methods).
More specifically, we consider a hierarchical model^{16,17,18} to analyze the tradeoff in the sequencing budget allocation problem (see Methods). At a high level, we assume an underlying highdimensional gene expression distribution \({P}_{{\bf{X}}}\) that carries the biological information of the cell population we are interested in and is independent of the sequencing process (Fig. 1a top). The cells \(1,2,\cdots \) in the experiment are described by gene expressions \({{\bf{X}}}_{1},{{\bf{X}}}_{2},\cdots \) sampled from \({P}_{{\bf{X}}}\), whereas we can only observe the read counts \({{\bf{Y}}}_{1},{{\bf{Y}}}_{2},\cdots \) that are generated from the corresponding gene expressions via sequencing (Fig. 1a bottom). In this context, it is clear that with many cells \({n}_{\mathrm{cells}}\) we can estimate the read count distribution \({P}_{{\bf{Y}}}\) accurately, whereas with more reads per cell \({n}_{\mathrm{reads}}\) we can make sure that the individual (normalized) observations \({{\bf{Y}}}_{1}/{n}_{\mathrm{reads},1},{{\bf{Y}}}_{2}/{n}_{\mathrm{reads},2},\cdots \) are much closer to the ground truth expressions \({{\bf{X}}}_{1},{{\bf{X}}}_{2},\cdots \) of the cells (here, \({n}_{\mathrm{reads},c}\) represents the total number of reads for cell \(c\) and the average of \({n}_{\mathrm{reads},c}\) over all cells is \({n}_{\mathrm{reads}}\)). The optimal tradeoff is then derived to reconcile the two.
Results
Model overview
The gene expression levels of each cell, denoted by \({{\bf{X}}}_{c}=[{X}_{c1},\cdots \ ,{X}_{cG}]\) for \(c=1, \ldots ,{n}_{\mathrm{cells}}\), can be viewed as independent samples from the gene expression distribution \({P}_{{\bf{X}}}\), where \(G\) denotes the number of genes. More specifically, we assume that \({X}_{cg}\) represents the true relative abundance of the mRNA molecules originating from a gene \(g\) in cell \(c\), so that \({\sum }_{g=1}^{G}{X}_{cg}=1\). To model the sequencing process, we assume that after a particular cell \({{\bf{X}}}_{c}\) has been sampled from \({P}_{{\bf{X}}}\), its corresponding gene counts \({{\bf{Y}}}_{c}=[{Y}_{c1},\cdots \ ,{Y}_{cG}]\) are generated via Poisson sampling of \({\gamma }_{c}\cdot {n}_{\mathrm{reads}}\) reads from \({{\bf{X}}}_{c}\), where \({\gamma }_{c}\) is a size factor that is cellspecific but not genespecific. Overall, our hierarchical model is given by (here, we simplified the model by fixing \({\gamma }_{c}\); see Eq. (2) in the Methods section for the complete model and Supplementary Note 1 for more details): for cells \(c=1,2, \cdots ,{n}_{\mathrm{cells}}\),
Under this framework, the ultimate goal of a singlecell RNAseq experiment would be to estimate quantities related to the (unknown) ground truth distribution \({P}_{{\bf{X}}}\) from the noisy measurements \({{\bf{Y}}}_{c}\). Fixing the total sequencing budget \(B={n}_{\mathrm{cells}}\times {n}_{\mathrm{reads}}\), we aim to characterize the optimal experimental design tradeoff between the number of cells \({n}_{\mathrm{cells}}\) and the number of reads per cell \({n}_{\mathrm{reads}}\) that can minimize the corresponding estimation error.
Although our framework is nonparametric—in the sense that no particular prior is assumed for the underlying gene distribution \({P}_{{\bf{X}}}\), it is instructive to illustrate the framework in the context of the widely used overdispersion model, where for each gene \(g\), the read counts \({Y}_{cg}\) are assumed to follow a negative binomial distribution^{19,20,21}. As the negative binomial distribution can be derived as a gamma–Poisson mixture, the resulting model can be viewed as a special case of Eq. (1) in which the underlying gene expression marginals follow gamma distributions. In that case, one would be interested in estimating the marginals \({X}_{g} \sim {\mathrm{Gamma}}({r}_{g},{\theta }_{g})\), effectively decoupling the true biological variability from the technical noise that was introduced during sequencing via Poisson sampling (see Relation to the overdispersion model in Supplementary Note 4).
As a technical remark, for assuming the gamma–Poisson mixture, we dropped two constraints without loss of generality, i.e., \({X}_{cg}\;<\;1\) and \({\sum }_{g=1}^{G}{X}_{cg}=1\). The former is because the relative expression \({X}_{cg}\) is of the order of \(1/G\), which is much smaller than 1. With a mean much smaller than 1, the truncated gamma distribution with truncation at 1 is very close to the corresponding untruncated distribution. The latter is because that the number of genes \(G\) is large, and therefore, \({\sum }_{g=1}^{G}{X}_{cg}\) concentrates around its mean, which is 1.
Optimal sequencing budget allocation
For our main results, we focused on 3’end sequencing technologies^{22,23,24} and used the above framework to study the experimental design tradeoff for estimating several important quantities of the underlying gene distribution, such as the CV and the Pearson correlation (see the Methods section). In the context of 3’end sequencing, \({P}_{{\bf{X}}}\) naturally models the unknown highdimensional distribution of mRNA abundances across cells, whereas the read counts for the cells, \({{\bf{Y}}}_{1},{{\bf{Y}}}_{2},\cdots \), correspond to the number of unique molecular identifiers (UMIs) observed via sequencing. Our main result states that the optimal budget allocation (i.e., the one that minimizes the estimation error) is achieved by maximizing the number of cells while making sure that at least ~1 UMI per cell will be observed on average for all genes of primary biological interest in the experiment.
As a demonstrating example, in Fig. 1b we consider the memory Tcell marker gene S100A4 to be of primary biological interest and evaluate the optimal tradeoff in the context of the overdispersion model for the total sequencing budget used to generate the 10x Genomics’ pbmc_4k dataset (4340 cells, total 41.7 k reads for S100A4); our analysis suggests that the optimal tradeoff would have been attained by sequencing 10 times shallower using 10 times more cells, reducing the error by twofolds. Of course, the recommended sequencing depth depends on the genes under consideration. For example, the sequencing depth of pbmc_4k dataset is optimal when the Bcell marker gene MS4A1 is considered, and it should be sequenced four times deeper with 1/4 cells when the Thelper marker gene CD4 is considered (Fig. 1c top, Supplementary Fig. 2a, b). The latter arguably has reached saturation for the 10x Genomics’ technology. Hence, the guidance there is to sequence until saturation, i.e., sequence until no more new UMIs are observed (see Experimental design in the Methods section as well as Supplementary Note 3).
As the example indicates, an important aspect of our framework is to allow flexible experimental design at a singlegene resolution. The researcher can thus design the experiment based on the mean expression level of a set of important genes related to the biological question, where the mean expression level can be obtained via pilot experiments or previous studies (see Experimental design in the Methods section). We illustrate the proposed experimental design procedure by considering peripheral blood mononuclear cells (PBMCs) with the corresponding marker genes (Fig. 1c). The goal is to ensure reliable estimation for all these genes that are above a certain expression level, say that of MS4A1. Hence, the expression level of the gene of interest, i.e., MS4A1, naturally defines the reliable detection limit \({p}^{* }\) at which we should guarantee observation of one average UMI per cell. Thus, given a budget \(B\), choosing \({n}_{\mathrm{reads}}^{* }=1/{p}^{* }\) and \({n}_{\mathrm{cells}}^{*}=B/{n}_{\mathrm{reads}}^{*}\) achieves the optimal tradeoff for reliable detection at \({p}^{* }\). In this example, MS4A1 will be sequenced ~1 UMI per cell on average. Interestingly, this approach suggests a slightly deeper sequencing for current 10x datasets (Supplementary Figs. 1 and 2).
Unlike estimating the gamma distribution parameters for the overdispersion model, we considered estimating other quantities in a nonparametric setting (see also a nonparametric interpretation of the overdispersion model in Relation to the overdispersion model in Supplementary Note 4). Although the exact optimal depth is taskdependent, our empirical evaluations have shown that the above recommendation is remarkably consistent across all quantities considered in this paper—typically lying in a narrow range between 0.2 and 1 (Fig. 2a, Supplementary Fig. 4). Last but not the least, our tradeoff analysis can also provide a post hoc guidance for reliable estimation for existing datasets, namely for certain quantities, to determine which genes can be reliably estimated and which cannot, based on their mean expression level (Fig. 2b, see also post hoc guidance for reliable estimation in Methods).
Optimal estimator
Another important result arising from our experimental design framework is the fundamental role of the estimator in the optimal tradeoff. A very common—almost routine—practice in the literature is to use the socalled plugin estimator, which, as a general recipe, blindly uses the scaled (relative) read counts \({{\bf{Y}}}_{1}/{n}_{\mathrm{reads},1},{{\bf{Y}}}_{2}/{n}_{\mathrm{reads},2},\cdots \) as a proxy for the true relative gene expression levels \({{\bf{X}}}_{1},{{\bf{X}}}_{2}\cdots \), effectively estimating the corresponding distributional quantities by pluggingin the observed values. For example, the plugin estimator naturally estimates the mean of the gene expression distribution \({P}_{{\bf{X}}}\) by that of \({P}_{{\bf{Y}}/{n}_{\mathrm{reads}}}\), the variance of \({P}_{{\bf{X}}}\) by that of \({P}_{{\bf{Y}}/{n}_{\mathrm{reads}}}\), etc. This approach, although very accurate for deeply sequenced datasets, becomes increasingly problematic in the limit of shallow sequencing; overdispersion and inflated dropout levels in lowly expressed genes, typically associated in the literature with scRNAseq, are some of the more pronounced consequences.
For the sequencing budget allocation problem, we did not restrict our results to any particular estimator; our analysis suggested that the optimal tradeoff cannot be achieved by the conventional plugin approach but with another class of estimators developed via empirical Bayes modeling^{16,17,18,25,26} (see Methods). Such estimators are inherently aware of the Poisson sampling noise introduced by sequencing, and therefore can adapt to various sequencing depths. As they estimate the prior gene distribution \({P}_{{\bf{X}}}\) in the hierarchical model (2) from the observed data \({{\bf{Y}}}_{c}\), sometimes by estimating the moments of the prior distribution \({P}_{{\bf{X}}}\), they are usually associated with the names empirical Bayes, moment matching, or density deconvolution. Here, we use the term EB to refer to them in general.
In Figs. 3 and 4 (also Supplementary Figs. 7–12), we provide a comprehensive evaluation of the performance of EB estimators in several key applications and show that they provide remarkably consistent estimates across varying sequencing depths and different datasets. Also, they are shown to be biologically meaningful (Fig. 4c, Supplementary Fig. 12). In contrast, the plugin approach, being sensitive to the sequencing depth, significantly overestimates the variability in gene expression (CV) owing to the inevitable zeroinflation occurring at shallow sequencing (Fig. 3a), and subsequently limits the performance of common downstream tasks such as PCA and gene network analysis (Methods, Fig. 3b, Fig. 4).
Validation against the gold standard smFISH
In order to further validate our results that the optimal sequencing depth is attained at ~1 average UMI per cell, and that the EB estimates are indeed close to the ground truth, we considered two additional datasets^{15,27}, accompanied by the singlemolecule fluorescent in situ hybridization (smFISH) data, which is regarded as the gold standard for measuring the number of mRNAs in a cell. The libraries for the two scRNAseq datasets were generated by Dropseq^{23} and CELseq^{28}, respectively, two UMIbased technologies.
We first compared the estimated CV and inactive probability against the smFISH estimates. The EB estimates agree well with the smFISH data while there is clear inflation for the plugin estimates (Fig. 5a, Supplementary Fig. 13). Furthermore, we investigated the optimal sequencing depth by fixing the budget \(B\) and varying the sequencing depth, as we did in Fig. 2a. The critical difference here is that the error is evaluated against the smFISH data, which serves as a proxy for the ground truth. Two genes, MITF and VGF, were considered that have relatively more UMIs to subsample. The tradeoff curves (Fig. 5b, Supplementary Fig. 14) are qualitatively similar to the simulation studies (Fig. 2a, Supplementary Fig. 4), showing an optimal depth between 0.1 and 0.6. This is consistent with the experimental design guidelines that we provided in our earlier analysis.
See also Supplementary Figs. 15–17 and the Details of the ERCC experiments subsection in Supplementary Note 6 for additional validations using datasets with ERCC synthetic spikein RNAs and pure RNA controls.
Discussion
A natural yet challenging experimental design question for singlecell RNAseq is how many cells should one choose to profile and at what sequencing depth to extract the maximum amount of information from the experiment. In this paper, we introduced the sequencing budget allocation problem to provide a precise answer to this question; given a fixed budget, sequencing as many cells as possible at approximately one read per cell per gene is optimal, both theoretically and experimentally.
Conceptually, there are three important aspects of our mathematical framework that enabled our theoretical analysis and led to the development of the corresponding sequencingdepthaware EB estimators. First, we explicitly incorporated the notion of an unknown ground truth distribution \({P}_{{\bf{X}}}\) that describes the underlying singlecell population of interest. From this perspective, a singlecell RNAseq experiment can be naturally seen as an attempt to extract information about this distribution. Second, we disentangled this biological ground truth not only from the sequencing process but also from the method used to estimate it. Considering the output of the sequencing experiment as a noisy measurement \({P}_{{\bf{Y}}}\) of the true underlying distribution, we were able to mathematically evaluate our fundamental ability to recover \({P}_{{\bf{X}}}\) and identify the corresponding tradeoffoptimal estimators for several quantities of interest by essentially optimizing over all possible methods and experimental design parameters. Finally, to provide practical experimental design guidelines, we considered how different biological questions could be incorporated within our framework. Assuming that a biological question can be defined in terms of a set of genes of interest (e.g., associated with a particular pathway), we were able to provide sequencing depth recommendations by minimizing the worstcase error within that set.
Our experimental results showed that the proposed EB estimators could achieve significantly better performance compared with the conventional plugin approach that is commonly used by existing singlecell analysis methods. Importantly, we demonstrated that the proposed estimators produce unbiased results across deep and shallow datasets obtained from the same underlying population of cells and validated their ability to produce estimates that are very close to the ground truth as measured by smFISH. We also provided post hoc guidance for reliable estimation by evaluating our results on multiple genes from different biological samples. Apart from providing costefficient data generation guidelines for future experiments, we believe that our results are also going to be useful in assessing the quality and statistical interpretability of existing datasets, particularly in the context of global collaborative initiatives such as the Human Cell Atlas^{29}.
Methods
Model
For a scRNAseq experiment, let \({n}_{\mathrm{cells}}\) be the number of cells and \({n}_{\mathrm{reads}}\) be the average UMIs per cell. The total number of UMIs \(B={n}_{\mathrm{cells}}\times {n}_{\mathrm{reads}}\) is used to denote the available budget for this experiment. Given a fixed budget, we are interested in the optimal allocation between \({n}_{\mathrm{cells}}\) and \({n}_{\mathrm{reads}}\) for estimating certain distributional quantities that are important to the scRNAseq analysis.
We adopt a hierarchical model for the analysis. Let G be the number of genes and for each cell \(c=1,\cdots \ ,{n}_{{\mathrm{cells}}}\), let \({{\bf{X}}}_{c}=[{X}_{c1},\cdots \ ,{X}_{cG}]\) be the relative gene expression level satisfying \({\sum }_{g=1}^{G}{X}_{cg}=1\). The relative expression levels are assumed to be drawn i.i.d. from some unknown cell distribution \({P}_{{\bf{X}}}\), which is defined with respect to the cell population under investigation—it may be cells from a certain tissue or some isolated cell subpopulations. This is quite a general model. For example, cells coming from several subpopulations can be modeled by letting \({P}_{{\bf{X}}}\) be a mixture distribution. The gene expression level \({{\bf{X}}}_{c}\) is measured by the observed UMIs \({{\bf{Y}}}_{c}\in {{\mathbb{N}}}^{G}\) via sequencing, of which the stochastic process is modeled using Poisson noise; such noise model has been extensively validated by previous works^{16,30}. In addition, we assume a size factor \({\gamma }_{c}\) for each cell that accounts for the variation in cell sizes. To summarize, for gene \(g=1,\cdots \ ,G\) in cell \(c=1,\cdots \ ,{n}_{{\mathrm{cells}}}\), we have assumed
Quantities to estimate
We study the optimal sequencing budget allocation for estimating the following distributional quantities of \({P}_{{\bf{X}}}\) that are commonly used in scRNAseq analysis. See Supplementary Note 2 for more details.
The marginal gene moments \({M}_{k,g}={\mathbb{E}}[{X}_{cg}^{k}]\), \(g=1,\cdots \ ,G,\,k=1,2,\cdots \). The marginal gene moments can be used to compute quantities like the mean expression, CV, the Fano factor, or the parameters for the overdispersion model (assuming that \({X}_{cg}\) follows a gamma distribution), which play an important role in data preprocessing, feature selection, and genetype identification^{31,32}.
The gene expression covariance matrix \(K\in {{\mathbb{R}}}^{G\times G}\), which also gives rise to the Pearson correlation matrix. Both quantities can be used to study the dependency structure of genes, e.g., via spectrum methods^{33,34,35} or gene network analysis^{36,37}.
The inactive probability of a gene \(g\) with the definition \({p}_{0,g}(\kappa )={\mathbb{E}}[\exp \left(\kappa {X}_{cg}\right)]\). It also has the interpretation of the proportion of zeroUMI cells for gene \(g\) when the cell population is sequenced \(\kappa /{n}_{{\rm{reads}}}\) times deeper. As special cases, \(\kappa\) = \(\infty\) corresponds to the probability that \({X}_{cg}\) is zero, whereas \(\kappa\) = \({n}_{{\mathrm{reads}}}\) corresponds to the proportion of cells whose observed counts \({Y}_{cg}\) are zero. The latter was also considered in a recent work^{38}.
The inactive probability of a gene pair \({g}_{1},{g}_{2}\) with the definition \({p}_{0,{g}_{1}{g}_{2}}(\kappa )={\mathbb{E}}[\exp (\kappa ({X}_{c{g}_{1}}+{X}_{c{g}_{2}}))]\) that quantifies the change that both genes are inactive. This quantity can be used to analyze the gene coexpression network^{38}.
The marginal gene distribution \({P}_{{X}_{g}}\) (also considered in a recent work^{16}).
Optimal sequencing budget allocation
We considered a single gene and derived the optimal budget allocation for estimating all the above quantities of its distribution \({P}_{{X}_{g}}\) (see Supplementary Note 5 for more details). As the mean relative expression level of a gene \({p}_{g}\) is relatively stable within a specific tissue/sample (see Experimental design subsection below for more details), one can safely estimate that for an experiment with budget \(B\), the total number of reads for gene \(g\) is around \({p}_{g}B\). Then the tradeoff with respect to gene \(g\) can be written as \({p}_{g}B={n}_{{\mathrm{reads}},g}\times {n}_{{\mathrm{cells}}}\), where \({n}_{{\mathrm{reads}},g}\) is the mean reads per cell for gene \(g\), satisfying the relation \({n}_{{\mathrm{reads}},g}={p}_{g}{n}_{{\mathrm{reads}}}\).
Theorem 1. (Optimal budget allocation, informal) For estimating moments, covariance matrix, inactive probability, pairwise inactive probability, and distribution, the optimal budget allocation is
The optimality is in the sense of minimizing the worstcase error over a family of distributions \({P}_{{X}_{g}}\) with mild assumptions and the optimal error rate is achieved by the EB estimators.
The expression \({n}_{{\mathrm{reads}},g}^{* } \sim 1\) in Theorem 1 implies that the optimal sequencing depth (mean reads per cell per gene) is given by some constant independent of the sequencing budget (see the formal statement in Supplementary Note 5). Therefore, for a scRNAseq experiment, we should aim at a certain sequencing depth; when the budget increases, we should keep the same depth and allocate the additional budget toward collecting more cells. In other words, after having achieved a certain sequencing depth, deeper sequencing does not help as much as having more cells. We also note that the actual value of this optimal sequencing depth may be different for estimating different quantities, which is further investigated in the following section. In addition, Theorem 1 suggests that the EB estimators should be used for optimal estimation, whose effectiveness is demonstrated in Figs. 3–4.
Experimental design
The exact values of the optimal sequencing depth \({n}_{{\mathrm{reads}},g}^{* }\) for estimating different quantities were investigated both theoretically and via simulations. First, the closedform expressions of the optimal depth \({n}_{{\mathrm{reads}},g}^{* }\) were derived for estimating the mean, the second moment, and the gamma parameters (of overdispersion model), which depend on the distribution \({P}_{{X}_{g}}\) but are nonetheless ~1 for typical cases (Supplementary Notes 3 and 5). Second, estimation errors under different budget splits were simulated by subsampling from a real dataset with deeply sequenced genes and many cells (top 72 genes of brain_1.3m, Fig. 2a). See details of the subsampling procedure in Subsampling experiment in Supplementary Note 6). Third, a more controlled simulation that assumes the Poisson model was conducted to provide a more comprehensive evaluation (Supplementary Fig. 4). Both simulations exhibit similar qualitative behaviors and imply that the optimal sequencing depths \({n}_{{\mathrm{reads}},g}^{* }\) for estimating different quantities are between 0.2 and 1. Therefore, we reached the conclusion that the optimal budget allocation for a single gene is to have ~1 UMI per cell on average.
When there are many genes of primary biological interest, the gene among them with the smallest relative mean expression level becomes the bottleneck, as it has the fewest number of reads on average (Fig. 1c, top). We call its relative mean expression level \({p}^{* }\) the reliable detection limit, below which the estimation performance cannot be guaranteed. The optimal sequencing depth for the entire experiment \({n}_{{\mathrm{reads}}}^{* }\) is chosen so that the gene at the reliable detection limit has one read per cell (in expectation), which minimizes the worstcase error for all genes of interest. Compared with this optimal allocation, deeper sequencing (green) gives a homogeneous error across genes but at a much higher level, whereas a shallower sequencing (blue) gives a small error for a few highly expressed genes but its performance quickly deteriorates (Fig. 1c, bottom).
The recommended budget allocation in general suggests a slightly deeper sequencing depth as compared with existing datasets, e.g., 7k UMIs per cell for the pbmc_4k dataset considering MS4A1 and 14k UMIs per cell for the brain_9k dataset considering S100a10 (Supplementary Fig. 2). Such a depth is feasible for the current 10x Genomics’ technology, which is estimated to be able to sequence 10–45k UMIs per cell where the actual values depend on different tissues (Feasibility of the recommended sequencing depth in Supplementary Note 3). In addition, under such a sequencing depth, all analyses are valid as the Poisson model is still a good approximation of the sequencing process. Regarding the rare genes, since the UMI efficiency for the 10x technology is estimated to be 10–15%, in order to achieve one read per cell, the gene needs to have at least 1/0.15\(\approx\)7 transcripts in the cell. The gene CD4 (Supplementary Fig. 2b) seems to be below this limit. For such genes, the recommendation should be sequencing until saturation.
The input parameter to the proposed experimental design approach, i.e., the detection limit \({p}^{* }\), corresponds to the smallest mean expression level among the list of genes of interest. Therefore, to carry out the proposed experimental design procedure, it is important to have an estimate of the mean expression levels for these genes. Such information may come from various sources whose data closely matched the system under study. First, researchers usually conduct pilot experiments before conducting the main experiment; the data from the pilot experiment can be used to provide such an estimate. Also, data from past studies or public databases, either scRNAseq or bulk RNASeq, can be used to provide the estimate. Some popular databases include Tabula Muris (scRNAseq)^{39} for different mouse tissues, Human Cell Atlas^{29} (scRNAseq) and GTEx^{40} (bulk RNASeq) for different human tissues, TCGA^{41} (bulk RNASeq) for human cancer data, and GEO^{42} (bulk/singlecell RNAseq) for past studies. One caveat here is that different datasets may have different covariate compositions, like sex, age, or demographic factors. To evaluate the sensitivity of using reference data to estimate \({p}^{* }\) for the proposed experimental design procedure, we consider four different types of reference data in Supplementary Figs. 18–19: insample bulk RNASeq or scRNAseq, where the corresponding reference data were obtained from the same biological sample as the data for the current study, and their outofsample counterparts obtained from independent biological replicates. Our results suggest that although all four types of reference data can be used to determine the optimal sequencing depth accurately, insample scRNAseq and outofsample bulk RNASeq should be considered as the most and least preferable sources of reference data respectively.
In practice, there is enough experimental flexibility to choose both the total sequencing budget \(B\) as well as the total number of cells \({n}_{{\mathrm{cells}}}\) to achieve the recommended allocation. The budget \(B\) is typically specified in terms of the total number of lanes that will be used for sequencing and is directly proportional to the sequencing cost of the experiment. For example, the 10x Genomics’ pbmc_4k dataset was sequenced on one Illumina Hiseq4000 lane yielding a total of ~350 million reads, whereas the brain_1.3m data set was sequenced on 88 Hiseq4000 lanes (11 flow cells) yielding ~30 billion reads. Sample multiplexing can also be utilized to achieve fractional lane occupancies for smaller experiments. Now, given a fixed budget \(B\), one can adjust the desired sequencing depth (\({n}_{{\mathrm{reads}}}=B/{n}_{{\mathrm{cells}}}\)) by selecting the total number of cells at the library preparation stage of the experiment. Although all singlecell RNAseq assays rely on the Illumina platform for sequencing, the library preparation stage (e.g., singlecell isolation, mRNA capture, and barcoding) is technologyspecific^{22,23,24,28,43}. Nevertheless, it is possible to accurately choose the total number of cells by adjusting the cell concentration (cells/\(\mu l\)) and the final cell suspension volume that is going to be used in the process. For example, the 10x Genomics Chromium platform can be adjusted to yield from 500 to 10K cells per lane in a single run (10× user manual: https://support.10xgenomics.com/permalink/3vzDu3zQjY0o2AqkkkI4CC). For larger experiments, multiple lanes can be used (e.g., the brain_1.3m dataset was prepared on 133 10x Genomics Chromium lanes, each optimized to capture ~10k cells). Even though the library preparation stage can incur additional costs for a singlecell RNAseq experiment, these costs are independent of the sequencing process, can vary significantly across different technologies^{44}, and are in general decreasing rapidly.
Empirical Bayes estimators
The EB estimators refer to the estimators that are aware of the noise model (which is Poisson here) and correct for the noise introduced by it. As they estimate the prior gene distribution \({P}_{{\bf{X}}}\) in the hierarchical model (2) from the observed data \({{\bf{Y}}}_{c}\), sometimes by estimating the moments of the prior distribution \({P}_{{\bf{X}}}\), they are usually associated with the names empirical Bayes, moment matching, or density deconvolution. Here, we use the term EB to refer to them in general.
As an illustrating example, consider a simplified model that for cell \(c\) and gene \(g\):
The plugin estimator estimates the gene variance by the sample variance of UMIs, i.e.,
where \({\overline{Y}}_{cg}=\frac{1}{{n}_{{\mathrm{cells}}}}{\sum }_{c=1}^{{n}_{{\mathrm{cells}}}}{Y}_{cg}\) is the empirical mean. However, the estimated value is usually overly variable owing to the presence of the Poisson noise. Indeed,
where the second term \({\mathbb{E}}[{X}_{cg}]\) corresponds to the technical variation introduced by the Poisson noise. Then, conceptually we can write:
from which we can see that the plugin estimate is inflated by the Poisson noise. In this case, this bias can be easily corrected by simply subtracting the mean, and the corresponding EB variance estimator can be written as
The EB estimators considered in the paper are listed in Table 1, along with the plugin estimators for comparison. In literature, they are designed in a casebycase fashion^{16,17,18,24,45,46,47,47,48,49,50} (more details in Supplementary Note 4).
Empirical evaluation of the tradeoff
We conducted two sets of simulations to evaluate the estimation error under different budget splits, which differ in how the data are generated. The first simulation (Fig. 2a) subsampled from a highbudget dataset consisting of the top 72 genes from the brain_1.3 m dataset. These genes were chosen because they have at least 10 reads per cell, providing a deep dataset to perform the subsample experiments. This simulation better matches the real data as the subsampling procedure does not assume the Poisson model (see Subsampling experiment in Supplementary Note 6). However, as we did not know the true gene distribution, the plugin estimates of the highbudget dataset that we subsample from were used as proxies of the ground truth, against which we evaluated the estimation error. The second simulation, corresponding to Supplementary Fig. 4), generated the data according to model (2), where the true gene distribution \({P}_{{\bf{X}}}\) was obtained by using the empirical distribution of the first 100 highly expressed genes in the pbmc_4k dataset. This setting better validates the theory as it assumes the same model. Moreover, the estimation error is exact as the ground truth is available. Both simulations include many genes to address the heterogeneity of the gene distribution, and the genes considered here, being top genes in the dataset, have similar mean expression levels so that the mean reads over all genes can well represent the mean reads for each gene. Both simulations exhibit qualitatively the same behavior, validating the theory that the optimal depth (mean reads per cell per gene) is a constant that does not depend on the budget.
Post hoc guidance for reliable estimation
The feasible region (top) and the post hoc table (bottom) were obtained via simulation, where we fixed the number of cells (1k, 5k, 10k, 30k, 70k) and studied how the error decreases as a function of the sequencing depth (Supplementary Figs. 5–6). The data were generated according to model (2) similar to the second tradeoff simulation, where the empirical distributions of the marker genes in pbmc_4k and brain_9k were used as the true gene distribution, respectively, to account for heterogeneity in different tissues. The true gene distribution was normalized so that each gene has the same mean expression level. As a result, the mean reads over all genes were the same as mean reads for each gene, providing a singlegene level error characterization. The post hoc table was obtained by finding the smallest sequencing depth such that the relative error was smaller than 0.1 (−2 in the log10 scale for the relative squared error and −1 for other errors, see Definition of errors in simulations in Supplementary Note 6). The results for both simulations were qualitatively the same. Hence, only the table for pbmc_4k was included.
Comparing the performance of plugin and EB estimators
Figure 3a demonstrates that the EB estimator is adaptive to different sequencing depths while the plugin estimator is not. The top panel shows the estimated CV using the plugin and EB estimators under different sequencing depths, where we can see clear inflation for the plugin estimates. The full data are from pbmc_4k, and the subsample rate ranges from 0.2 to 1 (1 corresponds to the full data). The experiment was repeated five times, and the 3std confidence interval was provided. The results for other genes, as well as for estimating the inactive probability, can be found in Supplementary Fig. 7. The middle panel compares the estimated CV from two datasets of the same tissue. Genes with at least 0.1 reads per cell were considered as our post hoc analysis showed that CVs of genes below this level could not be reliably estimated. The EB estimator may produce an invalid result when the plugin variance is smaller than the plugin mean of a gene, which was not accounted for by the Poisson model. Such cases were not common and were excluded while counting the number of genes that are above or below the red line. Hence, the total number of genes for the two panels may slightly differ. More results are in Supplementary Figs. 8–9. The bottom panel shows that the EB estimator can recover the gene distribution from shallow sequencing data. The shallow data were generated by subsampling to have 20% reads of the full data. For error evaluation, the recovered distribution was rescaled to have the same mean as the empirical distribution from the full data. See Supplementary Fig. 10 for more results.
Figure 3b investigates the common task where the most informative features (genes) were selected based on CV, and PCA was then performed on the selected features. The data were from pbmc_4k and was clipped at the 99th quantile to remove outliers. Such a procedure was also used in previous works on applying PCA to scRNAseq data^{11}. The top 500 genes with the highest CV were selected and the PCA scores were plotted for the 2nd and 3rd PC direction. The first direction was skipped because it corresponded to the variation in cell sizes. The results on the full data and the subsampled data (three times shallower) were compared, showing that the EB estimator is more consistent than the plugin estimator.
Figure 4a considers recovering gene functional groups using Pearson correlation. We used the pbmc_4k dataset here as the biological structure of the PBMCs is wellunderstood. The major cell populations identified in this dataset are T cells (IL7R, CD3D/E, LCK), NKcells (NKG7, PRF1, KLRD1, GZMA, HOPX, CST7), B cells (CD79A, BANK1, IGHD, LINC00926, MS4A1), myeloidderived cells (S100A8/9, MNDA, FGL2, CLEC7A, IFI30) and megakaryocytes/platelets (PF4, PPBP). The heatmap of the EBestimated Pearson correlation of those genes were visualized in Fig. 4a top, which shows that the EB estimator can well capture the gene functional groups. A subsample experiment was then conducted to investigate how well the estimators can recover the modules from the shallow sequencing data. The data were subsampled from the full data with rates 100% (full data), 25%, 10%, and 5%. The EB estimator can recover the module at a much shallower depth as compared to the plugin estimator.
Gene network analysis of the pbmc_4k dataset
The gene network (Fig. 4b) was constructed based on the EBestimated Pearson correlation using the pbmc_4k dataset. The genes were filtered to have the EBestimated variance larger than 0.1, resulting in 791 genes in total. A correlation larger than 0.8 was considered as a gene–gene edge. We found that varying the threshold from 0.4 to 0.95 did not significantly alter the result. The gene modules were identified based on knowledge of marker genes and gene pathways, as well as previous studies on PBMCs (see Gene module identification in Supplementary Note 6). We also note that the existence of megakaryocytes/platelets may be due to the imperfection of PBMC isolation, and since many genes were expressed in multiple cell populations (e.g., CD74, CD27), the resulting annotation only gives a rough picture of the underlying gene functional groups.
Next, we considered some important genes and plot their correlations with all other genes (Fig. 4c left, Supplementary Fig. 11). As a general phenomenon, the EBestimated values are more spread out and exhibit different modes corresponding to genes that interact differently with the gene of interest. The plugin estimated values are nonetheless much closer to zero even for genes that are known to be wellcorrelated.
Finally, we considered the gene pairs where the estimated values for the EB estimator and the plugin estimator differ significantly (>0.7). Out of 1054 such pairs, 91 were also annotated based on STRING^{51}, yielding a p value of 4.2e11 while testing against the null hypothesis that the gene pairs were selected at random based on a onesided hypergeometric distribution test (see Gene module identification in Supplementary Note 6). We plot the histograms of several such pairs and show that all of them have clear biological interpretations (Fig. 4c right, Supplementary Fig. 12). LY86 (also known as MD1) is a secreted protein that has been shown to have an important role in Tcell activation, whereas CD3E is expressed within T cells (see Gene module identification in Supplementary Note 6). These two genes are not coexpressed and hence, are negatively correlated. POMP encodes a chaperone for proteasome assembly, whereas PSMA7 is one of the 17 essential subunits for the complete assembly of the 20S proteasome complex. Hence, the two genes work together for proteasome assembly and should be positively correlated. The EBestimated correlation is one and is probably an overestimate owing to the randomness of the estimator. However, the actual Pearson correlation should not be much smaller than 1. In spite of the strong biological evidence, the plugin estimator gives very small values owing to the presence of sequencing noise (See also Supplementary Fig. 12).
smFISH experiments for validation
For validation, we considered two datasets, where both the scRNAseq and the smFISH data are available. smFISH can be regarded as the gold standard for measuring the number of mRNAs in a cell and was used as a proxy for the ground truth (see Details of the smFISH experiments in Supplementary Note 6 for more details).
In the first dataset, both Dropseq and smFISH were applied to the same melanoma cell line^{15}. A total of 5763 cells and 12,241 genes were kept for analysis from the Dropseq experiment, with a median of 1473 UMIs per cell. Of these genes, 24 were also profiled using smFISH. We further excluded genes with zeroUMI count in >97% of the cells and one more gene, FOSL1, owing to its abnormal behavior (FOSL1 was also excluded in a recent work^{16} analyzing the dataset). We considered two distributional quantities, CV and inactive probability (with \(\kappa =2.5{n}_{{\mathrm{reads}}}\)), where we note that the latter has the interpretation of the proportion of zeros when the data were sequenced 2.5 times deeper. We compared the plugin and the EBestimated results from the Dropseq data against the corresponding result from the smFISH data in Fig. 5a, where the smFISH estimates can be considered as the ground truth. Here, the gene VCL was omitted in the experiment of estimating the inactive probability because the corresponding smFISH data do not have enough cells to subsample from (4691, fewer than the number of cells captured by Dropseq, which is 5763). The consistency between the EB estimates and the smFISH result indicates that the EB estimates are close to the ground truth. Furthermore, we investigated the optimal sequencing depth (Figure 5b and Supplementary Fig. 14) by fixing the budget and varying the sequencing depth, where the error was evaluated against the gold standard smFISH result. As this was done by subsampling from the original dataset, to ensure a wide range, only two genes with relatively more reads (MITF and VGF) were considered. Figure 5b and Supplementary Fig. 14 are qualitatively similar to the simulation results in Fig. 2a and Supplementary Fig. 4, showing an optimal depth between 0.1 and 0.6. This is consistent with our previous experiments based on 10x Genomics’ data and the experimental design guidelines we provide in this work, i.e., that the optimal depth for estimating different quantities is 0.2–1 read per cell per gene.
In the second dataset, both CELseq and smFISH were applied to the same mESC cell line and culture conditions^{27} (smFISH data from D. Grün, personal communication). Again, the plugin and the EBestimated results from the CELseq data were compared against the corresponding result from the smFISH data in Supplementary Fig. 13 for nine genes measured by smFISH, where we observed a good consistency between the EB and the smFISH results. As there are only 80 cells, we did not perform the subsampling experiment for this dataset.
Overall, the comparisons between the scRNAseq and the smFISH results imply that our model matches the real data well, and the proposed EB estimator is able to provide estimates that are close to the ground truth. Also, the subsampling experiments in Fig. 5b and Supplementary Fig. 14 indicate that the optimal depth, evaluated using the smFISH data, is consistent with the main claim of the paper.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
The 10× datasets were generated by 10x Genomics’ v2 chemistry^{22}. They are publicly available and can be downloaded via the following links:
pbmc_4k: https://support.10xgenomics.com/singlecellgeneexpression/datasets/2.1.0/pbmc4k
pbmc_8k: https://support.10xgenomics.com/singlecellgeneexpression/datasets/2.1.0/pbmc8k
brain_1k: https://support.10xgenomics.com/singlecellgeneexpression/datasets/2.1.0/neurons_900
brain_2k: https://support.10xgenomics.com/singlecellgeneexpression/datasets/2.1.0/neurons_2000
brain_9k: https://support.10xgenomics.com/singlecellgeneexpression/datasets/2.1.0/neuron_9k
brain_1.3m: https://support.10xgenomics.com/singlecellgeneexpression/datasets/1.3.0/1M_neurons
293T_1k, 3T3_1k: https://support.10xgenomics.com/singlecellgeneexpression/datasets/2.1.0/hgmm_1k
293T_6k, 3T3_6k: https://support.10xgenomics.com/singlecellgeneexpression/datasets/2.1.0/hgmm_6k
293T_12k, 3T3_12k: https://support.10xgenomics.com/singlecellgeneexpression/datasets/2.1.0/hgmm_12k
We note that pbmc_4k and pbmc_8k are from the same donor; brain_1k and brain_9k are also from the same donor. Also, the following pairs of datasets are sequenced together: 293T_1k and 3T3_1k, 293T_6k and 3T3_6k, 293T_12k and 3T3_12k. These six datasets are from the same biological sample.
The Dropseq dataset and the corresponding smFISH data can be found from the original paper^{15} or a recent paper that analyzed the dataset^{16}. The CELseq data can be found from the original paper^{27}. the smFISH data accompany the CELseq can be obtained by contacting the author. The three ERCC datasets (Zheng, Klein, Svensson) can be found in a recent paper that analyzed the data set^{16}, where we have used the 2 × (control RNA + ERCC) data in the Svensson et al.^{52} paper. The Klein dataset with the pure RNA controls (the Klein ERCC dataset being part of it) can be found from the original paper^{24}. The data for sensitivity analysis (Supplementary Figs. 18–19) can be found from the original paper^{53}.
Code availability
We developed the python package sceb (singlecell empirical Bayes) for the EB estimators used in this paper (available on PyPI). The code to reproduce all experiments and generate the figures presented in this paper can be found at https://github.com/martinjzhang/single_cell_eb.
References
 1.
Kolodziejczyk, A. A., Kim, J. K., Svensson, V., Marioni, J. C. & Teichmann, S. A. The technology and biology of singlecell RNA sequencing. Mol. Cell 58, 610–620 (2015).
 2.
Trapnell, C. Defining cell types and states with singlecell genomics. Genome Res. 25, 1491–1498 (2015).
 3.
Svensson, V., VentoTormo, R. & Teichmann, S. A. Exponential scaling of singlecell RNAseq in the past decade. Nat. Protoc. 13, 599 (2018).
 4.
Streets, A. M. & Huang, Y. How deep is enough in singlecell RNAseq? Nat. Biotechnol. 32, 1005 (2014).
 5.
Bacher, R. & Kendziorski, C. Design and computational analysis of singlecell RNAsequencing experiments. Genome Biol. 17, 63 (2016).
 6.
Haque, A., Engel, J., Teichmann, S. A. & Lönnberg, T. A practical guide to singlecell RNAsequencing for biomedical research and clinical applications. Genome Med. 9, 75 (2017).
 7.
Dal Molin, A. & Di Camillo, B., How to design a singlecell RNAsequencing experiment: pitfalls, challenges and perspectives. Brief. Bioinform. 20, 1384–1394 2018.
 8.
Ecker, J. R. et al. The brain initiative cell census consortium: lessons learned toward generating a comprehensive brain cell atlas. Neuron 96, 542–557 (2017).
 9.
Pollen, A. A. et al. Lowcoverage singlecell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex. Nat. Biotechnol. 32, 1053 (2014).
 10.
Jaitin, D. A. et al. Massively parallel singlecell RNAseq for markerfree decomposition of tissues into cell types. Science 343, 776–779 (2014).
 11.
Heimberg, G., Bhatnagar, R., ElSamad, H. & Thomson, M. Low dimensionality in gene expression data enables the accurate extraction of transcriptional programs from shallow sequencing. Cell Systems 2, 239–250 (2016).
 12.
Shalek, A. K. et al. Singlecell RNAseq reveals dynamic paracrine control of cellular variation. Nature 510, 363 (2014).
 13.
Tung, P.Y. et al. Batch effects and the effective design of singlecell gene expression studies. Sci. Rep. 7, 39921 (2017).
 14.
Rizzetto, S. et al. Impact of sequencing depth and read length on single cell RNA sequencing data of t cells. Sci. Rep. 7, 12781 (2017).
 15.
Torre, E. et al. Rare cell detection by singlecell RNA sequencing as guided by singlemolecule RNA fish. Cell Syst. 6, 171–179 (2018).
 16.
Wang, J. et al. Gene expression distribution deconvolution in singlecell RNA sequencing. Proc. Natl Acad. Sci. 115, E6437–E6446 (2018).
 17.
Efron, B. Two modeling strategies for empirical Bayes estimation. Stat. Sci. 29, 285 (2014).
 18.
Efron, B. Empirical Bayes deconvolution estimates. Biometrika 103, 1–20 (2016).
 19.
Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010).
 20.
Robinson, M. D. & Smyth, G. K. Moderated statistical tests for assessing differences in tag abundance. Bioinformatics 23, 2881–2887 (2007).
 21.
Chen, W., Li, Y., Easton, J., Finkelstein, D., Wu, G. & Chen, X. UMIcount modeling and differential expression analysis for singlecell RNA sequencing. Genome Biol. 19, 70 (2018).
 22.
Zheng, G. X. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).
 23.
Macosko, E. Z. et al. Highly parallel genomewide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015).
 24.
Klein, A. M. et al. Droplet barcoding for singlecell transcriptomics applied to embryonic stem cells. Cell 161, 1187–1201 (2015).
 25.
Efron, B., Largescale inference: empirical Bayes methods for estimation, testing, and prediction, 1. Cambridge University Press, 2012.
 26.
Huang, M. et al. Saver: gene expression recovery for singlecell RNA sequencing. Nat. Methods 15, 539 (2018).
 27.
Grün, D., Kester, L. & Van Oudenaarden, A. Validation of noise models for singlecell transcriptomics. Nat. Methods 11, 637 (2014).
 28.
Hashimshony, T., Wagner, F., Sher, N. & Yanai, I. Celseq: singlecell RNAseq by multiplexed linear amplification. Cell Rep. 2, 666–673 (2012).
 29.
Regev, A. et al. Science forum: the human cell atlas. Elife 6, e27041 (2017).
 30.
Kim, J. K., Kolodziejczyk, A. A., Ilicic, T., Teichmann, S. A. & Marioni, J. C. Characterizing noise structure in singlecell RNAseq distinguishes genuine from technical stochastic allelic expression. Nat. Commun. 6, 8687 (2015).
 31.
Wang, L., Feng, Z., Wang, X., Wang, X. & Zhang, X. DEGseq: an R package for identifying differentially expressed genes from RNAseq data. Bioinformatics 26, 136–138 (2009).
 32.
Korthauer, K. D. et al. A statistical approach for identifying differential distributions in singlecell RNAseq experiments. Genome Biol. 17, 222 (2016).
 33.
Jolliffe, I. T., Principal component analysis and factor analysis, in Principal component analysis, 115–128, Springer, 1986.
 34.
Abid, A., Zhang, M. J., Bagaria, V. K. & Zou, J. Exploring patterns enriched in a dataset with contrastive principal component analysis. Nat. Commun. 9, 2134 (2018).
 35.
Shi, J. & Malik, J. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell 22, 888–905 (2000).
 36.
Friedman, J., Hastie, T. & Tibshirani, R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9, 432–441 (2008).
 37.
Zhang, B. & Horvath, S. A general framework for weighted gene coexpression network analysis. Stat. Appl. Genet. Mol. Biol. 4, Article17 (2005).
 38.
Mohammadi, S., DavilaVelderrain, J., Kellis, M. & Grama, A. DECODEing sparsity patterns in singlecell RNAseq, Preprint at https://doi.org/10.1101/241646v2 (2018).
 39.
Schaum, N. et al. Singlecell transcriptomics of 20 mouse organs creates a tabula muris: the tabula muris consortium. Nature 562, 367 (2018).
 40.
Consortium, G. et al. Genetic effects on gene expression across human tissues. Nature 550, 204 (2017).
 41.
Weinstein, J. N. et al. The cancer genome atlas pancancer analysis project. Nat. Genet. 45, 1113 (2013).
 42.
Edgar, R., Domrachev, M. & Lash, A. E. Gene expression omnibus: Ncbi gene expression and hybridization array data repository. Nucleic Acids Res. 30, 207–210 (2002).
 43.
Rosenberg, A. B. et al. Singlecell profiling of the developing mouse brain and spinal cord with splitpool barcoding. Science 360, 176–182 (2018).
 44.
AlJanahi, A. A., Danielsen, M. & Dunbar, C. E. An introduction to the analysis of singlecell rnasequencing data. Mol. Ther. Methods Clin. Dev. 10, 189–196 (2018).
 45.
Jiao, J., Venkat, K., Han, Y. & Weissman, T. Minimax estimation of functionals of discrete distributions. IEEE Transact. Inform. Theory 61, 2835–2885 (2015).
 46.
Yang, Y. Wu et al. Chebyshev polynomials, moment matching, and optimal estimation of the unseen. Ann. Stat. 47, 857–883 (2019).
 47.
Orlitsky, A., Suresh, A. T. & Wu, Y. Optimal prediction of the number of unseen species. Proc. Natl Acad Sci. 113, 13283–13288 (2016).
 48.
Kong, W. et al. Spectrum estimation from samples. Ann. Stat. 45, 2218–2247 (2017).
 49.
Good, I. & Toulmin, G. The number of new species, and the increase in population coverage, when a sample is increased. Biometrika 43, 45–63 (1956).
 50.
Efron, B. & Thisted, R. Estimating the number of unseen species: how many words did Shakespeare know? Biometrika 63, 435–447 (1976).
 51.
Szklarczyk, D. et al. String v10: proteinprotein interaction networks, integrated over the tree of life. Nucleic Acids Res. 43, D447–D452 (2014).
 52.
Svensson, V. et al. Power analysis of singlecell RNAsequencing experiments. Nat. Methods 14, 381 (2017).
 53.
Ding, J. et al., Systematic comparative analysis of single cell rnasequencing methods, Preprint at https://doi.org/10.1101/632216v2 (2019).
Acknowledgements
This research was in part motivated by discussions on the experimental design question in the Human Cell Atlas First Annual Jamboree meeting. We thank Lior Pachter for his valuable input and constructive suggestions throughout the course of this study; Jase Gehring, Wenying Pan, and Taibo Li for their helpful feedback; and Dominic Grï¿½n for providing the smFISH data corresponding to the CELseq data. Thanks also to Patrick Marks for very useful feedback on an earlier version of the paper. D.T. and M.J.Z. are supported in part by the Center of Science of Information, an NSF Science and Technology Center, under grant agreement CCF0939370 and in part by the National Human Genome Research Institute of the National Institutes of Health under award number R01HG008164. M.J.Z. is also supported by a Stanford Graduate Fellowship (Inventec Fellow). V.N. is supported in part by the Center for Science of Information and in part by a gift from Qualcomm Inc.
Author information
Affiliations
Contributions
M.J.Z. and V.N. conceived the idea and performed the empirical experiments. M.J.Z. performed the theoretical analysis. M.J.Z., V.N. and D.T. wrote the manuscript. D.T. supervised the research. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Communications thanks Jay West and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zhang, M.J., Ntranos, V. & Tse, D. Determining sequencing depth in a singlecell RNAseq experiment. Nat Commun 11, 774 (2020). https://doi.org/10.1038/s4146702014482y
Received:
Accepted:
Published:
Further reading

Separating measurement and expression models clarifies confusion in singlecell RNA sequencing analysis
Nature Genetics (2021)

A coordinated progression of progenitor cell states initiates urinary tract development
Nature Communications (2021)

A multicenter study benchmarking singlecell RNA sequencing technologies using reference samples
Nature Biotechnology (2021)

Recent Advances in SingleCell Profiling and Multispecific Therapeutics: Paving the Way for a New Era of Precision Medicine Targeting Cardiac Fibroblasts
Current Cardiology Reports (2021)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.