Single-cell RNA sequencing scRNA-seq is a powerful tool for characterizing the cell-to-cell variation and cellular dynamics in populations which appear homogeneous otherwise in basic and translational biological research.

However, significant challenges arise in the analysis of scRNA-seq data, including the low signal-to-noise ratio with high data sparsity, potential batch effects, scalability problems when hundreds of thousands of cells are to be analyzed among others.

The inherent complexities of scRNA-seq data and dynamic nature of cellular processes lead to suboptimal performance of many currently available algorithms, even for basic tasks such as identifying biologically meaningful heterogeneous subpopulations.

In this study, we developed the Latent Cellular Analysis LCAa machine learning—based analytical pipeline that combines cosine-similarity measurement by latent cellular states with a graph-based clustering algorithm. LCA provides heuristic solutions for population number inference, dimension reduction, feature selection, and control of technical variations without explicit gene filtering.

We show that LCA is robust, accurate, and powerful by comparison with multiple state-of-the-art computational methods when applied to large-scale real and simulated scRNA-seq data. Importantly, the ability of LCA to learn from representative subsets of the data provides scalability, thereby addressing a significant challenge posed by growing sample sizes in scRNA-seq data analysis.

Single-cell RNA sequencing scRNA-seq quantifies cell-to-cell variation in transcript abundance, leading to a deep understanding of the diversity of cell types and the dynamics of cell states at a scale of tens of thousands of single cells 1—3.

Although scRNA-seq offers enormous opportunities and has inspired a tremendous explosion of data-analysis methods for identifying heterogeneous subpopulations, significant challenges arise because of the inherently high noise associated with data sparsity and the ever-increasing number of cells sequenced. The current state-of-the-art algorithms have significant limitations.

The cell-to-cell similarity learned by most machine learning—based tools such as Seurat 4Monocle2 5SIMLR 6 and SC3 7 is not always user-friendly, and significant efforts are required for a human scientist to interpret the results and to generate a hypothesis. Several methods require the user to provide an estimation of the number of clusters in the data, and this may not be readily available and many times arbitrary.

Furthermore, many methods have a high computational cost that will be prohibitive for datasets representing large numbers of cells. Lastly, although certain technical biases e.

Most methods employ a variation based over-dispersed gene-selection step before clustering analysis, based on the assumption that a small subset of highly variable genes is most informative for revealing cellular diversity. Although this assumption may be valid in certain scenarios, due to the overall low signal-to-noise ratio in scRNA-seq data, many non-informative genes such as high-magnitude outliers and dropouts, etc.

Consequently, it potentially introduces additional challenges for downstream analysis when informative genes are not most variable, which happens when the difference among subpopulations is subtle, or there is a strong batch effect, while most variable genes differ by batch. Latent semantic indexing LSI is a machine-learning technique successfully developed in information retrieval 13where semantic embedding converts the sparse word vector of a text document to a low-dimensional vector space, which represents the underlying concepts of those documents.

LCA is an accurate, robust, and scalable computational pipeline that facilitates a deep understanding of the transcriptomic states and dynamics of single cells in large-scale scRNA-seq datasets.

Seurat part 3 – Data normalization and PCA

LCA makes a robust inference of the number of populations directly from the data a user can specify this with a priori informationrigorously models the contributions from potentially confounding factors, generates a biologically interpretable characterization of the cellular states, and recovers the underlying population structures.

Furthermore, LCA addresses the scalability problem by learning a model from a subset of the sample, after which a theoretical scheme is used to assign the remaining cells to identified populations.

seurat large dataset

We perform spectral clustering 19 on the resulted distance matrix to derive a set of candidate clustering models with a range of cluster numbers i. When known technical variations were strongly associated with significant components, those PCs were further aligned with the technical variations and discarded. Distance between cells was measured by the correlation distance of significant components. When less than three PCs retained, Euclidean distance was used instead.Vector of features to use when computing the PCA to determine the weights.

Only set if you want a different set from those used in the anchor finding process. A vector of strings, specifying the name of a dimension reduction to use for each object to be integrated.

A vector of DimReduc objects, specifying the object to use for each object in the integration. Note that, if specified, the requested dimension reduction will only be used for calculating anchor weights in the first merge between reference and query, as the merged object will subsequently contain more cells than was in query, and weights will need to be calculated for all cells in the object.

The main steps of this procedure are outlined below. For a more detailed description of the methodology, please see Stuart, Butler, et al Cell Construct a weights matrix that defines the association between each query cell and each anchor. These weights are computed as 1 - the distance between the query cell and the anchor divided by the distance of the query cell to the k.

We then apply a Gaussian kernel width a bandwidth defined by sd. Compute the anchor integration matrix as the difference between the two expression matrices for every pair of anchor cells. Compute the transformation matrix as the product of the integration matrix and the weights matrix.

Seurat - Combining Two 10X Runs

For multiple dataset integration, we perform iterative pairwise integration. To determine the order of integration if not specified via sample. Define a distance between datasets as the total number of cells in the smaller dataset divided by the total number of anchors between the two datasets.

Returns a Seurat object with a new integrated Assay. If normalization. Stuart T, Butler A, et al. Comprehensive Integration of Single-Cell Data. For more information on customizing the embed code, read Embedding Snippets. Seurat Tools for Single Cell Genomics.

How to Combine and Merge Data Sets in R

Man pages API Source code R Description Perform dataset integration using a pre-computed AnchorSet. IntegrateData anchorsetnew. Related to IntegrateData in Seurat Seurat index. R Package Documentation rdrr. We want your feedback! Note that we can't provide technical support on individual packages.

seurat large dataset

You should contact the package authors for that. Tweet to rdrrHQ. GitHub issue tracker. Personal blog.In this tutorial we will look at different ways of doing filtering and cell and exploring variablility in the data. However, we will not go into depth in how to use the Seurat package as this will be covered in other tutorials.

Looking at different ways of visualizing QC-stats and exploring variation in the data. Please follow these instructions if to load the conda environment before running the code below. If you want to access them after the course you can download them using the following commands:.

seurat large dataset

First, create Seurat objects for each of the datasets, and then merge into one large seurat object. In the same manner we will calculate the proportion gene expression that comes from ribosomal proteins. NOTE - add text on why! As you can see, the v2 chemistry gives lower gene detection, but higher detection of ribosomal proteins. As the ribosomal proteins are highly expressed they will make up a larger proportion of the transcriptional landscape when fewer of the lowly expressed genes are detected.

We have quite a lot of cells with high proportion of mitochondrial reads. It could be wise to remove those cells, if we have enough cells left after filtering. Another option would be to either remove all mitochondrial reads from the dataset and hope that the remaining genes still have enough biological signal.

A third option would be to just regress out the percent. In this case we have as much as Looking at the plots, make resonable decisions on where to draw the cutoff.

As you can see, there is still quite a lot of variation in percent mito, so it will have to be dealt with in the data analysis step. Extremely high number of detected genes could indicate doublets. However, depending on the celltype composition in your sample, you may have cells with higher number of genes and also higher counts from one celltype. In these datasets, there is also a clear difference between the v2 vs v3 10x chemistry with regards to gene detection, so it may not be fair to apply the same cutoffs to all of them.

Also, in the protein assay data there is a lot of cells with few detected genes giving a bimodal distribution. This type of distribution is not seen in the other 2 datasets.

Considering that they are all pbmc datasets it makes sense to regard this distribution as low quality libraries. Filter the cells with high gene detection putative doublets with cutoffs for v3 chemistry and for v2. Very similar QC-plots and filtering of cells can be done with the scater package, but since we alredy filtered cells using Seurat we will now just use scater to explore technical bias in the data. We can also include information on which genes are mitochondrial in the function call.

This can be valuable for detecting genes that are overabundant that may be driving a lot of the variation. I would consider removing that gene before further analysis and clustering.Using single-cell -omics data, it is now possible to computationally order cells along trajectories, allowing the unbiased study of cellular dynamic processes.

Sincemore than 50 trajectory inference methods have been developed, each with its own set of methodological characteristics. As a result, choosing a method to infer trajectories is often challenging, since a comprehensive assessment of the performance and robustness of each method is still lacking.

In order to facilitate the comparison of the results of these methods to each other and to a gold standard, we developed a global framework to benchmark trajectory inference tools. Using this framework, we compared the trajectories from a total of 29 trajectory inference methods, on a large collection of real and synthetic datasets.

We evaluate methods using several metrics, including accuracy of the inferred ordering, correctness of the network topology, code quality and user friendliness. We found that some methods, including Slingshot Street et al. Based on our benchmarking results, we therefore developed a set of guidelines for method users. However, our analysis also indicated that there is still a lot of room for improvement, especially for methods detecting complex trajectory topologies.

Our evaluation pipeline can therefore be used to spearhead the development of new scalable and more accurate methods, and is available at github.

Saelens et al. Haghverdi, Buettner, and Theis Haghverdi et al. BioMed Central: Ji, Zhicheng, and Hongkai Ji. Oxford University Press: e—e Nature Publishing Group: Cold Spring Harbor Laboratory, Oxford University Press: —CellSeurat v3 introduces new methods for the integration of multiple single-cell datasets. These methods aim to identify shared cell states that are present across different datasets, even if they were collected from different individuals, experimental conditions, technologies, or even species.

These represent pairwise correspondences between individual cells one in each datasetthat we hypothesize originate from the same biological state.

How to Read a Research Paper

Below, we demonstrate multiple applications of integrative analysis, and also introduce new functionality beyond what was described in the manuscript. To help guide users, we briefly introduce these vignettes below:. In this example workflow, we demonstrate two new methods we recently introduced in our paper, Comprehensive Integration of Single Cell Data :. We provide a combined raw data matrix and associated metadata file here to get started. The code for the new methodology is implemented in Seurat v3.

You can download and install from CRAN with install. In addition to new methods, Seurat v3 includes a number of improvements aiming to improve the Seurat object and user interaction. To help users familiarize themselves with these changes, we put together a command cheat sheet for common tasks. Load in expression matrix and metadata. The metadata file contains the technology tech column and cell type annotations cell type column for each cell in the four datasets. First, we split the combined object into a list, with each dataset as an element.

seurat large dataset

Prior to finding anchors, we perform standard preprocessing log-normalizationand identify variable features individually for each. Note that Seurat v3 implements an improved method for variable feature selection based on a variance stabilizing transformation "vst". Next, we identify anchors using the FindIntegrationAnchors function, which takes a list of Seurat objects as input. Here, we integrate three of the objects into a reference we will use the fourth later in this vignette.

We then pass these anchors to the IntegrateData function, which returns a Seurat object. After running IntegrateDatathe Seurat object will contain a new Assay with the integrated expression matrix. We can then use this new integrated matrix for downstream analysis and visualization. The integrated datasets cluster by cell type, instead of by technology. Seurat v3 also supports the projection of reference data or meta data onto a query object.

While many of the methods are conserved both procedures begin by identifying anchorsthere are two important distinctions between data transfer and integration:. After finding anchors, we use the TransferData function to classify the query cells based on reference data a vector of reference cell type labels. TransferData returns a matrix with predicted IDs and prediction scores, which we can add to the query metadata.GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community. Already on GitHub? Sign in to your account. It is not an issue with the quality of the cluster as I can run this on very well defined astrocyte clusters against neuronal clusters and still get this error:. I doubt max. I also suspect the gene filter isn't working efficiently as downsampling genes. Yet when I increase genes.

However, if cells. Is it possible that the function is asking for the full object's dense matrix before filtering the genes or cells? I wonder if my post is a manifestation of the same underlying issue s.

I was wondering if the seurat object itself was corrupted. Calling SubsetData to reduce each ident to cells did not fix issue in post Using seurat 2.

Do you receive the same error with default expression tests? It breaks with any test. I found FindMarkers is breaking at this point which is independent of any test chosen :. At this point, neither cells. However, there's no reason cells.Now that we have performed our initial Cell level QC, and removed potential outliers, we can go ahead and normalize the data. Seurat calculates highly variable genes and focuses on these for downstream analysis.

This helps control for the relationship between variability and average expression. We suggest that users set these parameters to mark visual outliers on the dispersion plot, but the exact parameter settings may vary based on the data type, heterogeneity in the sample, and normalization strategy.

This could include not only technical noise, but batch effects, or even biological sources of variation cell cycle stage. To mitigate the effect of these signals, Seurat constructs linear models to predict gene expression based on user-defined variables. The scaled z-scored residuals of these models are stored in the scale.

We can regress out cell-cell variation in gene expression driven by batch if applicablecell alignment rate as provided by Drop-seq tools for Drop-seq datathe number of detected molecules, and mitochondrial gene expression. In this simple example here for post-mitotic blood cells, we regress on the number of detected molecules per cell as well as the percentage mitochondrial gene content. Seurat v2. Next we perform PCA on the scaled data. We have typically found that running dimensionality reduction on highly variable genes can improve performance.

However, with UMI data — particularly after regressing out technical variables, we often see that PCA returns similar albeit slower results when run on much larger subsets of genes, including the whole transcriptome. Both cells and genes are ordered according to their PCA scores.

Setting cells. Though clearly a supervised analysis, we find this to be a valuable tool for exploring correlated gene sets. Determining how many PCs to include downstream is therefore an important step. In this case it appears that PCs are significant. A more ad hoc method for determining which PCs to use is to look at a plot of the standard deviations of the principle components and draw your cutoff where there is a clear elbow in the graph.

In this example, it looks like the elbow would fall around PC 9. We therefore suggest these three approaches to consider. The first is more supervised, exploring PCs to determine relevant sources of heterogeneity, and could be used in conjunction with GSEA for example. The second implements a statistical test based on a random null model, but is time-consuming for large datasets, and may not return a clear PC cutoff. The third is a heuristic that is commonly used, and can be calculated instantly.

In this example, all three approaches yielded similar results, but we might have been justified in choosing anything between PC as a cutoff. We followed the jackStraw here, admittedly buoyed by seeing the PCHeatmap returning interpretable signals including canonical dendritic cell markers throughout these PCs.


thoughts on “Seurat large dataset

Leave a Reply

Your email address will not be published. Required fields are marked *