5 Integration of metacells

In this section, we will work with the Human Cell Lung Atlas core HLCA gathering around 580,000 cells from 107 individuals distributed in 166 samples.

The aim of this tutorial is to show how you can use metacells to analyze a very large dataset using a reasonable amount of time and memory. For this we will use here SuperCell via the MCAT command line tool.

We will first perform the integration in an unsupervised mode (5.1), i.e., without considering the single-cell annotation. Then, we demonstrate supervised integration (5.2).

5.1 Unsupervised integration

In this section, we will first perform the integration in an unsupervised mode, i.e., without considering the single-cell annotation. For a supervised version, please refer to section 5.2.

5.1.1 Data loading

Please follow the section 0.4.1 to retrieve the HLCA atlas, divide the atlas by dataset and save the splitted data in the following folder: “data/HLCA/”.

5.1.2 Setting up the environment

First we need to specify that we will work with the MetacellAnalysisToolkit conda environment (needed for anndata relying on reticulate and the MATK tool). To build the conda environment please follow the instructions on our MetacellAnalysisToolkit github repository. Please skip this step if you did not use conda in the requirements section.

library(reticulate)
conda_env <-  conda_list()[reticulate::conda_list()$name == "MetacellAnalysisToolkit","python"]

Sys.setenv(RETICULATE_PYTHON = conda_env)

#>           used  (Mb) gc trigger (Mb) max used (Mb)
#> Ncells 1912851 102.2    3088062  165  3088062  165
#> Vcells 3338476  25.5    8388608   64  5498779   42

library(Seurat)
#> The legacy packages maptools, rgdal, and rgeos, underpinning this package
#> will retire shortly. Please refer to R-spatial evolution reports on
#> https://r-spatial.org/r/2023/05/15/evolution4.html for details.
#> This package is now running under evolution status 0
#> Attaching SeuratObject
library(anndata)
library(SuperCell)
library(ggplot2)
wilcox.test <- "wilcox"
if(packageVersion("Seurat") >= 5) {
  options(Seurat.object.assay.version = "v4") 
  wilcox.test <- "wilcox_limma"
  print("you are using seurat v5 with assay option v4")}

color.celltypes  <- c('#E5D2DD', '#53A85F', '#F1BB72', '#F3B1A0', '#D6E7A3', '#57C3F3', '#476D87',
                      '#E95C59', '#E59CC4', '#AB3282', '#23452F', '#BD956A', '#8C549C', '#585658',
                      '#9FA3A8', '#E0D4CA', '#5F3D69', '#58A4C3', "#b20000", '#E4C755', '#F7F398',
                      '#AA9A59', '#E63863', '#E39A35', '#C1E6F3', '#6778AE', '#91D0BE', '#B53E2B',
                      '#712820', '#DCC1DD', '#CCE0F5', '#CCC9E6', '#625D9E', '#68A180', '#3A6963',
                      '#968175')

5.1.3 Building metacell

We build metacells with the MATK command line using SuperCell (-t SuperCell). To facilitate downstream analysis of the donors we build metacells for each sample in each dataset (-a sample). Here we will use 2000 highly variable genes (-f 2000) to compute the PCA from which we used 50 principal components (-m 50) to build a k = 30 (-k 30) nearest neighbor graph on which the metacells are identified using a graining level of 50 (-g 50). We use an adata .h5ad output format (-s adata) as it is faster to write and lighter to store than a Seurat .rds object.

This step takes around 20 min with multiple cores (-l 6). Be aware that parallel processing requires more memory (32 GB of memory required for 6 cores).

If you are limited in memory you should still be able to process the samples by reducing the number of cores (e.g. -l 3) or by sequentially processing the samples (just remove the -l) in a slightly longer time.

To call the MATK command line, please define your path to the gihub cloned repository optained from this github repository.

#git clone https://github.com/GfellerLab/MetacellAnalysisToolkit
MATK_path=MetacellAnalysisToolkit/
start=`date +%s`
for d in data/HLCA/datasets/*;
do ${MATK_path}cli/MATK -t SuperCell -i $d/sc_adata.h5ad -o $d -a sample -l 6 -n 50 -f 2000 -k 30 -g 50 -s adata
done
echo "Duration: $((($(date +%s)-$start)/60)) minutes"

5.1.4 Loading metacell objects

We load the .h5ad objects and directly convert them in Seurat objects to benefit from all the functions of this framework. To consider the datasets in the same order as the one used in this tutorial we run the following chunk before loading the metacell objects.

library(anndata)
adata <- read_h5ad("data/HLCA/local.h5ad",backed = "r")
adata$var_names <- adata$var$feature_name # We will use gene short name for downstream analyses
datasets <- unique(adata$obs$dat)
rm(adata)
gc()


metacell.files <- sapply(datasets, FUN = function(x){paste0("data/HLCA/datasets/",x,"/mc_adata.h5ad")})

metacell.objs <- lapply(X = metacell.files, function(X){
  adata <- read_h5ad(X)
  countMatrix <- Matrix::t(adata$X)
  colnames(countMatrix) <- adata$obs_names
  rownames(countMatrix) <- adata$var_names
  sobj <- Seurat::CreateSeuratObject(counts = countMatrix,meta.data = adata$obs)
  if(packageVersion("Seurat") >= 5){sobj[["RNA"]] <- as(object = sobj[["RNA"]], Class = "Assay")}
  sobj <- RenameCells(sobj, add.cell.id = unique(sobj$sample)) # we give unique name to metacells
  return(sobj)
})

5.1.5 Merging objects and basic quality control

Given the single-cell metadata, the MATK tool automatically assigns annotations to metacells and computes purities for all the categorical variables present in the metadata of the input single-cell object.

Thus, we can check the purity of our metacells at different levels of annotations, as well as their size (number of single cells they contain).

To do so we merge the object together and use the Seurat VlnPlot function.

unintegrated.mc <- merge(metacell.objs[[1]], metacell.objs[-1])

VlnPlot(unintegrated.mc, features = c("size", "ann_level_1_purity"), group.by = 'dataset', pt.size = 0.001, ncol = 2)

VlnPlot(unintegrated.mc, features = c("ann_level_2_purity", "ann_level_3_purity"), group.by = 'dataset', pt.size = 0.001, ncol = 2)

We can also use box plots.

ggplot(unintegrated.mc@meta.data,aes(x = dataset, y = ann_level_2_purity, fill = dataset)) + geom_boxplot() +
  scale_x_discrete(guide = guide_axis(angle = 45))


ggplot(unintegrated.mc@meta.data,aes(x = dataset, y = ann_level_3_purity, fill = dataset)) + geom_boxplot() +
  scale_x_discrete(guide = guide_axis(angle = 45))


ggplot(unintegrated.mc@meta.data,aes(x = dataset, y = ann_level_4_purity, fill = dataset)) + geom_boxplot() +
  scale_x_discrete(guide = guide_axis(angle = 45))


ggplot(unintegrated.mc@meta.data,aes(x = dataset, y = ann_finest_level_purity, fill = dataset)) + geom_boxplot() +
  scale_x_discrete(guide = guide_axis(angle = 45))

Overall metacells from the different datasets present a good purity until the third level of annotation.

5.1.6 Unintegrated analysis

Let’s first do a standard dimensionality reduction without batch correction.

DefaultAssay(unintegrated.mc) <- "RNA"
unintegrated.mc <- NormalizeData(unintegrated.mc)
unintegrated.mc <- FindVariableFeatures(unintegrated.mc)
unintegrated.mc <- ScaleData(unintegrated.mc)
unintegrated.mc <- RunPCA(unintegrated.mc)
unintegrated.mc <- RunUMAP(unintegrated.mc, dims = 1:30)

umap.unintegrated.datasets <- DimPlot(unintegrated.mc,reduction = "umap",group.by = "dataset") + NoLegend() + ggtitle("unintegrated datasets")
umap.unintegrated.types <- DimPlot(unintegrated.mc, reduction = "umap", group.by = "ann_level_2", label = T, repel = T, cols = color.celltypes)+ NoLegend() + ggtitle("unintegrated cell types")

umap.unintegrated.datasets + umap.unintegrated.types


remove(unintegrated.mc) # we won't use the unintegrated object anymore
gc()

You can see on the plots that a batch effect is clearly present at the metacell level with metacells clustering by datasets inside the major cell types. Let’s correct it.

5.1.7 Seurat integration

Here we will use the standard Seurat_v4 batch correction workflow. As in the original study, we use the dataset rather than the donor as the batch parameter. See method section “Data integration benchmarking” of the original study for more details.

This should take less than 5 minutes.


n.metacells <- sapply(metacell.objs, FUN = function(x){ncol(x)})
names(n.metacells) <- datasets
ref.names <- sort(n.metacells,decreasing = T)[1:5]
ref.index <- which(datasets %in% names(ref.names))

# normalize each dataset
metacell.objs <- lapply(X = metacell.objs, FUN = function(x) {
  DefaultAssay(x) <- "RNA";
  x <- RenameCells(x, add.cell.id = unique(x$sample)) # we give unique name to metacells
  x <- NormalizeData(x)
  x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
  return(x)})

features <- SelectIntegrationFeatures(object.list = metacell.objs)

metacell.objs <- lapply(X = metacell.objs, FUN = function(x) {
    x <- ScaleData(x, features = features, verbose = FALSE)
    x <- RunPCA(x, features = features, verbose = FALSE)
})

anchors <- FindIntegrationAnchors(object.list = metacell.objs,
                                       anchor.features = features,
                                       reduction = "rpca",
                                       reference = ref.index, # the 5 biggest datasets (in term of metacell number) are used as reference
                                       dims = 1:30)

remove(metacell.objs) # We don't need the object list anymore
gc()

combined.mc <- IntegrateData(anchorset = anchors,k.weight = 40) # we have to update the k.weight parameters because the smallest dataset contain less than 100 metacells

Check the obtained object.

combined.mc
#> An object of class Seurat 
#> 30024 features across 11706 samples within 2 assays 
#> Active assay: integrated (2000 features, 2000 variable features)
#>  1 other assay present: RNA

We can verify that the sum of metacell sizes corresponds to the original number of single-cells

sum(combined.mc$size)
#> [1] 584944

Seurat returns the slot "integrated" that we can use for the downstream analysis.

DefaultAssay(combined.mc) = "integrated"
combined.mc <- ScaleData(combined.mc, verbose = FALSE)
combined.mc <- RunPCA(combined.mc, npcs = 30, verbose = FALSE)
combined.mc <- RunUMAP(combined.mc, reduction = "pca", dims = 1:30, verbose = FALSE)
combined.mc <- RunUMAP(combined.mc, dims = 1:30,reduction =  "pca",reduction.name = "umap", verbose = FALSE)

Now we can make the plots and visually compare the results with the unintegrated analysis.

umap.integrated.datasets <- DimPlot(combined.mc,reduction = "umap",group.by = "dataset") + NoLegend() + ggtitle("integrated datasets")
umap.integrated.celltypes <- DimPlot(combined.mc,reduction = "umap",group.by = "ann_level_2",label = T,repel = T,cols = color.celltypes) + NoLegend() + ggtitle("integrated cell types")

umap.integrated.datasets + umap.integrated.celltypes + umap.unintegrated.datasets + umap.unintegrated.types

Seurat efficiently corrected the batch effect in the data while keeping the cell type separated, but other batch correction methods such as harmony would have also done the job.

Note that In the original study, datasets were integrated using SCANVI semi-supervised integration using partial annotation obtained for each dataset prior integration. If you are interested in such supervised approach at the metacell level in R you can have a look to our second example in section 5.2 using the STACAS package.

We can navigate in the different annotation levels.

library(ggplot2)

DimPlot(combined.mc,group.by = "ann_level_1",reduction = "umap",label = T,repel = T,cols= color.celltypes) + NoLegend()


DimPlot(combined.mc,group.by = "ann_level_2",reduction = "umap",label = T,repel = T,cols= color.celltypes) + NoLegend()


DimPlot(combined.mc,group.by = "ann_level_3",reduction = "umap",label = T, repel = T,cols= color.celltypes) + NoLegend()

5.1.8 Downstream analysis

5.1.8.1 Clustering

We cluster the metacells based on the corrected PCA space by Seurat.

DefaultAssay(combined.mc) <- "integrated"
combined.mc <- FindNeighbors(combined.mc, reduction = "pca", dims = 1:30)
combined.mc <- FindClusters(combined.mc, resolution = 0.5)
UMAPPlot(combined.mc, label = T) + NoLegend()

5.1.8.2 Deferentially expressed gene (DEG) analysis.

We can find which of the obtained clusters correspond to the B cells annotated in the original study (4th level of annotation)

b.clust <- names(which.max(table(combined.mc$ann_level_4, combined.mc$integrated_snn_res.0.5)["B cells", ]))
b.clust
#> [1] "23"

Now let’s found the markers of this cluster.

DefaultAssay(combined.mc) <- "RNA"

markersB <- FindMarkers(combined.mc, ident.1 = b.clust, only.pos = T, logfc.threshold = 0.25, test.use =  wilcox.test)
head(markersB)
#>       p_val avg_log2FC pct.1 pct.2 p_val_adj
#> TCL1A     0  1.3939363 0.695 0.020         0
#> FCRLA     0  1.0706003 0.962 0.022         0
#> BLK       0  1.2209493 0.990 0.058         0
#> FCRL5     0  0.7922142 0.886 0.027         0
#> PNOC      0  0.5425023 0.924 0.050         0
#> PAX5      0  0.6124840 0.895 0.023         0

This cluster clearly presents a B cell signature with marker genes such as CD19 and PAX5

genes <-c("CD19","PAX5") # knwon B cells markers
markersB[genes,]
#>              p_val avg_log2FC pct.1 pct.2     p_val_adj
#> CD19 6.500657e-208   1.104315 0.990 0.113 1.821744e-203
#> PAX5  0.000000e+00   0.612484 0.895 0.023  0.000000e+00
VlnPlot(combined.mc, genes, ncol = 1)

By looking at the metacell annotation (assigned from the original single-cell metadata by MATK), we can verify that we correctly retrieved the B cell lineage cluster.

DimPlot(combined.mc[, combined.mc$integrated_snn_res.0.5 == b.clust],
        group.by = c("ann_level_4","integrated_snn_res.0.5"),
        ncol = 2)

5.1.8.3 Cell type abundances analyses.

We can easily make analysis of cell type abundances for different clinical variables as we construct metacell by sample. We have to take metacell size into account for these analyses. For instance we can analyse the proportion of different epithelial cell types depending on the smoking status.

library(reshape2)
combined.mc.epith <- combined.mc[,combined.mc$ann_level_1 == "Epithelial"]
#combined.metacells$major_type <- droplevels(combined.metacells$major_type)
smpCounts <- aggregate(combined.mc.epith$size, by=list(sample = combined.mc.epith$sample,
                                                       major_type = combined.mc.epith$ann_level_3,
                                                       smoking_status = combined.mc.epith$smoking_status),
                       FUN = sum)

contingencyTable <- xtabs(x ~ major_type + smoking_status, data = smpCounts)
freqMatrix <- apply(contingencyTable, 1, FUN = function(x){x/colSums(contingencyTable)})
# res <- chisq.test(contingencyTable)
# Roe <- res$observed/res$expected
freqMatrix_df <- melt(freqMatrix)

remove(combined.mc.epith)
gc()

ggplot(freqMatrix_df, aes(y=value,x = smoking_status,fill=major_type))  + geom_bar(stat="identity", position="fill") + scale_fill_manual(values = color.celltypes) + scale_y_continuous(labels = scales::percent) + ylab("% epithelial cells")

Samples from smokers seem to present more AT2 cells but this quick analysis is for illustrative purposes only. In practice it’s far more complex to draw conclusion as we should have considered the variations between samples/donors as well as many other technical (tissue dissociation protocol, tissue sampling method, single-cell platform, … ) and biological (BMI, sex, Age, …) variables.

5.1.9 Conclusion

Overall we made a precise simplification of the original atlas using metacells built from each sample separately. By reducing the size of the original atlas by a factor of 50 we could load the data, make an integration to correct batch effect and recapitulate the main different cell types using a reasonable amount of time and memory. In contrast, simply loading the original single-cell data in R using Seurat is extremely time-consuming and challenging even for the most powerful computers.

In this first example we used a fully unsupervised workflow and did not use any prior biological knowledge. Authors of the original study made a remarkable work annotating the hundreds of thousands cells of the atlas. In the second example in section 5.2 we propose a supervised workflow using this annotation to guide both the metacell identification and the batch correction.

We can save the results for comparison with the second example.

saveRDS(combined.mc,"data/HLCA/combined.mc.unsup.rds")

5.2 Supervised integration

As for the unsupervised integration example, we will work with the Human Cell Lung Atlas core HLCA gathering around 580,000 cells from 107 individuals distributed in 166 samples.

Taking advantage of the single-cell annotation of the original study we will build metacells for each cell type in each sample and guide the integration with the cell type label using STACAS.

5.2.1 Data loading

Please follow the section 0.4.1 to retrieve the HLCA atlas, divide the atlas by dataset and save the splitted data in the following folder: “data/HLCA/”.

5.2.2 Setting up the environment

library(reticulate)
conda_env <-  conda_list()[reticulate::conda_list()$name == "MetacellAnalysisToolkit","python"]

Sys.setenv(RETICULATE_PYTHON = conda_env)

#>           used  (Mb) gc trigger    (Mb)   max used    (Mb)
#> Ncells 3607736 192.7    6649217   355.2    6649217   355.2
#> Vcells 6815916  52.1 1894377494 14453.0 2366896489 18058.0

library(Seurat)
library(anndata)
library(SuperCell)
library(ggplot2)

wilcox.test <- "wilcox"
if(packageVersion("Seurat") >= 5) {
  options(Seurat.object.assay.version = "v4") 
  wilcox.test <- "wilcox_limma"
  print("you are using seurat v5 with assay option v4")}

color.celltypes  <- c('#E5D2DD', '#53A85F', '#F1BB72', '#F3B1A0', '#D6E7A3', '#57C3F3', '#476D87',
                      '#E95C59', '#E59CC4', '#AB3282', '#23452F', '#BD956A', '#8C549C', '#585658',
                      '#9FA3A8', '#E0D4CA', '#5F3D69', '#58A4C3', "#b20000", '#E4C755', '#F7F398',
                      '#AA9A59', '#E63863', '#E39A35', '#C1E6F3', '#6778AE', '#91D0BE', '#B53E2B',
                      '#712820', '#DCC1DD', '#CCE0F5', '#CCC9E6', '#625D9E', '#68A180', '#3A6963',
                      '#968175')

5.2.3 Constructing supervised metacell

Sikkema et al. made a remarkable job in finely annotating hundreds thousands of cells. Within the framework of this re-analysis, let’s now try to use this prior knowledge to obtain slightly better results using a supervised workflow.

We added in section 0.4 a ann_sample column in the metadata of the single cell object. We now can use it to build metacell for each cell type in each sample.

This should take around 30 minutes.

To call the MATK command line, please define your path to the gihub cloned repository optained from this github repository.

#git clone https://github.com/GfellerLab/MetacellAnalysisToolkit
MATK_path=MetacellAnalysisToolkit/
for d in data/HLCA/datasets/*;
do ${MATK_path}cli/MATK -t SuperCell -i $d/sc_adata.h5ad -o $d/sup_mc -a ann_sample -l 6 -n 50 -f 2000 -k 30 -g 50 -s adata
done

5.2.4 Load metacell objects

library(anndata)
adata <- read_h5ad("data/HLCA/local.h5ad",backed = "r")
adata$var_names <- adata$var$feature_name # We will use gene short name for downstream analyses
datasets <- unique(adata$obs$dat)
rm(adata)
gc()


metacell.files <- sapply(datasets, FUN = function(x){paste0("data/HLCA/datasets/",x,"/sup_mc/mc_adata.h5ad")})

metacell.objs <- lapply(X = metacell.files, function(X){
  adata <- read_h5ad(X)
  countMatrix <- Matrix::t(adata$X)
  colnames(countMatrix) <- adata$obs_names
  rownames(countMatrix) <- adata$var_names
  sobj <- Seurat::CreateSeuratObject(counts = countMatrix,meta.data = adata$obs)
  if(packageVersion("Seurat") >= 5) {sobj[["RNA"]] <- as(object = sobj[["RNA"]], Class = "Assay")}
  sobj <- RenameCells(sobj, add.cell.id = unique(sobj$sample)) # we give unique name to metacells
  return(sobj)
})

5.2.5 Merging objects and basic quality control

Thus, let’s check the purity of our metacells at different level of annotations, as well as their size (number of single cells they contain).

To do so we merge the objects together and use Seurat VlnPlot function.

unintegrated.mc <- merge(metacell.objs[[1]], metacell.objs[-1])

VlnPlot(unintegrated.mc[, unintegrated.mc$ann_level_3 != "None"], features = c("size", "ann_level_2_purity"), group.by = 'dataset', pt.size = 0.001, ncol = 2)
#> Warning in SingleExIPlot(type = type, data = data[, x, drop = FALSE], idents =
#> idents, : All cells have the same value of ann_level_2_purity.

VlnPlot(unintegrated.mc[, unintegrated.mc$ann_level_3 != "None"], features = c("ann_level_3_purity", "ann_level_4_purity"), group.by = 'dataset', pt.size = 0.001,  ncol=2)
#> Warning in SingleExIPlot(type = type, data = data[, x, drop = FALSE], idents =
#> idents, : All cells have the same value of ann_level_3_purity.

We can also use box plots.

p_4 <- ggplot(unintegrated.mc@meta.data, aes(x = dataset, y = ann_level_4_purity, fill = dataset)) + geom_boxplot() +
  scale_x_discrete(guide = guide_axis(angle = 45)) + ggtitle("sup metacells level 4 purity") + NoLegend() + ylim(c(0,1))

p_finest <- ggplot(unintegrated.mc@meta.data, aes(x = dataset, y = ann_finest_level_purity, fill = dataset)) + geom_boxplot() +
  scale_x_discrete(guide = guide_axis(angle = 45))  + ggtitle("sup metacells finest level purity") + NoLegend() + ylim(c(0,1))

p_4 + p_finest

Overall using supervised metacells construction we obtain pure metacells until the 3rd level of annotation and improve metacell purities for finer levels compared to the unsupervised approach (see previous section 5.1).

meta.data.unsup <- readRDS("data/HLCA/combined.mc.unsup.rds")@meta.data

p_4_unsup <- ggplot(meta.data.unsup, aes(x = dataset, y = ann_level_4_purity, fill = dataset)) + geom_boxplot() +
    scale_x_discrete(guide = guide_axis(angle = 45)) + ggtitle("unsup metacells level 4 purity") + NoLegend() + ylim(c(0,1))

p_finest_unsup <- ggplot(meta.data.unsup, aes(x = dataset, y = ann_finest_level_purity, fill = dataset)) + geom_boxplot() +
  scale_x_discrete(guide = guide_axis(angle = 45))  + ggtitle("unsup metacells finest level purity") + NoLegend() + ylim(c(0,1))

p_4_unsup | p_4

p_finest_unsup + p_finest

5.2.6 Unintegrated analysis

Let’s first do a standard dimensionality reduction without batch correction.

DefaultAssay(unintegrated.mc) <- "RNA"
unintegrated.mc <- NormalizeData(unintegrated.mc)
unintegrated.mc <- FindVariableFeatures(unintegrated.mc)
unintegrated.mc <- ScaleData(unintegrated.mc)
unintegrated.mc <- RunPCA(unintegrated.mc)
unintegrated.mc <- RunUMAP(unintegrated.mc,dims = 1:30)

umap.unintegrated.datasets <- DimPlot(unintegrated.mc,reduction = "umap",group.by = "dataset") + NoLegend() + ggtitle("unintegrated datasets")
umap.unintegrated.types <- DimPlot(unintegrated.mc,reduction = "umap",group.by = "ann_level_2",label = T,repel = T,cols = color.celltypes)+ NoLegend() + ggtitle("unintegrated cell types")

umap.unintegrated.datasets + umap.unintegrated.types

You can see on the plots that a batch effect is clearly present at the metacell level. Let’s correct it using a supervised approach.

5.2.7 STACAS integration

In the original study, datasets were integrated using SCANVI semi-supervised integration using partial annotation obtained for each dataset prior integration. Here in this second example we propose to use a similar approach in R using STACAS. We will use the “ann” labels we used to construct the metacells (3rd level of annotation if available for the cell, otherwise 2nd level).

To be noted that, as in the original study, we use the dataset rather than the donor as the batch parameter. See method section Data integration benchmarking of the original study for more details.

# Install package if needed
if (!requireNamespace("STACAS")) remotes::install_github("carmonalab/STACAS", upgrade = "never")
library(STACAS)

t0_integration <- Sys.time()

n.metacells <- sapply(metacell.objs, FUN = function(x){ncol(x)})
names(n.metacells) <- datasets
ref.names <- sort(n.metacells,decreasing = T)[1:5]
ref.index <- which(datasets %in% names(ref.names))

# normalize and identify variable features for each dataset independently
metacell.objs <- lapply(X = metacell.objs, FUN = function(x) {
  DefaultAssay(x) <- "RNA";
  x <- RenameCells(x, add.cell.id = unique(x$sample)) # we give unique name to metacells
  x <- NormalizeData(x)
  return(x)})
gc()


# Perform a supervised integration of the dataset using STACAS
combined.mc <- Run.STACAS(object.list = metacell.objs,
                          anchor.features = 2000,
                          min.sample.size = 80,
                          k.weight = 80, #smallest dataset contains 86 metacells
                          cell.labels = "ann", # Note that by not you can use STACAS in its unsupervised mode
                          reference = ref.index, # the 5 biggest datasets are used as reference
                          dims = 1:30)

tf_integration <- Sys.time()

tf_integration - t0_integration

remove(metacell.objs) # We don't need the object list anymore
gc()

Check the obtained object:

combined.mc
#> An object of class Seurat 
#> 30024 features across 12914 samples within 2 assays 
#> Active assay: integrated (2000 features, 2000 variable features)
#>  1 other assay present: RNA
#>  1 dimensional reduction calculated: pca

We can verify that the sum of metacell sizes correspond to the original number of single-cells

sum(combined.mc$size)
#> [1] 584944

STACAS directly returns a pca for the slot "integrated" that we can use to make a UMAP of the corrected data.

DefaultAssay(combined.mc) = "integrated"

combined.mc <- RunUMAP(combined.mc, dims = 1:30, reduction =  "pca", reduction.name = "umap")

Now we can make the plots and visually compare the results with the unintegrated analysis.

umap.stacas.datasets <- DimPlot(combined.mc,reduction = "umap",group.by = "dataset") + NoLegend() + ggtitle("integrated datasets")
umap.stacas.celltypes <- DimPlot(combined.mc,reduction = "umap",group.by = "ann_level_2",label = T,repel = T,cols = color.celltypes) + NoLegend() + ggtitle("integrated cell types")

umap.stacas.datasets + umap.stacas.celltypes + umap.unintegrated.datasets + umap.unintegrated.types

STACAS efficiently corrected the batch effect in the data while keeping the cell type separated.

We can navigate in the different annotation levels.

library(ggplot2)

DimPlot(combined.mc,group.by = "ann_level_1",reduction = "umap",cols= color.celltypes)


DimPlot(combined.mc,group.by = "ann_level_2",reduction = "umap",label = T,repel = T,cols= color.celltypes)


DimPlot(combined.mc,group.by = "ann_level_3",reduction = "umap",label = T, repel = T,cols= color.celltypes) + NoLegend()

5.2.8 Comparison with unsupervised analysis

we can quickly visually compare these results with the unsupervised integration obtained with Seurat:

combined.mc.unsup <- readRDS("data/HLCA/combined.mc.unsup.rds")

combined.mc$ann_level_3 <- factor(combined.mc$ann_level_3)
matched.color.celltypes <- color.celltypes[1:length(levels(combined.mc$ann_level_3))]
names(matched.color.celltypes) <- levels(combined.mc$ann_level_3)

level3_sup <- DimPlot(combined.mc,group.by = "ann_level_3",reduction = "umap",label = T, repel = T,cols= matched.color.celltypes) + NoLegend() + ggtitle("Sup workflow")
level3_unsup <- DimPlot(combined.mc.unsup,group.by = "ann_level_3",reduction = "umap",label = T, repel = T,cols= matched.color.celltypes) + NoLegend() + ggtitle("Unsup workflow")

level3_sup + level3_unsup

Look at epithelial cells in particular

level3_sup <- DimPlot(combined.mc[,combined.mc$ann_level_1 == "Epithelial"],group.by = "ann_level_3",reduction = "umap",label = T, repel = T,cols= matched.color.celltypes) + NoLegend() + ggtitle("Sup workflow")
level3_unsup <- DimPlot(combined.mc.unsup[,combined.mc.unsup$ann_level_1 == "Epithelial"],group.by = "ann_level_3",reduction = "umap",label = T, repel = T,cols= matched.color.celltypes) + NoLegend() + ggtitle("Unsup workflow")

level3_sup + level3_unsup

5.2.9 Downstream analysis

You can try conduce the same downstream analyses as in the previous example 5.1 (clustering, cell type abundances, DEG …).

Here to show you the interest of supervised workflow with pure metacell we can zoom on the smooth muscle sub types. Despite the low metacell number for each cell type these different subtypes are separated on the UMAP, especially the rare FAM83D+ smooth muscles that were discovered in the original study.

combined.mc$ann <- factor(combined.mc$ann)
color.celltypes.ann <- color.celltypes[c(1:length(levels(combined.mc$ann)))]
names(color.celltypes.ann) <- levels(combined.mc$ann)

DimPlot(combined.mc[,combined.mc$ann_level_2 == "Smooth muscle"],group.by = "ann",cols = color.celltypes.ann)

Using a DEG analysis we can check if we retrieve their markers. MYH11 and CNN1 genes are canonical smooth muscle markers while FAM83D was found uniquely and consistently expressed by this rare cell type in the original study

DefaultAssay(combined.mc) <- "RNA"
Idents(combined.mc) <- "ann"
markersSmoothMuscle <- FindMarkers(combined.mc,ident.1 = "Smooth muscle FAM83D+",only.pos = T, logfc.threshold = 0.25, test.use =  wilcox.test)

head(markersSmoothMuscle)
#>               p_val avg_log2FC pct.1 pct.2     p_val_adj
#> MYOCD 4.887974e-176  1.3879478 0.758 0.022 1.369806e-171
#> NMRK2 1.092465e-129  0.4261093 0.273 0.003 3.061523e-125
#> PLN   4.060884e-124  3.1234083 0.879 0.044 1.138022e-119
#> HSPB3 7.779955e-121  1.0321301 0.545 0.016 2.180255e-116
#> CASQ2 9.830136e-117  1.1149385 0.636 0.023 2.754797e-112
#> ASB5  2.233460e-107  0.2845419 0.273 0.004 6.259049e-103

markersSmoothMuscle[c("MYH11","CNN1","FAM83D"),]
#>               p_val avg_log2FC pct.1 pct.2    p_val_adj
#> MYH11  2.489777e-32   4.250912 0.970 0.286 6.977351e-28
#> CNN1   6.519506e-71   4.627342 0.970 0.106 1.827026e-66
#> FAM83D 3.395734e-11   2.189551 0.636 0.284 9.516205e-07

# Many classical smooth muscles cells are not annotated at the 3rd level of annotation (labelled None)
VlnPlot(combined.mc,features = c("MYH11","CNN1","FAM83D"),group.by = "ann",ncol = 2,cols = color.celltypes.ann)

5.2.10 Conclusion

Taking advantage of the single cell annotation in a supervised workflow we could improve the precision of our metacell re-analysis. When cell annotations are given and of good quality, which is far from being the case every time, building metacells accordingly and use a supervised integration workflow should be preferred.

To be noted that we used an intermediary level of annotation to supervise our analysis, using a finer level for this data would have resulted in a longer time for metacell building. PLus, we would have obtained to few metacells per cell type in the different sample to be able to make an efficient supervised batch correction with STACAS.

To be more precise at the cost of computational efficiency one could also try to reduce the graining level of the analysis (using a graining level of 20 for instance),

To conclude, keep in mind that in one hand, for certain analysis such as rare cell type analysis, we will never achieve the same level of sensitivity with metacells compared to single-cells. On the other hand, you certainly won’t be able to analyze so many single-cells so easily, and you may not need extremely fine cell-type resolution for many analyses.

4 Metacell Analysis Toolkit (MATK)

References