在微生物組研究中我們常常需要根據某些感興趣的表型來找到與其相關的特征(比如菌群、OTU、基因家族等等)。但微生物組學的數據結構導致了這必然是一項相當艱巨的任務,因為他們:
?高維特征集(通常超過 100 到 10,000 個特征);?高度稀疏(許多特征僅在少數樣本中被發現);?特征間復雜的相關性結構;?計數的組成性(即,觀察到的計數受文庫大小的限制);?不同的文庫大小;?過度離散的計數值,等等。
那么應該如何選擇不同的差異分析方法呢?其實這個問題并沒有答案,(如果有時間的話)我一般都是嘗試一些對手頭數據來說看似合理的模型,然后優先考慮 overlap 的差異特征集。雖然這并不完美,但至少會證明一些結果的魯棒性,增加我們對結果的信心。
下面我將基于一個用 MetaPhlAn2 注釋的公共宏基因組數據,使用五種不同算法進行差異分析。這些方法也可以應用于(也許更適用于)擴增子測序得到的 ASV 或 OTU。選擇這些方法的標準如下:
?在一項或多項模擬研究中表現較好;?可以校正協變量,和多重假設檢驗;?包含多種標準化和建模方法;?應用相對廣泛;?封裝成 R 包。
符合以上條件的有很多算法,但這里我挑選了以下五個模型:
?Limma-Voom[1]?DESeq2[2]?corncob[3]?MaAsLin 2[4]?ANCOM-BC[5]
我們將使用由 curatedMetagenomicData[6] 包(關于這個包的教程可以參見我之前的筆記)提供的公共數據[7] 來識別從印度南部與印度中北部人群收集的糞便樣本中的差異菌群。
相關文章:D B Dhakan, A Maji, A K Sharma, R Saxena, J Pulikkan, T Grace, A Gomez, J Scaria, K R Amato, V K Sharma, The unique composition of Indian gut microbiome, gene catalogue, and associated fecal metabolome deciphered using multi-omics approaches, GigaScience, Volume 8, Issue 3, March 2019, giz004, https://doi.org/10.1093/gigascience/giz004
# if (!requireNamespace("BiocManager", quietly = TRUE))# install.packages("BiocManager")# # BiocManager::install("curatedMetagenomicData")# BiocManager::install("phyloseq")# BiocManager::install("edgeR")# BiocManager::install("limma")# BiocManager::install("DEFormats")# BiocManager::install("DESeq2")# BiocManager::install("apeglm")# BiocManager::install("ANCOMBC")# BiocManager::install("Maaslin2")# # install.packages("corncob")# install.packages("tidyverse")# install.packages("ggVennDiagram")library(curatedMetagenomicData)library(tidyverse)library(phyloseq)library(edgeR)library(limma)library(DEFormats)library(DESeq2)library(apeglm)library(corncob)library(ANCOMBC)library(Maaslin2)library(ggVennDiagram)
dhakan_eset <- DhakanDB_2019.metaphlan_bugs_list.stool()
## snapshotDate(): 2020-04-27
## see ?curatedMetagenomicData and browseVignettes('curatedMetagenomicData') for documentation
## loading from cache
(ps <- ExpressionSet2phyloseq(dhakan_eset, simplify = TRUE, relab = FALSE, phylogenetictree = FALSE))
## phyloseq-class experiment-level object## otu_table() OTU Table: [ 768 taxa and 110 samples ]## sample_data() Sample Data: [ 110 samples by 20 sample variables ]## tax_table() Taxonomy Table: [ 768 taxa by 8 taxonomic ranks ]
(ps <- subset_taxa(ps, !is.na(Species) & is.na(Strain)))
## phyloseq-class experiment-level object## otu_table() OTU Table: [ 286 taxa and 110 samples ]## sample_data() Sample Data: [ 110 samples by 20 sample variables ]## tax_table() Taxonomy Table: [ 286 taxa by 8 taxonomic ranks ]
刪除物種注釋中的 “s__” 占位符:
taxa_names(ps) <- gsub("s__", "", taxa_names(ps)) head(taxa_names(ps))
## [1] "Prevotella_copri" "Eubacterium_rectale" ## [3] "Butyrivibrio_crossotus" "Prevotella_stercorea" ## [5] "Faecalibacterium_prausnitzii" "Subdoligranulum_unclassified"
剔除僅在 <10% 樣本中出現的菌種:
(ps <- filter_taxa(ps, function(x) sum(x > 0) > (0.1*length(x)), TRUE))
## phyloseq-class experiment-level object## otu_table() OTU Table: [ 109 taxa and 110 samples ]## sample_data() Sample Data: [ 110 samples by 20 sample variables ]## tax_table() Taxonomy Table: [ 109 taxa by 8 taxonomic ranks ]
查看數據中包括的 metadata 信息:
sample_variables(ps)
## [1] "subjectID" "body_site" ## [3] "antibiotics_current_use" "study_condition" ## [5] "disease" "age" ## [7] "infant_age" "age_category" ## [9] "gender" "BMI" ## [11] "country" "location" ## [13] "non_westernized" "DNA_extraction_kit" ## [15] "number_reads" "number_bases" ## [17] "minimum_read_length" "median_read_length" ## [19] "curator" "NCBI_accession"
sample_data(ps)$location <- factor(sample_data(ps)$location, levels = c("Bhopal", "Kerala"))table(sample_data(ps)$location)
## ## Bhopal Kerala ## 53 57
在過濾了低流行率的分類群后,我們現在得到了一個包含 109 個菌種的 phyloseq
對象。我一般傾向于根據總數和流行率過濾掉僅在 10% 到 50% 的樣本中觀察到的特征,以更好地滿足模型假設,同時限制計算 power 時所付出的 FDR 懲罰。(這里總共 109 個菌種肯定是偏低的,但本文僅作示例)
常用于基因表達矩陣分析的 Limma 包也可用于菌群矩陣的差異分析。
dds <- phyloseq_to_deseq2(ps, ~ location) #convert to DESeq2 and DGEList objects
## converting counts to integer mode
dge <- as.DGEList(dds)# 計算 TMM 歸一化因子dge <- calcNormFactors(dge, method = "TMM")head(dge$samples$norm.factors)
## [1] 0.4529756 1.6721557 0.1509053 0.4569702 0.4646246 0.3362776
# 建立模型矩陣mm <- model.matrix(~ group, dge$samples)head(mm)
## (Intercept) groupKerala## DHAK_HAK 1 0## DHAK_HAJ 1 0## DHAK_HAI 1 0## DHAK_HAH 1 0## DHAK_HAG 1 0## DHAK_HAF 1 0
table(mm[, 2])
## ## 0 1 ## 53 57
# 得到 voom 權重y <- voom(dge, mm, plot = T)
voom 的過程:1.將計數轉換為 log2 CPM(counts per million reads),其中 “per million reads” 是根據我們之前計算的歸一化因子定義的;2.線性模型擬合每個特征的 log2 CPM,并計算殘差;3.基于平均表達量擬合平滑曲線(見上圖中的紅線);4.獲得每個特征和樣本的權重。
使用 lmFit
擬合線性模型:
fit <- lmFit(y, mm) #fit lm with limma# 經驗貝葉斯統計量計算(參見 https://www.degruyter.com/doi/10.2202/1544-6115.1027)fit <- eBayes(fit)head(coef(fit))
## (Intercept) groupKerala## Prevotella_copri 16.1383696 -4.99436530## Eubacterium_rectale 11.9161328 -0.08294369## Butyrivibrio_crossotus -0.2110634 -0.08650380## Prevotella_stercorea 6.8200020 -6.22106379## Faecalibacterium_prausnitzii 13.6093282 1.00607350## Subdoligranulum_unclassified 10.9786786 1.16771790
提取計算結果:
limma_res_df <- data.frame(topTable(fit, coef = "groupKerala", number = Inf)) #extract resultsfdr_limma <- limma_res_df %>% dplyr::filter(adj.P.Val < 0.05) %>% rownames_to_column(var = "Species")dim(fdr_limma)
## [1] 15 7
head(fdr_limma)
## Species logFC AveExpr t P.Value## 1 Megasphaera_unclassified -8.974360 6.4247739 -5.575994 1.059729e-07## 2 Bacteroides_coprocola -5.124762 -0.8291721 -4.484538 1.409614e-05## 3 Prevotella_stercorea -6.221064 3.2149583 -3.867134 1.613340e-04## 4 Lactobacillus_ruminis -5.711738 3.7967638 -3.833832 1.826516e-04## 5 Ruminococcus_obeum 4.659230 2.7788048 3.737572 2.603217e-04## 6 Mitsuokella_unclassified -5.237836 0.7101123 -3.732304 2.653693e-04## adj.P.Val B## 1 1.155105e-05 6.9755057## 2 7.682395e-04 2.8793949## 3 4.820876e-03 0.6994391## 4 4.820876e-03 0.5749799## 5 4.820876e-03 0.2933741## 6 4.820876e-03 0.2931618
ggplot(fdr_limma, aes(x = Species, y = logFC, color = Species)) + geom_point(size = 4) + labs(y = "nLog2 Fold-Change for ACVD vs. Controls", x = "") + theme(axis.text.x = element_text(color = "black", size = 12), axis.text.y = element_text(color = "black", size = 12), axis.title.y = element_text(size = 14), axis.title.x = element_text(size = 14), legend.text = element_text(size = 12), legend.title = element_text(size = 12), legend.position = "none") + coord_flip() + geom_hline(yintercept = 0, linetype="dotted")
可以看到根據校正后的 P 值 < 0.05,limma-voom 找到了 15 個差異菌。
DESeq2 將對原始計數進行建模,使用標準化因子(scale factor)來解釋庫深度的差異。然后估計每條 OTU 的離散度,并縮小這些估計值以生成更準確的離散度估計。最后,DESeq2 擬合負二項分布的模型,并使用 Wald 檢驗或似然比檢驗進行假設檢驗。
dds <- DESeq(dds, test = "Wald", fitType = "local", sfType = "poscounts")
estimating size factorsestimating dispersionsgene-wise dispersion estimatesmean-dispersion relationshipfinal dispersion estimatesfitting model and testing-- replacing outliers and refitting for 65 genes-- DESeq argument 'minReplicatesForReplace' = 7 -- original counts are preserved in counts(dds)estimating dispersionsfitting model and testing
plotDispEsts(dds)
使用 apeglm
對計數值的離散度進行校正:
res <- lfcShrink(dds, coef=2, type="apeglm")
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:## Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for## sequence count data: removing the noise and preserving large differences.## Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
plotMA(dds)
deseq_res_df <- data.frame(res) %>% rownames_to_column(var = "Species") %>% dplyr::arrange(padj) fdr_deseq <- deseq_res_df %>% dplyr::filter(padj < 0.05)dim(fdr_deseq)
## [1] 5 6
head(fdr_deseq)
## Species basemean log2FoldChange lfcSE pvalue## 1 Bacteroides_massiliensis 2326.9291 0.1667126 0.9692252 9.311151e-32## 2 Clostridium_symbiosum 346.6204 0.5656621 1.2730261 4.669930e-32## 3 Clostridium_hathewayi 314.3198 0.4874255 1.2012990 1.764584e-28## 4 Parabacteroides_goldsteinii 256.9873 0.2564174 1.0225083 5.847951e-25## 5 Bacteroides_coprocola 19141.8394 -0.7224703 1.3784468 2.200809e-03## padj## 1 4.934910e-30## 2 4.934910e-30## 3 6.234864e-27## 4 1.549707e-23## 5 4.665715e-02
ggplot(fdr_deseq, aes(x = Species, y = log2FoldChange, color = Species)) + geom_point(size = 4) + labs(y = "nLog2 Fold-Change for ACVD vs. Controls", x = "") + theme(axis.text.x = element_text(color = "black", size = 12), axis.text.y = element_text(color = "black", size = 12), axis.title.y = element_text(size = 14), axis.title.x = element_text(size = 14), legend.text = element_text(size = 12), legend.title = element_text(size = 12), legend.position = "none") + coord_flip() + geom_hline(yintercept = 0, linetype="dotted")
根據校正后的 P 值 < 0.05,DESeq2 找到了 5 個差異菌。
Corncob 則是基于相對豐度進行建模并檢驗協變量對相對豐度的影響。
?GitHub 地址:https://github.com/bryandmartin/corncob/
corn_da <- differentialTest(formula = ~ location, phi.formula = ~ 1, formula_null = ~ 1, phi.formula_null = ~ 1, data = ps, test = "Wald", boot = FALSE, fdr_cutoff = 0.05)fdr_corncob <- corn_da$significant_taxadim(data.frame(fdr_corncob))
## [1] 28 1
head(sort(corn_da$p_fdr))
## Megasphaera_unclassified Ruminococcus_obeum ## 1.576463e-06 1.041461e-04 ## Prevotella_stercorea Bacteroides_thetaiotaomicron ## 3.697782e-04 3.965007e-04 ## Bacteroides_uniformis Lactobacillus_ruminis ## 4.262191e-04 5.608007e-04
Corncob 找到了 28 個差異特征。
MaAsLin2 是 MaAsLin(Microbiome Multivariable Association with Linear Models) 的升級版,主要基于線性模型進行多元關聯分析,分析表型和微生物特征之間的相關性。
?首頁:https://huttenhower.sph.harvard.edu/maaslin/?GitHub 地址:https://github.com/biobakery/Maaslin2
mas_1 <- Maaslin2( input_data = data.frame(otu_table(ps)), input_metadata = data.frame(sample_data(ps)), output = "./Maaslin2_default_output", min_abundance = 0.0, min_prevalence = 0.0, normalization = "TSS", transform = "LOG", analysis_method = "LM", max_significance = 0.05, fixed_effects = "location", correction = "BH", standardize = FALSE, cores = 1)
mas_res_df <- mas_1$resultsfdr_mas <- mas_res_df %>% dplyr::filter(qval < 0.05)dim(fdr_mas)
## [1] 27 10
head(fdr_mas)
## feature metadata value coef## locationKerala32 Megasphaera_unclassified location Kerala -0.7713398## locationKerala29 Ruminococcus_obeum location Kerala 0.6891505## locationKerala14 Ruminococcus_bromii location Kerala 0.8985006## locationKerala28 Bacteroides_thetaiotaomicron location Kerala 0.8813854## locationKerala70 Streptococcus_salivarius location Kerala 0.5409692## locationKerala3 Prevotella_stercorea location Kerala -1.0280049## stderr pval name qval N## locationKerala32 0.1583627 3.828477e-06 locationKerala 0.000417304 110## locationKerala29 0.1577677 2.887477e-05 locationKerala 0.001573675 110## locationKerala14 0.2122802 4.865394e-05 locationKerala 0.001767760 110## locationKerala28 0.2271755 1.801616e-04 locationKerala 0.004347755 110## locationKerala70 0.1404570 1.994383e-04 locationKerala 0.004347755 110## locationKerala3 0.2774162 3.343767e-04 locationKerala 0.005206723 110## N.not.zero## locationKerala32 0## locationKerala29 0## locationKerala14 0## locationKerala28 0## locationKerala70 0## locationKerala3 0
MaAsLin 2 找到了 27 個差異菌種。MaAsLin 2[8] 支持多種歸一化方法和模型,我們可以用它來快速擬合不同的模型,看看這些模型對結果的影響。
ANCOM-BC 引入了一種包含偏差校正的微生物組組成分析方法,該方法可以估計未知的抽樣比例,并校正由樣品之間的差異引起的偏差,絕對豐度數據使用線性回歸框架建模。
?GitHub 地址:https://github.com/FrederickHuangLin/ANCOM-BC-Code-Archive
ancom_da <- ancombc(phyloseq = ps, formula = "location", p_adj_method = "fdr", zero_cut = 0.90, lib_cut = 1000, group = "location", struc_zero = TRUE, neg_lb = TRUE, tol = 1e-5, max_iter = 100, conserve = TRUE, alpha = 0.05, global = FALSE)ancom_res_df <- data.frame( Species = row.names(ancom_da$res$beta), beta = unlist(ancom_da$res$beta), se = unlist(ancom_da$res$se), W = unlist(ancom_da$res$W), p_val = unlist(ancom_da$res$p_val), q_val = unlist(ancom_da$res$q_val), diff_abn = unlist(ancom_da$res$diff_abn))fdr_ancom <- ancom_res_df %>% dplyr::filter(q_val < 0.05)dim(fdr_ancom)
## [1] 22 7
head(fdr_ancom)
## Species beta se W## locationKerala4 Prevotella_stercorea -4.414224 1.1839905 -3.728259## locationKerala9 Mitsuokella_unclassified -3.467118 1.0176606 -3.406950## locationKerala15 Ruminococcus_bromii 3.682311 1.0497644 3.507750## locationKerala18 Lactobacillus_ruminis -3.608518 1.0441049 -3.456087## locationKerala22 Catenibacterium_mitsuokai -3.055131 0.9621105 -3.175447## locationKerala29 Bacteroides_thetaiotaomicron 3.143050 0.8682526 3.619973## p_val q_val locationKerala## locationKerala4 0.0001928069 0.006229387 TRUE## locationKerala9 0.0006569326 0.007160566 TRUE## locationKerala15 0.0004519128 0.007029239 TRUE## locationKerala18 0.0005480782 0.007029239 TRUE## locationKerala22 0.0014960575 0.013589189 TRUE## locationKerala29 0.0002946343 0.006423028 TRUE
ANCOM-BC 找到了 22 個差異物種。
x <- list(limma = fdr_limma$Species, deseq2 = fdr_deseq$Species, corncob = fdr_corncob, maaslin2 = fdr_mas$feature, ancom = fdr_ancom$Species) overlap_data <- process_region_data(Venn(x))overlap_data[overlap_data$id == 12345,]$item
## [[1]]## [1] "Bacteroides_coprocola"
ggVennDiagram(x,label_alpha=0) + scale_fill_gradient(low="gray100",high = "gray95",guide="none")
在這五種方法中,只有一種菌 Bacteroides coprocola 在所有結果中都顯著, FDR p 值 < 0.05。除了考慮到豐度差異外,我們還可以進一步考慮效應的大小(即倍數變化或系數的大小),看看這些被多種方法同時證實的結果是否合理,同時可進一步嘗試探究不同模型方法之間的結果差異是否有明確的原因(例如,數據是否過度稀疏等等)。比如,在圖中我們可以看到有 11 個菌被除 DESeq2 外的其余四種方法證實,這些菌或許就是下一步需要探究的方向。
Reference
?https://www.nicholas-ollberding.com/post/identifying-differentially-abundant-features-in-microbiome-data/
[1]
Limma-Voom: https://genomebiology.biomedcentral.com/articles/10.1186/gb-2014-15-2-r29[2]
DESeq2: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8[3]
corncob: https://projecteuclid.org/journals/annals-of-applied-statistics/volume-14/issue-1/Modeling-microbial-abundances-and-dysbiosis-with-beta-binomial-regression/10.1214/19-AOAS1283.short[4]
MaAsLin 2: https://www.biorxiv.org/content/10.1101/2021.01.20.427420v1[5]
ANCOM-BC: https://www.nature.com/articles/s41467-020-17041-7[6]
curatedMetagenomicData: https://bioconductor.org/packages/release/data/experiment/html/curatedMetagenomicData.html[7]
數據: https://bioconductor.org/packages/release/data/experiment/html/curatedMetagenomicData.html[8]
MaAsLin 2: https://github.com/biobakery/Maaslin2
本文由 貴州做網站公司 整理發布,部分圖文來源于互聯網,如有侵權,請聯系我們刪除,謝謝!
網絡推廣與網站優化公司(網絡優化與推廣專家)作為數字營銷領域的核心服務提供方,其價值在于通過技術手段與策略規劃幫助企業提升線上曝光度、用戶轉化率及品牌影響力。這...
在當今數字化時代,公司網站已成為企業展示形象、傳遞信息和開展業務的重要平臺。然而,對于許多公司來說,網站建設的價格是一個關鍵考量因素。本文將圍繞“公司網站建設價...
在當今的數字化時代,企業網站已成為企業展示形象、吸引客戶和開展業務的重要平臺。然而,對于許多中小企業來說,高昂的網站建設費用可能會成為其發展的瓶頸。幸運的是,隨...
泰州大橋簡介 泰州有多少座長江大橋?泰州大橋全長幾公里? 泰州有四座長江大橋1、鎮江長江大橋懸索橋鎮江長江大橋。2、江陰長江公路大橋是我國首座跨度超公里的大型鋼箱梁懸索橋。3、泰州長江大橋全長62、088公里,工期5年半。4、潤陽大橋西距南京二橋約60公里,東距江陰長江大橋約110公里。 泰州大橋附近有哪些景點? 泰州大橋附近有 西林禪院、太平廟、江河健身廣場、福滿園農場、和園、祥泰公園、...
elle的包包是幾線品牌?Elle男包屬于中高檔。Elle品牌來自法國時尚雜志《ELLE PARIS》。因為雜志的流行,延伸到男女裝、童裝、皮鞋、手表、裝飾展品的誕生和發展?來自法國的Elle時尚品牌,通過對流行趨勢的精準分析和傳播,精選并專注于時尚產品。經過半個世界的沉淀,ELLE已經成為著名的時尚品牌。elle男包是什么檔次?高檔Elle男包屬于高端階層。雖然不是奢侈品,但在時尚界很有影響力...
手機撥號上網是什么意思?撥號設置是設置默認使用哪張手機卡打電話,上網是開通數據,使用默認使用手機卡的流量。手機如何撥號上網?先打開撥號選項;點擊手機的撥號盤,點擊右下角三個點的設置選項;點擊【設置】,部分手機點擊【更多】選項按鈕顯示【設置】;可以看到,除了1是語音郵件設置,還可以設置2到9的快速撥號鍵。點擊添加聯系人,根據自己的實際情況在通訊錄中選擇經常聯系的聯系人。按照彈出的提示一步一步操作。1...