1. 多集合展示UpSet图
当有多个群组(集合)的数据,想要展示在不同的事物群组(集合)之间的数学或逻辑联系时,可使用韦恩Venn图和UpSet图展示多集合直接的共享关系和各自的独享关系。
当集合数量少于等于5个时,多用韦恩图;当集合数量大于5个时,多用upset图。
韦恩图的绘制推荐使用R包gplots,参考博文韦恩图;upset图的绘制推荐使用R包UpSetR,参考博文。
1.1. 应用场景
- 利用orthofinder找到的不同物种的orthogroups的结果来绘制Venn Diagram韦恩图或者UpSet plot图,查看不同物种间共享的orthogroups的数量关系。
1.2. 数据准备
UpSetR的输入文件是表格形式,第一列是数据分类信息,后面列是每一个集合占一列(每个物种占一列),表格内容是1/0表示集合在每一个类别是否有数据。
1.2.1. UpSetR自带数据
1 2 3 4 5 6 7 8
| movies <- read.csv(system.file("extdata", "movies.csv", package = "UpSetR"), header = T, sep=";") mutations <- read.csv(system.file("extdata", "mutations.csv", package = "UpSetR"), header = T, sep = ",")
head(movies) View(movies)
head(mutations) View(mutations)
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
| > head(movies) Name ReleaseDate Action Adventure Children Comedy Crime Documentary Drama Fantasy Noir Horror Musical Mystery Romance SciFi Thriller War Western AvgRating Watches 1 Toy Story (1995) 1995 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 4.15 2077 2 Jumanji (1995) 1995 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 3.20 701 3 Grumpier Old Men (1995) 1995 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 3.02 478 4 Waiting to Exhale (1995) 1995 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 2.73 170 5 Father of the Bride Part II (1995) 1995 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 3.01 296 6 Heat (1995) 1995 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 3.88 940 > head(mutations) Identifier TTN PTEN TP53 EGFR MUC16 FLG RYR2 PCLO PIK3R1 PIK3CA NF1 MUC17 HMCN1 SPTA1 USH2A RB1 PKHD1 OBSCN AHNAK2 RYR3 RELN FRAS1 GPR98 DNAH5 ATRX APOB TCHH SYNE1 LRP2 KEL HRNR DNAH3 COL6A3 MUC5B LAMA1 DSP 1 02-0003 0 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 2 02-0033 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 02-0047 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 02-0055 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 5 02-2470 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 02-2483 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 DNAH8 CNTNAP2 SDK1 NBPF10 DNAH2 NLRP5 MLL3 IDH1 HCN1 FCGBP DOCK5 RIMS2 PCDHA1 MXRA5 HEATR7B2 GRIN2A FGD5 TMEM132D STAG2 SEMA3C SCN9A PRDM9 POM121L12 PIK3CG PDGFRA GABRA6 FLG2 FBN3 FBN2 FAT2 DNAH11 DMD COL1A2 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 ABCC9 XIRP2 TSHZ2 TEX15 SLIT3 RBM47 PIK3C2G PCDH11X MYH2 MACF1 KSR2 DNAH9 DCHS2 CSMD3 CDH18 BCOR AHNAK ZAN TRRAP THSD7B TAF1L SPAG17 SLCO5A1 SCN10A RYR1 RIMBP2 PLEKHG4B PCDHB7 NPTX2 NOS1 LZTR1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 5 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
|
1.2.2. orthofinder数据
从orthofinder的结果文件Results_Aug14/Orthogroups/Orthogroups.GeneCount.tsv稍加处理就可以作为输入文件,展示不同物种的orthogroups集合的共享情况。
sed -E "s/\t[1-9][0-9]*/\t1/g" Orthogroups.GeneCount.tsv |sed "s/\.pep//g" >orthogroups.upset
# 把Orthogroups.GeneCount.tsv中的非零数字替换成1。
在R中用mutations <- read.csv("orthogroups.upset", header=TRUE, sep = "\t")
读取orthogroups.upset。
1.3. UpSetR包安装
1 2 3
| install.packages("UpSetR"); # 安装 library(UpSetR); # 载入UpSetR require(ggplot2); require(plyr); require(gridExtra); require(grid); # 载入包
|
1.4. UpSetR包使用
1.4.1. upset函数
upset(mutations)
可以看到upset图的效果
调整参数做指定数据显示
1 2 3 4 5 6 7 8 9 10 11 12
| upset(mutations, sets = c("MUC16","EGFR","TP53","TTN"), nset = 4, nintersects = 20, mb.ratio = c(0.55, 0.45), order.by = c("degree", "freq"), keep.order = TRUE, decreasing = c(TRUE,FALSE), number.angles = 30, point.size = 2, line.size = 1, mainbar.y.label = "Intersection size of gene family",sets.x.label = "genome size", text.scale = c(1.3, 1.3, 1, 1, 1.5, 1))
|
Figure 1. movies upset
1.4.2. queries参数
upset函数中可以添加queries参数,用于突出显示(上色)部分数据。
queries
queries是一个由多个query组成的list;每个query也是一个list,作为一次查询数据并突出显示的请求。
query
query也是list格式,由查询函数query和其他参数(param,color,active,query.name)组成,其中query,param是必须设置的参数。
- query: 指定查询函数,UpSetR有内置(比如intersects),也可以自定义函数后调用。
- param: list格式, 指定query作用的数据。
- color:设置颜色,可选设置。
- active:显示类型,TRUE/T表示用颜色覆盖条形图,FALSE/F表示在条形图顶端显示三角形。
- query.name:添加query图例名称。
queries参数示例
- 把”EGFR”和”TP53”两个基因共同拥有的突变标上蓝色;把”TTN”基因特有的突变在直方图上标为红色。
1 2 3 4 5 6 7
| upset(mutations, sets=c("MUC16","EGFR","TP53","TTN"), queries = list(list(query = intersects, params = list("EGFR", "TP53"), color = "blue", active = T, query.name = "share EGFR and TP53"), list(query = intersects, params=list("TTN"), color="red", active=T)))
|
Figure 2. mutations upset
- 把同属Drama和Thriller的电影突出显示,把1970-1980的电影标红。
1 2 3 4 5 6 7
| between <- function(row, min, max){ newData <- (row["ReleaseDate"] < max) & (row["ReleaseDate"] > min) } # 自定义between函数
upset(movies, sets=c("Drama","Comedy","Action","Thriller","Western","Documentary"), queries = list(list(query = intersects, params = list("Drama", "Thriller")), list(query = between, params=list(1970,1980), color="red", active=TRUE)))
|
Figure 3. movies upset 2
1.4.3. 添加属性图
- 添加箱线图
每次最多添加两个箱线图
upset(movies, boxplot.summary = c("AvgRating", "ReleaseDate"))
Figure 4. movies upset boxplot
1.4.3.1. attribute.plots参数
attribute.plots参数用于添加属性图,内置有柱形图,散点图,热图等。
如果想添加密度曲线图,可以自定义plot函数后添加。
- 添加柱形图和散点图
1 2 3 4 5 6 7 8 9 10 11
| upset(movies, sets=c("Drama","Comedy","Action","Thriller","Western","Documentary"), queries = list(list(query = intersects, params = list("Drama", "Thriller")), list(query = between, params=list(1970,1980), color="red", active=TRUE)), attribute.plots=list(gridrows=60, plots=list( list(plot=scatter_plot, x="ReleaseDate", y="AvgRating", queries = T), list(plot= histogram, x="ReleaseDate", queries = F)), ncols = 2), query.legend = "top")
|
Figure 5. movies upset scatter histograms
- 添加密度曲线图
1 2 3 4 5 6 7
| another.plot <- function(data, x, y) { data$decades <- round_any(as.integer(unlist(data[y])), 10, ceiling) data <- data[which(data$decades >= 1970), ] myplot <- (ggplot(data, aes_string(x = x)) + geom_density(aes(fill = factor(decades)), alpha = 0.4) + theme(plot.margin = unit(c(0, 0, 0, 0), "cm"), legend.key.size = unit(0.4, "cm"))) }
|
1 2 3 4 5 6 7 8 9
| library(plyr) upset(movies, main.bar.color = "black", mb.ratio = c(0.5, 0.5), queries = list(list(query = intersects, params = list("Drama"), color = "red", active = F), list(query = intersects, params = list("Action", "Drama"), active = T), list(query = intersects, params = list("Drama", "Comedy", "Action"), color = "orange", active = T)), attribute.plots = list(gridrows = 50, plots = list(list(plot = histogram, x = "ReleaseDate", queries = F), list(plot = scatter_plot, x = "ReleaseDate", y = "AvgRating", queries = T), list(plot = another.plot, x = "AvgRating", y = "ReleaseDate", queries = F)), ncols = 3))
|
Figure 6. movies upset density
2. references
- https://github.com/hms-dbmi/UpSetR
- https://www.jianshu.com/p/324aae3d5ea4
- https://zhuanlan.zhihu.com/p/35303590
- 欢迎关注微信公众号:生信技工
- 公众号主要分享生信分析、生信软件、基因组学、转录组学、植物进化、生物学概念等相关内容,包括生物信息学工具的基本原理、操作步骤和学习心得。