1. 多集合展示UpSet图
1.1. 应用场景
- 利用orthofinder找到的不同物种的orthogroups的结果来绘制Venn Diagram韦恩图或者UpSet plot图,查看不同物种间共享的orthogroups的数量关系。
1.2. 数据准备
1.2.1. UpSetR自带数据
1 2 3 4 5 6 7 8
| movies <- read.csv(system.file("extdata", "movies.csv", package = "UpSetR"), header = T, sep=";") mutations <- read.csv(system.file("extdata", "mutations.csv", package = "UpSetR"), header = T, sep = ",")
head(movies) View(movies)
head(mutations) View(mutations)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
| > head(movies) Name ReleaseDate Action Adventure Children Comedy Crime Documentary Drama Fantasy Noir Horror Musical Mystery Romance SciFi Thriller War Western AvgRating Watches 1 Toy Story (1995) 1995 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 4.15 2077 2 Jumanji (1995) 1995 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 3.20 701 3 Grumpier Old Men (1995) 1995 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 3.02 478 4 Waiting to Exhale (1995) 1995 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 2.73 170 5 Father of the Bride Part II (1995) 1995 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 3.01 296 6 Heat (1995) 1995 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 3.88 940 > head(mutations) Identifier TTN PTEN TP53 EGFR MUC16 FLG RYR2 PCLO PIK3R1 PIK3CA NF1 MUC17 HMCN1 SPTA1 USH2A RB1 PKHD1 OBSCN AHNAK2 RYR3 RELN FRAS1 GPR98 DNAH5 ATRX APOB TCHH SYNE1 LRP2 KEL HRNR DNAH3 COL6A3 MUC5B LAMA1 DSP 1 02-0003 0 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 2 02-0033 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 02-0047 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 02-0055 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 5 02-2470 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 02-2483 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 DNAH8 CNTNAP2 SDK1 NBPF10 DNAH2 NLRP5 MLL3 IDH1 HCN1 FCGBP DOCK5 RIMS2 PCDHA1 MXRA5 HEATR7B2 GRIN2A FGD5 TMEM132D STAG2 SEMA3C SCN9A PRDM9 POM121L12 PIK3CG PDGFRA GABRA6 FLG2 FBN3 FBN2 FAT2 DNAH11 DMD COL1A2 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 ABCC9 XIRP2 TSHZ2 TEX15 SLIT3 RBM47 PIK3C2G PCDH11X MYH2 MACF1 KSR2 DNAH9 DCHS2 CSMD3 CDH18 BCOR AHNAK ZAN TRRAP THSD7B TAF1L SPAG17 SLCO5A1 SCN10A RYR1 RIMBP2 PLEKHG4B PCDHB7 NPTX2 NOS1 LZTR1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 5 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
1.2.2. orthofinder数据
sed -E "s/\t[1-9][0-9]*/\t1/g" Orthogroups.GeneCount.tsv |sed "s/\.pep//g" >orthogroups.upset
# 把Orthogroups.GeneCount.tsv中的非零数字替换成1。
在R中用mutations <- read.csv("orthogroups.upset", header=TRUE, sep = "\t")
1.3. UpSetR包安装
1 2 3
| install.packages("UpSetR"); # 安装 library(UpSetR); # 载入UpSetR require(ggplot2); require(plyr); require(gridExtra); require(grid); # 载入包
1.4. UpSetR包使用
1.4.1. upset函数
1 2 3 4 5 6 7 8 9 10 11 12
| upset(mutations, sets = c("MUC16","EGFR","TP53","TTN"), nset = 4, nintersects = 20, mb.ratio = c(0.55, 0.45), order.by = c("degree", "freq"), keep.order = TRUE, decreasing = c(TRUE,FALSE), number.angles = 30, point.size = 2, line.size = 1, mainbar.y.label = "Intersection size of gene family",sets.x.label = "genome size", text.scale = c(1.3, 1.3, 1, 1, 1.5, 1))
Figure 1. movies upset
1.4.2. queries参数
- query: 指定查询函数,UpSetR有内置(比如intersects),也可以自定义函数后调用。
- param: list格式, 指定query作用的数据。
- color:设置颜色,可选设置。
- active:显示类型,TRUE/T表示用颜色覆盖条形图,FALSE/F表示在条形图顶端显示三角形。
- query.name:添加query图例名称。
- 把”EGFR”和”TP53”两个基因共同拥有的突变标上蓝色;把”TTN”基因特有的突变在直方图上标为红色。
1 2 3 4 5 6 7
| upset(mutations, sets=c("MUC16","EGFR","TP53","TTN"), queries = list(list(query = intersects, params = list("EGFR", "TP53"), color = "blue", active = T, query.name = "share EGFR and TP53"), list(query = intersects, params=list("TTN"), color="red", active=T)))
Figure 2. mutations upset
- 把同属Drama和Thriller的电影突出显示,把1970-1980的电影标红。
1 2 3 4 5 6 7
| between <- function(row, min, max){ newData <- (row["ReleaseDate"] < max) & (row["ReleaseDate"] > min) } # 自定义between函数
upset(movies, sets=c("Drama","Comedy","Action","Thriller","Western","Documentary"), queries = list(list(query = intersects, params = list("Drama", "Thriller")), list(query = between, params=list(1970,1980), color="red", active=TRUE)))
Figure 3. movies upset 2
1.4.3. 添加属性图
- 添加箱线图
upset(movies, boxplot.summary = c("AvgRating", "ReleaseDate"))
Figure 4. movies upset boxplot attribute.plots参数
- 添加柱形图和散点图
1 2 3 4 5 6 7 8 9 10 11
| upset(movies, sets=c("Drama","Comedy","Action","Thriller","Western","Documentary"), queries = list(list(query = intersects, params = list("Drama", "Thriller")), list(query = between, params=list(1970,1980), color="red", active=TRUE)), attribute.plots=list(gridrows=60, plots=list( list(plot=scatter_plot, x="ReleaseDate", y="AvgRating", queries = T), list(plot= histogram, x="ReleaseDate", queries = F)), ncols = 2), query.legend = "top")
Figure 5. movies upset scatter histograms
- 添加密度曲线图
1 2 3 4 5 6 7
| another.plot <- function(data, x, y) { data$decades <- round_any(as.integer(unlist(data[y])), 10, ceiling) data <- data[which(data$decades >= 1970), ] myplot <- (ggplot(data, aes_string(x = x)) + geom_density(aes(fill = factor(decades)), alpha = 0.4) + theme(plot.margin = unit(c(0, 0, 0, 0), "cm"), legend.key.size = unit(0.4, "cm"))) }
1 2 3 4 5 6 7 8 9
| library(plyr) upset(movies, main.bar.color = "black", mb.ratio = c(0.5, 0.5), queries = list(list(query = intersects, params = list("Drama"), color = "red", active = F), list(query = intersects, params = list("Action", "Drama"), active = T), list(query = intersects, params = list("Drama", "Comedy", "Action"), color = "orange", active = T)), attribute.plots = list(gridrows = 50, plots = list(list(plot = histogram, x = "ReleaseDate", queries = F), list(plot = scatter_plot, x = "ReleaseDate", y = "AvgRating", queries = T), list(plot = another.plot, x = "AvgRating", y = "ReleaseDate", queries = F)), ncols = 3))
Figure 6. movies upset density
2. references
- https://github.com/hms-dbmi/UpSetR
- https://www.jianshu.com/p/324aae3d5ea4
- https://zhuanlan.zhihu.com/p/35303590
- 欢迎关注微信公众号:生信技工
- 公众号主要分享生信分析、生信软件、基因组学、转录组学、植物进化、生物学概念等相关内容,包括生物信息学工具的基本原理、操作步骤和学习心得。