The purpose of this document is to generate, from the data included here, many of the figures, tables, and supplementary documents present in the StateHub / StatePaintR paper. All code is included to generate the figures and tables, see the “Code” button on the top right of the page to download the .Rmd
file that generates this document.
Genome annotation is critical to understand the function of disease variants, especially for clinical applications. To meet this need there are segmentations available from public consortia reflecting varying unsupervised approaches to functional annotation based on epigenetics data, but there remains a need for transparent, reproducible, and easily interpreted genomic maps of the functional biology of chromatin. We introduce here methods for defining chromatin state with a combinatorial epigenomic model using an annotation tool, StatePaintR and a website database, StateHub. Annotations are fully documented with change history and versioning, authorship information, and original source files. The tool calculates quantitative state scores based on genome-wide ranking, allowing prioritization and enrichment testing, facilitating quantitative analysis. StateHub hosts annotation tracks for major public consortia as a resource, and allows users to submit their own alternative models.
A preprint is available on bioRxiv
We begin by loading the required packages, not all of these are required to simply run StatePaintR, but they will be necessary for doing some of the analysis that follows.
library(GenomicRanges)
library(biomaRt)
library(rtracklayer)
library(RColorBrewer)
library(ggplot2)
library(stringr)
library(tidyr)
library(dplyr)
library(httr)
library(readr)
library(devtools)
And then we install StatePaintR, currently from GitHub.
install_github("Simon-Coetzee/StatePaintR")
library(StatePaintR)
Finally we download all of the data required to run this vignette:
download.file("https://s3-us-west-2.amazonaws.com/statehub-trackhub/statepaintr_data.tar.gz", "statepaintr_data.tar.gz")
trying URL 'https://s3-us-west-2.amazonaws.com/statehub-trackhub/statepaintr_data.tar.gz'
Content type 'application/gzip' length 115305878 bytes (110.0 MB)
==================================================
downloaded 110.0 MB
untar("statepaintr_data.tar.gz", compressed = "gzip")
data.frame(FILES = list.files("data", all.files = FALSE, recursive = TRUE))
With This complete, we begin analysis.
The process of running StatePaintR falls into three basic steps. ## Download the decision matrix We first download the decision matrix by indicating the model’s unique ID as indicated on the statehub website.
decisionmatrix <- get.decision.matrix(search = "5813b67f46e0fb06b493ceb0")
In order to determine a basis for true positive enhancers we use enhancers validated by the VISTA enhancer browser. We can begin by reading the VISTA data from a bed file. The original data comes from the ENCODE annotation file set ENCSR964TTF
vista.enhancers <- read_tsv("data/VISTA_invivo_tested.bed", col_names = FALSE)
vista.enhancers
This gives us the genomic locations of vista enhancers, in columns 1:3; then the name of the enhancer in 4. Column 5 is True - 1
or False 0
indicating if the enhancer was found to have activity. Columns 7 and 8 are redundant with 2 and 3, column 9 assigns a color based upon column 5. Column 10 indicates the tissues wherein the activity of the enhancer was observed. Since this is a little bit messy, we clean up the format, and covert it to a GenomicRanges
object for further analysis.
# remove commas and spaces
vista.validation <- stringr::str_replace(vista.enhancers$X10,
pattern = "ganglion, cranial",
replacement = "ganglion-cranial")
# split on the remaining commas in order retrieve all of the tissues
vista.validation <- stringr::str_split(vista.validation, ",")
# remove parenthesis and clarifiying terms
vista.validation <- lapply(vista.validation, function(x) {
x <- stringr::str_replace(x, " \\(.*\\)", "")
x <- stringr::str_trim(x, side = "both")
return(x)
})
# remove brackets and scores
vista.validation <- lapply(vista.validation, function(x) {
x <- stringr::str_replace(x, "\\[.*\\]", "")
})
# create an empty matrix with rows representing enhancers,
# columns representing tissues that they could be active in
vista.enhancers.matrix <- matrix(0L,
nrow = nrow(vista.enhancers),
ncol = length(unique(unlist(vista.validation))))
colnames(vista.enhancers.matrix) <- unique(unlist(vista.validation))
# if an enhancer was found to have activity by VISTA,
# indicate that with a 1 in the appropriate tissue
for (enhancer in seq_along(vista.validation)) {
vista.enhancers.matrix[enhancer, vista.validation[[enhancer]]] <- 1L
}
# create our Genomic Ranges object
vista.enhancers <- GRanges(seqnames = vista.enhancers$X1,
ranges = IRanges(start = vista.enhancers$X2,
end = vista.enhancers$X3),
name = vista.enhancers$X4,
validated = vista.enhancers$X5,
seqinfo = Seqinfo(genome = "mm10"))
mcols(vista.enhancers) <- cbind(as.data.frame(mcols(vista.enhancers)),
vista.enhancers.matrix)
vista.enhancers[, 1:4]
GRanges object with 2554 ranges and 4 metadata columns:
seqnames ranges strand | name validated negative neural tube
<Rle> <IRanges> <Rle> | <character> <integer> <integer> <integer>
[1] chr1 [ 5021600, 5024300] * | 912:2 0 1 0
[2] chr1 [ 6729232, 6730317] * | 698:1 1 0 1
[3] chr1 [ 9648223, 9650965] * | 1546:2 1 0 1
[4] chr1 [11026040, 11028544] * | 663:2 0 1 0
[5] chr1 [12614812, 12617322] * | 664:2 0 1 0
... ... ... ... . ... ... ... ...
[2550] chrX [109969648, 109970362] * | 871:1 0 1 0
[2551] chrX [110076960, 110081419] * | 426:1 1 0 0
[2552] chrX [110318086, 110319120] * | 554:1 0 1 0
[2553] chrX [110377387, 110378071] * | 585:1 0 1 0
[2554] chrX [110816983, 110818738] * | 1029:1 0 1 0
-------
seqinfo: 66 sequences (1 circular) from mm10 genome
We will also create a column that represents the amalgamation of several different tissues: ear, eye, branchial arch, nose, and facial mesenchyme, that we will be using as the validation set for ChIP-Seq conducted in the embryonic facial prominence.
face.vista <- c("ear", "eye", "branchial arch", "nose", "facial mesenchyme")
e.f.p <- rowSums(as.matrix(mcols(vista.enhancers)[, face.vista])) > 0
mcols(vista.enhancers)[, "embryonic facial prominence"] <- as.integer(e.f.p)
vista.enhancers[, 27]
GRanges object with 2554 ranges and 1 metadata column:
seqnames ranges strand | embryonic facial prominence
<Rle> <IRanges> <Rle> | <integer>
[1] chr1 [ 5021600, 5024300] * | 0
[2] chr1 [ 6729232, 6730317] * | 0
[3] chr1 [ 9648223, 9650965] * | 0
[4] chr1 [11026040, 11028544] * | 0
[5] chr1 [12614812, 12617322] * | 0
... ... ... ... . ...
[2550] chrX [109969648, 109970362] * | 0
[2551] chrX [110076960, 110081419] * | 0
[2552] chrX [110318086, 110319120] * | 0
[2553] chrX [110377387, 110378071] * | 0
[2554] chrX [110816983, 110818738] * | 0
-------
seqinfo: 66 sequences (1 circular) from mm10 genome
Using the data made available by the ENCODE project we can retrieve data on multiple Histone ChIP-Seq and sometimes DNase-seq for relevant tissues in order to make enhancer predictions.
Dataset | Accession Number |
---|---|
embryonic mouse neural tube (11.5 day) | ENCSR215ZYV |
embryonic mouse midbrain (11.5 day) | ENCSR843IAS |
embryonic mouse hindbrain (11.5 day) | ENCSR501OPC |
embryonic mouse limb (11.5 day) | ENCSR283NCE |
embryonic mouse heart (11.5 day) | ENCSR016LTR |
Narrowpeak calls for this data and our manifest are included with this document. In order to use DNase-seq when available, we used the IDR process to merge replicates.
manifest <- "data/mouse.idr/IDR.Manifest.txt"
read.table(manifest,
sep = "\t",
stringsAsFactors = FALSE,
header = TRUE)
We can begin our enhancer predictions by segmenting the genome for these samples. We have our manifest, now all we need is a decisionMatrix
from StateHub to define the rules for segmenting the genome.
dm <- get.decision.matrix("5813b67f46e0fb06b493ceb0")
dm
ID: 5813b67f46e0fb06b493ceb0
Name: Focused Poised Promoter Model
Author: Ben Berman - Cedars-Sinai Center for Bioinformatics and Functional Genomics
Revision: Oct 28, 2016 11:24:11 PM
Description: This model has focused poised regions that require narrow H3K27me3 regions to be present to be called.
Our decision matrix is downloaded, but in order to score our enhancers we make some modifications. We want to exclude the “Regulatory” mark from our scoring process, and we want keep DNase-seq peaks intact as the core of our features.
dm <- doNotScore(dm, "Regulatory")
dm <- doNotSplit(dm, "Core")
dm
ID: 5813b67f46e0fb06b493ceb0
Name: Focused Poised Promoter Model
Author: Ben Berman - Cedars-Sinai Center for Bioinformatics and Functional Genomics
Revision: Oct 28, 2016 11:24:11 PM
Description: This model has focused poised regions that require narrow H3K27me3 regions to be present to be called.
Not Scoring: Regulatory*
Not Splitting: [Core]
With preparation complete, we can now segment the genome for our samples defined in the manifest.
mouse.embryo.states <- PaintStates(manifest = manifest,
decisionMatrix = dm,
scoreStates = TRUE,
progress = FALSE)
names(mouse.embryo.states) <- str_replace(names(mouse.embryo.states), " embryo", "")
mouse.embryo.states$heart
GRanges object with 497638 ranges and 3 metadata columns:
seqnames ranges strand | name state score
<Rle> <IRanges> <Rle> | <character> <character> <numeric>
[1] chr1 [3008951, 3009184] * | heart embryo HET 964
[2] chr1 [3047354, 3047644] * | heart embryo HET 951
[3] chr1 [3120458, 3120776] * | heart embryo EWR 0
[4] chr1 [3212581, 3212786] * | heart embryo EWR 0
[5] chr1 [3301150, 3301344] * | heart embryo HET 951
... ... ... ... . ... ... ...
[497634] chrUn_JH584304 [65053, 66952] * | heart embryo HET 986
[497635] chrUn_JH584304 [67363, 68071] * | heart embryo HET 989
[497636] chrUn_JH584304 [68072, 68265] * | heart embryo TRS 756
[497637] chrUn_JH584304 [68266, 68579] * | heart embryo HET 989
[497638] chrUn_JH584304 [98007, 98405] * | heart embryo HET 945
-------
seqinfo: 66 sequences (1 circular) from mm10 genome
All of our model tuning was done on a selection of 100 valid enhancers for each tissue type, so we’ll exclude those 100 from our subsequent comparisons.
seed <- 42
train.enhancers <- list()
test.enhancers <- list()
set.seed(seed)
for (tissue in names(mouse.embryo.states)) {
vista.enhancers.train <- vista.enhancers[, tissue]
vista.enhancers.train <- sample(which(mcols(vista.enhancers.train)[, tissue] == 1L),
size = 100)
train.enhancers <- c(train.enhancers, list(vista.enhancers.train))
names(train.enhancers)[length(train.enhancers)] <- tissue
vista.enhancers.test <- c(1:length(vista.enhancers))[-vista.enhancers.train]
test.enhancers <- c(test.enhancers, list(vista.enhancers.test))
names(test.enhancers)[length(test.enhancers)] <- tissue
}
Evaluation of our models, and the external enhancer predictions against which we compare, is done with Precision-Recall-Gain Curves 1.
Here we generate the StatePaintR predictions, using the states EAR, EARC, AR, and ARC as our predicted enhancers. This also prepares the data for plotting in ggplot2 by extracting the precision gain and recall gain, and the convex hull. Additionally we look at the area under the precision recall gain curve to get an idea of the accuracy.
plot.data.sp <- PRG(states = mouse.embryo.states,
comparison = vista.enhancers,
state.select = c("EARC", "ARC", "AR", "EAR"),
comparison.select = test.enhancers)
plot.data.sp$auprg
heart hindbrain limb midbrain neural tube
0.8775036 0.7903943 0.8565165 0.8369033 0.8443555
We can so something similar for evaluating ENCODE candiate enhancer calls We used the following data from the ENCODE data portal:
Enhancer-like regions | Accession Number |
---|---|
using DNase and H3K27ac for neural tube (11.5 day) | ENCFF786KUB |
using DNase and H3K27ac for midbrain (11.5 day) | ENCFF733UJT |
using DNase and H3K27ac for hindbrain (11.5 day) | ENCFF324INM |
using DNase and H3K27ac for limb (11.5 day) | ENCFF520EGD |
using H3K27ac for heart (11.5 day) | ENCSR312DDF |
So we begin by downloading this data, and coverting it into GRanges objects
encode.enhancers <- c("neural tube" = "ENCFF786KUB",
midbrain = "ENCFF733UJT",
hindbrain = "ENCFF324INM",
limb = "ENCFF520EGD",
heart = "ENCFF435VGC")
encode.enhancers <- sapply(encode.enhancers, function(x) {
encode <- "https://www.encodeproject.org"
x <- content(GET(encode, path = x))$href
return(paste0(encode, x))
})
encode.enhancers <- lapply(encode.enhancers, function(x) {
x <- read_tsv(x, col_names = FALSE)
x$X5 <- nrow(x):1
x <- GRanges(seqnames = x$X1,
ranges = IRanges(start = x$X2,
end = x$X3),
score = x$X5,
seqinfo = Seqinfo(genome = "mm10"))
return(x)
})
encode.enhancers$heart
GRanges object with 33492 ranges and 1 metadata column:
seqnames ranges strand | score
<Rle> <IRanges> <Rle> | <integer>
[1] chr4 [155233365, 155237530] * | 33492
[2] chr15 [ 77034916, 77040989] * | 33491
[3] chr17 [ 46854219, 46859321] * | 33490
[4] chr3 [ 87149351, 87155573] * | 33489
[5] chr18 [ 75497565, 75507425] * | 33488
... ... ... ... . ...
[33488] chr17 [ 26123469, 26123919] * | 5
[33489] chr11 [ 18888501, 18888937] * | 4
[33490] chr5 [113650662, 113651045] * | 3
[33491] chr14 [ 33412655, 33413007] * | 2
[33492] chr17 [ 43568116, 43568477] * | 1
-------
seqinfo: 66 sequences (1 circular) from mm10 genome
Regulatory element prediction based on tissue-specific local epigenetic marks (REPTILE) is described in Improved regulatory element prediction based on tissue-specific local epigenomic signatures, and integrates histone modification and whole-genome cytosine DNA methylation profiles to identify the precise location of enhancers. This paper also includes results for DELTA, RFECS, and CSI-ANN which we will be compairing against.
reptile.enhancers <- c("neural tube" = "NT",
midbrain = "MB",
hindbrain = "HB",
limb = "LM",
heart = "HT")
reptile.enhancers <- lapply(reptile.enhancers, function(x) {
file.path <- file.path("data", "enhancer_predictions", "REPTILE", paste0("REPTILE_pred_E11_5_", x, ".bed"))
x <- read_tsv(file.path, col_names = FALSE)
x <- GRanges(seqnames = x$X1,
ranges = IRanges(start = x$X2,
end = x$X3),
score = x$X5,
enhancername = x$X4)
return(x)
})
reptile.enhancers$heart
GRanges object with 51551 ranges and 2 metadata columns:
seqnames ranges strand | score enhancername
<Rle> <IRanges> <Rle> | <numeric> <character>
[1] chr1 [3120334, 3120785] * | 0.5165 dmr_7
[2] chr1 [4412200, 4414200] * | 0.5075 bin_2671882
[3] chr1 [4432585, 4433294] * | 0.6330 dmr_117,dmr_118
[4] chr1 [4493100, 4495100] * | 0.5530 bin_2672691
[5] chr1 [4571100, 4573100] * | 0.5015 bin_2673471
... ... ... ... . ... ...
[51547] chrX [169910900, 169912900] * | 0.5300 bin_1699109
[51548] chrX [169935400, 169937400] * | 0.5985 bin_1699354
[51549] chrY [ 808500, 810500] * | 0.6395 bin_1718398
[51550] chrY [ 1010400, 1012400] * | 0.5320 bin_1720417
[51551] chrY [ 1284200, 1286200] * | 0.5055 bin_1723155
-------
seqinfo: 21 sequences from an unspecified genome; no seqlengths
DELTA (Distal Enhancer Locating Tool based on AdaBoost) is described in DELTA: A Distal Enhancer Locating Tool Based on AdaBoost Algorithm and Shape Features of Chromatin Modifications and defines a set of non-redundant shape features of histone modifications, which shows high consistency across cell types and can greatly reduce the dimensionality of feature vectors which is then integrated with a machine-learning algorithm AdaBoost to predict enhancers.
delta.enhancers <- c("neural tube" = "NT",
midbrain = "MB",
hindbrain = "HB",
limb = "LM",
heart = "HT")
delta.enhancers <- lapply(delta.enhancers, function(x) {
file.path <- file.path("data", "enhancer_predictions", "DELTA", paste0("DELTA_pred_E11_5_", x, ".bed"))
x <- read_tsv(file.path, col_names = FALSE)
x <- GRanges(seqnames = x$X1,
ranges = IRanges(start = x$X2,
end = x$X3),
score = x$X5,
enhancername = x$X4)
return(x)
})
delta.enhancers$heart
GRanges object with 43911 ranges and 2 metadata columns:
seqnames ranges strand | score enhancername
<Rle> <IRanges> <Rle> | <numeric> <character>
[1] chr1 [4426300, 4428400] * | 0.06853420 delta_44273
[2] chr1 [4614300, 4616400] * | 0.05295439 delta_46153
[3] chr1 [4857000, 4859100] * | 0.05446597 delta_48580
[4] chr1 [5220300, 5222400] * | 0.05524229 delta_52213
[5] chr1 [6213700, 6215800] * | 0.15449180 delta_62147
... ... ... ... . ... ...
[43907] chrX [169983200, 169985300] * | 0.34102207 delta_26327305
[43908] chrX [169988600, 169990700] * | 0.10003010 delta_26327359
[43909] chrX [170008800, 170010900] * | 0.08205701 delta_26327561
[43910] chrX [170017200, 170019300] * | 0.26976139 delta_26327645
[43911] chrY [ 90739300, 90741400] * | 0.21821156 delta_27245179
-------
seqinfo: 21 sequences from an unspecified genome; no seqlengths
RFECS (Random Forest based Enhancer identification from Chromatin States) is described in RFECS: A Random-Forest Based Algorithm for Enhancer Identification from Chromatin State and is used to predict genome-wide enhancers based on their similarity to the histone modification profiles of p300 binding sites.
rfecs.enhancers <- c("neural tube" = "NT",
midbrain = "MB",
hindbrain = "HB",
limb = "LM",
heart = "HT")
rfecs.enhancers <- lapply(rfecs.enhancers, function(x) {
file.path <- file.path("data", "enhancer_predictions", "RFECS", paste0("RFECS_pred_E11_5_", x, ".bed"))
x <- read_tsv(file.path, col_names = FALSE)
x <- GRanges(seqnames = x$X1,
ranges = IRanges(start = x$X2,
end = x$X3),
score = x$X5,
enhancername = x$X4)
return(x)
})
rfecs.enhancers$heart
GRanges object with 44039 ranges and 2 metadata columns:
seqnames ranges strand | score enhancername
<Rle> <IRanges> <Rle> | <numeric> <character>
[1] chr1 [4491500, 4493500] * | 0.246154 rfecs_18957
[2] chr1 [4496000, 4498000] * | 0.353846 rfecs_18958
[3] chr1 [4570700, 4572700] * | 0.415385 rfecs_18959
[4] chr1 [4622200, 4624200] * | 0.292308 rfecs_18960
[5] chr1 [4807400, 4809400] * | 0.261538 rfecs_18961
... ... ... ... . ... ...
[44035] chrX [170001400, 170003400] * | 0.400000 rfecs_44034
[44036] chrX [170004700, 170006700] * | 0.261538 rfecs_44035
[44037] chrX [170009000, 170011000] * | 0.584615 rfecs_44036
[44038] chrX [170672400, 170674400] * | 0.861538 rfecs_44037
[44039] chrX [170757400, 170759400] * | 0.953846 rfecs_44038
-------
seqinfo: 20 sequences from an unspecified genome; no seqlengths
CSI-ANN (chromatin signature identification by artificial neural network) is described in Discover regulatory DNA elements using chromatin signatures and artificial neural network and is a framework that consists of a data transformation and a feature extraction step followed by a classification step using time-delay neural network.
csiann.enhancers <- c("neural tube" = "NT",
midbrain = "MB",
hindbrain = "HB",
limb = "LM",
heart = "HT")
csiann.enhancers <- lapply(csiann.enhancers, function(x) {
file.path <- file.path("data", "enhancer_predictions", "CSIANN", paste0("CSIANN_pred_E11_5_", x, ".bed"))
x <- read_tsv(file.path, col_names = FALSE)
x <- GRanges(seqnames = x$X1,
ranges = IRanges(start = x$X2,
end = x$X3),
score = x$X5,
enhancername = x$X4)
return(x)
})
csiann.enhancers$heart
GRanges object with 63158 ranges and 2 metadata columns:
seqnames ranges strand | score enhancername
<Rle> <IRanges> <Rle> | <numeric> <character>
[1] chr1 [3670001, 3672001] * | 0.8655596 csiann_0
[2] chr1 [4412001, 4414001] * | 0.8594556 csiann_1
[3] chr1 [4491601, 4493601] * | 0.8655598 csiann_2
[4] chr1 [4571001, 4573001] * | 0.8655591 csiann_3
[5] chr1 [4622601, 4624601] * | 0.8651950 csiann_4
... ... ... ... . ... ...
[63154] chrX [170016801, 170018801] * | 0.8654488 csiann_63153
[63155] chrX [170018201, 170020201] * | 0.8655517 csiann_63154
[63156] chrY [ 1244401, 1246401] * | 0.8640187 csiann_63155
[63157] chrY [ 1285201, 1287201] * | 0.8629719 csiann_63156
[63158] chrY [ 90738801, 90740801] * | 0.8655591 csiann_63157
-------
seqinfo: 21 sequences from an unspecified genome; no seqlengths
Another system of enhancer prediction that we may compare against is EnhancerFinder, a two-step method for distinguishing developmental enhancers from the genomic background and then predicting their tissue specificity. Among the Supporting Information for the paper are the enhancer predictions made for brain, limb, and heart. These however are indicated in hg19 genomic coordinates, so we will have to get the hg19 coordinates for the VISTA database as well.
vista.human <- read_tsv("data/VISTA_invivo_tested_hg19.bed", col_names = FALSE)
vista.human.validation <- stringr::str_replace(vista.human$X10, "ganglion, cranial", "ganglion-cranial")
vista.human.validation <- stringr::str_split(vista.human.validation, ",")
vista.human.validation <- lapply(vista.human.validation, function(x) {
x <- stringr::str_replace(x, " \\(.*\\)", "")
x <- stringr::str_trim(x, side = "both")
return(x)
})
vista.human.validation <- lapply(vista.human.validation, function(x) {
x <- stringr::str_replace(x, "\\[.*\\]", "")
})
vista.human.matrix <- matrix(0L,
nrow = nrow(vista.human),
ncol = length(unique(unlist(vista.human.validation))))
colnames(vista.human.matrix) <- unique(unlist(vista.human.validation))
for(enhancer in seq_along(vista.human.validation)) {
vista.human.matrix[enhancer, vista.human.validation[[enhancer]]] <- 1L
}
library(GenomicRanges)
vista.human <- GRanges(seqnames = vista.human$X1,
ranges = IRanges(start = vista.human$X2,
end = vista.human$X3),
name = vista.human$X4,
validated = vista.human$X5,
seqinfo = Seqinfo(genome = "hg19"))
mcols(vista.human) <- cbind(as.data.frame(mcols(vista.human)), vista.human.matrix)
vista.human[, 1:4]
GRanges object with 2555 ranges and 4 metadata columns:
seqnames ranges strand | name validated heart forebrain
<Rle> <IRanges> <Rle> | <character> <integer> <integer> <integer>
[1] chr1 [2142309, 2144864] * | 318:2 1 1 0
[2] chr1 [2808410, 2809683] * | 1088:2 1 0 1
[3] chr1 [2883700, 2888173] * | 1492:2 1 0 0
[4] chr1 [3027197, 3029379] * | 1560:2 0 0 0
[5] chr1 [3190581, 3191428] * | 705:1 1 0 0
... ... ... ... . ... ... ... ...
[2551] chrX [139593502, 139594774] * | 770:1 0 0 0
[2552] chrX [139674499, 139675403] * | 588:1 0 0 0
[2553] chrX [147829016, 147830159] * | 856:1 0 0 0
[2554] chrX [150407692, 150409052] * | 1746:1 1 0 1
[2555] chrX [153593214, 153596578] * | 2491:1 0 0 0
-------
seqinfo: 93 sequences (1 circular) from hg19 genome
The enhancer predictions that EnhancerFinder generates only has a single category for brain so we will compare it to both hindbrain and midbrain, they also made no predictions for neural tube, so we will be excluding EnhancerFinder from that prediction. Beyond that, the procedure is similar to what we did for ENCODE. Starting with creating a GRanges object for the predictions
download.file("http://dx.doi.org/10.1371/journal.pcbi.1003677.s018", destfile = "data/enhancerfinder.predictions.zip")
trying URL 'http://dx.doi.org/10.1371/journal.pcbi.1003677.s018'
downloaded 1.9 MB
unzip("data/enhancerfinder.predictions.zip", exdir = file.path("data", "enhancer_predictions", "enhancerfinder"))
l.ef.enhancer <- read.delim(file.path("data", "enhancer_predictions", "enhancerfinder", "enhancerfinder_limb_hg19.bed"))
h.ef.enhancer <- read.delim(file.path("data", "enhancer_predictions", "enhancerfinder", "enhancerfinder_heart_hg19.bed"))
b.ef.enhancer <- read.delim(file.path("data", "enhancer_predictions", "enhancerfinder", "enhancerfinder_brain_hg19.bed"))
enhancerfinder.enhancers <- list(h.ef.enhancer, b.ef.enhancer, l.ef.enhancer, b.ef.enhancer)
enhancerfinder.enhancers <- lapply(enhancerfinder.enhancers, function(x) {
x <- GRanges(seqnames = x$X.chromosome,
ranges = IRanges(start = x$start,
end = x$end),
score = x$MKL_scores)
})
names(enhancerfinder.enhancers) <- c("heart", "hindbrain", "limb", "midbrain")
enhancerfinder.enhancers$heart
GRanges object with 19051 ranges and 1 metadata column:
seqnames ranges strand | score
<Rle> <IRanges> <Rle> | <numeric>
[1] chr1 [839000, 840500] * | 0.273
[2] chr1 [856000, 857500] * | 0.221
[3] chr1 [939000, 942500] * | 1.679
[4] chr1 [954000, 955500] * | 0.525
[5] chr1 [956000, 957500] * | 0.418
... ... ... ... . ...
[19047] chrX [153963280, 153964780] * | 0.293
[19048] chrX [153989280, 153990780] * | 0.297
[19049] chrY [ 10036000, 10037500] * | 0.271
[19050] chrY [ 13470000, 13471500] * | 0.299
[19051] chrY [ 13477000, 13479500] * | 0.401
-------
seqinfo: 24 sequences from an unspecified genome; no seqlengths
As with StatePaintR compare this data to the VISTA dataset.
plot.data.encode <- PRG(states = encode.enhancers,
comparison = vista.enhancers,
comparison.select = test.enhancers)
plot.data.encode$auprg
neural tube midbrain hindbrain limb heart
0.8160415 0.8228843 0.7969034 0.8530691 0.8815817
plot.data.reptile <- PRG(states = reptile.enhancers,
comparison = vista.enhancers,
comparison.select = test.enhancers)
plot.data.reptile$auprg
neural tube midbrain hindbrain limb heart
0.8622531 0.8710326 0.7628318 0.8937956 0.9175436
plot.data.delta <- PRG(states = delta.enhancers,
comparison = vista.enhancers,
comparison.select = test.enhancers)
plot.data.delta$auprg
neural tube midbrain hindbrain limb heart
0.8091653 0.8442490 0.7575233 0.8396032 0.9343298
plot.data.rfecs <- PRG(states = rfecs.enhancers,
comparison = vista.enhancers,
comparison.select = test.enhancers)
plot.data.rfecs$auprg
neural tube midbrain hindbrain limb heart
0.7918588 0.8462961 0.7762617 0.8544358 0.9222974
plot.data.csiann <- PRG(states = csiann.enhancers,
comparison = vista.enhancers,
comparison.select = test.enhancers)
plot.data.csiann$auprg
neural tube midbrain hindbrain limb heart
0.7187654 0.6819687 0.6236748 0.6873185 0.8400370
For enhancer finder - accounting for it being in hg19, rather than in mm10 like the previous comparisons
plot.data.enhancerfinder <- PRG(states = enhancerfinder.enhancers,
comparison = vista.human,
comparison.select = test.enhancers)
plot.data.enhancerfinder$auprg
heart hindbrain limb midbrain
0.8209410 0.6310684 0.6700646 0.5935917
We begin by doing a little bit of cleaning up the data to prepare for plotting with ggplot2
plot.data.encode$precision.recall.gain$SOURCE <- "ENCODE"
plot.data.sp$precision.recall.gain$SOURCE <- "StatePaintR"
plot.data.reptile$precision.recall.gain$SOURCE <- "REPTILE"
plot.data.delta$precision.recall.gain$SOURCE <- "DELTA"
plot.data.rfecs$precision.recall.gain$SOURCE <- "RFECS"
plot.data.csiann$precision.recall.gain$SOURCE <- "CSIANN"
plot.data.enhancerfinder$precision.recall.gain$SOURCE <- "EnhancerFinder"
precision.recall.gain <- rbind.data.frame(plot.data.encode$precision.recall.gain,
plot.data.sp$precision.recall.gain,
plot.data.reptile$precision.recall.gain,
plot.data.delta$precision.recall.gain,
plot.data.rfecs$precision.recall.gain,
plot.data.csiann$precision.recall.gain,
plot.data.enhancerfinder$precision.recall.gain)
plot.data.encode$convex.hull$SOURCE <- "ENCODE"
plot.data.sp$convex.hull$SOURCE <- "StatePaintR"
plot.data.reptile$convex.hull$SOURCE <- "REPTILE"
plot.data.delta$convex.hull$SOURCE <- "DELTA"
plot.data.rfecs$convex.hull$SOURCE <- "RFECS"
plot.data.csiann$convex.hull$SOURCE <- "CSIANN"
plot.data.enhancerfinder$convex.hull$SOURCE <- "EnhancerFinder"
convex.hull <- rbind.data.frame(plot.data.encode$convex.hull,
plot.data.sp$convex.hull,
plot.data.reptile$convex.hull,
plot.data.delta$convex.hull,
plot.data.rfecs$convex.hull,
plot.data.csiann$convex.hull,
plot.data.enhancerfinder$convex.hull)
comboplot <- ggplot(precision.recall.gain, aes(y = PRECISION, x = RECALL, group = SOURCE)) +
geom_line(aes(color = SOURCE)) +
geom_point(aes(color = SOURCE)) +
coord_cartesian(xlim=c(0,1), ylim = c(0.4,1)) +
scale_color_brewer(palette = "Dark2") +
geom_line(data = convex.hull, aes(y = PRECISION,
x = RECALL,
group = SOURCE,
color = SOURCE), linetype = 2) +
theme_grey() +
ylab("Precision Gain") + xlab("Recall Gain") + theme(aspect.ratio=1) +
facet_wrap( ~ TISSUE, ncol = 3)
comboplot
From all of this analysis we generate the Area Under the Precision-Recall-Gain Curve (AUPRG) which conveys an expected F1 score on a harmonic scale. Which is Table 3.
auprg <- data.frame(source = c("ENCODE", "StatePaintR",
"REPTILE", "DELTA", "RFECS",
"CSIANN", "EnhancerFinder"),
"neural tube" = NA,
midbrain = NA,
hindbrain = NA,
limb = NA,
heart = NA,
check.names = FALSE)
auprg[auprg$source == "ENCODE", names(plot.data.encode$auprg)] <- plot.data.encode$auprg
auprg[auprg$source == "REPTILE", names(plot.data.reptile$auprg)] <- plot.data.reptile$auprg
auprg[auprg$source == "DELTA", names(plot.data.delta$auprg)] <- plot.data.delta$auprg
auprg[auprg$source == "RFECS", names(plot.data.rfecs$auprg)] <- plot.data.rfecs$auprg
auprg[auprg$source == "CSIANN", names(plot.data.csiann$auprg)] <- plot.data.csiann$auprg
auprg[auprg$source == "StatePaintR", names(plot.data.sp$auprg)] <- plot.data.sp$auprg
auprg[auprg$source == "EnhancerFinder", names(plot.data.enhancerfinder$auprg)] <- plot.data.enhancerfinder$auprg
auprg$ave_auprg <- rowSums(auprg[, -1], na.rm = T)/5
average_rank <- sapply(auprg[, c("neural tube",
"midbrain",
"hindbrain",
"limb",
"heart")], function(x) {rank(1 - x)})
auprg$ave_rank <- sapply(split(average_rank, 1:nrow(average_rank)), mean)
pauprg <- auprg
pauprg[, -1] <- signif(pauprg[, -1], digits = 2)
pauprg
pd.enrichment <- read.delim("data/PD.enrichment.27461410.txt", sep = "\t", header = TRUE)
pd.enrichment
enrich.plot <- ggplot(pd.enrichment, aes(name, difference, group = color)) +
geom_pointrange(aes(ymin = lower, ymax = upper, color = color), fatten = 1, size = 1.2) +
theme_minimal() +
scale_color_identity() +
geom_hline(yintercept = 0, color = "#ececec", alpha = 0.2) +
theme(axis.text.x = element_blank(),
legend.position="none",
panel.grid.major.x = element_blank(), panel.grid.minor = element_blank(),
panel.background = element_blank(),
strip.text.y = element_text(angle = 0, hjust = 0, vjust = 0.5, size = rel(1.5)),
strip.text.x = element_text(angle = 270, hjust = 0.5, vjust = 1, size = rel(1.5))) +
scale_x_discrete(name = "Sample") +
scale_y_continuous(name = "difference", breaks = c(0, 0.5, 1)) +
ggtitle("Enrichment of PD GWAS variants") +
coord_cartesian(ylim = c(-0.2,1)) +
facet_grid(locus ~ type, scales = "free_x", space = "free_x", switch = "x") +
annotate("rect", xmin=-Inf, xmax=Inf, ymin=-Inf, ymax=0,alpha=0.1,fill="black")
enrich.plot
Enrichment calculations were done as in section above using either of two different state models (Model 1 and Model 2) from StateHub, “Default” and “Focused Poised Promoter”, which differ in the treatment of poised promoters. Each plot is made using the same y axis range for comparison and emphasizes that one model is clearly more selective than the other. Both models clearly detect enrichment of hypermethylated probes in the poised state. Model 2 is more selective than model 1 in its definition of poised promoter.
enrichment.out <- read_tsv("data/methylation.enrichment.txt")
enrichment.out <- enrichment.out[enrichment.out$type != "ENCODE2012", ]
enrichment.out
Click the buttons above to change models.
In Model 1, we assign any promoters lacking active marks to the poised state.
model1.plot <- ggplot(enrichment.out[enrichment.out$model == "hyper1", ], aes(name, oddsratio, group = color)) +
geom_pointrange(aes(ymin = odds.lower, ymax = odds.upper, color = color), fatten = 1, size = 1.2) +
theme_minimal() +
scale_color_identity() +
geom_hline(yintercept = 1, color = "#000000", alpha = 0.1) +
theme(axis.text.x = element_blank(),
legend.position="none",
panel.grid.major.x = element_blank(), panel.grid.minor = element_blank(), panel.background = element_blank(),
strip.text.y = element_text(angle = 0, hjust = 0, vjust = 0.5, size = rel(1.5)),
strip.text.x = element_text(angle = 270, hjust = 0.5, vjust = 1, size = rel(1.5))) +
scale_x_discrete(name = "Sample") +
scale_y_continuous(name = "Odds Ratio") +
coord_cartesian(ylim = c(0,12)) +
facet_grid(state ~ type, scales = "free_x", space = "free_x", switch = "x") +
annotate("rect", xmin=-Inf, xmax=Inf, ymin=-Inf, ymax=1,alpha=0.1,fill="black")
model1.plot
Click the buttons above to change models.
In this model, enhancers with H3K4me1 and promoters with H3K4me3 overlapping narrow regions of H3K27me3 are called poised (EPR and PPR), but those without H3K27me3 are called weak (EWR and PWR)
model2.plot <- ggplot(enrichment.out[enrichment.out$model == "hyper2", ], aes(name, oddsratio, group = color)) +
geom_pointrange(aes(ymin = odds.lower, ymax = odds.upper, color = color), fatten = 1, size = 1.2) +
theme_minimal() +
scale_color_identity() +
geom_hline(yintercept = 1, color = "#000000", alpha = 0.1) +
theme(axis.text.x = element_blank(),
legend.position="none",
panel.grid.major.x = element_blank(), panel.grid.minor = element_blank(), panel.background = element_blank(),
strip.text.y = element_text(angle = 0, hjust = 0, vjust = 0.5, size = rel(1.5)),
strip.text.x = element_text(angle = 270, hjust = 0.5, vjust = 1, size = rel(1.5))) +
scale_x_discrete(name = "Sample") +
scale_y_continuous(name = "Odds Ratio") +
coord_cartesian(ylim = c(0,12)) +
facet_grid(state ~ type, scales = "free_x", space = "free_x", switch = "x") +
annotate("rect", xmin=-Inf, xmax=Inf, ymin=-Inf, ymax=1,alpha=0.1,fill="black")
model2.plot
hmec.states <- PaintStates(manifest = "data/manifest.hmec.txt",
decisionMatrix = dm,
progress = FALSE)
ExportStatePaintR(states = hmec.states,
output.dir = "data/HMEC/")
# Active and Inactive Enhancers in HMEC
mart <- useMart(host = "grch37.ensembl.org",
biomart = "ENSEMBL_MART_FUNCGEN",
dataset = "hsapiens_regulatory_feature")
hmec.enhancers <- getBM(filters = c("regulatory_feature_type_name", "epigenome_name"),
values = list(c("Enhancer"), c("HMEC")),
attributes = c("chromosome_name", "chromosome_start",
"chromosome_end", "feature_type_name",
"regulatory_stable_id", "activity"),
mart = mart)
hmec.enhancers <- hmec.enhancers[order(hmec.enhancers$chromosome_name,
hmec.enhancers$chromosome_start), ]
hmec.enhancers$chromosome_name <- paste0("chr", hmec.enhancers$chromosome_name)
head(hmec.enhancers, n = 5)
hmec.enhancer.gff <- data.frame(chr = hmec.enhancers$chromosome_name,
source = "ensembl_hsapiens_regulatory_feature",
feature = paste("Enhancer", hmec.enhancers$activity, sep = "_"),
start = hmec.enhancers$chromosome_start,
end = hmec.enhancers$chromosome_end,
score = 1000,
strand = ".",
frame = ".",
group = hmec.enhancers$activity)
head(hmec.enhancer.gff, n = 5)
write.table(hmec.enhancer.gff[hmec.enhancer.gff$group == "ACTIVE", ],
file = "data/active.hmec.enhancers.gff",
sep = "\t",
quote = FALSE,
row.names = FALSE,
col.names = FALSE)
write.table(hmec.enhancer.gff[hmec.enhancer.gff$group == "INACTIVE", ],
file = "data/inactive.hmec.enhancers.gff",
sep = "\t",
quote = FALSE,
row.names = FALSE,
col.names = FALSE)
# Gencode Gene Models
download.file(url = "ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz",
destfile = "data/gencode.v19.annotation.gtf.gz", quiet = TRUE)
In order to generate the plots to show segmentation enrichment in other genomic regions, we use segtools
grep -v '#' data/HMEC/e117.7mark.segmentation.bed | \
sed '1d' > data/HMEC/e117.7mark.segmentation.noheader.bed
segtools-aggregation \
--noplot \
-o hmec.gene.statepaintr \
--mode gene \
--normalize \
data/HMEC/e117.7mark.segmentation.noheader.bed \
data/gencode.v19.annotation.gtf.gz
segtools-aggregation \
--noplot \
-o data/hmec.active.enhancer.statepaintr \
--group \
--mode region \
--normalize \
data/HMEC/e117.7mark.segmentation.noheader.bed \
data/active.hmec.enhancers.gff
segtools-aggregation \
--noplot \
-o data/hmec.inactive.enhancer.statepaintr \
--group \
--mode region \
--normalize \
data/HMEC/e117.7mark.segmentation.noheader.bed \
data/inactive.hmec.enhancers.gff
First we need to parse the output of segtools to prepare for plotting
segtools.files <- c("Active Enhancer" =
"data/hmec.active.enhancer.statepaintr/feature_aggregation.tab",
"Inactive Enhancer" =
"data/hmec.inactive.enhancer.statepaintr/feature_aggregation.tab",
"Gene Body" =
"data/hmec.gene.statepaintr/feature_aggregation.tab")
plots <- list()
for (file in seq_along(segtools.files)) {
# read the file
data <- read_tsv(segtools.files[file], comment = "#")
# set component order
component.order <- unique(data$component)
# parse header for normalization
# header contains metadata about segments
header <- readLines(segtools.files[file], n = 1)
header <- str_replace(header, "#", "") %>% str_split(pattern = " ", simplify = TRUE)
header <- header[ -1 ]
header.names <- sapply(str_split(header, pattern = "="), "[", 1)
header <- as.integer(sapply(str_split(header, pattern = "="), "[", 2))
# create normalized values
norm.values <- data.frame(Segment = header.names, Total = header, stringsAsFactors = FALSE)
num.features <- which(norm.values$Segment %in% "num_features")
norm.values <- norm.values[ -num.features, ]
norm.values$Random <- norm.values$Total / sum(norm.values$Total)
# Set up data for plotting
data <- gather(data, Segment, Value, 4:ncol(data))
data <- left_join(data, norm.values)
norm.values2 <- group_by(data, offset, component) %>% summarise(Offset.Total = sum(Value))
data <- left_join(data, norm.values2)
data <- mutate(data, Norm.Value = (Value / Offset.Total) + 1)
data <- mutate(data, Enrichment = log2(Norm.Value / (Random + 1)))
data[is.nan(data$Enrichment), "Enrichment"] <- 0
data$component <- factor(data$component, levels = component.order)
data$color <- "#998ec3"
data[ data$Enrichment > 0, "color"] <- "#f1a340"
if (names(segtools.files[file]) == "Gene Body") {
data <- data[data$component %in% c("5' flanking: 500 bp",
"initial exon (399 bp)",
"initial intron (13590 bp)",
"internal exons (154 bp)",
"internal introns (5794 bp)",
"terminal exon (886 bp)",
"terminal intron (6991 bp)",
"3' flanking: 500 bp"), ]
}
# create plots
segplot <- ggplot(data, aes(x = offset, fill = color)) +
geom_area(aes(y = Enrichment)) +
facet_grid(Segment ~ component, scales = "free_x") +
scale_fill_identity() +
geom_hline(yintercept = 0, size = 0.2, color = "grey50") +
scale_y_continuous(breaks=c(-0.5, 0, 0.5)) +
coord_cartesian(ylim = c(-1, 1)) +
theme_minimal() +
ggtitle(names(segtools.files[file])) +
theme(axis.text.x = element_blank(),
legend.position="none",
strip.background = element_rect(fill = "grey80", colour = "grey50", size = 0.2),
panel.grid = element_blank(),
panel.background = element_rect(color = "grey50"),
strip.text.y = element_text(angle = 0, hjust = 0, vjust = 0.5),
axis.title.x = element_blank())
plot <- list(segplot)
names(plot) <- names(segtools.files[file])
plots <- c(plots, plot)
}
plots$`Active Enhancer`
plots$`Inactive Enhancer`
plots$`Gene Body`
Using a benchmarking function, we can compare how StatePaintR performs using a variety of score combination methods, including finding the mean, the median, and the maximum of the scores of the features that make up the segmentations. We can compare these three score combination methods against 100 vista enhancers that were excluded above, in each category, when comparing our predictions, and those of other enhancer prediction algorithms to VISTA
dm <- get.decision.matrix("5813b67f46e0fb06b493ceb0")
dm <- doNotSplit(dm, "Core")
mouse.embryo.states <- StatePaintR:::PaintStatesBenchmark(manifest = manifest,
decisionMatrix = dm,
scoreStates = TRUE,
progress = FALSE)
plot.data <- list()
names(mouse.embryo.states) <- str_replace(names(mouse.embryo.states), " embryo", "")
for (tissue in names(mouse.embryo.states)) {
predicted.scores <- mouse.embryo.states[[tissue]]
vista.enhancers.test <- vista.enhancers
mcols(vista.enhancers.test) <- data.frame(FOUND = mcols(vista.enhancers.test)[, tissue])
vista.enhancers.test <- vista.enhancers.test[c(which(mcols(vista.enhancers.test)[, "FOUND"] != 1), train.enhancers[[tissue]]),]
predicted.scores <- predicted.scores[predicted.scores$state %in% c("EARC", "ARC", "AR", "EAR"), ]
for (scoring in c("median", "mean", "max")) {
predicted.scores <- predicted.scores[order(predicted.scores[, scoring], decreasing = TRUE), ]
olaps <- findOverlaps(vista.enhancers.test, predicted.scores, select = "first")
mcols(vista.enhancers.test)$score <- 0
mcols(vista.enhancers.test)[which(!is.na(olaps)), "score"] <- mcols(predicted.scores[olaps[!is.na(olaps)]])[, scoring]
prg_curve <- StatePaintR:::create_prg_curve(mcols(vista.enhancers.test)$FOUND, mcols(vista.enhancers.test)$score)
auprg = StatePaintR:::calc_auprg(prg_curve)
convex_hull = StatePaintR:::prg_convex_hull(prg_curve)
plot.tissue <- list(list(tissue = tissue, scoringm = scoring, curve = prg_curve, auprg = auprg, hull = convex_hull))
names(plot.tissue) <- tissue
plot.data <- c(plot.data, plot.tissue)
}
}
gg.tissue <- lapply(plot.data, function(y) {
y <- data.frame(TISSUE = y$tissue, METHOD = y$scoringm, PRECISION = y$curve$precision_gain, RECALL = y$curve$recall_gain)
return(y)
})
gg.tissue <- do.call("rbind", gg.tissue)
Using the scores generated above we can see the AUPRG for each scoring method / tissue combination
auprg <- sapply(plot.data, function(x) x[c("scoringm", "auprg")])
data.frame(tissue = colnames(auprg), scoring.method = unlist(auprg["scoringm", ]), auprg = unlist(auprg["auprg", ]))
We can also generate the plot, which shows that while relying on the maximum score does not necessarily give the best AUPRG, it does have good performance towards it’s tail at with high recall gain.
sfigure.2 <- ggplot(gg.tissue, aes(y = PRECISION, x = RECALL, group = METHOD)) +
geom_line(aes(color = METHOD)) +
geom_point(aes(color = METHOD)) +
coord_cartesian(xlim=c(0,1), ylim = c(0,1)) +
scale_color_brewer(palette = "Set2") +
theme_grey() + theme(aspect.ratio=1) +
ggtitle("Supplementary Figure 2 - Combining Scores") +
ylab("Precision Gain") + xlab("Recall Gain") +
facet_wrap( ~ TISSUE, ncol = 3)
sfigure.2
We can download the focused poised promoter model, and then modify the manner in which it scores segments. In the abstraction layer we can declare that a class of feature must not be split into sub-features, by surrounding it in brackets, i.e., [Core]
. Additionally we can specify that the scores from a specific feature should not be used in scoring segments, by following it with a star *
like so: Active*
or [Core]*
. Based on this we can generate four different version of scoring based upon the same segmentation rules. A version called noactive
does not use the score from the active mark, nocore
does not use the score from the core mark, noregulatory
ignores the score from the regulatory mark, and all
uses the score from all marks.
x <- get.decision.matrix("5813b67f46e0fb06b493ceb0")
x <- doNotSplit(x, "Core")
#onescore
noactive.x <- x
noactive.x <- doNotScore(noactive.x, "Active")
nocore.x <- x
nocore.x <- doNotScore(nocore.x, "Core")
noregulatory.x <- x
noregulatory.x <- doNotScore(noregulatory.x, "Regulatory")
#twoscores
all.x <- x
dms <- list(noactive.x, nocore.x, noregulatory.x, all.x)
names(dms) <- c("noactive", "nocore", "noregulatory", "all")
dms
$noactive
ID: 5813b67f46e0fb06b493ceb0
Name: Focused Poised Promoter Model
Author: Ben Berman - Cedars-Sinai Center for Bioinformatics and Functional Genomics
Revision: Oct 28, 2016 11:24:11 PM
Description: This model has focused poised regions that require narrow H3K27me3 regions to be present to be called.
Not Scoring: Active*
Not Splitting: [Core]
$nocore
ID: 5813b67f46e0fb06b493ceb0
Name: Focused Poised Promoter Model
Author: Ben Berman - Cedars-Sinai Center for Bioinformatics and Functional Genomics
Revision: Oct 28, 2016 11:24:11 PM
Description: This model has focused poised regions that require narrow H3K27me3 regions to be present to be called.
Not Scoring: [Core]*
Not Splitting: [Core]*
$noregulatory
ID: 5813b67f46e0fb06b493ceb0
Name: Focused Poised Promoter Model
Author: Ben Berman - Cedars-Sinai Center for Bioinformatics and Functional Genomics
Revision: Oct 28, 2016 11:24:11 PM
Description: This model has focused poised regions that require narrow H3K27me3 regions to be present to be called.
Not Scoring: Regulatory*
Not Splitting: [Core]
$all
ID: 5813b67f46e0fb06b493ceb0
Name: Focused Poised Promoter Model
Author: Ben Berman - Cedars-Sinai Center for Bioinformatics and Functional Genomics
Revision: Oct 28, 2016 11:24:11 PM
Description: This model has focused poised regions that require narrow H3K27me3 regions to be present to be called.
Not Splitting: [Core]
We can compare these four scoring models against the 100 vista enhancers that were excluded above, in each category, when comparing our predictions, and those of other enhancer prediction algorithms to VISTA
states <- dms
states <- lapply(states, function(x) {NA})
for (dm in seq_along(dms)) {
plot.data <- list()
mouse.embryo.states <- PaintStates(manifest = manifest,
decisionMatrix = dms[[dm]],
scoreStates = TRUE,
progress = FALSE)
names(mouse.embryo.states) <- str_replace(names(mouse.embryo.states), " embryo", "")
for (tissue in names(mouse.embryo.states)) {
predicted.scores <- mouse.embryo.states[[tissue]]
vista.enhancers.test <- vista.enhancers
mcols(vista.enhancers.test) <- data.frame(FOUND=mcols(vista.enhancers.test)[, tissue])
vista.enhancers.test <- vista.enhancers.test[c(which(mcols(vista.enhancers.test)[, "FOUND"] != 1), train.enhancers[[tissue]]),]
predicted.scores <- predicted.scores[predicted.scores$state %in% c("EARC", "ARC", "AR", "EAR"), ]
predicted.scores <- predicted.scores[order(predicted.scores$score, decreasing = TRUE), ]
olaps <- findOverlaps(vista.enhancers.test, predicted.scores, select = "first")
mcols(vista.enhancers.test)$score <- 0
mcols(vista.enhancers.test)[which(!is.na(olaps)), "score"] <- mcols(predicted.scores[olaps[!is.na(olaps)]])$score
prg_curve <- StatePaintR:::create_prg_curve(mcols(vista.enhancers.test)$FOUND, mcols(vista.enhancers.test)$score)
auprg = StatePaintR:::calc_auprg(prg_curve)
convex_hull = StatePaintR:::prg_convex_hull(prg_curve)
plot.tissue <- list(list(tissue = tissue, curve=prg_curve, auprg = auprg, hull = convex_hull))
names(plot.tissue) <- tissue
plot.data <- c(plot.data, plot.tissue)
}
sapply(plot.data, function(x) x$auprg)
gg.tissue <- lapply(plot.data, function(x) {
x <- data.frame(TISSUE = x$tissue, PRECISION = x$curve$precision_gain, RECALL = x$curve$recall_gain)
return(x)
})
gg.tissue.sp <- do.call("rbind", gg.tissue)
gg.tissue.sp$DM <- names(dms)[dm]
states[[dm]] <- gg.tissue.sp
}
gg.tissue <- do.call("rbind", states)
We can then plot these four scoring models across the tissue specific enhancers that we interrogated from VISTA to see which combination of scores serves to best predict enhancers.
sfigure3 <- ggplot(gg.tissue, aes(y = PRECISION, x = RECALL, group = DM)) +
geom_line(aes(color = DM)) +
geom_point(aes(color = DM)) +
coord_cartesian(xlim=c(0,1), ylim = c(0,1)) +
scale_color_brewer(palette = "Set1") +
theme_grey() + theme(aspect.ratio=1) +
ggtitle("Supplementary Figure 3 - Scoring Features") +
ylab("Precision Gain") + xlab("Recall Gain") +
facet_wrap( ~ TISSUE, ncol = 3)
sfigure3
Based upon this plot we decided to go forward without including the Regulatory mark in our scoring.
We considered which combinations of segments would provide the best description of an active enhancer as found in VISTA. We generated five groups of marks. Our active Active Enhancer group incorporated EAR
, AR
, EARC
, and ARC
, our Active Promoter group consists of PAR
and PARC
, the Silenced
group is SCR
and HET
, Poised and Weak Enhancer consists of EPR
, EPRC
, EWR
, EWRC
, and our Poised and Weak Promoter group is PPR
, PPRC
, PWR
, PWRC
, PPWR
, and PPWRC
. We evaluated each of these combinations against the 100 vista enhancers that were excluded above, in each category, when comparing our predictions, and those of other enhancer prediction algorithms to VISTA
dm <- get.decision.matrix("5813b67f46e0fb06b493ceb0")
dm <- doNotSplit(dm, "Core")
dm <- doNotScore(dm, "Regulatory")
dm <- doNotScore(dm, "Promoter")
segments <- list(A.Enhancer = c("EAR", "AR", "EARC", "ARC"),
A.Promoter = c("PAR", "PARC"),
Silenced = c("SCR", "HET"),
PW.Enhancer = c("EPR", "EPRC", "EWR", "EWRC"),
PW.Promoter = c("PPR", "PPRC", "PWR", "PWRC", "PPWR", "PPWRC"))
mouse.embryo.states <- PaintStates(manifest = manifest,
decisionMatrix = dm,
scoreStates = TRUE,
progress = FALSE)
names(mouse.embryo.states) <- str_replace(names(mouse.embryo.states), " embryo", "")
for (segment in seq_along(segments)) {
plot.data <- list()
for (tissue in names(mouse.embryo.states)) {
predicted.scores <- mouse.embryo.states[[tissue]]
vista.enhancers.test <- vista.enhancers
mcols(vista.enhancers.test) <- data.frame(FOUND=mcols(vista.enhancers.test)[, tissue])
vista.enhancers.test <- vista.enhancers.test[c(which(mcols(vista.enhancers.test)[, "FOUND"] != 1), train.enhancers[[tissue]]),]
predicted.scores <- predicted.scores[predicted.scores$state %in% segments[[segment]], ]
predicted.scores <- predicted.scores[order(predicted.scores$score, decreasing = TRUE), ]
olaps <- findOverlaps(vista.enhancers.test, predicted.scores, select = "first")
mcols(vista.enhancers.test)$score <- 0
mcols(vista.enhancers.test)[which(!is.na(olaps)), "score"] <- mcols(predicted.scores[olaps[!is.na(olaps)]])$score
prg_curve <- StatePaintR:::create_prg_curve(mcols(vista.enhancers.test)$FOUND, mcols(vista.enhancers.test)$score)
auprg = StatePaintR:::calc_auprg(prg_curve)
convex_hull = StatePaintR:::prg_convex_hull(prg_curve)
plot.tissue <- list(list(tissue = tissue, curve=prg_curve, auprg = auprg, hull = convex_hull))
names(plot.tissue) <- tissue
plot.data <- c(plot.data, plot.tissue)
}
sapply(plot.data, function(x) x$auprg)
gg.tissue <- lapply(plot.data, function(x) {
x <- data.frame(TISSUE = x$tissue, PRECISION = x$curve$precision_gain, RECALL = x$curve$recall_gain)
return(x)
})
gg.tissue <- do.call("rbind", gg.tissue)
gg.tissue$SEGMENTS <- names(segments)[segment]
segments[[segment]] <- gg.tissue
}
gg.tissue <- do.call("rbind", segments)
By Plotting the Precision Recall Gain curves, we see that, as expected our Active Enhancer group has the best accuracy across tissue types.
sfigure.4 <- ggplot(gg.tissue, aes(y = PRECISION, x = RECALL, group = SEGMENTS)) +
geom_line(aes(color = SEGMENTS)) +
geom_point(aes(color = SEGMENTS)) +
coord_cartesian(xlim=c(0,1), ylim = c(0,1)) +
scale_color_brewer(palette = "Paired") +
theme_grey() + theme(aspect.ratio = 1) +
ggtitle("Supplementary Figure 4 - Scoring Segments") +
ylab("Precision Gain") + xlab("Recall Gain") +
facet_wrap( ~ TISSUE, ncol = 3)
sfigure.4
From the abstract of the paper: “Precision-Recall analysis abounds in applications of binary classification where true negatives do not add value and hence should not affect assessment of the classifier’s performance. Perhaps inspired by the many advantages of receiver operating characteristic (ROC) curves and the area under such curves for accuracy-based performance assessment, many researchers have taken to report Precision-Recall (PR) curves and associated areas as performance metric. We demonstrate in this paper that this practice is fraught with difficulties, mainly because of incoherent scale assumptions – e.g., the area under a PR curve takes the arithmetic mean of precision values whereas the Fβ score applies the harmonic mean. We show how to fix this by plotting PR curves in a different coordinate system, and demonstrate that the new Precision-Recall-Gain curves inherit all key advantages of ROC curves. In particular, the area under Precision-Recall-Gain curves conveys an expected F1 score on a harmonic scale, and the convex hull of a Precision-Recall-Gain curve allows us to calibrate the classifier’s scores so as to determine, for each operating point on the convex hull, the interval of β values for which the point optimizes Fβ. We demonstrate experimentally that the area under traditional PR curves can easily favour models with lower expected F1 score than others, and so the use of Precision-Recall-Gain curves will result in better model selection.”↩