I have big peak list in the "Bed" format and I converted it to GenomicRange for use as an input for the GADEM package to find denovo motifs. But when I try the GADEM function always I face the below error.
Could you please anybody who knows help me with this error?
This is a small example of my real file with only 20 rows.
1 chr6 29723590 29723790
2 chr14 103334312 103334512
3 chr1 150579030 150579230
4 chr7 76358527 76358727
5 chr6 11537891 11538091
6 chr14 49893256 49893456
7 chr5 179623200 179623400
8 chr1 228082831 228083031
9 chr12 93441644 93441844
10 chr10 3784776 3784976
11 chr3 183635833 183636033
12 chr7 975301 975501
13 chr12 123364510 123364710
14 chr1 1615578 1615778
15 chr1 36156320 36156520
16 chr14 55051781 55051981
17 chr8 11867697 11867897
18 chr22 38706135 38706335
19 chr6 44265256 44265456
20 chr1 185316658 185316858
and the code that I use is :
library(GenomicRanges)
library(rGADEM)
data = makeGRangesFromDataFrame(data, keep.extra.columns = TRUE)
data = reduce(data)
data = resize(data, width = 50, fix='center')
gadem<-GADEM(data,verbose=1,genome=Hsapiens)
plot(gadem)
and error is:
[ Retrieving sequences... Error in.Call2("C_solve_user_SEW", refwidths, start, end, width, translate.negative.coord:
solving row 136: 'allow.nonnarrowing' is FALSE and the supplied start (55134751) is > refwidth + 1 ]
Better to mention that, when I try an example input file with less than 136 rows, it works and I get motifs.
Thanks in advance.
Hi i am working with GRanges and finding the overlaps using findOverlaps function of IRanges. I am getting the hits of which query and subject are overlapped,but I want to also have the coordinates of query and subject where they are overlapped and so I can retrieve the sequence of it.
How can get the coordinates of both subject and query where they are overlapped. I am using following function :
library(GenomicRanges)
library(regioneR) # toGRanges
fo <- findOverlaps(query = toGRanges(df1),subject = toGRanges(df2),type = "within")
df1 <- structure(list(df1c = c("chr2", "chr2", "chr2", "chr2"), df1c2 = c(2800,
3600, 3719, 3893), df1c3 = c(3270, 4152, 5092, 4547)), class = "data.frame", row.names = c(NA,
-4L))
df2 <- structure(list(df2c = c("chr2", "chr2", "chr2", "chr2", "chr2L"
), df2c2 = c(263, 342, 424, 846, 1030), df2c3 = c(20091, 17222,
2612, 4265, 11575)), class = "data.frame", row.names = c(NA,
-5L))
The expected output should be like
chr CoDF1 CoDF2
1 100-200 90-210
1 150-280 100-285
CoDF1 = Coordinates of df1 file where its overlapped with df2 reads
CoDF2 = Coordinates of df1 file where its overlapped with df1 reads
You'd better use intersect() :
> intersect(toGRanges(df1),toGRanges(df2))
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr2 2800-3270 *
[2] chr2 3600-5092 *
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
But pay attention that your data.frames colnames are not correct to create GRanges object, they should be seqnames/start/end
EDITED :
To see all intersections of all coordinates:
intersection = findOverlaps(query = toGRanges(df1), subject = toGRanges(df2), type = "any")
df = data.frame(df1[queryHits(intersection),], df2[subjectHits(intersection),])
df
seqnames start end seqnames.1 start.1 end.1
1 chr2 2800 3270 chr2 263 20091
1.1 chr2 2800 3270 chr2 342 17222
1.2 chr2 2800 3270 chr2 846 4265
2 chr2 3600 4152 chr2 263 20091
2.1 chr2 3600 4152 chr2 342 17222
2.2 chr2 3600 4152 chr2 846 4265
3 chr2 3719 5092 chr2 263 20091
3.1 chr2 3719 5092 chr2 342 17222
3.2 chr2 3719 5092 chr2 846 4265
4 chr2 3893 4547 chr2 263 20091
4.1 chr2 3893 4547 chr2 342 17222
4.2 chr2 3893 4547 chr2 846 4265
I have GRanges object (coordinates of all gene exons); coding_pos defines what is the start position of a codon in a particular exon (1 means that first nucleotide in exon is also the first nt in a codon, and so on).
grTargetGene itself looks like this
> grTargetGene
GRanges object with 11 ranges and 7 metadata columns:
seqnames ranges strand | ensembl_ids gene_biotype prev_exons_length coding_pos
<Rle> <IRanges> <Rle> | <character> <character> <numeric> <numeric>
[1] chr2 [148602722, 148602776] + | ENSG00000121989 protein_coding 0 1
[2] chr2 [148653870, 148654077] + | ENSG00000121989 protein_coding 55 2
[3] chr2 [148657027, 148657136] + | ENSG00000121989 protein_coding 263 3
[4] chr2 [148657313, 148657467] + | ENSG00000121989 protein_coding 373 2
[5] chr2 [148672760, 148672903] + | ENSG00000121989 protein_coding 528 1
[6] chr2 [148674852, 148674995] + | ENSG00000121989 protein_coding 672 1
[7] chr2 [148676016, 148676161] + | ENSG00000121989 protein_coding 816 1
[8] chr2 [148677799, 148677913] + | ENSG00000121989 protein_coding 962 3
[9] chr2 [148680542, 148680680] + | ENSG00000121989 protein_coding 1077 1
[10] chr2 [148683600, 148683730] + | ENSG00000121989 protein_coding 1216 2
[11] chr2 [148684649, 148684843] + | ENSG00000121989 protein_coding 1347 1
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
I am interested in looking at coordinates separately for [1,2] positions in each codon and [3]. In other words, I would like to have 2 different GRanges objects that look approximately like this (here it is only the beginning)
> grTargetGene_Nonsynonym
GRanges object with X ranges and 7 metadata columns:
seqnames ranges strand | ensembl_ids gene_biotype
<Rle> <IRanges> <Rle> | <character> <character>
[1] chr2 [148602722, 148602723] + | ENSG00000121989 protein_coding
[2] chr2 [148602725, 148602726] + | ENSG00000121989 protein_coding
[3] chr2 [148602728, 148602729] + | ENSG00000121989 protein_coding
[4] chr2 [148602731, 148602732] + | ENSG00000121989 protein_coding
> grTargetGene_Synonym
GRanges object with X ranges and 7 metadata columns:
seqnames ranges strand | ensembl_ids gene_biotype
<Rle> <IRanges> <Rle> | <character> <character>
[1] chr2 [148602724, 148602724] + | ENSG00000121989 protein_coding
[2] chr2 [148602727, 148602727] + | ENSG00000121989 protein_coding
[3] chr2 [148602730, 148602730] + | ENSG00000121989 protein_coding
[4] chr2 [148602733, 148602733] + | ENSG00000121989 protein_coding
I was planning to do it through the loop that creates a set of granges for each exon according to coding_pos and strand, but I suspect there is a smarter way or maybe even a function that can do it already, but I couldn't find a simple solution.
Important: I do not need the sequence itself (the easiest way, in that case, would be to extract DNA first and then work with the sequence), but instead of doing this I only need the positions which I will use to overlap with some features.
> library("GenomicRanges")
> dput(grTargetGene)
new("GRanges"
, seqnames = new("Rle"
, values = structure(1L, .Label = "chr2", class = "factor")
, lengths = 6L
, elementMetadata = NULL
, metadata = list()
)
, ranges = new("IRanges"
, start = c(148602722L, 148653870L, 148657027L, 148657313L, 148672760L,
148674852L)
, width = c(55L, 208L, 110L, 155L, 144L, 144L)
, NAMES = NULL
, elementType = "integer"
, elementMetadata = NULL
, metadata = list()
)
, strand = new("Rle"
, values = structure(1L, .Label = c("+", "-", "*"), class = "factor")
, lengths = 6L
, elementMetadata = NULL
, metadata = list()
)
, elementMetadata = new("DataFrame"
, rownames = NULL
, nrows = 6L
, listData = structure(list(ensembl_ids =
c("ENSG00000121989","ENSG00000121989",
"ENSG00000121989", "ENSG00000121989", "ENSG00000121989", "ENSG00000121989"
), gene_biotype = c("protein_coding", "protein_coding", "protein_coding",
"protein_coding", "protein_coding", "protein_coding"), cds_length =
c(1542,1542, 1542, 1542, 1542, 1542), gene_start_position = c(148602086L,
148602086L, 148602086L, 148602086L, 148602086L, 148602086L),
gene_end_position = c(148688393L, 148688393L, 148688393L,
148688393L, 148688393L, 148688393L), prev_exons_length = c(0,
55, 263, 373, 528, 672), coding_pos = c(1, 2, 3, 2, 1, 1)), .Names =
c("ensembl_ids", "gene_biotype", "cds_length", "gene_start_position",
"gene_end_position",
"prev_exons_length", "coding_pos"))
, elementType = "ANY"
, elementMetadata = NULL
, metadata = list()
)
, seqinfo = new("Seqinfo"
, seqnames = "chr2"
, seqlengths = NA_integer_
, is_circular = NA
, genome = NA_character_
)
, metadata = list()
)
How about the following:
grl <- lapply(list(Nonsym = c(1, 2), Sym = c(3, 3)), function(x) {
ranges(grTargetGene) <- IRanges(
start = start(grTargetGene) + x[1] - 1,
end = start(grTargetGene) + x[2] - 1)
return(grTargetGene) })
grl
#$Nonsym
#GRanges object with 6 ranges and 7 metadata columns:
# seqnames ranges strand | ensembl_ids gene_biotype
# <Rle> <IRanges> <Rle> | <character> <character>
# [1] chr2 148602722-148602723 + | ENSG00000121989 protein_coding
# [2] chr2 148653870-148653871 + | ENSG00000121989 protein_coding
# [3] chr2 148657027-148657028 + | ENSG00000121989 protein_coding
# [4] chr2 148657313-148657314 + | ENSG00000121989 protein_coding
# [5] chr2 148672760-148672761 + | ENSG00000121989 protein_coding
# [6] chr2 148674852-148674853 + | ENSG00000121989 protein_coding
# cds_length gene_start_position gene_end_position prev_exons_length
# <numeric> <integer> <integer> <numeric>
# [1] 1542 148602086 148688393 0
# [2] 1542 148602086 148688393 55
# [3] 1542 148602086 148688393 263
# [4] 1542 148602086 148688393 373
# [5] 1542 148602086 148688393 528
# [6] 1542 148602086 148688393 672
# coding_pos
# <numeric>
# [1] 1
# [2] 2
# [3] 3
# [4] 2
# [5] 1
# [6] 1
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths
#
#$Sym
#GRanges object with 6 ranges and 7 metadata columns:
# seqnames ranges strand | ensembl_ids gene_biotype cds_length
# <Rle> <IRanges> <Rle> | <character> <character> <numeric>
# [1] chr2 148602724 + | ENSG00000121989 protein_coding 1542
# [2] chr2 148653872 + | ENSG00000121989 protein_coding 1542
# [3] chr2 148657029 + | ENSG00000121989 protein_coding 1542
# [4] chr2 148657315 + | ENSG00000121989 protein_coding 1542
# [5] chr2 148672762 + | ENSG00000121989 protein_coding 1542
# [6] chr2 148674854 + | ENSG00000121989 protein_coding 1542
# gene_start_position gene_end_position prev_exons_length coding_pos
# <integer> <integer> <numeric> <numeric>
# [1] 148602086 148688393 0 1
# [2] 148602086 148688393 55 2
# [3] 148602086 148688393 263 3
# [4] 148602086 148688393 373 2
# [5] 148602086 148688393 528 1
# [6] 148602086 148688393 672 1
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths
grl contains a list of two GRanges, one with ranges based on positions 1 and 2, and the other with ranges based on position 3.
I created a function that can account for a chain and allows to process exons that length is not divisible by 3 (and might be even less than 3)
CodonPosition_separation = function(grTargetGene) {
grTargetGene = sort(grTargetGene)
grTargetGene$prev_exons_length = c(0,width(grTargetGene)[1:length(grTargetGene)-1])
if (length(grTargetGene) >1) {
for (l in 2:length(grTargetGene)) {
grTargetGene$prev_exons_length[l] = grTargetGene$prev_exons_length[l]+grTargetGene$prev_exons_length[l-1]
}
}
grTargetGene$coding_pos = grTargetGene$prev_exons_length%%3+1
grTargetGene_N = GRanges()
grTargetGene_S = GRanges()
for (l in 1:length(grTargetGene)) {
for (obj in c("start_nonsyn","start_syn", "end_nonsyn", "end_syn","gr_nonsyn","gr_syn")) {if(exists(obj)) {rm(obj)}}
if (as.character(strand(grTargetGene)[1]) =="+"){
start_ns = start(grTargetGene[l])+1-grTargetGene$coding_pos[l]
end_ns = end(grTargetGene[l])
if (start_ns <=end_ns) {
start_nonsyn = seq(from = start(grTargetGene[l])+1-grTargetGene$coding_pos[l],to = end(grTargetGene[l]), by=3)
end_nonsyn = seq(from = start(grTargetGene[l])+2-grTargetGene$coding_pos[l],to = end(grTargetGene[l]), by=3)
}
start_s =start(grTargetGene[l])+3-grTargetGene$coding_pos[l]
end_s = end(grTargetGene[l])
if (start_s <=end_s) {
start_syn = seq(from = start(grTargetGene[l])+3-grTargetGene$coding_pos[l],to = end(grTargetGene[l]), by=3)
end_syn = start_syn
}
} else {
start_ns = end(grTargetGene[l])-1+grTargetGene$coding_pos[l]
end_ns = start(grTargetGene[l])
if (start_ns >=end_ns) {
start_nonsyn = seq(from = end(grTargetGene[l])-1+grTargetGene$coding_pos[l],to = start(grTargetGene[l]), by=-3)
end_nonsyn = seq(from = end(grTargetGene[l])-2+grTargetGene$coding_pos[l],to = start(grTargetGene[l]), by=-3)
}
start_s =end(grTargetGene[l])-3+grTargetGene$coding_pos[l]
end_s = start(grTargetGene[l])
if (start_ns >=end_ns) {
start_syn = seq(from = end(grTargetGene[l])-3+grTargetGene$coding_pos[l],to = start(grTargetGene[l]), by=-3)
end_syn = start_syn
}
}
if (exists("start_nonsyn")) {
length_nonsyn = length(start_nonsyn)+ length(end_nonsyn)
gr_nonsyn = GRanges(
seqnames = rep(seqnames(grTargetGene[l]), length_nonsyn),
strand = rep(strand(grTargetGene[l]), length_nonsyn),
ranges = IRanges(start = c(start_nonsyn, end_nonsyn), end = c(start_nonsyn, end_nonsyn))
)
gr_nonsyn = intersect(gr_nonsyn,grTargetGene[l])
grTargetGene_N = append(grTargetGene_N, gr_nonsyn)
}
if (exists("start_syn")) {
length_syn = length(start_syn)
gr_syn = GRanges(
seqnames = rep(seqnames(grTargetGene[l]), length_syn),
strand = rep(strand(grTargetGene[l]), length_syn),
ranges = IRanges(start = start_syn, end = end_syn)
)
gr_syn = intersect(gr_syn,grTargetGene[l])
grTargetGene_S = append(grTargetGene_S, gr_syn)
}
}
return(list("grTargetGene_S"=grTargetGene_S,"grTargetGene_N"=grTargetGene_N))
}
It works nicely:
> CodonPosition_separation(grTargetGene)
$grTargetGene_S
GRanges object with 514 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr2 [148602724, 148602724] +
[2] chr2 [148602727, 148602727] +
[3] chr2 [148602730, 148602730] +
[4] chr2 [148602733, 148602733] +
[5] chr2 [148602736, 148602736] +
... ... ... ...
[510] chr2 [148684831, 148684831] +
[511] chr2 [148684834, 148684834] +
[512] chr2 [148684837, 148684837] +
[513] chr2 [148684840, 148684840] +
[514] chr2 [148684843, 148684843] +
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
$grTargetGene_N
GRanges object with 517 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr2 [148602722, 148602723] +
[2] chr2 [148602725, 148602726] +
[3] chr2 [148602728, 148602729] +
[4] chr2 [148602731, 148602732] +
[5] chr2 [148602734, 148602735] +
... ... ... ...
[513] chr2 [148684829, 148684830] +
[514] chr2 [148684832, 148684833] +
[515] chr2 [148684835, 148684836] +
[516] chr2 [148684838, 148684839] +
[517] chr2 [148684841, 148684842] +
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
I have a data.table (A) that is over 100,000 rows long. There are 3 columns.
chrom start end
1: chr1 6484847 6484896
2: chr1 6484896 6484945
3: chr1 6484945 6484994
4: chr1 6484994 6485043
5: chr1 6485043 6485092
---
183569: chrX 106893605 106893654
183570: chrX 106893654 106893703
183571: chrX 106893703 106893752
183572: chrX 106893752 106893801
183573: chrX 106893801 106894256
I'd like to generate a new column named "gene" that provides a label for each row based annotations from another data.table which has ~90 rows (B). Seen below:
chrom start end gene
1: chr1 6484847 6521004 ESPN
2: chr1 41249683 41306124 KCNQ4
3: chr1 55464616 55474465 BSND
42: chrX 82763268 82764775 POU3F4
43: chrX 100600643 100603957 TIMM8A
44: chrX 106871653 106894256 PRPS1
If the row start value in data.table A is within the row start and end values of data.table B I need the row in A to be labeled with the correct gene accordingly.
For example the resulting complete data.table A would be
chrom start end gene
1: chr1 6484847 6484896 ESPN
2: chr1 6484896 6484945 ESPN
3: chr1 6484945 6484994 ESPN
4: chr1 6484994 6485043 ESPN
5: chr1 6485043 6485092 ESPN
---
183569: chrX 106893605 106893654 TIMM8A
183570: chrX 106893654 106893703 TIMM8A
183571: chrX 106893703 106893752 TIMM8A
183572: chrX 106893752 106893801 TIMM8A
183573: chrX 106893801 106894256 TIMM8A
I've attempted some nested loops to do this but that seems like it would take WAY too long. I think there must be a way to do this with the data.table package but I can't seem to figure it out.
Any and all suggestions would be greatly appreciated.
While it's certainly possible to do this in base R (or potentially using data.table), I would highly recommend using GenomicRanges; it's a very powerful and flexible R/Bioconductor library that's been designed for these kind of tasks.
Here is an example using GenomicRanges::findOverlaps:
# Sample data
df1 <- read.table(text =
"chrom start end
chr1 6484847 6484896
chr1 6484896 6484945
chr1 6484945 6484994
chr1 6484994 6485043
chr1 6485043 6485092", sep = "", header = T, stringsAsFactors = F);
df2 <- read.table(text =
"chrom start end gene
chr1 6484847 6521004 ESPN
chr1 41249683 41306124 KCNQ4
chr1 55464616 55474465 BSND
chrX 82763268 82764775 POU3F4
chrX 100600643 100603957 TIMM8A
chrX 106871653 106894256 PRPS1", sep = "", header = TRUE, stringsAsFactors = F);
# Convert to GRanges objects
gr1 <- with(df1, GRanges(chrom, IRanges(start = start, end = end)));
gr2 <- with(df2, GRanges(chrom, IRanges(start = start, end = end), gene = gene));
# Find features from gr1 that overlap with gr2
m <- findOverlaps(gr1, gr2);
# Add gene annotation as metadata to gr1
mcols(gr1)$gene[queryHits(m)] <- mcols(gr2)$gene[subjectHits(m)];
gr1;
#GRanges object with 5 ranges and 1 metadata column:
# seqnames ranges strand | gene
# <Rle> <IRanges> <Rle> | <character>
# [1] chr1 [6484847, 6484896] * | ESPN
# [2] chr1 [6484896, 6484945] * | ESPN
# [3] chr1 [6484945, 6484994] * | ESPN
# [4] chr1 [6484994, 6485043] * | ESPN
# [5] chr1 [6485043, 6485092] * | ESPN
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths
Besides the GRanges/IRanges solution by Maurits Evers, there is an alternative data.table approach using non-equi join and update on join.
A[B, on = .(chrom, start >= start, start <= end), gene := i.gene][]
chrom start end gene
1: chr1 6484847 6484896 ESPN
2: chr1 6484896 6484945 ESPN
3: chr1 6484945 6484994 ESPN
4: chr1 6484994 6485043 ESPN
5: chr1 6485043 6485092 ESPN
6: chrX 106893605 106893654 PRPS1
7: chrX 106893654 106893703 PRPS1
8: chrX 106893703 106893752 PRPS1
9: chrX 106893752 106893801 PRPS1
10: chrX 106893801 106894256 PRPS1
According to the OP, A and B are already data.table objects. So, this approach avoids the coercion to GRanges objects.
Reproducible Data
library(data.table)
A <- fread("rn chrom start end
1: chr1 6484847 6484896
2: chr1 6484896 6484945
3: chr1 6484945 6484994
4: chr1 6484994 6485043
5: chr1 6485043 6485092
183569: chrX 106893605 106893654
183570: chrX 106893654 106893703
183571: chrX 106893703 106893752
183572: chrX 106893752 106893801
183573: chrX 106893801 106894256", drop = 1L)
B <- fread("rn chrom start end gene
1: chr1 6484847 6521004 ESPN
2: chr1 41249683 41306124 KCNQ4
3: chr1 55464616 55474465 BSND
42: chrX 82763268 82764775 POU3F4
43: chrX 100600643 100603957 TIMM8A
44: chrX 106871653 106894256 PRPS1", drop = 1L)
I have a bed file which is loaded as a dataframe into R. Genomic coordinates that looks something likes this:
chrom start end
chrX 400 600
chrX 800 1000
chrX 1000 1200
chrX 1200 1400
chrX 1600 1800
chrX 2000 2200
chrX 2200 2400
There's no need to keep all the rows and it would be nicer to compact it to something like this:
chrom start end
chrX 400 600
chrX 800 1400
chrX 1600 1800
chrX 2000 2400
How can I possibly do it?
I've tried to think of something with dplyr but no success. group_by wouldn't work because I don't know how to modify chunks of continuous rows into one using start coordinate from the first row and end coordinate from the last row also because there are many of these chunks.
Using GenomicRanges package from bioconductor, built specifically for bed files and the like:
library(GenomicRanges)
# Example data
gr <- GRanges(
seqnames = Rle("chr1", 6),
ranges = IRanges(start = c(400 ,800, 1200, 1400, 1800, 2000),
end = c(600, 1000, 1400, 1600, 2000, 2200)))
gr
# GRanges object with 6 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr1 [ 400, 600] *
# [2] chr1 [ 800, 1000] *
# [3] chr1 [1200, 1400] *
# [4] chr1 [1400, 1600] *
# [5] chr1 [1800, 2000] *
# [6] chr1 [2000, 2200] *
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths
# merge contiouse ranges into one using reduce:
reduce(gr)
# GRanges object with 4 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr1 [ 400, 600] *
# [2] chr1 [ 800, 1000] *
# [3] chr1 [1200, 1600] *
# [4] chr1 [1800, 2200] *
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlength
# EDIT: if the bed file is a data.frame we can convert it to ranges object:
gr <- GRanges(seqnames(Rle(df$chrom),
ranges = IRanges(start = df$start,
end = df$end)))