how to intersect and add score to column? - r

I have two data sets, I would like to find overlap/intersect/ common regions between them and if there is any overlap , then extract each initial table:
Data A:
chr start end
chr1 25 35
chr1 50 70
chr1 60 85
Data B:
chr start end score
chr1 10 15 24
chr1 55 75 14
chr1 76 82 10
out put tables:
out put 1: results of common regions
chr start end
chr1 55 70
chr1 70 75
chr1 76 82
out put 2: extract from data A:
chr start end
chr1 50 70
chr1 60 85
out put 3: extract from data B:
chr start end score
chr1 55 75 14
chr1 76 82 10
I have tried different ways but I do not know which one is the best:
library(GenomicRanges)
enhancer = with(dataA, GRanges(chr, IRanges(start=start, end=end)))
H3K4me1= with(dataB, GRanges(chr, IRanges(start=start, end=end)))
way 1:
hits <- findOverlaps(dataA, dataB)
ranges(dataA)[queryHits(hits)] = ranges(dataB)[subjectHits(hits)]
dataA
dataB
way2:
over<- subsetByOverlaps(dataA, dataB)
way 3:
inter = intersect(dataA, dataB)
way 4:
groupA <- data.table(dataA)
setkey(groupA, chr, start, end)
groupB <- data.table(dataB)
setkey(groupB, chr, start, end)
over <- foverlaps(groupA, groupB, nomatch = 0)
over2 <- data.table(
chr = over$chr,
start = over[, ifelse(start > i.start, start, i.start)],
end = over[, ifelse(end < i.end, end, i.end)])

I'm not sure if this is what you want. Would you mind creating a reproducible example as described here.
library(dplyr)
DataA <- data.frame(chr = c("chr1", "chr1", "chr1"), start = c(25,50,60), end = c(35,70,85))
DataB <- data.frame(chr = c("chr1", "chr1", "chr1"), start = c(10,55,76), end = c(15,75,82), score = c(24,14,10))
luA <- Map(`:`, DataA$start, DataA$end)
luA <- data.frame(value = unlist(luA),
index = rep(seq_along(luA), lapply(luA, length)))
DataA[luA$index[match(DataB$start, luA$value)],]
DataB[luA$index[match(DataB$start, luA$value)],]

Related

How to extract a set of rows in a dataframe from data in other dataframe in R?

I'm trying to extract lines from a data frame containing genome coverage information. I want to extract only the positions of specific genes, I have this information in another data frame.
The data frame with gene information, looks like:
df.1 <- data.frame(
"Gene" = c("UMAG_11067", "UMAG_03291"),
"Chr" = c(9, 14),
"Start" = c(18431, 35712),
"End" = c(21239, 39416),
"Function" = c("ROS", "ROS"))
Gene Chr Start End Function
UMAG_11067 9 18431 21239 ROS
UMAG_03291 14 35712 39416 ROS
With pivot_longer transform the table to create a variable call it Position
df.1 <- df.1 %>%
pivot_longer(cols = Start:End,
names_to = "StartEnd",
values_to = "Position")
# A tibble: 4 x 5
Gene Chr Function StartEnd Position
<chr> <dbl> <chr> <chr> <dbl>
1 UMAG_11067 9 ROS Start 18431
2 UMAG_11067 9 ROS End 21239
3 UMAG_03291 14 ROS Start 35712
4 UMAG_03291 14 ROS End 39416
The data frame with the coverage information from the sequencing, looks like:
df.coverage <- data.frame(
"Chr" = c(rep(9, 25000), rep(14, 50000)),
"Position" = c(seq(1,25000), seq(1, 50000)),
"Coverage" = c(rep(370, 25000), rep(185,50000)))
Chr Position Coverage
9 1 370
9 2 370
9 3 370
14 1 185
14 2 185
14 3 185
My desired output is:
Chr Position Coverage
9 18431 370
9 18432 370
9 18433 370
9 18434 370 # to the end of the gene
9 21239 370
...
14 35712 185
14 35713 185
14 35714 185
14 35715 185 # to the end of the gene
14 39416 185
I've tried creating an ID column by concatenating the chr and position columns:
df.1$ID <- paste(df.1$Chr, df.1$Position, sep = "-")
df.coverage$ID <- paste(df.coverage$Chr, df.coverage$Position, sep= "-")
and then using logical index
df.coverage[df.coverage$ID %in% df.1$ID, ]
which produce the next output:
Chr Position Coverage ID
9 18431 370 9-18431
9 21239 370 9-21239
14 35712 185 14-35712
14 39416 185 14-39416
This code only identified the columns where the ID matchs...
How can I subset the data frame with the coverage to extract the coverage of each nt of the target genes?
Thank you in advance!
I would do this in two steps: first merge in all the positions for each Chr, then filter to those between Start and End. Note this uses your initial, non-pivoted version of df.1.
library(dplyr)
df.2 <- df.1 %>%
left_join(df.coverage, by = "Chr") %>%
filter(Position >= Start & Position <= End) %>%
select(Chr, Position, Coverage)
head(df.2)
Output:
Chr Position Coverage
1 9 18431 370
2 9 18432 370
3 9 18433 370
4 9 18434 370
5 9 18435 370
6 9 18436 370

Finding overlaps between 2 ranges and their overlapped region lengths?

I need to find length of overlapped region on same chromosomes between 2 group(gp1 & gp2). (similar question in stackoverflow were different from my aim, because I wanna find overlapped region not a TRUE/FALSE answer).
For example:
gp1:
chr start end id1
chr1 580 600 1
chr1 900 970 2
chr3 400 600 3
chr2 100 700 4
gp2:
chr start end id2
chr1 590 864 1
chr3 550 670 2
chr2 897 1987 3
I'm looking for a way to compare these 2 group and get results like this:
id1 id2 chr overlapped_length
1 1 chr1 10
3 2 chr3 50
Should point you in the right direction:
Load libraries
# install.packages("BiocManager")
# BiocManager::install("GenomicRanges")
library(GenomicRanges)
library(IRanges)
Generate data
gp1 <- read.table(text =
"
chr start end id1
chr1 580 600 1
chr1 900 970 2
chr3 400 600 3
chr2 100 700 4
", header = TRUE)
gp2 <- read.table(text =
"
chr start end id2
chr1 590 864 1
chr3 550 670 2
chr2 897 1987 3
", header = TRUE)
Calculate ranges
gr1 <- GenomicRanges::makeGRangesFromDataFrame(
gp1,
seqnames.field = "chr",
start.field = "start",
end.field = "end"
)
gr2 <- GenomicRanges::makeGRangesFromDataFrame(
gp2,
seqnames.field = "chr",
start.field = "start",
end.field = "end"
)
Calculate overlaps
hits <- findOverlaps(gr1, gr2)
p <- Pairs(gr1, gr2, hits = hits)
i <- pintersect(p)
Result
> as.data.frame(i)
seqnames start end width strand hit
1 chr1 590 600 11 * TRUE
2 chr3 550 600 51 * TRUE

Finding the overlap between two data frames in R, how can I make my code more efficient?

I have two dataframes in R. In the first one I have two columns one is called "chr" and the other "position"; in the second dataframe I have three columns one is again "chr", other "start" and another one "end". I want to select those rows in the first dataframe in which chr value is the same as the second data frame, but also whose "position" is in the interval start-end of the second data frame.
For that I have written a function in R that gives me the desired output but it is very slow when I run it with huge data frames.
# My DataFrames are:
bed <- data.frame(Chr = c(rep("chr1",4),rep("chr2",3),rep("chr3",1)),
x1 = c(5,20,44,67,5,20,44,20),
x3=c(12,43,64,94,12,43,64,63))
snv <- data.frame(Chr = c(rep("chr1",6),rep("chr3",6)),
position = c(5,18,46,60,80,90,21,60,75,80,84,87))
# My function is:
get_overlap <- function(df, position, chrom){
overlap <- FALSE
for (row in 1:nrow(df)){
chr = df[row, 1]
start = df[row, 2]
end = df[row, 3]
if(chr == chrom & position %in% seq(start, end)){
overlap <- TRUE
}
}
return(overlap)
}
# The code is:
overlap_vector = c()
for (row in 1:nrow(snv)){
chrom = snv[row, 1]
position = snv[row, 2]
overlap <- get_overlap(bed, position, chrom)
overlap_vector <- c(overlap_vector, overlap)
}
print(snv[overlap_vector,])
How can I make this more efficient? I have never worked with hash tables, can that be the answer?
I'm sure there's a more elegant data.table solution, but this works. First I load the package.
# Load package
library(data.table)
Then, I define the data tables
# Define data tables
bed <- data.table(Chr = c(rep("chr1",4),rep("chr2",3),rep("chr3",1)),
start = c(5,20,44,67,5,20,44,20),
end = c(12,43,64,94,12,43,64,63))
snv <- data.table(Chr = c(rep("chr1",6),rep("chr3",6)),
position = c(5,18,46,60,80,90,21,60,75,80,84,87))
Here, I do a non-equi join on position and start/end, and an equal join on Chr. I assume you want to keep all columns, so specified them in the j argument and omitted those rows without matches.
na.omit(bed[snv,
.(Chr, start = x.start, end = x.end, position = i.position),
on = c("start <= position", "end >= position", "Chr == Chr")])
#> Chr start end position
#> 1: chr1 5 12 5
#> 2: chr1 44 64 46
#> 3: chr1 44 64 60
#> 4: chr1 67 94 80
#> 5: chr1 67 94 90
#> 6: chr3 20 63 21
#> 7: chr3 20 63 60
Created on 2019-08-21 by the reprex package (v0.3.0)
Edit
A quick benchmarking shows that Nathan's solution is about as twice as fast!
Unit: milliseconds
expr min lq mean median uq max neval
NathanWren() 1.684392 1.729557 1.819263 1.751520 1.787829 5.138546 100
Lyngbakr() 3.336902 3.395528 3.603376 3.441933 3.496131 7.720925 100
The data.table package is great for fast merging of tables. It also comes with a vectorized between function for just this type of task.
library(data.table)
# Convert the data.frames to data.tables
setDT(bed)
setDT(snv)
# Use the join syntax for data.table, then filter for the desired rows
overlap_dt <- bed[
snv,
on = "Chr",
allow.cartesian = TRUE # many-to-many matching
][
between(position, lower = x1, upper = x3)
]
overlap_dt
# Chr x1 x3 position
# 1: chr1 5 12 5
# 2: chr1 44 64 46
# 3: chr1 44 64 60
# 4: chr1 67 94 80
# 5: chr1 67 94 90
# 6: chr3 20 63 21
# 7: chr3 20 63 60

Plotting genomic data using RCircos package

I am trying to use the RCircos package in R to visualize links between genomic positions. I am unfamiliar with this package and have been using the package documentation available from the CRAN repository from 2016.
I have attempted to format my data according to the package requirements. Here is what it looks like:
> head(pts3)
Chromosome ChromStart ChromEnd Chromosome.1 ChromStart.1 ChromEnd.1
1 chr1 33 34 chr1 216 217
2 chr1 33 34 chr1 789 790
3 chr1 33 34 chr1 1716 1717
4 chr1 33 34 chr1 1902 1903
5 chr1 33 34 chr2 2538 2539
6 chr1 33 34 chr2 4278 4279
Ultimately, I would like to produce a plot with tracks from ChromStart to ChromStart.1 and each gene labeled along the outside of the plot. I thought the script would look something like:
RCircos.Set.Core.Components(cyto.info = pts3,
chr.exclude = NULL,
tracks.inside = 1,
tracks.outside = 2)
RCircos.Set.Plot.Area()
RCircos.Chromosome.Ideogram.Plot()
RCircos.Link.Plot(link.data = pts3,
track.num = 3,
by.chromosome = FALSE)
It appears that to do so, I must first initialize with the RCircos.Set.Core.Components() function which requires positional information for each gene to pass to RCircos.Chromosome.Ideogram.Plot(). So, I created a second data frame containing the required information to pass to the function and this is the error that I get:
> head(genes)
Chromosome ChromStart ChromEnd GeneName Band Stain
1 chr1 0 2342 PB2 NA NA
2 chr2 2343 4683 PB1 NA NA
3 chr3 4684 6917 PA NA NA
4 chr4 6918 8710 HA NA NA
5 chr5 8711 10276 NP NA NA
6 chr6 10277 11735 NA NA NA
> RCircos.Set.Core.Components(cyto.info = genes,
+ chr.exclude = NULL,
+ tracks.inside = 1,
+ tracks.outside = 2)
Error in RCircos.Validate.Cyto.Info(cyto.info, chr.exclude) :
Cytoband start should be 0.
I don't actually have data for the Band or Stain columns and don't understand what they are for, but adding data to the those columns (such as 1:8 or chr1, chr2, etc) does not resolve the problem. Based on a recommendation from another forum, I also tried to reset the plot parameters for RCircos using the following functions, but it did not resolve the error:
core.chrom <- data.frame("Chromosome" = c("chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8"),
"ChromStart" = c(0, 2343, 4684, 6918, 8711, 10277, 11736, 12763),
"ChromEnd" = c(2342, 4683, 6917, 8710, 10276, 11735, 12762, 13666),
"startLoc" = c(0, 2343, 4684, 6918, 8711, 10277, 11736, 12763),
"endLoc" = c(2342, 4683, 6917, 8710, 10276, 11735, 12762, 13666),
"Band" = NA,
"Stain" = NA)
RCircos.Reset.Plot.Ideogram(chrom.ideo = core.chrom)
Any advice would be deeply appreciated!
I'm not sure if you figured this one out or moved on etc. I had the same problem and ended up resolving it by reformatting my start positions for each chromosome to 0 as opposed to a continuation of the previous chr. For you it would be:
Chromosome ChromStart ChromEnd GeneName Band Stain
1 chr1 0 2342 PB2 NA NA
2 chr2 0 2340 PB1 NA NA
3 chr3 0 2233 PA NA NA
...etc

Find overlapping ranges based on positions in R

I have two datasets:
chr1 25 85
chr1 2000 3000
chr2 345 2300
and the 2nd,
chr1 34 45 1.2
chr1 100 1000
chr2 456 1500 1.3
This is my desired output,
chr1 25 85 1.2
chr2 345 2300 1.3
Below is my code:
sb <- NULL
rangesC <- NULL
sb$bin <- NULL
for(i in levels(df1$V1)){
s <- subset(df1, df1$V1 == i)
sb <- subset(df2, df2$V1 == i)
for(j in 1:nrow(sb)){
sb$bin[j] <-s$V4[(s$V2 <= sb$V2[j] & s$V3 >= sb$V3[j])]
}
rangesC <- try(rbind(rangesC, sb),silent = TRUE)
}
The error I get is :
replacement has length zero OR when I use as.character rangesC is empty.
I would like to get the V4 corresponding if the positions overlap. What is going wrong?
The foverlaps() function from the data.table package does an overlap join of two data.tables:
library(data.table)
setDT(df1, key = names(df1))
setDT(df2, key = key(df1))
foverlaps(df2, df1, nomatch = 0L)[, -c("i.V2", "i.V3")]
V1 V2 V3 V4
1: chr1 25 85 1.2
2: chr2 345 2300 1.3
Data
library(data.table)
df1 <- fread(
"chr1 25 85
chr1 2000 3000
chr2 345 2300", header = FALSE
)
df2 <- fread(
"chr1 34 45 1.2
chr1 100 1000
chr2 456 1500 1.3", header = FALSE
)

Resources