Finding overlaps between 2 ranges and their overlapped region lengths? - r

I need to find length of overlapped region on same chromosomes between 2 group(gp1 & gp2). (similar question in stackoverflow were different from my aim, because I wanna find overlapped region not a TRUE/FALSE answer).
For example:
gp1:
chr start end id1
chr1 580 600 1
chr1 900 970 2
chr3 400 600 3
chr2 100 700 4
gp2:
chr start end id2
chr1 590 864 1
chr3 550 670 2
chr2 897 1987 3
I'm looking for a way to compare these 2 group and get results like this:
id1 id2 chr overlapped_length
1 1 chr1 10
3 2 chr3 50

Should point you in the right direction:
Load libraries
# install.packages("BiocManager")
# BiocManager::install("GenomicRanges")
library(GenomicRanges)
library(IRanges)
Generate data
gp1 <- read.table(text =
"
chr start end id1
chr1 580 600 1
chr1 900 970 2
chr3 400 600 3
chr2 100 700 4
", header = TRUE)
gp2 <- read.table(text =
"
chr start end id2
chr1 590 864 1
chr3 550 670 2
chr2 897 1987 3
", header = TRUE)
Calculate ranges
gr1 <- GenomicRanges::makeGRangesFromDataFrame(
gp1,
seqnames.field = "chr",
start.field = "start",
end.field = "end"
)
gr2 <- GenomicRanges::makeGRangesFromDataFrame(
gp2,
seqnames.field = "chr",
start.field = "start",
end.field = "end"
)
Calculate overlaps
hits <- findOverlaps(gr1, gr2)
p <- Pairs(gr1, gr2, hits = hits)
i <- pintersect(p)
Result
> as.data.frame(i)
seqnames start end width strand hit
1 chr1 590 600 11 * TRUE
2 chr3 550 600 51 * TRUE

Related

Finding overlap in dataframe ranges in R

I have two bedfiles as dataframes in R, for which I want to map all overlapping regions to each other (similar to what bedtools closest would be able to do).
BedA:
chr start end
2 100 500
2 200 250
3 275 300
BedB:
chr start end
2 210 265
2 99 106
8 275 290
BedOut:
chr start.A end.A start.B end.B
2 100 500 210 265
2 100 500 99 106
2 200 250 210 265
Now, I found this very similar question, which suggest to use iRanges. Using the proposed way seems works, but I have no idea how to turn the output into a data frame like "BedOut".
Another data.table option using foverlaps:
setkeyv(BedA, names(BedA))
setkeyv(BedB, names(BedB))
ans <- foverlaps(BedB, BedA, nomatch=0L)
setnames(ans, c("start","end","i.start","i.end"), c("start.A","end.A","start.B","end.B"))
output:
chr start.A end.A start.B end.B
1: 2 100 500 99 106
2: 2 100 500 210 265
3: 2 200 250 210 265
data:
library(data.table)
BedA <- fread("chr start end
2 100 500
2 200 250
3 275 300")
BedB <- fread("chr start end
2 210 265
2 99 106
8 275 290")
Here is a solution using the data.table package.
library(data.table)
chr = c(2,2,3)
start.A = c(100, 200, 275)
end.A = c(500, 250, 300)
df_A = data.table(chr, start.A, end.A)
chr = c(2,2,8)
start.B = c(210, 99, 275)
end.B = c(265, 106, 290)
df_B = data.table(chr, start.B, end.B)
First, inner join the data-tables on the key chr:
df_out = df_B[df_A, on="chr", nomatch=0]
Then filter the overlapping interval:
df_out = df_out[(start.A>=start.B & start.A<=end.B) | (start.B>=start.A & start.B<=end.A)]
setcolorder(df_out, c("chr", "start.A", "end.A", "start.B", "end.B"))
chr start.A end.A start.B end.B
1: 2 100 500 210 265
2: 2 100 500 99 106
3: 2 200 250 210 265

Cross joining for the computation of a new variable

I have a game data set and I observe the number of points of one player.
da = data.frame(points = c(144,186,220,410,433))
da
points
1 144
2 186
3 220
4 410
5 433
I also now, in which the level the player was, because I know the ranges of points for different levels.
ranges = data.frame(level = c(1,2,3,4,5), points_from = c(0,100,200,300,430), points_to = c(100,170,300,430,550))
ranges
level points_from points_to
1 1 0 100
2 2 100 170
3 3 200 300
4 4 300 430
5 5 430 550
Now I want to compute a new variable, that indicates how far away the player was from the next level. It is computed by da$points/ranges$points_to of this specific level.
For example, if the player has 144 points and the next elvel is reached when achieving 170 points, the levle progress is 144/170.
Thus, the data set I want to have looks like this:
da_new = data.frame(points = c(144,186,220,410,433), points_to = c(170,300,300,430,550), level_progress = c(144/170,186/300,220/300,410/430,433/550))
da_new
points points_to level_progress
1 144 170 0.8471
2 186 300 0.6200
3 220 300 0.7333
4 410 430 0.9535
5 433 550 0.7873
How can I now compute this variable?
The main idea is to use merge(da, ranges, all = T) to do a "cross join" between the data. Then, we filter to where points is between points_from and points_to (meaning 186 is not in the final data).
library(dplyr)
merge(da, ranges, all = T) %>%
# keep only where points fall between points_from and points_to
filter(points >= points_from & points <= points_to) %>%
mutate(level_progress = points / points_to)
points level points_from points_to level_progress
1 144 2 100 170 0.8470588
2 220 3 200 300 0.7333333
3 410 4 300 430 0.9534884
4 433 5 430 550 0.7872727
Another option is to filter where points <= point_to, and find where points is closest to points_to (this method keeps 186):
merge(da, ranges, all = T) %>%
filter(points <= points_to) %>%
group_by(points) %>%
slice(which.min(abs(points - points_to))) %>%
mutate(level_progress = points / points_to)
points level points_from points_to level_progress
<dbl> <dbl> <dbl> <dbl> <dbl>
1 144 2 100 170 0.847
2 186 3 200 300 0.62
3 220 3 200 300 0.733
4 410 4 300 430 0.953
5 433 5 430 550 0.787
Here is a base R solution using findInterval
da_new <- da
da_new$points_to <- ranges$points_to[findInterval(da_new$points,c(0,ranges$points_to))]
da_new$level_progress <- da_new$points/da_new$points_to
such that
> da_new
points points_to level_progress
1 144 170 0.8470588
2 186 300 0.6200000
3 220 300 0.7333333
4 410 430 0.9534884
5 433 550 0.7872727

Find overlapping ranges based on positions in R

I have two datasets:
chr1 25 85
chr1 2000 3000
chr2 345 2300
and the 2nd,
chr1 34 45 1.2
chr1 100 1000
chr2 456 1500 1.3
This is my desired output,
chr1 25 85 1.2
chr2 345 2300 1.3
Below is my code:
sb <- NULL
rangesC <- NULL
sb$bin <- NULL
for(i in levels(df1$V1)){
s <- subset(df1, df1$V1 == i)
sb <- subset(df2, df2$V1 == i)
for(j in 1:nrow(sb)){
sb$bin[j] <-s$V4[(s$V2 <= sb$V2[j] & s$V3 >= sb$V3[j])]
}
rangesC <- try(rbind(rangesC, sb),silent = TRUE)
}
The error I get is :
replacement has length zero OR when I use as.character rangesC is empty.
I would like to get the V4 corresponding if the positions overlap. What is going wrong?
The foverlaps() function from the data.table package does an overlap join of two data.tables:
library(data.table)
setDT(df1, key = names(df1))
setDT(df2, key = key(df1))
foverlaps(df2, df1, nomatch = 0L)[, -c("i.V2", "i.V3")]
V1 V2 V3 V4
1: chr1 25 85 1.2
2: chr2 345 2300 1.3
Data
library(data.table)
df1 <- fread(
"chr1 25 85
chr1 2000 3000
chr2 345 2300", header = FALSE
)
df2 <- fread(
"chr1 34 45 1.2
chr1 100 1000
chr2 456 1500 1.3", header = FALSE
)

How to make new variable across conditions

I need to calculate new variable from data using conditions. New Pheno.
Data set is huge.
I have data set: Animal, Record, Days, Pheno
A R D P
1 1 240 300
1 2 230 290
2 1 305 350
2 2 260 290
3 1 350 450
Conditions are:
Constant pheno per day is 2.
If record days is more than 305 old pheno should be keept.
If record is less than 305 but has next records Pheno should be keept.
If record is less than 305 and have no next records it should be calculated as : 305-days*constant+pheno = (305 - 260)*2+300
Example for animal 1 having less than 305 for both records. So First record will be same in new pheno, but secon record is las and has less than 305, so we need to re-calculate... (305-230)*2+290=440
Finaly data will be like:
A R D P N_P
1 1 240 300 300
1 2 230 290 440
2 1 305 350 350
2 2 260 290 380
3 1 350 450 450
How to do it in R or linux ...
Here is a solution with base R
df <- read.table(header=TRUE, text=
"A R D P
1 1 240 300
1 2 230 290
2 1 305 350
2 2 260 290
3 1 350 450")
newP <- function(d) {
np <- numeric(nrow(d))
for (i in 1:nrow(d)) {
if (d$D[i] > 305) { np[i] <- d$P[i]; next }
if (d$D[i] <= 305 && i<nrow(d)) { np[i] <- d$P[i]; next }
np[i] <- (305-d$D[i])*2 + d$P[i]
}
d$N_P <- np
return(d)
}
D <- split(df, df$A)
D2 <- lapply(D, newP)
do.call(rbind, D2)
Check this out (I assume R is the number of records sorted, so if you have 10 records the last will have R=10)
library(dplyr)
df <- data.frame(A=c(1,1,2,2,3),
R=c(1,2,1,2,1),
D=c(240,230,305,260,350),
P=c(300,290,350,290,450))
df %>% group_by(A) %>%
mutate(N_P=ifelse(( D<305 & R==n()), # check if D<305 & Record is last record
((305-D)*2)+P # calculate new P
,P)) # Else : use old P
Source: local data frame [5 x 5]
Groups: A [3]
A R D P N_P
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 240 300 300
2 1 2 230 290 440
3 2 1 305 350 350
4 2 2 260 290 380
5 3 1 350 450 450
If you have predefined constants that depend on R value in the df, for example :
const <- c(1,2,1.5,2.5,3)
You can replace R in the code by const[R]
df %>% group_by(A) %>%
mutate(N_P=ifelse(( D<305 & R==n()), # check if D<305 & Record is last record
((305-D)*const[R])+P # calculate new P
,P)) # Else : use old P

R: sum rows from column A until conditioned value in column B

I'm pretty new to R and can't seem to figure out how to deal with what seems to be a relatively simple problem. I want to sum the rows of the column 'DURATION' per 'TRIAL_INDEX', but then only those first rows where the values of 'X_POSITION" are increasing. I only want to sum the first round within a trial where X increases.
The first rows of a simplified dataframe:
TRIAL_INDEX DURATION X_POSITION
1 1 204 314.5
2 1 172 471.6
3 1 186 570.4
4 1 670 539.5
5 1 186 503.6
6 2 134 306.8
7 2 182 503.3
8 2 806 555.7
9 2 323 490.0
So, for TRIAL_INDEX 1, only the first three values of DURATION should be added (204+172+186), as this is where X has the highest value so far (going through the dataframe row by row).
The desired output should look something like:
TRIAL_INDEX DURATION X_POSITION FIRST_PASS_TIME
1 1 204 314.5 562
2 1 172 471.6 562
3 1 186 570.4 562
4 1 670 539.5 562
5 1 186 503.6 562
6 2 134 306.8 1122
7 2 182 503.3 1122
8 2 806 555.7 1122
9 2 323 490.0 1122
I tried to use dplyr, to generate a new dataframe that can be merged with my original dataframe.
However, the code doesn't work, and also I'm not sure on how to make sure it's only adding the first rows per trial that have increasing values for X_POSITION.
FirstPassRT = dat %>%
group_by(TRIAL_INDEX) %>%
filter(dplyr::lag(dat$X_POSITION,1) > dat$X_POSITION) %>%
summarise(FIRST_PASS_TIME=sum(DURATION))
Any help and suggestions are greatly appreciated!
library(data.table)
dt = as.data.table(df) # or setDT to convert in place
# find the rows that will be used for summing DURATION
idx = dt[, .I[1]:.I[min(.N, which(diff(X_POSITION) < 0), na.rm = T)], by = TRIAL_INDEX]$V1
# sum the DURATION for those rows
dt[idx, time := sum(DURATION), by = TRIAL_INDEX][, time := time[1], by = TRIAL_INDEX]
dt
# TRIAL_INDEX DURATION X_POSITION time
#1: 1 204 314.5 562
#2: 1 172 471.6 562
#3: 1 186 570.4 562
#4: 1 670 539.5 562
#5: 1 186 503.6 562
#6: 2 134 306.8 1122
#7: 2 182 503.3 1122
#8: 2 806 555.7 1122
#9: 2 323 490.0 1122
Here is something you can try with dplyr package:
library(dplyr);
dat %>% group_by(TRIAL_INDEX) %>%
mutate(IncLogic = X_POSITION > lag(X_POSITION, default = 0)) %>%
mutate(FIRST_PASS_TIME = sum(DURATION[IncLogic])) %>%
select(-IncLogic)
Source: local data frame [9 x 4]
Groups: TRIAL_INDEX [2]
TRIAL_INDEX DURATION X_POSITION FIRST_PASS_TIME
(int) (int) (dbl) (int)
1 1 204 314.5 562
2 1 172 471.6 562
3 1 186 570.4 562
4 1 670 539.5 562
5 1 186 503.6 562
6 2 134 306.8 1122
7 2 182 503.3 1122
8 2 806 555.7 1122
9 2 323 490.0 1122
If you want to summarize it down to one row per trial you can use summarize like this:
library(dplyr)
df <- data_frame(TRIAL_INDEX = c(1,1,1,1,1,2,2,2,2),
DURATION = c(204,172,186,670, 186,134,182,806, 323),
X_POSITION = c(314.5, 471.6, 570.4, 539.5, 503.6, 306.8, 503.3, 555.7, 490.0))
res <- df %>%
group_by(TRIAL_INDEX) %>%
mutate(x.increasing = ifelse(X_POSITION > lag(X_POSITION), TRUE, FALSE),
x.increasing = ifelse(is.na(x.increasing), TRUE, x.increasing)) %>%
filter(x.increasing == TRUE) %>%
summarize(FIRST_PASS_TIME = sum(X_POSITION))
res
#Source: local data frame [2 x 2]
#
# TRIAL_INDEX FIRST_PASS_TIME
# (dbl) (dbl)
#1 1 1356.5
#2 2 1365.8

Resources