Find overlapping ranges based on positions in R - r

I have two datasets:
chr1 25 85
chr1 2000 3000
chr2 345 2300
and the 2nd,
chr1 34 45 1.2
chr1 100 1000
chr2 456 1500 1.3
This is my desired output,
chr1 25 85 1.2
chr2 345 2300 1.3
Below is my code:
sb <- NULL
rangesC <- NULL
sb$bin <- NULL
for(i in levels(df1$V1)){
s <- subset(df1, df1$V1 == i)
sb <- subset(df2, df2$V1 == i)
for(j in 1:nrow(sb)){
sb$bin[j] <-s$V4[(s$V2 <= sb$V2[j] & s$V3 >= sb$V3[j])]
}
rangesC <- try(rbind(rangesC, sb),silent = TRUE)
}
The error I get is :
replacement has length zero OR when I use as.character rangesC is empty.
I would like to get the V4 corresponding if the positions overlap. What is going wrong?

The foverlaps() function from the data.table package does an overlap join of two data.tables:
library(data.table)
setDT(df1, key = names(df1))
setDT(df2, key = key(df1))
foverlaps(df2, df1, nomatch = 0L)[, -c("i.V2", "i.V3")]
V1 V2 V3 V4
1: chr1 25 85 1.2
2: chr2 345 2300 1.3
Data
library(data.table)
df1 <- fread(
"chr1 25 85
chr1 2000 3000
chr2 345 2300", header = FALSE
)
df2 <- fread(
"chr1 34 45 1.2
chr1 100 1000
chr2 456 1500 1.3", header = FALSE
)

Related

Finding overlaps between 2 ranges and their overlapped region lengths?

I need to find length of overlapped region on same chromosomes between 2 group(gp1 & gp2). (similar question in stackoverflow were different from my aim, because I wanna find overlapped region not a TRUE/FALSE answer).
For example:
gp1:
chr start end id1
chr1 580 600 1
chr1 900 970 2
chr3 400 600 3
chr2 100 700 4
gp2:
chr start end id2
chr1 590 864 1
chr3 550 670 2
chr2 897 1987 3
I'm looking for a way to compare these 2 group and get results like this:
id1 id2 chr overlapped_length
1 1 chr1 10
3 2 chr3 50
Should point you in the right direction:
Load libraries
# install.packages("BiocManager")
# BiocManager::install("GenomicRanges")
library(GenomicRanges)
library(IRanges)
Generate data
gp1 <- read.table(text =
"
chr start end id1
chr1 580 600 1
chr1 900 970 2
chr3 400 600 3
chr2 100 700 4
", header = TRUE)
gp2 <- read.table(text =
"
chr start end id2
chr1 590 864 1
chr3 550 670 2
chr2 897 1987 3
", header = TRUE)
Calculate ranges
gr1 <- GenomicRanges::makeGRangesFromDataFrame(
gp1,
seqnames.field = "chr",
start.field = "start",
end.field = "end"
)
gr2 <- GenomicRanges::makeGRangesFromDataFrame(
gp2,
seqnames.field = "chr",
start.field = "start",
end.field = "end"
)
Calculate overlaps
hits <- findOverlaps(gr1, gr2)
p <- Pairs(gr1, gr2, hits = hits)
i <- pintersect(p)
Result
> as.data.frame(i)
seqnames start end width strand hit
1 chr1 590 600 11 * TRUE
2 chr3 550 600 51 * TRUE

Create multiple percentage columns based on existing columns in R

I want to create multiple columns that will show the percentage of each element of col2, col3 and Total. The code I came up with only paste the percentage in those columns instead of pasting it in new columns.
I have searched on stack and google but I have not found the answer I was looking for.
Sample data :
data <- data.table(col1= c("A", "B", "C"),
col2= c(43,23,19),
col3= c(102,230,149))
data <- data[, Total := col2 + col3]
data <- janitor::adorn_title(data)
Output :
col1 col2 col3 Total
A 43 102 145
B 23 230 253
C 19 149 168
Total 85 481 566
My percentage function :
add_percent <- function(dt, col_no_percent, col_percent){
dt <- dt[
, c(.SD[, col_no_percent, with=FALSE],
lapply(.SD[, col_percent, with=FALSE], function(x){
paste0(x, format(round(x / sum(x) * 100 * 2, 1), nsmall = 1, decimal.mark = "."))
}))
]
}
Data output with my function:
data <- add_percent(data, "col1", c("col2", "col3", "Total"))
col1 col2 col3 Total
A 43 50.6 102 21.2 145 25.6
B 23 27.1 230 47.8 253 44.7
C 19 22.4 149 31.0 168 29.7
Total 85 100.0 481 100.0 566 100.0
Data output I want :
col1 col2 col3 Total col2.x col3.x Total.x
A 43 102 145 50.6 21.2 25.6
B 23 230 253 27.1 47.8 44.7
C 19 149 168 22.4 31.0 29.7
Total 85 481 566 100.0 100.0 100.0
It is possible that my data will contain way more columns, so all the new columns will have to be created "automatically". So I would like to know how to generate those columns based on my percent function or even a more efficient way if possible.
Thank you.
Initial Data. Note I removed the janitor step. Will do that part at the end.
data <- data.table(col1= c("A", "B", "C"),
col2= c(43,23,19),
col3= c(102,230,149))
data <- data[, Total := col2 + col3]
Add percent columns for all numeric columns and add "Total" row
cols <- names(data)[sapply(data, is.numeric)]
data[, paste0(cols, '_pct') := lapply(.SD, function(x) 100*x/sum(x))
, .SDcols = cols]
adorn_totals(data)
# col1 col2 col3 Total col2_pct col3_pct Total_pct
# A 43 102 145 50.58824 21.20582 25.61837
# B 23 230 253 27.05882 47.81705 44.69965
# C 19 149 168 22.35294 30.97713 29.68198
# Total 85 481 566 100.00000 100.00000 100.00000
I know it is a data.table question, but dplyr has a really nice way of doing this. So just to add it as one possible answer.
library(dplyr)
# this is your function (slightly changed)
as_perc <- function(x) {
paste0(format(100 * (round(x/ sum(x), 2)), nsmall = 1, decimal.mark = "."), "%")
}
data %>%
mutate_if(is.numeric, .funs = list(perc = ~ as_perc(.)))
col1 col2 col3 Total col2_perc col3_perc Total_perc
1 A 43 102 145 51.0% 21.0% 26.0%
2 B 23 230 253 27.0% 48.0% 45.0%
3 C 19 149 168 22.0% 31.0% 30.0%

how to intersect and add score to column?

I have two data sets, I would like to find overlap/intersect/ common regions between them and if there is any overlap , then extract each initial table:
Data A:
chr start end
chr1 25 35
chr1 50 70
chr1 60 85
Data B:
chr start end score
chr1 10 15 24
chr1 55 75 14
chr1 76 82 10
out put tables:
out put 1: results of common regions
chr start end
chr1 55 70
chr1 70 75
chr1 76 82
out put 2: extract from data A:
chr start end
chr1 50 70
chr1 60 85
out put 3: extract from data B:
chr start end score
chr1 55 75 14
chr1 76 82 10
I have tried different ways but I do not know which one is the best:
library(GenomicRanges)
enhancer = with(dataA, GRanges(chr, IRanges(start=start, end=end)))
H3K4me1= with(dataB, GRanges(chr, IRanges(start=start, end=end)))
way 1:
hits <- findOverlaps(dataA, dataB)
ranges(dataA)[queryHits(hits)] = ranges(dataB)[subjectHits(hits)]
dataA
dataB
way2:
over<- subsetByOverlaps(dataA, dataB)
way 3:
inter = intersect(dataA, dataB)
way 4:
groupA <- data.table(dataA)
setkey(groupA, chr, start, end)
groupB <- data.table(dataB)
setkey(groupB, chr, start, end)
over <- foverlaps(groupA, groupB, nomatch = 0)
over2 <- data.table(
chr = over$chr,
start = over[, ifelse(start > i.start, start, i.start)],
end = over[, ifelse(end < i.end, end, i.end)])
I'm not sure if this is what you want. Would you mind creating a reproducible example as described here.
library(dplyr)
DataA <- data.frame(chr = c("chr1", "chr1", "chr1"), start = c(25,50,60), end = c(35,70,85))
DataB <- data.frame(chr = c("chr1", "chr1", "chr1"), start = c(10,55,76), end = c(15,75,82), score = c(24,14,10))
luA <- Map(`:`, DataA$start, DataA$end)
luA <- data.frame(value = unlist(luA),
index = rep(seq_along(luA), lapply(luA, length)))
DataA[luA$index[match(DataB$start, luA$value)],]
DataB[luA$index[match(DataB$start, luA$value)],]

Summing the counts in a data frame using sliding window

I am new to R. I have a data frame in R like following
df <- data.frame(ID=c(rep("A1",10),rep("A2",13),rep("A3",12)),
Values=c(10,2,4,23,10,5,20,15,13,21,15,9,19,5,14,25,18,19,31,26,4,21,4,6,7,12,15,18,25,20,16,29,21,19,10))
For every ID I would like to sum the counts in column "Values" in a sliding windows for every 3 positions. Following data frame is an excerpt from df which includes only the records corresponding to A1:
ID Values
A1 10
A1 2
A1 4
A1 23
A1 10
A1 5
A1 20
A1 15
A1 13
A1 21
I would like to take 3 entries at time and sum and move to next 3 entries. When the sliding windows can't accommodate 3 entries then I skip those values.
For an example, Window_1 starts from first value (10) while window_2 starts from second value (2) and window_3 starts from third value (4).
window_1 = [10+2+4] + [23+10+5] + [20+15+13] = 102
window_2 = [2+4+23] + [10+5+20] + [15+13+21] = 113
window_3 = [4+23+10] + [5+20+15] = 77
and report it in a data frame like following:
ID Window_1 Window_2 Window_3
A1 102 113 77
Likewise I would like sum the counts in column Values for everyid in the data frame "df" and report in a data.frmae like following:
ID window_1 window_2 window_3
A1 102 113 77
A2 206 195 161
A3 198 163 175
I tried the following code
sum_win_3=0
sum_win_2=0
sum_win_1=0
win_1_counts=0
win_2_counts=0
win_3_counts=0
for (i in seq(1,length(df$Values),3))
{
if((i+i+1+i+2) %% 3 == 0)
{
win_1_counts=df$Values[i]+df$Values[i+1]+df$Values[i+2]
win_1_counts[is.na(win_1_counts)]=0
#print(win_1_counts)
}
sum_win_1=sum_win_1+win_1_counts
}
#print(sum_win_1)
for (j in seq(2,length(df$Values),3))
{
if((j+j+1+j+2) %% 3 == 0)
{
win_2_counts=df$Values[j]+df$Values[j+1]+df$Values[j+2]
win_2_counts[is.na(win_2_counts)]=0
#print(win_2_counts)
}
sum_win_2=sum_win_2+win_2_counts
}
#print(sum_win_2)
for (k in seq(3,length(df$Values),3))
{
if((k+k+1+k+2) %% 3 == 0)
{
win_3_counts=df$Values[k]+df$Values[k+1]+df$Values[k+2]
win_3_counts[is.na(win_3_counts)]=0
#print(win_3_counts)
}
#sum_win_3=sum_win_3+win_3_counts
}
print(sum_win_3)
output=data.frame(ID=df[1],Window_1=sum_win_1,Window_2=sum_win_2,Window_3=sum_win_3)
The above code sums the counts for window_1, windows_2 and window_3 by taking all the IDs together rather working on every ID separately.
Kindly guide me in getting the the output in the desired format stated above.
Thanks in advance
Using the data.table package, I would approach it as follows:
library(data.table)
setDT(df)[, .(w1 = sum(Values[1:(3*(.N%/%3))]),
w2 = sum(Values[2:(3*((.N-1)%/%3)+1)]),
w3 = sum(Values[3:(3*((.N-2)%/%3)+2)]))
, by = ID]
which gives:
ID w1 w2 w3
1: A1 102 113 77
2: A2 206 195 161
3: A3 198 163 175
Or to avoid the repetition (thanx to #Cath):
setDT(df)[, lapply(1:3, function(i) {sum(Values[i:(3*((.N-i+1)%/%3)+(i-1))])})
, by = ID]
If you want to rename the V1, V2 & V3 variables, you can do that afterwards, but you can also do:
cols <- c("w1","w2","w3")
setDT(df)[, (cols) := lapply(1:3, function(i) {sum(Values[i:(3*((.N-i+1)%/%3)+(i-1))])})
, by = ID]
This could be done using tapplyand aggregate
sumf <- function(x1){
sum(tapply(x1,
(seq_along(x1) -1) %/%3,
function(x) ifelse(length(x) == 3, sum(x), 0)))
}
aggregate(Values ~ ID, data = df,
FUN = function(y){
cbind(sumf(y), sumf(y[-1]), sumf(y[-c(1,2)]))
})
# Group.1 x.1 x.2 x.3
#1 A1 102 113 77
#2 A2 206 195 161
#3 A3 198 163 175
This can also be done using filter
sum.filter <- function(z) tapply(head(tail(as.numeric(
filter(z, c(1,1,1))),-1), -1),
0:(length(z)-3) %% 3 +1, sum)
aggregate(Values ~ ID, data = df, FUN = function(y){ cbind(sum.filter(y) )})
This seems to work:
library(zoo)
wins = function(x, w)
rollapply(x, width = w*((length(x)-seq(w)+1) %/% w), align = "left", sum)
aggregate(Values ~ ID, df, wins, 3)
# ID Values.1 Values.2 Values.3
# 1 A1 102 113 77
# 2 A2 206 195 161
# 3 A3 198 163 175
This is the only answer so far to perform the calculation on a rolling basis, which is usually more efficient.

calculate reads per million mapped read using R

df1 <- read.table(text="
gene_id A1 A2 A3 A4 length Total
ENSMUSG00000000028 58 93 48 58 789 200
ENSMUSG00000000031 11 7 20 16 364 54
ENSMUSG00000000037 3 5 6 98 196 112
ENSMUSG00000000058 66 93 69 71 436 299
ENSMUSG00000000085 55 68 97 67 177 287", header=TRUE)
The table represents the read count in a gene in different samples (A1, A2..A4).
How can i calculate the reads per million mapped read (RPKM) for these raw read counts using R
RPKM = (number of reads in a gene * 1e6)/(Total*length)
out_put <- read.table(text="
gene_id A1 A2 A3 A4
ENSMUSG00000000028 367.5539 589.3536 304.1825 367.5539
ENSMUSG00000000031 559.6256 356.1254 1017.5010 814.0008
ENSMUSG00000000037 136.6618 227.7697 273.3236 4464.2857
ENSMUSG00000000058 506.2747 713.3871 529.2872 544.6289
ENSMUSG00000000085 1082.6985 1338.6090 1909.4864 1318.9236", header=TRUE)
One way to do this without writing lines or a loop is using melt and dcast:
library(reshape2)
m_df1 <- melt(df1, measure.vars=c("A1","A2","A3","A4"))
m_df1$RPKM <- with(m_df1, value*1e6 / (Total*length))
output <- dcast(gene_id~variable,value.var="RPKM",data=m_df1)
> output
gene_id A1 A2 A3 A4
1 ENSMUSG00000000028 367.5539 589.3536 304.1825 367.5539
2 ENSMUSG00000000031 559.6256 356.1254 1017.5010 814.0008
3 ENSMUSG00000000037 136.6618 227.7697 273.3236 4464.2857
4 ENSMUSG00000000058 506.2747 713.3871 529.2872 544.6289
5 ENSMUSG00000000085 1082.6985 1338.6090 1909.4864 1318.9236
A second way is to use sapply to create a matrix of estimates, which you can then either rename and add to your original data, or cbind to your gene_ids.
my_cols <- c("A1","A2","A3","A4")
RPKMs <- sapply(my_cols, function(x){
df1[,x]*1e6/(df1$Total*df1$length)
}
)
output <- cbind(df1$gene_id,RPKMs)
You can achieve this also without reshaping. Using the data.table package:
library(data.table)
setDT(df1)[,indx:=.I][, lapply(.SD, function(x) (x * 1e6) / (Total * length)),
by=.(indx,gene_id,length,Total)]
this gives:
indx gene_id length Total A1 A2 A3 A4
1: 1 ENSMUSG00000000028 789 200 367.5539 589.3536 304.1825 367.5539
2: 2 ENSMUSG00000000031 364 54 559.6256 356.1254 1017.5010 814.0008
3: 3 ENSMUSG00000000037 196 112 136.6618 227.7697 273.3236 4464.2857
4: 4 ENSMUSG00000000058 436 299 506.2747 713.3871 529.2872 544.6289
5: 5 ENSMUSG00000000085 177 287 1082.6985 1338.6090 1909.4864 1318.9236
Explanation:
with setDT(df1) you convert the dataframe to a datatable
with [,indx:=.I] you create a unique identifier for each row
with by=.(indx,gene_id,length,Total) you determine the columns by which you want to group the data (these columns will not be transformed), by including the indx you make sure that each row is an unique group
with lapply(.SD, function(x) (x * 1e6) / (Total * length)) you apply the required calculation to each column which is not specified in the by statement
A similar solution with dplyr:
library(dplyr)
func <- function(x,y,z) (x * 1e6) / (y * z)
df1 %>% mutate(indx=seq(1,nrow(.))) %>%
group_by(indx,gene_id,length,Total) %>%
summarise_each(funs(func(.,Total,length)))
wich gives:
indx gene_id length Total A1 A2 A3 A4
(int) (fctr) (int) (int) (dbl) (dbl) (dbl) (dbl)
1 1 ENSMUSG00000000028 789 200 367.5539 589.3536 304.1825 367.5539
2 2 ENSMUSG00000000031 364 54 559.6256 356.1254 1017.5010 814.0008
3 3 ENSMUSG00000000037 196 112 136.6618 227.7697 273.3236 4464.2857
4 4 ENSMUSG00000000058 436 299 506.2747 713.3871 529.2872 544.6289
5 5 ENSMUSG00000000085 177 287 1082.6985 1338.6090 1909.4864 1318.9236

Resources