Common genomic intervals in R - r

I would like to infer shared genomic interval between different samples.
My input:
sample chr start end
NE001 1 100 200
NE001 2 100 200
NE002 1 50 150
NE002 2 50 150
NE003 2 250 300
My expected output:
chr start end freq
1 100 150 2
2 100 150 2
Where the "freq" is the how many samples have contribuited to infer the shared region. In the above example freq = 2 (NE001 and NE002).
Cheers!

If your data is in a data.frame (see below), using the Bioconductor GenomicRanges package I create a GRanges instance, keeping the non-range columns too
library(GenomicRanges)
gr <- makeGRangesFromDataFrame(df, TRUE)
The discrete ranges represented by the data are given by the disjoin function, and the overlap between the disjoint ranges ('query') and your original ('subject') are
d <- disjoin(gr)
olaps <- findOverlaps(d, gr)
Split the sample information associated with each overlapping subject with the corresponding query, and associate it with the disjoint GRanges as
mcols(d) <- splitAsList(gr$sample[subjectHits(olaps)], queryHits(olaps))
leading to for instance
> d[elementLengths(d$value) > 1]
GRanges with 2 ranges and 1 metadata column:
seqnames ranges strand | value
<Rle> <IRanges> <Rle> | <CharacterList>
[1] 1 [100, 150] * | NE001,NE002
[2] 2 [100, 150] * | NE001,NE002
---
seqlengths:
1 2
NA NA
Here's how I input your data:
txt <- "sample chr start end
NE001 1 100 200
NE001 2 100 200
NE002 1 50 150
NE002 2 50 150
NE003 2 250 300"
df <- read.table(textConnection(txt), header=TRUE, stringsAsFactors=FALSE)

Given the context behind this question, I suspect it's going to be worthwhile your learning the GenomicRanges package from Bioconductor.
library(GenomicRanges)
gr <- GRanges(seqnames=df$chr, ranges=IRanges(start=df$start, end=df$end))
ov <- findOverlaps(gr,gr, type="any")
ov <- ov[queryHits(ov) != subjectHits(ov)]
between <- pintersect(gr[subjectHits(ov)], gr[queryHits(ov)])
The approach being: find all self-overlaps, remove the trivial ones where an interval is being compared to itself (4th line), and then finding the intersection between each pair of remaining intervals. You can then tabulate the results however you wish.

This is certainly very long (and likely very inefficient on large data.frames given the expand.grid.df, however, I hope it gives you a starting point. As a caveat, I have no background in genomics (which I'm sure comes through) so had no idea of common packages for this. Surely those are the best way to go. I just thought it would be fun to attempt a solution.
s<-"sample chr start end
NE001 1 100 200
NE001 2 100 200
NE002 1 50 150
NE002 2 50 150
NE003 2 250 300"
dat<-read.table(text=s, header=T)
library(plyr)
between<-function(x,y,z) x<=y & y<=z
dat$id<-seq_along(dat[,1])
expand.grid.df <- function(...) Reduce(function(...) merge(..., by=NULL), list(...))
expdat<-ddply(dat, .(chr), function(x) expand.grid.df(x,x))
expdat<-subset(expdat, id.x!=id.y)
expdat$betweenL<-with(expdat, between(start.y, start.x, end.y))
expdat$betweenR<-with(expdat, between(start.x, start.y, end.x))
expdat<-subset(expdat, betweenL | betweenR)
expdat$commonstart<-with(expdat,ifelse(betweenL,start.x,start.y))
expdat$commonend<-with(expdat, ifelse(betweenL, end.y, end.x))
res<-ddply(expdat, .(chr, commonstart, commonend),summarize, freq=length(sample.x))
> res
chr commonstart commonend freq
1 1 100 150 2
2 2 100 150 2

Related

how to find which rows are related by mathematical difference of x in R

i have a data frame with about 20k IDs of chemical compounds and the corresponding molecular weights, something like this:
ID <- c(1,2,3,4,5)
MASS <- c(324,162,508,675,670)
d <- data.frame(ID, MASS)
ID MASS
1 1 324
2 2 162
3 3 508
4 4 675
5 5 670
I would like to find a way to loop over the rows of the column MASS to find which masses are related by having a difference (positive or negative) of 162∓0.5. Then I would like to have a new column (d$DIFF) where the IDs that are linked by a MASS difference of 162∓0.5 are reported, while get 0 for those IDs when the condition is not met, in this example it would be something like this:
ID MASS DIFF
1 1 324 1&2
2 2 162 1&2
3 3 508 3&5
4 4 675 0
5 5 670 3&5
Thanks in advance for any help
Here's a base R solution using outer:
d$DIFF <- unlist(lapply(apply(outer(d$MASS, d$MASS,
function(x, y) abs((abs(x - y) - 162)) < 0.5), 1, which),
function(x) if(length(x) == 0)
return("0")
else
return(paste(x, collapse = " & "))))
This gives the result:
d
#> ID MASS DIFF
#> 1 1 324 2
#> 2 2 162 1
#> 3 3 508 5
#> 4 4 675 0
#> 5 5 670 3
Note that in your example data, there is at most a single match to other rows, but if you apply this technique to your real data you should get multiple hits for some rows separated by "&" as requested.
You should also note that whatever way you do this in your real data, you will have to make approximately 20K * 20K (400 million) comparisons, so it may take some time to complete, and may result in memory issues depending on your set-up.

Calculation within a pipe between different rows of a data frame

I have a tibble with a column of different numbers. I wish to calculate for every one of them how many others before them are within a certain range.
For example, let's say that range is 200 ; in the tibble below the result for the 5th number would be 2, that is the cardinality of the list {816, 705} whose numbers are above 872-1-200 = 671 but below 872.
I have thought of something along the lines of :
for every theRow of the tibble, do calculate the vector theTibble$number_list between(X,Y) ;
summing the boolean returned vector.
I have been told that using loops is less efficient.
Is there a clean way to do this within a pipe without using loops?
Not the way you asked for it, but you can use a bit of linear algebra. Should be more efficient and more simple than a loop.
number_list <- c(248,650,705,816,872,991,1156,1157,1180,1277)
m <- matrix(number_list, nrow = length(number_list), ncol = length(number_list))
d <- (t(m) - number_list)
cutoff <- 200
# I used setNames to name the result, but you do not need to
# We count inclusive of 0 in case of ties
setNames(colSums(d >= 0 & d < cutoff) - 1, number_list)
Which gives you the following named vector.
248 650 705 816 872 991 1156 1157 1180 1277
0 0 1 2 2 2 1 2 3 3
Here is another way that is pipe-able using rollapply().
library(zoo)
cutoff <- 200
df %>%
mutate(count = rollapply(number_list,
width = seq_along(number_list),
function(x) sum((tail(x, 1) - head(x, -1)) <= cutoff),
align = "right"))
Which gives you another column.
# A tibble: 10 x 2
number_list count
<int> <int>
1 248 0
2 650 0
3 705 1
4 816 2
5 872 2
6 991 2
7 1156 1
8 1157 2
9 1180 3
10 1277 3

How to resample data by clusters (block sampling) with replacement in R using Sampling package

This is my dummy data:
income <- as.data.frame.vector <- sample(1000:10000, 1000, replace=TRUE)
individuals <- as.data.frame.vector <- sample(1:50,1000,replace=TRUE)
datatest <- as.data.frame (cbind (income, individuals))
I know I can sample by individual rows with this code:
sample <- datatest[sample(nrow(datatest), replace=TRUE),]
Now, I want to extract random samples with replacement and equal probabilities of the dataset but sampling complete blocks of observations with the same individual code.
Note that there are 50 individuals, but 1000 observations. Some observations belong to the same individual, so I want to sample by individuals (clusters, in this case), not observations. I don't mind if the extracted samples differ slightly in the number of observations. How can I do that?
I have tried:
library(sampling)
samplecluster <- cluster (datatest, clustername=c("individuals"), size=50,
method="srswr")
But the outcome is not the sampled data. Am I missing something?
Well, it seems I was indeed missing something. After the cluster command you need to apply the getdata command (all from the Sampling Package). This way I do get the sample as I wanted, plus some additional columns.
samplecluster <- cluster (datatest, clustername=c("personid"), size=50, method="srswr")
Gives you:
head(samplecluster)
individuals ID_unit Replicates Prob
1 1 259 2 0.63583
2 1 178 2 0.63583
3 1 110 2 0.63583
4 1 153 2 0.63583
5 1 941 2 0.63583
6 1 667 2 0.63583
Then using getdata, I also get the original data on income sampled by whole clusters:
datasample <- getdata (datatest, samplecluster)
head(datasample)
income individuals ID_unit Replicates Prob
1 8567 1 259 2 0.63583
2 2701 1 178 2 0.63583
3 4998 1 110 2 0.63583
4 3556 1 153 2 0.63583
5 2893 1 941 2 0.63583
6 7581 1 667 2 0.63583
I am not sure if I am missing something. If you just want some of your individuals, you can create a smaller sample of them:
ind.sample <- sample(1:50, size = 10)
print(ind.sample)
# [1] 17 43 38 39 28 23 35 47 9 13
my.sample <- datatest[datatest$individuals %in% ind.sample) ,]
head(my.sample)
# income individuals
#21 9072 17
#97 5928 35
#122 9130 43
#252 4388 43
#285 8083 28
#287 1065 35
I guess a more generic approach would be to generate random indexes;
ind.unique <- unique(individuals)
ind.sample.index <- sample(1:length(ind.unique), size = 10)
ind.sample <- ind.unique[ind.sample.index]
print(ind.sample[order(ind.sample)])
my.sample <- datatest[datatest$individuals %in% ind.sample, ]
ind.counts <- aggregate(income ~ individuals, my.sample, FUN = length)
print(ind.counts)
I think its important to note that the dataset still needs to be expanded to include all the replicates.
sw<-data.frame(datasample[rep(seq_len(dim(datasample)[1]), datasample$Replicates),, drop = FALSE], row.names=NULL)
Might be helpful to someone

Create dataframe containing only matching data from 2 dataframes in R

I've seen several posts on similar topics to this but I can't seem to make it work for my needs. I have 2 data frames, df1 and df2. df1 is quite large, df 2 is small.
df1
Chr start end Count
1 0 50 20
1 51 100 40
2 0 50 100
2 51 100 30
2 101 150 7
df2
Chr coord Name
1 25 X
2 75 Y
What I would like is to return rows which contain only those that match Chr exactly (df1$Chr == df2$Chr) and where df2$coord falls in the range of df1 start and end (df2$coord >= df1$start & df2$coord <= df1$end)
The end result (ideally) should look like this:
Chr start end Count coord Name
1 0 50 20 25 X
2 51 100 30 75 Y
I know this is probably a basic problem but any help would be greatly appreciated.
This linked question by thelatemail gives the solution: Comparing multiple columns in different data sets to find values within range R That question is somewhat muddled and unclear.
This is a duplicate of that question, but this question is clearer and much more readable.
x <- merge(df1, df2)
with(x, x[coord >= start & coord <= end,])
## Chr start end Count coord Name
## 1 1 0 50 20 25 X
## 4 2 51 100 30 75 Y

Finding overlapping ranges between two interval data

I have one table with coordinates (start, end) of ca. 500000 fragments and another table with 60000 single coordinates that I would like to match with the former fragments. I.e., for each record from dtCoords table I need to search a record in dtFrags table having the same chr and start<=coord<=end (and retrieve the type from this record of dtFrags). Is it good idea at all to use R for this, or I should rather look to other languages?
Here is my example:
require(data.table)
dtFrags <- fread(
"id,chr,start,end,type
1,1,100,200,exon
2,2,300,500,intron
3,X,400,600,intron
4,2,250,600,exon
")
dtCoords <- fread(
"id,chr,coord
10,1,150
20,2,300
30,Y,500
")
At the end, I would like to have something like this:
"idC,chr,coord,idF,type
10, 1, 150, 1, exon
20, 2, 300, 2, intron
20, 2, 300, 4, exon
30, Y, 500, NA, NA
"
I can simplify a bit the task by splitting the table to subtables by chr, so I would concentrate only on coordinates
setkey(dtCoords, 'chr')
setkey(dtFrags, 'chr')
for (chr in unique(dtCoords$chr)) {
dtCoordsSub <- dtCoords[chr];
dtFragsSub <- dtFrags[chr];
dtCoordsSub[, {
# ????
}, by=id]
}
but it's still not clear for me how should I work inside... I would be very grateful for any hints.
UPD. just in case, I put my real table in the archive here. After unpacking to your working directory, tables can be loaded with the following code:
dtCoords <- fread("dtCoords.txt", sep="\t", header=TRUE)
dtFrags <- fread("dtFrags.txt", sep="\t", header=TRUE)
In general, it's very appropriate to use the bioconductor package IRanges to deal with problems related to intervals. It does so efficiently by implementing interval tree. GenomicRanges is another package that builds on top of IRanges, specifically for handling, well, "Genomic Ranges".
require(GenomicRanges)
gr1 = with(dtFrags, GRanges(Rle(factor(chr,
levels=c("1", "2", "X", "Y"))), IRanges(start, end)))
gr2 = with(dtCoords, GRanges(Rle(factor(chr,
levels=c("1", "2", "X", "Y"))), IRanges(coord, coord)))
olaps = findOverlaps(gr2, gr1)
dtCoords[, grp := seq_len(nrow(dtCoords))]
dtFrags[subjectHits(olaps), grp := queryHits(olaps)]
setkey(dtCoords, grp)
setkey(dtFrags, grp)
dtFrags[, list(grp, id, type)][dtCoords]
grp id type id.1 chr coord
1: 1 1 exon 10 1 150
2: 2 2 intron 20 2 300
3: 2 4 exon 20 2 300
4: 3 NA NA 30 Y 500
Does this work?
You can use merge first and then subset
kk<-merge(dtFrags,dtCoords,by="chr",all.x=TRUE)
> kk
chr id.x start end type id.y coord
1: 1 1 100 200 exon 10 150
2: 2 2 300 500 intron 20 300
3: 2 4 250 600 exon 20 300
4: X 3 400 600 intron NA NA
kk[coord>=start & coord<=end]
chr id.x start end type id.y coord
1: 1 1 100 200 exon 10 150
2: 2 4 250 600 exon 20 300

Resources