How to bypass a nested for loop? - r

So the situation is this:
I basically have one data frame where it contains about 100,000 rows of data. I am interested in a particular column of data, POS, and I wanted to check if the value of POS is between two values of another data frame, Start and End, and keep track of how many instances of those are there.
E.g., in my first data frame, I have something like
ID POS
A 20
B 533
C 600
And in my other data frame, I have stuff like
START END
123 150
489 552
590 600
I want to know how many items in POS are in any of the START-END ranges. So in this case, there's be 2 items. Also, if possible, can I get the IDs of the ones with POS between Start and End, too?
How can I go about doing that without having to use a nested for loop?

This is a fairly common problem which might happen in the context of a database. Here is a solution using sqldf:
library(sqldf)
query <- "SELECT POS, ID FROM df1 INNER JOIN df2 "
query <- paste0(query, "ON df1.POS BETWEEN df2.START AND df2.END")
sqldf(query)
If the ranges in your second data frame might overlap, then the above query could return more than one result for a given POS value. In this case, replace SELECT POS with SELECT DISTINCT POS.

We can use a non-equi join with data.table
library(data.table)
setDT(df1)[df2, on = .(POS > START, POS <= END)][, sum(!is.na(ID))]
#[1] 2

We can achieve the same using mapply in base-R as:
df1[mapply(function(x)any(x >= df2$START & x <= df2$END),df1$POS),]
# ID POS
#2 B 533
#3 C 600
Data
df1 <- read.table(text =
"ID POS
A 20
B 533
C 600", header = T)
df2 <- read.table(text =
"START END
123 150
489 552
590 600", header = TRUE)

Data frame: main
ID POS
A 20
B 533
C 600
Data frame: ran
START END
123 150
489 552
590 600
A simple sapply should suffice your use case:
sapply(main$POS, function(x) { sum(x>=ran$START & x<=ran$END) })
will return:
[1] 0 1 1
You could bind this back to a new column in your main data frame:
main$Count <- sapply(main$POS, function(x) { sum(x>=ran$START & x<=ran$END) }))
ID POS count
1 A 20 0
2 B 533 1
3 C 600 1
This should also work with overlapping ranges.

Related

How to match rows within in a range of another dataset [duplicate]

This question already has answers here:
Overlap join with start and end positions
(5 answers)
Closed 2 years ago.
I have a genetic dataset where I am matching chromosome positions in the genome of 1 file if they fit within chromosome position ranges given in another file.
There are similar questions to this that I have tried, mostly with time intervals, but they haven't worked due to me needing to make sure the chromosome number is also matching (so I don't match identical positions but on differing chromosomes)
My data looks like this:
#df1 - chromosome positions to find within df2 ranges:
Chromosome Position Start End
1 101 101 101
2 101 101 101
3 600 600 600
#df2 - genomic ranges
Chromosome Start End CpG
1 50 200 10
1 300 400 2
4 100 200 5
Expected matched output (also ultimately I am looking to find the matching CpG column for df1 data):
Chromosome Position Start End CpG
1 101 50 200 10 #only row of df1 that's within a range on df2 on the same chromosome
I am currently trying to do this with:
df <-df1 %>%
left_join(df2,
by = "Chromosome") %>%
filter(Position >= Start & Position <= End)
Error: Problem with `filter()` input `..1`.
x object 'Start' not found
i Input `..1` is `Position >= Start & Position <= End`.
I don't understand how I am getting this error, the Start and End columns exist in both files and are all integer data classes - is there something I'm missing or another way I can solve this?
My actual data is quite large so also if a data.table solution works for this I am also trying to find it - I've tried but I'm a novice and haven't got far:
df1[df2, on = .(Chromosome, Position > End, Position < Start ) ]
Edit: trying with foverlaps:
setkey(df1)
df2[, End := Start]
foverlaps(df2, df1, by.x = names(df2), type = "within", mult = "all", nomatch = 0L)
Error in foverlaps(df2, df1, by.x = names(df2), type = "within", mult = "all", :
length(by.x) != length(by.y). Columns specified in by.x should correspond to columns specified in by.y and should be of same lengths.
For a data.table solution, you should have looked at the second answer by Arun on non-equi joins in the link provided by #Henrik.
Overlap join with start and end positions
Based on that, we have
library(data.table)
df1 <- data.table(Chromosome=1:3,Position=c(101,101,600),
Start=c(101,101,600),End=c(101,101,600))
df2 <- data.table(Chromosome=c(1,1,4),
Start=c(50,300,100),End=c(200,400,200),CpG=c(10,2,5))
df1[df2,.(Chromosome,Position=x.Position,Start,End,CpG),
on=.(Chromosome,Position>=Start,Position<=End),nomatch=0L]
giving
Chromosome Position Start End CpG
1: 1 101 101 101 10
That's not quite right because it takes Start and End from df1 rather than df2. Why do you even have Start and End in df1?
One way to deal with that is to not include them in the join statement:
df1[,.(Chromosome,Position)][df2,
.(Chromosome,Position=x.Position,Start,End,CpG),
on=.(Chromosome,Position>=Start,Position<=End),nomatch=0L]
giving
Chromosome Position Start End CpG
1: 1 101 50 200 10
[EDIT to note that #Carles Sans Fuentes identified the same issue in his dplyr answer.]
As a check on cases with more matches, I added some more data:
df1 <- data.table(Chromosome=c(1,1:4),Position=c(350,101,101,600,200),
Start=c(350,101,101,600,200),End=c(350,101,101,600,200))
df1
Chromosome Position Start End
1: 1 350 350 350
2: 1 101 101 101
3: 2 101 101 101
4: 3 600 600 600
5: 4 200 200 200
df1[,.(Chromosome,Position)][df2,
.(Chromosome,Position=x.Position,Start,End,CpG),
on=.(Chromosome,Position>=Start,Position<=End),nomatch=0L]
Chromosome Position Start End CpG
1: 1 101 50 200 10
2: 1 350 300 400 2
3: 4 200 100 200 5
Which I guess to be what you'd want.
The problem is related to the left_join() , which stacks columns from different datasets with the same name in one dataset. Since two columns cannot have the same column name in one same dataset, the column Start and End gets its name changed to Start.x, and Start.y, End.x, End.y.
Therefore, you must either remove the Start and End columns from the first dataset as:
library(data.table)
library(tidyr)
library(dplyr)
df1 <- fread("Chromosome Position Start End
1 101 101 101
2 101 101 101
3 600 600 600")
df2<- fread("Chromosome Start End CpG
1 50 200 10
1 300 400 2
4 100 200 5")
df <-df1 %>%select(Chromosome, Position)%>%
left_join(df2,
by = "Chromosome") %>%
filter(Position >= Start & Position <= End)
or refer to the real name of the columns and then remove the extra cols:
df <-df1 %>%
left_join(df2,
by = "Chromosome") %>%
filter(Position >= Start.y & Position <= End.y)
Cheers !

Finding patterns across rows of data.table in R

I am trying to find patterns across rows of a data.table while still maintaining the linkages of data across the rows. Here is a reduced example:
Row ID Value
1 C 1000
2 A 500
3 T -200
4 B 5000
5 T -900
6 A 300
I would like to search for all instances of "ATB" in successive rows and output the integers from the value column. Ideally, I want to bin the number of instances as well. The output table would look like this:
String Frequency Value1 Value2 Value 3
ATB 1 500 -200 5000
CAT 1 1000 500 -200
Since the data.table packages seems to be oriented towards providing operations on a column or row-wise basis I thought this should be possible. However, I haven't the slightest idea where to start. Any pointers in the right direction would be greatly appreciated.
Thanks!
library("plyr")
library("stringr")
df <- read.table(header = TRUE, text = "Row ID Value
1 C 1000
2 A 500
3 T -200
4 B 5000
5 T -900
6 A 300
7 C 200
8 A 700
9 T -500")
sought <- c("ATB", "CAT", "NOT")
ids <- paste(df$ID, collapse = "")
ldply(sought, function(id) {
found <- str_locate_all(ids, id)
if (nrow(found[[1]])) {
vals <- outer(found[[1]][,"start"], 0:2, function(x, y) df$Value[x + y])
} else {
vals <- as.list(rep(NA, 3))
}
data.frame(ID = id, Count = str_count(ids, id),
setNames(as.data.frame(vals), paste0("Value", 1:3)))
})
Here's a solution using stringr and plyr. The ids are collapsed into a single string, all instances of each target located and then a data frame constructed with the relevant columns.

Eliminating rows from a data.frame

I have this example data.frame:
df <- data.frame(id=c("a","a,b,c","d,e","d","h","e","i","b","c"), start=c(100,100,400,400,800,500,900,200,300), end=c(150,350,550,450,850,550,950,250,350), level = c(1,5,2,3,6,4,2,1,1))
> df
id start end level
1 a 100 150 1
2 a,b,c 100 350 5
3 d,e 400 550 2
4 d 400 450 3
5 h 800 850 6
6 e 500 550 4
7 i 900 950 2
8 b 200 250 1
9 c 300 350 1
where each row is a linear interval.
As this example shows some rows are merged intervals (rows 2 and 3).
What I'd like to do is for each merged interval either eliminate all its individual parts from df if the df$level of the merged interval is greater than that of all its parts, or if the df$level of the merged interval is smaller than at least one of its parts eliminate the merged interval.
So for this example, the output should be:
> res.df
id start end level
1 a,b,c 100 350 5
2 d 400 450 3
3 h 800 850 6
4 e 500 550 4
5 i 900 950 2
Method 1 (ID values)
So If we can assume that all the "merged" group have ID names that are a comma separated list of the individual groups, then we can tackle this problem just looking at the IDs and ignore the start/end information. Here is one such method
First, find all the "merged" groups by finding the IDs with commas
groups<-Filter(function(x) length(x)>1,
setNames(strsplit(as.character(df$id),","),df$id))
Now, for each of those groups, determine who has the larger level, either the merged group or one of the individual groups. Then return the index of the rows to drop as a negative number
drops<-unlist(lapply(names(groups), function(g) {
mi<-which(df$id==g)
ii<-which(df$id %in% groups[[g]])
if(df[mi, "level"] > max(df[ii, "level"])) {
return(-ii)
} else {
return(-mi)
}
}))
And finally, drop those from the data.frame
df[drops,]
# id start end level
# 2 a,b,c 100 350 5
# 4 d 400 450 3
# 5 h 800 850 6
# 6 e 500 550 4
# 7 i 900 950 2
Method 2 (Start/End Graph)
I wanted to also try a method that ignored the (very useful) merged ID names and just looked at the start/end positions. I may have gone off in a bad direction but this lead me to think of it as a network/graph type problem so I used the igraph library.
I created a graph where each vertex represented a start/end position. Each edge therefore represented a range. I used all the ranges from the sample data set and filled in any missing ranges to make the graph connected. I merged that data together to create an edge list. For each edge, I remember the "level" and "id" values from the original data set. Here's the code to do that
library(igraph)
poslist<-sort(unique(c(df$start, df$end)))
seq.el<-embed(rev(poslist),2)
class(seq.el)<-"character"
colnames(seq.el)<-c("start","end")
el<-rbind(df[,c("start","end","level", "id")],data.frame(seq.el, level=0, id=""))
el<-el[!duplicated(el[,1:2]),]
gg<-graph.data.frame(el)
And that creates a graph that looks like
So basically we want to eliminate cycles in the graph by taking the path with the edge that has the maximum "level" value. Unfortunately since this isn't a normal path-weighting scheme, I didn't find an easy way to do this with a default algorithm (maybe I missed it). So I had to write my own graph transversal function. It's not as pretty as I would have liked, but here it is.
findPath <- function(gg, fromv, tov) {
if ((missing(tov) && length(incident(gg, fromv, "in"))>1) ||
(!missing(tov) && V(gg)[fromv]==V(gg)[tov])) {
return (list(level=0, path=numeric()))
}
es <- E(gg)[from(fromv)]
if (length(es)>1) {
pp <- lapply(get.edges(gg, es)[,2], function(v) {
edg <- E(gg)[fromv %--% v]
lvl <- edg$level
nxt <- findPaths(gg,v)
return (list(level=max(lvl, nxt$level), path=c(edg,nxt$path)))
})
lvl <- sapply(pp, `[[`, "level")
take <- pp[[which.max(lvl)]]
nxt <- findPaths(gg, get.edges(gg, tail(take$path,1))[,2], tov)
return (list(level=max(take$level, nxt$level), path=c(take$path, nxt$path)))
} else {
lvl <- E(gg)[es]$level
nv <- get.edges(gg,es)[,2]
nxt <- findPaths(gg, nv, tov)
return (list(level=max(lvl, nxt$level), path=c(es, nxt$path)))
}
}
This will find a path between two nodes that satisfies the property of having a maximal level when presented with a branch. We call that with this data set with
rr <- findPaths(gg, "100","950")$path
This will find the final path. Since each row in the original df data.frame is represented by an edge, we just need to extract the edges from the path that correspond to the final path. This actually gives us a path that looks like
where the red path is the chosen one. I can then subset df with
df[df$id %in% na.omit(E(gg)[rr]$id), ]
# id start end level
# 2 a,b,c 100 350 5
# 4 d 400 450 3
# 5 h 800 850 6
# 6 e 500 550 4
# 7 i 900 950 2
Method 3 (Overlap Matrix)
He's another way to look at the start/stop positions. I create a matix where columns correspond to ranges in the rows of the data.frame and the rows of the matrix correspond to positions. Each value in the matrix is true if a range overlaps a position. Here I use the between.R helper function
#find unique positions and create overlap matrix
un<-sort(unique(unlist(df[,2:3])))
cc<-sapply(1:nrow(df), function(i) between(un, df$start[i], df$end[i]))
#partition into non-overlapping sections
groups<-cumsum(c(F,rowSums(cc[-1,]& cc[-nrow(cc),])==0))
#find the IDs to keep from each section
keeps<-lapply(split.data.frame(cc, groups), function(m) {
lengths <- colSums(m)
mx <- which.max(lengths)
gx <- setdiff(which(lengths>0), mx)
if(length(gx)>0) {
if(df$level[mx] > max(df$level[gx])) {
mx
} else {
gx
}
} else {
mx
}
})
This will give a list of the IDs to keep from each group, and we can get the final data.set with
df[unlist(keeps),]
Method 4 (Open/Close Listing)
I have one last method. This one might be the most scalable. We basically melt the positions and keep track of opening and closing events to identify the groups. Then we split and see if the longest in each group has the max level or not. Ultimately we return the IDs. This method uses all standard base functions.
#create open/close listing
dd<-rbind(
cbind(df[,c(1,4)],pos=df[,2], evt=1),
cbind(df[,c(1,4)],pos=df[,3], evt=-1)
)
#annotate with useful info
dd<-dd[order(dd$pos, -dd$evt),]
dd$open <- cumsum(dd$evt)
dd$group <- cumsum(c(0,head(dd$open,-1)==0))
dd$width <- ave(dd$pos, dd$id, FUN=function(x) diff(range(x)))
#slim down
dd <- subset(dd, evt==1,select=c("id","level","width","group"))
#process each group
ids<-unlist(lapply(split(dd, dd$group), function(x) {
if(nrow(x)==1) return(x$id)
mw<-which.max(x$width)
ml<-which.max(x$level)
if(mw==ml) {
return(x$id[mw])
} else {
return(x$id[-mw])
}
}))
and finally subset
df[df$id %in% ids, ]
by now I think you know what this returns
Summary
So if your real data has the same type of IDs as the sample data, obviously method 1 is a better, more direct choice. I'm still hoping there is a way to simplify method 2 that i'm just missing. I've not done any testing on efficiency or performance of these methods. I'm guessing method 4 might be be the most efficient since it should scale linearly.
I'll take a procedural approach; basically, sort descending by level,
and for each record, remove later records that have a matching id.
df <- data.frame(id=c("a","a,b,c","d,e","d","h","e","i","b","c"), start=c(100,100,400,400,800,500,900,200,300), end=c(150,350,550,450,850,550,950,250,350),
level = c(1,5,2,3,6,4,2,1,1), stringsAsFactors=FALSE)
#sort
ids <- df[order(df$level, decreasing=TRUE), "id"]
#split
ids <- sapply(df$id, strsplit, ",")
i <- 1
while( i < length(ids)) {
current <- ids[[i]]
j <- i + 1
while(j <= length(ids)) {
if(any(ids[[j]] %in% current))
ids[[j]] <- NULL
else
j <- j + 1
}
i <- i + 1
}
And finally, only keep the ids that are left:
R> ids <- data.frame(id=names(ids), stringsAsFactors=FALSE)
R> merge(ids, df, sort=FALSE)
id start end level
1 h 800 850 6
2 a,b,c 100 350 5
3 e 500 550 4
4 d 400 450 3
5 i 900 950 2
This has ugly while loops because R only has for-each loops, and also note the stringsAsFactors=FALSE is necessary for splitting the ids. Deleting middle elements
could be bad for performance, but that will depend on the underlying implementation
R uses for lists (linked vs arrays).

Create a new data frame based on another dataframe

I am trying to use a huge dataframe (180000 x 400) to calculate another one that would be much smaller.
I have the following dataframe
df1=data.frame(LOCAT=c(1,2,3,4,5,6),START=c(120,345,765,1045,1347,1879),END=c(150,390,802,1120,1436,1935),CODE1=c(1,1,0,1,0,0),CODE2=c(1,0,0,0,-1,-1))
df1
LOCAT START END CODE1 CODE2
1 1 120 150 1 1
2 2 345 390 1 0
3 3 765 802 0 0
4 4 1045 1120 1 0
5 5 1347 1436 0 -1
6 6 1879 1935 0 -1
This is a sample dataframe. The rows continue until 180000 and the columns are over 400.
What I need to do is create a new dataframe based on each column that tells me the size of each continues "1" or "-1" and returns it with the location, size and value.
Something like this for CODE1:
LOCAT SIZE VALUE
1 1 to 2 270 POS
2 4 to 4 75 POS
And like this for CODE2:
LOCAT SIZE VALUE
1 1 to 1 30 POS
2 5 to 6 588 NEG
Unfortunately I still didn't figure out how to do this. I have been trying several lines of code to develop a function to do this automatically but start to get lost or stuck in loops and it seems that nothing works.
Any help would be appreciated.
Thanks in advance
Below is code that gives you the answer in the exact format that you wanted, except I split your "LOCAT" column into two columns entitled "Starts" and "Stops". This code will work for your entire data frame, no need to replicate it manually for each CODE (CODE1, CODE2, etc).
It assumes that the only non-CODE column have the names "LOCAT" "START" and "END".
# need package "plyr"
library("plyr")
# test2 is the example data frame that you gave in the question
test2 <- data.frame(
"LOCAT"=1:6,
"START"=c(120,345,765, 1045, 1347, 1879),
"END"=c(150,390,803,1120,1436, 1935),
"CODE1"=c(1,1,0,1,0,0),
"CODE2"=c(1,0,0,0,-1,-1)
)
codeNames <- names(test2)[!names(test2)%in%c("LOCAT","START","END")] # the names of columns that correspond to different codes
test3 <- reshape(test2, varying=codeNames, direction="long", v.names="CodeValue", timevar="Code") # reshape so the different codes are variables grouped into the same column
test4 <- test3[,!names(test3)%in%"id"] #remove the "id" column
sss <- function(x){ # sss gives the starting points, stopping points, and sizes (sss) in a data frame
rleX <- rle(x[,"CodeValue"]) # rle() to get the size of consecutive values
stops <- cumsum(rleX$lengths) # cumulative sum to get the end-points for the indices (the second value in your LOCAT column)
starts <- c(1, head(stops,-1)+1) # the starts are the first value in your LOCAT column
ssX0 <- data.frame("Value"=rleX$values, "Starts"=starts, "Stops"=stops) #the starts and stops from X (ss from X)
ssX <- ssX0[ssX0[,"Value"]!=0,] # remove the rows the correspond to CODE_ values that are 0 (not POS or NEG)
# The next 3 lines calculate the equivalent of your SIZE column
sizeX1 <- x[ssX[,"Starts"],"START"]
sizeX2 <- x[ssX[,"Stops"],"END"]
sizeX <- sizeX2 - sizeX1
sssX <- data.frame(ssX, "Size"=sizeX) # Combine the Size to the ssX (start stop of X) data frame
return(sssX) #Added in EDIT
}
answer0 <- ddply(.data=test4, .variables="Code", .fun=sss) # use the function ddply() in the package "plyr" (apply the function to each CODE, why we reshaped)
answer <- answer0 # duplicate the original, new version will be reformatted
answer[,"Value"] <- c("NEG",NA,"POS")[answer0[,"Value"]+2] # reformat slightly so that we have POS/NEG instead of 1/-1
Hopefully this helps, good luck!
Use run-length encoding to determine groups where CODE1 takes the same value.
rle_of_CODE1 <- rle(df1$CODE1)
For convenience, find the points where the value is non-zero, and the lenghts of the corresponding blocks.
CODE1_is_nonzero <- rle_of_CODE1$values != 0
n <- rle_of_CODE1$lengths[CODE1_is_nonzero]
Ignore the parts of df1 where CODE1 is zero.
df1_with_nonzero_CODE1 <- subset(df1, CODE1 != 0)
Define a group based on the contiguous blocks we found with rle.
df1_with_nonzero_CODE1$GROUP <- rep(seq_along(n), times = n)
Use ddply to get summary stats for each group.
summarised_by_CODE1 <- ddply(
df1_with_nonzero_CODE1,
.(GROUP),
summarise,
MinOfLOCAT = min(LOCAT),
MaxOfLOCAT = max(LOCAT),
SIZE = max(END) - min(START)
)
summarised_by_CODE1$VALUE <- ifelse(
rle_of_CODE1$values[CODE1_is_nonzero] == 1,
"POS",
"NEG"
)
summarised_by_CODE1
## GROUP MinOfLOCAT MaxOfLOCAT SIZE VALUE
## 1 1 1 2 270 POS
## 2 3 4 4 75 POS
Now repeat with CODE2.

Efficiently merging two data frames on a non-trivial criteria

Answering this question last night, I spent a good hour trying to find a solution that didn't grow a data.frame in a for loop, without any success, so I'm curious if there's a better way to go about this problem.
The general case of the problem boils down to this:
Merge two data.frames
Entries in either data.frame can have 0 or more matching entries in the other.
We only care about entries that have 1 or more matches across both.
The match function is complex involving multiple columns in both data.frames
For a concrete example I will use similar data to the linked question:
genes <- data.frame(gene = letters[1:5],
chromosome = c(2,1,2,1,3),
start = c(100, 100, 500, 350, 321),
end = c(200, 200, 600, 400, 567))
markers <- data.frame(marker = 1:10,
chromosome = c(1, 1, 2, 2, 1, 3, 4, 3, 1, 2),
position = c(105, 300, 96, 206, 150, 400, 25, 300, 120, 700))
And our complex matching function:
# matching criteria, applies to a single entry from each data.frame
isMatch <- function(marker, gene) {
return(
marker$chromosome == gene$chromosome &
marker$postion >= (gene$start - 10) &
marker$postion <= (gene$end + 10)
)
}
The output should look like an sql INNER JOIN of the two data.frames, for entries where isMatch is TRUE.
I've tried to construct the two data.frames so that there can be 0 or more matches in the other data.frame.
The solution I came up with is as follows:
joined <- data.frame()
for (i in 1:nrow(genes)) {
# This repeated subsetting returns the same results as `isMatch` applied across
# the `markers` data.frame for each entry in `genes`.
matches <- markers[which(markers$chromosome == genes[i, "chromosome"]),]
matches <- matches[which(matches$pos >= (genes[i, "start"] - 10)),]
matches <- matches[which(matches$pos <= (genes[i, "end"] + 10)),]
# matches may now be 0 or more rows, which we want to repeat the gene for:
if(nrow(matches) != 0) {
joined <- rbind(joined, cbind(genes[i,], matches[,c("marker", "position")]))
}
}
Giving the results:
gene chromosome start end marker position
1 a 2 100 200 3 96
2 a 2 100 200 4 206
3 b 1 100 200 1 105
4 b 1 100 200 5 150
5 b 1 100 200 9 120
51 e 3 321 567 6 400
This is quite an ugly and clungy solution, but anything else I tried was met with failure:
use of apply, gave me a list where each element was a matrix,
with no way to rbind them.
I can't specify the dimensions of joined first, because I don't
know how many rows I will need in the end.
I'm sure I will come up with a problem of this general form in the future. So what's the correct way to solve this kind of problem?
A data table solution: a rolling join to fulfill the first inequality, followed by a vector scan to satisfy the second inequality. The join-on-first-inequality will have more rows than the final result (and therefore may run into memory issues), but it will be smaller than a straight-up merge in this answer.
require(data.table)
genes_start <- as.data.table(genes)
## create the start bound as a separate column to join to
genes_start[,`:=`(start_bound = start - 10)]
setkey(genes_start, chromosome, start_bound)
markers <- as.data.table(markers)
setkey(markers, chromosome, position)
new <- genes_start[
##join genes to markers
markers,
##rolling the last key column of genes_start (start_bound) forward
##to match the last key column of markers (position)
roll = Inf,
##inner join
nomatch = 0
##rolling join leaves positions column from markers
##with the column name from genes_start (start_bound)
##now vector scan to fulfill the other criterion
][start_bound <= end + 10]
##change names and column order to match desired result in question
setnames(new,"start_bound","position")
setcolorder(new,c("chromosome","gene","start","end","marker","position"))
# chromosome gene start end marker position
# 1: 1 b 100 200 1 105
# 2: 1 b 100 200 9 120
# 3: 1 b 100 200 5 150
# 4: 2 a 100 200 3 96
# 5: 2 a 100 200 4 206
# 6: 3 e 321 567 6 400
One could do a double join, but as it involves re-keying the data table before the second join, I don't think that it will be faster than the vector scan solution above.
##makes a copy of the genes object and keys it by end
genes_end <- as.data.table(genes)
genes_end[,`:=`(end_bound = end + 10, start = NULL, end = NULL)]
setkey(genes_end, chromosome, gene, end_bound)
## as before, wrapped in a similar join (but rolling backwards this time)
new_2 <- genes_end[
setkey(
genes_start[
markers,
roll = Inf,
nomatch = 0
], chromosome, gene, start_bound),
roll = -Inf,
nomatch = 0
]
setnames(new2, "end_bound", "position")
I dealt with a very similar problem myself by doing the merge, and sorting out which rows satisfy the condition afterwards. I don't claim that this is a universal solution, if you're dealing with large datasets where there will be few entries that match the condition, this will likely be inefficient. But to adapt it to your data:
joined.raw <- merge(genes, markers)
joined <- joined.raw[joined.raw$position >= (joined.raw$start -10) & joined.raw$position <= (joined.raw$end + 10),]
joined
# chromosome gene start end marker position
# 1 1 b 100 200 1 105
# 2 1 b 100 200 5 150
# 4 1 b 100 200 9 120
# 10 2 a 100 200 4 206
# 11 2 a 100 200 3 96
# 16 3 e 321 567 6 400
Another answer I've come up with using the sqldf package.
sqldf("SELECT gene, genes.chromosome, start, end, marker, position
FROM genes JOIN markers ON genes.chromosome = markers.chromosome
WHERE position >= (start - 10) AND position <= (end + 10)")
Using microbenchmark it performs comparably to #alexwhan's merge and [ method.
> microbenchmark(alexwhan, sql)
Unit: nanoseconds
expr min lq median uq max neval
alexwhan 435 462.5 468.0 485 2398 100
sql 422 456.5 466.5 498 1262 100
I've also attempted to test both functions on some real data of the same format I have lying around (35,000 rows for genes, 2,000,000 rows for markers, with the joined output coming to 480,000 rows).
Unfortunately merge seems unable to handle this much data, falling over at joined.raw <- merge(genes, markers) with an error (which i don't get if reduce the number of rows):
Error in merge.data.frame(genes, markers) :
negative length vectors are not allowed
While the sqldf method runs successfully in 29 minutes.
After almost one year regarding to this problem you solved for me... now i spent some time to deal with this using another way by awk....
awk 'FNR==NR{a[NR]=$0;next}{for (i in a){split(a[i],x," ");if (x[2]==$2 && x[3]-10 <=$3 && x[4]+10 >=$3)print x[1],x[2],x[3],x[4],$0}}' gene.txt makers.txt > genesnp.txt
which produce the kind of same results:
b 1 100 200 1 1 105
a 2 100 200 3 2 96
a 2 100 200 4 2 206
b 1 100 200 5 1 150
e 3 321 567 6 3 400
b 1 100 200 9 1 120

Resources