How to match rows within in a range of another dataset [duplicate] - r

This question already has answers here:
Overlap join with start and end positions
(5 answers)
Closed 2 years ago.
I have a genetic dataset where I am matching chromosome positions in the genome of 1 file if they fit within chromosome position ranges given in another file.
There are similar questions to this that I have tried, mostly with time intervals, but they haven't worked due to me needing to make sure the chromosome number is also matching (so I don't match identical positions but on differing chromosomes)
My data looks like this:
#df1 - chromosome positions to find within df2 ranges:
Chromosome Position Start End
1 101 101 101
2 101 101 101
3 600 600 600
#df2 - genomic ranges
Chromosome Start End CpG
1 50 200 10
1 300 400 2
4 100 200 5
Expected matched output (also ultimately I am looking to find the matching CpG column for df1 data):
Chromosome Position Start End CpG
1 101 50 200 10 #only row of df1 that's within a range on df2 on the same chromosome
I am currently trying to do this with:
df <-df1 %>%
left_join(df2,
by = "Chromosome") %>%
filter(Position >= Start & Position <= End)
Error: Problem with `filter()` input `..1`.
x object 'Start' not found
i Input `..1` is `Position >= Start & Position <= End`.
I don't understand how I am getting this error, the Start and End columns exist in both files and are all integer data classes - is there something I'm missing or another way I can solve this?
My actual data is quite large so also if a data.table solution works for this I am also trying to find it - I've tried but I'm a novice and haven't got far:
df1[df2, on = .(Chromosome, Position > End, Position < Start ) ]
Edit: trying with foverlaps:
setkey(df1)
df2[, End := Start]
foverlaps(df2, df1, by.x = names(df2), type = "within", mult = "all", nomatch = 0L)
Error in foverlaps(df2, df1, by.x = names(df2), type = "within", mult = "all", :
length(by.x) != length(by.y). Columns specified in by.x should correspond to columns specified in by.y and should be of same lengths.

For a data.table solution, you should have looked at the second answer by Arun on non-equi joins in the link provided by #Henrik.
Overlap join with start and end positions
Based on that, we have
library(data.table)
df1 <- data.table(Chromosome=1:3,Position=c(101,101,600),
Start=c(101,101,600),End=c(101,101,600))
df2 <- data.table(Chromosome=c(1,1,4),
Start=c(50,300,100),End=c(200,400,200),CpG=c(10,2,5))
df1[df2,.(Chromosome,Position=x.Position,Start,End,CpG),
on=.(Chromosome,Position>=Start,Position<=End),nomatch=0L]
giving
Chromosome Position Start End CpG
1: 1 101 101 101 10
That's not quite right because it takes Start and End from df1 rather than df2. Why do you even have Start and End in df1?
One way to deal with that is to not include them in the join statement:
df1[,.(Chromosome,Position)][df2,
.(Chromosome,Position=x.Position,Start,End,CpG),
on=.(Chromosome,Position>=Start,Position<=End),nomatch=0L]
giving
Chromosome Position Start End CpG
1: 1 101 50 200 10
[EDIT to note that #Carles Sans Fuentes identified the same issue in his dplyr answer.]
As a check on cases with more matches, I added some more data:
df1 <- data.table(Chromosome=c(1,1:4),Position=c(350,101,101,600,200),
Start=c(350,101,101,600,200),End=c(350,101,101,600,200))
df1
Chromosome Position Start End
1: 1 350 350 350
2: 1 101 101 101
3: 2 101 101 101
4: 3 600 600 600
5: 4 200 200 200
df1[,.(Chromosome,Position)][df2,
.(Chromosome,Position=x.Position,Start,End,CpG),
on=.(Chromosome,Position>=Start,Position<=End),nomatch=0L]
Chromosome Position Start End CpG
1: 1 101 50 200 10
2: 1 350 300 400 2
3: 4 200 100 200 5
Which I guess to be what you'd want.

The problem is related to the left_join() , which stacks columns from different datasets with the same name in one dataset. Since two columns cannot have the same column name in one same dataset, the column Start and End gets its name changed to Start.x, and Start.y, End.x, End.y.
Therefore, you must either remove the Start and End columns from the first dataset as:
library(data.table)
library(tidyr)
library(dplyr)
df1 <- fread("Chromosome Position Start End
1 101 101 101
2 101 101 101
3 600 600 600")
df2<- fread("Chromosome Start End CpG
1 50 200 10
1 300 400 2
4 100 200 5")
df <-df1 %>%select(Chromosome, Position)%>%
left_join(df2,
by = "Chromosome") %>%
filter(Position >= Start & Position <= End)
or refer to the real name of the columns and then remove the extra cols:
df <-df1 %>%
left_join(df2,
by = "Chromosome") %>%
filter(Position >= Start.y & Position <= End.y)
Cheers !

Related

new data frame out of long format depending on condition

I've got a dataframe in a long form (90 rows for each participant). In this dataframe is one column with reaction times. And I've got another column (a factor with the two levels "match" and "mismatch").
I want to calculate an index that looks something like this: mean RT of the mismatches - mean RT of the matches. I added na.rm=TRUE because there are RTs for every row but not every row has matches or mismatches, some are also NAs.
mean(dato$MMRT [dato$Matcheig == "mismatch"], na.rm=TRUE) - mean(dato$MMRT [dato$Matcheig == "match"], na.rm=TRUE)
How can I get an index over these 90 rows per participant for each of them? I would prefer a new data frame with one line for each participant (VP) and a column with its index.
I tried my best with dplyr but to be honest, I only heard about it yesterday (I'm new here) and maybe someone can tell me if there is an easy solution
datindex <- dato %>%
+ group_by(VP) %>%
+ mean(dato$MMRT [dato$Matcheig == "mismatch"], na.rm=TRUE) - mean(dato$MMRT [dato$Matcheig == "match"], na.rm=TRUE)
This gives the error "argument is not numeric or logical: returning NA" and a 1x1 data frame with NA
Example:
dato looks like this (with only 5 rows for each participant now):
VP MMRT Matcheig
1 868 match
1 640 match
1 683 mismatch
1 643 NA
1 904 mismatch
2 705 mismatch
2 634 match
2 819 match
2 700 mismatch
2 765 mismatch
Result should look like this:
VP index
1 39.5
2 -3.2
because the mean RT of the mismatches of participant 1 is 793.5 and of the matches 754
--> 793.5 - 754 = 39.5
and for participant 2: mean RT(mismatches) = 723.3 and mean RT(matches) = 726.5
--> 723.3 - 726.5 = -3.2
1) You should not use $ in dplyr pipes, very rarely they are useful.
2) You should include the calculation inside summarise or mutate functions in dplyr.
library(dplyr)
dato %>%
group_by(VP) %>%
summarise(calc = mean(MMRT[Matcheig == "mismatch"], na.rm=TRUE) -
mean(MMRT[Matcheig == "match"], na.rm=TRUE))
# A tibble: 2 x 2
# VP calc
# <int> <dbl>
#1 1 39.5
#2 2 -3.17
The same can also be done using data.table in similar fashion which is useful for large datasets
library(data.table)
setDT(dato)[, (mean(MMRT[Matcheig == "mismatch"], na.rm=TRUE) -
mean(MMRT[Matcheig == "match"], na.rm=TRUE)), VP]

How to bypass a nested for loop?

So the situation is this:
I basically have one data frame where it contains about 100,000 rows of data. I am interested in a particular column of data, POS, and I wanted to check if the value of POS is between two values of another data frame, Start and End, and keep track of how many instances of those are there.
E.g., in my first data frame, I have something like
ID POS
A 20
B 533
C 600
And in my other data frame, I have stuff like
START END
123 150
489 552
590 600
I want to know how many items in POS are in any of the START-END ranges. So in this case, there's be 2 items. Also, if possible, can I get the IDs of the ones with POS between Start and End, too?
How can I go about doing that without having to use a nested for loop?
This is a fairly common problem which might happen in the context of a database. Here is a solution using sqldf:
library(sqldf)
query <- "SELECT POS, ID FROM df1 INNER JOIN df2 "
query <- paste0(query, "ON df1.POS BETWEEN df2.START AND df2.END")
sqldf(query)
If the ranges in your second data frame might overlap, then the above query could return more than one result for a given POS value. In this case, replace SELECT POS with SELECT DISTINCT POS.
We can use a non-equi join with data.table
library(data.table)
setDT(df1)[df2, on = .(POS > START, POS <= END)][, sum(!is.na(ID))]
#[1] 2
We can achieve the same using mapply in base-R as:
df1[mapply(function(x)any(x >= df2$START & x <= df2$END),df1$POS),]
# ID POS
#2 B 533
#3 C 600
Data
df1 <- read.table(text =
"ID POS
A 20
B 533
C 600", header = T)
df2 <- read.table(text =
"START END
123 150
489 552
590 600", header = TRUE)
Data frame: main
ID POS
A 20
B 533
C 600
Data frame: ran
START END
123 150
489 552
590 600
A simple sapply should suffice your use case:
sapply(main$POS, function(x) { sum(x>=ran$START & x<=ran$END) })
will return:
[1] 0 1 1
You could bind this back to a new column in your main data frame:
main$Count <- sapply(main$POS, function(x) { sum(x>=ran$START & x<=ran$END) }))
ID POS count
1 A 20 0
2 B 533 1
3 C 600 1
This should also work with overlapping ranges.

Find consecutive values with sliding window

I have a dataframe. For simplicity, I am leaving out many columns and rows:
Distance Type
1 162 A
2 27182 A
3 212 C
4 89 B
5 11 C
I need to find 6 consecutive rows in the dataframe, such that the average distance is 1000, and such that the only types considered are A or B. Just for clarification, one may think to filter out all Type C rows, and then proceed, but then the rows that were not originally consecutive will become consecutive upon filtering, and that's no good.
For example, if I filtered out rows 3 and 5 above, I would be left with 3 rows. And if I had provided more rows, that might produce a faulty result.
Maybe a solution with data.table library ?
For reproducibility, here is a data sample, based on what you wrote.
library(data.table)
# data orig (with row numbers...)
DO<-"Distance Type
1 162 A
2 27182 A
3 212 C
4 89 B
5 11 C
6 1234 A"
# data : sep by comma
DS<-gsub('[[:blank:]]+',';',DO)
# data.frame
DF<-read.table(textConnection(DS),header=T,sep=';',stringsAsFactors = F)
#data.table
DT<-as.data.table(DF)
Then, make a function to increment a counter each time a sequence of identical value is found :
# function to set sequencial group number
mkGroupRep<-function(x){
cnt=1L
grp=1L
lx<-length(x)
ne<- x[-lx] != x[-1L] #next not equal
for(i in seq_along(ne)){if(ne[i])cnt=cnt+1;grp[i+1]=cnt}
grp
}
And use it with data.table 'multiple assignment by reference' :
# update dat : set group number based on sequential type
DT[,grp:=mkGroupRep(Type)]
# calc sum of distance and number of item in group, by group
DT[,`:=`(
distMean=mean(Distance),
grpLength=.N
),by=grp]
# filter what you want :
DT[Type != 'C' & distMean >100 & grpLength==2 | grpLength==3]
Output :
Distance Type grp distMean grpLength
1: 162 A 1 13672 2
2: 27182 A 1 13672 2

Create a new data frame based on another dataframe

I am trying to use a huge dataframe (180000 x 400) to calculate another one that would be much smaller.
I have the following dataframe
df1=data.frame(LOCAT=c(1,2,3,4,5,6),START=c(120,345,765,1045,1347,1879),END=c(150,390,802,1120,1436,1935),CODE1=c(1,1,0,1,0,0),CODE2=c(1,0,0,0,-1,-1))
df1
LOCAT START END CODE1 CODE2
1 1 120 150 1 1
2 2 345 390 1 0
3 3 765 802 0 0
4 4 1045 1120 1 0
5 5 1347 1436 0 -1
6 6 1879 1935 0 -1
This is a sample dataframe. The rows continue until 180000 and the columns are over 400.
What I need to do is create a new dataframe based on each column that tells me the size of each continues "1" or "-1" and returns it with the location, size and value.
Something like this for CODE1:
LOCAT SIZE VALUE
1 1 to 2 270 POS
2 4 to 4 75 POS
And like this for CODE2:
LOCAT SIZE VALUE
1 1 to 1 30 POS
2 5 to 6 588 NEG
Unfortunately I still didn't figure out how to do this. I have been trying several lines of code to develop a function to do this automatically but start to get lost or stuck in loops and it seems that nothing works.
Any help would be appreciated.
Thanks in advance
Below is code that gives you the answer in the exact format that you wanted, except I split your "LOCAT" column into two columns entitled "Starts" and "Stops". This code will work for your entire data frame, no need to replicate it manually for each CODE (CODE1, CODE2, etc).
It assumes that the only non-CODE column have the names "LOCAT" "START" and "END".
# need package "plyr"
library("plyr")
# test2 is the example data frame that you gave in the question
test2 <- data.frame(
"LOCAT"=1:6,
"START"=c(120,345,765, 1045, 1347, 1879),
"END"=c(150,390,803,1120,1436, 1935),
"CODE1"=c(1,1,0,1,0,0),
"CODE2"=c(1,0,0,0,-1,-1)
)
codeNames <- names(test2)[!names(test2)%in%c("LOCAT","START","END")] # the names of columns that correspond to different codes
test3 <- reshape(test2, varying=codeNames, direction="long", v.names="CodeValue", timevar="Code") # reshape so the different codes are variables grouped into the same column
test4 <- test3[,!names(test3)%in%"id"] #remove the "id" column
sss <- function(x){ # sss gives the starting points, stopping points, and sizes (sss) in a data frame
rleX <- rle(x[,"CodeValue"]) # rle() to get the size of consecutive values
stops <- cumsum(rleX$lengths) # cumulative sum to get the end-points for the indices (the second value in your LOCAT column)
starts <- c(1, head(stops,-1)+1) # the starts are the first value in your LOCAT column
ssX0 <- data.frame("Value"=rleX$values, "Starts"=starts, "Stops"=stops) #the starts and stops from X (ss from X)
ssX <- ssX0[ssX0[,"Value"]!=0,] # remove the rows the correspond to CODE_ values that are 0 (not POS or NEG)
# The next 3 lines calculate the equivalent of your SIZE column
sizeX1 <- x[ssX[,"Starts"],"START"]
sizeX2 <- x[ssX[,"Stops"],"END"]
sizeX <- sizeX2 - sizeX1
sssX <- data.frame(ssX, "Size"=sizeX) # Combine the Size to the ssX (start stop of X) data frame
return(sssX) #Added in EDIT
}
answer0 <- ddply(.data=test4, .variables="Code", .fun=sss) # use the function ddply() in the package "plyr" (apply the function to each CODE, why we reshaped)
answer <- answer0 # duplicate the original, new version will be reformatted
answer[,"Value"] <- c("NEG",NA,"POS")[answer0[,"Value"]+2] # reformat slightly so that we have POS/NEG instead of 1/-1
Hopefully this helps, good luck!
Use run-length encoding to determine groups where CODE1 takes the same value.
rle_of_CODE1 <- rle(df1$CODE1)
For convenience, find the points where the value is non-zero, and the lenghts of the corresponding blocks.
CODE1_is_nonzero <- rle_of_CODE1$values != 0
n <- rle_of_CODE1$lengths[CODE1_is_nonzero]
Ignore the parts of df1 where CODE1 is zero.
df1_with_nonzero_CODE1 <- subset(df1, CODE1 != 0)
Define a group based on the contiguous blocks we found with rle.
df1_with_nonzero_CODE1$GROUP <- rep(seq_along(n), times = n)
Use ddply to get summary stats for each group.
summarised_by_CODE1 <- ddply(
df1_with_nonzero_CODE1,
.(GROUP),
summarise,
MinOfLOCAT = min(LOCAT),
MaxOfLOCAT = max(LOCAT),
SIZE = max(END) - min(START)
)
summarised_by_CODE1$VALUE <- ifelse(
rle_of_CODE1$values[CODE1_is_nonzero] == 1,
"POS",
"NEG"
)
summarised_by_CODE1
## GROUP MinOfLOCAT MaxOfLOCAT SIZE VALUE
## 1 1 1 2 270 POS
## 2 3 4 4 75 POS
Now repeat with CODE2.

Efficiently merging two data frames on a non-trivial criteria

Answering this question last night, I spent a good hour trying to find a solution that didn't grow a data.frame in a for loop, without any success, so I'm curious if there's a better way to go about this problem.
The general case of the problem boils down to this:
Merge two data.frames
Entries in either data.frame can have 0 or more matching entries in the other.
We only care about entries that have 1 or more matches across both.
The match function is complex involving multiple columns in both data.frames
For a concrete example I will use similar data to the linked question:
genes <- data.frame(gene = letters[1:5],
chromosome = c(2,1,2,1,3),
start = c(100, 100, 500, 350, 321),
end = c(200, 200, 600, 400, 567))
markers <- data.frame(marker = 1:10,
chromosome = c(1, 1, 2, 2, 1, 3, 4, 3, 1, 2),
position = c(105, 300, 96, 206, 150, 400, 25, 300, 120, 700))
And our complex matching function:
# matching criteria, applies to a single entry from each data.frame
isMatch <- function(marker, gene) {
return(
marker$chromosome == gene$chromosome &
marker$postion >= (gene$start - 10) &
marker$postion <= (gene$end + 10)
)
}
The output should look like an sql INNER JOIN of the two data.frames, for entries where isMatch is TRUE.
I've tried to construct the two data.frames so that there can be 0 or more matches in the other data.frame.
The solution I came up with is as follows:
joined <- data.frame()
for (i in 1:nrow(genes)) {
# This repeated subsetting returns the same results as `isMatch` applied across
# the `markers` data.frame for each entry in `genes`.
matches <- markers[which(markers$chromosome == genes[i, "chromosome"]),]
matches <- matches[which(matches$pos >= (genes[i, "start"] - 10)),]
matches <- matches[which(matches$pos <= (genes[i, "end"] + 10)),]
# matches may now be 0 or more rows, which we want to repeat the gene for:
if(nrow(matches) != 0) {
joined <- rbind(joined, cbind(genes[i,], matches[,c("marker", "position")]))
}
}
Giving the results:
gene chromosome start end marker position
1 a 2 100 200 3 96
2 a 2 100 200 4 206
3 b 1 100 200 1 105
4 b 1 100 200 5 150
5 b 1 100 200 9 120
51 e 3 321 567 6 400
This is quite an ugly and clungy solution, but anything else I tried was met with failure:
use of apply, gave me a list where each element was a matrix,
with no way to rbind them.
I can't specify the dimensions of joined first, because I don't
know how many rows I will need in the end.
I'm sure I will come up with a problem of this general form in the future. So what's the correct way to solve this kind of problem?
A data table solution: a rolling join to fulfill the first inequality, followed by a vector scan to satisfy the second inequality. The join-on-first-inequality will have more rows than the final result (and therefore may run into memory issues), but it will be smaller than a straight-up merge in this answer.
require(data.table)
genes_start <- as.data.table(genes)
## create the start bound as a separate column to join to
genes_start[,`:=`(start_bound = start - 10)]
setkey(genes_start, chromosome, start_bound)
markers <- as.data.table(markers)
setkey(markers, chromosome, position)
new <- genes_start[
##join genes to markers
markers,
##rolling the last key column of genes_start (start_bound) forward
##to match the last key column of markers (position)
roll = Inf,
##inner join
nomatch = 0
##rolling join leaves positions column from markers
##with the column name from genes_start (start_bound)
##now vector scan to fulfill the other criterion
][start_bound <= end + 10]
##change names and column order to match desired result in question
setnames(new,"start_bound","position")
setcolorder(new,c("chromosome","gene","start","end","marker","position"))
# chromosome gene start end marker position
# 1: 1 b 100 200 1 105
# 2: 1 b 100 200 9 120
# 3: 1 b 100 200 5 150
# 4: 2 a 100 200 3 96
# 5: 2 a 100 200 4 206
# 6: 3 e 321 567 6 400
One could do a double join, but as it involves re-keying the data table before the second join, I don't think that it will be faster than the vector scan solution above.
##makes a copy of the genes object and keys it by end
genes_end <- as.data.table(genes)
genes_end[,`:=`(end_bound = end + 10, start = NULL, end = NULL)]
setkey(genes_end, chromosome, gene, end_bound)
## as before, wrapped in a similar join (but rolling backwards this time)
new_2 <- genes_end[
setkey(
genes_start[
markers,
roll = Inf,
nomatch = 0
], chromosome, gene, start_bound),
roll = -Inf,
nomatch = 0
]
setnames(new2, "end_bound", "position")
I dealt with a very similar problem myself by doing the merge, and sorting out which rows satisfy the condition afterwards. I don't claim that this is a universal solution, if you're dealing with large datasets where there will be few entries that match the condition, this will likely be inefficient. But to adapt it to your data:
joined.raw <- merge(genes, markers)
joined <- joined.raw[joined.raw$position >= (joined.raw$start -10) & joined.raw$position <= (joined.raw$end + 10),]
joined
# chromosome gene start end marker position
# 1 1 b 100 200 1 105
# 2 1 b 100 200 5 150
# 4 1 b 100 200 9 120
# 10 2 a 100 200 4 206
# 11 2 a 100 200 3 96
# 16 3 e 321 567 6 400
Another answer I've come up with using the sqldf package.
sqldf("SELECT gene, genes.chromosome, start, end, marker, position
FROM genes JOIN markers ON genes.chromosome = markers.chromosome
WHERE position >= (start - 10) AND position <= (end + 10)")
Using microbenchmark it performs comparably to #alexwhan's merge and [ method.
> microbenchmark(alexwhan, sql)
Unit: nanoseconds
expr min lq median uq max neval
alexwhan 435 462.5 468.0 485 2398 100
sql 422 456.5 466.5 498 1262 100
I've also attempted to test both functions on some real data of the same format I have lying around (35,000 rows for genes, 2,000,000 rows for markers, with the joined output coming to 480,000 rows).
Unfortunately merge seems unable to handle this much data, falling over at joined.raw <- merge(genes, markers) with an error (which i don't get if reduce the number of rows):
Error in merge.data.frame(genes, markers) :
negative length vectors are not allowed
While the sqldf method runs successfully in 29 minutes.
After almost one year regarding to this problem you solved for me... now i spent some time to deal with this using another way by awk....
awk 'FNR==NR{a[NR]=$0;next}{for (i in a){split(a[i],x," ");if (x[2]==$2 && x[3]-10 <=$3 && x[4]+10 >=$3)print x[1],x[2],x[3],x[4],$0}}' gene.txt makers.txt > genesnp.txt
which produce the kind of same results:
b 1 100 200 1 1 105
a 2 100 200 3 2 96
a 2 100 200 4 2 206
b 1 100 200 5 1 150
e 3 321 567 6 3 400
b 1 100 200 9 1 120

Resources