Efficiently merging two data frames on a non-trivial criteria - r

Answering this question last night, I spent a good hour trying to find a solution that didn't grow a data.frame in a for loop, without any success, so I'm curious if there's a better way to go about this problem.
The general case of the problem boils down to this:
Merge two data.frames
Entries in either data.frame can have 0 or more matching entries in the other.
We only care about entries that have 1 or more matches across both.
The match function is complex involving multiple columns in both data.frames
For a concrete example I will use similar data to the linked question:
genes <- data.frame(gene = letters[1:5],
chromosome = c(2,1,2,1,3),
start = c(100, 100, 500, 350, 321),
end = c(200, 200, 600, 400, 567))
markers <- data.frame(marker = 1:10,
chromosome = c(1, 1, 2, 2, 1, 3, 4, 3, 1, 2),
position = c(105, 300, 96, 206, 150, 400, 25, 300, 120, 700))
And our complex matching function:
# matching criteria, applies to a single entry from each data.frame
isMatch <- function(marker, gene) {
return(
marker$chromosome == gene$chromosome &
marker$postion >= (gene$start - 10) &
marker$postion <= (gene$end + 10)
)
}
The output should look like an sql INNER JOIN of the two data.frames, for entries where isMatch is TRUE.
I've tried to construct the two data.frames so that there can be 0 or more matches in the other data.frame.
The solution I came up with is as follows:
joined <- data.frame()
for (i in 1:nrow(genes)) {
# This repeated subsetting returns the same results as `isMatch` applied across
# the `markers` data.frame for each entry in `genes`.
matches <- markers[which(markers$chromosome == genes[i, "chromosome"]),]
matches <- matches[which(matches$pos >= (genes[i, "start"] - 10)),]
matches <- matches[which(matches$pos <= (genes[i, "end"] + 10)),]
# matches may now be 0 or more rows, which we want to repeat the gene for:
if(nrow(matches) != 0) {
joined <- rbind(joined, cbind(genes[i,], matches[,c("marker", "position")]))
}
}
Giving the results:
gene chromosome start end marker position
1 a 2 100 200 3 96
2 a 2 100 200 4 206
3 b 1 100 200 1 105
4 b 1 100 200 5 150
5 b 1 100 200 9 120
51 e 3 321 567 6 400
This is quite an ugly and clungy solution, but anything else I tried was met with failure:
use of apply, gave me a list where each element was a matrix,
with no way to rbind them.
I can't specify the dimensions of joined first, because I don't
know how many rows I will need in the end.
I'm sure I will come up with a problem of this general form in the future. So what's the correct way to solve this kind of problem?

A data table solution: a rolling join to fulfill the first inequality, followed by a vector scan to satisfy the second inequality. The join-on-first-inequality will have more rows than the final result (and therefore may run into memory issues), but it will be smaller than a straight-up merge in this answer.
require(data.table)
genes_start <- as.data.table(genes)
## create the start bound as a separate column to join to
genes_start[,`:=`(start_bound = start - 10)]
setkey(genes_start, chromosome, start_bound)
markers <- as.data.table(markers)
setkey(markers, chromosome, position)
new <- genes_start[
##join genes to markers
markers,
##rolling the last key column of genes_start (start_bound) forward
##to match the last key column of markers (position)
roll = Inf,
##inner join
nomatch = 0
##rolling join leaves positions column from markers
##with the column name from genes_start (start_bound)
##now vector scan to fulfill the other criterion
][start_bound <= end + 10]
##change names and column order to match desired result in question
setnames(new,"start_bound","position")
setcolorder(new,c("chromosome","gene","start","end","marker","position"))
# chromosome gene start end marker position
# 1: 1 b 100 200 1 105
# 2: 1 b 100 200 9 120
# 3: 1 b 100 200 5 150
# 4: 2 a 100 200 3 96
# 5: 2 a 100 200 4 206
# 6: 3 e 321 567 6 400
One could do a double join, but as it involves re-keying the data table before the second join, I don't think that it will be faster than the vector scan solution above.
##makes a copy of the genes object and keys it by end
genes_end <- as.data.table(genes)
genes_end[,`:=`(end_bound = end + 10, start = NULL, end = NULL)]
setkey(genes_end, chromosome, gene, end_bound)
## as before, wrapped in a similar join (but rolling backwards this time)
new_2 <- genes_end[
setkey(
genes_start[
markers,
roll = Inf,
nomatch = 0
], chromosome, gene, start_bound),
roll = -Inf,
nomatch = 0
]
setnames(new2, "end_bound", "position")

I dealt with a very similar problem myself by doing the merge, and sorting out which rows satisfy the condition afterwards. I don't claim that this is a universal solution, if you're dealing with large datasets where there will be few entries that match the condition, this will likely be inefficient. But to adapt it to your data:
joined.raw <- merge(genes, markers)
joined <- joined.raw[joined.raw$position >= (joined.raw$start -10) & joined.raw$position <= (joined.raw$end + 10),]
joined
# chromosome gene start end marker position
# 1 1 b 100 200 1 105
# 2 1 b 100 200 5 150
# 4 1 b 100 200 9 120
# 10 2 a 100 200 4 206
# 11 2 a 100 200 3 96
# 16 3 e 321 567 6 400

Another answer I've come up with using the sqldf package.
sqldf("SELECT gene, genes.chromosome, start, end, marker, position
FROM genes JOIN markers ON genes.chromosome = markers.chromosome
WHERE position >= (start - 10) AND position <= (end + 10)")
Using microbenchmark it performs comparably to #alexwhan's merge and [ method.
> microbenchmark(alexwhan, sql)
Unit: nanoseconds
expr min lq median uq max neval
alexwhan 435 462.5 468.0 485 2398 100
sql 422 456.5 466.5 498 1262 100
I've also attempted to test both functions on some real data of the same format I have lying around (35,000 rows for genes, 2,000,000 rows for markers, with the joined output coming to 480,000 rows).
Unfortunately merge seems unable to handle this much data, falling over at joined.raw <- merge(genes, markers) with an error (which i don't get if reduce the number of rows):
Error in merge.data.frame(genes, markers) :
negative length vectors are not allowed
While the sqldf method runs successfully in 29 minutes.

After almost one year regarding to this problem you solved for me... now i spent some time to deal with this using another way by awk....
awk 'FNR==NR{a[NR]=$0;next}{for (i in a){split(a[i],x," ");if (x[2]==$2 && x[3]-10 <=$3 && x[4]+10 >=$3)print x[1],x[2],x[3],x[4],$0}}' gene.txt makers.txt > genesnp.txt
which produce the kind of same results:
b 1 100 200 1 1 105
a 2 100 200 3 2 96
a 2 100 200 4 2 206
b 1 100 200 5 1 150
e 3 321 567 6 3 400
b 1 100 200 9 1 120

Related

How to match rows within in a range of another dataset [duplicate]

This question already has answers here:
Overlap join with start and end positions
(5 answers)
Closed 2 years ago.
I have a genetic dataset where I am matching chromosome positions in the genome of 1 file if they fit within chromosome position ranges given in another file.
There are similar questions to this that I have tried, mostly with time intervals, but they haven't worked due to me needing to make sure the chromosome number is also matching (so I don't match identical positions but on differing chromosomes)
My data looks like this:
#df1 - chromosome positions to find within df2 ranges:
Chromosome Position Start End
1 101 101 101
2 101 101 101
3 600 600 600
#df2 - genomic ranges
Chromosome Start End CpG
1 50 200 10
1 300 400 2
4 100 200 5
Expected matched output (also ultimately I am looking to find the matching CpG column for df1 data):
Chromosome Position Start End CpG
1 101 50 200 10 #only row of df1 that's within a range on df2 on the same chromosome
I am currently trying to do this with:
df <-df1 %>%
left_join(df2,
by = "Chromosome") %>%
filter(Position >= Start & Position <= End)
Error: Problem with `filter()` input `..1`.
x object 'Start' not found
i Input `..1` is `Position >= Start & Position <= End`.
I don't understand how I am getting this error, the Start and End columns exist in both files and are all integer data classes - is there something I'm missing or another way I can solve this?
My actual data is quite large so also if a data.table solution works for this I am also trying to find it - I've tried but I'm a novice and haven't got far:
df1[df2, on = .(Chromosome, Position > End, Position < Start ) ]
Edit: trying with foverlaps:
setkey(df1)
df2[, End := Start]
foverlaps(df2, df1, by.x = names(df2), type = "within", mult = "all", nomatch = 0L)
Error in foverlaps(df2, df1, by.x = names(df2), type = "within", mult = "all", :
length(by.x) != length(by.y). Columns specified in by.x should correspond to columns specified in by.y and should be of same lengths.
For a data.table solution, you should have looked at the second answer by Arun on non-equi joins in the link provided by #Henrik.
Overlap join with start and end positions
Based on that, we have
library(data.table)
df1 <- data.table(Chromosome=1:3,Position=c(101,101,600),
Start=c(101,101,600),End=c(101,101,600))
df2 <- data.table(Chromosome=c(1,1,4),
Start=c(50,300,100),End=c(200,400,200),CpG=c(10,2,5))
df1[df2,.(Chromosome,Position=x.Position,Start,End,CpG),
on=.(Chromosome,Position>=Start,Position<=End),nomatch=0L]
giving
Chromosome Position Start End CpG
1: 1 101 101 101 10
That's not quite right because it takes Start and End from df1 rather than df2. Why do you even have Start and End in df1?
One way to deal with that is to not include them in the join statement:
df1[,.(Chromosome,Position)][df2,
.(Chromosome,Position=x.Position,Start,End,CpG),
on=.(Chromosome,Position>=Start,Position<=End),nomatch=0L]
giving
Chromosome Position Start End CpG
1: 1 101 50 200 10
[EDIT to note that #Carles Sans Fuentes identified the same issue in his dplyr answer.]
As a check on cases with more matches, I added some more data:
df1 <- data.table(Chromosome=c(1,1:4),Position=c(350,101,101,600,200),
Start=c(350,101,101,600,200),End=c(350,101,101,600,200))
df1
Chromosome Position Start End
1: 1 350 350 350
2: 1 101 101 101
3: 2 101 101 101
4: 3 600 600 600
5: 4 200 200 200
df1[,.(Chromosome,Position)][df2,
.(Chromosome,Position=x.Position,Start,End,CpG),
on=.(Chromosome,Position>=Start,Position<=End),nomatch=0L]
Chromosome Position Start End CpG
1: 1 101 50 200 10
2: 1 350 300 400 2
3: 4 200 100 200 5
Which I guess to be what you'd want.
The problem is related to the left_join() , which stacks columns from different datasets with the same name in one dataset. Since two columns cannot have the same column name in one same dataset, the column Start and End gets its name changed to Start.x, and Start.y, End.x, End.y.
Therefore, you must either remove the Start and End columns from the first dataset as:
library(data.table)
library(tidyr)
library(dplyr)
df1 <- fread("Chromosome Position Start End
1 101 101 101
2 101 101 101
3 600 600 600")
df2<- fread("Chromosome Start End CpG
1 50 200 10
1 300 400 2
4 100 200 5")
df <-df1 %>%select(Chromosome, Position)%>%
left_join(df2,
by = "Chromosome") %>%
filter(Position >= Start & Position <= End)
or refer to the real name of the columns and then remove the extra cols:
df <-df1 %>%
left_join(df2,
by = "Chromosome") %>%
filter(Position >= Start.y & Position <= End.y)
Cheers !

How to bypass a nested for loop?

So the situation is this:
I basically have one data frame where it contains about 100,000 rows of data. I am interested in a particular column of data, POS, and I wanted to check if the value of POS is between two values of another data frame, Start and End, and keep track of how many instances of those are there.
E.g., in my first data frame, I have something like
ID POS
A 20
B 533
C 600
And in my other data frame, I have stuff like
START END
123 150
489 552
590 600
I want to know how many items in POS are in any of the START-END ranges. So in this case, there's be 2 items. Also, if possible, can I get the IDs of the ones with POS between Start and End, too?
How can I go about doing that without having to use a nested for loop?
This is a fairly common problem which might happen in the context of a database. Here is a solution using sqldf:
library(sqldf)
query <- "SELECT POS, ID FROM df1 INNER JOIN df2 "
query <- paste0(query, "ON df1.POS BETWEEN df2.START AND df2.END")
sqldf(query)
If the ranges in your second data frame might overlap, then the above query could return more than one result for a given POS value. In this case, replace SELECT POS with SELECT DISTINCT POS.
We can use a non-equi join with data.table
library(data.table)
setDT(df1)[df2, on = .(POS > START, POS <= END)][, sum(!is.na(ID))]
#[1] 2
We can achieve the same using mapply in base-R as:
df1[mapply(function(x)any(x >= df2$START & x <= df2$END),df1$POS),]
# ID POS
#2 B 533
#3 C 600
Data
df1 <- read.table(text =
"ID POS
A 20
B 533
C 600", header = T)
df2 <- read.table(text =
"START END
123 150
489 552
590 600", header = TRUE)
Data frame: main
ID POS
A 20
B 533
C 600
Data frame: ran
START END
123 150
489 552
590 600
A simple sapply should suffice your use case:
sapply(main$POS, function(x) { sum(x>=ran$START & x<=ran$END) })
will return:
[1] 0 1 1
You could bind this back to a new column in your main data frame:
main$Count <- sapply(main$POS, function(x) { sum(x>=ran$START & x<=ran$END) }))
ID POS count
1 A 20 0
2 B 533 1
3 C 600 1
This should also work with overlapping ranges.

Finding patterns across rows of data.table in R

I am trying to find patterns across rows of a data.table while still maintaining the linkages of data across the rows. Here is a reduced example:
Row ID Value
1 C 1000
2 A 500
3 T -200
4 B 5000
5 T -900
6 A 300
I would like to search for all instances of "ATB" in successive rows and output the integers from the value column. Ideally, I want to bin the number of instances as well. The output table would look like this:
String Frequency Value1 Value2 Value 3
ATB 1 500 -200 5000
CAT 1 1000 500 -200
Since the data.table packages seems to be oriented towards providing operations on a column or row-wise basis I thought this should be possible. However, I haven't the slightest idea where to start. Any pointers in the right direction would be greatly appreciated.
Thanks!
library("plyr")
library("stringr")
df <- read.table(header = TRUE, text = "Row ID Value
1 C 1000
2 A 500
3 T -200
4 B 5000
5 T -900
6 A 300
7 C 200
8 A 700
9 T -500")
sought <- c("ATB", "CAT", "NOT")
ids <- paste(df$ID, collapse = "")
ldply(sought, function(id) {
found <- str_locate_all(ids, id)
if (nrow(found[[1]])) {
vals <- outer(found[[1]][,"start"], 0:2, function(x, y) df$Value[x + y])
} else {
vals <- as.list(rep(NA, 3))
}
data.frame(ID = id, Count = str_count(ids, id),
setNames(as.data.frame(vals), paste0("Value", 1:3)))
})
Here's a solution using stringr and plyr. The ids are collapsed into a single string, all instances of each target located and then a data frame constructed with the relevant columns.

Check if a value in one table (X) is between the values in two columns in another table (Y) with R data.table

Horrible title question, but this is what I am trying to achieve. For Table1 I want to add the Column "BETWEEN", verifying if the "POSITION" falls between any of the "START" and "STOP" values for the corresponding "BIN" in Table2.
Table1. BIN names (character) and POSITION in BIN (numeric):
BIN POSITION
1 12
1 52
1 86
7 6
7 22
X 112
X 139
MT 3
MT 26
Table2: BIN name (character) and START and STOP positions (numeric)
BIN START STOP
1 2 64
1 90 110
7 20 100
7 105 200
X 1 5
MT 1 1000
And the desired result - Table 1 with "BETWEEN":
CHROM POSITION BETWEEN
1 12 TRUE
1 52 TRUE
1 86 FALSE
7 6 FALSE
7 22 TRUE
X 112 FALSE
X 139 FALSE
MT 3 TRUE
MT 26 TRUE
My Table 1 has about 4,000,000 rows, and Table 2 about 500,000 rows, and anything I came up with was very slow.
As an example of bigger tables, use the following:
positions <- seq(1,100000,10)
bins <- c("A","B","C","D","E","F","G","H","I","J")
tab1 <- data.table(bin = rep(bins,1,each=length(positions)), pos = rep(positions,10))
tab2 <- data.table(bin = rep(bins,1,each=2000), start = seq(5,100000,50), stop = start+25)
The desired output would be:
tab1
bin pos between
1: A 1 FALSE
2: A 11 TRUE
3: A 21 TRUE
4: A 31 FALSE
5: A 41 FALSE
The following method requires that for a given bin, the bins are mutually exclusive. (e.g. you cant have bin A with bounds 1-5 and another bin A with bounds 4-8.) Also, I modified your example a bit.
positions <- seq(1,100000,10)
bins <- c("A","B","C","D","E","F","G","H","I","J")
tab1 <- data.table(bin = rep(bins,1,each=length(positions)), pos = rep(positions,10))
setkey(tab1,"bin","pos")
tab2 <- data.table(bin = rep(bins,1,each=2000), start = seq(5,100000,50))
tab2[, end := start+25]
tab2[,pos:=start]
setkey(tab2,"bin","pos")
x<-tab2[tab1, roll=TRUE, nomatch=0]
tab2[,pos:=end]
setkey(tab2,"bin","pos")
y<-tab2[tab1, roll=-Inf, nomatch=0]
setkey(x,"bin","pos","start")
setkey(y,"bin","pos","start")
inBin<-x[y,nomatch=0]
inBin[, between:=TRUE]
setkey(tab1,"bin","pos")
setkey(inBin,"bin","pos")
result<-inBin[,list(bin,pos,between)][tab1]
result[is.na(between), between:=FALSE]
I don't have the time to explain my solution in depth right now. Instead I'll take the cheap way out and refer you to research the roll parameter of data.table. The basic methodology above is that I'm joining tab1 and tab2, rolling pos forward to the nearest end bound. Then I join tab1 and tab2, rolling pos backward to the nearest start bound. Then I do an inner join on the those two sets, giving me all rows in tab1 which fall inside the bounds of a bin. From that point, it's just grunt work.
Most straightforward approach is to nest the matching loops I think. You may be need to handle factors slightly differently. I haven't tested to see what happens if it does not find a bin match.
BIN <- c("1","1","1","7","7","X","X","MT","MT")
POSITION <- c(12,52,86,6,22,112,139,3,26)
npos <- length(POSITION)
BETWEEN <- vector(mode="logical",length=npos)
tab1 <- as.data.frame(cbind(BIN,POSITION))
BIN2 <- c("1","1","7","7","X","MT")
START <- c(2,90,20,105,1,1)
STOP <- c(64,110,100,200,5,1000)
tab2 <- as.data.frame(cbind(BIN2,START,STOP))
bins <- unique(tab1$BIN)
for(bin in bins){
#print(paste("bin=",bin))
t1.bin.matches <- which(tab1$BIN==bin)
t2.bin.compares <- which(tab2$BIN2==bin)
#print(t1.bin.matches)
#print(t2.bin.compares)
for(match in t1.bin.matches){
between = FALSE
candidate = as.numeric(as.vector(tab1$POSITION)[match])
for(compare in t2.bin.compares){
comp.start <- as.numeric(as.vector(tab2$START)[compare])
comp.stop <- as.numeric(as.vector(tab2$STOP)[compare])
if(candidate>=comp.start&&candidate<=comp.stop){
between = TRUE
break
}
}
#print(paste(comp.start,candidate,comp.stop,between))
BETWEEN[match] = between
}
}
tab1 <- as.data.frame(cbind(tab1,BETWEEN))
tab1
Make sure your BIN columns are character, POSITION, START, END are numeric.
Table1$BIN = as.character(Table1$BIN)
Table1$POSITION = as.numeric(Table1$POSITION)
Table2$BIN = as.character(Table2$BIN)
Table2$START = as.numeric(Table2$START)
Table2$STOP = as.numeric(Table2$STOP)
Convert your data.frame to library(data.table) because the code below might be slow.
Table1 = as.data.table(Table1)
Table2 = as.data.table(Table2)
Generate desired output
z = apply(Table1, 1, function(x) {nrow(Table2[(as.numeric(x[2])>START) & (as.numeric(x[2])<STOP) & (BIN == as.character(x[1])),])>0})
cbind(Table1, z)
Old function is z(), new is y(). With the sample Table1, Table2, the new function is 30% faster. I don't know how this advantage will scale as nrow increases, but I'm guessing this scaling will be very positive. Let me know.
z = function(a){apply(Table1, 1, function(x) {z = subset(Table2, Table2$BIN == as.character(x[1]))
any(as.numeric(x[2])>z$START & as.numeric(x[2])<z$STOP)})}
y = function(a){apply(Table1, 1, function(x) {nrow(Table2[(as.numeric(x[2])>START) & (as.numeric(x[2])<STOP) & (BIN == as.character(x[1])),])>0})}
microbenchmark(z(), y(), times = 1000L)
expr min lq median uq max neval
z() 1168.283 1219.793 1237.791 1276.267 3481.576 1000
y() 809.575 848.052 863.257 885.909 1683.383 1000
edit: you might need to muck with the as.numeric, and as.character in the subsetting. I lost the data.table I created earlier and directly used the answer above's data.frame.

Create a new data frame based on another dataframe

I am trying to use a huge dataframe (180000 x 400) to calculate another one that would be much smaller.
I have the following dataframe
df1=data.frame(LOCAT=c(1,2,3,4,5,6),START=c(120,345,765,1045,1347,1879),END=c(150,390,802,1120,1436,1935),CODE1=c(1,1,0,1,0,0),CODE2=c(1,0,0,0,-1,-1))
df1
LOCAT START END CODE1 CODE2
1 1 120 150 1 1
2 2 345 390 1 0
3 3 765 802 0 0
4 4 1045 1120 1 0
5 5 1347 1436 0 -1
6 6 1879 1935 0 -1
This is a sample dataframe. The rows continue until 180000 and the columns are over 400.
What I need to do is create a new dataframe based on each column that tells me the size of each continues "1" or "-1" and returns it with the location, size and value.
Something like this for CODE1:
LOCAT SIZE VALUE
1 1 to 2 270 POS
2 4 to 4 75 POS
And like this for CODE2:
LOCAT SIZE VALUE
1 1 to 1 30 POS
2 5 to 6 588 NEG
Unfortunately I still didn't figure out how to do this. I have been trying several lines of code to develop a function to do this automatically but start to get lost or stuck in loops and it seems that nothing works.
Any help would be appreciated.
Thanks in advance
Below is code that gives you the answer in the exact format that you wanted, except I split your "LOCAT" column into two columns entitled "Starts" and "Stops". This code will work for your entire data frame, no need to replicate it manually for each CODE (CODE1, CODE2, etc).
It assumes that the only non-CODE column have the names "LOCAT" "START" and "END".
# need package "plyr"
library("plyr")
# test2 is the example data frame that you gave in the question
test2 <- data.frame(
"LOCAT"=1:6,
"START"=c(120,345,765, 1045, 1347, 1879),
"END"=c(150,390,803,1120,1436, 1935),
"CODE1"=c(1,1,0,1,0,0),
"CODE2"=c(1,0,0,0,-1,-1)
)
codeNames <- names(test2)[!names(test2)%in%c("LOCAT","START","END")] # the names of columns that correspond to different codes
test3 <- reshape(test2, varying=codeNames, direction="long", v.names="CodeValue", timevar="Code") # reshape so the different codes are variables grouped into the same column
test4 <- test3[,!names(test3)%in%"id"] #remove the "id" column
sss <- function(x){ # sss gives the starting points, stopping points, and sizes (sss) in a data frame
rleX <- rle(x[,"CodeValue"]) # rle() to get the size of consecutive values
stops <- cumsum(rleX$lengths) # cumulative sum to get the end-points for the indices (the second value in your LOCAT column)
starts <- c(1, head(stops,-1)+1) # the starts are the first value in your LOCAT column
ssX0 <- data.frame("Value"=rleX$values, "Starts"=starts, "Stops"=stops) #the starts and stops from X (ss from X)
ssX <- ssX0[ssX0[,"Value"]!=0,] # remove the rows the correspond to CODE_ values that are 0 (not POS or NEG)
# The next 3 lines calculate the equivalent of your SIZE column
sizeX1 <- x[ssX[,"Starts"],"START"]
sizeX2 <- x[ssX[,"Stops"],"END"]
sizeX <- sizeX2 - sizeX1
sssX <- data.frame(ssX, "Size"=sizeX) # Combine the Size to the ssX (start stop of X) data frame
return(sssX) #Added in EDIT
}
answer0 <- ddply(.data=test4, .variables="Code", .fun=sss) # use the function ddply() in the package "plyr" (apply the function to each CODE, why we reshaped)
answer <- answer0 # duplicate the original, new version will be reformatted
answer[,"Value"] <- c("NEG",NA,"POS")[answer0[,"Value"]+2] # reformat slightly so that we have POS/NEG instead of 1/-1
Hopefully this helps, good luck!
Use run-length encoding to determine groups where CODE1 takes the same value.
rle_of_CODE1 <- rle(df1$CODE1)
For convenience, find the points where the value is non-zero, and the lenghts of the corresponding blocks.
CODE1_is_nonzero <- rle_of_CODE1$values != 0
n <- rle_of_CODE1$lengths[CODE1_is_nonzero]
Ignore the parts of df1 where CODE1 is zero.
df1_with_nonzero_CODE1 <- subset(df1, CODE1 != 0)
Define a group based on the contiguous blocks we found with rle.
df1_with_nonzero_CODE1$GROUP <- rep(seq_along(n), times = n)
Use ddply to get summary stats for each group.
summarised_by_CODE1 <- ddply(
df1_with_nonzero_CODE1,
.(GROUP),
summarise,
MinOfLOCAT = min(LOCAT),
MaxOfLOCAT = max(LOCAT),
SIZE = max(END) - min(START)
)
summarised_by_CODE1$VALUE <- ifelse(
rle_of_CODE1$values[CODE1_is_nonzero] == 1,
"POS",
"NEG"
)
summarised_by_CODE1
## GROUP MinOfLOCAT MaxOfLOCAT SIZE VALUE
## 1 1 1 2 270 POS
## 2 3 4 4 75 POS
Now repeat with CODE2.

Resources