Identifying maximum number and longest set of time intervals - r

Say I have data that look like this:
level start end
1 1 133.631 825.141
2 2 133.631 155.953
3 3 146.844 155.953
4 2 293.754 302.196
5 3 293.754 302.196
6 4 293.754 301.428
7 2 326.253 343.436
8 3 326.253 343.436
9 4 333.827 343.436
10 2 578.066 611.766
11 3 578.066 611.766
12 4 578.066 587.876
13 4 598.052 611.766
14 2 811.228 825.141
15 3 811.228 825.141
or this:
level start end
1 1 3.60353 1112.62000
2 2 3.60353 20.35330
3 3 3.60353 8.77526
4 2 72.03720 143.60700
5 3 73.50530 101.13200
6 4 73.50530 81.64660
7 4 92.19030 101.13200
8 3 121.28500 143.60700
9 4 121.28500 128.25900
10 2 167.19700 185.04800
11 3 167.19700 183.44600
12 4 167.19700 182.84600
13 2 398.12300 418.64300
14 3 398.12300 418.64300
15 2 445.83600 454.54500
16 2 776.59400 798.34800
17 3 776.59400 796.64700
18 4 776.59400 795.91300
19 2 906.68800 915.89700
20 3 906.68800 915.89700
21 2 1099.44000 1112.62000
22 3 1099.44000 1112.62000
23 4 1100.14000 1112.62000
They produce the following graphs:
As you can see there are several time intervals at different levels. The level-1 interval always spans the entire duration of the time of interest. Levels 2+ have time intervals that are shorter.
What I would like to do is select the maximum number of non-overlapping time intervals covering each period that contain the maximum number of total time within them. I have marked in pink which ones those would be.
For small dataframes it is possible to brute force this, but obviously there should be some more logical way of doing this. I'm interested in hearing some ideas about what I should try.
EDIT:
I think one thing that could help here is the column 'level'. The results come from Kleinberg's burst detection algorithm (package 'bursts'). You will note that the levels are hierarchically organized. Levels of the same number cannot overlap. However levels successively increasing e.g. 2,3,4 in successive rows can overlap.
In essence, I think the problem could be shortened to this. Take the levels produced, but remove level 1. This would be the vector for the 2nd example:
2 3 2 3 4 4 3 4 2 3 4 2 3 2 2 3 4 2 3 2 3 4
Then, look at the 2s... if there are fewer than or only one '3' then that 2 is the longest interval. But if there are two or more 3's between successive 2's, then those 3s should be counted. Do this iteratively for each level. I think that should work...?
e.g.
vec<-df$level %>% as.vector() %>% .[-1]
vec
#[1] 2 3 2 3 4 4 3 4 2 3 4 2 3 2 2 3 4 2 3 2 3 4
max(vec) #4
vec3<-vec #need to find two or more 4's between 3s
vec3[vec3==3]<-NA
names(vec3)<-cumsum(is.na(vec3))
0 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 8 8
2 NA 2 NA 4 4 NA 4 2 NA 4 2 NA 2 2 NA 4 2 NA 2 NA 4
vec3.res<-which(table(vec3,names(vec3))["4",]>1)
which(names(vec3)==names(vec3.res) & vec3==4) #5 6
The above identifies rows 5 and 6 (which equate to rows 6 and 7 in original df) as having two fours that lie between 3's. Perhaps something using this sort of approach might work?

OK here is a stab using your second data set to test. This might not be correct in all cases!!
library(data.table)
dat <- fread("data.csv")
dat[,use:="maybe"]
make.pass <- function(dat,low,high,the.level,use) {
check <- dat[(use!="no" & level > the.level)]
check[,contained.by.above:=(low<=start & end<=high)]
check[,consecutive.contained.by.above:=
(contained.by.above &
!is.na(shift(contained.by.above,1)) &
shift(contained.by.above,1)),by=level]
if(!any(check[,consecutive.contained.by.above])) {
#Cause a side effect where we've learned we don't care:
dat[check[(contained.by.above),rownum],use:="no"]
print(check)
return("yes")
} else {
return("no")
}
}
dat[,rownum:=.I]
dat[level==1,use:=make.pass(dat,start,end,level,use),by=rownum]
dat
dat[use=="maybe" & level==2,use:=make.pass(dat,start,end,level,use),by=rownum]
dat
dat[use=="maybe" & level==3,use:=make.pass(dat,start,end,level,use),by=rownum]
dat
#Finally correct for last level
dat[use=="maybe" & level==4,use:="yes"]
I wrote these last steps out so you can trace in your own interactive session to see what's happening (see the print to get an idea) but you can remove the print and also condense the last steps into something like lapply(1:dat[,max(level)-1], function(the.level) dat[use=="maybe" & level==the.level,use:=make.pass......]) In response to your comment if there are an arbitrary number of levels you will definitely want to use this formalism, and follow it with a final call to dat[use=="maybe" & level==max(level),use:="yes"].
Output:
> dat
level start end use rownum
1: 1 3.60353 1112.62000 no 1
2: 2 3.60353 20.35330 yes 2
3: 3 3.60353 8.77526 no 3
4: 2 72.03720 143.60700 no 4
5: 3 73.50530 101.13200 no 5
6: 4 73.50530 81.64660 yes 6
7: 4 92.19030 101.13200 yes 7
8: 3 121.28500 143.60700 yes 8
9: 4 121.28500 128.25900 no 9
10: 2 167.19700 185.04800 yes 10
11: 3 167.19700 183.44600 no 11
12: 4 167.19700 182.84600 no 12
13: 2 398.12300 418.64300 yes 13
14: 3 398.12300 418.64300 no 14
15: 2 445.83600 454.54500 yes 15
16: 2 776.59400 798.34800 yes 16
17: 3 776.59400 796.64700 no 17
18: 4 776.59400 795.91300 no 18
19: 2 906.68800 915.89700 yes 19
20: 3 906.68800 915.89700 no 20
21: 2 1099.44000 1112.62000 yes 21
22: 3 1099.44000 1112.62000 no 22
23: 4 1100.14000 1112.62000 no 23
level start end use rownum
On the off chance this is correct, the algorithm can roughly be described as follows:
Mark all the intervals as possible.
Start with a given level. Pick a particular interval (by=rownum) say called X. With X in mind, subset a copy of the data to all higher-level intervals.
Mark any of these that are contained in X as "contained in X".
If consecutive intervals at the same level are contained in X, X is no good b/c it wastes intervals. In this case label X's "use" variable as "no" so we'll never think about X again. [Note: if it's possible that non-consecutive intervals are contained in X, or that containing multiple intervals across levels could ruin X's viability, then this logic might need to be changed to count contained intervals instead of finding consecutive ones. I didn't think about this at all, but it's just occurring to me now, so use at your own risk.]
On the other hand, if X passed the test, then we've already established it's good. Mark it as a "yes." But importantly, we also have to mark any single interval contained in X as "no," or else when we iterate the step it will forget that it was contained inside a good interval and mark itself as "yes" as well. This is the side effect step.
Now, iterate, ignoring any results that we've already determined.
Finally any "maybe"s leftover at the highest level are automatically in.
Let me know what you think of this--this is a rough draft and some aspects might not be correct.

Related

purrr map / lapply / sapply across groups of multiple (n > 1) elements at a time?

Suppose we have a vector, we can easily enough lapply, sapply or map across 1 element at a time.
Is there a way to do the same across groups of (>1) elements of the vector?
Example
Suppose we are constructing API calls by appending comma-separated user_identifiers to the URL, like so:
user_identifiers <- c("0011399", "0011400", "0013581", "0013769", "0013770", "0018374",
"0018376", "0018400", "0018401", "0018410", "0018415", "0018417",
"0018419", "0018774", "0018775", "0018776", "0018777", "0018778",
"0018779", "0021627", "0023492", "0023508", "0023511", "0023512",
"0024120", "0025672", "0025673", "0025675", "0025676", "0028226",
"0028227", "0028266", "0028509", "0028510", "0028512", "0028515",
"0028518", "0028520", "0028523", "0029160", "0033141", "0034586",
"0035035", "0035310", "0035835", "0035841", "0035862", "0036503",
"0036580", "0036583", "0036587", "0037577", "0038582", "0038583",
"0038587", "0039727", "0039729", "0039731", "0044703", "0044726"
)
get_data <- function(user_identifier) {
url <- paste0("https://www.myapi.com?userIdentifier=",
paste0(user_identifier, collapse=","))
fromJSON(url)
}
In the above, get_data(user_identifiers) would return the APIs response for all 60 user_identifiers in one single request.
But suppose the API accepts a maximum of 10 identifiers at a time (so we cannot do all 60 at once).
A simple solution could be to simply map/lapply/sapply over each element, e.g. sapply(get_data, user_identifiers - this would work fine - however, we would make 60 API calls, when all we really need is 6. If we could map/lapply/sapply over groups of 10 at a time; that would be ideal
Question
Is there an elegant way to map/lapply/sapply over groups of n elements at a time (where n>1)?
We can split user_identifiers in groups of 10 and use sapply/map/lapply
sapply(split(user_identifiers, gl(length(user_identifiers)/10, 10)), get_data)
where gl creates groups from 1 to 6 each of length 10.
gl(length(user_identifiers)/10, 10)
# [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
# 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6
#Levels: 1 2 3 4 5 6
The same groups can be created with rep
rep(1:ceiling(length(user_identifiers)/10), each = 10)
As #thelatemail mentioned, we can use cut and specify number of groups to cut the data into
sapply(split(user_identifiers, cut(seq_along(user_identifiers),6)), get_data)

Frequency distribution using binCounts

I have a dataset of Ages for the customer and I wanted to make a frequency distribution by 9 years of a gap of age.
Ages=c(83,51,66,61,82,65,54,56,92,60,65,87,68,64,51,
70,75,66,74,68,44,55,78,69,98,67,82,77,79,62,38,88,76,99,
84,47,60,42,66,74,91,71,83,80,68,65,51,56,73,55)
My desired outcome would be similar to below-shared table, variable names can be differed(as you wish)
Could I use binCounts code into it ? if yes could you help me out using the code as not sure of bx and idxs in this code?
binCounts(x, idxs = NULL, bx, right = FALSE) ??
Age Count
38-46 3
47-55 7
56-64 7
65-73 14
74-82 10
83-91 6
92-100 3
Much Appreciated!
I don't know about the binCounts or even the package it is in but i have a bare r function:
data.frame(table(cut(Ages,0:7*9+37)))
Var1 Freq
1 (37,46] 3
2 (46,55] 7
3 (55,64] 7
4 (64,73] 14
5 (73,82] 10
6 (82,91] 6
7 (91,100] 3
To exactly duplicate your results:
lowerlimit=c(37,46,55,64,73,82,91,101)
Labels=paste(head(lowerlimit,-1)+1,lowerlimit[-1],sep="-")#I add one to have 38 47 etc
group=cut(Ages,lowerlimit,Labels)#Determine which group the ages belong to
tab=table(group)#Form a frequency table
as.data.frame(tab)# transform the table into a dataframe
group Freq
1 38-46 3
2 47-55 7
3 56-64 7
4 65-73 14
5 74-82 10
6 83-91 6
7 92-100 3
All this can be combined as:
data.frame(table(cut(Ages,s<-0:7*9+37,paste(head(s+1,-1),s[-1],sep="-"))))

Adding all values of a variable in R [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 7 years ago.
I don't know how to word the title exactly, so I will just do my best to explain below... Sorry in advance for the .csv format.
I have the following example dataset:
print(data)
ID Tag Flowers
1 1 6871 1
2 2 6750 1
3 3 6859 1
4 4 6767 1
5 5 6747 1
6 6 6261 1
7 7 6750 1
8 8 6767 1
9 9 6812 1
10 10 6746 1
11 11 6496 4
12 12 6497 1
13 13 6495 4
14 14 6481 1
15 15 6485 1
Notice that in Lines 2 and 7, the tag 6750 appears twice. I observed one flower on plant number 6750 on two separate days, equaling two flowers in its lifetime. Basically, I want to add every flower that occurs for tag 6750, tag 6767, etc throughout ~100 rows. Each tag appears more than once, usually around 4 or 5 times.
I feel like I need to apply the unlist function here, but I'm a little bit lost as to how I should do so.
Without any extra packages, you can use function aggregate():
res<-aggregate(data$Flowers, list(data$Tag), sum)
This calculates a sum of the values in Flowers column for every value in the Tag column.

Short(er) notation of selecting a part of a data.frame or other objects in R

I always get angry at my R code when I have to process dataframes, i.e. filtering out certain rows. The code gets very illegible as I tend to choose meaningful, but long, names for my objects. An example:
all.mutations.extra.large.name <- read.delim(filename)
head(all.mutations.extra.large.name)
id gene pos aa consequence V
ENSG00000105732 ZN574_HUMAN 81 x/N missense_variant 3
ENSG00000125879 OTOR_HUMAN 7 V/3 missense_variant 2
ENSG00000129194 SOX15_HUMAN 20 N/T missense_variant 3
ENSG00000099204 ABLM1_HUMAN 33 H/R missense_variant 2
ENSG00000103335 PIEZ1_HUMAN 11 Q/R missense_variant 3
ENSG00000171533 MAP6_HUMAN 39 A/G missense_variant 3
all.mutations.extra.large.name <- all.mutations.extra.large.name[which(all.mutations.extra.large.name$gene == ZN574_HUMAN)]
So in order to kick out all other lines in which I am not interested I need to reference 3 times the object all.mutations.extra.large.name. And reating this kind of step for different columns makes the code really difficult to understand.
Therefore my question: Is there a way to filter out rows by a criterion without referencing the object 3 times. Something like this would be beautiful: myobj[,gene=="ZN574_HUMAN"]
You can use subset for that:
subset(all.mutations.extra.large.name, gene == "ZN574_HUMAN")
Several options:
all.mutations.extra.large.name <- data.frame(a=1:5, b=2:6)
within(all.mutations.extra.large.name, a[a < 3] <- 0)
a b
1 0 2
2 0 3
3 3 4
4 4 5
5 5 6
transform(all.mutations.extra.large.name, b = b^2)
a b
1 1 4
2 2 9
3 3 16
4 4 25
5 5 36
Also check ?attach if you would like to avoid repetitive typing like all.mutations.extra.large.name$foo.

Insert incremental time between two rows determined by column entry

Sorry me again. I will keep on trying but I want help in case I can't figure out within the next hour.
My data looks like this:
B<-data.frame(ID=c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2),EVID=c(1,1,1,0,1,2,2,1,1,1,2,2,1,1,1),VALUE=seq(15))
B$TIME<-c(Sys.time()+6*3600*(seq_len(nrow(B))-1))
Actually the time is more variable, and each ID may have multiple EVID of 2.
I wanted to add one hour increments between the times for EVID=2 for as many hours as they are apart, i.e., for each pair of EVID=2, I add one hour until the time is within one hour to the second EVID=2 in the pair, so I can get something like this:
(value and ID are just duplicate previous rows)
ID EVID VALUE TIME
1 1 1 1 2013-05-31 07:51:09
2 1 1 2 2013-05-31 13:51:09
3 1 1 3 2013-05-31 19:51:09
4 1 0 4 2013-06-01 01:51:09
5 1 1 5 2013-06-01 07:51:09
6 1 2 6 2013-06-01 13:51:09
6 1 2 6 2013-06-01 14:51:09
6 1 2 6 2013-06-01 15:51:09
6 1 2 6 2013-06-01 16:51:09
6 1 2 6 2013-06-01 17:51:09
6 1 2 6 2013-06-01 18:51:09
7 1 2 7 2013-06-01 19:51:09
8 1 1 8 2013-06-02 01:51:09
9 2 1 9 2013-06-02 07:51:09
10 2 1 10 2013-06-02 13:51:09
11 2 2 11 2013-06-02 19:51:09
11 2 2 11 2013-06-02 20:51:09
11 2 2 11 2013-06-02 21:51:09
11 2 2 11 2013-06-02 22:51:09
11 2 2 11 2013-06-02 23:51:09
11 2 2 11 2013-06-02 0:51:09
12 2 2 12 2013-06-03 01:51:09
13 2 1 13 2013-06-03 07:51:09
14 2 1 14 2013-06-03 13:51:09
15 2 1 15 2013-06-03 19:51:09
Below is my brainstorm/attempt:
library(data.table)
BDT <- data.table(row=1:nrow(B), B, key="ID")
BDT[,list(row,EVID,c(EVID)==2)]
attach(B)
newB<-BDT[c(EVID)==2,list(row=row+1,ID=ID,EVID=EVID,VALUE=VALUE,TIME=head(TIME+3600,-1))]
finalB<-rbind(BDT,newB)[order(EVID,decreasing=TRUE)][order(row)][,-1,with=FALSE]
However, this adds one row of Time+1 hour to each EVID=2 which is not what I desired.
The next thing I tried duplicates every row after the first which is not what I wanted, but has the advantage of sparing my from typing out all the names of the columns (I have about 32)
newB<-B[c(1,rep(2:nrow(B),each=2)),]
## My wild guess -- as.numeric(head(TIME))-as.numeric(tail(TIME)))/3600 doesn't work. I know it says that from row 2 to last row, repeat each row twice
newB[c(FALSE,TRUE),"EVID"]<-2
newB[c(FALSE,TRUE),"TIME"]<-newB[c(FALSE,TRUE),"TIME"]+3600
Thank you for any feedback.
=================================================================
eddie's code works well with my example, which I thought was a good representation but my actual data keep getting
error in seq.int(...) wrong sign in 'by' argument
(...) varies depending on what I was trying
I have a relatively large data, the column that I use as the ID as in the example is in the middle of the data table; I see even from my small sample data if I place the ID along with the other names in the list, R will recognize item 2 as having n+1 columns than item 1 in the rbind. But if I don't include it in the list so that I may use the by=ID, R complains that names are in different order. If a do not list one of the unimportant columns in the beginning of the data, R says item 2 has n-1 columns compared to item 1!
I thought that perhaps my error comes from my time being not exactly hours apart, but by test runs I see that small differences are tolerated, and rounding, either to hour or doing integers, doesn't help.
I tried using length.out, ignoring the warning
Warning message: In .rbind.data.table(...) : Argument 2 has names in
a different order. Columns will be bound by name for consistency with
base. Alternatively, you can drop names (by using an unnamed list) and
the columns will then be joined by position. Or, set use.names=FALSE.
But then the code does not add to between the 2's except at the end, where it adds too many!
What am I doing wrong? I've been pulling all-nighter for this :(
OK so when I rearrange the original data I can get rid of the warnings. However, the insertions are still happening at the end of the data only and they were too many.
This should work:
library(data.table)
dt = data.table(B)
dt[, TIME := as.POSIXct(TIME)]
rbind(dt, dt[EVID == 2,
list(EVID=EVID[1],
VALUE=VALUE[1],
TIME=seq.POSIXt(TIME[1], TIME[2], "hour")),
by = ID])[!duplicated(paste(ID,EVID,TIME))][order(ID, TIME)]

Resources