How to find the last observation in the next 10 seconds comparing to the current row? - r

I want to find the last observation in the next 10 seconds comparing to current row, here is the example:
library(dplyr)
library(data.table)
data <- setDT(data.frame(price=c(4,5,3,1,0),
datetime=c(as.POSIXct("2015/06/12 12:10:25"),
as.POSIXct("2015/06/12 12:10:27"),
as.POSIXct("2015/06/12 12:10:32"),
as.POSIXct("2015/06/12 12:10:36"),
as.POSIXct("2015/06/12 12:10:38"))))
#price datetime
#1 4 2015/06/12 12:10:25
#2 5 2015/06/12 12:10:27
#3 3 2015/06/12 12:10:32
#4 1 2015/06/12 12:10:36
#5 0 2015/06/12 12:10:38
This is the desired output:
#price datetime next.price
#1 4 2015/06/12 12:10:25 3
#2 5 2015/06/12 12:10:27 1
#3 3 2015/06/12 12:10:32 0
#4 1 2015/06/12 12:10:36 0
#5 0 2015/06/12 12:10:38 NA
Here is my code but it does not provide the desired output:
data.new <- data %>% rowwise() %>%
do(data.frame(
price=.$price,
datetime= .$datetime,
next.price = data[data$datetime <= .$datetime +10 ,"price", with=FALSE, mult="last"]
))
What do you generally do when you want to create a value for each row based on the subsequent rows? Questions such as: find the max/min in the subsequent rows, etc. Do you use rowwise from dplyr?

IIUC this can be done using rolling joins as well:
# data.table v1.9.6
data[, next.price := data[.(datetime=datetime+10), price, roll=10-1, on="datetime"]]
# price datetime next.price
# 1: 4 2015-06-12 12:10:25 3
# 2: 5 2015-06-12 12:10:27 1
# 3: 3 2015-06-12 12:10:32 0
# 4: 1 2015-06-12 12:10:36 0
# 5: 0 2015-06-12 12:10:38 NA
This performs a last observation carried forward (LOCF) rolling join. For each value in i = datetime+10, it finds the matching row of the largest observation <= that value in x = data. We extract the corresponding price value and create a new column. roll = 9 limits how far the values are carried forward to 9.
From the next version you wouldn't need the datetime =. It'd work with .(datetime+10) directly.

Related

Count prior occurences of value one greater than current value

I am trying to create a column that contains the number of prior occurrences (sorted by date) of a value one greater than the current value. In the example provided here, I manually create the values I want in the column labeled “wanted”, which is equal to the count of the prior occurrences (sorted by "date") of RoundNo that are equal to one greater than the focal row RoundNo. And I need this to be computed separately by group for each individual InvestorID.
So the first row "wanted" value is equal to the count of prior RoundNo of Investor 1 where RoundNo == 3 (aka one larger than the first row's RoundNo of 2). So in this case that would be 0. Similarly for the second row, the "wanted" value is the count of prior RoundNo of Investor 1 where RoundNo == 2 (aka one larger than the second row's RoundNo of 1). So in this case that would be 1. Would appreciate any help. Code example is below. Thanks!
dt = as.data.table(cbind(c(rep(1,7),rep(2,7)),
c("2019-08-01","2019-09-01","2019-10-01","2019-11-01","2019-12-01","2021-04-01","2021-10-01",
"2019-01-01","2019-02-01","2019-04-01","2019-08-01","2019-09-01","2019-10-01","2019-11-01"),
c(2,1,2,2,1,3,2,1,2,3,2,1,3,1)))
names(dt) = c("InvestorID","date","RoundNo")
wanted = c(0,1,0,0,3,0,1,0,0,0,1,2,0,2)
dt$wanted = wanted
1) Define a function Count which counts the number of times each element of its vector input equals one plus its last element. Then use rollapplyr to apply that to successively larger leading sequences of RoundNo.
library(zoo)
Count <- function(x) sum(x == tail(x, 1) + 1)
dt[, wanted := rollapplyr(as.numeric(RoundNo), 1:.N, Count), by = InvestorID]
2) An alternate method is to use a self left join in which the first instance of dt aliased to a is left joined to the second instance of dt aliased to b associating those b rows which are from the same InvestorID and come before or at the a row. Group by a row and take the appropriate sum over the b rows.
library(sqldf)
sqldf("select a.*, sum(a.RoundNo + 1 == b.RoundNo) wanted
from dt a
left join dt b on a.InvestorID = b.InvestorID and b.rowid <= a.rowid
group by a.rowid")
3) This alternative only uses data.table. Count is from (1).
dt[, wanted := sapply(1:.N, function(i) Count(as.numeric(RoundNo)[1:i])),
by = InvestorID]
Another data.table solution using Reduce:
dt[order(date),.(date,
result=lapply(Reduce('c',as.numeric(RoundNo),accumulate=T),
function(x) sum(x==last(x)+1)),
wanted), by=InvestorID]
InvestorID date result wanted
1: 2 2019-01-01 0 0
2: 2 2019-02-01 0 0
3: 2 2019-04-01 0 0
4: 2 2019-08-01 1 1
5: 2 2019-09-01 2 2
6: 2 2019-10-01 0 0
7: 2 2019-11-01 2 2
8: 1 2019-08-01 0 0
9: 1 2019-09-01 1 1
10: 1 2019-10-01 0 0
11: 1 2019-11-01 0 0
12: 1 2019-12-01 3 3
13: 1 2021-04-01 0 0
14: 1 2021-10-01 1 1

rolling function with variable width R

I need to summarize some data using a rolling window of different width and shift. In particular I need to apply a function (eg. sum) over some values recorded on different intervals.
Here an example of a data frame:
df <- tibble(days = c(0,1,2,3,1),
value = c(5,7,3,4,2))
df
# A tibble: 5 x 2
days value
<dbl> <dbl>
1 0 5
2 1 7
3 2 3
4 3 4
5 1 2
The columns indicate:
days how many days elapsed from the previous observation. The first value is 0 because no previous observation.
value the value I need to aggregate.
Now, let's assume that I need to sum the field value every 4 days shifting 1 day at the time.
I need something along these lines:
days value roll_sum rows_to_sum
0 5 15 1,2,3
1 7 10 2,3
2 3 3 3
3 4 6 4,5
1 2 NA NA
The column rows_to_sum has been added to make it clear.
Here more details:
The first value (15), is the sum of the 3 rows because 0+1+2 = 3 which is less than the reference value 4 and adding the next line (with value 3) will bring the total day count to 7 which is more than 4.
The second value (10), is the sum of row 2 and 3. This is because, excluding the first row (since we are shifting one day), we only summing row 2 and 3 because including row 4 will bring the total sum of days to 1+2+3 = 6 which is more than 4.
...
How can I achieve this?
Thank you
Here is one way :
library(dplyr)
library(purrr)
df %>%
mutate(roll_sum = map_dbl(row_number(), ~{
i <- max(which(cumsum(days[.x:n()]) <= 4))
if(is.na(i)) NA else sum(value[.x:(.x + i - 1)])
}))
# days value roll_sum
# <dbl> <dbl> <dbl>
#1 0 5 15
#2 1 7 10
#3 2 3 3
#4 3 4 6
#5 1 2 2
Performing this calculation in base R :
sapply(seq(nrow(df)), function(x) {
i <- max(which(cumsum(df$days[x:nrow(df)]) <= 4))
if(is.na(i)) NA else sum(df$value[x:(x + i - 1)])
})

How can I test if a value in one row is unique comparing with all previous rows by group, and count number of different value

I am trying to test if value in one row is unique comparing with all previous rows by group.
For example, for ID=1, I want to compare the drug of the current row to all previous rows (or to day, compare to those DATE earlier than the current row) under ID=1, eg. In row 2, drug A is same as in row 1 , thus EXIST_BEFORE codes as 1 ; for row 4, C is unique comparing with previous rows (A, B , C) thus codes as 0.
add another question: how can I count the number of different drug prior the current date ? for example, for ID=1 , prev_drug for row 4 is 2 , because it has two drugs ( A ,B) different from drug C prior the the DATE of row 4.
ID DATE DRUG EXIST_BEFORE prev_drug
1 2001-01-01 A NA 0
1 2001-02-01 A 1 0
1 2001-03-15 B 0 1
1 2001-04-20 C 0 2
1 2001-05-29 A 1 2
1 2001-05-02 B 1 2
2 2001-03-02 A NA 0
2 2001-03-23 C 0 1
2 2001-04-04 D 0 2
2 2001-05-05 B 0 3
I only know how to compare with one row above by lag(), but have no idea on comparing to date before for each ID.
For this, try using dplyr. Basically we can just group on ID and DRUG. For that grouped combination, find the first DATE occurrence using min(). Then, if the date is after that first occurrence, than it is a repeat.
library(dplyr)
mydata %>%
group_by(ID, DRUG) %>%
mutate(FIRST_OCCURANCE = min(DATE),
EXIST_BEFORE = DATE > FIRST_OCCURANCE)
ID DATE DRUG EXIST_BEFORE FIRST_OCCURANCE
<int> <date> <chr> <lgl> <date>
1 1 2001-01-01 A FALSE 2001-01-01
2 1 2001-02-01 A TRUE 2001-01-01
3 1 2001-03-15 B FALSE 2001-03-15
4 1 2001-04-20 C FALSE 2001-04-20
5 1 2001-05-29 A TRUE 2001-01-01
6 1 2001-05-02 B TRUE 2001-03-15
7 2 2001-03-02 A FALSE 2001-03-02
8 2 2001-03-23 C FALSE 2001-03-23
9 2 2001-04-04 D FALSE 2001-04-04
10 2 2001-05-05 B FALSE 2001-05-05
I broke it into two variables to show what is going on, but you can also reduce the mutate() line simply to:
mutate(EXIST_BEFORE = DATE > min(DATE))
Alternatively, the rowid() function from the data.table package can be used:
library(data.table)
setDT(DT)[order(DATE), EXIST_BEFORE := pmin(1L, rowid(ID, DRUG) - 1L)]
DT
ID DATE DRUG EXIST_BEFORE
1: 1 2001-01-01 A 0
2: 1 2001-02-01 A 1
3: 1 2001-03-15 B 0
4: 1 2001-04-20 C 0
5: 1 2001-05-29 A 1
6: 1 2001-05-02 B 1
7: 2 2001-03-02 A 0
8: 2 2001-03-23 C 0
9: 2 2001-04-04 D 0
10: 2 2001-05-05 B 0
rowid(ID, DRUG) - 1L counts the number of occurrences of ID and DRUG (kind of implied grouping) starting at 0. pmin() is used to cut off values greater 1. order(DATE) ensures that rows are sorted appropriately.
Or, as suggested in Sotos' comment:
setDT(DT)[order(DATE), EXIST_BEFORE := as.integer(duplicated(DRUG)), by = ID][]

Lookup observations data based on another table

I have 2 tibble data frames that I am trying to reconcile. The first tibble has over a million observations, the first few rows are as follows:
data
ID Time(Converted to number)
1 23160
1 23161
1 23162
1 23163
1 23164
1 23165
2 24251
2 24252
The second tibble is a lookup table (that has information of a particular event that has occurred), simplified version as follows:
lookup_table
ID Event_Time Event_Indicator Number_of_Cumulative_Events
1 23162 1 1
1 23164 1 2
2 24255 1 1
2 24280 0 1
I would like to create a 3rd column in the first tibble, such that it shows the number of cumulative events at that time of the observation. The 3rd column in the above example would therefore be:
ID Time(Converted to number) Number
1 23160 0
1 23161 0
1 23162 1
1 23163 1
1 23164 2
1 23165 2
2 24251 0
2 24252 0
I am trying to avoid having to loop through the millions of observations to compare each observation's time to the Event_Time in the lookup table because of computation time.
However, I am not sure how to go about doing this without the use of a loop. The issue is that the lookup_table contains some IDs multiple times, if all IDs only appeared in the lookup_table only once, then I could do:
data$Event_Time <- lookup_table[match(data$ID, lookup_table$ID),"Event_Time"]
data$Number <- data %>% mutate(ifelse(Time >= Event_Time,1,0))
Any ideas how I could avoid the use of a loop and yet apply the lookup conditions for each observation? Thank you.
Edit: I am not trying to join the tables, but more of comparing the time columns in the lookup_table and data table to obtain my desired column. Example, if I were to write an inefficient loop function, it would be:
for (i in 1:nrow(data)) {
data$Number[i] <- subset(lookup_table,ID == data$ID[i])[max(which
(data$Time[i] >= lookup_table$Event_Time)), "Number_of_Cumulative_Events"]
}
A possible solution is to count the cumulative events after the join. Note that an update on join is used.
library(data.table)
setDT(data)[, new := 0L][setDT(lookup_table), on = .(ID, Time = Event_Time), new := Event_Indicator][
, new := cumsum(new), by = ID][]
ID Time new
1: 1 23160 0
2: 1 23161 0
3: 1 23162 1
4: 1 23163 1
5: 1 23164 2
6: 1 23165 2
7: 2 24251 0
8: 2 24252 0
Alternatively,
setDT(data)[setDT(lookup_table), on = .(ID, Time = Event_Time), new := Event_Indicator][
is.na(new), new := 0][
, new := cumsum(new), by = ID][]
will set missing entries to zero after the join.
A completely different approach is to use a rolling join:
lookup_table[, !"Event_Indicator"][data, on = .(ID, Event_Time = Time), roll = TRUE]
ID Event_Time Number_of_Cumulative_Events
1: 1 23160 NA
2: 1 23161 NA
3: 1 23162 1
4: 1 23163 1
5: 1 23164 2
6: 1 23165 2
7: 2 24251 NA
8: 2 24252 NA
(NA's have been left untouched for illustration)

Subset dataframe based of non-sequential dates

I have data that looks like this
df<-data.frame(datecol=as.Date(c("2010-04-03","2010-04-04","2010-04-05","2010-04-06","2010-04-07",
"2010-04-03","2010-04-04","2010-04-05","2010-04-06","2010-04-07",
"2010-05-06","2010-05-07","2010-05-09","2010-06-06","2010-06-07")),x=c(1,1,1,0,1,1,1,0,0,0,1,0,0,0,1),type=c(rep("A",5),rep("B",5),rep("C",5)))
> df
datecol x type
1 2010-04-03 1 A
2 2010-04-04 1 A
3 2010-04-05 1 A
4 2010-04-06 0 A
5 2010-04-07 1 A
6 2010-04-03 1 B
7 2010-04-04 1 B
8 2010-04-05 0 B
9 2010-04-06 0 B
10 2010-04-07 0 B
11 2010-05-06 1 C
12 2010-05-07 0 C
13 2010-05-09 0 C
14 2010-06-06 0 C
15 2010-06-07 1 C
I need to subset this dataframe by type, where I only keep the "types" which have 2 or more different dates and those dates are at least 1 day apart. In the above example type A has 4 different dates, and type C has 2 different dates which are more than 1 day apart, so I want to save these two as a new dataframe. Type B has 2 different dates, but they are not 1 day apart, so I don't want to keep it.
I was thinking to do it in a loop count how many unique date are within each type, leave everything which has more than 2 different dates. Then I would look at the ones which have only 2 different dates and calculate the distance between them and leave only the ones where distance is more than 1. But it seems that there should be a more efficient way. Any Ideas?
One solution with data.table:
#make sure datecol is Date
df$datecol <- as.Date(df$datecol)
library(data.table)
#x needs to be 1 and the date difference more than a day per type
#then in the second [] we select the TRUEs
setDT(df)[x == 1, diff(datecol) > 1, by = type][V1 == TRUE, type]
#[1] A C
#Levels: A B C

Resources