Conditionally update rows and then group - r

Let me start by providing my sample dataset:
ID Start Code End Days
1 2016-03-01 A 2016-03-14 14
1 2016-03-01 A 2016-03-14 14
1 2016-03-01 B 2016-04-01 30
2 2016-02-01 A 2016-03-01 28
I'd like to, for each ID, and within this group, for each Code, check if the End is larger dan Start in the next row (if df$End[i] > df$Start[i+1]) and if so, update Start of the next row to End en recompute End (which is Start + Days) for that row i+1. The results should thus be:
ID Start Code End Days
1 2016-03-01 A 2016-03-14 14
1 2016-03-14 A 2016-03-28 14
1 2016-03-01 B 2016-04-01 30
2 2016-02-01 A 2016-03-01 28
Afterwards, if for a ID, and a Code the difference between df$End[i] - df$Start[i+1] <= 7, I would like to combine the rows, using the smallest df$Start and the largest df$End for this subset. Making:
ID Start Code End Days
1 2016-03-01 A 2016-03-28 14
1 2016-03-01 B 2016-04-01 30
2 2016-02-01 A 2016-03-01 28
Since my dataset is over 100M rows, I'd like a fast solution. Unfortunately I am pretty new to dplyr, so help is highly appreciated!
update: larger example:
ID Start Code End Days
1 2012-04-01 A 2012-04-07 7
1 2016-03-01 B 2016-03-15 15
1 2016-03-01 B 2016-05-29 90
1 2016-06-01 B 2016-08-29 90
1 2016-09-01 B 2016-11-29 90
1 2016-12-01 B 2017-02-28 90
1 2017-03-01 B 2017-05-09 90
1 2017-08-01 B 2017-10-29 90
1 2017-12-01 B 2018-02-28 90
2 2016-04-01 B 2016-04-14 14
This results in:
ID Start Code End
1 2012-04-01 A 2012-04-07
1 2016-03-01 B 2017-02-28
1 2017-03-01 B 2017-05-29
1 2018-08-01 B 2017-12-05
2 2016-04-01 B 2016-04-14
Where I would expect row 2 and to be combined.
For the first step I tried:
grouped_df <-
df %>%
group_by(ID, Code) %>%
mutate_at(vars(Start, End), funs(as.Date)) %>%
mutate(new_start = as.Date(ifelse(lag(End > Start), lag(End), Start), origin="1970-01-01")) %>%
mutate(new_stop= new_disp + Days)
However, if a new_end has been computed, we should now compare new_end and not End with new_start (and not Start).

Related

How to calculate number of hours from a fixed start point that varies among levels of a variable

The dataframe df1 summarizes detections of different individuals (ID) through time (Datetime). As a short example:
library(lubridate)
df1<- data.frame(ID= c(1,2,1,2,1,2,1,2,1,2),
Datetime= ymd_hms(c("2016-08-21 00:00:00","2016-08-24 08:00:00","2016-08-23 12:00:00","2016-08-29 03:00:00","2016-08-27 23:00:00","2016-09-02 02:00:00","2016-09-01 12:00:00","2016-09-09 04:00:00","2016-09-01 12:00:00","2016-09-10 12:00:00")))
> df1
ID Datetime
1 1 2016-08-21 00:00:00
2 2 2016-08-24 08:00:00
3 1 2016-08-23 12:00:00
4 2 2016-08-29 03:00:00
5 1 2016-08-27 23:00:00
6 2 2016-09-02 02:00:00
7 1 2016-09-01 12:00:00
8 2 2016-09-09 04:00:00
9 1 2016-09-01 12:00:00
10 2 2016-09-10 12:00:00
I want to calculate for each row, the number of hours (Hours_since_begining) since the first time that the individual was detected.
I would expect something like that (It can contain some mistakes since I did the calculations by hand):
> df1
ID Datetime Hours_since_begining
1 1 2016-08-21 00:00:00 0
2 2 2016-08-24 08:00:00 0
3 1 2016-08-23 12:00:00 60 # Number of hours between "2016-08-21 00:00:00" (first time detected the Ind 1) and "2016-08-23 12:00:00"
4 2 2016-08-29 03:00:00 115
5 1 2016-08-27 23:00:00 167 # Number of hours between "2016-08-21 00:00:00" (first time detected the Ind 1) and "2016-08-27 23:00:00"
6 2 2016-09-02 02:00:00 210
7 1 2016-09-01 12:00:00 276
8 2 2016-09-09 04:00:00 380
9 1 2016-09-01 12:00:00 276
10 2 2016-09-10 12:00:00 412
Does anyone know how to do it?
Thanks in advance!
You can do this :
library(tidyverse)
# first get min datetime by ID
min_datetime_id <- df1 %>% group_by(ID) %>% summarise(min_datetime=min(Datetime))
# join with df1 and compute time difference
df1 <- df1 %>% left_join(min_datetime_id) %>% mutate(Hours_since_beginning= as.numeric(difftime(Datetime, min_datetime,units="hours")))

How to group time by every n minutes in R

I have a dataframe with a lot of time series:
1 0:03 B 1
2 0:05 A 1
3 0:05 A 1
4 0:05 B 1
5 0:10 A 1
6 0:10 B 1
7 0:14 B 1
8 0:18 A 1
9 0:20 A 1
10 0:23 B 1
11 0:30 A 1
I want to group the time series into every 6 minutes and count the frequency of A and B:
1 0:06 A 2
2 0:06 B 2
3 0:12 A 1
4 0:12 B 1
5 0:18 A 1
6 0:24 A 1
7 0:24 B 1
8 0:18 A 1
9 0:30 A 1
Also, the class of the time series is character. What should I do?
Here's an approach to convert times to POSIXct, cut the times by 6 minute intervals, then count.
First, you need to specify the year, month, day, hour, minute, and seconds of your data. This will help with scaling it to larger datasets.
library(tidyverse)
library(lubridate)
# sample data
d <- data.frame(t = paste0("2019-06-02 ",
c("0:03","0:06","0:09","0:12","0:15",
"0:18","0:21","0:24","0:27","0:30"),
":00"),
g = c("A","A","B","B","B"))
d$t <- ymd_hms(d$t) # convert to POSIXct with `lubridate::ymd_hms()`
If you check the class of your new date column, you will see it is "POSIXct".
> class(d$t)
[1] "POSIXct" "POSIXt"
Now that the data is in "POSIXct", you can cut it by minute intervals! We will add this new grouping factor as a new column called tc.
d$tc <- cut(d$t, breaks = "6 min")
d
t g tc
1 2019-06-02 00:03:00 A 2019-06-02 00:03:00
2 2019-06-02 00:06:00 A 2019-06-02 00:03:00
3 2019-06-02 00:09:00 B 2019-06-02 00:09:00
4 2019-06-02 00:12:00 B 2019-06-02 00:09:00
5 2019-06-02 00:15:00 B 2019-06-02 00:15:00
6 2019-06-02 00:18:00 A 2019-06-02 00:15:00
7 2019-06-02 00:21:00 A 2019-06-02 00:21:00
8 2019-06-02 00:24:00 B 2019-06-02 00:21:00
9 2019-06-02 00:27:00 B 2019-06-02 00:27:00
10 2019-06-02 00:30:00 B 2019-06-02 00:27:00
Now you can group_by this new interval (tc) and your grouping column (g), and count the frequency of occurences. Getting the frequency of observations in a group is a fairly common operation, so dplyr provides count for this:
count(d, g, tc)
# A tibble: 7 x 3
g tc n
<fct> <fct> <int>
1 A 2019-06-02 00:03:00 2
2 A 2019-06-02 00:15:00 1
3 A 2019-06-02 00:21:00 1
4 B 2019-06-02 00:09:00 2
5 B 2019-06-02 00:15:00 1
6 B 2019-06-02 00:21:00 1
7 B 2019-06-02 00:27:00 2
If you run ?dplyr::count() in the console, you'll see that count(d, tc) is simply a wrapper for group_by(d, g, tc) %>% summarise(n = n()).
According to the sample dataset, the time series is given as time-of-day, i.e., without date.
The data.table package has the ITime class which is a time-of-day class stored as the integer number of seconds in the day. With data.table, we can use a rolling join to map times to the upper limit of the 6 minutes intervals (right-closed intervals):
library(data.table)
# coerce from character to class ITime
setDT(ts)[, time := as.ITime(time)]
# create sequence of breaks
breaks <- as.ITime(seq(as.ITime("0:00"), as.ITime("23:59:59"), as.ITime("0:06")))
# rolling join and aggregate
ts[, CJ(breaks, group, unique = TRUE)
][ts, on = .(group, breaks = time), roll = -Inf, .(x.breaks, group)
][, .N, by = .(upper = x.breaks, group)]
which returns
upper group N
1: 00:06:00 B 2
2: 00:06:00 A 2
3: 00:12:00 A 1
4: 00:12:00 B 1
5: 00:18:00 B 1
6: 00:18:00 A 1
7: 00:24:00 A 1
8: 00:24:00 B 1
9: 00:30:00 A 1
Addendum
If the direction of the rolling join is changed (roll = +Inf instead of roll = -Inf) we get left-closed intervals
ts[, CJ(breaks, group, unique = TRUE)
][ts, on = .(group, breaks = time), roll = +Inf, .(x.breaks, group)
][, .N, by = .(lower = x.breaks, group)]
which changes the result significantly:
lower group N
1: 00:00:00 B 2
2: 00:00:00 A 2
3: 00:06:00 A 1
4: 00:06:00 B 1
5: 00:12:00 B 1
6: 00:18:00 A 2
7: 00:18:00 B 1
8: 00:30:00 A 1
Data
library(data.table)
ts <- fread("
1 0:03 B 1
2 0:05 A 1
3 0:05 A 1
4 0:05 B 1
5 0:10 A 1
6 0:10 B 1
7 0:14 B 1
8 0:18 A 1
9 0:20 A 1
10 0:23 B 1
11 0:30 A 1"
, header = FALSE
, col.names = c("rn", "time", "group", "value"))

R sum two columns with condition on third column

I have a data frame like:
user_name started_at session_time_min task_completed timediff
ABC 2018-03-02 18:00:00 1 3 NA
ABC 2018-03-02 19:00:00 1036 18 1
ABC 2018-03-03 12:00:00 6 10 17
ABC 2018-03-04 21:00:00 0 1 33
ABC 2018-03-05 16:00:00 143 61 19
ABC 2018-03-05 18:00:00 12 18 2
ABC 2018-03-05 19:00:00 60 94 1
ABC 2018-03-05 20:00:00 20 46 1
ABC 2018-03-09 15:00:00 0 1 91
I want to sum session_time_min and task_completed with previous row if timediff = 1
Want output like:
user_name started_at session_time_min task_completed
ABC 2018-03-02 18:00:00 1037 21
ABC 2018-03-03 12:00:00 6 10
ABC 2018-03-04 21:00:00 0 1
ABC 2018-03-05 16:00:00 143 61
ABC 2018-03-05 18:00:00 92 158
ABC 2018-03-09 15:00:00 0 1
Any help will highly be appricated.
You could use a for loop to help you out especially if you want to use base R.
for (i in 1:nrow(data)) {
if (is.na(data[i,5])){
data[i+1,3] <- data[i+1,3] + data[i,3]
data[i+1,4] <- data[i+1,4] + data[i,4]
} else {}
}
data <- na.omit(data)
This code runs through each row in your dataframe and checks if the value in column 5 (timediff) is a NA. If it is an NA it adds (for the 2 columns you want positioned at 3 and 4) it to the row below (which will be i+1)
Make a group counter using cumsum and then use that to subset the identifier columns and rowsum the value columns:
grp <- cumsum(!dat$timediff %in% 1)
#[1] 1 1 2 3 4 5 5 5 6
cbind(
dat[match(unique(grp), grp), c("user_name","started_at")],
rowsum(dat[c("session_time_min","task_completed")], grp)
)
# user_name started_at session_time_min task_completed
#1 ABC 2018-03-0218:00:00 1037 21
#3 ABC 2018-03-0312:00:00 6 10
#4 ABC 2018-03-0421:00:00 0 1
#5 ABC 2018-03-0516:00:00 143 61
#6 ABC 2018-03-0518:00:00 92 158
#9 ABC 2018-03-0915:00:00 0 1

Summations by conditions on another row dealing with time

I am looking to run a cumulative sum at every row for values that occur in two columns before and after that point. So in this case I have volume of 2 incident types at every given minute over two days. I want to create a column which adds all the incidents that occured before and after for each row by the type. Sumif from excel comes to mind but I'm not sure how to port that over to R:
EDIT: ADDED set.seed and easier numbers
I have the following data set:
set.seed(42)
master_min =
setDT(
data.frame(master_min = seq(
from=as.POSIXct("2016-1-1 0:00", tz="America/New_York"),
to=as.POSIXct("2016-1-2 23:00", tz="America/New_York"),
by="min"
))
)
incident1= round(runif(2821, min=0, max=10))
incident2= round(runif(2821, min=0, max=10))
master_min = head(cbind(master_min, incident1, incident2), 5)
How do I essentially compute the following logic:
for each row, sum all the incident1s that occured before that row's timestamp and all the incident2s that occured after that row's timestamp? It would be great to get a data table solution, if not a dplyr as I am working with a large dataset. Below is a before and after for the data`:
BEFORE:
master_min incident1 incident2
1: 2016-01-01 00:00:00 9 6
2: 2016-01-01 00:01:00 9 5
3: 2016-01-01 00:02:00 3 5
4: 2016-01-01 00:03:00 8 6
5: 2016-01-01 00:04:00 6 9
AFTER THE CALCULATION:
master_min incident1 incident2 new_column
1: 2016-01-01 00:00:00 9 6 25
2: 2016-01-01 00:01:00 9 5 29
3: 2016-01-01 00:02:00 3 5 33
4: 2016-01-01 00:03:00 8 6 30
5: 2016-01-01 00:04:00 6 9 29
If I understand correctly:
# Cumsum of incident1, without current row:
master_min$sum1 <- cumsum(master_min$incident1) - master_min$incident1
# Reverse cumsum of incident2, without current row:
master_min$sum2 <- rev(cumsum(rev(master_min$incident2))) - master_min$incident2
# Your new column:
master_min$new_column <- master_min$sum1 + master_min$sum2
*update
The following two lines can do the job
master_min$sum1 <- cumsum(master_min$incident1)
master_min$sum2 <- sum(master_min$incident2) - cumsum(master_min$incident2)
I rewrote the question a bit to show a bit more comprehensive structure
library(data.table)
master_min <-
setDT(
data.frame(master_min = seq(
from=as.POSIXct("2016-1-1 0:00", tz="America/New_York"),
to=as.POSIXct("2016-1-1 0:09", tz="America/New_York"),
by="min"
))
)
set.seed(2)
incident1= as.integer(runif(10, min=0, max=10))
incident2= as.integer(runif(10, min=0, max=10))
master_min = cbind(master_min, incident1, incident2)
Now master_min looks like this
> master_min
master_min incident1 incident2
1: 2016-01-01 00:00:00 1 5
2: 2016-01-01 00:01:00 7 2
3: 2016-01-01 00:02:00 5 7
4: 2016-01-01 00:03:00 1 1
5: 2016-01-01 00:04:00 9 4
6: 2016-01-01 00:05:00 9 8
7: 2016-01-01 00:06:00 1 9
8: 2016-01-01 00:07:00 8 2
9: 2016-01-01 00:08:00 4 4
10: 2016-01-01 00:09:00 5 0
Apply transformations
master_min$sum1 <- cumsum(master_min$incident1)
master_min$sum2 <- sum(master_min$incident2) - cumsum(master_min$incident2)
Results
> master_min
master_min incident1 incident2 sum1 sum2
1: 2016-01-01 00:00:00 1 5 1 37
2: 2016-01-01 00:01:00 7 2 8 35
3: 2016-01-01 00:02:00 5 7 13 28
4: 2016-01-01 00:03:00 1 1 14 27
5: 2016-01-01 00:04:00 9 4 23 23
6: 2016-01-01 00:05:00 9 8 32 15
7: 2016-01-01 00:06:00 1 9 33 6
8: 2016-01-01 00:07:00 8 2 41 4
9: 2016-01-01 00:08:00 4 4 45 0
10: 2016-01-01 00:09:00 5 0 50 0

Aggregate one data frame by time intervals from another data frame

I'm trying to aggregate two data frames (df1 and df2).
The first contains 3 variables: ID, Date1 and Date2.
df1
ID Date1 Date2
1 2016-03-01 2016-04-01
1 2016-04-01 2016-05-01
2 2016-03-14 2016-04-15
2 2016-04-15 2016-05-17
3 2016-05-01 2016-06-10
3 2016-06-10 2016-07-15
The second also contains 3 variables: ID, Date3 and Value.
df2
ID Date3 Value
1 2016-03-15 5
1 2016-04-04 7
1 2016-04-28 7
2 2016-03-18 3
2 2016-03-27 5
2 2016-04-08 9
2 2016-04-20 2
3 2016-05-05 6
3 2016-05-25 8
3 2016-06-13 3
The idea is to get, for each df1 row, the sum of df2$Value that have the same ID and for which Date3 is between Date1 and Date2:
ID Date1 Date2 SumValue
1 2016-03-01 2016-04-01 5
1 2016-04-01 2016-05-01 14
2 2016-03-14 2016-04-15 17
2 2016-04-15 2016-05-17 2
3 2016-05-01 2016-06-10 14
3 2016-06-10 2016-07-15 3
I know how to make a loop on this, but the data frames are huge! Does someone has an efficient solution? Exploring data.table, plyr and dplyr but could not find a solution.
A couple of data.table solutions that should scale well (and a good stop-gap until non-equi joins are implemented):
Do the comparison in J using by=EACHI.
library(data.table)
setDT(df1)
setDT(df2)
df1[, `:=`(Date1 = as.Date(Date1), Date2 = as.Date(Date2))]
df2[, Date3 := as.Date(Date3)]
df1[ df2,
{
idx = Date1 <= i.Date3 & i.Date3 <= Date2
.(Date1 = Date1[idx],
Date2 = Date2[idx],
Date3 = i.Date3,
Value = i.Value)
},
on=c("ID"),
by=.EACHI][, .(sumValue = sum(Value)), by=.(ID, Date1, Date2)]
# ID Date1 Date2 sumValue
# 1: 1 2016-03-01 2016-04-01 5
# 2: 1 2016-04-01 2016-05-01 14
# 3: 2 2016-03-14 2016-04-15 17
# 4: 2 2016-04-15 2016-05-17 2
# 5: 3 2016-05-01 2016-06-10 14
# 6: 3 2016-06-10 2016-07-15 3
foverlap join (as suggested in the comments)
library(data.table)
setDT(df1)
setDT(df2)
df1[, `:=`(Date1 = as.Date(Date1), Date2 = as.Date(Date2))]
df2[, Date3 := as.Date(Date3)]
df2[, Date4 := Date3]
setkey(df1, ID, Date1, Date2)
foverlaps(df2,
df1,
by.x=c("ID", "Date3", "Date4"),
type="within")[, .(sumValue = sum(Value)), by=.(ID, Date1, Date2)]
# ID Date1 Date2 sumValue
# 1: 1 2016-03-01 2016-04-01 5
# 2: 1 2016-04-01 2016-05-01 14
# 3: 2 2016-03-14 2016-04-15 17
# 4: 2 2016-04-15 2016-05-17 2
# 5: 3 2016-05-01 2016-06-10 14
# 6: 3 2016-06-10 2016-07-15 3
Further reading
Rolling join on data.table with duplicate keys
foverlap joins in data.table
With the recently implemented non-equi joins feature in the current development version of data.table, v1.9.7, this can be done as follows:
dt2[dt1, .(sum = sum(Value)), on=.(ID, Date3>=Date1, Date3<=Date2), by=.EACHI]
# ID Date3 Date3 sum
# 1: 1 2016-03-01 2016-04-01 5
# 2: 1 2016-04-01 2016-05-01 14
# 3: 2 2016-03-14 2016-04-15 17
# 4: 2 2016-04-15 2016-05-17 2
# 5: 3 2016-05-01 2016-06-10 14
# 6: 3 2016-06-10 2016-07-15 3
The column names needs some fixing.. will work on it later.
Here's a base R solution using sapply():
df1 <- data.frame(ID=c(1L,1L,2L,2L,3L,3L),Date1=as.Date(c('2016-03-01','2016-04-01','2016-03-14','2016-04-15','2016-05-01','2016-06-01')),Date2=as.Date(c('2016-04-01','2016-05-01','2016-04-15','2016-05-17','2016-06-15','2016-07-15')));
df2 <- data.frame(ID=c(1L,1L,1L,2L,2L,2L,2L,3L,3L,3L),Date3=as.Date(c('2016-03-15','2016-04-04','2016-04-28','2016-03-18','2016-03-27','2016-04-08','2016-04-20','2016-05-05','2016-05-25','2016-06-13')),Value=c(5L,7L,7L,3L,5L,9L,2L,6L,8L,3L));
cbind(df1,SumValue=sapply(seq_len(nrow(df1)),function(ri) sum(df2$Value[df1$ID[ri]==df2$ID & df1$Date1[ri]<=df2$Date3 & df1$Date2[ri]>df2$Date3])));
## ID Date1 Date2 SumValue
## 1 1 2016-03-01 2016-04-01 5
## 2 1 2016-04-01 2016-05-01 14
## 3 2 2016-03-14 2016-04-15 17
## 4 2 2016-04-15 2016-05-17 2
## 5 3 2016-05-01 2016-06-15 17
## 6 3 2016-06-01 2016-07-15 3
Note that your df1 and expected output have slightly different dates in some cases; I used the df1 dates.
Here's another approach that attempts to be more vectorized: Precompute a cartesian product of indexes into the two frames, then perform a single vectorized conditional expression using the index vectors to get matching pairs of indexes, and finally use the matching indexes to aggregate the desired result:
cbind(df1,SumValue=with(expand.grid(i1=seq_len(nrow(df1)),i2=seq_len(nrow(df2))),{
x <- df1$ID[i1]==df2$ID[i2] & df1$Date1[i1]<=df2$Date3[i2] & df1$Date2[i1]>df2$Date3[i2];
tapply(df2$Value[i2[x]],i1[x],sum);
}));
## ID Date1 Date2 SumValue
## 1 1 2016-03-01 2016-04-01 5
## 2 1 2016-04-01 2016-05-01 14
## 3 2 2016-03-14 2016-04-15 17
## 4 2 2016-04-15 2016-05-17 2
## 5 3 2016-05-01 2016-06-15 17
## 6 3 2016-06-01 2016-07-15 3

Resources