Cumulative sum across first element of groups in data.table

Cumulative sum across first element of groups in data.table - r

I have some time series temperature measurements taken at half hour intervals. I want to calculate an average cumulative growing degree days style metric. I am using the variable "datetime", but leaving out actual datetimes for simplicity's sake. Also don't worry about if this is actually the right calculation for growing degree days, it isn't. The following toy data emulate the challenge.
library(data.table)
#generate some approximate data.
dt<-data.table(datetime=seq(1,10.9, by=0.1),
date=rep(1:10, each=10),
T=floor(runif(100,1,10)))
Now I calculate a 'daily' average:
dt[,T_mean_daily:=mean(T), by=date]
Now what I want to do is calculate the cumulative sum of T_mean_daily and have it displayed in a new column but repeated for each 'datetime' on a date as with T_mean_daily. I am having some trouble visualizing that with cumsum. The final output would look like:
datetime date T T_mean_daily T_sum
1: 1.0 1 4 5.6 5.6
2: 1.1 1 6 5.6 5.6
3: 1.2 1 9 5.6 5.6
4: 1.3 1 7 5.6 5.6
5: 1.4 1 3 5.6 5.6
6: 1.5 1 8 5.6 5.6
7: 1.6 1 3 5.6 5.6
8: 1.7 1 7 5.6 5.6
9: 1.8 1 8 5.6 5.6
10: 1.9 1 1 5.6 5.6
11: 2.0 2 2 3.6 9.2
12: 2.1 2 5 3.6 9.2
13: 2.2 2 4 3.6 9.2
14: 2.3 2 1 3.6 9.2
15: 2.4 2 9 3.6 9.2
16: 2.5 2 5 3.6 9.2
17: 2.6 2 2 3.6 9.2
18: 2.7 2 5 3.6 9.2
19: 2.8 2 2 3.6 9.2
20: 2.9 2 1 3.6 9.2
21: 3.0 3 1 5.9 15.1
22: 3.1 3 4 5.9 15.1
Looking for the data.table solution. This is not the cumsum by group, I am looking for the cumsum of each first row or unique value across all groups.

here is another data.table approach...
setnafill(dt[!duplicated(date), T_sum := cumsum(T_mean_daily)], "locf", cols = "T_sum")
explanation
Since we only need to use the first row of each date, we can select there rows using !duplicated(date) in the i of data.table. In the j, we can now calculate the cumulative sum of T_Mean_Daily.
Now we are left with a column with the correct cumsum value on all first date-rows, and NA's in between, so use setnafill fo locf-fill in the value over the NA-rows in the T_sum-column.
benchmarks
set.seed(42)
dt<-data.table(datetime=seq(1,10.9, by=0.1),
date=rep(1:10, each=10),
T=floor(runif(100,1,10)))
dt[,T_mean_daily:=mean(T), by=date]
microbenchmark::microbenchmark(
r2evans = {
test <- copy(dt)
test[ test[, .SD[1,], by = date][, T_mean_daily := cumsum(T_mean_daily)], T_sum := i.T_mean_daily, on = .(date)]
},
wimpel = {
test <- copy(dt)
setnafill(test[!duplicated(date), T_sum := cumsum(T_mean_daily)], "locf", cols = "T_sum")
}
)
Unit: microseconds
expr min lq mean median uq max neval cld
r2evans 3287.9 3488.20 3662.044 3560.65 3758.85 4833.1 100 b
wimpel 425.4 437.45 465.313 451.75 485.35 608.3 100 a

If we do a temporary subset to just the first row of each date, we can then use cumsum and join it back into the original data.
set.seed(42)
dt<-data.table(datetime=seq(1,10.9, by=0.1),
date=rep(1:10, each=10),
T=floor(runif(100,1,10)))
dt[,T_mean_daily:=mean(T), by=date]
dt
# datetime date T T_mean_daily
# <num> <int> <num> <num>
# 1: 1.0 1 9 6.2
# 2: 1.1 1 9 6.2
# 3: 1.2 1 3 6.2
# 4: 1.3 1 8 6.2
# 5: 1.4 1 6 6.2
# 6: 1.5 1 5 6.2
# 7: 1.6 1 7 6.2
# 8: 1.7 1 2 6.2
# 9: 1.8 1 6 6.2
# 10: 1.9 1 7 6.2
# ---
# 91: 10.0 10 7 5.6
# 92: 10.1 10 1 5.6
# 93: 10.2 10 2 5.6
# 94: 10.3 10 9 5.6
# 95: 10.4 10 9 5.6
# 96: 10.5 10 7 5.6
# 97: 10.6 10 3 5.6
# 98: 10.7 10 5 5.6
# 99: 10.8 10 7 5.6
# 100: 10.9 10 6 5.6
The aggregation is simply:
dt[, .SD[1,], by = date][, T_mean_daily := cumsum(T_mean_daily)][]
# date datetime T T_mean_daily T_sum
# <int> <num> <num> <num> <num>
# 1: 1 1 9 6.2 6.2
# 2: 2 2 5 12.2 12.2
# 3: 3 3 9 18.3 18.3
# 4: 4 4 7 23.6 23.6
# 5: 5 5 4 29.6 29.6
# 6: 6 6 4 34.1 34.1
# 7: 7 7 7 40.1 40.1
# 8: 8 8 1 43.1 43.1
# 9: 9 9 6 47.0 47.0
# 10: 10 10 7 52.6 52.6
which we can join back on the original data as:
dt[ dt[, .SD[1,], by = date][, T_mean_daily := cumsum(T_mean_daily)], T_sum := i.T_mean_daily, on = .(date)]
dt
# datetime date T T_mean_daily T_sum
# <num> <int> <num> <num> <num>
# 1: 1.0 1 9 6.2 6.2
# 2: 1.1 1 9 6.2 6.2
# 3: 1.2 1 3 6.2 6.2
# 4: 1.3 1 8 6.2 6.2
# 5: 1.4 1 6 6.2 6.2
# 6: 1.5 1 5 6.2 6.2
# 7: 1.6 1 7 6.2 6.2
# 8: 1.7 1 2 6.2 6.2
# 9: 1.8 1 6 6.2 6.2
# 10: 1.9 1 7 6.2 6.2
# ---
# 91: 10.0 10 7 5.6 52.6
# 92: 10.1 10 1 5.6 52.6
# 93: 10.2 10 2 5.6 52.6
# 94: 10.3 10 9 5.6 52.6
# 95: 10.4 10 9 5.6 52.6
# 96: 10.5 10 7 5.6 52.6
# 97: 10.6 10 3 5.6 52.6
# 98: 10.7 10 5 5.6 52.6
# 99: 10.8 10 7 5.6 52.6
# 100: 10.9 10 6 5.6 52.6

Related

R data.table: Merge conditional summary of a data.table back to the original data.table in one statement

I have the following dummy data -
dummyData = data.table(A = c(2,2,2,3,4,2,3,2,4,3), B = c(1.2, 3.2, 4.3, 3.1, 5.4, 6.6, 3.5, 3.2, 4.2, 2.3), desired_Result = c(18.5, 18.5, 18.5, 18.5, 18.5, 18.5, 18.5, 18.5, 18.5, 18.5))
I want to add a new column to this data.table as follows -
Pick the values in column B where A == 2, then add these values and make a new column C in original data.table
I do not want the sum of B based on group A, but the result should have only the sum of column B where A == 2. Following is the code I tried, but it is giving me summarized data.table
dummyData[, actual_Result := sum(B), by = A]
Following should be the output
A B desired_Result actual_Result
1: 2 1.2 18.5 18.5
2: 2 3.2 18.5 18.5
3: 2 4.3 18.5 18.5
4: 3 3.1 18.5 18.5
5: 4 5.4 18.5 18.5
6: 2 6.6 18.5 18.5
7: 3 3.5 18.5 18.5
8: 2 3.2 18.5 18.5
9: 4 4.2 18.5 18.5
10: 3 2.3 18.5 18.5
Following code gives NA in rows where A is 3 or 4, I need all the rows in actual_Result to have value 18.5
dummyData[A == 2, actual_Result := sum(B), by = A]
A B desired_Result actual_Result C
1: 2 1.2 18.5 18.5 18.5
2: 2 3.2 18.5 18.5 18.5
3: 2 4.3 18.5 18.5 18.5
4: 3 3.1 18.5 8.9 NA
5: 4 5.4 18.5 9.6 NA
6: 2 6.6 18.5 18.5 18.5
7: 3 3.5 18.5 8.9 NA
8: 2 3.2 18.5 18.5 18.5
9: 4 4.2 18.5 9.6 NA
10: 3 2.3 18.5 8.9 NA

You could do
library(data.table)
dummyData[, actual_Result := sum(B[A == 2])]
dummyData
# A B desired_Result actual_Result
# 1: 2 1.2 18.5 18.5
# 2: 2 3.2 18.5 18.5
# 3: 2 4.3 18.5 18.5
# 4: 3 3.1 18.5 18.5
# 5: 4 5.4 18.5 18.5
# 6: 2 6.6 18.5 18.5
# 7: 3 3.5 18.5 18.5
# 8: 2 3.2 18.5 18.5
# 9: 4 4.2 18.5 18.5
#10: 3 2.3 18.5 18.5
which using base R is
dummyData$actual_Result <- sum(dummyData$B[dummyData$A == 2])

In dplyr, we can use
library(dplyr)
dummyData %>%
mutate(actual_Result = sum(B[A ==2]))

How to reorganize data with the function `gather` (or similar) to reduce four variables to two

I have the dataframe df1 that summarizes the mean number of animals per 6-hours interval and per zone (mean_A and mean_B). I also have the standard error of this means (Se_A and Se_B). As an example:
df1<-data.frame(Hour=c(0,6,12,18,24),
mean_A= c(7.3,6.8,8.9,3.4,12.1),
mean_B=c(6.3,8.2,3.1,4.8,13.2),
Se_A=c(1.3,2.1,0.9,3.2,0.8),
Se_B=c(0.9,0.3,1.8,1.1,1.3))
> df1
Hour mean_A mean_B Se_A Se_B
1 0 7.3 6.3 1.3 0.9
2 6 6.8 8.2 2.1 0.3
3 12 8.9 3.1 0.9 1.8
4 18 3.4 4.8 3.2 1.1
5 24 12.1 13.2 0.8 1.3
For plotting reasons, I need to reorganize the dataframe. What I would need is this (or similar):
> df1
Hour meanType meanValue Se
1 0 mean_A 7.3 1.3
2 6 mean_A 6.8 2.1
3 12 mean_A 8.9 0.9
4 18 mean_A 3.4 3.2
5 24 mean_A 12.1 0.8
6 0 mean_B 6.3 0.9
7 6 mean_B 8.2 0.3
8 12 mean_B 3.1 1.8
9 18 mean_B 4.8 1.1
10 24 mean_B 13.2 1.3
Does anyone how to do it?

Using reshape
reshape(df1, idvar = "Hour", varying = 2:5, direction = "long", sep = "_", timevar = "type")
# Hour type mean Se
#0.A 0 A 7.3 1.3
#6.A 6 A 6.8 2.1
#12.A 12 A 8.9 0.9
#18.A 18 A 3.4 3.2
#24.A 24 A 12.1 0.8
#0.B 0 B 6.3 0.9
#6.B 6 B 8.2 0.3
#12.B 12 B 3.1 1.8
#18.B 18 B 4.8 1.1
#24.B 24 B 13.2 1.3
We can also use tidyr's pivot_longer (version 0.8.3.9000)
library(tidyr)
pivot_longer(df1, cols = -Hour, names_to = c(".value", "Type"), names_sep = "_")
# A tibble: 10 x 4
# Hour Type mean Se
# <dbl> <chr> <dbl> <dbl>
# 1 0 A 7.3 1.3
# 2 0 B 6.3 0.9
# 3 6 A 6.8 2.1
# 4 6 B 8.2 0.3
# 5 12 A 8.9 0.9
# 6 12 B 3.1 1.8
# 7 18 A 3.4 3.2
# 8 18 B 4.8 1.1
# 9 24 A 12.1 0.8
#10 24 B 13.2 1.3
From the vignette:
Note the special variable name .value: this tells pivot_longer() that that component of the variable name defines the name of the output value column.

We can use melt from data.table which would make it easier as it is in-built with taking multiple measure patterns to create separate columns when reshaped from 'wide' to 'long'
library(data.table)
melt(setDT(df1), measure = patterns("^mean", "^Se"),
variable.name = "meanType", value.name = c("meanValue", "Se"))[,
meanType := names(df1)[2:3][meanType]][]
# Hour meanType meanValue Se
# 1: 0 mean_A 7.3 1.3
# 2: 6 mean_A 6.8 2.1
# 3: 12 mean_A 8.9 0.9
# 4: 18 mean_A 3.4 3.2
# 5: 24 mean_A 12.1 0.8
# 6: 0 mean_B 6.3 0.9
# 7: 6 mean_B 8.2 0.3
# 8: 12 mean_B 3.1 1.8
# 9: 18 mean_B 4.8 1.1
#10: 24 mean_B 13.2 1.3
If we need a tidyverse approach
library(tidyversse)
gather(df1, meanType, val, -Hour) %>%
separate(meanType, into = c("meanType1", "meanType")) %>%
spread(meanType1, val) %>%
mutate(meanType = str_c("mean_", meanType)) %>%
arrange(meanType)
# Hour meanType mean Se
#1 0 mean_A 7.3 1.3
#2 6 mean_A 6.8 2.1
#3 12 mean_A 8.9 0.9
#4 18 mean_A 3.4 3.2
#5 24 mean_A 12.1 0.8
#6 0 mean_B 6.3 0.9
#7 6 mean_B 8.2 0.3
#8 12 mean_B 3.1 1.8
#9 18 mean_B 4.8 1.1
#10 24 mean_B 13.2 1.3
NOTE: The gather also works here, but make sure to check the type of columns before doing the gather. As both the columns are of numeric type, it is not an issue. When, we have multiple types and if we gather into a single column, then we may need to type_convert (from readr) after the spread step

Summarize function for dplyr doesn't output correct results by row for multiple columns

I have a dataset with 5 columns rachis1 to rachis5 being numeric.
I have 100 rows of data with names attached to each row as a factor.
I want to do a summary for each row for all five columns.
head(rl)
name rachis1 rachis2 rachis3 rachis4 rachis5
1 R04-001 2.4 2.6 2.7 3.0 2.4
2 R04-002 7.0 7.4 7.7 6.8 7.4
3 R04-003 3.5 3.7 3.9 4.1 3.8
4 R04-004 9.5 9.1 7.8 8.8 8.2
5 R04-005 3.0 3.3 3.4 3.8 3.3
6 R04-006 9.2 9.8 9.5 9.4 10.1
My code for this is.
library(dplyr)
####Rachis
RL<- rl %>%
group_by(name) %>%
summarize(RL= mean(rachis1:rachis5), RLMAX = max(rachis1:rachis5),RLMIN =
min(rachis1:rachis5), RLSTD=sd(rachis1:rachis5),na.rm=T)
head(RL)
tail(RL)
My resulting analysis comes out as...
head(RL)
# A tibble: 6 x 6
name RL RLMAX RLMIN RLSTD na.rm
<fctr> <dbl> <dbl> <dbl> <dbl> <lgl>
1 R04-001 2.4 2.4 2.4 NA TRUE
2 R04-002 7.0 7.0 7.0 NA TRUE
3 R04-003 3.5 3.5 3.5 NA TRUE
4 R04-004 9.0 9.5 8.5 0.7071068 TRUE
5 R04-005 3.0 3.0 3.0 NA TRUE
6 R04-006 9.2 9.2 9.2 NA TRUE
I was wondering why there is NA in the RLSTD(standard deviations) and the min and max are not the mix and max of the row.
Is there another way to gather my descriptive statistics?

I can't tell if you have duplicate row names among the 100 rows. If you do, and as you already have the data in this format and are using the tidyverse, perhaps this may work. Notice I have placed the na.rm argument within the individual statistic function calls.
RL<- rl %>%
group_by(name) %>%
summarise(RL = mean(rachis1+rachis2+rachis3+rachis4+rachis5, na.rm=T),
RLMAX = max(rachis1+rachis2+rachis3+rachis4+rachis5, na.rm=T),
RLMIN = min(rachis1+rachis2+rachis3+rachis4+rachis5, na.rm=T),
RLSTD = sd(rachis1+rachis2+rachis3+rachis4+rachis5, na.rm=T))

Here is the results for the summarise code with dplyr. Works great now.
name RL RLMAX RLMIN RLSTD
<fctr> <dbl> <dbl> <dbl> <dbl>
1 R04-001 2.62 3.0 2.4 0.2489980
2 R04-002 7.26 7.7 6.8 0.3577709
3 R04-003 3.80 4.1 3.5 0.2236068
4 R04-004 8.68 9.5 7.8 0.6833740
5 R04-005 3.36 3.8 3.0 0.2880972
6 R04-006 9.60 10.1 9.2 0.3535534

Check overlap begin and end time by group in R

I want to check overlap of data, here is data
ID <- c(rep(1,3), rep(3, 5), rep(4,4),rep(5,5))
Begin <- c(0,2.5,3,7,8,7,25,25,10,15,17,20,1,NA,10,11,13)
End <- c(1.5,3.5,6,12,8,11,29,35, 12,19,NA,28,5,20,30,20,25)
df <- data.frame(ID, Begin, End)
df
ID Begin End
1 1 0.0 1.5
2 1 2.5 3.5
3 1 3.0 6.0*
4 3 7.0 12.0
5 3 8.0 8.0*
6 3 7.0 11.0*
7 3 25.0 29.0
8 3 25.0 35.0*
9 4 10.0 12.0
10 4 15.0 19.0
11 4 17.0 NA*
12 4 20.0 28.0
13 5 1.0 5.0
14 5 NA 20.0
15 5 10.0 30.0
16 5 11.0 20.0*
17 5 13.0 25.0*
* means it's overlap:
for row 3,ID = 1,Begin=3.0 is smaller than 3.5, so set Begin_New = 3.5, but
while ID = 3, it's different, row 5 Begin = 8.0 is smaller than 12.0, we set Begin_New = 12, but it keep going, if we compare Begin = 7.0 with End = 8.0, it's not correct, because now End is 12 is higher next value.
So here is my output design
ID Begin End Begin_New1
1 1 0.0 1.5 0.0
2 1 2.5 3.5 2.5
3 1 3.0 6.0 3.5*
4 3 7.0 12.0 7.0
5 3 8.0 8.0 12.0*
6 3 7.0 11.0 12.0*
7 3 25.0 29.0 25.0
8 3 25.0 35.0 29.0*
9 4 10.0 12.0 10.0
10 4 15.0 19.0 15.0
11 4 17.0 NA 19.0*
12 4 20.0 28.0 20.0
13 5 1.0 5.0 1.0
14 5 NA 20.0 NA
15 5 10.0 30.0 20.0*
16 5 11.0 20.0 30.0*
17 5 13.0 25.0 30.0*
When I use this code, I don't get the output I want, it shift only 1 row and compare each row
setDT(df)[, Begin_New := shift(End), by = ID][!which(Begin < Begin_New), Begin_New:= Begin]
ID Begin End Begin_New
1: 1 0.0 1.5 0.0
2: 1 2.5 3.5 2.5
3: 1 3.0 6.0 3.5
4: 3 7.0 12.0 7.0
5: 3 8.0 8.0 12.0
6: 3 7.0 11.0 8.0
7: 3 25.0 29.0 25.0
8: 3 25.0 35.0 29.0
9: 4 10.0 12.0 10.0
10: 4 15.0 19.0 15.0
11: 4 17.0 NA 19.0
12: 4 20.0 28.0 20.0
13: 5 1.0 5.0 1.0
14: 5 NA 20.0 NA
15: 5 10.0 30.0 20.0
16: 5 11.0 20.0 30.0
17: 5 13.0 25.0 20.0
This is the output I don't want it

I think your code is pretty much right, you just need to use cummax:
df[, Begin_New := {
high_so_far = shift(cummax(End), fill=Begin[1L])
w = which(Begin < high_so_far)
Begin[w] = high_so_far[w]
Begin
}, by=ID]

Shift one row by ID in R

I have data frame, I want to create a new variable "Begin1" with condition: if the the second row of variable "Begin" smaller than variable "End" of first row, set value of "End" replace "Begin" due to overlap by ID
ID <- c(rep(1,3), rep(3, 5), rep(4,4))
Begin <- c(0,2.5,5, 7,8,7,25,25,10,15,17,20)
End <- c(1.5,3.5,6, 7.5,8,11,29,35, 12,19,21,28)
df <- data.frame(ID, Begin, End)
df
ID Begin End
1 1 0.0 1.5
2 1 2.5 3.5
3 1 5.0 6.0
4 3 7.0 7.5
5 3 8.0 8.0
6 3 7.0 11.0**
7 3 25.0 29.0
8 3 25.0 35.0**
9 4 10.0 12.0
10 4 15.0 19.0
11 4 17.0 21.0**
12 4 20.0 28.0**
If you can see, the rows bolded, row (6,8,11,12). Start with row 6 with ID 3, you see the "Begin" = 7.0, it's smaller the "End" of previous row, now we set "Begin1" = 8.0. For row 8 with ID 3, "Begin"=25, it's smaller than previous "End" = 29, now we set"Begin1" = 29 and so on. Here is the output
ID Begin Begin1 End
1 1 0.0 0.0 1.5
2 1 2.5 2.5 3.5
3 1 5.0 5.0 6.0
4 3 7.0 7.0 7.5
5 3 8.0 8.0 8.0
6 3 7.0 8.0 11.0**
7 3 25.0 25.0 29.0
8 3 25.0 29.0 35.0**
9 4 10.0 10.0 12.0
10 4 15.0 15.0 19.0
11 4 17.0 19.0 21.0**
12 4 20.0 21.0 28.0**
Thanks for your advice
Here is update
ID <- c(rep(1,3), rep(3, 5), rep(4,4))
Group <-c(1,1,2,1,1,1,2,2,1,1,1,2)
Begin <- c(0,2.5,5, 7,8,7,25,25,10,15,17,20)
End <- c(1.5,3.5,6, 7.5,8,11,29,35, 12,19,21,28)
df <- data.frame(ID,Group, Begin, End)
This time I want to group by ID and Group, I got error from data.table.
This is output
ID Group Begin End Begin1
1 1 1 0.0 1.5 0.0
2 1 1 2.5 3.5 2.5
3 1 2 5.0 6.0 5.0
4 3 1 7.0 7.5 7.0
5 3 1 8.0 8.0 8.0
6 3 1 7.0 11.0 8.0
7 3 2 25.0 29.0 25.0
8 3 2 25.0 35.0 29.0
9 4 1 10.0 12.0 35.0
10 4 1 15.0 19.0 15.0
11 4 1 17.0 21.0 19.0
12 4 2 20.0 28.0 20.0**** Right here is not change bc it's group 2
Here is result from dplyr package, it works, but data.table is not working
library(dplyr)
df %>%
group_by(ID, Group) %>%
mutate(Begin1 = pmax(Begin, lag(End), na.rm =TRUE))
Source: local data frame [12 x 5]
Groups: ID, Group [6]
ID Group Begin End Begin1
(dbl) (dbl) (dbl) (dbl) (dbl)
1 1 1 0.0 1.5 0.0
2 1 1 2.5 3.5 2.5
3 1 2 5.0 6.0 5.0
4 3 1 7.0 7.5 7.0
5 3 1 8.0 8.0 8.0
6 3 1 7.0 11.0 8.0
7 3 2 25.0 29.0 25.0
8 3 2 25.0 35.0 29.0
9 4 1 10.0 12.0 10.0
10 4 1 15.0 19.0 15.0
11 4 1 17.0 21.0 19.0
12 4 2 20.0 28.0 20.0**** It works

A different way using data.table. The keys are the following.
The by statement which does the calculation by the ID
The shift function, which lags the End variable to compare with Begin
The pmax function, which does an element-wise max calculation
Here is the code:
library(data.table)
dt <- as.data.table(df)
dt[, Begin1 := pmax(Begin, shift(End, type = 'lag'), na.rm = TRUE), by = ID]

Here's an approach with base R creating the column using ifelse based on the lag of the End column.
df$Begin1 <- ifelse(df$Begin <= lag(df$End), lag(df$End), df$Begin)
df$Begin1[which(is.na(df$Begin1))] <- df$Begin[which(is.na(df$Begin1))]
> df
ID Begin End Begin1
1 1 0.0 1.5 0.0
2 1 2.5 3.5 2.5
3 1 5.0 6.0 5.0
4 3 7.0 7.5 7.0
5 3 8.0 8.0 8.0
6 3 7.0 11.0 8.0
7 3 25.0 29.0 25.0
8 3 25.0 35.0 29.0
9 4 10.0 12.0 35.0
10 4 15.0 19.0 15.0
11 4 17.0 21.0 19.0
12 4 20.0 28.0 21.0

We can do this using data.table
library(data.table)
setDT(df)[, Begin1 := Begin]
i1 <- df[, .I[Begin < shift(End, fill = Begin[1L])], by = ID]$V1
df$Begin1[i1] <- df$End[i1-1]
df
# ID Begin End Begin1
# 1: 1 0.0 1.5 0.0
# 2: 1 2.5 3.5 2.5
# 3: 1 5.0 6.0 5.0
# 4: 3 7.0 7.5 7.0
# 5: 3 8.0 8.0 8.0
# 6: 3 7.0 11.0 8.0
# 7: 3 25.0 29.0 25.0
# 8: 3 25.0 35.0 29.0
# 9: 4 10.0 12.0 10.0
#10: 4 15.0 19.0 15.0
#11: 4 17.0 21.0 19.0
#12: 4 20.0 28.0 21.0
Or another option is
setDT(df)[, Begin1 := shift(End), by = ID][!which(Begin < Begin1), Begin1:= Begin]
df
# ID Begin End Begin1
# 1: 1 0.0 1.5 0.0
# 2: 1 2.5 3.5 2.5
# 3: 1 5.0 6.0 5.0
# 4: 3 7.0 7.5 7.0
# 5: 3 8.0 8.0 8.0
# 6: 3 7.0 11.0 8.0
# 7: 3 25.0 29.0 25.0
# 8: 3 25.0 35.0 29.0
# 9: 4 10.0 12.0 10.0
#10: 4 15.0 19.0 15.0
#11: 4 17.0 21.0 19.0
#12: 4 20.0 28.0 21.0
Or using dplyr
library(dplyr)
df %>%
group_by(ID) %>%
mutate(Begin1 = pmax(Begin, lag(End), na.rm =TRUE))
# ID Begin End Begin1
# <dbl> <dbl> <dbl> <dbl>
#1 1 0.0 1.5 0.0
#2 1 2.5 3.5 2.5
#3 1 5.0 6.0 5.0
#4 3 7.0 7.5 7.0
#5 3 8.0 8.0 8.0
#6 3 7.0 11.0 8.0
#7 3 25.0 29.0 25.0
#8 3 25.0 35.0 29.0
#9 4 10.0 12.0 10.0
#10 4 15.0 19.0 15.0
#11 4 17.0 21.0 19.0
#12 4 20.0 28.0 21.0
Update
Based on the OP's new data
setDT(df)[, Begin1 := shift(End), by = .(ID, Group)][
!which(Begin < Begin1), Begin1 := Begin]
df
# ID Group Begin End Begin1
#1: 1 1 0.0 1.5 0.0
#2: 1 1 2.5 3.5 2.5
#3: 1 2 5.0 6.0 5.0
#4: 3 1 7.0 7.5 7.0
#5: 3 1 8.0 8.0 8.0
#6: 3 1 7.0 11.0 8.0
#7: 3 2 25.0 29.0 25.0
#8: 3 2 25.0 35.0 29.0
#9: 4 1 10.0 12.0 10.0
#10: 4 1 15.0 19.0 15.0
#11: 4 1 17.0 21.0 19.0
#12: 4 2 20.0 28.0 20.0

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Cumulative sum across first element of groups in data.table - r

Related

R data.table: Merge conditional summary of a data.table back to the original data.table in one statement

How to reorganize data with the function `gather` (or similar) to reduce four variables to two

Summarize function for dplyr doesn't output correct results by row for multiple columns

Check overlap begin and end time by group in R

Shift one row by ID in R

Categories

Resources