Check overlap begin and end time by group in R - r

I want to check overlap of data, here is data
ID <- c(rep(1,3), rep(3, 5), rep(4,4),rep(5,5))
Begin <- c(0,2.5,3,7,8,7,25,25,10,15,17,20,1,NA,10,11,13)
End <- c(1.5,3.5,6,12,8,11,29,35, 12,19,NA,28,5,20,30,20,25)
df <- data.frame(ID, Begin, End)
df
ID Begin End
1 1 0.0 1.5
2 1 2.5 3.5
3 1 3.0 6.0*
4 3 7.0 12.0
5 3 8.0 8.0*
6 3 7.0 11.0*
7 3 25.0 29.0
8 3 25.0 35.0*
9 4 10.0 12.0
10 4 15.0 19.0
11 4 17.0 NA*
12 4 20.0 28.0
13 5 1.0 5.0
14 5 NA 20.0
15 5 10.0 30.0
16 5 11.0 20.0*
17 5 13.0 25.0*
* means it's overlap:
for row 3,ID = 1,Begin=3.0 is smaller than 3.5, so set Begin_New = 3.5, but
while ID = 3, it's different, row 5 Begin = 8.0 is smaller than 12.0, we set Begin_New = 12, but it keep going, if we compare Begin = 7.0 with End = 8.0, it's not correct, because now End is 12 is higher next value.
So here is my output design
ID Begin End Begin_New1
1 1 0.0 1.5 0.0
2 1 2.5 3.5 2.5
3 1 3.0 6.0 3.5*
4 3 7.0 12.0 7.0
5 3 8.0 8.0 12.0*
6 3 7.0 11.0 12.0*
7 3 25.0 29.0 25.0
8 3 25.0 35.0 29.0*
9 4 10.0 12.0 10.0
10 4 15.0 19.0 15.0
11 4 17.0 NA 19.0*
12 4 20.0 28.0 20.0
13 5 1.0 5.0 1.0
14 5 NA 20.0 NA
15 5 10.0 30.0 20.0*
16 5 11.0 20.0 30.0*
17 5 13.0 25.0 30.0*
When I use this code, I don't get the output I want, it shift only 1 row and compare each row
setDT(df)[, Begin_New := shift(End), by = ID][!which(Begin < Begin_New), Begin_New:= Begin]
ID Begin End Begin_New
1: 1 0.0 1.5 0.0
2: 1 2.5 3.5 2.5
3: 1 3.0 6.0 3.5
4: 3 7.0 12.0 7.0
5: 3 8.0 8.0 12.0
6: 3 7.0 11.0 8.0
7: 3 25.0 29.0 25.0
8: 3 25.0 35.0 29.0
9: 4 10.0 12.0 10.0
10: 4 15.0 19.0 15.0
11: 4 17.0 NA 19.0
12: 4 20.0 28.0 20.0
13: 5 1.0 5.0 1.0
14: 5 NA 20.0 NA
15: 5 10.0 30.0 20.0
16: 5 11.0 20.0 30.0
17: 5 13.0 25.0 20.0
This is the output I don't want it

I think your code is pretty much right, you just need to use cummax:
df[, Begin_New := {
high_so_far = shift(cummax(End), fill=Begin[1L])
w = which(Begin < high_so_far)
Begin[w] = high_so_far[w]
Begin
}, by=ID]

Related

Cumulative sum across first element of groups in data.table

I have some time series temperature measurements taken at half hour intervals. I want to calculate an average cumulative growing degree days style metric. I am using the variable "datetime", but leaving out actual datetimes for simplicity's sake. Also don't worry about if this is actually the right calculation for growing degree days, it isn't. The following toy data emulate the challenge.
library(data.table)
#generate some approximate data.
dt<-data.table(datetime=seq(1,10.9, by=0.1),
date=rep(1:10, each=10),
T=floor(runif(100,1,10)))
Now I calculate a 'daily' average:
dt[,T_mean_daily:=mean(T), by=date]
Now what I want to do is calculate the cumulative sum of T_mean_daily and have it displayed in a new column but repeated for each 'datetime' on a date as with T_mean_daily. I am having some trouble visualizing that with cumsum. The final output would look like:
datetime date T T_mean_daily T_sum
1: 1.0 1 4 5.6 5.6
2: 1.1 1 6 5.6 5.6
3: 1.2 1 9 5.6 5.6
4: 1.3 1 7 5.6 5.6
5: 1.4 1 3 5.6 5.6
6: 1.5 1 8 5.6 5.6
7: 1.6 1 3 5.6 5.6
8: 1.7 1 7 5.6 5.6
9: 1.8 1 8 5.6 5.6
10: 1.9 1 1 5.6 5.6
11: 2.0 2 2 3.6 9.2
12: 2.1 2 5 3.6 9.2
13: 2.2 2 4 3.6 9.2
14: 2.3 2 1 3.6 9.2
15: 2.4 2 9 3.6 9.2
16: 2.5 2 5 3.6 9.2
17: 2.6 2 2 3.6 9.2
18: 2.7 2 5 3.6 9.2
19: 2.8 2 2 3.6 9.2
20: 2.9 2 1 3.6 9.2
21: 3.0 3 1 5.9 15.1
22: 3.1 3 4 5.9 15.1
Looking for the data.table solution. This is not the cumsum by group, I am looking for the cumsum of each first row or unique value across all groups.
here is another data.table approach...
setnafill(dt[!duplicated(date), T_sum := cumsum(T_mean_daily)], "locf", cols = "T_sum")
explanation
Since we only need to use the first row of each date, we can select there rows using !duplicated(date) in the i of data.table. In the j, we can now calculate the cumulative sum of T_Mean_Daily.
Now we are left with a column with the correct cumsum value on all first date-rows, and NA's in between, so use setnafill fo locf-fill in the value over the NA-rows in the T_sum-column.
benchmarks
set.seed(42)
dt<-data.table(datetime=seq(1,10.9, by=0.1),
date=rep(1:10, each=10),
T=floor(runif(100,1,10)))
dt[,T_mean_daily:=mean(T), by=date]
microbenchmark::microbenchmark(
r2evans = {
test <- copy(dt)
test[ test[, .SD[1,], by = date][, T_mean_daily := cumsum(T_mean_daily)], T_sum := i.T_mean_daily, on = .(date)]
},
wimpel = {
test <- copy(dt)
setnafill(test[!duplicated(date), T_sum := cumsum(T_mean_daily)], "locf", cols = "T_sum")
}
)
Unit: microseconds
expr min lq mean median uq max neval cld
r2evans 3287.9 3488.20 3662.044 3560.65 3758.85 4833.1 100 b
wimpel 425.4 437.45 465.313 451.75 485.35 608.3 100 a
If we do a temporary subset to just the first row of each date, we can then use cumsum and join it back into the original data.
set.seed(42)
dt<-data.table(datetime=seq(1,10.9, by=0.1),
date=rep(1:10, each=10),
T=floor(runif(100,1,10)))
dt[,T_mean_daily:=mean(T), by=date]
dt
# datetime date T T_mean_daily
# <num> <int> <num> <num>
# 1: 1.0 1 9 6.2
# 2: 1.1 1 9 6.2
# 3: 1.2 1 3 6.2
# 4: 1.3 1 8 6.2
# 5: 1.4 1 6 6.2
# 6: 1.5 1 5 6.2
# 7: 1.6 1 7 6.2
# 8: 1.7 1 2 6.2
# 9: 1.8 1 6 6.2
# 10: 1.9 1 7 6.2
# ---
# 91: 10.0 10 7 5.6
# 92: 10.1 10 1 5.6
# 93: 10.2 10 2 5.6
# 94: 10.3 10 9 5.6
# 95: 10.4 10 9 5.6
# 96: 10.5 10 7 5.6
# 97: 10.6 10 3 5.6
# 98: 10.7 10 5 5.6
# 99: 10.8 10 7 5.6
# 100: 10.9 10 6 5.6
The aggregation is simply:
dt[, .SD[1,], by = date][, T_mean_daily := cumsum(T_mean_daily)][]
# date datetime T T_mean_daily T_sum
# <int> <num> <num> <num> <num>
# 1: 1 1 9 6.2 6.2
# 2: 2 2 5 12.2 12.2
# 3: 3 3 9 18.3 18.3
# 4: 4 4 7 23.6 23.6
# 5: 5 5 4 29.6 29.6
# 6: 6 6 4 34.1 34.1
# 7: 7 7 7 40.1 40.1
# 8: 8 8 1 43.1 43.1
# 9: 9 9 6 47.0 47.0
# 10: 10 10 7 52.6 52.6
which we can join back on the original data as:
dt[ dt[, .SD[1,], by = date][, T_mean_daily := cumsum(T_mean_daily)], T_sum := i.T_mean_daily, on = .(date)]
dt
# datetime date T T_mean_daily T_sum
# <num> <int> <num> <num> <num>
# 1: 1.0 1 9 6.2 6.2
# 2: 1.1 1 9 6.2 6.2
# 3: 1.2 1 3 6.2 6.2
# 4: 1.3 1 8 6.2 6.2
# 5: 1.4 1 6 6.2 6.2
# 6: 1.5 1 5 6.2 6.2
# 7: 1.6 1 7 6.2 6.2
# 8: 1.7 1 2 6.2 6.2
# 9: 1.8 1 6 6.2 6.2
# 10: 1.9 1 7 6.2 6.2
# ---
# 91: 10.0 10 7 5.6 52.6
# 92: 10.1 10 1 5.6 52.6
# 93: 10.2 10 2 5.6 52.6
# 94: 10.3 10 9 5.6 52.6
# 95: 10.4 10 9 5.6 52.6
# 96: 10.5 10 7 5.6 52.6
# 97: 10.6 10 3 5.6 52.6
# 98: 10.7 10 5 5.6 52.6
# 99: 10.8 10 7 5.6 52.6
# 100: 10.9 10 6 5.6 52.6

qtgrace/xmgrace non-overlaping data sets

I'm using qtgrace for MacOS and when I plotted two data in qtgrace I got something like this:
Overlapping data sets
However, I would like to plot something like this:
Non-overlapping data sets
My data 1:
0 14
0.1 6
0.2 14
0.3 14
0.4 14
0.5 14
0.6 14
0.7 14
0.8 6
0.9 6
1 6
1.1 6
1.2 6
1.3 6
1.4 6
1.5 6
1.6 6
1.7 6
1.8 6
1.9 6
2 6
2.1 6
2.2 6
2.3 6
2.4 6
2.5 6
2.6 6
2.7 6
2.8 6
2.9 6
3 6
3.1 6
3.2 6
3.3 6
3.4 6
3.5 6
3.6 6
3.7 6
3.8 6
3.9 6
4 6
4.1 6
4.2 6
4.3 6
4.4 6
4.5 6
4.6 6
4.7 6
4.8 6
4.9 6
5 6
5.1 6
5.2 6
5.3 6
5.4 6
5.5 6
5.6 6
5.7 6
5.8 6
5.9 6
6 6
6.1 6
6.2 6
6.3 6
6.4 6
6.5 6
6.6 6
6.7 6
6.8 6
6.9 6
7 6
7.1 6
7.2 6
7.3 2
7.4 6
7.5 2
7.6 2
7.7 2
7.8 2
7.9 6
8 2
8.1 6
8.2 2
8.3 2
8.4 6
8.5 6
8.6 6
8.7 2
8.8 6
8.9 19
9 19
9.1 6
9.2 6
9.3 6
9.4 2
9.5 2
9.6 2
9.7 2
9.8 2
9.9 2
10 2
10.1 2
10.2 2
10.3 2
10.4 2
10.5 2
10.6 2
10.7 2
10.8 2
10.9 2
11 2
11.1 2
11.2 2
11.3 2
11.4 2
11.5 2
11.6 2
11.7 2
11.8 2
11.9 2
12 2
12.1 2
12.2 2
12.3 2
12.4 2
12.5 2
12.6 2
12.7 2
12.8 2
12.9 2
13 2
13.1 2
13.2 2
13.3 2
13.4 2
13.5 2
13.6 2
13.7 2
13.8 2
13.9 2
14 2
14.1 2
14.2 2
14.3 2
14.4 2
14.5 2
14.6 2
14.7 2
14.8 2
14.9 2
15 2
15.1 2
15.2 2
15.3 2
15.4 2
15.5 2
15.6 2
15.7 2
15.8 2
15.9 2
16 2
16.1 2
16.2 2
16.3 2
16.4 2
16.5 2
16.6 2
16.7 2
16.8 2
16.9 2
17 2
17.1 2
17.2 2
17.3 2
17.4 2
17.5 2
17.6 2
17.7 2
17.8 2
17.9 2
18 2
18.1 2
18.2 2
18.3 2
18.4 2
18.5 2
18.6 2
18.7 2
18.8 2
18.9 2
19 2
19.1 2
19.2 2
19.3 2
19.4 2
19.5 2
19.6 2
19.7 2
19.8 2
19.9 2
20 2
20.1 2
20.2 2
20.3 2
20.4 2
20.5 2
20.6 2
20.7 2
20.8 2
20.9 2
21 2
21.1 2
21.2 2
21.3 2
21.4 2
21.5 2
21.6 2
21.7 2
21.8 7
21.9 2
22 2
22.1 2
22.2 2
22.3 7
22.4 7
22.5 7
22.6 7
22.7 7
22.8 2
22.9 2
23 7
23.1 7
23.2 7
23.3 7
23.4 7
23.5 2
23.6 2
23.7 2
23.8 2
23.9 2
24 2
24.1 2
24.2 2
24.3 2
24.4 2
24.5 2
24.6 2
24.7 2
24.8 2
24.9 2
25 2
. .
. .
. .
Data 2:
0 4
0.1 4
0.2 4
0.3 4
0.4 4
0.5 4
0.6 4
0.7 4
0.8 4
0.9 4
1 2
1.1 4
1.2 4
1.3 4
1.4 4
1.5 4
1.6 4
1.7 4
1.8 4
1.9 4
2 4
2.1 4
2.2 4
2.3 4
2.4 4
2.5 4
2.6 4
2.7 4
2.8 4
2.9 4
3 4
3.1 4
3.2 4
3.3 4
3.4 4
3.5 4
3.6 4
3.7 4
3.8 4
3.9 4
4 4
4.1 4
4.2 4
4.3 4
4.4 4
4.5 4
4.6 4
4.7 4
4.8 4
4.9 4
5 4
5.1 4
5.2 4
5.3 4
5.4 4
5.5 4
5.6 4
5.7 4
5.8 4
5.9 4
6 4
6.1 4
6.2 4
6.3 4
6.4 4
6.5 4
6.6 4
6.7 4
6.8 4
6.9 4
7 4
7.1 4
7.2 4
7.3 4
7.4 4
7.5 4
7.6 4
7.7 4
7.8 4
7.9 4
8 4
8.1 4
8.2 4
8.3 4
8.4 2
8.5 4
8.6 4
8.7 4
8.8 4
8.9 4
9 4
9.1 4
9.2 4
9.3 4
9.4 4
9.5 4
9.6 4
9.7 4
9.8 4
9.9 4
10 4
10.1 4
10.2 4
10.3 4
10.4 4
10.5 2
10.6 2
10.7 4
10.8 2
10.9 2
11 2
11.1 2
11.2 4
11.3 4
11.4 2
11.5 2
11.6 2
11.7 2
11.8 2
11.9 2
12 2
12.1 2
12.2 2
12.3 2
12.4 4
12.5 4
12.6 2
12.7 2
12.8 4
12.9 2
13 2
13.1 4
13.2 4
13.3 4
13.4 4
13.5 10
13.6 2
13.7 2
13.8 2
13.9 2
14 2
14.1 2
14.2 2
14.3 10
14.4 2
14.5 2
14.6 4
14.7 2
14.8 2
14.9 4
15 2
15.1 10
15.2 2
15.3 2
15.4 2
15.5 2
15.6 2
15.7 2
15.8 2
15.9 2
16 2
16.1 2
16.2 2
16.3 2
16.4 2
16.5 2
16.6 2
16.7 2
16.8 2
16.9 2
17 2
17.1 2
17.2 2
17.3 2
17.4 2
17.5 2
17.6 2
17.7 2
17.8 2
17.9 2
18 2
18.1 2
18.2 2
18.3 2
18.4 2
18.5 2
18.6 2
18.7 2
18.8 2
18.9 2
19 2
19.1 2
19.2 2
19.3 2
19.4 2
19.5 2
19.6 2
19.7 2
19.8 2
19.9 2
20 2
20.1 2
20.2 2
20.3 2
20.4 2
20.5 2
20.6 2
20.7 2
20.8 2
20.9 2
21 2
21.1 2
21.2 2
21.3 2
21.4 2
21.5 2
21.6 2
21.7 2
21.8 2
21.9 2
22 2
22.1 2
22.2 2
22.3 2
22.4 2
22.5 2
22.6 2
22.7 2
22.8 2
22.9 2
23 2
23.1 2
23.2 2
23.3 2
23.4 2
23.5 2
23.6 2
23.7 2
23.8 2
23.9 2
24 2
24.1 2
24.2 2
24.3 2
24.4 2
24.5 2
24.6 2
24.7 2
24.8 2
24.9 2
25 2
. .
. .
. .
The data are in two separate xvg file from GROMACS cluster analysis. I wanna plot five different sets in a manner which I can see all data without superposing.
Thank you!
I think the best approach would be to write a script that takes the original files and spits out new files with shifted y values. However, since you have asked for a qt/xmgrace solution, here is how you do it:
Load up all the datasets into qtgrace
Open the "Data -> Transformations -> Evaluate expression..." dialog
Select in the left and right columns a dataset and in the textbox below enter the formula y = y + 0.1. Click "apply". This will shift the dataset up by 0.1
Select the next dataset in the same way and use the formula y = y + 0.2. Click apply
Rinse and repeat for all the datasets (changing the shift accordingly)

Aggregate/sum and N/A values

I have a problem with the way aggregate or N/A deals with sums.
I would like the sums per area.code from following table
test <- read.table(text = "
area.code A B C D
1 0 NA 0.00 NA NA
2 1 0.0 3.10 9.6 0.0
3 1 0.0 3.20 6.0 0.0
4 2 0.0 6.10 5.0 0.0
5 2 0.0 6.50 8.0 0.0
6 2 0.0 6.90 4.0 3.1
7 3 0.0 6.70 3.0 3.2
8 3 0.0 6.80 3.1 6.1
9 3 0.0 0.35 3.2 6.5
10 3 0.0 0.67 6.1 6.9
11 4 0.0 0.25 6.5 6.7
12 5 0.0 0.68 6.9 6.8
13 6 0.0 0.95 6.7 0.0
14 7 1.2 NA 6.8 0.0
")
So, seems pretty easy:
aggregate(.~area.code, test, sum)
area.code A B C D
1 1 0 6.30 15.6 0.0
2 2 0 19.50 17.0 3.1
3 3 0 14.52 15.4 22.7
4 4 0 0.25 6.5 6.7
5 5 0 0.68 6.9 6.8
6 6 0 0.95 6.7 0.0
Apparently not so simple, because area code 7 is completely omitted from the aggregate() command.
I would however like the N/As to be completely ignored or computed as zero values, which na= command gives that option?
replacing all N/As with 0 is an option if I just want the sum... but the mean is really problematic then (since it can't differentiate between 0 and N/A anymore)
If you are willing to consider an external package (data.table):
setDT(test)
test[, lapply(.SD, sum), area.code]
area.code A B C D
1: 0 NA 0.00 NA NA
2: 1 0.0 6.30 15.6 0.0
3: 2 0.0 19.50 17.0 3.1
4: 3 0.0 14.52 15.4 22.7
5: 4 0.0 0.25 6.5 6.7
6: 5 0.0 0.68 6.9 6.8
7: 6 0.0 0.95 6.7 0.0
8: 7 1.2 NA 6.8 0.0
One option is to create a function that gives NA when all the values are NA or otherwise use sum. Along with that, use na.action argument in aggregate as aggregate can remove the row if there is at least one NA
f1 <- function(x) if(all(is.na(x))) NA else sum(x, na.rm = TRUE)
aggregate(.~area.code, test, f1, na.action = na.pass)
# area.code A B C D
#1 0 NA 0.00 NA NA
#2 1 0.0 6.30 15.6 0.0
#3 2 0.0 19.50 17.0 3.1
#4 3 0.0 14.52 15.4 22.7
# 4 0.0 0.25 6.5 6.7
#6 5 0.0 0.68 6.9 6.8
#7 6 0.0 0.95 6.7 0.0
#8 7 1.2 NA 6.8 0.0
When there are only NA elements and we use sum with na.rm = TRUE, it returns 0
sum(c(NA, NA), na.rm = TRUE)
#[1] 0
Another solution is to use dplyr:
test %>%
group_by(area.code) %>%
summarise_all(sum, na.rm = TRUE)

The type is integer, yet it has a decimal point. Why?

When I check the type, it said "integer", but has decimal point. If I change it to numeric, it become integer(no decimal point).
Because I want to do histogram, x must be numeric, but if change to numeric, all data wrong.
> typeof(data$fare_amount)
[1] "integer"
> data$fare_amount
[1] 5.5 6.5 8.0 13.5 5.5 9.5 7.5 8.0 16.0 8.0 5.5 7.0 8.0 5.0 9.5 23.0 5.0 6.0 17.5 12.0 8.5 13.0
[23] 6.5 4.5 52.0 14.5 7.5 4.5 9.0 10.0 15.0 11.5 6.0 12.5 7.5 8.0 6.5 7.5 31.5 10.0 10.0 10.0 4.0 8.5
[45] 24.0 8.5 5.5 14.0 11.0 4.5 9.0 7.5 22.0 8.5 24.0 36.5 15.0 10.5 9.5 17.0 4.5 6.0 6.5 11.5 16.0 6.5
[67] 7.0 20.0 13.5 30.0 8.0 11.0 6.5 11.5 6.5 37.0 5.5 12.5 8.5 58.5 13.5 8.5 9.0 6.0 6.5 9.0 38.0 4.5
[89] 10.0 9.0 44.5 11.0 12.0 4.5 14.5 8.5 32.0 9.5 4.5 6.0 6.5 6.0 31.5 52.0 10.5 12.0 5.5 24.5 7.0 5.5
[111] 16.5 5.0 5.5 6.5 3.5 11.5 13.0 6.0 14.0 3.5
42 Levels: 13.5 16.0 5.5 6.5 7.5 8.0 9.5 12.0 17.5 23.0 5.0 6.0 7.0 10.0 13.0 14.5 4.5 52.0 8.5 9.0 11.5 12.5 ... 3.5
> temp <- as.numeric(data$fare_amount)
> temp
[1] 3 4 6 1 3 7 5 6 2 6 3 13 6 11 7 10 11 12 9 8 19 15 4 17 18 16 5 17 20 14 23 21 12 22 5 6 4 5
[39] 24 14 14 14 28 19 27 19 3 26 25 17 20 5 31 19 27 32 23 29 7 30 17 12 4 21 2 4 13 33 1 34 6 25 4 21 4 35
[77] 3 22 19 36 1 19 20 12 4 20 37 17 14 20 39 25 8 17 16 19 38 7 17 12 4 12 24 18 29 8 3 40 13 3 41 11 3 4
[115] 42 21 15 12 26 42

Shift one row by ID in R

I have data frame, I want to create a new variable "Begin1" with condition: if the the second row of variable "Begin" smaller than variable "End" of first row, set value of "End" replace "Begin" due to overlap by ID
ID <- c(rep(1,3), rep(3, 5), rep(4,4))
Begin <- c(0,2.5,5, 7,8,7,25,25,10,15,17,20)
End <- c(1.5,3.5,6, 7.5,8,11,29,35, 12,19,21,28)
df <- data.frame(ID, Begin, End)
df
ID Begin End
1 1 0.0 1.5
2 1 2.5 3.5
3 1 5.0 6.0
4 3 7.0 7.5
5 3 8.0 8.0
6 3 7.0 11.0**
7 3 25.0 29.0
8 3 25.0 35.0**
9 4 10.0 12.0
10 4 15.0 19.0
11 4 17.0 21.0**
12 4 20.0 28.0**
If you can see, the rows bolded, row (6,8,11,12). Start with row 6 with ID 3, you see the "Begin" = 7.0, it's smaller the "End" of previous row, now we set "Begin1" = 8.0. For row 8 with ID 3, "Begin"=25, it's smaller than previous "End" = 29, now we set"Begin1" = 29 and so on. Here is the output
ID Begin Begin1 End
1 1 0.0 0.0 1.5
2 1 2.5 2.5 3.5
3 1 5.0 5.0 6.0
4 3 7.0 7.0 7.5
5 3 8.0 8.0 8.0
6 3 7.0 8.0 11.0**
7 3 25.0 25.0 29.0
8 3 25.0 29.0 35.0**
9 4 10.0 10.0 12.0
10 4 15.0 15.0 19.0
11 4 17.0 19.0 21.0**
12 4 20.0 21.0 28.0**
Thanks for your advice
Here is update
ID <- c(rep(1,3), rep(3, 5), rep(4,4))
Group <-c(1,1,2,1,1,1,2,2,1,1,1,2)
Begin <- c(0,2.5,5, 7,8,7,25,25,10,15,17,20)
End <- c(1.5,3.5,6, 7.5,8,11,29,35, 12,19,21,28)
df <- data.frame(ID,Group, Begin, End)
This time I want to group by ID and Group, I got error from data.table.
This is output
ID Group Begin End Begin1
1 1 1 0.0 1.5 0.0
2 1 1 2.5 3.5 2.5
3 1 2 5.0 6.0 5.0
4 3 1 7.0 7.5 7.0
5 3 1 8.0 8.0 8.0
6 3 1 7.0 11.0 8.0
7 3 2 25.0 29.0 25.0
8 3 2 25.0 35.0 29.0
9 4 1 10.0 12.0 35.0
10 4 1 15.0 19.0 15.0
11 4 1 17.0 21.0 19.0
12 4 2 20.0 28.0 20.0**** Right here is not change bc it's group 2
Here is result from dplyr package, it works, but data.table is not working
library(dplyr)
df %>%
group_by(ID, Group) %>%
mutate(Begin1 = pmax(Begin, lag(End), na.rm =TRUE))
Source: local data frame [12 x 5]
Groups: ID, Group [6]
ID Group Begin End Begin1
(dbl) (dbl) (dbl) (dbl) (dbl)
1 1 1 0.0 1.5 0.0
2 1 1 2.5 3.5 2.5
3 1 2 5.0 6.0 5.0
4 3 1 7.0 7.5 7.0
5 3 1 8.0 8.0 8.0
6 3 1 7.0 11.0 8.0
7 3 2 25.0 29.0 25.0
8 3 2 25.0 35.0 29.0
9 4 1 10.0 12.0 10.0
10 4 1 15.0 19.0 15.0
11 4 1 17.0 21.0 19.0
12 4 2 20.0 28.0 20.0**** It works
A different way using data.table. The keys are the following.
The by statement which does the calculation by the ID
The shift function, which lags the End variable to compare with Begin
The pmax function, which does an element-wise max calculation
Here is the code:
library(data.table)
dt <- as.data.table(df)
dt[, Begin1 := pmax(Begin, shift(End, type = 'lag'), na.rm = TRUE), by = ID]
Here's an approach with base R creating the column using ifelse based on the lag of the End column.
df$Begin1 <- ifelse(df$Begin <= lag(df$End), lag(df$End), df$Begin)
df$Begin1[which(is.na(df$Begin1))] <- df$Begin[which(is.na(df$Begin1))]
> df
ID Begin End Begin1
1 1 0.0 1.5 0.0
2 1 2.5 3.5 2.5
3 1 5.0 6.0 5.0
4 3 7.0 7.5 7.0
5 3 8.0 8.0 8.0
6 3 7.0 11.0 8.0
7 3 25.0 29.0 25.0
8 3 25.0 35.0 29.0
9 4 10.0 12.0 35.0
10 4 15.0 19.0 15.0
11 4 17.0 21.0 19.0
12 4 20.0 28.0 21.0
We can do this using data.table
library(data.table)
setDT(df)[, Begin1 := Begin]
i1 <- df[, .I[Begin < shift(End, fill = Begin[1L])], by = ID]$V1
df$Begin1[i1] <- df$End[i1-1]
df
# ID Begin End Begin1
# 1: 1 0.0 1.5 0.0
# 2: 1 2.5 3.5 2.5
# 3: 1 5.0 6.0 5.0
# 4: 3 7.0 7.5 7.0
# 5: 3 8.0 8.0 8.0
# 6: 3 7.0 11.0 8.0
# 7: 3 25.0 29.0 25.0
# 8: 3 25.0 35.0 29.0
# 9: 4 10.0 12.0 10.0
#10: 4 15.0 19.0 15.0
#11: 4 17.0 21.0 19.0
#12: 4 20.0 28.0 21.0
Or another option is
setDT(df)[, Begin1 := shift(End), by = ID][!which(Begin < Begin1), Begin1:= Begin]
df
# ID Begin End Begin1
# 1: 1 0.0 1.5 0.0
# 2: 1 2.5 3.5 2.5
# 3: 1 5.0 6.0 5.0
# 4: 3 7.0 7.5 7.0
# 5: 3 8.0 8.0 8.0
# 6: 3 7.0 11.0 8.0
# 7: 3 25.0 29.0 25.0
# 8: 3 25.0 35.0 29.0
# 9: 4 10.0 12.0 10.0
#10: 4 15.0 19.0 15.0
#11: 4 17.0 21.0 19.0
#12: 4 20.0 28.0 21.0
Or using dplyr
library(dplyr)
df %>%
group_by(ID) %>%
mutate(Begin1 = pmax(Begin, lag(End), na.rm =TRUE))
# ID Begin End Begin1
# <dbl> <dbl> <dbl> <dbl>
#1 1 0.0 1.5 0.0
#2 1 2.5 3.5 2.5
#3 1 5.0 6.0 5.0
#4 3 7.0 7.5 7.0
#5 3 8.0 8.0 8.0
#6 3 7.0 11.0 8.0
#7 3 25.0 29.0 25.0
#8 3 25.0 35.0 29.0
#9 4 10.0 12.0 10.0
#10 4 15.0 19.0 15.0
#11 4 17.0 21.0 19.0
#12 4 20.0 28.0 21.0
Update
Based on the OP's new data
setDT(df)[, Begin1 := shift(End), by = .(ID, Group)][
!which(Begin < Begin1), Begin1 := Begin]
df
# ID Group Begin End Begin1
#1: 1 1 0.0 1.5 0.0
#2: 1 1 2.5 3.5 2.5
#3: 1 2 5.0 6.0 5.0
#4: 3 1 7.0 7.5 7.0
#5: 3 1 8.0 8.0 8.0
#6: 3 1 7.0 11.0 8.0
#7: 3 2 25.0 29.0 25.0
#8: 3 2 25.0 35.0 29.0
#9: 4 1 10.0 12.0 10.0
#10: 4 1 15.0 19.0 15.0
#11: 4 1 17.0 21.0 19.0
#12: 4 2 20.0 28.0 20.0

Resources