I'm looking to aggregate some pedometer data, gathered in steps per minute, so I get a summed number of steps up until an EMA assessment. The EMA assessments happened four times per day. An example of the two data sets are:
Pedometer Data
ID Steps Time
1 15 2/4/2020 8:32
1 23 2/4/2020 8:33
1 76 2/4/2020 8:34
1 32 2/4/2020 8:35
1 45 2/4/2020 8:36
...
2 16 2/4/2020 8:32
2 17 2/4/2020 8:33
2 0 2/4/2020 8:34
2 5 2/4/2020 8:35
2 8 2/4/2020 8:36
EMA Data
ID Time X Y
1 2/4/2020 8:36 3 4
1 2/4/2020 12:01 3 5
1 2/4/2020 3:30 4 5
1 2/4/2020 6:45 7 8
...
2 2/4/2020 8:35 4 6
2 2/4/2020 12:05 5 7
2 2/4/2020 3:39 1 3
2 2/4/2020 6:55 8 3
I'm looking to add the pedometer data to the EMA data as a new variable, where the number of steps taken are summed until the next EMA assessment. Ideally it would like something like:
Combined Data
ID Time X Y Steps
1 2/4/2020 8:36 3 4 191
1 2/4/2020 12:01 3 5 [Sum of steps taken from 8:37 until 12:01 on 2/4/2020]
1 2/4/2020 3:30 4 5 [Sum of steps taken from 12:02 until 3:30 on 2/4/2020]
1 2/4/2020 6:45 7 8 [Sum of steps taken from 3:31 until 6:45 on 2/4/2020]
...
2 2/4/2020 8:35 4 6 38
2 2/4/2020 12:05 5 7 [Sum of steps taken from 8:36 until 12:05 on 2/4/2020]
2 2/4/2020 3:39 1 3 [Sum of steps taken from 12:06 until 3:39 on 2/4/2020]
2 2/4/2020 6:55 8 3 [Sum of steps taken from 3:40 until 6:55 on 2/4/2020]
I then need the process to continue over the entire 21 day EMA period, so the same process for the 4 EMA assessment time points on 2/5/2020, 2/6/2020, etc.
This has pushed me the limit of my R skills, so any pointers would be extremely helpful! I'm most familiar with the tidyverse but am comfortable using base R as well. Thanks in advance for all advice.
Here's a solution using rolling joins from data.table. The basic idea here is to roll each time from the pedometer data up to the next time in the EMA data (while matching on ID still). Once it's the next EMA time is found, all that's left is to isolate the X and Y values and sum up Steps.
Data creation and prep:
library(data.table)
pedometer <- data.table(ID = sort(rep(1:2, 500)),
Time = rep(seq.POSIXt(as.POSIXct("2020-02-04 09:35:00 EST"),
as.POSIXct("2020-02-08 17:00:00 EST"), length.out = 500), 2),
Steps = rpois(1000, 25))
EMA <- data.table(ID = sort(rep(1:2, 4*5)),
Time = rep(seq.POSIXt(as.POSIXct("2020-02-04 05:00:00 EST"),
as.POSIXct("2020-02-08 23:59:59 EST"), by = '6 hours'), 2),
X = sample(1:8, 2*4*5, rep = T),
Y = sample(1:8, 2*4*5, rep = T))
setkey(pedometer, Time)
setkey(EMA, Time)
EMA[,next_ema_time := Time]
And now the actual join and summation:
joined <- EMA[pedometer,
on = .(ID, Time),
roll = -Inf,
j = .(ID, Time, Steps, next_ema_time, X, Y)]
result <- joined[,.('X' = min(X),
'Y' = min(Y),
'Steps' = sum(Steps)),
.(ID, next_ema_time)]
result
#> ID next_ema_time X Y Steps
#> 1: 1 2020-02-04 11:00:00 1 2 167
#> 2: 2 2020-02-04 11:00:00 8 5 169
#> 3: 1 2020-02-04 17:00:00 3 6 740
#> 4: 2 2020-02-04 17:00:00 4 6 747
#> 5: 1 2020-02-04 23:00:00 2 2 679
#> 6: 2 2020-02-04 23:00:00 3 2 732
#> 7: 1 2020-02-05 05:00:00 7 5 720
#> 8: 2 2020-02-05 05:00:00 6 8 692
#> 9: 1 2020-02-05 11:00:00 2 4 731
#> 10: 2 2020-02-05 11:00:00 4 5 773
#> 11: 1 2020-02-05 17:00:00 1 5 757
#> 12: 2 2020-02-05 17:00:00 3 5 743
#> 13: 1 2020-02-05 23:00:00 3 8 693
#> 14: 2 2020-02-05 23:00:00 1 8 740
#> 15: 1 2020-02-06 05:00:00 8 8 710
#> 16: 2 2020-02-06 05:00:00 3 2 760
#> 17: 1 2020-02-06 11:00:00 8 4 716
#> 18: 2 2020-02-06 11:00:00 1 2 688
#> 19: 1 2020-02-06 17:00:00 5 2 738
#> 20: 2 2020-02-06 17:00:00 4 6 724
#> 21: 1 2020-02-06 23:00:00 7 8 737
#> 22: 2 2020-02-06 23:00:00 6 3 672
#> 23: 1 2020-02-07 05:00:00 2 6 726
#> 24: 2 2020-02-07 05:00:00 7 7 759
#> 25: 1 2020-02-07 11:00:00 1 4 737
#> 26: 2 2020-02-07 11:00:00 5 2 737
#> 27: 1 2020-02-07 17:00:00 3 5 766
#> 28: 2 2020-02-07 17:00:00 4 4 745
#> 29: 1 2020-02-07 23:00:00 3 3 714
#> 30: 2 2020-02-07 23:00:00 2 1 741
#> 31: 1 2020-02-08 05:00:00 4 6 751
#> 32: 2 2020-02-08 05:00:00 8 2 723
#> 33: 1 2020-02-08 11:00:00 3 3 716
#> 34: 2 2020-02-08 11:00:00 3 6 735
#> 35: 1 2020-02-08 17:00:00 1 5 696
#> 36: 2 2020-02-08 17:00:00 7 7 741
#> ID next_ema_time X Y Steps
Created on 2020-02-04 by the reprex package (v0.3.0)
I would left_join ema_df on pedometer_df by ID and Time. This way you get
all lines of pedometer_df with missing values for x and y (that I assume are identifiers) when it is not an EMA assessment time.
I fill the values using the next available (so the next ema assessment x and y)
and finally, group_by ID x and y and summarise to keep the datetime of assessment (max) and the sum of steps.
library(dplyr)
library(tidyr)
pedometer_df %>%
left_join(ema_df, by = c("ID", "Time")) %>%
fill(x, y, .direction = "up") %>%
group_by(ID, x, y) %>%
summarise(
Time = max(Time),
Steps = sum(Steps)
)
Related
I have a df like this:
Line
Sensor
Day
Time
Measurement
1
A
1
10:00:00
56
2
A
1
11:00:00
42
3
A
1
12:00:00
87
4
A
1
12:20:00
12
5
A
1
12:50:00
44
I would like to create some rows. Considering that measurements should be taken every 10 minutes I would like to add a non-constant number of rows (i.e there should be 6 between line 1 and 2; two rows between line 3 and 4; 3 rows between line 4 and five)
in order to get something similar to this:
Line
Sensor
Day
Time
Measurement
1
A
1
10:00:00
56
2
A
1
10:10:00
54
3
A
1
10:20:00
35
4
A
1
10:30:00
11
5
A
1
10:40:00
45
6
A
1
10:50:00
56
7
A
1
11:00:00
90
...
...
...
...
...
...
...
13
A
1
12:00:00
87
14
A
1
12:10:00
97
15
A
1
12:20:00
42
16
A
1
12:30:00
67
4
A
1
12:40:00
76
5
A
1
12:50:00
11
Any suggestion?
I am trying to identify clusters in a dataframe that are within 4 subsequent days of the first event. Additionally we have a grouping variable.
Here is an example:
startDate <- as.POSIXct("2022-10-01")
dt1 <- data.table(
id = 1:20,
timestamp = startDate+ lubridate::days(rep(1:10,2))+ lubridate::hours(1:20),
group_id = rep(c("A","B"), each= 10)
)
id timestamp group_id t_diff
1: 1 2022-10-02 01:00:00 A 0.000000 days
2: 2 2022-10-03 02:00:00 A 1.041667 days
3: 3 2022-10-04 03:00:00 A 2.083333 days
4: 4 2022-10-05 04:00:00 A 3.125000 days
5: 5 2022-10-06 05:00:00 A 4.166667 days
6: 6 2022-10-07 06:00:00 A 5.208333 days
7: 7 2022-10-08 07:00:00 A 6.250000 days
8: 8 2022-10-09 08:00:00 A 7.291667 days
9: 9 2022-10-10 09:00:00 A 8.333333 days
10: 10 2022-10-11 10:00:00 A 9.375000 days
11: 11 2022-10-02 11:00:00 B 0.000000 days
12: 12 2022-10-03 12:00:00 B 1.041667 days
13: 13 2022-10-04 13:00:00 B 2.083333 days
14: 14 2022-10-05 14:00:00 B 3.125000 days
15: 15 2022-10-06 15:00:00 B 4.166667 days
16: 16 2022-10-07 16:00:00 B 5.208333 days
17: 17 2022-10-08 17:00:00 B 6.250000 days
18: 18 2022-10-09 18:00:00 B 7.291667 days
19: 19 2022-10-10 19:00:00 B 8.333333 days
20: 20 2022-10-11 20:00:00 B 9.375000 days
The result should look like this:
id timestamp group_id t_diff cluster_id
1: 1 2022-10-02 01:00:00 A 0.000000 days 1
2: 2 2022-10-03 02:00:00 A 1.041667 days 1
3: 3 2022-10-04 03:00:00 A 2.083333 days 1
4: 4 2022-10-05 04:00:00 A 3.125000 days 1
5: 5 2022-10-06 05:00:00 A 4.166667 days 2
6: 6 2022-10-07 06:00:00 A 5.208333 days 2
7: 7 2022-10-08 07:00:00 A 6.250000 days 2
8: 8 2022-10-09 08:00:00 A 7.291667 days 2
9: 9 2022-10-10 09:00:00 A 8.333333 days 3
10: 10 2022-10-11 10:00:00 A 9.375000 days 3
11: 11 2022-10-02 11:00:00 B 0.000000 days 4
12: 12 2022-10-03 12:00:00 B 1.041667 days 4
13: 13 2022-10-04 13:00:00 B 2.083333 days 4
14: 14 2022-10-05 14:00:00 B 3.125000 days 4
15: 15 2022-10-06 15:00:00 B 4.166667 days 5
16: 16 2022-10-07 16:00:00 B 5.208333 days 5
17: 17 2022-10-08 17:00:00 B 6.250000 days 5
18: 18 2022-10-09 18:00:00 B 7.291667 days 5
19: 19 2022-10-10 19:00:00 B 8.333333 days 6
20: 20 2022-10-11 20:00:00 B 9.375000 days 6
I have tried an approch with lapply, but the code is ugly and very slow. I am looking for a data.table approach, but I don't know how to dynamically refer to the "first" observation.
By first observation I mean the first observation of the 4 day interval.
You can use integer division.
Not that as.numeric run on a difftime object as an argument units that converts the difference to the desired time unit.
startDate <- as.POSIXct("2022-10-01")
dt1 <- data.table::data.table(
id = 1:20,
timestamp = startDate + lubridate::days(rep(1:10,2)) + lubridate::hours(1:20),
group_id = rep(c("A","B"), each= 10)
)
#
dt1[, GRP := as.numeric(timestamp - min(timestamp),
units = "days") %/% 4,
by = group_id][]
#> id timestamp group_id GRP
#> 1: 1 2022-10-02 01:00:00 A 0
#> 2: 2 2022-10-03 02:00:00 A 0
#> 3: 3 2022-10-04 03:00:00 A 0
#> 4: 4 2022-10-05 04:00:00 A 0
#> 5: 5 2022-10-06 05:00:00 A 1
#> 6: 6 2022-10-07 06:00:00 A 1
#> 7: 7 2022-10-08 07:00:00 A 1
#> 8: 8 2022-10-09 08:00:00 A 1
#> 9: 9 2022-10-10 09:00:00 A 2
#> 10: 10 2022-10-11 10:00:00 A 2
#> 11: 11 2022-10-02 11:00:00 B 0
#> 12: 12 2022-10-03 12:00:00 B 0
#> 13: 13 2022-10-04 13:00:00 B 0
#> 14: 14 2022-10-05 14:00:00 B 0
#> 15: 15 2022-10-06 15:00:00 B 1
#> 16: 16 2022-10-07 16:00:00 B 1
#> 17: 17 2022-10-08 17:00:00 B 1
#> 18: 18 2022-10-09 18:00:00 B 1
#> 19: 19 2022-10-10 19:00:00 B 2
#> 20: 20 2022-10-11 20:00:00 B 2
# When you want a single ID index
# alternatovely, just you the combination of group_id and GRP in subsequent `by`s
dt1[, cluster_id := .GRP, by = .(group_id, GRP)][]
#> id timestamp group_id GRP cluster_id
#> 1: 1 2022-10-02 01:00:00 A 0 1
#> 2: 2 2022-10-03 02:00:00 A 0 1
#> 3: 3 2022-10-04 03:00:00 A 0 1
#> 4: 4 2022-10-05 04:00:00 A 0 1
#> 5: 5 2022-10-06 05:00:00 A 1 2
#> 6: 6 2022-10-07 06:00:00 A 1 2
#> 7: 7 2022-10-08 07:00:00 A 1 2
#> 8: 8 2022-10-09 08:00:00 A 1 2
#> 9: 9 2022-10-10 09:00:00 A 2 3
#> 10: 10 2022-10-11 10:00:00 A 2 3
#> 11: 11 2022-10-02 11:00:00 B 0 4
#> 12: 12 2022-10-03 12:00:00 B 0 4
#> 13: 13 2022-10-04 13:00:00 B 0 4
#> 14: 14 2022-10-05 14:00:00 B 0 4
#> 15: 15 2022-10-06 15:00:00 B 1 5
#> 16: 16 2022-10-07 16:00:00 B 1 5
#> 17: 17 2022-10-08 17:00:00 B 1 5
#> 18: 18 2022-10-09 18:00:00 B 1 5
#> 19: 19 2022-10-10 19:00:00 B 2 6
#> 20: 20 2022-10-11 20:00:00 B 2 6
I have a dataframe :
> dsa[1:20]
Ordered.Item date Qty
1: 2011001FAM002025001 2019-06-01 19440.00
2: 2011001FAM002025001 2019-05-01 24455.53
3: 2011001FAM002025001 2019-04-01 16575.06
4: 2011001FAM002025001 2019-03-01 880.00
5: 2011001FAM002025001 2019-02-01 5000.00
6: 2011001FAM002035001 2019-04-01 175.00
7: 2011001FAM004025001 2019-06-01 2000.00
8: 2011001FAM004025001 2019-05-01 2500.00
9: 2011001FAM004025001 2019-04-01 3000.00
10: 2011001FAM012025001 2019-06-01 1200.00
11: 2011001FAM012025001 2019-04-01 1074.02
12: 2011001FAM022025001 2019-06-01 350.00
13: 2011001FAM022025001 2019-05-01 110.96
14: 2011001FAM022025001 2019-04-01 221.13
15: 2011001FAM022035001 2019-06-01 500.00
16: 2011001FAM022035001 2019-05-01 18.91
17: 2011001FAM027025001 2019-06-01 210.00
18: 2011001FAM028025001 2019-04-01 327.21
19: 2011001FBK005035001 2019-05-01 500.00
20: 2011001FBL001025001 2019-06-01 15350.00
>str(dsa)
Classes ‘data.table’ and 'data.frame': 830 obs. of 3 variables:
$ Ordered.Item: Factor w/ 435 levels "2011001FAM002025001",..: 1 1 1 1 1 2 3 3 3 4 ...
$ date : Date, format: "2019-06-01" "2019-05-01" "2019-04-01" ...
$ Qty : num 19440 24456 16575 880 5000 ...
- attr(*, ".internal.selfref")=<externalptr>
this data contains sku and it's quantity sold per month
Because i plan to use ARIMA forecasting i am trying to convert the dataframe to time series but i get a weird output
> timesr<-ts(data=dsa,start=c(12,2018),frequency = 12)
> head(timesr)
Ordered.Item date Qty
[1,] 1 18048 19440.00
[2,] 1 18017 24455.53
[3,] 1 17987 16575.06
[4,] 1 17956 880.00
[5,] 1 17928 5000.00
[6,] 2 17987 175.00
You might try something like this for your sku ARIMA modeling.
# Create dataframe
dsa = read.table(text = '
ID Ordered.Item date Qty
1 2011001FAM002025001 2019-06-01 19440.00
2 2011001FAM002025001 2019-05-01 24455.53
3 2011001FAM002025001 2019-04-01 16575.06
4 2011001FAM002025001 2019-03-01 880.00
5 2011001FAM002025001 2019-02-01 5000.00
6 2011001FAM002035001 2019-04-01 175.00
7 2011001FAM004025001 2019-06-01 2000.00
8 2011001FAM004025001 2019-05-01 2500.00
9 2011001FAM004025001 2019-04-01 3000.00
10 2011001FAM012025001 2019-06-01 1200.00
11 2011001FAM012025001 2019-04-01 1074.02
12 2011001FAM022025001 2019-06-01 350.00
13 2011001FAM022025001 2019-05-01 110.96
14 2011001FAM022025001 2019-04-01 221.13
15 2011001FAM022035001 2019-06-01 500.00
16 2011001FAM022035001 2019-05-01 18.91
17 2011001FAM027025001 2019-06-01 210.00
18 2011001FAM028025001 2019-04-01 327.21
19 2011001FBK005035001 2019-05-01 500.00
20 2011001FBL001025001 2019-06-01 15350.00
', header = T)
dsa$ID <- NULL
# Reshape
dsa2 <- reshape(data=dsa,idvar="date", v.names = "Qty", timevar = "Ordered.Item", direction="wide")
dsa2 <- dsa2[order(as.Date(dsa2$date, "%Y-%m-%d")),] # Sort by date
# Predict for sku 2011001FAM002025001
fit <- auto.arima(ts(dsa2$Qty.2011001FAM002025001))
fcast <- forecast(fit, h=60) # forecast 60 periods ahead
plot(fcast)
Working in R, I have a dataframe with three variables (ID, date-time and blood pressure) where each row is a measurement of a persons blood pressure with an associated measurement time. There are multiple rows per person.
I would like to be able to count the number of rows/measurements in previous 60 minutes of the current row/measurement (per person).
Here are some example data
my_df<-data.frame(ID=c("A","A","A","A","A","A","B","B","B","C","C","C","C","C"),
Measured_DT_TM=as.POSIXct(c("2018-08-01 08:00:00","2018-08-01 08:20:00","2018-08-01 08:30:00","2018-08-01 08:35:00","2018-08-01 11:00:00","2018-08-01 11:30:00","2018-08-01 14:10:00","2018-08-01 15:40:00","2018-08-01 15:00:00","2018-08-01 13:00:00","2018-08-01 13:05:00","2018-08-01 13:30:00","2018-08-01 13:55:00","2018-08-01 14:40:00")),
blood_pressure=c(115,115,120,130,140,130,120,125,125,150,160,130,130,131))
To start I have grouped my data by person, ordered by time. I have created (mutate) a new variable that is the time from the first row/measurement to the current row/measurement (per person) and a variable that is the time from the previous measurment to the current measurement.
library(dplyr)
my_df_1<-my_df %>%
group_by(ID) %>%
arrange(Measured_DT_TM, .by_group=TRUE) %>%
mutate(time_since_first_measure=difftime(Measured_DT_TM, first(Measured_DT_TM), units = c("mins")),
time_since_prev_measure=difftime(Measured_DT_TM, lag(Measured_DT_TM, n=1), units = c("mins")))
my_df_1
ID Measured_DT_TM bp time_since_first_measure time_since_prev_measure
<fct> <dttm> <dbl> <drtn> <drtn>
1 A 2018-08-01 08:00:00 115 0 mins NA mins
2 A 2018-08-01 08:20:00 115 20 mins 20 mins
3 A 2018-08-01 08:30:00 120 30 mins 10 mins
4 A 2018-08-01 08:35:00 130 35 mins 5 mins
5 A 2018-08-01 11:00:00 140 180 mins 145 mins
6 A 2018-08-01 11:30:00 130 210 mins 30 mins
7 B 2018-08-01 14:10:00 120 0 mins NA mins
8 B 2018-08-01 15:00:00 125 50 mins 50 mins
9 B 2018-08-01 15:40:00 125 90 mins 40 mins
10 C 2018-08-01 13:00:00 150 0 mins NA mins
11 C 2018-08-01 13:05:00 160 5 mins 5 mins
12 C 2018-08-01 13:30:00 130 30 mins 25 mins
13 C 2018-08-01 13:55:00 130 55 mins 25 mins
14 C 2018-08-01 14:40:00 131 100 mins 45 mins
I am stuck here how to create/mutate a new variable that counts the number of rows in the previous 60 mins from the current row (per person). I would like to try create the no_'measures_in_prev_60m' variable/column as shown
ID Measured_DT_TM bp time_since_first_measure time_since_prev_measure measures_in_prev_60m
<fct> <dttm> <dbl> <drtn> <drtn> <dbl>
1 A 2018-08-01 08:00:00 115 0 mins NA mins NA
2 A 2018-08-01 08:20:00 115 20 mins 20 mins 1
3 A 2018-08-01 08:30:00 120 30 mins 10 mins 2
4 A 2018-08-01 08:35:00 130 35 mins 5 mins 3
5 A 2018-08-01 11:00:00 140 180 mins 145 mins 0
6 A 2018-08-01 11:30:00 130 210 mins 30 mins 1
7 B 2018-08-01 14:10:00 120 0 mins NA mins NA
8 B 2018-08-01 15:00:00 125 50 mins 50 mins 1
9 B 2018-08-01 15:40:00 125 90 mins 40 mins 1
10 C 2018-08-01 13:00:00 150 0 mins NA mins NA
11 C 2018-08-01 13:05:00 160 5 mins 5 mins 1
12 C 2018-08-01 13:30:00 130 30 mins 25 mins 2
13 C 2018-08-01 13:55:00 130 55 mins 25 mins 3
14 C 2018-08-01 14:40:00 131 100 mins 45 mins 1
Can anyone offer advice/help?
Thanks
This is a good case to use list-columns, part of the tidyverse and purrr package.
I put all the durations for each ID into each line with mutate(y = list(x)), which creates a list-column. Then I create the criteria for each row (the cutoff). Then I test each duration if it qualifies (within the previous 60 min), using pmap which operates on each row and takes in multiple inputs (i.e. the set of durations and the cutoffs). At the same time, for each row, I add up the elements that qualify.
library(tidyverse, quietly = TRUE)
#> Warning: package 'tidyverse' was built under R version 3.5.3
#> Warning: package 'ggplot2' was built under R version 3.5.3
#> Warning: package 'tibble' was built under R version 3.5.3
#> Warning: package 'tidyr' was built under R version 3.5.3
#> Warning: package 'readr' was built under R version 3.5.2
#> Warning: package 'purrr' was built under R version 3.5.3
#> Warning: package 'dplyr' was built under R version 3.5.3
#> Warning: package 'stringr' was built under R version 3.5.2
#> Warning: package 'forcats' was built under R version 3.5.2
my_df<-data.frame(ID=c("A","A","A","A","A","A","B","B","B","C","C","C","C","C"),
Measured_DT_TM=as.POSIXct(c("2018-08-01 08:00:00","2018-08-01 08:20:00","2018-08-01 08:30:00","2018-08-01 08:35:00","2018-08-01 11:00:00","2018-08-01 11:30:00","2018-08-01 14:10:00","2018-08-01 15:40:00","2018-08-01 15:00:00","2018-08-01 13:00:00","2018-08-01 13:05:00","2018-08-01 13:30:00","2018-08-01 13:55:00","2018-08-01 14:40:00")),
blood_pressure=c(115,115,120,130,140,130,120,125,125,150,160,130,130,131)) %>%
group_by(ID) %>%
arrange(Measured_DT_TM, .by_group=TRUE) %>%
mutate(time_since_first_measure=difftime(Measured_DT_TM, first(Measured_DT_TM), units = c("mins")),
time_since_prev_measure=difftime(Measured_DT_TM, lag(Measured_DT_TM, n=1), units = c("mins")))
# steps broken out for readability
my_df %>%
mutate(all_measures_by_ID = list(time_since_first_measure),
cutoff_60 = time_since_first_measure - 60,
check_for_measures_within_prev_60m = pmap(list(all_measures_by_ID, time_since_first_measure, cutoff_60), ~(..1 < ..2 & ..1 >= ..3)),
no_measures_in_prev_60m = map(check_for_measures_within_prev_60m, sum)) %>%
View()
# results in one line and no extra columns
my_df %>%
mutate(no_measures_in_prev_60m = pmap(list(list(time_since_first_measure), time_since_first_measure, time_since_first_measure - 60),
~sum(..1 < ..2 & ..1 >= ..3))) %>%
unnest(no_measures_in_prev_60m) %>%
select(no_measures_in_prev_60m, everything())
#> # A tibble: 14 x 6
#> # Groups: ID [3]
#> no_measures_in_~ ID Measured_DT_TM blood_pressure
#> <int> <fct> <dttm> <dbl>
#> 1 0 A 2018-08-01 08:00:00 115
#> 2 1 A 2018-08-01 08:20:00 115
#> 3 2 A 2018-08-01 08:30:00 120
#> 4 3 A 2018-08-01 08:35:00 130
#> 5 0 A 2018-08-01 11:00:00 140
#> 6 1 A 2018-08-01 11:30:00 130
#> 7 0 B 2018-08-01 14:10:00 120
#> 8 1 B 2018-08-01 15:00:00 125
#> 9 1 B 2018-08-01 15:40:00 125
#> 10 0 C 2018-08-01 13:00:00 150
#> 11 1 C 2018-08-01 13:05:00 160
#> 12 2 C 2018-08-01 13:30:00 130
#> 13 3 C 2018-08-01 13:55:00 130
#> 14 1 C 2018-08-01 14:40:00 131
#> # ... with 2 more variables: time_since_first_measure <drtn>,
#> # time_since_prev_measure <drtn>
Created on 2019-07-21 by the reprex package (v0.3.0)
I have the following data frame:
Date.POSIXct Date WeekDay DayCategory Hour Holidays value
1 2018-05-01 00:00:00 2018-05-01 MA MA-MI-JU 0 0 30
2 2018-05-01 01:00:00 2018-05-01 MA MA-MI-JU 1 0 80
3 2018-05-01 02:00:00 2018-05-01 MA MA-MI-JU 2 0 42
4 2018-05-01 03:00:00 2018-05-01 MA MA-MI-JU 3 0 90
5 2018-05-01 04:00:00 2018-05-01 MA MA-MI-JU 4 0 95
6 2018-05-01 05:00:00 2018-05-01 MA MA-MI-JU 5 0 5
DayCategory groups days of the week in the following way: Mondays goes to LU DayCategory. Tuesday, Wednesday and Thursdays go to MA-MI-JU DayCategory.
Friday goes to VI, Saturdays to SA and Sundays to DO Categories respectively.
I would like to find the value for the same hour in the previous day (Date) with the same DayCategory, while Holidays remains within the same group (e.g. if one instance has holiday 0 but previous day with same DayCategory has 1, we should lookv for the previous one, etc.)
As an intermediate step and to understand the process I would like to add a column PreviousDaySameDayCategory with the Date of the previous day that has the same DayCategory that the corresponding row. Some times it will be just the same date minus seven days ("LU","VI","SA","DO") but other days it will be just one day.
Reproducible data:
library(lubridate)
Date.POSIXct <- seq(as.POSIXct("2018-05-01"), as.POSIXct("2018-05-31"), "hour")
mydf <- as.data.frame(Date.POSIXct)
mydf$Date <- as.Date(substr(as.character(mydf$Date.POSIXct),1,10))
mydf$WeekDay <- substr(toupper((weekdays(mydf$Date))),1,2)
mydf$DayCategory <-as.factor(ifelse(mydf$WeekDay == "MA" | mydf$WeekDay == "MI" | mydf$WeekDay == "JU",
"MA-MI-JU", mydf$WeekDay))
mydf$Hour <- hour(mydf$Date.POSIXct)
mydf$Holidays <- c(rep(0, 24*7),rep(1, 24*7), rep(0, 24*16+1))
set.seed(123)
mydf$myvalue <- sample.int(101,size=nrow(mydf),replace=TRUE)
I have manually started the first days and craeted a vector of how the solution should look like:
a <- rep(NA, 24)
b <- mydf$value[1:24]
c <- mydf$value[25:48]
d <- rep(NA, 24)
e <- rep(NA,24)
f <- rep(NA,24)
g <- rep(NA,24)
h <- rep(NA,24)
i <- mydf$value[169:192]
solution <- c(a,b,c,d,e,f,g,h,i)
solution
I would appreciate any hint in the thinking process to solve this kind of problems that I face with relative frequency.
Here is a data.table solution which uses a "grouped shift()" and multiple joins to copy value from the same hour of the PreviousDaySameDayCategory.
Create reproducible data
OP's code to create reproducible data was not fully reproducible because he used the weekdays() function which returns the weekday names in the current locale (which seems to be Spanish for the OP). To be independent of the current locale, I switched to format(Date, "%u") which returns the numbers 1 to 7 for Monday to Sunday. Furthermore, the fct_collapse() from the forcats package is used to collapse the days 2, 3, and 4 (Tuesday to Thursday) into one factor level.
library(data.table)
# note that package lubridate is not required
myDT <- data.table(Date.POSIXct = seq(as.POSIXct("2018-05-01"),
as.POSIXct("2018-05-31"), "hour"))
myDT[, Date := as.Date(Date.POSIXct)]
myDT[, Weekday := format(Date, "%u")]
myDT[, DayCategory := forcats::fct_collapse(Weekday, "234" = c("2", "3", "4"))]
myDT[, hour := hour(Date.POSIXct)]
myDT[, Holidays := c(rep(0, 24 * 7), rep(1, 24 * 7), rep(0, 24 * 16 + 1))]
set.seed(123)
myDT[, myvalue := sample.int(101, size = nrow(mydf), replace = TRUE)]
Intermediate step: PreviousDaySameDayCategory
The sample data set consists of hourly data but in order to determine the PreviousDaySameDayCategory we need to work day-wise and thus only have to deal with the unique values of Date, DayCategory, and Holidays. The data is grouped by DayCategory and the Holidays indicator. For each group separately, the previous day is picked by lagging Date. As the result of shift() operations depend on the order of rows the dataset has been ordered before shifting.
tmp <- unique(myDT[order(Date), .(Date, DayCategory, Holidays)])[
, .(Date, PreviousDaySameDayCategory = shift(Date)), by = .(DayCategory, Holidays)][
order(Date)]
tmp
DayCategory Holidays Date PreviousDaySameDayCategory
1: 234 0 2018-05-01 <NA>
2: 234 0 2018-05-02 2018-05-01
3: 234 0 2018-05-03 2018-05-02
4: 5 0 2018-05-04 <NA>
5: 6 0 2018-05-05 <NA>
6: 7 0 2018-05-06 <NA>
7: 1 0 2018-05-07 <NA>
8: 234 1 2018-05-08 <NA>
9: 234 1 2018-05-09 2018-05-08
10: 234 1 2018-05-10 2018-05-09
11: 5 1 2018-05-11 <NA>
12: 6 1 2018-05-12 <NA>
13: 7 1 2018-05-13 <NA>
14: 1 1 2018-05-14 <NA>
15: 234 0 2018-05-15 2018-05-03
16: 234 0 2018-05-16 2018-05-15
17: 234 0 2018-05-17 2018-05-16
18: 5 0 2018-05-18 2018-05-04
19: 6 0 2018-05-19 2018-05-05
20: 7 0 2018-05-20 2018-05-06
21: 1 0 2018-05-21 2018-05-07
22: 234 0 2018-05-22 2018-05-17
23: 234 0 2018-05-23 2018-05-22
24: 234 0 2018-05-24 2018-05-23
25: 5 0 2018-05-25 2018-05-18
26: 6 0 2018-05-26 2018-05-19
27: 7 0 2018-05-27 2018-05-20
28: 1 0 2018-05-28 2018-05-21
29: 234 0 2018-05-29 2018-05-24
30: 234 0 2018-05-30 2018-05-29
31: 234 0 2018-05-31 2018-05-30
DayCategory Holidays Date PreviousDaySameDayCategory
For days 3 and 4 (Wednesdays and Thursday) the preceeding Tuesday and Wednesday, resp., of the same week are picked. For day 2 (Tuesday) the preceeding Thursday of the preceeding week is picked if both weeks have the same holiday indicator set. If the preceeding week has a different holiday indicator the most recent Thursday of the same holiday period is picked. This is why, e.g., the 2018-05-03 is picked in row 15.
Copying value from matching PreviousDaySameDayCategory
This is done in two steps. First, the hourly values are picked from the matching PreviousDaySameDayCategory by joining with the matching days table tmp:
tmp2 <- myDT[tmp, on = .(Date = PreviousDaySameDayCategory), .(Date = i.Date, hour, myvalue), nomatch = 0L]
tmp2
Date hour myvalue
1: 2018-05-02 0 30
2: 2018-05-02 1 80
3: 2018-05-02 2 42
4: 2018-05-02 3 90
5: 2018-05-02 4 95
---
500: 2018-05-31 19 39
501: 2018-05-31 20 1
502: 2018-05-31 21 1
503: 2018-05-31 22 101
504: 2018-05-31 23 11
Second, a new column previousValue in myDT is created by updating in a join which contains the corresponding value from PreviousDaySameDayCategory:
myDT[tmp2, on = .(Date, hour), previousValue := i.myvalue]
Here, the first two days of the result are shown:
myDT[Date %between% c(as.Date("2018-05-01"), as.Date("2018-05-02"))]
Date.POSIXct Date Weekday DayCategory hour Holidays myvalue previousValue
1: 2018-05-01 00:00:00 2018-05-01 2 234 0 0 30 NA
2: 2018-05-01 01:00:00 2018-05-01 2 234 1 0 80 NA
3: 2018-05-01 02:00:00 2018-05-01 2 234 2 0 42 NA
4: 2018-05-01 03:00:00 2018-05-01 2 234 3 0 90 NA
5: 2018-05-01 04:00:00 2018-05-01 2 234 4 0 95 NA
6: 2018-05-01 05:00:00 2018-05-01 2 234 5 0 5 NA
7: 2018-05-01 06:00:00 2018-05-01 2 234 6 0 54 NA
8: 2018-05-01 07:00:00 2018-05-01 2 234 7 0 91 NA
9: 2018-05-01 08:00:00 2018-05-01 2 234 8 0 56 NA
10: 2018-05-01 09:00:00 2018-05-01 2 234 9 0 47 NA
11: 2018-05-01 10:00:00 2018-05-01 2 234 10 0 97 NA
12: 2018-05-01 11:00:00 2018-05-01 2 234 11 0 46 NA
13: 2018-05-01 12:00:00 2018-05-01 2 234 12 0 69 NA
14: 2018-05-01 13:00:00 2018-05-01 2 234 13 0 58 NA
15: 2018-05-01 14:00:00 2018-05-01 2 234 14 0 11 NA
16: 2018-05-01 15:00:00 2018-05-01 2 234 15 0 91 NA
17: 2018-05-01 16:00:00 2018-05-01 2 234 16 0 25 NA
18: 2018-05-01 17:00:00 2018-05-01 2 234 17 0 5 NA
19: 2018-05-01 18:00:00 2018-05-01 2 234 18 0 34 NA
20: 2018-05-01 19:00:00 2018-05-01 2 234 19 0 97 NA
21: 2018-05-01 20:00:00 2018-05-01 2 234 20 0 90 NA
22: 2018-05-01 21:00:00 2018-05-01 2 234 21 0 70 NA
23: 2018-05-01 22:00:00 2018-05-01 2 234 22 0 65 NA
24: 2018-05-01 23:00:00 2018-05-01 2 234 23 0 101 NA
25: 2018-05-02 00:00:00 2018-05-02 3 234 0 0 67 30
26: 2018-05-02 01:00:00 2018-05-02 3 234 1 0 72 80
27: 2018-05-02 02:00:00 2018-05-02 3 234 2 0 55 42
28: 2018-05-02 03:00:00 2018-05-02 3 234 3 0 61 90
29: 2018-05-02 04:00:00 2018-05-02 3 234 4 0 30 95
30: 2018-05-02 05:00:00 2018-05-02 3 234 5 0 15 5
31: 2018-05-02 06:00:00 2018-05-02 3 234 6 0 98 54
32: 2018-05-02 07:00:00 2018-05-02 3 234 7 0 92 91
33: 2018-05-02 08:00:00 2018-05-02 3 234 8 0 70 56
34: 2018-05-02 09:00:00 2018-05-02 3 234 9 0 81 47
35: 2018-05-02 10:00:00 2018-05-02 3 234 10 0 3 97
36: 2018-05-02 11:00:00 2018-05-02 3 234 11 0 49 46
37: 2018-05-02 12:00:00 2018-05-02 3 234 12 0 77 69
38: 2018-05-02 13:00:00 2018-05-02 3 234 13 0 22 58
39: 2018-05-02 14:00:00 2018-05-02 3 234 14 0 33 11
40: 2018-05-02 15:00:00 2018-05-02 3 234 15 0 24 91
41: 2018-05-02 16:00:00 2018-05-02 3 234 16 0 15 25
42: 2018-05-02 17:00:00 2018-05-02 3 234 17 0 42 5
43: 2018-05-02 18:00:00 2018-05-02 3 234 18 0 42 34
44: 2018-05-02 19:00:00 2018-05-02 3 234 19 0 38 97
45: 2018-05-02 20:00:00 2018-05-02 3 234 20 0 16 90
46: 2018-05-02 21:00:00 2018-05-02 3 234 21 0 15 70
47: 2018-05-02 22:00:00 2018-05-02 3 234 22 0 24 65
48: 2018-05-02 23:00:00 2018-05-02 3 234 23 0 48 101
Date.POSIXct Date Weekday DayCategory hour Holidays myvalue previousValue
Verification
The result is in line with OP's expectations
identical(myDT[, previousValue[seq_along(solution)]], solution)
[1] TRUE
OP has posted the same question in the Data Science section as well. I am including the same solution I have there here case it might help others.
It is similar to Uwe's solution but with the dplyr library instead.
library(dplyr)
rankedDf <- mydf %>%
group_by(DayCategory, Hour, Holidays) %>%
arrange(Date) %>%
mutate(rowRank = order(Date), previousRowRank = order(Date) - 1) %>%
left_join(., ., by = c("previousRowRank" = "rowRank", "DayCategory", "Hour", "Holidays")) %>%
select(
Date.POSIXct = Date.POSIXct.x,
Date = Date.x,
WeekDay = WeekDay.x,
DayCategory,
Hour,
Holidays,
myvalue = myvalue.x,
PreviousDaySameDayCategory = Date.y,
PreviousValueSameDayCategory = myvalue.y
)
print.data.frame(rankedDf)
P.S. love the way Uwe changes the original sample code.