I have a series of observations of birds at different locations and times. The data frame looks like this:
birdID site ts
1 A 2013-04-15 09:29
1 A 2013-04-19 01:22
1 A 2013-04-20 23:13
1 A 2013-04-22 00:03
1 B 2013-04-22 14:02
1 B 2013-04-22 17:02
1 C 2013-04-22 14:04
1 C 2013-04-22 15:18
1 C 2013-04-23 00:54
1 A 2013-04-23 01:20
1 A 2013-04-24 23:07
1 A 2013-04-30 23:47
1 B 2013-04-30 03:51
1 B 2013-04-30 04:26
2 C 2013-04-30 04:29
2 C 2013-04-30 18:49
2 A 2013-05-01 01:03
2 A 2013-05-01 23:15
2 A 2013-05-02 00:09
2 C 2013-05-03 07:57
2 C 2013-05-04 07:21
2 C 2013-05-05 02:54
2 A 2013-05-05 03:27
2 A 2013-05-14 00:16
2 D 2013-05-14 10:00
2 D 2013-05-14 15:00
I would like to summarize the data in a way that shows the first and last detection of each bird at each site, and the duration at each site, while preserving information about multiple visits to sites (i.e. if a bird went from site A > B > C > A > B, I would like show each visit to site A and B independently, not lump both visits together).
I am hoping to produce output like this, where the start (min_ts), end (max_ts), and duration (days) of each visit are preserved:
birdID site min_ts max_ts days
1 A 2013-04-15 09:29 2013-04-22 00:03 6.6
1 B 2013-04-22 14:02 2013-04-22 17:02 0.1
1 C 2013-04-22 14:04 2013-04-23 00:54 0.5
1 A 2013-04-23 01:20 2013-04-30 23:47 7.9
1 B 2013-04-30 03:51 2013-04-30 04:26 0.02
2 C 2013-04-30 4:29 2013-04-30 18:49 0.6
2 A 2013-05-01 01:03 2013-05-02 00:09 0.96
2 C 2013-05-03 07:57 2013-05-05 02:54 1.8
2 A 2013-05-05 03:27 2013-05-14 00:16 8.8
2 D 2013-05-14 10:00 2013-05-14 15:00 0.2
I have tried this code, which yields the correct variables but lumps all the information about a single site together, not preserving multiple visits:
df <- df %>%
group_by(birdID, site) %>%
summarise(min_ts = min(ts),
max_ts = max(ts),
days = difftime(max_ts, min_ts, units = "days")) %>%
arrange(birdID, min_ts)
birdID site min_ts max_ts days
1 A 2013-04-15 09:29 2013-04-30 23:47 15.6
1 B 2013-04-22 14:02 2013-04-30 4:26 7.6
1 C 2013-04-22 14:04 2013-04-23 0:54 0.5
2 C 2013-04-30 04:29 2013-05-05 2:54 4.9
2 A 2013-05-01 01:03 2013-05-14 0:16 12.9
2 D 2013-05-14 10:00 2013-05-14 15:00 0.2
I realize grouping by site is a problem, but if I remove that as a grouping variable the data are summarised without site info. I have tried this. It doesn't run, but I feel it's close to the solution:
df <- df %>%
group_by(birdID) %>%
summarize(min_ts = if_else((birdID == lag(birdID) & site != lag(site)), min(ts), NA_real_),
max_ts = if_else((birdID == lag(birdID) & site != lag(site)), max(ts), NA_real_),
min_d = min(yday(ts)),
max_d = max(yday(ts)),
days = max_d - min_d))
One possibility could be:
df %>%
group_by(birdID, site, rleid = with(rle(site), rep(seq_along(lengths), lengths))) %>%
summarise(min_ts = min(ts),
max_ts = max(ts),
days = difftime(max_ts, min_ts, units = "days")) %>%
ungroup() %>%
select(-rleid) %>%
arrange(birdID, min_ts)
birdID site min_ts max_ts days
<int> <chr> <dttm> <dttm> <drtn>
1 1 A 2013-04-15 09:29:00 2013-04-22 00:03:00 6.60694444 days
2 1 B 2013-04-22 14:02:00 2013-04-22 17:02:00 0.12500000 days
3 1 C 2013-04-22 14:04:00 2013-04-23 00:54:00 0.45138889 days
4 1 A 2013-04-23 01:20:00 2013-04-30 23:47:00 7.93541667 days
5 1 B 2013-04-30 03:51:00 2013-04-30 04:26:00 0.02430556 days
6 2 C 2013-04-30 04:29:00 2013-04-30 18:49:00 0.59722222 days
7 2 A 2013-05-01 01:03:00 2013-05-02 00:09:00 0.96250000 days
8 2 C 2013-05-03 07:57:00 2013-05-05 02:54:00 1.78958333 days
9 2 A 2013-05-05 03:27:00 2013-05-14 00:16:00 8.86736111 days
10 2 D 2013-05-14 10:00:00 2013-05-14 15:00:00 0.20833333 days
Here it creates a rleid()-like grouping variable and then calculates the difference.
Or the same using rleid() from data.table explicitly:
df %>%
group_by(birdID, site, rleid = rleid(site)) %>%
summarise(min_ts = min(ts),
max_ts = max(ts),
days = difftime(max_ts, min_ts, units = "days")) %>%
ungroup() %>%
select(-rleid) %>%
arrange(birdID, min_ts)
Another alternative is to use lag and cumsum to create a grouping variable.
library(dplyr)
df %>%
group_by(birdID, group = cumsum(site != lag(site, default = first(site)))) %>%
summarise(min_ts = min(ts),
max_ts = max(ts),
days = difftime(max_ts, min_ts, units = "days")) %>%
ungroup() %>%
select(-group)
# A tibble: 10 x 4
# birdID min_ts max_ts days
# <int> <dttm> <dttm> <drtn>
# 1 1 2013-04-15 09:29:00 2013-04-22 00:03:00 6.60694444 days
# 2 1 2013-04-22 14:02:00 2013-04-22 17:02:00 0.12500000 days
# 3 1 2013-04-22 14:04:00 2013-04-23 00:54:00 0.45138889 days
# 4 1 2013-04-23 01:20:00 2013-04-30 23:47:00 7.93541667 days
# 5 1 2013-04-30 03:51:00 2013-04-30 04:26:00 0.02430556 days
# 6 2 2013-04-30 04:29:00 2013-04-30 18:49:00 0.59722222 days
# 7 2 2013-05-01 01:03:00 2013-05-02 00:09:00 0.96250000 days
# 8 2 2013-05-03 07:57:00 2013-05-05 02:54:00 1.78958333 days
# 9 2 2013-05-05 03:27:00 2013-05-14 00:16:00 8.86736111 days
#10 2 2013-05-14 10:00:00 2013-05-14 15:00:00 0.20833333 days
Related
The data frame currently looks like this
# A tibble: 20 x 3
Badge `Effective Date` day_off
<dbl> <dttm> <int>
1 3162 2013-01-16 00:00:00 1
2 3162 2013-01-19 00:00:00 2
3 3162 2013-02-21 00:00:00 3
5 3585 2015-10-21 00:00:00 5
6 3586 2014-05-21 00:00:00 6
7 3586 2014-05-23 00:00:00 7
I would like to create a new row for each day for each badge number between each effective date so that it looks something like this. The data frame is huge, so some tidy verse functions like complete which are resource intensive won't work.
# A tibble: 20 x 3
Badge `Effective Date` day_off
<dbl> <dttm> <int>
1 3162 2013-01-16 00:00:00 1
2 3162 2013-01-17 00:00:00. 1
3 3162 2013-01-18 00:00:00. 1
4 3162 2013-01-19 00:00:00 2
5 3162 2013-01-20 00:00:00 2
6 3162 2013-01-21 00:00:00 3
7 3585 2015-10-21 00:00:00 5
8 3586 2014-05-21 00:00:00 6
9 3586 2014-05-22 00:00:00 6
10 3586 2014-05-23 00:00:00 7
You can create a table where, for each Badge group, you have a sequence of datetimes from the first to the last. Then doing a rolling join to this data frame gives the desired output
library(data.table)
## Create reproducible example as an R object
# Please do this yourself next time using dput. See https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
df <- fread('
Badge , Effective_Date , day_off
3162 , 2013-01-16 00:00:00 , 1
3162 , 2013-01-19 00:00:00 , 2
3162 , 2013-01-21 00:00:00 , 3
3585 , 2015-10-21 00:00:00 , 5
3586 , 2014-05-21 00:00:00 , 6
3586 , 2014-05-23 00:00:00 , 7
')
df[, Effective_Date := as.POSIXct(Effective_Date)]
## Rolling join
setDT(df) # required if data wasn't originally a data.table as above
df[df[, .(Effective_Date = seq(min(Effective_Date), max(Effective_Date), by = '1 day')),
by = .(Badge)],
on = .(Badge, Effective_Date), roll = TRUE]
#> Badge Effective_Date day_off
#> 1: 3162 2013-01-16 1
#> 2: 3162 2013-01-17 1
#> 3: 3162 2013-01-18 1
#> 4: 3162 2013-01-19 2
#> 5: 3162 2013-01-20 2
#> 6: 3162 2013-01-21 3
#> 7: 3585 2015-10-21 5
#> 8: 3586 2014-05-21 6
#> 9: 3586 2014-05-22 6
#> 10: 3586 2014-05-23 7
Created on 2021-07-16 by the reprex package (v2.0.0)
A tidyverse way would be using complete and fill -
library(dplyr)
library(tidyr)
df %>%
group_by(Badge) %>%
complete(Effective_Date = seq(min(Effective_Date),
max(Effective_Date), by = '1 day')) %>%
fill(day_off) %>%
ungroup
# Badge Effective_Date day_off
# <int> <dttm> <int>
# 1 3162 2013-01-16 00:00:00 1
# 2 3162 2013-01-17 00:00:00 1
# 3 3162 2013-01-18 00:00:00 1
# 4 3162 2013-01-19 00:00:00 2
# 5 3162 2013-01-20 00:00:00 2
# 6 3162 2013-01-21 00:00:00 3
# 7 3585 2015-10-21 00:00:00 5
# 8 3586 2014-05-21 00:00:00 6
# 9 3586 2014-05-22 00:00:00 6
#10 3586 2014-05-23 00:00:00 7
I'm looking to aggregate some pedometer data, gathered in steps per minute, so I get a summed number of steps up until an EMA assessment. The EMA assessments happened four times per day. An example of the two data sets are:
Pedometer Data
ID Steps Time
1 15 2/4/2020 8:32
1 23 2/4/2020 8:33
1 76 2/4/2020 8:34
1 32 2/4/2020 8:35
1 45 2/4/2020 8:36
...
2 16 2/4/2020 8:32
2 17 2/4/2020 8:33
2 0 2/4/2020 8:34
2 5 2/4/2020 8:35
2 8 2/4/2020 8:36
EMA Data
ID Time X Y
1 2/4/2020 8:36 3 4
1 2/4/2020 12:01 3 5
1 2/4/2020 3:30 4 5
1 2/4/2020 6:45 7 8
...
2 2/4/2020 8:35 4 6
2 2/4/2020 12:05 5 7
2 2/4/2020 3:39 1 3
2 2/4/2020 6:55 8 3
I'm looking to add the pedometer data to the EMA data as a new variable, where the number of steps taken are summed until the next EMA assessment. Ideally it would like something like:
Combined Data
ID Time X Y Steps
1 2/4/2020 8:36 3 4 191
1 2/4/2020 12:01 3 5 [Sum of steps taken from 8:37 until 12:01 on 2/4/2020]
1 2/4/2020 3:30 4 5 [Sum of steps taken from 12:02 until 3:30 on 2/4/2020]
1 2/4/2020 6:45 7 8 [Sum of steps taken from 3:31 until 6:45 on 2/4/2020]
...
2 2/4/2020 8:35 4 6 38
2 2/4/2020 12:05 5 7 [Sum of steps taken from 8:36 until 12:05 on 2/4/2020]
2 2/4/2020 3:39 1 3 [Sum of steps taken from 12:06 until 3:39 on 2/4/2020]
2 2/4/2020 6:55 8 3 [Sum of steps taken from 3:40 until 6:55 on 2/4/2020]
I then need the process to continue over the entire 21 day EMA period, so the same process for the 4 EMA assessment time points on 2/5/2020, 2/6/2020, etc.
This has pushed me the limit of my R skills, so any pointers would be extremely helpful! I'm most familiar with the tidyverse but am comfortable using base R as well. Thanks in advance for all advice.
Here's a solution using rolling joins from data.table. The basic idea here is to roll each time from the pedometer data up to the next time in the EMA data (while matching on ID still). Once it's the next EMA time is found, all that's left is to isolate the X and Y values and sum up Steps.
Data creation and prep:
library(data.table)
pedometer <- data.table(ID = sort(rep(1:2, 500)),
Time = rep(seq.POSIXt(as.POSIXct("2020-02-04 09:35:00 EST"),
as.POSIXct("2020-02-08 17:00:00 EST"), length.out = 500), 2),
Steps = rpois(1000, 25))
EMA <- data.table(ID = sort(rep(1:2, 4*5)),
Time = rep(seq.POSIXt(as.POSIXct("2020-02-04 05:00:00 EST"),
as.POSIXct("2020-02-08 23:59:59 EST"), by = '6 hours'), 2),
X = sample(1:8, 2*4*5, rep = T),
Y = sample(1:8, 2*4*5, rep = T))
setkey(pedometer, Time)
setkey(EMA, Time)
EMA[,next_ema_time := Time]
And now the actual join and summation:
joined <- EMA[pedometer,
on = .(ID, Time),
roll = -Inf,
j = .(ID, Time, Steps, next_ema_time, X, Y)]
result <- joined[,.('X' = min(X),
'Y' = min(Y),
'Steps' = sum(Steps)),
.(ID, next_ema_time)]
result
#> ID next_ema_time X Y Steps
#> 1: 1 2020-02-04 11:00:00 1 2 167
#> 2: 2 2020-02-04 11:00:00 8 5 169
#> 3: 1 2020-02-04 17:00:00 3 6 740
#> 4: 2 2020-02-04 17:00:00 4 6 747
#> 5: 1 2020-02-04 23:00:00 2 2 679
#> 6: 2 2020-02-04 23:00:00 3 2 732
#> 7: 1 2020-02-05 05:00:00 7 5 720
#> 8: 2 2020-02-05 05:00:00 6 8 692
#> 9: 1 2020-02-05 11:00:00 2 4 731
#> 10: 2 2020-02-05 11:00:00 4 5 773
#> 11: 1 2020-02-05 17:00:00 1 5 757
#> 12: 2 2020-02-05 17:00:00 3 5 743
#> 13: 1 2020-02-05 23:00:00 3 8 693
#> 14: 2 2020-02-05 23:00:00 1 8 740
#> 15: 1 2020-02-06 05:00:00 8 8 710
#> 16: 2 2020-02-06 05:00:00 3 2 760
#> 17: 1 2020-02-06 11:00:00 8 4 716
#> 18: 2 2020-02-06 11:00:00 1 2 688
#> 19: 1 2020-02-06 17:00:00 5 2 738
#> 20: 2 2020-02-06 17:00:00 4 6 724
#> 21: 1 2020-02-06 23:00:00 7 8 737
#> 22: 2 2020-02-06 23:00:00 6 3 672
#> 23: 1 2020-02-07 05:00:00 2 6 726
#> 24: 2 2020-02-07 05:00:00 7 7 759
#> 25: 1 2020-02-07 11:00:00 1 4 737
#> 26: 2 2020-02-07 11:00:00 5 2 737
#> 27: 1 2020-02-07 17:00:00 3 5 766
#> 28: 2 2020-02-07 17:00:00 4 4 745
#> 29: 1 2020-02-07 23:00:00 3 3 714
#> 30: 2 2020-02-07 23:00:00 2 1 741
#> 31: 1 2020-02-08 05:00:00 4 6 751
#> 32: 2 2020-02-08 05:00:00 8 2 723
#> 33: 1 2020-02-08 11:00:00 3 3 716
#> 34: 2 2020-02-08 11:00:00 3 6 735
#> 35: 1 2020-02-08 17:00:00 1 5 696
#> 36: 2 2020-02-08 17:00:00 7 7 741
#> ID next_ema_time X Y Steps
Created on 2020-02-04 by the reprex package (v0.3.0)
I would left_join ema_df on pedometer_df by ID and Time. This way you get
all lines of pedometer_df with missing values for x and y (that I assume are identifiers) when it is not an EMA assessment time.
I fill the values using the next available (so the next ema assessment x and y)
and finally, group_by ID x and y and summarise to keep the datetime of assessment (max) and the sum of steps.
library(dplyr)
library(tidyr)
pedometer_df %>%
left_join(ema_df, by = c("ID", "Time")) %>%
fill(x, y, .direction = "up") %>%
group_by(ID, x, y) %>%
summarise(
Time = max(Time),
Steps = sum(Steps)
)
The given dataset contains a timestamp in the third column which consists of a date in mm/dd/yyyy format with time in 24 hr format for the month of January. I wish to find the difference in minutes using R by comparing every row with its previous row only if patient has a common value i.e. difference of every timestamp row with previous row if patient value of both is same say "1","2" and "3". This also means that first row of the dataset should give value 0 minutes as there is nothing to compare. Thanks and please help.
patient handling time
1 Registration 1/2/2017 11:41
1 Triage and Assessment 1/2/2017 12:40
1 Registration 1/2/2017 12:40
1 Triage and Assessment 1/2/2017 22:32
1 Blood test 1/5/2017 8:59
1 Blood test 1/5/2017 14:34
1 MRI SCAN 1/5/2017 21:37
2 X-Ray 1/7/2017 4:31
2 X-Ray 1/7/2017 7:57
2 Discuss Results 1/7/2017 14:45
2 Discuss Results 1/7/2017 17:55
2 Check-out 1/9/2017 17:09
2 Check-out 1/9/2017 19:14
3 Registration 1/4/2017 1:34
3 Registration 1/4/2017 6:36
3 Triage and Assessment 1/4/2017 17:49
3 Triage and Assessment 1/5/2017 8:59
3 Blood test 1/5/2017 21:37
3 Blood test 1/6/2017 3:53
If time is already of class POSIXct, and the data frame is already sorted by patient and time, the time difference in minutes can be appended using a streamlined version of SBista's answer
library(dplyr)
DF %>%
group_by(patient) %>%
mutate(delta = difftime(time, lag(time, default = first(time)), units = "mins"))
# A tibble: 19 x 4
# Groups: patient [3]
patient handling time delta
<chr> <chr> <dttm> <time>
1 1 Registration 2017-01-02 11:41:00 0 mins
2 1 Triage and Assessment 2017-01-02 12:40:00 59 mins
3 1 Registration 2017-01-02 12:40:00 0 mins
4 1 Triage and Assessment 2017-01-02 22:32:00 592 mins
5 1 Blood test 2017-01-05 08:59:00 3507 mins
6 1 Blood test 2017-01-05 14:34:00 335 mins
7 1 MRI SCAN 2017-01-05 21:37:00 423 mins
8 2 X-Ray 2017-01-07 04:31:00 0 mins
9 2 X-Ray 2017-01-07 07:57:00 206 mins
10 2 Discuss Results 2017-01-07 14:45:00 408 mins
11 2 Discuss Results 2017-01-07 17:55:00 190 mins
12 2 Check-out 2017-01-09 17:09:00 2834 mins
13 2 Check-out 2017-01-09 19:14:00 125 mins
14 3 Registration 2017-01-04 01:34:00 0 mins
15 3 Registration 2017-01-04 06:36:00 302 mins
16 3 Triage and Assessment 2017-01-04 17:49:00 673 mins
17 3 Triage and Assessment 2017-01-05 08:59:00 910 mins
18 3 Blood test 2017-01-05 21:37:00 758 mins
19 3 Blood test 2017-01-06 03:53:00 376 mins
Another approach would be to compute the delta for all rows ignoring the grouping by patient and then to replace the first value for each patient by zero as requested by the OP. Ignoring the groups in first place might bring a performance gain (not verified).
Unfortunately, I am not proficient enough to implement this using dplyr syntax, so I use data.table with its update by reference:
library(data.table)
setDT(DF)[, delta := difftime(time, shift(time), units = "mins")][]
DF[DF[, first(.I), by = patient]$V1, delta := 0][]
patient handling time delta
1: 1 Registration 2017-01-02 11:41:00 0 mins
2: 1 Triage and Assessment 2017-01-02 12:40:00 59 mins
3: 1 Registration 2017-01-02 12:40:00 0 mins
4: 1 Triage and Assessment 2017-01-02 22:32:00 592 mins
5: 1 Blood test 2017-01-05 08:59:00 3507 mins
6: 1 Blood test 2017-01-05 14:34:00 335 mins
7: 1 MRI SCAN 2017-01-05 21:37:00 423 mins
8: 2 X-Ray 2017-01-07 04:31:00 0 mins
9: 2 X-Ray 2017-01-07 07:57:00 206 mins
10: 2 Discuss Results 2017-01-07 14:45:00 408 mins
11: 2 Discuss Results 2017-01-07 17:55:00 190 mins
12: 2 Check-out 2017-01-09 17:09:00 2834 mins
13: 2 Check-out 2017-01-09 19:14:00 125 mins
14: 3 Registration 2017-01-04 01:34:00 0 mins
15: 3 Registration 2017-01-04 06:36:00 302 mins
16: 3 Triage and Assessment 2017-01-04 17:49:00 673 mins
17: 3 Triage and Assessment 2017-01-05 08:59:00 910 mins
18: 3 Blood test 2017-01-05 21:37:00 758 mins
19: 3 Blood test 2017-01-06 03:53:00 376 mins
You can do the following:
data %>%
group_by(patient) %>%
mutate(diff_in_sec = as.POSIXct(time, format = "%m/%d/%Y %H:%M") - lag(as.POSIXct(time, format = "%m/%d/%Y %H:%M"), default=first(as.POSIXct(time, format = "%m/%d/%Y %H:%M"))))%>%
mutate(diff_in_min = as.numeric(diff_in_sec/60))
You get the output as:
# A tibble: 19 x 5
# Groups: patient [3]
patient handling time diff_in_sec diff_in_min
<int> <chr> <chr> <time> <dbl>
1 1 Registration 1/2/2017 11:41 0 secs 0
2 1 Triage and Assessment 1/2/2017 12:40 3540 secs 59
3 1 Registration 1/2/2017 12:40 0 secs 0
4 1 Triage and Assessment 1/2/2017 22:32 35520 secs 592
5 1 Blood test 1/5/2017 8:59 210420 secs 3507
6 1 Blood test 1/5/2017 14:34 20100 secs 335
7 1 MRI SCAN 1/5/2017 21:37 25380 secs 423
8 2 X-Ray 1/7/2017 4:31 0 secs 0
9 2 X-Ray 1/7/2017 7:57 12360 secs 206
10 2 Discuss Results 1/7/2017 14:45 24480 secs 408
11 2 Discuss Results 1/7/2017 17:55 11400 secs 190
12 2 Check-out 1/9/2017 17:09 170040 secs 2834
13 2 Check-out 1/9/2017 19:14 7500 secs 125
14 3 Registration 1/4/2017 1:34 0 secs 0
15 3 Registration 1/4/2017 6:36 18120 secs 302
16 3 Triage and Assessment 1/4/2017 17:49 40380 secs 673
17 3 Triage and Assessment 1/5/2017 8:59 54600 secs 910
18 3 Blood test 1/5/2017 21:37 45480 secs 758
19 3 Blood test 1/6/2017 3:53 22560 secs 376
I have some difficulties to create an time interval with 30 min breaks beginning either with the full hour 00 or full hour 00 and 30 min:
For instance:
library(reshape2)
library(dplyr)
# Given some data which resemble the original data
foo <- data.frame(start.time = c("2012-02-01 13:47:00",
"2012-02-01 14:02:00",
"2012-02-01 14:20:00",
"2012-02-01 14:40:00",
"2012-02-01 15:08:00",
"2012-02-01 16:01:00",
"2012-02-01 16:02:00",
"2012-02-01 16:20:00",
"2012-02-01 17:09:00",
"2012-02-01 18:08:00",
"2012-02-01 18:20:00",
"2012-02-01 19:08:00"
),
employee = c("mike","john","john","steven","mike","mike","mike","steven","mike","steven","mike","mike"))
start.time employee
#1 2012-02-01 13:47:00 mike
#2 2012-02-01 14:02:00 john
#3 2012-02-01 14:20:00 john
#4 2012-02-01 14:40:00 steven
#5 2012-02-01 15:08:00 mike
#6 2012-02-01 16:01:00 mike
#7 2012-02-01 16:02:00 mike
#8 2012-02-01 16:20:00 steven
#9 2012-02-01 17:09:00 mike
#10 2012-02-01 18:08:00 steven
#11 2012-02-01 18:20:00 mike
#12 2012-02-01 19:08:00 mike
# change factor to POSIXct
foo$start.time <- as.POSIXct(foo$start.time)
# long to wide
my_emp<- dcast(foo, start.time ~ employee, fun.aggregate = length)
# 30 min breaks
my_emp_ag<- my_emp %>% group_by(start.time = as.POSIXct(cut(start.time, breaks="30 min"))) %>%
summarize(john = sum(john ),mike = sum(mike ),steven = sum(steven))
# Missing intervalls
miss_interval <- data.frame(start.time=seq(from = min(as.POSIXct(my_emp$start.time)), to= max(as.POSIXct(my_emp$start.time)), by = "30 mins"))
# join old woth new
substitited <- left_join(miss_interval,my_emp_ag,by=c('start.time'))
# change NA to zero
substitited[is.na(substitited)] <- 0
start.time john mike steven
1 2012-02-01 13:47:00 1 1 0
2 2012-02-01 14:17:00 1 0 1
3 2012-02-01 14:47:00 0 1 0
4 2012-02-01 15:17:00 0 0 0
5 2012-02-01 15:47:00 0 2 0
6 2012-02-01 16:17:00 0 0 1
7 2012-02-01 16:47:00 0 1 0
8 2012-02-01 17:17:00 0 0 0
9 2012-02-01 17:47:00 0 0 1
10 2012-02-01 18:17:00 0 1 0
11 2012-02-01 18:47:00 0 1 0
which is almost as desired 2012-02-01 13:30:00 2012-02-01 14:00:00 and so on.
library(data.table)
library(lubridate)
setDT(foo)[, `:=` (
round.time = {
todate = ymd_hms(start.time)
rounddate = floor_date(todate, "30 minutes")
}
)]
start.time employee round.time
1: 2012-02-01 13:47:00 mike 2012-02-01 13:30:00
2: 2012-02-01 14:02:00 john 2012-02-01 14:00:00
3: 2012-02-01 14:20:00 john 2012-02-01 14:00:00
4: 2012-02-01 14:40:00 steven 2012-02-01 14:30:00
5: 2012-02-01 15:08:00 mike 2012-02-01 15:00:00
6: 2012-02-01 16:01:00 mike 2012-02-01 16:00:00
7: 2012-02-01 16:02:00 mike 2012-02-01 16:00:00
8: 2012-02-01 16:20:00 steven 2012-02-01 16:00:00
9: 2012-02-01 17:09:00 mike 2012-02-01 17:00:00
10: 2012-02-01 18:08:00 steven 2012-02-01 18:00:00
11: 2012-02-01 18:20:00 mike 2012-02-01 18:00:00
12: 2012-02-01 19:08:00 mike 2012-02-01 19:00:00
I have the following dataframe:
> df
Time_Start Time_End Cut Plot Inlet_NH4N Outlet_NH4N Pump_reading Anemometer_reading
1 2016-05-05 11:19:00 2016-05-06 09:30:00 1 1 0.2336795 0.30786350 79846.9 6296343
2 2016-05-05 11:25:00 2016-05-06 09:35:00 1 3 1.0905045 0.50816024 78776.5 333116
3 2016-05-05 11:33:00 2016-05-06 09:39:00 1 6 1.3538576 0.34866469 79585.1 8970447
4 2016-05-05 11:37:00 2016-05-06 09:51:00 1 7 0.6862018 0.34124629 80043.1 8436546
5 2016-05-05 11:43:00 2016-05-06 09:43:00 1 9 0.2633531 0.73813056 79227.7 9007387
6 2016-05-05 11:48:00 2016-05-06 09:47:00 1 12 0.5934718 1.10905045 79121.5 8070785
7 2016-05-06 09:33:00 2013-05-07 10:13:00 1 1 0.5213904 2.46791444 88800.2 7807792
8 2016-05-06 09:38:00 2013-05-07 10:23:00 1 3 0.1684492 0.22905526 89123.0 14127
9 2016-05-06 09:42:00 2013-05-07 10:28:00 1 6 0.4393939 0.09001782 89157.6 9844162
10 2016-05-06 09:53:00 2013-05-07 10:34:00 1 7 0.1470588 1.03832442 88852.6 9143733
11 2016-05-06 09:45:00 2013-05-07 10:40:00 1 9 0.1114082 0.32531194 89635.6 10122720
12 2016-05-06 09:50:00 2013-05-07 10:43:00 1 12 0.6853832 2.51426025 89582.6 8924198
Here is the str:
> str(df)
'data.frame': 12 obs. of 8 variables:
$ Time_Start : POSIXct, format: "2016-05-05 11:19:00" "2016-05-05 11:25:00" "2016-05-05 11:33:00" ...
$ Time_End : POSIXct, format: "2016-05-06 09:30:00" "2016-05-06 09:35:00" "2016-05-06 09:39:00" ...
$ Cut : Factor w/ 1 level "1": 1 1 1 1 1 1 1 1 1 1 ...
$ Plot : Factor w/ 8 levels "1","3","6","7",..: 1 2 3 4 5 6 1 2 3 4 ...
$ Inlet_NH4N : num 0.234 1.091 1.354 0.686 0.263 ...
$ Outlet_NH4N : num 0.308 0.508 0.349 0.341 0.738 ...
$ Pump_reading : num 79847 78777 79585 80043 79228 ...
$ Anemometer_reading: int 6296343 333116 8970447 8436546 9007387 8070785 7807792 14127 9844162 9143733 ...
This is a small segment of a larger dataset.
I have a problem with these data in that the Anemometer_reading for plot "3" is always much lower than for the other plots. This is due to a mechanical problem. I want to remove this artifact and think that the best way to do this is to take an average of the Anemometer_reading for all the plots outwith plot "3". I want to calculate this average on a daily basis.
I can calculate the daily Anemometer_reading average, excluding plot "3" like this:
library(dplyr)
> df_avg <- df %>% filter(Plot != "3") %>% group_by(as.Date(Time_End)) %>% summarise(Anemometer_mean = mean(Anemometer_reading))
> df_avg
Source: local data frame [2 x 2]
as.Date(Time_End) Anemometer_mean
<date> <dbl>
1 2013-05-07 9168521
2 2016-05-06 8156302
I'm not sure how to go about using the resulting dataframe to replace the Anemometer_reading values from plot "3".
Can anyone point me in the right direction please?
Thanks
I would follow #roland's comment. However, if you care about how you would use dplyr to do what you asked:
result <- df %>% group_by(as.Date(Time_End)) %>%
mutate(Anemometer_mean = mean(Anemometer_reading[Plot != "3"])) %>%
mutate(Anemometer_reading = replace(Anemometer_reading, Plot == "3", first(Anemometer_mean))) %>%
ungroup() %>% select(-`as.Date(Time_End)`, -Anemometer_mean)
print(result)
## A tibble: 12 x 8
## Time_Start Time_End Cut Plot Inlet_NH4N Outlet_NH4N Pump_reading Anemometer_reading
## <fctr> <fctr> <int> <int> <dbl> <dbl> <dbl> <dbl>
##1 2016-05-05 11:19:00 2016-05-06 09:30:00 1 1 0.2336795 0.30786350 79846.9 6296343
##2 2016-05-05 11:25:00 2016-05-06 09:35:00 1 3 1.0905045 0.50816024 78776.5 8156302
##3 2016-05-05 11:33:00 2016-05-06 09:39:00 1 6 1.3538576 0.34866469 79585.1 8970447
##4 2016-05-05 11:37:00 2016-05-06 09:51:00 1 7 0.6862018 0.34124629 80043.1 8436546
##5 2016-05-05 11:43:00 2016-05-06 09:43:00 1 9 0.2633531 0.73813056 79227.7 9007387
##6 2016-05-05 11:48:00 2016-05-06 09:47:00 1 12 0.5934718 1.10905045 79121.5 8070785
##7 2016-05-06 09:33:00 2013-05-07 10:13:00 1 1 0.5213904 2.46791444 88800.2 7807792
##8 2016-05-06 09:38:00 2013-05-07 10:23:00 1 3 0.1684492 0.22905526 89123.0 9168521
##9 2016-05-06 09:42:00 2013-05-07 10:28:00 1 6 0.4393939 0.09001782 89157.6 9844162
##10 2016-05-06 09:53:00 2013-05-07 10:34:00 1 7 0.1470588 1.03832442 88852.6 9143733
##11 2016-05-06 09:45:00 2013-05-07 10:40:00 1 9 0.1114082 0.32531194 89635.6 10122720
##12 2016-05-06 09:50:00 2013-05-07 10:43:00 1 12 0.6853832 2.51426025 89582.6 8924198
Instead of filter and summarise, mutate to create a new column Anemometer_mean that computes the mean with all rows for Plot!=3. Then replace the Anemometer_read for those rows Plot==3 with this mean.
In fact, you can do all this with just one mutate:
result <- df %>% group_by(as.Date(Time_End)) %>%
mutate(Anemometer_reading = replace(Anemometer_reading, Plot == "3", mean(Anemometer_reading[Plot != "3"]))) %>%
ungroup() %>% select(-`as.Date(Time_End)`)
Hope this helps.