Indexing a group in R with dplyr - r

I have a dataset as below:
structure(AI_decs)
Horse Time RaceID dyLTO Value.LTO Draw.IV
1 Warne's Army 06/04/2021 13:00 1 56 3429 0.88
2 G For Gabrial 06/04/2021 13:00 1 57 3299 1.15
3 First Charge 06/04/2021 13:00 1 66 3429 1.06
4 Dream With Me 06/04/2021 13:00 1 62 2862 0.97
5 Qawamees 06/04/2021 13:00 1 61 4690 0.97
6 Glan Y Gors 06/04/2021 13:00 1 59 3429 1.50
7 The Dancing Poet 06/04/2021 13:00 1 42 4690 1.41
8 Finoah 06/04/2021 13:00 1 59 10260 0.97
9 Ravenscar 06/04/2021 13:30 2 58 5208 0.65
10 Arabescato 06/04/2021 13:30 2 57 2862 1.09
11 Thai Terrier 06/04/2021 13:30 2 58 7439 1.30
12 The Rutland Rebel 06/04/2021 13:30 2 55 3429 2.17
13 Red Tornado 06/04/2021 13:30 2 49 3340 0.43
14 Alfredo 06/04/2021 13:30 2 54 5208 1.30
15 Tynecastle Park 06/04/2021 13:30 2 72 7439 0.87
16 Waldkonig 06/04/2021 14:00 3 55 3493 1.35
17 Kaleidoscopic 06/04/2021 14:00 3 68 7439 1.64
18 Louganini 06/04/2021 14:00 3 75 56025 1.26
I have a list of columns with performance data values for horses in a race.
My dataset has many more rows and it contains a number of horse races on a given day.
Each horse race has a unique time and a different number of horses in each race.
Basically, I want to assign a raceId (index number) to each individual race.
I am currently having to do this in excel (see column RaceID) by comparing the Time column and adding 1 to the RaceId value every time we encounter a new race. This has to be done manually each day before I import into R.
I hope there is a way to do this in R Dplyr.
I thought if I use Group_by 'Time' there might be a function a bit like n() or row_number() that would
index the races for me.
Perhaps using Case_when and lag/lead.
Thanks in advance for any help.
Graham

Try this:
Note: group_indices() was deprecated in dplyr 1.0.0.
library(dplyr)
df <- data.frame(time = rep(c("06/04/2021 13:00", "06/04/2021 13:30", "06/04/2021 14:00", "07/04/2021 14:00"), each = 3))
df %>%
group_by(time) %>%
mutate(race_id = cur_group_id())
#> # A tibble: 12 x 2
#> # Groups: time [4]
#> time race_id
#> <chr> <int>
#> 1 06/04/2021 13:00 1
#> 2 06/04/2021 13:00 1
#> 3 06/04/2021 13:00 1
#> 4 06/04/2021 13:30 2
#> 5 06/04/2021 13:30 2
#> 6 06/04/2021 13:30 2
#> 7 06/04/2021 14:00 3
#> 8 06/04/2021 14:00 3
#> 9 06/04/2021 14:00 3
#> 10 07/04/2021 14:00 4
#> 11 07/04/2021 14:00 4
#> 12 07/04/2021 14:00 4
Created on 2021-04-10 by the reprex package (v2.0.0)

You can group by data.table's function rleid (i.e., run length ID):
library(dplyr)
library(data.table)
df %>%
group_by(race_id = rleid(time))
# A tibble: 12 x 2
# Groups: race_id [4]
time race_id
<chr> <int>
1 06/04/2021 13:00 1
2 06/04/2021 13:00 1
3 06/04/2021 13:00 1
4 06/04/2021 13:30 2
5 06/04/2021 13:30 2
6 06/04/2021 13:30 2
7 06/04/2021 14:00 3
8 06/04/2021 14:00 3
9 06/04/2021 14:00 3
10 07/04/2021 14:00 4
11 07/04/2021 14:00 4
12 07/04/2021 14:00 4
Data, from #Peter:
df <- data.frame(time = rep(c("06/04/2021 13:00", "06/04/2021 13:30", "06/04/2021 14:00", "07/04/2021 14:00"), each = 3))

Related

How to make a loop that changes values in specific rows from a dictionary

I'm new and can't figure it out how to solve this problem.
I have a data.frame = schedule
Week_number
Start
End
1
09:00
15:00
1
09:00
15:00
1
09:00
15:00
1
09:00
15:00
1
09:00
15:00
1
NA
NA
1
NA
NA
2
09:00
15:00
2
09:00
15:00
2
09:00
15:00
2
09:00
15:00
2
09:00
15:00
2
NA
NA
2
NA
NA
3
09:00
15:00
3
09:00
15:00
3
09:00
15:00
3
09:00
15:00
3
09:00
15:00
3
NA
NA
3
NA
NA
-----------------------------
..
52
-----------------------------
I have a shift dictionary :
> start_vec <- c("06:00", "08:00", "14:00")
> end_vec <- c("14:00", "16:00", "22:00")
My loop is to go through all 52 weeks and replace 9am and 3pm with a dictionary value.
But the problem is that the values should not be repeated, i.e. each week should be different.
For example, I start the year with : 08:00 - 16:00. The year can start with any shift.
Please find an example below :
Week_number
Start
End
1
08:00
16:00
1
08:00
16:00
1
08:00
16:00
1
08:00
16:00
1
08:00
16:00
1
NA
NA
1
NA
NA
2
14:00
22:00
2
14:00
22:00
2
14:00
22:00
2
14:00
22:00
2
14:00
22:00
2
NA
NA
2
NA
NA
3
06:00
14:00
3
06:00
14:00
3
06:00
14:00
3
06:00
14:00
3
06:00
14:00
3
NA
NA
3
NA
NA
-----------------------------
..
52
-----------------------------
I tryed to make nest loop, or make week_number vector to be able replace all 1 without NA with specific value.
> rd_dt <- data.frame()
> for (i in 1:length(schedule$Week_number)){
> for (s in start_vec){
> for (e in end_vec){
> dt <- schedule[i,]
> if (schedule$Start == NA){
> next
> else {
Thanks in advance for any hint.
I think you do not need a loop to do this. Here is one approach that may be helpful. Using ifelse check for NA - if not NA, then refer to start_vec and end_vec for substitute values. It will use the Week_number as an index in your vector, and uses the %% modulus operator where 3 is the length of your vectors, so it will restart at beginning if exceeds the length of the vectors.
library(dplyr)
df %>%
mutate(Start = ifelse(is.na(Start), NA, start_vec[1 + Week_number %% 3]),
End = ifelse(is.na(End), NA, end_vec[1 + Week_number %% 3]))
Output
Week_number Start End
1 1 08:00 16:00
2 1 08:00 16:00
3 1 08:00 16:00
4 1 08:00 16:00
5 1 08:00 16:00
6 1 <NA> <NA>
7 1 <NA> <NA>
8 2 14:00 22:00
9 2 14:00 22:00
10 2 14:00 22:00
11 2 14:00 22:00
12 2 14:00 22:00
13 2 <NA> <NA>
14 2 <NA> <NA>
15 3 06:00 14:00
16 3 06:00 14:00
17 3 06:00 14:00
18 3 06:00 14:00
19 3 06:00 14:00
20 3 <NA> <NA>
21 3 <NA> <NA>

Is there a way of converting four-digit numbers to time values in r?

When I try using as.POSIXlt or strptime I keep getting a single value of 'NA' as a result.
What I need to do is transform 3 and 4 digit numbers e.g. 2300 or 115 to 23:00 or 01:15 respectively, but I simply cannot get any code to work.
Basically, this data fame of outputs:
Time
1 2345
2 2300
3 2130
4 2400
5 115
6 2330
7 100
8 2300
9 1530
10 130
11 100
12 215
13 2245
14 145
15 2330
16 2400
17 2300
18 2230
19 2130
20 30
should look like this:
Time
1 23:45
2 23:00
3 21:30
4 24:00
5 01:15
6 23:30
7 01:00
8 23:00
9 15:30
10 01:30
11 01:00
12 02:15
13 22:45
14 01:45
15 23:30
16 24:00
17 23:00
18 22:30
19 21:30
20 00:30
I think you can use the following solution. However this is actually producing a character vector:
gsub("(\\d{2})(\\d{2})", "\\1:\\2", sprintf("%04d", df$Time)) |>
as.data.frame() |>
setNames("Time") |>
head()
Time
1 23:45
2 23:00
3 21:30
4 24:00
5 01:15
6 23:30

Aggregate Data based on Two Different Assessment Methods in R

I'm looking to aggregate some pedometer data, gathered in steps per minute, so I get a summed number of steps up until an EMA assessment. The EMA assessments happened four times per day. An example of the two data sets are:
Pedometer Data
ID Steps Time
1 15 2/4/2020 8:32
1 23 2/4/2020 8:33
1 76 2/4/2020 8:34
1 32 2/4/2020 8:35
1 45 2/4/2020 8:36
...
2 16 2/4/2020 8:32
2 17 2/4/2020 8:33
2 0 2/4/2020 8:34
2 5 2/4/2020 8:35
2 8 2/4/2020 8:36
EMA Data
ID Time X Y
1 2/4/2020 8:36 3 4
1 2/4/2020 12:01 3 5
1 2/4/2020 3:30 4 5
1 2/4/2020 6:45 7 8
...
2 2/4/2020 8:35 4 6
2 2/4/2020 12:05 5 7
2 2/4/2020 3:39 1 3
2 2/4/2020 6:55 8 3
I'm looking to add the pedometer data to the EMA data as a new variable, where the number of steps taken are summed until the next EMA assessment. Ideally it would like something like:
Combined Data
ID Time X Y Steps
1 2/4/2020 8:36 3 4 191
1 2/4/2020 12:01 3 5 [Sum of steps taken from 8:37 until 12:01 on 2/4/2020]
1 2/4/2020 3:30 4 5 [Sum of steps taken from 12:02 until 3:30 on 2/4/2020]
1 2/4/2020 6:45 7 8 [Sum of steps taken from 3:31 until 6:45 on 2/4/2020]
...
2 2/4/2020 8:35 4 6 38
2 2/4/2020 12:05 5 7 [Sum of steps taken from 8:36 until 12:05 on 2/4/2020]
2 2/4/2020 3:39 1 3 [Sum of steps taken from 12:06 until 3:39 on 2/4/2020]
2 2/4/2020 6:55 8 3 [Sum of steps taken from 3:40 until 6:55 on 2/4/2020]
I then need the process to continue over the entire 21 day EMA period, so the same process for the 4 EMA assessment time points on 2/5/2020, 2/6/2020, etc.
This has pushed me the limit of my R skills, so any pointers would be extremely helpful! I'm most familiar with the tidyverse but am comfortable using base R as well. Thanks in advance for all advice.
Here's a solution using rolling joins from data.table. The basic idea here is to roll each time from the pedometer data up to the next time in the EMA data (while matching on ID still). Once it's the next EMA time is found, all that's left is to isolate the X and Y values and sum up Steps.
Data creation and prep:
library(data.table)
pedometer <- data.table(ID = sort(rep(1:2, 500)),
Time = rep(seq.POSIXt(as.POSIXct("2020-02-04 09:35:00 EST"),
as.POSIXct("2020-02-08 17:00:00 EST"), length.out = 500), 2),
Steps = rpois(1000, 25))
EMA <- data.table(ID = sort(rep(1:2, 4*5)),
Time = rep(seq.POSIXt(as.POSIXct("2020-02-04 05:00:00 EST"),
as.POSIXct("2020-02-08 23:59:59 EST"), by = '6 hours'), 2),
X = sample(1:8, 2*4*5, rep = T),
Y = sample(1:8, 2*4*5, rep = T))
setkey(pedometer, Time)
setkey(EMA, Time)
EMA[,next_ema_time := Time]
And now the actual join and summation:
joined <- EMA[pedometer,
on = .(ID, Time),
roll = -Inf,
j = .(ID, Time, Steps, next_ema_time, X, Y)]
result <- joined[,.('X' = min(X),
'Y' = min(Y),
'Steps' = sum(Steps)),
.(ID, next_ema_time)]
result
#> ID next_ema_time X Y Steps
#> 1: 1 2020-02-04 11:00:00 1 2 167
#> 2: 2 2020-02-04 11:00:00 8 5 169
#> 3: 1 2020-02-04 17:00:00 3 6 740
#> 4: 2 2020-02-04 17:00:00 4 6 747
#> 5: 1 2020-02-04 23:00:00 2 2 679
#> 6: 2 2020-02-04 23:00:00 3 2 732
#> 7: 1 2020-02-05 05:00:00 7 5 720
#> 8: 2 2020-02-05 05:00:00 6 8 692
#> 9: 1 2020-02-05 11:00:00 2 4 731
#> 10: 2 2020-02-05 11:00:00 4 5 773
#> 11: 1 2020-02-05 17:00:00 1 5 757
#> 12: 2 2020-02-05 17:00:00 3 5 743
#> 13: 1 2020-02-05 23:00:00 3 8 693
#> 14: 2 2020-02-05 23:00:00 1 8 740
#> 15: 1 2020-02-06 05:00:00 8 8 710
#> 16: 2 2020-02-06 05:00:00 3 2 760
#> 17: 1 2020-02-06 11:00:00 8 4 716
#> 18: 2 2020-02-06 11:00:00 1 2 688
#> 19: 1 2020-02-06 17:00:00 5 2 738
#> 20: 2 2020-02-06 17:00:00 4 6 724
#> 21: 1 2020-02-06 23:00:00 7 8 737
#> 22: 2 2020-02-06 23:00:00 6 3 672
#> 23: 1 2020-02-07 05:00:00 2 6 726
#> 24: 2 2020-02-07 05:00:00 7 7 759
#> 25: 1 2020-02-07 11:00:00 1 4 737
#> 26: 2 2020-02-07 11:00:00 5 2 737
#> 27: 1 2020-02-07 17:00:00 3 5 766
#> 28: 2 2020-02-07 17:00:00 4 4 745
#> 29: 1 2020-02-07 23:00:00 3 3 714
#> 30: 2 2020-02-07 23:00:00 2 1 741
#> 31: 1 2020-02-08 05:00:00 4 6 751
#> 32: 2 2020-02-08 05:00:00 8 2 723
#> 33: 1 2020-02-08 11:00:00 3 3 716
#> 34: 2 2020-02-08 11:00:00 3 6 735
#> 35: 1 2020-02-08 17:00:00 1 5 696
#> 36: 2 2020-02-08 17:00:00 7 7 741
#> ID next_ema_time X Y Steps
Created on 2020-02-04 by the reprex package (v0.3.0)
I would left_join ema_df on pedometer_df by ID and Time. This way you get
all lines of pedometer_df with missing values for x and y (that I assume are identifiers) when it is not an EMA assessment time.
I fill the values using the next available (so the next ema assessment x and y)
and finally, group_by ID x and y and summarise to keep the datetime of assessment (max) and the sum of steps.
library(dplyr)
library(tidyr)
pedometer_df %>%
left_join(ema_df, by = c("ID", "Time")) %>%
fill(x, y, .direction = "up") %>%
group_by(ID, x, y) %>%
summarise(
Time = max(Time),
Steps = sum(Steps)
)

to remove duplicates and making separate column for entries whose date are not continuous

I have a data set as given in example data set employee:
s.no e.name s.date e.date s.time e.time total.hrs
1 George 1-Jan-19 10-Jan-19 10:45 11:45 1
2 George 10-Jan-19 15-Jan-19 06:00 09:00 3
3 George 15-Jan-19 15-Jan-19 12:00 03:00 3
4 George 5-Feb-19 18-Feb-19 12:50 14:50 2
5 Jacob 2-Feb-19 20-Feb-19 15:50 16:50 1
6 Jacob 20-Feb-19 24-Feb-19 14:30 18:30 4
7 Jacob 3-Dec-19 25-Dec-19 06:40 11:40 5
8 Jacob 25-Dec-19 30-Dec-19 09:40 12:40 3
9 Mike 02-Jun-19 02-Jun-19 6:40 07:40 1
10 Mike 02-Jun-19 02-Jun-19 2:45 3:45 1
11 Mike 02-Jun-19 20-Jun-19 10:00 12:00 2
12 Mike 23-Jun-19 25-Jun-19 4:00 5:00 1
My desired output is:
s.no e.name s.date e.date s.time e.time total.hrs
1 George 1-Jan-19 15-Jan-19 10:45 03:00 7
2 George 5-Feb-19 18-Feb-19 12:50 14:50 2
3 Jacob 2-Feb-19 24-Feb-19 15:50 18:30 5
4 Jacob 3-Dec-19 30-Dec-19 06:40 12:40 8
5 Mike 2-Jun-19 20-Jun-19 6:40 12:00 4
6 Mike 23-Jun-19 25-Jun-19 4:00 5:00 1
I was using dplyr library to summarize this but i was having some problem with that i was using this code but i'm not getting my desired output
employee <- employee %>% group_by(e.name) %>% summarise(
s.date=first(s.date),
e.date=last(e.date),
s.time=first(s.time),
e.time=last(e.time),
total.hrs=sum(total.hrs))
From my code i'm getting result as follows
s.no e.name s.date e.date s.time e.time total.hrs
1 George 1-Jan-19 18-Jan-19 10:45 14:50 6
2 Jacob 2-Feb-19 30-Dec-19 15:50 12:40 12
but i want my result like this
s.no e.name s.date e.date s.time e.time total.hrs
1 George 1-Jan-19 15-Jan-19 10:45 03:00 7
2 George 5-Feb-19 18-Feb-19 12:50 14:50 2
3 Jacob 2-Feb-19 24-Feb-19 15:50 18:30 5
4 Jacob 3-Dec-19 30-Dec-19 06:40 12:40 8
5 Mike 2-Jun-19 20-Jun-19 6:40 12:00 4
6 Mike 23-Jun-19 25-Jun-19 4:00 5:00 1
With data.table...
library(data.table)
setDT(DT)
res = DT[, .(
s.no = first(s.no),
s.date = first(s.date),
e.date = last(e.date),
s.time = first(s.time),
e.time = last(e.time),
total.hrs = sum(total.hrs)
), by=.(e.name, .g = cumsum(s.date != shift(e.date, fill=first(s.date))))]
res[, .g := NULL]
e.name s.no s.date e.date s.time e.time total.hrs
1: George 1 1-Jan-19 15-Jan-19 10:45 03:00 7
2: George 4 5-Feb-19 18-Feb-19 12:50 14:50 2
3: Jacob 5 2-Feb-19 24-Feb-19 15:50 18:30 5
4: Jacob 7 3-Dec-19 30-Dec-19 06:40 12:40 8
5: Mike 9 02-Jun-19 20-Jun-19 6:40 12:00 4
6: Mike 12 23-Jun-19 25-Jun-19 4:00 5:00 1
Analogue in dplyr:
library(dplyr)
DT %>% group_by(e.name, .g = cumsum(s.date != lag(e.date, default=first(s.date)))) %>%
summarise(
s.no = first(s.no),
s.date = first(s.date),
e.date = last(e.date),
s.time = first(s.time),
e.time = last(e.time),
total.hrs = sum(total.hrs)
) %>% select(-.g)
# A tibble: 6 x 7
# Groups: e.name [3]
e.name s.no s.date e.date s.time e.time total.hrs
<chr> <int> <chr> <chr> <chr> <chr> <int>
1 George 1 1-Jan-19 15-Jan-19 10:45 03:00 7
2 George 4 5-Feb-19 18-Feb-19 12:50 14:50 2
3 Jacob 5 2-Feb-19 24-Feb-19 15:50 18:30 5
4 Jacob 7 3-Dec-19 30-Dec-19 06:40 12:40 8
5 Mike 9 02-Jun-19 20-Jun-19 6:40 12:00 4
6 Mike 12 23-Jun-19 25-Jun-19 4:00 5:00 1
This answer...
skips reading in the data
skips conversion of dates and times to proper formats
assumes the data is sorted
#Mouad's answer is more thorough in that it fixes these (as the OP should do with their real data). The approach there is also essentially the same -- group by both e.name and the cumsum of flags for where s.date changes from its previous/shifted/lagged value.
Data
library(data.table)
DT = fread("s.no e.name s.date e.date s.time e.time total.hrs
1 George 1-Jan-19 10-Jan-19 10:45 11:45 1
2 George 10-Jan-19 15-Jan-19 06:00 09:00 3
3 George 15-Jan-19 15-Jan-19 12:00 03:00 3
4 George 5-Feb-19 18-Feb-19 12:50 14:50 2
5 Jacob 2-Feb-19 20-Feb-19 15:50 16:50 1
6 Jacob 20-Feb-19 24-Feb-19 14:30 18:30 4
7 Jacob 3-Dec-19 25-Dec-19 06:40 11:40 5
8 Jacob 25-Dec-19 30-Dec-19 09:40 12:40 3
9 Mike 02-Jun-19 02-Jun-19 6:40 07:40 1
10 Mike 02-Jun-19 02-Jun-19 2:45 3:45 1
11 Mike 02-Jun-19 20-Jun-19 10:00 12:00 2
12 Mike 23-Jun-19 25-Jun-19 4:00 5:00 1")
require(dplyr)
my_df <- read.table(text =
's.no e.name s.date e.date s.time e.time total.hrs
1 George 1-Jan-19 10-Jan-19 10:45 11:45 1
2 George 10-Jan-19 15-Jan-19 06:00 09:00 3
3 George 5-Feb-19 18-Feb-19 12:50 14:50 2
4 Jacob 2-Feb-19 20-Feb-19 15:50 16:50 1
5 Jacob 20-Feb-19 24-Feb-19 14:30 18:30 4
5 Jacob 3-Dec-19 25-Dec-19 06:40 11:40 5
6 Jacob 25-Dec-19 30-Dec-19 09:40 12:40 3',
header = TRUE, stringsAsFactors = FALSE) %>% as_tibble()
my_df <-
my_df %>%
mutate(s.date2 = as.Date(s.date, '%d-%B-%y'),
e.date2 = as.Date(e.date, '%d-%B-%y')) %>%
arrange(e.name, s.date2) %>%
group_by(e.name) %>%
mutate(lag_e.date2 = lag(e.date2,1)) %>%
ungroup %>%
mutate(new_episode = as.numeric(is.na(lag_e.date2) | s.date2 != lag_e.date2 )) %>%
mutate(episode = cumsum(new_episode)) %>%
group_by(episode) %>%
mutate(asc_rank = rank(s.date2),
desc_rank = rank(desc(s.date2)),
sum_hours = sum(total.hrs)) %>%
ungroup
then
my_df %>%
mutate(s.date_new = if_else(asc_rank ==1, s.date2, as.Date('1900-01-01')),
s.time_new = if_else(asc_rank ==1, s.time, '00:00'),
e.date_new = if_else(desc_rank ==1, e.date2, as.Date('1900-01-01')),
e.time_new = if_else(desc_rank ==1, e.time, '00:00')) %>%
select(e.name, s.date_new, e.date_new, s.time_new, e.time_new, sum_hours, episode) %>%
group_by(episode) %>%
mutate(s.date = max(s.date_new, na.rm = TRUE),
e.date = max(e.date_new, na.rm = TRUE),
s.time = max(s.time_new, na.rm = TRUE),
e.time = max(e.time_new, na.rm = TRUE),
sum_hours = max(sum_hours)) %>%
ungroup %>%
select(e.name, s.date, e.date, s.time, e.time, sum_hours) %>% distinct
# # A tibble: 4 x 6
# e.name s.date e.date s.time e.time sum_hours
# <chr> <date> <date> <chr> <chr> <dbl>
# 1 George 2019-01-01 2019-01-15 10:45 09:00 4
# 2 George 2019-02-05 2019-02-18 12:50 14:50 2
# 3 Jacob 2019-02-02 2019-02-24 15:50 18:30 5
# 4 Jacob 2019-12-03 2019-12-30 06:40 12:40 8
# >
#

moving average on different size data frames in R

I have a set of data taken every 5 minutes consisting of the following structure:
>df1
Date X1
01/01/2017 0:00 1
01/01/2017 0:30 32
01/01/2017 1:00 65
01/01/2017 1:30 14
01/01/2017 2:00 25
01/01/2017 2:30 14
01/01/2017 3:00 85
01/01/2017 3:30 74
01/01/2017 4:00 74
01/01/2017 4:30 52
01/01/2017 5:00 25
01/01/2017 5:30 74
01/01/2017 6:00 45
01/01/2017 6:30 52
01/01/2017 7:00 21
01/01/2017 7:30 41
01/01/2017 8:00 74
01/01/2017 8:30 11
01/01/2017 9:00 2
01/01/2017 9:30 52
Another vector is given consisting of only dates, but with a different time frequency:
>V1
Date2
1/1/2017 1:30:00
1/1/2017 3:30:00
1/1/2017 5:30:00
1/1/2017 9:30:00
I would like to calculate the moving average of X1 but at the end the only values I really need are the ones associated with the dates in V1 (but preserving the smoothing generated by the moving average)
Would you recommend to calculate the moving average of X1, then associate the value to the corresponding date in V1 and re-apply a moving average? or do you know a function in R that would help me achieve this?
Thank you, I really appreciate your help!
SofĂ­a
filter is a convenient way to construct moving averages
Assuming you want a simple arithmetic moving average, you'll need to decide how many elements you'd like to average together, and if you'd like a one or two-sided average. Arbitrarily, I'll pick 5 and one-sided.
elements <- 5
df1$x1.smooth <- filter(df1$X1, filter = rep(1/elements, elements), sides=1)
Note that "moving.average" will have elements-1 fewer elements than df1$X1 due to the moving average being undefined until there are elements items to average.
df1 is now
Date X1 x1.smooth
1 01/01/2017 0:00 1 NA
2 01/01/2017 0:30 32 NA
3 01/01/2017 1:00 65 NA
4 01/01/2017 1:30 14 NA
5 01/01/2017 2:00 25 27.4
6 01/01/2017 2:30 14 30.0
7 01/01/2017 3:00 85 40.6
8 01/01/2017 3:30 74 42.4
9 01/01/2017 4:00 74 54.4
10 01/01/2017 4:30 52 59.8
11 01/01/2017 5:00 25 62.0
12 01/01/2017 5:30 74 59.8
13 01/01/2017 6:00 45 54.0
14 01/01/2017 6:30 52 49.6
15 01/01/2017 7:00 21 43.4
16 01/01/2017 7:30 41 46.6
17 01/01/2017 8:00 74 46.6
18 01/01/2017 8:30 11 39.8
19 01/01/2017 9:00 2 29.8
20 01/01/2017 9:30 52 36.0
Now you need only merge the two data frames on Date = Date2 or else subset df1 where Date is %in% V1$Date2
Another option could be to use zoo package. One can use rollapply to calculate and add another column in dataframe that will hold moving average for X1.
A implementation with moving average of width 4 (every 2 hours) can be implemented as:
Library(zoo)
#Add another column with mean value
df$mean <- rollapply(df$X1, 4, mean, align = "right", fill = NA)
df
# Date X1 mean
# 1 2017-01-01 00:00:00 1 NA
# 2 2017-01-01 00:30:00 32 NA
# 3 2017-01-01 01:00:00 65 NA
# 4 2017-01-01 01:30:00 14 28.00
# 5 2017-01-01 02:00:00 25 34.00
# 6 2017-01-01 02:30:00 14 29.50
# 7 2017-01-01 03:00:00 85 34.50
# 8 2017-01-01 03:30:00 74 49.50
# 9 2017-01-01 04:00:00 74 61.75
# 10 2017-01-01 04:30:00 52 71.25
# 11 2017-01-01 05:00:00 25 56.25
# 12 2017-01-01 05:30:00 74 56.25
# 13 2017-01-01 06:00:00 45 49.00
# 14 2017-01-01 06:30:00 52 49.00
# 15 2017-01-01 07:00:00 21 48.00
# 16 2017-01-01 07:30:00 41 39.75
# 17 2017-01-01 08:00:00 74 47.00
# 18 2017-01-01 08:30:00 11 36.75
# 19 2017-01-01 09:00:00 2 32.00
# 20 2017-01-01 09:30:00 52 34.75

Resources