moving average on different size data frames in R - r

I have a set of data taken every 5 minutes consisting of the following structure:
>df1
Date X1
01/01/2017 0:00 1
01/01/2017 0:30 32
01/01/2017 1:00 65
01/01/2017 1:30 14
01/01/2017 2:00 25
01/01/2017 2:30 14
01/01/2017 3:00 85
01/01/2017 3:30 74
01/01/2017 4:00 74
01/01/2017 4:30 52
01/01/2017 5:00 25
01/01/2017 5:30 74
01/01/2017 6:00 45
01/01/2017 6:30 52
01/01/2017 7:00 21
01/01/2017 7:30 41
01/01/2017 8:00 74
01/01/2017 8:30 11
01/01/2017 9:00 2
01/01/2017 9:30 52
Another vector is given consisting of only dates, but with a different time frequency:
>V1
Date2
1/1/2017 1:30:00
1/1/2017 3:30:00
1/1/2017 5:30:00
1/1/2017 9:30:00
I would like to calculate the moving average of X1 but at the end the only values I really need are the ones associated with the dates in V1 (but preserving the smoothing generated by the moving average)
Would you recommend to calculate the moving average of X1, then associate the value to the corresponding date in V1 and re-apply a moving average? or do you know a function in R that would help me achieve this?
Thank you, I really appreciate your help!
SofĂ­a

filter is a convenient way to construct moving averages
Assuming you want a simple arithmetic moving average, you'll need to decide how many elements you'd like to average together, and if you'd like a one or two-sided average. Arbitrarily, I'll pick 5 and one-sided.
elements <- 5
df1$x1.smooth <- filter(df1$X1, filter = rep(1/elements, elements), sides=1)
Note that "moving.average" will have elements-1 fewer elements than df1$X1 due to the moving average being undefined until there are elements items to average.
df1 is now
Date X1 x1.smooth
1 01/01/2017 0:00 1 NA
2 01/01/2017 0:30 32 NA
3 01/01/2017 1:00 65 NA
4 01/01/2017 1:30 14 NA
5 01/01/2017 2:00 25 27.4
6 01/01/2017 2:30 14 30.0
7 01/01/2017 3:00 85 40.6
8 01/01/2017 3:30 74 42.4
9 01/01/2017 4:00 74 54.4
10 01/01/2017 4:30 52 59.8
11 01/01/2017 5:00 25 62.0
12 01/01/2017 5:30 74 59.8
13 01/01/2017 6:00 45 54.0
14 01/01/2017 6:30 52 49.6
15 01/01/2017 7:00 21 43.4
16 01/01/2017 7:30 41 46.6
17 01/01/2017 8:00 74 46.6
18 01/01/2017 8:30 11 39.8
19 01/01/2017 9:00 2 29.8
20 01/01/2017 9:30 52 36.0
Now you need only merge the two data frames on Date = Date2 or else subset df1 where Date is %in% V1$Date2

Another option could be to use zoo package. One can use rollapply to calculate and add another column in dataframe that will hold moving average for X1.
A implementation with moving average of width 4 (every 2 hours) can be implemented as:
Library(zoo)
#Add another column with mean value
df$mean <- rollapply(df$X1, 4, mean, align = "right", fill = NA)
df
# Date X1 mean
# 1 2017-01-01 00:00:00 1 NA
# 2 2017-01-01 00:30:00 32 NA
# 3 2017-01-01 01:00:00 65 NA
# 4 2017-01-01 01:30:00 14 28.00
# 5 2017-01-01 02:00:00 25 34.00
# 6 2017-01-01 02:30:00 14 29.50
# 7 2017-01-01 03:00:00 85 34.50
# 8 2017-01-01 03:30:00 74 49.50
# 9 2017-01-01 04:00:00 74 61.75
# 10 2017-01-01 04:30:00 52 71.25
# 11 2017-01-01 05:00:00 25 56.25
# 12 2017-01-01 05:30:00 74 56.25
# 13 2017-01-01 06:00:00 45 49.00
# 14 2017-01-01 06:30:00 52 49.00
# 15 2017-01-01 07:00:00 21 48.00
# 16 2017-01-01 07:30:00 41 39.75
# 17 2017-01-01 08:00:00 74 47.00
# 18 2017-01-01 08:30:00 11 36.75
# 19 2017-01-01 09:00:00 2 32.00
# 20 2017-01-01 09:30:00 52 34.75

Related

Is there a way of converting four-digit numbers to time values in r?

When I try using as.POSIXlt or strptime I keep getting a single value of 'NA' as a result.
What I need to do is transform 3 and 4 digit numbers e.g. 2300 or 115 to 23:00 or 01:15 respectively, but I simply cannot get any code to work.
Basically, this data fame of outputs:
Time
1 2345
2 2300
3 2130
4 2400
5 115
6 2330
7 100
8 2300
9 1530
10 130
11 100
12 215
13 2245
14 145
15 2330
16 2400
17 2300
18 2230
19 2130
20 30
should look like this:
Time
1 23:45
2 23:00
3 21:30
4 24:00
5 01:15
6 23:30
7 01:00
8 23:00
9 15:30
10 01:30
11 01:00
12 02:15
13 22:45
14 01:45
15 23:30
16 24:00
17 23:00
18 22:30
19 21:30
20 00:30
I think you can use the following solution. However this is actually producing a character vector:
gsub("(\\d{2})(\\d{2})", "\\1:\\2", sprintf("%04d", df$Time)) |>
as.data.frame() |>
setNames("Time") |>
head()
Time
1 23:45
2 23:00
3 21:30
4 24:00
5 01:15
6 23:30

Indexing a group in R with dplyr

I have a dataset as below:
structure(AI_decs)
Horse Time RaceID dyLTO Value.LTO Draw.IV
1 Warne's Army 06/04/2021 13:00 1 56 3429 0.88
2 G For Gabrial 06/04/2021 13:00 1 57 3299 1.15
3 First Charge 06/04/2021 13:00 1 66 3429 1.06
4 Dream With Me 06/04/2021 13:00 1 62 2862 0.97
5 Qawamees 06/04/2021 13:00 1 61 4690 0.97
6 Glan Y Gors 06/04/2021 13:00 1 59 3429 1.50
7 The Dancing Poet 06/04/2021 13:00 1 42 4690 1.41
8 Finoah 06/04/2021 13:00 1 59 10260 0.97
9 Ravenscar 06/04/2021 13:30 2 58 5208 0.65
10 Arabescato 06/04/2021 13:30 2 57 2862 1.09
11 Thai Terrier 06/04/2021 13:30 2 58 7439 1.30
12 The Rutland Rebel 06/04/2021 13:30 2 55 3429 2.17
13 Red Tornado 06/04/2021 13:30 2 49 3340 0.43
14 Alfredo 06/04/2021 13:30 2 54 5208 1.30
15 Tynecastle Park 06/04/2021 13:30 2 72 7439 0.87
16 Waldkonig 06/04/2021 14:00 3 55 3493 1.35
17 Kaleidoscopic 06/04/2021 14:00 3 68 7439 1.64
18 Louganini 06/04/2021 14:00 3 75 56025 1.26
I have a list of columns with performance data values for horses in a race.
My dataset has many more rows and it contains a number of horse races on a given day.
Each horse race has a unique time and a different number of horses in each race.
Basically, I want to assign a raceId (index number) to each individual race.
I am currently having to do this in excel (see column RaceID) by comparing the Time column and adding 1 to the RaceId value every time we encounter a new race. This has to be done manually each day before I import into R.
I hope there is a way to do this in R Dplyr.
I thought if I use Group_by 'Time' there might be a function a bit like n() or row_number() that would
index the races for me.
Perhaps using Case_when and lag/lead.
Thanks in advance for any help.
Graham
Try this:
Note: group_indices() was deprecated in dplyr 1.0.0.
library(dplyr)
df <- data.frame(time = rep(c("06/04/2021 13:00", "06/04/2021 13:30", "06/04/2021 14:00", "07/04/2021 14:00"), each = 3))
df %>%
group_by(time) %>%
mutate(race_id = cur_group_id())
#> # A tibble: 12 x 2
#> # Groups: time [4]
#> time race_id
#> <chr> <int>
#> 1 06/04/2021 13:00 1
#> 2 06/04/2021 13:00 1
#> 3 06/04/2021 13:00 1
#> 4 06/04/2021 13:30 2
#> 5 06/04/2021 13:30 2
#> 6 06/04/2021 13:30 2
#> 7 06/04/2021 14:00 3
#> 8 06/04/2021 14:00 3
#> 9 06/04/2021 14:00 3
#> 10 07/04/2021 14:00 4
#> 11 07/04/2021 14:00 4
#> 12 07/04/2021 14:00 4
Created on 2021-04-10 by the reprex package (v2.0.0)
You can group by data.table's function rleid (i.e., run length ID):
library(dplyr)
library(data.table)
df %>%
group_by(race_id = rleid(time))
# A tibble: 12 x 2
# Groups: race_id [4]
time race_id
<chr> <int>
1 06/04/2021 13:00 1
2 06/04/2021 13:00 1
3 06/04/2021 13:00 1
4 06/04/2021 13:30 2
5 06/04/2021 13:30 2
6 06/04/2021 13:30 2
7 06/04/2021 14:00 3
8 06/04/2021 14:00 3
9 06/04/2021 14:00 3
10 07/04/2021 14:00 4
11 07/04/2021 14:00 4
12 07/04/2021 14:00 4
Data, from #Peter:
df <- data.frame(time = rep(c("06/04/2021 13:00", "06/04/2021 13:30", "06/04/2021 14:00", "07/04/2021 14:00"), each = 3))

Subset dataframe based on rule - sequence

Subset dataframe w/ sequence of observations
I am experimenting with a large dataset. I would like to subset this data frame, in intervals of Monday through Friday. However, I see that some weeks have missing days (Thursday is missing one week).
If one sequence of days, i.e. Monday to Friday, I would like to not include this sequence of days in my sample.
Would this be possible?
week.nr <- data$week.nr[1:20]
week.day<- data$week.day[1:20]
date <- data$specific.date[1:20]
price <- data$price[1:20]
data.frame(date,week.nr,week.day,price)
data.frame(date,week.nr,week.day,price)
date week.nr week.day price
1 2019-01-28 05 Monday 62.6
2 2019-01-25 04 Friday 63.8
3 2019-01-24 04 Thursday 64.2
4 2019-01-23 04 Wednesday 64.0
5 2019-01-22 04 Tuesday 64.0
6 2019-01-21 04 Monday 63.4
7 2019-01-18 03 Friday 62.6
8 2019-01-17 03 Thursday 62.6
9 2019-01-16 03 Wednesday 64.0
10 2019-01-15 03 Tuesday 64.4
11 2019-01-14 03 Monday 65.2
12 2019-01-11 02 Friday 66.4
13 2019-01-10 02 Thursday 66.2
14 2019-01-09 02 Wednesday 68.2
15 2019-01-08 02 Tuesday 68.8
16 2019-01-07 02 Monday 67.8
17 2019-01-04 01 Friday 67.4
18 2019-01-03 01 Thursday 68.0
19 2019-01-02 01 Wednesday 69.6
20 2018-12-28 52 Friday 71.0

Combine data sets based on date comparisons within groups

I have two data sets, 'Df_A' and 'Df_B':
Df_A
Date Info A Info B
9/19/18 23:00 36 48
9/18/18 23:00 47 30
9/17/18 23:00 51 3
8/14/18 23:00 45 16
8/6/18 23:00 37 13
8/5/18 23:00 42 66
7/11/18 23:00 42 53
7/4/18 23:00 38 10
Df_B
Released Info Event Value X
9/6/2018 22:30 Event A 51.8
8/6/2018 22:30 Event A 52
7/5/2018 22:30 Event A 50.6
6/6/2018 22:30 Event A 54
9/2/2018 22:30 Event C 48
7/31/2018 22:30 Event C 45
9/4/2018 22:30 Event D 58.7
8/2/2018 22:30 Event D 56.2
7/3/2018 22:30 Event D 57.3
6/4/2018 22:30 Event D 51.1
5/2/2018 22:30 Event D 54.2
4/4/2018 22:30 Event D 59.8
9/3/2018 1:30 Event E 61.8
8/6/2018 1:30 Event E 63
7/2/2018 1:30 Event E 65.2
Both 'Date' and 'Released.info' are factors.
I have a vector 'Events' which contains the Events in 'Df_B' that I need to parse, e.g.
Events <- c("Event A", "Event D")
For each 'Event' in 'Df_B', I would like to check if 'Date' in 'Df_A' is greater than 'Released Info' in 'Df_B'. If so, I want to add the corresponding value of 'Event A' and 'Event B' to 'Df_A'.
The desired output:
Date Info A Info B Event A Event D
9/19/18 23:00 36 48 51.8 58.7
9/18/18 23:00 47 30 51.8 58.7
9/17/18 23:00 51 3 51.8 58.7
8/14/18 23:00 45 16 52 56.2
8/6/18 23:00 37 13 52 56.2
8/5/18 23:00 42 66 50.6 56.2
7/11/18 23:00 42 53 50.6 57.3
7/4/18 23:00 38 10 54 57.3
For example, for 9/19/18 23:00, 9/18/18 23:00 and 9/17/18 23:00 in 'Df_A', the closest prior date in 'Df_B' for the group 'Event A' is 9/6/2018 22:30. Thus, for these rows we pick the value 51.8 from 'Df_B'. And so on for all dates in Df_A, and for both 'Event A' and 'Event B' in 'Df_B'.
I would like to add new n columns to 'Df_A', in this example 'Event A' and 'Event D', but it could be more.
For this, I have been trying creating some dynamic variables for the dynamic amount of events with something like this (as the Events come from a csv as a matrix) :
#To Create a variable for each Event
ListEvents <- as.list(as.vector(Events))
names(ListEvents) <- paste("Variable", 1:length(ListEvents), sep = "")
list2env(ListEvents,envir = .GlobalEnv)
After creating a variable for each Event, I was thinking in creating a loop so I can create a subset for each event and then compare the Date (Df_A) with the release Date(Df_B) and add it as a column in Df_A. But I know this is an unnecessary complex and inefficient approach. Could someone help me?
The following reproduces your expected output:
events <- c("Event A", "Event D")
library(tidyverse)
library(lubridate)
map(events, ~Df_A %>%
mutate(Event := .x) %>%
left_join(Df_B) %>%
mutate(
Date = mdy_hm(Date),
Released.Info = mdy_hm(Released.Info)) %>%
group_by(Date) %>%
mutate(diff = difftime(Released.Info, Date, units = "days")) %>%
filter(diff < 0) %>%
filter(diff == max(diff)) %>%
select(-Released.Info, -diff) %>%
spread(Event, Value.X)) %>%
reduce(left_join) %>%
arrange(desc(Date))
## A tibble: 8 x 5
## Groups: Date [8]
# Date Info.A Info.B `Event A` `Event D`
# <dttm> <int> <int> <dbl> <dbl>
#1 2018-09-19 23:00:00 36 48 51.8 58.7
#2 2018-09-18 23:00:00 47 30 51.8 58.7
#3 2018-09-17 23:00:00 51 3 51.8 58.7
#4 2018-08-14 23:00:00 45 16 52 56.2
#5 2018-08-06 23:00:00 37 13 52 56.2
#6 2018-08-05 23:00:00 42 66 50.6 56.2
#7 2018-07-11 23:00:00 42 53 50.6 57.3
#8 2018-07-04 23:00:00 38 10 54 57.3
The idea is to do add an Events column to Df_A with entries given in a vector events; we then do a left join of Df_A and Df_B, and select only those rows with the shortest negative time difference between Released.Info and Date (that's the filter(diff < 0) and filter(diff == max(diff)) part). The rest is reshaping and re-arranging to reproduce your expected output.
Sample data
Df_A <-read.table(text =
" Date 'Info A' 'Info B'
'9/19/18 23:00' 36 48
'9/18/18 23:00' 47 30
'9/17/18 23:00' 51 3
'8/14/18 23:00' 45 16
'8/6/18 23:00' 37 13
'8/5/18 23:00' 42 66
'7/11/18 23:00' 42 53
'7/4/18 23:00' 38 10", header = T)
Df_B <- read.table(text =
"'Released Info' Event 'Value X'
'9/6/2018 22:30' 'Event A' 51.8
'8/6/2018 22:30' 'Event A' 52
'7/5/2018 22:30' 'Event A' 50.6
'6/6/2018 22:30' 'Event A' 54
'9/2/2018 22:30' 'Event C' 48
'7/31/2018 22:30' 'Event C' 45
'9/4/2018 22:30' 'Event D' 58.7
'8/2/2018 22:30' 'Event D' 56.2
'7/3/2018 22:30' 'Event D' 57.3
'6/4/2018 22:30' 'Event D' 51.1
'5/2/2018 22:30' 'Event D' 54.2
'4/4/2018 22:30' 'Event D' 59.8
'9/3/2018 1:30' 'Event E' 61.8
'8/6/2018 1:30' 'Event E' 63
'7/2/2018 1:30' 'Event E' 65.2", header = T)
This can be done with a rolling join by group in data.table.
library(data.table)
# convert data to data.table
setDT(Df_A)
setDT(Df_B)
# convert times to POSIXct
Df_A[ , Date := as.POSIXct(Date, format = "%m/%d/%y %H:%M")]
Df_B[ , Released.Info := as.POSIXct(Released.Info, format = "%m/%d/%Y %H:%M")]
# select rows
db <- Df_B[Event %in% Events]
# rolling join: for each Event in db, join to Df_A by nearest preceeding time
d2 <- db[ , .SD[Df_A, on = c(Released.Info = "Date"), roll = Inf], by = Event]
# Event Released.Info Value.X Info.A Info.B
# 1: Event A 2018-09-19 23:00:00 51.8 36 48
# 2: Event A 2018-09-18 23:00:00 51.8 47 30
# [snip]
# 7: Event A 2018-07-11 23:00:00 50.6 42 53
# 8: Event A 2018-07-04 23:00:00 54.0 38 10
# 9: Event D 2018-09-19 23:00:00 58.7 36 48
# 10: Event D 2018-09-18 23:00:00 58.7 47 30
# [snip]
# 15: Event D 2018-07-11 23:00:00 57.3 42 53
# 16: Event D 2018-07-04 23:00:00 57.3 38 10
That's basically it. If desired, cast the 'Event' column to wide and join to 'Df_A':
dcast(d2[ , .(Event, Released.Info, Value.X)],
Released.Info ~ Event, value.var = "Value.X")[
Df_A, on = c(Released.Info = "Date")]
# Released.Info Event A Event D Info.A Info.B
# 1: 2018-09-19 23:00:00 51.8 58.7 36 48
# 2: 2018-09-18 23:00:00 51.8 58.7 47 30
# 3: 2018-09-17 23:00:00 51.8 58.7 51 3
# 4: 2018-08-14 23:00:00 52.0 56.2 45 16
# 5: 2018-08-06 23:00:00 52.0 56.2 37 13
# 6: 2018-08-05 23:00:00 50.6 56.2 42 66
# 7: 2018-07-11 23:00:00 50.6 57.3 42 53
# 8: 2018-07-04 23:00:00 54.0 57.3 38 10

R Start Column & End Column Time Log

I have a data table with 3 columns, (Start, Stop, & Type). Some of the original datetimes hand off from Stop to Start smoothely, but others have gaps. I want to create new rows with a Start datetime, End datetime, & Type = 0 that fills the gaps if needed. Below is some sample data...
What I have...
LOG_START_DT LOG_END_DT Type
3/28/2018 9:30 3/28/2018 12:15 2
3/28/2018 13:30 3/28/2018 16:30 1
3/28/2018 17:15 3/28/2018 20:00 2
3/28/2018 21:15 3/29/2018 0:00 2
3/29/2018 0:00 3/29/2018 0:30 2
3/29/2018 1:30 3/29/2018 5:00 1
What I want...
LOG_START_DT LOG_END_DT Type
3/28/2018 9:30 3/28/2018 12:15 2
3/28/2018 12:16 3/28/2018 13:29 0
3/28/2018 13:30 3/28/2018 16:30 1
3/28/2018 16:31 3/28/2018 17:14 0
3/28/2018 17:15 3/28/2018 20:00 2
3/28/2018 20:01 3/28/2018 21:14 0
3/28/2018 21:15 3/29/2018 0:00 2
3/29/2018 0:00 3/29/2018 0:30 2
3/29/2018 0:31 3/29/2018 1:29 0
3/29/2018 1:30 3/29/2018 5:00 1
Also, it's important to note that whatever rows are added do not have a time that overlaps with the previous end or next start date time. My original data is about 500 rows too which I've tried to do a combination for loops or if statements, but can't figure it out or it takes way too long to run through the data.
Thank you!
Let's get the data and convert to datetimes.
library(tidyverse)
library(lubridate)
foo <- read_table("LOG_START_DT LOG_END_DT Type
3/28/2018 9:30 3/28/2018 12:15 2
3/28/2018 13:30 3/28/2018 16:30 1
3/28/2018 17:15 3/28/2018 20:00 2
3/28/2018 21:15 3/29/2018 0:00 2
3/29/2018 0:00 3/29/2018 0:30 2
3/29/2018 1:30 3/29/2018 5:00 1")
foo <- foo %>%
mutate(LOG_START_DT = mdy_hm(LOG_START_DT), LOG_END_DT = mdy_hm(LOG_END_DT))
Let's make an auxiliary data frame with the ends as starts and starts as ends, all with Type of 0.
bar <- data_frame(LOG_START_DT = foo$LOG_END_DT[-nrow(foo)],
LOG_END_DT = foo$LOG_START_DT[-1],
Type = 0L)
bar
#> # A tibble: 5 x 3
#> LOG_START_DT LOG_END_DT Type
#> <dttm> <dttm> <int>
#> 1 2018-03-28 12:15:00 2018-03-28 13:30:00 0
#> 2 2018-03-28 16:30:00 2018-03-28 17:15:00 0
#> 3 2018-03-28 20:00:00 2018-03-28 21:15:00 0
#> 4 2018-03-29 00:00:00 2018-03-29 00:00:00 0
#> 5 2018-03-29 00:30:00 2018-03-29 01:30:00 0
Then get rid of any rows that result from a "smooth hand-off" (which you do not define very well so I've defined it as "the next start is the same as the previous end"). After, (and this doesn't seem like a good idea but this gives you what you want) add a minute and subtract a minute from the two datetime columns.
bar <- bar %>%
filter(LOG_START_DT != LOG_END_DT) %>%
mutate(LOG_START_DT = LOG_START_DT + minutes(1),
LOG_END_DT = LOG_END_DT - minutes(1))
I don't think the adjustment is a good idea because it seems to break things if the original start and end happen to be only one minute (or less) apart. But that's up to you.
Then just bind the two data frames together and sort it.
baz <- rbind(foo, bar) %>%
arrange(LOG_START_DT)
baz
#> # A tibble: 10 x 3
#> LOG_START_DT LOG_END_DT Type
#> <dttm> <dttm> <int>
#> 1 2018-03-28 09:30:00 2018-03-28 12:15:00 2
#> 2 2018-03-28 12:16:00 2018-03-28 13:29:00 0
#> 3 2018-03-28 13:30:00 2018-03-28 16:30:00 1
#> 4 2018-03-28 16:31:00 2018-03-28 17:14:00 0
#> 5 2018-03-28 17:15:00 2018-03-28 20:00:00 2
#> 6 2018-03-28 20:01:00 2018-03-28 21:14:00 0
#> 7 2018-03-28 21:15:00 2018-03-29 00:00:00 2
#> 8 2018-03-29 00:00:00 2018-03-29 00:30:00 2
#> 9 2018-03-29 00:31:00 2018-03-29 01:29:00 0
#> 10 2018-03-29 01:30:00 2018-03-29 05:00:00 1
And I suppose if you really wanted that awful date format back you could do this:
baz_FUGLY <- baz %>%
mutate_if(is.POSIXct, format, "%m/%d/%Y %H:%M")

Resources