I have a data table with 3 columns, (Start, Stop, & Type). Some of the original datetimes hand off from Stop to Start smoothely, but others have gaps. I want to create new rows with a Start datetime, End datetime, & Type = 0 that fills the gaps if needed. Below is some sample data...
What I have...
LOG_START_DT LOG_END_DT Type
3/28/2018 9:30 3/28/2018 12:15 2
3/28/2018 13:30 3/28/2018 16:30 1
3/28/2018 17:15 3/28/2018 20:00 2
3/28/2018 21:15 3/29/2018 0:00 2
3/29/2018 0:00 3/29/2018 0:30 2
3/29/2018 1:30 3/29/2018 5:00 1
What I want...
LOG_START_DT LOG_END_DT Type
3/28/2018 9:30 3/28/2018 12:15 2
3/28/2018 12:16 3/28/2018 13:29 0
3/28/2018 13:30 3/28/2018 16:30 1
3/28/2018 16:31 3/28/2018 17:14 0
3/28/2018 17:15 3/28/2018 20:00 2
3/28/2018 20:01 3/28/2018 21:14 0
3/28/2018 21:15 3/29/2018 0:00 2
3/29/2018 0:00 3/29/2018 0:30 2
3/29/2018 0:31 3/29/2018 1:29 0
3/29/2018 1:30 3/29/2018 5:00 1
Also, it's important to note that whatever rows are added do not have a time that overlaps with the previous end or next start date time. My original data is about 500 rows too which I've tried to do a combination for loops or if statements, but can't figure it out or it takes way too long to run through the data.
Thank you!
Let's get the data and convert to datetimes.
library(tidyverse)
library(lubridate)
foo <- read_table("LOG_START_DT LOG_END_DT Type
3/28/2018 9:30 3/28/2018 12:15 2
3/28/2018 13:30 3/28/2018 16:30 1
3/28/2018 17:15 3/28/2018 20:00 2
3/28/2018 21:15 3/29/2018 0:00 2
3/29/2018 0:00 3/29/2018 0:30 2
3/29/2018 1:30 3/29/2018 5:00 1")
foo <- foo %>%
mutate(LOG_START_DT = mdy_hm(LOG_START_DT), LOG_END_DT = mdy_hm(LOG_END_DT))
Let's make an auxiliary data frame with the ends as starts and starts as ends, all with Type of 0.
bar <- data_frame(LOG_START_DT = foo$LOG_END_DT[-nrow(foo)],
LOG_END_DT = foo$LOG_START_DT[-1],
Type = 0L)
bar
#> # A tibble: 5 x 3
#> LOG_START_DT LOG_END_DT Type
#> <dttm> <dttm> <int>
#> 1 2018-03-28 12:15:00 2018-03-28 13:30:00 0
#> 2 2018-03-28 16:30:00 2018-03-28 17:15:00 0
#> 3 2018-03-28 20:00:00 2018-03-28 21:15:00 0
#> 4 2018-03-29 00:00:00 2018-03-29 00:00:00 0
#> 5 2018-03-29 00:30:00 2018-03-29 01:30:00 0
Then get rid of any rows that result from a "smooth hand-off" (which you do not define very well so I've defined it as "the next start is the same as the previous end"). After, (and this doesn't seem like a good idea but this gives you what you want) add a minute and subtract a minute from the two datetime columns.
bar <- bar %>%
filter(LOG_START_DT != LOG_END_DT) %>%
mutate(LOG_START_DT = LOG_START_DT + minutes(1),
LOG_END_DT = LOG_END_DT - minutes(1))
I don't think the adjustment is a good idea because it seems to break things if the original start and end happen to be only one minute (or less) apart. But that's up to you.
Then just bind the two data frames together and sort it.
baz <- rbind(foo, bar) %>%
arrange(LOG_START_DT)
baz
#> # A tibble: 10 x 3
#> LOG_START_DT LOG_END_DT Type
#> <dttm> <dttm> <int>
#> 1 2018-03-28 09:30:00 2018-03-28 12:15:00 2
#> 2 2018-03-28 12:16:00 2018-03-28 13:29:00 0
#> 3 2018-03-28 13:30:00 2018-03-28 16:30:00 1
#> 4 2018-03-28 16:31:00 2018-03-28 17:14:00 0
#> 5 2018-03-28 17:15:00 2018-03-28 20:00:00 2
#> 6 2018-03-28 20:01:00 2018-03-28 21:14:00 0
#> 7 2018-03-28 21:15:00 2018-03-29 00:00:00 2
#> 8 2018-03-29 00:00:00 2018-03-29 00:30:00 2
#> 9 2018-03-29 00:31:00 2018-03-29 01:29:00 0
#> 10 2018-03-29 01:30:00 2018-03-29 05:00:00 1
And I suppose if you really wanted that awful date format back you could do this:
baz_FUGLY <- baz %>%
mutate_if(is.POSIXct, format, "%m/%d/%Y %H:%M")
Related
When I try using as.POSIXlt or strptime I keep getting a single value of 'NA' as a result.
What I need to do is transform 3 and 4 digit numbers e.g. 2300 or 115 to 23:00 or 01:15 respectively, but I simply cannot get any code to work.
Basically, this data fame of outputs:
Time
1 2345
2 2300
3 2130
4 2400
5 115
6 2330
7 100
8 2300
9 1530
10 130
11 100
12 215
13 2245
14 145
15 2330
16 2400
17 2300
18 2230
19 2130
20 30
should look like this:
Time
1 23:45
2 23:00
3 21:30
4 24:00
5 01:15
6 23:30
7 01:00
8 23:00
9 15:30
10 01:30
11 01:00
12 02:15
13 22:45
14 01:45
15 23:30
16 24:00
17 23:00
18 22:30
19 21:30
20 00:30
I think you can use the following solution. However this is actually producing a character vector:
gsub("(\\d{2})(\\d{2})", "\\1:\\2", sprintf("%04d", df$Time)) |>
as.data.frame() |>
setNames("Time") |>
head()
Time
1 23:45
2 23:00
3 21:30
4 24:00
5 01:15
6 23:30
I have a dataset as below:
structure(AI_decs)
Horse Time RaceID dyLTO Value.LTO Draw.IV
1 Warne's Army 06/04/2021 13:00 1 56 3429 0.88
2 G For Gabrial 06/04/2021 13:00 1 57 3299 1.15
3 First Charge 06/04/2021 13:00 1 66 3429 1.06
4 Dream With Me 06/04/2021 13:00 1 62 2862 0.97
5 Qawamees 06/04/2021 13:00 1 61 4690 0.97
6 Glan Y Gors 06/04/2021 13:00 1 59 3429 1.50
7 The Dancing Poet 06/04/2021 13:00 1 42 4690 1.41
8 Finoah 06/04/2021 13:00 1 59 10260 0.97
9 Ravenscar 06/04/2021 13:30 2 58 5208 0.65
10 Arabescato 06/04/2021 13:30 2 57 2862 1.09
11 Thai Terrier 06/04/2021 13:30 2 58 7439 1.30
12 The Rutland Rebel 06/04/2021 13:30 2 55 3429 2.17
13 Red Tornado 06/04/2021 13:30 2 49 3340 0.43
14 Alfredo 06/04/2021 13:30 2 54 5208 1.30
15 Tynecastle Park 06/04/2021 13:30 2 72 7439 0.87
16 Waldkonig 06/04/2021 14:00 3 55 3493 1.35
17 Kaleidoscopic 06/04/2021 14:00 3 68 7439 1.64
18 Louganini 06/04/2021 14:00 3 75 56025 1.26
I have a list of columns with performance data values for horses in a race.
My dataset has many more rows and it contains a number of horse races on a given day.
Each horse race has a unique time and a different number of horses in each race.
Basically, I want to assign a raceId (index number) to each individual race.
I am currently having to do this in excel (see column RaceID) by comparing the Time column and adding 1 to the RaceId value every time we encounter a new race. This has to be done manually each day before I import into R.
I hope there is a way to do this in R Dplyr.
I thought if I use Group_by 'Time' there might be a function a bit like n() or row_number() that would
index the races for me.
Perhaps using Case_when and lag/lead.
Thanks in advance for any help.
Graham
Try this:
Note: group_indices() was deprecated in dplyr 1.0.0.
library(dplyr)
df <- data.frame(time = rep(c("06/04/2021 13:00", "06/04/2021 13:30", "06/04/2021 14:00", "07/04/2021 14:00"), each = 3))
df %>%
group_by(time) %>%
mutate(race_id = cur_group_id())
#> # A tibble: 12 x 2
#> # Groups: time [4]
#> time race_id
#> <chr> <int>
#> 1 06/04/2021 13:00 1
#> 2 06/04/2021 13:00 1
#> 3 06/04/2021 13:00 1
#> 4 06/04/2021 13:30 2
#> 5 06/04/2021 13:30 2
#> 6 06/04/2021 13:30 2
#> 7 06/04/2021 14:00 3
#> 8 06/04/2021 14:00 3
#> 9 06/04/2021 14:00 3
#> 10 07/04/2021 14:00 4
#> 11 07/04/2021 14:00 4
#> 12 07/04/2021 14:00 4
Created on 2021-04-10 by the reprex package (v2.0.0)
You can group by data.table's function rleid (i.e., run length ID):
library(dplyr)
library(data.table)
df %>%
group_by(race_id = rleid(time))
# A tibble: 12 x 2
# Groups: race_id [4]
time race_id
<chr> <int>
1 06/04/2021 13:00 1
2 06/04/2021 13:00 1
3 06/04/2021 13:00 1
4 06/04/2021 13:30 2
5 06/04/2021 13:30 2
6 06/04/2021 13:30 2
7 06/04/2021 14:00 3
8 06/04/2021 14:00 3
9 06/04/2021 14:00 3
10 07/04/2021 14:00 4
11 07/04/2021 14:00 4
12 07/04/2021 14:00 4
Data, from #Peter:
df <- data.frame(time = rep(c("06/04/2021 13:00", "06/04/2021 13:30", "06/04/2021 14:00", "07/04/2021 14:00"), each = 3))
I have a series of observations of birds at different locations and times. The data frame looks like this:
birdID site ts
1 A 2013-04-15 09:29
1 A 2013-04-19 01:22
1 A 2013-04-20 23:13
1 A 2013-04-22 00:03
1 B 2013-04-22 14:02
1 B 2013-04-22 17:02
1 C 2013-04-22 14:04
1 C 2013-04-22 15:18
1 C 2013-04-23 00:54
1 A 2013-04-23 01:20
1 A 2013-04-24 23:07
1 A 2013-04-30 23:47
1 B 2013-04-30 03:51
1 B 2013-04-30 04:26
2 C 2013-04-30 04:29
2 C 2013-04-30 18:49
2 A 2013-05-01 01:03
2 A 2013-05-01 23:15
2 A 2013-05-02 00:09
2 C 2013-05-03 07:57
2 C 2013-05-04 07:21
2 C 2013-05-05 02:54
2 A 2013-05-05 03:27
2 A 2013-05-14 00:16
2 D 2013-05-14 10:00
2 D 2013-05-14 15:00
I would like to summarize the data in a way that shows the first and last detection of each bird at each site, and the duration at each site, while preserving information about multiple visits to sites (i.e. if a bird went from site A > B > C > A > B, I would like show each visit to site A and B independently, not lump both visits together).
I am hoping to produce output like this, where the start (min_ts), end (max_ts), and duration (days) of each visit are preserved:
birdID site min_ts max_ts days
1 A 2013-04-15 09:29 2013-04-22 00:03 6.6
1 B 2013-04-22 14:02 2013-04-22 17:02 0.1
1 C 2013-04-22 14:04 2013-04-23 00:54 0.5
1 A 2013-04-23 01:20 2013-04-30 23:47 7.9
1 B 2013-04-30 03:51 2013-04-30 04:26 0.02
2 C 2013-04-30 4:29 2013-04-30 18:49 0.6
2 A 2013-05-01 01:03 2013-05-02 00:09 0.96
2 C 2013-05-03 07:57 2013-05-05 02:54 1.8
2 A 2013-05-05 03:27 2013-05-14 00:16 8.8
2 D 2013-05-14 10:00 2013-05-14 15:00 0.2
I have tried this code, which yields the correct variables but lumps all the information about a single site together, not preserving multiple visits:
df <- df %>%
group_by(birdID, site) %>%
summarise(min_ts = min(ts),
max_ts = max(ts),
days = difftime(max_ts, min_ts, units = "days")) %>%
arrange(birdID, min_ts)
birdID site min_ts max_ts days
1 A 2013-04-15 09:29 2013-04-30 23:47 15.6
1 B 2013-04-22 14:02 2013-04-30 4:26 7.6
1 C 2013-04-22 14:04 2013-04-23 0:54 0.5
2 C 2013-04-30 04:29 2013-05-05 2:54 4.9
2 A 2013-05-01 01:03 2013-05-14 0:16 12.9
2 D 2013-05-14 10:00 2013-05-14 15:00 0.2
I realize grouping by site is a problem, but if I remove that as a grouping variable the data are summarised without site info. I have tried this. It doesn't run, but I feel it's close to the solution:
df <- df %>%
group_by(birdID) %>%
summarize(min_ts = if_else((birdID == lag(birdID) & site != lag(site)), min(ts), NA_real_),
max_ts = if_else((birdID == lag(birdID) & site != lag(site)), max(ts), NA_real_),
min_d = min(yday(ts)),
max_d = max(yday(ts)),
days = max_d - min_d))
One possibility could be:
df %>%
group_by(birdID, site, rleid = with(rle(site), rep(seq_along(lengths), lengths))) %>%
summarise(min_ts = min(ts),
max_ts = max(ts),
days = difftime(max_ts, min_ts, units = "days")) %>%
ungroup() %>%
select(-rleid) %>%
arrange(birdID, min_ts)
birdID site min_ts max_ts days
<int> <chr> <dttm> <dttm> <drtn>
1 1 A 2013-04-15 09:29:00 2013-04-22 00:03:00 6.60694444 days
2 1 B 2013-04-22 14:02:00 2013-04-22 17:02:00 0.12500000 days
3 1 C 2013-04-22 14:04:00 2013-04-23 00:54:00 0.45138889 days
4 1 A 2013-04-23 01:20:00 2013-04-30 23:47:00 7.93541667 days
5 1 B 2013-04-30 03:51:00 2013-04-30 04:26:00 0.02430556 days
6 2 C 2013-04-30 04:29:00 2013-04-30 18:49:00 0.59722222 days
7 2 A 2013-05-01 01:03:00 2013-05-02 00:09:00 0.96250000 days
8 2 C 2013-05-03 07:57:00 2013-05-05 02:54:00 1.78958333 days
9 2 A 2013-05-05 03:27:00 2013-05-14 00:16:00 8.86736111 days
10 2 D 2013-05-14 10:00:00 2013-05-14 15:00:00 0.20833333 days
Here it creates a rleid()-like grouping variable and then calculates the difference.
Or the same using rleid() from data.table explicitly:
df %>%
group_by(birdID, site, rleid = rleid(site)) %>%
summarise(min_ts = min(ts),
max_ts = max(ts),
days = difftime(max_ts, min_ts, units = "days")) %>%
ungroup() %>%
select(-rleid) %>%
arrange(birdID, min_ts)
Another alternative is to use lag and cumsum to create a grouping variable.
library(dplyr)
df %>%
group_by(birdID, group = cumsum(site != lag(site, default = first(site)))) %>%
summarise(min_ts = min(ts),
max_ts = max(ts),
days = difftime(max_ts, min_ts, units = "days")) %>%
ungroup() %>%
select(-group)
# A tibble: 10 x 4
# birdID min_ts max_ts days
# <int> <dttm> <dttm> <drtn>
# 1 1 2013-04-15 09:29:00 2013-04-22 00:03:00 6.60694444 days
# 2 1 2013-04-22 14:02:00 2013-04-22 17:02:00 0.12500000 days
# 3 1 2013-04-22 14:04:00 2013-04-23 00:54:00 0.45138889 days
# 4 1 2013-04-23 01:20:00 2013-04-30 23:47:00 7.93541667 days
# 5 1 2013-04-30 03:51:00 2013-04-30 04:26:00 0.02430556 days
# 6 2 2013-04-30 04:29:00 2013-04-30 18:49:00 0.59722222 days
# 7 2 2013-05-01 01:03:00 2013-05-02 00:09:00 0.96250000 days
# 8 2 2013-05-03 07:57:00 2013-05-05 02:54:00 1.78958333 days
# 9 2 2013-05-05 03:27:00 2013-05-14 00:16:00 8.86736111 days
#10 2 2013-05-14 10:00:00 2013-05-14 15:00:00 0.20833333 days
I have a data frame df1 with a datetime column in format UTC. I need to merge this dataframe with the data frame df2 by the column datetime. My problem is that df2 is in Europe/Paris format, and when I transform df2$datetime from Europe/Paris to UTC format, I lose or duplicate data at the moments in which is the time change between either summer/winter or winter/summer. As an example:
df1<- data.frame(datetime=c("2016-10-29 22:00:00","2016-10-29 23:00:00","2016-10-30 00:00:00","2016-10-30 01:00:00","2016-10-30 02:00:00","2016-10-30 03:00:00","2016-10-30 04:00:00","2016-10-30 05:00:00","2016-03-25 22:00:00","2016-03-25 23:00:00","2016-03-26 00:00:00","2016-03-26 01:00:00","2016-03-26 02:00:00","2016-03-26 03:00:00","2016-03-26 04:00:00"), Var1= c(4, 56, 76, 54, 34, 3, 4, 6, 78, 23, 12, 3, 5, 6, 7))
df1$datetime<- as.POSIXct(df1$datetime, format = "%Y-%m-%d %H", tz= "UTC")
df2<- data.frame(datetime=c("2016-10-29 22:00:00","2016-10-29 23:00:00","2016-10-30 00:00:00","2016-10-30 01:00:00","2016-10-30 02:00:00","2016-10-30 03:00:00","2016-10-30 04:00:00","2016-10-30 05:00:00","2016-03-25 22:00:00","2016-03-25 23:00:00","2016-03-26 00:00:00","2016-03-26 01:00:00","2016-03-26 02:00:00","2016-03-26 03:00:00","2016-03-26 04:00:00"), Var2=c(56, 43, 23, 14, 51, 27, 89, 76, 56, 4, 35, 23, 4, 62, 84))
df2$datetime<- as.POSIXct(df2$datetime, format = "%Y-%m-%d %H", tz= "Europe/Paris")
df1
datetime Var1
1 2016-10-29 22:00:00 4
2 2016-10-29 23:00:00 56
3 2016-10-30 00:00:00 76
4 2016-10-30 01:00:00 54
5 2016-10-30 02:00:00 34
6 2016-10-30 03:00:00 3
7 2016-10-30 04:00:00 4
8 2016-10-30 05:00:00 6
9 2017-03-25 22:00:00 78
10 2017-03-25 23:00:00 23
11 2017-03-26 00:00:00 12
12 2017-03-26 01:00:00 3
13 2017-03-26 02:00:00 5
14 2017-03-26 03:00:00 6
15 2017-03-26 04:00:00 7
df2
datetime Var2
1 2016-10-29 22:00:00 56
2 2016-10-29 23:00:00 43
3 2016-10-30 00:00:00 23
4 2016-10-30 01:00:00 14
5 2016-10-30 02:00:00 51
6 2016-10-30 03:00:00 27
7 2016-10-30 04:00:00 89
8 2016-10-30 05:00:00 76
9 2017-03-25 22:00:00 56
10 2017-03-25 23:00:00 4
11 2017-03-26 00:00:00 35
12 2017-03-26 01:00:00 23
13 2017-03-26 02:00:00 4
14 2017-03-26 03:00:00 62
15 2017-03-26 04:00:00 84
When I change df2$datetime format from Europe/Paris to UTC, this happens:
library(lubridate)
df2$datetime<-with_tz(df2$datetime,"UTC")
df2
datetime Var2
1 2016-10-29 20:00:00 56
2 2016-10-29 21:00:00 43
3 2016-10-29 22:00:00 23
4 2016-10-29 23:00:00 14
5 2016-10-30 00:00:00 51
6 2016-10-30 02:00:00 27 # Data at 01:00:00 is missing
7 2016-10-30 03:00:00 89
8 2016-10-30 04:00:00 76
9 2017-03-25 21:00:00 56
10 2017-03-25 22:00:00 4
11 2017-03-25 23:00:00 35
12 2017-03-26 00:00:00 23
13 2017-03-26 00:00:00 4 # There is a duplicate at 00:00:00
14 2017-03-26 01:00:00 62
15 2017-03-26 02:00:00 84
16 2017-03-26 03:00:00 56
Is there another way to transform df2$datetime from Europe/Paris format to UTC format that allows me to merge two data frames without this problem of having either lost or duplicated data? I don't understand why I have to lose or duplicate info in df2.
Is the transformation I did right in df2$datetime in order to merge this data frame with df1? What I've done so far to solve this is to add a new row in df2 on 2016-10-30 at 01:00:00 that is the mean between 2016-10-30 00:00:00and 2016-10-30 02:00:00 and to remove one row on 2017-03-26 at 00:00:00.
Thanks for your help.
I found out that my original df2 should be like this:
df2
datetime Var1
1 2016-10-29 22:00:00 4 # This is time in format "GMT+2". It corresponds to 20:00 UTC
2 2016-10-29 23:00:00 56 # This is time in format "GMT+2". It corresponds to 21:00 UTC
3 2016-10-30 00:00:00 76 # This is time in format "GMT+2". It corresponds to 22:00 UTC
4 2016-10-30 01:00:00 54 # This is time in format "GMT+2". It corresponds to 23:00 UTC
5 2016-10-30 02:00:00 34 # This is time in format "GMT+2". It corresponds to 00:00 UTC
6 2016-10-30 02:00:00 3 # This is time in format "GMT+1". It corresponds to 01:00 UTC
7 2016-10-30 03:00:00 4 # This is time in format "GMT+1". It corresponds to 02:00 UTC
8 2016-10-30 04:00:00 6 # This is time in format "GMT+1". It corresponds to 03:00 UTC
9 2016-10-30 05:00:00 78 # This is time in format "GMT+1". It corresponds to 04:00 UTC
10 2017-03-25 22:00:00 23 # This is time in format "GMT+1". It corresponds to 21:00 UTC
11 2017-03-25 23:00:00 12 # This is time in format "GMT+1". It corresponds to 22:00 UTC
12 2017-03-26 00:00:00 3 # This is time in format "GMT+1". It corresponds to 23:00 UTC
13 2017-03-26 01:00:00 5 # This is time in format "GMT+1". It corresponds to 00:00 UTC
14 2017-03-26 03:00:00 6 # This is time in format "GMT+2". It corresponds to 01:00 UTC
15 2017-03-26 04:00:00 7 # This is time in format "GMT+2". It corresponds to 02:00 UTC
16 2017-03-26 05:00:00 76 # This is time in format "GMT+2". It corresponds to 03:00 UTC
However, my original df2 doesn't have duplicated or lost time data. It is like this:
df2
datetime Var1
1 2016-10-29 22:00:00 4
2 2016-10-29 23:00:00 56
3 2016-10-30 00:00:00 76
4 2016-10-30 01:00:00 54
5 2016-10-30 02:00:00 34
6 2016-10-30 03:00:00 3
7 2016-10-30 04:00:00 4
8 2016-10-30 05:00:00 6
9 2017-03-25 22:00:00 78
10 2017-03-25 23:00:00 23
11 2017-03-26 00:00:00 12
12 2017-03-26 01:00:00 3
13 2017-10-30 02:00:00 5
14 2017-03-26 03:00:00 6
15 2017-03-26 04:00:00 7
16 2017-03-26 05:00:00 76
When I applied the R code df2$datetime<-with_tz(df2$datetime,"UTC"), this happens:
df2
datetime Var1
1 2016-10-29 20:00:00 4
2 2016-10-29 21:00:00 56
3 2016-10-29 22:00:00 76
4 2016-10-29 23:00:00 54
5 2016-10-30 00:00:00 34
6 2016-10-30 02:00:00 3 # I have to add mannually a new row between the times "00:00" and "02:00"
7 2016-10-30 03:00:00 4
8 2016-10-30 04:00:00 6
9 2017-03-25 21:00:00 78
10 2017-03-25 22:00:00 23
11 2017-03-25 23:00:00 12
12 2017-03-26 00:00:00 3
13 2017-10-30 01:00:00 5 # I have to remove mannually one of the rows refered to the time "01:00".
14 2017-03-26 01:00:00 6
15 2017-03-26 02:00:00 7
16 2017-03-26 03:00:00 76
If my original df2 had one duplication at "02:00:00" on 30th Octover and a gap on 26th March between "01:00" and "03:00", I would get with the R code df2$datetime<-with_tz(df2$datetime,"UTC") this:
df2
datetime Var1
1 2016-10-29 20:00:00 4
2 2016-10-29 21:00:00 56
3 2016-10-29 22:00:00 76
4 2016-10-29 23:00:00 54
5 2016-10-30 00:00:00 34
6 2016-10-30 00:00:00 3 # I just have to change "00:00:00" for "01:00"
7 2016-10-30 02:00:00 4
8 2016-10-30 03:00:00 6
9 2016-10-30 04:00:00 78
10 2017-03-25 21:00:00 23
11 2017-03-25 22:00:00 12
12 2017-03-25 23:00:00 3
13 2017-03-26 00:00:00 5
14 2017-03-26 01:00:00 6
15 2017-03-26 02:00:00 7
16 2017-03-26 03:00:00 76
#As there are some Versions of df2 I use the one shown in the Question
df2 <- read.table(text = "
datetime Var2
1 '2016-10-29 22:00:00' 56
2 '2016-10-29 23:00:00' 43
3 '2016-10-30 00:00:00' 23
4 '2016-10-30 01:00:00' 14
5 '2016-10-30 02:00:00' 51
6 '2016-10-30 03:00:00' 27
7 '2016-10-30 04:00:00' 89
8 '2016-10-30 05:00:00' 76
9 '2017-03-25 22:00:00' 56
10 '2017-03-25 23:00:00' 4
11 '2017-03-26 00:00:00' 35
12 '2017-03-26 01:00:00' 23
13 '2017-03-26 02:00:00' 4
14 '2017-03-26 03:00:00' 62
15 '2017-03-26 04:00:00' 84
", header = TRUE)
library(lubridate)
#When you define now the timezone the content of df2 is already changed
df2$datetimeEP <- as.POSIXct(df2$datetime, format = "%Y-%m-%d %H", tz= "Europe/Paris")
#df2[13,]
# datetime Var2 datetimeEP
#13 2017-03-26 02:00:00 4 2017-03-26 01:00:00
#For me it looks like that your recorded times don't consider "daylight savings time".
#So your have to uses e.g. "Etc/GMT-1" instead of "Europe/Paris"
df2$datetimeG1 <- as.POSIXct(df2$datetime, format = "%Y-%m-%d %H", tz= "Etc/GMT-1")
data.frame(datetime=df2$datetime, utc=with_tz(df2$datetimeG1,"UTC"))
# datetime utc
#1 2016-10-29 22:00:00 2016-10-29 21:00:00
#2 2016-10-29 23:00:00 2016-10-29 22:00:00
#3 2016-10-30 00:00:00 2016-10-29 23:00:00
#4 2016-10-30 01:00:00 2016-10-30 00:00:00
#5 2016-10-30 02:00:00 2016-10-30 01:00:00
#6 2016-10-30 03:00:00 2016-10-30 02:00:00
#7 2016-10-30 04:00:00 2016-10-30 03:00:00
#8 2016-10-30 05:00:00 2016-10-30 04:00:00
#9 2017-03-25 22:00:00 2017-03-25 21:00:00
#10 2017-03-25 23:00:00 2017-03-25 22:00:00
#11 2017-03-26 00:00:00 2017-03-25 23:00:00
#12 2017-03-26 01:00:00 2017-03-26 00:00:00
#13 2017-03-26 02:00:00 2017-03-26 01:00:00
#14 2017-03-26 03:00:00 2017-03-26 02:00:00
#15 2017-03-26 04:00:00 2017-03-26 03:00:00
#You can use "dst" to see if datetime of a time zone has "daylight savings time"
dst(df2$datetimeEP)
dst(df2$datetimeG1)
dst(with_tz(df2$datetimeEP,"UTC"))
dst(with_tz(df2$datetimeG1,"UTC"))
#If your recorded times consider "daylight savings time" then you HAVE a gap and an overlap.
I have a set of data taken every 5 minutes consisting of the following structure:
>df1
Date X1
01/01/2017 0:00 1
01/01/2017 0:30 32
01/01/2017 1:00 65
01/01/2017 1:30 14
01/01/2017 2:00 25
01/01/2017 2:30 14
01/01/2017 3:00 85
01/01/2017 3:30 74
01/01/2017 4:00 74
01/01/2017 4:30 52
01/01/2017 5:00 25
01/01/2017 5:30 74
01/01/2017 6:00 45
01/01/2017 6:30 52
01/01/2017 7:00 21
01/01/2017 7:30 41
01/01/2017 8:00 74
01/01/2017 8:30 11
01/01/2017 9:00 2
01/01/2017 9:30 52
Another vector is given consisting of only dates, but with a different time frequency:
>V1
Date2
1/1/2017 1:30:00
1/1/2017 3:30:00
1/1/2017 5:30:00
1/1/2017 9:30:00
I would like to calculate the moving average of X1 but at the end the only values I really need are the ones associated with the dates in V1 (but preserving the smoothing generated by the moving average)
Would you recommend to calculate the moving average of X1, then associate the value to the corresponding date in V1 and re-apply a moving average? or do you know a function in R that would help me achieve this?
Thank you, I really appreciate your help!
SofĂa
filter is a convenient way to construct moving averages
Assuming you want a simple arithmetic moving average, you'll need to decide how many elements you'd like to average together, and if you'd like a one or two-sided average. Arbitrarily, I'll pick 5 and one-sided.
elements <- 5
df1$x1.smooth <- filter(df1$X1, filter = rep(1/elements, elements), sides=1)
Note that "moving.average" will have elements-1 fewer elements than df1$X1 due to the moving average being undefined until there are elements items to average.
df1 is now
Date X1 x1.smooth
1 01/01/2017 0:00 1 NA
2 01/01/2017 0:30 32 NA
3 01/01/2017 1:00 65 NA
4 01/01/2017 1:30 14 NA
5 01/01/2017 2:00 25 27.4
6 01/01/2017 2:30 14 30.0
7 01/01/2017 3:00 85 40.6
8 01/01/2017 3:30 74 42.4
9 01/01/2017 4:00 74 54.4
10 01/01/2017 4:30 52 59.8
11 01/01/2017 5:00 25 62.0
12 01/01/2017 5:30 74 59.8
13 01/01/2017 6:00 45 54.0
14 01/01/2017 6:30 52 49.6
15 01/01/2017 7:00 21 43.4
16 01/01/2017 7:30 41 46.6
17 01/01/2017 8:00 74 46.6
18 01/01/2017 8:30 11 39.8
19 01/01/2017 9:00 2 29.8
20 01/01/2017 9:30 52 36.0
Now you need only merge the two data frames on Date = Date2 or else subset df1 where Date is %in% V1$Date2
Another option could be to use zoo package. One can use rollapply to calculate and add another column in dataframe that will hold moving average for X1.
A implementation with moving average of width 4 (every 2 hours) can be implemented as:
Library(zoo)
#Add another column with mean value
df$mean <- rollapply(df$X1, 4, mean, align = "right", fill = NA)
df
# Date X1 mean
# 1 2017-01-01 00:00:00 1 NA
# 2 2017-01-01 00:30:00 32 NA
# 3 2017-01-01 01:00:00 65 NA
# 4 2017-01-01 01:30:00 14 28.00
# 5 2017-01-01 02:00:00 25 34.00
# 6 2017-01-01 02:30:00 14 29.50
# 7 2017-01-01 03:00:00 85 34.50
# 8 2017-01-01 03:30:00 74 49.50
# 9 2017-01-01 04:00:00 74 61.75
# 10 2017-01-01 04:30:00 52 71.25
# 11 2017-01-01 05:00:00 25 56.25
# 12 2017-01-01 05:30:00 74 56.25
# 13 2017-01-01 06:00:00 45 49.00
# 14 2017-01-01 06:30:00 52 49.00
# 15 2017-01-01 07:00:00 21 48.00
# 16 2017-01-01 07:30:00 41 39.75
# 17 2017-01-01 08:00:00 74 47.00
# 18 2017-01-01 08:30:00 11 36.75
# 19 2017-01-01 09:00:00 2 32.00
# 20 2017-01-01 09:30:00 52 34.75