Aggregating hourly data into daily aggregates - datetime

I have an hourly weather data in the following format:
Date,DBT
01/01/2000 01:00,30
01/01/2000 02:00,31
01/01/2000 03:00,33
...
...
12/31/2000 23:00,25
What I need is a daily aggregate of max, min, ave like this:
Date,MaxDBT,MinDBT,AveDBT
01/01/2000,36,23,28
01/02/2000,34,22,29
01/03/2000,32,25,30
...
...
12/31/2000,35,9,20
How to do this in R?

1) This can be done compactly using zoo:
L <- "Date,DBT
01/01/2000 01:00,30
01/01/2000 02:00,31
01/01/2000 03:00,33
12/31/2000 23:00,25"
library(zoo)
stat <- function(x) c(min = min(x), max = max(x), mean = mean(x))
z <- read.zoo(text = L, header = TRUE, sep = ",", format = "%m/%d/%Y", aggregate = stat)
This gives:
> z
min max mean
2000-01-01 30 33 31.33333
2000-12-31 25 25 25.00000
2) here is a solution that only uses core R:
DF <- read.csv(text = L)
DF$Date <- as.Date(DF$Date, "%m/%d/%Y")
ag <- aggregate(DBT ~ Date, DF, stat) # same stat as in zoo solution
The last line gives:
> ag
Date DBT.min DBT.max DBT.mean
1 2000-01-01 30.00000 33.00000 31.33333
2 2000-12-31 25.00000 25.00000 25.00000
EDIT: (1) Since this first appeared the text= argument to read.zoo was added in the zoo package.
(2) minor improvements.

Using strptime(), trunc() and ddply() from the plyr package :
#Make the data
ZZ <- textConnection("Date,DBT
01/01/2000 01:00,30
01/01/2000 02:00,31
01/01/2000 03:00,33
12/31/2000 23:00,25")
dataframe <- read.csv(ZZ,header=T)
close(ZZ)
# Do the calculations
dataframe$Date <- strptime(dataframe$Date,format="%m/%d/%Y %H:%M")
dataframe$day <- trunc(dataframe$Date,"day")
require(plyr)
ddply(dataframe,.(day),
summarize,
aveDBT=mean(DBT),
maxDBT=max(DBT),
minDBT=min(DBT)
)
gives
day aveDBT maxDBT minDBT
1 2000-01-01 31.33333 33 30
2 2000-12-31 25.00000 25 25
To clarify :
strptime converts the character to dates according to the format. To see how you can specify the format, see ?strptime. trunc will then truncate these date-times to the specified unit, which is day in this case.
ddply will evaluate the function summarize within the dataframe after splitting it up according to day. everything after summarize are arguments that are passed to the function summarize.

There is also a nice package called hydroTSM. It uses zoo objects and can convert to other aggregates in time
The function in your case is subdaily2daily. You can choose if the aggregation should be based on min / max / mean...

A couple of options:
1. Timetk
If you have a data frame (or tibble) then the summarize_by_time() function from timetk can be used:
library(tidyverse)
library(timetk)
# Collect Data
text <- "Date,DBT
01/01/2000 01:00,30
01/01/2000 02:00,31
01/01/2000 03:00,33
12/31/2000 23:00,25"
df <- read_csv(text, col_types = cols(Date = col_datetime("%m/%d/%Y %H:%M")))
df
#> # A tibble: 4 x 2
#> Date DBT
#> <dttm> <dbl>
#> 1 2000-01-01 01:00:00 30
#> 2 2000-01-01 02:00:00 31
#> 3 2000-01-01 03:00:00 33
#> 4 2000-12-31 23:00:00 25
# Summarize
df %>%
summarise_by_time(
.date_var = Date,
.by = "day",
min = min(DBT),
max = max(DBT),
mean = mean(DBT)
)
#> # A tibble: 2 x 4
#> Date min max mean
#> <dttm> <dbl> <dbl> <dbl>
#> 1 2000-01-01 00:00:00 30 33 31.3
#> 2 2000-12-31 00:00:00 25 25 25
Created on 2021-05-21 by the reprex package (v2.0.0)
2. Tidyquant
You can use the tidyquant package for this. The process is involves using the tq_transmute function to return a data frame that is modified using the xts aggregation function, apply.daily. We'll apply a custom stat_fun, which returns the min, max and mean. However, you can apply any vector function you'd like such as quantile.
library(tidyquant)
df
#> # A tibble: 4 x 2
#> Date DBT
#> <dttm> <dbl>
#> 1 2000-01-01 01:00:00 30
#> 2 2000-01-01 02:00:00 31
#> 3 2000-01-01 03:00:00 33
#> 4 2000-12-31 23:00:00 25
stat_fun <- function(x) c(min = min(x), max = max(x), mean = mean(x))
df %>%
tq_transmute(select = DBT,
mutate_fun = apply.daily,
FUN = stat_fun)
# A tibble: 2 x 4
#> Date min max mean
#> <dttm> <dbl> <dbl> <dbl>
#> 1 2000-01-01 03:00:00 30 33 31.33333
#> 2 2000-12-31 23:00:00 25 25 25.00000

Given that you have POSIXct time format, you can do this using as.POSIXct(time), all you need is cut and aggregate().
try this:
split_hour = cut(as.POSIXct(temp$time), breaks = "60 mins") # summrise given mins
temp$hour = split_hour # make hourly vaiable
ag = aggregate(. ~ hour, temp, mean)
In this case, temp is like this
temp
1 0.6 0.6 0.0 0.350 0.382 0.000 2020-04-13 18:30:42
2 0.0 0.5 0.5 0.000 0.304 0.292 2020-04-13 19:56:02
3 0.0 0.2 0.2 0.000 0.107 0.113 2020-04-13 20:09:10
4 0.6 0.0 0.6 0.356 0.000 0.376 2020-04-13 20:11:57
5 0.0 0.3 0.2 0.000 0.156 0.148 2020-04-13 20:12:07
6 0.0 0.4 0.4 0.000 0.218 0.210 2020-04-13 22:02:49
7 0.2 0.2 0.0 0.112 0.113 0.000 2020-04-13 22:31:43
8 0.3 0.0 0.3 0.155 0.000 0.168 2020-04-14 03:19:03
9 0.4 0.0 0.4 0.219 0.000 0.258 2020-04-14 03:55:58
10 0.2 0.0 0.0 0.118 0.000 0.000 2020-04-14 04:25:25
11 0.3 0.3 0.0 0.153 0.160 0.000 2020-04-14 05:38:20
12 0.0 0.7 0.8 0.000 0.436 0.493 2020-04-14 05:40:02
13 0.0 0.0 0.2 0.000 0.000 0.101 2020-04-14 05:40:44
14 0.3 0.0 0.3 0.195 0.000 0.198 2020-04-14 06:09:26
15 0.2 0.2 0.0 0.130 0.128 0.000 2020-04-14 06:17:15
16 0.2 0.0 0.0 0.144 0.000 0.000 2020-04-14 06:19:36
17 0.3 0.0 0.4 0.177 0.000 0.220 2020-04-14 06:23:43
18 0.2 0.0 0.0 0.110 0.000 0.000 2020-04-14 06:25:19
19 0.0 0.0 0.0 1.199 1.035 0.251 2020-04-14 07:05:24
20 0.2 0.2 0.0 0.125 0.107 0.000 2020-04-14 07:21:46
ag is like this
ag
1 2020-04-13 18:30:00 0.60000000 0.6000000 0.0000000 0.3500000 0.38200000 0.00000000
2 2020-04-13 19:30:00 0.15000000 0.2500000 0.3750000 0.0890000 0.14175000 0.23225000
3 2020-04-13 21:30:00 0.00000000 0.4000000 0.4000000 0.0000000 0.21800000 0.21000000
4 2020-04-13 22:30:00 0.20000000 0.2000000 0.0000000 0.1120000 0.11300000 0.00000000
5 2020-04-14 02:30:00 0.30000000 0.0000000 0.3000000 0.1550000 0.00000000 0.16800000
6 2020-04-14 03:30:00 0.30000000 0.0000000 0.2000000 0.1685000 0.00000000 0.12900000
7 2020-04-14 05:30:00 0.18750000 0.1500000 0.2125000 0.1136250 0.09050000 0.12650000
8 2020-04-14 06:30:00 0.10000000 0.1000000 0.0000000 0.6620000 0.57100000 0.12550000
9 2020-04-14 07:30:00 0.00000000 0.3000000 0.2000000 0.0000000 0.16200000 0.11800000
10 2020-04-14 19:30:00 0.20000000 0.3000000 0.0000000 0.1460000 0.19000000 0.00000000
11 2020-04-14 20:30:00 0.06666667 0.2000000 0.2666667 0.0380000 0.11766667 0.17366667
12 2020-04-14 22:30:00 0.20000000 0.3000000 0.0000000 0.1353333 0.18533333 0.00000000
13 2020-04-14 23:30:00 0.00000000 0.5000000 0.5000000 0.0000000 0.28000000 0.32100000
14 2020-04-15 01:30:00 0.25000000 0.2000000 0.4500000 0.1355000 0.11450000 0.26100000

Related

Fill in missing dates in a dataframe

I have two dataframes, interest rates and monthly standard deviation prices returns, that I have managed to merge together. However the interest rate data has gaps in its dates where the markets were not open, i.e. weekends and holidays. The monthly returns all start on the first of the month so where this lines up with a market closure the data doesn't merge correctly. An example of the dataframes is
Date Rollingstd
01/11/2014 0.00925
01/10/2014 0.01341
Date InterestRate
03/11/2014 2
31/10/2014 1.5
As you can see there is no 01/11/2014 in the interest rate data so merging together gives me
Date InterestRate Rollingstd
03/11/2014 2 0.01341
31/10/2014 1.5 0.01341
I guess a fix for this would be to expand the interest rate dataframe so that it includes all dates and just fill the interest rate data up so it looks like this
Date InterestRate
03/11/2014 2
02/11/2014 1.5
01/11/2014 1.5
31/10/2014 1.5
This would ensure there are no missing dates in the dataframe. Any ideas on how I could do this?
Do you want this?
df2 <- read.table(text = 'Date InterestRate
03/11/2014 2
31/10/2014 1.5', header = T)
df1 <- read.table(text = 'Date Rollingstd
01/11/2014 0.00925
01/10/2014 0.01341', header = T)
library(tidyverse)
df1 %>% full_join(df2, by = 'Date') %>%
mutate(Date = as.Date(Date, '%d/%m/%Y')) %>%
arrange(Date) %>%
complete(Date = seq.Date(min(Date), max(Date), 'days')) %>%
fill(InterestRate, .direction = 'up') %>%
as.data.frame()
#> Date Rollingstd InterestRate
#> 1 2014-10-01 0.01341 1.5
#> 2 2014-10-02 NA 1.5
#> 3 2014-10-03 NA 1.5
#> 4 2014-10-04 NA 1.5
#> 5 2014-10-05 NA 1.5
#> 6 2014-10-06 NA 1.5
#> 7 2014-10-07 NA 1.5
#> 8 2014-10-08 NA 1.5
#> 9 2014-10-09 NA 1.5
#> 10 2014-10-10 NA 1.5
#> 11 2014-10-11 NA 1.5
#> 12 2014-10-12 NA 1.5
#> 13 2014-10-13 NA 1.5
#> 14 2014-10-14 NA 1.5
#> 15 2014-10-15 NA 1.5
#> 16 2014-10-16 NA 1.5
#> 17 2014-10-17 NA 1.5
#> 18 2014-10-18 NA 1.5
#> 19 2014-10-19 NA 1.5
#> 20 2014-10-20 NA 1.5
#> 21 2014-10-21 NA 1.5
#> 22 2014-10-22 NA 1.5
#> 23 2014-10-23 NA 1.5
#> 24 2014-10-24 NA 1.5
#> 25 2014-10-25 NA 1.5
#> 26 2014-10-26 NA 1.5
#> 27 2014-10-27 NA 1.5
#> 28 2014-10-28 NA 1.5
#> 29 2014-10-29 NA 1.5
#> 30 2014-10-30 NA 1.5
#> 31 2014-10-31 NA 1.5
#> 32 2014-11-01 0.00925 2.0
#> 33 2014-11-02 NA 2.0
#> 34 2014-11-03 NA 2.0
Created on 2021-05-23 by the reprex package (v2.0.0)

How to aggregate hourly data into daily values for several years [duplicate]

I have an hourly weather data in the following format:
Date,DBT
01/01/2000 01:00,30
01/01/2000 02:00,31
01/01/2000 03:00,33
...
...
12/31/2000 23:00,25
What I need is a daily aggregate of max, min, ave like this:
Date,MaxDBT,MinDBT,AveDBT
01/01/2000,36,23,28
01/02/2000,34,22,29
01/03/2000,32,25,30
...
...
12/31/2000,35,9,20
How to do this in R?
1) This can be done compactly using zoo:
L <- "Date,DBT
01/01/2000 01:00,30
01/01/2000 02:00,31
01/01/2000 03:00,33
12/31/2000 23:00,25"
library(zoo)
stat <- function(x) c(min = min(x), max = max(x), mean = mean(x))
z <- read.zoo(text = L, header = TRUE, sep = ",", format = "%m/%d/%Y", aggregate = stat)
This gives:
> z
min max mean
2000-01-01 30 33 31.33333
2000-12-31 25 25 25.00000
2) here is a solution that only uses core R:
DF <- read.csv(text = L)
DF$Date <- as.Date(DF$Date, "%m/%d/%Y")
ag <- aggregate(DBT ~ Date, DF, stat) # same stat as in zoo solution
The last line gives:
> ag
Date DBT.min DBT.max DBT.mean
1 2000-01-01 30.00000 33.00000 31.33333
2 2000-12-31 25.00000 25.00000 25.00000
EDIT: (1) Since this first appeared the text= argument to read.zoo was added in the zoo package.
(2) minor improvements.
Using strptime(), trunc() and ddply() from the plyr package :
#Make the data
ZZ <- textConnection("Date,DBT
01/01/2000 01:00,30
01/01/2000 02:00,31
01/01/2000 03:00,33
12/31/2000 23:00,25")
dataframe <- read.csv(ZZ,header=T)
close(ZZ)
# Do the calculations
dataframe$Date <- strptime(dataframe$Date,format="%m/%d/%Y %H:%M")
dataframe$day <- trunc(dataframe$Date,"day")
require(plyr)
ddply(dataframe,.(day),
summarize,
aveDBT=mean(DBT),
maxDBT=max(DBT),
minDBT=min(DBT)
)
gives
day aveDBT maxDBT minDBT
1 2000-01-01 31.33333 33 30
2 2000-12-31 25.00000 25 25
To clarify :
strptime converts the character to dates according to the format. To see how you can specify the format, see ?strptime. trunc will then truncate these date-times to the specified unit, which is day in this case.
ddply will evaluate the function summarize within the dataframe after splitting it up according to day. everything after summarize are arguments that are passed to the function summarize.
There is also a nice package called hydroTSM. It uses zoo objects and can convert to other aggregates in time
The function in your case is subdaily2daily. You can choose if the aggregation should be based on min / max / mean...
A couple of options:
1. Timetk
If you have a data frame (or tibble) then the summarize_by_time() function from timetk can be used:
library(tidyverse)
library(timetk)
# Collect Data
text <- "Date,DBT
01/01/2000 01:00,30
01/01/2000 02:00,31
01/01/2000 03:00,33
12/31/2000 23:00,25"
df <- read_csv(text, col_types = cols(Date = col_datetime("%m/%d/%Y %H:%M")))
df
#> # A tibble: 4 x 2
#> Date DBT
#> <dttm> <dbl>
#> 1 2000-01-01 01:00:00 30
#> 2 2000-01-01 02:00:00 31
#> 3 2000-01-01 03:00:00 33
#> 4 2000-12-31 23:00:00 25
# Summarize
df %>%
summarise_by_time(
.date_var = Date,
.by = "day",
min = min(DBT),
max = max(DBT),
mean = mean(DBT)
)
#> # A tibble: 2 x 4
#> Date min max mean
#> <dttm> <dbl> <dbl> <dbl>
#> 1 2000-01-01 00:00:00 30 33 31.3
#> 2 2000-12-31 00:00:00 25 25 25
Created on 2021-05-21 by the reprex package (v2.0.0)
2. Tidyquant
You can use the tidyquant package for this. The process is involves using the tq_transmute function to return a data frame that is modified using the xts aggregation function, apply.daily. We'll apply a custom stat_fun, which returns the min, max and mean. However, you can apply any vector function you'd like such as quantile.
library(tidyquant)
df
#> # A tibble: 4 x 2
#> Date DBT
#> <dttm> <dbl>
#> 1 2000-01-01 01:00:00 30
#> 2 2000-01-01 02:00:00 31
#> 3 2000-01-01 03:00:00 33
#> 4 2000-12-31 23:00:00 25
stat_fun <- function(x) c(min = min(x), max = max(x), mean = mean(x))
df %>%
tq_transmute(select = DBT,
mutate_fun = apply.daily,
FUN = stat_fun)
# A tibble: 2 x 4
#> Date min max mean
#> <dttm> <dbl> <dbl> <dbl>
#> 1 2000-01-01 03:00:00 30 33 31.33333
#> 2 2000-12-31 23:00:00 25 25 25.00000
Given that you have POSIXct time format, you can do this using as.POSIXct(time), all you need is cut and aggregate().
try this:
split_hour = cut(as.POSIXct(temp$time), breaks = "60 mins") # summrise given mins
temp$hour = split_hour # make hourly vaiable
ag = aggregate(. ~ hour, temp, mean)
In this case, temp is like this
temp
1 0.6 0.6 0.0 0.350 0.382 0.000 2020-04-13 18:30:42
2 0.0 0.5 0.5 0.000 0.304 0.292 2020-04-13 19:56:02
3 0.0 0.2 0.2 0.000 0.107 0.113 2020-04-13 20:09:10
4 0.6 0.0 0.6 0.356 0.000 0.376 2020-04-13 20:11:57
5 0.0 0.3 0.2 0.000 0.156 0.148 2020-04-13 20:12:07
6 0.0 0.4 0.4 0.000 0.218 0.210 2020-04-13 22:02:49
7 0.2 0.2 0.0 0.112 0.113 0.000 2020-04-13 22:31:43
8 0.3 0.0 0.3 0.155 0.000 0.168 2020-04-14 03:19:03
9 0.4 0.0 0.4 0.219 0.000 0.258 2020-04-14 03:55:58
10 0.2 0.0 0.0 0.118 0.000 0.000 2020-04-14 04:25:25
11 0.3 0.3 0.0 0.153 0.160 0.000 2020-04-14 05:38:20
12 0.0 0.7 0.8 0.000 0.436 0.493 2020-04-14 05:40:02
13 0.0 0.0 0.2 0.000 0.000 0.101 2020-04-14 05:40:44
14 0.3 0.0 0.3 0.195 0.000 0.198 2020-04-14 06:09:26
15 0.2 0.2 0.0 0.130 0.128 0.000 2020-04-14 06:17:15
16 0.2 0.0 0.0 0.144 0.000 0.000 2020-04-14 06:19:36
17 0.3 0.0 0.4 0.177 0.000 0.220 2020-04-14 06:23:43
18 0.2 0.0 0.0 0.110 0.000 0.000 2020-04-14 06:25:19
19 0.0 0.0 0.0 1.199 1.035 0.251 2020-04-14 07:05:24
20 0.2 0.2 0.0 0.125 0.107 0.000 2020-04-14 07:21:46
ag is like this
ag
1 2020-04-13 18:30:00 0.60000000 0.6000000 0.0000000 0.3500000 0.38200000 0.00000000
2 2020-04-13 19:30:00 0.15000000 0.2500000 0.3750000 0.0890000 0.14175000 0.23225000
3 2020-04-13 21:30:00 0.00000000 0.4000000 0.4000000 0.0000000 0.21800000 0.21000000
4 2020-04-13 22:30:00 0.20000000 0.2000000 0.0000000 0.1120000 0.11300000 0.00000000
5 2020-04-14 02:30:00 0.30000000 0.0000000 0.3000000 0.1550000 0.00000000 0.16800000
6 2020-04-14 03:30:00 0.30000000 0.0000000 0.2000000 0.1685000 0.00000000 0.12900000
7 2020-04-14 05:30:00 0.18750000 0.1500000 0.2125000 0.1136250 0.09050000 0.12650000
8 2020-04-14 06:30:00 0.10000000 0.1000000 0.0000000 0.6620000 0.57100000 0.12550000
9 2020-04-14 07:30:00 0.00000000 0.3000000 0.2000000 0.0000000 0.16200000 0.11800000
10 2020-04-14 19:30:00 0.20000000 0.3000000 0.0000000 0.1460000 0.19000000 0.00000000
11 2020-04-14 20:30:00 0.06666667 0.2000000 0.2666667 0.0380000 0.11766667 0.17366667
12 2020-04-14 22:30:00 0.20000000 0.3000000 0.0000000 0.1353333 0.18533333 0.00000000
13 2020-04-14 23:30:00 0.00000000 0.5000000 0.5000000 0.0000000 0.28000000 0.32100000
14 2020-04-15 01:30:00 0.25000000 0.2000000 0.4500000 0.1355000 0.11450000 0.26100000

Scale Function Returns: Error in FUN(x, aperm(array(STATS, dims[perm]), order(perm)), ...)

Trying to train a NeuralNet but I can't normalize my data.
Defining Max and Mins for scaling this is working fine.
maxs <- apply(tour_weahter_data, 2, max)
mins <- apply(tour_weahter_data, 2, min)
Here is the data I'am trying to scale:
head(tour_weahter_data)
Start Time Starting Station ID Duration Distance Temperatur Humidity
1 2016-07-07 13:00:00 3063 12 578.7915 18 72
2 2016-07-07 13:00:00 3040 10 1262.4654 18 72
3 2016-07-07 13:00:00 3063 19 1660.0441 18 72
4 2016-07-07 13:00:00 3018 10 907.1427 18 72
5 2016-07-07 13:00:00 3076 10 1004.5161 18 72
6 2016-07-07 13:00:00 3034 4 448.0982 18 72
This is the call to the function:
scaled <- as.data.frame(scale(tour_weahter_data, center = mins, scale = maxs - mins))
This is the Error Message I get:
Error in FUN(x, aperm(array(STATS, dims[perm]), order(perm)), ...) :
non-numeric argument to binary operator
In addition: Warning message:
In scale.default(tour_weahter_data, center = mins, scale = maxs - :
NAs introduced by coercion
Is there a problem with my data or am I using the function incorrectly?
You should use scale with numeric variables, so you must use it only with numeric variables.
Here's an approach using dplyr.
library(dplyr)
vars_scale <- tour_weahter_data %>%
select_if(is.numeric) %>%
colnames()
scale_min_max <- function(x) scale(x, center = min(x), scale = max(x) - min(x))
tour_weahter_data %>%
mutate_at(vars_scale, scale_min_max)
## A tibble: 6 x 7
# Start[,1] Time_Starting Station_ID[,1] Duration[,1]
# <dbl> <dttm> <dbl> <dbl>
#1 0 2016-07-07 13:00:00 0.776 0.533
#2 0.2 2016-07-07 13:00:00 0.379 0.4
#3 0.4 2016-07-07 13:00:00 0.776 1
#4 0.6 2016-07-07 13:00:00 0 0.4
#5 0.8 2016-07-07 13:00:00 1 0.4
#6 1 2016-07-07 13:00:00 0.276 0
## ... with 3 more variables: Distance[,1] <dbl>,
## Temperatur[,1] <dbl>, Humidity[,1] <dbl>

How to combine two columns of time in R?

I have two text files:
1-
> head(val)
V1 V2 V3
1 2015/03/31 00:00 0.134
2 2015/03/31 01:00 0.130
3 2015/03/31 02:00 0.133
4 2015/03/31 03:00 0.132
2-
> head(tes)
A B date
1 0.04 0.02 2015-03-31 02:18:56
What I need is to combine V1 (date) and V2 (hour) in val. search in val the date and time that correspond (the closest) to date in tes and then extract the corresponding V3 and put it in tes.
the desired out put would be:
tes
A B date V3
1 0.04 0.02 2015-04-01 02:18:56 0.133
Updated answer based on OP's comments.
val$date <- with(val,as.POSIXct(paste(V1,V2), format="%Y/%m/%d %H:%M"))
val
# V1 V2 V3 date
# 1 2015/03/31 00:00 0.134 2015-03-31 00:00:00
# 2 2015/03/31 01:00 0.130 2015-03-31 01:00:00
# 3 2015/03/31 02:00 0.133 2015-03-31 02:00:00
# 4 2015/03/31 03:00 0.132 2015-03-31 03:00:00
# 5 2015/04/07 13:00 0.080 2015-04-07 13:00:00
# 6 2015/04/07 14:00 0.082 2015-04-07 14:00:00
tes$date <- as.POSIXct(tes$date)
tes
# A B date
# 1 0.04 0.02 2015-03-31 02:18:56
# 2 0.05 0.03 2015-03-31 03:30:56
# 3 0.06 0.04 2015-03-31 05:30:56
# 4 0.07 0.05 2015-04-07 13:42:56
f <- function(d) { # for given tes$date, find val$V3
diff <- abs(difftime(val$date,d,units="min"))
if (min(diff > 45)) Inf else which.min(diff)
}
tes <- cbind(tes,val[sapply(tes$date,f),c("date","V3")])
tes
# A B date date V3
# 1 0.04 0.02 2015-03-31 02:18:56 2015-03-31 02:00:00 0.133
# 2 0.05 0.03 2015-03-31 03:30:56 2015-03-31 03:00:00 0.132
# 3 0.06 0.04 2015-03-31 05:30:56 <NA> NA
# 4 0.07 0.05 2015-04-07 13:42:56 2015-04-07 14:00:00 0.082
The function f(...) calculates the index into val (the row number) for which val$date is closest in time to the given tes$date, unless that time is > 45 min, in which case Inf is returned. Using this function with sapply(...) as in:
sapply(tes$date, f)
returns a vector of row numbers in val matching your condition for each test$date.
The reason we use Inf instead of NA for missing values is that indexing a data.frame using Inf always returns a single "row" containing NA, whereas indexing using NA returns nrow(...) rows all containing NA.
I added the extra rows into val and tes per your comment.

Adding missing dates to dataframe

I have a data frame which looks like this:
times values
1 2013-07-06 20:00:00 0.02
2 2013-07-07 20:00:00 0.03
3 2013-07-09 20:00:00 0.13
4 2013-07-10 20:00:00 0.12
5 2013-07-11 20:00:00 0.03
6 2013-07-14 20:00:00 0.06
7 2013-07-15 20:00:00 0.08
8 2013-07-16 20:00:00 0.07
9 2013-07-17 20:00:00 0.08
There are a few dates missing from the data, and I would like to insert them and to carry over the value from the previous day into these new rows, i.e. obtain this:
times values
1 2013-07-06 20:00:00 0.02
2 2013-07-07 20:00:00 0.03
3 2013-07-08 20:00:00 0.03
4 2013-07-09 20:00:00 0.13
5 2013-07-10 20:00:00 0.12
6 2013-07-11 20:00:00 0.03
7 2013-07-12 20:00:00 0.03
8 2013-07-13 20:00:00 0.03
9 2013-07-14 20:00:00 0.06
10 2013-07-15 20:00:00 0.08
11 2013-07-16 20:00:00 0.07
12 2013-07-17 20:00:00 0.08
...
I have been trying to use a vector of all the dates:
dates <- as.Date(1:length(df),origin = df$times[1])
I am stuck, and can't find a way to do it without a horrible for loop in which I'm getting lost...
Thank you for your help
Some test data (I am using Date, yours seems to be a different type, but this does not affect the algorithm):
data = data.frame(dates = as.Date(c("2011-12-15", "2011-12-17", "2011-12-19")),
values = as.double(1:3))
# Generate **all** timestamps at which you want to have your result.
# I use `seq`, but you may use any other method of generating those timestamps.
alldates = seq(min(data$dates), max(data$dates), 1)
# Filter out timestamps that are already present in your `data.frame`:
# Construct a `data.frame` to append with missing values:
dates0 = alldates[!(alldates %in% data$dates)]
data0 = data.frame(dates = dates0, values = NA_real_)
# Append this `data.frame` and resort in time:
data = rbind(data, data0)
data = data[order(data$dates),]
# forward fill the values
# I would recommend to move this code into a separate `ffill` function:
# proved to be very useful in general):
current = NA_real_
data$values = sapply(data$values, function(x) {
current <<- ifelse(is.na(x), current, x); current })
library(zoo)
g <- data.frame(dates=seq(min(data$dates),max(data$dates),1))
na.locf(merge(g,data,by="dates",all.x=TRUE))
or entirely with zoo:
z <- read.zoo(data)
gz <- zoo(, seq(min(time(z)), max(time(z)), "day")) # time grid in zoo
na.locf(merge(z, gz))
Using tidyr's complete and fill assuming the times columns is already of class POSIXct.
library(tidyr)
df %>%
complete(times = seq(min(times), max(times), by = 'day')) %>%
fill(values)
# A tibble: 12 x 2
# times values
# <dttm> <dbl>
# 1 2013-07-06 20:00:00 0.02
# 2 2013-07-07 20:00:00 0.03
# 3 2013-07-08 20:00:00 0.03
# 4 2013-07-09 20:00:00 0.13
# 5 2013-07-10 20:00:00 0.12
# 6 2013-07-11 20:00:00 0.03
# 7 2013-07-12 20:00:00 0.03
# 8 2013-07-13 20:00:00 0.03
# 9 2013-07-14 20:00:00 0.06
#10 2013-07-15 20:00:00 0.08
#11 2013-07-16 20:00:00 0.07
#12 2013-07-17 20:00:00 0.08
data
df <- structure(list(times = structure(c(1373140800, 1373227200, 1373400000,
1373486400, 1373572800, 1373832000, 1373918400, 1374004800, 1374091200
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), values = c(0.02,
0.03, 0.13, 0.12, 0.03, 0.06, 0.08, 0.07, 0.08)), row.names = c(NA,
-9L), class = "data.frame")
df2 <- data.frame(times=seq(min(df$times), max(df$times), by="day"))
df3 <- merge(x=df2, y=df, by="times", all.x=T)
idx <- which(is.na(df3$values))
for (id in idx)
df3$values[id] <- df3$values[id-1]
df3
# times values
# 1 2013-07-06 20:00:00 0.02
# 2 2013-07-07 20:00:00 0.03
# 3 2013-07-08 20:00:00 0.03
# 4 2013-07-09 20:00:00 0.13
# 5 2013-07-10 20:00:00 0.12
# 6 2013-07-11 20:00:00 0.03
# 7 2013-07-12 20:00:00 0.03
# 8 2013-07-13 20:00:00 0.03
# 9 2013-07-14 20:00:00 0.06
# 10 2013-07-15 20:00:00 0.08
# 11 2013-07-16 20:00:00 0.07
# 12 2013-07-17 20:00:00 0.08
You can try this:
setkey(NADayWiseOrders, date)
all_dates <- seq(from = as.Date("2013-01-01"),
to = as.Date("2013-01-07"),
by = "days")
NADayWiseOrders[J(all_dates), roll=Inf]
date orders amount guests
1: 2013-01-01 50 2272.55 149
2: 2013-01-02 3 64.04 4
3: 2013-01-03 3 64.04 4
4: 2013-01-04 1 18.81 0
5: 2013-01-05 2 77.62 0
6: 2013-01-06 2 77.62 0
7: 2013-01-07 2 35.82 2

Resources