I'm trying to create a dataframe with the following columns: dt, depth, var1
But I need 4 lines of each hour going through a whole year as I need to adjust var1 at certain depths:
dt
depth
var1
2008-01-01 00:00
2
0.01
2008-01-01 00:00
40
0.01
2008-01-01 00:00
45
0.01
2008-01-01 00:00
100
0.01
2008-01-01 01:00
2
0.01
2008-01-01 01:00
40
0.01
2008-01-01 01:00
45
0.01
2008-01-01 01:00
100
0.01
2008-01-01 02:00
2
0.01
2008-01-01 02:00
40
0.01
2008-01-01 02:00
45
0.01
2008-01-01 02:00
100
0.01
2008-01-01 03:00
2
0.01
How do I create the "dt" list for the first column?
Thank you!
You can use expand.grid :
expand.grid(
dt = seq(as.POSIXct("2008-01-01 00:00:00", 'UTC'),
as.POSIXct("2008-01-01 03:00:00", 'UTC'), 'hour'),
depth = c(2, 40, 45, 100),
var1 = 0.01
) -> result
result
# dt depth var1
#1 2008-01-01 00:00:00 2 0.01
#2 2008-01-01 01:00:00 2 0.01
#3 2008-01-01 02:00:00 2 0.01
#4 2008-01-01 03:00:00 2 0.01
#5 2008-01-01 00:00:00 40 0.01
#6 2008-01-01 01:00:00 40 0.01
#7 2008-01-01 02:00:00 40 0.01
#8 2008-01-01 03:00:00 40 0.01
#9 2008-01-01 00:00:00 45 0.01
#10 2008-01-01 01:00:00 45 0.01
#11 2008-01-01 02:00:00 45 0.01
#12 2008-01-01 03:00:00 45 0.01
#13 2008-01-01 00:00:00 100 0.01
#14 2008-01-01 01:00:00 100 0.01
#15 2008-01-01 02:00:00 100 0.01
#16 2008-01-01 03:00:00 100 0.01
If you want order as shown you can arrange the above result or use tidyr::expand_grid :
tidyr::expand_grid(
dt = seq(as.POSIXct("2008-01-01 00:00:00", 'UTC'),
as.POSIXct("2008-01-01 03:00:00", 'UTC'), 'hour'),
depth = c(2, 40, 45, 100),
var1 = 0.01
) -> result
result
# A tibble: 16 x 3
# dt depth var1
# <dttm> <dbl> <dbl>
# 1 2008-01-01 00:00:00 2 0.01
# 2 2008-01-01 00:00:00 40 0.01
# 3 2008-01-01 00:00:00 45 0.01
# 4 2008-01-01 00:00:00 100 0.01
# 5 2008-01-01 01:00:00 2 0.01
# 6 2008-01-01 01:00:00 40 0.01
# 7 2008-01-01 01:00:00 45 0.01
# 8 2008-01-01 01:00:00 100 0.01
# 9 2008-01-01 02:00:00 2 0.01
#10 2008-01-01 02:00:00 40 0.01
#11 2008-01-01 02:00:00 45 0.01
#12 2008-01-01 02:00:00 100 0.01
#13 2008-01-01 03:00:00 2 0.01
#14 2008-01-01 03:00:00 40 0.01
#15 2008-01-01 03:00:00 45 0.01
#16 2008-01-01 03:00:00 100 0.01
Try this:
start_date_time <- as.POSIXct("2008-01-01 00:00", format= "%Y-%m-%d %H:%M")
end_date_time <- as.POSIXct("2008-01-01 03:00", format= "%Y-%m-%d %H:%M")
df <- data.frame(dt = rep(seq(start_date_time, end_date_time, by = 3600), each = 4),
depth = rep(c(2, 40, 45, 100)),
var1 = 0.01)
df
#> dt depth var1
#> 1 2008-01-01 00:00:00 2 0.01
#> 2 2008-01-01 00:00:00 40 0.01
#> 3 2008-01-01 00:00:00 45 0.01
#> 4 2008-01-01 00:00:00 100 0.01
#> 5 2008-01-01 01:00:00 2 0.01
#> 6 2008-01-01 01:00:00 40 0.01
#> 7 2008-01-01 01:00:00 45 0.01
#> 8 2008-01-01 01:00:00 100 0.01
#> 9 2008-01-01 02:00:00 2 0.01
#> 10 2008-01-01 02:00:00 40 0.01
#> 11 2008-01-01 02:00:00 45 0.01
#> 12 2008-01-01 02:00:00 100 0.01
#> 13 2008-01-01 03:00:00 2 0.01
#> 14 2008-01-01 03:00:00 40 0.01
#> 15 2008-01-01 03:00:00 45 0.01
#> 16 2008-01-01 03:00:00 100 0.01
Created on 2021-04-22 by the reprex package (v2.0.0)
Related
I have two long time series to compare, however, the sampling of them is completely different. The first one is with hourly, the second one is with irregular sampling.
I would like to compare Value1 and Value2, so, I would like to select Value1 records from df1 at 02:00 according to df2 dates. How can I solve it in R?
df1:
Date1
Value1
2014-01-01 01:00:00
0.16
2014-01-01 02:00:00
0.13
2014-01-01 03:00:00
0.6
2014-01-02 01:00:00
0.5
2014-01-02 02:00:00
0.22
2014-01-02 03:00:00
0.17
2014-01-19 01:00:00
0.2
2014-01-19 02:00:00
0.11
2014-01-19 03:00:00
0.15
2014-01-21 01:00:00
0.13
2014-01-21 02:00:00
0.33
2014-01-21 03:00:00
0.1
2014-01-23 01:00:00
0.09
2014-01-23 02:00:00
0.02
2014-01-23 03:00:00
0.16
df2:
Date2
Value2
2014-01-01
13
2014-01-19
76
2014-01-23
8
desired output:
df_fused:
Date1
Value1
Value2
2014-01-01 02:00:00
0.13
13
2014-01-19 02:00:00
0.11
76
2014-01-23 02:00:00
0.02
8
here is a data.table approach
library( data.table )
#sample data can also be setDT(df1);setDT(df2)
df1 <- fread("Date1 Value1
2014-01-01 01:00:00 0.16
2014-01-01 02:00:00 0.13
2014-01-01 03:00:00 0.6
2014-01-02 01:00:00 0.5
2014-01-02 02:00:00 0.22
2014-01-02 03:00:00 0.17
2014-01-19 01:00:00 0.2
2014-01-19 02:00:00 0.11
2014-01-19 03:00:00 0.15
2014-01-21 01:00:00 0.13
2014-01-21 02:00:00 0.33
2014-01-21 03:00:00 0.1
2014-01-23 01:00:00 0.09
2014-01-23 02:00:00 0.02
2014-01-23 03:00:00 0.16")
df2 <- fread("Date2 Value2
2014-01-01 13
2014-01-19 76
2014-01-23 8")
#set dates to posix
df1[, Date1 := as.POSIXct( Date1, format = "%Y-%m-%d %H:%M:%S", tz = "UTC" )]
#set df2 dates to 02:00:00 time
df2[, Date2 := as.POSIXct( paste0( Date2, "02:00:00" ), format = "%Y-%m-%d %H:%M:%S", tz = "UTC" )]
#join
df2[ df1, Value1 := i.Value1, on = .(Date2 = Date1)][]
# Date2 Value2 Value1
# 1: 2014-01-01 02:00:00 13 0.13
# 2: 2014-01-19 02:00:00 76 0.11
# 3: 2014-01-23 02:00:00 8 0.02
This question already has answers here:
Insert rows for missing dates/times
(9 answers)
Closed 5 years ago.
I have a dataframe that contains hourly weather information. I would like to increase the granularity of the time measurements (5 minute intervals instead of 60 minute intervals) while copying the other columns data into the new rows created:
Current Dataframe Structure:
Date Temperature Humidity
2015-01-01 00:00:00 25 0.67
2015-01-01 01:00:00 26 0.69
Target Dataframe Structure:
Date Temperature Humidity
2015-01-01 00:00:00 25 0.67
2015-01-01 00:05:00 25 0.67
2015-01-01 00:10:00 25 0.67
.
.
.
2015-01-01 00:55:00 25 0.67
2015-01-01 01:00:00 26 0.69
2015-01-01 01:05:00 26 0.69
2015-01-01 01:10:00 26 0.69
.
.
.
What I've Tried:
for(i in 1:nrow(df)) {
five.minutes <- seq(df$date[i], length = 12, by = "5 mins")
for(j in 1:length(five.minutes)) {
df$date[i]<-rbind(five.minutes[j])
}
}
Error I'm getting:
Error in as.POSIXct.numeric(value) : 'origin' must be supplied
The one possible solution can be using fill from tidyr and right_join from dplyr.
The approach is to create date/time series between min and max+55mins times from dataframe. Left join dataframe with timeseries which will provide you all desired rows but NA for Temperature and Humidity. Now use fill to populated NA values with previous valid values.
# Data
df <- read.table(text = "Date Temperature Humidity
'2015-01-01 00:00:00' 25 0.67
'2015-01-01 01:00:00' 26 0.69
'2015-01-01 02:00:00' 28 0.69
'2015-01-01 03:00:00' 25 0.69", header = T, stringsAsFactors = F)
df$Date <- as.POSIXct(df$Date, format = "%Y-%m-%d %H:%M:%S")
# Create a dataframe with all possible date/time at intervale of 5 mins
Dates <- data.frame(Date = seq(min(df$Date), max(df$Date)+3540, by = 5*60))
result <- df %>%
right_join(Dates, by="Date") %>%
fill(Temperature, Humidity)
result
# Date Temperature Humidity
#1 2015-01-01 00:00:00 25 0.67
#2 2015-01-01 00:05:00 25 0.67
#3 2015-01-01 00:10:00 25 0.67
#4 2015-01-01 00:15:00 25 0.67
#5 2015-01-01 00:20:00 25 0.67
#6 2015-01-01 00:25:00 25 0.67
#7 2015-01-01 00:30:00 25 0.67
#8 2015-01-01 00:35:00 25 0.67
#9 2015-01-01 00:40:00 25 0.67
#10 2015-01-01 00:45:00 25 0.67
#11 2015-01-01 00:50:00 25 0.67
#12 2015-01-01 00:55:00 25 0.67
#13 2015-01-01 01:00:00 26 0.69
#14 2015-01-01 01:05:00 26 0.69
#.....
#.....
#44 2015-01-01 03:35:00 25 0.69
#45 2015-01-01 03:40:00 25 0.69
#46 2015-01-01 03:45:00 25 0.69
#47 2015-01-01 03:50:00 25 0.69
#48 2015-01-01 03:55:00 25 0.69
I think this might do:
df=tibble(DateTime=c("2015-01-01 00:00:00","2015-01-01 01:00:00"),Temperature=c(25,26),Humidity=c(.67,.69))
df$DateTime<-ymd_hms(df$DateTime)
DateTime=as.POSIXct((sapply(1:(nrow(df)-1),function(x) seq(from=df$DateTime[x],to=df$DateTime[x+1],by="5 min"))),
origin="1970-01-01", tz="UTC")
Temperature=c(sapply(1:(nrow(df)-1),function(x) rep(df$Temperature[x],12)),df$Temperature[nrow(df)])
Humidity=c(sapply(1:(nrow(df)-1),function(x) rep(df$Humidity[x],12)),df$Humidity[nrow(df)])
tibble(as.character(DateTime),Temperature,Humidity)
<chr> <dbl> <dbl>
1 2015-01-01 00:00:00 25.0 0.670
2 2015-01-01 00:05:00 25.0 0.670
3 2015-01-01 00:10:00 25.0 0.670
4 2015-01-01 00:15:00 25.0 0.670
5 2015-01-01 00:20:00 25.0 0.670
6 2015-01-01 00:25:00 25.0 0.670
7 2015-01-01 00:30:00 25.0 0.670
8 2015-01-01 00:35:00 25.0 0.670
9 2015-01-01 00:40:00 25.0 0.670
10 2015-01-01 00:45:00 25.0 0.670
11 2015-01-01 00:50:00 25.0 0.670
12 2015-01-01 00:55:00 25.0 0.670
13 2015-01-01 01:00:00 26.0 0.690
I have a dataframe that looks like this:
dat <- data.frame(time = seq(as.POSIXct("2010-01-01"),
as.POSIXct("2016-12-31") + 60*99,
by = 60*15),
radiation = sample(1:500, 245383, replace = TRUE))
So I have every 15 minutes a measurement value. The structure is:
> str(dat)
'data.frame': 245383 obs. of 2 variables:
$ time : POSIXct, format: "2010-01-01 00:00:00" "2010-01-01 00:15:00" "2010-01-01 00:30:00" "2010-01-01 00:45:00" ...
$ radiation: num 230 443 282 314 286 225 77 89 97 330 ...
Now I want to interpolate, so my aim is a dataframe with values for every minute.
I searched a few times and tried some methods with the zoo package. But I have some problems with the dataframe. I have to convert it to a text file i guess? I have no idea how to do that.
Here is a tidyverse solution.
library('tidyverse')
dat <- data.frame(time = seq(as.POSIXct("2010-01-01"),
as.POSIXct("2016-12-31") + 60*99,
by = 60*15),
radiation = sample(1:500, 245383, replace = TRUE))
dat <- head(dat, 3)
dat
# time radiation
# 1 2010-01-01 00:00:00 241
# 2 2010-01-01 00:15:00 438
# 3 2010-01-01 00:30:00 457
You can create a data frame with all of the required times. Using full_join will make the missing radiation values be NA.
approx will fill the NAs with a linear approximation.
dat %>%
full_join(data.frame(time = seq(
from = min(.$time),
to = max(.$time),
by = 'min'))) %>%
arrange(time) %>%
mutate(radiation = approx(radiation, n = n())$y)
# Joining, by = "time"
# time radiation
# 1 2010-01-01 00:00:00 241.0000
# 2 2010-01-01 00:01:00 254.1333
# 3 2010-01-01 00:02:00 267.2667
# 4 2010-01-01 00:03:00 280.4000
# 5 2010-01-01 00:04:00 293.5333
# 6 2010-01-01 00:05:00 306.6667
# 7 2010-01-01 00:06:00 319.8000
# 8 2010-01-01 00:07:00 332.9333
# 9 2010-01-01 00:08:00 346.0667
# 10 2010-01-01 00:09:00 359.2000
# 11 2010-01-01 00:10:00 372.3333
# 12 2010-01-01 00:11:00 385.4667
# 13 2010-01-01 00:12:00 398.6000
# 14 2010-01-01 00:13:00 411.7333
# 15 2010-01-01 00:14:00 424.8667
# 16 2010-01-01 00:15:00 438.0000
# 17 2010-01-01 00:16:00 439.2667
# 18 2010-01-01 00:17:00 440.5333
# 19 2010-01-01 00:18:00 441.8000
# 20 2010-01-01 00:19:00 443.0667
# 21 2010-01-01 00:20:00 444.3333
# 22 2010-01-01 00:21:00 445.6000
# 23 2010-01-01 00:22:00 446.8667
# 24 2010-01-01 00:23:00 448.1333
# 25 2010-01-01 00:24:00 449.4000
# 26 2010-01-01 00:25:00 450.6667
# 27 2010-01-01 00:26:00 451.9333
# 28 2010-01-01 00:27:00 453.2000
# 29 2010-01-01 00:28:00 454.4667
# 30 2010-01-01 00:29:00 455.7333
# 31 2010-01-01 00:30:00 457.0000
You can use the approx function like this:
dat <- data.frame(time = seq(as.POSIXct("2016-12-01"),
as.POSIXct("2016-12-31") + 60*99,
by = 60*15),
radiation = sample(1:500, 2887, replace = TRUE))
mins <- seq(as.POSIXct("2016-12-01"),
as.POSIXct("2016-12-31") + 60*99,
by = 60)
out <- approx(dat$time, dat$radiation, mins)
Here is a solution using pad from the padr package to fill the gaps in your time column. na.approx is used for interpolation.
library(padr)
library(zoo)
dat[1:2, ]
time radiation
#1 2010-01-01 00:00:00 133
#2 2010-01-01 00:15:00 187
dat_padded <- pad(dat[1:2, ], interval = "min")
dat_padded$radiation <- zoo::na.approx(dat_padded$radiation)
dat_padded
time radiation
#1 2010-01-01 00:00:00 133.0
#2 2010-01-01 00:01:00 136.6
#3 2010-01-01 00:02:00 140.2
#4 2010-01-01 00:03:00 143.8
#5 2010-01-01 00:04:00 147.4
#6 2010-01-01 00:05:00 151.0
#7 2010-01-01 00:06:00 154.6
#8 2010-01-01 00:07:00 158.2
#9 2010-01-01 00:08:00 161.8
#10 2010-01-01 00:09:00 165.4
#11 2010-01-01 00:10:00 169.0
#12 2010-01-01 00:11:00 172.6
#13 2010-01-01 00:12:00 176.2
#14 2010-01-01 00:13:00 179.8
#15 2010-01-01 00:14:00 183.4
#16 2010-01-01 00:15:00 187.0
data
set.seed(1)
dat <-
data.frame(
time = seq(
as.POSIXct("2010-01-01"),
as.POSIXct("2016-12-31") + 60 * 99,
by = 60 * 15
),
radiation = sample(1:500, 245383, replace = TRUE)
)
Have 2 tables
dums:
start end 10min
2013-04-01 00:00:54 UTC 2013-04-01 01:00:10 UTC 0.05
2013-04-01 00:40:26 UTC 2013-04-01 01:00:00 UTC 0.1
2013-04-01 02:13:20 UTC 2013-04-01 04:53:42 UTC 0.15
2013-04-02 02:22:00 UTC 2013-04-01 04:33:12 UTC 0.2
2013-04-01 02:26:23 UTC 2013-04-01 04:05:12 UTC 0.25
2013-04-01 02:42:47 UTC 2013-04-01 04:34:33 UTC 0.3
2013-04-01 02:53:12 UTC 2013-04-03 05:27:05 UTC 0.35
2013-04-02 02:54:08 UTC 2013-04-02 05:31:15 UTC 0.4
2013-04-03 02:57:16 UTC 2013-04-03 05:29:32 UTC 0.45
maps: start and end are 10 minute interval blocks spanning 2013-4-1 00:00:00 to 2013-04-04
I want to add the column 3 of dt1 to map as long as the start and end time are within the 10 minute blocks and keep appending the columns
ideally the output should be
start end 10min
4/1/2013 0:00:00 4/1/2013 0:10:00 0.05 0
4/1/2013 0:10 4/1/2013 0:20 0.05 0
4/1/2013 0:20 4/1/2013 0:30 0.05 0
4/1/2013 0:30 4/1/2013 0:40 0.05 0
4/1/2013 0:40 4/1/2013 0:50 0.05 0.01
4/1/2013 0:50 4/1/2013 1:00 0.05 0.01
I tried
setkey(dums,start,end)
setkey(map,start,end)
foverlaps(map,dums,type="within",nomatch=0L)
I keep getting the error:
Error in foverlaps(map, dums, type = "within", nomatch = 0L) : All entries in column start should be <= corresponding entries in column end in data.table 'y'
Any pointers or alternative approaches?
Thanks
The error message
All entries in column start should be <= corresponding entries in column end in data.table 'y'
is probably caused by a typo in the dataset.
dums[start > end, with = TRUE]
returns 4 and row 4 of dums is:
start end min10
1: 2013-04-02 02:22:00 2013-04-01 04:33:12 0.2
After changing start to 2013-04-01 02:22:00 OP's code runs fine.
However, to achieve the expected output the result of foverlaps() needs to be reshaped from long to wide format.
This can be done in two ways:
dcast(foverlaps(map, dums, nomatch = 0L), i.start + i.end ~ min10,
value.var = "min10")
i.start i.end 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45
1: 2013-04-01 00:00:00 2013-04-01 00:10:00 0.05 NA NA NA NA NA NA NA NA
2: 2013-04-01 00:10:00 2013-04-01 00:20:00 0.05 NA NA NA NA NA NA NA NA
3: 2013-04-01 00:20:00 2013-04-01 00:30:00 0.05 NA NA NA NA NA NA NA NA
4: 2013-04-01 00:30:00 2013-04-01 00:40:00 0.05 NA NA NA NA NA NA NA NA
5: 2013-04-01 00:40:00 2013-04-01 00:50:00 0.05 0.1 NA NA NA NA NA NA NA
---
311: 2013-04-03 04:40:00 2013-04-03 04:50:00 NA NA NA NA NA NA 0.35 NA 0.45
312: 2013-04-03 04:50:00 2013-04-03 05:00:00 NA NA NA NA NA NA 0.35 NA 0.45
313: 2013-04-03 05:00:00 2013-04-03 05:10:00 NA NA NA NA NA NA 0.35 NA 0.45
314: 2013-04-03 05:10:00 2013-04-03 05:20:00 NA NA NA NA NA NA 0.35 NA 0.45
315: 2013-04-03 05:20:00 2013-04-03 05:30:00 NA NA NA NA NA NA 0.35 NA 0.45
or, more in line with OP's expected result:
dcast(foverlaps(map, dums, nomatch = 0L), i.start + i.end ~ rowid(i.start),
value.var = "min10")
i.start i.end 1 2 3 4 5
1: 2013-04-01 00:00:00 2013-04-01 00:10:00 0.05 NA NA NA NA
2: 2013-04-01 00:10:00 2013-04-01 00:20:00 0.05 NA NA NA NA
3: 2013-04-01 00:20:00 2013-04-01 00:30:00 0.05 NA NA NA NA
4: 2013-04-01 00:30:00 2013-04-01 00:40:00 0.05 NA NA NA NA
5: 2013-04-01 00:40:00 2013-04-01 00:50:00 0.05 0.10 NA NA NA
---
311: 2013-04-03 04:40:00 2013-04-03 04:50:00 0.35 0.45 NA NA NA
312: 2013-04-03 04:50:00 2013-04-03 05:00:00 0.35 0.45 NA NA NA
313: 2013-04-03 05:00:00 2013-04-03 05:10:00 0.35 0.45 NA NA NA
314: 2013-04-03 05:10:00 2013-04-03 05:20:00 0.35 0.45 NA NA NA
315: 2013-04-03 05:20:00 2013-04-03 05:30:00 0.35 0.45 NA NA NA
Note that the parameter type = "within" has been skipped for brevity.
Data
# corrected
dums <- fread(
" 2013-04-01 00:00:54 UTC 2013-04-01 01:00:10 UTC 0.05
2013-04-01 00:40:26 UTC 2013-04-01 01:00:00 UTC 0.1
2013-04-01 02:13:20 UTC 2013-04-01 04:53:42 UTC 0.15
2013-04-01 02:22:00 UTC 2013-04-01 04:33:12 UTC 0.2
2013-04-01 02:26:23 UTC 2013-04-01 04:05:12 UTC 0.25
2013-04-01 02:42:47 UTC 2013-04-01 04:34:33 UTC 0.3
2013-04-01 02:53:12 UTC 2013-04-03 05:27:05 UTC 0.35
2013-04-02 02:54:08 UTC 2013-04-02 05:31:15 UTC 0.4
2013-04-03 02:57:16 UTC 2013-04-03 05:29:32 UTC 0.45"
)
dums <- dums[, .(start = as.POSIXct(paste(V1, V2, V3)),
end = as.POSIXct(paste(V4, V5, V6)),
min10 = V7)]
setkey(dums, start, end)
ts <- seq(as.POSIXct("2013-04-01 00:00:00 UTC"),
as.POSIXct("2013-04-04 00:00:00 UTC"),
by = "10 min")
map <- data.table(start = head(ts, -1L), end = tail(ts, -1L),
key = c("start", "end"))
That's a good catch with the POSIXct time being off for 1 row. I feel super silly to have glossed over such an error in the input data.
The ultimate goal is to have 3 column variables : YYYY-DD-MM ; start time (POSIXCt), end time (POSIXCt).
The start and end time being 10 minute windows.
The number of days is 365. So effectively looking at 365 * 144 (10 minute slices for a day). The catch is I have 450k rows of "dums" data and the min10 is not evenly spaced discrete intervals, it is a continuous data. If I have to aggregate (sum,means,sd etc) , is there any way to use the dcast + aggregate +foverlaps within+ grouping? I can do with a for loop just placing the min10 value from start to end but it looks super time consuming and inefficient.
The output would be
5: 2013-04-01 00:40:00 2013-04-01 00:50:00 0.15
---
311: 2013-04-03 04:40:00 2013-04-03 04:50:00 0.80
map <- data.table(start = head(ts, -1L), end = tail(ts, -1L),
key = c("start", "end"))
# plus do something on the lines
dums[, .(count=.N, sum=sum(min10)), by = ID1]
I have a data frame which looks like this:
times values
1 2013-07-06 20:00:00 0.02
2 2013-07-07 20:00:00 0.03
3 2013-07-09 20:00:00 0.13
4 2013-07-10 20:00:00 0.12
5 2013-07-11 20:00:00 0.03
6 2013-07-14 20:00:00 0.06
7 2013-07-15 20:00:00 0.08
8 2013-07-16 20:00:00 0.07
9 2013-07-17 20:00:00 0.08
There are a few dates missing from the data, and I would like to insert them and to carry over the value from the previous day into these new rows, i.e. obtain this:
times values
1 2013-07-06 20:00:00 0.02
2 2013-07-07 20:00:00 0.03
3 2013-07-08 20:00:00 0.03
4 2013-07-09 20:00:00 0.13
5 2013-07-10 20:00:00 0.12
6 2013-07-11 20:00:00 0.03
7 2013-07-12 20:00:00 0.03
8 2013-07-13 20:00:00 0.03
9 2013-07-14 20:00:00 0.06
10 2013-07-15 20:00:00 0.08
11 2013-07-16 20:00:00 0.07
12 2013-07-17 20:00:00 0.08
...
I have been trying to use a vector of all the dates:
dates <- as.Date(1:length(df),origin = df$times[1])
I am stuck, and can't find a way to do it without a horrible for loop in which I'm getting lost...
Thank you for your help
Some test data (I am using Date, yours seems to be a different type, but this does not affect the algorithm):
data = data.frame(dates = as.Date(c("2011-12-15", "2011-12-17", "2011-12-19")),
values = as.double(1:3))
# Generate **all** timestamps at which you want to have your result.
# I use `seq`, but you may use any other method of generating those timestamps.
alldates = seq(min(data$dates), max(data$dates), 1)
# Filter out timestamps that are already present in your `data.frame`:
# Construct a `data.frame` to append with missing values:
dates0 = alldates[!(alldates %in% data$dates)]
data0 = data.frame(dates = dates0, values = NA_real_)
# Append this `data.frame` and resort in time:
data = rbind(data, data0)
data = data[order(data$dates),]
# forward fill the values
# I would recommend to move this code into a separate `ffill` function:
# proved to be very useful in general):
current = NA_real_
data$values = sapply(data$values, function(x) {
current <<- ifelse(is.na(x), current, x); current })
library(zoo)
g <- data.frame(dates=seq(min(data$dates),max(data$dates),1))
na.locf(merge(g,data,by="dates",all.x=TRUE))
or entirely with zoo:
z <- read.zoo(data)
gz <- zoo(, seq(min(time(z)), max(time(z)), "day")) # time grid in zoo
na.locf(merge(z, gz))
Using tidyr's complete and fill assuming the times columns is already of class POSIXct.
library(tidyr)
df %>%
complete(times = seq(min(times), max(times), by = 'day')) %>%
fill(values)
# A tibble: 12 x 2
# times values
# <dttm> <dbl>
# 1 2013-07-06 20:00:00 0.02
# 2 2013-07-07 20:00:00 0.03
# 3 2013-07-08 20:00:00 0.03
# 4 2013-07-09 20:00:00 0.13
# 5 2013-07-10 20:00:00 0.12
# 6 2013-07-11 20:00:00 0.03
# 7 2013-07-12 20:00:00 0.03
# 8 2013-07-13 20:00:00 0.03
# 9 2013-07-14 20:00:00 0.06
#10 2013-07-15 20:00:00 0.08
#11 2013-07-16 20:00:00 0.07
#12 2013-07-17 20:00:00 0.08
data
df <- structure(list(times = structure(c(1373140800, 1373227200, 1373400000,
1373486400, 1373572800, 1373832000, 1373918400, 1374004800, 1374091200
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), values = c(0.02,
0.03, 0.13, 0.12, 0.03, 0.06, 0.08, 0.07, 0.08)), row.names = c(NA,
-9L), class = "data.frame")
df2 <- data.frame(times=seq(min(df$times), max(df$times), by="day"))
df3 <- merge(x=df2, y=df, by="times", all.x=T)
idx <- which(is.na(df3$values))
for (id in idx)
df3$values[id] <- df3$values[id-1]
df3
# times values
# 1 2013-07-06 20:00:00 0.02
# 2 2013-07-07 20:00:00 0.03
# 3 2013-07-08 20:00:00 0.03
# 4 2013-07-09 20:00:00 0.13
# 5 2013-07-10 20:00:00 0.12
# 6 2013-07-11 20:00:00 0.03
# 7 2013-07-12 20:00:00 0.03
# 8 2013-07-13 20:00:00 0.03
# 9 2013-07-14 20:00:00 0.06
# 10 2013-07-15 20:00:00 0.08
# 11 2013-07-16 20:00:00 0.07
# 12 2013-07-17 20:00:00 0.08
You can try this:
setkey(NADayWiseOrders, date)
all_dates <- seq(from = as.Date("2013-01-01"),
to = as.Date("2013-01-07"),
by = "days")
NADayWiseOrders[J(all_dates), roll=Inf]
date orders amount guests
1: 2013-01-01 50 2272.55 149
2: 2013-01-02 3 64.04 4
3: 2013-01-03 3 64.04 4
4: 2013-01-04 1 18.81 0
5: 2013-01-05 2 77.62 0
6: 2013-01-06 2 77.62 0
7: 2013-01-07 2 35.82 2