There's two time datasets: data from raincollector -- time interval ti with start, end and rain p (total amount of rain per period in mm)
ti <- data.frame(
start = c("2017-06-05 19:30:00", "2017-06-06 12:00:00"),
end = c("2017-06-05 23:30:00", "2017-06-06 14:00:00"),
p = c(16.4, 4.4)
)
ti[,1] <- as.POSIXct(ti[, 1])
ti[,2] <- as.POSIXct(ti[, 2])
and timeseries ts from gauging station with time and parameter q, which is the water discharge (cu. m per sec)
ts <- data.frame(stringsAsFactors=FALSE,
time = c("2017-06-05 16:00:00", "2017-06-05 19:00:00",
"2017-06-05 21:00:00", "2017-06-05 23:00:00",
"2017-06-06 9:00:00", "2017-06-06 11:00:00", "2017-06-06 13:00:00",
"2017-06-06 16:00:00", "2017-06-06 17:00:00"),
q = c(0.78, 0.84, 0.9, 0.78, 0.78, 0.78, 0.78, 1.22, 1.25)
)
ts[,1] <- as.POSIXct(ts[,1])
I need to intersect timeseries with time interval and create a new column in ts with TRUE/FALSE if this row in the rain interval (TRUE) and if it not (FALSE) like this one:
time q rain
1 2017-06-05 16:00:00 0.78 FALSE
2 2017-06-05 19:00:00 0.84 FALSE
3 2017-06-05 21:00:00 0.90 TRUE # there were rain
4 2017-06-05 23:00:00 0.78 TRUE # there were rain
5 2017-06-06 9:00:00 0.78 FALSE
6 2017-06-06 11:00:00 0.78 FALSE
7 2017-06-06 13:00:00 0.78 TRUE # there were rain
8 2017-06-06 16:00:00 1.22 FALSE
9 2017-06-06 17:00:00 1.25 FALSE
Have you got any ideas how to apply such simple operation?
With sqldf:
library(sqldf)
sqldf('select ts.*, case when ti.p is not null then 1 else 0 end as rain
from ts
left join ti
on start <= time and
time <= end')
Result:
time q rain
1 2017-06-05 16:00:00 0.78 0
2 2017-06-05 19:00:00 0.84 0
3 2017-06-05 21:00:00 0.90 1
4 2017-06-05 23:00:00 0.78 1
5 2017-06-06 9:00:00 0.78 0
6 2017-06-06 11:00:00 0.78 0
7 2017-06-06 13:00:00 0.78 1
8 2017-06-06 16:00:00 1.22 0
9 2017-06-06 17:00:00 1.25 0
Related
I have two long time series to compare, however, the sampling of them is completely different. The first one is with hourly, the second one is with irregular sampling.
I would like to compare Value1 and Value2, so, I would like to select Value1 records from df1 at 02:00 according to df2 dates. How can I solve it in R?
df1:
Date1
Value1
2014-01-01 01:00:00
0.16
2014-01-01 02:00:00
0.13
2014-01-01 03:00:00
0.6
2014-01-02 01:00:00
0.5
2014-01-02 02:00:00
0.22
2014-01-02 03:00:00
0.17
2014-01-19 01:00:00
0.2
2014-01-19 02:00:00
0.11
2014-01-19 03:00:00
0.15
2014-01-21 01:00:00
0.13
2014-01-21 02:00:00
0.33
2014-01-21 03:00:00
0.1
2014-01-23 01:00:00
0.09
2014-01-23 02:00:00
0.02
2014-01-23 03:00:00
0.16
df2:
Date2
Value2
2014-01-01
13
2014-01-19
76
2014-01-23
8
desired output:
df_fused:
Date1
Value1
Value2
2014-01-01 02:00:00
0.13
13
2014-01-19 02:00:00
0.11
76
2014-01-23 02:00:00
0.02
8
here is a data.table approach
library( data.table )
#sample data can also be setDT(df1);setDT(df2)
df1 <- fread("Date1 Value1
2014-01-01 01:00:00 0.16
2014-01-01 02:00:00 0.13
2014-01-01 03:00:00 0.6
2014-01-02 01:00:00 0.5
2014-01-02 02:00:00 0.22
2014-01-02 03:00:00 0.17
2014-01-19 01:00:00 0.2
2014-01-19 02:00:00 0.11
2014-01-19 03:00:00 0.15
2014-01-21 01:00:00 0.13
2014-01-21 02:00:00 0.33
2014-01-21 03:00:00 0.1
2014-01-23 01:00:00 0.09
2014-01-23 02:00:00 0.02
2014-01-23 03:00:00 0.16")
df2 <- fread("Date2 Value2
2014-01-01 13
2014-01-19 76
2014-01-23 8")
#set dates to posix
df1[, Date1 := as.POSIXct( Date1, format = "%Y-%m-%d %H:%M:%S", tz = "UTC" )]
#set df2 dates to 02:00:00 time
df2[, Date2 := as.POSIXct( paste0( Date2, "02:00:00" ), format = "%Y-%m-%d %H:%M:%S", tz = "UTC" )]
#join
df2[ df1, Value1 := i.Value1, on = .(Date2 = Date1)][]
# Date2 Value2 Value1
# 1: 2014-01-01 02:00:00 13 0.13
# 2: 2014-01-19 02:00:00 76 0.11
# 3: 2014-01-23 02:00:00 8 0.02
This question already has answers here:
Insert rows for missing dates/times
(9 answers)
Closed 5 years ago.
I have a dataframe that contains hourly weather information. I would like to increase the granularity of the time measurements (5 minute intervals instead of 60 minute intervals) while copying the other columns data into the new rows created:
Current Dataframe Structure:
Date Temperature Humidity
2015-01-01 00:00:00 25 0.67
2015-01-01 01:00:00 26 0.69
Target Dataframe Structure:
Date Temperature Humidity
2015-01-01 00:00:00 25 0.67
2015-01-01 00:05:00 25 0.67
2015-01-01 00:10:00 25 0.67
.
.
.
2015-01-01 00:55:00 25 0.67
2015-01-01 01:00:00 26 0.69
2015-01-01 01:05:00 26 0.69
2015-01-01 01:10:00 26 0.69
.
.
.
What I've Tried:
for(i in 1:nrow(df)) {
five.minutes <- seq(df$date[i], length = 12, by = "5 mins")
for(j in 1:length(five.minutes)) {
df$date[i]<-rbind(five.minutes[j])
}
}
Error I'm getting:
Error in as.POSIXct.numeric(value) : 'origin' must be supplied
The one possible solution can be using fill from tidyr and right_join from dplyr.
The approach is to create date/time series between min and max+55mins times from dataframe. Left join dataframe with timeseries which will provide you all desired rows but NA for Temperature and Humidity. Now use fill to populated NA values with previous valid values.
# Data
df <- read.table(text = "Date Temperature Humidity
'2015-01-01 00:00:00' 25 0.67
'2015-01-01 01:00:00' 26 0.69
'2015-01-01 02:00:00' 28 0.69
'2015-01-01 03:00:00' 25 0.69", header = T, stringsAsFactors = F)
df$Date <- as.POSIXct(df$Date, format = "%Y-%m-%d %H:%M:%S")
# Create a dataframe with all possible date/time at intervale of 5 mins
Dates <- data.frame(Date = seq(min(df$Date), max(df$Date)+3540, by = 5*60))
result <- df %>%
right_join(Dates, by="Date") %>%
fill(Temperature, Humidity)
result
# Date Temperature Humidity
#1 2015-01-01 00:00:00 25 0.67
#2 2015-01-01 00:05:00 25 0.67
#3 2015-01-01 00:10:00 25 0.67
#4 2015-01-01 00:15:00 25 0.67
#5 2015-01-01 00:20:00 25 0.67
#6 2015-01-01 00:25:00 25 0.67
#7 2015-01-01 00:30:00 25 0.67
#8 2015-01-01 00:35:00 25 0.67
#9 2015-01-01 00:40:00 25 0.67
#10 2015-01-01 00:45:00 25 0.67
#11 2015-01-01 00:50:00 25 0.67
#12 2015-01-01 00:55:00 25 0.67
#13 2015-01-01 01:00:00 26 0.69
#14 2015-01-01 01:05:00 26 0.69
#.....
#.....
#44 2015-01-01 03:35:00 25 0.69
#45 2015-01-01 03:40:00 25 0.69
#46 2015-01-01 03:45:00 25 0.69
#47 2015-01-01 03:50:00 25 0.69
#48 2015-01-01 03:55:00 25 0.69
I think this might do:
df=tibble(DateTime=c("2015-01-01 00:00:00","2015-01-01 01:00:00"),Temperature=c(25,26),Humidity=c(.67,.69))
df$DateTime<-ymd_hms(df$DateTime)
DateTime=as.POSIXct((sapply(1:(nrow(df)-1),function(x) seq(from=df$DateTime[x],to=df$DateTime[x+1],by="5 min"))),
origin="1970-01-01", tz="UTC")
Temperature=c(sapply(1:(nrow(df)-1),function(x) rep(df$Temperature[x],12)),df$Temperature[nrow(df)])
Humidity=c(sapply(1:(nrow(df)-1),function(x) rep(df$Humidity[x],12)),df$Humidity[nrow(df)])
tibble(as.character(DateTime),Temperature,Humidity)
<chr> <dbl> <dbl>
1 2015-01-01 00:00:00 25.0 0.670
2 2015-01-01 00:05:00 25.0 0.670
3 2015-01-01 00:10:00 25.0 0.670
4 2015-01-01 00:15:00 25.0 0.670
5 2015-01-01 00:20:00 25.0 0.670
6 2015-01-01 00:25:00 25.0 0.670
7 2015-01-01 00:30:00 25.0 0.670
8 2015-01-01 00:35:00 25.0 0.670
9 2015-01-01 00:40:00 25.0 0.670
10 2015-01-01 00:45:00 25.0 0.670
11 2015-01-01 00:50:00 25.0 0.670
12 2015-01-01 00:55:00 25.0 0.670
13 2015-01-01 01:00:00 26.0 0.690
I have a dataframe that looks like this:
dat <- data.frame(time = seq(as.POSIXct("2010-01-01"),
as.POSIXct("2016-12-31") + 60*99,
by = 60*15),
radiation = sample(1:500, 245383, replace = TRUE))
So I have every 15 minutes a measurement value. The structure is:
> str(dat)
'data.frame': 245383 obs. of 2 variables:
$ time : POSIXct, format: "2010-01-01 00:00:00" "2010-01-01 00:15:00" "2010-01-01 00:30:00" "2010-01-01 00:45:00" ...
$ radiation: num 230 443 282 314 286 225 77 89 97 330 ...
Now I want to interpolate, so my aim is a dataframe with values for every minute.
I searched a few times and tried some methods with the zoo package. But I have some problems with the dataframe. I have to convert it to a text file i guess? I have no idea how to do that.
Here is a tidyverse solution.
library('tidyverse')
dat <- data.frame(time = seq(as.POSIXct("2010-01-01"),
as.POSIXct("2016-12-31") + 60*99,
by = 60*15),
radiation = sample(1:500, 245383, replace = TRUE))
dat <- head(dat, 3)
dat
# time radiation
# 1 2010-01-01 00:00:00 241
# 2 2010-01-01 00:15:00 438
# 3 2010-01-01 00:30:00 457
You can create a data frame with all of the required times. Using full_join will make the missing radiation values be NA.
approx will fill the NAs with a linear approximation.
dat %>%
full_join(data.frame(time = seq(
from = min(.$time),
to = max(.$time),
by = 'min'))) %>%
arrange(time) %>%
mutate(radiation = approx(radiation, n = n())$y)
# Joining, by = "time"
# time radiation
# 1 2010-01-01 00:00:00 241.0000
# 2 2010-01-01 00:01:00 254.1333
# 3 2010-01-01 00:02:00 267.2667
# 4 2010-01-01 00:03:00 280.4000
# 5 2010-01-01 00:04:00 293.5333
# 6 2010-01-01 00:05:00 306.6667
# 7 2010-01-01 00:06:00 319.8000
# 8 2010-01-01 00:07:00 332.9333
# 9 2010-01-01 00:08:00 346.0667
# 10 2010-01-01 00:09:00 359.2000
# 11 2010-01-01 00:10:00 372.3333
# 12 2010-01-01 00:11:00 385.4667
# 13 2010-01-01 00:12:00 398.6000
# 14 2010-01-01 00:13:00 411.7333
# 15 2010-01-01 00:14:00 424.8667
# 16 2010-01-01 00:15:00 438.0000
# 17 2010-01-01 00:16:00 439.2667
# 18 2010-01-01 00:17:00 440.5333
# 19 2010-01-01 00:18:00 441.8000
# 20 2010-01-01 00:19:00 443.0667
# 21 2010-01-01 00:20:00 444.3333
# 22 2010-01-01 00:21:00 445.6000
# 23 2010-01-01 00:22:00 446.8667
# 24 2010-01-01 00:23:00 448.1333
# 25 2010-01-01 00:24:00 449.4000
# 26 2010-01-01 00:25:00 450.6667
# 27 2010-01-01 00:26:00 451.9333
# 28 2010-01-01 00:27:00 453.2000
# 29 2010-01-01 00:28:00 454.4667
# 30 2010-01-01 00:29:00 455.7333
# 31 2010-01-01 00:30:00 457.0000
You can use the approx function like this:
dat <- data.frame(time = seq(as.POSIXct("2016-12-01"),
as.POSIXct("2016-12-31") + 60*99,
by = 60*15),
radiation = sample(1:500, 2887, replace = TRUE))
mins <- seq(as.POSIXct("2016-12-01"),
as.POSIXct("2016-12-31") + 60*99,
by = 60)
out <- approx(dat$time, dat$radiation, mins)
Here is a solution using pad from the padr package to fill the gaps in your time column. na.approx is used for interpolation.
library(padr)
library(zoo)
dat[1:2, ]
time radiation
#1 2010-01-01 00:00:00 133
#2 2010-01-01 00:15:00 187
dat_padded <- pad(dat[1:2, ], interval = "min")
dat_padded$radiation <- zoo::na.approx(dat_padded$radiation)
dat_padded
time radiation
#1 2010-01-01 00:00:00 133.0
#2 2010-01-01 00:01:00 136.6
#3 2010-01-01 00:02:00 140.2
#4 2010-01-01 00:03:00 143.8
#5 2010-01-01 00:04:00 147.4
#6 2010-01-01 00:05:00 151.0
#7 2010-01-01 00:06:00 154.6
#8 2010-01-01 00:07:00 158.2
#9 2010-01-01 00:08:00 161.8
#10 2010-01-01 00:09:00 165.4
#11 2010-01-01 00:10:00 169.0
#12 2010-01-01 00:11:00 172.6
#13 2010-01-01 00:12:00 176.2
#14 2010-01-01 00:13:00 179.8
#15 2010-01-01 00:14:00 183.4
#16 2010-01-01 00:15:00 187.0
data
set.seed(1)
dat <-
data.frame(
time = seq(
as.POSIXct("2010-01-01"),
as.POSIXct("2016-12-31") + 60 * 99,
by = 60 * 15
),
radiation = sample(1:500, 245383, replace = TRUE)
)
I have the following dataframe (ts1):
D1 Value N
1 20/11/2014 16:00 0.00
2 20/11/2014 17:00 0.01 1
3 20/11/2014 19:00 0.05 2
4 20/11/2014 22:00 0.20 3
5 20/11/2014 23:00 0.03 4
I would like to insert rows as the number of of (N-1) the new ts1 and rows will be:
D1 Value N
1 20/11/2014 16:00 0.00 1
2 20/11/2014 17:00 0.01 1
3 20/11/2014 18:00 0.03 1 <---
4 20/11/2014 19:00 0.05 1
5 20/11/2014 20:00 0.10 1 <---
6 20/11/2014 21:00 0.15 1 <---
7 20/11/2014 22:00 0.20 1
8 20/11/2014 23:00 0.03 1
As can be seen lines 3, 5 and 6 were added because of the gap in time (N > 1) the number in ts1$Value is filled in by dividing the gap of ts1$Value and dividing them by the number of new rows. I would like to add the values as efficient as possible with minimum number of going over the dataframe.
Here is the complete solution:
The usage of the last command of linear interpolation solves the issue
> Lines <- "D1,Value
+ 1,20/11/2014 16:00,0.00
+ 2,20/11/2014 17:00,0.01
+ 3,20/11/2014 19:00,0.05
+ 4,20/11/2014 22:00,0.20
+ 5,20/11/2014 23:00,0.03"
> ts1 <- read.csv(text = Lines, as.is = TRUE)
> library(zoo)
> z <- read.zoo(ts1, tz = "", format = "%d/%m/%Y %H:%M")
>
> z0 <- zoo(, seq(start(z), end(z), "hours"))
> zz <- merge(z, z0)
> interpolated <- na.approx(zz)
> interpolated
2014-11-20 16:00:00 2014-11-20 17:00:00 2014-11-20 18:00:00 2014-11-20 19:00:00 2014-11-20 20:00:00 2014-11-20 21:00:00
0.00 0.01 0.03 0.05 0.10 0.15
2014-11-20 22:00:00 2014-11-20 23:00:00
0.20 0.03
I have a data frame which looks like this:
times values
1 2013-07-06 20:00:00 0.02
2 2013-07-07 20:00:00 0.03
3 2013-07-09 20:00:00 0.13
4 2013-07-10 20:00:00 0.12
5 2013-07-11 20:00:00 0.03
6 2013-07-14 20:00:00 0.06
7 2013-07-15 20:00:00 0.08
8 2013-07-16 20:00:00 0.07
9 2013-07-17 20:00:00 0.08
There are a few dates missing from the data, and I would like to insert them and to carry over the value from the previous day into these new rows, i.e. obtain this:
times values
1 2013-07-06 20:00:00 0.02
2 2013-07-07 20:00:00 0.03
3 2013-07-08 20:00:00 0.03
4 2013-07-09 20:00:00 0.13
5 2013-07-10 20:00:00 0.12
6 2013-07-11 20:00:00 0.03
7 2013-07-12 20:00:00 0.03
8 2013-07-13 20:00:00 0.03
9 2013-07-14 20:00:00 0.06
10 2013-07-15 20:00:00 0.08
11 2013-07-16 20:00:00 0.07
12 2013-07-17 20:00:00 0.08
...
I have been trying to use a vector of all the dates:
dates <- as.Date(1:length(df),origin = df$times[1])
I am stuck, and can't find a way to do it without a horrible for loop in which I'm getting lost...
Thank you for your help
Some test data (I am using Date, yours seems to be a different type, but this does not affect the algorithm):
data = data.frame(dates = as.Date(c("2011-12-15", "2011-12-17", "2011-12-19")),
values = as.double(1:3))
# Generate **all** timestamps at which you want to have your result.
# I use `seq`, but you may use any other method of generating those timestamps.
alldates = seq(min(data$dates), max(data$dates), 1)
# Filter out timestamps that are already present in your `data.frame`:
# Construct a `data.frame` to append with missing values:
dates0 = alldates[!(alldates %in% data$dates)]
data0 = data.frame(dates = dates0, values = NA_real_)
# Append this `data.frame` and resort in time:
data = rbind(data, data0)
data = data[order(data$dates),]
# forward fill the values
# I would recommend to move this code into a separate `ffill` function:
# proved to be very useful in general):
current = NA_real_
data$values = sapply(data$values, function(x) {
current <<- ifelse(is.na(x), current, x); current })
library(zoo)
g <- data.frame(dates=seq(min(data$dates),max(data$dates),1))
na.locf(merge(g,data,by="dates",all.x=TRUE))
or entirely with zoo:
z <- read.zoo(data)
gz <- zoo(, seq(min(time(z)), max(time(z)), "day")) # time grid in zoo
na.locf(merge(z, gz))
Using tidyr's complete and fill assuming the times columns is already of class POSIXct.
library(tidyr)
df %>%
complete(times = seq(min(times), max(times), by = 'day')) %>%
fill(values)
# A tibble: 12 x 2
# times values
# <dttm> <dbl>
# 1 2013-07-06 20:00:00 0.02
# 2 2013-07-07 20:00:00 0.03
# 3 2013-07-08 20:00:00 0.03
# 4 2013-07-09 20:00:00 0.13
# 5 2013-07-10 20:00:00 0.12
# 6 2013-07-11 20:00:00 0.03
# 7 2013-07-12 20:00:00 0.03
# 8 2013-07-13 20:00:00 0.03
# 9 2013-07-14 20:00:00 0.06
#10 2013-07-15 20:00:00 0.08
#11 2013-07-16 20:00:00 0.07
#12 2013-07-17 20:00:00 0.08
data
df <- structure(list(times = structure(c(1373140800, 1373227200, 1373400000,
1373486400, 1373572800, 1373832000, 1373918400, 1374004800, 1374091200
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), values = c(0.02,
0.03, 0.13, 0.12, 0.03, 0.06, 0.08, 0.07, 0.08)), row.names = c(NA,
-9L), class = "data.frame")
df2 <- data.frame(times=seq(min(df$times), max(df$times), by="day"))
df3 <- merge(x=df2, y=df, by="times", all.x=T)
idx <- which(is.na(df3$values))
for (id in idx)
df3$values[id] <- df3$values[id-1]
df3
# times values
# 1 2013-07-06 20:00:00 0.02
# 2 2013-07-07 20:00:00 0.03
# 3 2013-07-08 20:00:00 0.03
# 4 2013-07-09 20:00:00 0.13
# 5 2013-07-10 20:00:00 0.12
# 6 2013-07-11 20:00:00 0.03
# 7 2013-07-12 20:00:00 0.03
# 8 2013-07-13 20:00:00 0.03
# 9 2013-07-14 20:00:00 0.06
# 10 2013-07-15 20:00:00 0.08
# 11 2013-07-16 20:00:00 0.07
# 12 2013-07-17 20:00:00 0.08
You can try this:
setkey(NADayWiseOrders, date)
all_dates <- seq(from = as.Date("2013-01-01"),
to = as.Date("2013-01-07"),
by = "days")
NADayWiseOrders[J(all_dates), roll=Inf]
date orders amount guests
1: 2013-01-01 50 2272.55 149
2: 2013-01-02 3 64.04 4
3: 2013-01-03 3 64.04 4
4: 2013-01-04 1 18.81 0
5: 2013-01-05 2 77.62 0
6: 2013-01-06 2 77.62 0
7: 2013-01-07 2 35.82 2