Linear interpolation R - r

I have this data.frame (12x2)called df_1 which represents monthly values :
month df_test
[1,] 1 -1.4408567
[2,] 2 -1.0007642
[3,] 3 2.1454113
[4,] 4 1.6935537
[5,] 5 0.1149219
[6,] 6 -1.3205144
[7,] 7 1.0277486
[8,] 8 1.0323482
[9,] 9 -0.1442319
[10,] 10 -0.2091197
[11,] 11 -0.6803158
[12,] 12 0.5965196
and this data.frame(8760x2) called df_2 where each rows represent a value associated to an interval of one hour of a day. This data.frame contains hourly values for one year:
time df_time
1 2015-01-01 00:00:00 -0.4035650
2 2015-01-01 01:00:00 0.1800579
3 2015-01-01 02:00:00 -0.3770589
4 2015-01-01 03:00:00 0.2573456
5 2015-01-01 04:00:00 1.2000178
6 2015-01-01 05:00:00 -0.4276127
...........................................
time df_time
8755 2015-12-31 18:00:00 1.3540119
8756 2015-12-31 19:00:00 0.4852843
8757 2015-12-31 20:00:00 -0.9194670
8758 2015-12-31 21:00:00 -1.0751814
8759 2015-12-31 22:00:00 1.0097749
8760 2015-12-31 23:00:00 -0.1032468
I want to obtain df_1 for each hour of each day. The problem is that all months do not have the same amount of days.
Finally we should obtain a data.frame called df_3 (8760x2) that has interpolated values between the values of df_1.
Thanks for help!

Here's done with zoo. I'm assuming that the monthly value is associated with a specific datetime stamp (middle of the month, midnight) - you have to do that. If you want a different datetime stamp, just change the value.
library(zoo)
library(dplyr)
library(tidyr)
df_3 <- df_1 %>%
mutate(time = paste(2015, month, "15 00:00:00", sep = "-"),
time = as.POSIXct(strptime(time, "%Y-%m-%d %H:%M:%S"))) %>%
full_join(df_2) %>%
arrange(time) %>%
mutate(df_test = na.approx(df_test, rule = 2))

Related

R Error in match.fun(FUN) : object 'Hour' not found after replacement inside column 'Hour'

I have below dataframe (df) from ENTSO-E showing German power prices. I created the "Hour" column with lubridate function hour(df$date). Output was a range (1,2,..,23,0)
# to replace 0 with 24
df["Hour"][df["Hour"]=="0"]<- "24"
I will need to work on an hourly basis. So I filtered each hour from 1 till 24, but I cannot filter the replaced hour: H24.
H1 <- df %>%
filter(Hour==1)
H24 <- df %>%
filter(Hour==24)
Error in match.fun(FUN) : object 'Hour' not found
24 values are still in Hour col, and class is numeric but I cannot do any calculation with the Hour column.
class(df$Hour)
[1] "numeric"
mean(german_last_4$Hour)
[1] NA
I am thinking problem is with replace function. is there any other way to produce a result that works with H24?
date
price
Hour
2019-01-01 01:00:00
28.32
1
2019-01-01 02:00:00
10.07
2
2019-01-01 03:00:00
-4.08
3
2019-01-01 04:00:00
-9.91
4
2019-01-01 05:00:00
-7.41
5
2019-01-01 06:00:00
-12.55
6

Finding each time of daily max variable in climate data

I have a large dataset over many years which has several variables, but the one I am interested in is wind speed and dateTime. I want to find the time of the max wind speed for every day in the data set. I have hourly data in Posixct format, with WS as a numeric with occasional NAs. Below is a short data set that should hopefully illustrate my point, however my dateTime wasn't working out to be hourly data, but it provides enough for a sample.
dateTime <- seq(as.POSIXct("2011-01-01 00:00:00", tz = "GMT"),
as.POSIXct("2011-01-29 23:00:00", tz = "GMT"),
by = 60*24)
WS <- sample(0:20,1798,rep=TRUE)
WD <- sample(0:390,1798,rep=TRUE)
Temp <- sample(0:40,1798,rep=TRUE)
df <- data.frame(dateTime,WS,WD,Temp)
df$WS[WS>15] <- NA
I have previously tried creating a new column with just a posix date (minus time) to allow for day isolation, however all the things I have tried have only returned a shortened data frame with date and WS (aggregate, splitting, xts). Aggregate was only one that didn't do this, however, it gave me 23:00:00 as a constant time which isn't correct.
I have looked at How to calculate daily means, medians, from weather variables data collected hourly in R?, https://stats.stackexchange.com/questions/7268/how-to-aggregate-by-minute-data-for-a-week-into-hourly-means and others but none have answered this question, or the solutions have not returned an ideal result.
I need to compare the results of this analysis with another data frame, so hence the reason I need the actual time when the max wind speed occurred for each day in the dataset. I have a feeling there is a simple solution, however, this has me frustrated.
A dplyr solution may be:
library(dplyr)
df %>%
mutate(date = as.Date(dateTime)) %>%
left_join(
df %>%
mutate(date = as.Date(dateTime)) %>%
group_by(date) %>%
summarise(max_ws = max(WS, na.rm = TRUE)) %>%
ungroup(),
by = "date"
) %>%
select(-date)
# dateTime WS WD Temp max_ws
# 1 2011-01-01 00:00:00 NA 313 2 15
# 2 2011-01-01 00:24:00 7 376 1 15
# 3 2011-01-01 00:48:00 3 28 28 15
# 4 2011-01-01 01:12:00 15 262 24 15
# 5 2011-01-01 01:36:00 1 149 34 15
# 6 2011-01-01 02:00:00 4 319 33 15
# 7 2011-01-01 02:24:00 15 280 22 15
# 8 2011-01-01 02:48:00 NA 110 23 15
# 9 2011-01-01 03:12:00 12 93 15 15
# 10 2011-01-01 03:36:00 3 5 0 15
Dee asked for: "I want to find the time of the max wind speed for every day in the data set." Other answers have calculated the max(WS) for every day, but not at which hour that occured.
So I propose the following solution with dyplr:
library(dplyr)
set.seed(12345)
dateTime <- seq(as.POSIXct("2011-01-01 00:00:00", tz = "GMT"),
as.POSIXct("2011-01-29 23:00:00", tz = "GMT"),
by = 60*24)
WS <- sample(0:20,1738,rep=TRUE)
WD <- sample(0:390,1738,rep=TRUE)
Temp <- sample(0:40,1738,rep=TRUE)
df <- data.frame(dateTime,WS,WD,Temp)
df$WS[WS>15] <- NA
df %>%
group_by(Date = as.Date(dateTime)) %>%
mutate(Hour = hour(dateTime),
Hour_with_max_ws = Hour[which.max(WS)])
I want to highlight out, that if there are several hours with the same maximal windspeed (in the example below: 15), only the first hour with max(WS) will be shown as result, though the windspeed 15 was reached on that date at the hours 0, 3, 4, 21 and 22! So you might need a more specific logic.
For the sake of completeness (and because I like the concise code) here is a "one-liner" using data.table:
library(data.table)
setDT(df)[, max.ws := max(WS, na.rm = TRUE), by = as.IDate(dateTime)][]
dateTime WS WD Temp max.ws
1: 2011-01-01 00:00:00 NA 293 22 15
2: 2011-01-01 00:24:00 15 55 14 15
3: 2011-01-01 00:48:00 NA 186 24 15
4: 2011-01-01 01:12:00 4 300 22 15
5: 2011-01-01 01:36:00 0 120 36 15
---
1734: 2011-01-29 21:12:00 12 249 5 15
1735: 2011-01-29 21:36:00 9 282 21 15
1736: 2011-01-29 22:00:00 12 238 6 15
1737: 2011-01-29 22:24:00 10 127 21 15
1738: 2011-01-29 22:48:00 13 297 0 15

combine two data frame with daily record and hourly record

I have two data frames: A
y_m_d SNOW
1 2010-01-01 0.0
2 2010-01-02 0.0
3 2010-01-03 0.1
4 2010-01-04 0.0
5 2010-01-05 0.0
6 2010-01-06 2.3
B:
time temp
1 2010-01-01 00:00:00 20.00000
2 2010-01-01 01:00:00 18.33333
3 2010-01-01 02:00:00 17.00000
4 2010-01-01 03:00:00 25.33333
5 2010-01-01 04:00:00 23.33333
I want to combine two data frame based on time. A is a daily record and B is a hourly record. I want to fill the A record at the beginning of each day at 00:00:00 and leave the rest of day blank.
The result should be look like this:
time temp SNOW
1 2010-01-01 00:00:00 20.00000 0.0
2 2010-01-01 01:00:00 18.33333
3 2010-01-01 02:00:00 17.00000
4 2010-01-01 03:00:00 25.33333
5 2010-01-01 04:00:00 23.33333
6 2010-01-01 05:00:00 22.66667
Could you please give me some advice?
Thank you.
Here's a quick solution:
A$y_m_d <- as.Date(A$y_m_d)
B$SNOW <- sapply(as.Date(B$time), function(x) A[A$y_m_d==x, "SNOW"])
This might not be the most efficient way in the world to do this, but it is a solution. I attempted to create data with the exact same variable types and structure as you.
# Create example data
y_m_d <- as.POSIXct(c("2010-01-01", "2010-01-02"), format="%Y-%m-%d")
SNOW <- c(0, 0.1)
time <- as.POSIXct(c("2010-01-01 00:00:00", "2010-01-01 01:00:00", "2010-01-01 02:00:00", "2010-01-02 00:00:00", "2010-01-02 01:00:00", "2010-01-02 02:00:00"), format="%Y-%m-%d %H:%M:%S")
temp <- rnorm(6, mean=20, sd=4)
A <- data.frame(y_m_d, SNOW)
B <- data.frame(time, temp)
# Check data
A
## y_m_d SNOW
## 1 2010-01-01 0.0
## 2 2010-01-02 0.1
B
## time temp
## 1 2010-01-01 00:00:00 17.52852
## 2 2010-01-01 01:00:00 12.42715
## 3 2010-01-01 02:00:00 21.79584
## 4 2010-01-02 00:00:00 19.90442
## 5 2010-01-02 01:00:00 16.40524
## 6 2010-01-02 02:00:00 16.86854
# Loop through days and construct new SNOW variable
days <- as.POSIXct(format(B$time, "%Y-%m-%d"), format="%Y-%m-%d")
SNOW_new <- c()
for (i in 1:nrow(A)) {
SNOW_new <- c(A[i, "SNOW"], rep(NA, sum(days==A[i, "y_m_d"])-1), SNOW_new)
}
# Create new data frame
C <- data.frame(B, SNOW_new)
## time temp SNOW_new
## 1 2010-01-01 00:00:00 17.52852 0.1
## 2 2010-01-01 01:00:00 12.42715 NA
## 3 2010-01-01 02:00:00 21.79584 NA
## 4 2010-01-02 00:00:00 19.90442 0.0
## 5 2010-01-02 01:00:00 16.40524 NA
## 6 2010-01-02 02:00:00 16.86854 NA
I put NA rather than a blank space because I assume you want the SNOW_new variable to be numeric, not character. But if you do want a blank space, you can just replace the NA in the rep function with a "".
Making sure time variables are in the right format.
A$y_m_d <- as.POSIXct(A$y_m_d, format="%Y-%m-%d")
B$time <- as.POSIXct(B$time, format="%Y-%m-%d %H:%M:%S")
The package lubridate is suited to merge time series data
#install.packages("lubridate")
library(lubridate)
A <- xts(A[,-1], order.by = A$y_m_d)
B <- xts(B[,-1], order.by = B$time)
merge.xts(A, B)

summarize by time interval not working

I have the following data as a list of POSIXct times that span one month. Each of them represent a bike delivery. My aim is to find the average amount of bike deliveries per ten-minute interval over a 24-hour period (producing a total of 144 rows). First all of the trips need to be summed and binned into an interval, then divided by the number of days. So far, I've managed to write a code that sums trips per 10-minute interval, but it produces incorrect values. I am not sure where it went wrong.
The data looks like this:
head(start_times)
[1] "2014-10-21 16:58:13 EST" "2014-10-07 10:14:22 EST" "2014-10-20 01:45:11 EST"
[4] "2014-10-17 08:16:17 EST" "2014-10-07 17:46:36 EST" "2014-10-28 17:32:34 EST"
length(start_times)
[1] 1747
The code looks like this:
library(lubridate)
library(dplyr)
tripduration <- floor(runif(1747) * 1000)
time_bucket <- start_times - minutes(minute(start_times) %% 10) - seconds(second(start_times))
df <- data.frame(tripduration, start_times, time_bucket)
summarized <- df %>%
group_by(time_bucket) %>%
summarize(trip_count = n())
summarized <- as.data.frame(summarized)
out_buckets <- data.frame(out_buckets = seq(as.POSIXlt("2014-10-01 00:00:00"), as.POSIXct("2014-10-31 23:0:00"), by = 600))
out <- left_join(out_buckets, summarized, by = c("out_buckets" = "time_bucket"))
out$trip_count[is.na(out$trip_count)] <- 0
head(out)
out_buckets trip_count
1 2014-10-01 00:00:00 0
2 2014-10-01 00:10:00 0
3 2014-10-01 00:20:00 0
4 2014-10-01 00:30:00 0
5 2014-10-01 00:40:00 0
6 2014-10-01 00:50:00 0
dim(out)
[1] 4459 2
test <- format(out$out_buckets,"%H:%M:%S")
test2 <- out$trip_count
test <- cbind(test, test2)
colnames(test)[1] <- "interval"
colnames(test)[2] <- "count"
test <- as.data.frame(test)
test$count <- as.numeric(test$count)
test <- aggregate(count~interval, test, sum)
head(test, n = 20)
interval count
1 00:00:00 32
2 00:10:00 33
3 00:20:00 32
4 00:30:00 31
5 00:40:00 34
6 00:50:00 34
7 01:00:00 31
8 01:10:00 33
9 01:20:00 39
10 01:30:00 41
11 01:40:00 36
12 01:50:00 31
13 02:00:00 33
14 02:10:00 34
15 02:20:00 32
16 02:30:00 32
17 02:40:00 36
18 02:50:00 32
19 03:00:00 34
20 03:10:00 39
but this is impossible because when I sum the counts
sum(test$count)
[1] 7494
I get 7494 whereas the number should be 1747
I'm not sure where I went wrong and how to simplify this code to get the same result.
I've done what I can, but I can't reproduce your issue without your data.
library(dplyr)
I created the full sequence of 10 minute blocks:
blocks.of.10mins <- data.frame(out_buckets=seq(as.POSIXct("2014/10/01 00:00"), by="10 mins", length.out=30*24*6))
Then split the start_times into the same bins. Note: I created a baseline time of midnight to force the blocks to align to 10 minute intervals. Removing this later is an exercise for the reader. I also changed one of your data points so that there was at least one example of multiple records in the same bin.
start_times <- as.POSIXct(c("2014-10-01 00:00:00", ## added
"2014-10-21 16:58:13",
"2014-10-07 10:14:22",
"2014-10-20 01:45:11",
"2014-10-17 08:16:17",
"2014-10-07 10:16:36", ## modified
"2014-10-28 17:32:34"))
trip_times <- data.frame(start_times) %>%
mutate(out_buckets = as.POSIXct(cut(start_times, breaks="10 mins")))
The start_times and all the 10 minute intervals can then be merged
trips_merged <- merge(trip_times, blocks.of.10mins, by="out_buckets", all=TRUE)
These can then be grouped by 10 minute block and counted
trips_merged %>% filter(!is.na(start_times)) %>%
group_by(out_buckets) %>%
summarise(trip_count=n())
Source: local data frame [6 x 2]
out_buckets trip_count
(time) (int)
1 2014-10-01 00:00:00 1
2 2014-10-07 10:10:00 2
3 2014-10-17 08:10:00 1
4 2014-10-20 01:40:00 1
5 2014-10-21 16:50:00 1
6 2014-10-28 17:30:00 1
Instead, if we only consider time, not date
trips_merged2 <- trips_merged
trips_merged2$out_buckets <- format(trips_merged2$out_buckets, "%H:%M:%S")
trips_merged2 %>% filter(!is.na(start_times)) %>%
group_by(out_buckets) %>%
summarise(trip_count=n())
Source: local data frame [6 x 2]
out_buckets trip_count
(chr) (int)
1 00:00:00 1
2 01:40:00 1
3 08:10:00 1
4 10:10:00 2
5 16:50:00 1
6 17:30:00 1

Aggregate 5 minute data to hourly sums with NA's

My problem is as follows: I've got a time series with 5-Minute precipitation data like:
Datum mm
1 2004-04-08 00:05:00 NA
2 2004-04-08 00:10:00 NA
3 2004-04-08 00:15:00 NA
4 2004-04-08 00:20:00 NA
5 2004-04-08 00:25:00 NA
6 2004-04-08 00:30:00 NA
With this structure:
'data.frame': 1098144 obs. of 2 variables:
$ Datum: POSIXlt, format: "2004-04-08 00:05:00" "2004-04-08 00:10:00" "2004-04-08 00:15:00" "2004-04-08 00:20:00" ...
$ mm : num NA NA NA NA NA NA NA NA NA NA ...
As you can see, the time series begins with a lot of NA's, but there is measured precipitation further down, although riddled with single, less common NA's due to malfunction of the measuring station.
What I'm trying to achieve, is summing up the measured precipitation to hourly sums, not considering NA's.
This is what I tried so far:
sums <- aggregate(precip["mm"],
list(cut(precip$Datum, "1 hour")), sum)
Even though the timestamps are correctly aggregated to hours, all sums are 0 or NA. The sums are not even calculated if there is no NA at all.
additionally to be taken into account:
Hourly precipitation sums in meteorology always describe the cumulative sum until a certain hour: The amount of precipitation at 0:00 o'clock describes the sum from 23:00 the previous day until 0:00. So I always need to sum up the previous hour.
Reproducible Example
set.seed(1120)
s <- as.POSIXlt("2004-03-08 23:00:00")
r <- seq(s, s+1e4, "30 min")
precip <- data.frame(Datum=r, mm=sample(c(1:5,NA), 6, T))
Datum mm
2004-03-08 23:00:00 4
2004-03-08 23:30:00 1
2004-03-09 00:00:00 2
2004-03-09 00:30:00 4
2004-03-09 01:00:00 1
2004-03-09 01:30:00 4
With the above example, the result I am looking for is:
Datum mm
2004-03-09 00:00:00 5
2004-03-09 01:00:00 6
2004-03-09 02:00:00 5
Try adding na.rm=TRUE:
aggregate(precip['mm'], list(cut(precip$Datum, "1 hour")), sum, na.rm=TRUE)
# Group.1 mm
# 1 2004-04-08 00:00:00 26
# 2 2004-04-08 01:00:00 35
# 3 2004-04-08 02:00:00 25
Reproducible Example
set.seed(1120)
s <- as.POSIXlt("2004-04-08 00:05:00")
r <- seq(s, s+1e4, "5 min")
precip <- data.frame(Datum=r, mm=sample(c(1:5,NA), 34, T))
addendum
To your second question: If you would like measurements on the hour to be calculated with the lesser hour add right=TRUE:
aggregate(precip['mm'], list(cut(precip$Datum, "1 hour", right=TRUE)), sum, na.rm=TRUE)
Further Explanation
We will create another more detailed explanation to show how the solution works:
p <- c("2004-04-07 23:48:20", "2004-04-08 00:00:00", "2004-04-08 00:03:20")
ptime <- as.POSIXlt(p)
#[1] "2004-04-07 23:48:20 EDT" "2004-04-08 00:00:00 EDT" "2004-04-08 00:03:20 EDT"
We have three dates to separate into groups. If we use cut without any extra arguments, the second entry "2004-04-08 00:00:00 EDT" will be grouped with the third entry for hour "00:00":
cut(ptime, "1 hour")
#[1] 2004-04-07 23:00:00 2004-04-08 00:00:00 2004-04-08 00:00:00
But if we add the argument right=FALSE we can group it with the "23:00" hour:
cut(ptime, "1 hour", right=TRUE)
#[1] 2004-04-07 23:00:00 2004-04-07 23:00:00 2004-04-08 00:00:00
We can specify the behavior of edge cases.
edit
With your new data the original solution produces the desired output:
aggregate(precip['mm'], list(cut(precip$Datum, "1 hour")), sum, na.rm=TRUE)
Group.1 mm
1 2004-03-08 23:00:00 5
2 2004-03-09 00:00:00 6
3 2004-03-09 01:00:00 5
You can use dplyr to calculate sum like :
precip$hour <- strftime(precip$Datum,"%Y-%m-%d %H")
library(dplyr)
sum_hour <- precip %>% group_by(hour) %>% summarise(sum_hour = sum(mm,na.rm = T))

Resources