Aggregating hourly data into daily aggregates with missing value in R - r

[enter image description here][1][enter image description here][2]I have a data frame "RH", with hourly data and I want to convert it to daily maximum and minimum data. This code was very useful [question]:Aggregating hourly data into daily aggregates
RH$Date <- strptime(RH$Date,format="%y/%m/%d)
RH$day <- trunc(RH$Date,"day")
require(plyr)
x <- ddply(RH,.(Date),
summarize,
aveRH=mean(RH),
maxRH=max(RH),
minRH=min(RH)
)
But my first 5 years data are 3 hours data not hourly. so no results for those years. Any suggestion? Thank you in advance.
'data.frame': 201600 obs. of 3 variables:
$ Date: chr "1985/01/01" "1985/01/01" "1985/01/01" "1985/01/01" ...
$ Hour: int 1 2 3 4 5 6 7 8 9 10 ...
$ RH : int NA NA 93 NA NA NA NA NA 79 NA ...

The link you provided is an old one. The code is still perfectly good and would work, but here's a more modern version using dplyr and lubridate
df <- read.table(text='date_time value
"01/01/2000 01:00" 30
"01/01/2000 02:00" 31
"01/01/2000 03:00" 33
"12/31/2000 23:00" 25',header=TRUE,stringsAsFactors=FALSE)
library(dplyr);library(lubridate)
df %>%
mutate(date_time=as.POSIXct(date_time,format="%m/%d/%Y %H:%M")) %>%
group_by(date(date_time)) %>%
summarise(mean=mean(value,na.rm=TRUE),max=max(value,na.rm=TRUE),
min=min(value,na.rm=TRUE))
`date(date_time)` mean max min
<date> <dbl> <dbl> <dbl>
1 2000-01-01 31.33333 33 30
2 2000-12-31 25.00000 25 25
EDIT
Since there's already a date column, this should work:
RH %>%
group_by(Date) %>%
summarise(mean=mean(RH,na.rm=TRUE),max=max(RH,na.rm=TRUE),
min=min(RH,na.rm=TRUE))

Related

How to subtract a column of date values by sys.Date() using mutate - tidyverse/dplyr? R

I have this dataframe I am working with.
data <- data.frame(id = c(123,124,125,126,127,128,129,130),
date = c("10/7/2021","10/6/2021","9/13/2021","10/18/2021","8/12/2021","9/6/2021","10/29/2021","9/6/2021"))
My goal is create a new column that tells me how many days have passed since that recorded date for each row. I'm trying to use this code but I keep getting NA days in my new column.
data %>%
select(id,date) %>%
mutate("days_since" = as.Date(Sys.Date()) - as.Date(date,format="%Y-%m-%d"))
id date days_since
1 123 10/7/2021 NA days
2 124 10/6/2021 NA days
3 125 9/13/2021 NA days
4 126 10/18/2021 NA days
5 127 8/12/2021 NA days
6 128 9/6/2021 NA days
7 129 10/29/2021 NA days
8 130 9/6/2021 NA days
What am I doing wrong? Thank you for any feedback.
We can use the lubridate package. It makes type conversion and operations with dates much easier.
In your code, the as.Date(date) step was problematic because the format was wrong.
library(dplyr)
library(lubridate)
data %>% mutate("days_since" = Sys.Date() - mdy(date))
id date days_since
1 123 10/7/2021 28
2 124 10/6/2021 29
3 125 9/13/2021 22
4 126 10/18/2021 17
5 127 8/12/2021 23
6 128 9/6/2021 29
7 129 10/29/2021 6
8 130 9/6/2021 29
Thanks, #Karthik S for the simplification
it is also easily done, using base r and a simple "-". This gives back the difference in days:
data <- data.frame(id = c(123,124,125,126,127,128,129,130),
date = c("2021-10-10","2021-10-06","2021-09-13","2021-10-18","2021-08-12","2021-09-06","2021-10-29","2021-09-06"))
data$date <- as.Date(data$date)
data$sys_date <- Sys.Date()
data$sysDate_to_date <- data$sys_date -data$date

How to select the earliest date in a month from a Date series in R?

I have a database containing the value of different indices with different frequency (weekly, monthly, daily)of data. I hope to calculate monthly returns by abstracting beginning of month value from the time series.
I have tried to use a loop to partition the time series month by month then use min() to get the earliest date in the month. However, I am wondering whether there is a more efficient way to speed up the calculation.
library(data.table)
df<-fread("statistic_date index_value funds_number
2013-1-1 1000.000 0
2013-1-4 996.096 21
2013-1-11 1011.141 21
2013-1-18 1057.344 21
2013-1-25 1073.376 21
2013-2-1 1150.479 22
2013-2-8 1150.288 19
2013-2-22 1112.993 18
2013-3-1 1148.826 20
2013-3-8 1093.515 18
2013-3-15 1092.352 17
2013-3-22 1138.346 18
2013-3-29 1107.440 17
2013-4-3 1101.897 17
2013-4-12 1093.344 17")
I expect to filter to get the rows of the earliest date of each month, such as:
2013-1-1 1000.000 0
2013-2-1 1150.479 22
2013-3-1 1148.826 20
2013-4-3 1101.897 17
Your help will be much appreciated!
Using the tidyverse and lubridate packages,
library(lubridate)
library(tidyverse)
df %>% mutate(statistic_date = ymd(statistic_date), # convert statistic_date to date format
month = month(statistic_date), #create month and year columns
year= year(statistic_date)) %>%
group_by(month,year) %>% # group by month and year
arrange(statistic_date) %>% # make sure the df is sorted by date
filter(row_number()==1) # select first row within each group
# A tibble: 4 x 5
# Groups: month, year [4]
# statistic_date index_value funds_number month year
# <date> <dbl> <int> <dbl> <dbl>
#1 2013-01-01 1000 0 1 2013
#2 2013-02-01 1150. 22 2 2013
#3 2013-03-01 1149. 20 3 2013
#4 2013-04-03 1102. 17 4 2013
First make statistic_date a Date:
df$statistic_date <- as.Date(df$statistic_date)
The you can use nth_day to find the first day of every month in statistic_date.
library("datetimeutils")
dates <- nth_day(df$statistic_date, period = "month", n = "first")
## [1] "2013-01-01" "2013-02-01" "2013-03-01" "2013-04-03"
df[statistic_date %in% dates]
## statistic_date index_value funds_number
## 1: 2013-01-01 1000.000 0
## 2: 2013-02-01 1150.479 22
## 3: 2013-03-01 1148.826 20
## 4: 2013-04-03 1101.897 17

How to assign day of year values starting from an arbitary date and take care of missing values?

I have an R dataframe df_demand with a date column (depdate) and a dependent variable column bookings. The duration is 365 days starting from 2017-11-02 and ending at 2018-11-01, sorted in ascending order.
We have booking data for only 279 days in the year.
dplyr::arrange(df_demand, depdate)
depdate bookings
1 2017-11-02 43
2 2017-11-03 27
3 2017-11-05 27
4 2017-11-06 22
5 2017-11-07 39
6 2017-11-08 48
.
.
279 2018-11-01 60
I want to introduce another column day_of_year in the following way:
depdate day_of_year bookings
1 2017-11-02 1 43
2 2017-11-03 2 27
3 2017-11-04 3 NA
4 2017-11-05 4 27
.
.
.
365 2018-11-01 365 60
I am trying to find the best possible way to do this.
In Python, I could use something like :
df_demand['day_of_year'] = df_demand['depdate'].sub(df_demand['depdate'].iat[0]).dt.days + 1
I wanted to know about an R equivalent of the same.
When I run
typeof(df_demand_2$depdate)
the output is
"double"
Am I missing something?
You can create a row for every date using the complete function from the tidyr package.
First, I'm creating a data frame with some sample data:
df <- data.frame(
depdate = as.Date(c('2017-11-02', '2017-11-03', '2017-11-05')),
bookings = c(43, 27, 27)
)
Next, I'm performing two operations. First, using tidyr::complete, I'm specifying all the dates I want in my analysis. I can do that using seq.Date, creating a sequence from the first to the last day.
Once that is done, the day_of_year column is simply equal to the row number.
df_complete <- tidyr::complete(df,
depdate = seq.Date(from = min(df$depdate), to = max(df$depdate), by = 1)
)
df_complete$day_of_year <- 1:nrow(df_complete)
> df_complete
#> # A tibble: 4 x 3
#> depdate bookings day_of_year
#> <date> <dbl> <int>
#> 1 2017-11-02 43 1
#> 2 2017-11-03 27 2
#> 3 2017-11-04 NA 3
#> 4 2017-11-05 27 4
An equivalent solution with the pipe operator from dplyr:
df %>%
complete(depdate = seq.Date(from = min(df$depdate), to = max(df$depdate), by = 1)) %>%
mutate(days_of_year = row_number())

R - Find a value based on a criteria

I have a dataframe DF in which I have numerous of columns, one is with Dates and an other is the Hour.
My point is that I need to find the PRICE (dame datafra 36 hours before. All my days don't have 24 hours so I can't just shift my data set.
My idea was to look for the day before in my dataset & 12 hours before.
This is what I wrote but this is not working:
for (i in 38:nrow(DF)){
RefDay=as.Date(DF$Date[i])
HourRef=DF$Hour[i]
DF$P24[i]=DF[which(DF$Date == (RefDay-1))& which(DF$Hour == (HourRef-36)),"PRICE"]
}
Here is my DF:
'data.frame': 20895 obs. of 45 variables:
$ Hour : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Date : POSIXct, format: "2016-07-01" "2016-07-01" "2016-07-01" "2016-07-01" ...
$ PRICE : num 29.4 24.7 23.4 21.9 20.2 ...
Here is a sample of my data:
DF.Hour DF.Date DF.PRICE
1 0 2016-07-01 29.36
2 1 2016-07-01 24.69
3 2 2016-07-01 23.42
4 3 2016-07-01 21.91
5 4 2016-07-01 20.19
6 5 2016-07-01 22.44
Try to fill the data.frame with full days. You can do it with complete in tidyr. It will fill the not existing values with NA.
If you have any NAs in your full data.frame you can go for the 36th element before with for example lag(price, 36).
DF <- complete(DF, Hour, Date) %>% arrange(Date)
DF$Price[is.na(DF$Price)] <- lag(Price, 36)

Averaging data over a week, with a week having between 0 and 2 values

I have a "weekly" xts object. The data is referring to futures (fronth month futures). The weeks were calculated as follows:
week 1 ends on the last business day of the previous calendar month
week 2 ends on the 5th business day of the current calendar month
week 3 ends on the 10th business day of the current calendar month
week 4 ends on the day when the near contract expires
The data looks as follows (part of the data as example):
2005-09-30 0.0019094400
2005-10-07 0.0020219110
2005-10-14 0.0067063990
2005-10-20 0.0074893360
2005-10-31 0.0028761630
2005-11-07 0.0011331470
2005-11-14 0.0031749880
2005-11-18 0.0007342980
2005-11-30 0.0025730810
2005-12-07 -0.0003133450
2005-12-14 -0.0008288860
2005-12-20 0.0013468400
2005-12-30 0.0012742930
2006-01-09 -0.0007873670
2006-01-17 -0.0004193150
2006-01-20 -0.0005391370
2006-01-31 -0.0022229660
If I call "string" on my dataset, I get the following (you can ignore X here, the important data is Date and Risk.Premium):
'data.frame': 484 obs. of 3 variables:
$ Date : num NA NA NA NA NA NA NA NA NA NA ...
$ Risk.Premium: num 0.00191 0.00202 0.00671 0.00749 0.00288 ...
$ X : logi NA NA NA NA NA NA ...
As you can see, there is between 0 and 2 values for one week in the data. I want to transform the data into "proper" weekly data, so I can compare it with other weekly data (i.e. if a week has 2 values, take the average, if a week has zero values, drop this week). Does anyone have an idea how to do this?
Creating sample data:
mydf <- data.frame(
date = c("2005-09-30", "2005-10-07", "2005-10-08", "2005-11-12"),
value = c(1, 2, 3, 4))
Then create a function that, given a week, finds the average. It will return NaN if given a week that doesn't have any values.
weekAverage <- function(week) {
vals <- mydf[which(format(as.Date(mydf$date), "%W") == week), 2]
mean(vals)
}
Then apply the function to each unique week in the data frame.
weeks <- unique(format(as.Date(mydf$date), "%W"))
weeklyAverages <- data.frame(
Week = weeks,
Average = sapply(weeks, weekAverage))
weeklyAverages
## Week Average
##39 39 1.0
##40 40 2.5
##45 45 4.0
This works if all your weeks are within the same year, because it only factors in the week of the year, not the year. If you want to include year, you can change the format string to "%W %Y" or similar.

Resources