I have the following df with the Date column having hourly marks for an entire year:
Date TD RN D.RN Press Temp G.Temp. Rad
1 2018-01-01 00:00:00 154.0535 9.035156 1.416667 950.7833 7.000000 60.16667 11.27000
2 2018-01-01 01:00:00 154.5793 9.663900 1.896667 951.2000 6.766667 59.16667 11.23000
3 2018-01-01 01:59:59 154.5793 7.523438 2.591667 951.0000 6.066667 65.16667 11.23500
4 2018-01-01 02:59:59 154.0535 7.994792 2.993333 951.1833 5.733333 64.00000 11.16833
5 2018-01-01 03:59:59 154.4041 6.797526 3.150000 951.4833 5.766667 57.83333 11.13500
6 2018-01-01 04:59:59 155.1051 12.009766 3.823333 951.0833 5.216667 61.33333 11.22167
I want to add a factor column 'Quarters' that indicates each quarter according to the 'Date'.
As far as I understand I can do that by:
Radiation$Quarter<-cut(Radiation$Date, breaks = "quarters", labels = c("Q1", "Q2", "Q3", "Q4"))
But I also want to add a factor column 'Day/Night' which indicates whether it's day or night, having:
Day → 8am - 8pm
Night → 8pm - 8am
It seems like with the cut() function there's no way to indicate time ranges.
You can use an ifelse/case_when statement after extracting hour from time.
library(dplyr)
library(lubridate)
df %>%
mutate(hour = hour(Date),
label = case_when(hour >= 8 & hour <= 19 ~ 'Day',
TRUE ~ 'Night'))
In base R :
df$hour = as.integer(format(df$Date, '%H'))
transform(df, label = ifelse(hour >= 8 & hour <= 19, 'Day', 'Night'))
We can also do
library(dplyr)
library(lubridate)
df %>%
mutate(hour = hour(Date),
label = case_when(between(hour, 8, 19) ~ "Day", TRUE ~ "Night"))
Related
I created a data frame with three columns date, ID and price(e5).
I want to get the mean price by day and hour.
> head(fuel_price, n = 5)
date station_uuid e5
1 2019-04-15 04:01:06+02 88149d2f-3258-445b-bfa4-60898e7fb186 1.529
2 2019-04-15 04:56:05+02 5c2d04fd-e464-4c96-b4a6-d996d0a8630c 1.539
3 2019-04-15 05:00:06+02 c8137d18-edad-4006-9746-18e876b14b1d 1.530
4 2019-04-16 05:00:06+02 6b2143cb-1cd8-4b4b-b2fb-2502f6ea8b35 1.542
5 2019-04-16 05:02:06+02 dbdb29f5-93aa-4ee4-a52b-7bff0e4ab75a 1.562
I think the main problem is that the date is not in the right format, but i am not able to change it because of the +02 for the timezone at the end.
price_2019$date <- mdy_hms(prices_2019$date)
If this would be fixed, would it work with dplyr?
agg_price <- price_2019 %>% group_by(Date=floor_date(date, "hour")) %>% summarize(mean_price = mean(price))
Could you help me out?
You can use lubridate::ymd_hms to convert the date variable to date-time, group by day and hour from it and take mean value of price for each hour.
library(dplyr)
prices_2019 %>%
mutate(date = lubridate::ymd_hms(date),
date_hour = format(date, "%Y-%m-%d %H")) %>%
group_by(date_hour) %>%
summarize(mean_price = mean(price))
I have an R script that I run monthly. I'd like to subset my data frame to only show data within a 6 month time period, but each month I'd like the time period to move forward one month.
Original data frame from Sept.:
ID Name Date
1 John 1/1/2020
2 Adam 5/2/2020
3 Kate 9/30/2020
4 Jill 10/15/2020
After subsetting for only dates from May 1, 2020 - Sept. 30, 2020:
ID Name Date
2 Adam 5/2/2020
3 Kate 9/30/2020
The next month when I run my script, I'd like the dates it's subsetting to move forward by one month, so June 1, 2020 - Oct. 31, 2020:
ID Name Date
3 Kate 9/30/2020
4 Jill 10/15/2020
Right now, I'm changing this part of my script manually each month, ie:
df$Date >= subset(df$Date >= '2020-05-01' & df$date <= '2020-09-30')
Is there a way to make this automatic, so that I don't have to manually move forward the date one month every time?
We can use between after converting the 'Date' to Date class
library(dplyr)
library(lubridate)
start <- as.Date("2020-05-01")
end <- as.Date("2020-09-30")
df1 %>%
mutate(Date = mdy(Date)) %>%
filter(between(Date, start, end))
# ID Name Date
#1 2 Adam 2020-05-02
#2 3 Kate 2020-09-30
In the next month, we can change the 'start', 'end' by adding 1 month
start <- start %m+% months(1)
end <- ceiling_date(end %m+% months(1), 'months') - days(1)
start
#[1] "2020-06-01"
end
#[1] "2020-10-31"
using base R and no package dependency.
Data:
dt <- read.table(text = 'ID Name Date
1 John 1/1/2020
2 Adam 3/2/2021
3 Kate 12/30/2020
4 Jill 5/15/2021', header = TRUE, stringsAsFactors = FALSE)
Code:
date_format <- "%m/%d/%Y"
dt$Date <- as.Date(dt$Date, format = date_format)
today <- Sys.Date()
six_month <- today+(6*30)
start <- as.Date(paste(format(today, "%m"), "01",
format(today, "%Y"), sep = "/"),
format = date_format)
end <- as.Date(paste(format(six_month, "%m"), "31",
format(six_month, "%Y"), sep = "/"),
format = date_format)
dt[with(dt, Date >= start & Date <= end), ]
# ID Name Date
# 2 2 Adam 2021-03-02
# 3 3 Kate 2020-12-30
# 4 4 Jill 2021-05-15
This is a very simple solution:
library(lubridate)
t <- today() #automatic
t <- as.Date('2020-11-26') # manual (you can change it as you like)
start <- floor_date(t %m-% months(6), unit="months")
end <- floor_date(t %m-% months(1), unit="months")-1
df$Date >= subset(df$Date >= start & df$date <= end)
I would like to know is there a way to transform dates like this
"2016-01-8" in "20160101q" which means the first half of January 2016 or
"20160127" in "20160102q" which means the second half of January 2016 for example and thank you in advance?
here is a solution makeing use of data.table and the lubridate-packages.
It uses the lubridate::days_in_month()-function, to determine the number of days in the month of the date. This is necessairey, since February has (normally) 28 days, so day 15 of February --> 02q. But January has 31 days, so day 15 of January --> 01q.
The logic for calculating the q-periode is:
If day_number / number_of_days_in_month > 0.5 --> q periode = 02q,
else q_period --> 01q.
Then a paste0 command is used to crete the text for in de q_date-column. sprintf() is used to add leading zero for single-digit monthnumbers.
library(data.table)
library(lubridate)
#sample data
data <- data.table( date = as.Date( c("2019-12-30", "2020-01-15", "2020-02-15", "2020-02-14") ) )
# date
# 1: 2019-12-30
# 2: 2020-01-15
# 3: 2020-02-15
# 4: 2020-02-14
#if the day / #days of month > 0.5, date is in q2, else q1
data[ lubridate::mday(date) / lubridate::days_in_month(date) > 0.5,
q_date := paste0( lubridate::year(date), sprintf( "%02d", lubridate::month(date) ), "02q" ) ]
data[ is.na( q_date ),
q_date := paste0( lubridate::year(date), sprintf( "%02d", lubridate::month(date) ), "01q" ) ]
# date q_date
# 1: 2019-12-30 20191202q
# 2: 2020-01-15 20200101q
# 3: 2020-02-15 20200202q
# 4: 2020-02-14 20200201q
you can try with mutate and paste0, first you decompose the date in day, month and year. then create a variable that says if we are in the first or second half of the month, then paste the sting text of month, year and the variable containing "01q" or "02q" depending on the period
date<- c("2016-01-8",
"2016-01-27")
id <- c(1,2)
x <- as.data.frame(cbind(id, date))
library(tidyverse)
library(lubridate)
x = x %>%
mutate(date = ymd(date)) %>%
mutate_at(vars(date), funs(year, month, day))
x$half <- "01q"
x$half[day>15] <- "02q"
paste0(x$year,x$month,x$half)
Here is an example of a subset data in .csv files. There are three columns with no header. The first column represents the date/time and the second column is load [kw] and the third column is 1= weekday, 0 = weekends/ holiday.
9/9/2010 3:00 153.94 1
9/9/2010 3:15 148.46 1
I would like to program in R, so that it selects the first and second column within time ranges from 10:00 to 20:00 for all weekdays (when the third column is 1) within a month of September and do not know what's the best and most efficient way to code.
code dt <- read.csv("file", header = F, sep=",")
#Select a column with weekday designation = 1, weekend or holiday = 0
y <- data.frame(dt[,3])
#Select a column with timestamps and loads
x <- data.frame(dt[,1:2])
t <- data.frame(dt[,1])
#convert timestamps into readable format
s <- strptime("9/1/2010 0:00", format="%m/%d/%Y %H:%M")
e <- strptime("9/30/2010 23:45", format="%m/%d/%Y %H:%M")
range <- seq(s,e, by = "min")
df <- data.frame(range)
OP ask for "best and efficient way to code" this without showing "inefficient code", so #Justin is right.
It's seems that the OP is new to R (and it's officially the summer of love) so I give it a try and I have a solution (not sure about efficiency..)
index <- c("9/9/2010 19:00", "9/9/2010 21:15", "10/9/2010 11:00", "3/10/2010 10:30")
index <- as.POSIXct(index, format = "%d/%m/%Y %H:%M")
set.seed(1)
Data <- data.frame(Date = index, load = rnorm(4, mean = 120, sd = 10), weeks = c(0, 1, 1, 1))
## Data
## Date load weeks
## 1 2010-09-09 19:00:00 113.74 0
## 2 2010-09-09 21:15:00 121.84 1
## 3 2010-09-10 11:00:00 111.64 1
## 4 2010-10-03 10:30:00 135.95 1
cond <- expression(format(Date, "%H:%M") < "20:00" &
format(Date, "%H:%M") > "10:00" &
weeks == 1 &
format(Date, "%m") == "09")
subset(Data, eval(cond))
## Date load weeks
## 3 2010-09-10 11:00:00 111.64 1
I have a dataset filled with the average windspeed per hour for multiple years. I would like to create an 'average year', in which for each hour the average windspeed for that hour over multiple years is calculated. How can I do this without looping endlessly through the dataset?
Ideally, I would like to just loop through the data once, extracting for each row the right month, day, and hour, and adding the windspeed from that row to the right row in a dataframe where the aggregates for each month, day, and hour are gathered. Is it possible to do this without extracting the month, day, and hour, and then looping over the complete average-year data.frame to find the right row?
Some example data:
data.multipleyears <- data.frame(
DATETIME = c("2001-01-01 01:00:00", "2001-05-03 09:00:00", "2007-01-01 01:00:00", "2008-02-29 12:00:00"),
Windspeed = c(10, 5, 8, 3)
)
Which I would like to aggregate in a dataframe like this:
average.year <- data.frame(
DATETIME = c("01-01 00:00:00", "01-01 01:00:00", ..., "12-31 23:00:00")
Aggregate.Windspeed = (100, 80, ...)
)
From there, I can go on calculating the averages, etc. I have probably overlooked some command, but what would be the right syntax for something like this (in pseudocode):
for(i in 1:nrow(data.multipleyears) {
average.year$Aggregate.Windspeed[
where average.year$DATETIME(month, day, hour) == data.multipleyears$DATETIME[i](month, day, hour)] <- average.year$Aggregate.Windspeed + data.multipleyears$Windspeed[i]
}
Or something like that. Help is appreciated!
I predict that ddply and the plyr package are going to be your best friend :). I created a 30 year dataset with hourly random windspeeds between 1 and 10 ms:
begin_date = as.POSIXlt("1990-01-01", tz = "GMT")
# 30 year dataset
dat = data.frame(dt = begin_date + (0:(24*30*365)) * (3600))
dat = within(dat, {
speed = runif(length(dt), 1, 10)
unique_day = strftime(dt, "%d-%m")
})
> head(dat)
dt unique_day speed
1 1990-01-01 00:00:00 01-01 7.054124
2 1990-01-01 01:00:00 01-01 2.202591
3 1990-01-01 02:00:00 01-01 4.111633
4 1990-01-01 03:00:00 01-01 2.687808
5 1990-01-01 04:00:00 01-01 8.643168
6 1990-01-01 05:00:00 01-01 5.499421
To calculate the daily normalen (30 year average, this term is much used in meteorology) over this 30 year period:
library(plyr)
res = ddply(dat, .(unique_day),
summarise, mean_speed = mean(speed), .progress = "text")
> head(res)
unique_day mean_speed
1 01-01 5.314061
2 01-02 5.677753
3 01-03 5.395054
4 01-04 5.236488
5 01-05 5.436896
6 01-06 5.544966
This takes just a few seconds on my humble two core AMD, so I suspect just going once through the data is not needed. Multiple of these ddply calls for different aggregations (month, season etc) can be done separately.
You can use substr to extract the part of the date you want,
and then use tapply or ddply to aggregate the data.
tapply(
data.multipleyears$Windspeed,
substr( data.multipleyears$DATETIME, 6, 19),
mean
)
# 01-01 01:00:00 02-29 12:00:00 05-03 09:00:00
# 9 3 5
library(plyr)
ddply(
data.multipleyears,
.(when=substr(DATETIME, 6, 19)),
summarize,
Windspeed=mean(Windspeed)
)
# when Windspeed
# 1 01-01 01:00:00 9
# 2 02-29 12:00:00 3
# 3 05-03 09:00:00 5
It is pretty old post, but I wanted to add. I guess timeAverage in Openair can also be used. In the manual, there are more options for timeAverage function.