Group by day and hour - r

I created a data frame with three columns date, ID and price(e5).
I want to get the mean price by day and hour.
> head(fuel_price, n = 5)
date station_uuid e5
1 2019-04-15 04:01:06+02 88149d2f-3258-445b-bfa4-60898e7fb186 1.529
2 2019-04-15 04:56:05+02 5c2d04fd-e464-4c96-b4a6-d996d0a8630c 1.539
3 2019-04-15 05:00:06+02 c8137d18-edad-4006-9746-18e876b14b1d 1.530
4 2019-04-16 05:00:06+02 6b2143cb-1cd8-4b4b-b2fb-2502f6ea8b35 1.542
5 2019-04-16 05:02:06+02 dbdb29f5-93aa-4ee4-a52b-7bff0e4ab75a 1.562
I think the main problem is that the date is not in the right format, but i am not able to change it because of the +02 for the timezone at the end.
price_2019$date <- mdy_hms(prices_2019$date)
If this would be fixed, would it work with dplyr?
agg_price <- price_2019 %>% group_by(Date=floor_date(date, "hour")) %>% summarize(mean_price = mean(price))
Could you help me out?

You can use lubridate::ymd_hms to convert the date variable to date-time, group by day and hour from it and take mean value of price for each hour.
library(dplyr)
prices_2019 %>%
mutate(date = lubridate::ymd_hms(date),
date_hour = format(date, "%Y-%m-%d %H")) %>%
group_by(date_hour) %>%
summarize(mean_price = mean(price))

Related

Add factor column for POSIXct Date format

I have the following df with the Date column having hourly marks for an entire year:
Date TD RN D.RN Press Temp G.Temp. Rad
1 2018-01-01 00:00:00 154.0535 9.035156 1.416667 950.7833 7.000000 60.16667 11.27000
2 2018-01-01 01:00:00 154.5793 9.663900 1.896667 951.2000 6.766667 59.16667 11.23000
3 2018-01-01 01:59:59 154.5793 7.523438 2.591667 951.0000 6.066667 65.16667 11.23500
4 2018-01-01 02:59:59 154.0535 7.994792 2.993333 951.1833 5.733333 64.00000 11.16833
5 2018-01-01 03:59:59 154.4041 6.797526 3.150000 951.4833 5.766667 57.83333 11.13500
6 2018-01-01 04:59:59 155.1051 12.009766 3.823333 951.0833 5.216667 61.33333 11.22167
I want to add a factor column 'Quarters' that indicates each quarter according to the 'Date'.
As far as I understand I can do that by:
Radiation$Quarter<-cut(Radiation$Date, breaks = "quarters", labels = c("Q1", "Q2", "Q3", "Q4"))
But I also want to add a factor column 'Day/Night' which indicates whether it's day or night, having:
Day → 8am - 8pm
Night → 8pm - 8am
It seems like with the cut() function there's no way to indicate time ranges.
You can use an ifelse/case_when statement after extracting hour from time.
library(dplyr)
library(lubridate)
df %>%
mutate(hour = hour(Date),
label = case_when(hour >= 8 & hour <= 19 ~ 'Day',
TRUE ~ 'Night'))
In base R :
df$hour = as.integer(format(df$Date, '%H'))
transform(df, label = ifelse(hour >= 8 & hour <= 19, 'Day', 'Night'))
We can also do
library(dplyr)
library(lubridate)
df %>%
mutate(hour = hour(Date),
label = case_when(between(hour, 8, 19) ~ "Day", TRUE ~ "Night"))

Calculate mean days in R using dplyr, filter, group_by and summarise?

I want to create a table that shows the mean days by submitted_via (Please see the consumer_compliants.csv) using date_diff, subtract date_sent and date_received. Data is filtered to show only date_diff values greater than 0. All this has to be done using dplyr, %>% , filter, group_by, and summarise_at, knitr::kable()
I have tried this in R
date_received <- as.Date(mydata$date_received, "%m/%d/%Y")
date_sent <- as.Date(mydata$date_sent_to_company, "%m/%d/%Y")
date_diff <- (date_sent) - (date_received)
mydata %>%
filter(date_diff > 0) %>%
group_by(date_received, date_sent_to_company) %>%
summarise(
a = mean(date_diff))
Output:
Email 11.973214 days
Fax 7.057072 days
Phone 6.290040 days
Postal mail 9.627809 days
Referral 6.761684 days
Web 10.695773 days
Any suggestions please?
In base R, we can do in the following way :
#select the date columns
cols <- c("date_received", "date_sent_to_company")
#Change the columns to date class
consumer_complaints[cols] <- lapply(consumer_complaints[cols],as.Date,"%m/%d/%Y")
#Suntract values between date_sent_to_company and date_received
#Select rows where dat_diff is greater than 0 and take mean for each submitted_via
aggregate(date_diff~submitted_via, subset(transform(consumer_complaints,
date_diff = date_sent_to_company - date_received), date_diff > 0), mean)
# submitted_via date_diff
#1 Email 11.97
#2 Fax 7.06
#3 Phone 6.29
#4 Postal mail 9.63
#5 Referral 6.76
#6 Web 10.70
This might be something closer to what you want:
library(dplyr)
mydata %>%
mutate_at(vars(starts_with("date_")), as.Date, format = "%m/%d/%Y") %>%
mutate(date_diff = date_received - date_sent) %>%
filter(date_diff > 0) %>%
group_by(submitted_via) %>%
summarise(a = mean(date_diff))
Output
# A tibble: 3 x 2
submitted_via a
<fct> <drtn>
1 phone 22 days
2 Referral 27 days
3 web 4 days
Data
mydata <- read.table(
text =
"date_received date_sent submitted_via
9/30/2015 9/3/2015 Referral
9/3/2015 8/30/2015 web
9/25/2015 9/3/2015 phone
9/18/2015 9/18/2015 Referral", header = T
)

Subtract month/year to get years

data = data.frame("start"= c("1/2000","8/2004","99/9999"),
"stop"=c("1/2001","2/2007","09/2010"),
"WANTYEARS"= c(1,2.5,NA))
I have date in month/year format and want to subtract to get the years.
My attempt of simple data$stop - data$start did not yield the desired results. THank you.
The yearmon class represents months and years as years and fraction of a year.
Using data shown in the Note at the end:
library(zoo)
transform(data, diff = as.yearmon(stop, "%m/%Y") - as.yearmon(start, "%m/%Y"))
giving:
start stop diff
1 1/2000 1/2001 1.0
2 8/2004 2/2007 2.5
3 99/9999 09/2010 NA
Note
data = data.frame(start= c("1/2000", "8/2004", "99/9999"),
stop = c("1/2001", "2/2007", "09/2010"))
One option is to use difftime from base R. Add "01" to stop and start date to create an actual Date object and subtract those dates using difftime with unit as "weeks" and divide it by number of weeks in year to get time difference in year,
round(difftime(as.Date(paste0("01/", data$stop), "%d/%m/%Y"),
as.Date(paste0("01/", data$start), "%d/%m/%Y"), units = "weeks")/52.2857, 2)
#[1] 1.0 2.5 NA
We can do the same using any other unit component of difftime as well if we know the equivalent year conversion ratio like for example with "days"
round(difftime(as.Date(paste0("01/", data$stop), "%d/%m/%Y"),
as.Date(paste0("01/", data$start), "%d/%m/%Y"), units = "days")/365.25, 2)
#[1] 1.0 2.5 NA
One possibility involving dplyr and lubridate could be:
data %>%
mutate_at(vars(1:2), list(~ parse_date_time(., "my"))) %>%
mutate(WANTYEARS = round(time_length(stop - start, "years"), 1))
start stop WANTYEARS
1 2000-01-01 2001-01-01 1.0
2 2004-08-01 2007-02-01 2.5
3 <NA> 2010-09-01 NA

Aggregate ISO weeks into months with a dataset containing just ISO weeks

My data is in a dataframe which has a structure like this:
df2 <- data.frame(Year = c("2007"), Week = c(1:12), Measurement = c(rnorm(12, mean = 4, sd = 1)))
Unfortunately I do not have the complete date (e.g. days are missing) for each measurement, only the Year and the Weeks (these are ISO weeks).
Now I want to aggregate the Median of a Month's worth of measurements (e.g. the weekly measurements per month of the specific year) into a new column, Months. I did not find a convenient way to do this without having the exact day of the measurements available. Any inputs are much appreciated!
When it is necessary to allocate a week to a single month, the rule for first week of the year might be applied, although ISO 8601 does not consider this case. (Wikipedia)
For example, the 5th week of 2007 belongs to February, because the Thursday of the 5th week was the 1st of February.
I am using data.table and ISOweek packages. See the example how to compute the month of the week. Then you can do any aggregation by month.
require(data.table)
require(ISOweek)
df2 <- data.table(Year = c("2007"), Week = c(1:12),
Measurement = c(rnorm(12, mean = 4, sd = 1)))
# Generate Thursday as year, week of the year, day of week according to ISO 8601
df2[, thursday_ISO := paste(Year, sprintf("W%02d", Week), 4, sep = "-")]
# Convert Thursday to date format
df2[, thursday_date := ISOweek2date(thursday_ISO)]
# Compute month
df2[, month := format(thursday_date, "%m")]
df2
Suggestion by Uwe to compute a year-month string.
# Compute year-month
df2[, yr_mon := format(ISOweek2date(sprintf("%s-W%02d-4", Year, Week)), "%Y-%m")]
df2
And finally you can do an aggregation to the new table or by adding median as a column.
df2[, median(Measurement), by = yr_mon]
df2[, median := median(Measurement), by = yr_mon]
df2
If I understand correctly, you don't know the exact day, but only the week number and year. My answer takes the first day of the year as a starting date and then compute one week intervals based on that. You can probably refine the answer.
Based on
an answer by mnel, using the lubridate package.
library(lubridate)
# Prepare week, month, year information ready for the merge
# Make sure you have all the necessary dates
wmy <- data.frame(Day = seq(ymd('2007-01-01'),ymd('2007-04-01'),
by = 'weeks'))
wmy <- transform(wmy,
Week = isoweek(Day),
Month = month(Day),
Year = isoyear(Day))
# Merge this information with your data
merge(df2, wmy, by = c("Year", "Week"))
Year Week Measurement Day Month
1 2007 1 3.704887 2007-01-01 1
2 2007 10 1.974533 2007-03-05 3
3 2007 11 4.797286 2007-03-12 3
4 2007 12 4.291169 2007-03-19 3
5 2007 2 4.305010 2007-01-08 1
6 2007 3 3.374982 2007-01-15 1
7 2007 4 3.600008 2007-01-22 1
8 2007 5 4.315184 2007-01-29 1
9 2007 6 4.887142 2007-02-05 2
10 2007 7 4.155411 2007-02-12 2
11 2007 8 4.711943 2007-02-19 2
12 2007 9 2.465862 2007-02-26 2
using dplyr you can try:
require(dplyr)
df2 %>% mutate(Date = as.Date(paste("1", Week, Year, sep = "-"), format = "%w-%W-%Y"),
Year_Mon = format(Date,"%Y-%m")) %>% group_by(Year_Mon) %>%
summarise(result = median(Measurement))
As #djhrio pointed out, Thursday is used to determine the weeks in a month. So simply switch paste("1", to paste("4", in the code above.
This can be done relatively simply in dplyr.
library(dplyr)
df2 %>%
mutate(Month = rep(1:3, each = 4)) %>%
group_by(Month) %>%
summarise(MonthlyMedian = stats::median(Measurement))
Basically, add a new column to define your months. I'm presuming since you don't have days, you are going to allocate 4 weeks per month?
Then you just group by your Month variable and calculate the median. Very simple
Hope this helps

How to split a panel data record in R based on a threshold value for a variable?

I have data for hospitalisations that records date of admission and the number of days spent in the hospital:
ID date ndays
1 2005-06-01 15
2 2005-06-15 60
3 2005-12-25 20
4 2005-01-01 400
4 2006-06-04 15
I would like to create a dataset of days spend at the hospital per year, and therefore I need to deal with cases like ID 3, whose stay at the hospital goes over the end of the year, and ID 4, whose stay at the hospital is longer than one year. There is also the problem that some people do have a record on next year, and I would like to add the `surplus' days to those when this happens.
So far I have come up with this solution:
library(lubridate)
ndays_new <- ifelse((as.Date(paste(year(data$date),"12-31",sep="-")),
format="%Y-%m-%d") - data$date) < data$ndays,
(as.Date(paste(year(data$date),"12-31",sep="-")),
format="%Y-%m-%d") - data$date) ,
data$ndays)
However, I can't think of a way to get those `surplus' days that go over the end of the year and assign them to a new record starting on the next year. Can any one point me to a good solution? I use dplyr, so solutions with that package would be specially welcome, but I'm willing to try any other tool if needed.
My solution isn't compact. But, I tried to employ dplyr and did the following. I initially changed column names for my own understanding. I calculated another date (i.e., date.2) by adding ndays to date.1. If the years of date.1 and date.2 match, that means you do not have to consider the following year. If the years do not match, you need to consider the following year. ndays.2 is basically ndays for the following year. Then, I reshaped the data using do. After filtering unnecessary rows with NAs, I changed date to year and aggregated the data by ID and year.
rename(mydf, date.1 = date, ndays.1 = ndays) %>%
mutate(date.1 = as.POSIXct(date.1, format = "%Y-%m-%d"),
date.2 = date.1 + (60 * 60 * 24) * ndays.1,
ndays.2 = ifelse(as.character(format(date.1, "%Y")) == as.character(format(date.2, "%Y")), NA,
date.2 - as.POSIXct(paste0(as.character(format(date.2, "%Y")),"-01-01"), format = "%Y-%m-%d")),
ndays.1 = ifelse(ndays.2 %in% NA, ndays.1, ndays.1 - ndays.2)) %>%
do(data.frame(ID = .$ID, date = c(.$date.1, .$date.2), ndays = c(.$ndays.1, .$ndays.2))) %>%
filter(complete.cases(ndays)) %>%
mutate(date = as.numeric(format(date, "%Y"))) %>%
rename(year = date) %>%
group_by(ID, year) %>%
summarise(ndays = sum(ndays))
# ID year ndays
#1 1 2005 15
#2 2 2005 60
#3 3 2005 7
#4 3 2006 13
#5 4 2005 365
#6 4 2006 50

Resources