Aggregate hourly data for each month of the year - r

I've looked around for something similar, but couldn't find anything. I have an airport data set which looks something like this (I rounded the hours):
Date Arrival_Time Departure_Time ...
2017-01-01 13:00 14:00 ...
2017-01-01 16:00 17:00 ...
2017-01-01 17:00 18:00 ...
2017-01-01 11:00 12:00 ...
The problem is that for some months, there isn't a flight for a specific time which means I have missing data for some hour. How can I extract hourly arrivals for each hour of every month so that there are no missing values?
I've tried using dplyr and doing the following:
arrivals <- allFlights %>% group_by(month(Date), Arrival_Time) %>%
summarise(n()) %>%
na.omit()
but the problem clearly arrises as group_by cannot fill in my missing data. I end up with data for every month, but not entries for some hour (e.g. no entry for month 1, hour 22:00).
I could currently get my answer by filtering out every month in its own list, and then fully merging them with a complete list of hours, but that's really slow as I have to do this 12 times. Ideally I'm trying to end up with something like this:
Hour Month January February March ... December
00:00 1 ### ### ### ... ###
01:00 1 ### ### ### ... ###
...
00:00 12 ### ### ### ... ###
23:00 12 ### ### ### ... ###
where ### is the number of flights for that hour of that month. Is there a nice way of doing this?
Note: I was thinking if I could somehow join every month's hours with my complete list of hours, and replace all na's with 0's, then that would work, but I couldn't figure out how to do it properly.
Hopefully the question makes sense. I'd gladly clarify if anything is unclear.
EDIT:
If you want to try it with the nycflights13 package, you could reproduce my attempt with the following code:
allFlights <- nycflights13::flights
allFlights$arr_time <- format(strptime(substr(as.POSIXct(sprintf("%04.0f", allFlights$arr_time), format="%H%M"), 12, 16), '%H:%M'), '%H:00')
arrivals <- allFlights %>% filter(carrier == "MQ") %>% group_by(month, arr_time) %>% summarise(n()) %>% na.omit()
Notice how arrivals doesn't have anything for month 1, hour 02:00, 03:00, etc. What I'm trying to do is have this be a complete data set with the missing hours filled in as 0.

I think you can use the code below to generate what you need.
library(stringr)
dim_month_hour<-data.frame(expand.grid(hour=paste(str_pad(seq(0,23,1),2,"left","0"),"00",sep=":"),month=sort(unique(allFlights$month)),stringsAsFactors=F))
arrivals_full<-left_join(dim_month_hour,arrivals,by=c("hour"="arr_time","month"="month"))
arrivals_full[is.na(arrivals_full$`n()`),"n()"]<-0

Is this what you're trying to do? I'm not sure if I'm aggregating exactly how you want, but the !is.na should do what you're looking for.
arrivals <- allFlights %>% group_by(month(Date), Arrival_Time) %>%
rowwise() %>%
summarise(month = plyr::count(!is.na(Arrival_Time)))
Edit: I may not be clear. Do you want a zero to show for hours where there are no data?
So I'm circling it. There's a cool packaged, called padr that will "pad" the date/time entries with NAs for missing values. Because there is a time_hour field, you can use pad.
library(padr)
allFlightsPad <- allFlights %>% pad
Then you can summarize from there. See this page for info.

Related

R:: Dplyr: How to get each Monday in a specific date range + count it

I have data on when an ATP Tennis tournament took place with two columns in the following format:
Tournament
Date Range
Australian Open
20.01.2020 - 02.02.2020
Now the goal is to predict the participation but solely for each Monday, if the date range contains a Monday of course! Also since once you lose you are out in Tennis, I am assuming that the participation in the second week is higher since then only the good players are left in the tournament. Which is why we would need to know if it is week one or two in the tournament.
Hence for the above example we would need something like this at the end:
| Tournament | Date | Number of week
|Australian Open |20.01.2022| 1 |
|Australien Open |27.01.2022| 2 |
I know that I can count in dplyr but how would you get "only Monday" that is compatible with Dplyr? Essentially in SQL "where DAYOFWEEK(Column) = 2)".
I guess first one would need to separate the date range into two columns?
Search function didn't yield nothing covering such a specific problem, hence it could help someone in future.
Cheers vie2bgd
#############################################
#############################################
Edit:
after working day and night I almost have it but am missing on the last step ... also sorry but it's my first post, no need to ghost me or give me immediate minus points thanks #NirGraham at least giving me some hints, much appreciated and will try to implement it but technically I shared data up there in line with an instruction here how to do it, simply separating by | simply forgot some point "... "...
here is what I did so far:
#first I separated the initial range into 2 columns
tennis.orf.2020.2 = tennis.orf.2020 %>% separate(Datum, c("Start", "End"), sep = " - ")
x=tennis.orf.2020.2 %>%
mutate(across(c(Start, End), as.Date, "%d.%m.%y")) %>%
transmute(Tournament, date = map2(Start, End, seq, by = 'day')) %>%
unnest(c(date)) %>%
filter(wday(date) == 2) %>%
count(Tournament,date)
Tournament
Date
Number of week
Australian Open
2020-01-20
1
Australian Open
2020-01-27
1
This should be the result:
Tournament
Date
Number of week
Australian Open
2020-01-20
1
Australian Open
2020-01-27
2
If I group by tournament I lose a row :(
################################################
EDIT2:
Nevermind got it finally although it makes zero sense,
hopefully this will help somebody out there and save at least somebody valuable time
x%>%
arrange(date) %>%
group_by(Tournament) %>%
mutate(dummy = 1) %>%
mutate(times = cumsum(dummy)) %>%
select(-dummy)
You can use row_number() helper function for this:
x%>%
arrange(date) %>%
group_by(Tournament) %>%
mutate(times = row_number())
This is a more concise equivalent of your code with the cumsum(dummy).

How can i obtain a tsibble from this tibble without using the as.Date function?

I need to convert several "tibble" into "tsibble".
Here a simple example:
require(tidyverse)
require(lubridate)
time_1 <- c(ymd_hms('20210101 000000'),
ymd_hms('20210101 080000'),
ymd_hms('20210101 160000'),
# ymd_hms('20210102 000000'),
ymd_hms('20210102 080000'),
ymd_hms('20210102 160000'))
df_1 <- tibble(time_1, y=rnorm(5))
df_1 %>%
as_tsibble(index=time_1)
This chunk of code works as expected.
But, if the dates are all midnights, this code throws an error:
time_2 <- c(ymd_hms('20210101 000000'),
ymd_hms('20210102 000000'),
ymd_hms('20210103 000000'),
# ymd_hms('20210104 000000'),
ymd_hms('20210105 000000'),
ymd_hms('20210106 000000'))
df_2 <- tibble(time_2, y=rnorm(5))
df_2 %>%
as_tsibble(index=time_2)
I don't want to solve this issue in this way because the as.Date function changes the column type.
df_2 %>%
mutate(time_2=as.Date(time_2)) %>%
as_tsibble(index=time_2)
I also don't want to fix the issue in this way because after converting the tibble into tsibble i need to apply the fill_gaps function, which doesn't create the ymd_hms('20210104 000000') in this second scenario.
df_2 %>%
as_tsibble(index=time_2, regular=FALSE)
Is this a bug?
Thanks.
This behaviour is explained in tsibble's FAQ.
Essentially subdaily data (ymd_hms()) measured at midnight each day doesn't necessarily have an interval of 1 day (24 hours). Consider that some days have shifts due to daylight savings in your time zone, and so the number of hours between midnight and midnight the next day may be 23 or 25 hours.
If you're working with data measured at a daily interval, you should use a date with ymd() precision. You can covert it back to a date time using as_datetime() if you like.
Personally I don't think this should produce an error, however it is much simpler if it does. Perhaps the appropriate interval here is 1 hour or 30 minutes (or whatever is appropriate for timezone shifts in the specified timezone).

How to find probability of dataset in R

I have a dataset with something like that, below is a small part. How to use barplot to calculate to probability of raining by month?
Date Rain Today
2020-01-01 Yes
2020-01-02 No
2020-01-03 Yes
2020-01-04 Yes
2020-01-05 No
... ...
2020-12-31 Yes
EDIT: Correct answer in the comments
I dont know why you would want to use a scatterplot for this, but, from this post, you can use dplyr pipelines to do something like this:
library(dplyr)
df %>%
group_by(month = format(Date, "%Y-%m")) %>%
summarise(probability = mean(`Rain Today` == 'Yes'))
To group your data into months and find out how many days it has rained/not rained. Then you find the mean of how many days it has rained.
Thank you everyone in the comments for pointing it out. I hope this helps
The lubridate package has some great functions that help you deal with dates.
install.packages("lubridate")
df$month <- lubridate::month(df$Date)
tapply(df[,"Rain Today"]=="Yes", df$month, mean)
You may need to execute df$Date <- as.Date(as.Date) first if it's currently stored as characterrather than a date.
If you don't want to have any dependencies, then I think you can get what you want like this:
df$month <- substr(df$Date, start=6, stop=7) #Get the 6th and 7th characters of your date strings, which correspond to the "month" part
tapply(df[,"Rain Today"]=="Yes", df$month, mean)

How to group by timestamp in UTC by day in R

So I have this sample of UTC timestamps and a bunch of other data. I would like to group my data by date. This means I do not need hours/mins/secs and would like to have a new df which shows the number of actions grouped together.
I tried using lubridate to pull out the date but I cant get the origin right.
DATA
hw0 <- read.table(text =
'ID timestamp action
4f.. 20160305195246 visitPage
75.. 20160305195302 visitPage
77.. 20160305195312 checkin
42.. 20160305195322 checkin
8f.. 20160305195332 searchResultPage
29.. 20160305195342 checkin', header = T)
Here's what I tried
library(dplyr)
library(lubridate) #this will allow us to extract the date
daily <- hw0 %>%
mutate(date=date(as.POSIXct(timestamp),origin='1970-01-01'))
daily <- daily %>%
group_by(date)
I am unsure what to use as an origin and my error says this value is incorrect. Ultimately, I expect the code to return a new df which features a variable (date) with a list of unique dates as well as how many of the different actions there are in each day.
Assuming the numbers at the end are 24 hour time based, you can use:
daily = hw0 %>%
mutate(date = as.POSIXct(as.character(timestamp), format = '%Y%m%d%H%M%S'))
You can use as.Date instead if you want to get rid of the hour times. You need to supply the origin when you give a numeric argument, which is interpreted as the number of days since the origin. In your case you should just give it a character vector and supply the date format.
Lubridate also has the ymd_hms() function that can extract the date, and the floor_date() function that would help.
library(tidyverse)
daily <- hw0 %>%
mutate(time = ymd_hms(timestamp, tz = 'UTC'),
date = floor_date(time, unit = 'day'))
lubridate also has parse_date_time which seems to be a nice mix of the above two solutions.
library(tidyverse)
library(lubridate)
hw0 %>%
mutate(timestamp = parse_date_time(timestamp, order = "%Y%m%d%H%M%S"))
ID timestamp action
1 4f.. 2016-03-05 19:52:46 visitPage
2 75.. 2016-03-05 19:53:02 visitPage
3 77.. 2016-03-05 19:53:12 checkin
4 42.. 2016-03-05 19:53:22 checkin
5 8f.. 2016-03-05 19:53:32 searchResultPage
6 29.. 2016-03-05 19:53:42 checkin

Get aggregate sum of data by day and hour

The below is an example of the data I have.
date time size filename day.of.week
1 2015-01-16 5:36:12 1577 01162015053400.xml Friday
2 2015-01-16 5:38:09 2900 01162015053600.xml Friday
3 2015-01-16 5:40:09 3130 01162015053800.xml Friday
What I would like to do is sum up the size of the files for each hour.
I would like a resulting data table that looks like:
date hour size
2015-01-16 5 7607
2015-01-16 6 10000
So forth and so on.
But I can't quite seem to get the output I need.
I've tried ddply and aggregate, but I'm summing up the entire day, I'm not sure how to break it down by the hour in the time column.
And I've got multiple days worth of data. So it's not only for that one day. It's from that day, almost every day until yesterday.
Thanks!
The following should do the trick, assuming your example data are stored in a data frame called "test":
library(lubridate) # for hms and hour functions
test$time <- hms(test$time)
test$hour <- factor(hour(test$time))
library(dplyr)
test %>%
select(-time) %>% # dplyr doesn't like this column for some reason
group_by(date, hour) %>%
summarise(size=sum(size))
You can use data.table
library(data.table)
# Define a time stamp column.
dt[, timestamp=as.POSIXct(strptime(paste(df$date, df$time), format = "%Y-%m-%d %H:%M:%S"))]
# Aggregate by hours
dt[, size = .N, by = as.POSIXct(round(timestamp, "hour"))]
Benefit is that data.table is blazing fast!
Use a compound group_by(day,hour)
That will do it.
If you convert your date and time columns into a single POSIX date when (similar to a previous answer, i.e. df$when <- as.POSIXct(strptime(paste(df$date, df$time), format = "%Y-%m-%d %H:%M:%S"))), you could use:
aggregate(df[c("size")], FUN=sum, by=list(d=as.POSIXct(trunc(df$when, "hour"))))

Resources