Summarising weather data by day ( from package nycflights13 in R) - r

I would like to summarise hourly weather data by day (get the total precipitation and maximum wind speed daily). Found a code snippet on the web, but it results in only one observation for both variables, instead of daily observations.
How can I change this particular code? And what are the other ways exist to perform this task?
Thanks!
library(nycflights13)
library(dplyr)
precip <- weather %>%
group_by(month, day) %>%
filter(month < 13) %>%
summarise(totprecip = sum(precip), maxwind = max(wind_speed))

Related

keep columns after summarising using tidyverse in R

I have a dataset that consists of groups with year, month, and day values. I want to filter the groups using tidyverse in R, such that I locate the latest month in the time series. Here is some example code.
dat = expand.grid(group = seq(1,5),year = seq(2016,2020),month=seq(1:12))
dat = dat[order(dat$group,dat$year,dat$month),]
dat$days=sample(seq(0,30),nrow(dat),replace=TRUE)
dat$year[dat$year==2020 & dat$month==12] = NA
dat = dat[complete.cases(dat),]
In this example, there are 5 groups with monthly data from 2016 - 2020. However, let's suppose group December is missing. Also, some days are missing in the dataset
I can grab December from 2019, but not sure how to include the days in the summary and filter by number of days in month. For example,
a = dat %>%
group_by(group,month) %>%
summarise(year = max(year))
gets the year, but I would like to add the correct days to the month and year. Does anyone know how to keep the days column? I don't want to average or get a minimum or anything.
We can use slice_max to return the full row based on the max value of 'year' for each grouping block
library(dplyr)
dat %>%
group_by(group, month) %>%
slice_max(year)

Data Aggregation Using For Loops

I have a data set that has individual basketball players statistics for each team for 17 years. In R I am trying to turn these player level observations into team level observations (for each year) by using a for loop which iterates through year and team and then aggregates the top three scorer's individual statistics (points, assists, rebounds etc). How would you recommend I proceed? (below you will find my current attempt, it only pulls the observations from the last teams and year of the data set and can't pull other statistics such as assists and rebound numbers from the 3 top scorers).
for (year in 2000:2017) {
for (team in teams) {
ts3_points =top_n(select(filter(bball, Tm == team & Year == year), PPG),3)
}
}
Would be more helpful with your data but I don't think you will need to have two for loops. Just need to use dplyr. Below I used some dumby data to try to recreate your issue...
colname key:
month == years
carrier == team
origin == player
library(dplyr)
library(nycflights13) # library with dumby data
flights %>%
group_by(month, carrier, origin) %>%
summarise(hour_avg = mean(hour)) %>% # create your summary stats
arrange(desc(hour_avg)) %>% #sort or data by a summary stat
top_n(n = 3) # return highest hour_avg
# returns the highest hour_avg origin (player) for every month and carrier (year and team)
Hope this helps!

How do i summarize values attributed to several variables in a data set?

First of all I have to describe my data set. It has three columns, where number 1 is country, number 2 is date (%Y-%m-%d), and number 3 is a value associated with each row (average hotel room prices). It continues like that in rows from 1990 to 2019. It works as such:
Country Date Value
France 2011-01-01 700
etc.
I'm trying to turn the date into years instead of the normal %Y-%m-%d format, so it will instead sum the average values for each country each year (instead of each month). How would I go about doing that?
I thought about summarizing the values totally for each country each year, but that is hugely tedious and takes a long time (plus the code will look horrible). So I'm wondering if there is a better solution for this problem that I'm not seeing.
Here is the task at hand so far. My dataset priceOnly shows the average price for each month. I have also attributed it to show only values not equal to 0.
diffyear <- priceOnly %>%
group_by(Country, Date) %>%
summarize(averagePrice = mean(Value[which(Value!=0.0)]))
You can use the lubridate package to extract years and then summarise accordingly.
Something like this:
diffyear <- priceOnly %>%
mutate(Year = year(Date)) %>%
filter(Value > 0) %>%
group_by(Country, Year) %>%
summarize(averagePrice = mean(Value, na.rm = TRUE))
And in general, you should always provide a minimal reproducible example with your questions.

R Tidyverse calculating mean() of filtered subset

in the R tidyverse chapter 5.6.4 it is said:
# What proportion of flights are delayed by more than an hour?
not_cancelled %>%
group_by(year, month, day) %>%
summarise(hour_perc = mean(arr_delay > 60))
but is this really the percentage of those flights that are delayed by more than an hour, compared to the total number of delays of all flights?
Because I am subsetting the data in the mean() function by filtering out those data points that have arr_delay <= 60, so I am excluding a part of all the data points, and then taking only the mean of the filtered data, which is the mean of only those data points that have arr_delay > 60, or not? I ask because I am not totally sure how this mean() function works. Otherwise, I would have to count the subset of all flights that have arr_delay >60 and then divide by total number of flights ?
To recap, essentially what I want to know is, does the mean() function take into account all flights, or only those flights that satisfy the condition when I filter inside the mean() function? Because then it wouldn't be the proportion of flights delayed by more than an hour, just the mean of those flights that are delayed by more than an hour.. or not?

Calculate average daily value from large data set with R standard format date/times?

I have a dataframe of approximately 10 million rows spanning about 570 days. After using striptime to convert the dates and times, the data looks like this:
date X1
1 2004-01-01 07:43:00 1.2587
2 2004-01-01 07:47:52 1.2585
3 2004-01-01 17:46:14 1.2586
4 2004-01-01 17:56:08 1.2585
5 2004-01-01 17:56:15 1.2585
I would like to compute the average value on each day (as in days of the year, not days of the week) and then plot them. Eg. Get all rows which have day "2004-01-01", compute average price, then do the same for "2004-01-2" and so on.
Similarly I would be interested in finding the average monthly value, or hourly price, but I imagine I can work these out once I know how to get average daily price.
My biggest difficulty here is extracting the day of the year from the date variable automatically. How can I cycle through all 365 days and compute the average value for each day, storing it in a list?
I was able to find the average value for day of the week using the weekdays() function, but I couldn't find anything similar for this.
Here's a solution using dplyr and lubridate. First, simplify the date by rounding it down to the nearest day-unit using floor_date (see below comment by thelatemail), then group_by date and calculate the mean value using summarize:
library(dplyr)
library(lubridate)
df %>%
mutate(date = floor_date(date)) %>%
group_by(date) %>%
summarize(mean_X1 = mean(X1))
Using the lubridate package, you can use a similar method to get the average by month, week, or hour. For example, to calculate the average by month:
df %>%
mutate(date = month(date)) %>%
group_by(date) %>%
summarize(mean_X1 = mean(X1))
And by hour:
df %>%
mutate(date = hour(date)) %>%
group_by(date) %>%
summarize(mean_X1 = mean(X1))
day of year in lubridate is
yday, as in
lubridate::yday(Sys.time())
because the size of data is big I recommend a data.table approach
library(lubridate)
library(data.table)
df$ydate=yday(df$date)
df=data.table(df)
df[,mean(X1),ydate]
if you want different days for different years as in 1Jan2004 and 1Jan2005
library(lubridate)
library(data.table)
df$ydate=ymd(df$date)
df=data.table(df)
df[,mean(X1),ydate]
Note -instead of using striptime to convert dates you could just use ymd_hms function from lubridate
Just to contribute, here is the solution to do it for multiple columns in your data frame. It consists of the same method as George, so a little more is added an using summarise:
new_df <- df %>% mutate(date = hour(date)) %>%
group_by(date) %>%
summarise(across(.cols = where(is.numeric), .fns = ~mean(.x, na.rm = TRUE))
In this case, in ".cols" it is specified that the operation be applied to all columns with numeric format (you can modify it for specific columns). In the ".fns" section you can put the operation you want to perform (mean, sd, etc.) and you can apply na.rm.
Greetings!

Resources