Data Aggregation Using For Loops - r

I have a data set that has individual basketball players statistics for each team for 17 years. In R I am trying to turn these player level observations into team level observations (for each year) by using a for loop which iterates through year and team and then aggregates the top three scorer's individual statistics (points, assists, rebounds etc). How would you recommend I proceed? (below you will find my current attempt, it only pulls the observations from the last teams and year of the data set and can't pull other statistics such as assists and rebound numbers from the 3 top scorers).
for (year in 2000:2017) {
for (team in teams) {
ts3_points =top_n(select(filter(bball, Tm == team & Year == year), PPG),3)
}
}

Would be more helpful with your data but I don't think you will need to have two for loops. Just need to use dplyr. Below I used some dumby data to try to recreate your issue...
colname key:
month == years
carrier == team
origin == player
library(dplyr)
library(nycflights13) # library with dumby data
flights %>%
group_by(month, carrier, origin) %>%
summarise(hour_avg = mean(hour)) %>% # create your summary stats
arrange(desc(hour_avg)) %>% #sort or data by a summary stat
top_n(n = 3) # return highest hour_avg
# returns the highest hour_avg origin (player) for every month and carrier (year and team)
Hope this helps!

Related

R - Draw cases per 100k population

I try to draw line COVID cases for each date. I do not have output, the lecturer gave just questions. I solved the question but my problem is the output. It looks weird. Here is the question:
"For the ten countries with the highest absolute number of total deaths, draw the following line graphs to visualize whether epidemic has started to slow down and how the growth rate of new cases/deaths differs across those countries.
a) Number of new cases at each date (absolute number vs per 100.000 population)"
Here is my codes:
library(utils)
COVID_data <-read.csv("https://opendata.ecdc.europa.eu/covid19/nationalcasedeath_eueea_daily_ei/csv", na.strings = "", fileEncoding = "UTF-8-BOM")
#Finding ten countries where the highest absolute total deaths number is
abs_total_deaths <-COVID_data %>%
group_by(countriesAndTerritories) %>%
summarise(abs_total_deaths = sum(deaths)) %>%
arrange(desc(abs_total_deaths))
abs_ten_total_deaths <- c('Italy','France','Germany','Spain','Poland',
'Romania','Czechia','Hungary','Belgium','Bulgaria')
#Calculate new cases by dividing absolute number to 100.000 population
#Draw line for each country
COVID_data %>%
filter(countriesAndTerritories %in% abs_ten_total_deaths) %>%
filter(cases >0) %>%
mutate(new_cases = cases/100000) %>%
ungroup() %>%
ggplot()+
geom_line(aes(x = dateRep, y = new_cases, color = countriesAndTerritories),size=1)+
labs(x="Date",
y="New Cases",
title="New Cases per 100.000 population") +
facet_wrap(~countriesAndTerritories)+
theme_bw()
I will also add a pic of my output. I think my graph is not correct, because the output looks really weird. I can't understand where I make a mistake. If you help me, I'll be appreciated that.
Here is the output:
Looking at Belgium, I get total deaths = 25051 from your data file, which tallies exactly with the data here.
It's obvious that the highest value (by far) for every country occurs "on" the earliest date for the country in the file. Amongst your top ten (I agree with your selection), this is 01Mar2021 for every country apart from Spain, and 28Feb2021 for Spain.
These two facts lead me to conclude (1) your graphs correctly display the data you have asked them to summarise and that (2) you have a data artefact: the first record for each country contains the cumulative total to date, whereas subsequent dates contain data reported "in the previous 24 hours". I use quotes because different countries have different reporting conventions. For example, in the UK (since August 2020) "COVID-related deaths" are deaths from any cause within 28 days of a positive COVID test. Citation
Therefore, to get meaningful graphs, I think your only option is to discard the cumulative data contained in the first record for each country. Here's how I would do that:
library(utils)
library(tidyverse)
COVID_data <-read.csv("https://opendata.ecdc.europa.eu/covid19/nationalcasedeath_eueea_daily_ei/csv", na.strings = "", fileEncoding = "UTF-8-BOM")
# For better printing
COVID_data <- as_tibble(COVID_data)
# Which countries have the higest absolute death toll?
# [I get the same countries as you do.]
top10 <- COVID_data %>%
group_by(countriesAndTerritories) %>%
summarise(TotalDeaths=sum(deaths)) %>%
slice_max(TotalDeaths, n=10) %>%
distinct(countriesAndTerritories) %>%
pull(countriesAndTerritories)
COVID_data %>%
filter(countriesAndTerritories %in% top10) %>%
mutate(
deathRate=100000 * deaths / popData2020,
caseRate=100000 * cases /popData2020,
Date=lubridate::dmy(dateRep)
) %>%
arrange(countriesAndTerritories, Date) %>%
group_by(countriesAndTerritories) %>%
filter(row_number() > 1) %>%
ggplot() +
geom_line(aes(x=Date, y=deathRate)) +
facet_wrap(~countriesAndTerritories)
The critical part that excludes the first data row for each country is
arrange(countriesAndTerritories, Date) %>%
group_by(countriesAndTerritories) %>%
filter(row_number() > 1) %>%
The call to arrange is necessary because the data are not in date order to begin with.
This gives the following plot
which is much more like what I (and I suspect, you) would expect.
The sawtooth patterns you see are most likely also reporting artefacts: deaths that take place over the weekend (or on public holidays) are not reported until the following Monday (or next working day). This is certainly true in the UK.

How do i summarize values attributed to several variables in a data set?

First of all I have to describe my data set. It has three columns, where number 1 is country, number 2 is date (%Y-%m-%d), and number 3 is a value associated with each row (average hotel room prices). It continues like that in rows from 1990 to 2019. It works as such:
Country Date Value
France 2011-01-01 700
etc.
I'm trying to turn the date into years instead of the normal %Y-%m-%d format, so it will instead sum the average values for each country each year (instead of each month). How would I go about doing that?
I thought about summarizing the values totally for each country each year, but that is hugely tedious and takes a long time (plus the code will look horrible). So I'm wondering if there is a better solution for this problem that I'm not seeing.
Here is the task at hand so far. My dataset priceOnly shows the average price for each month. I have also attributed it to show only values not equal to 0.
diffyear <- priceOnly %>%
group_by(Country, Date) %>%
summarize(averagePrice = mean(Value[which(Value!=0.0)]))
You can use the lubridate package to extract years and then summarise accordingly.
Something like this:
diffyear <- priceOnly %>%
mutate(Year = year(Date)) %>%
filter(Value > 0) %>%
group_by(Country, Year) %>%
summarize(averagePrice = mean(Value, na.rm = TRUE))
And in general, you should always provide a minimal reproducible example with your questions.

Zoo::yearmon class: removing records that do not contain all months for a given year

I have a data frame in R that includes city names, years, and months. It looks like the following:
Sample data set with Month, Year, and City columns
This table continues for thousands of records. Some cities do not have data for every month in the year (e.g. the 1920 data for Los Angeles in the example above only contains January and February), meaning they are incomplete. I want to extract out only those years which are complete for a given city (e.g. contains all 12 months for that year, like Toronto in the above example).
I have tried converting it into the zoo::yearmon class, but I do not know how to manipulate it to do what I have described above. I believe that a script could be written that looks at the year and city name, checks if it contains all 12 months, and then omits years that do not.
Here's a solution using the dplyr package:
df %>%
group_by(City, Year) %>%
filter(length(unique(Month)) == 12)
I group by City and Year and then filter for those with 12 unique months. (I assume your data frame is called df.)
Now, if you just want a particular city, say Toronto, you could use the following:
df %>%
filter(City == "Toronto") %>%
group_by(Year) %>%
filter(length(unique(Month)) == 12)
Here is an option using data.table
library(data.table)
setDT(df)[, .SD[uniqueN(Month)==12], .(City, Year)]

Summarising weather data by day ( from package nycflights13 in R)

I would like to summarise hourly weather data by day (get the total precipitation and maximum wind speed daily). Found a code snippet on the web, but it results in only one observation for both variables, instead of daily observations.
How can I change this particular code? And what are the other ways exist to perform this task?
Thanks!
library(nycflights13)
library(dplyr)
precip <- weather %>%
group_by(month, day) %>%
filter(month < 13) %>%
summarise(totprecip = sum(precip), maxwind = max(wind_speed))

Dplyr add column to data frame based on specific value of grouped data

I have a data frame containing number of page views per week for various users. It looks like this:
Userid week views
eerr 24 1
dd 24 2
dd 25 1
...
I want to plot average page views per week. However, I want to group users by the number of page views they had in the first week so that I can plot separate trajectories for users with different activity levels. I can get the first week for each user by doing
weekdf = df %>% group_by(Userid) %>% mutate(firstweek = min(week))
But I can't figure out how to group by the value of views in the row with that first week. I tried using a user-defined function within summarise, which seemed to work, but it never terminated, and I can see why - it has to recalculate everything many times.
getoffset <- function(week, Userid,minweekdf)
{
minweek = minweekdf[minweekdf$Userid == Userid,2]
offsetweek = week - minweek
return(offsetweek)
}
offsetdf = df %>% group_by(Userid, week) %>% summarise(offsetweek = getoffset(week, Userid, minweek))
How can I do this, preferably in dplyr?
Something like this:
df %>% group_by(Userid) %>% arrange(week) %>% mutate(fv = first(views) )
and then you can group by fv

Resources