Attaching Appropriate Subset Quartiles to Dataset - r
I'm trying to add two columns to an NBA player dataset. The first column will establish what quartile the player's age is among all players within the dataset. The second additional column will establish which quartile an individual player's age resides in among his position (i.e. Point Guard, Small Forward, Center, etc.). I'm able to use the dplyr package to calculate the subset age quartiles based on player position, but I don't know how to join it back to the original dataset or if this is even the correct approach to take.
I've used dplyr to calculate subset age quartiles based on position. Have attempted to use other packages like fuzzyjoin, but don't feel that comfortable working with them.
#Incorporate necessary packages
library(ballr)
library(magrittr)
library(dplyr)
library(tidyverse)
#Establish full player table
players <- NBAPerGameAdvStatistics(season = 2018)
#Calculates Quartiles for Each Position
Pos_quartiles <- players %>%
group_by(pos) %>%
summarise(age = list(enframe(quantile(age, probs=c(0.25,0.5,0.75,1.0))))) %>%
unnest
I expect to have the players dataset with 664 observations and 32 variables, the last two of which have been added on as a result of this procedure. The additional rows will show a player's age quartile based upon all players included, as well as the player's age quartile based upon his position.
We can use base::cut with quantile to get the appropriate quartile
library(dplyr)
players %>%
mutate(quar_all=cut(age, breaks=c(0,quantile(age, probs=c(0.25,0.5,0.75,1.0))),labels = FALSE)) %>%
group_by(pos) %>%
mutate(quar_pos=cut(age, breaks=unique(c(0,quantile(age, probs=c(0.25,0.5,0.75,1.0)))),labels = FALSE))
Please note in quar_pos I used unique as I got the error
Error in cut.default(age, breaks = quantile(age, probs = c(0.25, 0.5, :
'breaks' are not unique
For a similar error unique was proposed by Didzis Elferts here , so as Didzis mentioned expect less number of quartiles for the affected groups.
Related
Trying to use ddply to subset a dataframe by two column variables, then find the maximum of a third column in r?
I have a dataframe called data with variables for data, time, temperature, and a group number called Box #. I'm trying to subset the data to find the maximum temperature for each day, for each box, along with the time that temperature occurred at. Ideally I could place this data into a new dataframe with the date, time, maximum temperature and the time is occurred at. I tried using ddply but was the code only returns one line of output ddply(data, .('Box #', 'Date'), summarize, max('Temp')) I was able to find the maximum temperatures for each day using tapply on separate dataframes that only contain the values for individual groups mx_day_2 <- tapply(box2$Temp, box2$Date, max) I was unable to apply this to the larger dataframe with all groups and cannot figure out how to also get time from this code. Is it possible to have ddply subset by both Box # and Date, then return two separate outputs of both maximum temperature and time, or do I need to use a different function here? Edit: I managed to get the maximum times using a version of the code in the answer below, but still haven't figured out how to find the time at which the max occurs in the same data. The code that worked for the first part was max_data <- data %>% group_by(data$'Box #', data$'Date') max_values <- summarise(max_data, max_temp=max(Temp, na.rm=TRUE))
I would use dplyr/tidyverse in stead of plyr, it's an updated version of the package. And clean the column names with janitor: a space is difficult to work with (it changes 'Box #' to box_number). library(tidyverse) library(janitor) mx_day2 <- data %>% clean_names() %>% group_by(date,box_number)%>% summarise(max_temp=max(temp, na.rm=TRUE)
I found a solution that pulls full rows from the initial dataframe into a new dataframe based on only max values. Full code for the solution below max_data_v2 <- data %>% group_by(data$'Box #', data$'Date') %>% filter(Temp == max(Temp, na.rm=TRUE))
How do I take a column from a data set, and test it against the max of another column?
I am currently working in the package nyclflights13, data set flights. There is a column for the name of a plane, and a column for how many times that plane flew. I want to know which plane flew the most amount of times. Also I would like to omit any missing values, ie any NA;s. I know that I am going to have to use the summarise () function and the select function with a - to omit the missing values. I'm just not sure how to do that exactly.
I used this code to tally the number of rows in flights with each value of tailnum. library(magrittr) library(nycflights13) data(flights) flights %>% dplyr::group_by(tailnum) %>% dplyr::tally() %>% dplyr::arrange(desc(n)) You can then ignore the top row when examining the results, as it is for NA values of tailnum.
How to group twice?
I'd like to know how to group twice in a data set. I must answer the following question: "For each state, which municipalities have the lowest and the highest infections and death rates?". This question is part of a homework (https://github.com/umbertomig/intro-prob-stat-FGV/blob/master/assignments/hw6.Rmd) and I don't know how to do it. I've tried to use top_n, but I am not sure if this is the best way. I wanted to generate a data set in which, for each state, there were four municipalities (two with the highest rates of infection and death from coronavirus and two with the smallest). This is what a have done so far: library(tidyverse) brazilcorona <- read.csv("https://raw.githubusercontent.com/umbertomig/intro-prob-stat-FGV/master/data sets/brazilcorona.csv") brazilcorona_hl_rates <- select(brazilcorona, (estado:emAcompanhamentoNovos)) %>% filter(data >= "2020-05-15") %>% subset(!(coduf == 76)) %>% mutate(av_inf = (casosAcumulado/populacaoTCU2019)*100000, av_dth = (obitosAcumulado/populacaoTCU2019)*100000) brazilcorona_hilow_rates <- brazilcorona_hl_rates %>% group_by(estado) %>% summarize(top_dth = top_n(1, av_dth))
In my example, I find two cities per state with the maximum and minimum values for the column "obitosAcumulado", but to solve your problem you should simply change it to the column containing the information you want to extract the information from. rm(list=ls()) brazilcorona <- read.csv("https://raw.githubusercontent.com/umbertomig/intro-prob-stat-FGV/master/datasets/brazilcorona.csv") # #remove NA's from municipios brazilcorona<-brazilcorona[!is.na(brazilcorona$municipio),] #here I am gonna use the column "obitosAcumulado" but you should use the one you want brazilcorona$obitosAcumulado<-as.numeric(brazilcorona$obitosAcumulado) states<-as.list(unique(brazilcorona$estado)) result<-lapply(states,FUN=function(x){ df<-brazilcorona[brazilcorona$estado==x,] df<-df[order(df$obitosAcumulado,decreasing = T),] return(c(paste(x),as.character(df[1:2,"municipio"]), as.character(df[(nrow(df)-1):nrow(df),"municipio"])))}) I hope it helps you...
Aggregating two rows based on condition of different ID in R
I am dealing with a dataset of players statistics for a sport. There is an error in the data where one week a player who doesn't exist, has been attributed the data that belongs to a real player. I need to aggregate the two players data and delete the instance of the false players' row. I need to adjust my preprocessing code to accommodate this so when I scrape future weeks data then I don't need to make manual adjustments. df <- data.frame(Name = c("Bob","Ben","Bill"), Team = c("Dogs","Cats","Birds"), Runs = c(6, 4, 2) I'd like to do something along the lines of aggregating the two rows based on their df$Name e.g. when df$Name == "Bob" & df$Name == "Bill" aggregate columns [3:40] -- these are my columns with numeric statistics, [1:2] have df$Name and df$Team.
It would depend on the type of aggregation you are trying to do. This looks like a perfect use of the group_by from the dplyr package. Consider the CO2 data set. library(dplyr) CO2 %>% group_by(Plant) %>% summarise( n = n(), #Calculate number of rows in each group meanUptake = mean(uptake) # Aggregate data and take mean for each group ) %>% ungroup() Here we take each group, in your case above it would be name. In the summarise, if you wish to include extra information (like team) include it within the summarise.
Summing across rows conditional on groups with dplyr using select, group_by, and mutate
Problem: I'm making an aggregate market share variable in a car market with 286 distinct models sold and a total of 501 cars sold. This group share is based on only on car characteristic: cat= "compact", "midsize", "large" and yr=77,78,79,80,81, and the share, a small double variable; a total of 15 groups in the market. Closest answer I've found: by mishabalyasin on community.rstudio: "Calculating rowwise totals and proportions using tidyeval?" link to post on community.rstudio. Applying the principle of select-split-combine is the closest I've come to getting the correct answer is the 15 groups (15 x 3(cat, yr, s)): df<- blp %>% select(cat,yr,s) %>% group_by(cat,yr) %>% summarise(group_share = sum(s)) #in my actual data, this is what fills by group share to get what I want, but this isn't the desired pipele-based answer blp$group_share=0 #initializing the group_share, the 50th col for(i in 1:501){ for(j in 1:15){ if((blp[i,31]==df[j,1])&&(blp[i,3]==df[j,2])){ #if(sameCat & sameYr){blpGS=dfGS} blp[i,50]=df[j,3] } } } This is great, but I know this can be done in one fell swoop... Hopefully, the idea is clear from what I've described above. A simple fix may be a loop and set by conditions on cat and yr, and that'd help, but I really am trying to get better at data wrangling with dplyr, so, any insight along that line to get the pipelining answer would be wonderful. Example for the site: This example below doesn't work with the code I provided, but this is the "look" of my data. There is a problem with the share being a factor. #45 obs, 3 cats, 5 yrs cat=c( "compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large") yr=c(77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81) s=c(.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002) blp=as.data.frame(cbind(unlist(lapply(cat,as.character,stringsAsFactors=FALSE)),as.numeric(yr),unlist(as.numeric(s)))) names(blp)<-c("cat","yr","s") head(blp) #note: one example of a group share would be summing the share from (group_share.blp.large.81.s=(blp[cat== "large" &yr==81,])) #works thanks to akrun: applying the code I provided for what leads to the 15 groups df <- blp %>% select(cat,yr,s) %>% group_by(cat,yr) %>% summarise(group_share = sum(as.numeric(as.character(s)))) #manually filling doesn't work, but this is what I'd want if I didn't want pipelining blp$group_share=0 for(i in 1:45){ if( ((blp[i,1])==(df[j,1])) && (as.numeric(blp[i,2])==as.numeric(df[j,2]))){ #if(sameCat & sameYr){blpGS=dfGS} blp[i,4]=df[j,3]; } }
if I understood your problem correctly this should ideally help! Here the only difference that instead of using summarize which will automatically result only in the grouped column and the summarized one you can use mutate to keep the original columns and add to them an aggregate one. # Sample input ## 45 obs, 3 cats, 5 yrs cat <- c( "compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large") yr <- c(77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81) s <- c(.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002) # Calculation blp <- data.frame(cat, yr, s, stringsAsFactors = FALSE) %>% # To create dataframe group_by(cat, yr) %>% # Grouping by category and year mutate(group_share = sum(s, na.rm = TRUE)) %>% # Calculating sum share per category/year ungroup() Expected output Expected output