Ranking based on grouped data - R - r

I would please like to get help with ranking data. So I have this dataframe (with many more columns and unique "raceId") with grouped data based on "raceId". What I am trying to do is giving a rank based on column "avg.position", for each unique raceId.
What I have tried to do is using dplyr and the code down below:
rank.3 <- arrange(raceId, constructorId) %>%
group_by(raceId) %>%
mutate(rank = rank(avg.position, na.last = TRUE)) %>%
ungroup()
The problem I am facing is that it returns a rank based on "avg.position" but NOT for each unique "raceId". So it does not give me a rank for each race, but instead based on the whole dataset.
Is there anyone who might have an solution for this?
Thanks in advance!

Related

Lead and lag issue using dplyr

I have a data frame with data that looks like this that has 365 rows reflecting the calendar year. I am trying to shift the county name columns up by one row. The data frame doesn't contain any missing values.
I tried using the following code to shift it, but the resulting table has values that are all NA.
covid_shift <- covid_pivot %>%
mutate(Maricopa = lag(Maricopa), Cook = lag(Cook), Harris = lag(Harris))
Does anyone know what might be the issue?
Since covid_pivot is grouped by date, and each of these groups has one row, the lead and lag functions return NA.
Try:
covid_shift <- covid_pivot %>%
ungroup() %>%
mutate(Maricopa = lag(Maricopa), Cook = lag(Cook), Harris = lag(Harris))
You might also consider using across()
covid_pivot %>%
ungroup() %>%
mutate(across(-date, ~lag(.x)))

How can I copy and append rows of data to fill in missing records in a defined sequence?

I have a sequence of numeric labels for records that can be shared by a variable number of records per label (labelsequence). I also have the records themselves, but unfortunately for some of the sequence values, all records have been lost (dataframe df). I need to identify when a numeric label from labelsequence does not appear in the label column of df, copy all records within df that are associated with the closest label value that is less than the missing value, and append these to a newly filled-in dataframe, say df2.
I am trying to accomplish this in R (a dplyr answer would be ideal), and have looked at answers to questions regarding filling in missing rows, such as Fill in missing rows in R and fill missing rows in a dataframe, and have a working solution below, was wondering if anyone has a better way of doing this.
Take , for instance, this example data:
labelsequence<-data.frame(label=c(1,2,3,4,5,6))
and
df<-data.frame(label=c(1,1,1,1,3,3,4,4,4),
place=c('vermont','kentucky',
'wisconsin','wyoming','nevada',
'california','utah','georgia','kentucky'),
animal=c('wolf','wolf','cougar','cougar','lamb',
'cougar','donkey','lamb','wolf'))
with desired result...
desired_df2<-data.frame(label=c(1,1,1,1,2,2,2,2,3,3,4,4,4,5,5,5,6,6,6),
place=c('vermont','kentucky',
'wisconsin','wyoming','vermont','kentucky',
'wisconsin','wyoming','nevada',
'california','utah','georgia','kentucky','utah',
'georgia','kentucky','utah','georgia','kentucky'),
animal=c('wolf','wolf','cougar','cougar','wolf',
'wolf','cougar','cougar','lamb','cougar',
'donkey','lamb','wolf','donkey','lamb','wolf',
'donkey','lamb','wolf'))
Is there a better (be it effiency of code, flexibility, or resource efficiency) way than the following?
df2<- df %>%
full_join(expand.grid(label=unique(df$label),newlabel=labelsequence$label)) %>%
mutate(missing = ifelse(newlabel %in% label,0,1))%>%
filter(label<newlabel)%>%
group_by(newlabel) %>%
filter(label==max(label) & missing ==1) %>%
ungroup()%>%
mutate(label=newlabel,missing=NULL,newlabel=NULL) %>%
bind_rows(df) %>%
arrange(label)

How do I aggregate certain columns from data frame by a Unique ID?

I have a list of statcast data, per day dating back to 2016. I am attempting to aggregate this data for finding the mean for each pitching ID.
I have the following code:
aggpitch <- aggregate(pitchingstat, by=list(pitchingstat$PitcherID),
FUN=mean, na.rm = TRUE)
This function aggregates every single column. I am looking to only aggregate a certain amount of columns.
How would I include only certain columns?
If you have more than one column that you'd like to summarize, you can use QAsena's approach and add summarise_at function like so:
pitchingstat %>%
group_by(PitcherID) %>%
summarise_at(vars(col1:coln), mean, na.rm = TRUE)
Check out link below for more examples:
https://dplyr.tidyverse.org/reference/summarise_all.html
Replace the first argument (pitchingstat) with the name of the column you want to aggregate (or a vector thereof)
How about?:
library(tidyverse)
aggpitch <- pitchingstat %>%
group_by(PitcherID) %>%
summarise(pitcher_mean = mean(variable)) #replace 'variable' with your variable of interest here
or
library(tidyverse)
aggpitch <- pitchingstat %>%
select(var_1, var_2)
group_by(PitcherID) %>%
summarise(pitcher_mean = mean(var_1),
pitcher_mean2 = mean(var_2))
I think this works but could use a dummy example of your data to play with.

In R, how to sum multiple columns based on value of another column?

In R, I have a dataframe, so that I have One Variable (the name of a country), a number of variables (Population, Number of cars, etc) and then a Column that represents region.
I would like to sum the variables (1, 2, ....) based on the value of the last region. I think this should be possible with dplyr and summarise each, but I cannot get it to work.
Would someone be able to help me please? Thanks a lot.
Reading the response (althought this may change if you can get some of your dataframe together...
library(dplyr)
summarized_df <- df %>%
group_by(region) %>%
summarise(var1=sum(variable1), var2=sum(variable2), var3=sum(variable3))
If this doesn't seem to work, maybe you can post your code and the errors even if you can't post the dataframe.

Percentile Ranking by Grouping in R

The column ID is a sequence of 1-63 in repetition. I wish to add two new columns Closepctl and Quantitypctl in which I can rank each entry from 1-63 i.e., on the basis of Close and Quantity column but grouped with respect to ID. Is there any way to do this in R?
I tried it in excel but failed to find any grouping option there. Excel approach is also appreciated.
See if the following is helpful...
library(dplyr)
data(iris)
df <- iris %>%
group_by(Species) %>%
mutate(RankSepal = percent_rank(Sepal.Length),
RankPetal = percent_rank(Petal.Length))

Resources