Aggregate with multiple duplicates and calculate their mean - r

Assume we have a DF with duplicates in their respected UserID's but with different namings, which of course can be duplicates as well.
DF <- data.frame(ID=c(101,101,101,101,101,102,102,102,102),
Name=c("Ed","Ed","Hank","Hank","Hank","Sandy","Sandy","Jessica","Jessica"),
Class=c("Junior","Junior","Junior","Junior", "Junior","High","High","Mid","Mid"),
Scoring=c(11,15,18,18,12,20,22,25,26), Other_Scores=c(15,9,34,23,43,23,34,23,23))
The aim is to aggregate and calculate the mean and standard deviation of the UserID's and their names respectively. A desired output example:
UserID Name Class Scoring_mean Scoring_std
101 Ed Junior 12.5 3
101 Hank Junior 24.67 11.62
102 Sandy High 24.75 6.29
102 Jessica High 24.25 1.5
Hence my question:
What are the options to aggregate the Names based on the UserID, without the loss of information (Hank being coerced into Ed etc. as with summarise() or mutate() )
In my way of thinking, R has to check which Name corresponds to the UserID, and if a match; aggregate and calculate mean & standard deviation, but I'm not able to get this working in R with dplyr.
At the same time I couldn't find any other post that is somewhat related to this question, as in:
How to calculate the mean of specific rows in R?
Subtract pairs of columns based on matching column
Calculating mean when 2 conditions need met in R
average between duplicated rows in R

Here's a tidyverse option that uses some reshaping to create one column of scores and then some grouping in order to get the summary stats:
DF <- data.frame(
ID=c(101,101,101,101,101,102,102,102,102),
Name=c("Ed","Ed","Hank","Hank","Hank","Sandy","Sandy","Jessica","Jessica"),
Class=c("Junior","Junior","Junior","Junior", "Junior","High","High","Mid","Mid"),
Scoring=c(11,15,18,18,12,20,22,25,26),
Other_Scores=c(15,9,34,23,43,23,34,23,23)
)
library(tidyverse)
DF %>%
gather(score_type, score, Scoring, Other_Scores) %>% # reshape score columns
group_by(ID, Name, Class) %>% # group by combinations
summarise(scoring_mean = mean(score), # get summary stats
scoring_sd = sd(score)) %>%
ungroup() # forget the grouping
# # A tibble: 4 x 5
# ID Name Class scoring_mean scoring_sd
# <dbl> <fct> <fct> <dbl> <dbl>
# 1 101. Ed Junior 12.5 3.00
# 2 101. Hank Junior 24.7 11.6
# 3 102. Jessica Mid 24.2 1.50
# 4 102. Sandy High 24.8 6.29

What about computing your summary stats then joining the results to your initial dataframe. Like so:
DF <- data.frame(ID=c(101,101,101,101,101,102,102,102,102),
Name=c("Ed","Ed","Hank","Hank","Hank","Sandy","Sandy","Jessica","Jessica"),
Class=c("Junior","Junior","Junior","Junior", "Junior","High","High","Mid","Mid"),
Scoring=c(11,15,18,18,12,20,22,25,26), Other_Scores=c(15,9,34,23,43,23,34,23,23))
DF2 <- DF %>% group_by(Name) %>%
summarise(scoring_mean=mean(Scoring), scoring_sd = sd(Scoring)) %>%
left_join(DF[,c(1,2,3)], by="Name")
Giving:
# A tibble: 9 x 5
Name scoring_mean scoring_sd ID Class
<fct> <dbl> <dbl> <dbl> <fct>
1 Ed 13.0 2.83 101. Junior
2 Ed 13.0 2.83 101. Junior
3 Hank 16.0 3.46 101. Junior
4 Hank 16.0 3.46 101. Junior
5 Hank 16.0 3.46 101. Junior
6 Jessica 25.5 0.707 102. Mid
7 Jessica 25.5 0.707 102. Mid
8 Sandy 21.0 1.41 102. High
9 Sandy 21.0 1.41 102. High

Related

Conditional statements of rows and columns

I am trying a rather complex conditional statement in R that I need help with. I have a massive dataset (~1,450,000) different values.
I am trying to shave down the dataset with R by asking it a conditional statement: if multiple rows have the same values in column "date" AND if multiples rows have the same values in column "PageName", THEN give averages in columns "sst", "lat", and "long".
The code I have Frankensteined together so far is:
Combines_averages <- if(Combine_12$date == Combine_12$PageName{aggregate(Combine_12$sst, Combine_12$`location-lat`, Combine_12$`location-long`)}
Data:
Example_Data
# A tibble: 3 x 7
animal_id `location-lat` `location-long` date sst month PageName
<chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
1 Alpha-626 30.5 -79.5 3/14/2020 22.2 3 ABCD
2 Bravo-522 30.5 -79.5 3/14/2020 22.6 3 ABCD
3 Charlie-389 30.5 -79.5 3/13/2020 22.4 3 BCAD
4 Delta-720 30.5 -79.5 3/16/2020 22.8 3 CADB
5 Echo-550 30.5 -79.5 3/14/2020 22.2 3 ABCD

How do I find which country has the lowest average time to finish a marathon?

I have a data set with a column of countries and a column of time it took them to run a marathon. I want to find out which 5 countries completed it in the shortest time on average. I am new to R so only have basic knowledge. The column of time is in hours. eexample of the data: marathon$Countries is a column of the nationality of each runner, marathon$OverallHrs is the overall time it took to complete the marathon for each runner.
I have tried
tapply(marathon$OverallHrs, marathon$Country, mean)
It hasnt worked in the way I want it to
I am assuming that you are not referring to the trivial case where you don't have repeated countries for your "country" column. For a beginner in R, i would strongly encourage to start learning with the package "tidyverse".
Below is the solution, where you can have repeated countries for the column "Country"
library(tidyverse)
set.seed(123)
# Generate 10 Countries, each one 5 times
A = sample(rep(1:10,5))
# Generate 50 random timing from (5-20)
B = round(runif(50)*15 + 5)
#Create a dataframe with columns (Country, Timing), rows = 50
df = data.frame("Country" = paste0("Country",A),
"Timing" = B)
#Dataframe will look like this
# Country Timing
# 1 Country5 15
# 2 Country4 17
# 3 Country4 5
# 4 Country3 12
# 5 Country5 16
# Calculate average marathon timing
df_mean <- df %>%
group_by(Country) %>% #Group
summarise(Mean_Timing = mean(Timing), .groups = 'drop') %>% #Calculate Mean_Timing
arrange(Mean_Timing) # Arrange by fastest timing first
#Dataframe = df_mean
# A tibble: 10 x 2
# Country Mean_Timing
# <chr> <dbl>
# 1 Country9 10.6
# 2 Country1 11.4
# 3 Country3 11.4
# 4 Country4 11.4
# 5 Country2 12.2
# 6 Country10 12.6
# 7 Country8 13.2
# 8 Country7 13.6
# 9 Country5 15
# 10 Country6 15.2
#To get the first 5 country, would just be
df_mean$Country[1:5]
# "Country9" "Country1" "Country3" "Country4" "Country2"
There is always the aggregate function in R for calculating mean per group. Lesser code, but I still prefer the tidyverse method as it is intuitive to use after a while and can be tweaked slightly to solve any dataframe question.
Anyway, here is the solution using aggregate.
df_mean2 <- aggregate(df[, 2], list(df$Country), mean) # Calculate Mean
df_mean2[order(df_mean2$x), ] # Sort by ascending
Group.1 x
10 Country9 10.6
1 Country1 11.4
4 Country3 11.4
5 Country4 11.4
3 Country2 12.2
2 Country10 12.6
9 Country8 13.2
8 Country7 13.6
6 Country5 15.0
7 Country6 15.2

Mean Temperature by group month in R

I am trying to calculate the mean temperature per month of daily records between 1988 to 2020 using the following code:
(Temperature_year_month <- (na.omit(database_PE_na) %>% group_by(month) %>% summarise(mean_temp_monthYear = mean(Air.Temp.Mean))))
and I got the following results, that I checked in excel and it seems correct:
# A tibble: 12 x 2
month mean_temp_monthYear
<dbl> <dbl>
1 1 11.4
2 2 13.5
3 3 17.2
4 4 21.2
5 5 26.0
6 6 31.0
7 7 33.3
8 8 32.5
9 9 29.1
10 10 22.4
11 11 15.4
12 12 10.7
However when I do this only for the month of July (month =7). I got a different result:
(Temperature_year_month <- (na.omit(database_PE_na) %>% group_by(month=7) %>% summarise(mean_temp_monthYear = mean(Air.Temp.Mean))))
month mean_temp_monthYear
<dbl> <dbl>
1 7 22.0
Someone could explain to me why this happens¿
We can use data.table methods
library(data.table)
setDT(database_PE_na)[month == 7,
.(mean_temp_monthYear = mean(Air.Temp.Mean, na.rm = TRUE))]
For comparison use == and not =.
If you want to get mean of one month use it in filter instead of group_by.
mean has na.rm argument which can be set to TRUE to ignore NA values instead of using na.omit and removing the complete row.
Use :
library(dplyr)
Temperature_year_month <- database_PE_na %>%
filter(month==7) %>%
summarise(mean_temp_monthYear = mean(Air.Temp.Mean, na.rm = TRUE))

Finding a Weighted Average Based on Years

I want to create a weighted average of the baseball statistic WAR from 2017 to 2019.
The Averages would go as following:
2019: 57.14%
2018: 28.57%
2017: 14.29%
However some players only played in 2018 and 2019, some having played in 2019 and 2017.
If they've only played in two years it would be 67/33, and only one year would be 100% obviously.
I was wondering if there was an easy way to do this.
My data set looks like this
Name Season G PA HR BB_pct K_pct ISO wOBA wRC_plus Def WAR
337 A.J. Pollock 2017 112 466 14 7.5 15.2 0.205 0.340 103 2.6 2.2
357 A.J. Pollock 2018 113 460 21 6.7 21.7 0.228 0.338 111 0.9 2.6
191 Aaron Altherr 2017 107 412 19 7.8 25.2 0.245 0.359 120 -7.9 1.4
162 Aaron Hicks 2017 88 361 15 14.1 18.6 0.209 0.363 128 6.4 3.4
186 Aaron Hicks 2018 137 581 27 15.5 19.1 0.219 0.360 129 2.3 5.0
464 Aaron Hicks 2019 59 255 12 12.2 28.2 0.208 0.325 102 1.3 1.1
And the years vary from person to person, but was wondering if anyone had a way to do this weighted average dependent on the years they played. I also dont want any only 2017-ers if that make sense.
I guess, there is an easy way of doing your task. Unfortunately my approach is a little bit more complex. I'm using dplyr and purr.
First I put those weights into a list:
one_year <- 1
two_years <- c(2/3, 1/3)
three_years <- c(4/7, 3/7, 1/7)
weights <- list(one_year, two_years, three_years)
Next I split the datset into a list by the number of seasons each player took part:
df %>%
group_by(Name) %>%
mutate(n=n()) %>%
arrange(n) %>%
ungroup() %>%
group_split(n) -> my_list
Now I define a function that calculates the average using the weights:
WAR_average <- function(i) {my_list[[i]] %>%
group_by(Name) %>%
mutate(WAR_average = sum(WAR * weights[[i]]))}
And finally I apply the function WAR_average on my_list and filter/select the data:
my_list %>%
seq_along() %>%
lapply(WAR_average) %>% # apply function
reduce(rbind) %>% # bind the dataframes into one df
filter(Season != 2017 | n != 1) %>% # filter players only active in 2017
select(Name, WAR_average) %>% # select player and war_average
distinct() # remove duplicates
This whole process returns
# A tibble: 2 x 2
# Groups: Name [2]
Name WAR_average
<chr> <dbl>
1 A.J. Pollock 2.33
2 Aaron Hicks 4.24

Time difference calculated from wide data with missing rows

There is a longitudinal data set in the wide format, from which I want to compute time (in years and days) between the first observation date and the last date an individual was observed. Dates are in the format yyyy-mm-dd. The data set has four observation periods with missing dates, an example is as follows
df1<-data.frame("id"=c(1:4),
"adate"=c("2011-06-18","2011-06-18","2011-04-09","2011-05-20"),
"bdate"=c("2012-06-15","2012-06-15",NA,"2012-05-23"),
"cdate"=c("2013-06-18","2013-06-18","2013-04-09",NA),
"ddate"=c("2014-06-15",NA,"2014-04-11",NA))
Here "adate" is the first date and the last date is the date an individual was last seen. To compute the time difference (lastdate-adate), I have tried using "lubridate" package, for example
lubridate::time_length(difftime(as.Date("2012-05-23"), as.Date("2011-05-20")),"years")
However, I'm challenged by the fact that the last date is not coming from one column. I'm looking for a way to automate the calculation in R. The expected output would look like
id years days
1 1 2.99 1093
2 2 2.00 731
3 3 3.01 1098
4 4 1.01 369
Years is approximated to 2 decimal places.
Another tidyverse solution can be done by converting the data to long format, removing NA dates, and getting the time difference between last and first date for each id.
library(dplyr)
library(tidyr)
library(lubridate)
df1 %>%
pivot_longer(-id) %>%
na.omit %>%
group_by(id) %>%
mutate(value = as.Date(value)) %>%
summarise(years = time_length(difftime(last(value), first(value)),"years"),
days = as.numeric(difftime(last(value), first(value))))
#> # A tibble: 4 x 3
#> id years days
#> <int> <dbl> <dbl>
#> 1 1 2.99 1093
#> 2 2 2.00 731
#> 3 3 3.01 1098
#> 4 4 1.01 369
We could use pmap
library(dplyr)
library(purrr)
library(tidyr)
df1 %>%
mutate(out = pmap(.[-1], ~ {
dates <- as.Date(na.omit(c(...)))
tibble(years = lubridate::time_length(difftime(last(dates),
first(dates)), "years"),
days = lubridate::time_length(difftime(last(dates), first(dates)), "days"))
})) %>%
unnest_wider(out)
# A tibble: 4 x 7
# id adate bdate cdate ddate years days
# <int> <chr> <chr> <chr> <chr> <dbl> <dbl>
#1 1 2011-06-18 2012-06-15 2013-06-18 2014-06-15 2.99 1093
#2 2 2011-06-18 2012-06-15 2013-06-18 <NA> 2.00 731
#3 3 2011-04-09 <NA> 2013-04-09 2014-04-11 3.01 1098
#4 4 2011-05-20 2012-05-23 <NA> <NA> 1.01 369
Probably most of the functions introduced here might be quite complex. You should try to learn them if possible. Although will provide a Base R approach:
grp <- droplevels(interaction(df[,1],row(df[-1]))) # Create a grouping:
days <- tapply(unlist(df[-1]),grp, function(x)max(x,na.rm = TRUE) - x[1]) #Get the difference
cbind(df[1],days, years = round(days/365,2)) # Create your table
id days years
1.1 1 1093 2.99
2.2 2 731 2.00
3.3 3 1098 3.01
4.4 4 369 1.01
if comfortable with other higher functions then you could do:
dat <- aggregate(adate~id,reshape(df1,list(2:ncol(df1)), dir="long"),function(x)max(x) - x[1])
transform(dat,year = round(adate/365,2))
id adate year
1 1 1093 2.99
2 2 731 2.00
3 3 1098 3.01
4 4 369 1.01
Using base R apply :
df1[-1] <- lapply(df1[-1], as.Date)
df1[c('years', 'days')] <- t(apply(df1[-1], 1, function(x) {
x <- na.omit(x)
x1 <- difftime(x[length(x)], x[1], 'days')
c(x1/365, x1)
}))
df1[c('id', 'years', 'days')]
# id years days
#1 1 2.994521 1093
#2 2 2.002740 731
#3 3 3.008219 1098
#4 4 1.010959 369

Resources