Make table using mean of a column in R - r

I have following data frame:
test <- data.frame(Gender = rep(c("M","F"),5), Death = c(1981:1985), Age = c(21:30))
and I wanted to know how can I reproduce following results using command table rather than ddply:
library(plyr)
ddply(test, c("Gender", "Death"), summarise, AgeMean = mean(Age))
Death AgeMean
1 1981 23.5
2 1982 24.5
3 1983 25.5
4 1984 26.5
5 1985 27.5

I think you mean aggregate...
aggregate( Age ~ Death , data = test , FUN = mean )
# Death Age
#1 1981 23.5
#2 1982 24.5
#3 1983 25.5
#4 1984 26.5
#5 1985 27.5

Or you could also use summaryBy from the doBy package:
summaryBy(Age ~ Death,data=test,FUN=mean)
Death Age.mean
1981 23.5
1982 24.5
1983 25.5
1984 26.5
1985 27.5
The variable(s) to the left of the ~ is the variable(s) you want to perform the function FUN= on (in this case mean) and the variable(s) to the right of the ~ is the new level of aggregation you want.

You can also do this using dplyr:
library(dplyr)
test %>%
group_by(Death) %>%
summarise(Age.mean = mean(Age))
I find dplyr's chaining syntax results in very readable code, but that's a personal preference.
Source: local data frame [5 x 2]
Death Age.mean
1 1981 23.5
2 1982 24.5
3 1983 25.5
4 1984 26.5
5 1985 27.5

Related

Summing multiple observation rows in R

I have a dataset with 4 observations for 90 variables. The observations are answer to a questionnaire of the type "completely agree" to "completely disagree", expressed in percentages. I want to sum the two positive observations (completely and somewhat agree) and the two negative ones (completely and somewhat disagree) for all variables. Is there a way to do this in R?
My dataset looks like this:
Albania Andorra Azerbaijan etc.
1 13.3 18.0 14.9 ...
2 56.3 45.3 27.2 ...
3 21.3 27.2 28.0 ...
4 8.9 9.4 5.2 ...
And I want to sum rows 1+2 and 3+4 to look something like this:
Albania Andorra Azerbaijan etc.
1 69.6 63.3 65.4 ...
2 30.2 36.6 33.2 ...
I am really new to R so I have no idea how to go about this. All answers to similar questions I found on this website and others either have character type observations, multiple rows for the same observation (with missing data), or combine all the rows into just 1 row. My problem falls in none of these categories, I just want to collapse some of the observations.
Since you only have four rows, it's probably easiest to just add the first two rows together and the second two rows together. You can use rbind to stick the two resulting rows together into the desired data frame:
rbind(df[1,] + df[2, ], df[3,] + df[4,])
#> Albania Andorra Azerbaijan
#> 1 69.6 63.3 42.1
#> 3 30.2 36.6 33.2
Data taken from question
df <- structure(list(Albania = c(13.3, 56.3, 21.3, 8.9), Andorra = c(18,
45.3, 27.2, 9.4), Azerbaijan = c(14.9, 27.2, 28, 5.2)), class = "data.frame",
row.names = c("1", "2", "3", "4"))
Another option could be by summing every 2 rows with rowsum and using gl with k = 2 like in the following coding:
rowsum(df, gl(n = nrow(df), k = 2, length = nrow(df)))
#> Albania Andorra Azerbaijan
#> 1 69.6 63.3 42.1
#> 2 30.2 36.6 33.2
Created on 2023-01-06 with reprex v2.0.2
Using dplyr
library(dplyr)
df %>%
group_by(grp = gl(n(), 2, n())) %>%
summarise(across(everything(), sum))
-output
# A tibble: 2 × 4
grp Albania Andorra Azerbaijan
<fct> <dbl> <dbl> <dbl>
1 1 69.6 63.3 42.1
2 2 30.2 36.6 33.2

Calculating mean age by group in R

I have the following data: https://raw.githubusercontent.com/fivethirtyeight/data/master/congress-age/congress-terms.csv
I'm trying to determine how to calculate the mean age of members of Congress by year (termstart) for each party (Republican and Democrat).
I was hoping for some help on how to go about doing this. I am a beginner in R and I'm just playing around with the data.
Thanks!
Try this approach. Make a filter for the required parties and then summarise. After that you can reshape to wide in order to have both parties for each individual date. Here the code using tidyverse functions:
library(dplyr)
library(tidyr)
#Data
df <- read.csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/congress-age/congress-terms.csv',stringsAsFactors = F)
#Code
newdf <- df %>% filter(party %in% c('R','D')) %>%
group_by(termstart,party) %>% summarise(MeanAge=mean(age,na.rm=T)) %>%
pivot_wider(names_from = party,values_from=MeanAge)
Output:
# A tibble: 34 x 3
# Groups: termstart [34]
termstart D R
<chr> <dbl> <dbl>
1 1947-01-03 52.0 53.0
2 1949-01-03 51.4 54.6
3 1951-01-03 52.3 54.3
4 1953-01-03 52.3 54.1
5 1955-01-05 52.3 54.7
6 1957-01-03 53.2 55.4
7 1959-01-07 52.4 54.7
8 1961-01-03 53.4 53.9
9 1963-01-09 53.3 52.6
10 1965-01-04 52.3 52.2
# ... with 24 more rows

Mean Temperature by group month in R

I am trying to calculate the mean temperature per month of daily records between 1988 to 2020 using the following code:
(Temperature_year_month <- (na.omit(database_PE_na) %>% group_by(month) %>% summarise(mean_temp_monthYear = mean(Air.Temp.Mean))))
and I got the following results, that I checked in excel and it seems correct:
# A tibble: 12 x 2
month mean_temp_monthYear
<dbl> <dbl>
1 1 11.4
2 2 13.5
3 3 17.2
4 4 21.2
5 5 26.0
6 6 31.0
7 7 33.3
8 8 32.5
9 9 29.1
10 10 22.4
11 11 15.4
12 12 10.7
However when I do this only for the month of July (month =7). I got a different result:
(Temperature_year_month <- (na.omit(database_PE_na) %>% group_by(month=7) %>% summarise(mean_temp_monthYear = mean(Air.Temp.Mean))))
month mean_temp_monthYear
<dbl> <dbl>
1 7 22.0
Someone could explain to me why this happens¿
We can use data.table methods
library(data.table)
setDT(database_PE_na)[month == 7,
.(mean_temp_monthYear = mean(Air.Temp.Mean, na.rm = TRUE))]
For comparison use == and not =.
If you want to get mean of one month use it in filter instead of group_by.
mean has na.rm argument which can be set to TRUE to ignore NA values instead of using na.omit and removing the complete row.
Use :
library(dplyr)
Temperature_year_month <- database_PE_na %>%
filter(month==7) %>%
summarise(mean_temp_monthYear = mean(Air.Temp.Mean, na.rm = TRUE))

Finding a Weighted Average Based on Years

I want to create a weighted average of the baseball statistic WAR from 2017 to 2019.
The Averages would go as following:
2019: 57.14%
2018: 28.57%
2017: 14.29%
However some players only played in 2018 and 2019, some having played in 2019 and 2017.
If they've only played in two years it would be 67/33, and only one year would be 100% obviously.
I was wondering if there was an easy way to do this.
My data set looks like this
Name Season G PA HR BB_pct K_pct ISO wOBA wRC_plus Def WAR
337 A.J. Pollock 2017 112 466 14 7.5 15.2 0.205 0.340 103 2.6 2.2
357 A.J. Pollock 2018 113 460 21 6.7 21.7 0.228 0.338 111 0.9 2.6
191 Aaron Altherr 2017 107 412 19 7.8 25.2 0.245 0.359 120 -7.9 1.4
162 Aaron Hicks 2017 88 361 15 14.1 18.6 0.209 0.363 128 6.4 3.4
186 Aaron Hicks 2018 137 581 27 15.5 19.1 0.219 0.360 129 2.3 5.0
464 Aaron Hicks 2019 59 255 12 12.2 28.2 0.208 0.325 102 1.3 1.1
And the years vary from person to person, but was wondering if anyone had a way to do this weighted average dependent on the years they played. I also dont want any only 2017-ers if that make sense.
I guess, there is an easy way of doing your task. Unfortunately my approach is a little bit more complex. I'm using dplyr and purr.
First I put those weights into a list:
one_year <- 1
two_years <- c(2/3, 1/3)
three_years <- c(4/7, 3/7, 1/7)
weights <- list(one_year, two_years, three_years)
Next I split the datset into a list by the number of seasons each player took part:
df %>%
group_by(Name) %>%
mutate(n=n()) %>%
arrange(n) %>%
ungroup() %>%
group_split(n) -> my_list
Now I define a function that calculates the average using the weights:
WAR_average <- function(i) {my_list[[i]] %>%
group_by(Name) %>%
mutate(WAR_average = sum(WAR * weights[[i]]))}
And finally I apply the function WAR_average on my_list and filter/select the data:
my_list %>%
seq_along() %>%
lapply(WAR_average) %>% # apply function
reduce(rbind) %>% # bind the dataframes into one df
filter(Season != 2017 | n != 1) %>% # filter players only active in 2017
select(Name, WAR_average) %>% # select player and war_average
distinct() # remove duplicates
This whole process returns
# A tibble: 2 x 2
# Groups: Name [2]
Name WAR_average
<chr> <dbl>
1 A.J. Pollock 2.33
2 Aaron Hicks 4.24

Aggregate with multiple duplicates and calculate their mean

Assume we have a DF with duplicates in their respected UserID's but with different namings, which of course can be duplicates as well.
DF <- data.frame(ID=c(101,101,101,101,101,102,102,102,102),
Name=c("Ed","Ed","Hank","Hank","Hank","Sandy","Sandy","Jessica","Jessica"),
Class=c("Junior","Junior","Junior","Junior", "Junior","High","High","Mid","Mid"),
Scoring=c(11,15,18,18,12,20,22,25,26), Other_Scores=c(15,9,34,23,43,23,34,23,23))
The aim is to aggregate and calculate the mean and standard deviation of the UserID's and their names respectively. A desired output example:
UserID Name Class Scoring_mean Scoring_std
101 Ed Junior 12.5 3
101 Hank Junior 24.67 11.62
102 Sandy High 24.75 6.29
102 Jessica High 24.25 1.5
Hence my question:
What are the options to aggregate the Names based on the UserID, without the loss of information (Hank being coerced into Ed etc. as with summarise() or mutate() )
In my way of thinking, R has to check which Name corresponds to the UserID, and if a match; aggregate and calculate mean & standard deviation, but I'm not able to get this working in R with dplyr.
At the same time I couldn't find any other post that is somewhat related to this question, as in:
How to calculate the mean of specific rows in R?
Subtract pairs of columns based on matching column
Calculating mean when 2 conditions need met in R
average between duplicated rows in R
Here's a tidyverse option that uses some reshaping to create one column of scores and then some grouping in order to get the summary stats:
DF <- data.frame(
ID=c(101,101,101,101,101,102,102,102,102),
Name=c("Ed","Ed","Hank","Hank","Hank","Sandy","Sandy","Jessica","Jessica"),
Class=c("Junior","Junior","Junior","Junior", "Junior","High","High","Mid","Mid"),
Scoring=c(11,15,18,18,12,20,22,25,26),
Other_Scores=c(15,9,34,23,43,23,34,23,23)
)
library(tidyverse)
DF %>%
gather(score_type, score, Scoring, Other_Scores) %>% # reshape score columns
group_by(ID, Name, Class) %>% # group by combinations
summarise(scoring_mean = mean(score), # get summary stats
scoring_sd = sd(score)) %>%
ungroup() # forget the grouping
# # A tibble: 4 x 5
# ID Name Class scoring_mean scoring_sd
# <dbl> <fct> <fct> <dbl> <dbl>
# 1 101. Ed Junior 12.5 3.00
# 2 101. Hank Junior 24.7 11.6
# 3 102. Jessica Mid 24.2 1.50
# 4 102. Sandy High 24.8 6.29
What about computing your summary stats then joining the results to your initial dataframe. Like so:
DF <- data.frame(ID=c(101,101,101,101,101,102,102,102,102),
Name=c("Ed","Ed","Hank","Hank","Hank","Sandy","Sandy","Jessica","Jessica"),
Class=c("Junior","Junior","Junior","Junior", "Junior","High","High","Mid","Mid"),
Scoring=c(11,15,18,18,12,20,22,25,26), Other_Scores=c(15,9,34,23,43,23,34,23,23))
DF2 <- DF %>% group_by(Name) %>%
summarise(scoring_mean=mean(Scoring), scoring_sd = sd(Scoring)) %>%
left_join(DF[,c(1,2,3)], by="Name")
Giving:
# A tibble: 9 x 5
Name scoring_mean scoring_sd ID Class
<fct> <dbl> <dbl> <dbl> <fct>
1 Ed 13.0 2.83 101. Junior
2 Ed 13.0 2.83 101. Junior
3 Hank 16.0 3.46 101. Junior
4 Hank 16.0 3.46 101. Junior
5 Hank 16.0 3.46 101. Junior
6 Jessica 25.5 0.707 102. Mid
7 Jessica 25.5 0.707 102. Mid
8 Sandy 21.0 1.41 102. High
9 Sandy 21.0 1.41 102. High

Resources