Calculating mean age by group in R - r

I have the following data: https://raw.githubusercontent.com/fivethirtyeight/data/master/congress-age/congress-terms.csv
I'm trying to determine how to calculate the mean age of members of Congress by year (termstart) for each party (Republican and Democrat).
I was hoping for some help on how to go about doing this. I am a beginner in R and I'm just playing around with the data.
Thanks!

Try this approach. Make a filter for the required parties and then summarise. After that you can reshape to wide in order to have both parties for each individual date. Here the code using tidyverse functions:
library(dplyr)
library(tidyr)
#Data
df <- read.csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/congress-age/congress-terms.csv',stringsAsFactors = F)
#Code
newdf <- df %>% filter(party %in% c('R','D')) %>%
group_by(termstart,party) %>% summarise(MeanAge=mean(age,na.rm=T)) %>%
pivot_wider(names_from = party,values_from=MeanAge)
Output:
# A tibble: 34 x 3
# Groups: termstart [34]
termstart D R
<chr> <dbl> <dbl>
1 1947-01-03 52.0 53.0
2 1949-01-03 51.4 54.6
3 1951-01-03 52.3 54.3
4 1953-01-03 52.3 54.1
5 1955-01-05 52.3 54.7
6 1957-01-03 53.2 55.4
7 1959-01-07 52.4 54.7
8 1961-01-03 53.4 53.9
9 1963-01-09 53.3 52.6
10 1965-01-04 52.3 52.2
# ... with 24 more rows

Related

Summing multiple observation rows in R

I have a dataset with 4 observations for 90 variables. The observations are answer to a questionnaire of the type "completely agree" to "completely disagree", expressed in percentages. I want to sum the two positive observations (completely and somewhat agree) and the two negative ones (completely and somewhat disagree) for all variables. Is there a way to do this in R?
My dataset looks like this:
Albania Andorra Azerbaijan etc.
1 13.3 18.0 14.9 ...
2 56.3 45.3 27.2 ...
3 21.3 27.2 28.0 ...
4 8.9 9.4 5.2 ...
And I want to sum rows 1+2 and 3+4 to look something like this:
Albania Andorra Azerbaijan etc.
1 69.6 63.3 65.4 ...
2 30.2 36.6 33.2 ...
I am really new to R so I have no idea how to go about this. All answers to similar questions I found on this website and others either have character type observations, multiple rows for the same observation (with missing data), or combine all the rows into just 1 row. My problem falls in none of these categories, I just want to collapse some of the observations.
Since you only have four rows, it's probably easiest to just add the first two rows together and the second two rows together. You can use rbind to stick the two resulting rows together into the desired data frame:
rbind(df[1,] + df[2, ], df[3,] + df[4,])
#> Albania Andorra Azerbaijan
#> 1 69.6 63.3 42.1
#> 3 30.2 36.6 33.2
Data taken from question
df <- structure(list(Albania = c(13.3, 56.3, 21.3, 8.9), Andorra = c(18,
45.3, 27.2, 9.4), Azerbaijan = c(14.9, 27.2, 28, 5.2)), class = "data.frame",
row.names = c("1", "2", "3", "4"))
Another option could be by summing every 2 rows with rowsum and using gl with k = 2 like in the following coding:
rowsum(df, gl(n = nrow(df), k = 2, length = nrow(df)))
#> Albania Andorra Azerbaijan
#> 1 69.6 63.3 42.1
#> 2 30.2 36.6 33.2
Created on 2023-01-06 with reprex v2.0.2
Using dplyr
library(dplyr)
df %>%
group_by(grp = gl(n(), 2, n())) %>%
summarise(across(everything(), sum))
-output
# A tibble: 2 × 4
grp Albania Andorra Azerbaijan
<fct> <dbl> <dbl> <dbl>
1 1 69.6 63.3 42.1
2 2 30.2 36.6 33.2

Create a temporary group in dplyr group_by

I would like to group all members of the same genera together for some summary statistics, but would like to maintain their full names in the original dataframe. I know that I could change their names or create a new column in the original dataframe but I am lookng for a more elegant solution. I would like to implement this in R and the dplyr package.
Example data here https://knb.ecoinformatics.org/knb/d1/mn/v2/object/urn%3Auuid%3Aeb176981-1909-4d6d-ac07-3406e4efc43f
I would like to group all clams of the genus Macoma as one group, "Macoma sp." but ideally creating this grouping within the following, perhapse before the group_by(site_code, species_scientific)
summary <- data %>%
group_by(site_code, species_scientific) %>%
summarize(mean_size = mean(width_mm))
Note that there are multiple Macoma xxx species and multiple other species that I want to group as is.
We may replace the species_scientific by replaceing the elements that have the substring 'Macoma' (str_detect) with 'Macoma', use that as grouping column and get the mean
library(dplyr)
library(stringr)
data %>%
mutate(species_scientific = replace(species_scientific,
str_detect(species_scientific, "Macoma"), "Macoma")) %>%
group_by(site_code, species_scientific) %>%
summarise(mean_size = mean(width_mm, na.rm = TRUE), .groups = 'drop')
-output
# A tibble: 97 × 3
site_code species_scientific mean_size
<chr> <chr> <dbl>
1 H_01_a Clinocardium nuttallii 33.9
2 H_01_a Macoma 41.0
3 H_01_a Protothaca staminea 37.3
4 H_01_a Saxidomus gigantea 56.0
5 H_01_a Tresus nuttallii 100.
6 H_02_a Clinocardium nuttallii 35.1
7 H_02_a Macoma 41.3
8 H_02_a Protothaca staminea 38.0
9 H_02_a Saxidomus gigantea 54.7
10 H_02_a Tresus nuttallii 50.5
# … with 87 more rows
If the intention is to keep only the first word in 'species_scientific'
data %>%
group_by(genus = str_remove(species_scientific, "\\s+.*"), site_code) %>%
summarise(mean_size = mean(width_mm, na.rm = TRUE), .groups = 'drop')
-output
# A tibble: 97 × 3
genus site_code mean_size
<chr> <chr> <dbl>
1 Clinocardium H_01_a 33.9
2 Clinocardium H_02_a 35.1
3 Clinocardium H_03_a 37.5
4 Clinocardium H_04_a 48.2
5 Clinocardium H_05_a 37.6
6 Clinocardium H_06_a 38.7
7 Clinocardium H_07_a 40.2
8 Clinocardium L_01_a 44.4
9 Clinocardium L_02_a 54.8
10 Clinocardium L_03_a 61.1
# … with 87 more rows

r convert number saved as char in tibble to double NA introduced

I'd like to convert the relevant columns in the following tibble to numeric (double precision):
# A tibble: 6 x 6
Date Open High Low Close Shares
<chr> <chr> <chr> <chr> <chr> <chr>
1 16.04.2021 53,64 54,12 53,64 54,12 50
2 15.04.2021 53,19 53,19 53,19 53,19 -
3 14.04.2021 53,29 53,29 53,29 53,29 -
4 13.04.2021 52,86 52,86 52,86 52,86 -
5 12.04.2021 53,17 53,17 53,17 53,17 -
6 09.04.2021 53,18 53,18 53,18 53,18 -
However, if I apply as.numeric to the relevant columns, NA would be introduced.
What is the most efficient way to convert the entries in the relevant columns to double without generating the NAs?
Reproducible sample data:
df <- tribble(
~Date, ~Open, ~High, ~Low, ~Close, ~Shares,
"16.04.2021", "53,64", "54,12", "53,64", "54,12", 50,
"15.04.2021", "53,19", "53,19", "53,19", "53,19", NA,
"14.04.2021", "53,29", "53,29", "53,29", "53,29", NA,
"13.04.2021", "52,86", "52,86", "52,86", "52,86", NA,
"12.04.2021", "53,17", "53,17", "53,17", "53,17", NA,
"09.04.2021", "53,18", "53,18", "53,18", "53,18", NA
)
You can replace comma with a dot and convert to numeric. Use lapply to apply the function to multiple columns.
df[2:5] <- lapply(df[2:5], function(x) as.numeric(sub(',', '.', x)))
Using dplyr :
library(dplyr)
library(readr)
df %>%
mutate(across(Open:Close, ~parse_number(., locale = locale(decimal_mark = ","))))
That reason you can't turn them into numeric values are , as decimal separator instead of .. So you can use the following code:
library(dplyr)
library(stringr)
df %>%
mutate(across(Open:Close, ~ str_replace(., ",", "\\.")),
across(Open:Close, as.numeric))
# A tibble: 6 x 6
Date Open High Low Close Shares
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 16.04.2021 53.6 54.1 53.6 54.1 50
2 15.04.2021 53.2 53.2 53.2 53.2 NA
3 14.04.2021 53.3 53.3 53.3 53.3 NA
4 13.04.2021 52.9 52.9 52.9 52.9 NA
5 12.04.2021 53.2 53.2 53.2 53.2 NA
6 09.04.2021 53.2 53.2 53.2 53.2 NA
First escape the "." in your regular expression.
Second replace the commas with a "." before you can convert to numeric
df %>%
mutate(across(2:5, ~as.numeric(gsub(",", ".", gsub("\\.", "", .)))))
Output:
Date Open High Low Close Shares
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 16.04.2021 53.6 54.1 53.6 54.1 50
2 15.04.2021 53.2 53.2 53.2 53.2 -
3 14.04.2021 53.3 53.3 53.3 53.3 -
4 13.04.2021 52.9 52.9 52.9 52.9 -
5 12.04.2021 53.2 53.2 53.2 53.2 -
6 09.04.2021 53.2 53.2 53.2 53.2 -

Finding a Weighted Average Based on Years

I want to create a weighted average of the baseball statistic WAR from 2017 to 2019.
The Averages would go as following:
2019: 57.14%
2018: 28.57%
2017: 14.29%
However some players only played in 2018 and 2019, some having played in 2019 and 2017.
If they've only played in two years it would be 67/33, and only one year would be 100% obviously.
I was wondering if there was an easy way to do this.
My data set looks like this
Name Season G PA HR BB_pct K_pct ISO wOBA wRC_plus Def WAR
337 A.J. Pollock 2017 112 466 14 7.5 15.2 0.205 0.340 103 2.6 2.2
357 A.J. Pollock 2018 113 460 21 6.7 21.7 0.228 0.338 111 0.9 2.6
191 Aaron Altherr 2017 107 412 19 7.8 25.2 0.245 0.359 120 -7.9 1.4
162 Aaron Hicks 2017 88 361 15 14.1 18.6 0.209 0.363 128 6.4 3.4
186 Aaron Hicks 2018 137 581 27 15.5 19.1 0.219 0.360 129 2.3 5.0
464 Aaron Hicks 2019 59 255 12 12.2 28.2 0.208 0.325 102 1.3 1.1
And the years vary from person to person, but was wondering if anyone had a way to do this weighted average dependent on the years they played. I also dont want any only 2017-ers if that make sense.
I guess, there is an easy way of doing your task. Unfortunately my approach is a little bit more complex. I'm using dplyr and purr.
First I put those weights into a list:
one_year <- 1
two_years <- c(2/3, 1/3)
three_years <- c(4/7, 3/7, 1/7)
weights <- list(one_year, two_years, three_years)
Next I split the datset into a list by the number of seasons each player took part:
df %>%
group_by(Name) %>%
mutate(n=n()) %>%
arrange(n) %>%
ungroup() %>%
group_split(n) -> my_list
Now I define a function that calculates the average using the weights:
WAR_average <- function(i) {my_list[[i]] %>%
group_by(Name) %>%
mutate(WAR_average = sum(WAR * weights[[i]]))}
And finally I apply the function WAR_average on my_list and filter/select the data:
my_list %>%
seq_along() %>%
lapply(WAR_average) %>% # apply function
reduce(rbind) %>% # bind the dataframes into one df
filter(Season != 2017 | n != 1) %>% # filter players only active in 2017
select(Name, WAR_average) %>% # select player and war_average
distinct() # remove duplicates
This whole process returns
# A tibble: 2 x 2
# Groups: Name [2]
Name WAR_average
<chr> <dbl>
1 A.J. Pollock 2.33
2 Aaron Hicks 4.24

Aggregate and adding new column

I have a dataset with district name, household latitude, and longitude. The dataset has 2000 household locations. I want to calculate the mean of latitude and longitude based on district name. Next, I want to add two new columns (i.e. Lat_mean, Long_mean) in which the mean Lat and Long will be stored for each household.
I was just able to aggregate the mean values for latitude and longitude. I don't know how to paste the summarized data as a new column for each ID (see code)
id <- c(1,2,3,4,5,6)
district <- c("A", "B", "C", "A", "A", "B")
lat <- c(28.6, 30.2, 35.9, 27.5, 27.9, 31.5)
long <- c(77.5, 85.2, 66.5, 75.0, 79.2, 88.8)
df <- data.frame(id, district, lat, long)
df_group <- df %>% group_by(district) %>% summarise_at(vars(lat:long), mean)
I am expecting the following. Lat_mean & Long_mean columns will be added to 'df' and each ID will have values based on district name. See the image below.
We can use mutate_at instead of summarise_at. Within the list, specify the name, so that it will create a new column with suffix as that name
library(dplyr)
df %>%
group_by(district) %>%
mutate_at(vars(lat, long), list(mean = mean))
# A tibble: 6 x 6
# Groups: district [3]
# id district lat long lat_mean long_mean
# <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
#1 1 A 28.6 77.5 28 77.2
#2 2 B 30.2 85.2 30.8 87
#3 3 C 35.9 66.5 35.9 66.5
#4 4 A 27.5 75 28 77.2
#5 5 A 27.9 79.2 28 77.2
#6 6 B 31.5 88.8 30.8 87
> df %>%
mutate(lat_mean = ave(lat, district, FUN=mean),
lon_mean = ave(long, district, FUN=mean))
id district lat long lat_mean lon_mean
1 1 A 28.6 77.5 28.00 77.23333
2 2 B 30.2 85.2 30.85 87.00000
3 3 C 35.9 66.5 35.90 66.50000
4 4 A 27.5 75.0 28.00 77.23333
5 5 A 27.9 79.2 28.00 77.23333
6 6 B 31.5 88.8 30.85 87.00000

Resources