Assistance with DPLYR functions - r

I am trying to make my code display the mean attendance of a specified country in a new column. When I run the code below (Should be an image) I get the table also listed in the image. Can anyone explain how to display only the column of the specified country name and the mean attendance in the new column and explain what I am doing wrong? Thank you
My_Code
EDIT: sorry I'm obviously new at this,
my code is
WorldCupMatches %>%
select(Home.Team.Name, Attendance) %>%
group_by(Home.Team.Name == "USA") %>%
mutate(AVG_Attendance = mean(Attendance, na.rm = T))
so to explain more, worldcupmatches is a dataframe and it has columns named "home.team.names" and "Attendance." I am trying to add a column by mutating and I want the mutated column to show the mean attendance for a country. The country i am looking for in this particular situation is USA. I also want the output to only display the columns "home.team.name" (with the USA as the home team), attendance and the mutated column that would be the mean attendance.
Thank you all for the help i got a lot of great answers!

First group by Home.Team.Name
Then you get the mean of each country in the table with summarise
If you just want USA then add filter(Home.Team.Name == "USA") at the end
WorldCupMatches %>%
select(Home.Team.Name, Attendance) %>%
group_by(Home.Team.Name) %>%
summarise(AVG_Attendance = mean(Attendance, na.rm = T)) %>%
filter(Home.Team.Name == "USA")

If you want to have averages by group just group_byand summarise:
library(dplyr)
df %>%
group_by(Hometeam) %>%
summarise(mean = mean(Attendance))
# A tibble: 3 x 2
Hometeam mean
* <chr> <dbl>
1 France 555
2 UK 500.
3 USA 373
If you're just interested in a specific group you can filter that group:
df %>%
filter(Hometeam=="USA") %>%
summarise(mean = mean(Attendance))
mean
1 373
Data:
df <- data.frame(
Hometeam = c("USA", "UK", "USA", "France", "UK", "USA"),
Attendance = c(120, 333, 222, 555, 666, 777)
)

Related

Compute a custom mean for each row over multiple columns, based on a set of conditions

I have a complex problem and I will be grateful if someone can help me out. I have a dataframe made up of appended survey data for different countries in different years. In the said dataframe, I also have air quality measures for the neighbourhoods where respondents were selected. The air quality data is from 1998 to 2016.
My problem is I want to compute the row mean (or cumulative mean exposures) for each person base on the respondents' age and the air quality data years. My data frame looks like this
dat <- data.frame(ID=c(1:2000), dob = sample(1990:2020, size=2000, replace=TRUE),
survey_year=rep(c(1998, 2006, 2008, 2014, 2019), times=80, each=5),
CNT = rep(c('AO', 'GH', 'NG', 'SL', 'UG'), times=80, each=5),
Ozone_1998=runif(2000), Ozone_1999=runif(2000), Ozone_2000=runif(2000),
Ozone_2001=runif(2000), Ozone_2002=runif(2000), Ozone_2003=runif(2000),
Ozone_2004=runif(2000), Ozone_2005=runif(2000), Ozone_2006=runif(2000),
Ozone_2007=runif(2000), Ozone_2008=runif(2000), Ozone_2009=runif(2000),
Ozone_2010=runif(2000), Ozone_2011=runif(2000), Ozone_2012=runif(2000),
Ozone_2013=runif(2000), Ozone_2014=runif(2000), Ozone_2015=runif(2000),
Ozone_2016=runif(2000))
In the example data frame above, all respondents in country Ao will have their cumulative mean air quality exposures restricted to the Ozone_1998 while respondents in country SL will have their mean calculated based on Ozone_1998 to Ozone_2014.
The next thing is for a person in country SL aged 15 years I want to their cumulative exposure to be from Ozone_2000 to Ozone_2014 (the 15 year period of their life include their birth year). A person aged 16 will have their mean from Ozone_1999 to Ozone_2014 etc.
Is their a way to do this complex task in R?
NB: Although my question is similar to another I posted (see link below), this task is much complex. I tried adapting the solution for my previous question but my attempts did not work. For instance, I tried
dat$mean_exposure = dat %>% pivot_longer(starts_with("Ozone"), names_pattern = "(.*)_(.*)", names_to = c("type", "year")) %>%
mutate(year = as.integer(year)) %>% group_by(ID) %>%
summarize(mean_under5_ozone = mean(value[ between(year, survey_year,survey_year + 0) ]), .groups = "drop")
but got an error
*Error: Problem with `summarise()` input `mean_under5_ozone`.
x `left` must be length 1
i Input `mean_under5_ozone` is `mean(value[between(year, survey_year, survey_year + 0)])`.
i The error occurred in group 1: ID = 1.*
Link to the previous question
How to compute a custom mean for each row over multiple columns, based on a row-specific criterion?
Thank you
The tidying step from your last question works well:
tidy_data = dat %>%
pivot_longer(
starts_with("Ozone"),
names_pattern = "(.*)_(.*)",
names_to = c(NA, "year"),
values_to = "ozone"
) %>%
mutate(year = as.integer(year))
Now you can filter out the years you want to get mean exposure by country / age:
mean_lifetime_exposure = tidy_data %>%
group_by(CNT, dob) %>%
filter(year >= dob) %>%
summarise(mean(ozone))
PS I'm sorry I don't quite understand your first question about country AO.
Edit:
Does this do what you wanted? The logic is a bit convoluted but the code is straightforward.
tidy_data_filtered = tidy_data %>%
filter(
!(CNT == "AO" & year != 1998),
!(CNT == "SL" & !year %in% 1998:2014)
)

Assign a conditional value to new created column

My Data frame looks like this
Now, I want to add a new column which assigns one (!) specific value to each country. That means, there is only one value for Australia, one for Canada etc. for every year.
It should look like this:
Year Country R Ineq Adv NEW_COL
2018 Australia R1 Ineq1 1 x_Australia
2019 Australia R2 Ineq2 1 x_Australia
1972 Canada R1 Ineq1 1 x_Canada
...
Is there a smart way to do this?
Appreciate any help!
you use merge.
x = data.frame(country = c("AUS","CAN","AUS","USA"),
val1 = c(1:4))
y = data.frame(country = c("AUS","CAN","USA"),
val2 = c("a","b","c"))
merge(x,y)
country val1 val2
1 AUS 1 a
2 AUS 3 a
3 CAN 2 b
4 USA 4 c
You just manually create the (probably significantly smaller!) reference table that then gets duplicated in the original table in the merge. As you can see, my 3 row table (with a,b,c) is correctly duplicated up to the original (4 row) table such that every AUS gets "a".
You may use mutate and case_when from the package dplyr:
library(dplyr)
data <- data.frame(country = rep(c("AUS", "CAN"), each = 2))
data <- mutate(data,
newcol = case_when(
country == "CAN" ~ 1,
country == "AUS" ~ 2))
print(data)
You can use mutate and group_indices:
library(dplyr)
Sample data:
sample.df <- data.frame(Year = sample(1971:2019, 10, replace = T),
Country = sample(c("AUS", "Can", "UK", "US"), 10, replace = T))
Create new variable called ID, and assign unique ID to each Country group:
sample.df <- sample.df %>%
mutate(ID = group_indices(., Country))
If you want it to appear as x_Country, you can use paste (as commented):
sample.df <- sample.df %>%
mutate(ID = paste(group_indices(., Country), Country, sep = "_"))

How to group rows with duplicate name in R?

I am quite new to R and struggling with subsetting datasets.
This is where the dataset came from and how I clean it.
board_game_original<- read.csv("https://raw.githubusercontent.com/bryandmartin/STAT302/master/docs/Projects/project1_bgdataviz/board_game_raw.csv")
#tidy up the column of mechanic and category with cSplit function
library(splitstackshape)
mechanic <- board_game$mechanic
board_game_tidy <- cSplit(board_game,splitCols=c("mechanic","category"), sep = ",", direction = "long")
here's my code trying to extract two columns: category, and average complexities.
summary_category <- summary(board_game_tidy$category)
top_5_category <- summary_category[1:5]
complexity_top_5_category <- board_game_tidy %>%
group_by(category) %>%
select(average_complexity) %>%
filter(category == c("Abstract Strategy Action / Dexterity", "Adventure", "Age of Reason","American Civil War "))
complexity_top_5_category
My final intent: create a data frame with only 2 columns: category and average complexities, and take a mean of the average complexities under the same category name.
What I encountered: I have 5 rows of category, but 30 rows of average complexities. What can I do to take a mean value of all the average complexities under the same category names? All help will be appreciated! Thank you!
filter the values for top 5 category, then group_by category and take mean of average_complexity.
library(dplyr)
board_game_tidy %>%
filter(category %in% names(top_5_category)) %>%
group_by(category) %>%
summarise(average_complexity = mean(average_complexity))
# category average_complexity
# <fct> <dbl>
#1 Abstract Strategy 0.844
#2 Action / Dexterity 0.469
#3 Adventure 1.25
#4 Age of Reason 1.95
#5 American Civil War 1.68
You are very close. You need dplyr::summarise()
complexity_top_5_category <- board_game_tidy %>%
group_by(category) %>%
dplyr::summarise(mean_average_complexity = mean(average_complexity, na.rm=TRUE)) %>%
top_n(5, mean_average_complexity)
#select(average_complexity) %>% # you don't need this
#filter(category == c("Abstract Strategy Action / Dexterity", "Adventure", "Age of Reason","American Civil War "))
complexity_top_5_category
You don't have to include dplyr:: before summarise(). However, some other common packages have their versions of summarise() so it's safer to be specific.
You can use top_n() to automatically select the top n categories, instead of using filter().

Match two dataframes with different column names and create new column with mean from the other

I have two dataframes. The first one only lists each School/Team once, something like this:
classA <- data.frame(School=c("Omaha South", "Millard North", "Elkhorn"))
The other dataframe is a table of basketball scores throughout a season and you can a School/Team can be listed more than once in the same column:
scores <- data.frame('Away Score'=c(60,84,48,72),
'Away Team'=c("Omaha South", "Millard North", "Elkhorn","Elkhorn"),
'Home Score'=c(88,40,38,62),
'Home Team'=c("Elkhorn", "Omaha South", "Millard North","Omaha South"))
My goal is to create a new column called classA$'Away PPG' that averages all of the 'Away Scores' for each School in the first data frame. So as a result, for Elkhorn, the new classA column would be 60 (48+72)/2.
One of the places I'm getting stuck is that the two dfs have different column names to match and I haven't found out how to deal with that aspect.
I got help previously on a somewhat related problem where I was looking for a count instead of an average but couldn't figure out how to modify it to work for this one. The solution for count issue looked like this:
df2 %>%
right_join(df1, by = c('Winner' = 'School')) %>%
na.omit() %>%
count(Winner, name = "wins") %>%
right_join(df1, c('Winner' = 'School')) %>%
mutate(wins = replace(wins, is.na(wins), 0))
We can join classA with scores and then take mean of Away.Score for each School.
library(dplyr)
classA %>%
left_join(scores, by = c('School' = 'Away.Team')) %>%
group_by(School) %>%
summarise(AwayScore = mean(Away.Score, na.rm = TRUE))
# A tibble: 3 x 2
# School AwayScore
# <fct> <dbl>
#1 Elkhorn 60
#2 Millard North 84
#3 Omaha South 60
Similarly in base R
aggregate(Away.Score~School,
merge(classA, scores, by.x = 'School', by.y = 'Away.Team'),
mean, na.rm = TRUE)

Using Mutate to rank specific columns

I'm a relative newbie to dplyr. I have a data.frame organized with each store name and source (made up of the results for 2018) making up the observations. The variables are total revenue, quantity, customer experience score, and a few others.
I'd like to rank each category in the data.frame and create new observations. All variables would be ranked in descending order, but customer experience and one additional column would be ranked in ascending order. The source I'd like to call this would be called "ranks".
store <- c("NYC", "Chicago", "Boston")
source <- c("2018", "2018", "2018")
revenue <- c(10000, 50000, 2000)
quantity <- c(100, 50, 20)
satisfaction <- c(3, 2, 5)
table <- cbind(store, source, revenue, quantity, satisfaction)
I was able to get what I needed using mutate, but I had to manually name each new column. I'm sure there is a more efficient way to rank these values out there!
Here is what I originally did:
table <- table %>%
mutate(revenue_rank = rank(-revenue), quantity_rank = rank(-quantity), satisfaction_rank = rank(satisfaction))
In general, if you're having to do something repeatedly in a data frame, such as calculating ranks, you probably want to reshape to long data. Also note that what you got from cbind is a matrix, not data frame--probably not what you want, since this means numeric variables actually come through as characters. Instead of cbind, use data.frame or data_frame (for a tibble).
What I did here is gathered into a long data frame, grouped by the measures (revenue, quantity, or satisfaction), then gave ranks based on the value, keeping in mind that you wanted different orders for satisfaction and the other measures.
library(tidyverse)
store <- c("NYC", "Chicago", "Boston")
source <- c("2018", "2018", "2018")
revenue <- c(10000, 50000, 2000)
quantity <- c(100, 50, 20)
satisfaction <- c(3, 2, 5)
df <- data_frame(store, source, revenue, quantity, satisfaction)
df %>%
gather(key = measure, value = value, revenue:satisfaction) %>%
group_by(measure) %>%
mutate(rank = ifelse(measure == "satisfaction", rank(value), rank(-value))) %>%
ungroup() %>%
select(-value) %>%
mutate(measure = paste(measure, "rank", sep = "_")) %>%
spread(key = measure, value = rank)
#> # A tibble: 3 x 5
#> store source quantity_rank revenue_rank satisfaction_rank
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Boston 2018 3 3 3
#> 2 Chicago 2018 2 1 1
#> 3 NYC 2018 1 2 2
Created on 2018-05-04 by the reprex package (v0.2.0).

Resources