Summarizing by group using dplyr not working as expected - r

I am trying to summarize demographic information of a dataframe and I am running into some issues. Breaking it down by gender, there are 4 possible options that participants can choose from: 1,2,3,4 with blanks (no response) being treated as NA values by R. I am getting the correct counts for each gender but when trying to obtain the mean of each gender is where I am running into issues.
I'd like to keep the observations with NA values because while they may not have answered demographic information, they have answered other questions hence why I do not want to simply remove those rows from the dataframe.
Here is what I tried
#df$q10: what is your gender
by_gender = df %>%
group_by(df$Q10) %>%
dplyr::summarize(count = n(),
AvgAge = mean(df$Q11_1_TEXT, na.rm = TRUE))
by_gender
This returns the same value for all genders as
mean(df$Q11_1_TEXT, na.rm = TRUE)
Both the gender and age columns have NA values and I suspect this is where the issue may be? I tried adding na.rm = T but that does not seem to work. What else can I try?
Edit: Removing $ makes the function work as expected.

When you ask for mean(df$Q11_1_TEXT) it will calculate a mean from the original ungrouped vector, whereas if you use mean(Q11_1_TEXT) it will look for Q11_1_TEXT within the grouped data frame it received from the prior step.
Compare:
mtcars %>%
group_by(gear) %>%
summarize(wt_ttl = sum(wt),
wt_ttl2 = sum(mtcars$wt))
# A tibble: 3 × 3
gear wt_ttl wt_ttl2
<dbl> <dbl> <dbl>
1 3 58.4 103.
2 4 31.4 103.
3 5 13.2 103.

Related

How can I calculate a new dataframe only for one outcome type?

I'm working with some data that involves participants running on a cognitive task that measures their outcome (Correct or Incorrect) and reaction time (RT) (the entire dataset is called practice). For each participant, I want to create a new dataframe with their average RT when they got the answer correct, and one for when they were incorrect. I've tried
practice %>%
mutate(correctRT = mean(practice$RT[practice$Outcome=="Correct"]))
Using dplyr and tidyverse, as well as
correctRT <- c(mean(practice$RT[practice$Outcome=="Correct"]))
(which I'm sure isn't the correct way to do it) and nothing seems to be working. I'm a complete novice and am working with this dataset in order to learn how to do stats with R and just can't find any answers with R.
In R you can "keep" multiple objects (e.g. data frames) in a single list. This saves you from storing every (sub)dataframe in a separate variable (e.g. through subsetting your problem and storing it based on Participant, Outcome). This will come handy when you have "many" individuals and a manual filter and storing of the (sub)dataframe becomes prohibitive.
Conceptually, your problem is to "subset" your data to the Participant and Outcome you aim for and calculate the mean on this group.
The following is based on {tidyverse}, i.e. {dplyr}.
data
As you have not provided a reproducble example, this is a quick hack of your data:
practice <- data.frame(
Participant = c("A","A","A","B","B","B","B","C","C","D"),
RT = c(10, 12, 14, 9, 12, 13, 17, 11, 13, 17),
Outcome = c("Incorrect","Correct", "Correct","Incorrect","Incorrect","Correct", "Correct","Incorrect","Correct", "Correct")
)
which looks like the following:
practice
Participant RT Outcome
1 A 10 Incorrect
2 A 12 Correct
3 A 14 Correct
4 B 9 Incorrect
5 B 12 Incorrect
6 B 13 Correct
7 B 17 Correct
8 C 11 Incorrect
9 C 13 Correct
10 D 17 Correct
splitting groups of a dataframe
The {tidyverse} provides some neat functions for the general data processing.
{dplyr} has a group_split() function that returns such a list.
library(dplyr)
practice %>% group_split(Participant, Outcome)
<list_of<
tbl_df<
Participant: character
RT : double
Outcome : character
>
>[7]>
[[1]]
# A tibble: 2 x 3
Participant RT Outcome
<chr> <dbl> <chr>
1 A 12 Correct
2 A 14 Correct
[[2]]
...
You can address the respective list-elements with the [[]] notation.
Store the list in a variable and try my_list_name[[3]] to extract the 3rd element.
potential summary for your problem
If you do not need a list you could wrap this into a data summary.
If you want to split on Outcomes, you may want to filter your data in 2 sub-dataframes only holding the respective outcome (e.g. correct <- practice %>% filter(Outcome == "Correct")).
Group your data dependent on the summary you want to construct.
Use summarise() to summarise your groups into a 1-row summary.
Note you can combine multiple operations. For example next to the mean reaction time, the following counts the number of rows (:= attempts).
practice %>%
group_by(Participant, Outcome) %>%
##--------- summarise data into 1 row summarise
summarise( Mean_RT = mean(RT) # calculate mean reaction time
,Attempts = n() ) # how many times
This yields:
# A tibble: 7 x 4
# Groups: Participant [4]
Participant Outcome Mean_RT Attempts
<chr> <chr> <dbl> <int>
1 A Correct 13 2
2 A Incorrect 10 1
3 B Correct 15 2
4 B Incorrect 10.5 2
5 C Correct 13 1
6 C Incorrect 11 1
7 D Correct 17 1
Please note that this is a grouped data frame. If you further process the data, you need to "remove" the grouping. Otherwise any follow up operation in a pipe will be on the group-level.
For this you can either use summarise(...., .groups = "drop") or you add ... %>% ungroup() to your pipe.
If you need to split the result, check for above group_split().

plyr for same analysis accross different subsets

I'm new to plyr and dplyr and seriously don't get it. I have managed my way around some functions, but I struggle with really basic stuff such as the following example.
Taking mtcars, I have different overlapping subsets, such as vs = 1 and am = 1
I now want to run the same analysis, in this case median() for one var over the different subsets, and another analysis, such as mean() for another var.
This should give me in the end the same result, such as the following code - just much shorter:
data_mt <- mtcars # has binary dummy vars for grouping
data_vs <- data_mt[ which(data_mt$vs == 1 ), ]
data_am <- data_mt[ which(data_mt$am == 1 ), ]
median(data_mt$mpg)
median(data_vs$mpg)
median(data_am$mpg)
mean(data_mt$cyl)
mean(data_vs$cyl)
mean(data_am$cyl)
In my real example, I have an analog to data_mt, so if you have a solution starting there, without data_vs etc. that would be great.
I'm sure this is very basic, but I can't wrap my head around it - and as I have some 1500 variables that I want to look at, I'd appreciate your help =)
It may well be that my answer is already out there, but with the terminology I know I didn't find it explain for Dummies ;D
Edit:
To have a better understanding of what I am doing and what I am looking for, I hereby post my original code (not the mtcars example).
I have a dataset ds with 402 observations of 553 variables
The dataset comes from a study with human participants, some of which opted in for additional research mys or obs or both.
ds$mys <- 0
ds$mys[ which(ds$staffmystery_p == "Yes" ) ] <- 1
ds$obs <- 0
ds$obs[ which( !is.na(ds$sales_time)) ] <- 1
The 553 variables are either integers (e.g. for age or years of experience) or factors (e.g. sex or yes/no answers). I now want to compare some descriptive of the full dataset with the descriptives for the subsets and ideally also do a t-test for difference.
Currently I have just a very long list of code that reads more or less like the following (just much longer). This doesn't include t-tests.
describe(ds$age_b)
describe(dm$age_b)
describe(do$age_b)
prop.table(table(ds$sex_b))*100
prop.table(table(dm$sex_b))*100
prop.table(table(do$sex_b))*100
ds, dm and do are different datasets, but they are all just based on the above mentioned full dataset ds and the subsets ds$mys for dm and ds$obs for do
describe comes from the psych package and just lists descriptive statistics like mean or median etc. I don't need all of the metrics, mostly n, mean, median, sd and iqr.
The formula around 'prop.table' gives me a readout I can just copy into the excel tables I use for the final publications. I don't want automated output because I get asked all the time to add or change this, which is really just easier in excel than with automated output. (unless you know a much superior way ;)
Thank you so much!
Here is an option if we want to do this for different columns by group separately
library(dplyr)
library(purrr)
library(stringr)
map_dfc(c('vs', 'am'), ~
mtcars %>%
group_by(across(all_of(.x))) %>%
summarise(!! str_c("Mean_cyl_", .x) := mean(cyl),
!! str_c("Median_mpg_", .x) := median(mpg), .groups = 'drop'))%>%
mutate(Mean_cyl_full = mean(mtcars$cyl), Median_mpg_full = median(mtcars$mpg))
-output
# A tibble: 2 x 8
# vs Mean_cyl_vs Median_mpg_vs am Mean_cyl_am Median_mpg_am Mean_cyl_full Median_mpg_full
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 0 7.44 15.6 0 6.95 17.3 6.19 19.2
#2 1 4.57 22.8 1 5.08 22.8 6.19 19.2
If the package version is old, we can replace the across with group_by_at
map_dfc(c('vs', 'am'), ~
mtcars %>%
group_by_at(vars(.x)) %>%
summarise(!! str_c("Mean_cyl_", .x) := mean(cyl),
!! str_c("Median_mpg_", .x) := median(mpg), .groups = 'drop'))%>%
mutate(Mean_cyl_full = mean(mtcars$cyl), Median_mpg_full = median(mtcars$mpg))
Update
Based on the update, we could place the datasets in a list, do the transformations at once and return a list of descriptive statistics and the proportion table
out <- map(dplyr::lst(dm, ds, do), ~ {
dat <- .x %>%
mutate(mys = as.integer(staffmystery_p == 'Yes'),
obs = as.integer(!is.na(sales_time)))
age_b_desc <- describe(dat$age_b)
prop_table_out <- prop.table(table(dat$sex_b))*100
return(dplyr::lst(age_b_desc, prop_table_out))
}
)

Tidyr: pivot_wider error: Can't convert <double> to <list>

I have a dataframe which lists species observations across multiple survey plots (the data is here). I'm trying to use tidyr's pivot_wider to spread that abundance data across several columns, with the new columns being each of the observed species. Here's the line of code I'm trying to use to do that:
data %>% pivot_wider(names_from = Species, values_from = Total.Abundance, values_fill = 0)
However, this gives me two error messages:
Error: Can't convert <double> to <list>.
Values are not uniquely identified; output will contain list-cols.
I'm not sure what the issue is, because this has worked fine for several other dataframes that are (seemingly) identical to this one. I've tried googling the first error message and have not been able to find what conditions cause it—I don't know what double R is trying to convert to a list, nor why it's trying to convert to a list. The Total.Abundance column should be integers, but I wonder if somehow it's a double data type?
From what I've been able to find, the second error message appears when there are identical rows in the dataframe. However, the error persists when I modify my statement to
unique(data) %>% pivot_wider(names_from = Species, values_from = Total.Abundance, values_fill = 0)
Which I would have thought would remove duplicate rows.
Any help would be much appreciated!
Expanding on my comment, there are duplicates in your data that cannot be removed by unique() or in dplyr, distinct():
dat %>%
distinct() %>%
group_by(Plot.ID, Species) %>%
count()
# Plot.ID Species n
# <dbl> <chr> <int>
# 1 1 Calliopius 1
# 2 1 Idotea 2
# 3 1 Lacuna vincta 2
# 4 1 Mitrella lunata 2
# 5 1 Podoceropsis nitida 1
# 6 1 Unk. Amphipod 1
# 7 1 Unk. Bivalve 1
# 8 2 Calliopius 1
# 9 2 Caprella penantis 1
#10 2 Corophium insidiosum 1
Need to find out why you have duplicates like this and reconcile it, say by summing them up. The problem might be coming out of data wrangling coding bugs in which case summing is not necessarily suitable. Or perhaps say you sample same plot twice, you want mean instead of sum to normalize vs sampling effort, or perhaps you need extra column indicating sampling effort). Nevertheless, this works perfectly:
dat %>%
group_by(Plot.ID, Species) %>%
summarise(abundance = sum(Total.Abundance)) %>%
tidyr::pivot_wider(names_from = Species, values_from = abundance,
values_fill = 0)

dplyr is not working properly when using group by in it

Am calculating error rates between two different forecasting methods.
My basic approach is to get group by nk and calculate the errors to compare and choose the one which have less error rate value. The issue is am getting MAP1E_arima_ds and MAPE_cagr_ds is all the same value some how the group_by function is not working while calculating.
Here is something I tried
group_by(nk) %>%
mutate(MAP1E_arima_ds=sum(temp2$ABS_arima_error_ds)/nrow(temp2)) %>%
mutate(MAPE_cagr_ds=sum(temp2$ABS_cagr_error_ds)/nrow(temp2))
So finally expected like
nk MAP1E_arima_ds MAPE_cagr_ds
1-G0175 value_x value_y
1-H0182 value_z value_a
so that I can compare error rate and choose forecasting method with less error rate.
If I understand you correctly, I think what you are looking for is this
library(dplyr)
df %>%
group_by(nk) %>%
summarise(MAP1E_arima_ds=sum(ABS_arima_error_ds)/n(),
MAPE_cagr_ds=sum(ABS_cagr_error_ds)/n())
# A tibble: 2 x 3
# nk MAP1E_arima_ds MAPE_cagr_ds
# <chr> <dbl> <dbl>
#1 1-G0175 14.7 3.38
#2 1-H0182 2.91 7.40
which is actually mean
df %>%
group_by(nk) %>%
summarise(MAP1E_arima_ds = mean(ABS_arima_error_ds),
MAPE_cagr_ds = mean(ABS_cagr_error_ds))
Moreover, after copying your dput it seems that your data is already grouped by nk, so the following would also give the same result
df %>%
summarise(MAP1E_arima_ds=mean(ABS_arima_error_ds),
MAPE_cagr_ds=mean(ABS_cagr_error_ds))

For loop compilation error in R

I have a data frame in R that lists monthly sales data by department for a store. Each record contains a month/year, a department name, and the total sales in that department for the month. I'm trying to calculate the mean sales by department, adding them to the vector avgs but I seem to be having two problems: the total sales per department is not compiling at all (its evaluating to zero) and avgs is compiling by record instead of by department. Here's what I have:
avgs = c()
for(dept in data$departmentName){
total <- 0
for(record in data){
if(identical(data$departmentName, dept)){
total <- total + data$ownerSales[record]
}
}
avgs <- c(avgs, total/72)
}
Upon looking at avgs on completion of the loop, I find that it's returning a vector of zeroes the length of the data frame rather than a vector of 22 averages (there are 22 departments). I've been tweaking this forever and I'm sure it's a stupid mistake, but I can't figure out what it is. Any help would be appreciated.
why not use library(dplyr)?:
library(dplyr)
data(iris)
iris %>% group_by(Species) %>% # or dept
summarise(total_plength = sum(Petal.Length), # total owner sales
weird_divby72 = total_plength/72) # total/72?
# A tibble: 3 × 3
Species total_plength weird_divby72
<fctr> <dbl> <dbl>
1 setosa 73.1 1.015278
2 versicolor 213.0 2.958333
3 virginica 277.6 3.855556
your case would probably look like this :
data %>% group_by(deptName) %>%
summarise(total_sales = sum(ownerSales),
monthly_sales = total_sales/72)
I like dplyr for it's syntax and pipeability. I think it is a huge improvement over base R for ease of data wrangling. Here is a good cheat sheet to help you get rolling: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

Resources