plyr for same analysis accross different subsets - r

I'm new to plyr and dplyr and seriously don't get it. I have managed my way around some functions, but I struggle with really basic stuff such as the following example.
Taking mtcars, I have different overlapping subsets, such as vs = 1 and am = 1
I now want to run the same analysis, in this case median() for one var over the different subsets, and another analysis, such as mean() for another var.
This should give me in the end the same result, such as the following code - just much shorter:
data_mt <- mtcars # has binary dummy vars for grouping
data_vs <- data_mt[ which(data_mt$vs == 1 ), ]
data_am <- data_mt[ which(data_mt$am == 1 ), ]
median(data_mt$mpg)
median(data_vs$mpg)
median(data_am$mpg)
mean(data_mt$cyl)
mean(data_vs$cyl)
mean(data_am$cyl)
In my real example, I have an analog to data_mt, so if you have a solution starting there, without data_vs etc. that would be great.
I'm sure this is very basic, but I can't wrap my head around it - and as I have some 1500 variables that I want to look at, I'd appreciate your help =)
It may well be that my answer is already out there, but with the terminology I know I didn't find it explain for Dummies ;D
Edit:
To have a better understanding of what I am doing and what I am looking for, I hereby post my original code (not the mtcars example).
I have a dataset ds with 402 observations of 553 variables
The dataset comes from a study with human participants, some of which opted in for additional research mys or obs or both.
ds$mys <- 0
ds$mys[ which(ds$staffmystery_p == "Yes" ) ] <- 1
ds$obs <- 0
ds$obs[ which( !is.na(ds$sales_time)) ] <- 1
The 553 variables are either integers (e.g. for age or years of experience) or factors (e.g. sex or yes/no answers). I now want to compare some descriptive of the full dataset with the descriptives for the subsets and ideally also do a t-test for difference.
Currently I have just a very long list of code that reads more or less like the following (just much longer). This doesn't include t-tests.
describe(ds$age_b)
describe(dm$age_b)
describe(do$age_b)
prop.table(table(ds$sex_b))*100
prop.table(table(dm$sex_b))*100
prop.table(table(do$sex_b))*100
ds, dm and do are different datasets, but they are all just based on the above mentioned full dataset ds and the subsets ds$mys for dm and ds$obs for do
describe comes from the psych package and just lists descriptive statistics like mean or median etc. I don't need all of the metrics, mostly n, mean, median, sd and iqr.
The formula around 'prop.table' gives me a readout I can just copy into the excel tables I use for the final publications. I don't want automated output because I get asked all the time to add or change this, which is really just easier in excel than with automated output. (unless you know a much superior way ;)
Thank you so much!

Here is an option if we want to do this for different columns by group separately
library(dplyr)
library(purrr)
library(stringr)
map_dfc(c('vs', 'am'), ~
mtcars %>%
group_by(across(all_of(.x))) %>%
summarise(!! str_c("Mean_cyl_", .x) := mean(cyl),
!! str_c("Median_mpg_", .x) := median(mpg), .groups = 'drop'))%>%
mutate(Mean_cyl_full = mean(mtcars$cyl), Median_mpg_full = median(mtcars$mpg))
-output
# A tibble: 2 x 8
# vs Mean_cyl_vs Median_mpg_vs am Mean_cyl_am Median_mpg_am Mean_cyl_full Median_mpg_full
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 0 7.44 15.6 0 6.95 17.3 6.19 19.2
#2 1 4.57 22.8 1 5.08 22.8 6.19 19.2
If the package version is old, we can replace the across with group_by_at
map_dfc(c('vs', 'am'), ~
mtcars %>%
group_by_at(vars(.x)) %>%
summarise(!! str_c("Mean_cyl_", .x) := mean(cyl),
!! str_c("Median_mpg_", .x) := median(mpg), .groups = 'drop'))%>%
mutate(Mean_cyl_full = mean(mtcars$cyl), Median_mpg_full = median(mtcars$mpg))
Update
Based on the update, we could place the datasets in a list, do the transformations at once and return a list of descriptive statistics and the proportion table
out <- map(dplyr::lst(dm, ds, do), ~ {
dat <- .x %>%
mutate(mys = as.integer(staffmystery_p == 'Yes'),
obs = as.integer(!is.na(sales_time)))
age_b_desc <- describe(dat$age_b)
prop_table_out <- prop.table(table(dat$sex_b))*100
return(dplyr::lst(age_b_desc, prop_table_out))
}
)

Related

Summarizing by group using dplyr not working as expected

I am trying to summarize demographic information of a dataframe and I am running into some issues. Breaking it down by gender, there are 4 possible options that participants can choose from: 1,2,3,4 with blanks (no response) being treated as NA values by R. I am getting the correct counts for each gender but when trying to obtain the mean of each gender is where I am running into issues.
I'd like to keep the observations with NA values because while they may not have answered demographic information, they have answered other questions hence why I do not want to simply remove those rows from the dataframe.
Here is what I tried
#df$q10: what is your gender
by_gender = df %>%
group_by(df$Q10) %>%
dplyr::summarize(count = n(),
AvgAge = mean(df$Q11_1_TEXT, na.rm = TRUE))
by_gender
This returns the same value for all genders as
mean(df$Q11_1_TEXT, na.rm = TRUE)
Both the gender and age columns have NA values and I suspect this is where the issue may be? I tried adding na.rm = T but that does not seem to work. What else can I try?
Edit: Removing $ makes the function work as expected.
When you ask for mean(df$Q11_1_TEXT) it will calculate a mean from the original ungrouped vector, whereas if you use mean(Q11_1_TEXT) it will look for Q11_1_TEXT within the grouped data frame it received from the prior step.
Compare:
mtcars %>%
group_by(gear) %>%
summarize(wt_ttl = sum(wt),
wt_ttl2 = sum(mtcars$wt))
# A tibble: 3 × 3
gear wt_ttl wt_ttl2
<dbl> <dbl> <dbl>
1 3 58.4 103.
2 4 31.4 103.
3 5 13.2 103.

How to divide a data frame into new data frames(like new data1,data2,data3 ..so on), so I can anaylsis each of them(like T-test)

I am just start learning R for data analysis. Here is my problem.
I want to analyse the body weight(BW) difference between male and female in different species. (For example, in Sorex gracilliums, male and female body weight is significantly different just an example,I don't know the answer. :))At first I thought maybe I can first divide them by Species into several groups.(This indeed can be done in Excel, but I have tooo many files, I think maybe R is better ) And then I can just using some simple code to test sex difference. But I don't know how to divide them, how to make new data frame..
I tried to use group_split. It indeed split the data, but just many tribble.
like image showed
What should I do?
Or maybe there is a better way for testing the difference?
I am a foreigner,so maybe there are many grammar mistakes.. But I will be very appreciated if you help!
Assuming your data is in a data.frame called df, with columns NO, SPECIES, SEX, BW:
set.seed(100)
df = data.frame(NO=1:100,
SPECIES=sample(LETTERS[1:4],100,replace=TRUE),
SEX=sample(c("M","F"),100,replace=TRUE),
BW = rnorm(100,80,2)
)
And we make Species D to have an effect:
df$BW[df$SPECIES=="D" & df$SEX=="M"] = df$BW[df$SPECIES=="D" & df$SEX=="M"] + 5
If we want to do it on one data frame, say Species A, we do
dat = subset(df,SPECIES=="A")
t.test(BW ~ SEX,data=dat)
And you get the relevant statistics and so forth. To do this systematically for all SPECIES, we can use broom, dplyr:
library(dplyr)
library(broom)
df %>% group_by(SPECIES) %>% do(tidy(t.test(BW ~ SEX,data=.)))
# A tibble: 4 x 11
# Groups: SPECIES [4]
SPECIES estimate estimate1 estimate2 statistic p.value parameter conf.low
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 0.883 80.4 79.6 0.936 3.65e-1 14.2 -1.14
2 B 0.259 80.2 79.9 0.377 7.12e-1 14.1 -1.21
3 C 0.170 80.1 79.9 0.359 7.23e-1 25.3 -0.807
4 D -5.55 79.7 85.2 -7.71 1.29e-7 21.4 -7.05
If you don't want to install any packages, this will give you all the test results:
by(df, df$SPECIES, function(x)t.test(BW ~ SEX,data=x))
And combining them into one data.frame:
func = function(x){
Nu=t.test(BW ~ SEX,data=x);
data.frame(estimate_1=Nu$estimate[1],estimate_2=Nu$estimate[2],p=Nu$p.value)}
do.call(rbind,by(df, df$SPECIES,func))
Here is an example to set multiple data.frames from one. The exemple data set iris is a table of character for 3 species.
First you can set a vector with all the species in your dataframe nspe. I then create a liste of the same length.
The for loop allows to watch each element of this list et put it a data.frame with just the species.
At the end of this script, I compute the mean petal width of the setosa species. If I had two discrete character on this species, I could do a t.test as well. I did one here but it's not really usefull...
data("iris")
summary(iris)
nspe <- as.vector(unique(iris$Species))
spe <- list() ; length(spe) = length(nspe) ; names(spe) <- nspe
for(i in nspe){
spe[i][[1]] <- iris[which(iris$Species == i),]
}
mean(spe$setosa$Petal.Width)
# [1] 0.246
t.test(spe$setosa$Petal.Width)
Below is an example to show how you can run a t.test on one species. Note that you will surely have trouble with species names and spaces, so I think it's easier to set ID for species than keeping their full names.
In future questions, consider providing a small example dataset rather than pictures, it's easier to help you.
# NOT RUN
t.test(
spe$Sorex_gracilliums$BW[which(spe$Sorex_gracilliums$SEX == 'm')],
spe$Sorex_gracilliums$BW[which(spe$Sorex_gracilliums$SEX == 'f')]
)

dplyr is not working properly when using group by in it

Am calculating error rates between two different forecasting methods.
My basic approach is to get group by nk and calculate the errors to compare and choose the one which have less error rate value. The issue is am getting MAP1E_arima_ds and MAPE_cagr_ds is all the same value some how the group_by function is not working while calculating.
Here is something I tried
group_by(nk) %>%
mutate(MAP1E_arima_ds=sum(temp2$ABS_arima_error_ds)/nrow(temp2)) %>%
mutate(MAPE_cagr_ds=sum(temp2$ABS_cagr_error_ds)/nrow(temp2))
So finally expected like
nk MAP1E_arima_ds MAPE_cagr_ds
1-G0175 value_x value_y
1-H0182 value_z value_a
so that I can compare error rate and choose forecasting method with less error rate.
If I understand you correctly, I think what you are looking for is this
library(dplyr)
df %>%
group_by(nk) %>%
summarise(MAP1E_arima_ds=sum(ABS_arima_error_ds)/n(),
MAPE_cagr_ds=sum(ABS_cagr_error_ds)/n())
# A tibble: 2 x 3
# nk MAP1E_arima_ds MAPE_cagr_ds
# <chr> <dbl> <dbl>
#1 1-G0175 14.7 3.38
#2 1-H0182 2.91 7.40
which is actually mean
df %>%
group_by(nk) %>%
summarise(MAP1E_arima_ds = mean(ABS_arima_error_ds),
MAPE_cagr_ds = mean(ABS_cagr_error_ds))
Moreover, after copying your dput it seems that your data is already grouped by nk, so the following would also give the same result
df %>%
summarise(MAP1E_arima_ds=mean(ABS_arima_error_ds),
MAPE_cagr_ds=mean(ABS_cagr_error_ds))

How can I count and compare data over multiple columns in R?

I am working with a dataframe which contains text data which has been categorised and coded. Each numerical value from 1-12 represent a type of word.
I want to count the frequencies of occurrence each number (1 to 12) over 6 columns (pre1 to pre6) so I know how many types of words have been used. Could anyone please advise on how to do this?
My df is structured as such:
Would something like that work for you?
library(dplyr)
df <- data.frame(pre1 = c(sample(1:12, 10)),
pre2 = c(sample(1:12, 10)),
pre3 = c(sample(1:12, 10)),
pre4 = c(sample(1:12, 10)),
pre5 = c(sample(1:12, 10)),
pre6 = c(sample(1:12, 10)))
count <- count(df, pre1, pre2, pre3, pre4, pre5, pre6)
One solution is this:
library(tidyverse)
mtcars %>%
select(cyl, am, gear) %>% # select the columns of interest
gather(column, number) %>% # reshape
count(column, number) # get counts of numbers for each column
# # A tibble: 8 x 3
# column number n
# <chr> <dbl> <int>
# 1 am 0 19
# 2 am 1 13
# 3 cyl 4 11
# 4 cyl 6 7
# 5 cyl 8 14
# 6 gear 3 15
# 7 gear 4 12
# 8 gear 5 5
In your case column will get values as pre1, pre2, etc., number' will get values 1 - 12 andn` will be the count of a specific number at a specific column.
It is not entirely clear from the question, whether you want frequency tables for all of these columns together or for each column seperately. In possible further questions you should also make clear, whether those numbers are coded as numerics, as characters or as factors (the result of str(pCat) is a good way to do that). For this particular question, it fortunately does not matter.
The answers I have already given in the comments are
table(unlist(pCat[,4:9]))
and
table(pCat$pre3)
as an extension for the latter, I shall also point to the comment by ANG , which boils down to
lapply(pCat[,4:9], table)
These are straightforward solutions with base R without any further unneccessary packages. The answers by JonGrub and AntoniosK base on the tidyverse. There is no obvious need to import dplyr or tidyverse for this problem but I guess, the authors open those packages anyways, whenever they use R, so it does not really impose any cost on them. Other great packages to base good answers on are data.table and sqldf. Those are good packages and many people do a lot of things, that could be done in base R in one of these packages. The packages promise to be more clear or faster or reuse possible knowledge you might already have. Nothing is wrong with that. However, I take your question as an indication, that you are still learning R and I would advise, to learn R first, before you become distracted by learning special packages and DSLs.
People have been using base R for decades and they will continue to do so. Learning base R wil lnot distract you from R and the knowledge will continue to be worthwhile in decades. If the same can be said for the tidyverse or datatable, time will tell (although sqldf is probably also a solid investment in the future, maybe more so than R).

For loop compilation error in R

I have a data frame in R that lists monthly sales data by department for a store. Each record contains a month/year, a department name, and the total sales in that department for the month. I'm trying to calculate the mean sales by department, adding them to the vector avgs but I seem to be having two problems: the total sales per department is not compiling at all (its evaluating to zero) and avgs is compiling by record instead of by department. Here's what I have:
avgs = c()
for(dept in data$departmentName){
total <- 0
for(record in data){
if(identical(data$departmentName, dept)){
total <- total + data$ownerSales[record]
}
}
avgs <- c(avgs, total/72)
}
Upon looking at avgs on completion of the loop, I find that it's returning a vector of zeroes the length of the data frame rather than a vector of 22 averages (there are 22 departments). I've been tweaking this forever and I'm sure it's a stupid mistake, but I can't figure out what it is. Any help would be appreciated.
why not use library(dplyr)?:
library(dplyr)
data(iris)
iris %>% group_by(Species) %>% # or dept
summarise(total_plength = sum(Petal.Length), # total owner sales
weird_divby72 = total_plength/72) # total/72?
# A tibble: 3 × 3
Species total_plength weird_divby72
<fctr> <dbl> <dbl>
1 setosa 73.1 1.015278
2 versicolor 213.0 2.958333
3 virginica 277.6 3.855556
your case would probably look like this :
data %>% group_by(deptName) %>%
summarise(total_sales = sum(ownerSales),
monthly_sales = total_sales/72)
I like dplyr for it's syntax and pipeability. I think it is a huge improvement over base R for ease of data wrangling. Here is a good cheat sheet to help you get rolling: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

Resources