I have been asked by my coauthor to add sd to the factor variables that have more than two levels, and sd(as.numeric(df$factor)) is giving me a single output instead of the sd for each. I imagine purrr::map could handle it but df%>% select(factor) %>% as.numeric %>% map(~(sd(.))) outputs an error Error in function_list[[i]](value) : 'list' object cannot be coerced to type 'double' even though df is not a list.
If it is the sd for each level of the factor column, we need to use that as a grouping variable
library(dplyr)
df %>%
group_by(factor) %>%
summarise(SD = sd(anothercolumn, na.rm = TRUE))
Based on the description, if we need the sd of factor variables having more than two levels
df %>%
summarise(across(where(~ is.factor(.) && nlevels(.) >2),
~ sd(as.numeric(.))))
Related
I have dataset that contains logical variable ('verdad') and a group variable ('group') that splits all data into several groups. Now I would like to summarize the data and calculate mean of the logical variable to test the hypothesis that occurence of TRUE and FALSE values in 'verdad' column differs accross the groups. The code is as simple as this:
domy_nad_1000 %>%
filter(usable_area > 1000) %>%
group_by(group) %>%
mean(verdad, na.rm = TRUE)
The datatype of 'verdad' is logical but it is showing this error:
In mean.default(., verdad, na.rm = TRUE) :
argument is not numeric or logical: returning NA
Is there a way to fix it?
You simply need to wrap your mean in a summarize function.
domy_nad_1000 %>%
filter(usable_area > 1000) %>%
group_by(group) %>%
summarize(verdad_mean = mean(verdad, na.rm = TRUE))
I have a dataframe ("df") with a number of columns that I would like to estimate the weighted means of, weighting by population (df$Population), and grouping by commuting zone (df$cz).
This is the list of columns I would like to estimate the weighted means of:
vlist = c("Public_Welf_Total_Exp", "Welf_Cash_Total_Exp", "Welf_Cash_Cash_Assist", "Welf_Ins_Total_Exp","Total_Educ_Direct_Exp", "Higher_Ed_Total_Exp", "Welf_NEC_Cap_Outlay","Welf_NEC_Direct_Expend", "Welf_NEC_Total_Expend", "Total_Educ_Assist___Sub", "Health_Total_Expend", "Total_Hospital_Total_Exp", "Welf_Vend_Pmts_Medical","Hosp_Other_Total_Exp","Unemp_Comp_Total_Exp", "Unemp_Comp_Cash___Sec", "Total_Unemp_Rev", "Hous___Com_Total_Exp", "Hous___Com_Construct")
This is the code I have been using:
df = df %>% group_by(cz) %>% mutate_at(vlist, weighted.mean(., df$Population))
I have also tried:
df = df %>% group_by(cz) %>% mutate_at(vlist, function(x) weighted.mean(x, df$Population))
As well as tested the following code on only 2 columns:
df = df %>% group_by(cz) %>% mutate_at(vars(Public_Welf_Total_Exp, Welf_Cash_Total_Exp), weighted.mean(., df$Population))
However, everything I have tried gives me the following error, even though there are no NAs in any of my variables:
Error in weighted.mean.default(., df$Population) :
'x' and 'w' must have the same length
I understand that I could do the following estimation using lapply, but I don't know how to group by another variable using lapply. I would appreciate any suggestions!
There is a lot to unpack here...
Probably you mean summarise instead of mutate, because with mutate you would just replicate your result for each row.
mutate_at and summarise_at are subseeded and you should use across instead.
the reason why your code wasn't working was because you did not write your function as a formula (you did not add ~ at the beginning), also you were using df$Population instead of Population. When you write Population, summarise knows you're talking about the column Population which, at that point, is grouped like the rest of the dataframe. When you use df$Population you are calling the column of the original dataframe without grouping. Not only it is wrong, but you would also get an error because the length of the variable you are trying to average and the lengths of the weights provided by df$Population would not correspond.
Here is how you could do it:
library(dplyr)
df %>%
group_by(cz) %>%
summarise(across(vlist, weighted.mean, Population),
.groups = "drop")
If you really need to use summarise_at (and probably you are using an old version of dplyr [lower than 1.0.0]), then you could do:
df %>%
group_by(cz) %>%
summarise_at(vlist, ~weighted.mean(., Population)) %>%
ungroup()
I considered df and vlist like the following:
vlist <- c("Public_Welf_Total_Exp", "Welf_Cash_Total_Exp", "Welf_Cash_Cash_Assist", "Welf_Ins_Total_Exp","Total_Educ_Direct_Exp", "Higher_Ed_Total_Exp", "Welf_NEC_Cap_Outlay","Welf_NEC_Direct_Expend", "Welf_NEC_Total_Expend", "Total_Educ_Assist___Sub", "Health_Total_Expend", "Total_Hospital_Total_Exp", "Welf_Vend_Pmts_Medical","Hosp_Other_Total_Exp","Unemp_Comp_Total_Exp", "Unemp_Comp_Cash___Sec", "Total_Unemp_Rev", "Hous___Com_Total_Exp", "Hous___Com_Construct")
df <- as.data.frame(matrix(rnorm(length(vlist) * 100), ncol = length(vlist)))
names(df) <- vlist
df$cz <- rep(letters[1:10], each = 10)
df$Population <- runif(100)
I need to check my data fro outliers and I have 67 different variables. So I don't want to do it by hand. This is my code for checking it by hand (I have three factors to be checked - voiceID, gender and VP). But I don't know how I should change it to a loop that iterates over columns.
features %>%
group_by(voiceID, gender, VP) %>%
identify_outliers(meanF0)
The values are all numbers. The output should tell me which rows for what factors are outliers.
Thanks for help
The output of identify_outliers is a tibble with multiple columns and it can take a single variable at a time. The variable name can be either quoted or unquoted. In that case, we can group_split the data by the grouping variables, then loop over the columns of interest, and apply the identify_outliers
library(dplyr)
library(purrr)
library(rstatix)
nm1 <- c("score", "score2")
demo.data %>%
group_split(gender) %>%
map(~ map(nm1, function(x) .x %>%
identify_outliers(x)))
If we want to count the outliers,
features %>%
group_by(voiceID, gender, VP) %>%
summarise(across(everything(), ~ length(boxplot(., plot = FALSE)$out)))
I tried to do a t-test comparing values between time1/2/3.. and threshold.
here is my data frame:
time.df1<-data.frame("condition" =c("A","B","C","A","C","B"),
"time1" = c(1,3,2,6,2,3) ,
"time2" = c(1,1,2,8,2,9) ,
"time3" = c(-2,12,4,1,0,6),
"time4" = c(-8,3,2,1,9,6),
"threshold" = c(-2,3,8,1,9,-3))
and I tried to compare each two values by:
time.df1%>%
select_if(is.numeric) %>%
purrr::map_df(~ broom::tidy(t.test(. ~ threshold)))
However, I got this error message
Error in eval(predvars, data, env) : object 'threshold' not found
So, I tried another way (maybe it is wrong)
time.df2<-time.df1%>%gather(TF,value,time1:time4)
time.df2%>% group_by(condition) %>% do(tidy(t.test(value~TF, data=.)))
sadly, I got this error. Even I limited the condition to only two levels (A,B)
Error in t.test.formula(value ~ TF, data = .) : grouping factor must have exactly 2 levels
I wish to loop t-test over each time column to threshold column per condition, then using broom::tidy to get the results in tidy format. My approaches apparently aren't working, any advice is much appreciated to improve my codes.
An alternative route would be to define a function with the required options for t.test() up front, then create data frames for each pair of variables (i.e. each combination of 'time*' and 'threshold') and nesting them into list columns and use map() combined with relevant functions from 'broom' to simplify the outputs.
library(tidyverse)
library(broom)
ttestfn <- function(data, ...){
# amend this function to include required options for t.test
res = t.test(data$resp, data$threshold)
return(res)
}
df2 <-
time.df1 %>%
gather(time, "resp", - threshold, -condition) %>%
group_by(time) %>%
nest() %>%
mutate(ttests = map(data, ttestfn),
glances = map(ttests, glance))
# df2 has data frames, t-test objects and glance summaries
# as separate list columns
Now it's easy to query this object to extract what you want
df2 %>%
unnest(glances, .drop=TRUE)
However, it's unclear to me what you want to do with 'condition', so I'm wondering if it is more straightforward to reframe the question in terms of a GLM (as camille suggested in the comments: ANOVA is part of the GLM family).
Reshape the data, define 'threshold' as the reference level of the 'time' factor and the default 'treatment' contrasts used by R will compare each time to 'threshold':
time.df2 <-
time.df1 %>%
gather(key = "time", value = "resp", -condition) %>%
mutate(time = fct_relevel(time, "threshold")) # define 'threshold' as baseline
fit.aov <- aov(resp ~ condition * time, data = time.df2)
summary(fit.aov)
summary.lm(fit.aov) # coefficients and p-values
Of course this assumes that all subjects are independent (i.e. there are no repeated measures). If not, then you'll need to move on to more complicated procedures. Anyway, moving to appropriate GLMs for the study design should help minimise the pitfalls of doing multiple t-tests on the same data set.
We could remove the threshold from the select and then reintroduce it by creating a data.frame which would go into the formula object of t.test
library(tidyverse)
time1.df %>%
select_if(is.numeric) %>%
select(-threshold) %>%
map_df(~ data.frame(time = .x, time1.df['threshold']) %>%
broom::tidy(t.test(. ~ threshold)))
I would like, when summarizing after grouping, to count the number of a specific level of another factor.
In the working example below, I would like to count the number of "male" levels in each group. I've tried many things with count, tally and so on but cannot find a straightforward and neat way to do it.
df <- data.frame(Group=replicate(20, sample(c("A","B"), 1)),
Value=rnorm(20),
Factor=replicate(20, sample(c("male","female"), 1)))
df %>%
group_by(Group) %>%
summarize(Value = mean(Value),
n_male = ???)
Thanks for your help!
We can use sum on a logical vector i.e. Factor == "male". The TRUE/FALSE will be coerced to 1/0 to get the frequency of 'male' elements when we do the sum
df %>%
group_by(Group) %>%
summarise(Value = mean(Value),
n_male = sum(Factor=="male"))