I need to turn a (working) bit of dplyr/broom code into a function, since I'll call it several (dozen) times.
I am stuck -- and this has likely to do with Non Standard Evaluation being mixed with Standard Evaluation.
Here I take code directly from the vignette 'broom and dplyr'
library(tidyverse)
library(broom)
data(Orange)
Orange %>%
nest(-Tree) %>%
mutate(
test = map(data, ~ cor.test(.x$age, .x$circumference)),
tidied = map(test, tidy)
) %>%
unnest(tidied, .drop = TRUE)
This works:
Tree estimate statistic p.value parameter conf.low conf.high method alternative
1 1 0.9854675 12.97258 4.851902e-05 5 0.9012111 0.9979400 Pearson's product-moment correlation two.sided
2 2 0.9873624 13.93129 3.425041e-05 5 0.9136142 0.9982101 Pearson's product-moment correlation two.sided
3 3 0.9881766 14.41188 2.901046e-05 5 0.9189858 0.9983260 Pearson's product-moment correlation two.sided
4 4 0.9844610 12.53575 5.733090e-05 5 0.8946782 0.9977964 Pearson's product-moment correlation two.sided
5 5 0.9877376 14.14686 3.177093e-05 5 0.9160865 0.9982635 Pearson's product-moment correlation two.sided
Now the point is that I want to make a function out of it.
So if I try this:
afunction <- function(data, var) {
data %>%
nest(-Tree) %>%
mutate(
test = map(data, ~ cor.test(.x$age, .x$var)), # S3 list-col
tidied = map(test, tidy)
) %>%
unnest(tidied, .drop = TRUE)
}
It fails miserably.
Error in cor.test.default(.x$age, .x$var) : 'x' and 'y' must have the same length
I have tried to use NSE, quotation, semiquotation. I admit I tried a bit at random since I cannot find a proper tutorial of how to let NSE and SE play nicely together with the $ operator.
Any solutions -- especially one that would scale & teach me how to solve these issues once and for all? I am also happy for pointers at relevant books / tutorials.
Related
There are two methods available to estimate confidence intervals for a gls model in R: using function confint and function intervals. The results are not the same and I want to know what are the causes of the differences and which one is the preferred to use for a gls (and for lme as well) models.
I will use the cats data set for this example. I will use four different approaches to estimate the mean difference (MD) of Hwt between sex:
t-test (heterogeneous variance)
Linear model, using lm (homogeneous variance)
Linear model, using gls (homogeneous variance)
Heteroscedastic linear model, using gls (heterogeneous variance)
for the gls approaches confint and intervals are available for calculating confidence intervals.
Here is the code:
library(pacman)
p_load(tidyverse)
p_load(MASS)
p_load(nlme)
set.seed(150)
cats%>%ggplot(aes(x=Sex,y=Hwt))+
geom_boxplot()+theme_bw()
###different approaches for the same mean difference estimation
cats_ttest<-t.test(Hwt~Sex,data=cats)
cats$Sex<-relevel(cats$Sex,ref="M")
cats_lm<-lm(Hwt~Sex,data=cats)
cats_gls_hom<-gls(Hwt~Sex,data=cats)
cats_gls_het<-gls(Hwt~Sex,weights=varIdent(form=~1|Sex),data=cats)
###store estimations and CI's from different approaches
a<-rbind(confint(cats_lm),confint(cats_gls_hom),confint(cats_gls_het),
intervals(cats_gls_hom,which = "coef")$coef[,c(1,3)],
intervals(cats_gls_het,which = "coef")$coef[,c(1,3)]) %>% data.frame%>% {cbind(par=rownames(.),.)}
a$par<-a$par %>% str_remove_all("X.|.1|.2|.3|.4")
a$par<-factor(a$par,levels =c("Intercept.","SexF"),
labels =c("Intercept.","SexF") )
a$est<-c(rep(cats_lm %>% coef,3),
cats_gls_hom %>% coef,cats_gls_het %>% coef
)
a$mod<-c(rep("cats_lm_ci",2),rep("cats_gls_hom_ci",2),rep("cats_gls_het_ci",2),
rep("cats_gls_hom_int",2),rep("cats_gls_het_int",2)
)
colnames(a)[2:3]<-c("LCI","UCI")
a<-rbind(data.frame(par="SexF",LCI=cats_ttest$conf.int[1],
UCI=cats_ttest$conf.int[2],est=cats_ttest$estimate[1]-cats_ttest$estimate[2],
mod="ttest"),a)
a$mod<-factor(a$mod,levels =c("ttest","cats_lm_ci","cats_gls_hom_ci","cats_gls_het_ci","cats_gls_hom_int","cats_gls_het_int"))
a$diff<-a$UCI-a$LCI
rownames(a)<-NULL
###results
a[order(a$par,a$diff),]
#> par LCI UCI est mod diff
#> 4 Intercept. 10.879181 11.766179 11.322680 cats_gls_hom_ci 0.8869980
#> 2 Intercept. 10.875369 11.769992 11.322680 cats_lm_ci 0.8946223
#> 8 Intercept. 10.875369 11.769992 11.322680 cats_gls_hom_int 0.8946223
#> 6 Intercept. 10.816754 11.828606 11.322680 cats_gls_het_ci 1.0118521
#> 10 Intercept. 10.812406 11.832955 11.322680 cats_gls_het_int 1.0205495
#> 7 SexF -2.758218 -1.482888 -2.120553 cats_gls_het_ci 1.2753295
#> 11 SexF -2.763699 -1.477407 -2.120553 cats_gls_het_int 1.2862917
#> 1 SexF -2.763753 -1.477352 -2.120553 ttest 1.2864011
#> 5 SexF -2.896844 -1.344261 -2.120553 cats_gls_hom_ci 1.5525835
#> 3 SexF -2.903517 -1.337588 -2.120553 cats_lm_ci 1.5659288
#> 9 SexF -2.903517 -1.337588 -2.120553 cats_gls_hom_int 1.5659288
a %>% ggplot(aes(x=par,y=est,color=mod,group=mod))+geom_point(position=position_dodge(0.5))+
geom_errorbar(aes(ymin=LCI, ymax=UCI), width=.2,
position=position_dodge(0.5))+theme_bw()
Created on 2022-09-11 by the reprex package (v2.0.1)
As you can see, there are mild differences in CI amplitudes from the different methods,and as expected, the methods which accounts for differences in variances produced the narrowest CI for the mean differences (parameter SexF in dataframe a).
So, why are two methods available to estimate confidence intervals for gls models, what are the differences between them and which one is the preferred one for this kind of models?
tl;dr use intervals(), it gives you CIs based on a Student-t rather than a Normal sampling distribution.
If you look at methods(class = "gls") you'll see that confint() is not listed. That means that when you call confint(gls_fit), R falls back to the default confint method. If we look at the code for stats::confint.default you'll see fac <- qnorm(a); ...; ci[] <- cf[parm] + ses %o% fac. In other words, confint.default is constructing CIs based on a Normal distribution.
In contrast, nlme:::intervals.gls uses
len <- -qt((1 - level)/2, dims$N - dims$p) * sqrt(diag(object$varBeta))
— i.e., an interval based on a t-distribution.
It makes very little difference in this case (CI interval width of 1.55 vs 1.56).
For what it's worth, you can streamline this kind of comparison a little bit using broom/broom.mixed (although this does not include the confint.default option for gls!)
library(broom)
library(broom.mixed)
options(pillar.sigfig = 7)
(tibble::lst(cats_ttest, cats_lm, cats_gls_hom, cats_gls_het)
|> map_dfr(tidy, .id = "model", conf.int = TRUE)
## t-test doesn't have a "term" element
|> mutate(across(term, ~ifelse(is.na(.), "SexF", term)))
|> select(model, term, estimate, lwr = conf.low, upr = conf.high)
|> mutate(width = upr - lwr)
|> arrange(term)
)
As a general rule, you should use the most specific method available — this usually happens automatically, it's sort of an accident that confint() works for gls objects (partly because the nlme package predates R itself, so doesn't follow all of its conventions ...)
A newbie in R.
Considering this is my situation:(Actually my real situation is much more complex)
set.seed(100)
df = data.frame(SEX=sample(c("M","F"),100,replace=TRUE),BW = rnorm(100,80,2))
One column is SEX(male and female), another one is BW(body weight).
I want to test male's body weight normality and female's body weight normality. Then I can test equlity of variances respectively. At last, T test or other test for this situation.
But shapiro.test can't be used in this situation. (like shapiro.test(BW~SEX,data=df))
What should I do? I don't want to seperate the data frame or make new subsets.
Thanks in advance~!
A "tidyverse" solution to this problem is described in detail here: Running a model on separate groups.
Briefly, using your data:
library(dplyr) # for mutate
library(tidyr) # for nest/unnest
library(purrr) # for map
library(broom) # for glance
df %>%
nest(data = c(BW)) %>%
mutate(model = map(data, ~ shapiro.test(.x$BW)),
g = map(model, glance)) %>%
unnest(g)
Result:
# A tibble: 2 x 6
SEX data model statistic p.value method
<fct> <list<df[,1]>> <list> <dbl> <dbl> <chr>
1 F [50 x 1] <htest> 0.982 0.639 Shapiro-Wilk normality test
2 M [50 x 1] <htest> 0.980 0.535 Shapiro-Wilk normality test
Oh I just figured out by myself...
using this code
with(df, shapiro.test(BW[SEX == "M"]))
with(df, shapiro.test(BW[SEX == "F"]))
I am glad I can learn more!
I would like to write a function for a neat output table from a t-test as I'm conducting numerous post-hoc t-tests however, writing functions are not my strong suit so I'd like some help. My current code looks like this
library(tidyverse)
library(lsr)
library(broom)
t_table <- function(data$col, data$col) {
t.test(data$col, data$col) %>%
broom::tidy() %>%
mutate(Cohens_d = cohensD(data$col, data$col)) %>% # calc. cohen's d
mutate_at(vars(- c(p.value,method,alternative)), round, 2)
}
One of the errors is:
Error in data$col : object of type 'closure' is not subsettable.
I'm assuming data and col are not general names for any data frame and column.
Essentially I'd like to be able to specify any data frame and column for each variable. I'm not even sure if this is possible as it is a very general function I'm trying to create but any help would be much appreciated.
the input arguments for your function should a) not have the same name and b) should not contain $. Other than that your function works fine:
t_table <- function(col1, col2) {
t.test(col1, col2) %>%
broom::tidy() %>%
mutate(Cohens_d = cohensD(col1, col2)) %>% # calc. cohen's d
mutate_at(vars(- c(p.value,method,alternative)), round, 2)
}
set.seed(1)
t_table(rnorm(100), rnorm(100)+1/2)
estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high method alternative Cohens_d
1 -0.35 0.11 0.46 -2.69 0.007745151 197.19 -0.61 -0.09 Welch Two Sample t-test two.sided 0.38
Chapter 1: mean age by gender
I work a lot with epidemiologists and statisticians that have very specific requirements for their statistical output and I frequently fail to reproduce the exact same thing in R (our epidemiologst works in Stata).
Let's start with an easy example, a Student's t-test. What we are interested in is the difference in mean age at first diagnosis and a confidence interval.
1) create some sample data in R
set.seed(41)
cohort <- data.frame(
id = seq(1,100),
gender = sample(c(rep(1,33), rep(2,67)),100),
age = sample(seq(0,50),100, replace=TRUE)
)
# save to import into Stata
# write.csv(cohort, "cohort.csv", row.names = FALSE)
b) import data and run t-test in Stata
import delimited "cohort.csv"
ttest age, by(gender)
What we want is the absolute difference in the mean= 3.67 years and the combined confidence intervals = 95% CI: 24.59 - 30.57
b) run t-test in R
t.test(age~gender, data=cohort)
t.test(cohort$age[cohort$gender == 1])
t.test(cohort$age[cohort$gender == 2])
t.test(cohort$age)
There surely must be another way instead of running 4 t-tests in R!
You can try to put everything in one function and some tidyverse magic. The output can be edited as your needs are of course. boom's tidy will be used for nice output.
foo <- function(df, x, y){
require(tidyverse)
require(broom)
a1 <- df %>%
select(ep=!!x, gr=!!y) %>%
mutate(gr=as.character(gr)) %>%
bind_rows(mutate(., gr="ALL")) %>%
split(.$gr) %>%
map(~tidy(t.test(.$ep))) %>%
bind_rows(.,.id = "gr") %>%
mutate_if(is.factor, as.character)
tidy(t.test(as.formula(paste(x," ~ ",y)), data=df)) %>%
mutate_if(is.factor, as.character) %>%
mutate(gr="vs") %>%
select(gr, estimate, statistic, p.value,parameter, conf.low, conf.high, method, alternative) %>%
bind_rows(a1, .)}
foo(cohort, "age", "gender")
gr estimate statistic p.value parameter conf.low conf.high method alternative
1 1 25.121212 9.545737 6.982763e-11 32.00000 19.76068 30.481745 One Sample t-test two.sided
2 2 28.791045 15.699854 5.700541e-24 66.00000 25.12966 32.452428 One Sample t-test two.sided
3 ALL 27.580000 18.301678 1.543834e-33 99.00000 24.58985 30.570147 One Sample t-test two.sided
4 vs -3.669833 -1.144108 2.568817e-01 63.37702 -10.07895 2.739284 Welch Two Sample t-test two.sided
I recommend to start from the beginning using this
foo <- function(df){
a1 <- broom::tidy(t.test(age~gender, data=df))
a2 <- broom::tidy(t.test(df$age))
a3 <- broom::tidy(t.test(df$age[df$gender == 1]))
a4 <- broom::tidy(t.test(df$age[df$gender == 2]))
list(rbind(a2, a3, a4), a1)
}
foo(cohort)
[[1]]
estimate statistic p.value parameter conf.low conf.high method alternative
1 27.58000 18.301678 1.543834e-33 99 24.58985 30.57015 One Sample t-test two.sided
2 25.12121 9.545737 6.982763e-11 32 19.76068 30.48174 One Sample t-test two.sided
3 28.79104 15.699854 5.700541e-24 66 25.12966 32.45243 One Sample t-test two.sided
[[2]]
estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high method alternative
1 -3.669833 25.12121 28.79104 -1.144108 0.2568817 63.37702 -10.07895 2.739284 Welch Two Sample t-test two.sided
You can make your own function:
tlimits <- function(data, group){
error <- qt(0.975, df = length(data)-1)*sd(data)/(sqrt(length(data)))
mean <- mean(data)
means <- tapply(data, group, mean)
c(abs(means[1] - means[2]), mean - error, mean + error)
}
tlimits(cohort$age, cohort$gender)
1
3.669833 24.589853 30.570147
What we want is the absolute difference in the mean= 3.67 years and the combined confidence intervals = 95% CI: 24.59 - 30.57
Notice that R's t.test does a t-test, whereas you want a mean difference and "combined confidence intervals" (which is CI around the mean ignoring the grouping variable). So you don't want a t-test but something else.
You can get the mean difference using, e.g.:
diff(with(cohort, tapply(age, gender, mean)))
# 3.669833
# no point in using something more complicated e.g., t-test or lm
... and the CI using, e.g.:
confint(lm(age~1, data=cohort))
# 2.5 % 97.5 %
# (Intercept) 24.58985 30.57015
And obviously, you can easily combine the two steps into one function if you need it often.
doit <- function(a,b) c(diff= diff(tapply(a,b,mean)), CI=confint(lm(a~1)))
with(cohort, doit(age,gender))
# diff.2 CI1 CI2
# 3.669833 24.589853 30.570147
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
I have a table with 3,000+ rows and 10+ variables. I am trying to run a linear regression using one variables as the predictor and another as the response for 300 different groups. I need the slope, p-value, and r-squared for each of these regressions. To do each regression individually and record the summary variables would take hours if not days.
I have used the following package to get the intercept and slope for each group, but I do not know how to also get the corresponding p-value and r-squared for each group:
library(lme4)
groupreg<-lmList(logpop ~ avgp | id, data=data)
groupreg
I achieved a list sample below, where "Adams #" is the id value. NAs exist because not all groups have multiple points to plot and compare:
Coefficients:
(Intercept) avgp
Adams 6 4.0073332 NA
Adams 7 6.5177389 -7.342443e+00
Adams 8 4.7449321 NA
Adams 9 NA NA
This table does not include any significance statistics, however. I still need the p-value and r-squared statistic. If there is a code to do it all in one go for all group values, or a code to just pull the remaining values, it would be helpful.
Is there are way also to exponentiate the slope output for all groups? My outcome was log-transformed.
Thank you all!!
I thinks the easiest answer is still missing. You can use a combination of nesting and mapping. I'll show you how it works for linear regression. I think you're able to apply the same principle to models of the lme4 package.
Lets create a toy data set, where we've measured the IQ score for three different groups at two different points of time.
library(tidyverse)
library(broom)
df <- tibble(
id = seq_len(90),
IQ = rnorm(90, 100, 15),
group = rep(c("A", "B", "C"), each = 30),
time = rep(c("T1", "T2"), 45)
)
If we want to build a regression model for each group, investigating the relation between the IQ score and the point of time, we only need five lines of code.
df %>%
nest(-group) %>%
mutate(fit = map(data, ~ lm(IQ ~ time, data = .)),
results = map(fit, glance)) %>%
unnest(results) %>%
select(group, r.squared, p.value)
Which will return
# A tibble: 3 x 3
group r.squared p.value
<chr> <dbl> <dbl>
1 A 0.0141 0.532
2 B 0.0681 0.164
3 C 0.00432 0.730
where nest(-group) creates tibbles within your tibble for each group, containing the corresponding variables of id, IQ and time. Then you add a new column fit with mutate() where you apply a regression model for each group and a new column containing the results, which we unnest() shortly after to access the values glance() returned properly. In the last step we select() the three values of interest.
To get the slope you need to call tidy() in addition. Maybe it's possible to shorten the code somehow, but one solution would be
df %>%
nest(-group) %>%
mutate(fit = map(data, ~ lm(IQ ~ time, data = .)),
results1 = map(fit, glance),
results2 = map(fit, tidy)) %>%
unnest(results1) %>%
unnest(results2) %>%
select(group, term, estimate, r.squared, p.value) %>%
mutate(estimate = exp(estimate))
To exponentiate the slope, you can just add another mutate() statement. Finally it returns
# A tibble: 6 x 5
group term estimate r.squared p.value
<chr> <chr> <dbl> <dbl> <dbl>
1 A (Intercept) 3.34e+46 0.0141 0.532
2 A timeT2 3.31e- 2 0.0141 0.532
3 B (Intercept) 1.17e+47 0.0681 0.164
4 B timeT2 1.34e- 3 0.0681 0.164
5 C (Intercept) 8.68e+43 0.00432 0.730
6 C timeT2 1.25e- 1 0.00432 0.730
Note that the estimates are exponentiated already. Without the exponentiation you can double check the slope and p value with base R calling
summary(lm(IQ ~ time, data = filter(df, group == "A")))
If you work with more complex models (lme4), there is a package called lmerTest which offers wrapper functions for lme4 which return p-values (at least for mixed models, with which I already worked with).
A word of warning towards using glance() for lme4 models should be spoken, because the maintainers of the broom package, will try a new concept where they outsource the summary statistics to the particular package developer responsible for the model.
If I am understanding your question correctly, you want to run multiple regressions over lots of groups. Here is an example of how to do so with the mtcars data.
library(dplyr)
mtcars %>% group_by(cyl) %>%
summarise_at(vars(disp:wt), funs(
r.sqr = summary(lm(mpg~.))$r.squared,
intercept = summary(lm(mpg~.))$coefficients[[1]],
slope = summary(lm(mpg~.))$coefficients[[2]],
p.value = summary(lm(mpg~.))$coefficients[[8]]
))
This will run a regression per group per variable and extract the info you asked for. If your formula is always the same, you could simplify as follows.
mtcars %>% group_by(cyl) %>%
summarise(
r.sqr = summary(lm(mpg~wt))$r.squared,
intercept = summary(lm(mpg~wt))$coefficients[[1]],
slope = summary(lm(mpg~wt))$coefficients[[2]],
p.value = summary(lm(mpg~wt))$coefficients[[8]]
)
This is actually running the regression 4 times(once per value of interest). If that takes too long for your real data, you could try this:
df <- mtcars %>% group_by(cyl) %>% summarise(model = list(summary(lm(mpg~wt))))
which simply runs the model once per group and then extract out the info you want. The problem is that extracting values this way can be a pain
df$model[[1]]$coefficients[[1]]
[1] 39.5712
While the code given by AndS will work, it will run lm function 4 times for each group which makes it a bit inefficient. You can use the following. I am trying to break it into simpler steps:
Assuming your data frame(df) has three variables: "Group", "Dep", "Indep":
#Getting the unique list of groups
groups <- unique(df$Group)
#Creating a model summary list to combine the model summary of each model
model_summaries = list()
#Running the models
for(i in 1:length(groups)){
model <- lm(Dep ~ Indep, df[df$Group == Groups[i], c("Dep", "Indep")])
model_summaries[i] <- summary(model)
}
In each model summary you have following elements RSQ, coefficients(contains p-values and intercept too)
Let me know if this helps.