Creating function with output table from t-test in R - r

I would like to write a function for a neat output table from a t-test as I'm conducting numerous post-hoc t-tests however, writing functions are not my strong suit so I'd like some help. My current code looks like this
library(tidyverse)
library(lsr)
library(broom)
t_table <- function(data$col, data$col) {
t.test(data$col, data$col) %>%
broom::tidy() %>%
mutate(Cohens_d = cohensD(data$col, data$col)) %>% # calc. cohen's d
mutate_at(vars(- c(p.value,method,alternative)), round, 2)
}
One of the errors is:
Error in data$col : object of type 'closure' is not subsettable.
I'm assuming data and col are not general names for any data frame and column.
Essentially I'd like to be able to specify any data frame and column for each variable. I'm not even sure if this is possible as it is a very general function I'm trying to create but any help would be much appreciated.

the input arguments for your function should a) not have the same name and b) should not contain $. Other than that your function works fine:
t_table <- function(col1, col2) {
t.test(col1, col2) %>%
broom::tidy() %>%
mutate(Cohens_d = cohensD(col1, col2)) %>% # calc. cohen's d
mutate_at(vars(- c(p.value,method,alternative)), round, 2)
}
set.seed(1)
t_table(rnorm(100), rnorm(100)+1/2)
estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high method alternative Cohens_d
1 -0.35 0.11 0.46 -2.69 0.007745151 197.19 -0.61 -0.09 Welch Two Sample t-test two.sided 0.38

Related

Functional programming: use broom nest->tidy->unnest and map within a function

I need to turn a (working) bit of dplyr/broom code into a function, since I'll call it several (dozen) times.
I am stuck -- and this has likely to do with Non Standard Evaluation being mixed with Standard Evaluation.
Here I take code directly from the vignette 'broom and dplyr'
library(tidyverse)
library(broom)
data(Orange)
Orange %>%
nest(-Tree) %>%
mutate(
test = map(data, ~ cor.test(.x$age, .x$circumference)),
tidied = map(test, tidy)
) %>%
unnest(tidied, .drop = TRUE)
This works:
Tree estimate statistic p.value parameter conf.low conf.high method alternative
1 1 0.9854675 12.97258 4.851902e-05 5 0.9012111 0.9979400 Pearson's product-moment correlation two.sided
2 2 0.9873624 13.93129 3.425041e-05 5 0.9136142 0.9982101 Pearson's product-moment correlation two.sided
3 3 0.9881766 14.41188 2.901046e-05 5 0.9189858 0.9983260 Pearson's product-moment correlation two.sided
4 4 0.9844610 12.53575 5.733090e-05 5 0.8946782 0.9977964 Pearson's product-moment correlation two.sided
5 5 0.9877376 14.14686 3.177093e-05 5 0.9160865 0.9982635 Pearson's product-moment correlation two.sided
Now the point is that I want to make a function out of it.
So if I try this:
afunction <- function(data, var) {
data %>%
nest(-Tree) %>%
mutate(
test = map(data, ~ cor.test(.x$age, .x$var)), # S3 list-col
tidied = map(test, tidy)
) %>%
unnest(tidied, .drop = TRUE)
}
It fails miserably.
Error in cor.test.default(.x$age, .x$var) : 'x' and 'y' must have the same length
I have tried to use NSE, quotation, semiquotation. I admit I tried a bit at random since I cannot find a proper tutorial of how to let NSE and SE play nicely together with the $ operator.
Any solutions -- especially one that would scale & teach me how to solve these issues once and for all? I am also happy for pointers at relevant books / tutorials.

Save R-squared from lm summary as a dataframe

I want to save the result of a lm model into a dataframe. I generated an empty dataframe (Startframe), where I want to save the results.
My dataframe containing the data is called testdata in this case. It contains the Date in the first column and then several Stations in the rest of the colums.
So far this code is working to get the Estimate, Std. Error, t value and Pr(>|t|).
for(i in 2:ncol(testdata)) {
x <- testdata[,1]
y <- testdata[,i]
mod <- lm(y ~ x)
summary(mod)
Startframe[i,] <- c(i,
summary(mod)[['coefficients']]['(Intercept)','Estimate'],
summary(mod)[['coefficients']]['x','Estimate'],
summary(mod)[['coefficients']]['x','Std. Error'],
summary(mod)[['coefficients']]['x','t value'],
summary(mod)[['coefficients']]['x','Pr(>|t|)'])
But how can I also extract the r.squared?
I tried to add summary(mod)[['r.squared']] to the list, but it gives me the wrong numbers.
I know str(summary(mod)) gives me an overview, but I cant figure out how to add it into my loop.
Thanks for your help.
Nice way to work with the same model on different datasets is to use the tidyverse approach using broom package.
In this example I'm using the diamonds dataset to test how carat and depth effects the diamonds' price in different diamond cuts.
require(tidyverse)
require(broom)
diamonds %>%
nest(-cut) %>%
mutate(model = purrr::map(data, function(x) {
lm(price ~ carat + depth, data = x)}),
values = purrr::map(model, glance),
r.squared = purrr::map_dbl(values, "r.squared"),
pvalue = purrr::map_dbl(values, "p.value")) %>%
select(-data, -model, -values)
cut r.squared pvalue
<ord> <dbl> <dbl>
1 Ideal 0.867 0
2 Premium 0.856 0
3 Good 0.851 0
4 Very Good 0.859 0
5 Fair 0.746 0

Keeping your statistician happy: Stata vs. R Student's t-test

Chapter 1: mean age by gender
I work a lot with epidemiologists and statisticians that have very specific requirements for their statistical output and I frequently fail to reproduce the exact same thing in R (our epidemiologst works in Stata).
Let's start with an easy example, a Student's t-test. What we are interested in is the difference in mean age at first diagnosis and a confidence interval.
1) create some sample data in R
set.seed(41)
cohort <- data.frame(
id = seq(1,100),
gender = sample(c(rep(1,33), rep(2,67)),100),
age = sample(seq(0,50),100, replace=TRUE)
)
# save to import into Stata
# write.csv(cohort, "cohort.csv", row.names = FALSE)
b) import data and run t-test in Stata
import delimited "cohort.csv"
ttest age, by(gender)
What we want is the absolute difference in the mean= 3.67 years and the combined confidence intervals = 95% CI: 24.59 - 30.57
b) run t-test in R
t.test(age~gender, data=cohort)
t.test(cohort$age[cohort$gender == 1])
t.test(cohort$age[cohort$gender == 2])
t.test(cohort$age)
There surely must be another way instead of running 4 t-tests in R!
You can try to put everything in one function and some tidyverse magic. The output can be edited as your needs are of course. boom's tidy will be used for nice output.
foo <- function(df, x, y){
require(tidyverse)
require(broom)
a1 <- df %>%
select(ep=!!x, gr=!!y) %>%
mutate(gr=as.character(gr)) %>%
bind_rows(mutate(., gr="ALL")) %>%
split(.$gr) %>%
map(~tidy(t.test(.$ep))) %>%
bind_rows(.,.id = "gr") %>%
mutate_if(is.factor, as.character)
tidy(t.test(as.formula(paste(x," ~ ",y)), data=df)) %>%
mutate_if(is.factor, as.character) %>%
mutate(gr="vs") %>%
select(gr, estimate, statistic, p.value,parameter, conf.low, conf.high, method, alternative) %>%
bind_rows(a1, .)}
foo(cohort, "age", "gender")
gr estimate statistic p.value parameter conf.low conf.high method alternative
1 1 25.121212 9.545737 6.982763e-11 32.00000 19.76068 30.481745 One Sample t-test two.sided
2 2 28.791045 15.699854 5.700541e-24 66.00000 25.12966 32.452428 One Sample t-test two.sided
3 ALL 27.580000 18.301678 1.543834e-33 99.00000 24.58985 30.570147 One Sample t-test two.sided
4 vs -3.669833 -1.144108 2.568817e-01 63.37702 -10.07895 2.739284 Welch Two Sample t-test two.sided
I recommend to start from the beginning using this
foo <- function(df){
a1 <- broom::tidy(t.test(age~gender, data=df))
a2 <- broom::tidy(t.test(df$age))
a3 <- broom::tidy(t.test(df$age[df$gender == 1]))
a4 <- broom::tidy(t.test(df$age[df$gender == 2]))
list(rbind(a2, a3, a4), a1)
}
foo(cohort)
[[1]]
estimate statistic p.value parameter conf.low conf.high method alternative
1 27.58000 18.301678 1.543834e-33 99 24.58985 30.57015 One Sample t-test two.sided
2 25.12121 9.545737 6.982763e-11 32 19.76068 30.48174 One Sample t-test two.sided
3 28.79104 15.699854 5.700541e-24 66 25.12966 32.45243 One Sample t-test two.sided
[[2]]
estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high method alternative
1 -3.669833 25.12121 28.79104 -1.144108 0.2568817 63.37702 -10.07895 2.739284 Welch Two Sample t-test two.sided
You can make your own function:
tlimits <- function(data, group){
error <- qt(0.975, df = length(data)-1)*sd(data)/(sqrt(length(data)))
mean <- mean(data)
means <- tapply(data, group, mean)
c(abs(means[1] - means[2]), mean - error, mean + error)
}
tlimits(cohort$age, cohort$gender)
1
3.669833 24.589853 30.570147
What we want is the absolute difference in the mean= 3.67 years and the combined confidence intervals = 95% CI: 24.59 - 30.57
Notice that R's t.test does a t-test, whereas you want a mean difference and "combined confidence intervals" (which is CI around the mean ignoring the grouping variable). So you don't want a t-test but something else.
You can get the mean difference using, e.g.:
diff(with(cohort, tapply(age, gender, mean)))
# 3.669833
# no point in using something more complicated e.g., t-test or lm
... and the CI using, e.g.:
confint(lm(age~1, data=cohort))
# 2.5 % 97.5 %
# (Intercept) 24.58985 30.57015
And obviously, you can easily combine the two steps into one function if you need it often.
doit <- function(a,b) c(diff= diff(tapply(a,b,mean)), CI=confint(lm(a~1)))
with(cohort, doit(age,gender))
# diff.2 CI1 CI2
# 3.669833 24.589853 30.570147

Hundreds of linear regressions that run by group in R [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
I have a table with 3,000+ rows and 10+ variables. I am trying to run a linear regression using one variables as the predictor and another as the response for 300 different groups. I need the slope, p-value, and r-squared for each of these regressions. To do each regression individually and record the summary variables would take hours if not days.
I have used the following package to get the intercept and slope for each group, but I do not know how to also get the corresponding p-value and r-squared for each group:
library(lme4)
groupreg<-lmList(logpop ~ avgp | id, data=data)
groupreg
I achieved a list sample below, where "Adams #" is the id value. NAs exist because not all groups have multiple points to plot and compare:
Coefficients:
(Intercept) avgp
Adams 6 4.0073332 NA
Adams 7 6.5177389 -7.342443e+00
Adams 8 4.7449321 NA
Adams 9 NA NA
This table does not include any significance statistics, however. I still need the p-value and r-squared statistic. If there is a code to do it all in one go for all group values, or a code to just pull the remaining values, it would be helpful.
Is there are way also to exponentiate the slope output for all groups? My outcome was log-transformed.
Thank you all!!
I thinks the easiest answer is still missing. You can use a combination of nesting and mapping. I'll show you how it works for linear regression. I think you're able to apply the same principle to models of the lme4 package.
Lets create a toy data set, where we've measured the IQ score for three different groups at two different points of time.
library(tidyverse)
library(broom)
df <- tibble(
id = seq_len(90),
IQ = rnorm(90, 100, 15),
group = rep(c("A", "B", "C"), each = 30),
time = rep(c("T1", "T2"), 45)
)
If we want to build a regression model for each group, investigating the relation between the IQ score and the point of time, we only need five lines of code.
df %>%
nest(-group) %>%
mutate(fit = map(data, ~ lm(IQ ~ time, data = .)),
results = map(fit, glance)) %>%
unnest(results) %>%
select(group, r.squared, p.value)
Which will return
# A tibble: 3 x 3
group r.squared p.value
<chr> <dbl> <dbl>
1 A 0.0141 0.532
2 B 0.0681 0.164
3 C 0.00432 0.730
where nest(-group) creates tibbles within your tibble for each group, containing the corresponding variables of id, IQ and time. Then you add a new column fit with mutate() where you apply a regression model for each group and a new column containing the results, which we unnest() shortly after to access the values glance() returned properly. In the last step we select() the three values of interest.
To get the slope you need to call tidy() in addition. Maybe it's possible to shorten the code somehow, but one solution would be
df %>%
nest(-group) %>%
mutate(fit = map(data, ~ lm(IQ ~ time, data = .)),
results1 = map(fit, glance),
results2 = map(fit, tidy)) %>%
unnest(results1) %>%
unnest(results2) %>%
select(group, term, estimate, r.squared, p.value) %>%
mutate(estimate = exp(estimate))
To exponentiate the slope, you can just add another mutate() statement. Finally it returns
# A tibble: 6 x 5
group term estimate r.squared p.value
<chr> <chr> <dbl> <dbl> <dbl>
1 A (Intercept) 3.34e+46 0.0141 0.532
2 A timeT2 3.31e- 2 0.0141 0.532
3 B (Intercept) 1.17e+47 0.0681 0.164
4 B timeT2 1.34e- 3 0.0681 0.164
5 C (Intercept) 8.68e+43 0.00432 0.730
6 C timeT2 1.25e- 1 0.00432 0.730
Note that the estimates are exponentiated already. Without the exponentiation you can double check the slope and p value with base R calling
summary(lm(IQ ~ time, data = filter(df, group == "A")))
If you work with more complex models (lme4), there is a package called lmerTest which offers wrapper functions for lme4 which return p-values (at least for mixed models, with which I already worked with).
A word of warning towards using glance() for lme4 models should be spoken, because the maintainers of the broom package, will try a new concept where they outsource the summary statistics to the particular package developer responsible for the model.
If I am understanding your question correctly, you want to run multiple regressions over lots of groups. Here is an example of how to do so with the mtcars data.
library(dplyr)
mtcars %>% group_by(cyl) %>%
summarise_at(vars(disp:wt), funs(
r.sqr = summary(lm(mpg~.))$r.squared,
intercept = summary(lm(mpg~.))$coefficients[[1]],
slope = summary(lm(mpg~.))$coefficients[[2]],
p.value = summary(lm(mpg~.))$coefficients[[8]]
))
This will run a regression per group per variable and extract the info you asked for. If your formula is always the same, you could simplify as follows.
mtcars %>% group_by(cyl) %>%
summarise(
r.sqr = summary(lm(mpg~wt))$r.squared,
intercept = summary(lm(mpg~wt))$coefficients[[1]],
slope = summary(lm(mpg~wt))$coefficients[[2]],
p.value = summary(lm(mpg~wt))$coefficients[[8]]
)
This is actually running the regression 4 times(once per value of interest). If that takes too long for your real data, you could try this:
df <- mtcars %>% group_by(cyl) %>% summarise(model = list(summary(lm(mpg~wt))))
which simply runs the model once per group and then extract out the info you want. The problem is that extracting values this way can be a pain
df$model[[1]]$coefficients[[1]]
[1] 39.5712
While the code given by AndS will work, it will run lm function 4 times for each group which makes it a bit inefficient. You can use the following. I am trying to break it into simpler steps:
Assuming your data frame(df) has three variables: "Group", "Dep", "Indep":
#Getting the unique list of groups
groups <- unique(df$Group)
#Creating a model summary list to combine the model summary of each model
model_summaries = list()
#Running the models
for(i in 1:length(groups)){
model <- lm(Dep ~ Indep, df[df$Group == Groups[i], c("Dep", "Indep")])
model_summaries[i] <- summary(model)
}
In each model summary you have following elements RSQ, coefficients(contains p-values and intercept too)
Let me know if this helps.

lm(): loop through multiple linear models exporting p-value of F-statistic

I have a large data set for which I need to run a linear model comparing groups.
I need to find the p-values for group comparisons using a linear model. There are four groups (so I need 1~2, 1~3. 1~4, 2~3, 2~4, 3~4) and and there are 130 columns for which the data from these groups needs to be compared. Any help would be greatly appreciated!!
I have this, which gives me exactly what I need.
fit<-lm(variable~group, data=data)
summary(fit)
However, with all of the groups and columns, I have nearly 800 comparisons to make, so I want to avoid doing this manually. I tried writing a for loop, but it isn't working.
k<-data.frame()
for (i in 1:130){
[i,1]<-colnames(data)
fit<- lm(i~group, data=data)
[i,2] <- fit$p.value
}
But this has given me a variety of different errors. I really just need the p-values. Help would be greatly greatly appreciated!! Thank you!
(2016-06-18) Your question is not completely answerable at this stage. In the following, I shall point out several problems.
How to get p-value properly
I assume you want p-value of F-statistic for the model, as an indication of goodness of fit. Suppose your fitted model is fit, we should do this way:
fstatistic <- summary(fit)$fstatistic
p_value <- unname(1 - pf(fstatistic[1], fstatistic[2], fstatistic[3]))
As an example, I will use built-in dataset trees as an demonstration.
fit <- lm(Height ~ Girth, trees)
## truncated output of summary(fit)
# > summary(fit)
# Residual standard error: 5.538 on 29 degrees of freedom
# Multiple R-squared: 0.2697, Adjusted R-squared: 0.2445
F-statistic: 10.71 on 1 and 29 DF, p-value: 0.002758
fstatistic <- summary(fit)$fstatistic
p_value <- unname(1 - pf(fstatistic[1], fstatistic[2], fstatistic[3]))
## > p_value
# [1] 0.002757815
So, p_value agrees with the printed summary.
Your loop
I suggest you use vectors rather than data frame during computation/update.
variable <- character(130)
p.value <- numeric(130)
You can combine the results at the end to a data frame via:
k <- data.frame(var = variable, p.value = p.value)
Why? Because this is memory efficient! Now, after those correction, we arrive at:
variable <- character(130)
p.value <- numeric(130)
for (i in 1:130) {
variable[i] <- colnames(data)
fit <- lm(i~group, data=data)
fstatistic <- summary(fit)$fstatistic
p_value <- unname(1 - pf(fstatistic[1], fstatistic[2], fstatistic[3]))
p.value[i] <- p_value
}
k <- data.frame(var = variable, p.value = p.value)
Further problems
I still don't think the above code above will work. Because I am not sure whether the following is doing correct:
variable[i] <- colnames(data)
fit <- lm(i~group, data=data)
During the loop, data is not changed, so colnames(data) returns a vector, hence var[i] <- colnames(data) will trigger error.
i~group looks odd. Do you have i in your data?
I can't help you solve these issues. I have no idea of what your data looks like. But if you could put in a subset of your data, it would be OK.
Follow-up (2016-06-19)
Thank you. This has been extremely helpful. I don't have "i" in my data, but I was hoping that I could use that to represent the different column names, so that it goes through all of them. Is there a way to assign column names numbers so that this would work?
Yes, but I need to know what you have for each column.
Column 1 has a group number. The following columns have data for different factors I am looking at.
OK, so I think ncol(data) = 131, where the first column is group, and the remaining 130 columns are what you will test. Then this should work:
variable <- colnames(data)[-1]
p.value <- numeric(130)
for (i in 1:130) {
fit <- lm(paste(variable[i], "group", sep = "~"), data=data)
fstatistic <- summary(fit)$fstatistic
p_value <- unname(1 - pf(fstatistic[1], fstatistic[2], fstatistic[3]))
p.value[i] <- p_value
}
k <- data.frame(var = variable, p.value = p.value)
It is possible to use sapply() instead of the above for loop. But I think there is no performance difference, as loop overhead is so much tiny compared with lm() and summary().
I think this can get you started at least. It uses the dplyr and broom packages. The basic idea is to define all the formulas you want as characters then use lapply() to run them through lm().
library(dplyr)
library(broom)
# Generate a vector of wanted formulas
forms <- c("mpg ~ cyl", "mpg ~ wt")
# Function to apply formula
lmit <- function(form){
tidy(lm(as.formula(form), mtcars)) %>%
mutate(formula = form)
}
# Apply it and bind into a dataframe
results <- bind_rows(lapply(forms, lmit))

Resources