Keeping your statistician happy: Stata vs. R Student's t-test - r

Chapter 1: mean age by gender
I work a lot with epidemiologists and statisticians that have very specific requirements for their statistical output and I frequently fail to reproduce the exact same thing in R (our epidemiologst works in Stata).
Let's start with an easy example, a Student's t-test. What we are interested in is the difference in mean age at first diagnosis and a confidence interval.
1) create some sample data in R
set.seed(41)
cohort <- data.frame(
id = seq(1,100),
gender = sample(c(rep(1,33), rep(2,67)),100),
age = sample(seq(0,50),100, replace=TRUE)
)
# save to import into Stata
# write.csv(cohort, "cohort.csv", row.names = FALSE)
b) import data and run t-test in Stata
import delimited "cohort.csv"
ttest age, by(gender)
What we want is the absolute difference in the mean= 3.67 years and the combined confidence intervals = 95% CI: 24.59 - 30.57
b) run t-test in R
t.test(age~gender, data=cohort)
t.test(cohort$age[cohort$gender == 1])
t.test(cohort$age[cohort$gender == 2])
t.test(cohort$age)
There surely must be another way instead of running 4 t-tests in R!

You can try to put everything in one function and some tidyverse magic. The output can be edited as your needs are of course. boom's tidy will be used for nice output.
foo <- function(df, x, y){
require(tidyverse)
require(broom)
a1 <- df %>%
select(ep=!!x, gr=!!y) %>%
mutate(gr=as.character(gr)) %>%
bind_rows(mutate(., gr="ALL")) %>%
split(.$gr) %>%
map(~tidy(t.test(.$ep))) %>%
bind_rows(.,.id = "gr") %>%
mutate_if(is.factor, as.character)
tidy(t.test(as.formula(paste(x," ~ ",y)), data=df)) %>%
mutate_if(is.factor, as.character) %>%
mutate(gr="vs") %>%
select(gr, estimate, statistic, p.value,parameter, conf.low, conf.high, method, alternative) %>%
bind_rows(a1, .)}
foo(cohort, "age", "gender")
gr estimate statistic p.value parameter conf.low conf.high method alternative
1 1 25.121212 9.545737 6.982763e-11 32.00000 19.76068 30.481745 One Sample t-test two.sided
2 2 28.791045 15.699854 5.700541e-24 66.00000 25.12966 32.452428 One Sample t-test two.sided
3 ALL 27.580000 18.301678 1.543834e-33 99.00000 24.58985 30.570147 One Sample t-test two.sided
4 vs -3.669833 -1.144108 2.568817e-01 63.37702 -10.07895 2.739284 Welch Two Sample t-test two.sided
I recommend to start from the beginning using this
foo <- function(df){
a1 <- broom::tidy(t.test(age~gender, data=df))
a2 <- broom::tidy(t.test(df$age))
a3 <- broom::tidy(t.test(df$age[df$gender == 1]))
a4 <- broom::tidy(t.test(df$age[df$gender == 2]))
list(rbind(a2, a3, a4), a1)
}
foo(cohort)
[[1]]
estimate statistic p.value parameter conf.low conf.high method alternative
1 27.58000 18.301678 1.543834e-33 99 24.58985 30.57015 One Sample t-test two.sided
2 25.12121 9.545737 6.982763e-11 32 19.76068 30.48174 One Sample t-test two.sided
3 28.79104 15.699854 5.700541e-24 66 25.12966 32.45243 One Sample t-test two.sided
[[2]]
estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high method alternative
1 -3.669833 25.12121 28.79104 -1.144108 0.2568817 63.37702 -10.07895 2.739284 Welch Two Sample t-test two.sided

You can make your own function:
tlimits <- function(data, group){
error <- qt(0.975, df = length(data)-1)*sd(data)/(sqrt(length(data)))
mean <- mean(data)
means <- tapply(data, group, mean)
c(abs(means[1] - means[2]), mean - error, mean + error)
}
tlimits(cohort$age, cohort$gender)
1
3.669833 24.589853 30.570147

What we want is the absolute difference in the mean= 3.67 years and the combined confidence intervals = 95% CI: 24.59 - 30.57
Notice that R's t.test does a t-test, whereas you want a mean difference and "combined confidence intervals" (which is CI around the mean ignoring the grouping variable). So you don't want a t-test but something else.
You can get the mean difference using, e.g.:
diff(with(cohort, tapply(age, gender, mean)))
# 3.669833
# no point in using something more complicated e.g., t-test or lm
... and the CI using, e.g.:
confint(lm(age~1, data=cohort))
# 2.5 % 97.5 %
# (Intercept) 24.58985 30.57015
And obviously, you can easily combine the two steps into one function if you need it often.
doit <- function(a,b) c(diff= diff(tapply(a,b,mean)), CI=confint(lm(a~1)))
with(cohort, doit(age,gender))
# diff.2 CI1 CI2
# 3.669833 24.589853 30.570147

Related

Understanding the Output Coefficients from a Linear Model Regression in R

I'm reading a fairly simple hypothesis textbook at the moment. It is being explained that the coefficients from a linear model, where the independent variables are two categorical variables with 2 and 3 factors respectively, and the dependent variable is a continuous variable should be interpreted as; the difference between the overall mean of the dependent variable (mean across all categorical variables and factors) and the mean of the dependent variable based on the values of the dependent variable from a given factorized categorical variable. I hope it's understandable.
However, when I try to reproduce the example in the book, I do not get the same coefficients, std. err., T- or P-values.
I created a reproducible example using the ToothGrowth dataset, where the same is the case:
library(tidyverse)
# Transforming Data to a Tibble and Change Variable 'dose' to a Factor:
tooth_growth_reprex <- ToothGrowth %>%
as_tibble() %>%
mutate(dose = as.factor(dose))
# Creating Linear Model of Variables in ToothGrowth (tg):
tg_lm <- lm(formula = len ~ supp * dose, data = tooth_growth_reprex)
# Extracting suppVC coefficient:
(coef_supp_vc <- tg_lm$coefficients["suppVC"])
#> suppVC
#> -5.25
# Calculating Mean Difference between Overall Mean and Supplement VC Mean:
## Overall Mean:
(overall_summary <- tooth_growth_reprex %>%
summarise(Mean = mean(len)))
#> # A tibble: 1 x 1
#> Mean
#> <dbl>
#> 1 18.8
## Supp VC Mean:
(supp_vc_summary <- tooth_growth_reprex %>%
group_by(supp) %>%
summarise(Mean = mean(len))) %>%
filter(supp == "VC")
#> # A tibble: 1 x 2
#> supp Mean
#> <fct> <dbl>
#> 1 VC 17.0
## Difference between Overall Mean and Supp VC Mean:
(mean_dif_overall_vc <- overall_summary$Mean - supp_vc_summary$Mean[2])
#> [1] 1.85
# Testing if supp_VC coefficient and difference between Overall Mean and Supp VC Mean is near identical:
near(coef_supp_vc, mean_dif_overall_vc)
#> suppVC
#> FALSE
Created on 2021-02-23 by the reprex package (v1.0.0)
My questions:
Am I understanding the interpretation of the coefficient values completely wrong?
What is the lm actually calculating regarding the coefficients?
Is there any functions in R that can calculate what I'm interested in, with me having to do it manually?
I hope this is enough information. If not, please don't hesitate to ask me!
The lm() function uses dummy coding, so all the coefficients in your model are compared to the reference group's mean. The reference group here is the first levels of your factors, so supp=OJ and dose=0.5
You can then do this verification like so:
coef(tg_lm)["(Intercept)"] + coef(tg_lm)["suppVC"] == mean_table %>% filter(supp=='VC' & dose==0.5) %>% pull(M)
(coef(tg_lm)["(Intercept)"] + coef(tg_lm)["suppVC"] + coef(tg_lm)["dose1"] + coef(tg_lm)["suppVC:dose1"]) == mean_table %>% filter(supp=='VC' & dose==1) %>% pull(M)
You can read into the differences here

Functional programming: use broom nest->tidy->unnest and map within a function

I need to turn a (working) bit of dplyr/broom code into a function, since I'll call it several (dozen) times.
I am stuck -- and this has likely to do with Non Standard Evaluation being mixed with Standard Evaluation.
Here I take code directly from the vignette 'broom and dplyr'
library(tidyverse)
library(broom)
data(Orange)
Orange %>%
nest(-Tree) %>%
mutate(
test = map(data, ~ cor.test(.x$age, .x$circumference)),
tidied = map(test, tidy)
) %>%
unnest(tidied, .drop = TRUE)
This works:
Tree estimate statistic p.value parameter conf.low conf.high method alternative
1 1 0.9854675 12.97258 4.851902e-05 5 0.9012111 0.9979400 Pearson's product-moment correlation two.sided
2 2 0.9873624 13.93129 3.425041e-05 5 0.9136142 0.9982101 Pearson's product-moment correlation two.sided
3 3 0.9881766 14.41188 2.901046e-05 5 0.9189858 0.9983260 Pearson's product-moment correlation two.sided
4 4 0.9844610 12.53575 5.733090e-05 5 0.8946782 0.9977964 Pearson's product-moment correlation two.sided
5 5 0.9877376 14.14686 3.177093e-05 5 0.9160865 0.9982635 Pearson's product-moment correlation two.sided
Now the point is that I want to make a function out of it.
So if I try this:
afunction <- function(data, var) {
data %>%
nest(-Tree) %>%
mutate(
test = map(data, ~ cor.test(.x$age, .x$var)), # S3 list-col
tidied = map(test, tidy)
) %>%
unnest(tidied, .drop = TRUE)
}
It fails miserably.
Error in cor.test.default(.x$age, .x$var) : 'x' and 'y' must have the same length
I have tried to use NSE, quotation, semiquotation. I admit I tried a bit at random since I cannot find a proper tutorial of how to let NSE and SE play nicely together with the $ operator.
Any solutions -- especially one that would scale & teach me how to solve these issues once and for all? I am also happy for pointers at relevant books / tutorials.

How to construct a table from a t-test in R

I was provided with three t-tests:
Two Sample t-test
data: cammol by gender
t = -3.8406, df = 175, p-value = 0.0001714
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.11460843 -0.03680225
sample estimates:
mean in group 1 mean in group 2
2.318132 2.393837
Welch Two Sample t-test
data: alkphos by gender
t = -2.9613, df = 145.68, p-value = 0.003578
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-22.351819 -4.458589
sample estimates:
mean in group 1 mean in group 2
85.81319 99.21839
Two Sample t-test
data: phosmol by gender
t = -3.4522, df = 175, p-value = 0.0006971
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.14029556 -0.03823242
sample estimates:
mean in group 1 mean in group 2
1.059341 1.148605
And I want to construct a table with these t-test results in R markdown like:
wanted_table_format
I've tried reading some instructions for using "knitr" and "kable" functions, but honestly, I do not know how to apply the t-test results to those functions.
What could I do?
Suppose your three t-tests are saved as t1, t2, and t3.
t1 <- t.test(rnorm(100), rnorm(100)
t2 <- t.test(rnorm(100), rnorm(100, 1))
t3 <- t.test(rnorm(100), rnorm(100, 2))
You could turn them into one data frame (that can then be printed as a table) with the broom and purrr packages:
library(broom)
library(purrr)
tab <- map_df(list(t1, t2, t3), tidy)
On the above data, this would become:
estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
1 0.07889713 -0.008136139 -0.08703327 0.535986 5.925840e-01 193.4152 -0.2114261 0.3692204
2 -0.84980010 0.132836627 0.98263673 -6.169076 3.913068e-09 194.2561 -1.1214809 -0.5781193
3 -1.95876967 -0.039048940 1.91972073 -13.270232 3.618929e-29 197.9963 -2.2498519 -1.6676875
method alternative
1 Welch Two Sample t-test two.sided
2 Welch Two Sample t-test two.sided
3 Welch Two Sample t-test two.sided
Some of the columns probably don't matter to you, so you could do something like this to get just the columns you want:
tab[c("estimate", "statistic", "p.value", "conf.low", "conf.high")]
As noted in the comments, you'd have to first do install.packages("broom") and install.packages("purrr").

Pairwise t test using plyr

I would like to use the R package plyr to run a pairwise t test on a really large data frame, but I'm not sure how to do it. I recently learned how to do correlations using plyr, and I really like how you can specify which groups you want to compare and then plyr breaks down the data for you. For example, you could have plyr calculate the correlation between sepal length and sepal width for each species of iris in the iris dataset like this:
Correlations <- ddply(iris, "Species", function(x) cor(x$Sepal.Length, x$Sepal.Width))
I could break the data frame down myself by specifying that the data for the setosa species of iris are in rows 1:50 and so on, but plyr would be less likely than me to mess up and accidentally say rows 1:51, for example.
So how do I do something similar with a paired t test? How can I specify which observations are the pairs? Here's some example data that are similar to what I'm working with, and I'd like the pairs to be the Subject and I'd like to break the data down by Pesticide:
Exposure <- data.frame("Subject" = rep(1:4, 6),
"Season" = rep(c(rep("summer", 4), rep("winter", 4)),3),
"Pesticide" = rep(c("atrazine", "metolachlor", "chlorpyrifos"), each=8),
"Exposure" = sample(1:100, size=24))
Exposure$Subject <- as.factor(Exposure$Subject)
In other words, the question I'd like to evaluate is whether there is a difference in pesticide exposure for each person during the winter versus during the summer, and I'd like to answer that question separately for each of the three pesticides.
Much thanks in advance!
An edit: To clarify, this is how to do an unpaired t test in plyr:
TTests <- dlply(Exposure, "Pesticide", function(x) t.test(x$Exposure ~ x$Season))
And if I add "paired=T" in there, plyr will do a paired t test, but it assumes that I always have the pairs in the same order. While I do have them all in the same order in the example data frame above, I don't in my real data because I sometimes have missing data.
Do you want this?
library(data.table)
# convert to data.table in place
setDT(Exposure)
# make sure data is sorted correctly
setkey(Exposure, Pesticide, Season, Subject)
Exposure[, list(res = list(t.test(Exposure[Season == "summer"],
Exposure[Season == "winter"],
paired = T)))
, by = Pesticide]$res
#[[1]]
#
# Paired t-test
#
#data: Exposure[Season == "summer"] and Exposure[Season == "winter"]
#t = -4.1295, df = 3, p-value = 0.02576
#alternative hypothesis: true difference in means is not equal to 0
#95 percent confidence interval:
# -31.871962 -4.128038
#sample estimates:
#mean of the differences
# -18
#
#
#[[2]]
#
# Paired t-test
#
#data: Exposure[Season == "summer"] and Exposure[Season == "winter"]
#t = -6.458, df = 3, p-value = 0.007532
#alternative hypothesis: true difference in means is not equal to 0
#95 percent confidence interval:
# -73.89299 -25.10701
#sample estimates:
#mean of the differences
# -49.5
#
#
#[[3]]
#
# Paired t-test
#
#data: Exposure[Season == "summer"] and Exposure[Season == "winter"]
#t = -2.5162, df = 3, p-value = 0.08646
#alternative hypothesis: true difference in means is not equal to 0
#95 percent confidence interval:
# -30.008282 3.508282
#sample estimates:
#mean of the differences
# -13.25
I don't know ddply, but here's how I would do using some base functions.
by(data = Exposure, INDICES = Exposure$Pesticide, FUN = function(x) {
t.test(Exposure ~ Season, data = x)
})
Exposure$Pesticide: atrazine
Welch Two Sample t-test
data: Exposure by Season
t = -0.1468, df = 5.494, p-value = 0.8885
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-49.63477 44.13477
sample estimates:
mean in group summer mean in group winter
60.50 63.25
----------------------------------------------------------------------------------------------
Exposure$Pesticide: chlorpyrifos
Welch Two Sample t-test
data: Exposure by Season
t = -0.8932, df = 4.704, p-value = 0.4151
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-83.58274 41.08274
sample estimates:
mean in group summer mean in group winter
52.25 73.50
----------------------------------------------------------------------------------------------
Exposure$Pesticide: metolachlor
Welch Two Sample t-test
data: Exposure by Season
t = 0.8602, df = 5.561, p-value = 0.4252
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-39.8993 81.8993
sample estimates:
mean in group summer mean in group winter
62.5 41.5

How to do: Correlation with "blocks" (or - "repeated measures" ?!)?

I have the following setup to analyse:
We have about 150 subjects, and for each subject we performed a pair of tests (under different conditions) 18 times.
The 18 different conditions of the test are complementary, in such a way so that if we where to average over the tests (for each subject), we would get no correlation between the tests (between subjects).
What we wish to know is the correlation (and P value) between the tests, in within subjects, but over all the subjects.
The way I did this by now was to perform the correlation for each subject, and then look at the distribution of the correlations received so to see if it's mean is different then 0.
But I suspect there might be a better way for answering the same question (someone said to me something about "geographical correlation", but a shallow search didn't help).
p.s: I understand there might be a place here to do some sort of mixed model, but I would prefer to present a "correlation", and am not sure how to extract such an output from a mixed model.
Also, here is a short dummy code to give an idea of what I am talking about:
attach(longley)
N <- length(Unemployed)
block <- c(
rep( "a", N),
rep( "b", N),
rep( "c", N)
)
Unemployed.3 <- c(Unemployed + rnorm(1),
Unemployed + rnorm(1),
Unemployed + rnorm(1))
GNP.deflator.3 <- c(GNP.deflator + rnorm(1),
GNP.deflator + rnorm(1),
GNP.deflator + rnorm(1))
cor(Unemployed, GNP.deflator)
cor(Unemployed.3, GNP.deflator.3)
cor(Unemployed.3[block == "a"], GNP.deflator.3[block == "a"])
cor(Unemployed.3[block == "b"], GNP.deflator.3[block == "b"])
cor(Unemployed.3[block == "c"], GNP.deflator.3[block == "c"])
(I would like to somehow combine the last three correlations...)
Any ideas will be welcomed.
Best,
Tal
I agree with Tristan - you are looking for ICC. The only difference from standard implementations is that the two raters (tests) evaluate each subject repeatedly. There might be an implementation that allows that. In the meanwhile here is another approach to get the correlation.
You can use "general linear models", which are generalizations of linear models that explicitly allow correlation between residuals. The code below implements this using the gls function of the nlme package. I am sure there are other ways as well. To use this function we have to first reshape the data into a "long" format. I also changed the variable names to x and y for simplicity. I also used +rnorm(N) instead of +rnorm(1) in your code, because that's what I think you meant.
library(reshape)
library(nlme)
dd <- data.frame(x=Unemployed.3, y=GNP.deflator.3, block=factor(block))
dd$occasion <- factor(rep(1:N, 3)) # variable denoting measurement occasions
dd2 <- melt(dd, id=c("block","occasion")) # reshape
# fit model with the values within a measurement occasion correlated
# and different variances allowed for the two variables
mod <- gls(value ~ variable + block, data=dd2,
cor=corSymm(form=~1|block/occasion),
weights=varIdent(form=~1|variable))
# extract correlation
mod$modelStruct$corStruct
In the modeling framework you can use a likelihood ratio test to get a p-value. nlme can also give you a confidence interval:
mod2 <- gls(value ~ variable + block, data=dd2,
weights=varIdent(form=~1|variable))
anova(mod, mod2) # likelihood-ratio test for corr=0
intervals(mod)$corStruct # confidence interval for the correlation
If I understand your question correctly, you are interested in computing the intraclass correlation between multiple tests. There is an implementation in the psy package, although I have not used it.
If you want to perform inference on the correlation estimate, you could bootstrap the subjects. Just make sure to keep the tests together for each sample.
I'm no expert, but this looks to me like what you want. It's automated, short to code, gives the same correlations as your example above, and produces p-values.
> df = data.frame(block=block, Unemployed=Unemployed.3,
+ GNP.deflator=GNP.deflator.3)
> require(plyr)
Loading required package: plyr
> ddply(df, "block", function(x){
+ as.data.frame(
+ with(x,cor.test(Unemployed, GNP.deflator))[c("p.value","estimate")]
+ )})
block p.value estimate
1 a 0.01030636 0.6206334
2 b 0.01030636 0.6206334
3 c 0.01030636 0.6206334
To see all the details, do this:
> dlply(df, "block", function(x){with(x,cor.test(Unemployed, GNP.deflator))})
$a
Pearson's product-moment correlation
data: Unemployed and GNP.deflator
t = 2.9616, df = 14, p-value = 0.01031
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.1804410 0.8536976
sample estimates:
cor
0.6206334
$b
Pearson's product-moment correlation
data: Unemployed and GNP.deflator
t = 2.9616, df = 14, p-value = 0.01031
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.1804410 0.8536976
sample estimates:
cor
0.6206334
$c
Pearson's product-moment correlation
data: Unemployed and GNP.deflator
t = 2.9616, df = 14, p-value = 0.01031
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.1804410 0.8536976
sample estimates:
cor
0.6206334
attr(,"split_type")
[1] "data.frame"
attr(,"split_labels")
block
1 a
2 b
3 c

Resources