Grabbing certain results out of multiple t.test outputs to create a table - r

I have run 48 t-tests (coded by hand instead of writing a loop) and would like to splice out certain results of those t.tests to create a table of the things I'm most interested in.
Specifically, I would like to keep only the p-value, confidence interval, and the mean of x and mean of y for each of these 48 tests and then build a table of the results.
Is there an elegant, quick way to do this beyond the top answer detailed here , wherein I would go in for all 48 tests and grab all three desired outputs with something along the lines of ttest$p.value? Perhaps a loop?
Below is a sample of the coded input for one t-test, followed by the output delivered by R.
# t.test comparing means of Change_Unemp for 2005 government employment (ix)
lowgov6 <- met_res[met_res$Gov_Emp_2005 <= 93310, "Change_Unemp"]
highgov6 <- met_res[met_res$Gov_Emp_2005 > 93310, "Change_Unemp"]
t.test(lowgov6,highgov6,pool.sd=FALSE,na.rm=TRUE)
Welch Two Sample t-test
data: lowgov6 and highgov6
t = 1.5896, df = 78.978, p-value = 0.1159
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.1813909 1.6198399
sample estimates:
mean of x mean of y
4.761224 4.042000

Save all of your t-tests into a list:
tests <- list()
tests[[1]] <- t.test(lowgov6,highgov6,pool.sd=FALSE,na.rm=TRUE)
# repeat for all tests
# there are probably faster ways than doing all of that by hand
# extract your values using `sapply`
sapply(tests, function(x) {
c(x$estimate[1],
x$estimate[2],
ci.lower = x$conf.int[1],
ci.upper = x$conf.int[2],
p.value = x$p.value)
})
The output is something like the following:
[,1] [,2]
mean of x 0.12095949 0.03029474
mean of y -0.05337072 0.07226999
ci.lower -0.11448679 -0.31771191
ci.upper 0.46314721 0.23376141
p.value 0.23534905 0.76434012
But will have 48 columns. You can t() the result if you'd like it transposed.

Related

Removing outliers can not runt cor.test()

I am extracting outliers from a single column of a dataset. Then I am attempting to run cor.test() on that column plus another column. I am getting error: Error in cor.test.default(dep_delay_noout, distance) : 'x' and 'y' must have the same length I assume this is because removing the outliers from one column caused it to be a different length vector than the other column, but am not sure what to do about it. I have tried mutating the dataset by adding a new column that lacked outliers, but unfortunately ran into the same problem. Does anybody know what to do? Below is my code.
dep_delay<-flights$dep_delay
dep_delay_upper<-quantile(dep_delay,0.997,na.rm=TRUE)
dep_delay_lower<-quantile(dep_delay,0.003,na.rm=TRUE)
dep_delay_out<-which(dep_delay>dep_delay_upper|dep_delay<dep_delay_lower)
dep_delay_noout<-dep_delay[-dep_delay_out]
distance<-flights$distance
cor.test(dep_delay_noout,distance)
You were almost there. In cor.test you also want to subset distance. Additionally, for the preprocessing you could use a quantile vector of length 2 and mapply to do the comparison in one step―just to write it more concise, actually your code is fine.
data('flights', package='nycflights13')
nna <- !is.na(flights$dep_delay)
(q <- quantile(flights$dep_delay[nna], c(0.003, 0.997)))
# 0.3% 99.7%
# -14 270
nout <- rowSums(mapply(\(f, q) f(flights$dep_delay[nna], q), c(`>`, `<`), q)) == 2
with(flights, cor.test(dep_delay[nout], distance[nout]))
# Pearson's product-moment correlation
#
# data: dep_delay[no_out] and distance[no_out]
# t = -12.409, df = 326171, p-value < 2.2e-16
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
# -0.02515247 -0.01829207
# sample estimates:
# cor
# -0.02172252

R - Error T-test For loop command between variables

Currently in the process of writing a For Loop that'll calculate and print t-test results, I'm testing for the difference in means of all variables (faminc, fatheduc, motheduc, white, cigtax, cigprice) between smokers and non-smokers ("smoke"; 0=non, 1=smoker)
Current code:
type <- c("faminc", "fatheduc", "motheduc", "white", "cigtax", "cigprice")
count <- 1
for(name in type){
temp <- subset(data, data[name]==1)
cat("For", name, "between smokers and non, the difference in means is: \n")
print(t.test(temp$smoke))
count <- count + 1
}
However, I feel that 'temp' doesn't belong here and when running the code I get:
For faminc between smokers and non, the difference in means is:
Error in t.test.default(temp$smoke) : not enough 'x' observations
The simple code of
t.test(faminc~smoke,data=data)
does what I need, but I'd like to get some practice/better understanding of for loops.
Here is a solution that generates the output requested in the OP, using lapply() with the mtcars data set.
data(mtcars)
varList <- c("wt","disp","mpg")
results <- lapply(varList,function(x){
t.test(mtcars[[x]] ~ mtcars$am)
})
names(results) <- varList
for(i in 1:length(results)){
message(paste("for variable:",names(results[i]),"difference between manual and automatic transmissions is:"))
print(results[[i]])
}
...and the output:
> for(i in 1:length(results)){
+ message(paste("for variable:",names(results[i]),"difference between manual and automatic transmissions is:"))
+ print(results[[i]])
+ }
for variable: wt difference between manual and automatic transmissions is:
Welch Two Sample t-test
data: mtcars[[x]] by mtcars$am
t = 5.4939, df = 29.234, p-value = 6.272e-06
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.8525632 1.8632262
sample estimates:
mean in group 0 mean in group 1
3.768895 2.411000
for variable: disp difference between manual and automatic transmissions is:
Welch Two Sample t-test
data: mtcars[[x]] by mtcars$am
t = 4.1977, df = 29.258, p-value = 0.00023
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
75.32779 218.36857
sample estimates:
mean in group 0 mean in group 1
290.3789 143.5308
for variable: mpg difference between manual and automatic transmissions is:
Welch Two Sample t-test
data: mtcars[[x]] by mtcars$am
t = -3.7671, df = 18.332, p-value = 0.001374
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-11.280194 -3.209684
sample estimates:
mean in group 0 mean in group 1
17.14737 24.39231
>
Compare your code that works...
t.test(faminc~smoke,data=data)
You are specifying a relationship between variables (faminc~smoke) which means that you think the mean of faminc is different between the values of smoke and you wish to use the data dataset.
The equivalent line in your loop...
print(t.test(temp$smoke))
...only gives the single column of temp$smoke after having selected those who have the value 1 for each of faminc, fatheduc etc. So even if you wrote...
print(t.test(faminc~smoke, data=data))
Further your count is doing nothing.
If you want to perform a range of testes in this manner you could
type <- c("faminc", "fatheduc", "motheduc", "white", "cigtax", "cigprice")
for(name in type){
cat("For", name, "between smokers and non, the difference in means is: \n")
print(t.test(name~smoke, data=data))
}
Whether this is what you want to do though isn't clear to me, your variables suggest family (faminc), father (fatheduc), mother (motheduc), ethnicity (white), tax (cigtax) and price (cigprice).
I can't think why you would want to compare the mean cigarette price or tax between smokers and non-smokers, because the later are not going to have any value for this since they don't smoke!
You're code suggests these are perhaps binary variables though (since you are filtering on each value being 1) which to me suggests this isn't even what you want to do.
If you wish to look in subsets of data then a tidier approach to performing regression rather than loops is to use purrr.
In future when asking consider providing a sample of data along with the full copy & pasted output as advised in How to create a Minimal, Complete, and Verifiable example - Help Center - Stack Overflow. Because this allows people to see in greater detail what you are doing (e.g. I've only guessed about your variables). With statistics its also useful to state what your hypothesis is too to help people understand what it is you are trying to achieve overall.

Use lapply and show variable names

I am fairly new to R, though I've programmed in Python and Java a lot. I have searched these questions about using a for loop to run through a list of variables and everyone keeps mentioning to use lapply. I have done that, and my code works in the sense that it gives me the answers, but it doesn't work in the sense that the answers hide important details. Here's my code and some of the output.
> bat <- read.csv(file="mlbTeam2016-B.csv", header=TRUE)
> varlist <- names(bat)[6:32]
> varlist
[1] "AB.B" "R.B" "H.B" "X2B.B" "X3B.B" "HR.B" "RBI.B"
[8] "BB.B" "SO.B" "SB.B" "CS.B" "AVG.B" "OBP.B" "SLG.B"
[15] "OPS.B" "IBB.B" "HBP.B" "SAC.B" "SF.B" "TB.B" "XBH.B"
[22] "GDP.B" "GO.B" "AO.B" "GO_AO.B" "NP.B" "PA.B"
> lapply(varlist, function(i){
+ var <- eval(parse(text=paste("bat$",i)))
+ cor.test(bat$W, var, alternative="two.sided", method="pearson")
+ })
[[1]]
Pearson's product-moment correlation
data: bat$W and var
t = 0.35067, df = 28, p-value = 0.7285
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3013221 0.4164731
sample estimates:
cor
0.06612551
etc
The problem is that each output says data: bat$W and var without telling me which variable it is testing in this step. This is fine, except I have to go back and look up to see what variable this corresponds to. That is better than typing this code in dozens of times, but not ideal. I also know that using eval(parse( is bad, but I can't figure out another way to handle that line.
This is my desired output:
[[1]]
Pearson's product-moment correlation
data: bat$W and bat$AB.B
t = 0.35067, df = 28, p-value = 0.7285
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3013221 0.4164731
sample estimates:
cor
0.06612551
I would suggest creating a correlation matrix rather than doing this using lapply.
This link will walk you through how to do that
http://www.sthda.com/english/wiki/correlation-matrix-a-quick-start-guide-to-analyze-format-and-visualize-a-correlation-matrix-using-r-software
You can select the variables you want using dplyr:
select(bat, one_of(varlist))
This should be a bit easier than the approach you are using.

What is the ggplot2/plyr way to calculate statistical tests between two subgroups?

I am a rather novice user of R and have come to appreciate the elegance of ggplot2 and plyr. Right now, I am trying to analyze a large dataset that I can not share here, but I have reconstructed my problem with the diamonds dataset (shortened for convenience).
Without further ado:
diam <- diamonds[diamonds$cut=="Fair"|diamonds$cut=="Ideal",]
boxplots <- ggplot(diam, aes(x=cut, price)) + geom_boxplot(aes(fill=cut)) + facet_wrap(~ color)
print(boxplots)
What the plot produces is a set of boxplots, comparing the price of the two cuts "Fair" and "Ideal".
I would now very much like to proceed by statistically comparing the two cuts for each color subgroup (D,E,F,..,J) using either t.test or wilcox.test.
How would I implement this in an way that is as elegant as the ggplot2-syntax? I assume I would use ddply from the plyr-package, but I couldn't figure out how to feed two subgroups into a function that calculates the appropriate statistics..
I think you're looking for:
library(plyr)
ddply(diam,"color",
function(x) {
w <- wilcox.test(price~cut,data=x)
with(w,data.frame(statistic,p.value))
})
(Substituting t.test for wilcox.test seems to work fine too.)
results:
color statistic p.value
1 D 339753.5 4.232833e-24
2 E 591104.5 6.789386e-19
3 F 731767.5 2.955504e-11
4 G 950008.0 1.176953e-12
5 H 611157.5 2.055857e-17
6 I 213019.0 3.299365e-04
7 J 56870.0 2.364026e-01
ddply returns a data frame as output and, assuming that I am reading your question properly, that isn't what you are looking for. I believe you would like to conduct a series of t-tests using a series of subsets of data so the only real task is compiling a list of those subsets. Once you have them you can use a function like lapply() to run a t-test for each subset in your list. I am sure this isn't the most elegant solution, but one approach would be to create a list of unique pairs of your colors using a function like this:
get.pairs <- function(v){
l <- length(v)
n <- sum(1:l-1)
a <- vector("list",n)
j = 1
k = 2
for(i in 1:n){
a[[i]] <- c(v[j],v[k])
if(k < l){
k <- k + 1
} else {
j = j + 1
k = j + 1
}
}
return(a)
}
Now you can use that function to get your list of unique pairs of colors:
> (color.pairs <- get.pairs(levels(diam$color))))
[[1]]
[1] "D" "E"
[[2]]
[1] "D" "F"
...
[[21]]
[1] "I" "J"
Now you can use each of these lists to run a t.test (or whatever you would like) on your subset of your data frame, like so:
> t.test(price~cut,data=diam[diam$color %in% color.pairs[[1]],])
Welch Two Sample t-test
data: price by cut
t = 8.1594, df = 427.272, p-value = 3.801e-15
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
1008.014 1647.768
sample estimates:
mean in group Fair mean in group Ideal
3938.711 2610.820
Now use lapply() to run your test for each subset in your list of color pairs:
> lapply(color.pairs,function(x) t.test(price~cut,data=diam[diam$color %in% x,]))
[[1]]
Welch Two Sample t-test
data: price by cut
t = 8.1594, df = 427.272, p-value = 3.801e-15
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
1008.014 1647.768
sample estimates:
mean in group Fair mean in group Ideal
3938.711 2610.820
...
[[21]]
Welch Two Sample t-test
data: price by cut
t = 0.8813, df = 375.996, p-value = 0.3787
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-260.0170 682.3882
sample estimates:
mean in group Fair mean in group Ideal
4802.912 4591.726

Bootstrapping to compare two groups

In the following code I use bootstrapping to calculate the C.I. and the p-value under the null hypothesis that two different fertilizers applied to tomato plants have no effect in plants yields (and the alternative being that the "improved" fertilizer is better). The first random sample (x) comes from plants where a standard fertilizer has been used, while an "improved" one has been used in the plants where the second sample (y) comes from.
x <- c(11.4,25.3,29.9,16.5,21.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
library(boot)
diff <- function(x,i) mean(x[i[6:11]]) - mean(x[i[1:5]])
b <- boot(total, diff, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
What I don't like about the code above is that resampling is done as if there was only one sample of 11 values (separating the first 5 as belonging to sample x leaving the rest to sample y).
Could you show me how this code should be modified in order to draw resamples of size 5 with replacement from the first sample and separate resamples of size 6 from the second sample, so that bootstrap resampling would mimic the “separate samples” design that produced the original data?
EDIT2 :
Hack deleted as it was a wrong solution. Instead one has to use the argument strata of the boot function :
total <- c(x,y)
id <- as.factor(c(rep("x",length(x)),rep("y",length(y))))
b <- boot(total, diff, strata=id, R = 10000)
...
Be aware you're not going to get even close to a correct estimate of your p.value :
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
b <- boot(total, diff, strata=id, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
> p.value
[1] 0.5162
How would you explain a p-value of 0.51 for two samples where all values of the second are higher than the highest value of the first?
The above code is fine to get a -biased- estimate of the confidence interval, but the significance testing about the difference should be done by permutation over the complete dataset.
Following John, I think the appropriate way to use bootstrap to test if the sums of these two different populations are significantly different is as follows:
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
b_x <- boot(x, sum, R = 10000)
b_y <- boot(y, sum, R = 10000)
z<-(b_x$t0-b_y$t0)/sqrt(var(b_x$t[,1])+var(b_y$t[,1]))
pnorm(z)
So we can clearly reject the null that they are the same population. I may have missed a degree of freedom adjustment, I am not sure how bootstrapping works in that regard, but such an adjustment will not change your results drastically.
While the actual soil beds could be considered a stratified variable in some instances this is not one of them. You only have the one manipulation, between the groups of plants. Therefore, your null hypothesis is that they really do come from the exact same population. Treating the items as if they're from a single set of 11 samples is the correct way to bootstrap in this case.
If you have two plots, and in each plot tried the different fertilizers over different seasons in a counterbalanced fashion then the plots would be statified samples and you'd want to treat them as such. But that isn't the case here.

Resources