R - Error T-test For loop command between variables - r

Currently in the process of writing a For Loop that'll calculate and print t-test results, I'm testing for the difference in means of all variables (faminc, fatheduc, motheduc, white, cigtax, cigprice) between smokers and non-smokers ("smoke"; 0=non, 1=smoker)
Current code:
type <- c("faminc", "fatheduc", "motheduc", "white", "cigtax", "cigprice")
count <- 1
for(name in type){
temp <- subset(data, data[name]==1)
cat("For", name, "between smokers and non, the difference in means is: \n")
print(t.test(temp$smoke))
count <- count + 1
}
However, I feel that 'temp' doesn't belong here and when running the code I get:
For faminc between smokers and non, the difference in means is:
Error in t.test.default(temp$smoke) : not enough 'x' observations
The simple code of
t.test(faminc~smoke,data=data)
does what I need, but I'd like to get some practice/better understanding of for loops.

Here is a solution that generates the output requested in the OP, using lapply() with the mtcars data set.
data(mtcars)
varList <- c("wt","disp","mpg")
results <- lapply(varList,function(x){
t.test(mtcars[[x]] ~ mtcars$am)
})
names(results) <- varList
for(i in 1:length(results)){
message(paste("for variable:",names(results[i]),"difference between manual and automatic transmissions is:"))
print(results[[i]])
}
...and the output:
> for(i in 1:length(results)){
+ message(paste("for variable:",names(results[i]),"difference between manual and automatic transmissions is:"))
+ print(results[[i]])
+ }
for variable: wt difference between manual and automatic transmissions is:
Welch Two Sample t-test
data: mtcars[[x]] by mtcars$am
t = 5.4939, df = 29.234, p-value = 6.272e-06
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.8525632 1.8632262
sample estimates:
mean in group 0 mean in group 1
3.768895 2.411000
for variable: disp difference between manual and automatic transmissions is:
Welch Two Sample t-test
data: mtcars[[x]] by mtcars$am
t = 4.1977, df = 29.258, p-value = 0.00023
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
75.32779 218.36857
sample estimates:
mean in group 0 mean in group 1
290.3789 143.5308
for variable: mpg difference between manual and automatic transmissions is:
Welch Two Sample t-test
data: mtcars[[x]] by mtcars$am
t = -3.7671, df = 18.332, p-value = 0.001374
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-11.280194 -3.209684
sample estimates:
mean in group 0 mean in group 1
17.14737 24.39231
>

Compare your code that works...
t.test(faminc~smoke,data=data)
You are specifying a relationship between variables (faminc~smoke) which means that you think the mean of faminc is different between the values of smoke and you wish to use the data dataset.
The equivalent line in your loop...
print(t.test(temp$smoke))
...only gives the single column of temp$smoke after having selected those who have the value 1 for each of faminc, fatheduc etc. So even if you wrote...
print(t.test(faminc~smoke, data=data))
Further your count is doing nothing.
If you want to perform a range of testes in this manner you could
type <- c("faminc", "fatheduc", "motheduc", "white", "cigtax", "cigprice")
for(name in type){
cat("For", name, "between smokers and non, the difference in means is: \n")
print(t.test(name~smoke, data=data))
}
Whether this is what you want to do though isn't clear to me, your variables suggest family (faminc), father (fatheduc), mother (motheduc), ethnicity (white), tax (cigtax) and price (cigprice).
I can't think why you would want to compare the mean cigarette price or tax between smokers and non-smokers, because the later are not going to have any value for this since they don't smoke!
You're code suggests these are perhaps binary variables though (since you are filtering on each value being 1) which to me suggests this isn't even what you want to do.
If you wish to look in subsets of data then a tidier approach to performing regression rather than loops is to use purrr.
In future when asking consider providing a sample of data along with the full copy & pasted output as advised in How to create a Minimal, Complete, and Verifiable example - Help Center - Stack Overflow. Because this allows people to see in greater detail what you are doing (e.g. I've only guessed about your variables). With statistics its also useful to state what your hypothesis is too to help people understand what it is you are trying to achieve overall.

Related

Error when using the Benjamini-Hochberg false discovery rate in R after Wilcoxon Rank

I have carried out a Wilcoxon rank sum test to see if there is any significant difference between the expression of 598019 genes between three disease samples vs three control samples. I am in R.
When I see how many genes have a p value < 0.05, I get 41913 altogether. I set the parameters of the Wilcoxon as follows;
wilcox.test(currRow[4:6], currRow[1:3], paired=F, alternative="two.sided", exact=F, correct=F)$p.value
(This is within an apply function, and I can provide my total code if necessary, I was a little unsure as to whether alternative="two.sided" was correct).
However, as I assume correcting for multiple comparisons using the Benjamini Hochberg False Discovery rate would lower this number, I then adjusted the p values via the following code
pvaluesadjust1 <- p.adjust(pvalues_genes, method="BH")
Re-assessing which p values are less than 0.05 via the below code, I get 0!
p_thresh1 <- 0.05
names(pvaluesadjust1) <- rownames(gene_analysis1)
output <- names(pvaluesadjust1)[pvaluesadjust1 < p_thresh1]
length(output)
I would be grateful if anybody could please explain, or direct me to somewhere which can help me understand what is going on!
Thank-you
(As an extra question, would a t-test be fine due to the size of the data, the Anderson-Darling test showed that the underlying data is not normal. I had far less genes which were less than 0.05 using this statistical test rather than Wilcoxon (around 2000).
Wilcoxon is a parametric test based on ranks. If you have only 6 samples, the best result you can get is rank 2,2,2 in disease versus 5,5,5 in control, or vice-versa.
For example, try the parameters you used in your test, on these random values below, and you that you get the same p.value 0.02534732.
wilcox.test(c(100,100,100),c(1,1,1),exact=F, correct=F)$p.value
wilcox.test(c(5,5,5),c(15,15,15),exact=F, correct=F)$p.value
So yes, with 598019 you can get 41913 < 0.05, these p-values are not low enough and with FDR adjustment, none will ever pass.
You are using the wrong test. To answer your second question, a t.test does not work so well because you don't have enough samples to estimate the standard deviation correctly. Below I show you an example using DESeq2 to find differential genes
library(zebrafishRNASeq)
data(zfGenes)
# remove spikeins
zfGenes = zfGenes[-grep("^ERCC", rownames(zfGenes)),]
head(zfGenes)
Ctl1 Ctl3 Ctl5 Trt9 Trt11 Trt13
ENSDARG00000000001 304 129 339 102 16 617
ENSDARG00000000002 605 637 406 82 230 1245
First three are controls, last three are treatment, like your dataset. To validate what I have said before, you can see that if you do a wilcoxon.test, the minimum value is 0.02534732
all_pvalues = apply(zfGenes,1,function(i){
wilcox.test(i[1:3],i[4:6],exact=F, correct=F)$p.value
})
min(all_pvalues,na.rm=T)
# returns 0.02534732
So we proceed with DESeq2
library(DESeq2)
#create a data.frame to annotate your samples
DF = data.frame(id=colnames(zfGenes),type=rep(c("ctrl","treat"),each=3))
# run DESeq2
dds = DESeqDataSetFromMatrix(zfGenes,DF,~type)
dds = DESeq(dds)
summary(results(dds),alpha=0.05)
out of 25839 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up) : 69, 0.27%
LFC < 0 (down) : 47, 0.18%
outliers [1] : 1270, 4.9%
low counts [2] : 5930, 23%
(mean count < 7)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
So you do get hits which pass the FDR cutoff. Lastly we can pull out list of significant genes
res = results(dds)
res[which(res$padj < 0.05),]

Change recursive vector to atomic vector for t-test

I'm new to R and am trying to run a t-test for two means. I keep getting the error is.atomic is not TRUE. I know I need to make my data atomic, but I haven't found a way online.
I've ran code to check that the data is recursive and did a as.data.frame(mydata).
titanic_summary <- data.frame(Outcome = c("Survived", "Died"),
Mean_Age = c(28.34369, 30.62618),
N = c(342, 549),
Total_Missing = c(52, 125))
titanic_summary
Run a stats test (two sample T-test)
str(titanic_summary)
as.data.frame(titanic_summary)
is.atomic(titanic_summary)
is.recursive(titanic_summary)
titanic_test <- titanic_summary %>%
t.test(Outcome~Mean_Age)
Error in var(x) : is.atomic(x) is not TRUE
t.test does not work the way you seem to think. To avoid that particular error, you could instead use something like titanic_test <- t.test(Mean_Age ~ Outcome, data = titanic_summary) but that would just give you different errors, which comes down to the real question:
You presumably want to see whether there may be a relationship between age and survival, i.e. whether the difference in average ages of 2.28249 is significant but you will need the individual ages or some other additional information about dispersion for this
If you do use the detailed dataset then I suspect that what you really want is something like this:
library(titanic)
titanic_test <- t.test(Age ~ Survived, data = titanic_train)
which would give (for the Kaggle selected training set used in the titanic package)
> titanic_test
Welch Two Sample t-test
data: Age by Survived
t = 2.046, df = 598.84, p-value = 0.04119
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.09158472 4.47339446
sample estimates:
mean in group 0 mean in group 1
30.62618 28.34369

Is it possible to change Type I error threshold using t.test() function?

I am asked to compute a test statistic using the t.test() function, but I need to reduce the type I error. My prof showed us how to change a confidence level for this function, but not the acceptable type I error for null hypothesis testing. The goal is for the argument to automatically compute a p-value based on a .01 error rate rather than the normal .05.
The r code below involves a data set that I have downloaded.
t.test(mid$log_radius_area, mu=8.456)
I feel like I've answered this somewhere, but can't seem to find it on SO or CrossValidated.
Similarly to this question, the answer is that t.test() doesn't specify any threshold for rejecting/failing to reject the null hypothesis; it reports a p-value, and you get to decide whether to reject or not. (The conf.level argument is for adjusting which confidence interval the output reports.)
From ?t.test:
t.test(1:10, y = c(7:20))
Welch Two Sample t-test
data: 1:10 and c(7:20)
t = -5.4349, df = 21.982, p-value = 1.855e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-11.052802 -4.947198
sample estimates:
mean of x mean of y
5.5 13.5
Here the p-value is reported as 1.855e-05, so the null hypothesis would be rejected for any (pre-specified) alpha level >1.855e-05. Note that the output doesn't say anywhere "the null hypothesis is rejected at alpha=0.05" or anything like that. You could write your own function to do that, using the $p.value element that is saved as part of the test results:
report_test <- function(tt, alpha=0.05) {
cat("the null hypothesis is ")
if (tt$p.value > alpha) {
cat("**NOT** ")
}
cat("rejected at alpha=",alpha,"\n")
}
tt <- t.test(1:10, y = c(7:20))
report_test(tt)
## the null hypothesis is rejected at alpha= 0.05
Most R package/function writers don't bother to do this, because they figure that it should be simple enough for users to do for themselves.

Grabbing certain results out of multiple t.test outputs to create a table

I have run 48 t-tests (coded by hand instead of writing a loop) and would like to splice out certain results of those t.tests to create a table of the things I'm most interested in.
Specifically, I would like to keep only the p-value, confidence interval, and the mean of x and mean of y for each of these 48 tests and then build a table of the results.
Is there an elegant, quick way to do this beyond the top answer detailed here , wherein I would go in for all 48 tests and grab all three desired outputs with something along the lines of ttest$p.value? Perhaps a loop?
Below is a sample of the coded input for one t-test, followed by the output delivered by R.
# t.test comparing means of Change_Unemp for 2005 government employment (ix)
lowgov6 <- met_res[met_res$Gov_Emp_2005 <= 93310, "Change_Unemp"]
highgov6 <- met_res[met_res$Gov_Emp_2005 > 93310, "Change_Unemp"]
t.test(lowgov6,highgov6,pool.sd=FALSE,na.rm=TRUE)
Welch Two Sample t-test
data: lowgov6 and highgov6
t = 1.5896, df = 78.978, p-value = 0.1159
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.1813909 1.6198399
sample estimates:
mean of x mean of y
4.761224 4.042000
Save all of your t-tests into a list:
tests <- list()
tests[[1]] <- t.test(lowgov6,highgov6,pool.sd=FALSE,na.rm=TRUE)
# repeat for all tests
# there are probably faster ways than doing all of that by hand
# extract your values using `sapply`
sapply(tests, function(x) {
c(x$estimate[1],
x$estimate[2],
ci.lower = x$conf.int[1],
ci.upper = x$conf.int[2],
p.value = x$p.value)
})
The output is something like the following:
[,1] [,2]
mean of x 0.12095949 0.03029474
mean of y -0.05337072 0.07226999
ci.lower -0.11448679 -0.31771191
ci.upper 0.46314721 0.23376141
p.value 0.23534905 0.76434012
But will have 48 columns. You can t() the result if you'd like it transposed.

Bootstrapping to compare two groups

In the following code I use bootstrapping to calculate the C.I. and the p-value under the null hypothesis that two different fertilizers applied to tomato plants have no effect in plants yields (and the alternative being that the "improved" fertilizer is better). The first random sample (x) comes from plants where a standard fertilizer has been used, while an "improved" one has been used in the plants where the second sample (y) comes from.
x <- c(11.4,25.3,29.9,16.5,21.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
library(boot)
diff <- function(x,i) mean(x[i[6:11]]) - mean(x[i[1:5]])
b <- boot(total, diff, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
What I don't like about the code above is that resampling is done as if there was only one sample of 11 values (separating the first 5 as belonging to sample x leaving the rest to sample y).
Could you show me how this code should be modified in order to draw resamples of size 5 with replacement from the first sample and separate resamples of size 6 from the second sample, so that bootstrap resampling would mimic the “separate samples” design that produced the original data?
EDIT2 :
Hack deleted as it was a wrong solution. Instead one has to use the argument strata of the boot function :
total <- c(x,y)
id <- as.factor(c(rep("x",length(x)),rep("y",length(y))))
b <- boot(total, diff, strata=id, R = 10000)
...
Be aware you're not going to get even close to a correct estimate of your p.value :
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
b <- boot(total, diff, strata=id, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
> p.value
[1] 0.5162
How would you explain a p-value of 0.51 for two samples where all values of the second are higher than the highest value of the first?
The above code is fine to get a -biased- estimate of the confidence interval, but the significance testing about the difference should be done by permutation over the complete dataset.
Following John, I think the appropriate way to use bootstrap to test if the sums of these two different populations are significantly different is as follows:
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
b_x <- boot(x, sum, R = 10000)
b_y <- boot(y, sum, R = 10000)
z<-(b_x$t0-b_y$t0)/sqrt(var(b_x$t[,1])+var(b_y$t[,1]))
pnorm(z)
So we can clearly reject the null that they are the same population. I may have missed a degree of freedom adjustment, I am not sure how bootstrapping works in that regard, but such an adjustment will not change your results drastically.
While the actual soil beds could be considered a stratified variable in some instances this is not one of them. You only have the one manipulation, between the groups of plants. Therefore, your null hypothesis is that they really do come from the exact same population. Treating the items as if they're from a single set of 11 samples is the correct way to bootstrap in this case.
If you have two plots, and in each plot tried the different fertilizers over different seasons in a counterbalanced fashion then the plots would be statified samples and you'd want to treat them as such. But that isn't the case here.

Resources