computing T-test with help of apply function

computing T-test with help of apply function - r

I have a matrix :
>data
A A A B B C
gene1 1 6 11 16 21 26
gene2 2 7 12 17 22 27
gene3 3 8 13 18 23 28
gene4 4 9 14 19 24 29
gene5 5 10 15 20 25 30
I want to to test whether the mean of each gene (rows) values are different between different groups for each gene or not? I want to use T-test for it. The function should take all columns belong to group A, take all columns belongs to group B, take all columns belongs to group C,... and calculate the T-test between each groups for each genes.(every groups contains several columns)
on implementation which I got from answer to my previews post is :
Results <- combn(colnames(data), 2, function(x) t.test(data[,x]), simplify = FALSE)
sapply(Results, "[", c("statistic", "p.value"))
but it does compute between all columns rather than between groups for every row. can somebody help me how to modify this code to calculate T test between groups like for my data ?

Maybe this can be usuful
> Mat <- matrix(1:20, nrow=4, dimnames=list(NULL, letters[1:5]))
> # t.test
> Results <- combn(colnames(Mat), 2, function(x) t.test(Mat[,x]), simplify = FALSE)
> names(Results) <- apply(Pairs, 2, paste0, collapse="~")
> Results # Only the first element of the `Results` is shown
$`a~b` # t.test applied to a and b
One Sample t-test
data: Mat[, x]
t = 5.1962, df = 7, p-value = 0.001258
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
2.452175 6.547825
sample estimates:
mean of x
4.5
...
A nicer output:
> sapply(Results, "[", c("statistic", "p.value"))
a~b a~c a~d a~e b~c b~d b~e c~d
statistic 5.196152 4.140643 3.684723 3.439126 9.814955 6.688732 5.41871 14.43376
p.value 0.00125832 0.004345666 0.007810816 0.01085005 2.41943e-05 0.0002803283 0.0009884764 1.825796e-06
c~e d~e
statistic 9.23682 19.05256
p.value 3.601564e-05 2.730801e-07

almost there, with apply, you don't give arguments inside functions, but outside
data<-matrix(1:20,4,5)
Tscore<- apply(data, 2, t.test, alternative = c("two.sided", "less", "greater"),mu = 0, paired = FALSE, var.equal = FALSE,conf.level = 0.95)
and to test if this is what you wanted, check t stats
t.test(data[,1], alternative = c("two.sided", "less", "greater"),mu = 0, paired = FALSE, var.equal = FALSE,conf.level = 0.95)
I may have misunderstood the question though, I just implemented your y=NULL, t test of single column

Related

Need help to perform an automated unpaired t-test over different columns from CSV document in R

I would like to perform an automated paired t-test between column 2 and 3, 4 and 5, 6 and 7 and so on. When I use the code below, I am able to perform a t-test, but not an unpaired t-test.
data:
patient weight_1 weight_2 BMI_1 BMI_2 chol_1 chol_2 gly_1 gly_2
1 A 86.0 97.0 34.44961 30.61482 86.0 97.0 34.44961 30.61482
2 B 111.0 55.5 33.51045 22.80572 111.0 55.5 33.51045 22.80572
3 C 92.4 70.0 28.51852 25.71166 92.4 70.0 28.51852 25.71166
code:
names <- colnames(dataframe)
for(i in seq(from = 2, to = 8, by = 2)){
print(names[i])
print(names[i+1])
print(t.test(dataframe[i], dataframe[i+1]))
}
output:
[1] "weight_1"
[1] "weight_2"
Welch Two Sample t-test
data: dataframe[i] and dataframe[i + 1]
t = 1.3183, df = 75.892, p-value = 0.1914
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.459965 12.090735
sample estimates:
mean of x mean of y
91.50256 86.68718
[1] "BMI_1"
[1] "BMI_2"
Welch Two Sample t-test
data: dataframe[i] and dataframe[i + 1]
t = 1.5851, df = 75.866, p-value = 0.1171
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.3817027 3.3571650
sample estimates:
mean of x mean of y
30.45167 28.96394
And so on. When I add paired=TRUE to the data:
names <- colnames(dataframe)
for(i in seq(from = 2, to = 8, by = 2)){
print(names[i])
print(names[i+1])
print(t.test(dataframe[i], dataframe[i+1]), paired=TRUE)
}
The results are exactly the same, as if it doesn't include the paired function. Could someone help me with this? Many thanks in advance.

You have to change the indexing in the t.test to define clearly that you want to use the columns:
e.g.:
df <- data.frame(a = runif(10), b=runif(10), c=runif(10))
t1 <- t.test(df[1], df[2])
t1$p.value
t2 <- t.test(df[1], df[2], paired=T)
t2$p.value
Error in `[.data.frame`(y, yok) : undefined columns selected
but
t2 <- t.test(df[,1], df[,2], paired=T)
t2$p.value
works. So in your code it should be
print(t.test(dataframe[,i], dataframe[,i+1], paired=TRUE))
for the paired t.test.
I would suggest using this form of indexing also for the paired t-test, although it does not throw any error.

Extracting statiscal values from a list with multiple lists of results of statistical test

I did a Ljung-Box Test for independence in r with 36 lags and stored the results in a list.
for (lag in c(1:36)){
box.test.list[[lag]] <- (Box.test(btcr, type = "Ljung", lag))
}
I want to extract the p-values as well as the test statistic (X-squared) and print them out to look something like:
X-squared = 100, p-value = 0.0001
I also want to pull it out p-value indivually but rather than just spit out numbers, I want something like:
[1] p-value = 0.001
[2] p-value = 0.0001
and so on. Can this be done?

With the test data
set.seed(7)
btcr <- rnorm(100)
you can perform all your tests with
box.test.list <- lapply(1:36, function(i) Box.test(btcr, type = "Ljung", i))
and then put all the results in a data.frame with
results <- data.frame(
lag = 1:36,
xsquared = sapply(box.test.list, "[[", "statistic"),
pvalue = sapply(box.test.list, "[[", "p.value")
)
Then you can do what you like with the results
head(results)
# lag xsquared pvalue
# 1 1 3.659102 0.05576369
# 2 2 7.868083 0.01956444
# 3 3 8.822760 0.03174261
# 4 4 9.654935 0.04665920
# 5 5 11.190969 0.04772238
# 6 6 12.607454 0.04971085

T test in R over large data frame

I'm attempting to run a t-test over a large data frame. The data frame contains CpG sites in the columns and the case/control groups in the rows.
Sample of the data:
Type cg00000029 cg00000108 cg00000109 cg00000165 cg00000236 cg00000289
1 Normal.01 0.32605 0.89785 0.73910 0.30960 0.80654 0.60874
2 Normal.05 0.28981 0.89931 0.72506 0.29963 0.81649 0.62527
3 Normal.11 0.25767 0.90689 0.77163 0.27489 0.83556 0.66264
4 Normal.15 0.26599 0.89893 0.75909 0.30317 0.81778 0.71451
5 Normal.18 0.29924 0.89284 0.75974 0.33740 0.83017 0.69799
6 Normal.20 0.27242 0.90849 0.76260 0.27898 0.84248 0.68689
7 Normal.21 0.22222 0.89940 0.72887 0.25004 0.80569 0.69102
8 Normal.22 0.28861 0.89895 0.80707 0.42462 0.86252 0.61141
9 Normal.24 0.43764 0.89720 0.82701 0.35888 0.78328 0.65301
10 Normal.57 0.26827 0.91092 0.73839 0.30372 0.81349 0.66338
There are 10 "normal" types and 62 "case" types (normal = rows 1-10, case = rows 11-62).
I attempted to run the following t-test on the 16384 CpG sites, but it only returned 72 p-values:
t.result <- apply(data[1:72,], 2, function (x) t.test(x[1:10],x[11:72],paired=FALSE))
data$p_value <- unlist(lapply(t.result, function(x) x$p.value))
data$fdr <- p.adjust(data$p_value, method = "fdr")
Any help would be much appreciated.

Probably you want something like this:
set.seed(1)
data <- matrix(runif(72*16384), nrow=72) # some random data as surrogate for your original data
indices <- expand.grid(1:10, 11:72) # generate all indices of pairs for t-test
t.result <- apply(indices, 1, function (x) t.test(data[x[1],],data[x[2],],paired=FALSE))
p_values <- unlist(lapply(t.result, function(x) x$p.value))
p_fdr <- p.adjust(p_values, method = "fdr")
hist(p_fdr, col='red', xlim=c(0,1), xlab='p-value', main='Histogram of p-values')
hist(p_values, add=TRUE, col=rgb(0, 1, 0, 0.5))
legend('topleft', legend=c('unadjusted', 'fdr-adjusted'), col=c('red', rgb(0, 1, 0, 0.5)), lwd=2)
As expected, almost all of the false positives were eliminated with FDR adjusting of the p-values.

Store or print results for 't.test' in for loop

I am new to R and having a problem with printing the results of 'for' loop in R. Here is my code:
afile <- read.table(file = 'data.txt', head =T)##Has three columns Lab, Store and Batch
lab1 <- afile$Lab[afile$Batch == 1]
lab2 <- afile$Lab[afile$Batch == 2]
lab3 <- afile$Lab[afile$Batch == 3]
lab_list <- list(lab1,lab2,lab3)
for (i in 1:2){
x=lab_list[[i]]
y=lab_list[[i+1]]
t.test(x,y,alternative='two.sided',conf.level=0.95)
}
This code runs without any error but produces no output on screen. I tried taking results in a variable using 'assign' but that produces error:
for (i in 1:2){x=lab_list[[i]];y=lab_list[[i+1]];assign(paste(res,i,sep=''),t.test(x,y,alternative='two.sided',conf.level=0.95))}
Warning messages:
1: In assign(paste(res, i, sep = ""), t.test(x, y, alternative = "two.sided", :
only the first element is used as variable name
2: In assign(paste(res, i, sep = ""), t.test(x, y, alternative = "two.sided", :
only the first element is used as variable name
Please help me on how can I perform t.test in loop and get their results i.e. print on screen or save in variable.
AK

I would rewrite your code like this :
I assume your data is like this
afile <- data.frame(Batch= sample(1:3,10,rep=TRUE),lab=rnorm(10))
afile
Batch lab
1 2 0.4075675
2 1 0.3006192
3 1 -0.4824655
4 3 1.0656481
5 1 0.1741648
6 2 -1.4911526
7 2 0.2216970
8 1 -0.3862147
9 1 -0.4578520
10 1 -0.6298040
Then using lapply you can store your result in a list :
lapply(1:2,function(i){
x <- subset(afile,Batch==i)
y <- subset(afile,Batch==i+1)
t.test(x,y,alternative='two.sided',conf.level=0.95)
})
[[1]]
Welch Two Sample t-test
data: x and y
t = -0.7829, df = 6.257, p-value = 0.4623
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.964637 1.005008
sample estimates:
mean of x mean of y
0.3765373 0.8563520
[[2]]
Welch Two Sample t-test
data: x and y
t = -1.0439, df = 1.797, p-value = 0.4165
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.588720 4.235776
sample estimates:
mean of x mean of y
0.856352 2.032824

In a loop, you need to explicitly print your results in many cases. Try:
print(t.test(x,y,alternative='two.sided',conf.level=0.95))
or
print(summary(t.test(x,y,alternative='two.sided',conf.level=0.95)))

In addition to 'Hansons' solution of printing, results can be saved and printed like:
result <- vector("list",6)
for (i in 1:5){x=lab_list[[i]];y=lab_list[[i+1]];result[[i]] = t.test(x,y,alternative='two.sided',conf.level=0.95)}
result
AK

R For loop to perform Fisher's test - Error message

My data frame looks like that:
595.00000 18696 984.00200 32185 Group1
935.00000 18356 1589.00000 31580 Group2
40.00010 19251 73.00000 33096 Group3
1058.00000 18233 1930.00000 31239 Group4
19.00000 19272 27.00000 33142 Group5
1225.00000 18066 2149.00000 31020 Group6
....
For every group I want to do Fisher exact test.
table <- matrix(c(595.00000, 984.00200, 18696, 32185), ncol=2, byrow=T)
Group1 <- Fisher.test(table, alternative="greater")
Tried to loop over the data frame with:
for (i in 1:nrow(data.frame))
{
table= matrix(c(data.frame$V1, data.frame$V2, data.frame$V3, data.frame$V4), ncol=2, byrow=T)
fisher.test(table, alternative="greater")
}
But got error message
Error in fisher.test(table, alternative = "greater") :
FEXACT error 40.
Out of workspace.
In addition: Warning message:
In fisher.test(table, alternative = "greater") :
'x' has been rounded to integer: Mean relative difference: 2.123828e-06
How can I fix this problem or maybe do another way of looping over the data?

Your first error is: Out of workspace
?fisher.test
fisher.test(x, y = NULL, workspace = 200000, hybrid = FALSE,
control = list(), or = 1, alternative = "two.sided",
conf.int = TRUE, conf.level = 0.95,
simulate.p.value = FALSE, B = 2000)
You should try increasing the workspace (default = 2e5).
However, this happens in your case because you have really huge values. As a rule of thumb, if all elements of your matrix are > 5 (or in your case 10, because d.f. = 1), then you can safely approximate it with a chi-square test of independence using chisq.test. For your case, I think you should rather use a chisq.test.
And the warning message happens because your values are not integers (595.000) etc. So, if you really want to use a fisher.test recursively, do this (assuming your data is in df and is a data.frame:
# fisher.test with bigger workspace
apply(as.matrix(df[,1:4]), 1, function(x)
fisher.test(matrix(round(x), ncol=2), workspace=1e9)$p.value)
Or if you would rather substitute with a chisq.test (which I think you should for these huge values for performance gain with out no significant differences in p-values):
apply(as.matrix(df[,1:4]), 1, function(x)
chisq.test(matrix(round(x), ncol=2))$p.value)
This will extract the p-values.
Edit 1: I just noticed that you use one-sided Fisher's exact test. Maybe you should continue using Fisher's test with bigger workspace as I'm not sure of having a one-sided chi-square test of independence as it is already calculated from the right-tail probability (and you can not divide the p-values by 2 as its unsymmetrical).
Edit 2: Since you require the group name with the p-values and you already have a data.frame, I suggest you use data.table package as follows:
# example data
set.seed(45)
df <- as.data.frame(matrix(sample(10:200, 20), ncol=4))
df$grp <- paste0("group", 1:nrow(df))
# load package
require(data.table)
dt <- data.table(df, key="grp")
dt[, p.val := fisher.test(matrix(c(V1, V2, V3, V4), ncol=2),
workspace=1e9)$p.value, by=grp]
> dt
# V1 V2 V3 V4 grp p.val
# 1: 130 65 76 82 group1 5.086256e-04
# 2: 70 52 168 178 group2 1.139934e-01
# 3: 55 112 195 34 group3 7.161604e-27
# 4: 81 43 91 80 group4 4.229546e-02
# 5: 75 10 86 50 group5 4.212769e-05

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

computing T-test with help of apply function - r

Related

Need help to perform an automated unpaired t-test over different columns from CSV document in R

Extracting statiscal values from a list with multiple lists of results of statistical test

T test in R over large data frame

Store or print results for 't.test' in for loop

R For loop to perform Fisher's test - Error message

Categories

Resources