Perform two-sample t-test and output the t, p values for two groups of matrices in R - r

I am trying to perform a two-sample t-test like this
The wt1, wt2, wt3, mut1, mut2, mut3 are 3x3 matrices. After running the t-test, I would like to get t.stat and p.value matrices, in which
t.stat[i,j] <- the t value from t.test(c(wt1[i,j],wt2[i,j],wt3[i,j]),c(mut1[i,j],mut2[i,j],mut3[i,j]))
p.value[i,j] <- the p-value from t.test(c(wt1[i,j],wt2[i,j],wt3[i,j]),c(mut1[i,j],mut2[i,j],mut3[i,j]))
with i and j indicating the row and column indices.
Is there an efficient way to achieve this without a loop?
Thank you very much for the help, it works!
Now I found that my data in the diagonal directions are all 1, which would result in Error in t.test.default(c(wt1[x], wt2[x], wt3[x]), c(mut1[x], mut2[x], :
data are essentially constant.
In order to pass those errors, I would like to output N/A in the t.stat and p.value. If the matrices have to contain the same type of values, 0 and 1 can be used for t.stat and p.value, respectively. It seems that tryCatch can do the job, but I am not sure how to handle it with sapply?

You can do something like this:
test<- sapply(1:9, function(x) t.test(c(wt1[x], wt2[x], wt3[x]),
c(mut1[x], mut2[x], mut3[x])))
t.stat<- matrix(test["statistic", ], nrow = 3)
p.value<- matrix(test["p.value", ], nrow = 3)
For the second part of your question, I think using tryCatch inside sapply will help. Unfortunately, I couldn't think of a way of pre-allocating test and then creating the 2 matrices while using tryCatch. In order to do that, I am adapting Aaron's answer.
t.stat<- matrix(sapply(1:9, function(x)
tryCatch({t.test(c(wt1[x], wt2[x], wt3[x]),
c(mut1[x], mut2[x], mut3[x]))$statistic},
error = function(err) {return(NA)})), nrow = 3)
p.value<- matrix(sapply(1:9, function(x)
tryCatch({t.test(c(wt1[x], wt2[x], wt3[x]),
c(mut1[x], mut2[x], mut3[x]))$p.value},
error = function(err) {return(NA)})), nrow = 3)

Related

Calculate Errors using loop function in R

I have two data matrices both having the same dimensions. I want to extract the same series of columns vectors. Then take both series as vectors, then calculate different errors for example mean absolute error (mae), mean percentage error (mape) and root means square error
(rmse). My data matrix is quite large dimensional so I try to explain with an example and calculate these errors manually as:
mat1<- matrix(6:75,ncol=10,byrow=T)
mat2<- matrix(30:99,ncol=10,byrow=T)
mat1_seri1 <- as.vector(mat1[,c(1+(0:4)*2)])
mat1_seri2<- as.vector(mat1[,c(2+(0:4)*2)])
mat2_seri1 <- as.vector(mat1[,c(1+(0:4)*2)])
mat2_seri2<- as.vector(mat1[,c(2+(0:4)*2)])
mae1<-mean(abs(mat1_seri1-mat2_seri1))
mae2<-mean(abs(mat1_seri2-mat2_seri2))
For mape
mape1<- mean(abs(mat1_seri1-mat2_seri1)/mat1_seri1)*100
mape2<- mean(abs(mat1_seri2-mat2_seri2)/mat1_seri2)*100
similarly, I calculate rmse from their formula, as I have large data matrices so manually it is quite time-consuming. Is it's possible to do this using looping which gives an output of the errors (mae,mape,rmse) term for each series separately.
I'm not sure if this is what you are looking for, but here is a function that could automate the process, maybe there is also a better way:
fn <- function(m1, m2) {
stopifnot(dim(m1) == dim(m2))
mat1_seri1 <- as.vector(m1[, (1:ncol(m1))[(1:ncol(m1))%%2 != 0]])
mat1_seri2 <- as.vector(m1[, (1:ncol(m1))[!(1:ncol(m1))%%2]])
mat2_seri1 <- as.vector(m2[, (1:ncol(m2))[(1:ncol(m2))%%2 != 0]])
mat2_seri2 <- as.vector(m2[, (1:ncol(m2))[!(1:ncol(m2))%%2]])
mae1 <- mean(abs(mat1_seri1-mat2_seri1))
mae2 <- mean(abs(mat1_seri2-mat2_seri2))
mape1 <- mean(abs(mat1_seri1-mat2_seri1)/mat1_seri1)*100
mape2 <- mean(abs(mat1_seri2-mat2_seri2)/mat1_seri2)*100
setNames(as.data.frame(matrix(c(mae1, mae2, mape1, mape2), ncol = 4)),
c("mae1", "mae2", "mape1", "mape2"))
}
fn(mat1, mat2)
mae1 mae2 mape1 mape2
1 24 24 92.62581 86.89572

Multiple p-value in shapiro.test with R

I want to get the p-value of multiple shapiro test. I want to test the normality of 20 (from 4 to 23) columns from a data frame called bladder, then i want to get the p-value of each one programmatically and store it, I'm trying something like:
ttest20<-apply(bladder[4:23], 2, shapiro.test)
pVals <- numeric()
for(i in 1:length(ttest20)){
pVals<- ttest20[i]$p.value
}
but the last line don't store all the p-value.
Could someone help me? Thanks a lot.
i just found out the answer, here is it:
ttest20<-apply(bladder[,4:23], 2, function(x) shapiro.test(x)$p.value)
This returns all the p-value from all shapiro.test

R: how to perform more complex calculations from a combn of a dataset?

Right now, I have a combn from the built in dataset iris. So far, I have been guided into being able to find the coefficient of lm() of the pair of values.
myPairs <- combn(names(iris[1:4]), 2)
formula <- apply(myPairs, MARGIN=2, FUN=paste, collapse="~")
model <- lapply(formula, function(x) lm(formula=x, data=iris)$coefficients[2])
model
However, I would like to go a few steps further and use the coefficient from lm() to be used in further calculations. I would like to do something like this:
Coefficient <- lm(formula=x, data=iris)$coefficients[2]
Spread <- myPairs[1] - coefficient*myPairs[2]
library(tseries)
adf.test(Spread)
The procedure itself is simple enough, but I haven't been able to find a way to do this for each combn in the data set. (As a sidenote, the adf.test would not be applied to such data, but I'm just using the iris dataset for demonstration).
I'm wondering, would it be better to write a loop for such a procedure?
You can do all of this within combn.
If you just wanted to run the regression over all combinations, and extract the second coefficient you could do
fun <- function(x) coef(lm(paste(x, collapse="~"), data=iris))[2]
combn(names(iris[1:4]), 2, fun)
You can then extend the function to calculate the spread
fun <- function(x) {
est <- coef(lm(paste(x, collapse="~"), data=iris))[2]
spread <- iris[,x[1]] - est*iris[,x[2]]
adf.test(spread)
}
out <- combn(names(iris[1:4]), 2, fun, simplify=FALSE)
out[[1]]
# Augmented Dickey-Fuller Test
#data: spread
#Dickey-Fuller = -3.879, Lag order = 5, p-value = 0.01707
#alternative hypothesis: stationary
Compare results to running the first one manually
est <- coef(lm(Sepal.Length ~ Sepal.Width, data=iris))[2]
spread <- iris[,"Sepal.Length"] - est*iris[,"Sepal.Width"]
adf.test(spread)
# Augmented Dickey-Fuller Test
# data: spread
# Dickey-Fuller = -3.879, Lag order = 5, p-value = 0.01707
# alternative hypothesis: stationary
Sounds like you would want to write your own function and call it in your myPairs loop (apply):
yourfun <- function(pair){
fm <- paste(pair, collapse='~')
coef <- lm(formula=fm, data=iris)$coefficients[2]
Spread <- iris[,pair[1]] - coef*iris[,pair[2]]
return(Spread)
}
Then you can call this function:
model <- apply(myPairs, 2, yourfun)
I think this is the cleanest way. But I don't know what exactly you want to do, so I was making up the example for Spread. Note that in my example you get warning messages, since column Species is a factor.
A few tips: I wouldn't name things that you with the same name as built-in functions (model, formula come to mind in your original version).
Also, you can simplify the paste you are doing - see the below.
Finally, a more general statement: don't feel like everything needs to be done in a *apply of some kind. Sometimes brevity and short code is actually harder to understand, and remember, the *apply functions offer at best, marginal speed gains over a simple for loop. (This was not always the case with R, but it is at this point).
# Get pairs
myPairs <- combn(x = names(x = iris[1:4]),m = 2)
# Just directly use paste() here
myFormulas <- paste(myPairs[1,],myPairs[2,],sep = "~")
# Store the models themselves into a list
# This lets you go back to the models later if you need something else
myModels <- lapply(X = myFormulas,FUN = lm,data = iris)
# If you use sapply() and this simple function, you get back a named vector
# This seems like it could be useful to what you want to do
myCoeffs <- sapply(X = myModels,FUN = function (x) {return(x$coefficients[2])})
# Now, you can do this using vectorized operations
iris[myPairs[1,]] - iris[myPairs[2,]] * myCoeffs[myPairs[2,]]
If I am understanding right, I believe the above will work. Note that the names on the output at present will be nonsensical, you would need to replace them with something of your own design (maybe the values of myFormulas).

Running loop on groups of data frame in R

I need to code a loop that will run t-tests on small groups of a data frame. I think they recommended using a for loop.
There are 271 rows in the data frame. The first 260 rows need to be split into 13 groups of 20, and a t-test must be run on each of the 13 groups.
This is the code I used to run a t-test on the entire data frame:
t.test(a, c, alternative =c("two.sided"), mu=0, paired=TRUE, var.equal=TRUE, conf.level=0.95)
I'm a coding noob, please help! D:
First at all, I don't see a data.frame here. a and c seem to be vectors. I assume that these both vectors are of length 271 and you want to ignore the last 11 items. So you can throw away these items first:
a2 <- a[1:260]
c2 <- c[1:260]
Now you can create a vector of length 260 determining the indices of the subsets. (There are many ways to do this, but I think this way is easy to understand.)
indices <- as.numeric(cut(1:260, 20))
indices #just to show the output
You probably have to store the output in a list. The following code is again not the most efficient, but easy to understand.
result <- list()
for (i in 1:20){
result[[i]] <- t.test(a2[which(indices == i)], c2[which(indices == i)],
alternative = c("two.sided"),
mu = 0, paired = TRUE, var.equal = TRUE,
conf.level = 0.95)
}
result[[1]] #gives the results of the first t-test (items 1 to 20)
result[[2]] # ...
As alternative to the for-loop you could also use lapply which usually is more effective and a bit shorter (but that doesn't matter for 260 data points):
result2 <- lapply(1:20, function(i) t.test(a2[which(indices == i)],
c2[which(indices == i)],
alternative = c("two.sided"),
mu = 0, paired = TRUE, var.equal = TRUE,
conf.level = 0.95))
result[[1]] # ...
I hope that answers you're question.

How to find significant correlations in a large dataset

I'm using R.
My dataset has about 40 different Variables/Vektors and each has about 80 entries. I'm trying to find significant correlations, that means I want to pick one variable and let R calculate all the correlations of that variable to the other 39 variables.
I tried to do this by using a linear modell with one explaining variable that means: Y=a*X+b.
Then the lm() command gives me an estimator for a and p-value of that estimator for a. I would then go on and use one of the other variables I have for X and try again until I find a p-value thats really small.
I'm sure this is a common problem, is there some sort of package or function that can try all these possibilities (Brute force),show them and then maybe even sorts them by p-value?
You can use the function rcorr from the package Hmisc.
Using the same demo data from Richie:
m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
Then:
library(Hmisc)
correlations <- rcorr(as.matrix(the_data))
To access the p-values:
correlations$P
To visualize you can use the package corrgram
library(corrgram)
corrgram(the_data)
Which will produce:
In order to print a list of the significant correlations (p < 0.05), you can use the following.
Using the same demo data from #Richie:
m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
Install Hmisc
install.packages("Hmisc")
Import library and find the correlations (#Carlos)
library(Hmisc)
correlations <- rcorr(as.matrix(the_data))
Loop over the values printing the significant correlations
for (i in 1:m){
for (j in 1:m){
if ( !is.na(correlations$P[i,j])){
if ( correlations$P[i,j] < 0.05 ) {
print(paste(rownames(correlations$P)[i], "-" , colnames(correlations$P)[j], ": ", correlations$P[i,j]))
}
}
}
}
Warning
You should not use this for drawing any serious conclusion; only useful for some exploratory analysis and formulate hypothesis. If you run enough tests, you increase the probability of finding some significant p-values by random chance: https://www.xkcd.com/882/. There are statistical methods that are more suitable for this and that do do some adjustments to compensate for running multiple tests, e.g. https://en.wikipedia.org/wiki/Bonferroni_correction.
Here's some sample data for reproducibility.
m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
You can calculate the correlation between two columns using cor. This code loops over all columns except the first one (which contains our response), and calculates the correlation between that column and the first column.
correlations <- vapply(
the_data[, -1],
function(x)
{
cor(the_data[, 1], x)
},
numeric(1)
)
You can then find the column with the largest magnitude of correlation with y using:
correlations[which.max(abs(correlations))]
So knowing which variables are correlated which which other variables can be interesting, but please don't draw any big conclusions from this knowledge. You need to have a proper think about what you are trying to understand, and which techniques you need to use. The folks over at Cross Validated can help.
If you are trying to predict y using only one variable than you have to take the one that is mainly correlated with y.
To do this just use the command which.max(abs(cor(x,y))). If you want to use more than one variable in your model then you have to consider something like the lasso estimator
One option is to run a correlation matrix:
cor_result=cor(data)
write.csv(cor_result, file="cor_result.csv")
This correlates all the variables in the file against each other and outputs a matrix.

Resources