Multiple p-value in shapiro.test with R - r

I want to get the p-value of multiple shapiro test. I want to test the normality of 20 (from 4 to 23) columns from a data frame called bladder, then i want to get the p-value of each one programmatically and store it, I'm trying something like:
ttest20<-apply(bladder[4:23], 2, shapiro.test)
pVals <- numeric()
for(i in 1:length(ttest20)){
pVals<- ttest20[i]$p.value
}
but the last line don't store all the p-value.
Could someone help me? Thanks a lot.

i just found out the answer, here is it:
ttest20<-apply(bladder[,4:23], 2, function(x) shapiro.test(x)$p.value)
This returns all the p-value from all shapiro.test

Related

Creating a function in r to run a chi squared test

The goal is to create a chi squared test function with arguments dat and res.type="pearson" that returns an R list containing the test statistic, p-value, expected counts, residual type, and the residuals of that type, where the expected counts and the residuals are stored in two r × c matrices.
I think I have figured out how to get the expected counts and test statistic but can't figure out the res.type argument. The pearson residual equation is (o_ij-e_ij)/sqrt(e_ij) and the standard residual equation is (o_ij-e_ij)/sqrt(e_ij(1-n_i./n_..)(1-n_.j/n..)
Here is what I have so far
chisquared <- function(dat, res.type) {
expdata <- matrix(c((dat[4,1]*dat[1,4])/dat[4,4],
(dat[4,2]*dat[1,4])/dat[4,4], (dat[4,3]*dat[1,4])/dat[4,4],
(dat[4,1]*dat[2,4])/dat[4,4], (dat[4,2]*dat[2,4])/dat[4,4],
(dat[4,3]*dat[2,4])/dat[4,4], (dat[4,1]*dat[3,4])/dat[4,4],
(dat[4,2]*dat[3,4])/dat[4,4], (dat[4,3]*dat[3,4])/dat[4,4]),
nrow=3, ncol=3, byrow="T")
sqdist <- matrix(c((dat[1,1]-expdata[1,1])^2/expdata[1,1],
(dat[1,2]-expdata[1,2])^2/expdata[1,2], (dat[1,3]-
expdata[1,3])^2/expdata[1,3], (dat[2,1]-
expdata[2,1])^2/expdata[2,1], (dat[2,2]-
expdata[2,2])^2/expdata[2,2], (dat[2,3]-
expdata[2,3])^2/expdata[2,3], (dat[3,1]-
expdata[3,1])^2/expdata[3,1], (dat[3,2]-
expdata[3,2])^2/expdata[3,2], (dat[3,3]-
expdata[3,3])^2/expdata[3,3]), nrow=3, ncol=3, byrow="T")
ts <- sum(sqdist[1,1], sqdist[1,2], sqdist[1,3], sqdist[2,1],
sqdist[2,2], sqdist[2,3], sqdist[3,1], sqdist[3,2], sqdist[3,3])
}
I'm sure there is an easier way to do this, other than the chisq.test function as I am not allowed to use it, so if anyone could provide some advice for that as well it would be appreciated

How to use lapply with get.confusion_matrix() in R?

I am performing a PLS-DA analysis in R using the mixOmics package. I have one binary Y variable (presence or absence of wetland) and 21 continuous predictor variables (X) with values ranging from 1 to 100.
I have made the model with the data_training dataset and want to predict new outcomes with the data_validation dataset. These datasets have exactly the same structure.
My code looks like:
library(mixOmics)
model.plsda<-plsda(X,Y, ncomp = 10)
myPredictions <- predict(model.plsda, newdata = data_validation[,-1], dist = "max.dist")
I want to predict the outcome based on 10, 9, 8, ... to 2 principal components. By using the get.confusion_matrix function, I want to estimate the error rate for every number of principal components.
prediction <- myPredictions$class$max.dist[,10] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
I can do this seperately for 10 times, but I want do that a little faster. Therefore I was thinking of making a list with the results of prediction for every number of components...
library(BBmisc)
prediction_test <- myPredictions$class$max.dist
predictions_components <- convertColsToList(prediction_test, name.list = T, name.vector = T, factors.as.char = T)
...and then using lapply with the get.confusion_matrix and get.BER function. But then I don't know how to do that. I have searched on the internet, but I can't find a solution that works. How can I do this?
Many thanks for your help!
Without reproducible there is no way to test this but you need to convert the code you want to run each time into a function. Something like this:
confmat <- function(x) {
prediction <- myPredictions$class$max.dist[,x] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
}
Now lapply:
results <- lapply(10:2, confmat)
That will return a list with the get.BER results for each number of PCs so results[[1]] will be the results for 10 PCs. You will not get values for prediction or confusionmat unless they are included in the results returned by get.BER. If you want all of that, you need to replace the last line to the function with return(list(prediction, confusionmat, get.BER(confusion.mat)). This will produce a list of the lists so that results[[1]][[1]] will be the results of prediction for 10 PCs and results[[1]][[2]] and results[[1]][[3]] will be confusionmat and get.BER(confusion.mat) respectively.

Perform two-sample t-test and output the t, p values for two groups of matrices in R

I am trying to perform a two-sample t-test like this
The wt1, wt2, wt3, mut1, mut2, mut3 are 3x3 matrices. After running the t-test, I would like to get t.stat and p.value matrices, in which
t.stat[i,j] <- the t value from t.test(c(wt1[i,j],wt2[i,j],wt3[i,j]),c(mut1[i,j],mut2[i,j],mut3[i,j]))
p.value[i,j] <- the p-value from t.test(c(wt1[i,j],wt2[i,j],wt3[i,j]),c(mut1[i,j],mut2[i,j],mut3[i,j]))
with i and j indicating the row and column indices.
Is there an efficient way to achieve this without a loop?
Thank you very much for the help, it works!
Now I found that my data in the diagonal directions are all 1, which would result in Error in t.test.default(c(wt1[x], wt2[x], wt3[x]), c(mut1[x], mut2[x], :
data are essentially constant.
In order to pass those errors, I would like to output N/A in the t.stat and p.value. If the matrices have to contain the same type of values, 0 and 1 can be used for t.stat and p.value, respectively. It seems that tryCatch can do the job, but I am not sure how to handle it with sapply?
You can do something like this:
test<- sapply(1:9, function(x) t.test(c(wt1[x], wt2[x], wt3[x]),
c(mut1[x], mut2[x], mut3[x])))
t.stat<- matrix(test["statistic", ], nrow = 3)
p.value<- matrix(test["p.value", ], nrow = 3)
For the second part of your question, I think using tryCatch inside sapply will help. Unfortunately, I couldn't think of a way of pre-allocating test and then creating the 2 matrices while using tryCatch. In order to do that, I am adapting Aaron's answer.
t.stat<- matrix(sapply(1:9, function(x)
tryCatch({t.test(c(wt1[x], wt2[x], wt3[x]),
c(mut1[x], mut2[x], mut3[x]))$statistic},
error = function(err) {return(NA)})), nrow = 3)
p.value<- matrix(sapply(1:9, function(x)
tryCatch({t.test(c(wt1[x], wt2[x], wt3[x]),
c(mut1[x], mut2[x], mut3[x]))$p.value},
error = function(err) {return(NA)})), nrow = 3)

R: how to perform more complex calculations from a combn of a dataset?

Right now, I have a combn from the built in dataset iris. So far, I have been guided into being able to find the coefficient of lm() of the pair of values.
myPairs <- combn(names(iris[1:4]), 2)
formula <- apply(myPairs, MARGIN=2, FUN=paste, collapse="~")
model <- lapply(formula, function(x) lm(formula=x, data=iris)$coefficients[2])
model
However, I would like to go a few steps further and use the coefficient from lm() to be used in further calculations. I would like to do something like this:
Coefficient <- lm(formula=x, data=iris)$coefficients[2]
Spread <- myPairs[1] - coefficient*myPairs[2]
library(tseries)
adf.test(Spread)
The procedure itself is simple enough, but I haven't been able to find a way to do this for each combn in the data set. (As a sidenote, the adf.test would not be applied to such data, but I'm just using the iris dataset for demonstration).
I'm wondering, would it be better to write a loop for such a procedure?
You can do all of this within combn.
If you just wanted to run the regression over all combinations, and extract the second coefficient you could do
fun <- function(x) coef(lm(paste(x, collapse="~"), data=iris))[2]
combn(names(iris[1:4]), 2, fun)
You can then extend the function to calculate the spread
fun <- function(x) {
est <- coef(lm(paste(x, collapse="~"), data=iris))[2]
spread <- iris[,x[1]] - est*iris[,x[2]]
adf.test(spread)
}
out <- combn(names(iris[1:4]), 2, fun, simplify=FALSE)
out[[1]]
# Augmented Dickey-Fuller Test
#data: spread
#Dickey-Fuller = -3.879, Lag order = 5, p-value = 0.01707
#alternative hypothesis: stationary
Compare results to running the first one manually
est <- coef(lm(Sepal.Length ~ Sepal.Width, data=iris))[2]
spread <- iris[,"Sepal.Length"] - est*iris[,"Sepal.Width"]
adf.test(spread)
# Augmented Dickey-Fuller Test
# data: spread
# Dickey-Fuller = -3.879, Lag order = 5, p-value = 0.01707
# alternative hypothesis: stationary
Sounds like you would want to write your own function and call it in your myPairs loop (apply):
yourfun <- function(pair){
fm <- paste(pair, collapse='~')
coef <- lm(formula=fm, data=iris)$coefficients[2]
Spread <- iris[,pair[1]] - coef*iris[,pair[2]]
return(Spread)
}
Then you can call this function:
model <- apply(myPairs, 2, yourfun)
I think this is the cleanest way. But I don't know what exactly you want to do, so I was making up the example for Spread. Note that in my example you get warning messages, since column Species is a factor.
A few tips: I wouldn't name things that you with the same name as built-in functions (model, formula come to mind in your original version).
Also, you can simplify the paste you are doing - see the below.
Finally, a more general statement: don't feel like everything needs to be done in a *apply of some kind. Sometimes brevity and short code is actually harder to understand, and remember, the *apply functions offer at best, marginal speed gains over a simple for loop. (This was not always the case with R, but it is at this point).
# Get pairs
myPairs <- combn(x = names(x = iris[1:4]),m = 2)
# Just directly use paste() here
myFormulas <- paste(myPairs[1,],myPairs[2,],sep = "~")
# Store the models themselves into a list
# This lets you go back to the models later if you need something else
myModels <- lapply(X = myFormulas,FUN = lm,data = iris)
# If you use sapply() and this simple function, you get back a named vector
# This seems like it could be useful to what you want to do
myCoeffs <- sapply(X = myModels,FUN = function (x) {return(x$coefficients[2])})
# Now, you can do this using vectorized operations
iris[myPairs[1,]] - iris[myPairs[2,]] * myCoeffs[myPairs[2,]]
If I am understanding right, I believe the above will work. Note that the names on the output at present will be nonsensical, you would need to replace them with something of your own design (maybe the values of myFormulas).

Displaying only the p-value of multiple t.tests

I have
replicate(1000, t.test(rnorm(10)))
What it does that it draws a sample of size ten from a normal distribution, performs a t.test on it, and does this a 1000 times.
But for my assignment I'm only interested in the p-value (the question is: how many times is the null hypothesis rejected).
How do I get only the p-values, or can I add something that already says how many times the null hypothesis is rejected(how many times the p-value is smaller than 0.05)
t.test returns a object of class htest which is a list containing a number of components including p.value (which is what you want).
You have a couple of options.
You can save the t.test results in a list and then extract the p.value component
# simplify = FALSE to avoid coercion to array
ttestlist <- replicate(1000, t.test(rnorm(10)), simplify = FALSE)
ttest.pval <- sapply(ttestlist, '[[', 'p.value')
Or you could simply only save that component of the t.test object
pvals <- replicate(1000, t.test(rnorm(10))$p.value)
Here are the steps I'd use to solve your problem. Pay attention to how I broke it down into the smallest component parts and built it up step by step:
#Let's look at the structure of one t.test to see where the p-value is stored
str(t.test(rnorm(10)))
#It is named "p.value, so let's see if we can extract it
t.test(rnorm(10))[["p.value"]]
#Now let's test if its less than your 0.05 value
ifelse(t.test(rnorm(10))[["p.value"]]< 0.05,1,0)
#That worked. Now let's replace the code above in your replicate function:
replicate(1000, ifelse(t.test(rnorm(10))[["p.value"]]< 0.05,1,0))
#That worked too, now we can just take the sum of that:
#Make it reproducible this time
set.seed(42)
sum(replicate(1000, ifelse(t.test(rnorm(10))[["p.value"]]< 0.05,1,0)))
Should yield this:
[1] 54

Resources