Displaying only the p-value of multiple t.tests - r

I have
replicate(1000, t.test(rnorm(10)))
What it does that it draws a sample of size ten from a normal distribution, performs a t.test on it, and does this a 1000 times.
But for my assignment I'm only interested in the p-value (the question is: how many times is the null hypothesis rejected).
How do I get only the p-values, or can I add something that already says how many times the null hypothesis is rejected(how many times the p-value is smaller than 0.05)

t.test returns a object of class htest which is a list containing a number of components including p.value (which is what you want).
You have a couple of options.
You can save the t.test results in a list and then extract the p.value component
# simplify = FALSE to avoid coercion to array
ttestlist <- replicate(1000, t.test(rnorm(10)), simplify = FALSE)
ttest.pval <- sapply(ttestlist, '[[', 'p.value')
Or you could simply only save that component of the t.test object
pvals <- replicate(1000, t.test(rnorm(10))$p.value)

Here are the steps I'd use to solve your problem. Pay attention to how I broke it down into the smallest component parts and built it up step by step:
#Let's look at the structure of one t.test to see where the p-value is stored
str(t.test(rnorm(10)))
#It is named "p.value, so let's see if we can extract it
t.test(rnorm(10))[["p.value"]]
#Now let's test if its less than your 0.05 value
ifelse(t.test(rnorm(10))[["p.value"]]< 0.05,1,0)
#That worked. Now let's replace the code above in your replicate function:
replicate(1000, ifelse(t.test(rnorm(10))[["p.value"]]< 0.05,1,0))
#That worked too, now we can just take the sum of that:
#Make it reproducible this time
set.seed(42)
sum(replicate(1000, ifelse(t.test(rnorm(10))[["p.value"]]< 0.05,1,0)))
Should yield this:
[1] 54

Related

Confusion about the matrix "B" returned by `quantreg::boot.rq`

When invoking boot.rq like this
b_10 = boot.rq(x, y, tau = .1, bsmethod = "xy", cov = TRUE, R = reps, mofn = mofn)
what does the B matrix (size R x p) in b_10 contain: bootstrapped coefficient estimates or bootstrapped standard errors?
The Value section in the documentation says:
A list consisting of two elements: A matrix B of dimension R by p is returned with the R resampled estimates of the vector of quantile regression parameters. [...]
So, it seems to be the coefficient estimates. But Description section says:
These functions can be used to construct standard errors, confidence intervals and tests of hypotheses regarding quantile regression models.
So it seems to be bootstrapped standard errors.
What is it really?
Edit:
I also wonder what difference the option cov = TRUE makes. Thanks!
The bootstrapped values are different depending on whether I use cov = TRUE or not. The code was written by someone else so I'm not sure why that option was put there.
It stores the bootstrap coefficients. Each row of B is a sample of coefficients, and you have R rows.
These samples are the basis of further inference. We can compute various statistics from them. For example, to compute bootstrap mean and standard error, we can do:
colMeans(B)
apply(B, 2, sd)
Do you also happen to know what difference the option cov = TRUE makes?
Are you sure that cov = TRUE works? First of all, boot.rq itself has no such argument. It may be passed in via .... However, ... is forwarded to boot.rq.pxy (if bsmethod = "pxy") or boot.rq.pwxy (if bsmethod = "pwxy"), neither of which deals with a cov argument. Furthermore, you use bsmethod = "xy", so ... will be silently ignored. As far as I could see, cov = TRUE has no effect at all.
It works in the sense that R doesn't throw me an error.
That is what "silently ignored" means. You can pass whatever into .... They are just ignored.
The bootstrapped values are different depending on whether I use cov = TRUE or not. The code was written by someone else so I'm not sure why that option was put there.
Random sampling won't give identical results on different runs. I suggest you fix a random seed then do testing:
set.seed(0); ans1 <- boot.rq(..., cov = FALSE)
set.seed(0); ans2 <- boot.rq(..., cov = TRUE)
all.equal(ans1$B, ans2$B)
If you don't get TRUE, come back to me.
You're right. It's just because of the different seeds. Thanks!!

How to use lapply with get.confusion_matrix() in R?

I am performing a PLS-DA analysis in R using the mixOmics package. I have one binary Y variable (presence or absence of wetland) and 21 continuous predictor variables (X) with values ranging from 1 to 100.
I have made the model with the data_training dataset and want to predict new outcomes with the data_validation dataset. These datasets have exactly the same structure.
My code looks like:
library(mixOmics)
model.plsda<-plsda(X,Y, ncomp = 10)
myPredictions <- predict(model.plsda, newdata = data_validation[,-1], dist = "max.dist")
I want to predict the outcome based on 10, 9, 8, ... to 2 principal components. By using the get.confusion_matrix function, I want to estimate the error rate for every number of principal components.
prediction <- myPredictions$class$max.dist[,10] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
I can do this seperately for 10 times, but I want do that a little faster. Therefore I was thinking of making a list with the results of prediction for every number of components...
library(BBmisc)
prediction_test <- myPredictions$class$max.dist
predictions_components <- convertColsToList(prediction_test, name.list = T, name.vector = T, factors.as.char = T)
...and then using lapply with the get.confusion_matrix and get.BER function. But then I don't know how to do that. I have searched on the internet, but I can't find a solution that works. How can I do this?
Many thanks for your help!
Without reproducible there is no way to test this but you need to convert the code you want to run each time into a function. Something like this:
confmat <- function(x) {
prediction <- myPredictions$class$max.dist[,x] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
}
Now lapply:
results <- lapply(10:2, confmat)
That will return a list with the get.BER results for each number of PCs so results[[1]] will be the results for 10 PCs. You will not get values for prediction or confusionmat unless they are included in the results returned by get.BER. If you want all of that, you need to replace the last line to the function with return(list(prediction, confusionmat, get.BER(confusion.mat)). This will produce a list of the lists so that results[[1]][[1]] will be the results of prediction for 10 PCs and results[[1]][[2]] and results[[1]][[3]] will be confusionmat and get.BER(confusion.mat) respectively.

R: how to perform more complex calculations from a combn of a dataset?

Right now, I have a combn from the built in dataset iris. So far, I have been guided into being able to find the coefficient of lm() of the pair of values.
myPairs <- combn(names(iris[1:4]), 2)
formula <- apply(myPairs, MARGIN=2, FUN=paste, collapse="~")
model <- lapply(formula, function(x) lm(formula=x, data=iris)$coefficients[2])
model
However, I would like to go a few steps further and use the coefficient from lm() to be used in further calculations. I would like to do something like this:
Coefficient <- lm(formula=x, data=iris)$coefficients[2]
Spread <- myPairs[1] - coefficient*myPairs[2]
library(tseries)
adf.test(Spread)
The procedure itself is simple enough, but I haven't been able to find a way to do this for each combn in the data set. (As a sidenote, the adf.test would not be applied to such data, but I'm just using the iris dataset for demonstration).
I'm wondering, would it be better to write a loop for such a procedure?
You can do all of this within combn.
If you just wanted to run the regression over all combinations, and extract the second coefficient you could do
fun <- function(x) coef(lm(paste(x, collapse="~"), data=iris))[2]
combn(names(iris[1:4]), 2, fun)
You can then extend the function to calculate the spread
fun <- function(x) {
est <- coef(lm(paste(x, collapse="~"), data=iris))[2]
spread <- iris[,x[1]] - est*iris[,x[2]]
adf.test(spread)
}
out <- combn(names(iris[1:4]), 2, fun, simplify=FALSE)
out[[1]]
# Augmented Dickey-Fuller Test
#data: spread
#Dickey-Fuller = -3.879, Lag order = 5, p-value = 0.01707
#alternative hypothesis: stationary
Compare results to running the first one manually
est <- coef(lm(Sepal.Length ~ Sepal.Width, data=iris))[2]
spread <- iris[,"Sepal.Length"] - est*iris[,"Sepal.Width"]
adf.test(spread)
# Augmented Dickey-Fuller Test
# data: spread
# Dickey-Fuller = -3.879, Lag order = 5, p-value = 0.01707
# alternative hypothesis: stationary
Sounds like you would want to write your own function and call it in your myPairs loop (apply):
yourfun <- function(pair){
fm <- paste(pair, collapse='~')
coef <- lm(formula=fm, data=iris)$coefficients[2]
Spread <- iris[,pair[1]] - coef*iris[,pair[2]]
return(Spread)
}
Then you can call this function:
model <- apply(myPairs, 2, yourfun)
I think this is the cleanest way. But I don't know what exactly you want to do, so I was making up the example for Spread. Note that in my example you get warning messages, since column Species is a factor.
A few tips: I wouldn't name things that you with the same name as built-in functions (model, formula come to mind in your original version).
Also, you can simplify the paste you are doing - see the below.
Finally, a more general statement: don't feel like everything needs to be done in a *apply of some kind. Sometimes brevity and short code is actually harder to understand, and remember, the *apply functions offer at best, marginal speed gains over a simple for loop. (This was not always the case with R, but it is at this point).
# Get pairs
myPairs <- combn(x = names(x = iris[1:4]),m = 2)
# Just directly use paste() here
myFormulas <- paste(myPairs[1,],myPairs[2,],sep = "~")
# Store the models themselves into a list
# This lets you go back to the models later if you need something else
myModels <- lapply(X = myFormulas,FUN = lm,data = iris)
# If you use sapply() and this simple function, you get back a named vector
# This seems like it could be useful to what you want to do
myCoeffs <- sapply(X = myModels,FUN = function (x) {return(x$coefficients[2])})
# Now, you can do this using vectorized operations
iris[myPairs[1,]] - iris[myPairs[2,]] * myCoeffs[myPairs[2,]]
If I am understanding right, I believe the above will work. Note that the names on the output at present will be nonsensical, you would need to replace them with something of your own design (maybe the values of myFormulas).

cforest party unbalanced classes

I want to measure the features importance with the cforest function from the party library.
My output variable has something like 2000 samples in class 0 and 100 samples in class 1.
I think a good way to avoid bias due to class unbalance is to train each tree of the forest using a subsample such that the number of elements of class 1 is the same of the number of element in class 0.
Is there anyway to do that? I am thinking to an option like n_samples = c(20, 20)
EDIT:
An example of code
> iris.cf <- cforest(Species ~ ., data = iris,
+ control = cforest_unbiased(mtry = 2)) #<--- Here I would like to train the forest using a balanced subsample of the data
> varimp(object = iris.cf)
Sepal.Length Sepal.Width Petal.Length Petal.Width
0.048981818 0.002254545 0.305818182 0.271163636
>
EDIT:
Maybe my question is not clear enough.
Random forest is a set of decision trees. In general the decision trees are constructed using only a random subsample of the data. I would like that the used subsample has the same numbers of element in the class 1 and in the class 0.
EDIT:
The function that I am looking for is for sure available in the randomForest package
sampsize
Size(s) of sample to draw. For classification, if sampsize is a vector of the length the number of strata, then sampling is stratified by strata, and the elements of sampsize indicate the numbers to be drawn from the strata.
I need the same for the party package. Is there any way to get it?
I will assume you know what you want to accomplish, but don't know enough R to do that.
Not sure if the function provides balancing of data as an argument, but you can do it manually. Below is the code I quickly threw together. More elegant solution might exist.
# just in case
myData <- iris
# replicate everything *10* times. Replicate is just a "loop 10 times".
replicate(10,
{
# split dataset by class and add separate classes to list
splitList <- split(myData, myData$Species)
# sample *20* random rows from each matrix in a list
sampledList <- lapply(splitList, function(dat) { dat[sample(20),] })
# combine sampled rows to a data.frame
sampledData <- do.call(rbind, sampledList)
# your code below
res.cf <- cforest(Species ~ ., data = sampledData,
control = cforest_unbiased(mtry = 2)
)
varimp(object = res.cf)
}
)
Hope you can take it from here.

Bootstrapping to compare two groups

In the following code I use bootstrapping to calculate the C.I. and the p-value under the null hypothesis that two different fertilizers applied to tomato plants have no effect in plants yields (and the alternative being that the "improved" fertilizer is better). The first random sample (x) comes from plants where a standard fertilizer has been used, while an "improved" one has been used in the plants where the second sample (y) comes from.
x <- c(11.4,25.3,29.9,16.5,21.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
library(boot)
diff <- function(x,i) mean(x[i[6:11]]) - mean(x[i[1:5]])
b <- boot(total, diff, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
What I don't like about the code above is that resampling is done as if there was only one sample of 11 values (separating the first 5 as belonging to sample x leaving the rest to sample y).
Could you show me how this code should be modified in order to draw resamples of size 5 with replacement from the first sample and separate resamples of size 6 from the second sample, so that bootstrap resampling would mimic the “separate samples” design that produced the original data?
EDIT2 :
Hack deleted as it was a wrong solution. Instead one has to use the argument strata of the boot function :
total <- c(x,y)
id <- as.factor(c(rep("x",length(x)),rep("y",length(y))))
b <- boot(total, diff, strata=id, R = 10000)
...
Be aware you're not going to get even close to a correct estimate of your p.value :
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
b <- boot(total, diff, strata=id, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
> p.value
[1] 0.5162
How would you explain a p-value of 0.51 for two samples where all values of the second are higher than the highest value of the first?
The above code is fine to get a -biased- estimate of the confidence interval, but the significance testing about the difference should be done by permutation over the complete dataset.
Following John, I think the appropriate way to use bootstrap to test if the sums of these two different populations are significantly different is as follows:
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
b_x <- boot(x, sum, R = 10000)
b_y <- boot(y, sum, R = 10000)
z<-(b_x$t0-b_y$t0)/sqrt(var(b_x$t[,1])+var(b_y$t[,1]))
pnorm(z)
So we can clearly reject the null that they are the same population. I may have missed a degree of freedom adjustment, I am not sure how bootstrapping works in that regard, but such an adjustment will not change your results drastically.
While the actual soil beds could be considered a stratified variable in some instances this is not one of them. You only have the one manipulation, between the groups of plants. Therefore, your null hypothesis is that they really do come from the exact same population. Treating the items as if they're from a single set of 11 samples is the correct way to bootstrap in this case.
If you have two plots, and in each plot tried the different fertilizers over different seasons in a counterbalanced fashion then the plots would be statified samples and you'd want to treat them as such. But that isn't the case here.

Resources