chi-square - two sample test in r - r

There is in R a function to perform a chi-square two sample test ?http://www.itl.nist.gov/div898/software/dataplot/refman1/auxillar/chi2samp.htm
For example I whant to prove if x = rnorm(100) and y = rnorm(100) come from the same distribution.
I tried to use the chisq.test function, but I think it is not correct because it gives me a very large p-value
> chisq.test(rnorm(100),runif(100))
Pearson's Chi-squared test
data: rnorm(100) and runif(100)
X-squared = 9900, df = 9801, p-value = 0.239
thank you

I have similar data - binned length frequencies of fish from two different sampling methods. Because my data are binned, as I understand it, it is not appropriate to use the KS test, because that test requires continuous data. So, I too, came across the Chi Square two sample test offered by DataPlot at the link you provide: http://www.itl.nist.gov/div898/software/dataplot/refman1/auxillar/chi2samp.htm
I installed the Dataplot software, and communicated with the contact person for the test when I had problems. I've also run it in R, and gotten the same result (a good check). Here's what I coded in R:
col1=c(2,4,22,46,81,103,238,65,82,9,58,2)
col2=c(0,0,1,1,0,10,32,22,20,4,0,0)
chisq.test(cbind(col1,col2))
Pearson's Chi-squared test
data: cbind(col1, col2)
X-squared = 53.782, df = 11, p-value = 1.293e-07
NOTE: Dataplot did not mind if I had a bin that had zeros for both of my samples, but R does - after much frustration I realized this, removed the shared 0s, and R gave me the same test stat, df, and p value that I got with DataPlot. Hope this helps!

Related

A small simulation study about normality tests in R

I am conducting a small simulation study to judge how good two normality tests really are. My plan is to generate a multitude of normal distribution samples of not too many observations and determine how often each test rejects the null hypothesis of normality.
The (incomplete) code I have so far is
library(nortest)
y<-replicate(10000,{
x<-rnorm(50)
ad.test(x)$p.value
ks.test(x,y=pnorm)$p.value
}
)
Now I would like to count the proportion of these p-values that are smaller than 0.05 for each test. Could you please tell me how I could do that? I apologise if this is a newbie question, but I myself am new to R.
Thank you.
library(nortest)
nsim <- 10000
nx <- 50
set.seed(101)
y <- replicate(nsim,{
x<-rnorm(nx)
c(ad=ad.test(x)$p.value,
ks=ks.test(x,y=pnorm)$p.value)
}
)
apply(y<0.05,MARGIN=1,mean)
## ad ks
## 0.0534 0.0480
Using MARGIN=1 tells apply to take the mean across rows, rather than columns -- this is sensible given the ordering that replicate()'s built-in simplification produces.
For examples of this type, the type I error rates of any standard tests will be extremely close to their nominal value (0.05 in this example).
If you run each test separately, then you can simply count which vals are stored in y that are less than 0.05.
y<-replicate(1000,{
x<-rnorm(50)
ks.test(x,y=pnorm)$p.value})
length(which(y<0.05))
Your code isn't outputting the p-values. You could do something like this:
rep_test <- function(reps=10000) {
p_ks <- rep(NA, reps)
p_ad <- rep(NA, reps)
for (i in 1:reps) {
x <- rnorm(50)
p_ks[i] <- ks.test(x, y=pnorm)$p.value
p_ad[i] <- ad.test(x)$p.value
}
return(data.frame(cbind(p_ks, p_ad)))
}
sum(test$p_ks<.05)/10000
sum(test$p_ad<.05)/10000

performing a chi square test across multiple variables and extracting the relevant p value in R

Ok straight to the question. I have a database with lots and lots of categorical variable.
Sample database with a few variables as below
gender <- as.factor(sample( letters[6:7], 100, replace=TRUE, prob=c(0.2, 0.8) ))
smoking <- as.factor(sample(c(0,1),size=100,replace=T,prob=c(0.6,0.4)))
alcohol <- as.factor(sample(c(0,1),size=100,replace=T,prob=c(0.3,0.7)))
htn <- as.factor(sample(c(0,1),size=100,replace=T,prob=c(0.2,0.8)))
tertile <- as.factor(sample(c(1,2,3),size=100,replace=T,prob=c(0.3,0.3,0.4)))
df <- as.data.frame(cbind(gender,smoking,alcohol,htn,tertile))
I want to test the hypothesis, using a chi square test, that there is a difference in the portion of smokers, alcohol use, hypertension (htn) etc by tertile (3 factors). I then want to extract the p values for each variable.
Now i know i can test each individual variable using a 2 by 3 cross tabulation but is there a more efficient code to derive the test statistic and p-value across all variables in one go and extract the p value across each variable
Thanks in advance
Anoop
If you want to do all the comparisons in one statement, you can do
mapply(function(x, y) chisq.test(x, y)$p.value, df[, -5], MoreArgs=list(df[,5]))
# gender smoking alcohol htn
# 0.4967724 0.8251178 0.5008898 0.3775083
Of course doing tests this way is somewhat statistically inefficient since you are doing multiple tests here so some correction is required to maintain an appropriate type 1 error rate.
You can run the following code chunk if you want to get the test result in details:
lapply(df[,-5], function(x) chisq.test(table(x,df$tertile), simulate.p.value = TRUE))
You can get just p-values:
lapply(df[,-5], function(x) chisq.test(table(x,df$tertile), simulate.p.value = TRUE)$p.value)
This is to get the p-values in the data frame:
data.frame(lapply(df[,-5], function(x) chisq.test(table(x,df$tertile), simulate.p.value = TRUE)$p.value))
Thanks to RPub for inspiring.
http://www.rpubs.com/kaz_yos/1204

alternative for wilcox.test in R

I'm trying a significance test using wilcox.test in R. I want to basically test if a value x is significantly within/outside a distribution d.
I'm doing the following:
d = c(90,99,60,80,80,90,90,54,65,100,90,90,90,90,90)
wilcox.test(60,d)
Wilcoxon rank sum test with continuity correction
data: 60 and d
W = 4.5, p-value = 0.5347
alternative hypothesis: true location shift is not equal to 0
Warning message:
In wilcox.test.default(60, d) : cannot compute exact p-value with ties
and basically the p-value is the same for a big range of numbers i test.
I've tried wilcox_test() from the coin package, but i can't get it to work testing a value against a distribution.
Is there an alternative to this test that does the same and knows how to deal with ties?
How worried are you about the non-exact results? I would guess that the approximation is reasonable for a data set this size. (I did manage to get coin::wilcox_test working, and the results are not hugely different ...)
d <- c(90,99,60,80,80,90,90,54,65,100,90,90,90,90,90)
pfun <- function(x) {
suppressWarnings(w <- wilcox.test(x,d)$p.value)
return(w)
}
testvec <- 30:120
p1 <- sapply(testvec,pfun)
library("coin")
pfun2 <- function(x) {
dd <- data.frame(y=c(x,d),f=factor(c(1,rep(2,length(d)))))
return(pvalue(wilcox_test(y~f,data=dd)))
}
p2 <- sapply(testvec,pfun2)
library("exactRankTests")
pfun3 <- function(x) {wilcox.exact(x,d)$p.value}
p3 <- sapply(testvec,pfun3)
Picture:
par(las=1,bty="l")
matplot(testvec,cbind(p1,p2,p3),type="s",
xlab="value",ylab="p value of wilcoxon test",lty=1,
ylim=c(0,1),col=c(1,2,4))
legend("topright",c("stats::wilcox.test","coin::wilcox_test",
"exactRankTests::wilcox.exact"),
lty=1,col=c(1,2,4))
(exactRankTests added by request, but given that it's not maintained any more and recommends the coin package, I'm not sure how reliable it is. You're on your own for figuring out what the differences among these procedures are and which would be best to use ...)
The results make sense here -- the problem is just that your power is low. If your value is completely outside the range of the data, for n=15, that will be a probability of something like 2*(1/16)=0.125 [i.e. probability of your sample ending up as the first or the last element in a permutation], which is not quite the same as the minimum value here (wilcox.test: p=0.105, wilcox_test: p=0.08), but that might be an approximation issue, or I might have some detail wrong. Nevertheless, it's in the right ballpark.
You can do this.
wilcox.test(60,d, exact=FALSE)

How do I test GOF of a small sample against an arbitrary distribution in R

I’d like to do a GOF test on a small set of data (around 50 samples) against a defined distribution. Any suggestions would be appreciated.
Example distribution:
n <- 50
time.vec <- 1:n
alpha <- 0.6
test.dist.vec <- 1/(time.vec^alpha)
Example data to be fit against test.dist.vec:
my.jitter <- runif(n, min=-0.05, max=0.1)
test.data <- test.dist.vec + my.jitter
Given the above, how can test the significance of test.data against test.dist.vec?
Additionally, given a sample test.data with a small n (around 50 – but what is min anyway?), how do I estimate alpha?
Would the Kolmogorov-Smirnov test work for you.
If yes:
ks.test(test.data,test.dist.vec)
In fact, there's a bunch of those kind of test.

Bootstrapping to compare two groups

In the following code I use bootstrapping to calculate the C.I. and the p-value under the null hypothesis that two different fertilizers applied to tomato plants have no effect in plants yields (and the alternative being that the "improved" fertilizer is better). The first random sample (x) comes from plants where a standard fertilizer has been used, while an "improved" one has been used in the plants where the second sample (y) comes from.
x <- c(11.4,25.3,29.9,16.5,21.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
library(boot)
diff <- function(x,i) mean(x[i[6:11]]) - mean(x[i[1:5]])
b <- boot(total, diff, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
What I don't like about the code above is that resampling is done as if there was only one sample of 11 values (separating the first 5 as belonging to sample x leaving the rest to sample y).
Could you show me how this code should be modified in order to draw resamples of size 5 with replacement from the first sample and separate resamples of size 6 from the second sample, so that bootstrap resampling would mimic the “separate samples” design that produced the original data?
EDIT2 :
Hack deleted as it was a wrong solution. Instead one has to use the argument strata of the boot function :
total <- c(x,y)
id <- as.factor(c(rep("x",length(x)),rep("y",length(y))))
b <- boot(total, diff, strata=id, R = 10000)
...
Be aware you're not going to get even close to a correct estimate of your p.value :
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
b <- boot(total, diff, strata=id, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
> p.value
[1] 0.5162
How would you explain a p-value of 0.51 for two samples where all values of the second are higher than the highest value of the first?
The above code is fine to get a -biased- estimate of the confidence interval, but the significance testing about the difference should be done by permutation over the complete dataset.
Following John, I think the appropriate way to use bootstrap to test if the sums of these two different populations are significantly different is as follows:
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
b_x <- boot(x, sum, R = 10000)
b_y <- boot(y, sum, R = 10000)
z<-(b_x$t0-b_y$t0)/sqrt(var(b_x$t[,1])+var(b_y$t[,1]))
pnorm(z)
So we can clearly reject the null that they are the same population. I may have missed a degree of freedom adjustment, I am not sure how bootstrapping works in that regard, but such an adjustment will not change your results drastically.
While the actual soil beds could be considered a stratified variable in some instances this is not one of them. You only have the one manipulation, between the groups of plants. Therefore, your null hypothesis is that they really do come from the exact same population. Treating the items as if they're from a single set of 11 samples is the correct way to bootstrap in this case.
If you have two plots, and in each plot tried the different fertilizers over different seasons in a counterbalanced fashion then the plots would be statified samples and you'd want to treat them as such. But that isn't the case here.

Resources