Splitting a large vector into intervals in R [duplicate] - r

This question already has answers here:
Split a vector into chunks
(22 answers)
Closed 9 years ago.
I'm not too good with R. I ran this loop and I have this huge resulting vector of 11,303,044 rows. I have another vector resulting from another loop with dimensions 1681 rows.
I wish to run a chisq.test to compare their distributions. but since they are of different length, it's not working.
I tried taking 1681-sized samples from the 11,303,044-sized vector to match the size length of the 2nd vector but I get different chisq.test results every time I run it.
I'm thinking splitting the 2 vectors into equal number of intervals.
Let's say
vector1:
temp.mat<-matrix((rnorm(11303044))^2, ncol=1)
head(temp.mat)
dim(temp.mat)
vector2:
temp.mat<-matrix((rnorm(1681))^2, ncol=1)
head(temp.mat)
dim(temp.mat)
How do I split them in equal intervals to result in same lengths vectors?

mat1<-matrix((rnorm(1130300))^2, ncol=1) # only one-tenth the size of your vector
smat=sample(mat1, 100000) #and take only one-tenth of that
mat2<-matrix((rnorm(1681))^2, ncol=1)
qqplot(smat,mat2) #and repeat the sampling a few times
What you see seems interesting from a statistical point of view. At the higher levels of "departure from the mean" the large sample is always departing from a "good fit" not surprisingly because it has a higher number of really extreme values.

chisq.test is Pearson's chi-square test. It is designed for discrete data, and with two input vectors, it will coerce the inputs you pass in to factors, and it tests for independence, not equality in distribution. This means, for example, that the order of the data will make a difference.
> set.seed(123)
> x<-sample(5,10,T)
> y<-sample(5,10,T)
> chisq.test(x,y)
Pearson's Chi-squared test
data: x and y
X-squared = 18.3333, df = 16, p-value = 0.3047
Warning message:
In chisq.test(x, y) : Chi-squared approximation may be incorrect
> chisq.test(x,y[10:1])
Pearson's Chi-squared test
data: x and y[10:1]
X-squared = 16.5278, df = 16, p-value = 0.4168
Warning message:
In chisq.test(x, y[10:1]) : Chi-squared approximation may be incorrect
So I don't think that chisq.test is what you want, because it does not compare distributions. Maybe try something like ks.test, which will work with different length vectors and continuous data.
> set.seed(123)
> x<-rnorm(2000)^2
> y<-rnorm(100000)^2
> ks.test(x,y)
Two-sample Kolmogorov-Smirnov test
data: x and y
D = 0.0139, p-value = 0.8425
alternative hypothesis: two-sided
> ks.test(sqrt(x),y)
Two-sample Kolmogorov-Smirnov test
data: sqrt(x) and y
D = 0.1847, p-value < 2.2e-16
alternative hypothesis: two-sided

Related

Removing outliers can not runt cor.test()

I am extracting outliers from a single column of a dataset. Then I am attempting to run cor.test() on that column plus another column. I am getting error: Error in cor.test.default(dep_delay_noout, distance) : 'x' and 'y' must have the same length I assume this is because removing the outliers from one column caused it to be a different length vector than the other column, but am not sure what to do about it. I have tried mutating the dataset by adding a new column that lacked outliers, but unfortunately ran into the same problem. Does anybody know what to do? Below is my code.
dep_delay<-flights$dep_delay
dep_delay_upper<-quantile(dep_delay,0.997,na.rm=TRUE)
dep_delay_lower<-quantile(dep_delay,0.003,na.rm=TRUE)
dep_delay_out<-which(dep_delay>dep_delay_upper|dep_delay<dep_delay_lower)
dep_delay_noout<-dep_delay[-dep_delay_out]
distance<-flights$distance
cor.test(dep_delay_noout,distance)
You were almost there. In cor.test you also want to subset distance. Additionally, for the preprocessing you could use a quantile vector of length 2 and mapply to do the comparison in one step―just to write it more concise, actually your code is fine.
data('flights', package='nycflights13')
nna <- !is.na(flights$dep_delay)
(q <- quantile(flights$dep_delay[nna], c(0.003, 0.997)))
# 0.3% 99.7%
# -14 270
nout <- rowSums(mapply(\(f, q) f(flights$dep_delay[nna], q), c(`>`, `<`), q)) == 2
with(flights, cor.test(dep_delay[nout], distance[nout]))
# Pearson's product-moment correlation
#
# data: dep_delay[no_out] and distance[no_out]
# t = -12.409, df = 326171, p-value < 2.2e-16
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
# -0.02515247 -0.01829207
# sample estimates:
# cor
# -0.02172252

chi-square - two sample test in r

There is in R a function to perform a chi-square two sample test ?http://www.itl.nist.gov/div898/software/dataplot/refman1/auxillar/chi2samp.htm
For example I whant to prove if x = rnorm(100) and y = rnorm(100) come from the same distribution.
I tried to use the chisq.test function, but I think it is not correct because it gives me a very large p-value
> chisq.test(rnorm(100),runif(100))
Pearson's Chi-squared test
data: rnorm(100) and runif(100)
X-squared = 9900, df = 9801, p-value = 0.239
thank you
I have similar data - binned length frequencies of fish from two different sampling methods. Because my data are binned, as I understand it, it is not appropriate to use the KS test, because that test requires continuous data. So, I too, came across the Chi Square two sample test offered by DataPlot at the link you provide: http://www.itl.nist.gov/div898/software/dataplot/refman1/auxillar/chi2samp.htm
I installed the Dataplot software, and communicated with the contact person for the test when I had problems. I've also run it in R, and gotten the same result (a good check). Here's what I coded in R:
col1=c(2,4,22,46,81,103,238,65,82,9,58,2)
col2=c(0,0,1,1,0,10,32,22,20,4,0,0)
chisq.test(cbind(col1,col2))
Pearson's Chi-squared test
data: cbind(col1, col2)
X-squared = 53.782, df = 11, p-value = 1.293e-07
NOTE: Dataplot did not mind if I had a bin that had zeros for both of my samples, but R does - after much frustration I realized this, removed the shared 0s, and R gave me the same test stat, df, and p value that I got with DataPlot. Hope this helps!

coin::wilcox_test versus wilcox.test in R

In trying to figure out which one is better to use I have come across two issues.
1) The W statistic given by wilcox.test is different from that of coin::wilcox_test. Here's my output:
wilcox_test:
Exact Wilcoxon Mann-Whitney Rank Sum Test
data: data$variableX by data$group (yes, no)
Z = -0.7636, p-value = 0.4489
alternative hypothesis: true mu is not equal to 0
wilcox.test:
Wilcoxon rank sum test with continuity correction
data: data$variable by data$group
W = 677.5, p-value = 0.448
alternative hypothesis: true location shift is not equal to 0
I'm aware that there's actually two values for W and that the smaller one is usually reported. When wilcox.test is used with comma instead of "~" I can get the other value, but this comes up as W = 834.5. From what I understand, coin::statistic() can return three different statistics using ("linear", "standarized", and "test") where "linear" is the normal W and "standardized" is just the W converted to a z-score. None of these match up to the W I get from wilcox.test though (linear = 1055.5, standardized = 0.7636288, test = -0.7636288). Any ideas what's going on?
2) I like the options in wilcox_test for "distribution" and "ties.method", but it seems that you can not apply a continuity correction like in wilcox.test. Am I right?
I encountered the same issue when trying to apply Wendt formula to compute effect sizes using the coin package, and obtained aberrant r values due to the fact that the linear statistic outputted by wilcox_test() is unadjusted.
A great explanation is already given here, and therefore I will simply address how to obtain adjusted U statistics with the wilcox_test() function. Let's use a the following data frame:
d <- data.frame( x = c(rnorm(n = 60, mean = 10, sd = 5), rnorm(n = 30, mean = 16, sd = 5)),
g = c(rep("a",times = 60), rep("b",times = 30)) )
We can perform identical tests with wilcox.test() and wilcox_test():
w1 <- wilcox.test( formula = x ~ g, data = d )
w2 <- wilcox_test( formula = x ~ g, data = d )
Which will output two distinct statistics:
> w1$statistic
W
321
> w2#statistic#linearstatistic
[1] 2151
The values are indeed totally different (albeit the tests are equivalent).
To obtain the U statistics identical to that of wilcox.test(), you need to subtract wilcox_test()'s output statistic by the minimal value that the sum of the ranks of the reference sample can take, which is n_1(n_1+1)/2.
Both commands take the first level in the factor of your grouping variable g as reference (which will by default be alphabetically ordered).
Then you can compute the smallest sum of the ranks possible for the reference sample:
n1 <- table(w2#statistic#x)[1]
And
w2#statistic#linearstatistic- n1*(n1+1)/2 == w1$statistic
should return TRUE
Voilà.
It seems to be one is performing Mann-Whitney's U and the other Wilcoxon rank test, which is defined in many different ways in literature. They are pretty much equivalent, just look at the p-value. If you want continuity correction in wilcox.test just use argument correct=T.
Check https://stats.stackexchange.com/questions/79843/is-the-w-statistic-outputted-by-wilcox-test-in-r-the-same-as-the-u-statistic

correlation failure - Pearson

I want to write to datafile information about correlation as follows:
*korelacja=cor(p2,d2,method="pearson",use = "complete.obs")
korelacja2=cor(p2,d2,method="kendall",use = "complete.obs")
korelacja3=cor(p2,d2,method="spearman",use = "complete.obs")
dane=paste(korelacja,korelacja2,korelacja3,sep=';')
write(dane,file=nazwa,append=TRUE)*
Results are strange for me - Pearson correlation is very high (always equal one), but Kendall and Spearman is very low. I create scatterplots and I don't see linear correlation.
It's not hard to replicate this pattern if you have some large outliers in your data that dominate the Pearson correlation but are relatively insignificant in the non-parametric (Kendall/Spearman) approaches. For example, here's a concocted data set with nothing going on except for one large outlier:
> set.seed(1001)
> x <- c(runif(1000),1e5)
> y <- c(runif(1000),1e5)
> cor(x,y,method="pearson")
[1] 1
> cor(x,y,method="kendall")
[1] -0.02216583
> cor(x,y,method="spearman")
[1] -0.03335352
This is consistent with your description so far, although you ought in this case to be able to see the outliers in your scatterplots ...

Bootstrapping to compare two groups

In the following code I use bootstrapping to calculate the C.I. and the p-value under the null hypothesis that two different fertilizers applied to tomato plants have no effect in plants yields (and the alternative being that the "improved" fertilizer is better). The first random sample (x) comes from plants where a standard fertilizer has been used, while an "improved" one has been used in the plants where the second sample (y) comes from.
x <- c(11.4,25.3,29.9,16.5,21.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
library(boot)
diff <- function(x,i) mean(x[i[6:11]]) - mean(x[i[1:5]])
b <- boot(total, diff, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
What I don't like about the code above is that resampling is done as if there was only one sample of 11 values (separating the first 5 as belonging to sample x leaving the rest to sample y).
Could you show me how this code should be modified in order to draw resamples of size 5 with replacement from the first sample and separate resamples of size 6 from the second sample, so that bootstrap resampling would mimic the “separate samples” design that produced the original data?
EDIT2 :
Hack deleted as it was a wrong solution. Instead one has to use the argument strata of the boot function :
total <- c(x,y)
id <- as.factor(c(rep("x",length(x)),rep("y",length(y))))
b <- boot(total, diff, strata=id, R = 10000)
...
Be aware you're not going to get even close to a correct estimate of your p.value :
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
b <- boot(total, diff, strata=id, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
> p.value
[1] 0.5162
How would you explain a p-value of 0.51 for two samples where all values of the second are higher than the highest value of the first?
The above code is fine to get a -biased- estimate of the confidence interval, but the significance testing about the difference should be done by permutation over the complete dataset.
Following John, I think the appropriate way to use bootstrap to test if the sums of these two different populations are significantly different is as follows:
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
b_x <- boot(x, sum, R = 10000)
b_y <- boot(y, sum, R = 10000)
z<-(b_x$t0-b_y$t0)/sqrt(var(b_x$t[,1])+var(b_y$t[,1]))
pnorm(z)
So we can clearly reject the null that they are the same population. I may have missed a degree of freedom adjustment, I am not sure how bootstrapping works in that regard, but such an adjustment will not change your results drastically.
While the actual soil beds could be considered a stratified variable in some instances this is not one of them. You only have the one manipulation, between the groups of plants. Therefore, your null hypothesis is that they really do come from the exact same population. Treating the items as if they're from a single set of 11 samples is the correct way to bootstrap in this case.
If you have two plots, and in each plot tried the different fertilizers over different seasons in a counterbalanced fashion then the plots would be statified samples and you'd want to treat them as such. But that isn't the case here.

Resources