How to create a matrix of p-values in ks.test - r

I'am new with R. If you could help me that would be great. My problem is as follows. I have this code:
N <- 1000 # number of simulations
n <- 30
prob <- 0.001
Y2 <- rep(0, times=N)
for(i in 1:N) {
X <- rbinom(n, size=1, prob)
#binomial experiment
Y2[i] <- sum(X)
}
table(Y2)
library(dgof)
ks.test(Y2, "pnorm", mean=prob*n, sd=sqrt(n*prob*(1-prob)))
I want to generete this code 10 times, so I'll get 10 different p-values and D. Also the parameter n will be not fixed, but it will be a vector of values (50,100,500,1000,5000). So I want a matrix of ks.test like on a picture.
Thx.enter image description here

Related

how to find correlation coefficient in a for loop that is to be repeated 5000 times? and save the statistic

for 2 independent normally distributed variables x and y, they are found using x = rnorm(50) and y = rnorm(50). calculate the correlation 5000 times and save the result each time. What is the likelihood that a correlation with absolute value greater than 0.3 is computed? (default set.seed(42) and to plot a histogram of the coefficient spread)
This is what i have tried so far...
set.seed(42)
n <- 50 #length of random sequence
x_norm <- rnorm(n)
y_norm <- rnorm(n)
nrun <- 5000
corr <- numeric(nrun)
for (i in 1:nrun) {
corrxy <- cor(x_norm,y_norm)
corr[i] <- sum(abs(corrxy > 0.3)) / n #save statistic in the vector
}
hist(corr)
it is expected that i get 5000 different coefficient numbers saved in [i], and when plotted using hist(0), these coefficients should follow approx a normal distribution. but i do not understand how the for loop works and how to incorporate the value of coefficient being greater than 0.3.
I think you were nearly there. You just had to shift some code outside and inside the for loop.
You want new data for each run of the loop (otherwise you get the same correlation 5000 times) and you need to save the correlation each time the loop runs. This results in a vector of 5000 correlations which you can use to look at the proportion of correlations (divide by the number of runs, not the number of observations) that are higher than .3 outside of the for loop.
Edit: One final correction is needed in the bracketing of the absolute function. You want to find the absolute correlations > .3 not the absolute value of corrxy > .3.
set.seed(42)
n <- 50 #length of random sequence
nrun <- 5000
corrxy <- numeric(nrun) # The correlation is the statistic you want to save
for (i in 1:nrun) {
x_norm <- rnorm(n) # Compute a new dataset for each run (otherwise you get the same correlation)
y_norm <- rnorm(n)
corrxy[i] <- cor(x_norm,y_norm) # Calculate the correlation
}
hist(corrxy)
sum(abs(corrxy) > 0.3) / nrun # look at the proportion of runs that have cor > .3
Below is the resulting histogram of the 5000 correlations. The proportion of correlations that is higher than |.3| is 0.034 in this case.
Here's another way of doing this kind of simulations without explicitly calling a loop:
Define first your simulation:
my_sim <- function(n) { # n is the norm distribution size
x <- rnorm(n)
y <- rnorm(n)
corrxy <- cor(x, y)
corrxy # return the correlation (single value)
}
Now we can call this function many times with replicate():
set.seed(123)
nrun <- 10
my_results <- replicate(nrun, my_sim(n=50))
#my_results
# [1] -0.0358698314 -0.0077403045 -0.0512509071 -0.0998484901 0.1230261286 0.1001124010 -0.0002023124
# [8] 0.2017120443 0.0644662387 0.0567232640
Now in my_results you have all the correlations from each simulations (just 10 for example).
And you can compute your statistics:
sum(abs(my_results)> 0.3) / nrun # nrun is 10
or plot:
hist(my_results)

MIC correlation between 2 matrices in R

The MINERVA package provide a function to perform the Maximal Information Coefficient (MIC). The description of the package stipulates that the function mine (x,y) works only with 2 matrices A and B of the same size.
Here, I would like to obtain the MIC coefficient value obtained from the correlation of two A and B matrices of different size, respectfully, A is n by m and B is n by z, with n being the number of observations (rows).
In other words, my aim is to obtain a C matrix of m x z , which returns, for each value, give the MIC correlation coefficient values (and, if possible, the associated P value, if any).
I provide an example of what I want with the Pearson correlation.
set.seed(1)
x <- matrix(rnorm(20), nrow=5, ncol=10)
y <- matrix(rnorm(15), nrow=5, ncol=20)
P <- cor(x, y=y)
I mailed one author of the MINERVA package without success, is there any way I can apply the mine function to obtain the desired m by z correlation?
Let me answer to my own post. In the code below, I use the loop function, which may be not the smartest/fastest way to to do it, but it work as expected.
library(minerva)
set.seed(1)
x <- matrix(rnorm(20), nrow=5, ncol=10)
y <- matrix(rnorm(15), nrow=5, ncol=20)
Result = matrix(ncol = ncol(y),nrow = ncol(x))
for(i in 1:ncol(x))
{Thisvar = x[,i]
print(i)
for(k in 1:ncol(y))
{Thisvar2 = y[,k]
res = mine(Thisvar,Thisvar2, master=TRUE, use="all.obs")
Result[i,k] = res$MIC
}}

Matrix computation with for loop

I am newcomer to R, migrated from GAUSS because of the license verification issues.
I want to speed-up the following code which creates n×k matrix A. Given the n×1 vector x and vectors of parameters mu, sig (both of them k dimensional), A is created as A[i,j]=dnorm(x[i], mu[j], sigma[j]). Following code works ok for small numbers n=40, k=4, but slows down significantly when n is around 10^6 and k is about the same size as n^{1/3}.
I am doing simulation experiment to verify the bootstrap validity, so I need to repeatedly compute matrix A for #ofsimulation × #bootstrap times, and it becomes little time comsuming as I want to experiment with many different values of n,k. I vectorized the code as much as I could (thanks to vector argument of dnorm), but can I ask more speed up?
Preemptive thanks for any help.
x = rnorm(40)
mu = c(-1,0,4,5)
sig = c(2^2,0.5^2,2^2,3^2)
n = length(x)
k = length(mu)
A = matrix(NA,n,k)
for(j in 1:k){
A[,j]=dnorm(x,mu[j],sig[j])
}
Your method can be put into a function like this
A.fill <- function(x,mu,sig) {
k <- length(mu)
n <- length(x)
A <- matrix(NA,n,k)
for(j in 1:k) A[,j] <- dnorm(x,mu[j],sig[j])
A
}
and it's clear that you are filling the matrix A column by column.
R stores the entries of a matrix columnwise (just like Fortran).
This means that the matrix can be filled with a single call of dnorm using suitable repetitions of x, mu, and sig. The vector z will have the columns of the desired matrix stacked. and then the matrix to be returned can be formed from that vector just by specifying the number of rows an columns. See the following function
B.fill <- function(x,mu,sig) {
k <- length(mu)
n <- length(x)
z <- dnorm(rep(x,times=k),rep(mu,each=n),rep(sig,each=n))
B <- matrix(z,nrow=n,ncol=k)
B
}
Let's make an example with your data and test this as follows:
N <- 40
set.seed(11)
x <- rnorm(N)
mu <- c(-1,0,4,5)
sig <- c(2^2,0.5^2,2^2,3^2)
A <- A.fill(x,mu,sig)
B <- B.fill(x,mu,sig)
all.equal(A,B)
# [1] TRUE
I'm assuming that n is an integer multiple of k.
Addition
As noted in the comments B.fill is quite slow for large values of n.
The reason lies in the construct rep(...,each=...).
So is there a way to speed A.fill.
I tested this function:
C.fill <- function(x,mu,sig) {
k <- length(mu)
n <- length(x)
sapply(1:k,function(j) dnorm(x,mu[j],sig[j]), simplify=TRUE)
}
This function is about 20% faster than A.fill.

Generating correlated ordinal data

I'm using the package GenOrd for generating correlated ordinal data. The basic idea is to get correlated ordinal data with correlation 0.5, now I want to repeat the whole code for 1000 times and save the results of correlation, to see how close I can get to the correlation of 0.5, then change the sample size and the Marginal probabilities and see what changes.
library(GenOrd)
R<-matrix(c(1,0.5,0.5,1),2,2)
Marginal<-list(c(0.2,0.5,0.7,0.9),c(0.1,0.3,0.4,0.5))
DataOrd<-ordsample(100,Marginal,R)
correlation<-cor(DataOrd)
correlation[1,2] # 0.5269
Here is a simple solution:
sim.cor <- function(R, Marginal, n, K)
{
res <- numeric(length = K)
for(i in 1:K)
res[i] <- cor(ordsample(n, Marginal, R))[1,2]
res
}
where n is the sample size and K is the number of times you want to repeat. So, in your example, you can call this function and save the result (a vector of size K with the correlations) in an object:
set.seed(1234)
correlations <- sim.cor(R = R, Marginal = Marginal, n = 100, K = 1000)
mean(correlations)
[1] 0.5009389
A faster and more elegant solution is to use the replicate function as suggested by jaysunice3401:
set.seed(1234)
n <- 100
correlations <- replicate(n = 1000, expr = cor(ordsample(n, Marginal, R))[1,2])
mean(correlations)
[1] 0.5009389
I hope this can help!

Computing linear regressions for every possible permutation of matrix columns

I have a (k x n) matrix. I have initially managed to linearly regress (using the lm function) column 1 with each and every other column and extracted only the coefficients.
fore.choose <- matrix(0, 1, NCOL(assets))
for(i in seq(1, NCOL(assets), 1))
{
abc <- lm(assets[,1]~assets[,i])$coefficients
fore.choose[1,i] <- abc[2:length(abc)]
}
The coefficients are placed in the fore.choose matrix.
What I now need to do is to linearly regress column 2 with each and every other column, and then column 3 and so on and so forth and extract only the coefficients.
The output will be a square matrix of OLS univariate coefficients. Kind of similar to a correlation matrix, but it is the beta coefficients I am interested in.
fore.choose <- matrix(0, 1, NCOL(assets))
will initially need to become
fore.choose <- matrix(0, NCOL(assets), NCOL(assets))
I'd just compute the coefficients directly from the correlation matrix, using beta = cor(x,y)*sd(x)/sd(y), like this:
# set up some sample data
set.seed(1)
d <- matrix(rnorm(50), ncol=5)
# get the coefficients
s <- apply(d, 2, sd)
cor(d)*outer(s, s, "/")
You could also use lsfit to get the coefficients of one term on all the others at once and then only have one loop to do:
sapply(1:ncol(d), function(i) {
coef(lsfit(d[,i], d))[2,]
})
I'm sure there must be a more elegant way than to nested loops.
fore.choose <- matrix(NA, NCOL(assets), NCOL(assets))
abc <- NULL
for(i in seq_len(ncol(assets))){ # loop over "dependant" columns
for(j in seq_len(ncol(assets))){ # loop over "independant" columns
abc <- lm(assets[,i]~assets[,j])$coefficients
fore.choose[i,j] <- abc[-1]
}
}

Resources