two sided ks test loop, get p.value - r

I have a column of data from which I am taking randomized sub samples of 50%.
I'm running a two sided ks test to compare the distribution of 50% of the data against 100% of the data to see if the distribution is still a significant fit.
In order to meet my objectives I want to run this as a loop of say 1000 to get an average p-value from 1000 randomized sub samples. This line of code gives me a single p-value for a random subset of 50% of my sample:
dat50=dat[sample(nrow(dat),replace=F,size=0.50*nrow(dat)),]
ks.test(dat[,1],dat50[,1], alternative="two.sided")
I need a line of code that will run this 1000 times saving the resulting (different) p value each time in a column which I can then average. The code I'm trying to get to work looks like this:
x <- numeric(100)
for (i in 1:100){
x<- ks.test(dat[,7],dat50[,7], alternative="two.sided")
x<-x$p.value
}
However this does not store multiple p-values
Also tried this:
get.p.value <- function(df1, df2) {
x <- rf(5, df1=df1, df2=df2)
p.value <- ks.test(dat[,6],dat50[,6], alternative="two.sided")$p.value
}
replicate (2000, get.p.value(df1 = 5, df2 = 10))
I hope that is clear and I would appreciate any help solving this so much!
Q

In your for loop you are overwriting x in each iteration meaning that you will only save the p-value for the last iteration. Try this instead:
x <- numeric(100)
for (i in 1:length(x))
x[i] <- ks.test(dat[,17], dat[sample(nrow(dat), replace=F, size=0.5*nrow(dat)),7])$p.value
You can get the same result using replicate with:
replicate(100, ks.test(dat[,7], dat[sample(nrow(dat), replace=F, size=0.5*nrow(dat)),7])$p.value)

Related

Calculate Errors using loop function in R

I have two data matrices both having the same dimensions. I want to extract the same series of columns vectors. Then take both series as vectors, then calculate different errors for example mean absolute error (mae), mean percentage error (mape) and root means square error
(rmse). My data matrix is quite large dimensional so I try to explain with an example and calculate these errors manually as:
mat1<- matrix(6:75,ncol=10,byrow=T)
mat2<- matrix(30:99,ncol=10,byrow=T)
mat1_seri1 <- as.vector(mat1[,c(1+(0:4)*2)])
mat1_seri2<- as.vector(mat1[,c(2+(0:4)*2)])
mat2_seri1 <- as.vector(mat1[,c(1+(0:4)*2)])
mat2_seri2<- as.vector(mat1[,c(2+(0:4)*2)])
mae1<-mean(abs(mat1_seri1-mat2_seri1))
mae2<-mean(abs(mat1_seri2-mat2_seri2))
For mape
mape1<- mean(abs(mat1_seri1-mat2_seri1)/mat1_seri1)*100
mape2<- mean(abs(mat1_seri2-mat2_seri2)/mat1_seri2)*100
similarly, I calculate rmse from their formula, as I have large data matrices so manually it is quite time-consuming. Is it's possible to do this using looping which gives an output of the errors (mae,mape,rmse) term for each series separately.
I'm not sure if this is what you are looking for, but here is a function that could automate the process, maybe there is also a better way:
fn <- function(m1, m2) {
stopifnot(dim(m1) == dim(m2))
mat1_seri1 <- as.vector(m1[, (1:ncol(m1))[(1:ncol(m1))%%2 != 0]])
mat1_seri2 <- as.vector(m1[, (1:ncol(m1))[!(1:ncol(m1))%%2]])
mat2_seri1 <- as.vector(m2[, (1:ncol(m2))[(1:ncol(m2))%%2 != 0]])
mat2_seri2 <- as.vector(m2[, (1:ncol(m2))[!(1:ncol(m2))%%2]])
mae1 <- mean(abs(mat1_seri1-mat2_seri1))
mae2 <- mean(abs(mat1_seri2-mat2_seri2))
mape1 <- mean(abs(mat1_seri1-mat2_seri1)/mat1_seri1)*100
mape2 <- mean(abs(mat1_seri2-mat2_seri2)/mat1_seri2)*100
setNames(as.data.frame(matrix(c(mae1, mae2, mape1, mape2), ncol = 4)),
c("mae1", "mae2", "mape1", "mape2"))
}
fn(mat1, mat2)
mae1 mae2 mape1 mape2
1 24 24 92.62581 86.89572

sampling random values each iteration

I have some simulated data, on top of the data I add some noise to see how the noise affects my data for further analyses. I created the following function
create.noise <- function(n, amount_needed, mean, sd){
set.seed(25)
values <- rnorm(n, mean, sd)
returned.values <- sample(values, size=amount_needed)
}
I call this function in the following loop:
dataframe.noises <- as.data.frame(noises) #i create here a dataframe dim 1x45 containing zeros
for(i in 1:100){
noises <- as.matrix(create.noise(100,45,0,1))
dataframe.noises[,i] <- noises
data_w_noise <- df.data_responses+noises
Estimators <- solve(transposed_schema %*% df.data_schema) %*% (transposed_schema %*% data_w_noise)
df.calculated_estimators[,i] <-Estimators
}
The code above always returns the same values, one solution I tried is sending i as parameter(which i think isn't correct) for each iteration I add i to the set.seed(25+i)
This gives me a unique value for each iteration, butas mentioned I don't think that this is the correct way to go with it.

Generating n new datasets by randomly sampling existing data, and then applying a function to new datasets

For a paper I'm writing I have subsetted a larger dataset into 3 groups, because I thought the strength of correlations between 2 variables in those groups would differ (they did). I want to see if subsetting my data into random groupings would also significantly affect the strength of correlations (i.e., whether what I'm seeing is just an effect of subsetting, or if those groupings are actually significant).
To this end, I am trying to generate n new data frames by randomly sampling 150 rows from an existing dataset, and then want to calculate correlation coefficients for two variables in those n new data frames, saving the correlation coefficient and significance in a new file.
But, HOW?
I can do it manually, e.g., with dplyr, something like
newdata <- sample_n(Random_sample_data, 150)
output <- cor.test(newdata$x, newdata$y, method="kendall")
I'd obviously like to not type this out 1000 or 100000 times, and have been trying things with loops and lapply (see below) but they've not worked (undoubtedly due to something really obvious that I'm missing!).
Here I have tried to assign each row to a different group, with 10 groups in total, and then to do correlations between x and y by those groups:
Random_sample_data<-select(Range_corrected, x, y)
cat <- sample(1:10, 1229, replace=TRUE)
Random_sample_cats<-cbind(Random_sample_data,cat)
correlation <- function(c) {
c <- cor.test(x,y, method="kendall")
return(c)
}
b<- daply(Random_sample_cats, .(cat), correlation)
Error message:
Error in cor.test(x, y, method = "kendall") :
object 'x' not found
Once you have the code for what you want to do once, you can put it in replicate to do it n times. Here's a reproducible example on built-in data
result = replicate(n = 10, expr = {
newdata <- sample_n(mtcars, 10)
output <- cor.test(newdata$wt, newdata$qsec, method="kendall")
})
replicate will save the result of the last line of what you did (output <- ...) for each replication. It will attempt to simplify the result, in this case cor.test returns a list of length 8, so replicate will simplify the results to a matrix with 8 rows and 10 columns (1 column per replication).
You may want to clean up the results a little bit so that, e.g., you only save the p-value. Here, we store only the p-value, so the result is a vector with one p-value per replication, not a matrix:
result = replicate(n = 10, expr = {
newdata <- sample_n(mtcars, 10)
cor.test(newdata$wt, newdata$qsec, method="kendall")$p.value
})

Simulation in R, for loop

I am trying to simulate the data for 10 times in R but I did not figure out how to achieve that. The code is shown below, you could run it in R straightway! When I run it, it will give me 5 numbers of "w" as output, I think this is only one simulation, but actually what I want is 10 different simulations of that 5 numbers.
I know I will need to write a for loop for it but I did not get that, could anyone help please?
# simulate 10 times
# try N = 10, for loop?
# initial values w0 and E
w0=1000
E= 1000
data = c(-0.02343731, 0.045509474 ,0.076144158,0.09234636,0.0398257)
constant = exp(cumsum(data))
exp.cum = cumsum(1/constant)
w=constant*(W0 - exp.cum)- E
w
You'll want to generate new values of data in each simulation. Do this within the curly brackets that follow the for loop. Then, before closing the curly brackets, be sure to save your statistical output in the appropriate position in a object, like a vector. For a simple example,
W0=1000
E= 1000
n_per_sim <- 5
num_sims <- 10
set.seed(12345) #seed is necessay for reproducibility
sim_output_1 <- rep(NA, times = num_sims) #This creates a vector of 10 NA values
for (sim_number in 1:num_sims){ #this starts your for loop
data <- rnorm(n=n_per_sim, mean=10, sd=2) #generate your data
average <- mean(data)
sim_output_1[sim_number] <- average #this is where you store your output for each simulation
}
sim_output_1 #Now you can see the average from each simulation
Note that if you want to save five values from each simulation, you can make use a matrix object instead of a vector object, as shown here
matrix_output <- matrix(NA, ncol=n_per_sim, nrow=num_sims) #This creates a 10x5 matrix
for (sim_number in 1:num_sims){ #this starts your for loop
data <- rnorm(n=n_per_sim, mean=10, sd=2) #generate your data
constant = exp(cumsum(data))
exp.cum = cumsum(1/constant)
w=constant*(W0 - exp.cum)- E
matrix_output[sim_number, ] <- w #this is where you store your output for each simulation
}
matrix_output #Now you can see the average from each simulation

A small simulation study about normality tests in R

I am conducting a small simulation study to judge how good two normality tests really are. My plan is to generate a multitude of normal distribution samples of not too many observations and determine how often each test rejects the null hypothesis of normality.
The (incomplete) code I have so far is
library(nortest)
y<-replicate(10000,{
x<-rnorm(50)
ad.test(x)$p.value
ks.test(x,y=pnorm)$p.value
}
)
Now I would like to count the proportion of these p-values that are smaller than 0.05 for each test. Could you please tell me how I could do that? I apologise if this is a newbie question, but I myself am new to R.
Thank you.
library(nortest)
nsim <- 10000
nx <- 50
set.seed(101)
y <- replicate(nsim,{
x<-rnorm(nx)
c(ad=ad.test(x)$p.value,
ks=ks.test(x,y=pnorm)$p.value)
}
)
apply(y<0.05,MARGIN=1,mean)
## ad ks
## 0.0534 0.0480
Using MARGIN=1 tells apply to take the mean across rows, rather than columns -- this is sensible given the ordering that replicate()'s built-in simplification produces.
For examples of this type, the type I error rates of any standard tests will be extremely close to their nominal value (0.05 in this example).
If you run each test separately, then you can simply count which vals are stored in y that are less than 0.05.
y<-replicate(1000,{
x<-rnorm(50)
ks.test(x,y=pnorm)$p.value})
length(which(y<0.05))
Your code isn't outputting the p-values. You could do something like this:
rep_test <- function(reps=10000) {
p_ks <- rep(NA, reps)
p_ad <- rep(NA, reps)
for (i in 1:reps) {
x <- rnorm(50)
p_ks[i] <- ks.test(x, y=pnorm)$p.value
p_ad[i] <- ad.test(x)$p.value
}
return(data.frame(cbind(p_ks, p_ad)))
}
sum(test$p_ks<.05)/10000
sum(test$p_ad<.05)/10000

Resources