R problems looping through array - r

I am trying to loop through an array, by which in the end I will create a powercurve showing the power by a function of the number of animals per treatment and the mean difference between the two treatments
N=30 # number of maximum simulations per K
K=seq(10,30,1) # maximum number of animals per group
ES=seq(1,2,0.1) # mean difference compared to control
x=array(data=NA,
dim=c(N,length(K),length(ES)),
dimnames =list(paste("Sim",1:N, sep=""),
paste("Total Number of Animals=",min(K):max(K), sep=""),
ES)) # 3-dimensional matrix in which to store the values
for (q in ES){
for (j in K){
for (i in 1:N){
controle<-rnorm(j,popmeansum$V3, 1.490918)
new<-rnorm(j,popmeansum$V3-q, 1.490918)
fit<-t.test(controle, new, alternative ="greater")
x[i,j,q]<-fit$p.value
}
}
}
The error i get is :
Error in [<-(*tmp*, i, j, q, value = 0.00490665200011608) :
subscript out of bounds
My gut feeling says I am making a simple and stupid mistake. Unfortunately, those mistakes can take hours. Hope anyone sees a quick and simple fix.

There are various problems here, but hopefully this will help solve some of them...
N=30 # number of maximum simulations per K
K=seq(10,30,1) # maximum number of animals per group
ES=seq(1,2,0.1) # mean difference compared to control
x=array(data=NA,
dim=c(N,length(K),length(ES)),
dimnames =list(paste("Sim",1:N, sep=""),
paste("Total Number of Animals=",min(K):max(K), sep=""),
ES)) # 3-dimensional matrix in which to store the values
#all OK to here
#in the next loop, popmeansum$V3 is not defined. If it is a single value, then I suggest...
pms <- popmeansum$V3[1]
for (q in 1:length(ES)){ #keep q as integers so that you can use it for indexing
for (j in 1:length(K)){ #to be consistent with your dimensions of x
for (i in 1:N){
controle<-rnorm(K[j],pms,1.490918)
new<-rnorm(K[j],pms-ES[q],1.490918)
fit<-t.test(controle, new, alternative ="greater")
x[i,j,q]<-fit$p.value
}
}
}
It is not the slickest piece of code but it does at least do something!

Related

Mclust() - NAs in model selection

I recently tried to perform a GMM in R on a multivariate matrix (400 obs of 196 var), which elements belong to known categories. The Mclust() function (from package mclust) gave very poor results (around 30% of individuals were well classified, whereas with k-means the result reaches more than 90%).
Here is my code :
library(mclust)
X <- read.csv("X.csv", sep = ",", h = T)
y <- read.csv("y.csv", sep = ",")
gmm <- Mclust(X, G = 5) #I want 5 clusters
cl_gmm <- gmm$classification
cl_gmm_lab <- cl_gmm
for (k in 1:nclusters){
ii = which(cl_gmm == k) # individuals of group k
counts=table(y[ii]) # number of occurences for each label
imax = which.max(counts) # Majority label
maj_lab = attributes(counts)$dimnames[[1]][imax]
print(paste("Group ",k,", majority label = ",maj_lab))
cl_gmm_lab[ii] = maj_lab
}
conf_mat_gmm <- table(y,cl_gmm_lab) # CONFUSION MATRIX
The problem seems to come from the fact that every other model than "EII" (spherical, equal volume) is "NA" when looking at gmm$BIC.
Until now I did not find any solution to this problem...are you familiar with this issue?
Here is the link for the data: https://drive.google.com/file/d/1j6lpqwQhUyv2qTpm7KbiMRO-0lXC3aKt/view?usp=sharing
Here is the link for the labels: https://docs.google.com/spreadsheets/d/1AVGgjS6h7v6diLFx4CxzxsvsiEm3EHG7/edit?usp=sharing&ouid=103045667565084056710&rtpof=true&sd=true
I finally found the answer. GMMs simply cannot apply every model when two much explenatory variables are involved. The right thing to do is first reduce dimensions and select an optimal number of dimensions that make it possible to properly apply GMMs while preserving as much informations as possible about the data.

Stochastic parameter estimation

I have made a pandemic stochastic simulator which takes probabilities of an infection, recovery or neither and uses a gillespie algorithm with vectors to determine the number of people in each category at each time. I want to carry out a simulation study and use maximum liklihood estimation to get parameter estimates for my simulations. It worked perfectly for the SI model but in this model i get the following error codes that i cannot understand. When i run just the function MLE i get scalars and I can even produce the vector J. But when i try and use optim it tells me that the function PL isnt a scalar when i know it is. Any help would be greatly appreciated thanks
#SIR 100 DAYS WITH 10 INTERVALS A DAY
T<-100 #Setting the number of intervals
dt<-0.01 #Setting the interval lengths
B<-1.5 #Setting Beta
N<-50 #Setting population size
Y<-0.5 #Setting recovery rate
r<-function(i){runif(1,0,1)} #Random number generator
S<-c(1:T)
I<-c(1:T)
R<-c(1:T)
I1<-c(1:T)
I2<-c(1:T)
I3<-c(1:T)
It<-c(1:T)
Time<-c(1:T)
I[1]<-1
S[1]<-N-I[1]
R[1]<-0
It[1]<-I[1]
P1<-function(t){(B)*(I[t])*(S[t])*(dt)*(1/N)} #Creates first event interval(Infection)
P2<-function(t){(Y)*(I[t])*(dt)+(B)*(I[t])*(S[t])*(dt)*(1/N)} #Creates 2nd event interval(Recovery)
P3<-function(t){1} #Creates 3rd event interval (No transition)
PI1<-function(t){(I1[t])/I[t]} #Creates interval for recovery from first group
PI2<-function(t){((I1[t])/I[t])+((I2[t])/I[t])} #Creates interval for recovery from third group
PI3<-function(t){1} #Creates interval for recovery from first group
for(i in 2:T){
x<-r(i)
if(x<P1(i-1)){ #If an infection occurs
S[i]<-S[i-1]-1
I[i]<-I[i-1]+1
R[i]<-R[i-1]
It[i]<-It[i-1]+1
}
else if(x<P2(i-1)){ #If a recovery occurs
S[i]<-S[i-1]
I[i]<-I[i-1]-1
R[i]<-R[i-1]+1
It[i]<-It[i-1]}
else{ #If no transition occurs
S[i]<-S[i-1]
I[i]<-I[i-1]
R[i]<-R[i-1]
It[i]<-It[i-1]}
}
n<-c(1:T)
for(i in 1:T){
n[i]<-S[i]+I[i]+R[i]}
n
S
I
R
Data<-cbind.data.frame(Time,S,I,R,n,It) #Create a dataframe for ease of manipulations
Data$EventInfection<-0
Data$EventRecovery<-0
Data$EventNotransition<-0
for(i in 2:T){if(Data$It[i]>Data$It[i-1]){Data$EventInfection[i]<-1} #Event indiciators to make Liklihood easier
else if(Data$R[i]>Data$R[i-1]){Data$EventRecovery[i]<-1}
else{Data$EventNotransition[i]<-1}}
PL<-function(i,b,y){((b*S[i]*I[i]*dt*(1/N))^Data$EventInfection[[i]])*((I[i]*(y)*dt)^Data$EventRecovery[[i]])*((1-(b*S[i]*I[i]*dt*(1/N))-((y)*I[i]*dt))^Data$EventNotransition[[i]])}
MLE<-function(b,y){
J<<-c(1:T)
for(i in 1:T){
J[i]<<-log(PL(i,b,y))}
return(sum(J))}
MLE(1,0.5)
optim(c(1,1), MLE, y=1)
Warning messages:
1: In J[i] <- log(PL(i, b, y)) :
number of items to replace is not a multiple of replacement length
2: In J[i] <- log(PL(i, b, y)) :
number of items to replace is not a multiple of replacement length
3: In J[i] <- log(PL(i, b, y)) :
number of items to replace is not a multiple of replacement length
4: In J[i] <- log(PL(i, b, y)) :
number of items to replace is not a multiple of replacement length
5: In J[i] <- log(PL(i, b, y)) :
number of items to replace is not a multiple of replacement length
MLE() takes two variables, yet you gave the optim() function three parameters. Essentially, the optim() function expects b in your MLE function to be a vector of two spots. If you wanted to optimize b and y, for example, this will work.
MLE <- function(b){
J <<- vector(length = Ti)
for(i in 1:Ti){
J[i] <<- log(PL(i, b[1], b[2]))
}
return(sum(J))
}
MLE(c(1, 0.5))
optim(c(1, 1), MLE)
Now b is b[1] and y is b[2]. I'm not sure if that's what you wanted to optimize, though.

Null distribution related question (edited)

Please help with the following question.
The experiment involved mice; feeding them two diets: high-fat diet and normal diet (control group). The data below contains the weights of all female mice (population) that received the normal diet. The data can be downloaded from GitHub running the following command lines in R:
library(downloader)
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleControlsPopulation.csv"
filename <- basename(url)
download(url, destfile = filename)
x <- unlist(read.csv(filename))
Here x represents the weights of the entire population.
So, the question is:
Set the seed at 1, then using a for-loop take a random sample of 5 mice 1,000 (one thousand) times. Save the averages.
What proportion of these 1,000 averages are more than 1 gram away from the average x?
Below is what I have tried using the ‘sum’ & ‘mean()’ function:
set.seed(1)
n <- 1000
sample1 <- vector("numeric", n)
for (i in 1: n) {
sample1[i] <- mean (sample (x, 5))
}
sum(sample1 > mean(x) / n)
mean(sample1 > mean(x)+1)
So this step is where I need the help…because I am not sure how to deal with ‘1 gram away from average of x’ statement in the question.
Thank you in advance for your help.
Looks like homework, so I'll give some hints:
In your second code block, the last two statements seem off.
n <- 1000
sample1 <- vector("numeric", n)
for (i in 1: n) {
sample1[i] <- mean (sample (x, 5))
}
sum(sample1 > mean(x) / n) #<- why dividing by n here?
mean(sample1 > mean(x)+1) #<- what are you trying to do here?
Why are you dividing the mean of the overall sample by n?
The call to mean does seem to make sense.
I don't think you need the second statement, mean(sample1 > mean(x)+1) to get your answer.
You need an inequality in the sum() statement that will be TRUE for every value that is outside the range of mean(x) - 1 to mean(x) + 1. Or, the number less than mean(x) -1 plus the number greater than mean(x) + 1.
Does that help?
On the loop part you are doing correctly, for the ##What proportion of these 1,000 averages are more than 1 gram away from the average x?##sum(abs(null)-mean(population)>1)/n

Faster way to generate large list of vectors from permuted datasets [R]

Setup For the purposes of my simulation, I'm generating a list of B=2000 elements, with each element being the output of a permutation procedure in which I first permute the rows of a 200x8000 matrix and for each column, I calculate the Kolmogorov-Smirnov test statistic between the first and second 100 rows (you can think of the first 100 rows as data from one group and the second 100 rows as data from another group).
Question This process takes a very long time (about 30-40 minutes) to generate the list. Is there a much faster way? In the future, I'd like to increase B to a larger value.
Code
B=2000
n.row=200; n.col=8000
#Generate sample data
samp.dat = matrix(rnorm(n.row*n.col),nrow=n.row)
perm.KS.list = NULL
for (b in 1:B){
#permute the rows
perm.dat.tmp = samp.dat[sample(nrow(samp.dat)),]
#Compute the permutation-based test statistics
perm.KS.list[[b]]= apply(perm.dat.tmp,2,function(y) ks.test.stat(y[1:100],y[101:200]))
}
#Modified KS-test function (from base package)
ks.test.stat <- function(x,y){
x <- x[!is.na(x)]
n <- length(x)
y <- y[!is.na(y)]
n.x <- as.double(n)
n.y <- length(y)
w <- c(x, y)
z <- cumsum(ifelse(order(w) <= n.x, 1/n.x, -1/n.y))
z <- z[c(which(diff(sort(w)) != 0), n.x + n.y)] #exclude ties
STATISTIC <- max(abs(z))
return(STATISTIC)
}
The 1:B loop has several places to optimize, but I agree that the real consumer is that inner function. Because you're simulating your well-behaved bootstrap samples, you can make two simplifying assumptions that the general base function can't:
There aren't missing values. This obviates the is.na() adjustments
The two sides (ie, x & y) have the same number of elements, so you don't need to count them separately. instead of splitting y in the loop, and them joining them back in the function (into w), just keep it together. The balanced sides also permit simplifications like remove the ifelse() clause. It produces a bunch of 0/1s, which are rescaled to -1/1s with integer arithmetic.
The function is reduced, which saves about 25% of the time. I added integers, instead of doubles inside cumsum().
ks.test.stat.balanced <- function(w){
n <- as.integer(length(w) * .5)
# z <- cumsum(ifelse(order(w) <= n, 1L, -1L)) / n
z <- cumsum((order(w) <= n)*2L - 1L) / n
# z <- z[c(which(diff(sort(w)) != 0), n + n)] #exclude ties
return( max(abs(z)) )
}
Ties shouldn't occur often with your gaussian rng, and the diff(sort(.)) is very expensive. If you're willing to remove that protection, the time is reduced by about 65%.
If you move the equation for z into abs(), it saves a little time over all those reps. I kept it separate above, so it's easier to read.
edit in case of an unbalanced simulation I'd recommend you:
still keep out the is.na,
still pass w,
still keep as much as possible in integer, not numeric, but
now include arguments n1 & n2 for the two group sizes.
Also, experiment w/ precalculating 1/n before cumsum() to avoid a lot of expensive divisions. Try to think of other math-y ways to extract calculations from an inner loop so it occurs less frequently.

simulation of binomial distribution and storing value in matrix in r

set.seed(123)
for(m in 1:40)
{
u <- rbinom(1e3,40,0.30)
result[[m]]=u
}
result
for (m in 1:40) if (any(result[[m]] == 1)) break
m
m is the exit time for company, as we change the probability it will give different result. Using this m as exit, I have to find if there was a funding round inbetween, so I created a random binomial distribution with some prob, when you will get a 1 that means there is a funding round(j). if there is a funding round i have to find the limit of round using the random uniform distribution. I am not sure if the code is right for rbinom and is running till m. And imat1<- matrix(0,nrow = 40,ncol = 2) #empty matrix
am gettin the y value for all 40 iteration I Need it when I get rbinom==1 it should go to next loop. I am trying to store the value in matrix but its not getting stored too. Please help me with that.
mat1<- matrix(0,nrow = 40,ncol = 2) #empty matrix
for(j in 1:m) {
k<- if(any(rbinom(1e3,40,0.42)==1)) #funding round
{
y<- runif(j, min = 0, max = 1) #lower and upper bound
mat1[l][0]<-j
mat1[l][1]<-y #matrix storing the value
}
}
resl
mat1
y
The answer to your first question:
result <- vector("list",40)
for(m in 1:40)
{
u <- rbinom(1e3,40,0.05)
print(u)
result[[m]]=u
}
u
The second question is not clear. Could you rephrase it?
To generate 40 vectors of random binomial numbers you don't need a loop at all, use ?replicate.
u <- replicate(40, rbinom(1e3, 40, 0.05))
As for your second question, there are several problems with your code. I will try address them, it will be up to you to say if the proposed corrections are right.
The following does basically nothing
for(k in 1:40)
{
n<- (any(rbinom(1e3,40,0.05)==1)) # n is TRUE/FALSE
}
k # at this point, equal to 40
There are better ways of creating a T/F variable.
#matrix(0, nrow = 40,ncol = 2) # wrong, don't use list()
matrix(0, nrow = 40,ncol = 2) # or maybe NA
Then you set l=0 when indices in R start at 1. Anyway, I don't believe you'll need this variable l.
if(any(rbinom(1e3,40,0.30)==1)) # probably TRUE, left as an exercise
# in probability theory
Then, finally,
mat1[l][0]<-j # index `0` doesn't exist
Please revise your code, and tell us what you want to do, we're glad to help.

Resources