I am trying to simulate the unequal sample size in the multilevel model.I have four groups, the sample size is 100,200,300,and 400, respectively.
So, the total sample size is 1000. w, u0,u1 variables are in the level 2 ; x , r0 are in the level 1. y is an outcome
nSubWithinGroup <- c(100,200,300,400)###the sample size in each group
nGroup <-4 ## 4 groups
gamma00 <- 1
gamma01 <- 1 ## b0 = gamma00+gamma01*w+u0
gamma10 <- 1 ## b1 = gamma10+gamma11*w+u1
gamma11 <- 1
dataLevel1 <- mat.or.vec(sum(nSubWithinGroup),4)
colnames(dataLevel1) <- c("Group","X","W","Y")
rowIndex <- 0
for (group in 1:nGroup) {
u0 <- rnorm(1,mean=0,sd=1)
u1 <- rnorm(1,mean=0,sd=1)
w <- rnorm(1,mean=0,sd=1)
for(i in 1:length(nSubWithinGroup)){
for (j in 1:nSubWithinGroup[i]){
r0 <- rnorm(1,mean=0,sd=1)
x <- rnorm(1,mean=0,sd=1)
y <- (gamma00+gamma01*w+u0)+(gamma10+gamma11*w+u1)*x+r0
rowIndex <- rowIndex + 1
dataLevel1[rowIndex,] <- c(group,x,w,y)
}
}
}
I ran the codes, and it showed me the value in the "Group" column is 1 , no 2,3, or 4. Also, it has errors, which is:
"Error in [<-(*tmp*, rowIndex, , value = c(2, -1.94476463667851, -0.153516782293473, :
subscript out of bounds"
Your original issue was a bit hard to find with all the for-loops, but you were looping twice on your grouping level (one time in 1:nGroup and then again in 1:length(nSubWithinGroup). This lead to more combinations than you had allowed for in your matrix, and thus your error. (If you want to check, run your loop without assigining to dataLevel1 and see what value rowIndex has at the end.
However, generating data like this in R can be notoriously slow and every function you use with n=1 can just as easily be used to generate nTotal numbers. I have rewritten your code to something that's (hopefully) more readable, but also more vectorized.
#set seed; you can never reproduce your result if you don't do this
set.seed(289457)
#set constants
gamma00 <- 1
gamma01 <- 1 ## b0 = gamma00+gamma01*w+u0
gamma10 <- 1 ## b1 = gamma10+gamma11*w+u1
gamma11 <- 1
#set size parameters
nSubWithinGroup <- c(100,200,300,400)###the sample size in each group
nGroup <-4
nTotal <- sum(nSubWithinGroup)
#simulate group-level data
level2_data <- data.frame(group=1:nGroup,
size=nSubWithinGroup, #not really necessary here, but I like to have everything documented/accessible
u0 = rnorm(nGroup,mean=0,sd=1),
u1 = rnorm(nGroup,mean=0,sd=1),
w = rnorm(nGroup,mean=0,sd=1)
)
#simulate individual_level data (from example code x and r0 where generated in the same way for each individual)
level1_data <- data.frame(id=1:nTotal,
group=rep(1:nGroup, nSubWithinGroup),
r0 = rnorm(nTotal,mean=0,sd=1),
x = rnorm(nTotal, mean=0,sd=1)
)
#several possibilities here, you can merge the two dataframes together or reference the level2data when calculating the outcome
#merging generates more data, but is also readable
combined_data <- merge(level1_data,level2_data,by="group",all.x=T)
#calculate outcome. This can be shortened for instance by calculating some linear parts before
#merging but wanted to stay as close to original code as possible.
combined_data$y <- (gamma00+gamma01*combined_data$w+combined_data$u0)+
(gamma10+gamma11*combined_data$w+combined_data$u1)*combined_data$x+combined_data$r0
Related
So, I'm relatively new to R and have the following problem:
I want to run 1000 generations of a population of some organism. At each generation there is a certain probability to change from one environment to the other (there are just two different "environments").
Now, the code works just fine and I do get the desired results. However one small issue that still needs to be resolved is that for every run, the initial environment seems to be set at environment 1 even though I defined the initial environment to be randomly sampled (should be either environment 1 OR 2; you can find this in line 12 of the second block of code).
If anybody could help me resolve this issue, I would be very thankful.
simulate_one_gen_new <- function(K, N_total_init, N_wt, N_generalist, N_specialist, growth_wt, growth_generalist, growth_specialist, mut_rate) {
scaling <- min(K/(N_wt + N_generalist + N_specialist),1)
# draw offspring according to Poisson distribution
offsp_wt <- rpois(1, scaling * N_wt * growth_wt)
offsp_generalist <- rpois(1, scaling * N_generalist * growth_generalist)
offsp_specialist <- rpois(1, scaling * N_specialist * growth_specialist)
# draw new mutants according to Poisson distribution
mut_wt_to_generalist <- rpois(1, N_wt * (mut_rate/2))
mut_wt_to_specialist <- rpois(1, N_wt * (mut_rate/2))
# determine new population sizes of wild type and mutant
N_wt_new <- max(offsp_wt - mut_wt_to_specialist - mut_wt_to_generalist, 0)
N_generalist_new <- max(offsp_generalist + mut_wt_to_generalist,0)
N_specialist_new <- max(offsp_specialist + mut_wt_to_specialist,0)
N_total_new <- N_wt_new + N_generalist_new + N_specialist_new
return(c(N_total_new, N_wt_new, N_generalist_new, N_specialist_new))
}
# Test the function
print(simulate_one_gen_new(100,100,100,0,0,0.9,1.0,1.1,0.001))
The code block above is needed to simulate one single generation.
simulate_pop_new <- function(K, N_total_init,N_init_wt,
growth_vec1, growth_vec2, growth_vec3,
mut_rate, switch_prob) {
# determine that there are no mutants present at time 0
N_init_generalist <- 0
N_init_specialist <- 0
# Create the vector in which to save the results, including the index of the environment
pop_vector <- c(N_total_init,N_init_wt, N_init_generalist, N_init_specialist, 1)
# initiate the variables
pop_new <- c(N_total_init, N_init_wt, N_init_generalist, N_init_specialist)
# determine that the first environment is either 1 or 2
env_temp <- sample(1:2, size = 1, replace = T)
tmax <- 1000
j <- 0
# run the simulation until generation t_max
for (i in 1:tmax) {
# redefine the current population one generation later
pop_new <- c(simulate_one_gen_new(K,pop_new[1],pop_new[2],pop_new[3],pop_new[4], growth_vec1[env_temp],growth_vec2[env_temp], growth_vec3[env_temp],mut_rate),env_temp)
# add the new population sizes to the output vector
pop_vector <- rbind(pop_vector,pop_new)
# determine whether environmental switch occurs and make it happen
env_switch <- rbinom(1,1,switch_prob)
if (env_switch==1)
{
if(env_temp==1) env_temp <- 2
else env_temp <- 1
}
# condition to stop the simulation before t_max: either the population has only one of the two mutants left, or the whole population goes extinct
if ((pop_new[2] == 0 & pop_new[3] == 0) | (pop_new[2] == 0 & pop_new[4] == 0)){j=j+1}
if (j == 100) break #here we let it run 100 generations longer after the conditions above are met
}
# define the row and column names of the output vector
rownames(pop_vector) <- (0:length(pop_vector[1]))[1:length(pop_vector[,1])] # note that the vector has to be cut if the simulation stopped early
colnames(pop_vector) <- c("total","wt","generalist","specialist","env")
# return the result
return(pop_vector)
}
# Test the function and plot the result
# create your simulation data
output <- simulate_pop_new(1000,1000,1000,c(0.98,0.99),c(1.04,1.02),c(0.96,1.1),0.001,0.5)
# show the last few lines of the data table
print(tail(output))
# determine x axis range
x_range <- 0:(length(output[,1])-1)
# Create data frame from output (or just rename it)
df <- data.frame(output)
# Add a new column to our output that simply shows the Generations
df$Generation<-1:nrow(df)
# Manually create data frame where the genotypes are not separate but all in one column. Note that we need to repeat/ add together all other values since our "Genotype" column will be three times longer.
Genotype <- rep(c("wt", "generalist", "specialist"), each = length(output[,1]))
PopSize <- c(df$wt, df$generalist, df$specialist)
Generation <- rep(df$Generation, 3)
environment <- rep(df$env, 3)
# Let's also create a column solely for the total population
All_Genotypes <- df$generalist + df$wt + df$specialist
N_tot <- rep(All_Genotypes, 3)
# Create a new data frame containing the modified columns which we will be using for plotting
single_run <- data.frame(Generation, Genotype, PopSize, N_tot, environment)
print(tail(single_run))
Above is the second block of code which now simulates 1000 generations.
I need to repeat the sampling procedure of the below loop 1000 times using a second loop.
This is the simplified code i produced for reproducability, the inner loop.
##Number of iterations
N = 8
##Store data from inner loop in vectors
PMSE <- rep(1 , N)
PolynomialDegree <- rep(1, N)
for (I in 1:N){
PolynomialDegree [I] <- I
PMSE [I] <- I*rnorm(1)
}
Now, using a second , outer loop. I want repeat this "sampling procedure" 1000 times and store the data of all those vectors into a single dataframe. Im struggling to write the outer loop and was hoping for some assistance.
This is my attempt with non-reproducable code, I hope it is clear what i am attempting to do.
##Set number of iterations
N <- 8
M <- 1000
##Store data
OUTPUT <- rep(1,M)
##Outer loop starts
for (J in 1:M){
PMSE <- rep(1 , N)
PolynomialDegree <- rep(1, N)
sample <- sample(nrow(tempraindata), floor(nrow(tempraindata)*0.7))
training <- tempraindata[sample,]
testing <- tempraindata[-sample,]
##Inner loop starts
for (I in 1:N){
##Set up linear model with x polynomial of degree I x = year, y = temp
mymodel <- lm(tem ~ poly(Year, degree = I), data = training)
##fit model on testing set and save predictions
predictions <- predict(mymodel, newdata = testing, raw = FALSE)
##define and store PMSE
PMSE[I] <- (1/(nrow(tempraindata)- nrow(training)))*(sum(testing$tem-predictions))^2
PolynomialDegree [I] <- I
} ## End of inner loop
OUTPUT[J] <- ##THIS IS WHERE I WANT TO SAVE THE DATA
} ##End outer loop
I want to store all the data inside OUTPUT and make it a dataframe, if done correctly it should contain 8000 values of PMSE and 8000 values of PolynomialDegree.
Avoid the bookkeeping of initializing vectors and then assigning elements by index. Consider a single sapply (or vapply) passing both iterations to build a matrix of 8,000 elements of the PSME calculations within a 1000 X 8 structure. Every column would then be a model run (or PolynomialDegree) and every row the training/testing data pair.
## Set number of iterations
N <- 8
M <- 1000
## Defined method to generalize process
calc_PSME <- function(M, N) {
## Randomly build training/testing sets
set.seed(M+N) # TO REPRODUCE RANDOM SAMPLES
sample <- sample(nrow(tempraindata), floor(nrow(tempraindata)*0.7))
training <- tempraindata[sample,]
testing <- tempraindata[-sample,]
## Set up linear model with x polynomial of degree I x = year, y = temp
mymodel <- lm(tem ~ poly(Year, degree = N), data = training)
## Fit model on testing set and save predictions
predictions <- predict(mymodel, newdata = testing, raw = FALSE)
## Return single PSME value
(
(1/(nrow(tempraindata)- nrow(training))) *
(sum(testing$tem-predictions)) ^ 2
)
}
# RETURN (1000 X 8) MATRIX WITH NAMED COLUMNS
PSME_matrix <- sapply(1:N, calc_PSME, 1:M)
PSME_matrix <- vapply(1:N, calc_PSME, numeric(M), 1:M)
Should you need a 8,000-row data frame of two columns, consider reshape to long format:
long_df <- reshape(
data.frame(output_matrix),
varying = 1:8,
timevar = "PolynomialDegree",
v.names = "PSME",
ids = NULL,
new.row.names = 1:1E4,
direction = "long"
)
I have 2 vectors containing numbers, I'm using to simulate power of my study but keeps getting this error at the for loop section
Error in pwr.2p2n.test(h, n1 = i, n2 = j, sig.level = 0.05) :
number of observations in the first group must be at least 2
would be grateful for your suggestions to get it working
##sample code
grp1.n <- seq(30,150,5) ##group 1, N
grp2.n <- seq(30,150,5)-15 ## group 2, N - 15
h=0.85 #specify large effect size
grp1.length <- length(grp1.n)
grp2.length <- length(grp2.n)
power.holder <- array(numeric(grp1.length*grp2.length), dim=c(grp1.length,grp2.length),dimnames=list(grp1.n,grp2.n))
for (i in 1:grp1.length){
for (j in 1:grp2.length){
result.pwr.2p2n.test <- pwr.2p2n.test(h, n1=i, n2=j, sig.level=0.05)
power.holder[i,j] <- ceiling(result.pwr.2p2n.test$power)
return(result.pwr.2p2n.test)
}
}
I'm not entirely sure if this is what you want, but I think it is:
grp1.n <- seq(30,150,5) ##group 1, N
grp2.n <- seq(30,150,5)-15 ## group 2, N - 15
h=0.85 #specify large effect size
grp1.length <- length(grp1.n)
grp2.length <- length(grp2.n)
power.holder <- array(numeric(grp1.length*grp2.length), dim=c(grp1.length,grp2.length),dimnames=list(grp1.n,grp2.n))
for (i in 1:grp1.length){
for (j in 1:grp2.length){
result.pwr.2p2n.test <- pwr.2p2n.test(h, n1=grp1.n[i], n2=grp2.n[j], sig.level=0.05)
power.holder[i,j] <- ceiling(result.pwr.2p2n.test$power)
return(power.holder)
}
}
The changes are in the pwr.2p2n.test function as well as the object you want to return.
Old: pwr.2p2n.test(h, n1=i, n2=j, sig.level=0.05)
New: pwr.2p2n.test(h, n1=grp1.n[i], n2=grp2.n[j], sig.level=0.05)
Note there was also a missing } bracket in your code.
I am working on a meta analysis and a sensitivity analysis for missing data. I want to replace censorsed data either with 0 or 1 according to a predefined probability.
I have a dataset with colum x: timepoints and y: events (1 = event, 0 = censored). For the analysis I replaced some of the 0 with NAs. Z is the indicator for the treatment arm. I want to replace NAs to either 1 or 0 with a predefined probability.
This is my code:
Just an example:
library(mice)
x <- c(1:10)
y <- c(1,1,1,NA,NA,NA,1,1,0,NA)
z <- rep(2,10)
data <- data.frame(x,y,z)
str(data)
md.pattern(data)
mice.impute.myfunct <- function(y, ry, x, ...)
{event <- sample(c(0:1), size = 1, replace=T, prob=c(0.5,0.5)); return(event)}
data.imp <- mice(data, me = c("","myfunct",""), m = 1)
data.comp <- complete(data.imp)
I would expect that NAs in y will be replaced with 0 (20% of cases) and 1 (80% of cases). But NAs are either replaced only with 0 or only with 1.
I have to admit, that I am quite a beginner with R and did not have to write own little functions before.
Thank you very much for your help!
Here is a possible solution just replacing the missing values with the 0 and 1, and a varying probability between 0.1 and 0,9:
for( i in seq(0.1,0.9,0.1)){
data[[paste0("y_imp",i)]] <- data$y
N <- sum(is.na( data$y))
data[[paste0("y_imp",i)]][is.na(data[[paste0("y_imp",i)]])] <- sample(c(0,1), size = N, replace=T, prob=c(i,1-i))
}
data[[paste0("y_imp",i)]] <- data$y create the column where you has the i probability of replacing the missing by 0.
I am working on Spike Trains and my code to get a spike train like this:
for 20 trials is written below. The image is representational for 5 trials.
fr = 100
dt = 1/1000 #dt in milisecond
duration = 2 #no of duration in s
nBins = 2000 #10msSpikeTrain
nTrials = 20 #NumberOfSimulations
MyPoissonSpikeTrain = function(p, fr= 100) {
p = runif(nBins)
q = ifelse(p < fr*dt, 1, 0)
return(q)
}
set.seed(1)
SpikeMat <- t(replicate(nTrials, MyPoissonSpikeTrain()))
plot(x=-1,y=-1, xlab="time (s)", ylab="Trial",
main="Spike trains",
ylim=c(0.5, nTrials+1), xlim=c(0, duration))
for (i in 1: nTrials)
{
clip(x1 = 0, x2= duration, y1= (i-0.2), y2= (i+0.4))
abline(h=i, lwd= 1/4)
abline(v= dt*which( SpikeMat[i,]== 1))
}
Each trial has spikes occuring at random time points. Now what I am trying to work towards, is getting a random sample time point that works for all 20 trials and I want to get the vector consisting of length of the intervals this point falls into, for each trial. The code to get the time vector for the points where the spikes occur is,
A <- numeric()
for (i in 1: nTrials)
{
ISI <- function(i){
spike_times <- c(dt*which( SpikeMat[i, ]==1))
ISI1vec <- c(diff(spike_times))
A <- c(A, ISI1vec)
return(A)}
}
Then you call ISI(i) for whichever trial you wish to see the Interspike interval vector for. A visual representation of what I want is:
I want to get a vector that has the lengths of the interval where this points fall into, for each trial. I want to figure out it's distribution as well, but that's for later. Can anybody help me figure out how to code my way to this? Any help is appreciated, even if it's just about how to start/where to look.
Your data
set.seed(1)
SpikeMat <- t(replicate(nTrials, MyPoissonSpikeTrain()))
I suggest transforming your sparse matrix data into a list of indices where spikes occur
L <- lapply(seq_len(nrow(SpikeMat)), function(i) setNames(which(SpikeMat[i, ] == 1), seq_along(which(SpikeMat[i, ] == 1))))
Grab random timepoint
set.seed(1)
RT <- round(runif(1) * ncol(SpikeMat))
# 531
Result
distances contains the distances to the 2 nearest spikes - each element of the list is a named vector where the values are the distances (to RT) and their names are their positions in the vector. nearest_columns shows the original timepoint (column number) of each spike in SpikeMat.
bookend_values <- function(vec) {
lower_val <- head(sort(vec[sign(vec) == 1]), 1)
upper_val <- head(sort(abs(vec[sign(vec) == -1])), 1)
return(c(lower_val, upper_val))
}
distances <- lapply(L, function(i) bookend_values(RT-i))
nearest_columns <- lapply(seq_along(distances), function(i) L[[i]][names(distances[[i]])])
Note that the inter-spike interval of the two nearest spikes that bookend RT can be obtained with
sapply(distances, sum)