I have to code a simulation study in R. So, I have X1,...,X15~N(0,1) explanatory variables and Y~N(2+2*X1+0.8*X21.2*X15, 1) and I need to simulate n=100 values and repeat that for iter=100 times. Then, for each linear model created I have to calculate the AICvalues and, finally, find the best model. The problem is that I can't figure out how to do that for item=100 times. I wrote the code for 1 simulation, which is the following:
set.seed(123)
n<‐100
p<‐15
iter<‐100 X<‐matrix(rep(NA,n*p),ncol=p) for (j in 1:p) {
X[,j]<‐rnorm(n = 100, mean = 0, sd = 1) }
mu<‐(2+2*X[,1])+(0.8*X[,2])‐(1.2*X[,15]) Y<‐rnorm(n = 100, mean = mu , sd = 1)
sim<‐data.frame(Y,X)
d<‐lm(Y~X, data = sim)
But how I do the rest I have to do, i.e.the 100 simulations and the calculations of AIC? I'm really new to R, so I am quite confused.
How about this
nsim <- 100
nobs <- 100
nvar <- 15
results <- lapply(1:nsim, function(i) {
X <- matrix(rnorm(nobs*nvar),nrow=nobs)
y <- rnorm(nobs, mean=2 + X[,c(1,2,15)]%*% c(2, .8,-1.2))
DF <- data.frame(y, X)
lm(y ~ X, data=DF)})
That should give you your simulations. Now find the "best"
findbest <- which.min(sapply(results, function(i) { AIC(i) }))
results[[findbest]]
Since all data are simulated using the same underlying data-generating process any variation in AIC is essentially random variation.
Related
I have run a multiple imputation (m=45, 10 iterations) using the MICE package, and want to calculate the cronbach's alpha for a number of ordinal scales in the data. Is there a function in r that could assist me in calculating the alpha coefficient across the imputed datasets in a manner that would satisfy Rubin's rules for pooling estimates?
We may exploit pool.scalar from the mice package, which performs pooling of univariate estimates according to Rubin's rules.
Since you have not provided a reproducible example yourself, I will provide one.
set.seed(123)
# sample survey responses
df <- data.frame(
x1 = c(1,2,2,3,2,2,3,3,2,3,
1,2,2,3,2,2,3,3,2,3,
1,2,2,3,2,2,3,3,2,3),
x2 = c(1,1,1,2,3,3,2,3,3,3,
1,1,1,2,3,3,2,3,3,3,
1,2,2,3,2,2,3,3,2,3),
x3 = c(1,1,2,1,2,3,3,3,2,3,
1,1,2,1,2,3,3,3,2,3,
1,2,2,3,2,2,3,3,2,3)
)
# function to column-wise generate missing values (MCAR)
create_missings <- function(data, prob) {
x <- replicate(ncol(data),rbinom(nrow(data), 1, prob))
for(k in 1:ncol(data)) {
data[, k] <- ifelse(x[, k] == 1, NA, data[,k])
}
data
}
df <- create_missings(df, prob = 0.2)
# multiple imputation ----------------------------------
library(mice)
imp <- mice(df, m = 10, maxit = 20)
# extract the completed data in long format
implong <- complete(imp, 'long')
We need a function to compute cronbach's alpha and obtain an estimate of the standard error of alpha, which can be used in a call to pool.scalar() later on. Since there is no available formula with which we can analytically estimate the standard error of alpha, we also need to deploy a bootstrapping procedure to estimate this standard error.
The function cronbach_fun() takes the following arguments:
list_compl_data: a character string specifying the list of completed data from a mids object.
boot: a logical indicating whether a non-parametrical bootstrap should be conducted.
B: an integer specifying the number of bootstrap samples to be taken.
ci: a logical indicating whether a confidence interval around alpha should be estimated.
cronbach_fun <- function(list_compl_data, boot = TRUE, B = 1e4, ci = FALSE) {
n <- nrow(list_compl_data); p <- ncol(list_compl_data)
total_variance <- var(rowSums(list_compl_data))
item_variance <- sum(apply(list_compl_data, 2, sd)^2)
alpha <- (p/(p - 1)) * (1 - (item_variance/total_variance))
out <- list(alpha = alpha)
boot_alpha <- numeric(B)
if (boot) {
for (i in seq_len(B)) {
boot_dat <- list_compl_data[sample(seq_len(n), replace = TRUE), ]
total_variance <- var(rowSums(boot_dat))
item_variance <- sum(apply(boot_dat, 2, sd)^2)
boot_alpha[i] <- (p/(p - 1)) * (1 - (item_variance/total_variance))
}
out$var <- var(boot_alpha)
}
if (ci){
out$ci <- quantile(boot_alpha, c(.025,.975))
}
return(out)
}
Now that we have our function to do the 'heavy lifting', we can run it on all m completed data sets, after which we can obtain Q and U (which are required for the pooling of the estimates). Consult ?pool.scalar for more information.
m <- length(unique(implong$.imp))
boot_alpha <- rep(list(NA), m)
for (i in seq_len(m)) {
set.seed(i) # fix random number generator
sub <- implong[implong$.imp == i, -c(1,2)]
boot_alpha[[i]] <- cronbach_fun(sub)
}
# obtain Q and U (see ?pool.scalar)
Q <- sapply(boot_alpha, function(x) x$alpha)
U <- sapply(boot_alpha, function(x) x$var)
# pooled estimates
pool_estimates <- function(x) {
out <- c(
alpha = x$qbar,
lwr = x$qbar - qt(0.975, x$df) * sqrt(x$t),
upr = x$qbar + qt(0.975, x$df) * sqrt(x$t)
)
return(out)
}
Output
# Pooled estimate of alpha (95% CI)
> pool_estimates(pool.scalar(Q, U))
alpha lwr upr
0.7809977 0.5776041 0.9843913
I'm working with the train() function from the caret package to fit multiple regression and ML models to test their fit. I'd like to write a function that iterates through all model types and enters the best fit into a dataframe. Biggest issue is that caret doesn't provide all the model fit statistics that I'd like so they need to be derived from the raw output. Based on my exploration there doesn't seem to be a standardized way caret outputs each models fit.
Another post (sorry don't have a link) created this function which pulls from fit$results and fit$bestTune to get pre calculated RMSE, R^2, etc.
get_best_result <- function(caret_fit) {
best = which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
best_result = caret_fit$results[best, ]
rownames(best_result) = NULL
best_result
}
One example of another fit statistic I need to calculate using raw output is BIC. The two functions below do that. The residuals (y_actual - y_predicted) are needed along with the number of x variables (k) and the number of rows used in the prediction (n). k and n must be derived from the output not the original dataset due to the models dropping x variables (feature selection) or rows (omitting NAs) based on its algorithm.
calculate_MSE <- function(residuals){
# residuals can be replaced with y_actual-y_predicted
mse <- mean(residuals^2)
return(mse)
}
calculate_BIC <- function(n, mse, k){
BIC <- n*log(mse)+k*log(n)
return(BIC)
}
The real question is is there a standardized output of caret::train() for x variables or either y_actual, y_predicted, or residuals?
I tried fit$finalModel$model and other methods but to no avail.
Here is a reproducible example along with the function I'm using. Please consider the functions above a part of this reproducible example.
library(rlist)
library(data.table)
# data
df <- data.frame(y1 = rnorm(50, 0, 1),
y2 = rnorm(50, .25, 1.5),
x1 = rnorm(50, .4, .9),
x2 = rnorm(50, 0, 1.1),
x3 = rnorm(50, 1, .75))
missing_index <- sample(1:50, 7, replace = F)
df[missing_index,] <- NA
# function to fit models and pull results
fitModels <- function(df, Ys, Xs, models){
# empty list
results <- list()
# number of for loops
loops_counter <- 0
# for every y
for(y in 1:length(Ys)){
# for every model
for(m in 1:length(models)){
# track loops
loops_counter <- loops_counter + 1
# fit the model
set.seed(1) # seed for reproducability
fit <- tryCatch(train(as.formula(paste(Ys[y], paste(Xs, collapse = ' + '),
sep = ' ~ ')),
data = df,
method = models[m],
na.action = na.omit,
tuneLength = 10),
error = function(e) {return(NA)})
# pull results
results[[loops_counter]] <- c(Y = Ys[y],
model = models[m],
sample_size = nrow(fit$finalModel$model),
RMSE = get_best_result(fit)[[2]],
R2 = get_best_result(fit)[[3]],
MAE = get_best_result(fit)[[4]],
BIC = calculate_BIC(n = length(fit$finalModel),
mse = calculate_MSE(fit$finalModel$residuals),
k = length(fit$finalModel$xNames)))
}
}
# list bind
results_df <- list.rbind(results)
return(results_df)
}
linear_models <- c('lm', 'glmnet', 'ridge', 'lars', 'enet')
fits <- fitModels(df, c(y1, y2), c(x1,x2,x3), linear_models)
I have a function which is extremely computationally heavy (takes about 10 minutes to run once), and this function has 4 parameters.
I define the function as follows
function <- function(parameters){...calculate value...
return(value)
}
It is nonlinear for sure, and these 4 parameters I am defining must be integer values. I want to run it on the interval from c(3, 100, 20000, 80000) to c(10, 150, 80000, 120000)
I realize that for a function which is this computationally heavy and the range is so large, this will take a long time for analysis.
I am looking for an efficient way of doing this. rgenoud seems to work for integer values, but after running it, I have no way to determine if it is actually running. It stops on the line Minimization problem.
Are there other such optimization algorithms (either minimization or maximization) which are extremely efficient and will give me some feedback whether it is running or not?
I am sorry I could not provide a reproducible example, as I cannot provide the data set (100 gb) and this is more of a theoretical question.
EDIT:
If this helps, some algorithms I have been looking at include genetic algorithms, swarm particle optimization, simulated annealing, Markov chain Monte Carlo. I am open to other algorithms, as I am just looking for most efficient way to optimize these discrete variables.
EDIT 2:
Simplified version of function:
control1 <- sample(1:75, 3947398, replace=TRUE)
control2 <- sample(1:75, 28793, replace=TRUE)
control3 <- sample(1:100, 392733, replace=TRUE)
control4 <- sample(1:75, 858383, replace=TRUE)
patient1 <- sample(1:100, 28048, replace=TRUE)
patient2 <- sample(1:50, 80400, replace=TRUE)
patient3 <- sample(1:100, 48239, replace=TRUE)
control <- list(control1, control2, control3, control4)
patient <- list(patient1, patient2, patient3)
function <- function(parameter){
s <- parameter[1]
control_s <- list()
patient_s <- list()
for (i in 1:length(control))
control_s[[i]] <- sample(control[[i]], s)
for (i in 1:length(patient))
patient_s[[i]] <- sample(patient[[i]], s)
controlfreq <- list()
for (i in 1:length(control_s)){
controlfreq[[i]] <-
as.data.frame(prop.table(table(factor(
control_s[[i]], levels = 1:100
))))[,2]}
patientfreq <- list()
for (i in 1:length(patient_s)){
patientfreq[[i]] <-
as.data.frame(prop.table(table(factor(
patient_s[[i]], levels = 1:100
))))[,2]}
controlfreq <- t(as.data.frame(controlfreq))
controltrainingset <- transform(controlfreq, status = "control")
patientfreq <- t(as.data.frame(patientfreq))
patienttrainingset <- transform(patientfreq, status = "patient")
dataset <- rbind(controltrainingset, patienttrainingset)
library(caret)
fitControl <-trainControl(method = "LOOCV", classProbs = T, savePredictions = T)
model <- train(status ~ ., data = dataset, method = "rf", trControl = fitControl)
selectedIndices <- model$pred$mtry == 2
confusionmatrix <- table(model$pred$obs[selectedIndices], model$pred$pred[selectedIndices])
metric= ((confusionmatrix[1,1]/length(control))+(confusionmatrix[2,2]/length(patient)))/2
return(-metric)}
# -metric if minimizing, metric if maximizing
I have written a small function that simulates data from a normal distribution, how it is usual in linear models. My question is how to get a model with a pvalue of sim[, 1] == 0.05. I want to show that if I add a random variable even it is normal distributed around zero with small variance N(0,0.0023) , that pvalue of sim[,1] changes. The code below shows the true model.
set.seed(37) # seed for reproducability
simulation <- function(b_0, b_1,n,min_x_1 ,max_x_1,sd_e){
mat <- NA
x_1 <- runif(n = n, min = min_x_1, max =max_x_1)
error <- rnorm(mean = 0,sd = sd_e, n = n )
y <- b_0 + b_1*x_1 + error
mat <- matrix(cbind(x_1,y), ncol = 2)
return(mat)
#plot(mat[,1],mat[,2])
}
sim <- simulation(10,-2,10000,-10,70,0.003)
summary(lm(sim[,2] ~ sim[,1] ))
In a project where I'm performing mixed-effects modelling using lme, I'm trying to compare models with different correlation structures and equal fixed parts. As I'll be building a lot of these models (for different dependent variables), I tried to write a function to generate a list of models with different correlation structures, as in the example below (I really tried to keep it to a minimum working example).
If I run an anova() on the elements of this list, this works, but only if fixedPart is in my global environment. Why is this the case? Is there a way to circumvent this problem, so that I can just keep m and re-use/delete fixedPart?
I presume this problem is related to the (lexical) scoping in R, but I cannot find a way to actually fix it.
Thanks in advance!
#Dependencies
library(multilevel)
library(multcomp)
#Generate sample data
nVals = 100
sData = rnorm(nVals, mean = 1, sd = 1)
dF <- data.frame(nSubject = 1:nVals,
v1data = sData + rnorm(nVals, mean = 0, sd = 0.1),
v2data = sData + rnorm(nVals, mean = 0, sd = 0.1),
v3data = sData + rnorm(nVals, mean = 0, sd = 0.4))
dLongF = reshape(data=dF, varying=c("v1data","v2data","v3data"), v.names='data', direction="long", idvar="nSubject", times=1:3)
#Define function to assess different covariance structures
doAllCorrModels <- function(dataF, subjVarName, visitVarName, fixedPart){
mList <- vector("list",2)
mList[[1]] <- lme(fixedPart, #Random intercept, homogeneous variance
random=as.formula(paste("~1|", subjVarName)),
data=dataF,
weights=NULL)
mList[[2]] <- lme(fixedPart, #Random intercept, heterogeneous variance
random=as.formula(paste("~1|", subjVarName)),
data=dataF,
weights=varIdent(form = as.formula(paste("~1|", visitVarName)))
)
mList
}
#Get different covariance structures
dataF <- dLongF
subjVarName <- "nSubject"
visitVarName <- "time"
fixedPart <- data ~ time
m <- doAllCorrModels(dataF, subjVarName, visitVarName, fixedPart)
#This works:
a1 <- anova(m[[1]], m[[2]])
#But this does not:
rm(fixedPart)
a2 <- anova(m[[1]], m[[2]])
You can avoid this by using do.call:
doAllCorrModels <- function(dataF, subjVarName, visitVarName, fixedPart){
mList <- vector("list",2)
mList[[1]] <- do.call(lme, list(fixed = fixedPart,
random=as.formula(paste("~1|", subjVarName)),
data=dataF,
weights=NULL))
mList[[2]] <- do.call(lme, list(fixed = fixedPart,
random=as.formula(paste("~1|", subjVarName)),
data=dataF,
weights=varIdent(form = as.formula(paste("~1|", visitVarName)))))
mList
}