Parallel estimation of multiple nonparametric models using np and snowfall - r

I am trying to estimate multiple nonparametric models using snowfall. So far I had no problems, but now I run into a problem that I feel unable to resolve.
In the MWE below we simply estimate only one model on one node. In my application the structure is the same. When I try to plot the model results or use another function from the np package (like npsigtest()), I get the error
Error in is.data.frame(data) : ..1 used in an incorrect context, no
... to look in
Has anyone an idea what causes the problem? I am open to another approach concerning parallel estimation of several models.
MRE:
library(np)
library(snowfall)
df <- data.frame(Y = runif(100, 0, 10), X = rnorm(100))
models <- list(as.formula(Y ~ X))
sfInit(parallel = T, cpus = length(models))
sfExport("models")
sfExport("df")
sfLibrary(snowfall)
sfLibrary(np)
lcls <- sfLapply(models, fun = npregbw, data = df, regtype = "lc")
sfStop()
plot(lcls[[1]])

Related

An R function cannot work in local environment of other functions

I use Matchit package for propensity score matching. It can generate a matched data after matching using get_matches() function.
However, if I do not run the get_matches() function in the global environment but include it in any other function, the matched data cannot be found in the local environment. (These prove to be misleading information. There is nothing wrong with MatchIt's output. Answer by Noah explains my question better.)
For producing my data
dataGen <- function(b0,b1,n = 2000,cor = 0){
# covariate
sigma <- matrix(rep(cor,9),3,3)
diag(sigma) <- rep(1,3)
cov <- MASS::mvrnorm(n, rep(0,3), sigma)
# error
error <- rnorm(n,0,sqrt(18))
# treatment variable
logit <- b0+b1*cov[,1]+0.3*cov[,2]+cov[,3]
p <- 1/(1+exp(-logit))
treat <- rbinom(n,1,p)
# outcome variable
y <- error+treat+cov[,1]+cov[,2]
data <- as.data.frame(cbind(cov,treat,y))
return(data)
}
set.seed(1)
data <- dataGen(b0=-0.92, b1=0.8, 900)
It is like the following works. The est.m.WLS() can use the m.data.
fm1 <- treat ~ V1+V2+V3
m.out <- MatchIt::matchit(data = data, formula = fm1, link = "logit", m.order = "random", caliper = 0.2)
m.data <- MatchIt::get_matches(m.out,data=data)
est.m.WLS <- function(m.data, fm2){
model.1 <- lm(fm2, data = m.data, weights=(weights))
est <- model.1$coefficients["treat"]
## regular robust standard error ignoring pair membership
model.1.2 <- lmtest::coeftest(model.1,vcov. = sandwich::vcovHC)
CI.r <- confint(model.1.2,"treat",level=0.95)
## cluster robust standard error accounting for pair membership
model.2.2 <- lmtest::coeftest(model.1, vcov. = sandwich::vcovCL, cluster = ~subclass)
CI.cr <- confint(model.2.2,"treat",level=0.95)
return(c(est=est,CI.r,CI.cr))
}
fm2 <- y ~ treat+V1+V2+V3
est.m.WLS(m.data,fm2)
But the next syntax does not work. It will report
"object 'm.data' not found"
rm(m.data)
m.out <- MatchIt::matchit(data = data, formula = fm1, link = "logit", m.order = "random", caliper = 0.2)
est.m.WLS <- function(m.out, fm2){
m.data <- MatchIt::get_matches(m.out,data=data)
model.1 <- lm(fm2, data = m.data, weights=(weights))
est <- model.1$coefficients["treat"]
## regular robust standard error ignoring pair membership
model.1.2 <- lmtest::coeftest(model.1,vcov. = sandwich::vcovHC)
CI.r <- confint(model.1.2,"treat",level=0.95)
## cluster robust standard error accounting for pair membership
model.2.2 <- lmtest::coeftest(model.1, vcov. = sandwich::vcovCL, cluster = ~subclass)
CI.cr <- confint(model.2.2,"treat",level=0.95)
return(c(est=est,CI.r,CI.cr))
}
est.m.WLS(m.out,fm2)
Since I want to run parallel loops using the groundhog library for simulation purpose, the get_matches function also cannot work in foreach()%dopar%{...} environment.
res=foreach(s = 1:7,.combine="rbind")%dopar%{
m.out <- MatchIt::matchit(data = data, formula = fm.p, distance = data$logit, m.order = "random", caliper = 0.2)
m.data <- MatchIt::get_matches(m.out,data=data)
...
}
How should I fix the problem?
Any help would be appreciated. Thank you!
Using for() loop directly will not run into any problem since it just works in the global environment, but it is too slow... I really hope to do the thousand time simulations at once. Help!
This has nothing to do with MatchIt or get_matches(). Run debugonce(est.m.WLS) with your second implementation of est.m.WLS(). You will see that get_matches() works perfectly fine and returns m.data. The problem is when lmtest() runs with a formula argument for cluster.
This is due to a bug in R, outside any package, that I have already requested to be fixed. The problem is that expand.model.matrix(), a function that searches for the dataset that the variables supplied to cluster could be in, only searches the global environment for data, but m.data does not exist in the global environment. To get around this issue, don't supply a formula to cluster; use cluster = m.data["subclass"]. This should hopefully be resolved in an upcoming R release.

Is it possible to use lqmm with a mira object?

I am using the package lqmm, to run a linear quantile mixed model on an imputed object of class mira from the package mice. I tried to make a reproducible example:
library(lqmm)
library(mice)
summary(airquality)
imputed<-mice(airquality,m=5)
summary(imputed)
fit1<-lqmm(Ozone~Solar.R+Wind+Temp+Day,random=~1,
tau=0.5, group= Month, data=airquality,na.action=na.omit)
fit1
summary(fit1)
fit2<-with(imputed, lqmm(Ozone~Solar.R+Wind+Temp+Day,random=~1,
tau=0.5, group= Month, na.action=na.omit))
"Error in lqmm(Ozone ~ Solar.R + Wind + Temp + Day, random = ~1, tau = 0.5, :
`data' must be a data frame"
Yes, it is possible to get lqmm() to work in mice. Viewing the code for lqmm(), it turns out that it's a picky function. It requires that the data argument is supplied, and although it appears to check if the data exists in another environment, it doesn't seem to work in this context. Fortunately, all we have to do to get this to work is capture the data supplied from mice and give it to lqmm().
fit2 <- with(imputed,
lqmm(Ozone ~ Solar.R + Wind + Temp + Day,
data = data.frame(mget(ls())),
random = ~1, tau = 0.5, group = Month, na.action = na.omit))
The explanation is that ls() gets the names of the variables available, mget() gets those variables as a list, and data.frame() converts them into a data frame.
The next problem you're going to find is that mice::pool() requires there to be tidy() and glance() methods to properly pool the multiple imputations. It looks like neither broom nor broom.mixed have those defined for lqmm. I threw together a very quick and dirty implementation, which you could use if you can't find anything else.
To get pool(fit2) to run you'll need to create the function tidy.lqmm() as below. Then pool() will assume the sample size is infinite and perform the calculations accordingly. You can also create the glance.lqmm() function before running pool(fit2), which will tell pool() the residual degrees of freedom. Afterwards you can use summary(pooled) to find the p-values.
tidy.lqmm <- function(x, conf.int = FALSE, conf.level = 0.95, ...) {
broom:::as_tidy_tibble(data.frame(
estimate = coef(x),
std.error = sqrt(
diag(summary(x, covariance = TRUE,
R = 50)$Cov[names(coef(x)),
names(coef(x))]))))
}
glance.lqmm <- function(x, ...) {
broom:::as_glance_tibble(
logLik = as.numeric(stats::logLik(x)),
df.residual = summary(x, R = 2)$rdf,
nobs = stats::nobs(x),
na_types = "rii")
}
Note: lqmm uses bootstrapping to estimate the standard error. By default it uses R = 50 bootstrapping replicates, which I've copied in the tidy.lqmm() function. You can change that line to increase the number of replicates if you like.
WARNING: Use these functions and the results with caution. I know just enough to be dangerous. To me it looks like these functions work to give sensible results, but there are probably intricacies that I'm not aware of. If you can find a more authoritative source for similar functions that work, or someone who is familiar with lqmm or pooling mixed models, I'd trust them more than me.

Caret train function for muliple data frames as function

there has been a similar question to mine 6 years+ ago and it hasn't been solve (R -- Can I apply the train function in caret to a list of data frames?)
This is why I am bringing up this topic again.
I'm writing my own functions for my big R project at the moment and I'm wondering if there is an opportunity to sum up the model training function train() of the pakage caret for different dataframes with different predictors.
My function should look like this:
lda_ex <- function(data, predictor){
model <- train(predictor ~., data,
method = "lda",
trControl = trainControl(method = "none"),
preProc = c("center","scale"))
return(model)
}
Using it afterwards should work like this:
data_iris <- iris
predictor_iris <- "Species"
iris_res <- lda_ex(data = data_iris, predictor = predictor_iris)
Unfortunately the R formula is not able to deal with a variable as input as far as I tried.
Is there something I am missing?
Thank you in advance for helping me out!
Solving this would help me A LOT to keep my function sheet clean and safe work for sure.
By writing predictor_iris <- "Species", you are basically saving a string object in predictor_iris. Thus, when you run lda_ex, I guess you incur in some error concerning the formula object in train(), since you are trying to predict a string using vectors of covariates.
Indeed, I tried the following toy example:
X = rnorm(1000)
Y = runif(1000)
predictor = "Y"
lm(predictor ~ X)
which gives an error about differences in the lengths of variables.
Let me modify your function:
lda_ex <- function(data, formula){
model <- train(formula, data,
method = "lda",
trControl = trainControl(method = "none"),
preProc = c("center","scale"))
return(model)
}
The key difference is that now we must pass in the whole formula, instead of the predictor only. In that way, we avoid the string-related problem.
library(caret) # Recall to specify the packages needed to reproduce your examples!
data_iris <- iris
formula_iris = Species ~ . # Key difference!
iris_res <- lda_ex(data = data_iris, formula = formula_iris)

Can H2O deeplearning models in R be reproducible while remaining multithreaded?

I've been working on validating models developed using h2o.
Specificially I've been testing a neural net implemented using h2o.deeplearning. I've been attempting to generate consistent results by setting a seed in the H2O function, but even doing this I see correlation coefficients of between 0.6 and 0.85 between different versions of the same model, even ones with identical seeds.
I did some reading, and saw that I could force reproducibility by setting the reproducible flag to TRUE, but at a significant performance cost. The input to this model is too large for that to be a feasible method.
Has anyone else ever had to solve a similar problem/found a way to force H2O neural nets to be reproducible with less performance impact?
From the technical note on this topic
Why Deep learning results are not reproducible:
Motivation
H2O's Deep Learning uses a technique called HOGWILD! which greatly increases the speed of training, but is not reproducible by default.
Solution
In order to obtain reproducible results, you must set reproducible = TRUE and seed = 1 (for example, but you can use any seed as long as you use the same one each time). If you force reproducibility, it will slow down the training because this only works on a single thread. By default, H2O clusters are started with the same number of threads as number of cores (e.g. 8 is typical on a laptop).
The R example below demonstrates how to produce reproducible deep learning models:
library(h2o)
h2o.init(nthreads = -1)
# Import a sample binary outcome train/test set into R
train <- read.table("http://www.stat.berkeley.edu/~ledell/data/higgs_10k.csv", sep=",")
test <- read.table("http://www.stat.berkeley.edu/~ledell/data/higgs_test_5k.csv", sep=",")
# Convert R data.frames into H2O parsed data objects
training_frame <- as.h2o(train)
validation_frame <- as.h2o(test)
y <- "V1"
x <- setdiff(names(training_frame), y)
family <- "binomial"
training_frame[,c(y)] <- as.factor(training_frame[,c(y)]) #Force Binary classification
validation_frame[,c(y)] <- as.factor(validation_frame[,c(y)])
Now we will fit two models and show that the training AUC is the same both times (ie. reproducible).
fit <- h2o.deeplearning(x = x, y = y,
training_frame = training_frame,
reproducible = TRUE,
seed = 1)
h2o.auc(fit)
#[1] 0.8715931
fit2 <- h2o.deeplearning(x = x, y = y,
training_frame = training_frame,
reproducible = TRUE,
seed = 1)
h2o.auc(fit2)
#[1] 0.8715931

R bsts predictions are not consistent

Whenever I run the predict function multiple times on a bsts model using the same prediction data, I get different answers. So my question is, is there a way to return consistent answers given I keep my predictor dataset the same?
Example using the iris data set (I know it's not time series but it will illustrate my point)
iris_train <- iris[1:100,1:3]
iris_test <- iris[101:150,1:3]
ss <- AddLocalLinearTrend(list(), y = iris_train$Sepal.Length)
iris_bsts <- bsts(formula = Sepal.Length ~ ., data = iris_train,
state.specification = ss,
family = 'gaussian', seed = 1, niter = 500)
burn <- SuggestBurn(0.1,iris_bsts)
Now if I run this following line say, 10 times, each result is different:
iris_predict <- predict(iris_bsts, newdata = iris_test, burn = burn)
iris_predict$mean
I understand that it is running MCMC simulations, but I require consistent results and have therefore tried:
Setting the seed in bsts and before predict
Setting the state space standard deviation to near 0, which just creates unstable results.
And neither seem to work. Any help would be appreciated!
I encountered the same problem. To fix it, you need to set the random seed in the embedded C code. I forked the packaged and made the modifications here: BSTS.
For package installation only, download bsts_0.7.1.1.tar.gz in the build folder. If you already have bsts installed, replace it with this version via:
remove.packages("bsts")
# assumes working directory is whre file is located
install.packages("bsts_0.7.1.1.tar.gz", repos=NULL, tyype="source")
If you do not have bsts installed, please install it first to ensure all dependencies are there. (This may require installing Rtools, Boom, and BoomSpikeSlab individually.)
This package version only modifies the predict function from bsts, all code should work as is. It automatically sets the random seed to 1 each time predict is called. If you want predictions to vary, you'll need to explicitly set the predict parameter each time.
You can make a function to specify seed each time (set.seed was unnecessary...):
reproducible_predict <- function(S) {
iris_bsts <- bsts(formula = Sepal.Length ~ ., data = iris_train, state.specification = ss, seed = S, family = 'gaussian', niter = 500)
burn <- SuggestBurn(0.1,iris_bsts)
iris_predict <- predict(iris_bsts, newdata = iris_test, burn = burn)
return(iris_predict$mean)
}
reproducible_predict(1)
[1] 7.043592 6.212780 6.789205 6.563942 6.746156
reproducible_predict(1)
[1] 7.043592 6.212780 6.789205 6.563942 6.746156
reproducible_predict(200)
[1] 7.013679 6.173846 6.763944 6.567651 6.715257
reproducible_predict(200)
[1] 7.013679 6.173846 6.763944 6.567651 6.715257
I have come across the same issue.
The problem comes from setting the seed within the model definition only.
To solve your problem, you have to set a seed within the predict function such as:
iris_predict <- predict(iris_bsts, newdata = iris_test, burn = burn, seed=X)
Hope this helps.

Resources