How can I load custom functions into foreach loop in R? - r

I am trying to run gls models with a specific spatial correlation structure that comes from modifying the nlme package/ building new functions in the global environment from this post (the answer from this post that creates new functions which allows for the implementation of the correlation structure). Unfortunately I cannot get this spatial correlation structure to work when I run this through a foreach loop:
#setup example data
data("mtcars")
mtcars$lon = runif(nrow(mtcars)) #include lon and lat for the new correlation structure
mtcars$lat = runif(nrow(mtcars))
mtcars$marker = c(rep(1, nrow(mtcars)/2), rep(2, nrow(mtcars)/2)) #values for iterations
#set up cluster
detectCores()
cl <- parallel::makeCluster(6, setup_strategy = "sequential")
doParallel::registerDoParallel(cl)
#run model
list_models<-foreach(i=1:2, .packages=c('nlme'), .combine = cbind,
.export=ls(.GlobalEnv)) %dopar% {
.GlobalEnv$i <- i
model_trial<-gls(disp ~ wt,
correlation = corHaversine(form=~lon+lat,
mimic="corSpher"),
data = mtcars)
}
stopCluster(cl)
When I run this I get the error message:
Error in { :
task 1 failed - "do not know how to calculate correlation matrix of “corHaversine” object"
In addition: Warning message:
In e$fun(obj, substitute(ex), parent.frame(), e$data) :
already exporting variable(s): corHaversine, mtcars, path_df1
The model works fine with the added correlation structure :
correlation = corHaversine(form=~lon+lat,mimic="corSpher")
in a normal loop. Any help would be appreciated!

I'm not sure why your foreach approach doesn't work, andd I'm also not sure what you're actually calculating. Anyway, you may try this alternative approach using parallel::parLapply() which seems to work:
First, I cleared workspace using rm(list=ls()), then I ran the entire first codeblock of this answer where they create "corStruct" class and corHaversine method to have it in workspace as well as the Data below, ready for clusterExport().
library(parallel)
cl <- makeCluster(detectCores() - 1)
clusterEvalQ(cl, library(nlme))
clusterExport(cl, ls())
r <- parLapply(cl=cl, X=1:2, fun=function(i) {
gls(disp ~ wt,
correlation=corHaversine(form= ~ lon + lat, mimic="corSpher"),
data=mtcars)
})
stopCluster(cl) ## stop cluster
r ## result
# [[1]]
# Generalized least squares fit by REML
# Model: disp ~ wt
# Data: mtcars
# Log-restricted-likelihood: -166.6083
#
# Coefficients:
# (Intercept) wt
# -122.4464 110.9652
#
# Correlation Structure: corHaversine
# Formula: ~lon + lat
# Parameter estimate(s):
# range
# 10.24478
# Degrees of freedom: 32 total; 30 residual
# Residual standard error: 58.19052
#
# [[2]]
# Generalized least squares fit by REML
# Model: disp ~ wt
# Data: mtcars
# Log-restricted-likelihood: -166.6083
#
# Coefficients:
# (Intercept) wt
# -122.4464 110.9652
#
# Correlation Structure: corHaversine
# Formula: ~lon + lat
# Parameter estimate(s):
# range
# 10.24478
# Degrees of freedom: 32 total; 30 residual
# Residual standard error: 58.19052
Data:
set.seed(42) ## for sake of reproducibility
mtcars <- within(mtcars, {
lon <- runif(nrow(mtcars))
lat <- runif(nrow(mtcars))
marker <- c(rep(1, nrow(mtcars)/2), rep(2, nrow(mtcars)/2))
})

Related

Use lapply with formula to estimate lm with different weights

Related but slightly different to How to use lapply with a formula? and Calling update within a lapply within a function, why isn't it working?:
I am trying to estimate models with replicate weights. For correct standard errors, I need to estimate the same regression model with each version of the replicate weights. Since I need to estimate many different models and do not want to always write a seperate loop, I tried writing a function where I specify both data for the regression, regression formula and the data with the replicate weights. While the function works fine when specifying the formula explicitly inside the lapply() command in the function and not as a function input (function tryout below), as soon as I specify the regression formula as a function input (function tryout2 below), it breaks.
Here is a reproducible example:
library(tidyverse)
set.seed(123)
lm.dat <- data.frame(id=1:500,
x1=sample(1:100, replace=T, size=500),
x2=runif(n=500, min=0, max=20)) %>%
mutate(y=0.2*x1+1.5*x2+rnorm(n=500, mean=0, sd=5))
repweights <- data.frame(id=1:500)
set.seed(123)
for (i in 1:200) {
repweights[,i+1] <- runif(n=500, min=0, max=10)
names(repweights)[i+1] <- paste0("hrwgt", i)
}
The two functions are defined as follows:
trythis <- function(data, weightsdata, weightsN){
rep <- as.list(1:weightsN)
res <- lapply(rep, function(x) lm(data=data, formula=y~x1+x2, weights=weightsdata[,x]))
return(res)
}
results1 <- trythis(data=lm.dat, weightsdata=repweights[-1], weightsN=200)
trythis2 <- function(LMformula, data, weightsdata, weightsN){
rep <- as.list(1:weightsN)
res <- lapply(rep, function(x) lm(data=data, formula=LMformula, weights=weightsdata[,x]))
return(res)
}
While the first function works, applying the second one results in an error:
trythis2(LMformula = y~x1+x2, data=lm.dat, weightsN=200, weightsdata = repweights[-1])
Error in eval(extras, data, env) : object 'weightsdata' not found
Formulas have an associated environment in which the referenced variables can be found. In your case, the formula you are passing has the environment of the calling frame. To access the variables within the function, you need to assign the formula to the local frame so it can find the correct variables:
trythis3 <- function(LMformula, data, weightsdata, weightsN){
rep <- as.list(1:weightsN)
res <- lapply(rep, function(x) {
environment(LMformula) <- sys.frames()[[length(sys.frames())]]
lm(data = data, formula = LMformula, weights = weightsdata[,x])
})
return(res)
}
trythis3(LMformula = y~x1+x2, data = lm.dat, weightsN = 200,
weightsdata = repweights[-1])
Which results in
#> [[1]]
#>
#> Call:
#> lm(formula = LMformula, data = data, weights = weightsdata[,
#> x])
#>
#> Coefficients:
#> (Intercept) x1 x2
#> 1.2932 0.1874 1.4308
#>
#>
#> [[2]]
#>
#> Call:
#> lm(formula = LMformula, data = data, weights = weightsdata[,
#> x])
#>
#> Coefficients:
#> (Intercept) x1 x2
#> 1.2932 0.1874 1.4308
#>
#>
#> [[3]]
#>
#> Call:
#> lm(formula = LMformula, data = data, weights = weightsdata[,
#> x])
#>
#> Coefficients:
#> (Intercept) x1 x2
#> 1.2932 0.1874 1.4308
...etc

Custom function does not work properly unless the object is stored in the global environment in R

Context
I have a custom function myfun1 that fits the cox model. Before fitting the model, I need to do a bit of processing on the data used to fit the model. Specifically, run two lines of code, dd = datadist(data) and options(datadist = 'dd').
If dd exists in the environment inside the function, myfun1 will report an error.
But when I output dd to the global environment, myfun2 works fine.
Question
Why does this happen?
How can I get myfun1 to run properly while keeping dd inside the function?
Reproducible code
library(survival)
library(rms)
data(cancer)
myfun1 <- function(data, x){
x = sym(x)
dd = datadist(data)
options(datadist = 'dd')
fit = rlang::inject(cph(Surv(time, status) ~ rcs(!!x), data = data))
fit
}
myfun1(dat = lung, x = 'meal.cal')
# Error in Design(data, formula, specials = c("strat", "strata")) :
# dataset dd not found for options(datadist=)
myfun2 <- function(data, x){
x = sym(x)
dd <<- datadist(data) # Changed here compared to myfun1
options(datadist = 'dd')
fit = rlang::inject(cph(Surv(time, status) ~ rcs(!!x), data = data))
fit
}
myfun2(dat = lung, x = 'meal.cal')
# Frequencies of Missing Values Due to Each Variable
# Surv(time, status) meal.cal
# 0 47
#
# Cox Proportional Hazards Model
#
# cph(formula = Surv(time, status) ~ rcs(meal.cal), data = data)
#
#
# Model Tests Discrimination
# Indexes
# Obs 181 LR chi2 0.72 R2 0.004
# Events 134 d.f. 4 R2(4,181)0.000
# Center -0.3714 Pr(> chi2) 0.9485 R2(4,134)0.000
# Score chi2 0.76 Dxy 0.048
# Pr(> chi2) 0.9443

Non linear fit with R

I try to obtain the first three coefficients for Cauchy's dispersion equation for Silicon. Using a csv containing the refractive index for some wavelengths (that you can find here), I try to fit the following model :
library(readr)
library(tidyverse)
library(magrittr)
library(modelr)
library(broom)
library(splines)
# CSV parsing
RefractiveIndexINFO <- read_csv("./silicon-index.csv")
# Cleaning the output of the csv-parsing
indlong = tibble(RefractiveIndexINFO$`Wavelength. µm`,RefractiveIndexINFO$n)
names(indlong) = c('w','n')
# Remove some wavelengths that might not fit
indlong_non_uv = indlong %>% filter(indlong$w >= 0.4)
# Renaming variables
w = indlong_non_uv$w
n = indlong_non_uv$n
# Creating the non linear model
model = nls(n ~ a + b*ns(w,-2) + c*ns(w,-4), data = indlong_non_uv)
# Gathering informations on the fitted model
cor(indlong_non_uv$n,predict(model))
tidy(model)
Which gives the following error :
Error in c * ns(w, -4) : non-numeric argument to binary operator
How can I circumvent this situation and get the three coefficients (a,b,c) in a row ?
Obviously, using model = nls(n ~ a + b*ns(w,-2), data = indlong_non_uv) does not give an error.
Try this:
library(readr)
library(tidyverse)
library(magrittr)
library(modelr)
library(broom)
library(splines)
# CSV parsing
RefractiveIndexINFO <- read_csv("aspnes.csv")
RefractiveIndexINFO <- RefractiveIndexINFO[1:46,]
RefractiveIndexINFO <- as.data.frame(apply(RefractiveIndexINFO,2,as.numeric))
names(RefractiveIndexINFO) <- c('w','n')
indlong_non_uv = RefractiveIndexINFO %>% filter(RefractiveIndexINFO$w >= 0.4)
# Creating the nonlinear model
model <- nls(n ~ a + b*w^(-2) + c*w^(-4), data = indlong_non_uv,
start=list(a=1, b=1, c=1))
# Gathering informations on the fitted model
cor(indlong_non_uv$n,predict(model))
# [1] 0.9991006
tidy(model)
# term estimate std.error statistic p.value
# 1 a 3.65925186 0.039368851 92.947896 9.686805e-20
# 2 b -0.04981151 0.024099580 -2.066904 5.926046e-02
# 3 c 0.05282668 0.003306895 15.974707 6.334197e-10
Alternatively, you can use linear regression:
model2 <- lm(n ~ I(w^(-2)) + I(w^(-4)), data = indlong_non_uv)
summary(model2)
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 3.659252 0.039369 92.948 < 2e-16 ***
# I(w^(-2)) -0.049812 0.024100 -2.067 0.0593 .
# I(w^(-4)) 0.052827 0.003307 15.975 6.33e-10 ***

Clustered standard errors with texreg?

I'm trying to reproduce this stata example and move from stargazer to texreg. The data is available here.
To run the regression and get the se I run this code:
library(readstata13)
library(sandwich)
cluster_se <- function(model_result, data, cluster){
model_variables <- intersect(colnames(data), c(colnames(model_result$model), cluster))
model_rows <- as.integer(rownames(model_result$model))
data <- data[model_rows, model_variables]
cl <- data[[cluster]]
M <- length(unique(cl))
N <- nrow(data)
K <- model_result$rank
dfc <- (M/(M-1))*((N-1)/(N-K))
uj <- apply(estfun(model_result), 2, function(x) tapply(x, cl, sum));
vcovCL <- dfc*sandwich(model_result, meat=crossprod(uj)/N)
sqrt(diag(vcovCL))
}
elemapi2 <- read.dta13(file = 'elemapi2.dta')
lm1 <- lm(formula = api00 ~ acs_k3 + acs_46 + full + enroll, data = elemapi2)
se.lm1 <- cluster_se(model_result = lm1, data = elemapi2, cluster = "dnum")
stargazer::stargazer(lm1, type = "text", style = "aer", se = list(se.lm1))
==========================================================
api00
----------------------------------------------------------
acs_k3 6.954
(6.901)
acs_46 5.966**
(2.531)
full 4.668***
(0.703)
enroll -0.106**
(0.043)
Constant -5.200
(121.786)
Observations 395
R2 0.385
Adjusted R2 0.379
Residual Std. Error 112.198 (df = 390)
F Statistic 61.006*** (df = 4; 390)
----------------------------------------------------------
Notes: ***Significant at the 1 percent level.
**Significant at the 5 percent level.
*Significant at the 10 percent level.
texreg produces this:
texreg::screenreg(lm1, override.se=list(se.lm1))
========================
Model 1
------------------------
(Intercept) -5.20
(121.79)
acs_k3 6.95
(6.90)
acs_46 5.97 ***
(2.53)
full 4.67 ***
(0.70)
enroll -0.11 ***
(0.04)
------------------------
R^2 0.38
Adj. R^2 0.38
Num. obs. 395
RMSE 112.20
========================
How can I fix the p-values?
Robust Standard Errors with texreg are easy: just pass the coeftest directly!
This has become much easier since the question was last answered: it appears you can now just pass the coeftest with the desired variance-covariance matrix directly. Downside: you lose the goodness of fit statistics (such as R^2 and number of observations), but depending on your needs, this may not be a big problem
How to include robust standard errors with texreg
> screenreg(list(reg1, coeftest(reg1,vcov = vcovHC(reg1, 'HC1'))),
custom.model.names = c('Standard Standard Errors', 'Robust Standard Errors'))
=============================================================
Standard Standard Errors Robust Standard Errors
-------------------------------------------------------------
(Intercept) -192.89 *** -192.89 *
(55.59) (75.38)
x 2.84 ** 2.84 **
(0.96) (1.04)
-------------------------------------------------------------
R^2 0.08
Adj. R^2 0.07
Num. obs. 100
RMSE 275.88
=============================================================
*** p < 0.001, ** p < 0.01, * p < 0.05
To generate this example, I created a dataframe with heteroscedasticity, see below for full runnable sample code:
require(sandwich);
require(texreg);
set.seed(1234)
df <- data.frame(x = 1:100);
df$y <- 1 + 0.5*df$x + 5*100:1*rnorm(100)
reg1 <- lm(y ~ x, data = df)
First, notice that your usage of as.integer is dangerous and likely to cause problems once you use data with non-numeric rownames. For instance, using the built-in dataset mtcars whose rownames consist of car names, your function will coerce all rownames to NA, and your function will not work.
To your actual question, you can provide custom p-values to texreg, which means that you need to compute the corresponding p-values. To achieve this, you could compute the variance-covariance matrix, compute the test-statistics, and then compute the p-value manually, or you just compute the variance-covariance matrix and supply it to e.g. coeftest. Then you can extract the standard errors and p-values from there. Since I am unwilling to download any data, I use the mtcars-data for the following:
library(sandwich)
library(lmtest)
library(texreg)
cluster_se <- function(model_result, data, cluster){
model_variables <- intersect(colnames(data), c(colnames(model_result$model), cluster))
model_rows <- rownames(model_result$model) # changed to be able to work with mtcars, not tested with other data
data <- data[model_rows, model_variables]
cl <- data[[cluster]]
M <- length(unique(cl))
N <- nrow(data)
K <- model_result$rank
dfc <- (M/(M-1))*((N-1)/(N-K))
uj <- apply(estfun(model_result), 2, function(x) tapply(x, cl, sum));
vcovCL <- dfc*sandwich(model_result, meat=crossprod(uj)/N)
}
lm1 <- lm(formula = mpg ~ cyl + disp, data = mtcars)
vcov.lm1 <- cluster_se(model_result = lm1, data = mtcars, cluster = "carb")
standard.errors <- coeftest(lm1, vcov. = vcov.lm1)[,2]
p.values <- coeftest(lm1, vcov. = vcov.lm1)[,4]
texreg::screenreg(lm1, override.se=standard.errors, override.p = p.values)
And just for completeness sake, let's do it manually:
t.stats <- abs(coefficients(lm1) / sqrt(diag(vcov.lm1)))
t.stats
(Intercept) cyl disp
38.681699 5.365107 3.745143
These are your t-statistics using the cluster-robust standard errors. The degree of freedom is stored in lm1$df.residual, and using the built in functions for the t-distribution (see e.g. ?pt), we get:
manual.p <- 2*pt(-t.stats, df=lm1$df.residual)
manual.p
(Intercept) cyl disp
1.648628e-26 9.197470e-06 7.954759e-04
Here, pt is the distribution function, and we want to compute the probability of observing a statistic at least as extreme as the one we observe. Since we testing two-sided and it is a symmetric density, we first take the left extreme using the negative value, and then double it. This is identical to using 2*(1-pt(t.stats, df=lm1$df.residual)). Now, just to check that this yields the same result as before:
all.equal(p.values, manual.p)
[1] TRUE

Custom Bootstrapped Standard Error: numeric 'envir' arg not of length one

I am writing a custom script to bootstrap standard errors in a GLM in R and receive the following error:
Error in eval(predvars, data, env) : numeric 'envir' arg not of length one
Can someone explain what I am doing wrong? My code:
#Number of simulations
sims<-numbersimsdesired
#Set up place to store data
saved.se<-matrix(NA,sims,numberofcolumnsdesired)
y<-matrix(NA,realdata.rownumber)
x1<-matrix(NA,realdata.rownumber)
x2<-matrix(NA,realdata.rownumber)
#Resample entire dataset with replacement
for (sim in 1:sims) {
fake.data<-sample(1:nrow(data5),nrow(data5),replace=TRUE)
#Define variables for GLM using fake data
y<-realdata$y[fake.data]
x1<-realdata$x1[fake.data]
x2<-realdata$x2[fake.data]
#Run GLM on fake data, extract SEs, save SE into matrix
glm.output<-glm(y ~ x1 + x2, family = "poisson", data = fake.data)
saved.se[sim,]<-summary(glm.output)$coefficients[0,2]
}
An example: if we suppose sims = 1000 and we want 10 columns (suppose instead of x1 and x2, we have x1...x10) the goal is a dataset with 1,000 rows and 10 columns containing each explanatory variable's SEs.
There isn't a reason to reinvent the wheel. Here is an example of bootstrapping the standard error of the intercept with the boot package:
set.seed(42)
counts <- c(18,17,15,20,10,20,25,13,12)
x1 <- 1:9
x2 <- sample(9)
DF <- data.frame(counts, x1, x2)
glm1 <- glm(counts ~ x1 + x2, family = poisson(), data=DF)
summary(glm1)$coef
# Estimate Std. Error z value Pr(>|z|)
#(Intercept) 2.08416378 0.42561333 4.896848 9.738611e-07
#x1 0.04838210 0.04370521 1.107010 2.682897e-01
#x2 0.09418791 0.04446747 2.118131 3.416400e-02
library(boot)
intercept.se <- function(d, i) {
glm1.b <- glm(counts ~ x1 + x2, family = poisson(), data=d[i,])
summary(glm1.b)$coef[1,2]
}
set.seed(42)
boot.intercept.se <- boot(DF, intercept.se, R=999)
#ORDINARY NONPARAMETRIC BOOTSTRAP
#
#
#Call:
#boot(data = DF, statistic = intercept.se, R = 999)
#
#
#Bootstrap Statistics :
# original bias std. error
#t1* 0.4256133 0.103114 0.2994377
Edit:
If you prefer doing it without a package:
n <- 999
set.seed(42)
ind <- matrix(sample(nrow(DF), nrow(DF)*n, replace=TRUE), nrow=n)
boot.values <- apply(ind, 1, function(...) {
i <- c(...)
intercept.se(DF, i)
})
sd(boot.values)
#[1] 0.2994377

Resources