R for loop over randomForest - r

I have an R dataframe with 9 input variables and 1 output variable. I want to find the accuracy of randomForest using each individual input, and add them to a list. To do this, I need to loop over a list of formulas, as in the code below:
library(randomForest)
library(caret)
formulas = c(target ~ age, target ~ sex, target ~ cp,
target ~ trestbps, target ~ chol, target ~ fbs,
target ~ restecg, target ~ ca, target ~ thal)
test_idx = sample(dim(df)[1], 60)
test_data = df[test_idx, ]
train_data = df[-test_idx, ]
accuracies = rep(NA, 9)
for (i in 1:length(formulas)){
rf_model = randomForest(formulas[i], data=train_data)
prediction = predict(rf_model, newdata=test_data, type="response")
acc = confusionMatrix(test_data$target, prediction)$overall[1]
accuracies[i] = acc
}
I run into an error,
Error in if (n==0) stop("data (x) has 0 rows") : argument is of
length zero calls: ... eval -> eval -> randomForest -> randomForest.default
Execution halted
The error is related to the formulas[i] argument passed to randomForest, when I type the formula name as the argument (for example, rf_model = randomForest(target ~ age, data=train_data), there is no error.
Is there any other way to iterate over randomForest?
Thank you!

As you have not provided any data, I am using the iris dataset. You have to make 2 changes in your code to make it run. First, use list to store the formulas, and second, formulas[[i]] within for loop. You can use the following code
library(randomForest)
library(caret)
df <- iris
formulas = list(Species ~ Sepal.Length, Species ~ Petal.Length, Species ~ Petal.Width,
Species ~ Sepal.Width)
test_idx = sample(dim(df)[1], 60)
test_data = df[test_idx, ]
train_data = df[-test_idx, ]
accuracies = rep(NA, 4)
for (i in 1:length(formulas)){
rf_model = randomForest(formulas[[i]], data=train_data)
prediction = predict(rf_model, newdata=test_data, type="response")
acc = confusionMatrix(test_data$Species, prediction)$overall[1]
accuracies[i] = acc
}
#> 0.7000000 0.9166667 0.9166667 0.5000000

Related

R: Predicting with lmer, y ~ . formula error

Predicting values in new data from an lmer model throws an error when a period is used to represent predictors. Is there any way around this?
The answer to this similar question offers a way to automatically write out the full formula instead of using the period, but I'm curious if there's a way to get predictions from new data just using the period.
Here's a reproducible example:
mydata <- data.frame(
groups = rep(1:3, each = 100),
x = rnorm(300),
dv = rnorm(300)
)
train_subset <- sample(1:300, 300 * .8)
train <- mydata[train_subset,]
test <- mydata[-train_subset,]
# Returns an error
mod <- lmer(dv ~ . - groups + (1 | groups), data = train)
predict(mod, newdata = test)
predict(mod) # getting predictions for the original data works
# Writing the full formula without the period does not return an error, even though it's the exact same model
mod <- lmer(dv ~ x + (1 | groups), data = train)
predict(mod, newdata = test)
This should be fixed in the development branch of lme4 now. You can install from GitHub (see first line below) or wait a few weeks (early April-ish) for a new version to hit CRAN.
remotes::install_github("lme4/lme4") ## you will need compilers etc.
mydata <- data.frame(
groups = rep(1:3, each = 100),
x = rnorm(300),
dv = rnorm(300)
)
train_subset <- sample(1:300, 300 * .8)
train <- mydata[train_subset,]
test <- mydata[-train_subset,]
# Returns an error
mod <- lmer(dv ~ . - groups + (1 | groups), data = train)
p1 <- predict(mod, newdata = test)
mod2 <- lmer(dv ~ x + (1 | groups), data = train)
p2 <- predict(mod2, newdata = test)
identical(p1, p2) ## TRUE

Can't seem to figure out how to use lapply with gls() [duplicate]

I am trying to use lme function from nlme package inside a lapply loop. This works for lmer function from lme4 package, but produces an error message for lme. How can I loop lme functions similarly to the lmer function in the example below?
library("nlme")
library("lme4")
set.seed(1)
dt <- data.frame(Resp1 = rnorm(100, 50, 23), Resp2 = rnorm(100, 80, 15), Pred = rnorm(100,10,2), group = factor(rep(LETTERS[1:10], each = 10)))
## Syntax:
lmer(Resp1 ~ Pred + (1 |group), data = dt)
lme(Resp1 ~ Pred, random = ~1 | group, data = dt)
## Works for lme4
lapply(c("Resp1", "Resp2"), function(k) {
lmer(substitute(j ~ Pred + (1 | group), list(j = as.name(k))), data = dt)})
## Does not work for nlme
lapply(c("Resp1", "Resp2"), function(k) {
lme(substitute(j ~ Pred, list(j = as.name(k))), random = ~1 | group, data = dt)})
# Error in UseMethod("lme") :
# no applicable method for 'lme' applied to an object of class "call"
PS. I am aware that this solution exists, but I would like to use a method substituting response variable directly in the model function instead of subsetting data using an additional function.
Instead of fiddling around with substitute and eval you also could do the following:
lapply(c("Resp1", "Resp2"), function(r) {
f <- formula(paste(r, "Pred", sep = "~"))
m <- lme(fixed = f, random = ~ 1 | group, data = dt)
m$call$fixed <- f
m})
You could use the same trick if you want to provide different data sets to a modelling function:
makeModel <- function(dat) {
l <- lme(Resp1 ~ Pred, random = ~ 1 | group, data = dat)
l$call$data <- as.symbol(deparse(substitute(dat)))
l
}
I use this snippet quite a bit, when I want to generate a model from within a function and want to update it afterwards.
As #CarlWitthoft suggested, adding eval into the function will solve the issue:
lapply(c("Resp1", "Resp2"), function(k) {
lme(eval(substitute(j ~ Pred, list(j = as.name(k)))), random = ~1 | group, data = dt)})
Also see #thothal's alternative.

Removing completely separated observations from glm()

I'm doing a bit of exploratory data analysis using HMDA data from the AER package; however, the variables that I used to fit the model seem to contain some observations that perfectly determine the outcomes, an issue known as "separation." So I tried to remedy this using the solution recommended by this thread, yet when I tried to execute the first set of source code from glm.fit(), R returned an error message:
Error in family$family : object of type 'closure' is not subsettable
so I could not proceed any further to remove those fully determined observations from my data with this code. I am wondering if anyone could help me fix this?
My current code is provided at below for your reference.
# load the AER package and HMDA data
library(AER)
data(HMDA)
# fit a 2-degree olynomial probit model
probit.fit <- glm(deny ~ poly(hirat, 2), family = binomial, data = HMDA)
# using the revised source code from that stackexchage thread to find out observations that received a warning message
library(tidyverse)
library(dplyr)
library(broom)
eps <- 10 * .Machine$double.eps
if (family$family == "binomial") {
if (any(mu > 1 - eps) || any(mu < eps))
warning("glm.fit: fitted probabilities numerically 0 or 1 occurred",
call. = FALSE)
}
# this return the following error message
# Error in family$family : object of type 'closure' is not subsettable
probit.resids <- augment(probit.fit) %>%
mutate(p = 1 / (1 + exp(-.fitted)),
warning = p > 1-eps)
arrange(probit.resids, desc(.fitted)) %>%
select(2:5, p, warning) %>%
slice(1:10)
HMDA.nwarning <- filter(HMDA, !probit.resids$warning)
# using HMDA.nwarning should solve the problem...
probit.fit <- glm(deny ~ poly(hirat, 2), family = binomial, data = HMDA.nwarning)
This chunk of code
if (family$family == "binomial") {
if (any(mu > 1 - eps) || any(mu < eps))
warning("glm.fit: fitted probabilities numerically 0 or 1 occurred",
call. = FALSE)
}
there is a function, binomial() called when you run glm with family == "binomial". If you look under glm (just type glm):
if (is.character(family))
family <- get(family, mode = "function", envir = parent.frame())
if (is.function(family))
family <- family()
if (is.null(family$family)) {
print(family)
stop("'family' not recognized")
}
And the glm function checks binomial()$family during the fit, and if any of the predicted values differ from 1 or 0 by eps, it raises that warning.
You don't need to run that part, and yes, you need to set eps <- 10 * .Machine$double.eps . So let's run the code below, and if you run a probit, you need to specify link="probit" in binomial, otherwise the default is logit:
library(AER)
library(tidyverse)
library(dplyr)
library(broom)
data(HMDA)
probit.fit <- glm(deny ~ poly(hirat, 2), family = binomial(link="probit"), data = HMDA)
eps <- 10 * .Machine$double.eps
probit.resids <- augment(probit.fit) %>%
mutate(p = 1 / (1 + exp(-.fitted)),
warning = p > 1-eps)
The column warning indicates if the observations raises a warning, in this dataset, there's one:
table(probit.resids$warning)
FALSE TRUE
2379 1
We can use the next step to filter it
HMDA.nwarning <- filter(HMDA, !probit.resids$warning)
dim(HMDA.nwarning)
[1] 2379 14
And rerun the regression:
probit.fit <- glm(deny ~ poly(hirat, 2), family = binomial(link="probit"), data = HMDA.nwarning)
coefficients(probit.fit)
(Intercept) poly(hirat, 2)1 poly(hirat, 2)2
-1.191292 8.708494 6.884404

R: Clustered robust standard errors using miceadds lm.cluster - error with subset and weights

I am trying to use the lm.cluster function in the package miceadds to get robust clustered standard errors for a multiply imputed dataset.
I am able to get the standard version of it to run but I get the following error when I try to add a subset or weights:
Error in eval(substitute(subset), data, env) :
..1 used in an incorrect context, no ... to look in
Example that works without subset or weights:
require("mice")
require("miceadds")
data(data.ma01)
# imputation of the dataset: use six imputations
dat <- data.ma01[ , - c(1:2) ]
imp <- mice::mice( dat , maxit=3 , m=6 )
datlist <- miceadds::mids2datlist( imp )
# linear regression with cluster robust standard errors
mod <- lapply(datlist, FUN = function(data){miceadds::lm.cluster( data=data ,
formula=read ~ paredu+ female , cluster = data.ma01$idschool )} )
# extract parameters and covariance matrix
betas <- lapply( mod , FUN = function(rr){ coef(rr) } )
vars <- lapply( mod , FUN = function(rr){ vcov(rr) } )
# conduct statistical inference
summary(pool_mi( qhat = betas, u = vars ))
Example that breaks with subset:
mod <- lapply(datlist, FUN = function(data){miceadds::lm.cluster( data=data ,
formula=read ~ paredu+ female , cluster = data.ma01$idschool, subset=
(data.ma01$urban==1))} )
Error during wrapup: ..1 used in an incorrect context, no ... to look in
Example that breaks with weights:
mod <- lapply(datlist, FUN = function(data){miceadds::lm.cluster( data=data ,
formula=read ~ paredu+ female , cluster = data.ma01$idschool,
weights=data.ma01$studwgt)} )
Error during wrapup: ..1 used in an incorrect context, no ... to look in
From searching, I think I am encountering similar issues as others when passing these commands through an lm or glm wrapper (such as: Passing Argument to lm in R within Function or R : Pass argument to glm inside an R function or Passing the weights argument to a regression function inside an R function)
However, I am not sure how to address the issue with the imputed datasets & existing lm.cluster command.
Thanks
This works fine with the estimatr package which is on CRAN and the estimatr::lm_robust() function. Two notes: (1) you can change the type of standard errors using se_type = and (2) I keep idschool in the data because we like the clusters to be in the same data.frame as we fit the model on.
library(mice)
library(miceadds)
library(estimatr)
# imputation of the dataset: use six imputations
data(data.ma01)
dat <- data.ma01[, -c(1)] # note I keep idschool in data
imp <- mice::mice( dat , maxit = 3, m = 6)
datlist <- miceadds::mids2datlist(imp)
# linear regression with cluster robust standard errors
mod <- lapply(
datlist,
function (dat) {
estimatr::lm_robust(read ~ paredu + female, dat, clusters = idschool)
}
)
# subset
mod <- lapply(
datlist,
function (dat) {
estimatr::lm_robust(read ~ paredu + female, dat, clusters = idschool, subset = urban == 1)
}
)
# weights
mod <- lapply(
datlist,
function (dat) {
estimatr::lm_robust(read ~ paredu + female, dat, clusters = idschool, weights = studwgt)
}
)
# note that you can use the `se_type` argument of lm_robust()
# to change the vcov estimation
# extract parameters and covariance matrix
betas <- lapply(mod, coef)
vars <- lapply(mod, vcov)
# conduct statistical inference
summary(pool_mi( qhat = betas, u = vars ))
I'm no expert, but there is an issue with the passing of the weights to lm(). I know this is not an ideal situation, but I managed to get it to work by modifying the lm.cluster() function to hard code the weights pass and then just used my own.
lm.cluster <- function (data, formula, cluster, wgts=NULL, ...)
{
TAM::require_namespace_msg("multiwayvcov")
if(is.null(wgts)) {
mod <- stats::lm(data = data, formula = formula)
} else {
data$.weights <- wgts
mod <- stats::lm(data = data, formula = formula, weights=data$.weights)
}
if (length(cluster) > 1) {
v1 <- cluster
}
else {
v1 <- data[, cluster]
}
dfr <- data.frame(cluster = v1)
vcov2 <- multiwayvcov::cluster.vcov(model = mod, cluster = dfr)
res <- list(lm_res = mod, vcov = vcov2)
class(res) <- "lm.cluster"
return(res)
}

How to use substitute() to loop lme functions from nlme package?

I am trying to use lme function from nlme package inside a lapply loop. This works for lmer function from lme4 package, but produces an error message for lme. How can I loop lme functions similarly to the lmer function in the example below?
library("nlme")
library("lme4")
set.seed(1)
dt <- data.frame(Resp1 = rnorm(100, 50, 23), Resp2 = rnorm(100, 80, 15), Pred = rnorm(100,10,2), group = factor(rep(LETTERS[1:10], each = 10)))
## Syntax:
lmer(Resp1 ~ Pred + (1 |group), data = dt)
lme(Resp1 ~ Pred, random = ~1 | group, data = dt)
## Works for lme4
lapply(c("Resp1", "Resp2"), function(k) {
lmer(substitute(j ~ Pred + (1 | group), list(j = as.name(k))), data = dt)})
## Does not work for nlme
lapply(c("Resp1", "Resp2"), function(k) {
lme(substitute(j ~ Pred, list(j = as.name(k))), random = ~1 | group, data = dt)})
# Error in UseMethod("lme") :
# no applicable method for 'lme' applied to an object of class "call"
PS. I am aware that this solution exists, but I would like to use a method substituting response variable directly in the model function instead of subsetting data using an additional function.
Instead of fiddling around with substitute and eval you also could do the following:
lapply(c("Resp1", "Resp2"), function(r) {
f <- formula(paste(r, "Pred", sep = "~"))
m <- lme(fixed = f, random = ~ 1 | group, data = dt)
m$call$fixed <- f
m})
You could use the same trick if you want to provide different data sets to a modelling function:
makeModel <- function(dat) {
l <- lme(Resp1 ~ Pred, random = ~ 1 | group, data = dat)
l$call$data <- as.symbol(deparse(substitute(dat)))
l
}
I use this snippet quite a bit, when I want to generate a model from within a function and want to update it afterwards.
As #CarlWitthoft suggested, adding eval into the function will solve the issue:
lapply(c("Resp1", "Resp2"), function(k) {
lme(eval(substitute(j ~ Pred, list(j = as.name(k)))), random = ~1 | group, data = dt)})
Also see #thothal's alternative.

Resources