Removing completely separated observations from glm() - r

I'm doing a bit of exploratory data analysis using HMDA data from the AER package; however, the variables that I used to fit the model seem to contain some observations that perfectly determine the outcomes, an issue known as "separation." So I tried to remedy this using the solution recommended by this thread, yet when I tried to execute the first set of source code from glm.fit(), R returned an error message:
Error in family$family : object of type 'closure' is not subsettable
so I could not proceed any further to remove those fully determined observations from my data with this code. I am wondering if anyone could help me fix this?
My current code is provided at below for your reference.
# load the AER package and HMDA data
library(AER)
data(HMDA)
# fit a 2-degree olynomial probit model
probit.fit <- glm(deny ~ poly(hirat, 2), family = binomial, data = HMDA)
# using the revised source code from that stackexchage thread to find out observations that received a warning message
library(tidyverse)
library(dplyr)
library(broom)
eps <- 10 * .Machine$double.eps
if (family$family == "binomial") {
if (any(mu > 1 - eps) || any(mu < eps))
warning("glm.fit: fitted probabilities numerically 0 or 1 occurred",
call. = FALSE)
}
# this return the following error message
# Error in family$family : object of type 'closure' is not subsettable
probit.resids <- augment(probit.fit) %>%
mutate(p = 1 / (1 + exp(-.fitted)),
warning = p > 1-eps)
arrange(probit.resids, desc(.fitted)) %>%
select(2:5, p, warning) %>%
slice(1:10)
HMDA.nwarning <- filter(HMDA, !probit.resids$warning)
# using HMDA.nwarning should solve the problem...
probit.fit <- glm(deny ~ poly(hirat, 2), family = binomial, data = HMDA.nwarning)

This chunk of code
if (family$family == "binomial") {
if (any(mu > 1 - eps) || any(mu < eps))
warning("glm.fit: fitted probabilities numerically 0 or 1 occurred",
call. = FALSE)
}
there is a function, binomial() called when you run glm with family == "binomial". If you look under glm (just type glm):
if (is.character(family))
family <- get(family, mode = "function", envir = parent.frame())
if (is.function(family))
family <- family()
if (is.null(family$family)) {
print(family)
stop("'family' not recognized")
}
And the glm function checks binomial()$family during the fit, and if any of the predicted values differ from 1 or 0 by eps, it raises that warning.
You don't need to run that part, and yes, you need to set eps <- 10 * .Machine$double.eps . So let's run the code below, and if you run a probit, you need to specify link="probit" in binomial, otherwise the default is logit:
library(AER)
library(tidyverse)
library(dplyr)
library(broom)
data(HMDA)
probit.fit <- glm(deny ~ poly(hirat, 2), family = binomial(link="probit"), data = HMDA)
eps <- 10 * .Machine$double.eps
probit.resids <- augment(probit.fit) %>%
mutate(p = 1 / (1 + exp(-.fitted)),
warning = p > 1-eps)
The column warning indicates if the observations raises a warning, in this dataset, there's one:
table(probit.resids$warning)
FALSE TRUE
2379 1
We can use the next step to filter it
HMDA.nwarning <- filter(HMDA, !probit.resids$warning)
dim(HMDA.nwarning)
[1] 2379 14
And rerun the regression:
probit.fit <- glm(deny ~ poly(hirat, 2), family = binomial(link="probit"), data = HMDA.nwarning)
coefficients(probit.fit)
(Intercept) poly(hirat, 2)1 poly(hirat, 2)2
-1.191292 8.708494 6.884404

Related

R for loop over randomForest

I have an R dataframe with 9 input variables and 1 output variable. I want to find the accuracy of randomForest using each individual input, and add them to a list. To do this, I need to loop over a list of formulas, as in the code below:
library(randomForest)
library(caret)
formulas = c(target ~ age, target ~ sex, target ~ cp,
target ~ trestbps, target ~ chol, target ~ fbs,
target ~ restecg, target ~ ca, target ~ thal)
test_idx = sample(dim(df)[1], 60)
test_data = df[test_idx, ]
train_data = df[-test_idx, ]
accuracies = rep(NA, 9)
for (i in 1:length(formulas)){
rf_model = randomForest(formulas[i], data=train_data)
prediction = predict(rf_model, newdata=test_data, type="response")
acc = confusionMatrix(test_data$target, prediction)$overall[1]
accuracies[i] = acc
}
I run into an error,
Error in if (n==0) stop("data (x) has 0 rows") : argument is of
length zero calls: ... eval -> eval -> randomForest -> randomForest.default
Execution halted
The error is related to the formulas[i] argument passed to randomForest, when I type the formula name as the argument (for example, rf_model = randomForest(target ~ age, data=train_data), there is no error.
Is there any other way to iterate over randomForest?
Thank you!
As you have not provided any data, I am using the iris dataset. You have to make 2 changes in your code to make it run. First, use list to store the formulas, and second, formulas[[i]] within for loop. You can use the following code
library(randomForest)
library(caret)
df <- iris
formulas = list(Species ~ Sepal.Length, Species ~ Petal.Length, Species ~ Petal.Width,
Species ~ Sepal.Width)
test_idx = sample(dim(df)[1], 60)
test_data = df[test_idx, ]
train_data = df[-test_idx, ]
accuracies = rep(NA, 4)
for (i in 1:length(formulas)){
rf_model = randomForest(formulas[[i]], data=train_data)
prediction = predict(rf_model, newdata=test_data, type="response")
acc = confusionMatrix(test_data$Species, prediction)$overall[1]
accuracies[i] = acc
}
#> 0.7000000 0.9166667 0.9166667 0.5000000

capturing convergence message from lme4 package in R

I was wondering if there is a way to write a logical test (TRUE/FALSE) to show whether a model from lme4 package has converged or not?
An example is shown below, I want to capture if any model comes with the convergence warning (i.e., Model failed to converge) message?
library(lme4)
dat <- read.csv('https://raw.githubusercontent.com/rnorouzian/e/master/nc.csv')
m <- lmer(math ~ ses*sector + (ses | sch.id), data = dat)
Warning message:
In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model failed to converge with max|grad| = 0.00279 (tol = 0.002, component 1)
> sm=summary(model)
> sm$optinfo$conv$lme4$messages
[1] "Model failed to converge with max|grad| = 0.0120186 (tol = 0.002, component 1)"
>
We can use tryCatch, using withCallingHandlers taking inspiration from this post.
dat <- read.csv('https://raw.githubusercontent.com/rnorouzian/e/master/nc.csv')
m <- tryCatch({
withCallingHandlers({
error <- FALSE
list(model = lmer(math ~ ses*sector + (ses | sch.id), data = dat),
error = error)
},warning = function(w) {
if(grepl('failed to converge', w$message)) error <<- TRUE
}
)})
m$model
#Linear mixed model fit by REML ['lmerMod']
#Formula: math ~ ses * sector + (ses | sch.id)
# Data: dat
#REML criterion at convergence: 37509.07
#Random effects:
# Groups Name Std.Dev. Corr
# sch.id (Intercept) 1.9053
# ses 0.8577 0.46
# Residual 3.1930
#Number of obs: 7185, groups: sch.id, 160
#Fixed Effects:
#(Intercept) ses sector ses:sector
# 11.902 2.399 1.677 -1.322
#convergence code 0; 0 optimizer warnings; 1 lme4 warnings
m$error
#[1] TRUE
The output m is a list with model and error elements.
If we need to test for warning after the model has been created we can use :
is_warning_generated <- function(m) {
df <- summary(m)
!is.null(df$optinfo$conv$lme4$messages) &&
grepl('failed to converge', df$optinfo$conv$lme4$messages)
}
m <- lmer(math ~ ses*sector + (ses | sch.id), data = dat)
is_warning_generated(m)
#[1] TRUE
We can use safely from purrr. It will also return the error as a list element and captures the error. If there are no error, it will be NULL
library(purrr)
safelmer <- safely(lmer, otherwise = NA)
out <- safelmer(math ~ ses*sector + (ses | sch.id), data = dat)
I'm just going to say that #RonakShah's is_warning_generated could be made slightly more compact:
function(m) {
w <- m#optinfo$conv$lme4$messages
!is.null(w) && grepl('failed to converge', w)
}
I applied Ronak's solution to my own simulation data and found a problem.
The message may be a vector of multiple entries, leading also grepl() to have multiple entries. However, the && operator compares the string only to the first entry, such that further occurrences of 'failed to converge' are unobserved. To avoid this behavior, I changed && to &.
Now a problem occurred if there was no message at all. In this case the !is.null() part becomes correctly FALSE (i.e., no warning generated), but the grepl() part becomes logical(0) and the function value becomes FALSE & logical(0) which is logical(0). In fact it would work for FALSE && logical(0) which is FALSE (correct).
A solution that worked for me is
if(is.null(mess)) FALSE else grepl('failed to converge', mess)
which in case of a failure to converge provides a vector with a TRUE at the entry where the warning was placed. This vector may be evaluated, for example, by building the numeric (or Boolean) sum which becomes greater 0 or TRUE.

Error in subset.default(data, Parameter == parm) : object 'Parameter' not found while running example in cenGAM package

I am running the example of Tobit I regression using GAM model with the cenGAM package.
However, running the example provided in the manual of the package gave me the following error:
Error in subset.default(data, Parameter == parm) :
object 'Parameter' not found
Here is the code I ran:`
library(cenGAM)
# Generate random data
set.seed(1)
x <- matrix(2*rnorm(300), 100)
yn <- 2*x[,3] + 4*cos(x[,1]*2)
y <- yn + rnorm(100)
ycensored <- pmax(y, 0) # data left-censored at 0
ycensored <- pmin(ycensored, 4) # data right-censored at 4
par(mfrow = c(3,3))
# True model
plot(gam(y ~ s(x[,1]) + s(x[,2]) + s(x[, 3])), ylim=c(-5, 5), main = "True")
# Naive estimation
plot(gam(ycensored ~ s(x[,1]) + s(x[,2]) + s(x[, 3])), ylim=c(-5, 5), main = "Naive")
# Tobit I estimation
m <- gam(ycensored ~ s(x[,1]) + s(x[,2]) + s(x[, 3]), family = tobit1(left.threshold=0))
Everything worked fine until I ran the last line of the code to fit the Tobit 1 GAM model.
Could any one know how to fix it?
I am using R 3.4.1

Function lipsitz.test {generalhoslem} is not working for object clm{ordinal}

I am now tring to test the goodness of fit of an ordianl model using lipsitz.test {generalhoslem}. According to the document, the function can deal with both polr and clm. However, when I try to use clm in the lipsitz.testfunction, an error occurs. Here is an example
library("ordinal")
library(generalhoslem)
data("wine")
fm1 <- clm(rating ~ temp * contact, data = wine)
lipsitz.test(fm1)
Error in names(LRstat) <- "LR statistic" :
'names' attribute [1] must be the same length as the vector [0]
In addition: Warning message:
In lipsitz.test(fm1) :
n/5c < 6. Running this test when n/5c < 6 is not recommended.
Is there any solution to fix this? Thanks a lot.
I'm not sure if this is off-topic and should be on CrossValidated. It's partly a problem with the coding of the test and partly about the statistics of the test itself.
There are two problems. I've just spotted a bug in the code when using clm and will push a fix to CRAN (corrected code below).
There does however appear to be a more fundamental problem with the example data. Basically, the Lipsitz test requires fitting a new model with dummy variables of the groupings. When fitting the new model with this example, the model fails and so some of the coefficients are not calculated. If using polr, the new model gets the warning that it is rank-deficient; if using clm, the new model gets a message that two coefficients are not fitted due to singularities. I think this example data set is just unsuitable for this kind of analysis.
The corrected code is below and I have used a larger example dataset on which the test runs.
lipsitz.test <- function (model, g = NULL) {
oldmodel <- model
if (class(oldmodel) == "polr") {
yhat <- as.data.frame(fitted(oldmodel))
} else if (class(oldmodel) == "clm") {
predprob <- oldmodel$model[, 2:ncol(oldmodel$model)]
yhat <- predict(oldmodel, newdata = predprob, type = "prob")$fit
} else warning("Model is not of class polr or clm. Test may fail.")
formula <- formula(oldmodel$terms)
DNAME <- paste("formula: ", deparse(formula))
METHOD <- "Lipsitz goodness of fit test for ordinal response models"
obs <- oldmodel$model[1]
if (is.null(g)) {
g <- round(nrow(obs)/(5 * ncol(yhat)))
if (g < 6)
warning("n/5c < 6. Running this test when n/5c < 6 is not recommended.")
}
qq <- unique(quantile(1 - yhat[, 1], probs = seq(0, 1, 1/g)))
cutyhats <- cut(1 - yhat[, 1], breaks = qq, include.lowest = TRUE)
dfobs <- data.frame(obs, cutyhats)
dfobsmelt <- melt(dfobs, id.vars = 2)
observed <- cast(dfobsmelt, cutyhats ~ value, length)
if (g != nrow(observed)) {
warning(paste("Not possible to compute", g, "rows. There might be too few observations."))
}
oldmodel$model <- cbind(oldmodel$model, cutyhats = dfobs$cutyhats)
oldmodel$model$grp <- as.factor(vapply(oldmodel$model$cutyhats,
function(x) which(observed[, 1] == x), 1))
newmodel <- update(oldmodel, . ~ . + grp, data = oldmodel$model)
if (class(oldmodel) == "polr") {
LRstat <- oldmodel$deviance - newmodel$deviance
} else if (class(oldmodel) == "clm") {
LRstat <- abs(-2 * (newmodel$logLik - oldmodel$logLik))
}
PARAMETER <- g - 1
PVAL <- 1 - pchisq(LRstat, PARAMETER)
names(LRstat) <- "LR statistic"
names(PARAMETER) <- "df"
structure(list(statistic = LRstat, parameter = PARAMETER,
p.value = PVAL, method = METHOD, data.name = DNAME, newmoddata = oldmodel$model,
predictedprobs = yhat), class = "htest")
}
library(foreign)
dt <- read.dta("http://www.ats.ucla.edu/stat/data/hsbdemo.dta")
fm3 <- clm(ses ~ female + read + write, data = dt)
lipsitz.test(fm3)
fm4 <- polr(ses ~ female + read + write, data = dt)
lipsitz.test(fm4)

Can the boxTidwell function handle binary outcome variables?

I initially wanted to run a boxTidwell() (found in the "car" package) analysis on my prospective Logistic Regression model (BinaryOutcomeVar ~ ContinuousPredVar + ContinuousPredVar^2 + ContinuousPredVar^3). I ran into issues:
Error in x - xbar : non-numeric argument to binary operator
In addition: Warning message:
In mean.default(x) : argument is not numeric or logical: returning NA
So, I created a reproducable example for demonstrating the error:
Doesn't work:
boxTidwell(formula = Treatment ~ uptake, other.x = ~ poly(x = colnames(CO2)[c(1,2,4)], degree = 2), data = CO2)
boxTidwell(y = CO2$Treatment, x = CO2$uptake)
Works:
boxTidwell(formula = prestige ~ income + education, other.x = ~ poly(x = women , degree = 2), data = Prestige)
I've been goofing around with the other.x parameter and am guessing that's the issue.
Question
So, does anyone know if 1. the boxTidwell() function works with binary outcome variables 2. the logic behind the other.x, because I can't get my dummy example to work either.
After further searching, it looks like the car:::boxTidwell can't handle the binary outcome variable in the formula, but it can be hand coded:
require(MASS)
require(car)
d1<-read.csv("path for your csv file",sep=',',header=TRUE)
x<-d1$explanatory variable name
y<-d1$dependent variable name
#FIT IS DONE USING THE glm FUNCTION
m1res <- glm(y ~ x,family=binomial(link = "logit"))
coeff1<- coefficients(summary(m1res))
lnx<-x*log(x)
m2res <- glm(y ~ x+lnx ,family=binomial(link = "logit"))
coeff2<- coefficients(summary(m2res))
alpha0<-1.0
pvalue<-coeff2[3,4]
pvalue
beta1<-coeff1[2,1]
beta2<-coeff2[3,1]
iter<-0
err<-1
while (pvalue<0.1) {
alpha <-(beta2/beta1)+alpha0
err<-abs(alpha-alpha0)
alpha0<-alpha
mx<-x^alpha
m1res <- glm(y ~ mx,family=binomial(link = "logit"))
coeff1<- coefficients(summary(m1res))
mlnx<-mx*log(x)
m2res <- glm(y ~ mx+mlnx ,family=binomial(link = "logit"))
coeff2<- coefficients(summary(m2res))
pvalue<-coeff2[3,4]
beta1<-coeff1[2,1]
beta2<-coeff2[3,1]
iter<- iter+1
}
# PRINT THE POWER TO CONSOLE
alpha
above code taken from:
https://sites.google.com/site/ayyalaprem/box-tidwelltransform

Resources