It seems that whenever I use any of the rsq package functions (pcor for partial correlations; rsq and rsq.partial for R-squared) on a binomial glm which uses the two-column notation, I get an error - see below. The model actually is correct, fit goes perfectly, no data missing.
Is there something I can do about it?
Reproducible example:
require(rsq)
data(esoph)
model1 <- glm(cbind(ncases, ncontrols) ~ agegp + tobgp * alcgp,
data = esoph, family = binomial)
pcor(model1)
Error in cbind(ncases, ncontrols) : object 'ncases' not found
rsq(model1)
Error in cbind(ncases, ncontrols) : object 'ncases' not found
rsq.partial(model1)
Error in cbind(ncases, ncontrols) : object 'ncases' not found
You have to use attach(esoph) before applying the model. Like
data(esoph)
model1 <- glm(cbind(ncases, ncontrols) ~ agegp + tobgp * alcgp,
data = esoph, family = binomial)
attach(esoph)
pcor(model1)
# $adjustment
#[1] FALSE
#$variable
#[1] "agegp" "tobgp" "alcgp" "tobgp:alcgp"
#$partial.cor
#[1] 0.8092124 0.0000000 0.0000000 0.3815876
#Warning message:
#In (nLevels > 1) & (varcls == "factor") :
#longer object length is not a multiple of shorter object length
rsq(model1)
# [1] 0.826124
rsq.partial(model1)
#$adjustment
#[1] FALSE
#$variable
#[1] "agegp" "tobgp" "alcgp" "tobgp:alcgp"
#$partial.rsq
#[1] 6.548247e-01 -6.661338e-16 0.000000e+00 1.456091e-01
detach(esoph)
cbinding beforehand works.
esoph$ncases.ncontrols <- with(esoph, cbind(ncases, ncontrols))
glm(ncases.ncontrols ~ agegp + tobgp * alcgp, data=esoph, family=binomial)
Comes a warning though in pcor().
Related
I was wondering if there is a way to write a logical test (TRUE/FALSE) to show whether a model from lme4 package has converged or not?
An example is shown below, I want to capture if any model comes with the convergence warning (i.e., Model failed to converge) message?
library(lme4)
dat <- read.csv('https://raw.githubusercontent.com/rnorouzian/e/master/nc.csv')
m <- lmer(math ~ ses*sector + (ses | sch.id), data = dat)
Warning message:
In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model failed to converge with max|grad| = 0.00279 (tol = 0.002, component 1)
> sm=summary(model)
> sm$optinfo$conv$lme4$messages
[1] "Model failed to converge with max|grad| = 0.0120186 (tol = 0.002, component 1)"
>
We can use tryCatch, using withCallingHandlers taking inspiration from this post.
dat <- read.csv('https://raw.githubusercontent.com/rnorouzian/e/master/nc.csv')
m <- tryCatch({
withCallingHandlers({
error <- FALSE
list(model = lmer(math ~ ses*sector + (ses | sch.id), data = dat),
error = error)
},warning = function(w) {
if(grepl('failed to converge', w$message)) error <<- TRUE
}
)})
m$model
#Linear mixed model fit by REML ['lmerMod']
#Formula: math ~ ses * sector + (ses | sch.id)
# Data: dat
#REML criterion at convergence: 37509.07
#Random effects:
# Groups Name Std.Dev. Corr
# sch.id (Intercept) 1.9053
# ses 0.8577 0.46
# Residual 3.1930
#Number of obs: 7185, groups: sch.id, 160
#Fixed Effects:
#(Intercept) ses sector ses:sector
# 11.902 2.399 1.677 -1.322
#convergence code 0; 0 optimizer warnings; 1 lme4 warnings
m$error
#[1] TRUE
The output m is a list with model and error elements.
If we need to test for warning after the model has been created we can use :
is_warning_generated <- function(m) {
df <- summary(m)
!is.null(df$optinfo$conv$lme4$messages) &&
grepl('failed to converge', df$optinfo$conv$lme4$messages)
}
m <- lmer(math ~ ses*sector + (ses | sch.id), data = dat)
is_warning_generated(m)
#[1] TRUE
We can use safely from purrr. It will also return the error as a list element and captures the error. If there are no error, it will be NULL
library(purrr)
safelmer <- safely(lmer, otherwise = NA)
out <- safelmer(math ~ ses*sector + (ses | sch.id), data = dat)
I'm just going to say that #RonakShah's is_warning_generated could be made slightly more compact:
function(m) {
w <- m#optinfo$conv$lme4$messages
!is.null(w) && grepl('failed to converge', w)
}
I applied Ronak's solution to my own simulation data and found a problem.
The message may be a vector of multiple entries, leading also grepl() to have multiple entries. However, the && operator compares the string only to the first entry, such that further occurrences of 'failed to converge' are unobserved. To avoid this behavior, I changed && to &.
Now a problem occurred if there was no message at all. In this case the !is.null() part becomes correctly FALSE (i.e., no warning generated), but the grepl() part becomes logical(0) and the function value becomes FALSE & logical(0) which is logical(0). In fact it would work for FALSE && logical(0) which is FALSE (correct).
A solution that worked for me is
if(is.null(mess)) FALSE else grepl('failed to converge', mess)
which in case of a failure to converge provides a vector with a TRUE at the entry where the warning was placed. This vector may be evaluated, for example, by building the numeric (or Boolean) sum which becomes greater 0 or TRUE.
I'm doing a bit of exploratory data analysis using HMDA data from the AER package; however, the variables that I used to fit the model seem to contain some observations that perfectly determine the outcomes, an issue known as "separation." So I tried to remedy this using the solution recommended by this thread, yet when I tried to execute the first set of source code from glm.fit(), R returned an error message:
Error in family$family : object of type 'closure' is not subsettable
so I could not proceed any further to remove those fully determined observations from my data with this code. I am wondering if anyone could help me fix this?
My current code is provided at below for your reference.
# load the AER package and HMDA data
library(AER)
data(HMDA)
# fit a 2-degree olynomial probit model
probit.fit <- glm(deny ~ poly(hirat, 2), family = binomial, data = HMDA)
# using the revised source code from that stackexchage thread to find out observations that received a warning message
library(tidyverse)
library(dplyr)
library(broom)
eps <- 10 * .Machine$double.eps
if (family$family == "binomial") {
if (any(mu > 1 - eps) || any(mu < eps))
warning("glm.fit: fitted probabilities numerically 0 or 1 occurred",
call. = FALSE)
}
# this return the following error message
# Error in family$family : object of type 'closure' is not subsettable
probit.resids <- augment(probit.fit) %>%
mutate(p = 1 / (1 + exp(-.fitted)),
warning = p > 1-eps)
arrange(probit.resids, desc(.fitted)) %>%
select(2:5, p, warning) %>%
slice(1:10)
HMDA.nwarning <- filter(HMDA, !probit.resids$warning)
# using HMDA.nwarning should solve the problem...
probit.fit <- glm(deny ~ poly(hirat, 2), family = binomial, data = HMDA.nwarning)
This chunk of code
if (family$family == "binomial") {
if (any(mu > 1 - eps) || any(mu < eps))
warning("glm.fit: fitted probabilities numerically 0 or 1 occurred",
call. = FALSE)
}
there is a function, binomial() called when you run glm with family == "binomial". If you look under glm (just type glm):
if (is.character(family))
family <- get(family, mode = "function", envir = parent.frame())
if (is.function(family))
family <- family()
if (is.null(family$family)) {
print(family)
stop("'family' not recognized")
}
And the glm function checks binomial()$family during the fit, and if any of the predicted values differ from 1 or 0 by eps, it raises that warning.
You don't need to run that part, and yes, you need to set eps <- 10 * .Machine$double.eps . So let's run the code below, and if you run a probit, you need to specify link="probit" in binomial, otherwise the default is logit:
library(AER)
library(tidyverse)
library(dplyr)
library(broom)
data(HMDA)
probit.fit <- glm(deny ~ poly(hirat, 2), family = binomial(link="probit"), data = HMDA)
eps <- 10 * .Machine$double.eps
probit.resids <- augment(probit.fit) %>%
mutate(p = 1 / (1 + exp(-.fitted)),
warning = p > 1-eps)
The column warning indicates if the observations raises a warning, in this dataset, there's one:
table(probit.resids$warning)
FALSE TRUE
2379 1
We can use the next step to filter it
HMDA.nwarning <- filter(HMDA, !probit.resids$warning)
dim(HMDA.nwarning)
[1] 2379 14
And rerun the regression:
probit.fit <- glm(deny ~ poly(hirat, 2), family = binomial(link="probit"), data = HMDA.nwarning)
coefficients(probit.fit)
(Intercept) poly(hirat, 2)1 poly(hirat, 2)2
-1.191292 8.708494 6.884404
With this data input:
A B C D
0.0513748973337 0.442624990365 0.044669941640565 12023787.0495
-0.047511808790502 0.199057057555 0.067542653775225 6674747.75598
0.250333519823608 0.0400359422093 -0.062361320324768 10836244.44
0.033600922318947 0.118359141703 0.048493523722074 7521473.94034
0.00492552770819 0.0851342003243 0.027123088894137 8742685.39098
0.02053037069955 0.0535545969759 0.06352586720282 8442677.4204
0.09050961131549 0.044871795257 0.049363888991624 7223126.70424
0.082789930841618 0.0230375009412 0.090676778601245 8974611.5623
0.06396481119371 0.0467280364963 0.128097065131764 8167179.81463
and this code:
library(plm);
mydata <- read.csv("reproduce_small.csv", sep = "\t");
plm(C ~ log(D), data = mydata, model = "pooling"); # works
plm(A ~ log(B), data = mydata, model = "pooling"); # error
the second plm call returns the following error:
Error in Math.factor(B) : ‘log’ not meaningful for factors
reproduce_small.csv contains the ten lines of data pasted above. Obviously, B is not a factor, it is clearly a numeric vector. This means that plm thinks it is a factor. The questions are "why?", but more importantly "how do I fix this?"
Things I've tried:
#1) mydata$B.log <- log(mydata$B) results in
Error in model.frame.default(formula = y ~ X - 1, drop.unused.levels = TRUE) :
variable lengths differ (found for 'X')
which is in itself weird, since A and B.log have clearly the same length.
#2) plm(A ~ log(D), data = mydata, model = "pooling"); results in the same error as #1.
#3) plm(C ~ log(B), data = mydata, model = "pooling"); results in the same original error (log not meaningful for factors).
#4) plm(A ~ log(B + 1), data = mydata, model = "pooling"); results in
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
In addition: Warning message:
In Ops.factor(B, 1) : ‘+’ not meaningful for factors
#5) plm(A ~ as.numeric(as.character(log(B))), data = mydata, model = "pooling"); results in the same original error (log not meaningful for factors).
EDIT: As suggested, I'm including the result of str(mydata):
> str(mydata)
'data.frame': 9 obs. of 4 variables:
$ A: num 0.05137 -0.04751 0.25033 0.0336 0.00493 ...
$ B: num 0.4426 0.1991 0.04 0.1184 0.0851 ...
$ C: num 0.0447 0.0675 -0.0624 0.0485 0.0271 ...
$ D: num 12023787 6674748 10836244 7521474 8742685 ...
Also trying mydata <- read.csv("reproduce_small.csv", sep = "\t", stringsAsFactors = FALSE); didn't work.
Helix123 in the comments pointed out that the data.frame should be converted to a pdata.frame. So, for instance, a solution to this toy example will be:
mydata$E <- c("x", "x", "x", "x", "x", "y", "y", "y", "y"); # Create E as an "index"
mydata <- pdata.frame(mydata, index = "E"); # convert to pdata.frame
plm(A ~ log(B), data = mydata, model = "pooling"); # now it works!
EDIT:
As to "why" this happens, as Helix123 pointed out in comments, is that, when passed a data.frame instead of a pdata.frame, plm quietly assumes that the first two columns are indexes, and converts them to factor under the hood. Then plm will throw an unhelpful error, instead of launching a warning that the object passed is not of the correct type, or that it made an assumption at all.
I found other questions regarding this topic, such as this, however I am keep getting the error message
Error in xy.coords(x, y, xlabel, ylabel, log) : 'x' and 'y' lengths
differ
Below is the code I am using:
library(DAAG)
attach(ultrasonic)
g.poly = lm(UR ~ poly(MD, 3), data = ultrasonic)
cv.poly <- cv.lm(ultrasonic, g.poly ,m=3, plotit=TRUE, printit=TRUE, dots=FALSE, seed=29)
Of course, the length is same:
> length(UR)
[1] 214
> length(MD)
[1] 214
Note that in the same script, I perform another linear regression with crossvalidation, which works.
library(DAAG)
g.lin = lm(log(UR) ~ MD, data = ultrasonic)
cv.lin <- cv.lm(ultrasonic, g.lin ,m=3, plotit=TRUE, printit=TRUE, dots=FALSE, seed=29)
Any idea why the polynomial regression crossvalidation does not work?
EDIT
To get the data:
install.packages('nlsmsn')
library('nlsmsn')
data(Ultrasonic)
#names differ, i am using copy in local machine with lower case u(ultrasonic) and different column names, but data are identical.
#UR = y
#MD = x
DAAG:::cv.lm obviously does not support everything you can do with lm, e.g., it does not support functions in the formula. You need to take an intermediate step.
mf <- as.data.frame(model.matrix(y ~ poly(x), data = Ultrasonic))
mf$y <- Ultrasonic$y
mf$`(Intercept)` <- NULL
#sanitize names
names(mf) <- make.names(names(mf))
#[1] "poly.x." "y"
g.poly.san <- lm(y ~ ., data = mf)
cv.poly <- cv.lm(mf, g.poly.san, m=3, plotit=TRUE, printit=TRUE, dots=FALSE, seed=29)
#works
I initially wanted to run a boxTidwell() (found in the "car" package) analysis on my prospective Logistic Regression model (BinaryOutcomeVar ~ ContinuousPredVar + ContinuousPredVar^2 + ContinuousPredVar^3). I ran into issues:
Error in x - xbar : non-numeric argument to binary operator
In addition: Warning message:
In mean.default(x) : argument is not numeric or logical: returning NA
So, I created a reproducable example for demonstrating the error:
Doesn't work:
boxTidwell(formula = Treatment ~ uptake, other.x = ~ poly(x = colnames(CO2)[c(1,2,4)], degree = 2), data = CO2)
boxTidwell(y = CO2$Treatment, x = CO2$uptake)
Works:
boxTidwell(formula = prestige ~ income + education, other.x = ~ poly(x = women , degree = 2), data = Prestige)
I've been goofing around with the other.x parameter and am guessing that's the issue.
Question
So, does anyone know if 1. the boxTidwell() function works with binary outcome variables 2. the logic behind the other.x, because I can't get my dummy example to work either.
After further searching, it looks like the car:::boxTidwell can't handle the binary outcome variable in the formula, but it can be hand coded:
require(MASS)
require(car)
d1<-read.csv("path for your csv file",sep=',',header=TRUE)
x<-d1$explanatory variable name
y<-d1$dependent variable name
#FIT IS DONE USING THE glm FUNCTION
m1res <- glm(y ~ x,family=binomial(link = "logit"))
coeff1<- coefficients(summary(m1res))
lnx<-x*log(x)
m2res <- glm(y ~ x+lnx ,family=binomial(link = "logit"))
coeff2<- coefficients(summary(m2res))
alpha0<-1.0
pvalue<-coeff2[3,4]
pvalue
beta1<-coeff1[2,1]
beta2<-coeff2[3,1]
iter<-0
err<-1
while (pvalue<0.1) {
alpha <-(beta2/beta1)+alpha0
err<-abs(alpha-alpha0)
alpha0<-alpha
mx<-x^alpha
m1res <- glm(y ~ mx,family=binomial(link = "logit"))
coeff1<- coefficients(summary(m1res))
mlnx<-mx*log(x)
m2res <- glm(y ~ mx+mlnx ,family=binomial(link = "logit"))
coeff2<- coefficients(summary(m2res))
pvalue<-coeff2[3,4]
beta1<-coeff1[2,1]
beta2<-coeff2[3,1]
iter<- iter+1
}
# PRINT THE POWER TO CONSOLE
alpha
above code taken from:
https://sites.google.com/site/ayyalaprem/box-tidwelltransform