R: Problems of wrapping polynomial regression in a function - r

When I was doing polynomial regression, I tried to put fits with different degree of polynomial into a list, so I wrapped the glm into a function:
library(MASS)
myglm <- function(dop) {
# dop: degree of polynomial
glm(nox ~ poly(dis, degree = dop), data = Boston)
}
However, I guess there might be some problem related to lazy evaluation. The degree of the model is parameter dop rather than a specific number.
r$> myglm(2)
Call: glm(formula = nox ~ poly(dis, degree = dop), data = Boston)
Coefficients:
(Intercept) poly(dis, degree = dop)1 poly(dis, degree = dop)2
0.5547 -2.0031 0.8563
Degrees of Freedom: 505 Total (i.e. Null); 503 Residual
Null Deviance: 6.781
Residual Deviance: 2.035 AIC: -1347
When I do cross-validation using this model, an error occurs:
>>> cv.glm(Boston, myglm(2))
Error in poly(dis, degree = dop) : object 'dop' not found
So how can I solve this problem ?

Quosures, quasiquotation, and tidy evaluation are useful here:
library(MASS)
library(boot)
library(rlang)
myglm <- function(dop) {
eval_tidy(quo(glm(nox ~ poly(dis, degree = !! dop), data = Boston)))
}
cv.glm(Boston, myglm(2))

Related

Error using offset argument in an intercept only model in lrm function R, from rms package

I cannot fit an intercept only model with an offset term using the lrm function from rms. The equivalent model can be fit with glm. I cannot find anything about this on the internet. Reproducible example below. Any help appreciated!
### Generate data
###
## Load rms
#install.packages("rms")
library(rms)
## Set seed
set.seed(1)
## Generate event probabilities
p.true <- runif(1000, 0.2 , 0.8)
## Generate binary outcome
state1.bin <- rbinom(1000, 1, p.true)
## Crete dataset with outcome, and the logit of the event probabilities
data.raw.temp <- data.frame("state1.bin" = state1.bin, "p.true.logit1" = log(p.true/(1-p.true)))
### Fit an intercept only model, using the logit of the event probabilities as an offset
###
## GLM attempt 1
glm(state1.bin ~ offset(p.true.logit1), data = data.raw.temp, family = binomial(link = "logit"))
## lrm attempt 1
lrm(state1.bin ~ offset(p.true.logit1), data = data.raw.temp)
## lrm attempt 2
lrm(state1.bin ~ 1, offset = p.true.logit1, data = data.raw.temp)
GLM attempt 1 sucessfully fits the desired model
Call: glm(formula = state1.bin ~ offset(p.true.logit1), family = binomial(link = "logit"),
data = data.raw.temp)
Coefficients:
(Intercept)
-0.02642
Degrees of Freedom: 999 Total (i.e. Null); 999 Residual
Null Deviance: 1270
Residual Deviance: 1270 AIC: 1272
lrm attempt 1 gives the following error
Error in lrm(state1.bin ~ offset(p.true.logit1), data = data.raw.temp) :
object 'nact' not found
lrm attempt 2 gives the following error
Error in fitter(X, Y, offset = offs, penalty.matrix = penalty.matrix, :
formal argument "offset" matched by multiple actual arguments

Weighted logistic regression in R

Given sample data of proportions of successes plus sample sizes and independent variable(s), I am attempting logistic regression in R.
The following code does what I want and seems to give sensible results, but does not look like a sensible approach; in effect it doubles the size of the data set
datf <- data.frame(prop = c(0.125, 0, 0.667, 1, 0.9),
cases = c(8, 1, 3, 3, 10),
x = c(11, 12, 15, 16, 18))
datf2 <- rbind(datf,datf)
datf2$success <- rep(c(1, 0), each=nrow(datf))
datf2$cases <- round(datf2$cases*ifelse(datf2$success,datf2$prop,1-datf2$prop))
fit2 <- glm(success ~ x, weight=cases, data=datf2, family="binomial")
datf$proppredicted <- 1 / (1 + exp(-predict(fit2, datf)))
plot(datf$x, datf$proppredicted, type="l", col="red", ylim=c(0,1))
points(datf$x, datf$prop, cex=sqrt(datf$cases))
producing a chart like
which looks reasonably sensible.
But I am not happy about the use of datf2 as a way of separating the successes and failures by duplicating the data. Is something like this necessary?
As a lesser question, is there a cleaner way of calculating the predicted proportions?
No need to construct artificial data like that; glm can fit your model from the dataset as given.
> glm(prop ~ x, family=binomial, data=datf, weights=cases)
Call: glm(formula = prop ~ x, family = binomial, data = datf, weights = cases)
Coefficients:
(Intercept) x
-9.3533 0.6714
Degrees of Freedom: 4 Total (i.e. Null); 3 Residual
Null Deviance: 17.3
Residual Deviance: 2.043 AIC: 11.43
You will get a warning about "non-integer #successes", but that is because glm is being silly. Compare to the model on your constructed dataset:
> fit2
Call: glm(formula = success ~ x, family = "binomial", data = datf2,
weights = cases)
Coefficients:
(Intercept) x
-9.3532 0.6713
Degrees of Freedom: 7 Total (i.e. Null); 6 Residual
Null Deviance: 33.65
Residual Deviance: 18.39 AIC: 22.39
The regression coefficients (and therefore predicted values) are basically equal. However your residual deviance and AIC are suspect because you've created artificial data points.

R: loglikelihood of Saturated Model in GLM

Let LL = loglikelihood
Residual Deviance = 2(LL(Saturated Model) - LL(Proposed Model))
However, when I use glm function, it seems that
Residual Deviance = -2LL(Proposed Model)
For example,
mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
mydata$rank <- factor(mydata$rank)
mylogit <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial")
summary(mylogit)
###
Residual deviance: 458.52 on 394 degrees of freedom
AIC: 470.52
#Residual deviance
-2*logLik(mylogit)
##'log Lik.' 458.5175 (df=6)
#AIC
-2*logLik(mylogit)+2*(5+1)
##470.5175
Where is LL(Saturated Model) and how can I get it's value in R?
Thank you.
I have got the answer: it only happens when the log likelihood of the saturated model is 0, which for discrete models implies that the probability of the observed data under the saturated model is 1. Binary data is pretty much the only case where this is true (because individual fitted probabilities become either zero or one).H and Here for details.

Logistic Unit Fixed Effect Model in R

I'm trying to estimate a logistic unit fixed effects model for panel data using R. My dependent variable is binary and measured daily over two years for 13 locations.
The goal of this model is to predict the value of y for a particular day and location based on x.
zero <- seq(from=0, to=1, by=1)
ids = dplyr::data_frame(location=seq(from=1, to=13, by=1))
dates = dplyr::data_frame(date = seq(as.Date("2015-01-01"), as.Date("2016-12-31"), by="days"))
data = merge(dates, ids)
data$y <- sample(zero, size=9503, replace=TRUE)
data$x <- sample(zero, size=9503, replace=TRUE)
While surveying the available packages to do so, I've read a number of ways to (apparently) do this, but I'm not confident I've understood the differences between packages and approaches.
From what I have read so far, glm(), survival::clogit() and pglm::pglm() can be used to do this, but I'm wondering if there are substantial differences between the packages and what those might be.
Here are the calls I've used:
fixed <- glm(y ~ x + factor(location), data=data)
fixed <- clogit(y ~ x + strata(location), data=data)
One of the reasons for this insecurity is the error I get when using pglm (also see this question) that pglm can't use the "within" model:
fixed <- pglm(y ~ x, data=data, index=c("location", "date"), model="within", family=binomial("logit")).
What distinguishes the "within" model of pglm from the approaches in glm() and clogit() and which of the three would be the correct one to take here when trying to predict y for a given date and unit?
I don't see that you have defined a proper hypothesis to test within the context of what you are calling "panel data", but as far as getting glm to give estimates for logistic coefficients within strata it can be accomplished by adding family="binomial" and stratifying by your "unit" variable:
> fixed <- glm(y ~ x + strata(unit), data=data, family="binomial")
> fixed
Call: glm(formula = y ~ x + strata(unit), family = "binomial", data = data)
Coefficients:
(Intercept) x strata(unit)unit=2 strata(unit)unit=3
0.10287 -0.05910 -0.08302 -0.03020
strata(unit)unit=4 strata(unit)unit=5 strata(unit)unit=6 strata(unit)unit=7
-0.06876 -0.05042 -0.10200 -0.09871
strata(unit)unit=8 strata(unit)unit=9 strata(unit)unit=10 strata(unit)unit=11
-0.09702 0.02742 -0.13246 -0.04816
strata(unit)unit=12 strata(unit)unit=13
-0.11449 -0.16986
Degrees of Freedom: 9502 Total (i.e. Null); 9489 Residual
Null Deviance: 13170
Residual Deviance: 13170 AIC: 13190
That will not take into account any date-ordering, which is what I would have expected to be the interest. But as I said above, there doesn't yet appear to be a hypothesis that is premised on any sequential ordering.
This would create a fixed effects model that included a spline relationship of date to probability of y-event. I chose to center the date rather than leaving it as a very large integer:
library(splines)
fixed <- glm(y ~ x + ns(scale(date),3) + factor(unit), data=data, family="binomial")
fixed
#----------------------
Call: glm(formula = y ~ x + ns(scale(date), 3) + factor(unit), family = "binomial",
data = data)
Coefficients:
(Intercept) x ns(scale(date), 3)1 ns(scale(date), 3)2
0.13389 -0.05904 0.04431 -0.10727
ns(scale(date), 3)3 factor(unit)2 factor(unit)3 factor(unit)4
-0.03224 -0.08302 -0.03020 -0.06877
factor(unit)5 factor(unit)6 factor(unit)7 factor(unit)8
-0.05042 -0.10201 -0.09872 -0.09702
factor(unit)9 factor(unit)10 factor(unit)11 factor(unit)12
0.02742 -0.13246 -0.04816 -0.11450
factor(unit)13
-0.16987
Degrees of Freedom: 9502 Total (i.e. Null); 9486 Residual
Null Deviance: 13170
Residual Deviance: 13160 AIC: 13200

Pasting object names into the glm function in R

I have the following data
data.set <- data.frame("varA"=rnorm(50),"varB"=rnorm(50),
"varC"=rnorm(50), binary.outcome=sample(c(0,1),50,replace=T) )
exp.vars <- c("varA","varB","varC")
I then wish to apply a logistic model using all of the exp.vars as dependent variables without hard coding them (I want to put this into a function so that different combinations of exp.vars can be tried. My attempt:
results <- glm( binary.outcome ~ get(paste(exp.vars, collapse="+")), family=binomial,
data=data.set )
How can I get this to work?
The . in the formula tells R to use all variables in the data.frame data.set (except y) as predictors. This should do it:
glm( binary.outcome ~ ., family=binomial,
data=data.set )
Call: glm(formula = binary.outcome ~ ., family = binomial, data = data.set)
Coefficients:
(Intercept) varA varB varC
-0.4820 0.1878 -0.3974 -0.4566
Degrees of Freedom: 49 Total (i.e. Null); 46 Residual
Null Deviance: 66.41
Residual Deviance: 62.06 AIC: 70.06
and from ?formula
There are two special interpretations of . in a formula. The usual one
is in the context of a data argument of model fitting functions and
means ‘all columns not otherwise in the formula’: see terms.formula.
In the context of update.formula, only, it means ‘what was previously
in this part of the formula’.

Resources