How can I manipulate a GLM object in order to bypass this error? I would like for predict to treat the unseen levels as base cases (that is, give them a coefficient of zero.)
> master <- data.frame(x = factor(floor(runif(100,0,3)), labels=c("A","B","C")), y = rnorm(100))
> part.1 <- master[master$x == 'C',]
> part.2 <- master[master$x == 'A' | master$x == 'B',]
> model.2 <- glm(y ~ x, data=part.2)
> predict.1 <- predict(model.2, part.1)
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : factor 'x' has new level(s) C
I tried doing this:
> model.2$xlevels$x <- c(model.2$xlevels, "C")
> predict.1 <- predict(model.2, part.1)
But it's not scoring the model correctly:
> predict.1[1:5]
2 3 6 8 10
0.03701494 0.03701494 0.03701494 0.03701494 0.03701494
> summary(model.2)
Call:
glm(formula = y ~ x, data = part.2)
<snip>
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.12743 0.18021 0.707 0.482
xB -0.09042 0.23149 -0.391 0.697
predict.1 should only be 0.12743.
This is obviously just a trimmed down version--my real model has 25 or so variables in it, so an answer of predict.1 <- rep(length(part.1), 0.12743) is not useful to me.
Thanks for any help!
If you know that observations where x=='C' behave exactly like x=='A', then you can just do:
> part.1$x <- factor(rep("A",nrow(part.1)),levels=c("A","B"))
> predict(model.2, part.1)
which will give you your pure intercept model.
I disagree that you should expect any prediction. You develop a model with no items whose x variable is a factor whose value is "C" so you should not expect any prediction. Your effort to produce predictions for 1:5 also should fail.
Related
So I want to find the estimate parameter using GLM and compare it with mle2 package.
Here's my code for GLM
d <- read.delim("http://dnett.github.io/S510/Disease.txt")
d$disease=factor(d$disease)
d$ses=factor(d$ses)
d$sector=factor(d$sector)
str(d)
glm2 <- glm(disease~ses+sector, family=binomial(link=logit), data=d)
summary(glm2)
And my code for mle2()
y<-as.numeric(as.character(d$disease))
x1<-as.numeric(as.character(d$age))
x2<-as.numeric(as.character(d$sector))
x3<-as.numeric(as.character(d$ses))
library(bbmle)
nlldbin=function(A,B,C,D){
eta<-A+B*(x3==2)+C*(x3==3)+D*(x2==2)
p<-1/(1+exp(-eta))
joint.pdf= (p^y)*((1-p)^(1-y))
-sum(joint.pdf, log=TRUE ,na.rm=TRUE)
}
st <- list(A=0.0001,B=0.0001,C=0.0001,D=0.0001)
est_mle2<-mle2(start=st,nlldbin,hessian=TRUE)
summary(est_mle2)
But the result is quiet different. Please help me to fix this, thank you!
> summary(est_mle2)
Maximum likelihood estimation
Call:
mle2(minuslogl = nlldbin, start = st, hessian.opts = TRUE)
Coefficients:
Estimate Std. Error z value Pr(z)
A -20.4999 5775.1484 -0.0035 0.9972
B -5.2499 120578.9515 0.0000 1.0000
C -7.9999 722637.2670 0.0000 1.0000
D -2.2499 39746.6639 -0.0001 1.0000
> summary(glm2)
Call:
glm(formula = disease ~ ses + sector, family = binomial(link = logit),
data = d)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.52001 0.33514 -4.535 5.75e-06 ***
ses2 -0.08525 0.41744 -0.204 0.838177
ses3 0.16086 0.39261 0.410 0.682019
sector2 1.28098 0.34140 3.752 0.000175 ***
I'm not sure your definition of eta is correct. I would use the model matrix.
X <- model.matrix(~ ses + sector, data = d)
nlldbin <- function(A,B,C,D){
eta <- X %*% c(A, B, C, D)
p <- 1/(1+exp(-eta))
logpdf <- y*log(p) + (1-y)*log(1-p)
-sum(logpdf)
}
This line
-sum(joint.pdf, log=TRUE ,na.rm=TRUE)
is wrong. sum doesn't have a special log argument; what you're doing is adding the value TRUE (which gets converted to 1) to the pdf.
What you want is
-sum(log(joint.pdf), na.rm=TRUE)
but this is also not very good for numerical reasons, as the pdf is likely to underflow. A better way of writing it would be
logpdf <- y*log(p) + (1-y)*log(1-p)
-sum(logpdf, na.rm=TRUE)
Having a small issue with updating nlme models after using reformulate in the formula argument of lme()
Here is some data
set.seed(345)
A0 <- rnorm(4,2,.5)
B0 <- rnorm(4,2+3,.5)
A1 <- rnorm(4,6,.5)
B1 <- rnorm(4,6+2,.5)
A2 <- rnorm(4,10,.5)
B2 <- rnorm(4,10+1,.5)
A3 <- rnorm(4,14,.5)
B3 <- rnorm(4,14+0,.5)
score <- c(A0,B0,A1,B1,A2,B2,A3,B3)
id <- rep(1:8,times = 4, length = 32)
time <- factor(rep(0:3, each = 8, length = 32))
group <- factor(rep(c("A","B"), times =2, each = 4, length = 32))
df <- data.frame(id = id, group = group, time = time, score = score)
Now say I want to specify the variables as objects outside the lme function...
t <- "time"
g <- "group"
dv <- "score"
...and then reformulate them...
mod1 <- lme(fixed = reformulate(t, response = "score"),
random = ~1|id,
data = df)
summary(mod1)
Linear mixed-effects model fit by REML
Data: df
AIC BIC logLik
101.1173 109.1105 -44.55864
Random effects:
Formula: ~1 | id
(Intercept) Residual
StdDev: 0.5574872 0.9138857
Fixed effects: reformulate(t, response = "score")
Value Std.Error DF t-value p-value
(Intercept) 3.410345 0.3784804 21 9.010626 0
time1 3.771009 0.4569429 21 8.252693 0
time2 6.990972 0.4569429 21 15.299445 0
time3 10.469034 0.4569429 21 22.911036 0
Correlation:
(Intr) time1 time2
time1 -0.604
time2 -0.604 0.500
time3 -0.604 0.500 0.500
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-1.6284111 -0.5463271 0.1020036 0.5387158 2.1784156
Number of Observations: 32
Number of Groups: 8
So far so good. But what if we want to add terms to the fixed effects portion of the model using update()?
mod2 <- update(mod1, reformulate(paste(g,"*",t), response = "score"))
We get the error message
Error in reformulate(t, response = "score") :
'termlabels' must be a character vector of length at least one
Obviously I can write the model out again without using update() but I was just wondering if there is a way to make update work.
I gather the problem lies in the way that lme encodes the formula argument when using reformulate.
Any solution much appreciated.
The problem is that when you don't put in formula literal in the call to lme, certain types of functions don't work. In particular, the place where the error is coming from is
formula(mod1)
# Error in reformulate(t, response = "score") :
# 'termlabels' must be a character vector of length at least one
The nlme:::formula.lme tries to evaluate the parameter in the wrong environment. A different way to construct the first model would be
mod1 <- do.call("lme", list(
fixed = reformulate(t, response = "score"),
random = ~1|id,
data = quote(df)))
When you do this, this injects the formula into the call
formula(mod1)
# score ~ time
which will allow the update function to change the formula.
I'm encountering an issue with predictInterval() from merTools. The predictions seem to be out of order when compared to the data and midpoint predictions using the standard predict() method for lme4. I can't reproduce the problem with simulated data, so the best I can do is show the lmerMod object and some of my data.
> # display input data to the model
> head(inputData)
id y x z
1 calibration19 1.336 0.531 001
2 calibration20 1.336 0.433 001
3 calibration22 0.042 0.432 001
4 calibration23 0.042 0.423 001
5 calibration16 3.300 0.491 001
6 calibration17 3.300 0.465 001
> sapply(inputData, class)
id y x z
"factor" "numeric" "numeric" "factor"
>
> # fit mixed effects regression with random intercept on z
> lmeFit = lmer(y ~ x + (1 | z), inputData)
>
> # display lmerMod object
> lmeFit
Linear mixed model fit by REML ['lmerMod']
Formula: y ~ x + (1 | z)
Data: inputData
REML criterion at convergence: 444.245
Random effects:
Groups Name Std.Dev.
z (Intercept) 0.3097
Residual 0.9682
Number of obs: 157, groups: z, 17
Fixed Effects:
(Intercept) x
-0.4291 5.5638
>
> # display new data to predict in
> head(predData)
id x z
1 29999900108 0.343 001
2 29999900207 0.315 001
3 29999900306 0.336 001
4 29999900405 0.408 001
5 29999900504 0.369 001
6 29999900603 0.282 001
> sapply(predData, class)
id x z
"factor" "numeric" "factor"
>
> # estimate fitted values using predict()
> set.seed(1)
> preds_mid = predict(lmeFit, newdata=predData)
>
> # estimate fitted values using predictInterval()
> set.seed(1)
> preds_interval = predictInterval(lmeFit, newdata=predData, n.sims=1000) # wrong order
>
> # estimate fitted values just for the first observation to confirm that it should be similar to preds_mid
> set.seed(1)
> preds_interval_first_row = predictInterval(lmeFit, newdata=predData[1,], n.sims=1000)
>
> # display results
> head(preds_mid) # correct prediction
1 2 3 4 5 6
1.256860 1.101074 1.217913 1.618505 1.401518 0.917470
> head(preds_interval) # incorrect order
fit upr lwr
1 1.512410 2.694813 0.133571198
2 1.273143 2.521899 0.009878347
3 1.398273 2.785358 0.232501376
4 1.878165 3.188086 0.625161201
5 1.605049 2.813737 0.379167003
6 1.147415 2.417980 -0.108547846
> preds_interval_first_row # correct prediction
fit upr lwr
1 1.244366 2.537451 -0.04911808
> preds_interval[round(preds_interval$fit,3)==round(preds_interval_first_row$fit,3),] # the correct prediction ends up as observation 1033
fit upr lwr
1033 1.244261 2.457012 -0.0001299777
>
To put this into words, the first observation of my data frame predData should have a fitted value around 1.25 according to the predict() method, but it has a value around 1.5 using the predictInterval() method. This does not seem to be simply due to differences in the prediction approaches, because if I restrict the newdata argument to the first row of predData, the resulting fitted value is around 1.25, as expected.
The fact that I can't reproduce the problem with simulated data leads me to believe it has to do with an attribute of my input or prediction data. I've tried reclassifying the factor variable as character, enforcing the order of the rows prior to fitting the model, between fitting the model and predicting, but found no success.
Is this a known issue? What can I do to avoid it?
I have attempted to make a minimal reproducible example of this issue, but have been unsuccessful.
library(merTools)
d <- data.frame(x = rnorm(1000), z = sample(1:25L, 1000, replace=TRUE),
id = sample(LETTERS, 1000, replace = TRUE))
d$z <- as.factor(d$z)
d$id <- factor(d$id)
d$y <- simulate(~x+(1|z),family = gaussian,
newdata=d,
newparams=list(beta=c(2, -1.1), theta=c(.25),
sigma = c(.23)), seed =463)[[1]]
lmeFit <- lmer(y ~ x + (1|z), data = d)
predData <- data.frame(x = rnorm(25), z = sample(1:25L, 25, replace=TRUE),
id = sample(LETTERS, 25, replace = TRUE))
predData$z <- as.factor(predData$z)
predData$id <- factor(predData$id)
predict(lmeFit, predData)
predictInterval(lmeFit, predData)
predictInterval(lmeFit, predData[1, ])
But, playing around with this code I was not able to recreate the error observed above. Can you post a synthetic example or see if you can create a synthetic example?
Or can you test the issue first coercing the factors to characters and seeing if you see the same re-ordering issue?
I am writing a custom script to bootstrap standard errors in a GLM in R and receive the following error:
Error in eval(predvars, data, env) : numeric 'envir' arg not of length one
Can someone explain what I am doing wrong? My code:
#Number of simulations
sims<-numbersimsdesired
#Set up place to store data
saved.se<-matrix(NA,sims,numberofcolumnsdesired)
y<-matrix(NA,realdata.rownumber)
x1<-matrix(NA,realdata.rownumber)
x2<-matrix(NA,realdata.rownumber)
#Resample entire dataset with replacement
for (sim in 1:sims) {
fake.data<-sample(1:nrow(data5),nrow(data5),replace=TRUE)
#Define variables for GLM using fake data
y<-realdata$y[fake.data]
x1<-realdata$x1[fake.data]
x2<-realdata$x2[fake.data]
#Run GLM on fake data, extract SEs, save SE into matrix
glm.output<-glm(y ~ x1 + x2, family = "poisson", data = fake.data)
saved.se[sim,]<-summary(glm.output)$coefficients[0,2]
}
An example: if we suppose sims = 1000 and we want 10 columns (suppose instead of x1 and x2, we have x1...x10) the goal is a dataset with 1,000 rows and 10 columns containing each explanatory variable's SEs.
There isn't a reason to reinvent the wheel. Here is an example of bootstrapping the standard error of the intercept with the boot package:
set.seed(42)
counts <- c(18,17,15,20,10,20,25,13,12)
x1 <- 1:9
x2 <- sample(9)
DF <- data.frame(counts, x1, x2)
glm1 <- glm(counts ~ x1 + x2, family = poisson(), data=DF)
summary(glm1)$coef
# Estimate Std. Error z value Pr(>|z|)
#(Intercept) 2.08416378 0.42561333 4.896848 9.738611e-07
#x1 0.04838210 0.04370521 1.107010 2.682897e-01
#x2 0.09418791 0.04446747 2.118131 3.416400e-02
library(boot)
intercept.se <- function(d, i) {
glm1.b <- glm(counts ~ x1 + x2, family = poisson(), data=d[i,])
summary(glm1.b)$coef[1,2]
}
set.seed(42)
boot.intercept.se <- boot(DF, intercept.se, R=999)
#ORDINARY NONPARAMETRIC BOOTSTRAP
#
#
#Call:
#boot(data = DF, statistic = intercept.se, R = 999)
#
#
#Bootstrap Statistics :
# original bias std. error
#t1* 0.4256133 0.103114 0.2994377
Edit:
If you prefer doing it without a package:
n <- 999
set.seed(42)
ind <- matrix(sample(nrow(DF), nrow(DF)*n, replace=TRUE), nrow=n)
boot.values <- apply(ind, 1, function(...) {
i <- c(...)
intercept.se(DF, i)
})
sd(boot.values)
#[1] 0.2994377
Just a quick question on how to handle outputs of different lengths using ldply from the plyr package. Here is a simple version of the code I am using and the error I am getting:
# function to collect the coefficients from the regression models:
> SecreatWeapon <- dlply(merged1,~country.x, function(df) {
+ lm(log(child_mortality) ~ log(IHME_usd_gdppc)+ hiv_prev,data=df)
+ })
>
# functions to extract the output of interest
> extract.coefs <- function(mod) c(extract.coefs = summary(mod)$coefficients[,1])
> extract.se.coefs <- function(mod) c(extract.se.coefs = summary(mod)$coefficients[,2])
>
# function to combine the extracted output
> res <- ldply(SecreatWeapon, extract.coefs)
Error in list_to_dataframe(res, attr(.data, "split_labels")) :
Results do not have equal lengths
Here the error is due to the fact that some models will contain NA values so that:
> SecreatWeapon[[1]]
Call:
lm(formula = log(child_mortality) ~ log(IHME_usd_gdppc) + hiv_prev,
data = df)
Coefficients:
(Intercept) log(IHME_usd_gdppc) hiv_prev
-4.6811 0.5195 NA
and therefore the following output won't have the same length; for example:
> summary(SecreatWeapon[[1]])$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.6811000 0.6954918 -6.730633 6.494799e-08
log(IHME_usd_gdppc) 0.5194643 0.1224292 4.242977 1.417349e-04
but for the other one I get
> summary(SecreatWeapon[[10]])$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.612698 1.7505236 10.632646 1.176347e-12
log(IHME_usd_gdppc) -2.256465 0.1773498 -12.723244 6.919009e-15
hiv_prev -272.558951 160.3704493 -1.699558 9.784053e-02
Any easy fixes? Thank you very much,
Antonio Pedro.
The summary.lm( . ) function accessed with $coefficients gives different output than the coef would with an lm argument for any lm-object with an NA "coefficient". Would you be satisfied with using something like this:
coef.se <- function(mod) {
extract.coefs <- function(mod) coef(mod) # lengths all the same
extract.se.coefs <- function(mod) { summary(mod)$coefficients[,2]}
return( merge( extract.coefs(mod), extract.se.coefs(mod), by='row.names', all=TRUE) )
}
With Roland's example it gives:
> coef.se(fit)
Row.names x y
1 (Intercept) -0.3606557 0.1602034
2 x1 2.2131148 0.1419714
3 x2 NA NA
You could rename the x as coef and the y as se.coef
y <- c(1,2,3)
x1 <- c(0.6,1.1,1.5)
x2 <- c(1,1,1)
fit <- lm(y~x1+x2)
summary(fit)$coef
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -0.3606557 0.1602034 -2.251236 0.26612016
#x1 2.2131148 0.1419714 15.588457 0.04078329
#function for full matrix, adjusted from getAnywhere(print.summary.lm)
full_coeffs <- function (fit) {
fit_sum <- summary(fit)
cn <- names(fit_sum$aliased)
coefs <- matrix(NA, length(fit_sum$aliased), 4,
dimnames = list(cn, colnames(fit_sum$coefficients)))
coefs[!fit_sum$aliased, ] <- fit_sum$coefficients
coefs
}
full_coeffs(fit)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -0.3606557 0.1602034 -2.251236 0.26612016
#x1 2.2131148 0.1419714 15.588457 0.04078329
#x2 NA NA NA NA