Updating model factor levels within function not working properly - r

I'm running some GLM models in R on a some data related to feeding trials I am doing. I'm regressing my variables of interest on a two predictors: one factor with three levels and one continuous variable. I want to compare the intercepts for each level of the factor to one another to determine if they're different. To do this, I wrote a function (called interceptCompare in the reproducible code below) which relevels the factor and updates the model and then saves the results of each model. It's my quick way of doing all the pair-wise comparisons of the intercepts.
The problem is that when I run the function, it doesn't appear to properly update the model. Each item of the list returned is the same, when they should be changing so that each item has a different level of the factor as the "(Intercept)" that the other levels are being compared against. I suspect it has something to do with the environment of the function, but I'm not sure. I haven't been able to find a similar example on stackoverflow or google.
Here's what should be a reproducible example:
food <- as.factor(rep(c("a", "b", "c"), each = 20))
variable <- rbinom(60, 1, 0.7)
movement <- rgamma(60, 10, 2)
binomial.model <- glm(variable ~ food,
family = "binomial")
gamma.model <- glm(movement ~ food,
family = Gamma)
interceptCompare <- function(model, factor) {
results <- list() # empty list to store results
for (i in unique(factor)) {
factor <- relevel(factor, ref = i)
model <- update(model)
results[[i]] <- summary(model)$coefficients[1:3, ]
}
results <- lapply(results, function(x) round(x, 4))
return(results)
}
interceptCompare(binomial.model, food)
interceptCompare(gamma.model, food)

You will need to add one line, in order to change the data, and use it within the update:
interceptCompare <- function(model, factor) {
results <- list() # empty list to store results
s <- deparse(substitute(factor))#ADD THIS LINE
for (i in unique(factor)) {
factor <- relevel(factor, ref = i)
model[["model"]][[s]] <- factor #CHANGE THE DATA IN THE MODEL
model <- update(model,data=model[["model"]])# UPDATE THE MODEL
results[[i]] <- summary(model)$coefficients[1:3, ]
}
results <- lapply(results, function(x) round(x, 4))
return(results)
}
interceptCompare(binomial.model, food)
$a
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.3863 0.5590 2.4799 0.0131
foodb -0.7673 0.7296 -1.0516 0.2930
foodc -0.2877 0.7610 -0.3780 0.7054
$b
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.6190 0.4688 1.3205 0.1867
fooda 0.7673 0.7296 1.0516 0.2930
foodc 0.4796 0.6975 0.6876 0.4917
$c
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.0986 0.5164 2.1275 0.0334
foodb -0.4796 0.6975 -0.6876 0.4917
fooda 0.2877 0.7610 0.3780 0.7054
interceptCompare(gamma.model, food)
$a
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.2246 0.0156 14.3919 0.0000
foodb -0.0170 0.0213 -0.8022 0.4257
foodc -0.0057 0.0218 -0.2608 0.7952
$b
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.2076 0.0144 14.3919 0.0000
fooda 0.0170 0.0213 0.8022 0.4257
foodc 0.0114 0.0210 0.5421 0.5898
$c
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.2189 0.0152 14.3919 0.0000
foodb -0.0114 0.0210 -0.5421 0.5898
fooda 0.0057 0.0218 0.2608 0.7952

You need to be much more careful when trying to swap out symbols from formulas which is what you are doing. You need to put it in terms that the R language can understand. You want to pass in the name "food" not the values stored in the "food" vector as you are now. Here's an update that seems to do what you were trying to do
interceptCompare <- function(model, factor) {
sym <- substitute(factor)
results <- list() # empty list to store results
for (i in unique(factor)) {
change <- eval(bquote(~.-.(sym)+relevel(.(sym), ref=.(i))))
new_model <- update(model, change)
results[[i]] <- summary(new_model)$coefficients[1:3, ]
}
results <- lapply(results, function(x) round(x, 4))
return(results)
}
Here we capture the name "food" with the substitute. Then we use bquote() to build a new formula that will remove the original value, and relevel the factor variable with a particular reference. Then we save this to a new object so we don't keep updating the same model. For the binomial.model, this returns
$`a`
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.8473 0.4879 1.7364 0.0825
relevel(food, ref = "a")b 0.0000 0.6901 0.0000 1.0000
relevel(food, ref = "a")c -0.8473 0.6619 -1.2801 0.2005
$b
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.8473 0.4879 1.7364 0.0825
relevel(food, ref = "b")a 0.0000 0.6901 0.0000 1.0000
relevel(food, ref = "b")c -0.8473 0.6619 -1.2801 0.2005
$c
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.0000 0.4472 0.0000 1.0000
relevel(food, ref = "c")a 0.8473 0.6619 1.2801 0.2005
relevel(food, ref = "c")b 0.8473 0.6619 1.2801 0.2005
You can see how it changed the ref= at each iteration

Related

R: Different result from glm and mle2 package in R

So I want to find the estimate parameter using GLM and compare it with mle2 package.
Here's my code for GLM
d <- read.delim("http://dnett.github.io/S510/Disease.txt")
d$disease=factor(d$disease)
d$ses=factor(d$ses)
d$sector=factor(d$sector)
str(d)
glm2 <- glm(disease~ses+sector, family=binomial(link=logit), data=d)
summary(glm2)
And my code for mle2()
y<-as.numeric(as.character(d$disease))
x1<-as.numeric(as.character(d$age))
x2<-as.numeric(as.character(d$sector))
x3<-as.numeric(as.character(d$ses))
library(bbmle)
nlldbin=function(A,B,C,D){
eta<-A+B*(x3==2)+C*(x3==3)+D*(x2==2)
p<-1/(1+exp(-eta))
joint.pdf= (p^y)*((1-p)^(1-y))
-sum(joint.pdf, log=TRUE ,na.rm=TRUE)
}
st <- list(A=0.0001,B=0.0001,C=0.0001,D=0.0001)
est_mle2<-mle2(start=st,nlldbin,hessian=TRUE)
summary(est_mle2)
But the result is quiet different. Please help me to fix this, thank you!
> summary(est_mle2)
Maximum likelihood estimation
Call:
mle2(minuslogl = nlldbin, start = st, hessian.opts = TRUE)
Coefficients:
Estimate Std. Error z value Pr(z)
A -20.4999 5775.1484 -0.0035 0.9972
B -5.2499 120578.9515 0.0000 1.0000
C -7.9999 722637.2670 0.0000 1.0000
D -2.2499 39746.6639 -0.0001 1.0000
> summary(glm2)
Call:
glm(formula = disease ~ ses + sector, family = binomial(link = logit),
data = d)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.52001 0.33514 -4.535 5.75e-06 ***
ses2 -0.08525 0.41744 -0.204 0.838177
ses3 0.16086 0.39261 0.410 0.682019
sector2 1.28098 0.34140 3.752 0.000175 ***
I'm not sure your definition of eta is correct. I would use the model matrix.
X <- model.matrix(~ ses + sector, data = d)
nlldbin <- function(A,B,C,D){
eta <- X %*% c(A, B, C, D)
p <- 1/(1+exp(-eta))
logpdf <- y*log(p) + (1-y)*log(1-p)
-sum(logpdf)
}
This line
-sum(joint.pdf, log=TRUE ,na.rm=TRUE)
is wrong. sum doesn't have a special log argument; what you're doing is adding the value TRUE (which gets converted to 1) to the pdf.
What you want is
-sum(log(joint.pdf), na.rm=TRUE)
but this is also not very good for numerical reasons, as the pdf is likely to underflow. A better way of writing it would be
logpdf <- y*log(p) + (1-y)*log(1-p)
-sum(logpdf, na.rm=TRUE)

Change Y intercept in Poisson GLM R

Background: I have the following data that I run a glm function on:
location = c("DH", "Bos", "Beth")
count = c(166, 57, 38)
#make into df
df = data.frame(location, count)
#poisson
summary(glm(count ~ location, family=poisson))
Output:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.6376 0.1622 22.424 < 2e-16 ***
locationBos 0.4055 0.2094 1.936 0.0529 .
locationDH 1.4744 0.1798 8.199 2.43e-16 ***
Problem: I would like to change the (Intercept) so I can get all my values relative to Bos
I looked Change reference group using glm with binomial family and How to force R to use a specified factor level as reference in a regression?. I tried there method and it did not work, and I am not sure why.
Tried:
df1 <- within(df, location <- relevel(location, ref = 1))
#poisson
summary(glm(count ~ location, family=poisson, data = df1))
Desired Output:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) ...
locationBeth ...
locationDH ...
Question: How do I solve this problem?
I think your problem is that you are modifying the data frame, but in your model you are not using the data frame. Use the data argument in the model to use the data in the data frame.
location = c("DH", "Bos", "Beth")
count = c(166, 57, 38)
# make into df
df = data.frame(location, count)
Note that location by itself is a character vector. data.frame() coerces it to a factor by default in the data frame. After this conversion, we can use relevel to specify the reference level.
df$location = relevel(df$location, ref = "Bos") # set Bos as reference
summary(glm(count ~ location, family=poisson, data = df))
# Call:
# glm(formula = count ~ location, family = poisson, data = df)
# ...
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 4.0431 0.1325 30.524 < 2e-16 ***
# locationBeth -0.4055 0.2094 -1.936 0.0529 .
# locationDH 1.0689 0.1535 6.963 3.33e-12 ***
# ...

extract summary from matrix lm object

I'd like to get the coefficients from the summary section of an lm object, except I inputted a matrix and I am getting null for the summary part. Here is my code:
n=12
y=rnorm(n,23,1)
x1=rnorm(n,23,1)
x2=rnorm(n,15.5,1)
lm1=lm(y~x1+x2)
n2=10
b0=4;b1=2;b2=3
sim1<-function(){
randmat=matrix(rnorm(n*n2,0,8),n,n2)
x1mat=matrix(x1,n,n2)
x2mat=matrix(x2,n,n2)
return(b0+b1*x1mat+b2*x2mat+randmat)
}
sim1=sim1()
lm1=lm(sim1~x1+x2)
c2=summary(lm1)$coefficients
> c2
NULL
what I want is this (but repeated):
lm2=lm(sim1[,1]~x1+x2)
summary(lm2)$coefficients
Does anyone know how to extract these? Thanks
-Rik
Another way is to do the following after the end of the following line of your code.
lm1=lm(sim1~x1+x2) #this runs 10 models
All the coefficients will be stored in the list summary(lm1) as Response Y1 ... to Response Y10 (i.e. 10 models as many as ncol(sim1)).
In order to get the coefficients from each model back you could do:
all_coef <- lapply( paste0('Response Y', 1:ncol(sim1)),
function(x) summary(lm1)[[x]]$coefficients)
Or as #Rik mentions in the comment it will be faster if summary(lm1) is not repeated in the lapply loop in case you have a big matrix.
the_sum <- summary(lm1)
all_coef <- lapply( paste0('Response Y', 1:ncol(sim1)),
function(x) the_sum[[x]]$coefficients)
And the output would be:
> all_coef
[[1]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) 135.242552 80.136427 1.687654 0.1257496
x1 -4.777486 2.953534 -1.617549 0.1402142
x2 4.464435 3.891641 1.147186 0.2808857
[[2]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) 119.1772823 111.603046 1.06786765 0.3133851
x1 -0.1376013 4.113277 -0.03345297 0.9740435
x2 -1.2946027 5.419744 -0.23886785 0.8165585
[[3]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) -51.329923 63.495202 -0.8084063 0.4397018
x1 3.721227 2.340199 1.5901325 0.1462682
x2 3.793981 3.083498 1.2304147 0.2497304
[[4]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) 124.8606014 57.669842 2.16509352 0.05857967
x1 -1.2517705 2.125498 -0.58893044 0.57039201
x2 -0.1159803 2.800603 -0.04141263 0.96787111
#...and so on until 10
To get the individual coefficients for a model just do:
all_coef[[<the_number_you_want>]]

extracting linear model coefficients into a vector within a loop

I am trying to create sample of 200 linear model coefficients using a loop in R. As an end result, I want a vector containing the coefficients.
for (i in 1:200) {
smpl_5 <- population[sample(1:1000, 5), ]
model_5 <- summary(lm(y~x, data=smpl_5))
}
I can extract the coefficients easy enough, but I am having trouble outputting them into a vector within the loop. Any Suggestions?
You can use replicate for this if you like. In your case, because the number of coefficients is identical for all models, it'll return an array as shown in the example below:
d <- data.frame(x=runif(1000))
d$y <- d$x * 0.123 + rnorm(1000, 0, 0.01)
coefs <- replicate(3, {
xy <- d[sample(nrow(d), 100), ]
coef(summary(lm(y~x, data=xy)))
})
coefs
# , , 1
#
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.001361961 0.002091297 0.6512516 5.164083e-01
# x 0.121142447 0.003624717 33.4212114 2.235307e-55
#
# , , 2
#
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.003213314 0.001967050 1.63357 1.055579e-01
# x 0.118026828 0.003332906 35.41259 1.182027e-57
#
# , , 3
#
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.003366678 0.001990226 1.691606 9.389883e-02
# x 0.119408470 0.003370190 35.430783 1.128070e-57
Access particular elements with normal array indexing, e.g.:
coefs[, , 1] # return the coefs for the first model
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.001361961 0.002091297 0.6512516 5.164083e-01
# x 0.121142447 0.003624717 33.4212114 2.235307e-55
So, for your problem, you could use:
replicate(200, {
smpl_5 <- population[sample(1:1000, 5), ]
coef(summary(lm(y~x, data=smpl_5)))
})

handling outputs with different lengths using ldply

Just a quick question on how to handle outputs of different lengths using ldply from the plyr package. Here is a simple version of the code I am using and the error I am getting:
# function to collect the coefficients from the regression models:
> SecreatWeapon <- dlply(merged1,~country.x, function(df) {
+ lm(log(child_mortality) ~ log(IHME_usd_gdppc)+ hiv_prev,data=df)
+ })
>
# functions to extract the output of interest
> extract.coefs <- function(mod) c(extract.coefs = summary(mod)$coefficients[,1])
> extract.se.coefs <- function(mod) c(extract.se.coefs = summary(mod)$coefficients[,2])
>
# function to combine the extracted output
> res <- ldply(SecreatWeapon, extract.coefs)
Error in list_to_dataframe(res, attr(.data, "split_labels")) :
Results do not have equal lengths
Here the error is due to the fact that some models will contain NA values so that:
> SecreatWeapon[[1]]
Call:
lm(formula = log(child_mortality) ~ log(IHME_usd_gdppc) + hiv_prev,
data = df)
Coefficients:
(Intercept) log(IHME_usd_gdppc) hiv_prev
-4.6811 0.5195 NA
and therefore the following output won't have the same length; for example:
> summary(SecreatWeapon[[1]])$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.6811000 0.6954918 -6.730633 6.494799e-08
log(IHME_usd_gdppc) 0.5194643 0.1224292 4.242977 1.417349e-04
but for the other one I get
> summary(SecreatWeapon[[10]])$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.612698 1.7505236 10.632646 1.176347e-12
log(IHME_usd_gdppc) -2.256465 0.1773498 -12.723244 6.919009e-15
hiv_prev -272.558951 160.3704493 -1.699558 9.784053e-02
Any easy fixes? Thank you very much,
Antonio Pedro.
The summary.lm( . ) function accessed with $coefficients gives different output than the coef would with an lm argument for any lm-object with an NA "coefficient". Would you be satisfied with using something like this:
coef.se <- function(mod) {
extract.coefs <- function(mod) coef(mod) # lengths all the same
extract.se.coefs <- function(mod) { summary(mod)$coefficients[,2]}
return( merge( extract.coefs(mod), extract.se.coefs(mod), by='row.names', all=TRUE) )
}
With Roland's example it gives:
> coef.se(fit)
Row.names x y
1 (Intercept) -0.3606557 0.1602034
2 x1 2.2131148 0.1419714
3 x2 NA NA
You could rename the x as coef and the y as se.coef
y <- c(1,2,3)
x1 <- c(0.6,1.1,1.5)
x2 <- c(1,1,1)
fit <- lm(y~x1+x2)
summary(fit)$coef
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -0.3606557 0.1602034 -2.251236 0.26612016
#x1 2.2131148 0.1419714 15.588457 0.04078329
#function for full matrix, adjusted from getAnywhere(print.summary.lm)
full_coeffs <- function (fit) {
fit_sum <- summary(fit)
cn <- names(fit_sum$aliased)
coefs <- matrix(NA, length(fit_sum$aliased), 4,
dimnames = list(cn, colnames(fit_sum$coefficients)))
coefs[!fit_sum$aliased, ] <- fit_sum$coefficients
coefs
}
full_coeffs(fit)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -0.3606557 0.1602034 -2.251236 0.26612016
#x1 2.2131148 0.1419714 15.588457 0.04078329
#x2 NA NA NA NA

Resources