I'm trying to build an AIC table for my candidate set of models in R, which were run using mlogit. I've used glm and glmer in the past, and have always used the package AICcmodavg and aictab to extract values and create a model selection table. This package doesn't seem to work for mlogit, so I'm wondering if there are any other ways of creating an AIC table in R besides manual calculation using the log-likelihood value?
Example of mlogit model output:
Call:
mlogit(formula = Case ~ Dist_boulder + Mesohabitat + Depth +
Size + Size^2 | -1, data = reach.dc, method = "nr")
Frequencies of alternatives:
0 1 2 3
1 0 0 0
nr method
5 iterations, 0h:0m:0s
g'(-H)^-1g = 1.19E-05
successive function values within tolerance limits
Coefficients :
Estimate Std. Error z-value Pr(>|z|)
Dist_boulder -0.052165 0.162047 -0.3219 0.74752
Mesohabitatriffle -1.400752 0.612329 -2.2876 0.02216 *
Mesohabitatrun 0.302697 0.420181 0.7204 0.47128
Depth 0.137524 0.162521 0.8462 0.39745
Size 0.336949 0.145036 2.3232 0.02017 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Log-Likelihood: -86.627
example of models run (from my candidate set of 14)
predation.reach<-mlogit(Case ~ Dist_boulder + Mesohabitat + Depth + Size + Size^2 | -1, data=reach.dc)
velocity.reach<-mlogit(Case ~ Mid_vel | -1, data=reach.dc)
spaces.reach<-mlogit(Case~ Embedded + Class | -1, data=reach.dc)
substrate.reach<-mlogit(Case ~ Class | -1, data=reach.dc)
defining candidate set list
cand.set.reach<-list(predation.reach, velocity.reach, spaces.reach, substrate.reach)
bbmle::AICtab() appears to work.
library("mlogit")
m1 <- mlogit(formula = mode ~ price + catch | income,
data = Fish,
alt.subset = c("charter", "pier", "beach"), method = "nr")
m2 <- update(m1, . ~ . - price)
bbmle::AICtab(m1,m2)
## dAIC df
## m1 0.0 6
## m2 412.1 5
By default bbmle::AICtab() gives only delta-AIC and the model degrees of freedom/number of parameters, but you can use optional arguments to get the absolute AIC, AIC weights, etc..
It also works with a list:
L <- list(m1,m2)
bbmle::AICtab(L)
In the tidyverse world,
library(broom)
L %>% purrr::map(augment) %>% bind_rows()
ought to work, but doesn't yet.
Related
I am trying to get the nonlinear formula of y0, a, and b for my curve so I cam plot it on my graph. the summary(nls.mod) output does not show the y0 that I will need to plot the curve and I am not sure why as I have tried everything. The code is below:
# BH this version of plot is used for diagnostic plots for
# BH residuals of a linear model, i.e. using lm.
plot(mdl3 <- lm(ETR ~ wp_Mpa + I(wp_Mpa^2) + I(wp_Mpa^3), data = dat3))
prd <- data.frame(x = seq(-4, 0, by = 0.5))
result <- prd
result$mdl3 <- predict(mdl3, newdat3 = prd)
# BH use nls to fit this model y0+a*exp(b*x)
nls.mod <- nls(ETR ~ y0 + a*exp(b*wp_Mpa), start=c(a = -4, b = 0), data=dat3.no_na)
summary(nls.mod)
and here is the output:
Formula: ETR ~ y0 + a * exp(b * wp_Mpa)
Parameters:
Estimate Std. Error t value Pr(>|t|)
a 85.85515 8.62005 9.960 <2e-16 ***
b 0.14157 0.07444 1.902 0.0593 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 58.49 on 134 degrees of freedom
Number of iterations to convergence: 8
Achieved convergence tolerance: 1.515e-06
as you can see for some reason only the a and b show up but the y0 is supposed to be above that
I tried to reassign the variables and that just continued to give me the same output contacted a stats consultant and they just said I needed to change the variables but it still didnt work
I am using the nls package in R to perform a nonlinear fit. I have specified my independent variable as follows:
t <- seq(1,7)
and my dependent variables as P <- c(0.0246, 0.2735, 0.5697, 0.6715, 0.8655, 0.9614, 1)
I then have tried:
m <- nls(P ~ 1 / (c + q*exp(-b*t))^(1/v)),
but every time I get:
"Error in c + q * exp(-b * t) : non-numeric argument to binary
operator"
Every one of my variables is numeric. Any ideas?
Thanks!
You have more than one problem in your script. The main issue is that you should never use names which are used by R: t is the matrix transpose, c is a common method to create vectors, and q is the quit instruction. nls() will not try to fit them, as they are already defined. I recommend using more meaningful and less dangerous variables such as Coef1, Coef2, …
The second problem is that you are trying to fit a model with 4 variables to a dataset with 7 data... This may yield singularities and other problems.
For the sake of the argument, I have reduced your model to three variables, and changed some names:
Time <- seq(1,7)
Prob <- c(0.0246, 0.2735, 0.5697, 0.6715, 0.8655, 0.9614, 1)
plot(Time, Prob)
And now we perform the nls() fit:
Fit <- nls(Prob ~ 1 / (Coef1 + Coef2 * exp(-Coef3 * Time)))
X <- data.frame(Time = seq(0, 7, length.out = 100))
Y <- predict(object = Fit, newdata = X)
lines(X$Time, Y)
And a summary of the results:
summary(Fit)
# Formula: Prob ~ 1/(Coef1 + Coef2 * exp(-Coef3 * Time))
#
# Parameters:
# Estimate Std. Error t value Pr(>|t|)
# Coef1 1.00778 0.06113 16.487 7.92e-05 ***
# Coef2 23.43349 14.42378 1.625 0.1796
# Coef3 1.04899 0.21892 4.792 0.0087 **
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.06644 on 4 degrees of freedom
#
# Number of iterations to convergence: 12
# Achieved convergence tolerance: 3.04e-06
I know it is not exactly what you wanted, but I hope it helps.
I am using NBA shot data and am attempting to create shot prediction models using different regression techniques. However, I am running into the following warning message when trying to use a logistic regression model: Warning message:
glm.fit: algorithm did not converge. Also, it seems that the predictions do not work at all (not changed from the original Y variable (make or miss)). I will provide my code below. I got the data from here: Shot Data.
nba_shots <- read.csv("shot_logs.csv")
library(dplyr)
library(ggplot2)
library(data.table)
library("caTools")
library(glmnet)
library(caret)
nba_shots_clean <- data.frame("game_id" = nba_shots$GAME_ID, "location" =
nba_shots$LOCATION, "shot_number" = nba_shots$SHOT_NUMBER,
"closest_defender" = nba_shots$CLOSEST_DEFENDER,
"defender_distance" = nba_shots$CLOSE_DEF_DIST, "points" = nba_shots$PTS,
"player_name" = nba_shots$player_name, "dribbles" = nba_shots$DRIBBLES,
"shot_clock" = nba_shots$SHOT_CLOCK, "quarter" = nba_shots$PERIOD,
"touch_time" = nba_shots$TOUCH_TIME, "game_result" = nba_shots$W
, "FGM" = nba_shots$FGM)
mean(nba_shots_clean$shot_clock) # NA
# this gave NA return which means that there are NAs in this column that we
# need to clean up
# if the shot clock was NA I assume that this means it was the end of a
# quarter and the shot clock was off.
# For now I'm going to just set all of these NAs equal to zero, so all zeros
# mean it is the end of a quarter
# checking the amount of NAs
last_shots <- nba_shots_clean[is.na(nba_shots_clean$shot_clock),]
nrow(last_shots) # this tells me there is 5567 shots taken when the shot
# clock was turned off at the end of a quarter
# setting these NAs equal to zero
nba_shots_clean[is.na(nba_shots_clean)] <- 0
# checking to see if it worked
nrow(nba_shots_clean[is.na(nba_shots_clean$shot_clock),]) # it worked
# create a test and train set
split = sample.split(nba_shots_clean, SplitRatio=0.75)
nbaTrain = subset(nba_shots_clean, split==TRUE)
nbaTest = subset(nba_shots_clean, split==FALSE)
# logistic regression
nbaLogitModel <- glm(FGM ~ location + shot_number + defender_distance +
points + dribbles + shot_clock + quarter + touch_time, data=nbaTrain,
family="binomial", na.action = na.omit)
nbaPredict = predict(nbaLogitModel, newdata=nbaTest, type="response")
cm = table(nbaTest$FGM, nbaPredict > 0.5)
print(cm)
This gives me the output of the following, which tells me the prediction didn't do anything, as it's the same as before.
FALSE TRUE
0 21428 0
1 0 17977
I would really appreciate any guidance.
The confusion matrix of your model (model prediction vs. nbaTest$FGM) tells you that your model has a 100% accuracy !
This is due to the points variable in your dataset which is perfectly associated to the dependent variable:
table(nba_shots_clean$points, nba_shots_clean$FGM)
0 1
0 87278 0
2 0 58692
3 0 15133
Try to delete points from your model:
# create a test and train set
set.seed(1234)
split = sample.split(nba_shots_clean, SplitRatio=0.75)
nbaTrain = subset(nba_shots_clean, split==TRUE)
nbaTest = subset(nba_shots_clean, split==FALSE)
# logistic regression
nbaLogitModel <- glm(FGM ~ location + shot_number + defender_distance +
dribbles + shot_clock + quarter + touch_time, data=nbaTrain,
family="binomial", na.action = na.omit)
summary(nbaLogitModel)
No warning messages now and the estimated model is:
Call:
glm(formula = FGM ~ location + shot_number + defender_distance +
dribbles + shot_clock + quarter + touch_time, family = "binomial",
data = nbaTrain, na.action = na.omit)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.8995 -1.1072 -0.9743 1.2284 1.6799
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.427688 0.025446 -16.808 < 2e-16 ***
locationH 0.037920 0.012091 3.136 0.00171 **
shot_number 0.007972 0.001722 4.630 0.000003656291 ***
defender_distance -0.006990 0.002242 -3.117 0.00182 **
dribbles 0.010582 0.004859 2.178 0.02941 *
shot_clock 0.032759 0.001083 30.244 < 2e-16 ***
quarter -0.043100 0.007045 -6.118 0.000000000946 ***
touch_time -0.038006 0.005700 -6.668 0.000000000026 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 153850 on 111532 degrees of freedom
Residual deviance: 152529 on 111525 degrees of freedom
AIC: 152545
Number of Fisher Scoring iterations: 4
The confusion matrix is:
nbaPredict = predict(nbaLogitModel, newdata=nbaTest, type="response")
cm = table(nbaTest$FGM, nbaPredict > 0.5)
print(cm)
FALSE TRUE
0 21554 5335
1 16726 5955
I want to see whether the fixed effect Group2 in my model is significant. The model is:
Response ~ Group1 + Group2 + Gender + Age + BMI + (1 | Subject)
To check the significance I create a null model not containing the effect Group2:
Resp.null = lmer(Response~Group1+Gender+Age+BMI+(1|Subject),
data=mydata,REML=FALSE)
and the full model containing the effect Group2:
Resp.model = lmer(Response~Group1+Group2+Gender+Age+BMI+(1|Subject),
data=mydata,REML=FALSE)
Then I use anova() to compare the two, but I get an error:
anova(Resp.null, Resp.model)
## Error in anova.merMod(Resp.null, Resp.model) :
## models were not all fitted to the same size of dataset
I think that the problem is that Group1 contains NaN, but I thought that linear mixed models were robust to missing data.
How can I solve this problem and compare the two models?
Do I have to delete the rows corresponding to NaN and fit Resp.null without these rows?
The data can be downloaded here.
Please note that you should replace "<undefined>" with NaN like this:
mydata = read.csv("mydata.csv")
mydata[mydata == "<undefined>"] <- NA
To avoid the "models were not all fitted to the same size of dataset" error in anova, you must fit both models on the exact same subset of data.
There are two simple ways to do this, and while this reproducible example uses lm and update, for lmer objects the same approach should work:
# 1st approach
# define a convenience wrapper
update_nested <- function(object, formula., ..., evaluate = TRUE){
update(object = object, formula. = formula., data = object$model, ..., evaluate = evaluate)
}
# prepare data with NAs
data(mtcars)
for(i in 1:ncol(mtcars)) mtcars[i,i] <- NA
xa <- lm(mpg~cyl+disp, mtcars)
xb <- update_nested(xa, .~.-cyl)
anova(xa, xb)
## Analysis of Variance Table
##
## Model 1: mpg ~ cyl + disp
## Model 2: mpg ~ disp
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 26 256.91
## 2 27 301.32 -1 -44.411 4.4945 0.04371 *
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 2nd approach
xc <- update(xa, .~.-cyl, data=na.omit(mtcars[ , all.vars(formula(xa))]))
anova(xa, xc)
## Analysis of Variance Table
##
## Model 1: mpg ~ cyl + disp
## Model 2: mpg ~ disp
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 26 256.91
## 2 27 301.32 -1 -44.411 4.4945 0.04371 *
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
If however you're only interested in testing a single variable (e.g. Group2), then perhaps the Anova() or linearHypothesis() in car would work as well for this usecase.
See also:
How to update `lm` or `glm` model on same subset of data?
R error which says "Models were not all fitted to the same size of dataset"
Fit Resp.model first, then use Resp.model#frame as data argument.
Resp.null = lmer(Response~Group1+Gender+Age+BMI+(1|Subject),
data=Resp.model#frame,REML=FALSE)
I want to do a MAM, but I'm having difficulty in removing some terms:
full.model<-glm(SSB_sq~Veg_height+Bare+Common+Birds_Foot+Average_March+Average_April+
Average_May+Average_June15+Average_June20+Average_June25+Average_July15
+Average_July20+Average_July25,family="poisson")
summary(full.model)
I believe I have to remove these terms to start the MAM like so:
model1<-update(full.model,~.-Veg_height:Bare:Common:Birds_Foot:Average_March
:Average_April:Average_May:Average_June15:Average_June20:Average_June25:
Average_July15:Average_July20:Average_July25,family="poisson")
summary(model1)
anova(model1,full.model,test="Chi")
But I get this output:
anova(model1,full.model,test="Chi")
Analysis of Deviance Table
Model 1: SSB_sq ~ Veg_height + Bare + Common + Birds_Foot + Average_March +
Average_April + Average_May + Average_June15 + Average_June20 +
Average_June25 + Average_July15 + Average_July20 + Average_July25
Model 2: SSB_sq ~ Veg_height + Bare + Common + Birds_Foot + Average_March +
Average_April + Average_May + Average_June15 + Average_June20 +
Average_June25 + Average_July15 + Average_July20 + Average_July25
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1 213 237.87
2 213 237.87 0 0
I've tried putting plus signs in model1 instead of colons, as I was clutching at straws when reading my notes but the same thing happens.
Why are both my models the same? I've tried searching on Google but I'm not very good at the terminology so my searches aren't bringing up much.
If I read your intention correctly, are you trying to fit a null model with no terms in it? If so, a simpler way is just to use the SSB_sq ~ 1 as the formula, meaning a model with only an intercept.
fit <- lm(sr ~ ., data = LifeCycleSavings)
fit0 <- lm(sr ~ 1, data = LifeCycleSavings)
## or via an update:
fit01 <- update(fit, . ~ 1)
Which gives, for example:
> anova(fit)
Analysis of Variance Table
Response: sr
Df Sum Sq Mean Sq F value Pr(>F)
pop15 1 204.12 204.118 14.1157 0.0004922 ***
pop75 1 53.34 53.343 3.6889 0.0611255 .
dpi 1 12.40 12.401 0.8576 0.3593551
ddpi 1 63.05 63.054 4.3605 0.0424711 *
Residuals 45 650.71 14.460
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> anova(fit, fit0)
Analysis of Variance Table
Model 1: sr ~ pop15 + pop75 + dpi + ddpi
Model 2: sr ~ 1
Res.Df RSS Df Sum of Sq F Pr(>F)
1 45 650.71
2 49 983.63 -4 -332.92 5.7557 0.0007904 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
An explanation of the formulae I use:
The first model used the shortcut ., which means all remaining variables in argument data (in my model it meant all variables in LifeCycleSavings on the RHS of the formula, except for sr which is already on the LHS).
In the second model (fit0), we only include 1 on the RHS of the formula. In R, 1 means an intercept, so sr ~ 1 means fit an intercept-only model. By default, an intercept is assumed, hence we did not need 1 when specifying the first model fit.
If you want to suppress an intercept, add - 1 or + 0 to your formula.
For your data, the first model would be:
full.model <- glm(SSB_sq ~ ., data = FOO, family = "poisson")
where FOO is the data frame holding your variables - you are using a data frame, aren't you? The null, intercept-only model would be specified using one of:
null.model <- glm(SSB_sq ~ 1, data = FOO, family = "poisson")
or
null.model <- update(full.model, . ~ 1)