weights option in GAM - r

My dataset has many redundant observations (but each observation should be counted). So I consider using 'weights' option in GAM because it significantly reduces computation time.
gam function (in mgcv package) explains that they are 'equivalent' (from ?gam on arguments weights):
"Note that a weight of 2, for example, is equivalent to having made exactly the same observation twice."
But it does not seem right.
yy = c(5,2,8,9)
xx = 1:4
wgts = c(3,2,4,1)
yy2 = rep(yy, wgts)
xx2 = rep(xx, wgts)
mod1 = gam(yy2 ~ xx2)
mod2 = gam(yy ~ xx, weights = wgts)
mod3 = gam(yy ~ xx, weights = wgts / mean(wgts))
predict(mod1,data.frame(xx2=1:4))
predict(mod2,data.frame(xx=1:4))
predict(mod3,data.frame(xx=1:4))
The estimates are identical in all three models.
Standard error are same in model 2 and 3 but different in model 1.
GCV is different in all three models.
I understand GCVs can be different. But how can we say that the models are identical if standard errors are different? Is this an error, or is there any good explanation for this?

The issues you saw is not about GAM. You have used gam to fit a parametric model, in which case gam behaves almost as same as lm. To answer your questions, it is sufficient to focus on the linear regression case. What happens to a linear model will happens to GLMs and GAMs, too. Here is how we can reproduce the issue with lm:
yy <- c(5,2,8,9)
xx <- 1:4
wgts <- c(3,2,4,1)
yy2 <- rep(yy,wgts)
xx2 <- rep(xx,wgts)
fit1 <- lm(yy2 ~ xx2)
fit2 <- lm(yy ~ xx, weights = wgts)
fit3 <- lm(yy ~ xx, weights = wgts/mean(wgts))
summary1 <- summary(fit1)
summary2 <- summary(fit2)
summary3 <- summary(fit3)
pred1 <- predict(fit1, list(xx2 = xx), interval = "confidence", se.fit = TRUE)
pred2 <- predict(fit2, list(xx = xx), interval = "confidence", se.fit = TRUE)
pred3 <- predict(fit3, list(xx = xx), interval = "confidence", se.fit = TRUE)
All models have the same regression coefficients, but other results may differ. You asked:
For weighted regression fit2 and fit3, why is almost everything the same except residual standard error?
Why is weighted regression (fit2 or fit3) not equivalent to ordinary regression with ties?
Your first question is about the scaling invariance of weight least squares to weights. Here is a brief summary I made:
If we rescale W by an arbitrary positive value, only residual standard error and unscaled covariance will change. Such change does not imply a different, non-equivalent model. In fact, everything related to prediction is not affected. In weighted regression, don't just look at sigma2; it is just a marginal variance. What is really of interest is the gross variance after multiplying weights. If you divide your weights by 2, you will find sigma2 doubles, but you still get the same result when multiplying them together.
summary2$coef
summary3$coef
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 2.128713 3.128697 0.6803832 0.5664609
#xx 1.683168 1.246503 1.3503125 0.3094222
pred2
pred3
#$fit
# fit lwr upr
#1 3.811881 -5.0008685 12.62463
#2 5.495050 -0.1299942 11.12009
#3 7.178218 0.6095820 13.74685
#4 8.861386 -1.7302209 19.45299
#
#$se.fit
# 1 2 3 4
#2.048213 1.307343 1.526648 2.461646
#
#$df
#[1] 2
#
#$residual.scale ## for `pred2`
#[1] 3.961448
#
#$residual.scale ## for `pred3`
#[1] 2.50544
Your second question is about the meaning of weights. Weights is used to model heteroscedastic response to overcome leverage effect in ordinary least square regression. Weights are proportional to reciprocal variance: You give bigger weights to data with smaller expected errors. Weights can be non-integer, so it does not have a naturual explanation in terms of repeated data. Thus, what is written in mgcv package is not rigorously correct.
The real difference between fit1 and fit2? is the degree of freedom. Check the above table for (n - p). n is the number of data you have, while p is the number of non-NA coefficients, so n - p is the residual degree of freedom. For both models we have p = 2 (intercept and slope), but for fit1 we have n = 10 while for fit2 we have n = 4. This has dramatic effect on inference, as now standard errors for coefficients and predictions (hence confidence intervals) will differ. These two models are far from being equivalent.
summary1$coef
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 2.128713 1.5643486 1.360766 0.21068210
#xx2 1.683168 0.6232514 2.700625 0.02704784
summary2$coef
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 2.128713 3.128697 0.6803832 0.5664609
#xx 1.683168 1.246503 1.3503125 0.3094222
pred1
#$fit
# fit lwr upr
#1 3.811881 1.450287 6.173475
#2 5.495050 3.987680 7.002419
#3 7.178218 5.417990 8.938446
#4 8.861386 6.023103 11.699669
#
#$se.fit
# 1 2 3 4
#1.0241066 0.6536716 0.7633240 1.2308229
#
#$df # note, this is `10 - 2 = 8`
#[1] 8
#
#$residual.scale
#[1] 1.980724
pred2
#$fit
# fit lwr upr
#1 3.811881 -5.0008685 12.62463
#2 5.495050 -0.1299942 11.12009
#3 7.178218 0.6095820 13.74685
#4 8.861386 -1.7302209 19.45299
#
#$se.fit
# 1 2 3 4
#2.048213 1.307343 1.526648 2.461646
#
#$df # note, this is `4 - 2 = 2`
#[1] 2
#
#$residual.scale ## for `pred2`
#[1] 3.961448

Related

Find the parameter estimates for each random term in a binomial GLMM (lme4)?

Does anyone know how to extract the parameter estimates of random term when using the (1 | …) syntax in a glmer model (including se, t ratio and p value)? I’m only able to access the average variance and std. deviance with the summary function.
Some background: I used cohort and period random terms (both factorized), where period = each survey year, and cohort = 8 birth cohorts. My model empty model looks like this :
glmer(pid ~ age + age2 + (1 | cohort) + (1| period)
There's a bit of a conceptual problem with what you are doing. The random effects do not have the same standing in statistical theory as the fixed effects. You are not really supposed to be making inferences on their estimates since you don't have a random sampling from their overall population. Hence you need to make some unteseted assumptions on their distribution. That said, there are apparently times when you might want to do it but with care that you are not making unsupportable claims. See: https://stats.stackexchange.com/questions/392314/interpretation-of-fixed-effect-coefficients-from-glms-and-glmms .
Dimitris Rizopoulosthen responded to a request for the possibility of getting "an average" of the random effects conditional on the fixed effects (rather the flipped version of mixed models inference). He offered a function in his GLMM package:
https://drizopoulos.github.io/GLMMadaptive/articles/Methods_MixMod.html#marginalized-coefficients
This is his example ......
install.packages("GLMMadaptive"); library(GLMMadaptive)
set.seed(1234)
n <- 100 # number of subjects
K <- 8 # number of measurements per subject
t_max <- 15 # maximum follow-up time
# we constuct a data frame with the design:
# everyone has a baseline measurment, and then measurements at random follow-up times
DF <- data.frame(id = rep(seq_len(n), each = K),
time = c(replicate(n, c(0, sort(runif(K - 1, 0, t_max))))),
sex = rep(gl(2, n/2, labels = c("male", "female")), each = K))
# design matrices for the fixed and random effects
X <- model.matrix(~ sex * time, data = DF)
Z <- model.matrix(~ time, data = DF)
betas <- c(-2.13, -0.25, 0.24, -0.05) # fixed effects coefficients
D11 <- 0.48 # variance of random intercepts
D22 <- 0.1 # variance of random slopes
# we simulate random effects
b <- cbind(rnorm(n, sd = sqrt(D11)), rnorm(n, sd = sqrt(D22)))
# linear predictor
eta_y <- as.vector(X %*% betas + rowSums(Z * b[DF$id, ]))
# we simulate binary longitudinal data
DF$y <- rbinom(n * K, 1, plogis(eta_y))
#We continue by fitting the mixed effects logistic regression for y assuming random intercepts and random slopes for the random-effects part.
fm <- mixed_model(fixed = y ~ sex * time, random = ~ time | id, data = DF,
family = binomial())
.... and then the call to his marginal_coefs function.
marginal_coefs(fm, std_errors=TRUE)
Estimate Std.Err z-value p-value
(Intercept) -1.6025 0.2906 -5.5154 < 1e-04
sexfemale -1.0975 0.3676 -2.9859 0.0028277
time 0.1766 0.0337 5.2346 < 1e-04
sexfemale:time 0.0508 0.0366 1.3864 0.1656167

What is the scale of parameter estimates produced by nnet::multinom?

I'm using the multinom function from the nnet package to do multinomial logistic regression in R. When I fit the model, I expected to get parameter estimates on the logit scale. However, transforming variables with the inverse logit doesn't give probability estimates that match predicted examples, see example below.
The help file states that "A log-linear model is fitted, with coefficients zero for the first class", but how do I transform parameter estimates to get predicted effects on the probability scale?
library("nnet")
set.seed(123)
# Simulate some simple fake data
groups <- t(rmultinom(500, 1, prob = c(0.05, 0.3, 0.65))) %*% c(1:3)
moddat <- data.frame(group = factor(groups))
# Fit the multinomial model
mod <- multinom(group ~ 1, moddat)
predict(mod, type = "probs")[1,] # predicted probabilities recover generating probs
# But transformed coefficients don't become probabilities
plogis(coef(mod)) # inverse logit
1/(1 + exp(-coef(mod))) # inverse logit
Using predict I can recover the generating probabilities:
1 2 3
0.06 0.30 0.64
But taking the inverse logit of the coefficients does not give probabilities:
(Intercept)
2 0.8333333
3 0.9142857
The inverse logit is the correct back transformation for a binomial model. In the case of a multinomial model, the appropriate back transformation is the softmax function, as described in this question.
The statement from the documentation that a "log-linear model is fitted with coefficient zero for the first class" essentially means that the reference probability is set to 0 on the link scale.
To recover the probabilities manually from the example above:
library("nnet")
set.seed(123)
groups <- t(rmultinom(500, 1, prob = c(0.05, 0.3, 0.65))) %*% c(1:3)
moddat <- data.frame(group = factor(groups))
mod <- multinom(group ~ 1, moddat)
# weights: 6 (2 variable)
# initial value 549.306144
# final value 407.810115
# converged
predict(mod, type = "probs")[1,] # predicted probabilities recover generating probs
# 1 2 3
# 0.06 0.30 0.64
# Inverse logit is incorrect
1/(1 + exp(-coef(mod))) # inverse logit
# (Intercept)
# 2 0.8333333
# 3 0.9142857
# Use softmax transformation instead
softmax <- function(x){
expx <- exp(x)
return(expx/sum(expx))
}
# Add the reference category probability (0 on link scale) and use softmax tranformation
all_coefs <- rbind("1" = 0, coef(mod))
softmax(all_coefs)
# (Intercept)
# 1 0.06
# 2 0.30
# 3 0.64

Zero-inflated two-part models in GLMMadaptive (R): anova on fixed effects zero-part?

I'm running a hurdle lognormal model using the GLMMadaptive package in R. Both the continuous part as well as the zero-part have categorical variables defined in the fixed effects. I would like to run an ANOVA on these categorical variables to detect if there is a main effect.
I've seen that using the glmmTMB package you are able to separately run an ANOVA on the conditional model and the zero-part model separately, as is demonstrated here.
Is there a similar strategy available for the GLMMadaptive package? (The glmmTMB does not support hurdle lognormal models as far as I understood). Perhaps using the joint_tests function from the emmeans package? If so, how do you define that you want to test the zero-part model? As emmeans::joint_tests(hurdlemodel) only gives the F-tests for the conditional part of the model.
Or as an alternative method, could you compare the fit of the models where you exclude the variable of interest against a the full model, as is demonstrated for the relevance of random effects in this vignette?
Many thanks!
The suggestion by Russ Lenth in the comments are implemented below, using the data and model in the GLMMadaptive two-part model vignette:
library(GLMMadaptive)
library(emmeans)
# data generating code from the vignette:
{
set.seed(1234)
n <- 100 # number of subjects
K <- 8 # number of measurements per subject
t_max <- 5 # maximum follow-up time
# we construct a data frame with the design:
# everyone has a baseline measurement, and then measurements at random follow-up times
DF <- data.frame(id = rep(seq_len(n), each = K),
time = c(replicate(n, c(0, sort(runif(K - 1, 0, t_max))))),
sex = rep(gl(2, n/2, labels = c("male", "female")), each = K))
# design matrices for the fixed and random effects non-zero part
X <- model.matrix(~ sex * time, data = DF)
Z <- model.matrix(~ 1, data = DF)
# design matrices for the fixed and random effects zero part
X_zi <- model.matrix(~ sex, data = DF)
Z_zi <- model.matrix(~ 1, data = DF)
betas <- c(1.5, 0.05, 0.05, -0.03) # fixed effects coefficients non-zero part
shape <- 2 # shape/size parameter of the negative binomial distribution
gammas <- c(-1.5, 0.5) # fixed effects coefficients zero part
D11 <- 0.5 # variance of random intercepts non-zero part
D22 <- 0.4 # variance of random intercepts zero part
# we simulate random effects
b <- cbind(rnorm(n, sd = sqrt(D11)), rnorm(n, sd = sqrt(D22)))
# linear predictor non-zero part
eta_y <- as.vector(X %*% betas + rowSums(Z * b[DF$id, 1, drop = FALSE]))
# linear predictor zero part
eta_zi <- as.vector(X_zi %*% gammas + rowSums(Z_zi * b[DF$id, 2, drop = FALSE]))
# we simulate negative binomial longitudinal data
DF$y <- rnbinom(n * K, size = shape, mu = exp(eta_y))
# we set the extra zeros
DF$y[as.logical(rbinom(n * K, size = 1, prob = plogis(eta_zi)))] <- 0
}
#create categorical time variable
DF$time_categorical[DF$time<2.5] <- "early"
DF$time_categorical[DF$time>=2.5] <- "late"
DF$time_categorical <- as.factor(DF$time_categorical)
#model with interaction in fixed effects zero part and adding nesting in zero part as in model above
km3 <- mixed_model(y ~ sex * time_categorical, random = ~ 1 | id, data = DF,
family = hurdle.lognormal(), n_phis = 1,
zi_fixed = ~ sex * time_categorical, zi_random = ~ 1 | id)
#### ATTEMPT at QDRG function in emmeans ####
coef_zero_part <- fixef(km3, sub_model = "zero_part")
vcov_zero_part <- vcov(km3)[9:12,9:12]
qd_km3 <- emmeans::qdrg(formula = ~ sex * time_categorical, data = DF,
coef = coef_zero_part, vcov = vcov_zero_part)
Output:
> joint_tests(qd_km3)
model term df1 df2 F.ratio p.value
sex 1 Inf 11.509 0.0007
time_categorical 1 Inf 0.488 0.4848
sex:time_categorical 1 Inf 1.077 0.2993
> emmeans(qd_km3, pairwise ~ sex|time_categorical)
$emmeans
time_categorical = early:
sex emmean SE df asymp.LCL asymp.UCL
male -1.592 0.201 Inf -1.99 -1.198
female -1.035 0.187 Inf -1.40 -0.669
time_categorical = late:
sex emmean SE df asymp.LCL asymp.UCL
male -1.914 0.247 Inf -2.40 -1.429
female -0.972 0.188 Inf -1.34 -0.605
Confidence level used: 0.95
$contrasts
time_categorical = early:
contrast estimate SE df z.ratio p.value
male - female -0.557 0.270 Inf -2.064 0.0390
time_categorical = late:
contrast estimate SE df z.ratio p.value
male - female -0.942 0.306 Inf -3.079 0.0021
Checking if contrasts correspond with zero-part fixed effects:
> fixef(km3, sub_model = "zero_part")
(Intercept) sexfemale time_categoricallate sexfemale:time_categoricallate
-1.5920415 0.5568072 -0.3220390 0.3849780
> (-1.5920) - (-1.5920 + 0.5568)
[1] -0.5568 #matches contrast within "early" level of "time_categorical"
> (-1.5920 + -0.3220) - (-1.5920 + -0.3220 + 0.5568 + 0.3850)
[1] -0.9418 #matches contrast within "late" level of "time_categorical"
The function emmeans::qdrg() can sometimes be used to create the needed object for a model not directly supported by emmeans. See its documentation. In very simple models (e.g., inheriting from lm, it may be enough to supply the object and data arguments.
That usually does not work for more sophisticated models, in which case
you will need to specify data, the fixed-effects formula for the conditional or zero part of the model, and the associated regression coefficients (coef) and variance-covariance matrix (vcov) for the part of the model in question. Often with models like this with multiple components, you likely will have to pick a subset of the coefficients and covariance matrix. These all must conform: the length of coef must equal the number of rows and columns of vcov and the number of columns in the model matrix generated by formula [which may be checked via model.matrix(formula, data = data)].
qdrg() will not work for a multivariate model -- or at least it's tricky -- because the implied model involves other factor(s) that delineate the levels of the multivariate response. If there are special provisions for, say, spline smoothing, that is another instance where qdrg() probably can't be made to work.
Once qdrg() actually runs and produces results, it is a good idea to use it to estimate some contrasts that are estimated by the model parameterization. For example, suppose that the model was fitted with the default contr.treatment contrasts. Then the regression coefficients are interpretable as a comparison with the first level as a reference level. Accordingly, if we computedrg <- qdrg(...), and one of the factors is "treat", look at contrast(rg, "trt.vs.ctrl1", simple = "treat"), and check to see if the first set of estimated contrasts matches the main-effect estimates for treat.
I will illustrate all of this with a simple lm model, ignoring the fact that it is already supported by emmeans.
> warp.lm <- lm(breaks ~ wool * tension, data = warpbreaks)
Here is the reference grid
> rg <- qdrg(~ wool * tension, coef = coef(warp.lm), vcov = vcov(warp.lm),
+ df = df.residual(warp.lm), data = warpbreaks)
Here is a sanity check -- First, look at the model summary:
> summary(warp.lm)$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 44.55556 3.646761 12.217842 2.425903e-16
woolB -16.33333 5.157299 -3.167032 2.676803e-03
tensionM -20.55556 5.157299 -3.985721 2.280796e-04
tensionH -20.00000 5.157299 -3.877999 3.199282e-04
woolB:tensionM 21.11111 7.293523 2.894501 5.698287e-03
woolB:tensionH 10.55556 7.293523 1.447251 1.543266e-01
Second, look at selected contrasts:
> contrast(rg, "trt.vs.ctrl1", simple = "wool")
tension = L:
contrast estimate SE df t.ratio p.value
B - A -16.33 5.16 48 -3.167 0.0027
tension = M:
contrast estimate SE df t.ratio p.value
B - A 4.78 5.16 48 0.926 0.3589
tension = H:
contrast estimate SE df t.ratio p.value
B - A -5.78 5.16 48 -1.120 0.2682
> contrast(rg, "trt.vs.ctrl1", simple = "tension")
wool = A:
contrast estimate SE df t.ratio p.value
M - L -20.556 5.16 48 -3.986 0.0005
H - L -20.000 5.16 48 -3.878 0.0006
wool = B:
contrast estimate SE df t.ratio p.value
M - L 0.556 5.16 48 0.108 0.9863
H - L -9.444 5.16 48 -1.831 0.1338
P value adjustment: dunnettx method for 2 tests
Comparing with the regression coefficients, we do confirm that the first contrast for wool is estimated as -16.33, matching the regression coefficient for woolB. Also, the first set of contrasts for tension are estimated as -20.556 and -20.0, matching the regression coefficients for tensionM and tensionH. The SEs and t ratios match as well. (The P values for the second set do not match due to the multiplicity adjustment.)

Calculate α and β in Probit Model in R

I am facing following issue: I want to calculate the α and β from the following probit model in R, which is defined as:
Probability = F(α + β sprd )
where sprd denotes the explanatory variable, α and β are constants, F is the cumulative normal distribution function.
I can calculate probabilities for the entire dataset, the coeffcients (see code below) etc. but I do not know how to get the constant α and β.
The purpose is to determine the Spread in Excel that corresponds to a certain probability. E.g: Which Spread corresponds to 50% etc.
Thank you in advance!
Probit model coefficients
probit<- glm(Y ~ X, family=binomial (link="probit"))
summary(probit)
Call:
glm(formula = Y ~ X, family = binomial(link = "probit"))
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4614 -0.6470 -0.3915 -0.2168 2.5730
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.3566755 0.0883634 -4.036 5.43e-05 ***
X -0.0058377 0.0007064 -8.264 < 2e-16 ***
From the help("glm") page you can see that the object returns a value named coefficients.
An object of class "glm" is a list containing at least the following
components:
coefficients a named vector of coefficients
So after you call glm() that object will be a list, and you can access each element using $name_element.
Reproducible example (not a Probit model, but it's the same):
counts <- c(18,17,15,20,10,20,25,13,12)
outcome <- gl(3,1,9)
treatment <- gl(3,3)
d.AD <- data.frame(treatment, outcome, counts)
# fit model
glm.D93 <- glm(counts ~ outcome + treatment, family = poisson())
Now glm.D93$coefficients will print the vector with all the coefficients:
glm.D93$coefficients
# (Intercept) outcome2 outcome3 treatment2 treatment3
#3.044522e+00 -4.542553e-01 -2.929871e-01 1.337909e-15 1.421085e-15
You can assign that and access each individually:
coef <- glm.D93$coefficients
coef[1] # your alpha
#(Intercept)
# 3.044522
coef[2] # your beta
# outcome2
#-0.4542553
I've seen in your deleted post that you are not convinced by #RLave's answer. Here are some simulations to convince you:
# (large) sample size
n <- 10000
# covariate
x <- (1:n)/n
# parameters
alpha <- -1
beta <- 1
# simulated data
set.seed(666)
y <- rbinom(n, 1, prob = pnorm(alpha + beta*x))
# fit the probit model
probit <- glm(y ~ x, family = binomial(link="probit"))
# get estimated parameters - very close to the true parameters -1 and 1
coef(probit)
# (Intercept) x
# -1.004236 1.029523
The estimated parameters are given by coef(probit), or probit$coefficients.

incorrect logistic regression output

I'm doing logistic regression on Boston data with a column high.medv (yes/no) which indicates if the median house pricing given by column medv is either more than 25 or not.
Below is my code for logistic regression.
high.medv <- ifelse(Boston$medv>25, "Y", "N") # Applying the desired
`condition to medv and storing the results into a new variable called "medv.high"
ourBoston <- data.frame (Boston, high.medv)
ourBoston$high.medv <- as.factor(ourBoston$high.medv)
attach(Boston)
# 70% of data <- Train
train2<- subset(ourBoston,sample==TRUE)
# 30% will be Test
test2<- subset(ourBoston, sample==FALSE)
glm.fit <- glm (high.medv ~ lstat,data = train2, family = binomial)
summary(glm.fit)
The output is as follows:
Deviance Residuals:
[1] 0
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -22.57 48196.14 0 1
lstat NA NA NA NA
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 0.0000e+00 on 0 degrees of freedom
Residual deviance: 3.1675e-10 on 0 degrees of freedom
AIC: 2
Number of Fisher Scoring iterations: 21
Also i need:
Now I'm required to use the misclassification rate as the measure of error for the two cases:
using lstat as the predictor, and
using all predictors except high.medv and medv.
but i am stuck at the regression itself
With every classification algorithm, the art relies on choosing the threshold upon which you will determine whether the the result is positive or negative.
When you predict your outcomes in the test data set you estimate probabilities of the response variable being either 1 or 0. Therefore, you need to the tell where you are gonna cut, the threshold, at which the prediction becomes 1 or 0.
A high threshold is more conservative about labeling a case as positive, which makes it less likely to produce false positives and more likely to produce false negatives. The opposite happens for low thresholds.
The usual procedure is to plot the rates that interests you, e.g., true positives and false positives against each other, and then choose what is the best rate for you.
set.seed(666)
# simulation of logistic data
x1 = rnorm(1000) # some continuous variables
z = 1 + 2*x1 # linear combination with a bias
pr = 1/(1 + exp(-z)) # pass through an inv-logit function
y = rbinom(1000, 1, pr)
df = data.frame(y = y, x1 = x1)
df$train = 0
df$train[sample(1:(2*nrow(df)/3))] = 1
df$new_y = NA
# modelling the response variable
mod = glm(y ~ x1, data = df[df$train == 1,], family = "binomial")
df$new_y[df$train == 0] = predict(mod, newdata = df[df$train == 0,], type = 'response') # predicted probabilities
dat = df[df$train==0,] # test data
To use missclassification error to evaluate your model, first you need to set up a threshold. For that, you can use the roc function from pROC package, which calculates the rates and provides the corresponding thresholds:
library(pROC)
rates =roc(dat$y, dat$new_y)
plot(rates) # visualize the trade-off
rates$specificity # shows the ratio of true negative over overall negatives
rates$thresholds # shows you the corresponding thresholds
dat$jj = as.numeric(dat$new_y>0.7) # using 0.7 as a threshold to indicate that we predict y = 1
table(dat$y, dat$jj) # provides the miss classifications given 0.7 threshold
0 1
0 86 20
1 64 164
The accuracy of your model can be computed as the ratio of the number of observations you got right against the size of your sample.

Resources