After setting up the model and training it with Gibbs Sampling, I got the result of all the prediction of hidden values with:
jags <- jags.model('example.bug',
data = data,
n.chains = 4,
n.adapt = 100)
update(jags, 1000)
samples <- jags.samples(jags,
c('r','alpha','alpha_i','alpha_u','u','i'),
1000)
Where r is a list of rating, and some of them are withheld for a prediction with the model. And suppose I can get them with r[test], where test is a list of integer indicating the index of the rating withheld. But when I tried to get them using this way:
summary(samples$r, mean)[test]
I just got this:
$drop.dims
iteration chain
1000 4
Could you please tell me how to get the expected value? Thank you in advance!
draw samples
Absent your data or model I'll demonstrate using the simple example here, modified so that jags monitors the predicted outcomes.
library(rjags)
# simulate some data
N <- 1000
x <- 1:N
epsilon <- rnorm(N, 0, 1)
y <- x + epsilon
# define a jags model
writeLines("
model {
for (i in 1:N){
y[i] ~ dnorm(y.hat[i], tau)
y.hat[i] <- a + b * x[i]
}
a ~ dnorm(0, .0001)
b ~ dnorm(0, .0001)
tau <- pow(sigma, -2)
sigma ~ dunif(0, 100)
}
", con = "example2_mod.jags")
# create a jags model object
jags <- jags.model("example2_mod.jags",
data = list('x' = x,
'y' = y,
'N' = N),
n.chains = 4,
n.adapt = 100)
# burn-in
update(jags, 1000)
# drawing samples gives mcarrays
samples <- jags.samples(jags, c('a', 'b'), 1000)
str(samples)
# List of 2
# $ a: mcarray [1, 1:1000, 1:4] -0.0616 -0.0927 -0.0528 -0.0844 -0.06 ...
# ..- attr(*, "varname")= chr "a"
# $ b: mcarray [1, 1:1000, 1:4] 1 1 1 1 1 ...
# ..- attr(*, "varname")= chr "b"
# NULL
extract predictions
Our result, samples, is a list of mcarray objects with dimensions 1 x iterations x chains. You'd really want to run diagnostics at this point, but we'll jump to summarizing the samples from the posterior for your predictions. One approach is taking the mean over chains and iterations.
# extract posterior means from the mcarray object by marginalizing over
# chains and iterations (alternative: posterior modes)
posterior_means <- apply(samples$y.hat, 1, mean)
head(posterior_means)
# [1] 0.9201342 1.9202996 2.9204649 3.9206302 4.9207956 5.9209609
# reasonable?
head(predict(lm(y ~ x)))
# 1 2 3 4 5 6
# 0.9242663 1.9244255 2.9245847 3.9247439 4.9249031 5.9250622
out-of-sample predictions
Alternatively, here's how you could make out-of-sample predictions. I'll just reuse our existing feature vector x, but this could be test data instead.
# extract posterior means from the mcarray object by marginalizing over chains and iterations (alternative: posterior modes)
posterior_means <- lapply(samples, apply, 1, "mean")
str(posterior_means)
# List of 3
# $ a : num -0.08
# $ b : num 1
# $ y.hat: num [1:1000] 0.92 1.92 2.92 3.92 4.92 ...
# NULL
# create a model matrix from x
X <- cbind(1, x)
head(X)
# x
# [1,] 1 1
# [2,] 1 2
# [3,] 1 3
# [4,] 1 4
# [5,] 1 5
# [6,] 1 6
# take our posterior means
B <- as.matrix(unlist(posterior_means[c("a", "b")]))
# [,1]
# a -0.07530888
# b 1.00015874
Given the model, the predicted outcome is a + b * x[i] as we wrote in jags.
# predicted outcomes are the product of our model matrix and estimates
y_hat <- X %*% B
head(y_hat)
# [,1]
# [1,] 0.9248499
# [2,] 1.9250086
# [3,] 2.9251673
# [4,] 3.9253261
# [5,] 4.9254848
# [6,] 5.9256436
Related
How do you get the random intercept effects estimators from a lme4 result object?
set.seed(247)
# Create Data
n=1000
x = runif(n)
id = rep(NA,n)
for (i in 1:10) {
id_s = (i-1)*100+1
id_e = i*100
id[id_s:id_e] = i
}
effects = rnorm(10)
lp = -0.5+0.5*x + effects[id]
probs = exp(lp)/(1+exp(lp))
Y2 = rbinom(n, 1, probs)
library(lme4)
fit_glmm2 = glmer(Y2 ~ x + (1|id), family = "binomial",control = glmerControl(calc.derivs = FALSE))
I thought maybe they are the u's but there's a slight difference between them:
yy = coef(fit_glmm2) # looking only at the intercept
fit_glmm2#u + fit_glmm2#beta[1]
If you want the random effects, ranef() is the best way to get them:
r <- ranef(fit_glmm2)
str(r)
## List of 1
## $ id:'data.frame': 10 obs. of 1 variable:
## ..$ (Intercept): num [1:10] -0.693 0.297 0.54 -0.467 0.755 ...
## ..- attr(*, "postVar")= num [1, 1, 1:10] 0.0463 0.0385 0.0392 0.0436 0.0409 ...
## - attr(*, "class")= chr "ranef.mer"
raw <- unname(unlist(ranef(fit_glmm2)$id))
identical(raw, fit_glmm2#u*fit_glmm2#theta) ## TRUE
As described in vignette("lmer", package = "lme4"), the #u values are the spherical random effects, i.e. they're iid N(0,1) and need to be transformed to get to the random effects b used in the formula X %*% beta + Z %*% b. In this case (an intercept-only RE), theta corresponds to the standard deviation of the random effect. u*theta won't work for more complicated cases ... in this case you need getME(fit_glmm2, "Lambda") %*% getME(fit_glmm2, "u").
getME(., "b") will also work, but again for more complex models you'll have to work out how the b-vector is split into random intercepts, slopes, different RE terms, etc..
Turns out you can get them by multiplying the u parameter with the theta parameter, or by calling getME(.,"b"):
yy = coef(fit_glmm2) # looking only at the intercept
fit_glmm2#u*fit_glmm2#theta + fit_glmm2#beta[1] # or
# getME(fit_glmm2,"b") + fit_glmm2#beta[1]
I'd like to bootstrap a mixed glm zero-inflated model (m_F) using the glmmTMB package, but despite the use of coef or fixef for coefficients specification, I always have as output the error:
Error in bres[i, ] <- coef(bfit) :
incorrect number of subscripts on matrix
My example:
library(glmmTMB)
library(boot)
my.ds <- read.csv("https://raw.githubusercontent.com/Leprechault/trash/main/ds.desenvol.csv")
str(my.ds)
# 'data.frame': 400 obs. of 4 variables:
# $ temp : num 0 0 0 0 0 0 0 0 0 0 ...
# $ storage : int 5 5 5 5 5 5 5 5 5 5 ...
# $ rep : chr "r1" "r2" "r3" "r4" ...
# $ development: int 0 23 22 27 24 25 24 22 0 22 ...
# Fit a GLM mixed Hurdle (zero-inflated) log-link Gamma model
m_F <- glmmTMB(development ~ poly(temp,2) + (1 | storage), data = my.ds,
family = ziGamma(link = "log"),
ziformula = ~ 1)
summary(m_F)
# Create a bootstrap aproach
nboot <- 1000
bres <- matrix(NA,nrow=nboot,
ncol=length(coef(m_F)),
dimnames=list(rep=seq(nboot),
coef=names(coef(m_F))))
set.seed(1000)
bootsize <- 100
for (i in seq(nboot)) {
bdat <- my.ds[sample(nrow(my.ds),size=bootsize,replace=TRUE),]
bfit <- update(m_F, data=bdat) ## refit with new data
bres[i,] <- coef(bfit)
}
Please, any help wit it?
My answer is somewhat similar to #RuiBarradas's, but closer to your original code. The main point is that coef() doesn't do what you think; (1) the convention (set originally by the nlme package) is that for mixed models coef() returns a matrix (or list of matrices) of group-level coefficients, while fixef() returns the fixed-effect (population-level) coefficients; (2) for glmmTMB, fixef() returns a list of fixed-effect vectors for the conditional, zero-inflation, and dispersion models (unlist() collapses this back to a vector with concatenated names).
The other point to keep in mind is that bootstrapping at the level of individual observations may not be sensible for a data set with grouping structure (you can bootstrap at the group level, or the within-group level, or both; you can bootstrap residuals (if you have a linear model - this won't work for GLMMs with count data); you can also use lme4::bootMer to do parametric bootstrapping, which is pretty much the only alternative when you have GLMMs with crossed random effects).
PS what is bootsize doing here? The standard approach to bootstrapping is to resample a data set the same size as the original with replacement. Resampling only a quarter of the data set (nrow(my.ds) == 400, bootsize == 100) is well-defined but very unusual — are you doing some particular non-standard kind of bootstrap on purpose ... ?
sum_fun <- function(fit) {
unlist(fixef(fit))
}
bres <- matrix(NA,
nrow=nboot,
ncol=length(sum_fun(m_F)),
dimnames=list(rep=seq(nboot),
coef=names(sum_fun(m_F))))
set.seed(1000)
bootsize <- 100
pb <- txtProgressBar(max = bootsize, style = 3)
for (i in seq(nboot)) {
setTxtProgressBar(pb, i)
bdat <- my.ds[sample(nrow(my.ds), size=bootsize,replace=TRUE),]
bfit <- update(m_F, data=bdat) ## refit with new data
bres[i,] <- sum_fun(bfit)
}
To use package boot, you must define a function that bootstraps the data and then computes the statistic or vector of statistics from it. This is function ziboot below. Then call boot passing it the data, the function and the number of replicates.
The function fits the same model as the question's code but must transform the model output in a vector of coefficients. That is what the lapply does.
library(glmmTMB)
library(boot)
my.ds <- read.csv("https://raw.githubusercontent.com/Leprechault/trash/main/ds.desenvol.csv")
# Create a bootstrap aproach
# This function will be called by boot() below
ziboot <- function(data, i) {
# this bootstraps the data
d <- data[i, ]
model <- glmmTMB(development ~ temp + (1 | storage), data = d,
family = ziGamma(link = "log"),
ziformula = ~ 1)
cf <- coef(model)$cond$storage
l <- as.list(cf)
unlist(lapply(seq_along(l), \(i){
x <- l[[i]]
nms <- paste(names(l)[i], row.names(cf), sep = "_")
setNames(x, nms)
}))
}
set.seed(1000)
bootsize <- 100
b <- boot(my.ds, ziboot, R = bootsize)
colnames(b$t) <- names(b$t0)
head(b$t)
#> (Intercept)_5 (Intercept)_10 (Intercept)_15 (Intercept)_20 (Intercept)_30
#> [1,] 3.156717 3.153949 3.139001 3.147799 3.196308
#> [2,] 3.172563 3.157384 3.164663 3.143005 3.196966
#> [3,] 3.175124 3.154946 3.158715 3.129027 3.168753
#> [4,] 3.149817 3.143550 3.135256 3.141367 3.167679
#> [5,] 3.159183 3.179388 3.147193 3.148219 3.237395
#> [6,] 3.148815 3.168335 3.117576 3.126973 3.178377
#> temp_5 temp_10 temp_15 temp_20 temp_30
#> [1,] -0.004089067 -0.004089067 -0.004089067 -0.004089067 -0.004089067
#> [2,] -0.004404738 -0.004404738 -0.004404738 -0.004404738 -0.004404738
#> [3,] -0.003153053 -0.003153053 -0.003153053 -0.003153053 -0.003153053
#> [4,] -0.003547863 -0.003547863 -0.003547863 -0.003547863 -0.003547863
#> [5,] -0.003989763 -0.003989763 -0.003989763 -0.003989763 -0.003989763
#> [6,] -0.003137722 -0.003137722 -0.003137722 -0.003137722 -0.003137722
Created on 2022-07-05 by the reprex package (v2.0.1)
I'm using predict.lm(fit, newdata=newdata, interval="prediction") to get predictions and their prediction intervals (PI) for new observations. Now I would like to aggregate (sum and mean) these predictions and their PI's based on an additional variable (i.e. a spatial aggregation on the zip code level of predictions for single households).
I learned from StackExchange, that you cannot aggregate the prediction intervals of single predictions just by aggregating the limits of the prediction intervals. The post is very helpful to understand why this can't be done, but I have a hard time translating this bit into actual code. The answer reads:
Here's a reproducible example:
library(dplyr)
set.seed(123)
data(iris)
#Split dataset in training and prediction set
smp_size <- floor(0.75 * nrow(iris))
train_ind <- sample(seq_len(nrow(iris)), size = smp_size)
train <- iris[train_ind, ]
pred <- iris[-train_ind, ]
#Fit regression model
fit1 <- lm(Petal.Width ~ Petal.Length, data=train)
#Fit multiple linear regression model
fit2 <- lm(Petal.Width ~ Petal.Length + Sepal.Width + Sepal.Length, data=train)
#Predict Pedal.Width for new data incl prediction intervals for each prediction
predictions1<-predict(fit1, newdata=pred, interval="prediction")
predictions2<-predict(fit2, newdata=pred, interval="prediction")
# Aggregate data by summing predictions for species
#NOT correct for prediction intervals
predictions_agg1<-data.frame(predictions1,Species=pred$Species) %>%
group_by(Species) %>%
summarise_all(funs(sum,mean))
predictions_agg2<-data.frame(predictions2,Species=pred$Species) %>%
group_by(Species) %>%
summarise_all(funs(sum,mean))
I couldn't find a good tutorial or package which describes how to properly aggregate predictions and their PI's in R when using predict.lm(). Is there something out there? Would highly appreciate if you could point me in the right direction on how to do this in R.
Your question is closely related to a thread I answered 2 years ago: linear model with `lm`: how to get prediction variance of sum of predicted values. It provides an R implementation of Glen_b's answer on Cross Validated. Thanks for quoting that Cross Validated thread; I didn't know it; perhaps I can leave a comment there linking the Stack Overflow thread.
I have polished my original answer, wrapping up line-by-line code cleanly into easy-to-use functions lm_predict and agg_pred. Solving your question is then simplified to applying those functions by group.
Consider the iris example in your question, and the second model fit2 for demonstration.
set.seed(123)
data(iris)
#Split dataset in training and prediction set
smp_size <- floor(0.75 * nrow(iris))
train_ind <- sample(seq_len(nrow(iris)), size = smp_size)
train <- iris[train_ind, ]
pred <- iris[-train_ind, ]
#Fit multiple linear regression model
fit2 <- lm(Petal.Width ~ Petal.Length + Sepal.Width + Sepal.Length, data=train)
We split pred by group Species, then apply lm_predict (with diag = FALSE) on all sub data frames.
oo <- lapply(split(pred, pred$Species), lm_predict, lmObject = fit2, diag = FALSE)
To use agg_pred we need to specify a weight vector, whose length equals to the number of data. We can determine this by consulting the length of fit in each oo[[i]]:
n <- lengths(lapply(oo, "[[", 1))
#setosa versicolor virginica
# 11 13 14
If aggregation operation is sum, we do
w <- lapply(n, rep.int, x = 1)
#List of 3
# $ setosa : num [1:11] 1 1 1 1 1 1 1 1 1 1 ...
# $ versicolor: num [1:13] 1 1 1 1 1 1 1 1 1 1 ...
# $ virginica : num [1:14] 1 1 1 1 1 1 1 1 1 1 ...
SUM <- Map(agg_pred, w, oo)
SUM[[1]] ## result for the first group, for example
#$mean
#[1] 2.499728
#
#$var
#[1] 0.1271554
#
#$CI
# lower upper
#1.792908 3.206549
#
#$PI
# lower upper
#0.999764 3.999693
sapply(SUM, "[[", "CI") ## some nice presentation for CI, for example
# setosa versicolor virginica
#lower 1.792908 16.41526 26.55839
#upper 3.206549 17.63953 28.10812
If aggregation operation is average, we rescale w by n and call agg_pred.
w <- mapply("/", w, n)
#List of 3
# $ setosa : num [1:11] 0.0909 0.0909 0.0909 0.0909 0.0909 ...
# $ versicolor: num [1:13] 0.0769 0.0769 0.0769 0.0769 0.0769 ...
# $ virginica : num [1:14] 0.0714 0.0714 0.0714 0.0714 0.0714 ...
AVE <- Map(agg_pred, w, oo)
AVE[[2]] ## result for the second group, for example
#$mean
#[1] 1.3098
#
#$var
#[1] 0.0005643196
#
#$CI
# lower upper
#1.262712 1.356887
#
#$PI
# lower upper
#1.189562 1.430037
sapply(AVE, "[[", "PI") ## some nice presentation for CI, for example
# setosa versicolor virginica
#lower 0.09088764 1.189562 1.832255
#upper 0.36360845 1.430037 2.072496
This is great! Thank you so much! There is one thing I forgot to mention: in my actual application I need to sum ~300,000 predictions which would create a full variance-covariance matrix which is about ~700GB in size. Do you have any idea if there is a computationally more efficient way to directly get to the sum of the variance-covariance matrix?
Use the fast_agg_pred function provided in the revision of the original Q & A. Let's start it all over.
set.seed(123)
data(iris)
#Split dataset in training and prediction set
smp_size <- floor(0.75 * nrow(iris))
train_ind <- sample(seq_len(nrow(iris)), size = smp_size)
train <- iris[train_ind, ]
pred <- iris[-train_ind, ]
#Fit multiple linear regression model
fit2 <- lm(Petal.Width ~ Petal.Length + Sepal.Width + Sepal.Length, data=train)
## list of new data
newdatlist <- split(pred, pred$Species)
n <- sapply(newdatlist, nrow)
#setosa versicolor virginica
# 11 13 14
If aggregation operation is sum, we do
w <- lapply(n, rep.int, x = 1)
SUM <- mapply(fast_agg_pred, w, newdatlist,
MoreArgs = list(lmObject = fit2, alpha = 0.95),
SIMPLIFY = FALSE)
If aggregation operation is average, we do
w <- mapply("/", w, n)
AVE <- mapply(fast_agg_pred, w, newdatlist,
MoreArgs = list(lmObject = fit2, alpha = 0.95),
SIMPLIFY = FALSE)
Note that we can't use Map in this case as we need to provide more arguments to fast_agg_pred. Use mapply in this situation, with MoreArgs and SIMPLIFY.
Consider the following table :
DB <- data.frame(
Y =rnorm(6),
X1=c(T, T, F, T, F, F),
X2=c(T, F, T, F, T, T)
)
Y X1 X2
1 1.8376852 TRUE TRUE
2 -2.1173739 TRUE FALSE
3 1.3054450 FALSE TRUE
4 -0.3476706 TRUE FALSE
5 1.3219099 FALSE TRUE
6 0.6781750 FALSE TRUE
I'd like to explain my quantitative variable Y by two binary variables (TRUE or FALSE) without intercept.
The argument of this choice is that, in my study, we can't observe X1=FALSE and X2=FALSE at the same time, so it doesn't make sense to have a mean, other than 0, for this level.
With intercept
m1 <- lm(Y~X1+X2, data=DB)
summary(m1)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.9684 1.0590 -1.859 0.1600
X1TRUE 0.7358 0.9032 0.815 0.4749
X2TRUE 3.0702 0.9579 3.205 0.0491 *
Without intercept
m0 <- lm(Y~0+X1+X2, data=DB)
summary(m0)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
X1FALSE -1.9684 1.0590 -1.859 0.1600
X1TRUE -1.2325 0.5531 -2.229 0.1122
X2TRUE 3.0702 0.9579 3.205 0.0491 *
I can't explain why two coefficients are estimated for the variable X1. It seems to be equivalent to the intercept coefficient in the model with intercept.
Same results
When we display the estimation for all the combinations of variables, the two models are the same.
DisplayLevel <- function(m){
R <- outer(
unique(DB$X1),
unique(DB$X2),
function(a, b) predict(m,data.frame(X1=a, X2=b))
)
colnames(R) <- paste0('X2:', unique(DB$X2))
rownames(R) <- paste0('X1:', unique(DB$X1))
return(R)
}
DisplayLevel(m1)
X2:TRUE X2:FALSE
X1:TRUE 1.837685 -1.232522
X1:FALSE 1.101843 -1.968364
DisplayLevel(m0)
X2:TRUE X2:FALSE
X1:TRUE 1.837685 -1.232522
X1:FALSE 1.101843 -1.968364
So the two models are equivalent.
Question
My question is : can we just estimate one coefficient for the first effect ? Can we force R to assign a 0 value to the combinations X1=FALSE and X2=FALSE ?
Yes, we can, by
DB <- as.data.frame(data.matrix(DB))
## or you can do:
## DB$X1 <- as.integer(DB$X1)
## DB$X2 <- as.integer(DB$X2)
# Y X1 X2
# 1 -0.5059575 1 1
# 2 1.3430388 1 0
# 3 -0.2145794 0 1
# 4 -0.1795565 1 0
# 5 -0.1001907 0 1
# 6 0.7126663 0 1
## a linear model without intercept
m0 <- lm(Y ~ 0 + X1 + X2, data = DB)
DisplayLevel(m0)
# X2:1 X2:0
# X1:1 0.15967744 0.2489237
# X1:0 -0.08924625 0.0000000
I have explicitly coerced your TRUE/FALSE binary into numeric 1/0, so that no contrast is handled by lm().
The data appeared in my answer are different to yours, because you did not use set.seed(?) before rnorm() for reproducibility. But this is not a issue here.
QUESTIONS
(1) what is the classification formula from the fit model in example code below named 'model1'? (is it formula A, B or Neither?)
(2) how does 'model1' determine if class == 1 vs. 2?
Formula A:
class(Species{1:2}) = (-31.938998) + (-7.501714 * [PetalLength]) + (63.670583 * [PetalWidth])
Formula B:
class(Species{1:2}) = 1.346075e-14 + (5.521371e-04 * [PetalLength]) + (4.485211e+27 * [PetalWidth])
USE CASE
Use R to fit/train a binary classification model, then interpret the model for the purpose of manual calculating classifications in Excel, not R.
MODEL COEFFICIENTS
>coef(model1)
#(Intercept) PetalLength PetalWidth
#-31.938998 -7.501714 63.670583
>exp(coef(model1))
#(Intercept) PetalLength PetalWidth
#1.346075e-14 5.521371e-04 4.485211e+27
R CODE EXAMPLE
# Load data (using iris dataset from Google Drive because uci.edu link wasn't working for me today)
#iris <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"), header = FALSE)
iris <- read.csv(url("https://docs.google.com/spreadsheets/d/1ovz31Y6PrV5OwpqFI_wvNHlMTf9IiPfVy1c3fiQJMcg/pub?gid=811038462&single=true&output=csv"), header = FALSE)
dataSet <- iris
#assign column names
names(dataSet) <- c("SepalLength", "SepalWidth", "PetalLength", "PetalWidth", "Species")
#col names
dsColNames <- as.character(names(dataSet))
#num of columns and rows
dsColCount <- as.integer(ncol(dataSet))
dsRowCount <- as.integer(nrow(dataSet))
#class ordinality and name
classColumn <- 5
classColumnName <- dsColNames[classColumn]
y_col_pos <- classColumn
#features ordinality
x_col_start_pos <- 1
x_col_end_pos <- 4
# % of [dataset] reserved for training/test and validation
set.seed(10)
sampleAmt <- 0.25
mainSplit <- sample(2, dsRowCount, replace=TRUE, prob=c(sampleAmt, 1-sampleAmt))
#split [dataSet] into two sets
dsTrainingTest <- dataSet[mainSplit==1, 1:5]
dsValidation <- dataSet[mainSplit==2, 1:5]
nrow(dsTrainingTest);nrow(dsValidation);
# % of [dsTrainingTest] reserved for training
sampleAmt <- 0.5
secondarySplit <- sample(2, nrow(dsTrainingTest), replace=TRUE, prob=c(sampleAmt, 1-sampleAmt))
#split [dsTrainingTest] into two sets
dsTraining <- dsTrainingTest[secondarySplit==1, 1:5]
dsTest <- dsTrainingTest[secondarySplit==2, 1:5]
nrow(dsTraining);nrow(dsTest);
nrow(dataSet) == nrow(dsTrainingTest)+nrow(dsValidation)
nrow(dsTrainingTest) == nrow(dsTraining)+nrow(dsTest)
library(randomGLM)
dataSetEnum <- dsTraining[,1:5]
dataSetEnum[,5] <- as.character(dataSetEnum[,5])
dataSetEnum[,5][dataSetEnum[,5]=="Iris-setosa"] <- 1
dataSetEnum[,5][dataSetEnum[,5]=="Iris-versicolor"] <- 2
dataSetEnum[,5][dataSetEnum[,5]=="Iris-virginica"] <- 2
dataSetEnum[,5] <- as.integer(dataSetEnum[,5])
x <- as.matrix(dataSetEnum[,1:4])
y <- as.factor(dataSetEnum[,5:5])
# number of features
N <- ncol(x)
# define function misclassification.rate
if (exists("misclassification.rate") ) rm(misclassification.rate);
misclassification.rate<-function(tab){
num1<-sum(diag(tab))
denom1<-sum(tab)
signif(1-num1/denom1,3)
}
#Fit randomGLM model - Ensemble predictor comprised of individual generalized linear model predictors
RGLM <- randomGLM(x, y, classify=TRUE, keepModels=TRUE,randomSeed=1002)
RGLM$thresholdClassProb
tab1 <- table(y, RGLM$predictedOOB)
tab1
# y 1 2
# 1 2 0
# 2 0 12
# accuracy
1-misclassification.rate(tab1)
# variable importance measure
varImp = RGLM$timesSelectedByForwardRegression
sum(varImp>=0)
table(varImp)
# select most important features
impF = colnames(x)[varImp>=5]
impF
# build single GLM model with most important features
model1 = glm(y~., data=as.data.frame(x[, impF]), family = binomial(link='logit'))
coef(model1)
#(Intercept) PetalLength PetalWidth
#-31.938998 -7.501714 63.670583
exp(coef(model1))
#(Intercept) PetalLength PetalWidth
#1.346075e-14 5.521371e-04 4.485211e+27
confint.default(model1)
# 2.5 % 97.5 %
#(Intercept) -363922.5 363858.6
#PetalLength -360479.0 360464.0
#PetalWidth -916432.0 916559.4
Your model is defined as
model1 <- glm(y~., data=as.data.frame(x[, impF]), family=binomial(link='logit'))
The family=binomial(link='logit')) bit is saying that the response y is a series of Bernoulli trials, i.e. a variable that takes values 1 or 0 depending on a parameter p, and that p = exp(m) / (1 + exp(m)), where m is a function of the data, called the linear predictor.
The formula y~. means that m = a + b PetalLength + c PetalWidth, where a, b, c are the model coefficients.
Therefore the probability of y = 1 is
> m <- model.matrix(model1) %*% coef(model1)
> exp(m) / (1+exp(m))
[,1]
20 3.448852e-11
50 1.253983e-13
65 1.000000e+00
66 1.000000e+00
87 1.000000e+00
105 1.000000e+00
106 1.000000e+00
107 1.000000e+00
111 1.000000e+00
112 1.000000e+00
116 1.000000e+00
118 1.000000e+00
129 1.000000e+00
130 1.000000e+00
We can check that this is the same as the output of fitted.values
> fitted.values(model1)
20 50 65 66 87 105
3.448852e-11 1.253983e-13 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
106 107 111 112 116 118
1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
129 130
1.000000e+00 1.000000e+00
Finally, the response can be classified in two categories depending on whether P(Y = 1) is above or below a certain threshold. For example,
> ifelse(fitted.values(model1) > 0.5, 1, 0)
20 50 65 66 87 105 106 107 111 112 116 118 129 130
0 0 1 1 1 1 1 1 1 1 1 1 1 1
A GLM model has a link function and a linear predictor. You have not specified your link function above.
Let Y = {0,1} and X be a n x p matrix. (using pseudo-LaTeX) This leads to \hat Y= \phi(X \hat B) = \eta
where
- \eta is the linear predictor
- \phi() is the link function
The linear predictor is just X %*% \hat B and the classification back to P(Y=1|X) = \phi^{-1}(\eta) -- ie the inverse link function. The inverse link function obviously depends on the choice of link. For a logit, you have the inverse logit P(Y=1|X) = exp(eta) / (1+ exp(eta))