estimate h2o glm coefficients by a categorical variable level - r

I would like to estimate coefficient for a predictor by a categorical variable level in h2o glm. For example, if my data frame has product price (continuous variable) and product type (categorical variable), then I want to estimate a coefficient for price by product. In SAS, you can easily accomplish this by specifying model effect as price*type. How can I do the same in h2o or R?
There is an interactions() function, but it cannot handle interaction between a continuous and categorical variables. Any tips to get around this problem?
Many thanks,

set.seed(1234)
x1 = rnorm(100,0,1)
x2 = as.factor(rep(c("A","B","C","D"), each = 25))
y = as.factor(rep(0:1, each = 50))
data = data.frame(x1 = x1, x2 = x2, y = y)
Interactions can be specified using a ":" in the formula argument
# glm base example
fit <- glm(data = data, y ~ x1 + x2 + x1:x2, family = "binomial")
print(fit)
Using h2o.glm pairwise interactions can be specified by passing column indices to the interactions argument
# h2o.glm example
library("h2o")
h2o.init(nthreads = -1)
data.hex = as.h2o(data)
h2o_fit <- h2o.glm(x = 1:2, y = 3, training_frame = data.hex, family = "binomial", interactions = 1:2)
h2o_fit#model$coefficients_table
h2o.shutdown(prompt = F)

Related

Plotting multiple logistic regression in R

I've built this logistic regression model which includes four predictors, optimized from a dataframe that includes ten predictors (I've uploaded the data here http://www.filedropper.com/df). This is my first time trying to plot a logistic model in R, and I'm not sure how to go about visualizing the results. Here's my code:
df <- read.csv("df.csv")
model <- glm(y ~ v1 + v6 + v9 + v3, data = df,
family = binomial, maxit = 100)
summary(model)
df$pred_res <- predict(model, df, type = "response")
ggplot(data = df, aes(x = v3, y = pred_res)) +
geom_line(mapping = aes(colour = "blue"), size = 1)
Can anyone suggest a better way to visualize/plot the predicted values of the model? In general, does anyone have any tips for visualizing a logit model with multiple predictors? Thank you!

What do you set grouping factor to when using glmer/lme4 and predictInterval?

Problem: Using multilevel (mixed-effects) model and not sure what to set grouping variable to in order to generate predicted probabilities for a measured group-level variable from glmer model using merTools' predictInterval function.
Goal: Generate predicted probabilities and SEs/CIs across a range of values from a "second level" group-level variable.
Seeking: advice on how to properly do this or other recommendations to generate predicted probabilities and CIs the range of values for a group level variable from a glmer model.
library(lme4)
library(merTools)
library(ggplot2)
hier_data <- data_frame(pass = sample(c(0, 1), size = 1000, replace = T),
wt = rnorm(1000),
ht = sample(1:5, size = 1000, replace = T, prob = c(.1, .1, .6, .1, .1)),
school_funding = rnorm(1000),
school = rep(c("A", "B", "C", "D", "E"), each = 200))
mod <- glmer(pass ~ wt + ht + school_funding + (1 | school),
family = binomial("logit"), data = hier_data)
### Without school: error
ndata <- data.frame(wt = median(hier_data$wt),
ht = median(hier_data$ht),
school_funding = seq(from = min(hier_data$school_funding), to =max(hier_data$school_funding), length.out = 100))
pp <- cbind(ndata, predictInterval(merMod = mod,
newdata = ndata,
type = "probability"))
### Problem, when adding school variable: which school?
ndata <- data.frame(wt = median(hier_data$wt),
ht = median(hier_data$ht),
school_funding = seq(from = min(hier_data$school_funding), to =max(hier_data$school_funding), length.out = 100),
school = "A")
pp <- cbind(ndata, predictInterval(merMod = mod,
newdata = ndata,
type = "probability"))
ggplot(pp, aes(x = school_funding, y = fit)) +
geom_point() +
geom_errorbar(aes(ymin = lwr, ymax = upr))
It seems what you are trying to achieve is effects plots for your variables, with fast prediction intervals. Note first of all that predictInterval does not incorporate the uncertainty in the estimated values of the variance parameters, theta. If more accurate confidence intervals are needed, you should use the bootMer function as described in ?bootMer which uses bootstrapping to estimate the uncertainty. However it might simply be infeasible as the model size and complexity increases. Alternatively the effects package contains the capability to illustrate the effects of merMod objects (however the documentation is simply atrocious).
In general when illustrating the effects of merMod objects a question is "which effects?". Are you interested in the marginal effects or the conditional effects (such as variability in random effects?). If your model contains only first-order random effects (no random slopes), and you are interested in the uncertainty of the fixed-effect coefficient or the effect on the conditional mean, you can get away with using any school and specifying which = "fixed" in predictInterval as
pp <- cbind(ndata, predictInterval(merMod = mod,
newdata = ndata, #<= any school chosen
type = "probability",
which = "fixed"))
Note the size will depend on the chosen school and remaining coefficients as in standard models, and are thus not causal.
If you are interested in the marginal effect, there are multiple methods for approximating this. The optimal would be to bootstrap the predicted values of the marginal mean. Alternatively if the number of independent groups in your grouping variable is "large" enough, you could (maybe) average predicts intervals across groups as illustrated below
newData <- expand.grid(wt = median(hier_data$wt),
ht = median(hier_data$ht),
school = levels(hier_data$school),
school_funding = seq(from = min(hier_data$school_funding),
to = max(hier_data$school_funding),
length.out = 100))
pp <- predictInterval(merMod = mod,
newdata = newData,
type = "probability")
#Split predictions by every column but school
# And calculate estimated means
predictions <- do.call("rbind", lapply(split(as.data.frame(pp),
newData[, !names(newData) == "school"]),
colMeans))
rownames(predictions) <- 1:nrow(predictions)
#create a plot
ggplot(as.data.frame(cbind(predictions, funding = newData$school_funding[newData$school == "A"])),
aes(x = funding, y = fit, ymax = upr, ymin = lwr)) +
geom_point() +
geom_errorbar()
For this example the model is more often than not singular and contains very few groups, and as such the result is a unlikely to be a great estimator for the marginal effect, but outside of extracting the simulations from predictInterval it might suffice. It is likely going to improve with models with more grouping levels in the random effect. predictInterval doesn't seem to incorporate a method for this situation directly.
An alternative for looking at marginal effects would be assuming marginal mean of the form 1/(1+exp(-eta) (which is often assumed for new groupings of the random effect). This isn't directly implemented in the predictInterval function, but can be achieved by substracting the random effect from the linear predicter, and only estimating the randomness of the fixed effects, as below:
pp <- predictInterval(merMod = mod,
newdata = ndata, #<= any school chosen
type = "linear.prediction",
which = "fixed")
#remove random effects
pp <- sweep(pp, 1, predict(mod, newdata = ndata, random.only = TRUE), "-")
pp <- 1/(1+exp(-pp))
which could then be plotted using similar methods. For fewer groups this might be a better predictor for the marginal mean(?, someone might correct me here).
In either case, adding a bit of x-jitter might improve the plot.
In all cases there might be some golden nuggets in the references to GLMM FAQ by bolker and others.

brms package in R smoother

I have this data frame in R:
x = rep(seq(-10,10,1),each=5)
y = rep(0,length(x) )
weights = sample( seq(1,20,1) ,length(x), replace = TRUE)
weights = weights/sum(weights)
groups = rep( letters[1:5], times =length(x)/5 )
and some data that looks like this:
library(ggplot2)
ggplot(data = dat, aes(x = x, y = y, color = group))+geom_point( aes(size = weights))+
ylab("outcome")+
xlab("predictor x1")+
geom_vline(xintercept = 0)+ geom_hline(yintercept = 0)
fit_brms = brm(y~ s(x)+(1|group), data = dat)
by_group = marginal_effects(fit_brms, conditions = data.frame(group = dat$group) ,
re_formula = NULL, method = "predict")
plot(by_group, ncol = 5, points = TRUE)
I'd like to make a hierarchical nonlinear model so that there is a different nonlinear fit for each group.
In brms I have the code below which is doing a spline fit on the x predictor with random intercepts on group the fitted line is the same for all groups. the difference is where the lines cross the y intercept. Is there a way to make the non-linear fit be different for each group's data points?
ON page 13 here : https://cran.r-project.org/web/packages/brms/vignettes/brms_multilevel.pdf
It states "As the smooth term itself cannot be modeled as varying by year in a multilevel manner,we add a basic varying intercept in an effort to account for variation between years"
So the spline will be the same for all groups it appears? The only difference in the plots is where the spline cross the y intercept. That seems very restrictive. Can this be modified to make the spline unique to each group?
Use the formula: y ~ s(x, by = group) + (1|group)

R: GAM with fit on subset of data

I fit a Generalized Additive Model using gam from the mgcv package. I have a data table containing my dependent variable Y, an independent variable X, other independent variables Oth and a two-level factor Fac. I would like to fit the following model
Y ~ s(X) + Oth
BUT with the additional constraint that the s(X) term is fit only on one of the two levels of the factor, say Fac==1. The other terms Oth should be fit with the whole data.
I tried exploring s(X,by=Fac) but this biases the fit for Oth. In other words, I would like to express the belief that X relates to Y only if Fac==1, otherwise it does not make sense to model X.
Cheap trick: use an auxiliary variable that is X if Fac == 1 and 0 elsewhere.
library("mgcv")
library("ggplot2")
# simulate data
N <- 1e3
dat <- data.frame(covariate = runif(N),
predictor = runif(N),
group = factor(sample(0:1, N, TRUE)))
dat$outcome <- rnorm(N,
1 * dat$covariate +
ifelse(dat$group == 1,
.5 * dat$predictor +
1.5 * sin(dat$predictor * pi),
0), .1)
# some plots
ggplot(dat, aes(x = predictor, y = outcome,
col = group, group = group)) +
geom_point()
ggplot(dat, aes(x = covariate, y = outcome,
col = group, group = group)) +
geom_point()
# create auxiliary variable
dat$aux <- ifelse(dat$group == 1,
dat$predictor,
0)
# fit the data
fit1 <- gam(outcome ~ covariate + s(predictor, by = group),
data = dat)
fit2 <- gam(outcome ~ covariate + s(aux, by = group),
data = dat)
# compare fits
summary(fit1)
summary(fit2)
If I understand it right, you're thinking about some model with interaction like this:
Y ~ 0th + (Fac==1)*s(X)
If you want to "express the belief that X relates to Y only if Fac==1" don't treat Fac as a factor, but as a numeric variable. In this case you will get numeric interaction and only one set of coefficients (when it's a factor there where two). This type of model is a varying coefficient model.
# some data
data <- data.frame(th = runif(100),
X = runif(100),
Y = runif(100),
Fac = sample(0:1, 100, TRUE))
data$Fac<-as.numeric(as.character(data$Fac)) #change to numeric
# then run model
gam(Y~s(X, by=Fac)+th,data=data)
See the documentation for by option in the documentation ?s

Plot the results of a multivariate logistic regression model in R

I would like to plot the results of a multivariate logistic regression analysis (GLM) for a specific independent variables adjusted (i.e. independent of the confounders included in the model) relationship with the outcome (binary).
I have seen posts that recommend the following method using the predict command followed by curve, here's an example;
x <- data.frame(binary.outcome, cont.exposure)
model <- glm(binary.outcome ~ cont.exposure, family=binomial, data=x)
plot(cont.exposure, binary.outcome, xlab="Temperature",ylab="Probability of Response")
curve(predict(model, data.frame(cont.exposure=x), type="resp"), add=TRUE, col="red")
However this does not seem to work for multivariate regression models. I get the following error when I add 'age' (arbitrary - could be any variable of same length) as a confounding variable;
> x <- data.frame(binary.outcome, cont.exposure, age)
> model <- glm(binary.outcome ~ cont.exposure + age, family=binomial, data=x)
> plot(cont.exposure, binary.outcome, xlab="Temperature",ylab="Probability of Response")
> curve(predict(model, data.frame(cont.exposure=x), type="resp"), add=TRUE, col="red")
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
variable lengths differ (found for 'age')
In addition: Warning message:
'newdata' had 101 rows but variable(s) found have 698 rows
The above model is a simplified version of the models I'd like to run, but the principle is the same; I would like to plot the relationship between a binary outcome variable and a continuous exposure, independent of confounding factors..
It would be great to get either a workaround for the above, or an alternative way to view the relationship I am interested in. Many thanks.
set.seed(12345)
dataset <- expand.grid(Temp = rnorm(30), Age = runif(10))
dataset$Truth <- with(dataset, plogis(2 * Temp - 3 * Age))
dataset$Sample <- rbinom(nrow(dataset), size = 1, prob = dataset$Truth)
model <- glm(Sample ~ Temp + Age, data = dataset, family = binomial)
newdata <- expand.grid(
Temp = pretty(dataset$Temp, 20),
Age = pretty(dataset$Age, 5))
newdata$Sample <- predict(model, newdata = newdata, type = "response")
library(ggplot2)
ggplot(newdata, aes(x = Temp, y = Sample)) + geom_line() + facet_wrap(~Age)
ggplot(newdata, aes(x = Temp, y = Sample, colour = Age, group = Age)) +
geom_line()

Resources