I have this data frame in R:
x = rep(seq(-10,10,1),each=5)
y = rep(0,length(x) )
weights = sample( seq(1,20,1) ,length(x), replace = TRUE)
weights = weights/sum(weights)
groups = rep( letters[1:5], times =length(x)/5 )
and some data that looks like this:
library(ggplot2)
ggplot(data = dat, aes(x = x, y = y, color = group))+geom_point( aes(size = weights))+
ylab("outcome")+
xlab("predictor x1")+
geom_vline(xintercept = 0)+ geom_hline(yintercept = 0)
fit_brms = brm(y~ s(x)+(1|group), data = dat)
by_group = marginal_effects(fit_brms, conditions = data.frame(group = dat$group) ,
re_formula = NULL, method = "predict")
plot(by_group, ncol = 5, points = TRUE)
I'd like to make a hierarchical nonlinear model so that there is a different nonlinear fit for each group.
In brms I have the code below which is doing a spline fit on the x predictor with random intercepts on group the fitted line is the same for all groups. the difference is where the lines cross the y intercept. Is there a way to make the non-linear fit be different for each group's data points?
ON page 13 here : https://cran.r-project.org/web/packages/brms/vignettes/brms_multilevel.pdf
It states "As the smooth term itself cannot be modeled as varying by year in a multilevel manner,we add a basic varying intercept in an effort to account for variation between years"
So the spline will be the same for all groups it appears? The only difference in the plots is where the spline cross the y intercept. That seems very restrictive. Can this be modified to make the spline unique to each group?
Use the formula: y ~ s(x, by = group) + (1|group)
Related
Problem: Using multilevel (mixed-effects) model and not sure what to set grouping variable to in order to generate predicted probabilities for a measured group-level variable from glmer model using merTools' predictInterval function.
Goal: Generate predicted probabilities and SEs/CIs across a range of values from a "second level" group-level variable.
Seeking: advice on how to properly do this or other recommendations to generate predicted probabilities and CIs the range of values for a group level variable from a glmer model.
library(lme4)
library(merTools)
library(ggplot2)
hier_data <- data_frame(pass = sample(c(0, 1), size = 1000, replace = T),
wt = rnorm(1000),
ht = sample(1:5, size = 1000, replace = T, prob = c(.1, .1, .6, .1, .1)),
school_funding = rnorm(1000),
school = rep(c("A", "B", "C", "D", "E"), each = 200))
mod <- glmer(pass ~ wt + ht + school_funding + (1 | school),
family = binomial("logit"), data = hier_data)
### Without school: error
ndata <- data.frame(wt = median(hier_data$wt),
ht = median(hier_data$ht),
school_funding = seq(from = min(hier_data$school_funding), to =max(hier_data$school_funding), length.out = 100))
pp <- cbind(ndata, predictInterval(merMod = mod,
newdata = ndata,
type = "probability"))
### Problem, when adding school variable: which school?
ndata <- data.frame(wt = median(hier_data$wt),
ht = median(hier_data$ht),
school_funding = seq(from = min(hier_data$school_funding), to =max(hier_data$school_funding), length.out = 100),
school = "A")
pp <- cbind(ndata, predictInterval(merMod = mod,
newdata = ndata,
type = "probability"))
ggplot(pp, aes(x = school_funding, y = fit)) +
geom_point() +
geom_errorbar(aes(ymin = lwr, ymax = upr))
It seems what you are trying to achieve is effects plots for your variables, with fast prediction intervals. Note first of all that predictInterval does not incorporate the uncertainty in the estimated values of the variance parameters, theta. If more accurate confidence intervals are needed, you should use the bootMer function as described in ?bootMer which uses bootstrapping to estimate the uncertainty. However it might simply be infeasible as the model size and complexity increases. Alternatively the effects package contains the capability to illustrate the effects of merMod objects (however the documentation is simply atrocious).
In general when illustrating the effects of merMod objects a question is "which effects?". Are you interested in the marginal effects or the conditional effects (such as variability in random effects?). If your model contains only first-order random effects (no random slopes), and you are interested in the uncertainty of the fixed-effect coefficient or the effect on the conditional mean, you can get away with using any school and specifying which = "fixed" in predictInterval as
pp <- cbind(ndata, predictInterval(merMod = mod,
newdata = ndata, #<= any school chosen
type = "probability",
which = "fixed"))
Note the size will depend on the chosen school and remaining coefficients as in standard models, and are thus not causal.
If you are interested in the marginal effect, there are multiple methods for approximating this. The optimal would be to bootstrap the predicted values of the marginal mean. Alternatively if the number of independent groups in your grouping variable is "large" enough, you could (maybe) average predicts intervals across groups as illustrated below
newData <- expand.grid(wt = median(hier_data$wt),
ht = median(hier_data$ht),
school = levels(hier_data$school),
school_funding = seq(from = min(hier_data$school_funding),
to = max(hier_data$school_funding),
length.out = 100))
pp <- predictInterval(merMod = mod,
newdata = newData,
type = "probability")
#Split predictions by every column but school
# And calculate estimated means
predictions <- do.call("rbind", lapply(split(as.data.frame(pp),
newData[, !names(newData) == "school"]),
colMeans))
rownames(predictions) <- 1:nrow(predictions)
#create a plot
ggplot(as.data.frame(cbind(predictions, funding = newData$school_funding[newData$school == "A"])),
aes(x = funding, y = fit, ymax = upr, ymin = lwr)) +
geom_point() +
geom_errorbar()
For this example the model is more often than not singular and contains very few groups, and as such the result is a unlikely to be a great estimator for the marginal effect, but outside of extracting the simulations from predictInterval it might suffice. It is likely going to improve with models with more grouping levels in the random effect. predictInterval doesn't seem to incorporate a method for this situation directly.
An alternative for looking at marginal effects would be assuming marginal mean of the form 1/(1+exp(-eta) (which is often assumed for new groupings of the random effect). This isn't directly implemented in the predictInterval function, but can be achieved by substracting the random effect from the linear predicter, and only estimating the randomness of the fixed effects, as below:
pp <- predictInterval(merMod = mod,
newdata = ndata, #<= any school chosen
type = "linear.prediction",
which = "fixed")
#remove random effects
pp <- sweep(pp, 1, predict(mod, newdata = ndata, random.only = TRUE), "-")
pp <- 1/(1+exp(-pp))
which could then be plotted using similar methods. For fewer groups this might be a better predictor for the marginal mean(?, someone might correct me here).
In either case, adding a bit of x-jitter might improve the plot.
In all cases there might be some golden nuggets in the references to GLMM FAQ by bolker and others.
I am new to R and trying to learn. I am trying to plot lift curves of multiple classifiers in one graph. I can't figure out a way to do it. I know the below two classifiers are essentially the same but they both give different graphs and I just want to combine the two. Below is the code I tried. Could someone please point me in the right direction
fullmod = glm(Response ~ page_views_90d+win_visits+osx_visits+mc_1+mc_2+mc_3+mc_4+mc_5+mc_6+store_page+orders+orderlines+bookings+purchase, data=training, family=binomial)
summary(fullmod)
fullmod.results <- predict(fullmod, newdata = testing, type='response')
plotLift(fitted.results, test_data_full$class, cumulative = TRUE,col="orange", n.buckets = 5)
redmod1 = glm(Response ~ win_visits+osx_visits+mc_2+mc_4+mc_6+store_page+orders+orderlines+bookings+purchase, data=training, family=binomial)
redmod1.results <- predict(redmod1, newdata = testing, type = 'response')
plotLift(redmod1.results, test_data_full$class, cumulative = TRUE,col="orange", n.buckets = 5)
# Attempt to plot multiple classifiers
plotLift((redmod1.results, fullmod.results), test_data_full$class, cumulative = TRUE,col="orange", n.buckets = 5)
Here is a way to plot multiple lift curves using the caret library. But first some data:
set.seed(1)
for_lift <- data.frame(Class = factor(rep(1:2, each = 50)),
model1 = sort(runif(100), decreasing = TRUE),
model2 = runif(100),
model3 = runif(100))
Here the Class column is the real classes
model1 is the predicted probabilities by the first model and so on.
Now create a lift object from the data using:
library(caret)
lift_curve <- lift(Class ~ model1 + model2, data = for_lift)
and plot it
xyplot(lift_curve, auto.key = list(columns = 3))
If you would like to plot with ggplot:
library(ggplot2)
ggplot(lift_curve$data)+
geom_line(aes(CumTestedPct, CumEventPct, color = liftModelVar))+
xlab("% Samples tested")+
ylab("% Samples found")+
scale_color_discrete(guide = guide_legend(title = "method"))+
geom_polygon(data = data.frame(x = c(0, lift_curve$pct, 100, 0),
y = c(0, 100, 100, 0)),
aes(x = x, y = y), alpha = 0.1)
I need to graph the predicted probabilities of a logit regression in ggplot2. Essentially, I am trying to graph a glm by each treatment condition within the same graph. However, I am getting quite confused about how to do this seeing that my treat variable (i.e. the x I am interested in) is categorical.This means that when I try to graph the treatment effects using ggplot I just get a bunch of points at 0, 1, and 2 but no lines.
My question is... How could I graph the logit prediction lines in this case? Thanks in advance!
set.seed(96)
df <- data.frame(
vote = sample(0:1, 200, replace = T),
treat = sample(0:3, 200, replace = T))
glm_output <- glm(vote ~ as.factor(treat), data = df, family = binomial(link = "logit"))
predicted_vote <- predict(glm_output, newdata = df, type = "link", interval = "confidence", se = TRUE)
df <- cbind(df, data.frame(predicted_vote))
Since the explanatory variable treat is categorical, it will make more sense if you use boxplot instead like the following:
ggplot(df, aes(x = treat, y = predicted_prob)) +
geom_boxplot(aes(fill = factor(treat)), alpha = .2)
If you want to see the predicted probabilities by glm across different values of some of other explanatory variables you may try this:
ggplot(df, aes(x = treat, y = predicted_prob)) +
geom_boxplot(aes(fill = factor(treat)), alpha = .2) + facet_wrap(~gender)
# create age groups
df$age_group <- cut(df$age, breaks=seq(0,100,20))
ggplot(df, aes(x = treat, y = predicted_prob)) +
geom_boxplot(aes(fill = factor(treat)), alpha = .2) + facet_grid(age_group~gender)
I have a multi-panel plot created with facet_wrap in ggplot, and model outputs from the FlexParamCurve package.
FlexParamCurve provides a model to fit each set of data, i.e. each panel in the plot. I have found code elsewhere for plotting the same curve across all panels, and for plotting model curves for individual lines in each panel. But how can I plot the model curve for each specific plot?
Sample data;
DATA <- data.frame(Cond = rep(1:2, each = 60),
Site = rep((seq(1:2)), each = 30),
Survey = rep((seq(1:5)), each = 6),
Time = rep(1:6))
NUMBERS <- data.frame(Numbers = c(10,13,10,16,31,25,4,11,11,16,21,23,7,12,15,18,20,19,9,15,22,21,24,30,5,10,15,21,21,24,5,7,10,12,20,17,7,11,17,25,27,34,4,9,13,18,23,28,11,15,17,20,21,25,9,18,21,24,21,30,10,11,8,15,20,17,2,6,5,13,14,15,7,9,13,23,28,25,8,13,17,20,24,24,10,10,15,19,23,26,1,2,5,10,17,18,1,3,5,8,15,19,1,8,14,18,26,27,4,9,17,23,25,23,6,12,15,17,20,23))
GRAPH.DATA <- data.frame(cbind(DATA, NUMBERS))
ggplot code;
ggplot(GRAPH.DATA, aes(Time, Numbers, colour=factor(Survey))) +
geom_line(aes(group = Survey), size = 1.5) +
facet_grid(Cond ~ Site)
FlexParamCurve code and Model outputs;
MODELS <- pn.mod.compare(GRAPH.DATA$Time, GRAPH.DATA$Numbers, GRAPH.DATA$Site, pn.options ="MOD.OUTPUTS")
Call:
Model: y ~ SSposnegRichards(x, Asym = Asym, K = K, Infl = Infl, M = M,
RAsym = RAsym, modno = 11, pn.options = "MOD.OUTPUTS") | grp
Data: userdata
Coefficients:
Asym K Infl M RAsym
1-1 24.32301 1.0579649 35.61585 17.961961 -8.1278653
1-2 25.15115 0.4167590 31.70722 6.473804 -14.2503569
2-1 23.69160 0.9179826 36.32567 15.839351 -11.2709848
2-2 24.75980 0.3559937 35.14258 4.499553 -14.4040608
The FlexParamCurve call is in the format (x, y, grouping variable, output_object)
I just discovered that mgcv::s() permits one to supply a matrix to its by argument, permitting one to smooth a continuous variable with separate smooths for each of a combination of variables (and their interactions if so desired). However, I'm having trouble getting sensible predictions from such models, for example:
library(mgcv) #for gam
library(ggplot2) #for plotting
#Generate some fake data
set.seed(1) #for replicability of this example
myData = expand.grid(
var1 = c(-1,1)
, var2 = c(-1,1)
, z = -10:10
)
myData$y = rnorm(nrow(myData)) + (myData$z^2 + myData$z*4) * myData$var1 +
(3*myData$z^2 + myData$z) * myData$var2
#note additive effects of var1 and var2
#plot the data
ggplot(
data = myData
, mapping = aes(
x = z
, y = y
, colour = factor(var1)
, linetype = factor(var2)
)
)+
geom_line(
alpha = .5
)
#reformat to matrices
zMat = matrix(rep(myData$z,times=2),ncol=2)
xMat = matrix(c(myData$var1,myData$var2),ncol=2)
#get the fit
fit = gam(
formula = myData$y ~ s(zMat,by=xMat,k=5)
)
#get the predictions and plot them
predicted = myData
predicted$value = predict(fit)
ggplot(
data = predicted
, mapping = aes(
x = z
, y = value
, colour = factor(var1)
, linetype = factor(var2)
)
)+
geom_line(
alpha = .5
)
Yields this plot of the input data:
And this obviously awry plot of the predicted values:
Whereas replacing the gam fit above with:
fit = gam(
formula = y ~ s(z,by=var1,k=5) + s(z,by=var2,k=5)
, data = myData
)
but otherwise running the same code yields this reasonable plot of predicted values:
What am I doing wrong here?
The use of vector-valued inputs to mgcv smooths is taken up here. It seems to me that you are misunderstanding these model types.
Your first formula
myData$y ~ s(zMat,by=xMat,k=5)
fits the model
y ~ f(z)*x_1 + f(z)*x_2
That is, mgcv estimates a single smooth function f(). This function is evaluated at each covariate, with the weightings supplied to the by argument.
Your second formula
y ~ s(z,by=var1,k=5) + s(z,by=var2,k=5)
fits the model
y ~ f_1(z)*x_1 +f_2(z)*x_2
where f_1() and f_2() are two different smooth functions. Your data model is essentially the second formula, so it is not surprising that it gives a more sensible looking fit.
The first formula is useful when you want an additive model where a single function is evaluated on each variable, with given weightings.