Generate table with side-by-side node models of `partykit:mob()` object - r

Let's say I fit a model using partykit:mob(). Afterward, I would like to generate a side-by-side table with all the nodes (including the model fitted using the whole sample). Here I attempted to do it using stargazer(), but other ways are more than welcome.
Below an example and attempts to get the table.
library("partykit")
require("mlbench")
## Pima Indians diabetes data
data("PimaIndiansDiabetes", package = "mlbench")
## a simple basic fitting function (of type 1) for a logistic regression
logit <- function(y, x, start = NULL, weights = NULL, offset = NULL, ...) {
glm(y ~ 0 + x, family = binomial, start = start, ...)
}
## set up a logistic regression tree
pid_tree <- mob(diabetes ~ glucose | pregnant + pressure + triceps + insulin +
mass + pedigree + age, data = PimaIndiansDiabetes, fit = logit)
pid_tree
# Model-based recursive partitioning (logit)
#
# Model formula:
# diabetes ~ glucose | pregnant + pressure + triceps + insulin +
# mass + pedigree + age
#
# Fitted party:
# [1] root
# | [2] mass <= 26.3: n = 167
# | x(Intercept) xglucose
# | -9.95150963 0.05870786
# | [3] mass > 26.3
# | | [4] age <= 30: n = 304
# | | x(Intercept) xglucose
# | | -6.70558554 0.04683748
# | | [5] age > 30: n = 297
# | | x(Intercept) xglucose
# | | -2.77095386 0.02353582
#
# Number of inner nodes: 2
# Number of terminal nodes: 3
# Number of parameters per node: 2
# Objective function: 355.4578
1.- Extract summary(pid_tree, node = x) + stargazer().
## I want to replicate this table extracting the the nodes from partykit object.
library(stargazer)
m.glm<- glm(diabetes ~ glucose, family = binomial,data = PimaIndiansDiabetes)
typeof(m.glm)
## [1] "list"
class(m.glm)
## [1] "glm" "lm"
stargazer(m.glm)
## ommited output.
## Extracting summary from each node
summ_full_data <- summary(pid_tree, node = 1)
summ_node_2 <- summary(pid_tree, node = 2)
summ_node_4 <- summary(pid_tree, node = 4)
summ_node_5 <- summary(pid_tree, node = 5)
## trying to create stargazer table with coefficients
stargazer(m.glm,
summ_node_2,
summ_node_4,
summ_node_5,title="MOB Results")
##Error: $ operator is invalid for atomic vectors
2.- Extract pid_tree[x] + stargazer().
## Second Attempt (extracting modelparty objects instead)
node_2 <- pid_tree[2]
node_4 <- pid_tree[4]
node_5 <- pid_tree[5]
class(node_5)
##[1] "modelparty" "party"
stargazer(m.glm,
node_2,
node_4,
node_5,title="MOB Results")
# % Error: Unrecognized object type.
# % Error: Unrecognized object type.
# % Error: Unrecognized object type.
3.- Not really elegant, I know: Force class to emulate the glm object.
## Force class of object to emulate glm one
class(m.glm)
class(summ_node_2) <- c("glm", "lm")
stargazer(summ_node_2)
##Error in if (p > 0) { : argument is of length zero
A rather pragmatic solution would be just re-fit the model recovering the rules found by partykit:mob() and then use stargaze() on them, but for sure I am missing something here. Thanks in advance.

It's best to extract (or refit) the list of model objects per node and then apply the table package of choice. Personally, I don't like stargazer much and much rather use modelsummary instead or sometimes the good old memisc.
If the tree contains the model $objects in the $info (as for pid_tree) you can use nodeapply() for all nodeids() to extract these:
pid_models <- nodeapply(pid_tree, ids = nodeids(pid_tree), FUN = function(x) x$info$object)
If you just want to extract the fitted models for the terminal nodes (leaves) of the tree, then you can do so by setting ids = nodeids(pid_tree, terminal = TRUE).
Alternatively, especially when the model objects are not stored, you can easily refit them via:
pid_models <- refit.modelparty(pid_tree)
Here, you could also include node = nodeids(pid_tree, terminal = TRUE) to only refit the terminal node models.
In all cases you can subsequently use
msummary(pid_models)
to produce the model summary table. It supports a variety of output formats and of course you can tweak the list further to change the results, e.g., by changing their names etc. The default output looks like this:

My bad, it was a small difference that makes it work. Here a solution, not sure if the best way, but it does the work.-
library(stargazer)
obj_node_full_sample<- pid_tree[1]$node$info$object
obj_node_2<- pid_tree[2]$node$info$object
obj_node_4<- pid_tree[4]$node$info$object
obj_node_5<- pid_tree[5]$node$info$object
stargazer(obj_node_full_sample,
obj_node_2,
obj_node_4,
obj_node_5,title="Results", align=TRUE)

Related

Removing certain parts of modelsummary in R at specific statistics

I am using gamm4:gamm4 to model longitudinal change.
I am trying to use the modelsummary::modelsummary function to create an output table of the following results:
I would like to add t-values and std.error to the output of the fixed effects, and remove the empty tags values from the random effects
model_lmer <- gamm4(Y ~ Tract + s(Age, by = Tract, k = 10) + Sex,
data = (DF1),
random = ~ (0 + Tract | ID))
modelsummary(model_lmer$mer,
statistic = c("s.e. = {std.error}",
"t = {statistic}"))
But I am struggling to write the correct syntax to remove the "t" and "s.e." from the random effects output.
This is kind of tricky, actually. The issue is that modelsummary()
automatically drops empty rows when they are filled with NA or an
empty string "". However, since glue strings can include arbitrary
text, it is hard to think of a general way to figure out if the row is
empty or not, because modelsummary() cannot know ex ante what
constitutes an empty string.
If you have an idea on how this check could be implemented, please report it
on Github:
https://github.com/vincentarelbundock/modelsummary
In the meantime, you could use the powerful tidy_custom.CLASSNAME
mechanism
to customize the statistic and p.value statistics directly instead
of through a glue string:
library(gamm4)
library(modelsummary)
# simulate
x <- runif(100)
fac <- sample(1:20,100,replace=TRUE)
eta <- x^2*3 + fac/20; fac <- as.factor(fac)
y <- rpois(100,exp(eta))
# fit
mod <- gamm4(y~s(x),family=poisson,random=~(1|fac))
# customize
tidy_custom.glmerMod <- function(x) {
out <- parameters::parameters(x)
out <- insight::standardize_names(out, style = "broom")
out$statistic <- sprintf("t = %.3f", out$statistic)
out$p.value <- sprintf("p = %.3f", out$p.value)
out
}
# summarize
modelsummary(mod$mer,
statistic = c("{statistic}", "{p.value}"))
Model 1
X(Intercept)
1.550
t = 17.647
p = 0.000
Xs(x)Fx1
0.855
t = 4.445
p = 0.000
Num.Obs.
100
RMSE
2.49
Note that I used simple glue strings in statistic = "{p.value}", otherwise they would be wrapped up in parentheses, as is default for standard errors.

Storing results from simulated merMods in data frame

Updated:
I'm trying to examine the variability in my parameter estimates from a merMod object by simulating known data and running the model 100 times. I'd like the result to be a data frame that looks like the following:
| simulation | intercept | est.x1 | est.x2 |
| ---------- | --------- | ------ | ------ |
| sim_study1 |.09 |.75 |.25 |
| sim_study2 |.10 |.72 |.21 |
| sim_study3 |NA |NA |NA |
My code to generate multilevel data with a random intercept and 2 predictors is:
# note. this code block runs as expected, and if I run a lmer() call
# on a simulated data set I get values that one would expect.
gen_fake <- function(i, j){
school <- rep(1:j)
person <- rep(1:i) # students nested in schools
# parameters
mu_a_true <- 0.10 # real intercept
sigma_a_true <- 0.10 # varince of intercept
sigma_y_true <- 0.40
b1_true <- .75
b2_true <- .25
# random intercept for schools
a_true <- rnorm(j, mu_a_true, sigma_a_true)
# random data for predictors
x1 <- rnorm(i, 0, 1)
x2 <- rnorm(i, 0, 1)
# outcome
y <- rnorm(i, a_true[school] + b1_true*x1 + b2_true*x2, sigma_y_true)
return (data.frame(y, person, school, x1, x2))
}
I'm attempting to conduct a 100 simulations of a model, while generating new data each time. Note, I'm trying to implement tryCatch within the loop because with more complex models, where the model might not terminate normally, I'd like value returned in the table to be NA for parameters.
My code for this is as follows:
# create an empty data frame with names of parameters (there's probably
# a slicker way to do this within the loop where I can match names from
# the model call)
sim_results <- data.frame(matrix(nrow=100, ncol=3,
dimnames=list(c(),
c("intercept",
"est.x1", "est.x2"))),
stringsAsFactors=F)
# load library for analysis
library(lme4)
# conduct 100 simulations of the model generating fake data for each run
sim_study <- function (i, j, n.sims){
for (sim in 1:n.sims){
fake_dat <- gen_fake(i, j)
tryCatch({
lmer_sim <- lmer(y ~ x1 + x2 + (1|school), data = fake_dat)
}, error = function(e){
return(NA)
}) #return previous value of fm if error
estimates <- rbind(fixef(lmer_sim))
}
sim_results[sim,] <- estimates
}
# run the simulation study
sim_study (1000,5,100)
The issue I am having is that the function only returns 1 row and it isn’t populating the empty data frame I made:
(Intercept) x1 x2
[1,] 0.09659339 0.7746392 0.2325391
I'm unsure of the issue. Finally, any feedback you might have for how to make this work faster would also be appreciated, as I'd like to learn more about that issue. Thanks for any assistance.
This may be a bit of a forehead-slapper, but I think you just misplaced the loop brackets? This works for me:
sim_study <- function (i, j, n.sims){
for (sim in 1:n.sims) {
if (sim %% 10 == 0 ) cat(".\n") ## print progress
fake_dat <- gen_fake(i, j)
tryCatch({
lmer_sim <- lmer(y ~ x1 + x2 + (1|school),
data = fake_dat)
}, error = function(e){
return(rep(NA,3)) ## return vector of correct length
}) #return previous value of fm if error
estimates <- rbind(fixef(lmer_sim))
sim_results[sim,] <- estimates
}
return(sim_results)
}
A few more points:
I'm not sure whether the tryCatch() logic works, since I didn't hit any errors (but I think it ought to be modified to return an object with the current length, as above)
you could replace some of your gen_fake (not the generation of the predictors, but the generation of the response with the built-in ?simulate.merMod(), but I don't think it would actually work any better (or worse)
speeding this up significantly would be a bit of work/hacky. There is a refit() function that works quickly if only the predictor variable has changed, but it doesn't hold in this case. You could use the tricks specified here ...

retain several best models during model dredging in R

Is there a way to retain the best models, for example, within two Alkaike Information Criterion (AIC) units of the best fitting model, during a model dredging approach in R? I am using the glmulti package, which returns the AIC of the best models, but does not allow visualizing the models associated with those values.
Thanks in advance.
Here is my example (data here):
results <- read.csv("gameresults.csv")
require(glmulti)
M <- glmulti(result~speed*svl*tailsize*strategy,
data=results, name = "glmulti.analysis",
intercept = TRUE, marginality = FALSE,
level = 2, minsize = 0, maxsize = -1, minK = 0, maxK = -1,
fitfunction = Multinom, method = "h", crit = "aic",
confsetsize = 100,includeobjects=TRUE)
summary(M)
The function glmulti::glmulti returns a S4 class object that can be accessed like a list. All of your models, not just the best, could be accessed. Since I don't have your functions and some other optional inputs, I performed a simplified version of your model just as a demonstration:
results <- read.csv("gameresults.csv")
library(glmulti)
M <- glmulti(result~speed*svl*strategy, data=results, crit = "aic", plotty = TRUE)
Here are a list of all models, accessed by the # operator:
M#formulas
# [[1]]
# result ~ 1 + speed + svl:speed + strategy:speed
# <environment: 0x11a616750>
#
# [[2]]
# result ~ 1 + speed + svl + svl:speed + strategy:speed
# <environment: 0x11a616750>
#
# [[3]]
# result ~ 1 + strategy + speed + svl:speed + strategy:speed
# <environment: 0x11a616750>
#
## **I omitted the remaining 36-3=33 models**
You can plot them individually based on the formula, using the base graphic or any packages that support use of model formulas. For example, I randomly selected one from the list:
plot(result ~ 1 + speed + svl, data=results)
## Hit <Return> to see next plot:
## Hit <Return> to see next plot:

Get name in formula dynamically in R

When running a decision tree I use:
mod1 <- C5.0(Species ~ ., data = iris)
If I want to pass in a data frame and set the target feature name in the formula (something different than "Species") how would I do this?
For example,
mod1 <- C5.0(other_data[,target_column] ~ ., data = other_data)
which obviously doesn't work.
1) Paste together the formula:
fun <- function(resp, data) C5.0(as.formula(paste(resp, "~ .")), data = data)
# test
library(C50)
fun("Species", iris)
giving:
Call:
C5.0.formula(formula = as.formula(paste(resp, "~ .")), data = data)
Classification Tree
Number of samples: 150
Number of predictors: 4
Tree size: 4
Non-standard options: attempt to group attributes
2) Or this variation which gives nicer rendition of the call on the line after Call: in the output:
fun <- function(resp, data)
do.call(C5.0, list(as.formula(paste(resp, "~ .")), data = substitute(data)))
fun("Species", iris)
giving:
Call:
C5.0.formula(formula = Species ~ ., data = iris)
Classification Tree
Number of samples: 150
Number of predictors: 4
Tree size: 4
Here is a second test of this version of fun using the builtin data frame CO2:
fun("Plant", CO2)
giving:
Call:
C5.0.formula(formula = Plant ~ ., data = CO2)
Classification Tree
Number of samples: 84
Number of predictors: 4
Tree size: 7
Non-standard options: attempt to group attributes
The following allows for passing in arbitrary data and a target feature to the C50 method:
boosted_trees <- function(data_train, target_feature, iter_choice) {
target_index <- grep(target_feature, colnames(data_train))
model_boosted <- C5.0(x = data_train[, -target_index], y = data_train[[target_feature]], trial=iter_choice)
model_boosted$call$x <- data_train[, -target_index]
model_boosted$call$y <- data_train[[target_feature]]
return(model_boosted)
}
The trick is to rename the terms in the method call after building the model so that it can be plotted.
An alternative that may be preferable is to overwrite the symbol within the parse tree after creating the formula:
x <- Species~.;
x;
## Species ~ .
x[[2L]] <- as.symbol('Blah');
x;
## Blah ~ .
The above works because formulas are encoded as normal parse trees, with a top-level node that consists of a call (typeof 'language', mode 'call') of the `~`() function, and classed as 'formula':
(function(x) c(typeof(x),mode(x),class(x)))(.~.);
## [1] "language" "call" "formula"
All parse trees can be read and written as a recursive list structure. Here I'll demonstrate that using a nice little recursive function I originally wrote for this answer:
unwrap <- function(x) if (typeof(x) == 'language') lapply(as.list(x),unwrap) else x;
unwrap(Species~.);
## [[1]]
## `~`
##
## [[2]]
## Species
##
## [[3]]
## .
##
In other words, parse trees represent function calls with the function symbol as the first list component, and then all function arguments as the subsequent list components. The special case of a normal formula captures the LHS as the first function argument and the RHS as the second. Hence x[[2L]] represents the LHS symbol of your formula, which we can overwrite directly with a normal assignment to your preferred symbol.

How to retrieve correlation matrix from glm models in R

I am using the gls function from the nlme package. You can copy and paste the following code to reproduce my analysis.
library(nlme) # Needed for gls function
# Read in wide format
tlc = read.table("http://www.hsph.harvard.edu/fitzmaur/ala2e/tlc.dat",header=FALSE)
names(tlc) = c("id","trt","y0","y1","y4","y6")
tlc$trt = factor(tlc$trt, levels=c("P","A"), labels=c("Placebo","Succimer"))
# Convert to long format
tlc.long = reshape(tlc, idvar="id", varying=c("y0","y1","y4","y6"), v.names="y", timevar="time", direction="long")
# Create week numerical variable
tlc.long$week = tlc.long$time-1
tlc.long$week[tlc.long$week==2] = 4
tlc.long$week[tlc.long$week==3] = 6
tlc.long$week.f = factor(tlc.long$week, levels=c(0,1,4,6))
The real analysis starts from here:
# Including group main effect assuming unstructured covariance:
mod1 = gls(y ~ trt*week.f, corr=corSymm(, form= ~ time | id),
weights = varIdent(form = ~1 | time), method = "REML", data=tlc.long)
summary(mod1)
In the summary(mod1), the following parts of the results are of interest to me that I would love to retrieve.
Correlation Structure: General
Formula: ~time | id
Parameter estimate(s):
Correlation:
1 2 3
2 0.571
3 0.570 0.775
4 0.577 0.582 0.581
Variance function:
Structure: Different standard deviations per stratum
Formula: ~1 | time
Parameter estimates:
1 2 3 4
1.000000 1.325880 1.370442 1.524813
The closest I can get is to use the following method.
temp = mod1$modelStruct$varStruct
Variance function structure of class varIdent representing
1 2 3 4
1.000000 1.325880 1.370442 1.524813
However, whatever you stored with temp, I cannot get the five numbers out. I tried as.numeric(temp) and unclass(temp), but none of them works. There is no way I can just get the five numbers as a clean numeric vector.
Thanks in advance!
When you run mod1$modelStruct$varStruct in R console, R first inspects the class of it
> class(mod1$modelStruct$varStruct)
[1] "varIdent" "varFunc"
and then dispatch the corresponding print function. In this case, it is nlme:::print.varFunc. i.e., the actual command running is nlme:::print.varFunc(mod1$modelStruct$varStruct).
If you run nlme:::print.varFunc, you can see the function body of it
function (x, ...)
{
if (length(aux <- coef(x, uncons = FALSE, allCoef = TRUE)) >
0) {
cat("Variance function structure of class", class(x)[1],
"representing\n")
print(aux, ...)
}
else {
cat("Variance function structure of class", class(x)[1],
"with no parameters, or uninitialized\n")
}
invisible(x)
}
<bytecode: 0x7ff4bf688df0>
<environment: namespace:nlme>
What it does is evaluating the coef and print it, and the unevaluated x is returned invisibly.
Therefore, in order to get the cor/var, you need
coef(mod1$modelStruct$corStruct, uncons = FALSE, allCoef = TRUE)
coef(mod1$modelStruct$varStruct, uncons = FALSE, allCoef = TRUE)

Resources