So I'm trying to use the step function in R with the Iowa, Ames data set, I can't get the step function to do anything.
I'm currently doing this -
step(lm(SalePrice ~ 1), list = ~upper(as.numeric(Garage.Area) + Lot.Area
), k = 2, direction='forward')
and that gives me -
## Start: AIC=45082
## SalePrice ~ 1
##
## Call:
## lm(formula = SalePrice ~ 1)
##
## Coefficients:
## (Intercept)
## 180241
I then tried this-
step(lm(SalePrice ~ 1), list = ~upper(
factor(Neighborhood) * factor(Bedroom.AbvGr) * factor(Pool.QC) + Bsmt.Full.Bath + Bsmt.Half.Bath + Full.Bath + Half.Bath + factor(Heating.QC) + factor(Central.Air) + factor(Overall.Cond) + factor(Overall.Qual) + Gr.Liv.Area + as.numeric(Garage.Area) + as.numeric(Pool.Area)
), trace=FALSE, k = 2, direction='forward')
But that gives me the same thing.
Why?
If you define the saturated model (i.e. the model you want the forward step method to go to) then you can use this:
step(yourmodel, scope = ~. saturatedmodel , direction="forward")
Related
In the partykit package print(i.ctree) provides the probability of the outcome at terminal nodes (classification tree). However, I would like to know the probability of the outcome at internal nodes as well.
I'm able to estimate the probability at the internal nodes when I create a barplot for my ctree
plot(i.ctree, inner_panel = node_barplot). But what I would like is the exact probability at internal nodes. For example, in the diagram below I would like to know the probability of the outcome at nodes 2 and 5.
Any ideas?
Using an example from https://cran.r-project.org/web/packages/partykit/vignettes/ctree.pdf:
barplot of ctree:
There are various ways to extract the full data pertaining to a certain node and compute any quantity you are interested in. For the distribution of a classification tree one way is to coerce to the simpleparty class which stores the distribution in the info slots of each node.
Using the example from the vignette you mentioned, you can first fit the full constparty tree:
library("partykit")
data("GlaucomaM", package = "TH.data")
gtree <- ctree(Class ~ ., data = GlaucomaM)
And then coerce to simpleparty:
gtree <- as.simpleparty(gtree)
Then you can extract a list of distributions from each node, bind it into a table, and compute the proportions:
tab <- nodeapply(gtree, nodeids(gtree), function(node) node$info$distribution)
tab <- do.call(rbind, tab)
proportions(tab, 1)
## glaucoma normal
## 1 0.50000000 0.50000000
## 2 0.86206897 0.13793103
## 3 0.93670886 0.06329114
## 4 0.12500000 0.87500000
## 5 0.21100917 0.78899083
## 6 0.09230769 0.90769231
## 7 0.38636364 0.61363636
You can also adapt the panel function for the printing, re-using the functions used in print.simpleparty:
simpleprint <- function(node) formatinfo_node(node,
FUN = partykit:::.make_formatinfo_simpleparty(gtree),
default = "*", prefix = ": ")
print(gtree, inner_panel = simpleprint)
## Model formula:
## Class ~ ag + at + as + an + ai + eag + eat + eas + ean + eai +
## abrg + abrt + abrs + abrn + abri + hic + mhcg + mhct + mhcs +
## mhcn + mhci + phcg + phct + phcs + phcn + phci + hvc + vbsg +
## vbst + vbss + vbsn + vbsi + vasg + vast + vass + vasn + vasi +
## vbrg + vbrt + vbrs + vbrn + vbri + varg + vart + vars + varn +
## vari + mdg + mdt + mds + mdn + mdi + tmg + tmt + tms + tmn +
## tmi + mr + rnf + mdic + emd + mv
##
## Fitted party:
## [1] root
## | [2] vari <= 0.059: glaucoma (n = 87, err = 13.8%)
## | | [3] vasg <= 0.066: glaucoma (n = 79, err = 6.3%)
## | | [4] vasg > 0.066: normal (n = 8, err = 12.5%)
## | [5] vari > 0.059: normal (n = 109, err = 21.1%)
## | | [6] tms <= -0.066: normal (n = 65, err = 9.2%)
## | | [7] tms > -0.066: normal (n = 44, err = 38.6%)
##
## Number of inner nodes: 3
## Number of terminal nodes: 4
I am struggling with looping nls functions. So here is an example data set for a single sample
dat<-read.table(text="time y
1 4.62
2 13.55
3 30.82
6 93.97
12 145.93
24 179.93", header = TRUE)
plot(data);lines(data)
model <- nls(y ~ Max * (1-exp(-k * (time - Lag))),data=dat,start=list(Max = 200, k = 0.1, Lag = 0.5))
but what if I want to apply model to multiple columns of samples?
for example
dat<-read.table(text="time gluc starch solka
+ 1 6.32 7.51 1.95
+ 2 20.11 25.49 6.43
+ 3 36.03 47.53 10.39
+ 6 107.52 166.31 27.01
+ 12 259.28 305.19 113.72
+ 24 283.40 342.56 251.14
+ 48 297.55 353.66 314.22", header = TRUE)
How can I get R to solve for Max, k, and Lag for each sample (gluc, starch, solka)?
In all the alternatives below we use these values:
long <- tidyr::pivot_longer(dat, -1, values_to = "y")
long$name <- factor(long$name)
st0 <- list(Max = 200, k = 0.1, Lag = 0.5)
1) nls grouped data Convert dat to long form and then use the grouped data feature of nls This solution is the most suitable among those presented here for testing whether certain parameters are common among the three names since it is easy to simply remove the subscript on a parameter if it is to be common among the names. The fitting itself does not use any packages but we show ggplot2 and lattice package graphics for plotting.
# get better starting values
model0 <- nls(y ~ Max * (1-exp(-k * (time - Lag))), long, start = st0)
st <- with(as.list(coef(model0)),
list(Max = rep(Max, 3), k = rep(k, 3), Lag = rep(Lag, 3)))
model <- nls(y ~ Max[name] * (1-exp(-k[name] * (time - Lag[name]))),
long, start = st)
model
giving:
Nonlinear regression model
model: y ~ Max[name] * (1 - exp(-k[name] * (time - Lag[name])))
data: long
Max1 Max2 Max3 k1 k2 k3 Lag1 Lag2
306.48737 389.84657 361.82290 0.12214 0.03857 0.13747 1.38072 2.02205
Lag3
1.31770
residual sum-of-squares: 7167
Number of iterations to convergence: 8
Achieved convergence tolerance: 9.186e-06
ggplot2 graphics could be done like this.
library(ggplot2)
fitdf <- transform(long, fit = fitted(model))
ggplot(fitdf, aes(x = time, y = y, color = name)) +
geom_point() +
geom_line(aes(y = fit))
A slightly different looking plot can be generated using lattice graphics which comes with R so the package does not have to be installed. The code is particularly compact.
library(lattice)
xyplot(fit + y ~ time | name, fitdf, type = c("l", "p"), auto.key = TRUE)
2) nlsList If you don't need to investigate common settings for parameters among the names then another possibility is to use nlsList in the nlme package (which comes with R so you don't have to install it). long and st0 are from above.
library(nlme)
fit <- nlsList(y ~ Max * (1-exp(-k * (time - Lag))) | name, long, start = st0)
giving an nlsList object whose 3 components are the three nls objects obtained by running nls for each name.
> fit
Call:
Model: y ~ Max * (1 - exp(-k * (time - Lag))) | name
Data: long
Coefficients:
Max k Lag
gluc 306.4875 0.12214330 1.380713
solka 389.8449 0.03856544 2.022057
starch 361.8231 0.13747402 1.317698
Degrees of freedom: 21 total; 12 residual
Residual standard error: 24.43858
We can plot the data and fit:
levs <- levels(long$name)
col <- setNames(rainbow(length(levs)), levs)
plot(y ~ time, long, col = col[name], pch = 20, cex = 1.5)
for(lv in levs) lines(fitted(fit[[lv]]) ~ time, dat, col = col[lv])
legend("bottomright", leg = levs, col = col, pch = 20, cex = 1.5)
3) subset An approach which is similar to (2) is to perform three nls runs using subset= to select the data. This returns a named list of nls objects. st0 and long are from above. No packages are used.
fit <- Map(function(nm) nls(y ~ Max * (1-exp(-k * (time - Lag))), data = long,
start = st0, subset = name == nm), levels(long$name))
The graphics code in (2) also works here.
Build the formulas you want to use as strings:
outcomes = c("gluc", "starch", "solka")
my_formulas = paste(outcomes, "~ Max * (1-exp(-k * (time - Lag)))")
model_list = list()
for(i in seq_along(outcomes)) {
model_list[[outcomes[i]]] = nls(
as.formula(my_formulas[i],
data = dat,
start = list(Max = 200, k = 0.1, Lag = 0.5)
)
}
This will create a list of models, you can the access with, e.g., summary(model_list[[1]]) or summary(model_list[["solka"]])
When I run a Tweedie GLM, one can get the prediction from the link by doing exp(link). To get the prediction for a Tweedie GLM, I get the prediction from the link by doing exp(link)/2. I don't understand why I need to divide by 2.
Minimal reproducible example below, inspired from the tweedie regression demo at https://github.com/dmlc/xgboost/blob/master/R-package/demo/tweedie_regression.R
library(xgboost)
library(data.table)
library(cplm) # for insurance data
library(statmod) # for tweedie glm
data(AutoClaim)
# auto insurance dataset analyzed by Yip and Yau (2005)
dt <- data.table(AutoClaim)
# exclude these columns from the model matrix
exclude <- c('POLICYNO', 'PLCYDATE', 'CLM_FREQ5', 'CLM_AMT5', 'CLM_FLAG', 'IN_YY')
# retains the missing values
# NOTE: this dataset is comes ready out of the box
options(na.action = 'na.pass')
x <- sparse.model.matrix(~ . - 1, data = dt[, -exclude, with = F])
options(na.action = 'na.omit')
# response
y <- dt[, CLM_AMT5]
d_train <- xgb.DMatrix(data = x, label = y, missing = NA)
# the tweedie_variance_power parameter determines the shape of
# distribution
# - closer to 1 is more poisson like and the mass
# is more concentrated near zero
# - closer to 2 is more gamma like and the mass spreads to the
# the right with less concentration near zero
params <- list(
objective = 'reg:tweedie',
eval_metric = 'rmse',
tweedie_variance_power = 1.4,
max_depth = 2,
eta = 1)
set.seed(42)
bst <- xgb.train(
data = d_train,
params = params,
maximize = FALSE,
watchlist = list(train = d_train),
nrounds = 3)
xgb.plot.tree(model = bst)
# Manually extract the values for the first record :
x[1,]
# travtime < 102, bluebook <61645 -->tree #1 value= 2.49922585
# revolkedyes < -9.53674316e-07, npolicy < 5.5 --> tree #2 value= 2.48586464
# REVOLKEDYes < -9.53674316e-07, areaurban > -9.53674316e-07 --> tree #2 vakye = 2.36028123
link_gbm <- 2.49922585 +2.48586464+ 2.36028123
link_gbm # 7.345372
# Take exp(link_gbm), divide by 2
exp(link_gbm ) / 2 # 774.5053
# Compare with getting prediction directly from GBM.
predict(bst, d_train)[1] # 774.5053
# Let's do the same with a GLM:
dt2 <- dt[, -exclude, with = F]
dt2$CLM_AMT5 <- dt$CLM_AMT5
tweedie_fit <-
glm(CLM_AMT5 ~ .,
family=tweedie(var.power=1.4, link.power=0),
data = dt2)
summary(tweedie_fit)
# Manually get the link value for the first record
dt2[1,]
link_glm <- tweedie_fit$coefficients["(Intercept)"] +
14 * tweedie_fit$coefficients["TRAVTIME"] +
14230 * tweedie_fit$coefficients["BLUEBOOK"] +
11 * tweedie_fit$coefficients["RETAINED"] +
1 * tweedie_fit$coefficients["NPOLICY"] +
1 * tweedie_fit$coefficients["CAR_TYPESedan"] +
1 * tweedie_fit$coefficients["RED_CARyes"] +
3 * tweedie_fit$coefficients["MVR_PTS"] +
60 * tweedie_fit$coefficients["AGE"] +
11 * tweedie_fit$coefficients["YOJ"] +
67349 * tweedie_fit$coefficients["INCOME"] +
1 * tweedie_fit$coefficients["GENDERM"] +
1 * tweedie_fit$coefficients["JOBCLASSProfessional"] +
1 * tweedie_fit$coefficients["MAX_EDUCPhD"] +
18 * tweedie_fit$coefficients["SAMEHOME"] +
1 * tweedie_fit$coefficients["AREAUrban"]
link_glm # 8.299899
# prediction is exp(link_glm)
exp(link_glm) # 4023.466
# compare with link and prediction from glm ... yes, it's identical
predict(tweedie_fit, type="link")[1]
predict(tweedie_fit, type="response")[1] # 4023.466
Is there a way to retain the best models, for example, within two Alkaike Information Criterion (AIC) units of the best fitting model, during a model dredging approach in R? I am using the glmulti package, which returns the AIC of the best models, but does not allow visualizing the models associated with those values.
Thanks in advance.
Here is my example (data here):
results <- read.csv("gameresults.csv")
require(glmulti)
M <- glmulti(result~speed*svl*tailsize*strategy,
data=results, name = "glmulti.analysis",
intercept = TRUE, marginality = FALSE,
level = 2, minsize = 0, maxsize = -1, minK = 0, maxK = -1,
fitfunction = Multinom, method = "h", crit = "aic",
confsetsize = 100,includeobjects=TRUE)
summary(M)
The function glmulti::glmulti returns a S4 class object that can be accessed like a list. All of your models, not just the best, could be accessed. Since I don't have your functions and some other optional inputs, I performed a simplified version of your model just as a demonstration:
results <- read.csv("gameresults.csv")
library(glmulti)
M <- glmulti(result~speed*svl*strategy, data=results, crit = "aic", plotty = TRUE)
Here are a list of all models, accessed by the # operator:
M#formulas
# [[1]]
# result ~ 1 + speed + svl:speed + strategy:speed
# <environment: 0x11a616750>
#
# [[2]]
# result ~ 1 + speed + svl + svl:speed + strategy:speed
# <environment: 0x11a616750>
#
# [[3]]
# result ~ 1 + strategy + speed + svl:speed + strategy:speed
# <environment: 0x11a616750>
#
## **I omitted the remaining 36-3=33 models**
You can plot them individually based on the formula, using the base graphic or any packages that support use of model formulas. For example, I randomly selected one from the list:
plot(result ~ 1 + speed + svl, data=results)
## Hit <Return> to see next plot:
## Hit <Return> to see next plot:
this post follows this question : https://stackoverflow.com/questions/31234329/rpart-user-defined-implementation
I'm very interested in tools which could handle tree growing with customized criteria, such that I could test different model.
I tried to use the partykit R package to grow a tree for which the split rule is given by the negative log-likelihood of a Cox model (which is log-quasi-likelihood in case of the Cox model) and a Cox model is fitted in each leaf.
As I understood reading the vignette about the MOB function, there are two way to implement my own split criteria, namely to get the fit function return either a list or a model object.
For my purpose, I tried the two solutions but I failed to make it work.
Solution 1 : return a list object :
I take as an example the "breast cancer dataset" as in the "mob" vignette.
I tried this :
cox1 = function(y,x, start = NULL, weights = NULL, offset = NULL, ...,
estfun = FALSE, object = TRUE){
res_cox = coxph(formula = y ~ x )
list(
coefficients = res_cox$coefficients,
objfun = - res_cox$loglik[2],
object = res_cox)
}
mob(formula = Surv(time, cens) ~ horTh + pnodes - 1 | age + tsize + tgrade + progrec +
estrec + menostat ,
data = GBSG2 ,
fit = cox1,
control = mob_control(alpha = 0.0001) )
There is a warning about the singularity of the X matrix, and the mob function a tree with a single node (even with smaller values for alpha).
Note that there is no singularity problem with the X matrix when running the coxph function :
res_cox = coxph( formula = Surv(time, cens) ~ horTh + pnodes ,
data = GBSG2 )
Solution 2 : Return a coxph.object :
I tried this :
cox2 = function(y,x, start = NULL, weights = NULL, offset = NULL, ... ){
res_cox = coxph(formula = y ~ x )
}
logLik.cox2 <- function(object, ...)
structure( - object$loglik[2], class = "logLik")
mob(formula = Surv(time, cens) ~ horTh + pnodes - 1 | age + tsize + tgrade + progrec +
estrec + menostat ,
data = GBSG2 ,
fit = cox2,
control = mob_control(alpha = 0.0001 ) )
So this time I get a split along the "progrec" variable :
Model-based recursive partitioning (cox2)
Model formula:
Surv(time, cens) ~ horTh + pnodes - 1 | age + tsize + tgrade +
progrec + estrec + menostat
Fitted party:
[1] root
| [2] progrec <= 21: n = 281
| xhorThno xhorThyes xpnodes
| 0.19306661 NA 0.07832756
| [3] progrec > 21: n = 405
| xhorThno xhorThyes xpnodes
| 0.64810352 NA 0.04482348
Number of inner nodes: 1
Number of terminal nodes: 2
Number of parameters per node: 3
Objective function: 1531.132
Warning message:
In coxph(formula = y ~ x) : X matrix deemed to be singular; variable 2
I would like to know what's wrong with my Solution 1.
I also tried a similar thing for a regression problem and get the same result, ending with a single leaf :
data("BostonHousing", package = "mlbench")
BostonHousing <- transform(BostonHousing,
chas = factor(chas, levels = 0:1, labels = c("no", "yes")),
rad = factor(rad, ordered = TRUE))
linear_reg = function(y,x, start = NULL, weights = NULL, offset = NULL, ...,
estfun = FALSE, object = TRUE){
res_lm = glm(formula = y ~ x , family = "gaussian")
list(
coefficients = res_lm$coefficients,
objfun = res_lm$deviance,
object = res_lm )
}
mob( formula = medv ~ log(lstat) + I(rm^2) | zn + indus + chas + nox +
+ age + dis + rad + tax + crim + b + ptratio,
data = BostonHousing ,
fit = linear_reg)
Also I would like to know if there is no problem using a variable for both "fit the model in a node" and "make a split".
Thank you in advance.
I will probably have other questions about partykit functioning.
The problem with the cox1() and linear_reg() functions you have set up are that you do not supply the estimating functions aka score contributions. As these are the basis for the inference that selects the splitting variable, the algorithm does not split at all if these are not provided. See this recent answer for some discussion of this issues.
But for coxph() objects (unlike the fitdistr() example in the discussion linked above) it is very easy to obtain these estimating functions or scores because there is an estfun() method available. So your cox2() approach is the easier route to go here.
The reason that the latter doesn't work correctly is due to the special handling of intercepts in coxph(). Internally, this always forces the intercept into the model but then omits the first column from the design matrix. When interfacing this through mob() you need to be careful not to mess this up because mob() sets up its own model matrix. And because you exclude the intercept, mob() thinks that it can estimate both levels of horTh. But this is not the case because the intercept is not identified in the Cox-PH model.
The best solution in this case (IMO) is the following: You let mob() set up an intercept but then exclude it again when passing the model matrix to coxph(). Because there are coef(), logLik(), and estfun() methods for the resulting objects, one can use the simple setup of your cox2() function.
Packages and data:
library("partykit")
library("survival")
data("GBSG2", package = "TH.data")
Fitting function:
cox <- function(y, x, start = NULL, weights = NULL, offset = NULL, ... ) {
x <- x[, -1]
coxph(formula = y ~ 0 + x)
}
Fitting of the MOB tree to the GBSG2 data:
mb <- mob(formula = Surv(time, cens) ~ horTh + pnodes | age + tsize + tgrade + progrec + estrec + menostat,
data = GBSG2, fit = cox)
mb
## Model-based recursive partitioning (cox)
##
## Model formula:
## Surv(time, cens) ~ horTh + pnodes | age + tsize + tgrade + progrec +
## estrec + menostat
##
## Fitted party:
## [1] root: n = 686
## xhorThyes xpnodes
## -0.35701115 0.05768026
##
## Number of inner nodes: 0
## Number of terminal nodes: 1
## Number of parameters per node: 2
## Objective function: 1758.86