custom split rule with partykit - r

this post follows this question : https://stackoverflow.com/questions/31234329/rpart-user-defined-implementation
I'm very interested in tools which could handle tree growing with customized criteria, such that I could test different model.
I tried to use the partykit R package to grow a tree for which the split rule is given by the negative log-likelihood of a Cox model (which is log-quasi-likelihood in case of the Cox model) and a Cox model is fitted in each leaf.
As I understood reading the vignette about the MOB function, there are two way to implement my own split criteria, namely to get the fit function return either a list or a model object.
For my purpose, I tried the two solutions but I failed to make it work.
Solution 1 : return a list object :
I take as an example the "breast cancer dataset" as in the "mob" vignette.
I tried this :
cox1 = function(y,x, start = NULL, weights = NULL, offset = NULL, ...,
estfun = FALSE, object = TRUE){
res_cox = coxph(formula = y ~ x )
list(
coefficients = res_cox$coefficients,
objfun = - res_cox$loglik[2],
object = res_cox)
}
mob(formula = Surv(time, cens) ~ horTh + pnodes - 1 | age + tsize + tgrade + progrec +
estrec + menostat ,
data = GBSG2 ,
fit = cox1,
control = mob_control(alpha = 0.0001) )
There is a warning about the singularity of the X matrix, and the mob function a tree with a single node (even with smaller values for alpha).
Note that there is no singularity problem with the X matrix when running the coxph function :
res_cox = coxph( formula = Surv(time, cens) ~ horTh + pnodes ,
data = GBSG2 )
Solution 2 : Return a coxph.object :
I tried this :
cox2 = function(y,x, start = NULL, weights = NULL, offset = NULL, ... ){
res_cox = coxph(formula = y ~ x )
}
logLik.cox2 <- function(object, ...)
structure( - object$loglik[2], class = "logLik")
mob(formula = Surv(time, cens) ~ horTh + pnodes - 1 | age + tsize + tgrade + progrec +
estrec + menostat ,
data = GBSG2 ,
fit = cox2,
control = mob_control(alpha = 0.0001 ) )
So this time I get a split along the "progrec" variable :
Model-based recursive partitioning (cox2)
Model formula:
Surv(time, cens) ~ horTh + pnodes - 1 | age + tsize + tgrade +
progrec + estrec + menostat
Fitted party:
[1] root
| [2] progrec <= 21: n = 281
| xhorThno xhorThyes xpnodes
| 0.19306661 NA 0.07832756
| [3] progrec > 21: n = 405
| xhorThno xhorThyes xpnodes
| 0.64810352 NA 0.04482348
Number of inner nodes: 1
Number of terminal nodes: 2
Number of parameters per node: 3
Objective function: 1531.132
Warning message:
In coxph(formula = y ~ x) : X matrix deemed to be singular; variable 2
I would like to know what's wrong with my Solution 1.
I also tried a similar thing for a regression problem and get the same result, ending with a single leaf :
data("BostonHousing", package = "mlbench")
BostonHousing <- transform(BostonHousing,
chas = factor(chas, levels = 0:1, labels = c("no", "yes")),
rad = factor(rad, ordered = TRUE))
linear_reg = function(y,x, start = NULL, weights = NULL, offset = NULL, ...,
estfun = FALSE, object = TRUE){
res_lm = glm(formula = y ~ x , family = "gaussian")
list(
coefficients = res_lm$coefficients,
objfun = res_lm$deviance,
object = res_lm )
}
mob( formula = medv ~ log(lstat) + I(rm^2) | zn + indus + chas + nox +
+ age + dis + rad + tax + crim + b + ptratio,
data = BostonHousing ,
fit = linear_reg)
Also I would like to know if there is no problem using a variable for both "fit the model in a node" and "make a split".
Thank you in advance.
I will probably have other questions about partykit functioning.

The problem with the cox1() and linear_reg() functions you have set up are that you do not supply the estimating functions aka score contributions. As these are the basis for the inference that selects the splitting variable, the algorithm does not split at all if these are not provided. See this recent answer for some discussion of this issues.
But for coxph() objects (unlike the fitdistr() example in the discussion linked above) it is very easy to obtain these estimating functions or scores because there is an estfun() method available. So your cox2() approach is the easier route to go here.
The reason that the latter doesn't work correctly is due to the special handling of intercepts in coxph(). Internally, this always forces the intercept into the model but then omits the first column from the design matrix. When interfacing this through mob() you need to be careful not to mess this up because mob() sets up its own model matrix. And because you exclude the intercept, mob() thinks that it can estimate both levels of horTh. But this is not the case because the intercept is not identified in the Cox-PH model.
The best solution in this case (IMO) is the following: You let mob() set up an intercept but then exclude it again when passing the model matrix to coxph(). Because there are coef(), logLik(), and estfun() methods for the resulting objects, one can use the simple setup of your cox2() function.
Packages and data:
library("partykit")
library("survival")
data("GBSG2", package = "TH.data")
Fitting function:
cox <- function(y, x, start = NULL, weights = NULL, offset = NULL, ... ) {
x <- x[, -1]
coxph(formula = y ~ 0 + x)
}
Fitting of the MOB tree to the GBSG2 data:
mb <- mob(formula = Surv(time, cens) ~ horTh + pnodes | age + tsize + tgrade + progrec + estrec + menostat,
data = GBSG2, fit = cox)
mb
## Model-based recursive partitioning (cox)
##
## Model formula:
## Surv(time, cens) ~ horTh + pnodes | age + tsize + tgrade + progrec +
## estrec + menostat
##
## Fitted party:
## [1] root: n = 686
## xhorThyes xpnodes
## -0.35701115 0.05768026
##
## Number of inner nodes: 0
## Number of terminal nodes: 1
## Number of parameters per node: 2
## Objective function: 1758.86

Related

Comparison between configural and metric model (MI) using semTools

I try to analyze for measurement invariance with multi groups (8 groups; group = "stort"). Below you can see my syntax. The wb variables are all ordinal, ranging from 1-5. There are 8 different groups: n1 =233, n2= 832 n3=67 n4=68 n5=530 n6=169 n7=139 n8=108
The following error occurs: If I try to compare the configural and the metric model using semTools (see syntax) I get an error message: Error in A %*% P.inv : requires numeric/complex matrix/vector arguments (also if I use lavTestLRT()). Other model comparisons run without any errors.
My question is, how can I fix this? I appreciate all comments :)
Here, you can find the dataset: https://drive.google.com/drive/folders/1h9hpFoRhz-zphJ3NSfXtDxbWoad3CmQI?usp=sharing
This is my model:
modelwb.1<- 'kaw =~ wb1.1 + wb2.1 + wb3.1invers + wb4.1invers
paw =~ wb5.1 + wb6.1 + wb7.1invers + wb8.1invers
saw =~ wb9.1 + wb10.1 + wb11.1invers + wb12.1invers
kaw ~ paw + saw
paw ~ saw
kaw ~~ paw + saw
paw ~~ saw
'
fitwb.1 <- cfa(modelwb.1, data=df_wide, estimator = "WLSMV")
summary(fitwb.1, fit.measures = TRUE, standardized = TRUE)
#### Configural model
fitwbcon.1 <- cfa(modelwb.1, data = df_wide, group = "stort", estimator = "WLSMV", missing = "pairwise")
summary(fitwbcon.1, fit.measures = TRUE, standardized = TRUE)
#### Metric model
fitwbmet.1 <- cfa(modelwb.1, data = df_wide, group = "stort", estimator = "WLSMV", missing = "pairwise", group.equal = "loadings")
summary(fitwbmet.1, fit.measures = TRUE, standardized = TRUE)
##### Model comparison: configural & metric
x1<- semTools::compareFit(fitwbcon.1, fitwbmet.1)
summary(x1)
##or:
lavTestLRT(fitwbcon.1, fitwbmet.1)

Modeling beta-binomial distributed data using glmmTBM

Im trying to fit a mixed effect model to asses for effects upon the rate of germinated polen grains. I started with a binomial distribution with a model structure like this:
glmer(cbind(NGG,NGNG) ~ RH3*Altitude + AbH + Date3 + (1 | Receptor/Code/Plant) +
(1 | Mountain/Community), data=database, family="binomial",
control = glmerControl(optimizer="bobyqa"))
Where NGG is the number of successes (germinated grains per stigma, can vary from 0 to e.g. 55), NGNG the number of failures (non-germinated grains 0 to e.g. 80). The issue is, after seeing the results, data seems to be over-dispersed, as indicated by the function (found in http://rstudio-pubs-static.s3.amazonaws.com/263877_d811720e434d47fb8430b8f0bb7f7da4.html):
overdisp_fun <- function(model) {
vpars <- function(m) {
nrow(m)*(nrow(m)+1)/2
}
model.df <- sum(sapply(VarCorr(model), vpars)) + length(fixef(model))
rdf <- nrow(model.frame(model))-model.df
rp <- residuals(model, type = "pearson") # computes pearson residuals
Pearson.chisq <- sum(rp^2)
prat <- Pearson.chisq/rdf
pval <- pchisq(Pearson.chisq, df = rdf, lower.tail = FALSE)
c(chisq = Pearson.chisq, ratio = prat, rdf = rdf, p = pval)
}
The output was:
chisq = 1.334567e+04, ratio = 1.656201e+00, rdf = 8.058000e+03, p = 3.845911e-268
So I decided to try a beta-binomial in glmmTMB as follows (its important to keep this hierarchical structure):
glmmTMB(cbind(NGG,NGNG) ~ RH3*Altitude + AbH + Date3 + (1 | Receptor/Code/Plant) +
(1 | Mountain/Community), data=database,
family=betabinomial(link = "logit"), na.action = na.omit, weights=NGT)
When I run it.. says:
Error in nlminb(start = par, objective = fn, gradient = gr, control = control$optCtrl) : (converted from warning) NA/NaN function evaluation
Is there something wrong in the model writing? I already checked for posible issues in (http://rstudio-pubs-static.s3.amazonaws.com/263877_d811720e434d47fb8430b8f0bb7f7da4.html) but did not find any solution yet.
thanks

Getting errors while using neuralnet function

I tried neural net in R on Boston data set available.
data("Boston",package="MASS")
data <- Boston
Retaining only those variable we want to use:
keeps <- c("crim", "indus", "nox", "rm" , "age", "dis", "tax" ,"ptratio", "lstat" ,"medv" )
data <- data[keeps]
In this case the formula is stored in an R object called f.
The response variable medv is to be “regressed” against the remaining nine attributes. I have done it as below:
f <- medv ~ crim + indus + nox + rm + age + dis + tax + ptratio + lstat
To set up train sample 400 of the 506 rows of data without replacement is collected using the sample method:
set.seed(2016)
n = nrow(data)
train <- sample(1:n, 400, FALSE)
neuralnet function of R is fitted.
library(neuralnet)
fit<- neuralnet(f, data = data[train ,], hidden=c(10 ,12 ,20),
algorithm = "rprop+", err.fct = "sse", act.fct = "logistic",
threshold =0.1, linear.output=TRUE)
But warning message is displayed as algorithm not converging.
Warning message:
algorithm did not converge in 1 of 1 repetition(s) within the stepmax
Tried Prediction using compute,
pred <- compute(fit,data[-train, 1:9])
Following error msg is displayed
Error in nrow[w] * ncol[w] : non-numeric argument to binary operator
In addition: Warning message:
In is.na(weights) : is.na() applied to non-(list or vector) of type 'NULL'
Why the error is coming up and how to recover from it for prediction. I want to use the neuralnet function on that data set.
When neuralnet doesn't converge, the resulting neural network is not complete. You can tell by calling attributes(fit)$names. When training converges, it will look like this:
[1] "call" "response" "covariate" "model.list" "err.fct"
[6] "act.fct" "linear.output" "data" "net.result" "weights"
[11] "startweights" "generalized.weights" "result.matrix"
When it doesn't, some attributes will not be defined:
[1] "call" "response" "covariate" "model.list" "err.fct" "act.fct" "linear.output"
[8] "data"
That explains why compute doesn't work.
When training doesn't converge, the first possible solution could be to increase stepmax (default 100000). You can also add lifesign = "full", to get better insight into the training process.
Also, looking at your code, I would say three layers with 10, 12 and 20 neurons is too much. I would start with one layer with the same number of neurons as the number of inputs, in your case 9.
EDIT:
With scaling (remember to scale both training and test data, and to 'de-scale' compute results), it converges much faster. Also note that I reduced the number of layers and neurons, and still lowered the error threshold.
data("Boston",package="MASS")
data <- Boston
keeps <- c("crim", "indus", "nox", "rm" , "age", "dis", "tax" ,"ptratio", "lstat" ,"medv" )
data <- data[keeps]
f <- medv ~ crim + indus + nox + rm + age + dis + tax + ptratio + lstat
set.seed(2016)
n = nrow(data)
train <- sample(1:n, 400, FALSE)
# Scale data. Scaling parameters are stored in this matrix for later.
scaledData <- scale(data)
fit<- neuralnet::neuralnet(f, data = scaledData[train ,], hidden=9,
algorithm = "rprop+", err.fct = "sse", act.fct = "logistic",
threshold = 0.01, linear.output=TRUE, lifesign = "full")
pred <- neuralnet::compute(fit,scaledData[-train, 1:9])
scaledResults <- pred$net.result * attr(scaledData, "scaled:scale")["medv"]
+ attr(scaledData, "scaled:center")["medv"]
cleanOutput <- data.frame(Actual = data$medv[-train],
Prediction = scaledResults,
diff = abs(scaledResults - data$medv[-train]))
# Show some results
summary(cleanOutput)
The problem seems to be in your argument linear.output = TRUE.
With your data, but changing the code a bit (not defining the formula and adding some explanatory comments):
library(neuralnet)
fit <- neuralnet(formula = medv ~ crim + indus + nox + rm + age + dis + tax + ptratio + lstat,
data = data[train,],
hidden=c(10, 12, 20), # number of vertices (neurons) in each hidden layer
algorithm = "rprop+", # resilient backprop with weight backtracking,
err.fct = "sse", # calculates error based on the sum of squared errors
act.fct = "logistic", # smoothing the cross product of neurons and weights with logistic function
threshold = 0.1, # of the partial derivatives for error function, stopping
linear.output=FALSE) # act.fct applied to output neurons
print(net)
Call: neuralnet(formula = medv ~ crim + indus + nox + rm + age + dis + tax + ptratio + lstat, data = data[train, ], hidden = c(10, 12, 20), threshold = 0.1, rep = 10, algorithm = "rprop+", err.fct = "sse", act.fct = "logistic", linear.output = FALSE)
10 repetitions were calculated.
Error Reached Threshold Steps
1 108955.0318 0.03436116236 4
5 108955.0339 0.01391790099 8
3 108955.0341 0.02193379592 3
9 108955.0371 0.01705056758 6
8 108955.0398 0.01983134293 8
4 108955.0450 0.02500006437 5
6 108955.0569 0.03689097762 5
7 108955.0677 0.04765829189 5
2 108955.0705 0.05052776877 5
10 108955.1103 0.09031966778 7
10 108955.1103 0.09031966778 7
# now compute will work
pred <- compute(fit, data[-train, 1:9])

correlation in multivariate mixed model in r

I am running multivariate mixed model in R by using nlme package. Suppose that x and y are responses variables for longitudinal data which assumed that the error within group is correlated. The residual error matrix is presented as:
So my question is how to involve the correlation into lme function?
I tried commands corr = corComSymm(from =~ 1 | x) or corr = corAR1(from =~ 1 | x) but did not work!
here en example:
# visiting time by months
time = rep(c(0,3,6,9),time = 4, 200)
# subjects
subject = rep(1:50, each = 4)
# first response variable "identity"
x = c(rep(0, 100), rep(1,100))
# second response variable "identity"
y = c(rep(1, 100), rep(0,100))
# values of both reponses variables (x_1, x_2)
value = c(rnorm(100,20,1),rnorm(100,48,1))
# variables refer to reponses variables (x_1, x_2)
variable = factor(c(rep(0,150),rep(1,50)), label=c("X","Y"))
df = data.frame(subject , time, x,y,value, variable)
library(nlme)
# fit the model that each response variable has intercept and slope (time) for each random and fixed effects
# as well as fixed effects slopes for sex and lesion, and each response has different variance
f= lme(value ~ -1 + x + y + x:time + y:time , random = ~ -1 + (x + y) + time:( x + y)|subject ,
weights = varIdent(form=~1| x),corr = corAR1(from = ~ 1|x), control=lmeControl(opt="optim"), data =df)
Error in corAR1(from = ~1 | x) : unused argument (from = ~1 | x)
Any suggestions?
I found this website (below) which helpful and useful, I posted here in case someone might has this problem in future.
https://rpubs.com/bbolker/3336

R Step function not selecting models

So I'm trying to use the step function in R with the Iowa, Ames data set, I can't get the step function to do anything.
I'm currently doing this -
step(lm(SalePrice ~ 1), list = ~upper(as.numeric(Garage.Area) + Lot.Area
), k = 2, direction='forward')
and that gives me -
## Start: AIC=45082
## SalePrice ~ 1
##
## Call:
## lm(formula = SalePrice ~ 1)
##
## Coefficients:
## (Intercept)
## 180241
I then tried this-
step(lm(SalePrice ~ 1), list = ~upper(
factor(Neighborhood) * factor(Bedroom.AbvGr) * factor(Pool.QC) + Bsmt.Full.Bath + Bsmt.Half.Bath + Full.Bath + Half.Bath + factor(Heating.QC) + factor(Central.Air) + factor(Overall.Cond) + factor(Overall.Qual) + Gr.Liv.Area + as.numeric(Garage.Area) + as.numeric(Pool.Area)
), trace=FALSE, k = 2, direction='forward')
But that gives me the same thing.
Why?
If you define the saturated model (i.e. the model you want the forward step method to go to) then you can use this:
step(yourmodel, scope = ~. saturatedmodel , direction="forward")

Resources