Gridsearch in randomforest (RandomForestSRC) - r

I am using RandomForestSRC to create a random forest model using regression, and I want to perform a gridsearch on the optimal mtry, nodesize, ntrees, nodedepth in combination in order to better visualize the optimization process.
I have tried the following:
mtry <- c(4,8,16)
nodesize <- c(50,150,300)
ntrees <- c(500,1000,2000)
nodedepth <- c(5,10)
frmodel <- rfsrc(mort_30 ~ variable1+variable2+variable3, #(ect)
data= data.train, mtry= mtry, nodesize= nodesize, ntrees=ntrees,
nodedepth=nodedepth, blocksize=1, importance=TRUE, seed=40)
But I keep getting this error:
I if (mtry < 1 | mtry >n.xvar) mtry <- max(1, min(mtry, n.xvar)):
the condition has length > 1 and only the first element will be used
It seems I wont be able to assign more than one value to these. Is there another way to do this, short of manually making a tree for every single combination?

You can use tune for search for mtry and nodesize, then maybe just run this for different ntrees , for example:
nodesize <- c(5,10,20)
model <- tune(Ozone ~ .,data = airquality,
mtryStart = 2,
nodesizeTry= nodesize, ntreeTry=100,
blocksize=1, importance=TRUE, seed=40)
model$results
nodesize mtry err
1 5 1 0.5750139
2 5 2 0.4420183
3 5 3 0.3750303
4 5 4 0.3781430
5 5 5 0.3255283
6 10 1 0.6128187
7 10 2 0.4719501
8 10 3 0.3825911
9 10 4 0.3771207
10 10 5 0.3523660
11 20 1 0.6981993
12 20 2 0.5251094
13 20 3 0.4451690
14 20 4 0.4305362
15 20 5 0.4099460

Related

sandwich + mlogit: `Error in ef/X : non-conformable arrays` when using `vcovHC()` to compute robust/clustered standard errors

I am trying to compute robust/cluster standard errors after using mlogit() to fit a Multinomial Logit (MNL) in a Discrete Choice problem. Unfortunately, I suspect I am having problems with it because I am using data in long format (this is a must in my case), and getting the error #Error in ef/X : non-conformable arrays after sandwich::vcovHC( , "HC0").
The Data
For illustration, please gently consider the following data. It represents data from 5 individuals (id_ind ) that choose among 3 alternatives (altern). Each of the five individuals chose three times; hence we have 15 choice situations (id_choice). Each alternative is represented by two generic attributes (x1 and x2), and the choices are registered in y (1 if selected, 0 otherwise).
df <- read.table(header = TRUE, text = "
id_ind id_choice altern x1 x2 y
1 1 1 1 1.586788801 0.11887832 1
2 1 1 2 -0.937965347 1.15742493 0
3 1 1 3 -0.511504401 -1.90667519 0
4 1 2 1 1.079365680 -0.37267925 0
5 1 2 2 -0.009203032 1.65150370 1
6 1 2 3 0.870474033 -0.82558651 0
7 1 3 1 -0.638604013 -0.09459502 0
8 1 3 2 -0.071679538 1.56879334 0
9 1 3 3 0.398263302 1.45735788 1
10 2 4 1 0.291413453 -0.09107974 0
11 2 4 2 1.632831160 0.92925495 0
12 2 4 3 -1.193272276 0.77092623 1
13 2 5 1 1.967624379 -0.16373709 1
14 2 5 2 -0.479859282 -0.67042130 0
15 2 5 3 1.109780885 0.60348187 0
16 2 6 1 -0.025834772 -0.44004183 0
17 2 6 2 -1.255129594 1.10928280 0
18 2 6 3 1.309493274 1.84247199 1
19 3 7 1 1.593558740 -0.08952151 0
20 3 7 2 1.778701074 1.44483791 1
21 3 7 3 0.643191170 -0.24761157 0
22 3 8 1 1.738820924 -0.96793288 0
23 3 8 2 -1.151429915 -0.08581901 0
24 3 8 3 0.606695064 1.06524268 1
25 3 9 1 0.673866953 -0.26136206 0
26 3 9 2 1.176959443 0.85005871 1
27 3 9 3 -1.568225496 -0.40002252 0
28 4 10 1 0.516456176 -1.02081089 1
29 4 10 2 -1.752854918 -1.71728381 0
30 4 10 3 -1.176101700 -1.60213536 0
31 4 11 1 -1.497779616 -1.66301234 0
32 4 11 2 -0.931117325 1.50128532 1
33 4 11 3 -0.455543630 -0.64370825 0
34 4 12 1 0.894843784 -0.69859139 0
35 4 12 2 -0.354902281 1.02834859 0
36 4 12 3 1.283785176 -1.18923098 1
37 5 13 1 -1.293772990 -0.73491317 0
38 5 13 2 0.748091387 0.07453705 1
39 5 13 3 -0.463585127 0.64802031 0
40 5 14 1 -1.946438667 1.35776140 0
41 5 14 2 -0.470448172 -0.61326604 1
42 5 14 3 1.478763383 -0.66490028 0
43 5 15 1 0.588240775 0.84448489 1
44 5 15 2 1.131731049 -1.51323232 0
45 5 15 3 0.212145247 -1.01804594 0
")
The problem
Consequently, we can fit an MNL using mlogit() and extract their robust variance-covariance as follows:
library(mlogit)
library(sandwich)
mo <- mlogit(formula = y ~ x1 + x2|0 ,
method ="nr",
data = df,
idx = c("id_choice", "altern"))
sandwich::vcovHC(mo, "HC0")
#Error in ef/X : non-conformable arrays
As we can see there is an error produced by sandwich::vcovHC, which says that ef/X is non-conformable. Where X <- model.matrix(x) and ef <- estfun(x, ...). After looking through the source code on the mirror on GitHub I spot the problem which comes from the fact that, given that the data is in long format, ef has dimensions 15 x 2 and X has 45 x 2.
My workaround
Given that the show must continue, I am computing the robust and cluster standard errors manually using some functions that I borrow from sandwich and I adjusted to accommodate the Stata's output.
> Robust Standard Errors
These lines are inspired on the sandwich::meat() function.
psi<- estfun(mo)
k <- NCOL(psi)
n <- NROW(psi)
rval <- (n/(n-1))* crossprod(as.matrix(psi))
vcov(mo) %*% rval %*% vcov(mo)
# x1 x2
# x1 0.23050261 0.09840356
# x2 0.09840356 0.12765662
Stata Equivalent
qui clogit y x1 x2 ,group(id_choice) r
mat li e(V)
symmetric e(V)[2,2]
y: y:
x1 x2
y:x1 .23050262
y:x2 .09840356 .12765662
> Clustered Standard Errors
Here, given that each individual answers 3 questions is highly likely that there is some degree of correlation among individuals; hence cluster corrections should be preferred in such situations. Below I compute the cluster correction in this case and I show the equivalence with the Stata output of clogit , cluster().
id_ind_collapsed <- df$id_ind[!duplicated(mo$model$idx$id_choice,)]
psi_2 <- rowsum(psi, group = id_ind_collapsed )
k_cluster <- NCOL(psi_2)
n_cluster <- NROW(psi_2)
rval_cluster <- (n_cluster/(n_cluster-1))* crossprod(as.matrix(psi_2))
vcov(mo) %*% rval_cluster %*% vcov(mo)
# x1 x2
# x1 0.1766707 0.1007703
# x2 0.1007703 0.1180004
Stata equivalent
qui clogit y x1 x2 ,group(id_choice) cluster(id_ind)
symmetric e(V)[2,2]
y: y:
x1 x2
y:x1 .17667075
y:x2 .1007703 .11800038
The Question:
I would like to accommodate my computations within the sandwich ecosystem, meaning not computing the matrices manually but actually using the sandwich functions. Is it possible to make it work with models in long format like the one described here? For example, providing the meat and bread objects directly to perform the computations? Thanks in advance.
PS: I noted that there is a dedicated bread function in sandwich for mlogit, but I could not spot something like meat for mlogit, but anyways I am probably missing something here...
Why vcovHC does not work for mlogit
The class of HC covariance estimators can just be applied in models with a single linear predictor where the score function aka estimating function is the product of so-called "working residuals" and a regressor matrix. This is explained in some detail in the Zeileis (2006) paper (see Equation 7), provided as vignette("sandwich-OOP", package = "sandwich") in the package. The ?vcovHC also pointed to this but did not explain it very well. I have improved this in the documentation at http://sandwich.R-Forge.R-project.org/reference/vcovHC.html now:
The function meatHC is the real work horse for estimating the meat of HC sandwich estimators - the default vcovHC method is a wrapper calling sandwich and bread. See Zeileis (2006) for more implementation details. The theoretical background, exemplified for the linear regression model, is described below and in Zeileis (2004). Analogous formulas are employed for other types of models, provided that they depend on a single linear predictor and the estimating functions can be represented as a product of “working residual” and regressor vector (Zeileis 2006, Equation 7).
This means that vcovHC() is not applicable to multinomial logit models as they generally use separate linear predictors for the separate response categories. Similarly, two-part or hurdle models etc. are not supported.
Basic "robust" sandwich covariance
Generally, for computing the basic Eicker-Huber-White sandwich covariance matrix estimator, the best strategy is to use the sandwich() function and not the vcovHC() function. The former works for any model with estfun() and bread() methods.
For linear models sandwich(..., adjust = FALSE) (default) and sandwich(..., adjust = TRUE) correspond to HC0 and HC1, respectively. In a model with n observations and k regression coefficients the former standardizes with 1/n and the latter with 1/(n-k).
Stata, however, divides by 1/(n-1) in logit models, see:
Different Robust Standard Errors of Logit Regression in Stata and R. To the best of my knowledge there is no clear theoretical reason for using specifically one or the other adjustment. And already in moderately large samples, this makes no difference anyway.
Remark: The adjustment with 1/(n-1) is not directly available in sandwich() as an option. However, coincidentally, it is the default in vcovCL() without specifying a cluster variable (i.e., treating each observation as a separate cluster). So this is a convenient "trick" if you want to get exactly the same results as Stata.
Clustered covariance
This can be computed "as usual" via vcovCL(..., cluster = ...). For mlogit models you just have to consider that the cluster variable just needs to be provided once (as opposed to stacked several times in long format).
Replicating Stata results
With the data and model from your post:
vcovCL(mo)
## x1 x2
## x1 0.23050261 0.09840356
## x2 0.09840356 0.12765662
vcovCL(mo, cluster = df$id_choice[1:15])
## x1 x2
## x1 0.1766707 0.1007703
## x2 0.1007703 0.1180004

covariance structure for multilevel modelling

I have a multilevel repeated measures dataset of around 300 patients each with up to 10 repeated measures predicting troponin rise. There are other variables in the dataset, but I haven't included them here.
I am trying to use nlme to create a random slope, random intercept model where effects vary between patients, and effect of time is different in different patients. When I try to introduce a first-order covariance structure to allow for the correlation of measurements due to time I get the following error message.
Error in `coef<-.corARMA`(`*tmp*`, value = value[parMap[, i]]) : Coefficient matrix not invertible
I have included my code and a sample of the dataset, and I would be very grateful for any words of wisdom.
#baseline model includes only the intercept. Random slopes - intercept varies across patients
randomintercept <- lme(troponin ~ 1,
data = df, random = ~1|record_id, method = "ML",
na.action = na.exclude,
control = list(opt="optim"))
#random intercept and time as fixed effect
timeri <- update(randomintercept,.~. + day)
#random slopes and intercept: effect of time is different in different people
timers <- update(timeri, random = ~ day|record_id)
#model covariance structure. corAR1() first order autoregressive covariance structure, timepoints equally spaced
armodel <- update(timers, correlation = corAR1(0, form = ~day|record_id))
Error in `coef<-.corARMA`(`*tmp*`, value = value[parMap[, i]]) : Coefficient matrix not invertible
Data:
record_id day troponin
1 1 32
2 0 NA
2 1 NA
2 2 NA
2 3 8
2 4 6
2 5 7
2 6 7
2 7 7
2 8 NA
2 9 9
3 0 14
3 1 1167
3 2 1935
4 0 19
4 1 16
4 2 29
5 0 NA
5 1 17
5 2 47
5 3 684
6 0 46
6 1 45440
6 2 47085
7 0 48
7 1 87
7 2 44
7 3 20
7 4 15
7 5 11
7 6 10
7 7 11
7 8 197
8 0 28
8 1 31
9 0 NA
9 1 204
10 0 NA
10 1 19
You can fit this if you change your optimizer to "nlminb" (or at least it works with the reduced data set you posted).
armodel <- update(timers,
correlation = corAR1(0, form = ~day|record_id),
control=list(opt="nlminb"))
However, if you look at the fitted model, you'll see you have problems - the estimated AR1 parameter is -1 and the random intercept and slope terms are correlated with r=0.998.
I think the problem is with the nature of the data. Most of the data seem to be in the range 10-50, but there are excursions by one or two orders of magnitude (e.g. individual 6, up to about 45000). It might be hard to fit a model to data this spiky. I would strongly suggest log-transforming your data; the standard diagnostic plot (plot(randomintercept)) looks like this:
whereas fitting on the log scale
rlog <- update(randomintercept,log10(troponin) ~ .)
plot(rlog)
is somewhat more reasonable, although there is still some evidence of heteroscedasticity.
The AR+random-slopes model fits OK:
ar.rlog <- update(rlog,
random = ~day|record_id,
correlation = corAR1(0, form = ~day|record_id))
## Linear mixed-effects model fit by maximum likelihood
## ...
## Random effects:
## Formula: ~day | record_id
## Structure: General positive-definite, Log-Cholesky parametrization
## StdDev Corr
## (Intercept) 0.1772409 (Intr)
## day 0.6045765 0.992
## Residual 0.4771523
##
## Correlation Structure: ARMA(1,0)
## Formula: ~day | record_id
## Parameter estimate(s):
## Phi1
## 0.09181557
## ...
A quick glance at intervals(ar.rlog) shows that the confidence intervals on the autoregressive parameter are (-0.52,0.65), so it may not be worth keeping ...
With the random slopes in the model the heteroscedasticity no longer seems problematic ...
plot(rlog,sqrt(abs(resid(.)))~fitted(.),type=c("p","smooth"))

Binomial confidence intervals of means with R

I have got 4 different data.frames that have observations that follow a binomial distribution and I need to calculate, for each one, the confidence intervals related to the means of a second column (Flow).
The number of successes are reported in the column Success and the total number of trials = 85.
How can I calculate confidence intervals?
How can I do it with R?
Here an example of my data.frames:
df1 <- read.table(text = 'Flow Success
725.661 4
25.54 4
318.481 4
230.556 4
2.823 3
12.6 3
9.891 3
11.553 1', header = TRUE)
> mean(df1$Flow)
[1] 167.1381
df2 <- read.table(text = 'Flow Success
725.661 3
25.54 3
318.481 3
230.556 2
2.823 2
12.6 1', header = TRUE)
> mean(df2$Flow)
[1] 219.2768
df3 <- read.table(text = 'Flow Success
725.661 2
25.54 2
318.481 1', header = TRUE)
> mean(df3$Flow)
[1] 356.5607
df4 <- read.table(text = 'Flow Success
725.661 2
25.54 2', header = TRUE)
> mean(df4$Flow)
[1] 375.6005
I need to calculate the confidence intervals of the above means.
I can give you more info about the data if needed.
Thanks for anyone who will help me.
The package binom provides methods for calculating binomial confidence intervals. One can choose to use all available methods, or specify a single method.
x gives the number of successes, and n the number of Bernouli trials.
library(binom)
binom.confint(x = 5, n = 10)
method x n mean lower upper
1 agresti-coull 5 10 0.5 0.2365931 0.7634069
2 asymptotic 5 10 0.5 0.1901025 0.8098975
3 bayes 5 10 0.5 0.2235287 0.7764713
4 cloglog 5 10 0.5 0.1836056 0.7531741
5 exact 5 10 0.5 0.1870860 0.8129140
6 logit 5 10 0.5 0.2245073 0.7754927
7 probit 5 10 0.5 0.2186390 0.7813610
8 profile 5 10 0.5 0.2176597 0.7823403
9 lrt 5 10 0.5 0.2176212 0.7823788
10 prop.test 5 10 0.5 0.2365931 0.7634069
11 wilson 5 10 0.5 0.2365931 0.7634069
binom.confint(x = 5, n = 10, method = "exact")
method x n mean lower upper
1 exact 5 10 0.5 0.187086 0.812914

Using predict() to predict response variable in test dataset

Question: What r code should one use to predict a response variable in a completely separate test data set (not the test data set drawn from the original data set from which the training data set has been drawn) that doesn't have a response variable?
I have been stuck on this for two days and any help is highly appreciated!
My training set has 100 observations and 27 variables. "units" is the response variables. The test set has 6000 observations and 26 variables. I am showing only a part of both data sets to keep the length of my question manageable.
I am using ISLR and MASS packages.
Training set:
age V1 V2 V3 V4 V5 V6 units
10 1 3 0 5 5 5 5828
7 4 5 4 4 1 2 2698
5 6 6 4 7 8 10 2578
4 4 5 4 4 1 3 2548
15 3 5 4 4 2 5 9922
5 2 4 4 5 1 3 6791
Test set:
age V1 V2 V3 V4 V5 V6
2 3 4 4 4 2 2
2 2 5 4 5 2 3
10 5 4 4 4 1 3
4 15 7 6 3 4 8
7 2 5 4 4 2 2
4 6 5 4 5 2 2
18 2 5 4 5 1 3
6 3 5 5 6 4 5
R Code:
library(ISLR)
library(MASS)
train = read.csv(".../train.csv", header = T)
train.pca = train[c(-27)]
pr.out = prcomp(train.pca, scale = TRUE, center = TRUE, retx = TRUE) # Conducting PCA
plot(pr.out, type = 'l')
summary(pr.out)
pred.tr = predict(pr.out, newdata = train) # Predicting on the train data
dat.tr = cbind(train, pred.tr) # Appending PCA output to the train data
glm.fit.pca = glm(units ~ PC2 + PC3 + PC4 + PC5 +
PC6 + PC7 + PC8 + PC9 + PC10 +
PC11 + PC12 + PC13 + PC14 + PC15,
data = dat.tr) # Conducting glm on train data with PCs
test = read.csv(".../test.csv", header = T) # Reading in test data
pred.test = predict(pr.out, newdata = test, type = "response") # Predicting
# on test data. With this code, I get the following error message - "Error
# in predict.prcomp(pr.out, newdata = y, type = "response") :
# 'newdata' does not have named columns matching one or more of the original
# columns" I understand why because the test set doesn't have the response
# variable
So I tried the following:
pred.test = predict(pr.out, newdata = test) # This doesn't give me any error
dat.test = cbind(test_numr, pred.test) # Appending PCA output to test data
I don't understand how I can conduct a glm on the test data, the way I did on train data because test data set doesn't have a response variable (i.e., "units"). I tried initializing the response variable in the test data by doing the following to add the response variable in the test data set:
dat.test$units = rep(0, nrow(dat.test))
Now when I try to run the glm model on the dat.test data set, I get all zeros. I can understand why but I don't understand what changes should I make to my code to get the predictions for the test data set.
Any guidance is highly appreciated! Thank you!
EDIT: I edited and ran the code again based on the comment from #csgillespie. I still have the same issue. Thanks for catching the error!

How to get terminal nodes for a new observation from an rpart object?

Say I have
head(kyphosis)
inTrain <- sample(1:nrow(kyphosis), 45, replace = F)
TRAIN_KYPHOSIS <- kyphosis[inTrain,]
TEST_KYPHOSIS <- kyphosis[-inTrain,]
(kyph_tree <- rpart(Number ~ ., data = TRAIN_KYPHOSIS))
How to get the terminal node from the fitted object for each observation in TEST_KYPHOSIS?
How do I get a summary, such as the deviance and the predicted value from the terminal node which each test observation maps to?
rpart actually has this functionality but it's not exposed (strangely enough, it's a rather obvious requirement).
predict_nodes <-
function (object, newdata, na.action = na.pass) {
where <-
if (missing(newdata))
object$where
else {
if (is.null(attr(newdata, "terms"))) {
Terms <- delete.response(object$terms)
newdata <- model.frame(Terms, newdata, na.action = na.action,
xlev = attr(object, "xlevels"))
if (!is.null(cl <- attr(Terms, "dataClasses")))
.checkMFClasses(cl, newdata, TRUE)
}
rpart:::pred.rpart(object, rpart:::rpart.matrix(newdata))
}
as.integer(row.names(object$frame))[where]
}
And then:
> predict_nodes(kyph_tree, TEST_KYPHOSIS)
[1] 5 3 4 3 3 5 5 3 3 3 3 5 5 4 3 5 4 3 3 3 3 4 3 4 4 5 5 3 4 4 3 5 3 5 5 5
One option is to convert the rpart object to an object of class party from the partykit package. That provides a general toolkit for dealing with recursive partytions. The conversion is simple:
library("partykit")
(kyph_party <- as.party(kyph_tree))
Model formula:
Number ~ Kyphosis + Age + Start
Fitted party:
[1] root
| [2] Start >= 15.5: 2.933 (n = 15, err = 10.9)
| [3] Start < 15.5
| | [4] Age >= 112.5: 3.714 (n = 14, err = 18.9)
| | [5] Age < 112.5: 5.125 (n = 16, err = 29.8)
Number of inner nodes: 2
Number of terminal nodes: 3
(For exact reproducibility run the code from your question with set.seed(1) prior to running my code.)
For objects of this class there are somewhat more flexible methods for plot(), predict(), fitted(), etc. For example, plot(kyph_party) yields a more informative display than the default plot(kyph_tree). The fitted() method extracts a two-column data.frame with the fitted node numbers and the observed responses on the training data.
kyph_fit <- fitted(kyph_party)
head(kyph_fit, 3)
(fitted) (response)
1 5 6
2 2 2
3 4 3
With this you can easily compute any quantity you are interested in, e.g., the means, median, or residual sums of squares within each node.
tapply(kyph_fit[,2], kyph_fit[,1], mean)
2 4 5
2.933333 3.714286 5.125000
tapply(kyph_fit[,2], kyph_fit[,1], median)
2 4 5
3 4 5
tapply(kyph_fit[,2], kyph_fit[,1], function(x) sum((x - mean(x))^2))
2 4 5
10.93333 18.85714 29.75000
Instead of the simple tapply() you can use any other function of your choice to compute the tables of grouped statistics.
Now to learn which observation from the test data TEST_KYPHOSIS to which node in the tree you can simply use the predict(..., type = "node") method:
kyph_pred <- predict(kyph_party, newdata = TEST_KYPHOSIS, type = "node")
head(kyph_pred)
2 3 4 6 7 10
4 4 5 2 2 5

Resources