predict multivariate model using rfsrc - r

I use the package "randomForestSRC" and I want to predict two variables at the same time. I have no issue predicting both of these variables alone, but can't get the combination to work. Here's a mwe:
library(randomForestSRC)
# create dataset, only continuous variables
dat <- mtcars[,c("drat", "wt", "qsec")]
dat$drat_wt <- dat$drat*dat$wt
dat$drat_qsec <- dat$drat*dat$qsec
train <- sample(nrow(dat), round(nrow(dat)*0.8))
# train the algorithms with only one outcome variable
test_1 <- rfsrc(drat_wt ~ . ,data = dat[train,],
importance = 'permute')
test_2 <- rfsrc(drat_qsec ~ . ,data = dat[train,],
importance = 'permute')
# train the radom forest with multiple outcome variables
test_3 <- rfsrc(cbind(drat_qsec,drat_wt) ~ . ,data = dat[train,],
importance = 'permute')
test_4 <- rfsrc(Multivar(drat_qsec,drat_wt) ~ . ,data = dat[train,],
importance = 'permute')
# predict all the random forest data
pred_1 <- predict(test_1, dat[-train,], na.action= "na.impute")
pred_2 <- predict(test_2, dat[-train,], na.action= "na.impute")
pred_3 <- predict(test_3, dat[-train,], na.action= "na.impute")
pred_4 <- predict(test_4, dat[-train,], na.action= "na.impute")
Now for the random forests with only one outcome (pred_1 and pred_2) there is an index pred_1$predicted containing the predicted values, not however for the other random forests.
Bonus: Is there any difference in the test_3 and test_4 formulation? Couldn't find anything said on that in the documentation.

To answer your first question, the output of pred_3 and pred_4 contain a list element, regrOutput, that is not contained in the pred_1 and pred_2 objects. If you look inside of this object, you will find the names of the two outcome (or target or dependent or ...) variables. For example,
names(pred_4$regrOutput)
[1] "drat_qsec" "drat_wt"
Now, inside each of these ojects, you find the objects that are missing in the
names(pred_1). For example,
names(pred_4$regrOutput$drat_qsec)
[1] "predicted" "predicted.oob" "quantile" "quantile.oob" "err.rate"
and
setdiff(names(pred_1), names(pred_3))
[1] "predicted" "predicted.oob" "quantile" "quantile.oob" "err.rate"
I am not sure about the second question, but the objects have the same structure. One way to explore this is to look through the objects.
For example, set the seed for each forest and then look at the returned objects.
set.seed(1234)
test_3 <- rfsrc(cbind(drat_qsec,drat_wt) ~ . ,data = dat[train,],
importance = 'permute')
set.seed(1234)
test_4 <- rfsrc(Multivar(drat_qsec,drat_wt) ~ . ,data = dat[train,],
importance = 'permute')
Since the first element is the call, which we know is different, drop it and check for identity.
identical(test_3[-1], test_4[-1])
Now, look at each element in the returned objects.
mapply(identical, test_3, test_4)
Zero in on the those that are not identical:
which(!mapply(identical, test_3, test_4))
call forest regrOutput
1 19 36
Then further into the object...
which(!mapply(identical, test_3$forest, test_4$forest))
nativeArray
2
and so on.
However, what might matter most is, do these produce the same result? Let's rerun the prediction and taking a look at the predicted values.
pred_3 <- predict(test_3, dat[-train,], na.action= "na.impute")
pred_4 <- predict(test_4, dat[-train,], na.action= "na.impute")
Now, take a look
pred_3$regrOutput$drat_qsec$predicted
[1] 68.04419 55.95063 55.02067 56.38382 56.38382 75.33735
pred_4$regrOutput$drat_qsec$predicted
[1] 68.04419 55.95063 55.02067 56.38382 56.38382 75.33735
They look the same to me
identical(pred_3$regrOutput$drat_qsec$predicted,
pred_4$regrOutput$drat_qsec$predicted)
[1] FALSE
They aren't identical, though. Let's use the recommended all.equal function to check for floating point values.
all.equal(pred_3$regrOutput$drat_qsec$predicted,
pred_4$regrOutput$drat_qsec$predicted)
[1] TRUE
Ah, it's probably just a matter of numerical precision.

Related

Confusion Matrix in Logistic Regression in R

My confusion matrix created for a logistic regression model only has the values for Predicted-FALSE. Even though I adjusted my threshold, it does not do much to the matrix. What is wrong and how do I adjust the threshold? Below is the code for the training set and the result. "Retain" is my dependent variable with 1=retained 0=not retained, and all the independent variables are continuous variables. I have overall 170K records in the dataset (df). This matrix indicates that the model predicts that no one retained, which is odd, because in reality 45% retained.
model_1 <- glm(retain~ age_2010+cnt_total_funds+sum_MS_2010+tenure_2010, data=df, family="binomial")
res <- predict(model_1, training, retain="response")
(table(ActualValue=training$retain, PredictedValue=res>0.05))
PredictedValue
ActualValue FALSE
0 96006
1 43676
You made a mistake inside predict function as you want to use type argument (not retain which not exists for this function).
I use a sample data to show you working example.
In your example change retain="response" to type="response".
aa <- airquality
aa$retain <- aa$Ozone > 50
gg = glm(retain ~ Solar.R + Month, data = aa, family = "binomial")
range(predict(gg, aa, type = "response"), na.rm = TRUE)
#> [1] 0.05918388 0.48769632
Created on 2021-06-18 by the reprex package (v2.0.0)

`table` not showing in matrix format

I'm trying to generate a confusion table using the HMDA data from the AER package. So I ran a probit model, predict on testing set, and use table() function to generate a 2 by 2 plot, but R just returns me a long list, not showing the 2 by 2 matrix that I wanted.
Could anyone tell me what's going on>
# load required packages and data (HMDA)
library(e1071)
library(caret)
library(AER)
library(plotROC)
data(HMDA)
# again, check variable columns
names(HMDA)
# convert dependent variables to numeric
HMDA$deny <- ifelse(HMDA$deny == "yes", 1, 0)
# subset needed columns
subset <- c("deny", "hirat", "lvrat", "mhist", "unemp")
# subset data
data <- HMDA[complete.cases(HMDA), subset]
# do a 75-25 train-test split
train_row_numbers <- createDataPartition(data$deny, p=0.75, list=FALSE)
training <- data[train_row_numbers, ]
testing <- data[-train_row_numbers, ]
# fit a probit model and predict on testing data
probit.fit <- glm(deny ~ ., family = binomial(link = "probit"), data = training)
probit.pred <- predict(probit.fit, testing)
confmat_probit <- table(Predicted = probit.pred,
Actual = testing$deny)
confmat_probit
You need to specify the threshold or cut-point for predicting a dichotomous outcome. Predict returns the predicted values, not 0 / 1.
And be careful with the predict function as the default type is "link", which in your case is the "probit". If you want predict to return the probabilities, specify type="response".
probit.pred <- predict(probit.fit, testing, type="response")
Then choose a cut-point; any prediction above this value will be TRUE:
confmat_probit <- table(`Predicted>0.1` = probit.pred > 0.1 , Actual = testing$deny)
confmat_probit
Actual
Predicted>0.1 0 1
FALSE 248 21
TRUE 273 53

Comparing nested models with NAs in R

I am trying to compare nested regression models using the anova() function in R, but am running into problems because the level 1 and level 2 models differ in the number of observations due to missing cases. Here is a simple example:
# Create dataframe with multiple predictors with different number of NAs
dep <- c(45,46,45,48,49)
basevar <- c(10,12,10,16,17)
pred1 <- c(NA,20,NA,19,21)
dat <- data.frame(dep,basevar,pred1)
# Define level 1 of the nested models
basemodel <- lm(dep ~ basevar, data = dat)
# Add level 2
model1 <- lm(dep ~ basevar + pred1, data = dat)
# Compare the models (uh oh!)
anova(basemodel, model1)
I have seen 2 suggestions to similar problems, but both are problematic.
Suggestion 1: Impute the missing data. The problem with this is that the missing cases in my data were removed because they were outliers, and thus are not "missing at random," and imputing may overfit the data.
Suggestion 2: Make a separate data frame containing only the complete cases for the variable with missing cases, and use that for regressions. This is also problematic if you are creating multiple nested models sharing the same level 1 variable, but in which the level 2 variables differ in the number of missing cases. Here is an example of this:
# Create a new predictor variable with a different number of NAs from pred1
pred2 <- c(23,21,NA,10,11)
dat <- cbind(dat,pred2)
# Create dataframe containing only completed cases of pred1
nonadat1 <- subset(dat, subset = !is.na(pred1))
# Do the same for pred2
nonadat2 <- subset(dat, subset = !is.na(pred2))
# Define level 1 of the nested models within dataframe of pred1 complete cases
basemodel1 <- lm(dep ~ basevar, data = nonadat1)
# Check values of the model
summary(basemodel1)
# Add level 2
model1 <- lm(dep ~ basevar + pred1, data = nonadat1)
# Compare the models (yay it runs!)
anova(basemodel1, model1)
# Define level 1 of the nested models within dataframe of pred2 complete cases
basemodel2 <- lm(dep ~ basevar, data = nonadat2)
# Values are different from those in basemodel1
summary(basemodel2)
# Add level 2
model2 <- lm(dep ~ basevar + pred2, data = nonadat2)
# Compare the models
anova(basemodel2, model2)
As you can see, creating individual data frames creates differences at level 1 of the nested models, which makes interpretation problematic.
Does anyone know how I can compare these nested models while circumventing these problems?
Could this work? See here for more information. It doesn't exactly deal with the fact that models are fitted on different datasets, but it does allow for a comparison.
A<-logLik(basemodel)
B<-logLik(model1)
(teststat <- -2 * (as.numeric(A)-as.numeric(B)))
(p.val <- pchisq(teststat, df = 1, lower.tail = FALSE))

caret dummy-vars exclude target

How can I use dummy vars in caret without destroying my target variable?
set.seed(5)
data <- ISLR::OJ
data<-na.omit(data)
dummies <- dummyVars( Purchase ~ ., data = data)
data2 <- predict(dummies, newdata = data)
split_factor = 0.5
n_samples = nrow(data2)
train_idx <- sample(seq_len(n_samples), size = floor(split_factor * n_samples))
train <- data2[train_idx, ]
test <- data2[-train_idx, ]
modelFit<- train(Purchase~ ., method='lda',preProcess=c('scale', 'center'), data=train)
will fail, as the Purchase variable is missing. In case I replace it with data$Purchase <- ifelse(data$Purchase == "CH",1,0) beforehand caret complains that this no longer is a classification but a regression problem
At least the example code seems to have a few issues indicated in the comments below. To answer your questions:
The result of ifelse is an integer vector, not a factor, so the train function defaults to regression
Passing the dummyVars directly to the function is done by using the train(x = , y =, ...) instead of a formula
To avoid these problems, check the class of your objects carefully.
Be aware that option preProcess in train() will apply the preprocessing to all numeric variables, including the dummies. Option 2 below avoid this, be standardizing the data before calling train().
set.seed(5)
data <- ISLR::OJ
data<-na.omit(data)
# Make sure that all variables that should be a factor are defined as such
newFactorIndex <- c("StoreID","SpecialCH","SpecialMM","STORE")
data[, newFactorIndex] <- lapply(data[,newFactorIndex], factor)
library(caret)
# See help for dummyVars. The function does not take a dependent variable and predict will give an error
# I don't include the target variable here, so predicting dummies on new data will drop unknown columns
# including the target variable
dummies <- dummyVars(~., data = data[,-1])
# I don't change the data yet to apply standardization to the numeric variables,
# before turning the categorical variables into dummies
split_factor = 0.5
n_samples = nrow(data)
train_idx <- sample(seq_len(n_samples), size = floor(split_factor * n_samples))
# Option 1 (as asked): Specify independent and dependent variables separately
# Note that dummy variables will be standardized by preProcess as per the original code
# Turn the categorical variabels to (unstandardized) dummies
# The output of predict is a matrix, change it to data frame
data2 <- data.frame(predict(dummies, newdata = data))
modelFit<- train(y = data[train_idx, "Purchase"], x = data2[train_idx,], method='lda',preProcess=c('scale', 'center'))
# Option 2: Append dependent variable to the independent variables (needs to be a data frame to allow factor and numeric)
# Note that I also shift the proprocessing away from train() to
# avoid standardizing the dummy variables
train <- data[train_idx, ]
test <- data[-train_idx, ]
preprocessor <- preProcess(train[!sapply(train, is.factor)], method = c('center',"scale"))
train <- predict(preprocessor, train)
test <- predict(preprocessor, test)
# Turn the categorical variabels to (unstandardized) dummies
# The output of predict is a matrix, change it to data frame
train <- data.frame(predict(dummies, newdata = train))
test <- data.frame(predict(dummies, newdata = test))
# Reattach the target variable to the training data that has been
# dropped by predict(dummies,...)
train$Purchase <- data$Purchase[train_idx]
modelFit<- train(Purchase ~., data = train, method='lda')

Regression Summaries in R

I've been using the glm function to do regression analysis, and it's treating me quite well. I'm wondering though, some of the things I want to regress involve a large amount of regression factors. I have two main questions:
Is it possible to give a text vector for the regressors?
Can the p-value portion of summary(glm) be sorted at all? Preferably by the p-values of each regressor.
Ex.
A # sample data frame
names(A)
[1] Dog Cat Human Limbs Tail Height Weight Teeth.Count
a = names(A)[4:7]
glm( Dog ~ a, data = A, family = "binomial")
For your first question, see as.formula. Basically you want to do the following:
x <- names(A)[4:7]
regressors <- paste(x,collapse=" + ")
form <- as.formula(c("Dog ~ ",regressors))
glm(form, data = A, family = "binomial")
If you want interaction terms in your model, you need to make the structure somewhat more complex by using different collapse= arguments. That argument specifies which symbols are placed between the elements of your vector. For instance, if you specify "*" in the code above, you will have a saturated model with all possible interactions. If you just need some interactions, but not all, you will want to create the part of the formula containing all interactions first (using "*" as collapse argument), and then add the remaining terms in the separate paste function (using "+" as collapse argument). All in all, you want to create a character string that is identical to your formula, and then convert it to the formula class.
For your second question, you need to convert the output of summary to a data structure that can be sorted. For instance, a data frame. Let's say that the name of your glm model is model:
library(plyr)
coef <- summary(model)[12]
coef.sort <- as.data.frame(coef)
names(coef.sort) <- c("Estimate","SE","Tval","Pval")
arrange(coef.sort,Pval)
Assign the result of arrange() to a varable, and continue with it as you like.
An example data frame:
set.seed(42)
A <- data.frame(Dog = sample(0:1, 100, TRUE), b = rnorm(100), c = rnorm(100))
a <- names(A)[2:3]
Firstly, you can use the character vector a to create a model formula with reformulate:
glm(Dog ~ a, data = A, family = "binomial")
form <- reformulate(a, "Dog")
# Dog ~ b + c
model <- glm(form, data = A, family = "binomial")
Secondly, this is a way to sort the model summary by the p-values:
modcoef <- summary(model)[["coefficients"]]
modcoef[order(modcoef[ , 4]), ]
# Estimate Std. Error z value Pr(>|z|)
# b 0.23902684 0.2212345 1.0804232 0.2799538
# (Intercept) 0.20855908 0.2025642 1.0295951 0.3032001
# c -0.09287769 0.2191231 -0.4238608 0.6716673

Resources