Create a binary outcome with random forest - r

I have a dataset that looks like this:
TEAM1 TEAM2 EXPG1 EXPG2 Gewonnen
ADO Den Haag Groningen 1.5950 1.2672 1
I now try to predict the column Gewonnen based on EXPG1 and EXPG2. Therefore I created a training and test set and am creating the following model (all using rcaret):
modFit <- train(Gewonnen~ EXPG1 + EXPG2, data=training, method="rf", prox=TRUE)
I can't make a confusion matrix now because my data has more references. That's true because when I do:
pred <- predict(modFit, testing)
head(print)
It says: 0.5324000 0.7237333 0.2811333 0.8231000 0.8299333 0.9792000
Because I want to make a confusion matrix I can't turn them into on 0/1 but I have the feeling that there should be an option to do this in the model as well.
Any thoughts on what I should change in this model to create 0/1 values. I couldn't find it in the documentation:
modFit <- train(Gewonnen~ EXPG1 + EXPG2, data=training, method="rf", prox=TRUE)

First of all, as Tim Biegeleisen says, you should convert your Gewonnen variable to a factor (in both training & test sets), if it is not already:
training$Gewonnen <- as.factor(training$Gewonnen)
testing$Gewonnen <- as.factor(testing$Gewonnen)
After that, the type option in the caret function predict determines what type of response you get for a binary classification problem, i.e. class labels or probabilities. Here is a reproducible example from the caret documentation using the Sonar dataset from the package mlbench:
library(caret)
library(mlbench)
data(Sonar)
str(Sonar$Class)
# Factor w/ 2 levels "M","R": 2 2 2 2 2 2 2 2 2 2 ...
set.seed(998)
inTraining <- createDataPartition(Sonar$Class, p = .75, list = FALSE)
training <- Sonar[ inTraining,]
testing <- Sonar[-inTraining,]
modFit <- train(Class ~ ., data=training, method="rf", prox=TRUE)
pred <- predict(modFit, testing, type="prob") # for class probabilities
head(pred)
# M R
# 5 0.442 0.558
# 10 0.276 0.724
# 11 0.096 0.904
# 12 0.360 0.640
# 20 0.654 0.346
# 21 0.522 0.478
pred2 <- predict(modFit, testing, type="raw") # for class labels
head(pred2)
# [1] R R R R M M
# Levels: M R
For the confusion matrix, you will need class labels (i.e. pred2 above):
confusionMatrix(pred2, testing$Class)
# Confusion Matrix and Statistics
# Reference
# Prediction M R
# M 25 6
# R 2 18

This answer is a bit speculative as you omitted some critical details about your data set and I have not worked extensively with the caret package. That being said, it appears that you are running random forests in regression mode, which means that you will end up with a continuous function. This means that predictions can have a response value of 0, 1, or anything in between 0 and 1. If your Gewonnen column only has values of 0 or 1, and you want predicted values to also behave this way, then you can try turning Gewonnen into a categorical variable. As this article discusses, this might tell random forests to run in classification mode instead of regression mode.
Gewonnen <- as.factor(Gewonnen)
This builds the random forest as you did before, and you should have the responses you want.

Related

kNN algorithm not working while using caret

I am trying to run LOOCV kNN on this dataset (104x182 where the first 62 samples are B and the following 42 are C). I first conducted a PCA on the standardized version of this dataset (giving me 104 PCs). I then try to perform LOOCV kNN for i = 3:98 where i refers to the number of PCs I will use for my kNN model. For each i I pull out the highest accuracy, which k it occurs at and store it within a data frame.
# required packages
library(MASS)
library(class)
library(tidyverse)
library(caret)
# reading in and cleaning data
data <- read.csv("chowdary.csv")
og_data <- data[, -1]
st_data <- as.data.frame(cbind(og_data[, 1], scale(og_data[, -1])))
colnames(st_data)[1] <- "tumour"
# PCA for dimension reduction
# on standardized data
pca_all <- prcomp(og_data[, -1], center=TRUE, scale=TRUE)
# creating data frame to store best k value for each number of PCs
kdf_pca_all_cc <- tibble(i=as.numeric(), # this is for storing number of PCs used,
pca_all_k=as.numeric(), # k value,
pca_all_acc=as.numeric(), # accuracy value,
pca_all_kapp=as.numeric()) # and kappa value
# kNN
k_kNN <- 3:97 # number of PCs to use in each iteration of the model
train_control <- trainControl(method="LOOCV")
kNN_data <- as.data.frame(cbind(as.factor(st_data[, 1]), pca_all$x)) # data used in kNN model below
for (i in k_kNN){
a111 <- train(V1~ .,
method="knn",
tuneGrid=expand.grid(k=1:25),
trControl=train_control,
metric="Accuracy",
data=kNN_data[, 1:i])
b111 <- a111$results[as.integer(a111$bestTune), ] # this is to store the best accuracy rate, along with its k and kappa value
kdf_pca_all_cc <- kdf_pca_all_cc %>%
add_row(i=i-1,
pca_all_k=b111[, 1],
pca_all_acc=b111[, 2],
pca_all_kapp=b111[, 3])
}
For example, for i = 5, the kNN model would be using the following data:
head(kNN_data[, 1:5])
V1 PC1 PC2 PC3 PC4
1 1 3.299844 0.2587487 -1.00501632 2.0273727
2 1 1.427856 -1.0455044 -1.79970790 2.5244021
3 1 3.087657 1.2563404 1.67591441 -1.4270431
4 1 3.107778 1.5893396 2.65871270 -2.8217264
5 1 3.244306 0.5982652 0.37011029 0.3642425
6 1 3.000098 0.5471276 -0.01178315 1.0857886
However, whenever I try to run the for-loop, I am given the following warning message:
Error: Metric Accuracy not applicable for regression models
In addition: Warning message:
In train.default(x, y, weights = w, ...) :
You are trying to do regression and your outcome only has two possible values Are you trying to do classification? If so, use a 2 level factor as your outcome column.
I have no idea how to fix this. Any help would be much appreciated.
Also, as a side note, is there a faster way to run this for-loop? It takes quite a while but I have no idea how to make it more efficient. Thank you.

R Caret Random Forest AUC too good to be true?

Relative newbie to predictive modeling--most of my training/experience is in inferential stats. I'm trying to predict student college graduation in 4 years.
Basic issue is that I've done data cleaning (imputing, centering, scaling); split that processed/transformed data into training (70%) and testing (30%) sets; balanced the data using two approaches (because data was 65%=0, 35%=1--and I've found inconsistent advice on what classifies as unbalanced, but one source suggested anything not within 40/60 range)--ROSE "BOTH" and SMOTE; and ran random forests.
For the ROSE "BOTH" models I got 0.9242 accuracy on the training set and AUC of 0.9268 for the test set.
For the SMOTE model I got 0.9943 accuracy on the training set and AUC of 0.9971 on the test set.
More details on model performance are embedded in the code copied below.
This just seems too good to be true. But, from what I've been able to find slightly improved performance on the test set would not indicate overfitting (it'd be the other way around). So, is this models performance likely really good or is it too good to be true? I have not been able to find a direct answer to this question via SO searches.
Also, in a few weeks I'll have another cohort of data I can run this on. I suppose that could be another "test" set, correct? Then I can apply this to the newest cohort for which we are interested in knowing likelihood to graduate in 4 years.
Many thanks,
Brian
#Used for predictive modeling of 4-year graduation
#IMPORT DATA
library(haven)
grad4yr <- [file path]
#DETERMINE DATA BALANCE/UNBALANCE
prop.table(table(grad4yr$graduate_4_yrs))
# 0=0.6492, 1=0.3517
#convert to factor so next step doesn't impute outcome variable
grad4yr$graduate_4_yrs <- as.factor(grad4yr$graduate_4_yrs)
#Preprocess data, RANN package used
library('RANN')
#Create proprocessed values object which includes centering, scaling, and imputing missing values using KNN
Processed_Values <- preProcess(grad4yr, method = c("knnImpute","center","scale"))
#Create new dataset with imputed values and centering/scaling
#Confirmed this results in 0 cases with missing values
grad4yr_data_processed <- predict(Processed_Values, grad4yr)
#Confirm last step results in 0 cases with missing values
sum(is.na(grad4yr_data_processed))
#[1] 0
#Convert outcome variable to numeric to ensure dummify step (next) doesn't dummify outcome variable.
grad4yr_data_processed$graduate_4_yrs <- as.factor(grad4yr_data_processed$graduate_4_yrs)
#Convert all factor variables to dummy variables; fullrank used to omit one of new dummy vars in each
#set.
dmy <- dummyVars("~ .", data = grad4yr_data_processed, fullRank = TRUE)
#Create new dataset that has the data imputed AND transformed to have dummy variables for all variables that
#will go in models.
grad4yr_processed_transformed <- data.frame(predict(dmy,newdata = grad4yr_data_processed))
#Convert outcome variable back to binary/factor for predictive models and create back variable with same name
#not entirely sure who last step created new version of outcome var with ".1" at the end
grad4yr_processed_transformed$graduate_4_yrs.1 <- as.factor(grad4yr_processed_transformed$graduate_4_yrs.1)
grad4yr_processed_transformed$graduate_4_yrs <- as.factor(grad4yr_processed_transformed$graduate_4_yrs)
grad4yr_processed_transformed$graduate_4_yrs.1 <- NULL
#Split data into training and testing/validation datasets based on outcome at 70%/30%
index <- createDataPartition(grad4yr_processed_transformed$graduate_4_yrs, p=0.70, list=FALSE)
trainSet <- grad4yr_processed_transformed[index,]
testSet <- grad4yr_processed_transformed[-index,]
#load caret
library(caret)
#Feature selection using rfe in R Caret, used with profile/comparison
control <- rfeControl(functions = rfFuncs,
method = "repeatedcv",
repeats = 10,#using k=10 per Kuhn & Johnson pp70; and per James et al pp
#https://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf
verbose = FALSE)
#create traincontrol using repeated cross-validation with 10 fold 5 times
fitControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
search = "random")
#Set the outcome variable object
grad4yrs <- 'graduate_4_yrs'
#set predictor variables object
predictors <- names(trainSet[!names(trainSet) %in% grad4yrs])
#create predictor profile to see what where prediction is best (by num vars)
grad4yr_pred_profile <- rfe(trainSet[,predictors],trainSet[,grad4yrs],rfeControl = control)
# Recursive feature selection
#
# Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
#
# Resampling performance over subset size:
#
# Variables Accuracy Kappa AccuracySD KappaSD Selected
# 4 0.6877 0.2875 0.03605 0.08618
# 8 0.7057 0.3078 0.03461 0.08465 *
# 16 0.7006 0.2993 0.03286 0.08036
# 40 0.6949 0.2710 0.03330 0.08157
#
# The top 5 variables (out of 8):
# Transfer_Credits, HS_RANK, Admit_Term_Credits_Taken, first_enroll, Admit_ReasonUT10
#see data structure
str(trainSet)
#not copying output here, but confirms outcome var is factor and everything else is numeric
#given 65/35 split on outcome var and what can find about unbalanced data, considering unbalanced and doing steps to balance.
#using ROSE "BOTH and SMOTE to see how differently they perform. Also ran under/over with ROSE but they didn't perform nearly as
#well so removed from this script.
#SMOTE to balance data on the processed/dummified dataset
library(DMwR)#https://www3.nd.edu/~dial/publications/chawla2005data.pdf for justification
train.SMOTE <- SMOTE(graduate_4_yrs ~ ., data=grad4yr_processed_transformed, perc.over=600, perc.under=100)
#see how balanced SMOTE resulting dataset is
prop.table(table(train.SMOTE$graduate_4_yrs))
#0 1
#0.4615385 0.5384615
#open ROSE package/library
library("ROSE")
#ROSE to balance data (using BOTH) on the processed/dummified dataset
train.both <- ovun.sample(graduate_4_yrs ~ ., data=grad4yr_processed_transformed, method = "both", p=.5,
N = 2346)$data
#see how balanced BOTH resulting dataset is
prop.table(table(train.both$graduate_4_yrs))
#0 1
#0.4987212 0.5012788
#ROSE to balance data (using BOTH) on the processed/dummified dataset
table(grad4yr_processed_transformed$graduate_4_yrs)
#0 1
#1144 618
library("caret")
#create random forests using balanced data from above
RF_model_both <- train(train.both[,predictors],train.both[, grad4yrs],method = 'rf', trControl = fitControl, ntree=1000, tuneLength = 10)
#print info on accuracy & kappa for "BOTH" training model
# print(RF_model_both)
# Random Forest
#
# 2346 samples
# 40 predictor
# 2 classes: '0', '1'
#
# No pre-processing
# Resampling: Cross-Validated (10 fold, repeated 5 times)
# Summary of sample sizes: 2112, 2111, 2111, 2112, 2111, 2112, ...
# Resampling results across tuning parameters:
#
# mtry Accuracy Kappa
# 8 0.9055406 0.8110631
# 11 0.9053719 0.8107246
# 12 0.9057981 0.8115770
# 13 0.9054584 0.8108965
# 14 0.9048602 0.8097018
# 20 0.9034992 0.8069796
# 26 0.9027307 0.8054427
# 30 0.9034152 0.8068113
# 38 0.9023899 0.8047622
# 40 0.9032428 0.8064672
# Accuracy was used to select the optimal model using the largest value.
# The final value used for the model was mtry = 12.
RF_model_SMOTE <- train(train.SMOTE[,predictors],train.SMOTE[, grad4yrs],method = 'rf', trControl = fitControl, ntree=1000, tuneLength = 10)
#print info on accuracy & kappa for "SMOTE" training model
# print(RF_model_SMOTE)
# Random Forest
#
# 8034 samples
# 40 predictor
# 2 classes: '0', '1'
#
# No pre-processing
# Resampling: Cross-Validated (10 fold, repeated 5 times)
# Summary of sample sizes: 7231, 7231, 7230, 7230, 7231, 7231, ...
# Resampling results across tuning parameters:
#
# mtry Accuracy Kappa
# 17 0.9449082 0.8899939
# 19 0.9458047 0.8917740
# 21 0.9458543 0.8918695
# 29 0.9470243 0.8941794
# 31 0.9468750 0.8938864
# 35 0.9468003 0.8937290
# 36 0.9463772 0.8928876
# 40 0.9463275 0.8927828
#
# Accuracy was used to select the optimal model using the largest value.
# The final value used for the model was mtry = 29.
#Given that both accuracy and kappa appear better in the "SMOTE" random forest it's looking like it's the better model.
#But, running ROC/AUC on both to see how they both perform on validation data.
#Create predictions based on random forests above
rf_both_predictions <- predict.train(object=RF_model_both,testSet[, predictors], type ="raw")
rf_SMOTE_predictions <- predict.train(object=RF_model_SMOTE,testSet[, predictors], type ="raw")
#Create predictions based on random forests above
rf_both_pred_prob <- predict.train(object=RF_model_both,testSet[, predictors], type ="prob")
rf_SMOTE_pred_prob <- predict.train(object=RF_model_SMOTE,testSet[, predictors], type ="prob")
#create Random Forest confusion matrix to evaluate random forests
confusionMatrix(rf_both_predictions,testSet[,grad4yrs], positive = "1")
#output copied here:
# Confusion Matrix and Statistics
#
# Reference
# Prediction 0 1
# 0 315 12
# 1 28 173
#
# Accuracy : 0.9242
# 95% CI : (0.8983, 0.9453)
# No Information Rate : 0.6496
# P-Value [Acc > NIR] : < 2e-16
#
# Kappa : 0.8368
# Mcnemar's Test P-Value : 0.01771
#
# Sensitivity : 0.9351
# Specificity : 0.9184
# Pos Pred Value : 0.8607
# Neg Pred Value : 0.9633
# Prevalence : 0.3504
# Detection Rate : 0.3277
# Detection Prevalence : 0.3807
# Balanced Accuracy : 0.9268
#
# 'Positive' Class : 1
# confusionMatrix(rf_under_predictions,testSet[,grad4yrs], positive = "1")
#output copied here:
#Accuracy : 0.8258
#only copied accuracy as it was fair below two other versions
confusionMatrix(rf_SMOTE_predictions,testSet[,grad4yrs], positive = "1")
#output copied here:
# Confusion Matrix and Statistics
#
# Reference
# Prediction 0 1
# 0 340 0
# 1 3 185
#
# Accuracy : 0.9943
# 95% CI : (0.9835, 0.9988)
# No Information Rate : 0.6496
# P-Value [Acc > NIR] : <2e-16
#
# Kappa : 0.9876
# Mcnemar's Test P-Value : 0.2482
#
# Sensitivity : 1.0000
# Specificity : 0.9913
# Pos Pred Value : 0.9840
# Neg Pred Value : 1.0000
# Prevalence : 0.3504
# Detection Rate : 0.3504
# Detection Prevalence : 0.3561
# Balanced Accuracy : 0.9956
#
# 'Positive' Class : 1
#put predictions in dataset
testSet$rf_both_pred <- rf_both_predictions#predictions (BOTH)
testSet$rf_SMOTE_pred <- rf_SMOTE_predictions#probabilities (BOTH)
testSet$rf_both_prob <- rf_both_pred_prob#predictions (SMOTE)
testSet$rf_SMOTE_prob <- rf_SMOTE_pred_prob#probabilities (SMOTE)
library(pROC)
#get AUC of the BOTH predictions
testSet$rf_both_pred <- as.numeric(testSet$rf_both_pred)
Both_ROC_Curve <- roc(response = testSet$graduate_4_yrs,
predictor = testSet$rf_both_pred,
levels = rev(levels(testSet$graduate_4_yrs)))
auc(Both_ROC_Curve)
# Area under the curve: 0.9268
#get AUC of the SMOTE predictions
testSet$rf_SMOTE_pred <- as.numeric(testSet$rf_SMOTE_pred)
SMOTE_ROC_Curve <- roc(response = testSet$graduate_4_yrs,
predictor = testSet$rf_SMOTE_pred,
levels = rev(levels(testSet$graduate_4_yrs)))
auc(SMOTE_ROC_Curve)
#Area under the curve: 0.9971
#So, the SMOTE balanced data performed very well on training data and near perfect on the validation/test data.
#But, it seems almost too good to be true.
#Is there anything I might have missed or performed incorrectly?
I'll post as an answer my comment, even if this might be migrated.
I really think that you're overfitting, because you have balanced on the whole dataset.
Instead you should balance only the train set.
Here is your code:
library(DMwR)
train.SMOTE <- SMOTE(graduate_4_yrs ~ ., data=grad4yr_processed_transformed,
perc.over=600, perc.under=100)
By doing so your train.SMOTE now contains information from the test set too, so when you'll test on your testSet the model will have already seen part of the data, and this will likely be the cause of your "too good" results.
It should be:
library(DMwR)
train.SMOTE <- SMOTE(graduate_4_yrs ~ ., data=trainSet, # use only the train set
perc.over=600, perc.under=100)

calculate precision and recall for SVM model in R

I use census-data to build logistic regression model and SVM model, first, I convert <=50K to 0, >50K to 1 to make the data binomial. I try to calculate precision and recall for both models and compare which model perform better. But table(test$salary,pred1 >0.5) for SVM model only gives fault values and no true values ( FALSE
0 26
1 8) . Does anyone know what the the problem is?
I am new to R software, I hope that I can get help from here.Thanks a lot. Any help is welcome. I hope the question is clear enough.
#setwd("C:/Users/)
Censusdata <- read.csv(file="census-data.csv", header=TRUE, sep=",")
library("dplyr", lib.loc="~/R/win-library/3.4")
# convert <=50K to 0, >50K to 1
data = Censusdata
data$salary<-as.numeric(factor(data$salary))-1
library(lattice)
library(ggplot2)
library(caret)
data <- Censusdata
indexes <- sample(1:nrow(data),size=0.7*nrow(data))
test <- data[indexes,]
train <- data[-indexes,]
#logistic regression model fit
model <- glm(salary ~ education.num + hours.per.week,family = binomial,data = test)
pred <- predict(model,data=train)
summary(model)
# calculate precision and recall
table(test$salary,pre >0.5)
# I get
FALSE TRUE
0 26 0
1 6 2
# for SVM model
model1 <- svm(salary ~ education.num + hours.per.week,family = binomial, data=test)
pred1 <- predict(model1,data=train)
table(test$salary,pred1 >0.5)
# I get the following
FALSE
0 26
1 8

Multinomial logit in R: mlogit versus nnet

I want to run a multinomial logit in R and have used two libraries, nnet and mlogit, which produce different results and report different types of statistics. My questions are:
What is the source of discrepency between the coefficients and standard errors reported by nnet and those reported by mlogit?
I would like to report my results to a Latex file using stargazer. When doing so, there is a problematic tradeoff:
If I use the results from mlogit then I get the statistics I wish, such as psuedo R squared, however, the output is in long format (see example below).
If I use the results from nnet then the format is as expected, but it reports statistics that I am not interested in such as AIC, but does not include, for example, psuedo R squared.
I would like to have the statistics reported by mlogit in the formatting of nnet when I use stargazer.
Here is a reproducible example, with three choice alternatives:
library(mlogit)
df = data.frame(c(0,1,1,2,0,1,0), c(1,6,7,4,2,2,1), c(683,276,756,487,776,100,982))
colnames(df) <- c('y', 'col1', 'col2')
mydata = df
mldata <- mlogit.data(mydata, choice="y", shape="wide")
mlogit.model1 <- mlogit(y ~ 1| col1+col2, data=mldata)
The tex output when compiled is of what I refer to as "long format" which I deem undesired:
Now, using nnet:
library(nnet)
mlogit.model2 = multinom(y ~ 1 + col1+col2, data=mydata)
stargazer(mlogit.model2)
Gives the tex output:
which is of the "wide" format which I desire. Note the different coefficient and standard errors.
To my knowledge, there are three R packages that allow the estimation of the multinomial logistic regression model: mlogit, nnet and globaltest (from Bioconductor). I do not consider here the mnlogit package, a faster and more efficient implementation of mlogit.
All the above packages use different algorithms that, for small samples, give different results. These differencies vanishes for moderate sample sizes (try with n <- 100).
Consider the following data generating process taken from the James Keirstead's blog:
n <- 40
set.seed(4321)
df1 <- data.frame(x1=runif(n,0,100), x2=runif(n,0,100))
df1 <- transform(df1, y=1+ifelse(100 - x1 - x2 + rnorm(n,sd=10) < 0, 0,
ifelse(100 - 2*x2 + rnorm(n,sd=10) < 0, 1, 2)))
str(df1)
'data.frame': 40 obs. of 3 variables:
$ x1: num 33.48 90.91 41.15 4.38 76.35 ...
$ x2: num 68.6 42.6 49.9 36.1 49.6 ...
$ y : num 1 1 3 3 1 1 1 1 3 3 ...
table(df1$y)
1 2 3
19 8 13
The model parameters estimated by the three packages are respectively:
library(mlogit)
df2 <- mlogit.data(df1, choice="y", shape="wide")
mlogit.mod <- mlogit(y ~ 1 | x1+x2, data=df2)
(mlogit.cf <- coef(mlogit.mod))
2:(intercept) 3:(intercept) 2:x1 3:x1 2:x2 3:x2
42.7874653 80.9453734 -0.5158189 -0.6412020 -0.3972774 -1.0666809
#######
library(nnet)
nnet.mod <- multinom(y ~ x1 + x2, df1)
(nnet.cf <- coef(nnet.mod))
(Intercept) x1 x2
2 41.51697 -0.5005992 -0.3854199
3 77.57715 -0.6144179 -1.0213375
#######
library(globaltest)
glbtest.mod <- globaltest::mlogit(y ~ x1+x2, data=df1)
(cf <- glbtest.mod#coefficients)
1 2 3
(Intercept) -41.2442934 1.5431814 39.7011119
x1 0.3856738 -0.1301452 -0.2555285
x2 0.4879862 0.0907088 -0.5786950
The mlogit command of globaltest fits the model without using a reference outcome category, hence the usual parameters can be calculated as follows:
(glbtest.cf <- rbind(cf[,2]-cf[,1],cf[,3]-cf[,1]))
(Intercept) x1 x2
[1,] 42.78747 -0.5158190 -0.3972774
[2,] 80.94541 -0.6412023 -1.0666813
Concerning the estimation of the parameters in the three packages, the method used in mlogit::mlogit is explained in detail here.
In nnet::multinom the model is a neural network with no hidden layers, no bias nodes and a softmax output layer; in our case there are 3 input units and 3 output units:
nnet:::summary.nnet(nnet.mod)
a 3-0-3 network with 12 weights
options were - skip-layer connections softmax modelling
b->o1 i1->o1 i2->o1 i3->o1
0.00 0.00 0.00 0.00
b->o2 i1->o2 i2->o2 i3->o2
0.00 41.52 -0.50 -0.39
b->o3 i1->o3 i2->o3 i3->o3
0.00 77.58 -0.61 -1.02
Maximum conditional likelihood is the method used in multinom for model fitting.
The parameters of multinomial logit models are estimated in globaltest::mlogit using maximum likelihood and working with an equivalent log-linear model and the Poisson likelihood. The method is described here.
For models estimated by multinom the McFadden's pseudo R-squared can be easily calculated as follows:
nnet.mod.loglik <- nnet:::logLik.multinom(nnet.mod)
nnet.mod0 <- multinom(y ~ 1, df1)
nnet.mod0.loglik <- nnet:::logLik.multinom(nnet.mod0)
(nnet.mod.mfr2 <- as.numeric(1 - nnet.mod.loglik/nnet.mod0.loglik))
[1] 0.8483931
At this point, using stargazer, I generate a report for the model estimated by mlogit::mlogit which is as similar as possible to the report of multinom. The basic idea is to substitute the estimated coefficients and probabilities in the object created by multinom with the corresponding estimates of mlogit.
# Substitution of coefficients
nnet.mod2 <- nnet.mod
cf <- matrix(nnet.mod2$wts, nrow=4)
cf[2:nrow(cf), 2:ncol(cf)] <- t(matrix(mlogit.cf,nrow=2))
# Substitution of probabilities
nnet.mod2$wts <- c(cf)
nnet.mod2$fitted.values <- mlogit.mod$probabilities
Here is the result:
library(stargazer)
stargazer(nnet.mod2, type="text")
==============================================
Dependent variable:
----------------------------
2 3
(1) (2)
----------------------------------------------
x1 -0.516** -0.641**
(0.212) (0.305)
x2 -0.397** -1.067**
(0.176) (0.519)
Constant 42.787** 80.945**
(18.282) (38.161)
----------------------------------------------
Akaike Inf. Crit. 24.623 24.623
==============================================
Note: *p<0.1; **p<0.05; ***p<0.01
Now I am working on the last issue: how to visualize loglik, pseudo R2 and other information in the above stargazer output.
If you are using stargazer you can use omit to remove unwanted rows or references. Here is a quick example, hopefully, it will point you int he right direction.
nb. My assumption is you are using Rstudio and rmarkdown with knitr.
```{r, echo=FALSE}
library(mlogit)
df = data.frame(c(0,1,1,2,0,1,0), c(1,6,7,4,2,2,1), c(683,276,756,487,776,100,982))
colnames(df) <- c('y', 'col1', 'col2')
mydata = df
mldata <- mlogit.data(mydata, choice = "y", shape="wide")
mlogit.model1 <- mlogit(y ~ 1| col1+col2, data=mldata)
mlogit.col1 <- mlogit(y ~ 1 | col1, data = mldata)
mlogit.col2 <- mlogit(y ~ 1 | col2, data = mldata)
```
# MLOGIT
```{r echo = FALSE, message = TRUE, error = TRUE, warning = FALSE, results = 'asis'}
library(stargazer)
stargazer(mlogit.model1, type = "html")
stargazer(mlogit.col1,
mlogit.col2,
type = "html",
omit=c("1:col1","2:col1","1:col2","2:col2"))
```
Result:
Note that the second image omits 1:col1, 2:col2, 1:col2 and 2:col2

Why aren't these models produced in Caret identical?

I'm really trying to understand why two pieces of code don't produce identical models. To create the first neural network (NN1), I used (code below) cross validation in the train function of the Caret package to find the best parameters. Page 2 of the package's vignette suggests that it will "Fit the final model to all the training data using the optimal parameter set".
So in the code below I expect NN1 to reflect the full training set with the best parameters which happen to be size=5 and decay=0.1.
My plan was to use the parameters from this step to create a model to put into production using the combined training and test data. Before I created this production model, I wanted to make sure I was using the output from the train function properly.
So I created a second model (NN2) with with the train function but without tuning. Instead I specified the parameters size=5 and decay=0.1. With the same data, same parameters (and same seed), I expected identical models, but they are not. Why aren't these models identical?
# Create some data
library(caret)
set.seed(2)
xy<-data.frame(Response=factor(sample(c("Y","N"),534,replace = TRUE,prob=c(0.5,0.5))),
GradeGroup=factor(sample(c("G1","G2","G3"),534,replace=TRUE,prob=c(0.4,0.3,0.3))),
Sibling=sample(c(TRUE,FALSE),534,replace=TRUE,prob=c(0.3,0.7)),
Dist=rnorm(534))
xyTrain <- xy[1:360,]
xyTest <- xy[361:534,]
# Create NN1 using cross-validation
tc <- trainControl(method="cv", number = 10, savePredictions = TRUE, classProbs = TRUE)
set.seed(2)
NN1 <- train(Response~.,data=xyTrain,
method="nnet",
trControl=tc,
verbose=FALSE,
metric="Accuracy")
# Create NN2 using parameters from NN1
fitControl <- trainControl(method="none", classProbs = TRUE)
set.seed(2)
NN2 <- train(Response~.,data=xyTrain,
method="nnet",
trControl=fitControl,
verbose=FALSE,
tuneGrid=data.frame(size=NN1$bestTune[[1]],decay=NN1$bestTune[[2]]),
metric="Accuracy")
Here are the results
> # Parameters of NN1
> NN1$bestTune
size decay
1 1 0
>
> # Code to show results of NN1 and NN2 differ
> testFitted <- data.frame(fitNN1=NN1$finalModel$fitted.values,
+ fitNN2=NN2$finalModel$fitted.values)
>
> testPred<-data.frame(predNN1=predict(NN1,xyTest,type="prob")$Y,
+ predNN2=predict(NN2,xyTest,type="prob")$Y)
> # Fitted values are different
> head(testFitted)
fitNN1 fitNN2
X1 0.4824096 0.4834579
X2 0.4673498 0.4705441
X3 0.4509407 0.4498603
X4 0.4510129 0.4498710
X5 0.4690963 0.4753655
X6 0.4509160 0.4498539
> # Predictions on test set are different
> head(testPred)
predNN1 predNN2
1 0.4763952 0.4784981
2 0.4509160 0.4498539
3 0.5281298 0.5276355
4 0.4512930 0.4498993
5 0.4741959 0.4804776
6 0.4509335 0.4498589
>
> # Accuracy of predictions are different
> sum(predict(NN1,xyTest,type="raw")==xyTest$Response)/nrow(xyTest)
[1] 0.4655172
> sum(predict(NN2,xyTest,type="raw")==xyTest$Response)/nrow(xyTest)
[1] 0.4597701
>
> # Summary of models
> summary(NN1)
a 4-1-1 network with 7 weights
options were - entropy fitting
b->h1 i1->h1 i2->h1 i3->h1 i4->h1
-8.38 6.58 5.51 -9.50 1.06
b->o h1->o
-0.20 1.39
> summary(NN2)
a 4-1-1 network with 7 weights
options were - entropy fitting
b->h1 i1->h1 i2->h1 i3->h1 i4->h1
10.94 -8.27 -7.36 8.50 -0.76
b->o h1->o
3.15 -3.35
I believe this has to do with the random seed. When you do cross-validation you are fitting many models from that starting seed (set.seed(2)). The final model is fit with the same parameters but the seed that the final model was fit inside cross-validation is not the same as when you try to just fit that final model with those parameters yourself. You see this here because the weights in each neural network call (nnet) are randomly generated each time.

Resources