I have a multinomial logit model with two individual specific variables (first and age).
I would like to conduct the hmftest to check if the IIA holds.
My dataset looks like this:
head(df)
mode choice first age
1 both 1 0 24
2 pre 1 1 23
3 both 1 2 53
4 post 1 3 43
5 no 1 1 55
6 both 1 2 63
I adjusted it for the mlogit to:
mode choice first age idx
1 TRUE 1 0 24 1:both
2 FALSE 1 0 24 1:no
3 FALSE 1 0 24 1:post
4 FALSE 1 0 24 1:pre
5 FALSE 1 1 23 2:both
6 FALSE 1 1 23 2:no
7 FALSE 1 1 23 2:post
8 TRUE 1 1 23 2:pre
9 TRUE 1 2 53 3:both
10 FALSE 1 2 53 3:no
~~~ indexes ~~~~
id1 id2
1 1 both
2 1 no
3 1 post
4 1 pre
5 2 both
6 2 no
7 2 post
8 2 pre
9 3 both
10 3 no
indexes: 1, 2
My original (full) model runs as follows:
full <- mlogit(mode ~ 0 | first + age, data = df_mlogit, reflevel = "no")
leading to the following result:
Call:
mlogit(formula = mode ~ 0 | first + age, data = df_mlogit, reflevel = "no",
method = "nr")
Frequencies of alternatives:choice
no both post pre
0.2 0.4 0.2 0.2
nr method
18 iterations, 0h:0m:0s
g'(-H)^-1g = 8.11E-07
gradient close to zero
Coefficients :
Estimate Std. Error z-value Pr(>|z|)
(Intercept):both 2.0077e+01 1.0441e+04 0.0019 0.9985
(Intercept):post -4.1283e-01 1.4771e+04 0.0000 1.0000
(Intercept):pre 5.3346e-01 1.4690e+04 0.0000 1.0000
first1:both -4.0237e+01 1.1059e+04 -0.0036 0.9971
first1:post -8.9168e-01 1.4771e+04 -0.0001 1.0000
first1:pre -6.6805e-01 1.4690e+04 0.0000 1.0000
first2:both -1.9674e+01 1.0441e+04 -0.0019 0.9985
first2:post -1.8975e+01 1.5683e+04 -0.0012 0.9990
first2:pre -1.8889e+01 1.5601e+04 -0.0012 0.9990
first3:both -2.1185e+01 1.1896e+04 -0.0018 0.9986
first3:post 1.9200e+01 1.5315e+04 0.0013 0.9990
first3:pre 1.9218e+01 1.5237e+04 0.0013 0.9990
age:both 2.1898e-02 2.9396e-02 0.7449 0.4563
age:post 9.3377e-03 2.3157e-02 0.4032 0.6868
age:pre -1.2338e-02 2.2812e-02 -0.5408 0.5886
Log-Likelihood: -61.044
McFadden R^2: 0.54178
Likelihood ratio test : chisq = 144.35 (p.value = < 2.22e-16)
To test for IIA, I exclude one alternative from the model (here "pre") and run the model as follows:
part <- mlogit(mode ~ 0 | first + age, data = df_mlogit, reflevel = "no",
alt.subset = c("no", "post", "both"))
leading to
Call:
mlogit(formula = mode ~ 0 | first + age, data = df_mlogit, alt.subset = c("no",
"post", "both"), reflevel = "no", method = "nr")
Frequencies of alternatives:choice
no both post
0.25 0.50 0.25
nr method
18 iterations, 0h:0m:0s
g'(-H)^-1g = 6.88E-07
gradient close to zero
Coefficients :
Estimate Std. Error z-value Pr(>|z|)
(Intercept):both 1.9136e+01 6.5223e+03 0.0029 0.9977
(Intercept):post -9.2040e-01 9.2734e+03 -0.0001 0.9999
first1:both -3.9410e+01 7.5835e+03 -0.0052 0.9959
first1:post -9.3119e-01 9.2734e+03 -0.0001 0.9999
first2:both -1.8733e+01 6.5223e+03 -0.0029 0.9977
first2:post -1.8094e+01 9.8569e+03 -0.0018 0.9985
first3:both -2.0191e+01 1.1049e+04 -0.0018 0.9985
first3:post 2.0119e+01 1.1188e+04 0.0018 0.9986
age:both 2.1898e-02 2.9396e-02 0.7449 0.4563
age:post 1.9879e-02 2.7872e-02 0.7132 0.4757
Log-Likelihood: -27.325
McFadden R^2: 0.67149
Likelihood ratio test : chisq = 111.71 (p.value = < 2.22e-16)
However when I want to codnuct the hmftest then the following error occurs:
> hmftest(full, part)
Error in solve.default(diff.var) :
system is computationally singular: reciprocal condition number = 4.34252e-21
Does anyone have an idea where the problem might be?
I believe the issue here could be that the hmftest checks if the probability ratio of two alternatives depends only on the characteristics of these alternatives.
Since there are only individual-level variables here, the test won't work in this case.
Working with R, I performed a KNN Algorithm knn <- train(x = x_train, y = y_train, method = "knn") with this dataframe:
1 0.35955056 0.62068966 0.34177215 0.27 0.7260274 0.22 MIT
2 0.59550562 0.56321839 0.35443038 0.15 0.7260274 0.22 MIT
3 0.52808989 0.35632184 0.45569620 0.13 0.7397260 0.22 NUC
4 0.34831461 0.35632184 0.34177215 0.54 0.6575342 0.22 MIT
5 0.44943820 0.31034483 0.44303797 0.17 0.6712329 0.22 CYT
6 0.43820225 0.47126437 0.34177215 0.65 0.7260274 0.22 MIT
7 0.41573034 0.36781609 0.48101266 0.20 0.7945205 0.34 NUC
8 0.49438202 0.42528736 0.56962025 0.36 0.6712329 0.22 MIT
9 0.32584270 0.29885057 0.49367089 0.15 0.7945205 0.30 CYT
10 0.35955056 0.29885057 0.41772152 0.21 0.7260274 0.27 NU
...
Obtaining this result:
k-Nearest Neighbors
945 samples
6 predictor
8 classes: 'CYT', 'ERL', 'EXC', 'ME', 'MIT', 'NUC', 'POX', 'VAC'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 945, 945, 945, 945, 945, 945, ...
Resampling results across tuning parameters:
k Accuracy Kappa
5 0.5273630 0.3760233
7 0.5480598 0.4004283
9 0.5667651 0.4242597
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
After that, I wanted to do a confusion matrix with this code:
predictions <- predict(knn, x_test)
results <- data.frame(real = y_test, predicted = predictions)
attach(results)
confusionMatrix(real, predicted)
And I got this results:
Confusion Matrix and Statistics
Reference
Prediction CYT ERL EXC ME MIT NUC POX VAC
CYT 73 0 0 3 7 44 0 0
ERL 0 0 0 1 0 0 0 0
EXC 2 0 6 3 1 0 0 0
ME 5 0 1 68 2 11 0 0
MIT 19 0 0 8 44 14 0 0
NUC 57 0 0 6 8 74 0 0
POX 3 0 0 0 1 2 0 0
VAC 3 0 2 2 1 1 0 0
Overall Statistics
Accuracy : 0.5614
95% CI : (0.5153, 0.6068)
No Information Rate : 0.3432
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.417
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: CYT Class: ERL Class: EXC Class: ME Class: MIT Class: NUC Class: POX Class: VAC
Sensitivity 0.4506 NA 0.66667 0.7473 0.68750 0.5068 NA NA
Specificity 0.8258 0.997881 0.98704 0.9501 0.89951 0.7822 0.98729 0.98093
Pos Pred Value 0.5748 NA 0.50000 0.7816 0.51765 0.5103 NA NA
Neg Pred Value 0.7420 NA 0.99348 0.9403 0.94832 0.7798 NA NA
Prevalence 0.3432 0.000000 0.01907 0.1928 0.13559 0.3093 0.00000 0.00000
Detection Rate 0.1547 0.000000 0.01271 0.1441 0.09322 0.1568 0.00000 0.00000
Detection Prevalence 0.2691 0.002119 0.02542 0.1843 0.18008 0.3072 0.01271 0.01907
Balanced Accuracy 0.6382 NA 0.82685 0.8487 0.79350 0.6445 NA NA
I would like to know why I have got this NAs in my sensibility in the class ERL for example.
Did I do something wrong ?
What is the reason of these NAs. I can provided the completed dataframe if necessary.
Based on the confusion matrix, your prediction set is lacking data with the classification of ERL, POX, and VOC which is leading to the NAs in the Statistics by Class.
Take a look at the Sensitivity of Class ERL for example. Sensitivity, also called the True Positive Rate, is calculated as the number of correct positive predictions divided by the total number of positives.
Positive ERL Predictions = 0
Actual ERL Classifications = 0
Sensitivity ERL = 0/0 which leads to the NA.
How can I get starting values?
I think there is a problem with the slope
I know the model is modified hill equation
logFC_max is upper bound(5) - lower bound(-1)
Smax is the maximum sensitivity(1/EC50) of the gene
tmax is the point in time with maximum sensitivity
Sdur represents a measure of the duration of the sensitivity interval.
head(ctr)
t c y logy logc logt
1 1 0 100 4.60517 -Inf 0.0000000
2 1 0 100 4.60517 -Inf 0.0000000
3 1 0 100 4.60517 -Inf 0.0000000
4 2 0 100 4.60517 -Inf 0.6931472
5 2 0 100 4.60517 -Inf 0.6931472
6 2 0 100 4.60517 -Inf 0.6931472
(time <- unique(ctr$t))
[1] 1 2 4 6 8 24 30 48 54 72
fo <- logy ~ logFC_max / (1+exp(-slope*(log(c)-log(1/(s_max*exp(-0.5*(log(t)-log(t_max))/s_dur)^2)))))
ctr_nls <- nls(fo,
data = ctr,
start = list(logFC_max = 6, slope = -10, s_max = 4.619, t_max = 72, s_dur = 8))
Error in numericDeriv(form[[3L]], names(ind), env) :
In addition: Warning message:
In log(1/(s_max * exp(-0.5 * ((log(t) - log(t_max))/s_dur)^2))) :
NaNs produced
I am running caret's neural network on the Bike Sharing dataset and I get the following error message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
: There were missing values in resampled performance measures.
I am not sure what the problem is. Can anyone help please?
The dataset is from:
https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset
Here is the coding:
library(caret)
library(bestNormalize)
data_hour = read.csv("hour.csv")
# Split dataset
set.seed(3)
split = createDataPartition(data_hour$casual, p=0.80, list=FALSE)
validation = data_hour[-split,]
dataset = data_hour[split,]
dataset = dataset[,c(-1,-2,-4)]
# View strucutre of data
str(dataset)
# 'data.frame': 13905 obs. of 14 variables:
# $ season : int 1 1 1 1 1 1 1 1 1 1 ...
# $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
# $ hr : int 1 2 3 5 8 10 11 12 14 15 ...
# $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
# $ weekday : int 6 6 6 6 6 6 6 6 6 6 ...
# $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
# $ weathersit: int 1 1 1 2 1 1 1 1 2 2 ...
# $ temp : num 0.22 0.22 0.24 0.24 0.24 0.38 0.36 0.42 0.46 0.44 ...
# $ atemp : num 0.273 0.273 0.288 0.258 0.288 ...
# $ hum : num 0.8 0.8 0.75 0.75 0.75 0.76 0.81 0.77 0.72 0.77 ...
# $ windspeed : num 0 0 0 0.0896 0 ...
# $ casual : int 8 5 3 0 1 12 26 29 35 40 ...
# $ registered: int 32 27 10 1 7 24 30 55 71 70 ...
# $ cnt : int 40 32 13 1 8 36 56 84 106 110 ...
## transform numeric data to Guassian
dataset_selected = dataset[,c(-13,-14)]
for (i in 8:12) { dataset_selected[,i] = predict(boxcox(dataset_selected[,i] +0.1))}
# View transformed dataset
str(dataset_selected)
#'data.frame': 13905 obs. of 12 variables:
#' $ season : int 1 1 1 1 1 1 1 1 1 1 ...
#' $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
#' $ hr : int 1 2 3 5 8 10 11 12 14 15 ...
#' $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
#' $ weekday : int 6 6 6 6 6 6 6 6 6 6 ...
#' $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
#' $ weathersit: int 1 1 1 2 1 1 1 1 2 2 ...
#' $ temp : num -1.47 -1.47 -1.35 -1.35 -1.35 ...
#' $ atemp : num -1.18 -1.18 -1.09 -1.27 -1.09 ...
#' $ hum : num 0.899 0.899 0.637 0.637 0.637 ...
#' $ windspeed : num -1.8 -1.8 -1.8 -0.787 -1.8 ...
#' $ casual : num -0.361 -0.588 -0.81 -1.867 -1.208 ...
# Train data with Neural Network model from caret
control = trainControl(method = 'repeatedcv', number = 10, repeats =3)
metric = 'RMSE'
set.seed(3)
fit = train(casual ~., data = dataset_selected, method = 'nnet', metric = metric, trControl = control, trace = FALSE)
Thanks for your help!
phivers comment is spot on, however I would still like to provide a more verbose answer on this concrete example.
In order to investigate what is going on in more detail one should add the argument savePredictions = "all" to trainControl:
control = trainControl(method = 'repeatedcv',
number = 10,
repeats = 3,
returnResamp = "all",
savePredictions = "all")
metric = 'RMSE'
set.seed(3)
fit = train(casual ~.,
data = dataset_selected,
method = 'nnet',
metric = metric,
trControl = control,
trace = FALSE,
form = "traditional")
now when running:
fit$results
#output
size decay RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 1 0e+00 0.9999205 NaN 0.8213177 0.009655872 NA 0.007919575
2 1 1e-04 0.9479487 0.1850270 0.7657225 0.074211541 0.20380571 0.079640883
3 1 1e-01 0.8801701 0.3516646 0.6937938 0.074484860 0.20787440 0.077960642
4 3 0e+00 0.9999205 NaN 0.8213177 0.009655872 NA 0.007919575
5 3 1e-04 0.9272942 0.2482794 0.7434689 0.091409600 0.24363651 0.098854133
6 3 1e-01 0.7943899 0.6193242 0.5944279 0.011560524 0.03299137 0.013002708
7 5 0e+00 0.9999205 NaN 0.8213177 0.009655872 NA 0.007919575
8 5 1e-04 0.8811411 0.3621494 0.6941335 0.092169810 0.22980560 0.098987058
9 5 1e-01 0.7896507 0.6431808 0.5870894 0.009947324 0.01063359 0.009121535
we notice the problem occurs when decay = 0.
lets filter the observations and predictions for decay = 0
library(tidyverse)
fit$pred %>%
filter(decay == 0) -> for_r2
var(for_r2$pred)
#output
0
we can observe that all of the predictions when decay == 0 are the same (have zero variance). The model exclusively predicts 0:
unique(for_r2$pred)
#output
0
So when the summary function tries to predict R squared:
caret::R2(for_r2$obs, for_r2$pred)
#output
[1] NA
Warning message:
In cor(obs, pred, use = ifelse(na.rm, "complete.obs", "everything")) :
the standard deviation is zero
Answer by #topepo (Caret package main developer). See detailed Github thread here.
It looks like it happens when you have one hidden unit and almost no
regularization. What is happening is that the model is predicting a
value very close to a constant (so that the RMSE is a little worse
than the basic st deviation of the outcome):
> ANN_cooling_fit$resample %>% dplyr::filter(is.na(Rsquared))
RMSE Rsquared MAE size decay Resample
1 8.414010 NA 6.704311 1 0e+00 Fold04.Rep01
2 8.421244 NA 6.844363 1 0e+00 Fold01.Rep03
3 7.855925 NA 6.372947 1 1e-04 Fold10.Rep07
4 7.963816 NA 6.428947 1 0e+00 Fold07.Rep09
5 8.492898 NA 6.901842 1 0e+00 Fold09.Rep09
6 7.892527 NA 6.479474 1 0e+00 Fold10.Rep10
> sd(mydata$V7)
[1] 7.962888
So it's nothing to really worry about; just some parameters that do very poorly.
The answer by #missuse is already very insightful to understand why this error happens.
So I just want to add some straightforward ways how to get rid of this error.
If in some cross-validation folds the predictions get zero variance, the model didn't converge. In such cases, you can try the neuralnet package which offers two parameters you can tune:
threshold : default value = 0.01. Set it to 0.3 and then try lower values 0.2, 0.1, 0.05.
stepmax : default value = 1e+05. Set it to 1e+08 and then try lower values 1e+07, 1e+06.
In most cases, it is sufficient to change the threshold parameter like this:
model.nn <- caret::train(formula1,
method = "neuralnet",
data = training.set[,],
# apply preProcess within cross-validation folds
preProcess = c("center", "scale"),
trControl = trainControl(method = "repeatedcv",
number = 10,
repeats = 3),
threshold = 0.3
)
I am attempting to fit an analytical model to my observed data on human somatic growth. The generalized model is a 3-parameter logarithmic growth curve where:
s= B0+B1*t+B2*log(t),
where s is a growth parameter whether it be a bone length or stature and t is an age.
I have attempted to run through various iterations of code to both form a likelihood function and and maximize teh returned parameters. To be completely honest, I am totally unsure if I am writing this correctly, but the most recent iteration of my attempts looks as such using a dataframe title "cedar":
cedar.f<-function(b){sum((cedar$FLEN~b[0]+b[1]*cedar$MINAGE+b[2]*log(cedar$MINAGE))^2)}
cedar.optimx<-optimx(c(0,0,0), fn = cedar.f, control = list(all.methods=TRUE, save.failures=TRUE, maxit=5000, maximize=TRUE))
cedar$MINAGE=c(2.5,0.5,6,0.01,0.01,0.01,1,0.01,12,0.01,0.01,1,12,1,4.5,1,4.5,0.01,7.8,11,4,7.5,9,0.25,0.01,0.01,0.01,0.1,1,1,0.01,0.01)
cedar$FLEN=c(167,150,300,54,60,78,152,72, 452,84,81,163,450,137,271,151,261,73,330,378,226,319,378,89,78,89,65,81,144,122, 50,55
Above, I have provided my attempt at the function and the optimization. I have received several errors in my attempts at doing this including:
argument "cedar" is missing ,no default
non-numeric argument to binary operator
Really, I am here to ask what recommendations anyone may have in writing this function so as to maximize the best fit of the data to the analytical human growth curve. If I am going about this all wrong / writing the function wrong, any help would be appreciated.
Thank you all.
cedar <- data.frame(MINAGE=c(2.5,0.5,6,0.01,0.01,0.01,1,0.01,12,0.01,0.01,1,12,1,4.5,1,4.5,0.01,7.8,11,4,7.5,9,0.25,0.01,0.01,0.01,0.1,1,1,0.01,0.01),
FLEN=c(167,150,300,54,60,78,152,72,452,84,81,163,450,137,271,151,261,73,330,378,226,319,378,89,78,89,65,81,144,122,50,55))
# Sum of squared errors
# Minus sign is for function minimization
cedar.f <- function(b) {
-sum( (cedar$FLEN - (b[1] + b[2]*cedar$MINAGE + b[3]*log(cedar$MINAGE)))^2 )
}
library(optimx)
cedar.optimx <- optimx( c(1,1,1), fn = cedar.f,
control = list(all.methods=TRUE, save.failures=TRUE, maxit=5000, maximize=TRUE))
# p1 p2 p3 value fevals gevals niter convcode kkt1 kkt2 xtimes
# BFGS 120.4565 24.41910 11.25419 -7.674935e+03 25 8 NA 0 TRUE TRUE 0.00
# CG 120.4565 24.41910 11.25419 -7.674935e+03 1072 298 NA 0 TRUE TRUE 0.15
# Nelder-Mead 120.4714 24.41647 11.25186 -7.674947e+03 258 NA NA 0 TRUE TRUE 0.02
# L-BFGS-B 120.4565 24.41910 11.25419 -7.674935e+03 17 17 NA 0 TRUE TRUE 0.01
# nlm 120.4564 24.41910 11.25417 -7.674935e+03 NA NA 12 0 TRUE TRUE 0.01
# nlminb 120.4565 24.41910 11.25419 -7.674935e+03 21 48 13 0 TRUE TRUE 0.02
# spg 120.4565 24.41910 11.25419 -7.674935e+03 99 NA 92 0 TRUE TRUE 0.06
# ucminf 120.4564 24.41910 11.25417 -7.674935e+03 10 10 NA 0 TRUE TRUE 0.00
# Rcgmin NA NA NA -8.988466e+307 NA NA NA 9999 NA NA 0.00
# Rvmmin NA NA NA -8.988466e+307 NA NA NA 9999 NA NA 0.00
# newuoa 120.4565 24.41910 11.25419 -7.674935e+03 118 NA NA 0 TRUE TRUE 0.01
# bobyqa 120.4565 24.41910 11.25419 -7.674935e+03 142 NA NA 0 TRUE TRUE 0.02
# nmkb 120.4566 24.41907 11.25421 -7.674935e+03 213 NA NA 0 TRUE TRUE 0.03
# hjkb 1.0000 1.00000 1.00000 -1.363103e+06 1 NA 0 9999 NA NA 0.00
Alternatively, model coefficients can be estimated using a simple linear model:
fitlm <- lm(FLEN~MINAGE+log(MINAGE), data=cedar)
coef(fitlm)
# Intercept) MINAGE log(MINAGE)
# 120.45654 24.41910 11.25419
The estimated function can be plotted as follows:
optpar <- as.matrix(cedar.optimx[1,1:3])
estim_fun <- function(x, b=optpar) {
b[1] + b[2]*x + b[3]*log(x)
}
curve(estim_fun, from=min(cedar$MINAGE), to=max(cedar$MINAGE))