When trying to fit models to predict the outcome "death" I am having a 100% accuracy, this is obviously wrong. Could someone tell me what am I missing?
library(caret)
set.seed(100)
intrain <- createDataPartition(riskFinal$death,p=0.6, list=FALSE)
training_Score <- riskFinal[intrain,]
testing_Score <- riskFinal[-intrain,]
control <- trainControl(method="repeatedcv", repeats=3, number=5)
#C5.0 decision tree
set.seed(100)
modelC50 <- train(death~., data=training_Score, method="C5.0",trControl=control)
summary(modelC50)
#Call:
#C5.0.default(x = structure(c(3, 4, 2, 30, 4, 12, 156, 0.0328767150640488, 36, 0.164383560419083, 22,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0,
# 0, 0, 0, 0,
#C5.0 [Release 2.07 GPL Edition] Tue Aug 4 10:23:10 2015
#-------------------------------
#Class specified by attribute `outcome'
#Read 27875 cases (23 attributes) from undefined.data
#21 attributes winnowed
#Estimated importance of remaining attributes:
#-2147483648% no.subjective.fevernofever
#Rules:
#Rule 1: (26982, lift 1.0)
# no.subjective.fevernofever <= 0
# -> class no [1.000]
#Rule 2: (893, lift 31.2)
# no.subjective.fevernofever > 0
# -> class yes [0.999]
#Default class: no
#Evaluation on training data (27875 cases):
# Rules
# ----------------
# No Errors
# 2 0( 0.0%) <<
# (a) (b) <-classified as
# ---- ----
# 26982 (a): class no
# 893 (b): class yes
# Attribute usage:
# 100.00% no.subjective.fevernofever
#Time: 0.1 secs
confusionMatrix(predictC50, testing_Score$death)
#Confusion Matrix and Statistics
# Reference
#Prediction no yes
# no 17988 0
# yes 0 595
# Accuracy : 1
# 95% CI : (0.9998, 1)
# No Information Rate : 0.968
# P-Value [Acc > NIR] : < 2.2e-16
# Kappa : 1
# Mcnemar's Test P-Value : NA
# Sensitivity : 1.000
# Specificity : 1.000
# Pos Pred Value : 1.000
# Neg Pred Value : 1.000
# Prevalence : 0.968
# Detection Rate : 0.968
# Detection Prevalence : 0.968
# Balanced Accuracy : 1.000
# 'Positive' Class : no
For the Random Forest model
set.seed(100)
modelRF <- train(death~., data=training_Score, method="rf", trControl=control)
predictRF <- predict(modelRF,testing_Score)
confusionMatrix(predictRF, testing_Score$death)
#Confusion Matrix and Statistics
#
# Reference
#Prediction no yes
# no 17988 0
# yes 0 595
# Accuracy : 1
# 95% CI : (0.9998, 1)
# No Information Rate : 0.968
# P-Value [Acc > NIR] : < 2.2e-16
# Kappa : 1
# Mcnemar's Test P-Value : NA
# Sensitivity : 1.000
# Specificity : 1.000
# Pos Pred Value : 1.000
# Neg Pred Value : 1.000
# Prevalence : 0.968
# Detection Rate : 0.968
# Detection Prevalence : 0.968
# Balanced Accuracy : 1.000
# 'Positive' Class : no
predictRFprobs <- predict(modelRF, testing_Score, type = "prob")
For the Logit model
set.seed(100)
modelLOGIT <- train(death~., data=training_Score,method="glm",family="binomial", trControl=control)
summary(modelLOGIT)
#Call:
#NULL
#Deviance Residuals:
# Min 1Q Median 3Q Max
#-2.409e-06 -2.409e-06 -2.409e-06 -2.409e-06 2.409e-06
#Coefficients:
# Estimate Std. Error z value Pr(>|z|)
#(Intercept) -2.657e+01 7.144e+04 0.000 1.000
#age.in.months 3.554e-15 7.681e+01 0.000 1.000
#temp -1.916e-13 1.885e+03 0.000 1.000
#genderfemale 3.644e-14 4.290e+03 0.000 1.000
#no.subjective.fevernofever 5.313e+01 1.237e+04 0.004 0.997
#palloryes -1.156e-13 4.747e+03 0.000 1.000
#jaundiceyes -2.330e-12 1.142e+04 0.000 1.000
#vomitingyes 1.197e-13 4.791e+03 0.000 1.000
#diarrheayes -3.043e-13 4.841e+03 0.000 1.000
#dark.urineyes -6.958e-13 1.037e+04 0.000 1.000
#intercostal.retractionyes 2.851e-13 1.003e+04 0.000 1.000
#subcostal.retractionyes 7.414e-13 1.012e+04 0.000 1.000
#wheezingyes -1.756e-12 1.091e+04 0.000 1.000
#rhonchiyes -1.659e-12 1.074e+04 0.000 1.000
#difficulty.breathingyes 4.496e-13 6.504e+03 0.000 1.000
#deep.breathingyes 1.086e-12 7.075e+03 0.000 1.000
#convulsionsyes -1.294e-12 6.424e+03 0.000 1.000
#lethargyyes -4.338e-13 6.188e+03 0.000 1.000
#unable.to.sityes -4.284e-13 8.118e+03 0.000 1.000
#unable.to.drinkyes 7.297e-13 6.507e+03 0.000 1.000
#altered.consciousnessyes 2.907e-12 1.071e+04 0.000 1.000
#unconsciousnessyes 2.868e-11 1.505e+04 0.000 1.000
#meningeal.signsyes -1.177e-11 1.570e+04 0.000 1.000
#(Dispersion parameter for binomial family taken to be 1)
# Null deviance: 7.9025e+03 on 27874 degrees of freedom
#Residual deviance: 1.6172e-07 on 27852 degrees of freedom
#AIC: 46
#Number of Fisher Scoring iterations: 25
#predictLOGIT<-predict(modelLOGIT,testing_Score)
confusionMatrix(predictLOGIT, testing_Score$death)
#Confusion Matrix and Statistics
# Reference
#Prediction no yes
# no 17988 0
# yes 0 595
# Accuracy : 1
# 95% CI : (0.9998, 1)
# No Information Rate : 0.968
# P-Value [Acc > NIR] : < 2.2e-16
# Kappa : 1
# Mcnemar's Test P-Value : NA
# Sensitivity : 1.000
# Specificity : 1.000
# Pos Pred Value : 1.000
# Neg Pred Value : 1.000
# Prevalence : 0.968
# Detection Rate : 0.968
# Detection Prevalence : 0.968
# Balanced Accuracy : 1.000
# 'Positive' Class : no
The data before slicing was:
str(riskFinal)
#'data.frame': 46458 obs. of 23 variables:
# $ age.in.months : num 3 3 4 2 1.16 ...
# $ temp : num 35.5 39.4 36.8 35.2 35 34.3 37.2 35.2 34.6 35.3 ...
# $ gender : Factor w/ 2 levels "male","female": 1 2 2 2 1 1 1 2 1 1 ...
# $ no.subjective.fever : Factor w/ 2 levels "fever","nofever": 1 1 2 2 1 1 2 2 2 1 ...
# $ pallor : Factor w/ 2 levels "no","yes": 2 2 1 1 2 2 2 1 2 2 ...
# $ jaundice : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 2 ...
# $ vomiting : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 2 1 1 ...
# $ diarrhea : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
# $ dark.urine : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 2 ...
# $ intercostal.retraction: Factor w/ 2 levels "no","yes": 2 2 2 1 2 2 2 2 1 2 ...
# $ subcostal.retraction : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 1 1 ...
# $ wheezing : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
# $ rhonchi : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
# $ difficulty.breathing : Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 1 1 1 2 ...
# $ deep.breathing : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 1 2 ...
# $ convulsions : Factor w/ 2 levels "no","yes": 1 2 1 1 2 2 2 1 2 2 ...
# $ lethargy : Factor w/ 2 levels "no","yes": 2 2 2 1 2 2 2 2 2 2 ...
# $ unable.to.sit : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
# $ unable.to.drink : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
# $ altered.consciousness : Factor w/ 2 levels "no","yes": 2 2 2 1 2 2 2 2 2 2 ...
# $ unconsciousness : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
# $ meningeal.signs : Factor w/ 2 levels "no","yes": 1 2 2 1 1 2 1 2 2 1 ...
# $ death : Factor w/ 2 levels "no","yes": 1 1 2 2 1 1 2 2 2 1 ...
EDIT: based on the comments, I realized that the no.subjective.fever variable had the exactly same values as the target variable death, so I excluded it from the model. Then I got even stranger results:
RANDOM FOREST
set.seed(100)
nmodelRF<- train(death~.-no.subjective.fever, data=training_Score, method="rf", trControl=control)
summary(nmodelRF)
npredictRF<-predict(nmodelRF,testing_Score)
> confusionMatrix(npredictRF, testing_Score$death)
# Confusion Matrix and Statistics
#
# Reference
# Prediction no yes
# no 17988 595
# yes 0 0
#
# Accuracy : 0.968
# 95% CI : (0.9653, 0.9705)
# No Information Rate : 0.968
# P-Value [Acc > NIR] : 0.5109
#
# Kappa : 0
# Mcnemar's Test P-Value : <2e-16
#
# Sensitivity : 1.000
# Specificity : 0.000
# Pos Pred Value : 0.968
# Neg Pred Value : NaN
# Prevalence : 0.968
# Detection Rate : 0.968
# Detection Prevalence : 1.000
# Balanced Accuracy : 0.500
#
# 'Positive' Class : no
Logit
set.seed(100)
nmodelLOGIT<- train(death~.-no.subjective.fever, data=training_Score,method="glm",family="binomial", trControl=control)
>summary(nmodelLOGIT)
# Call:
# NULL
#
# Deviance Residuals:
# Min 1Q Median 3Q Max
# -1.5113 -0.2525 -0.2041 -0.1676 3.1698
#
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 2.432065 1.084942 2.242 0.024984 *
#age.in.months -0.001047 0.001293 -0.810 0.417874
#temp -0.168704 0.028815 -5.855 4.78e-09 ***
#genderfemale -0.053306 0.070468 -0.756 0.449375
#palloryes 0.282123 0.076518 3.687 0.000227 ***
#jaundiceyes 0.323755 0.144607 2.239 0.025165 *
#vomitingyes -0.533661 0.082948 -6.434 1.25e-10 ***
#diarrheayes -0.040272 0.080417 -0.501 0.616520
#dark.urineyes -0.583666 0.168787 -3.458 0.000544 ***
#intercostal.retractionyes -0.021717 0.129607 -0.168 0.866926
#subcostal.retractionyes 0.269588 0.128772 2.094 0.036301 *
#wheezingyes -0.587940 0.150475 -3.907 9.34e-05 ***
#rhonchiyes -0.008565 0.140095 -0.061 0.951249
#difficulty.breathingyes 0.397394 0.087789 4.527 5.99e-06 ***
#deep.breathingyes 0.399302 0.098761 4.043 5.28e-05 ***
#convulsionsyes 0.132609 0.094038 1.410 0.158491
#lethargyyes 0.338599 0.089934 3.765 0.000167 ***
#unable.to.sityes 0.452111 0.104556 4.324 1.53e-05 ***
#unable.to.drinkyes 0.516878 0.089685 5.763 8.25e-09 ***
#altered.consciousnessyes 0.433672 0.123288 3.518 0.000436 ***
#unconsciousnessyes 0.754012 0.136105 5.540 3.03e-08 ***
#meningeal.signsyes 0.188823 0.161088 1.172 0.241130
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for binomial family taken to be 1)
#
# Null deviance: 7902.5 on 27874 degrees of freedom
# Residual deviance: 7148.5 on 27853 degrees of freedom
# AIC: 7192.5
#
# Number of Fisher Scoring iterations: 6
npredictLOGIT<-predict(nmodelLOGIT,testing_Score)
>confusionMatrix(npredictLOGIT, testing_Score$death)
# Confusion Matrix and Statistics
#
# Reference
# Prediction no yes
# no 17982 592
# yes 6 3
#
# Accuracy : 0.9678
# 95% CI : (0.9652, 0.9703)
# No Information Rate : 0.968
# P-Value [Acc > NIR] : 0.5605
#
# Kappa : 0.009
# Mcnemar's Test P-Value : <2e-16
#
# Sensitivity : 0.999666
# Specificity : 0.005042
# Pos Pred Value : 0.968127
# Neg Pred Value : 0.333333
# Prevalence : 0.967981
# Detection Rate : 0.967659
# Detection Prevalence : 0.999516
# Balanced Accuracy : 0.502354
#
# 'Positive' Class : no
The 100% accuracy results are probably not correct. I assume that they are due to the fact that the target variable (or another variable with essentially the same entries as the target variable, as pointed out in a comment by #ulfelder) is included in the training set and in the test set. Usually these columns need to be removed during the model building and testing process, since they represent the target that describes the classification, whereas the train/test data should only contain information that (hopefully) leads to a correct classification according to the target variable.
You could try the following:
target <- riskFinal$death
set.seed(100)
intrain <- createDataPartition(riskFinal$death,p=0.6, list=FALSE)
training_Score <- riskFinal[intrain,]
testing_Score <- riskFinal[-intrain,]
train_target <- training_Score$death
test_target <- test_Score$death
training_Score <- training_Score[,-which(colnames(training_Score)=="death")]
test_Score <- test_Score[,-which(colnames(test_Score)=="death")]
modelRF <- train(training_Score, train_target, method="rf", trControl=control)
Then you could proceed like you did before, noting that the target "death" is stored in the variables train_target and test_target.
Hope this helps.
Related
I have run this regression without any problems and I get 4 coefficients, for each interaction between econ_sit and educ_cat. Econ_sit is a continous variable, and educ_cat is a categorical variable from 1-6. How can i plot the coefficients only for the interaction terms in a good way?
model_int_f <- felm(satis_gov_sc ~ econ_sit*factor(educ_cat) + factor(benefit) + econ_neth + age + gender + pol_sof
| factor(wave) + factor(id) # Respondent and time fixed effects
| 0
| id, # Cluster standard errors on each respondent
data = full1)
summary(model_int_f)
Call:
felm(formula = satis_gov_sc ~ econ_sit * factor(educ_cat) + factor(benefit) + econ_neth + age + gender + pol_sof | factor(wave) + factor(id) | 0 | id, data = full1)
Residuals:
Min 1Q Median 3Q Max
-0.58468 -0.04464 0.00000 0.04728 0.78470
Coefficients:
Estimate Cluster s.e. t value Pr(>|t|)
econ_sit 0.1411692 0.0603100 2.341 0.01928 *
factor(educ_cat)2 0.0525580 0.0450045 1.168 0.24292
factor(educ_cat)3 0.1229048 0.0576735 2.131 0.03313 *
factor(educ_cat)4 0.1244146 0.0486455 2.558 0.01057 *
factor(educ_cat)5 0.1245556 0.0520246 2.394 0.01669 *
factor(educ_cat)6 0.1570034 0.0577240 2.720 0.00655 **
factor(benefit)2 -0.0030380 0.0119970 -0.253 0.80010
factor(benefit)3 0.0026064 0.0072590 0.359 0.71957
econ_neth 0.0642726 0.0131940 4.871 1.14e-06 ***
age 0.0177453 0.0152661 1.162 0.24512
gender 0.1088780 0.0076137 14.300 < 2e-16 ***
pol_sof 0.0006003 0.0094504 0.064 0.94935
econ_sit:factor(educ_cat)2 -0.0804820 0.0653488 -1.232 0.21816
econ_sit:factor(educ_cat)3 -0.0950652 0.0793818 -1.198 0.23114
econ_sit:factor(educ_cat)4 -0.1259772 0.0692072 -1.820 0.06877 .
econ_sit:factor(educ_cat)5 -0.1469749 0.0654870 -2.244 0.02485 *
econ_sit:factor(educ_cat)6 -0.1166243 0.0693709 -1.681 0.09279 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1161 on 11159 degrees of freedom
(23983 observations deleted due to missingness)
Multiple R-squared(full model): 0.8119 Adjusted R-squared: 0.717
Multiple R-squared(proj model): 0.00657 Adjusted R-squared: -0.4946
F-statistic(full model, *iid*):8.557 on 5630 and 11159 DF, p-value: < 2.2e-16
F-statistic(proj model): 55.38 on 17 and 5609 DF, p-value: < 2.2e-16
This is what my data looks like:
$ id : num 1 1 1 1 2 2 2 2 3 3 3 3
$ wave : chr "2013" "2015" "2016" "2017" ...
$ satis_gov_sc: num 0.5 0.4 0.4 0.6 0.6 0.5 0.6 0.7 0.7 0.7 ...
$ econ_sit : num NA NA 0.708 0.75 0.708 ...
$ educ_cat : num 5 5 5 5 5 6 6 6 6 6 ...
$ benefit : num 3 3 3 3 3 3 3 3 3 3 ...
$ econ_neth : num NA 0.6 0.6 0.7 0.7 0.5 0.4 0.6 0.8 0.7 ...
$ age : num 58 60 61 62 63 51 53 54 55 56 ...
$ gender : num 1 1 1 1 1 1 1 1 1 1 ...
$ pol_sof : num 1 1 1 0.8 1 1 1 1 0.8 1 ...
I've tried to run af simple plot_model with the following code:
plot_model(model_int_f, type = "pred", terms = c("econ_sit", "educ_cat"))
However I only get error because the felm function is not compatible with "pred":
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "felm"
Any suggestions on how to plot the interaction terms?
Thanks in advance!
felm does not have a predict method so it is not compatible with plot_model. You could use some other fixed effects library.
Here's an example using fixest. As you did not provide a sample of your data, I have used data(iris).
library(fixest); library(sjPlot)
res = feols(Sepal.Length ~ Sepal.Width + Petal.Length:Species | Species, cluster='Species', iris)
plot_model(res, type = "pred", terms = c("Petal.Length", "Species"))
I have a multinomial logit model with two individual specific variables (first and age).
I would like to conduct the hmftest to check if the IIA holds.
My dataset looks like this:
head(df)
mode choice first age
1 both 1 0 24
2 pre 1 1 23
3 both 1 2 53
4 post 1 3 43
5 no 1 1 55
6 both 1 2 63
I adjusted it for the mlogit to:
mode choice first age idx
1 TRUE 1 0 24 1:both
2 FALSE 1 0 24 1:no
3 FALSE 1 0 24 1:post
4 FALSE 1 0 24 1:pre
5 FALSE 1 1 23 2:both
6 FALSE 1 1 23 2:no
7 FALSE 1 1 23 2:post
8 TRUE 1 1 23 2:pre
9 TRUE 1 2 53 3:both
10 FALSE 1 2 53 3:no
~~~ indexes ~~~~
id1 id2
1 1 both
2 1 no
3 1 post
4 1 pre
5 2 both
6 2 no
7 2 post
8 2 pre
9 3 both
10 3 no
indexes: 1, 2
My original (full) model runs as follows:
full <- mlogit(mode ~ 0 | first + age, data = df_mlogit, reflevel = "no")
leading to the following result:
Call:
mlogit(formula = mode ~ 0 | first + age, data = df_mlogit, reflevel = "no",
method = "nr")
Frequencies of alternatives:choice
no both post pre
0.2 0.4 0.2 0.2
nr method
18 iterations, 0h:0m:0s
g'(-H)^-1g = 8.11E-07
gradient close to zero
Coefficients :
Estimate Std. Error z-value Pr(>|z|)
(Intercept):both 2.0077e+01 1.0441e+04 0.0019 0.9985
(Intercept):post -4.1283e-01 1.4771e+04 0.0000 1.0000
(Intercept):pre 5.3346e-01 1.4690e+04 0.0000 1.0000
first1:both -4.0237e+01 1.1059e+04 -0.0036 0.9971
first1:post -8.9168e-01 1.4771e+04 -0.0001 1.0000
first1:pre -6.6805e-01 1.4690e+04 0.0000 1.0000
first2:both -1.9674e+01 1.0441e+04 -0.0019 0.9985
first2:post -1.8975e+01 1.5683e+04 -0.0012 0.9990
first2:pre -1.8889e+01 1.5601e+04 -0.0012 0.9990
first3:both -2.1185e+01 1.1896e+04 -0.0018 0.9986
first3:post 1.9200e+01 1.5315e+04 0.0013 0.9990
first3:pre 1.9218e+01 1.5237e+04 0.0013 0.9990
age:both 2.1898e-02 2.9396e-02 0.7449 0.4563
age:post 9.3377e-03 2.3157e-02 0.4032 0.6868
age:pre -1.2338e-02 2.2812e-02 -0.5408 0.5886
Log-Likelihood: -61.044
McFadden R^2: 0.54178
Likelihood ratio test : chisq = 144.35 (p.value = < 2.22e-16)
To test for IIA, I exclude one alternative from the model (here "pre") and run the model as follows:
part <- mlogit(mode ~ 0 | first + age, data = df_mlogit, reflevel = "no",
alt.subset = c("no", "post", "both"))
leading to
Call:
mlogit(formula = mode ~ 0 | first + age, data = df_mlogit, alt.subset = c("no",
"post", "both"), reflevel = "no", method = "nr")
Frequencies of alternatives:choice
no both post
0.25 0.50 0.25
nr method
18 iterations, 0h:0m:0s
g'(-H)^-1g = 6.88E-07
gradient close to zero
Coefficients :
Estimate Std. Error z-value Pr(>|z|)
(Intercept):both 1.9136e+01 6.5223e+03 0.0029 0.9977
(Intercept):post -9.2040e-01 9.2734e+03 -0.0001 0.9999
first1:both -3.9410e+01 7.5835e+03 -0.0052 0.9959
first1:post -9.3119e-01 9.2734e+03 -0.0001 0.9999
first2:both -1.8733e+01 6.5223e+03 -0.0029 0.9977
first2:post -1.8094e+01 9.8569e+03 -0.0018 0.9985
first3:both -2.0191e+01 1.1049e+04 -0.0018 0.9985
first3:post 2.0119e+01 1.1188e+04 0.0018 0.9986
age:both 2.1898e-02 2.9396e-02 0.7449 0.4563
age:post 1.9879e-02 2.7872e-02 0.7132 0.4757
Log-Likelihood: -27.325
McFadden R^2: 0.67149
Likelihood ratio test : chisq = 111.71 (p.value = < 2.22e-16)
However when I want to codnuct the hmftest then the following error occurs:
> hmftest(full, part)
Error in solve.default(diff.var) :
system is computationally singular: reciprocal condition number = 4.34252e-21
Does anyone have an idea where the problem might be?
I believe the issue here could be that the hmftest checks if the probability ratio of two alternatives depends only on the characteristics of these alternatives.
Since there are only individual-level variables here, the test won't work in this case.
I am running caret's neural network on the Bike Sharing dataset and I get the following error message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
: There were missing values in resampled performance measures.
I am not sure what the problem is. Can anyone help please?
The dataset is from:
https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset
Here is the coding:
library(caret)
library(bestNormalize)
data_hour = read.csv("hour.csv")
# Split dataset
set.seed(3)
split = createDataPartition(data_hour$casual, p=0.80, list=FALSE)
validation = data_hour[-split,]
dataset = data_hour[split,]
dataset = dataset[,c(-1,-2,-4)]
# View strucutre of data
str(dataset)
# 'data.frame': 13905 obs. of 14 variables:
# $ season : int 1 1 1 1 1 1 1 1 1 1 ...
# $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
# $ hr : int 1 2 3 5 8 10 11 12 14 15 ...
# $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
# $ weekday : int 6 6 6 6 6 6 6 6 6 6 ...
# $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
# $ weathersit: int 1 1 1 2 1 1 1 1 2 2 ...
# $ temp : num 0.22 0.22 0.24 0.24 0.24 0.38 0.36 0.42 0.46 0.44 ...
# $ atemp : num 0.273 0.273 0.288 0.258 0.288 ...
# $ hum : num 0.8 0.8 0.75 0.75 0.75 0.76 0.81 0.77 0.72 0.77 ...
# $ windspeed : num 0 0 0 0.0896 0 ...
# $ casual : int 8 5 3 0 1 12 26 29 35 40 ...
# $ registered: int 32 27 10 1 7 24 30 55 71 70 ...
# $ cnt : int 40 32 13 1 8 36 56 84 106 110 ...
## transform numeric data to Guassian
dataset_selected = dataset[,c(-13,-14)]
for (i in 8:12) { dataset_selected[,i] = predict(boxcox(dataset_selected[,i] +0.1))}
# View transformed dataset
str(dataset_selected)
#'data.frame': 13905 obs. of 12 variables:
#' $ season : int 1 1 1 1 1 1 1 1 1 1 ...
#' $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
#' $ hr : int 1 2 3 5 8 10 11 12 14 15 ...
#' $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
#' $ weekday : int 6 6 6 6 6 6 6 6 6 6 ...
#' $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
#' $ weathersit: int 1 1 1 2 1 1 1 1 2 2 ...
#' $ temp : num -1.47 -1.47 -1.35 -1.35 -1.35 ...
#' $ atemp : num -1.18 -1.18 -1.09 -1.27 -1.09 ...
#' $ hum : num 0.899 0.899 0.637 0.637 0.637 ...
#' $ windspeed : num -1.8 -1.8 -1.8 -0.787 -1.8 ...
#' $ casual : num -0.361 -0.588 -0.81 -1.867 -1.208 ...
# Train data with Neural Network model from caret
control = trainControl(method = 'repeatedcv', number = 10, repeats =3)
metric = 'RMSE'
set.seed(3)
fit = train(casual ~., data = dataset_selected, method = 'nnet', metric = metric, trControl = control, trace = FALSE)
Thanks for your help!
phivers comment is spot on, however I would still like to provide a more verbose answer on this concrete example.
In order to investigate what is going on in more detail one should add the argument savePredictions = "all" to trainControl:
control = trainControl(method = 'repeatedcv',
number = 10,
repeats = 3,
returnResamp = "all",
savePredictions = "all")
metric = 'RMSE'
set.seed(3)
fit = train(casual ~.,
data = dataset_selected,
method = 'nnet',
metric = metric,
trControl = control,
trace = FALSE,
form = "traditional")
now when running:
fit$results
#output
size decay RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 1 0e+00 0.9999205 NaN 0.8213177 0.009655872 NA 0.007919575
2 1 1e-04 0.9479487 0.1850270 0.7657225 0.074211541 0.20380571 0.079640883
3 1 1e-01 0.8801701 0.3516646 0.6937938 0.074484860 0.20787440 0.077960642
4 3 0e+00 0.9999205 NaN 0.8213177 0.009655872 NA 0.007919575
5 3 1e-04 0.9272942 0.2482794 0.7434689 0.091409600 0.24363651 0.098854133
6 3 1e-01 0.7943899 0.6193242 0.5944279 0.011560524 0.03299137 0.013002708
7 5 0e+00 0.9999205 NaN 0.8213177 0.009655872 NA 0.007919575
8 5 1e-04 0.8811411 0.3621494 0.6941335 0.092169810 0.22980560 0.098987058
9 5 1e-01 0.7896507 0.6431808 0.5870894 0.009947324 0.01063359 0.009121535
we notice the problem occurs when decay = 0.
lets filter the observations and predictions for decay = 0
library(tidyverse)
fit$pred %>%
filter(decay == 0) -> for_r2
var(for_r2$pred)
#output
0
we can observe that all of the predictions when decay == 0 are the same (have zero variance). The model exclusively predicts 0:
unique(for_r2$pred)
#output
0
So when the summary function tries to predict R squared:
caret::R2(for_r2$obs, for_r2$pred)
#output
[1] NA
Warning message:
In cor(obs, pred, use = ifelse(na.rm, "complete.obs", "everything")) :
the standard deviation is zero
Answer by #topepo (Caret package main developer). See detailed Github thread here.
It looks like it happens when you have one hidden unit and almost no
regularization. What is happening is that the model is predicting a
value very close to a constant (so that the RMSE is a little worse
than the basic st deviation of the outcome):
> ANN_cooling_fit$resample %>% dplyr::filter(is.na(Rsquared))
RMSE Rsquared MAE size decay Resample
1 8.414010 NA 6.704311 1 0e+00 Fold04.Rep01
2 8.421244 NA 6.844363 1 0e+00 Fold01.Rep03
3 7.855925 NA 6.372947 1 1e-04 Fold10.Rep07
4 7.963816 NA 6.428947 1 0e+00 Fold07.Rep09
5 8.492898 NA 6.901842 1 0e+00 Fold09.Rep09
6 7.892527 NA 6.479474 1 0e+00 Fold10.Rep10
> sd(mydata$V7)
[1] 7.962888
So it's nothing to really worry about; just some parameters that do very poorly.
The answer by #missuse is already very insightful to understand why this error happens.
So I just want to add some straightforward ways how to get rid of this error.
If in some cross-validation folds the predictions get zero variance, the model didn't converge. In such cases, you can try the neuralnet package which offers two parameters you can tune:
threshold : default value = 0.01. Set it to 0.3 and then try lower values 0.2, 0.1, 0.05.
stepmax : default value = 1e+05. Set it to 1e+08 and then try lower values 1e+07, 1e+06.
In most cases, it is sufficient to change the threshold parameter like this:
model.nn <- caret::train(formula1,
method = "neuralnet",
data = training.set[,],
# apply preProcess within cross-validation folds
preProcess = c("center", "scale"),
trControl = trainControl(method = "repeatedcv",
number = 10,
repeats = 3),
threshold = 0.3
)
I want to compute a structural equation model with the sem() function in R with the package lavaan.
There are two categorial variables, one latent exogenous and one latent endogenous, I want to include in the final version of the model.
When I include one of the categorial variables in the model, however, R produces the following warning:
1: In estimateVCOV(lavaanModel, samplestats = lavaanSampleStats,
options = lavaanOptions, : lavaan WARNING: could not compute
standard errors!
2: In computeTestStatistic(lavaanModel, partable = lavaanParTable, :
lavaan WARNING: could not compute scaled test statistic
Code used:
model1 <- '
Wertschaetzung_Essen =~ abwechslungsreiche_M + schnell_zubereitbar + koche_sehr_gerne + koche_sehr_haeufig
Fleischverzicht =~ Ern_Index1
Fleischverzicht ~ Wertschaetzung_Essen
'
fit_model1 <- sem(model1, data=survey2_subset, ordered = c("Ern_Index1"))
Note: This is only a small version of the final model and in which I only introduce one categorial variable. The warning, however, is the same for more complex versions of the model.
Output
str(survey2_subset):
'data.frame': 3676 obs. of 116 variables:
$ abwechslungsreiche_M : num 4 2 3 4 3 3 4 3 3 3 ...
$ schnell_zubereitbar : num 0 3 2 0 0 1 3 2 1 1 ...
$ koche_sehr_gerne : num 1 3 3 1 3 1 4 4 4 3 ...
$ koche_sehr_haeufig : num 2 2 3 NA 3 2 2 4 3 3 ...
$ Ern_Index1 : num 1 1 1 1 0 0 1 0 1 0 ...
summary(fit_model1, fit.measures = TRUE, standardized=TRUE)
lavaan (0.5-15) converged normally after 31 iterations
Used Total
Number of observations 3469 3676
Estimator DWLS Robust
Minimum Function Test Statistic 13.716 NA
Degrees of freedom 4 4
P-value (Chi-square) 0.008 NA
Scaling correction factor NA
Shift parameter
for simple second-order correction (Mplus variant)
Model test baseline model:
Minimum Function Test Statistic 2176.159 1582.139
Degrees of freedom 10 10
P-value 0.000 0.000
User model versus baseline model:
Comparative Fit Index (CFI) 0.996 NA
Tucker-Lewis Index (TLI) 0.989 NA
Root Mean Square Error of Approximation:
RMSEA 0.026 NA
90 Percent Confidence Interval 0.012 0.042 NA NA
P-value RMSEA <= 0.05 0.994 NA
Parameter estimates:
Information Expected
Standard Errors Robust.sem
Estimate Std.err Z-value P(>|z|) Std.lv Std.all
Latent variables:
Wertschaetzung_Essen =~
abwchslngsr_M 1.000 0.363 0.436
schnll_zbrtbr 1.179 0.428 0.438
koche_shr_grn 2.549 0.925 0.846
koche_shr_hfg 2.530 0.918 0.775
Fleischverzicht =~
Ern_Index1 1.000 0.249 0.249
Regressions:
Fleischverzicht ~
Wrtschtzng_Es 0.302 0.440 0.440
Intercepts:
abwchslngsr_M 3.133 3.133 3.760
schnll_zbrtbr 1.701 1.701 1.741
koche_shr_grn 2.978 2.978 2.725
koche_shr_hfg 2.543 2.543 2.148
Wrtschtzng_Es 0.000 0.000 0.000
Fleischvrzcht 0.000 0.000 0.000
Thresholds:
Ern_Index1|t1 0.197 0.197 0.197
Variances:
abwchslngsr_M 0.562 0.562 0.810
schnll_zbrtbr 0.771 0.771 0.808
koche_shr_grn 0.339 0.339 0.284
koche_shr_hfg 0.559 0.559 0.399
Ern_Index1 0.938 0.938 0.938
Wrtschtzng_Es 0.132 1.000 1.000
Fleischvrzcht 0.050 0.806 0.806
Is the model not identified? There should be enough degrees of freedom and the loadings of the first manifest items are set to one.
How can I resolve this issue?
My first thought was:
You can´t have missing values in the dataframe, because with categorial variables WLSMV is used and FIML (missing="ML") is only usable with ML estimates. Perhaps that´s a problem.
Also: Does lavaan automatically fix the residual-variance of "Fleischverzicht" to 0 (or some other value)? A single-item latent variable would not be identified without that, I think.
I am fitting a cox model to some data that is structured as such:
str(test)
'data.frame': 147 obs. of 8 variables:
$ AGE : int 71 69 90 78 61 74 78 78 81 45 ...
$ Gender : Factor w/ 2 levels "F","M": 2 1 2 1 2 1 2 1 2 1 ...
$ RACE : Factor w/ 5 levels "","BLACK","HISPANIC",..: 5 2 5 5 5 5 5 5 5 1 ...
$ SIDE : Factor w/ 2 levels "L","R": 1 1 2 1 2 1 1 1 2 1 ...
$ LESION.INDICATION: Factor w/ 12 levels "CLAUDICATION",..: 1 11 4 11 9 1 1 11 11 11 ...
$ RUTH.CLASS : int 3 5 4 5 4 3 3 5 5 5 ...
$ LESION.TYPE : Factor w/ 3 levels "","OCCLUSION",..: 3 3 2 3 3 3 2 3 3 3 ...
$ Primary : int 1190 1032 166 689 219 840 1063 115 810 157 ...
the RUTH.CLASS variable is actually a factor, and i've changed it to one as such:
> test$RUTH.CLASS <- as.factor(test$RUTH.CLASS)
> summary(test$RUTH.CLASS)
3 4 5 6
48 56 35 8
great.
after fitting the model
stent.surv <- Surv(test$Primary)
> cox.ruthclass <- coxph(stent.surv ~ RUTH.CLASS, data=test )
>
> summary(cox.ruthclass)
Call:
coxph(formula = stent.surv ~ RUTH.CLASS, data = test)
n= 147, number of events= 147
coef exp(coef) se(coef) z Pr(>|z|)
RUTH.CLASS4 0.1599 1.1734 0.1987 0.804 0.42111
RUTH.CLASS5 0.5848 1.7947 0.2263 2.585 0.00974 **
RUTH.CLASS6 0.3624 1.4368 0.3846 0.942 0.34599
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
RUTH.CLASS4 1.173 0.8522 0.7948 1.732
RUTH.CLASS5 1.795 0.5572 1.1518 2.796
RUTH.CLASS6 1.437 0.6960 0.6762 3.053
Concordance= 0.574 (se = 0.026 )
Rsquare= 0.045 (max possible= 1 )
Likelihood ratio test= 6.71 on 3 df, p=0.08156
Wald test = 7.09 on 3 df, p=0.06902
Score (logrank) test = 7.23 on 3 df, p=0.06478
> levels(test$RUTH.CLASS)
[1] "3" "4" "5" "6"
When i fit more variables in the model, similar things happen:
cox.fit <- coxph(stent.surv ~ RUTH.CLASS + LESION.INDICATION + LESION.TYPE, data=test )
>
> summary(cox.fit)
Call:
coxph(formula = stent.surv ~ RUTH.CLASS + LESION.INDICATION +
LESION.TYPE, data = test)
n= 147, number of events= 147
coef exp(coef) se(coef) z Pr(>|z|)
RUTH.CLASS4 -0.5854 0.5569 1.1852 -0.494 0.6214
RUTH.CLASS5 -0.1476 0.8627 1.0182 -0.145 0.8847
RUTH.CLASS6 -0.4509 0.6370 1.0998 -0.410 0.6818
LESION.INDICATIONEMBOLIC -0.4611 0.6306 1.5425 -0.299 0.7650
LESION.INDICATIONISCHEMIA 1.3794 3.9725 1.1541 1.195 0.2320
LESION.INDICATIONISCHEMIA/CLAUDICATION 0.2546 1.2899 1.0189 0.250 0.8027
LESION.INDICATIONREST PAIN 0.5302 1.6993 1.1853 0.447 0.6547
LESION.INDICATIONTISSUE LOSS 0.7793 2.1800 1.0254 0.760 0.4473
LESION.TYPEOCCLUSION -0.5886 0.5551 0.4360 -1.350 0.1770
LESION.TYPESTEN -0.7895 0.4541 0.4378 -1.803 0.0714 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
RUTH.CLASS4 0.5569 1.7956 0.05456 5.684
RUTH.CLASS5 0.8627 1.1591 0.11726 6.348
RUTH.CLASS6 0.6370 1.5698 0.07379 5.499
LESION.INDICATIONEMBOLIC 0.6306 1.5858 0.03067 12.964
LESION.INDICATIONISCHEMIA 3.9725 0.2517 0.41374 38.141
LESION.INDICATIONISCHEMIA/CLAUDICATION 1.2899 0.7752 0.17510 9.503
LESION.INDICATIONREST PAIN 1.6993 0.5885 0.16645 17.347
LESION.INDICATIONTISSUE LOSS 2.1800 0.4587 0.29216 16.266
LESION.TYPEOCCLUSION 0.5551 1.8015 0.23619 1.305
LESION.TYPESTEN 0.4541 2.2023 0.19250 1.071
Concordance= 0.619 (se = 0.028 )
Rsquare= 0.137 (max possible= 1 )
Likelihood ratio test= 21.6 on 10 df, p=0.01726
Wald test = 22.23 on 10 df, p=0.01398
Score (logrank) test = 23.46 on 10 df, p=0.009161
> levels(test$LESION.INDICATION)
[1] "CLAUDICATION" "EMBOLIC" "ISCHEMIA" "ISCHEMIA/CLAUDICATION"
[5] "REST PAIN" "TISSUE LOSS"
> levels(test$LESION.TYPE)
[1] "" "OCCLUSION" "STEN"
truncated output from model.matrix below:
> model.matrix(cox.fit)
RUTH.CLASS4 RUTH.CLASS5 RUTH.CLASS6 LESION.INDICATIONEMBOLIC LESION.INDICATIONISCHEMIA
1 0 0 0 0 0
2 0 1 0 0 0
We can see that the the first level of each of these is being excluded from the model. Any input would be greatly appreciated. I noticed that on the LESION.TYPE variable, the blank level "" is not being included, but that is not by design - that should be a NA or something similar.
I'm confused and could use some help with this. Thanks.
Factors in any model return coefficients based on a base level (a contrast).Your contrasts default to a base factor. There is no point in calculating a coefficient for the dropped value because the model will return the predictions when that dropped value = 1 given that all the other factor values are 0 (factors are complete and mutually exclusive for every observation). You can alter your default contrast by changing the contrasts in your options.
For your coefficients to be versus an average of all factors:
options(contrasts=c(unordered="contr.sum", ordered="contr.poly"))
For your coefficients to be versus a specific treatment (what you have above and your default):
options(contrasts=c(unordered="contr.treatment", ordered="contr.poly"))
As you can see there are two types of factors in R: unordered (or categorical, e.g. red, green, blue) and ordered (e.g. strongly disagree, disagree, no opinion, agree, strongly agree)