I'm trying to fit models for latent profile analysis (packages: tidyLPA and mclust). For model VVI (variances=equal, covariances=zero), I get many "NA" for BIC when n_profiles > 5. I figured out that function "prior = priorControl()" can possibly fix that and - indeed - it does! But now I get a huge bump for BIC when n_profiles > 5 which indicates better model fit.
Does anyone have a guess what is going on here? Or does anybody have a recommendation how to deal with that? Any thoughts are appreciated. I hope the code below and plots attached can illustrate the issue.
Many thanks!!
###not run
library(mclust)
library(tidyLPA)
library(dplyr)
# cluster_1 is an imputed subset of 9 variables
cluster_1 <- subset %>%
single_imputation() %>%
mutate_all(list(scale))
# mclustBIC without priorControl
set.seed(0408)
BIC_m12a <- mclustBIC(cluster_1, modelNames = c("EEI", "VVI"))
BIC1_m12a
Bayesian Information Criterion (BIC):
EEI VVI
1 -127005.8 -127005.8
2 -122298.6 -121027.1
3 -120579.4 -119558.0
4 -119883.4 -118765.7
5 -119244.2 -118293.6
6 -119064.4 NA
7 -118771.5 NA
8 -118681.0 NA
9 -118440.2 NA
# mclustBIC with priorControl
set.seed(0408)
BIC_m12b <- mclustBIC(cluster_1, modelNames = c("EEI", "VVI"), prior=priorControl())
BIC_m12b
Bayesian Information Criterion (BIC):
EEI VVI
1 -127006.0 -127006.0
2 -122299.5 -121028.2
3 -120587.9 -119560.0
4 -119884.4 -118771.7
5 -119235.4 -118296.9
6 -118933.9 -112761.8
7 -118776.2 -112579.7
8 -118586.2 -112353.3
9 -118472.0 -112460.2
enter image description here
enter image description here
Related
I've trained a naive Bayes model with R using the tidymodels framework.
The whole model is saved in an rds file. Here's a snippet of the content of that model (the whole model has 181 or more such tables so I can't post it here):
══ Workflow [trained] ══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: naive_Bayes()
── Preprocessor ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
0 Recipe Steps
── Model ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
$apriori
grouping
A B C D E F
0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667
$tables
$tables$var1
var
grouping 1 2 3 4
1 0.3173302108 0.3728337237 0.2304449649 0.0793911007
2 0.2104513064 0.3885985748 0.2923990499 0.1085510689
3 0.2561613144 0.5481702763 0.1784914115 0.0171769978
4 0.0038167939 0.1059160305 0.5477099237 0.3425572519
5 0.0009017133 0.0324616772 0.3841298467 0.5825067628
6 0.1474328780 0.4399434762 0.3655204899 0.0471031559
$tables$var2
var
grouping 1 2 3 4
1 0.2215456674 0.3592505855 0.2777517564 0.1414519906
2 0.1532066508 0.3446555819 0.3225653207 0.1795724466
3 0.1762509335 0.4458551158 0.3330843913 0.0448095594
4 0.0009541985 0.0324427481 0.4208015267 0.5458015267
5 0.0009017133 0.0189359784 0.2957619477 0.6844003607
6 0.1427225624 0.4371172869 0.3546867640 0.0654733867
$tables$var3
var
grouping 1 2 3 4 5
1 0.7679700304 0.1992507609 0.0320767970 0.0004682744 0.0002341372
2 0.3680835906 0.3526478271 0.2526715744 0.0256471147 0.0009498931
3 0.0432835821 0.2328358209 0.5201492537 0.1694029851 0.0343283582
4 0.0514775977 0.2278360343 0.4642516683 0.1954242135 0.0610104862
5 0.0117117117 0.0702702703 0.3144144144 0.3486486486 0.2549549550
6 0.0150659134 0.1012241055 0.4077212806 0.3436911488 0.1322975518
$tables$var4
var
grouping 1 2 3 4 5
1 0.6518379771 0.3289627722 0.0187309764 0.0002341372 0.0002341372
2 0.1260983139 0.2125385894 0.5079553550 0.1184991688 0.0349085728
3 0.3089552239 0.4783582090 0.2059701493 0.0037313433 0.0029850746
4 0.3441372736 0.4718779790 0.1811248808 0.0019065777 0.0009532888
5 0.0270270270 0.0360360360 0.3432432432 0.3612612613 0.2324324324
6 0.0127118644 0.0555555556 0.4119585687 0.3672316384 0.1525423729
I read that file into R which works fine and then want to use that model and predict some values of a new data set with:
model <- readRDS(file.choose())
new_pred <- predict(model,
dat_new,
type = "prob")
For me, personally, this runs just fine. But when I sent this to a client of me, they get the following error:
Error in blueprint$forge$clean(blueprint = blueprint, new_data = new_data, :
attempt to apply non-function)
I know, with such little information it is very difficult to figure out what's going on, but I'm still trying. Maybe the tidymodels experts here know where such an error might come from.
Any ideas?
Update to show how the model is created:
library(tidymodels)
library(discrim)
model_recipe <- recipe(outcome_var ~ ., data = dat_train)
model_final <- naive_Bayes(Laplace = 1) |>
set_mode("classification") |>
set_engine("klaR", prior = rep((1/6), 6))
model_final_wf <- workflow() |>
add_recipe(model_recipe) |>
add_model(model_final)
full_fit <- model_final_wf |>
fit(data = dat_full)
saveRDS(full_fit, file = 'my_model.rds')
You are getting this error because your client are using too old a version of {hardhat}.
In version 1.1.0 of hardhat a lot of internals were changed about hardhat. This means that the $clean object is no longer present, which is causing the error that we are seeing.
The recommended cause of action is for both of you to use the same version of {hardhat}, preferably the most recent one, which at the time of writing this is 1.2.0.
Additionally: when sharing models like this it is recommended that you also move along package versions to make sure everything is in sync, such as with renv or by using more dedicated model deployment such as with vetiver
I have simulation data that it repeated 100 times. I applied a mclustBIC for each sample.
Then, I would like to access the top result of this function. However, I could not access it.
I provided an example of this function.
library(mclust)
mclustBIC(iris[,-5])
The output is:
Bayesian Information Criterion (BIC):
EII VII EEI VEI EVI VVI EEE VEE EVE VVE EEV
1 -1804.0854 -1804.0854 -1522.1202 -1522.1202 -1522.1202 -1522.1202 -829.9782 -829.9782 -829.9782 -829.9782 -829.9782
2 -1123.4117 -1012.2352 -1042.9679 -956.2823 -1007.3082 -857.5515 -688.0972 -656.3270 -657.2263 -605.1841 -644.5997
3 -878.7650 -853.8144 -813.0504 -779.1566 -797.8342 -744.6382 -632.9647 -605.3982 -666.5491 -636.4259 -644.7810
4 -893.6140 -812.6048 -827.4036 -748.4529 -837.5452 -751.0198 -646.0258 -604.8371 -705.5435 -639.7078 -699.8684
5 -782.6441 -742.6083 -741.9185 -688.3463 -766.8158 -711.4502 -604.8131 NA -723.7199 -632.2056 -652.2959
6 -715.7136 -705.7811 -693.7908 -676.1697 -774.0673 -707.2901 -609.8543 -609.5584 -661.9497 -664.8224 -664.4537
7 -731.8821 -698.5413 -713.1823 -680.7377 -813.5220 -766.6500 -632.4947 NA -699.5102 -690.6108 -709.9530
8 -725.0805 -701.4806 -691.4133 -679.4640 -740.4068 -764.1969 -639.2640 -654.8237 -700.4277 -709.9392 -735.4463
9 -694.5205 -700.0276 -696.2607 -702.0143 -767.8044 -755.8290 -653.0878 NA -729.6651 -734.2997 -758.9348
VEV EVV VVV
1 -829.9782 -829.9782 -829.9782
2 -561.7285 -658.3306 -574.0178
3 -562.5522 -656.0359 -580.8396
4 -602.0104 -725.2925 -630.6000
5 -634.2890 NA -676.6061
6 -679.5116 NA -754.7938
7 -704.7699 -809.8276 -806.9277
8 -712.8788 -831.7520 -830.6373
9 -748.8237 -882.4391 -883.6931
Top 3 models based on the BIC criterion:
VEV,2 VEV,3 VVV,2
-561.7285 -562.5522 -574.0178
I want to access the last line and extract values from it (is that possible?)
Top 3 models based on the BIC criterion:
VEV,2 VEV,3 VVV,2
-561.7285 -562.5522 -574.0178
update: using summary() will help to get to this value, but not to extract from it
I tried to solve this point using another way. I first extract only the values, such that:
res <- mclustBIC(iris[,-5])
res1 <- as.data.frame(res[,1:14])
res2 <- max(res1[[1]])
However, res2 will provide me with the maximum value for a specific model. In addition, I need to know the number of clusters (from 1 to 9). I would like to have it like this:
"EII, 9, -694.5205". ## the last line of EII.
A possible solution:
library(mclust)
m <- mclustBIC(iris[,-5])
BIC <- as.numeric(summary(m))
names(BIC) <- names(summary(m))
BIC
#> VEV,2 VEV,3 VVV,2
#> -561.7285 -562.5522 -574.0178
I am given example that comp() should be returning p-vals but it ends up returning ranks so let me ask:
Why is comp() function from survmisc package returning ranks instead of p-values?
Is there a way to change it?
test_drug <- survfit(Surv(N_Days,Cens) ~ Drug, data = df)
comp(ten(test_drug), p=c(0,1,1,0.5,0.5),q=c(1,0,1,0.5,2))
output:
Q Var Z pNorm
1 3.3457e+00 2.7643e+01 0.63634 4
n 3.2000e+02 1.0304e+06 0.31524 10
sqrtN 3.4634e+01 4.8218e+03 0.49877 9
S1 2.1524e+00 1.6867e+01 0.52410 7
S2 2.1294e+00 1.6650e+01 0.52185 8
FH_p=0_q=1 1.1647e+00 2.2356e+00 0.77898 3
FH_p=1_q=0 2.1809e+00 1.7056e+01 0.52809 6
FH_p=1_q=1 8.4412e-01 7.9005e-01 0.94968 1
FH_p=0.5_q=0.5 1.6895e+00 4.1759e+00 0.82678 2
FH_p=0.5_q=2 2.7491e-01 2.2027e-01 0.58575 5
maxAbsZ Var Q pSupBr
1 5.8550e+00 2.7643e+01 1.11361 5
n 9.7000e+02 1.0304e+06 0.95556 6
sqrtN 6.3636e+01 4.8218e+03 0.91643 7
S1 3.5891e+00 1.6867e+01 0.87391 9
S2 3.5737e+00 1.6650e+01 0.87581 8
FH_p=0_q=1 2.2539e+00 2.2356e+00 1.50743 2
FH_p=1_q=0 3.6025e+00 1.7056e+01 0.87230 10
FH_p=1_q=1 1.4726e+00 7.9005e-01 1.65678 1
FH_p=0.5_q=0.5 2.9457e+00 4.1759e+00 1.44148 3
FH_p=0.5_q=2 6.3430e-01 2.2027e-01 1.35150 4
So according to the topic here:
https://github.com/dardisco/survMisc/issues/21
And information that I got from the profesor lecturer who solved the problem earlier.
This is issue with R version and update is required to the fuction itself by authors or contributors.
This can be solves using attr() func with 'tft' parameter standing for test for trend. Code example here:
test_bilirubin <- survfit(Surv(N_Days,Cens) ~ Bilirubin_cat, data = df)
b=ten(test_bilirubin)
comp(b,p=c(0,1,1,0.5,0.5),q=c(1,0,1,0.5,2))
d=attr(b,"tft")
# "lrt" - the long-rank family of tests
#"sup" - Renyi test,
#"tft" - test for trend
cbind(d$tft$W,round(d$tft$pChisq,4))
I use the caret package with multi-layer perception.
My dataset consists of a labelled output value, which can be either A,B or C. The input vector consists of 4 variables.
I use the following lines of code to calculate the class probabilities for each input value:
fit <- train(device~.,data=dataframetrain[1:100,], method="mlp",
trControl=trainControl(classProbs=TRUE))
(p=(predict(fit,newdata=dataframetest,type=("prob"))))
I thought that the class probabilities for each record must sum up to one. But I get the following:
rowSums(p)
# 1 2 3 4 5 6 7 8
# 1.015291 1.015265 1.015291 1.015291 1.015291 1.014933 1.015011 1.015291
# 9 10 11 12 13 14 15 16
# 1.014933 1.015206 1.015291 1.015291 1.015291 1.015224 1.015011 1.015291
Can anybody help me because I don't know what I did wrong.
There's probably nothing wrong, it just seems that caret returns the values of the neurons in the output layer without converting them to probabilities (correct me if I'm wrong). When using the RSNNS::mlp function outside of caret the rows of the predictions also don't sum to one.
Since all output neurons have the same activation function the outputs can be converted to probabilities by dividing the predictions by the respective row sum, see this question.
This behavior seems to be true when using method = "mlp" or method = "mlpWeightDecay" but when using method = "nnet" the predictions do sum to one.
Example:
library(RSNNS)
data(iris)
#shuffle the vector
iris <- iris[sample(1:nrow(iris),length(1:nrow(iris))),1:ncol(iris)]
irisValues <- iris[,1:4]
irisTargets <- iris[,5]
irisTargetsDecoded <- decodeClassLabels(irisTargets)
iris2 <- splitForTrainingAndTest(irisValues, irisTargetsDecoded, ratio=0.15)
iris2 <- normTrainingAndTestSet(iris2)
set.seed(432)
model <- mlp(iris2$inputsTrain, iris2$targetsTrain,
size=5, learnFuncParams=c(0.1), maxit=50,
inputsTest=iris2$inputsTest, targetsTest=iris2$targetsTest)
predictions <- predict(model,iris2$inputsTest)
head(rowSums(predictions))
# 139 26 17 104 54 82
# 1.0227419 1.0770722 1.0642565 1.0764587 0.9952268 0.9988647
probs <- predictions / rowSums(predictions)
head(rowSums(probs))
# 139 26 17 104 54 82
# 1 1 1 1 1 1
# nnet example --------------------------------------
library(caret)
training <- sample(seq_along(irisTargets), size = 100, replace = F)
modelCaret <- train(y = irisTargets[training],
x = irisValues[training, ],
method = "nnet")
predictionsCaret <- predict(modelCaret,
newdata = irisValues[-training, ],
type = "prob")
head(rowSums(predictionsCaret))
# 122 100 89 134 30 86
# 1 1 1 1 1 1
I don't know how much flexibility the caret package offers in these choices, but the standard way to make a neural net produce outputs which sum to one is to use the softmax function as the activation function in the output layer.
I am doing a rolling regression with a huge database, and the reference column used for rolling is called "Q" with the value from 5 to 45 for each data block. At first I tried with simple codes step by step, and it works very good:
fit <- as.formula(EB~EB1+EB2+EB3+EB4)
#use the 20 Quarters data to do regression
model<-lm(fit,data=datapool[(which(datapool$Q>=5&datapool$Q<=24)),])
#use the model to forecast the value of next quarter
pre<-predict(model,newdata=datapool[which(datapool$Q==25),])
#get the forecast error
error<-datapool[which(datapool$Q==25),]$EB -pre
The result of the code above is:
> head(t(t(error)))
[,1]
21 0.006202145
62 -0.003005097
103 -0.019273856
144 -0.016053012
185 -0.025608022
226 -0.004548264
The datapool has the structure below:
> head(datapool)
X Q Firm EB EB1 EB2 EB3
1 1 5 CMCSA US Equity 0.02118966 0.08608825 0.01688180 0.01826571
2 2 6 CMCSA US Equity 0.02331379 0.10506550 0.02118966 0.01688180
3 3 7 CMCSA US Equity 0.01844747 0.12961955 0.02331379 0.02118966
4 4 8 CMCSA US Equity NA NA 0.01844747 0.02331379
5 5 9 CMCSA US Equity 0.01262287 0.05622834 NA 0.01844747
6 6 10 CMCSA US Equity 0.01495291 0.06059339 0.01262287 NA
...
Firm B(also from Q5 to Q45)
...
Firm C(also from Q5 to Q45)
The errors produced above are all marked with "X" value in "datapool", so I can know from which firm does the error come from.
Since I need to run the regression for 21 times (quarters 5-24,6-25,...,25-44), so I do not want to do it manully, and have thought out the following codes:
fit <- as.formula(EB~EB1+EB2+EB3+EB4)
for (i in 0:20){
model<-lm(fit,data=datapool[(which(datapool$Q>=5+i&datapool$Q<=24+i)),])
pre<-predict(model,newdata=datapool[which(datapool$Q==25+i),])
error<-datapool[which(datapool$Q==25),]$EB -pre
}
The codes above works, and no error come out, but I do not know how to compile all errors produced by each regression into one datapool automatically? Can anyone help me with that?
(I say again: Really bad idea to use the name 'error' for a vector.) It is the name of a core function. This is how I would have attempted that task. (Using the subset parameter and indexing than the tortured which statements.
fit <- as.formula(EB~EB1+EB2+EB3+EB4)
pre <- numeric(len=21)
errset <- numeric(len=21)
for (i in 0:20){
model<-lm(fit,data=datapool, subset= Q>=5+i & Q<=24+i )
pre[i]<-predict(model,newdata=datapool[ datapool[["Q"]] %in% i:(25+i), ])
errset[i]<-datapool[25+i,]$EB -pre
}
errset
No gaurantees this won't error out by running out tof data at the beginning or end since you have not offered either data or a comprehensive description of the data-object.