I have simulation data that it repeated 100 times. I applied a mclustBIC for each sample.
Then, I would like to access the top result of this function. However, I could not access it.
I provided an example of this function.
library(mclust)
mclustBIC(iris[,-5])
The output is:
Bayesian Information Criterion (BIC):
EII VII EEI VEI EVI VVI EEE VEE EVE VVE EEV
1 -1804.0854 -1804.0854 -1522.1202 -1522.1202 -1522.1202 -1522.1202 -829.9782 -829.9782 -829.9782 -829.9782 -829.9782
2 -1123.4117 -1012.2352 -1042.9679 -956.2823 -1007.3082 -857.5515 -688.0972 -656.3270 -657.2263 -605.1841 -644.5997
3 -878.7650 -853.8144 -813.0504 -779.1566 -797.8342 -744.6382 -632.9647 -605.3982 -666.5491 -636.4259 -644.7810
4 -893.6140 -812.6048 -827.4036 -748.4529 -837.5452 -751.0198 -646.0258 -604.8371 -705.5435 -639.7078 -699.8684
5 -782.6441 -742.6083 -741.9185 -688.3463 -766.8158 -711.4502 -604.8131 NA -723.7199 -632.2056 -652.2959
6 -715.7136 -705.7811 -693.7908 -676.1697 -774.0673 -707.2901 -609.8543 -609.5584 -661.9497 -664.8224 -664.4537
7 -731.8821 -698.5413 -713.1823 -680.7377 -813.5220 -766.6500 -632.4947 NA -699.5102 -690.6108 -709.9530
8 -725.0805 -701.4806 -691.4133 -679.4640 -740.4068 -764.1969 -639.2640 -654.8237 -700.4277 -709.9392 -735.4463
9 -694.5205 -700.0276 -696.2607 -702.0143 -767.8044 -755.8290 -653.0878 NA -729.6651 -734.2997 -758.9348
VEV EVV VVV
1 -829.9782 -829.9782 -829.9782
2 -561.7285 -658.3306 -574.0178
3 -562.5522 -656.0359 -580.8396
4 -602.0104 -725.2925 -630.6000
5 -634.2890 NA -676.6061
6 -679.5116 NA -754.7938
7 -704.7699 -809.8276 -806.9277
8 -712.8788 -831.7520 -830.6373
9 -748.8237 -882.4391 -883.6931
Top 3 models based on the BIC criterion:
VEV,2 VEV,3 VVV,2
-561.7285 -562.5522 -574.0178
I want to access the last line and extract values from it (is that possible?)
Top 3 models based on the BIC criterion:
VEV,2 VEV,3 VVV,2
-561.7285 -562.5522 -574.0178
update: using summary() will help to get to this value, but not to extract from it
I tried to solve this point using another way. I first extract only the values, such that:
res <- mclustBIC(iris[,-5])
res1 <- as.data.frame(res[,1:14])
res2 <- max(res1[[1]])
However, res2 will provide me with the maximum value for a specific model. In addition, I need to know the number of clusters (from 1 to 9). I would like to have it like this:
"EII, 9, -694.5205". ## the last line of EII.
A possible solution:
library(mclust)
m <- mclustBIC(iris[,-5])
BIC <- as.numeric(summary(m))
names(BIC) <- names(summary(m))
BIC
#> VEV,2 VEV,3 VVV,2
#> -561.7285 -562.5522 -574.0178
Related
I am given example that comp() should be returning p-vals but it ends up returning ranks so let me ask:
Why is comp() function from survmisc package returning ranks instead of p-values?
Is there a way to change it?
test_drug <- survfit(Surv(N_Days,Cens) ~ Drug, data = df)
comp(ten(test_drug), p=c(0,1,1,0.5,0.5),q=c(1,0,1,0.5,2))
output:
Q Var Z pNorm
1 3.3457e+00 2.7643e+01 0.63634 4
n 3.2000e+02 1.0304e+06 0.31524 10
sqrtN 3.4634e+01 4.8218e+03 0.49877 9
S1 2.1524e+00 1.6867e+01 0.52410 7
S2 2.1294e+00 1.6650e+01 0.52185 8
FH_p=0_q=1 1.1647e+00 2.2356e+00 0.77898 3
FH_p=1_q=0 2.1809e+00 1.7056e+01 0.52809 6
FH_p=1_q=1 8.4412e-01 7.9005e-01 0.94968 1
FH_p=0.5_q=0.5 1.6895e+00 4.1759e+00 0.82678 2
FH_p=0.5_q=2 2.7491e-01 2.2027e-01 0.58575 5
maxAbsZ Var Q pSupBr
1 5.8550e+00 2.7643e+01 1.11361 5
n 9.7000e+02 1.0304e+06 0.95556 6
sqrtN 6.3636e+01 4.8218e+03 0.91643 7
S1 3.5891e+00 1.6867e+01 0.87391 9
S2 3.5737e+00 1.6650e+01 0.87581 8
FH_p=0_q=1 2.2539e+00 2.2356e+00 1.50743 2
FH_p=1_q=0 3.6025e+00 1.7056e+01 0.87230 10
FH_p=1_q=1 1.4726e+00 7.9005e-01 1.65678 1
FH_p=0.5_q=0.5 2.9457e+00 4.1759e+00 1.44148 3
FH_p=0.5_q=2 6.3430e-01 2.2027e-01 1.35150 4
So according to the topic here:
https://github.com/dardisco/survMisc/issues/21
And information that I got from the profesor lecturer who solved the problem earlier.
This is issue with R version and update is required to the fuction itself by authors or contributors.
This can be solves using attr() func with 'tft' parameter standing for test for trend. Code example here:
test_bilirubin <- survfit(Surv(N_Days,Cens) ~ Bilirubin_cat, data = df)
b=ten(test_bilirubin)
comp(b,p=c(0,1,1,0.5,0.5),q=c(1,0,1,0.5,2))
d=attr(b,"tft")
# "lrt" - the long-rank family of tests
#"sup" - Renyi test,
#"tft" - test for trend
cbind(d$tft$W,round(d$tft$pChisq,4))
I'm trying to fit models for latent profile analysis (packages: tidyLPA and mclust). For model VVI (variances=equal, covariances=zero), I get many "NA" for BIC when n_profiles > 5. I figured out that function "prior = priorControl()" can possibly fix that and - indeed - it does! But now I get a huge bump for BIC when n_profiles > 5 which indicates better model fit.
Does anyone have a guess what is going on here? Or does anybody have a recommendation how to deal with that? Any thoughts are appreciated. I hope the code below and plots attached can illustrate the issue.
Many thanks!!
###not run
library(mclust)
library(tidyLPA)
library(dplyr)
# cluster_1 is an imputed subset of 9 variables
cluster_1 <- subset %>%
single_imputation() %>%
mutate_all(list(scale))
# mclustBIC without priorControl
set.seed(0408)
BIC_m12a <- mclustBIC(cluster_1, modelNames = c("EEI", "VVI"))
BIC1_m12a
Bayesian Information Criterion (BIC):
EEI VVI
1 -127005.8 -127005.8
2 -122298.6 -121027.1
3 -120579.4 -119558.0
4 -119883.4 -118765.7
5 -119244.2 -118293.6
6 -119064.4 NA
7 -118771.5 NA
8 -118681.0 NA
9 -118440.2 NA
# mclustBIC with priorControl
set.seed(0408)
BIC_m12b <- mclustBIC(cluster_1, modelNames = c("EEI", "VVI"), prior=priorControl())
BIC_m12b
Bayesian Information Criterion (BIC):
EEI VVI
1 -127006.0 -127006.0
2 -122299.5 -121028.2
3 -120587.9 -119560.0
4 -119884.4 -118771.7
5 -119235.4 -118296.9
6 -118933.9 -112761.8
7 -118776.2 -112579.7
8 -118586.2 -112353.3
9 -118472.0 -112460.2
enter image description here
enter image description here
I have a data set that has multiple engines and I want to create a for loop function to run a linear regression for each engine and to extract the coefficients of each regression. So I want the first regression to run across the six weeks with engine= Google and then the second one to run across the six weeks with engine =Bing, etc. A sample of the data set looks like this:
Engine Wk Imp Clicks lnSpend Actions CPA
google 1 100302 15791 10998 31 354.79
google 2 23893 4734 2866 16 179.18
google 3 318 16 37.83 11 3.44
google 4 7992 1980 1704.81 27 63.14
google 5 13206 3292 2732.13 26 105.08
google 6 10888 2966 2293.86 22 104.27
bing 1 23536 1808 1028.95 3 342.98
bing 2 86873 7196 2740.28 14 195.73
bing 3 54654 4398 1786.96 13 137.46
bing 4 45553 3353 1860.47 13 143.11
bing 5 41254 3322 1811.80 13 139.37
bing 6 38305 3117 1501.01 19 79.00
The regression equation is actions~ spend and this would remain constant across all the engines.
This is the code that I have so far:
for(i in unique(mydata$engine))
{
reg<- append(reg, lm(mydata$Actions~ mydata$lnspend, data=mydata[mydata$engine== i,]))
}
summary(reg)
However, when I do that, the regression runs on the full data set combining all of the engines together.
I also tried using a by function. The code that I have for that is
reg<- by(mydata$engine, function(mydata) lm(Actions~ lnspend, data=mydata))
sapply(reg, coef)
When I run that I get the following error:
"Error in unique.default(x, nmax = nmax)"
Any idea how to fix it?
Using data.table:
library(data.table)
mydata=data.table(mydata)
mydata[,as.list(lm(Actions~lnSpend)$coeff),by=Engine]
Engine (Intercept) lnSpend
1: google 17.632263 0.001318611
2: bing 4.735979 0.004341699
You can also make it work by adjusting your first for loop a bit:
Engine <- c('g','g','g','g','g','g','b','b','b','b','b','b')
Actions <- c(31,16,11,27,26,22,3,14,13,13,13,19)
lnSpend <- c(10998,2866,37.83,1704.81,2732,2293,1028,2740,1786,1860,1811,1501)
df <- data.frame(Engine,Actions,lnSpend)
reg <- c()
for (eng in unique(Engine)){
m <- lm(Actions~ lnSpend, data = df[which(df$Engine == eng),])
reg <- append(reg, m$coeff)
}
reg
# > reg
# (Intercept) lnSpend (Intercept) lnSpend
# 17.632629162 0.001318568 4.734059476 0.004344177
Your for/loop should work. You just did not filter your dependent and independent variables in for/loop by engine type, so R takes the full dataset:
Consider either explicitly referencing the subset filter in each variable:
for(i in unique(mydata$engine))
{
reg<- append(reg, lm(mydata$Actions[mydata$engine== i] ~ mydata$lnspend[mydata$engine== i],
data=mydata[mydata$engine== i,]))
}
Or leave anonymously so data argument dictates the structure:
for(i in unique(mydata$engine))
{
reg<- append(reg, lm(Actions ~ lnspend,
data=mydata[mydata$engine== i,]))
}
summary(reg)
I use the caret package with multi-layer perception.
My dataset consists of a labelled output value, which can be either A,B or C. The input vector consists of 4 variables.
I use the following lines of code to calculate the class probabilities for each input value:
fit <- train(device~.,data=dataframetrain[1:100,], method="mlp",
trControl=trainControl(classProbs=TRUE))
(p=(predict(fit,newdata=dataframetest,type=("prob"))))
I thought that the class probabilities for each record must sum up to one. But I get the following:
rowSums(p)
# 1 2 3 4 5 6 7 8
# 1.015291 1.015265 1.015291 1.015291 1.015291 1.014933 1.015011 1.015291
# 9 10 11 12 13 14 15 16
# 1.014933 1.015206 1.015291 1.015291 1.015291 1.015224 1.015011 1.015291
Can anybody help me because I don't know what I did wrong.
There's probably nothing wrong, it just seems that caret returns the values of the neurons in the output layer without converting them to probabilities (correct me if I'm wrong). When using the RSNNS::mlp function outside of caret the rows of the predictions also don't sum to one.
Since all output neurons have the same activation function the outputs can be converted to probabilities by dividing the predictions by the respective row sum, see this question.
This behavior seems to be true when using method = "mlp" or method = "mlpWeightDecay" but when using method = "nnet" the predictions do sum to one.
Example:
library(RSNNS)
data(iris)
#shuffle the vector
iris <- iris[sample(1:nrow(iris),length(1:nrow(iris))),1:ncol(iris)]
irisValues <- iris[,1:4]
irisTargets <- iris[,5]
irisTargetsDecoded <- decodeClassLabels(irisTargets)
iris2 <- splitForTrainingAndTest(irisValues, irisTargetsDecoded, ratio=0.15)
iris2 <- normTrainingAndTestSet(iris2)
set.seed(432)
model <- mlp(iris2$inputsTrain, iris2$targetsTrain,
size=5, learnFuncParams=c(0.1), maxit=50,
inputsTest=iris2$inputsTest, targetsTest=iris2$targetsTest)
predictions <- predict(model,iris2$inputsTest)
head(rowSums(predictions))
# 139 26 17 104 54 82
# 1.0227419 1.0770722 1.0642565 1.0764587 0.9952268 0.9988647
probs <- predictions / rowSums(predictions)
head(rowSums(probs))
# 139 26 17 104 54 82
# 1 1 1 1 1 1
# nnet example --------------------------------------
library(caret)
training <- sample(seq_along(irisTargets), size = 100, replace = F)
modelCaret <- train(y = irisTargets[training],
x = irisValues[training, ],
method = "nnet")
predictionsCaret <- predict(modelCaret,
newdata = irisValues[-training, ],
type = "prob")
head(rowSums(predictionsCaret))
# 122 100 89 134 30 86
# 1 1 1 1 1 1
I don't know how much flexibility the caret package offers in these choices, but the standard way to make a neural net produce outputs which sum to one is to use the softmax function as the activation function in the output layer.
I am doing a rolling regression with a huge database, and the reference column used for rolling is called "Q" with the value from 5 to 45 for each data block. At first I tried with simple codes step by step, and it works very good:
fit <- as.formula(EB~EB1+EB2+EB3+EB4)
#use the 20 Quarters data to do regression
model<-lm(fit,data=datapool[(which(datapool$Q>=5&datapool$Q<=24)),])
#use the model to forecast the value of next quarter
pre<-predict(model,newdata=datapool[which(datapool$Q==25),])
#get the forecast error
error<-datapool[which(datapool$Q==25),]$EB -pre
The result of the code above is:
> head(t(t(error)))
[,1]
21 0.006202145
62 -0.003005097
103 -0.019273856
144 -0.016053012
185 -0.025608022
226 -0.004548264
The datapool has the structure below:
> head(datapool)
X Q Firm EB EB1 EB2 EB3
1 1 5 CMCSA US Equity 0.02118966 0.08608825 0.01688180 0.01826571
2 2 6 CMCSA US Equity 0.02331379 0.10506550 0.02118966 0.01688180
3 3 7 CMCSA US Equity 0.01844747 0.12961955 0.02331379 0.02118966
4 4 8 CMCSA US Equity NA NA 0.01844747 0.02331379
5 5 9 CMCSA US Equity 0.01262287 0.05622834 NA 0.01844747
6 6 10 CMCSA US Equity 0.01495291 0.06059339 0.01262287 NA
...
Firm B(also from Q5 to Q45)
...
Firm C(also from Q5 to Q45)
The errors produced above are all marked with "X" value in "datapool", so I can know from which firm does the error come from.
Since I need to run the regression for 21 times (quarters 5-24,6-25,...,25-44), so I do not want to do it manully, and have thought out the following codes:
fit <- as.formula(EB~EB1+EB2+EB3+EB4)
for (i in 0:20){
model<-lm(fit,data=datapool[(which(datapool$Q>=5+i&datapool$Q<=24+i)),])
pre<-predict(model,newdata=datapool[which(datapool$Q==25+i),])
error<-datapool[which(datapool$Q==25),]$EB -pre
}
The codes above works, and no error come out, but I do not know how to compile all errors produced by each regression into one datapool automatically? Can anyone help me with that?
(I say again: Really bad idea to use the name 'error' for a vector.) It is the name of a core function. This is how I would have attempted that task. (Using the subset parameter and indexing than the tortured which statements.
fit <- as.formula(EB~EB1+EB2+EB3+EB4)
pre <- numeric(len=21)
errset <- numeric(len=21)
for (i in 0:20){
model<-lm(fit,data=datapool, subset= Q>=5+i & Q<=24+i )
pre[i]<-predict(model,newdata=datapool[ datapool[["Q"]] %in% i:(25+i), ])
errset[i]<-datapool[25+i,]$EB -pre
}
errset
No gaurantees this won't error out by running out tof data at the beginning or end since you have not offered either data or a comprehensive description of the data-object.