Finding the best LCA model in poLCA R package - r

I am applying LCA analysis with PoLCA R package, but the analysis not resulted since three days (it did not find the best model yet) and occasionally it gives the following error: "ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND". So i cancelled the process at 35 latent class. I am analyzing 16 variables (all of them categorical) and 36036 rows of data. When I test the variable importance for 16 variables in Boruta package, all the 16 variables resulted as important, so i used all 16 variables in LCA analysis with poLCA. Which path should i follow? Should I use another clustering method such as k-modes for clustering categorical variables in this dataset? I use the parameters with 500 iterations and nrep=10 model estimation number. The R script i use to find the best model in LCA and one of the outputs is as follows:
for(i in 2:50){
lc <- poLCA(f, data, nclass=i, maxiter=500,
tol=1e-5, na.rm=FALSE,
nrep=10, verbose=TRUE, calc.se=TRUE)
if(lc$bic < min_bic){
min_bic <- lc$bic
LCA_best_model<-lc
}
}
========================================================= Fit for 35 latent classes:
========================================================= number of observations: 36036
number of estimated parameters: 2029 residual
degrees of freedom: 34007
maximum log-likelihood: -482547.1
AIC(35): 969152.2
BIC(35): 986383 G^2(35): 233626.8 (Likelihood
ratio/deviance statistic)
X^2(35): 906572555 (Chi-square goodness of
fit)
ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND

The script you are using sequentially tests every model from 2 to 50 classes and keeps the one with the lowest BIC. BIC is not the only one or the best way to select "the best" model, but fair enough.
The problem is, you are estimating a LOT of parameters, especially in the last steps. The more classes you fit, the more time consuming the process is. Also, in this cases convergence problems are to be expected because you are fitting so many classes. That's what the error message reports, it can't find the maximum likelihood for a model with 35 classes.
I don't know what problem you are trying to solve, but models with over 10 classes are unusual in LCA. You do LCA to reduce the complexity of your data as much as possible. If you NEED to fit models with many -over 10- classes:
fit them one by one, so RAM consumption will be less of a problem.
increase the nrep= argument in the call, so you the probability of the model not finding maximum likelihood by chance -bad random initial numbers- is reduced. Also increases computing time.
Alternatively you can reduce computing time running models in parallel. Almost every modern PC has 2 or more cores. The function acl() in the next block does this with foreach() and %dopar%, so is OS independent.
library(poLCA)
library(foreach)
library(doParallel)
registerDoParallel(cores=2) #as many physical cores as available.
acl <- function(datos, #a data.frame with your data
k, #the maximum number of classes to fit
formula) {
foreach(i=1:k, .packages="poLCA") %dopar% poLCA(formula, datos, nclass=i
)
}
acm() returns a list of models, you can pick "the best" later. The next function will retrieve the quantities of intrest from the list and create a nicely formatted data.frame with usefull information to select the right number of classes.
comparar_clases_acl <- function(modelo) {
entropy<-function (p) sum(-p*log(p)) #to asses the quality of classification
tabla_LCA <- data.frame(Modelo=0, BIC=0, Lik_ratio=0, Entropia=0, MenorClase=0) #empty data.frame to prealocate memory.
for(i in 1:length(modelo)){
tabla_LCA [i,1] <- paste("Modelo", i)
tabla_LCA [i,2] <- modelo[[i]]$bic
tabla_LCA [i,3] <- modelo[[i]]$Gsq
error_prior <- entropy(modelo[[i]]$P)
error_post <- mean(apply(modelo[[i]]$posterior,1, entropy),na.rm = TRUE)
tabla_LCA [i,4]<-round(((error_prior-error_post) / error_prior),3)
tabla_LCA [i,5] <- min(modelo[[i]]$P)*100
}
return(tabla_LCA)
}
It takes only one argument: an object with a list of LCA models, exactly what acl() returns.
This parallel approach should reduce computing time. Still 50 classes are to much and you are probably getting the smallest BIC way before 50 classes. Remember, BIC penalices models as the number of estimated parameters increases, helping you find the point of diminishing returns of an extra class in your model.

Related

Computational speed of a complex Hierarchical GAM

I have a large dataset (3.5+ million observations) of a binary response variable that I am trying to compute a Hierarchical GAM with a global smoother with individual effects that have a Shared penalty (e.g. 'GS' in Pedersen et al. 2019). Specifically I am trying to estimate the following structure: Global > Geographic Zone (N=2) > Bioregion (N=20) > Season (N varies by bioregion). In total, I am trying to estimate 36 different nested parameters.
Here is the the code I am currently using:
modGS <- bam(
outbreak ~
te(days_diff,NDVI_mean,bs=c("tp","tp"),k=c(5,5)) +
t2(days_diff, NDVI_mean, Zone, Bioregion, Season, bs=c("tp", "tp","re","re","re"),k=c(5, 5), m=2, full=TRUE) +
s(Latitude,Longitude,k=50),
family=binomial(),select = TRUE,data=dat)
My main issue is that it is taking a long time (5+ days) to construct the model. This nesting structure cannot be discretized, so I cannot compute it in parallel. Further I have tried gamm4 but I ran into memory limit issues. Here is the gamm4 code:
modGS <- gamm4(
outbreak ~
t2(days_diff,NDVI_mean,bs=c("tp","tp"),k=c(5,5)) +
t2(days_diff, NDVI_mean, Zone, Bioregion, Season, bs=c("tp", "tp","re","re","re"),k=c(5, 5), m=2, full=TRUE) +
s(Latitude,Longitude,k=50),
family=binomial(),select = TRUE,data=dat)
What is the best/most computationally feasible way to run this model?
I cut down the computational time by reducing the amount of bioregion levels and randomly sampling ca. 60% of the data. This actually allow me to calculate OOB error for the model.
There is an article I read recently that has a specific section on decreasing computational time. The main things they highlight are:
Use the bam function with it's useful fREML estimation, which refactorizes the model matrix to make calculation faster. Here it seems you have already done that.
Adding the discrete = TRUE argument, which assumes only a smaller finite number of unique values for estimation.
Manipulating nthreads in this function so it runs more than one core in parallel in your computer.
As the authors caution, the second option can reduce the amount of accuracy in your estimates. I fit some large models recently doing this and found that it was not always the same as the default bam function, so its best to use this as a quick inspection rather than the full result you are looking for.

Writing syntax for bivariate survival censored data to fit copula models in R

library(Sunclarco)
library(MASS)
library(survival)
library(SPREDA)
library(SurvCorr)
library(doBy)
#Dataset
diabetes=data("diabetes")
data1=subset(diabetes,select=c("LASER","TRT_EYE","AGE_DX","ADULT","TIME1","STATUS1"))
data2=subset(diabetes,select=c("LASER","TRT_EYE","AGE_DX","ADULT","TIME2","STATUS2"))
#Adding variable which identify cluster
data1$CLUSTER<- rep(1,197)
data2$CLUSTER<- rep(2,197)
#Renaming the variable so that that we hve uniformity in the common items in the data
names(data1)[5] <- "TIME"
names(data1)[6] <- "STATUS"
names(data2)[5] <- "TIME"
names(data2)[6] <- "STATUS"
#merge the files
Total_data=rbind(data1,data2)
# Re arranging the database
diabete_full=orderBy(~LASER+TRT_EYE+AGE_DX,data=Total_data)
diabete_full
#using Sunclarco package for Clayton a nd Gumbel
Clayton_1step <- SunclarcoModel(data=diabete_full,time="TIME",status="STATUS",
clusters="CLUSTER",covariates=c("LASER","TRT_EYE","ADULT"),
stage=1,copula="Clayton",marginal="Weibull")
summary(Clayton_1step)
# Estimates StandardErrors
#lambda 0.01072631 0.005818201
#rho 0.79887565 0.058942208
#theta 0.10224445 0.090585891
#beta_LASER 0.16780224 0.157652947
#beta_TRT_EYE 0.24580489 0.162333369
#beta_ADULT 0.09324001 0.158931463
# Estimate StandardError
#Kendall's Tau 0.04863585 0.04099436
Clayton_2step <- SunclarcoModel(data=diabete_full,time="TIME",status="STATUS",
clusters="CLUSTER",covariates=c("LASER","TRT_EYE","ADULT"),
stage=2,copula="Clayton",marginal="Weibull")
summary(Clayton_1step)
# Estimates StandardErrors
#lambda 0.01131751 0.003140733
#rho 0.79947406 0.012428824
#beta_LASER 0.14244235 0.041845100
#beta_TRT_EYE 0.27246433 0.298184235
#beta_ADULT 0.06151645 0.253617142
#theta 0.18393973 0.151048024
# Estimate StandardError
#Kendall's Tau 0.08422381 0.06333791
Gumbel_1step <- SunclarcoModel(data=diabete_full,time="TIME",status="STATUS",
clusters="CLUSTER",covariates=c("LASER","TRT_EYE","ADULT"),
stage=1,copula="GH",marginal="Weibull")
# Estimates StandardErrors
#lambda 0.01794495 0.01594843
#rho 0.70636113 0.10313853
#theta 0.87030690 0.11085344
#beta_LASER 0.15191936 0.14187943
#beta_TRT_EYE 0.21469814 0.14736381
#beta_ADULT 0.08284557 0.14214373
# Estimate StandardError
#Kendall's Tau 0.1296931 0.1108534
Gumbel_2step <- SunclarcoModel(data=diabete_full,time="TIME",status="STATUS",
clusters="CLUSTER",covariates=c("LASER","TRT_EYE","ADULT"),
stage=2,copula="GH",marginal="Weibull")
Am required to fit copula models in R for different copula classes particularly the Gaussian, FGM,Pluckett and possibly Frank (if i still have time). The data am using is Diabetes data available in R through the package Survival and Survcorr.
Its my thesis am working on and its a study for the exploratory purposes to see how does copula class serves different purposes as in results they lead to having different results on the same. I found a package Sunclarco in Rstudio which i was able to fit Clayton and Gumbel copula class but its not available yet for the other classes.
The challenge am facing is that since i have censored data which has to be incorporated in likelihood estimation then it becomes harder fro me to write a syntax since as I don't have a strong programming background. In addition, i have to incorporate the covariates present in programming and see their impact on the association if it present or not. However, taking to my promoter he gave me insights on how to approach the syntax writing for this puzzle which goes as follows
• ******First of all, forget about the likelihood function. We only work with the log-likelihood function. In this way, you do not need to take the product of the contributions over each of the observations, but can take the sum of the log-contributions over the different observations.
• Next, since we have a balanced design, we can use the regular data frame structure in which we have for each cluster only one row in the data frame. The different variables such as the lifetimes, the indicators and all the covariates are the columns in this data frame.
• Due to the bivariate setting, there are only 4 possible ways to give a contribution to the log-likelihood function: both uncensored, both censored, first uncensored and second censored, or first censored and second uncensored. Well, to create the loglikelihood function, you create a new variable in your data frame in which you put the correct contribution of the log-likelihood based on which individual in the couple is censored. When you take the sum of this variable, you have the value of the log-likelihood function.
• Since this function depends on parameters, you can use any optimizer, like optim or nlm to get your optimal values. By careful here, optim and nlm look for the minimum of a function, not a maximum. This is easy solved since the minimum of a function -f is the same as the maximum of a function f.
• Since you have for each copula function, the different expressions for the derivatives, it should be possible to get the likelihood functions now.******
Am still struggling to find a way as for each copula class each of the likelihood changes as the generator function is also unique for the respective copula since it needs to be adapted during estimation. Lastly, I should run analysis for both one and two steps of copula estimations as i will use to compare results.
if someone could help me to figure it out then I will be eternally grateful. Even if for just one copula class e.g. Gaussian then I will figure it the rest based on the one that am requesting to be assisted since I tried everything and still i have nothing to show up for and now i feel time is running out to get answers by myself.

Resampling from multiple strata using boot()

I'm attempting to bootstrap a ZIP estimation while resampling from within specific populations. Each of the populations (clusters) are fundamentally different in some way, so I would like to proportionally represent them in the bootstrapping. The strata command will do that.
I sometimes encounter the following error:
Error in solve.default(as.matrix(fit$hessian)) :
system is computationally singular: reciprocal condition number = 2.02001e-16
Here's a way to replicate the problem, and it should only take about a minute or so to run, depending on your computer:
#Load dependencies
library(AER)
library(boot)
library(pscl)
library(sampling)
#generate some fake data.q. Seed will be used to make it replicatable.
set.seed(1)
x1<-rpois(1000,1)
set.seed(1)
x2<-rnorm(1000,0,1)
set.seed(1)
e<-round(runif(1000,0,1)) #this should add some disruptions and prevent any multicolinearity.
pop<-rep(1:10,length.out=1000) #there are 10 populations
y<-x1*abs(floor(x2*sqrt(pop)))+e #the populations each impact the y variable somewhat differently
fake_data<-as.data.frame(cbind(y,x1,x2,pop))
fake_data$pop<-factor(pop) #they are not actually simple scalars.
#Run zip proccess, confirm it works. I understand it's not a matching model.
system.time(zip<-zeroinfl(y ~ x1+x2+pop | x1+x2+pop, data=fake_data))
#storing estimates to speed up bootstrapping phase. General technique from http://www.ats.ucla.edu/stat/r/dae/zipoisson.htm
count_hold<-as.data.frame(dput(coef(zip, "count")))
count_short<-c(count_hold[,1])
zero_hold<-as.data.frame(dput(coef(zip, "zero")))
zero_short<-c(zero_hold[,1])
#bootstrapping
f <- function(fake_data, i) {
zip_boot<- zeroinfl(y ~ x1+x2+pop | x1+x2+pop, data=fake_data[i,], start=list(count=count_short, zero=zero_short))
return(coef(zip_boot))
} #defines function for R to repeat in bootstrapping phase.
set.seed(1)
system.time(res <- boot(fake_data, f, R =50, strata=fake_data$pop)) #adjust the number of cpus to match your computer.
There ought to be enough samples, considering that I have 900+ degrees of freedom, and at least 100 samples in each population to grab my resampling estimates from.
My questions:
1)What did I do that is causing this multicolinarity?

setting values for ntree and mtry for random forest regression model

I'm using R package randomForest to do a regression on some biological data. My training data size is 38772 X 201.
I just wondered---what would be a good value for the number of trees ntree and the number of variable per level mtry? Is there an approximate formula to find such parameter values?
Each row in my input data is a 200 character representing the amino acid sequence, and I want to build a regression model to use such sequence in order to predict the distances between the proteins.
The default for mtry is quite sensible so there is not really a need to muck with it. There is a function tuneRF for optimizing this parameter. However, be aware that it may cause bias.
There is no optimization for the number of bootstrap replicates. I often start with ntree=501 and then plot the random forest object. This will show you the error convergence based on the OOB error. You want enough trees to stabilize the error but not so many that you over correlate the ensemble, which leads to overfit.
Here is the caveat: variable interactions stabilize at a slower rate than error so, if you have a large number of independent variables you need more replicates. I would keep the ntree an odd number so ties can be broken.
For the dimensions of you problem I would start ntree=1501. I would also recommended looking onto one of the published variable selection approaches to reduce the number of your independent variables.
The short answer is no.
The randomForest function of course has default values for both ntree and mtry. The default for mtry is often (but not always) sensible, while generally people will want to increase ntree from it's default of 500 quite a bit.
The "correct" value for ntree generally isn't much of a concern, as it will be quite apparent with a little tinkering that the predictions from the model won't change much after a certain number of trees.
You can spend (read: waste) a lot of time tinkering with things like mtry (and sampsize and maxnodes and nodesize etc.), probably to some benefit, but in my experience not a lot. However, every data set will be different. Sometimes you may see a big difference, sometimes none at all.
The caret package has a very general function train that allows you to do a simple grid search over parameter values like mtry for a wide variety of models. My only caution would be that doing this with fairly large data sets is likely to get time consuming fairly quickly, so watch out for that.
Also, somehow I forgot that the ranfomForest package itself has a tuneRF function that is specifically for searching for the "optimal" value for mtry.
Could this paper help ?
Limiting the Number of Trees in Random Forests
Abstract. The aim of this paper is to propose a simple procedure that
a priori determines a minimum number of classifiers to combine in order
to obtain a prediction accuracy level similar to the one obtained with the
combination of larger ensembles. The procedure is based on the McNemar
non-parametric test of significance. Knowing a priori the minimum
size of the classifier ensemble giving the best prediction accuracy, constitutes
a gain for time and memory costs especially for huge data bases
and real-time applications. Here we applied this procedure to four multiple
classifier systems with C4.5 decision tree (Breiman’s Bagging, Ho’s
Random subspaces, their combination we labeled ‘Bagfs’, and Breiman’s
Random forests) and five large benchmark data bases. It is worth noticing
that the proposed procedure may easily be extended to other base
learning algorithms than a decision tree as well. The experimental results
showed that it is possible to limit significantly the number of trees. We
also showed that the minimum number of trees required for obtaining
the best prediction accuracy may vary from one classifier combination
method to another
They never use more than 200 trees.
One nice trick that I use is to initially start with first taking square root of the number of predictors and plug that value for "mtry". It is usually around the same value that tunerf funtion in random forest would pick.
I use the code below to check for accuracy as I play around with ntree and mtry (change the parameters):
results_df <- data.frame(matrix(ncol = 8))
colnames(results_df)[1]="No. of trees"
colnames(results_df)[2]="No. of variables"
colnames(results_df)[3]="Dev_AUC"
colnames(results_df)[4]="Dev_Hit_rate"
colnames(results_df)[5]="Dev_Coverage_rate"
colnames(results_df)[6]="Val_AUC"
colnames(results_df)[7]="Val_Hit_rate"
colnames(results_df)[8]="Val_Coverage_rate"
trees = c(50,100,150,250)
variables = c(8,10,15,20)
for(i in 1:length(trees))
{
ntree = trees[i]
for(j in 1:length(variables))
{
mtry = variables[j]
rf<-randomForest(x,y,ntree=ntree,mtry=mtry)
pred<-as.data.frame(predict(rf,type="class"))
class_rf<-cbind(dev$Target,pred)
colnames(class_rf)[1]<-"actual_values"
colnames(class_rf)[2]<-"predicted_values"
dev_hit_rate = nrow(subset(class_rf, actual_values ==1&predicted_values==1))/nrow(subset(class_rf, predicted_values ==1))
dev_coverage_rate = nrow(subset(class_rf, actual_values ==1&predicted_values==1))/nrow(subset(class_rf, actual_values ==1))
pred_prob<-as.data.frame(predict(rf,type="prob"))
prob_rf<-cbind(dev$Target,pred_prob)
colnames(prob_rf)[1]<-"target"
colnames(prob_rf)[2]<-"prob_0"
colnames(prob_rf)[3]<-"prob_1"
pred<-prediction(prob_rf$prob_1,prob_rf$target)
auc <- performance(pred,"auc")
dev_auc<-as.numeric(auc#y.values)
pred<-as.data.frame(predict(rf,val,type="class"))
class_rf<-cbind(val$Target,pred)
colnames(class_rf)[1]<-"actual_values"
colnames(class_rf)[2]<-"predicted_values"
val_hit_rate = nrow(subset(class_rf, actual_values ==1&predicted_values==1))/nrow(subset(class_rf, predicted_values ==1))
val_coverage_rate = nrow(subset(class_rf, actual_values ==1&predicted_values==1))/nrow(subset(class_rf, actual_values ==1))
pred_prob<-as.data.frame(predict(rf,val,type="prob"))
prob_rf<-cbind(val$Target,pred_prob)
colnames(prob_rf)[1]<-"target"
colnames(prob_rf)[2]<-"prob_0"
colnames(prob_rf)[3]<-"prob_1"
pred<-prediction(prob_rf$prob_1,prob_rf$target)
auc <- performance(pred,"auc")
val_auc<-as.numeric(auc#y.values)
results_df = rbind(results_df,c(ntree,mtry,dev_auc,dev_hit_rate,dev_coverage_rate,val_auc,val_hit_rate,val_coverage_rate))
}
}

RandomForest for Regression in R

I'm experimenting with R and the randomForest Package, I have some experience with SVM and Neural Nets.
My first test is to try and regress: sin(x)+gaussian noise.
With Neural Nets and svm I obtain a "relatively" nice approximation of sin(x) so the noise is filtered out and the learning algorithm doesn't overfit. (for decent parameters)
When doing the same on randomForest I have a completely overfitted solution.
I simply use (R 2.14.0, tried on 2.14.1 too, just in case):
library("randomForest")
x<-seq(-3.14,3.14,by=0.00628)
noise<-rnorm(1001)
y<-sin(x)+noise/4
mat<-matrix(c(x,y),ncol=2,dimnames=list(NULL,c("X","Y")))
plot(x,predict(randomForest(Y~.,data=mat),mat),col="green")
points(x,y)
I guess there is a magic option in randomForest to make it work correctly, I tried a few but I did not find the right lever to pull...
You can use maxnodes to limit the size of the trees,
as in the examples in the manual.
r <- randomForest(Y~.,data=mat, maxnodes=10)
plot(x,predict(r,mat),col="green")
points(x,y)
You can do a lot better (rmse ~ 0.04, $R^2$ > 0.99) by training individual trees on small samples or bites as Breiman called them
Since there is a significant amount of noise in the training data, this problem is really about smoothing rather than generalization. In general machine learning terms this requires increasing regularization. For ensemble learner this means trading strength for diversity.
Diversity of randomForests can be increasing by reducing the number of candidate feature per split (mtry in R) or the training set of each tree (sampsize in R). Since there is only 1 input dimesions, mtry does not help, leaving sampsize. This leads to a 3.5x improvement in RMSE over the default settings and >6x improvement over the noisy training data itself. Since increased divresity means increased variance in the prediction of the individual learners, we also need to increase the number of trees to stabilize the ensemble prediction.
small bags, more trees :: rmse = 0.04:
>sd(predict(randomForest(Y~.,data=mat, sampsize=60, nodesize=2,
replace=FALSE, ntree=5000),
mat)
- sin(x))
[1] 0.03912643
default settings :: rmse=0.14:
> sd(predict(randomForest(Y~.,data=mat),mat) - sin(x))
[1] 0.1413018
error due to noise in training set :: rmse = 0.25
> sd(y - sin(x))
[1] 0.2548882
The error due to noise is of course evident from
noise<-rnorm(1001)
y<-sin(x)+noise/4
In the above the evaluation is being done against the training set, as it is in the original question. Since the issue is smoothing rather than generalization, this is not as egregious as it may seem, but it is reassuring to see that out of bag evaluation shows similar accuracy:
> sd(predict(randomForest(Y~.,data=mat, sampsize=60, nodesize=2,
replace=FALSE, ntree=5000))
- sin(x))
[1] 0.04059679
My intuition is that:
if you had a simple decision tree to fit a 1 dimensional curve f(x), that would be equivalent to fit with a staircase function (not necessarily with equally spaced jumps)
with random forests you will make a linear combination of staircase functions
For a staircase function to be a good approximator of f(x), you want enough steps on the x axis, but each step should contain enough points so that their mean is a good approximation of f(x) and less affected by noise.
So I suggest you tune the nodesize parameter. If you have 1 decision tree and N points, and nodesize=n, then your staircase function will have N/n steps. n too small brings to overfitting. I got nice results with n~30 (RMSE~0.07):
r <- randomForest(Y~.,data=mat, nodesize=30)
plot(x,predict(r,mat),col="green")
points(x,y)
Notice that RMSE gets smaller if you take N'=10*N and n'=10*n.

Resources