poLCA - Latent Class how to do the adjusted Lo-Mendell-Rubin (LMR) test with R - r

Good afternoon,
I am trying to perform Lo, Mendell and Rubin's (2001) adjusted test (LMR) in order to decide the optimal number of classes in LCA. I performed the command with poLCA, but I didn't find any command to perform it.
Is there someone that can help me?
Thank you very much!

Here is an example of a (ad-hoc adjusted) LMR test comparing a LCA with 3 groups (alternative model) against 2 groups (baseline model).
# load packages/install if needed
library(poLCA)
library(tidyLPA)
data("election")
# Fit LCA with 2 classes (NULL model)
mod_null <- poLCA(formula = cbind(MORALG, CARESG, KNOWG) ~ 1,
data = election, nclass = 2, verbose = F)
# store values baseline model
n <- mod_null$Nobs #number of observations (should be equal in both models)
null_ll <- mod_null$llik #log-likelihood
null_param <- mod_null$npar # number of parameters
null_classes <- length(mod_null$P) # number of classes
# Fit LCA with 3 classes (ALTERNATIVE model)
mod_alt <- poLCA(formula = cbind(MORALG, CARESG, KNOWG) ~ 1,
data = election, nclass = 3, verbose = F)
# Store values alternative model
alt_ll <- mod_alt$llik #log-likelihood
alt_param <- mod_alt$npar # number of parameters
alt_classes <- length(mod_alt$P) # number of classes
# use calc_lrt from tidyLPA package
calc_lrt(n, null_ll, null_param, null_classes, alt_ll, alt_param, alt_classes)

Wow really late to the game but as Im looking at similar things Ill leave for the next person.
The Lo-Mendell-Rubin test involves a transformation of the data and then a chi-sq test to determine if K classes is a better fit than K-1 classes... basically.
However there is reasonable research out there suggesting that a better measure of this is the bootstrap likelihood ratio.
The former is still in common use with MPlus users, the latter is far more common in LCA packages in R, e.g. mclust. Dunno about poLCA though...

Related

How to use lapply with get.confusion_matrix() in R?

I am performing a PLS-DA analysis in R using the mixOmics package. I have one binary Y variable (presence or absence of wetland) and 21 continuous predictor variables (X) with values ranging from 1 to 100.
I have made the model with the data_training dataset and want to predict new outcomes with the data_validation dataset. These datasets have exactly the same structure.
My code looks like:
library(mixOmics)
model.plsda<-plsda(X,Y, ncomp = 10)
myPredictions <- predict(model.plsda, newdata = data_validation[,-1], dist = "max.dist")
I want to predict the outcome based on 10, 9, 8, ... to 2 principal components. By using the get.confusion_matrix function, I want to estimate the error rate for every number of principal components.
prediction <- myPredictions$class$max.dist[,10] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
I can do this seperately for 10 times, but I want do that a little faster. Therefore I was thinking of making a list with the results of prediction for every number of components...
library(BBmisc)
prediction_test <- myPredictions$class$max.dist
predictions_components <- convertColsToList(prediction_test, name.list = T, name.vector = T, factors.as.char = T)
...and then using lapply with the get.confusion_matrix and get.BER function. But then I don't know how to do that. I have searched on the internet, but I can't find a solution that works. How can I do this?
Many thanks for your help!
Without reproducible there is no way to test this but you need to convert the code you want to run each time into a function. Something like this:
confmat <- function(x) {
prediction <- myPredictions$class$max.dist[,x] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
}
Now lapply:
results <- lapply(10:2, confmat)
That will return a list with the get.BER results for each number of PCs so results[[1]] will be the results for 10 PCs. You will not get values for prediction or confusionmat unless they are included in the results returned by get.BER. If you want all of that, you need to replace the last line to the function with return(list(prediction, confusionmat, get.BER(confusion.mat)). This will produce a list of the lists so that results[[1]][[1]] will be the results of prediction for 10 PCs and results[[1]][[2]] and results[[1]][[3]] will be confusionmat and get.BER(confusion.mat) respectively.

weird svm behavior in R (e1071)

I ran the following code for a binary classification task w/ an SVM in both R (first sample) and Python (second example).
Given randomly generated data (X) and response (Y), this code performs leave group out cross validation 1000 times. Each entry of Y is therefore the mean of the prediction across CV iterations.
Computing area under the curve should give ~0.5, since X and Y are completely random. However, this is not what we see. Area under the curve is frequently significantly higher than 0.5. The number of rows of X is very small, which can obviously cause problems.
Any idea what could be happening here? I know that I can either increase the number of rows of X or decrease the number of columns to mediate the problem, but I am looking for other issues.
Y=as.factor(rep(c(1,2), times=14))
X=matrix(runif(length(Y)*100), nrow=length(Y))
library(e1071)
library(pROC)
colnames(X)=1:ncol(X)
iter=1000
ansMat=matrix(NA,length(Y),iter)
for(i in seq(iter)){
#get train
train=sample(seq(length(Y)),0.5*length(Y))
if(min(table(Y[train]))==0)
next
#test from train
test=seq(length(Y))[-train]
#train model
XX=X[train,]
YY=Y[train]
mod=svm(XX,YY,probability=FALSE)
XXX=X[test,]
predVec=predict(mod,XXX)
RFans=attr(predVec,'decision.values')
ansMat[test,i]=as.numeric(predVec)
}
ans=rowMeans(ansMat,na.rm=TRUE)
r=roc(Y,ans)$auc
print(r)
Similarly, when I implement the same thing in Python I get similar results.
Y = np.array([1, 2]*14)
X = np.random.uniform(size=[len(Y), 100])
n_iter = 1000
ansMat = np.full((len(Y), n_iter), np.nan)
for i in range(n_iter):
# Get train/test index
train = np.random.choice(range(len(Y)), size=int(0.5*len(Y)), replace=False, p=None)
if len(np.unique(Y)) == 1:
continue
test = np.array([i for i in range(len(Y)) if i not in train])
# train model
mod = SVC(probability=False)
mod.fit(X=X[train, :], y=Y[train])
# predict and collect answer
ansMat[test, i] = mod.predict(X[test, :])
ans = np.nanmean(ansMat, axis=1)
fpr, tpr, thresholds = roc_curve(Y, ans, pos_label=1)
print(auc(fpr, tpr))`
You should consider each iteration of cross-validation to be an independent experiment, where you train using the training set, test using the testing set, and then calculate the model skill score (in this case, AUC).
So what you should actually do is calculate the AUC for each CV iteration. And then take the mean of the AUCs.

Weighted Kmeans R

I want to do a Kmeans clustering on a dataset (namely, Sample_Data) with three variables (columns) such as below:
A B C
1 12 10 1
2 8 11 2
3 14 10 1
. . . .
. . . .
. . . .
in a typical way, after scaling the columns, and determining the number of clusters, I will use this function in R:
Sample_Data <- scale(Sample_Data)
output_kmeans <- kmeans(Sample_Data, centers = 5, nstart = 50)
But, what if there is a preference for the variables? I mean that, suppose variable (column) A, is more important than the two other variables?
how can I insert their weights in the model?
Thank you all
You have to use a kmeans weighted clustering, like the one presented in flexclust package:
https://cran.r-project.org/web/packages/flexclust/flexclust.pdf
The function
cclust(x, k, dist = "euclidean", method = "kmeans",
weights=NULL, control=NULL, group=NULL, simple=FALSE,
save.data=FALSE)
Perform k-means clustering, hard competitive learning or neural gas on a data matrix.
weights An optional vector of weights to be used in the fitting process. Works only in combination with hard competitive learning.
A toy example using iris data:
library(flexclust)
data(iris)
cl <- cclust(iris[,-5], k=3, save.data=TRUE,weights =c(1,0.5,1,0.1),method="hardcl")
cl
kcca object of family ‘kmeans’
call:
cclust(x = iris[, -5], k = 3, method = "hardcl", weights = c(1, 0.5, 1, 0.1), save.data = TRUE)
cluster sizes:
1 2 3
50 59 41
As you can see from the output of cclust, also using competitive learning the family is always kmenas.
The difference is related to cluster assignment during training phase:
If method is "kmeans", the classic kmeans algorithm as given by
MacQueen (1967) is used, which works by repeatedly moving all cluster
centers to the mean of their respective Voronoi sets. If "hardcl",
on-line updates are used (AKA hard competitive learning), which work
by randomly drawing an observation from x and moving the closest
center towards that point (e.g., Ripley 1996).
The weights parameter is just a sequence of numbers, in general I use number between 0.01 (minimum weight) and 1 (maximum weight).
I had the same problem and the answer here is not satisfying for me.
What we both wanted was an observation-weighted k-means clustering in R. A good readable example for our question is this link: https://towardsdatascience.com/clustering-the-us-population-observation-weighted-k-means-f4d58b370002
However the solution to use the flexclust package is not satisfying simply b/c the used algorithm is not the "standard" k-means algorithm but the "hard competitive learning" algorithm. The difference are well described above and in the package description.
I looked through many sites and did not find any solution/package in R in order to use to perform a "standard" k-means algorithm with weighted observations. I was also wondering why the flexclust package explicitly do not support weights with the standard k-means algorithm. If anyone has an explanation for this, please feel free to share!
So basically you have two options: First, rewrite the flexclust-algorithm to enable weights within the standard approach. Or second, you can estimate weighted cluster centroids as starting centroids and perform a standard k-means algorithm with only one iteration, then compute new weighted cluster centroids and perform a k-means with one iteration and so on until you reach convergence.
I used the second alternative b/c it was the easier way for me. I used the data.table package, hope you are familiar with it.
rm(list=ls())
library(data.table)
### gen dataset with sample-weights
dataset <- data.table(iris)
dataset[, weights:= rep(c(1, 0.7, 0.3, 4, 5),30)]
dataset[, Species := NULL]
### initial hclust for estimating weighted centroids
clustering <- hclust(dist(dataset[, c(1:4)], method = 'euclidean'),
method = 'ward.D2')
no_of_clusters <- 4
### estimating starting centroids (weighted)
weighted_centroids <- matrix(NA, nrow = no_of_clusters,
ncol = ncol(dataset[, c(1:4)]))
for (i in (1:no_of_clusters))
{
weighted_centroids[i,] <- sapply(dataset[, c(1:4)][cutree(clustering, k =
no_of_clusters) == i,], weighted.mean, w = dataset[cutree(clustering, k = no_of_clusters) == i, weights])
}
### performing weighted k-means as explained in my post
iter <- 0
cluster_i <- 0
cluster_iminus1 <- 1
## while loop: if number of iteration is smaller than 50 and cluster_i (result of
## current iteration) is not identical to cluster_iminus1 (result of former
## iteration) then continue
while(identical(cluster_i, cluster_iminus1) == F && iter < 50){
# update iteration
iter <- iter + 1
# k-means with weighted centroids and one iteration (may generate warning messages
# as no convergence is reached)
cluster_kmeans <- kmeans(x = dataset[, c(1:4)], centers = weighted_centroids, iter = 1)$cluster
# estimating new weighted centroids
weighted_centroids <- matrix(NA, nrow = no_of_clusters,
ncol=ncol(dataset[,c(1:4)]))
for (i in (1:no_of_clusters))
{
weighted_centroids[i,] <- sapply(dataset[, c(1:4)][cutree(clustering, k =
no_of_clusters) == i,], weighted.mean, w = dataset[cutree(clustering, k = no_of_clusters) == i, weights])
}
# update cluster_i and cluster_iminus1
if(iter == 1) {cluster_iminus1 <- 0} else{cluster_iminus1 <- cluster_i}
cluster_i <- cluster_kmeans
}
## merge final clusters to data table
dataset[, cluster := cluster_i]
If you want to increase the weight of a variable (column), just multiply it with a constant c > 1.
It's trivial to show that this increases the weight in the SSQ optimization objective.

cforest party unbalanced classes

I want to measure the features importance with the cforest function from the party library.
My output variable has something like 2000 samples in class 0 and 100 samples in class 1.
I think a good way to avoid bias due to class unbalance is to train each tree of the forest using a subsample such that the number of elements of class 1 is the same of the number of element in class 0.
Is there anyway to do that? I am thinking to an option like n_samples = c(20, 20)
EDIT:
An example of code
> iris.cf <- cforest(Species ~ ., data = iris,
+ control = cforest_unbiased(mtry = 2)) #<--- Here I would like to train the forest using a balanced subsample of the data
> varimp(object = iris.cf)
Sepal.Length Sepal.Width Petal.Length Petal.Width
0.048981818 0.002254545 0.305818182 0.271163636
>
EDIT:
Maybe my question is not clear enough.
Random forest is a set of decision trees. In general the decision trees are constructed using only a random subsample of the data. I would like that the used subsample has the same numbers of element in the class 1 and in the class 0.
EDIT:
The function that I am looking for is for sure available in the randomForest package
sampsize
Size(s) of sample to draw. For classification, if sampsize is a vector of the length the number of strata, then sampling is stratified by strata, and the elements of sampsize indicate the numbers to be drawn from the strata.
I need the same for the party package. Is there any way to get it?
I will assume you know what you want to accomplish, but don't know enough R to do that.
Not sure if the function provides balancing of data as an argument, but you can do it manually. Below is the code I quickly threw together. More elegant solution might exist.
# just in case
myData <- iris
# replicate everything *10* times. Replicate is just a "loop 10 times".
replicate(10,
{
# split dataset by class and add separate classes to list
splitList <- split(myData, myData$Species)
# sample *20* random rows from each matrix in a list
sampledList <- lapply(splitList, function(dat) { dat[sample(20),] })
# combine sampled rows to a data.frame
sampledData <- do.call(rbind, sampledList)
# your code below
res.cf <- cforest(Species ~ ., data = sampledData,
control = cforest_unbiased(mtry = 2)
)
varimp(object = res.cf)
}
)
Hope you can take it from here.

Bootstrapping to compare two groups

In the following code I use bootstrapping to calculate the C.I. and the p-value under the null hypothesis that two different fertilizers applied to tomato plants have no effect in plants yields (and the alternative being that the "improved" fertilizer is better). The first random sample (x) comes from plants where a standard fertilizer has been used, while an "improved" one has been used in the plants where the second sample (y) comes from.
x <- c(11.4,25.3,29.9,16.5,21.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
library(boot)
diff <- function(x,i) mean(x[i[6:11]]) - mean(x[i[1:5]])
b <- boot(total, diff, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
What I don't like about the code above is that resampling is done as if there was only one sample of 11 values (separating the first 5 as belonging to sample x leaving the rest to sample y).
Could you show me how this code should be modified in order to draw resamples of size 5 with replacement from the first sample and separate resamples of size 6 from the second sample, so that bootstrap resampling would mimic the “separate samples” design that produced the original data?
EDIT2 :
Hack deleted as it was a wrong solution. Instead one has to use the argument strata of the boot function :
total <- c(x,y)
id <- as.factor(c(rep("x",length(x)),rep("y",length(y))))
b <- boot(total, diff, strata=id, R = 10000)
...
Be aware you're not going to get even close to a correct estimate of your p.value :
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
b <- boot(total, diff, strata=id, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
> p.value
[1] 0.5162
How would you explain a p-value of 0.51 for two samples where all values of the second are higher than the highest value of the first?
The above code is fine to get a -biased- estimate of the confidence interval, but the significance testing about the difference should be done by permutation over the complete dataset.
Following John, I think the appropriate way to use bootstrap to test if the sums of these two different populations are significantly different is as follows:
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
b_x <- boot(x, sum, R = 10000)
b_y <- boot(y, sum, R = 10000)
z<-(b_x$t0-b_y$t0)/sqrt(var(b_x$t[,1])+var(b_y$t[,1]))
pnorm(z)
So we can clearly reject the null that they are the same population. I may have missed a degree of freedom adjustment, I am not sure how bootstrapping works in that regard, but such an adjustment will not change your results drastically.
While the actual soil beds could be considered a stratified variable in some instances this is not one of them. You only have the one manipulation, between the groups of plants. Therefore, your null hypothesis is that they really do come from the exact same population. Treating the items as if they're from a single set of 11 samples is the correct way to bootstrap in this case.
If you have two plots, and in each plot tried the different fertilizers over different seasons in a counterbalanced fashion then the plots would be statified samples and you'd want to treat them as such. But that isn't the case here.

Resources