Topic models: cross validation with loglikelihood or perplexity - r

I'm clustering documents using topic modeling. I need to come up with the optimal topic numbers. So, I decided to do ten fold cross validation with topics 10, 20, ...60.
I have divided my corpus into ten batches and set aside one batch for a holdout set. I have ran latent dirichlet allocation (LDA) using nine batches (total 180 documents) with topics 10 to 60. Now, I have to calculate perplexity or log likelihood for the holdout set.
I found this code from one of CV's discussion sessions. I really don't understand several lines of code below. I have dtm matrix using the holdout set (20 documents). But I don't know how to calculate the perplexity or log likelihood of this holdout set.
Questions:
Can anybody explain to me what seq(2, 100, by =1) mean here? Also, what AssociatedPress[21:30] mean? What function(k) is doing here?
best.model <- lapply(seq(2, 100, by=1), function(k){ LDA(AssociatedPress[21:30,], k) })
If I want to calculate perplexity or log likelihood of the holdout set called dtm, is there better code? I know there are perplexity() and logLik() functions but since I'm new I can not figure out how to implement it with my holdout matrix, called dtm.
How can I do ten fold cross validation with my corpus, containing 200 documents? Is there existing code that I can invoke? I found caret for this purpose, but again cannot figure that out either.

The accepted answer to this question is good as far as it goes, but it doesn't actually address how to estimate perplexity on a validation dataset and how to use cross-validation.
Using perplexity for simple validation
Perplexity is a measure of how well a probability model fits a new set of data. In the topicmodels R package it is simple to fit with the perplexity function, which takes as arguments a previously fit topic model and a new set of data, and returns a single number. The lower the better.
For example, splitting the AssociatedPress data into a training set (75% of the rows) and a validation set (25% of the rows):
# load up some R packages including a few we'll need later
library(topicmodels)
library(doParallel)
library(ggplot2)
library(scales)
data("AssociatedPress", package = "topicmodels")
burnin = 1000
iter = 1000
keep = 50
full_data <- AssociatedPress
n <- nrow(full_data)
#-----------validation--------
k <- 5
splitter <- sample(1:n, round(n * 0.75))
train_set <- full_data[splitter, ]
valid_set <- full_data[-splitter, ]
fitted <- LDA(train_set, k = k, method = "Gibbs",
control = list(burnin = burnin, iter = iter, keep = keep) )
perplexity(fitted, newdata = train_set) # about 2700
perplexity(fitted, newdata = valid_set) # about 4300
The perplexity is higher for the validation set than the training set, because the topics have been optimised based on the training set.
Using perplexity and cross-validation to determine a good number of topics
The extension of this idea to cross-validation is straightforward. Divide the data into different subsets (say 5), and each subset gets one turn as the validation set and four turns as part of the training set. However, it's really computationally intensive, particularly when trying out the larger numbers of topics.
You might be able to use caret to do this, but I suspect it doesn't handle topic modelling yet. In any case, it's the sort of thing I prefer to do myself to be sure I understand what's going on.
The code below, even with parallel processing on 7 logical CPUs, took 3.5 hours to run on my laptop:
#----------------5-fold cross-validation, different numbers of topics----------------
# set up a cluster for parallel processing
cluster <- makeCluster(detectCores(logical = TRUE) - 1) # leave one CPU spare...
registerDoParallel(cluster)
# load up the needed R package on all the parallel sessions
clusterEvalQ(cluster, {
library(topicmodels)
})
folds <- 5
splitfolds <- sample(1:folds, n, replace = TRUE)
candidate_k <- c(2, 3, 4, 5, 10, 20, 30, 40, 50, 75, 100, 200, 300) # candidates for how many topics
# export all the needed R objects to the parallel sessions
clusterExport(cluster, c("full_data", "burnin", "iter", "keep", "splitfolds", "folds", "candidate_k"))
# we parallelize by the different number of topics. A processor is allocated a value
# of k, and does the cross-validation serially. This is because it is assumed there
# are more candidate values of k than there are cross-validation folds, hence it
# will be more efficient to parallelise
system.time({
results <- foreach(j = 1:length(candidate_k), .combine = rbind) %dopar%{
k <- candidate_k[j]
results_1k <- matrix(0, nrow = folds, ncol = 2)
colnames(results_1k) <- c("k", "perplexity")
for(i in 1:folds){
train_set <- full_data[splitfolds != i , ]
valid_set <- full_data[splitfolds == i, ]
fitted <- LDA(train_set, k = k, method = "Gibbs",
control = list(burnin = burnin, iter = iter, keep = keep) )
results_1k[i,] <- c(k, perplexity(fitted, newdata = valid_set))
}
return(results_1k)
}
})
stopCluster(cluster)
results_df <- as.data.frame(results)
ggplot(results_df, aes(x = k, y = perplexity)) +
geom_point() +
geom_smooth(se = FALSE) +
ggtitle("5-fold cross-validation of topic modelling with the 'Associated Press' dataset",
"(ie five different models fit for each candidate number of topics)") +
labs(x = "Candidate number of topics", y = "Perplexity when fitting the trained model to the hold-out set")
We see in the results that 200 topics is too many and has some over-fitting, and 50 is too few. Of the numbers of topics tried, 100 is the best, with the lowest average perplexity on the five different hold-out sets.

I wrote the answer on CV that you refer to, here's a bit more detail:
seq(2, 100, by =1) simply creates a number sequence from 2 to 100 by ones, so 2, 3, 4, 5, ... 100. Those are the numbers of topics that I want to use in the models. One model with 2 topics, another with 3 topics, another with 4 topics and so on to 100 topics.
AssociatedPress[21:30] is simply a subset of the built-in data in the topicmodels package. I just used a subset in that example so that it would run faster.
Regarding the general question of optimal topic numbers, I now follow the example of Martin
Ponweiser on Model Selection by Harmonic Mean (4.3.3 in his thesis, which is here: http://epub.wu.ac.at/3558/1/main.pdf). Here's how I do it at the moment:
library(topicmodels)
#
# get some of the example data that's bundled with the package
#
data("AssociatedPress", package = "topicmodels")
harmonicMean <- function(logLikelihoods, precision=2000L) {
library("Rmpfr")
llMed <- median(logLikelihoods)
as.double(llMed - log(mean(exp(-mpfr(logLikelihoods,
prec = precision) + llMed))))
}
# The log-likelihood values are then determined by first fitting the model using for example
k = 20
burnin = 1000
iter = 1000
keep = 50
fitted <- LDA(AssociatedPress[21:30,], k = k, method = "Gibbs",control = list(burnin = burnin, iter = iter, keep = keep) )
# where keep indicates that every keep iteration the log-likelihood is evaluated and stored. This returns all log-likelihood values including burnin, i.e., these need to be omitted before calculating the harmonic mean:
logLiks <- fitted#logLiks[-c(1:(burnin/keep))]
# assuming that burnin is a multiple of keep and
harmonicMean(logLiks)
So to do this over a sequence of topic models with different numbers of topics...
# generate numerous topic models with different numbers of topics
sequ <- seq(2, 50, 1) # in this case a sequence of numbers from 1 to 50, by ones.
fitted_many <- lapply(sequ, function(k) LDA(AssociatedPress[21:30,], k = k, method = "Gibbs",control = list(burnin = burnin, iter = iter, keep = keep) ))
# extract logliks from each topic
logLiks_many <- lapply(fitted_many, function(L) L#logLiks[-c(1:(burnin/keep))])
# compute harmonic means
hm_many <- sapply(logLiks_many, function(h) harmonicMean(h))
# inspect
plot(sequ, hm_many, type = "l")
# compute optimum number of topics
sequ[which.max(hm_many)]
## 6
Here's the output, with numbers of topics along the x-axis, indicating that 6 topics is optimum.
Cross-validation of topic models is pretty well documented in the docs that come with the package, see here for example: http://cran.r-project.org/web/packages/topicmodels/vignettes/topicmodels.pdf Give that a try and then come back with a more specific question about coding CV with topic models.

Related

How to downsample within recursive feature elimination using caret?

Consider the data frame data created here:
set.seed(123)
num = sample(5:20, replace = T, 20)
id = letters[1:20]
loc <- rep(id, num)
data <- data.frame(Location = loc)
data[paste0('var', seq_along(1:10))] <- rnorm(length(id) * sum(num))
Assuming data is my training data; Each row represents measurements that were taken on a randomly sampled individuals from populations identified by the grouping variable Location. I want to use recursive feature elimination to identify the best subset of predictors for predicting Location. Analogously, I want to understand how much variation each of the predictors explain in Location (i.e., which ones are most important, and how much more important are they). I have read how this can be done using the caret package using something like this:
library(caret)
subsets <- 1:9
ctrl <- rfeControl(functions = lmFuncs, method = "repeatedcv", repeats = 10, verbose = F)
lmProfile <- rfe(data[,2:10], data[,1], sizes = subsets, rfeControl = ctrl)
In my data example, considering the unbalanced number of samples in each Location, I want to use down sampling to ensure that the same number of samples is being considered across the levels of Location upon each iteration. Could someone demonstrate how I might do this?

How to calculate perplexity for LDA with Gibbs sampling

I perform an LDA topic model in R on a collection of 200+ documents (65k words total). The documents have been preprocessed and are stored in the document-term matrix dtm. Theoretically, I should expect to find 5 distinct topics in the corpus, but I would like to calculate the perplexity score and see how the model fit changes with the number of topics. Below is the code I use. The problem is it gives me an error when i try to calculate the perplexity score and I am not sure how to fix it (I am new to R). The error is in the last line of code. I would appreciate any help.
burnin <- 4000 #burn-in parameter
iter <- 2000 # #of iteration after burn-in
thin <- 500 #take every 500th iteration for further use to avoid correlations between samples
seed <-list(2003,10,100,10005,765)
nstart <- 5 #use 5 different starting points
best <- TRUE #return results of the run with the highest posterior probability
#Number of topics (run the algorithm for different values of k and make a choice based by inspecting the results)
k <- 5
#Run LDA using Gibbs sampling
ldaOut <-LDA(dtm,k, method="Gibbs",
control=list(nstart=nstart, seed = seed, best=best,
burnin = burnin, iter = iter, thin=thin))
perplexity(ldaOut, newdata = dtm)
Error in method(x, k, control, model, mycall, ...) : Need 1 seeds
It needs one more parameter "estimate_theta",
use below code:
perplexity(ldaOut, newdata = dtm,estimate_theta=FALSE)

Weighted Kmeans R

I want to do a Kmeans clustering on a dataset (namely, Sample_Data) with three variables (columns) such as below:
A B C
1 12 10 1
2 8 11 2
3 14 10 1
. . . .
. . . .
. . . .
in a typical way, after scaling the columns, and determining the number of clusters, I will use this function in R:
Sample_Data <- scale(Sample_Data)
output_kmeans <- kmeans(Sample_Data, centers = 5, nstart = 50)
But, what if there is a preference for the variables? I mean that, suppose variable (column) A, is more important than the two other variables?
how can I insert their weights in the model?
Thank you all
You have to use a kmeans weighted clustering, like the one presented in flexclust package:
https://cran.r-project.org/web/packages/flexclust/flexclust.pdf
The function
cclust(x, k, dist = "euclidean", method = "kmeans",
weights=NULL, control=NULL, group=NULL, simple=FALSE,
save.data=FALSE)
Perform k-means clustering, hard competitive learning or neural gas on a data matrix.
weights An optional vector of weights to be used in the fitting process. Works only in combination with hard competitive learning.
A toy example using iris data:
library(flexclust)
data(iris)
cl <- cclust(iris[,-5], k=3, save.data=TRUE,weights =c(1,0.5,1,0.1),method="hardcl")
cl
kcca object of family ‘kmeans’
call:
cclust(x = iris[, -5], k = 3, method = "hardcl", weights = c(1, 0.5, 1, 0.1), save.data = TRUE)
cluster sizes:
1 2 3
50 59 41
As you can see from the output of cclust, also using competitive learning the family is always kmenas.
The difference is related to cluster assignment during training phase:
If method is "kmeans", the classic kmeans algorithm as given by
MacQueen (1967) is used, which works by repeatedly moving all cluster
centers to the mean of their respective Voronoi sets. If "hardcl",
on-line updates are used (AKA hard competitive learning), which work
by randomly drawing an observation from x and moving the closest
center towards that point (e.g., Ripley 1996).
The weights parameter is just a sequence of numbers, in general I use number between 0.01 (minimum weight) and 1 (maximum weight).
I had the same problem and the answer here is not satisfying for me.
What we both wanted was an observation-weighted k-means clustering in R. A good readable example for our question is this link: https://towardsdatascience.com/clustering-the-us-population-observation-weighted-k-means-f4d58b370002
However the solution to use the flexclust package is not satisfying simply b/c the used algorithm is not the "standard" k-means algorithm but the "hard competitive learning" algorithm. The difference are well described above and in the package description.
I looked through many sites and did not find any solution/package in R in order to use to perform a "standard" k-means algorithm with weighted observations. I was also wondering why the flexclust package explicitly do not support weights with the standard k-means algorithm. If anyone has an explanation for this, please feel free to share!
So basically you have two options: First, rewrite the flexclust-algorithm to enable weights within the standard approach. Or second, you can estimate weighted cluster centroids as starting centroids and perform a standard k-means algorithm with only one iteration, then compute new weighted cluster centroids and perform a k-means with one iteration and so on until you reach convergence.
I used the second alternative b/c it was the easier way for me. I used the data.table package, hope you are familiar with it.
rm(list=ls())
library(data.table)
### gen dataset with sample-weights
dataset <- data.table(iris)
dataset[, weights:= rep(c(1, 0.7, 0.3, 4, 5),30)]
dataset[, Species := NULL]
### initial hclust for estimating weighted centroids
clustering <- hclust(dist(dataset[, c(1:4)], method = 'euclidean'),
method = 'ward.D2')
no_of_clusters <- 4
### estimating starting centroids (weighted)
weighted_centroids <- matrix(NA, nrow = no_of_clusters,
ncol = ncol(dataset[, c(1:4)]))
for (i in (1:no_of_clusters))
{
weighted_centroids[i,] <- sapply(dataset[, c(1:4)][cutree(clustering, k =
no_of_clusters) == i,], weighted.mean, w = dataset[cutree(clustering, k = no_of_clusters) == i, weights])
}
### performing weighted k-means as explained in my post
iter <- 0
cluster_i <- 0
cluster_iminus1 <- 1
## while loop: if number of iteration is smaller than 50 and cluster_i (result of
## current iteration) is not identical to cluster_iminus1 (result of former
## iteration) then continue
while(identical(cluster_i, cluster_iminus1) == F && iter < 50){
# update iteration
iter <- iter + 1
# k-means with weighted centroids and one iteration (may generate warning messages
# as no convergence is reached)
cluster_kmeans <- kmeans(x = dataset[, c(1:4)], centers = weighted_centroids, iter = 1)$cluster
# estimating new weighted centroids
weighted_centroids <- matrix(NA, nrow = no_of_clusters,
ncol=ncol(dataset[,c(1:4)]))
for (i in (1:no_of_clusters))
{
weighted_centroids[i,] <- sapply(dataset[, c(1:4)][cutree(clustering, k =
no_of_clusters) == i,], weighted.mean, w = dataset[cutree(clustering, k = no_of_clusters) == i, weights])
}
# update cluster_i and cluster_iminus1
if(iter == 1) {cluster_iminus1 <- 0} else{cluster_iminus1 <- cluster_i}
cluster_i <- cluster_kmeans
}
## merge final clusters to data table
dataset[, cluster := cluster_i]
If you want to increase the weight of a variable (column), just multiply it with a constant c > 1.
It's trivial to show that this increases the weight in the SSQ optimization objective.

stop xgboost based on eval_metric

I am trying to run xgboost for a problem with very noisy features and interested in stopping the number of rounds based on a custom eval_metric that I have defined.
Based on domain knowledge I know that when the eval_metric (evaluated on the training data) goes above a certain value xgboost is overfitting. And I would like to just take the fitted model at that specific number of rounds and not proceed further.
What would be the best way to achieve this ?
It would be somewhat in line with the early stopping criteria but not exactly.
Alternately, if there is a possibility to get the model from an intermediate round ?
Here is an example to better explain by question. (Using the toy example that comes with xgboost help docs and using the default eval_metric)
library(xgboost)
data(agaricus.train, package='xgboost')
train <- agaricus.train
bstSparse <- xgboost(data = train$data, label = train$label, max.depth = 2, eta = 1, nthread = 2, nround = 5, objective = "binary:logistic")
Here is the output
[0] train-error:0.046522
[1] train-error:0.022263
[2] train-error:0.007063
[3] train-error:0.015200
[4] train-error:0.007063
Now lets say from domain knowledge I know that once the train error goes below 0.015 (third round in this case), any further rounds only lead to over fitting. How would I stop the training process after the third round and get hold of the trained model to use it for prediction over a different dataset ?
I need to run the training process over many different datasets and I have no sense of how many rounds it might take to train to get the error below a fixed number, hence I can't set the nrounds argument to a predetermined value. Only intuition I have is that once the training error goes below a number I need to stop further training rounds.
In the absence of any code you have tried or any data you are using then try something like this:
require(xgboost)
library(Metrics) # for rmse to calculate errors
# Assume you have a training set db.train and have some
# feature indices of interest and a test set db.test
predz <- c(2, 4, 6, 8, 10, 12)
predictors <- names(db.train[, predz])
# you have some response you are interested in
outcomeName <- "myLabel"
# you may like to include for testing some other parameters like:
# eta, gamma, colsample_bytree, min_child_weight
# here we look at depths from 1 to 4 and rounds 1 to 100 but set your own values
smallestError <- 100 # set to some sensible value depending on your eval metric
for (depth in seq(1, 4, 1)) {
for (rounds in seq(1, 100, 1)) {
# train
bst <- xgboost(data = as.matrix(db.train[,predictors]),
label = db.train[,outcomeName],
max.depth = depth,
nround = rounds,
eval_metric = "logloss",
objective = "binary:logistic",
verbose=TRUE)
gc()
# predict
predictions <- as.numeric(predict(bst, as.matrix(db.test[, predictors]),
outputmargin = TRUE))
err <- rmse(as.numeric(db.test[, outcomeName]), as.numeric(predictions))
if (err < smallestError) {
smallestError = err
print(paste(depth,rounds,err))
}
}
}
You could adapt this code for your particular evaluation metric and print this out to suit your situation. Similarly you could introduce a break in the code when some specified number of rounds is reached that satisfies some condition you seek to achieve.

knapsack case r implementation for multiple persons using genetic algorithm

I am trying to implement genetic algorithm in R. I found out that r has 'GA' and 'genalg' packages for genetic algorithm implementation. I encountered the example i the link http://www.r-bloggers.com/genetic-algorithms-a-simple-r-example/. They tried solving the Knapsack problem. The problem can be briefly explained as:
"You are going to spend a month in the wilderness. You’re taking a backpack with you, however, the maximum weight it can carry is 20 kilograms. You have a number of survival items available, each with its own number of 'survival points'. You’re objective is to maximize the number of survival points"
The problem is easily solved using 'genalg' package for a single person and the output is binary string. Now i have a doubt, lets say instead of one person there are 2 or more i.e multiple persons and we need to distribute the survival points. The weight constraints apply for each person. Then how can we solve this problem? Can we use 'genalg' or 'GA' package? If so how can we apply them? Are there any examples on this that are solved in R or other software's?
Thanks
The R package adagio (https://cran.r-project.org/web/packages/adagio/index.html) comes with two functions (knapsack and mknapsack) which solves this type of problem more efficient by dynamic programming.
A simple approach could be to have one chromosome containing all individuals in the group and have the evaluation function split this chromosome in multiple parts, one for each individual and then have these parts evaluated. In the example below (based on the example in the question) I have assumed each individual has the same weight limit and multiple individuals can bring the same item.
library(genalg)
#Set up the problem parameters
#how many people in the group
individual_count <-3
#The weight limit for one individual
weightlimit <- 20
#The items with their survivalpoints
dataset <- data.frame(item = c("pocketknife", "beans", "potatoes", "unions",
"sleeping bag", "rope", "compass"), survivalpoints = c(10, 20, 15, 2, 30,
10, 30), weight = c(1, 5, 10, 1, 7, 5, 1))
#Next, we choose the number of iterations, design and run the model.
iter <- 100
#Our chromosome has to be large enough to contain a bit for all individuals and for all items in the dataset
chromosomesize <- individual_count * nrow(dataset)
#Function definitions
#A function to split vector X in N equal parts
split_vector <- function(x,n) split(x, cut(seq_along(x), n, labels = FALSE))
#EValuate an individual (a part of the chromosome)
evalIndividual <- function(x) {
current_solution_survivalpoints <- x %*% dataset$survivalpoints
current_solution_weight <- x %*% dataset$weight
if (current_solution_weight > weightlimit)
return(0) else return(-current_solution_survivalpoints)
}
#Evaluate a chromosome
evalFunc <- function(x) {
#First split the chromosome in a list of individuals, then we can evaluate all individuals
individuals<-split_vector(x,individual_count)
#now we need to sapply the evalIndividual function to each element of individuals
return(sum(sapply(individuals,evalIndividual)))
}
#Run the Genetic Algorithm
GAmodel <- rbga.bin(size = chromosomesize, popSize = 200, iters = iter, mutationChance = 0.01,
elitism = T, evalFunc = evalFunc)
#First show a summary
summary(GAmodel,echo=TRUE)
#Then extract the best solution from the GAmodel, copy/paste from the source code of the summary function
filter = GAmodel$evaluations == min(GAmodel$evaluations)
bestSolution = GAmodel$population[filter, , drop= FALSE][1,]
#Now split the solution in the individuals.
split_vector(bestSolution,individual_count)

Resources