Getting the document-per-topic loading using TextmineR package by passing term co-occurrence matrix - r

I am using TextmineR package to find the most similar documents to given document list. I used the following code to generate the tcm not dtm
tcm <- CreateTcm(doc_vec = text_df$Description,
skipgram_window = 20,
verbose = FALSE,
cpus = 2)
Which is used to fit a lda model:
# note the number of topics is arbitrary here
# see extensions for more info
model <- FitLdaModel(dtm = tcm,
k = 25,
iterations = 200, # I usually recommend at least 500 iterations or more
burnin = 180,
alpha = 0.1,
beta = 0.05,
optimize_alpha = TRUE,
calc_likelihood = TRUE,
calc_coherence = TRUE,
calc_r2 = TRUE,
cpus = 2)
Now the model parameter theta here generates word-per-topic loading rather than document-per-topic loading. I want to retrieve the document number from the document-per-topic loading. Please help in suggesting the method to obtain the document-per-topic distribution from this model while passing term co-occurrence matrix.
I have tried to back connect to get document number from document-per-topic loading, but not successful as per the guidelines given at https://cran.r-project.org/web/packages/textmineR/vignettes/d_text_embeddings.html

11-month old question. But giving it a shot anyway.
Technically, theta with LDA embeddings gives you P(topic|word) and phi still gives you P(word|topic). If I understand you correctly, you want to embed whole documents under this model? If so, here's how you'd do it.
library(textmineR)
# create a tcm
tcm <- CreateTcm(nih_sample$ABSTRACT_TEXT, skipgram_window = 10)
# fit an LDA model
m <- FitLdaModel(dtm = tcm, k = 100, iterations = 100, burnin = 75)
# pull your documents into a dtm
d <- nih_sample_dtm
# get them predicted under the model
# I recommend using the "dot" method for prediction with embeddings as sparsity may
# result in underflow and throw an error using the default "gibbs" method
p <- predict(object = m, newdata = d, method = "dot")

Related

extract_inner_fselect_results is NULL with mlr3 Nested Resampling

This question is an extension of the following question: No Model Stored with Mlr3.
I have been performing nested resampling to get an unbiased metric of model performance. If I don't specify store_models=TRUE then I get Error: No model stored at the end of the run. However, if I specify store_models=TRUE in both the at and resample calls then RStudio crashes due to RAM consumption.
I have now tried the following code in which I specified store_models=TRUE for just the at call:
MSvCon<-read.csv("MS v Control Proteomics Final.csv", row.names=1)
MSvCon$Status<-as.factor(MSvCon$Status)
MSvCon[,2:4399]<-scale(MSvCon[,2:4399], center=TRUE, scale=TRUE)
set.seed(123, "L'Ecuyer")
task = as_task_classif(MSvCon, target = "Status")
learner = lrn("classif.ranger", importance = "impurity", num.trees=10000)
set_threads(learner, n = 8)
measure = msr("classif.fbeta", beta=1, average="micro")
terminator = trm("none")
resampling_inner = rsmp("repeated_cv", folds = 10, repeats = 10)
at = AutoFSelector$new(
learner = learner,
resampling = resampling_inner,
measure = measure,
terminator = terminator,
fselect = fs("rfe", n_features = 1, feature_fraction = 0.5, recursive = FALSE),
store_models=TRUE)
resampling_outer = rsmp("repeated_cv", folds = 10, repeats = 10)
rr = resample(task, at, resampling_outer)
After finishing, I am able to extract performance measures successfully. However, I tried to use extract_inner_fselect_results and extract_inner_fselect_archives to check what features were selected and importance measures but received a NULL result.
Do you have any suggestions on what I would need to adjust in my code to see this information? I anticipate that adding store_models=TRUE to the resample call would but the RAM consumption issue (even using 128GB on Rstudio Workbench) prevents that. Is there a way around this?
The archives of the inner resampling are stored in the model slot of the AutoFSelectors i.e. without store_models = TRUE in resample() you cannot access the inner results and archives. I will write a workaround for you and answer in the other question.

Multiclassification with LightGBM

I am using latest release of LightGBM to solve a multi classification problem. When I switch the objective to "multiclass", this error occurs;
Error in data$update_params(params) :
[LightGBM] [Fatal] Number of classes should be specified and greater than 1 for multiclass training
I leave a reproducible example that indicates my way
catnames <- names(purrr::keep(train_x,is.factor))
dtrain <- lgb.Dataset(as.matrix(train_x), label = train_y,categorical_feature = catnames)
data_file <- tempfile(fileext = ".data")
lgb.Dataset.save(dtrain, data_file)
dtrain <- lgb.Dataset(data_file)
lgb.Dataset.construct(dtrain)
model <- lgb.train(data=dtrain,
objective = "multiclass",
alpha = 0.1,
nrounds = 1000,
learning_rate = .1
)
Tried to save my target (train_y) as factor, nothing changed.
When using the multi-class objective in LightGBM, you need to pass another parameter that tells the learner the number of classes to predict.
So, it should probably look more like this:
model <- lgb.train(data=dtrain,
objective = "multiclass",
num_classes = INSERT NUMBER OF TARGET CLASSES HERE,
alpha = 0.1,
nrounds = 1000,
learning_rate = .1,
)
My experience is more with the python API so it might be that (if this does not work) you need to pass the num_class parameter in the form of a list for a params keyword argument in lgb.train.

rpart giving same results for cross-validation and no CV

Like the title says, I'm trying to run a decision tree both with and without cross-validation using the rpart package in R. I'm doing this using the xval parameter, as described in the vignette (https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf)
Unfortunately, I'm getting the same tree with and without CV. I've compared the calculation time for each and the CV model looks like it takes about 10 times as long, so its apparently doing something, I just can't figure out what.
I've also redone the model a number of times with different complexity parameters, but it hasn't made any difference.
Here's sample code that shows my problem, the printcp's show the same results and the predictions from both on the training and a hold-out set are the same.
library(rpart)
library(caret)
abalone <- read.csv(file = 'https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data',header = FALSE)
names(abalone) <- c("sex", "length", "diameter", "height", "whole_weight", "shucked_weight", "viscera_weight", "shell_weight", "rings")
train_set <- createDataPartition(abalone$sex, times = 1, p = 0.8, list = FALSE)
abalone_train <- slice(abalone, train_set)
abalone_test <- slice(abalone, -train_set)
abalone_fit_noCV <- rpart(sex ~ .,
data = abalone_train,
method = "class",
parms = list(split = 'information'),
control = rpart.control(xval = 0,
cp = 0.005))
abalone_fit_CV <- rpart(sex ~ .,
data = abalone_train,
method = "class",
parms = list(split = 'information'),
control = rpart.control(xval = 10,
cp = 0.005))
printcp(abalone_fit_noCV)
printcp(abalone_fit_CV)
CV_pred <- predict(abalone_fit_CV, type = "class")
noCV_pred <- predict(abalone_fit_noCV, type = "class")
confusionMatrix(CV_pred, noCV_pred)
CV_pred <- predict(abalone_fit_CV, abalone_test, type = "class")
noCV_pred <- predict(abalone_fit_noCV, abalone_test, type = "class")
confusionMatrix(CV_pred, noCV_pred)
In true beginner fashion, I figured this out shortly after posting.
For anybody else coming upon this issue, it is basically answered on Cross Validated :
The final tree that is returned is still the initial tree. You must use the prune function using the cross-validation plot to choose the best subtree.
This is clear if you read the full Pruning the tree section of the vignette, rather than just the cross-validation section.

How to reproduce topic modelling results with LDA package in R

I am using the lda package in R to perform Latent Dirichlet Allocation modelling. However, each time I run the program I get a different output.
Using set.seed() doesn't seem to help like with the topicmodels package.
Assuming an identical input, is there a way to ensure that identical topics are found on subsequent executions of the code?
I execute the function as follows:
set.seed(11)
fit1 <- lda.collapsed.gibbs.sampler(documents = documents, K = topics, vocab = vocab,
num.iterations = iterations, alpha = alpha,
eta = eta, initial = NULL, burnin = 500,
compute.log.likelihood = TRUE)

Predicting topics with LDA

I am trying to extract topic assignments from a fit I build with R's 'lda' package. I created a fit:
fit <- lda.collapsed.gibbs.sampler(documents = documents, K = K, vocab = vocab,
num.iterations = G, alpha = alpha, eta = eta, initial = NULL,
burnin = 0, compute.log.likelihood = TRUE)
...and would like to extract a probability for each topic-document assignment or simply the most likely topic for each document. With the 'topicmodel' package I can just call
topics(fit)
to get that (as in LDA with topicmodels, how can I see which topics different documents belong to?)
How can I get the same with 'lda'?
I haven't used the 'lda' package of R but I use the 'topicmodels' package in R
I an create the lda fit for lets say 5 topics, using
topic.fit <- LDA(document-term matrix, 5)
now if you want to extract the probability of each topic-document assignment, use
topic.fit#gamma[1:5, ] , gamma contains the document-topic matrix
and to get the most likely topic you can use
most.likely.topic <- topics(topic.fit, 1)
hope this answers your question.

Resources