Need help diagnosing cause of "Covariate matrix is singular" when estimating effect in structural topic model (stm) - r

First things first. I've saved my workspace and you can load it with the following command:
load(url("https://dl.dropboxusercontent.com/s/06oz5j41nif7la5/example.RData?dl=0"))
I have a number of abstract texts and I'm attempting to estimate a structural topic model to measure topic prevalence over time. The data contains a document id, abstract text, and year of publication.
I want to generate trends in expected topic proportion over time like the authors of the STM Vignette do here:
I'm able to create my topic model without issue, but when I attempt to run the estimateEffect() function from the stm package in R, I always get the following warning:
And my trends look like this:
In the documentation, the authors note that
The function will automatically check whether the covariate matrix is singular which generally results from linearly dependent columns. Some common causes include a factor variable with an unobserved level, a spline with degrees of freedom that are too high, or a spline with a continuous variable where a gap in the support of the variable results in several empty basis functions.
I've tried a variety of different models, using a 2-topic solution all the way up to 52-topic solution, always with the same result. If I remove the spline function from the "year" variable in my model and assume a linear fit, then estimateEffect() works just fine. So it must be an issue with the splined data. I just don't know what exactly.
Again, here's a link to my workspace:
load(url("https://dl.dropboxusercontent.com/s/06oz5j41nif7la5/example.RData?dl=0"))
And here is the code I'm using to get there:
library(udpipe)
library(dplyr) # data wrangling
library(readr) # import data
library(ggplot2) # viz
library(stm) # STM
library(tidytext) # Tf-idf
library(tm) # DTM stuff
library(quanteda) # For using ngrams in STM
rm(list = ls())
abstracts <- read_delim("Data/5528_demand_ta.txt",
delim = "\t", escape_double = FALSE,
col_names = TRUE, trim_ws = TRUE)
abstracts <- rename(abstracts, doc_id = cpid)
abstracts$doc_id <- as.character(abstracts$doc_id)
# Download english dictionary
ud_model <- udpipe_download_model(language = "english")
ud_model <- udpipe_load_model(ud_model$file_model)
# Interpret abstracts assuming English
x <- udpipe_annotate(ud_model, x = abstracts$abstract, doc_id = abstracts$doc_id)
x <- as.data.frame(x)
# Regroup terms
data <- paste.data.frame(x, term = "lemma", group = c("doc_id"))
data <- left_join(data, abstracts) %>%
rename(term = lemma) %>%
select(doc_id, term , year)
# Prepare text
processed <- textProcessor(documents = data$term,
metadata = data,
lowercase = TRUE,
removestopwords = TRUE,
removenumbers = TRUE,
removepunctuation = TRUE,
stem = FALSE)
out <- prepDocuments(processed$documents,
processed$vocab,
processed$meta,
lower.thresh = 20, # term must appear in at least n docs to matter
upper.thres = 1000) # I've been using about 1/3 of documents as an upper thresh
# Build model allowing tSNE to pick k (should result in 52 topics)
stm_mod <- stm(documents = out$documents,
vocab = out$vocab,
K = 0,
init.type = "Spectral",
prevalence = ~ s(year),
data = out$meta,
max.em.its = 500, # Max number of runs to attempt
seed = 831)
###################################################################################
########### If you loaded the workspace from my link, then you are here ###########
###################################################################################
# Estimate effect of year
prep <- estimateEffect(formula = 1:52 ~ s(year),
stmobj = stm_mod,
metadata = out$meta)
# Plot expected topic proportion
summary(prep, topics=1)
plot.estimateEffect(prep,
"year",
method = "continuous",
model = stm_mod,
topics = 5,
printlegend = TRUE,
xaxt = "n",
xlab = "Years")

A singular matrix simply means that you have linearly dependent rows or columns. First thing you could do is check the determinant of the matrix - a singular matrix implies a zero determinant - which means the matrix can't be inverted.
Next thing would be to identify the literally dependent rows (columns), you can do so using smisc::findDepMat(X, rows = TRUE, tol = 1e-10) for rows, and smisc::findDepMat(X, rows = FALSE, tol = 1e-10) for columns. You MAY be able to alter the levels of tol in findDepMat() and etol in stm() to arrive at a solution, probably an unstable solution, but a solution.

Related

Is there a way to limit the number of features in IML package Shapley values?

I have an automl model created with the H2O package. Currently, H2O only calculates Shapley values on tree-based models. I've used the IML package to calculate the values on the AML model. However, because I have a large number of features, the plot is too jumbled to read. I'm looking for a way to select/show only the top X number of features. I can't find anything in the IML CRAN PDF nor in other documentation I've found by Googling.
#initiate h2o
h2o.init()
h2o.no_progress()
#create automl model (data cleaning and train/test split not shown)
set.seed(1911)
num_models <- 10
aml <- h2o.automl(y = label, x = features,
training_frame = train.hex,
nfolds = 5,
balance_classes = TRUE,
leaderboard_frame = test.hex,
sort_metric = 'AUCPR',
max_models = num_models,
verbosity = 'info',
exclude_algos = "DeepLearning", #exclude for reproducibility
seed = 27)
# 1. create a data frame with just the features
features_eval <- as.data.frame(test) %>% dplyr::select(-target)
# 2. Create a vector with the actual responses
response <- as.numeric(as.vector(test$target))
# 3. Create custom predict function that returns the predicted values as a
# vector (probability of purchasing in our example)
pred <- function(model, newdata) {
results <- as.data.frame(h2o.predict(model, as.h2o(newdata)))
return(results[[3L]])
}
# example of prediction output
pred(aml, features_eval) %>% head()
#create predictor needed
predictor.aml <- Predictor$new(
model = aml,
data = features_eval,
y = response,
predict.fun = pred,
class = "classification"
)
high <- predict(aml, test.hex) %>% .[,3] %>% as.vector() %>% which.max()
high_prob_ob <- features_eval[high, ]
shapley <- Shapley$new(predictor.aml, x.interest = high_prob_ob, sample.size = 200)
plot(shapley, sort = TRUE)
Any suggestions/help appreciated.
Thank you,
Brian
I can offer a hacky solution that utilizes the fact that iml uses ggplot2 to plot.
N <- 10 # number of features to show
# Capture the ggplot2 object
p <- plot(shapley, sort = TRUE)
# Modify it so it shows only top N features
print(p + scale_x_discrete(limits=rev(p$data$feature.value[order(-p$data$phi)][1:N])))

How to Create a loop (when levels do not overlap the reference)

I have written some code in R. This code takes some data and splits it into a training set and a test set. Then, I fit a "survival random forest" model on the training set. After, I use the model to predict observations within the test set.
Due to the type of problem I am dealing with ("survival analysis"), a confusion matrix has to be made for each "unique time" (inside the file "unique.death.time"). For each confusion matrix made for each unique time, I am interested in the corresponding "sensitivity" value (e.g. sensitivity_1001, sensitivity_2005, etc.). I am trying to get all these sensitivity values : I would like to make a plot with them (vs unique death times) and determine the average sensitivity value.
In order to do this, I need to repeatedly calculate the sensitivity for each time point in "unique.death.times". I tried doing this manually and it is taking a long time.
Could someone please show me how to do this with a "loop"?
I have posted my code below:
#load libraries
library(survival)
library(data.table)
library(pec)
library(ranger)
library(caret)
#load data
data(cost)
#split data into train and test
ind <- sample(1:nrow(cost),round(nrow(cost) * 0.7,0))
cost_train <- cost[ind,]
cost_test <- cost[-ind,]
#fit survival random forest model
ranger_fit <- ranger(Surv(time, status) ~ .,
data = cost_train,
mtry = 3,
verbose = TRUE,
write.forest=TRUE,
num.trees= 1000,
importance = 'permutation')
#optional: plot training results
plot(ranger_fit$unique.death.times, ranger_fit$survival[1,], type = 'l', col = 'red') # for first observation
lines(ranger_fit$unique.death.times, ranger_fit$survival[21,], type = 'l', col = 'blue') # for twenty first observation
#predict observations test set using the survival random forest model
ranger_preds <- predict(ranger_fit, cost_test, type = 'response')$survival
ranger_preds <- data.table(ranger_preds)
colnames(ranger_preds) <- as.character(ranger_fit$unique.death.times)
From here, another user (Justin Singh) from a previous post (R: how to repeatedly "loop" the results from a function?) suggested how to create a loop:
sensitivity <- list()
for (time in names(ranger_preds)) {
prediction <- ranger_preds[which(names(ranger_preds) == time)] > 0.5
real <- cost_test$time >= as.numeric(time)
confusion <- confusionMatrix(as.factor(prediction), as.factor(real), positive = 'TRUE')
sensitivity[as.character(i)] <- confusion$byclass[1]
}
But due to some of the observations used in this loop, I get the following error:
Error in confusionMatrix.default(as.factor(prediction), as.factor(real), :
The data must contain some levels that overlap the reference.
Does anyone know how to fix this?
Thanks
Certain values in prediction and/or real have only 1 unique value in them. Make sure the levels of the factors are the same.
sapply(names(ranger_preds), function(x) {
prediction <- factor(ranger_preds[[x]] > 0.5, levels = c(TRUE, FALSE))
real <- factor(cost_test$time >= as.numeric(x), levels = c(TRUE, FALSE))
confusion <- caret::confusionMatrix(prediction, real, positive = 'TRUE')
confusion$byClass[1]
}, USE.NAMES = FALSE) -> result
result

Meta and Metafor R Package -

I am currently conducting a metaanlysis in R using the package "metafor". Doing my research I came across a different package for metaanalyses in R, namely "meta". I like the forest plot created by the latter package better (designwise) but unfortunatley some of the data is not the same as in the plot I created with metafor.
Specifically, the data is different only for I^2 and the pooled estimate.
meta_1 <- rma(yi=yi, vi=vi, measure="SMD", method="ML", slab=Citation, data=dat)
forest(meta_1)
meta_2 <- metagen(yi,vi^.5,data = dat,studlab = paste(Citation), comb.fixed = FALSE,
comb.random = TRUE, hakn = TRUE, method.tau = "ML", sm = "SMD")
forest(meta_2)
Does anyone know why those differences emerge?
So I was able to get the prediction interval to match across functions but not the I^2 values (even though the difference is off by only 2%). There might be some statistical correction one package is doing compared to the other or it has to do with the RE/FE type of modeling approach.
Anyway I hope this code helps point you in the right direction. To get the CIs to match you also have to use the parameter method.tau.ci in metagen().
library(meta)
library(metafor)
study<- c(1:10)
yi<- c( -0.48965031,0.64970214, 0.11201680,0.07945655,-0.70874645 -0.54922759,0.66768916 , -0.45523574 )
vi <- c(0.10299697,0.14036855,0.05137812, 0.03255550, 0.34913525, 0.34971466, 0.07539957, 0.08428983)
dat <- cbind(study, yi, vi)
dat <- as.data.frame(dat)
meta_1 <- rma(yi=dat$yi, vi=dat$vi, measure="SMD", method="REML", slab=paste(study), data=dat)
forest(meta_1)
meta_2 <- meta::metagen(TE =dat$yi,seTE = dat$vi^.5, method.tau = 'REML',
method.tau.ci = 'BJ', comb.random = TRUE, comb.fixed = TRUE,
sm = 'SMD')
forest(meta_2)

Plotting the effect of document pruning on text corpus in R text2vec

Is it possible to check how many documents remain in the corpus after applying prune_vocabulary in the text2vec package?
Here is an example for getting a dataset in and pruning vocabulary
library(text2vec)
library(data.table)
library(tm)
#Load movie review dataset
data("movie_review")
setDT(movie_review)
setkey(movie_review, id)
set.seed(2016L)
#Tokenize
prep_fun = tolower
tok_fun = word_tokenizer
it_train = itoken(movie_review$review,
preprocessor = prep_fun,
tokenizer = tok_fun,
ids = movie_review$id,
progressbar = FALSE)
#Generate vocabulary
vocab = create_vocabulary(it_train
, stopwords = tm::stopwords())
#Prune vocabulary
#How do I ascertain how many documents got kicked out of my training set because of the pruning criteria?
pruned_vocab = prune_vocabulary(vocab,
term_count_min = 10,
doc_proportion_max = 0.5,
doc_proportion_min = 0.001)
# create document term matrix with new pruned vocabulary vectorizer
vectorizer = vocab_vectorizer(pruned_vocab)
dtm_train = create_dtm(it_train, vectorizer)
Is there an easy way to understand how aggressive the term_count_min and doc_proportion_min parameters are being on my text corpus. I am trying to do something similar to how stm package lets us handle this using a plotRemoved function which produces a plot like this:
vocab $vocab is a data.table which contains a lot of statistics about your corpus. prune_vocabulary with term_count_min, doc_proportion_min parameters just filters this data.table. For example here is how you can calculate number of removed tokens:
total_tokens = sum(v$vocab$terms_counts)
total_tokens
# 1230342
# now lets prune
v2 = prune_vocabulary(v, term_count_min = 10)
total_tokens - sum(v2$vocab$terms_counts)
# 78037
# effectively this will remove 78037 tokens
On other side you can create document-term matrices with different vocabularies and check different statistics with functions from Matrix package: colMeans(), colSums(), rowMeans(), rowSums(), etc. I'm sure you can obtain any of the metrics above.
For example here is how to find empty documents:
doc_word_count = Matrix::rowSums(dtm)
indices_empty_docs = which(doc_word_count == 0)

How to use weights from survey package in TermDocumentMatrix

I work a lot with samples that I want to generalize to larger populations. However, most times the samples are biased and need to be weighted with the survey package. However, I have not found a way to weight Term Document Matrix on these kind of weights. Consider this example
library(tm)
library(wordcloud)
set.seed(123)
# Consider this example: I have performed a sample from a population and now have
# 1000 observations of text. In the data I also have information about gender.
# The sample
data <- rbind(data.frame(gender = "M",
words = sample(c("education", "money", "family",
"house", "debts"),
600, replace = TRUE)),
data.frame(gender = "F",
words = sample(c("career", "bank", "friends",
"drinks", "relax"),
400, replace = TRUE)))
# I create a simple wordcloud
text <- paste(data$words, collapse = " ")
matrix <- as.matrix(
TermDocumentMatrix(
VCorpus(
VectorSource(text)
)
)
)
Which produces a wordcloud that looks something like this:
As you can see, the terms mentioned by men are bigger because the appear more often. However, I know the true distribution of this population, thus this wordcloud is biased.
The true gender distribution
true_gender_dist <- data.frame(gender = c("M", "F"), freq = nrow(data) * c(0.49,0.51))
With the survey package I can weight the data with the rake function
library(survey)
rake_data <- rake(design = svydesign(ids = ~1, data = data),
sample.margins = list(~gender),
population.margins = list(true_gender_dist))
In order to use the weights in analysis, visualizations etc. (that are not included in the survey package) I add the weights to the original data.
data_weighted <- cbind(data, data.frame(weights = weights(rake_data)))
So far so good. However, I would like to make a wordcloud that take these weighs into consideration.
My first attempt would be to use the weights in making the Term Document Matrix.
text_corp <- VCorpus(VectorSource(text))
w_tdm <- TermDocumentMatrix(text_corp,
control = list(weighting = weights(rake_data)))
But then I get:
Error in .TermDocumentMatrix(m, weighting) : invalid weighting
Is this at all possible?
I can't comment yet, so I'll use the answer to comment your question:
You could be interested in the R package stm (structured topic models). It provides possibilities to infer latent topics regarding meta variables (continuous and/or discrete).
You can generate different kinds of plots to check out how metavariables influence
a) the selected topics depending,
b) the preferred words inside one topic,
c) and some more :)
Some links, if you're interested:
Paper describing the R package
R documentation
Some more Papers <-- this is a really good collection, if you want to dive into the subject some more!

Resources