Need help diagnosing cause of "Covariate matrix is singular" when estimating effect in structural topic model (stm)

Need help diagnosing cause of "Covariate matrix is singular" when estimating effect in structural topic model (stm) - r

First things first. I've saved my workspace and you can load it with the following command:
load(url("https://dl.dropboxusercontent.com/s/06oz5j41nif7la5/example.RData?dl=0"))
I have a number of abstract texts and I'm attempting to estimate a structural topic model to measure topic prevalence over time. The data contains a document id, abstract text, and year of publication.
I want to generate trends in expected topic proportion over time like the authors of the STM Vignette do here:
I'm able to create my topic model without issue, but when I attempt to run the estimateEffect() function from the stm package in R, I always get the following warning:
And my trends look like this:
In the documentation, the authors note that
The function will automatically check whether the covariate matrix is singular which generally results from linearly dependent columns. Some common causes include a factor variable with an unobserved level, a spline with degrees of freedom that are too high, or a spline with a continuous variable where a gap in the support of the variable results in several empty basis functions.
I've tried a variety of different models, using a 2-topic solution all the way up to 52-topic solution, always with the same result. If I remove the spline function from the "year" variable in my model and assume a linear fit, then estimateEffect() works just fine. So it must be an issue with the splined data. I just don't know what exactly.
Again, here's a link to my workspace:
load(url("https://dl.dropboxusercontent.com/s/06oz5j41nif7la5/example.RData?dl=0"))
And here is the code I'm using to get there:
library(udpipe)
library(dplyr) # data wrangling
library(readr) # import data
library(ggplot2) # viz
library(stm) # STM
library(tidytext) # Tf-idf
library(tm) # DTM stuff
library(quanteda) # For using ngrams in STM
rm(list = ls())
abstracts <- read_delim("Data/5528_demand_ta.txt",
delim = "\t", escape_double = FALSE,
col_names = TRUE, trim_ws = TRUE)
abstracts <- rename(abstracts, doc_id = cpid)
abstracts$doc_id <- as.character(abstracts$doc_id)
# Download english dictionary
ud_model <- udpipe_download_model(language = "english")
ud_model <- udpipe_load_model(ud_model$file_model)
# Interpret abstracts assuming English
x <- udpipe_annotate(ud_model, x = abstracts$abstract, doc_id = abstracts$doc_id)
x <- as.data.frame(x)
# Regroup terms
data <- paste.data.frame(x, term = "lemma", group = c("doc_id"))
data <- left_join(data, abstracts) %>%
rename(term = lemma) %>%
select(doc_id, term , year)
# Prepare text
processed <- textProcessor(documents = data$term,
metadata = data,
lowercase = TRUE,
removestopwords = TRUE,
removenumbers = TRUE,
removepunctuation = TRUE,
stem = FALSE)
out <- prepDocuments(processed$documents,
processed$vocab,
processed$meta,
lower.thresh = 20, # term must appear in at least n docs to matter
upper.thres = 1000) # I've been using about 1/3 of documents as an upper thresh
# Build model allowing tSNE to pick k (should result in 52 topics)
stm_mod <- stm(documents = out$documents,
vocab = out$vocab,
K = 0,
init.type = "Spectral",
prevalence = ~ s(year),
data = out$meta,
max.em.its = 500, # Max number of runs to attempt
seed = 831)
###################################################################################
########### If you loaded the workspace from my link, then you are here ###########
###################################################################################
# Estimate effect of year
prep <- estimateEffect(formula = 1:52 ~ s(year),
stmobj = stm_mod,
metadata = out$meta)
# Plot expected topic proportion
summary(prep, topics=1)
plot.estimateEffect(prep,
"year",
method = "continuous",
model = stm_mod,
topics = 5,
printlegend = TRUE,
xaxt = "n",
xlab = "Years")

A singular matrix simply means that you have linearly dependent rows or columns. First thing you could do is check the determinant of the matrix - a singular matrix implies a zero determinant - which means the matrix can't be inverted.
Next thing would be to identify the literally dependent rows (columns), you can do so using smisc::findDepMat(X, rows = TRUE, tol = 1e-10) for rows, and smisc::findDepMat(X, rows = FALSE, tol = 1e-10) for columns. You MAY be able to alter the levels of tol in findDepMat() and etol in stm() to arrive at a solution, probably an unstable solution, but a solution.

Related

Is there a way to limit the number of features in IML package Shapley values?

I have an automl model created with the H2O package. Currently, H2O only calculates Shapley values on tree-based models. I've used the IML package to calculate the values on the AML model. However, because I have a large number of features, the plot is too jumbled to read. I'm looking for a way to select/show only the top X number of features. I can't find anything in the IML CRAN PDF nor in other documentation I've found by Googling.
#initiate h2o
h2o.init()
h2o.no_progress()
#create automl model (data cleaning and train/test split not shown)
set.seed(1911)
num_models <- 10
aml <- h2o.automl(y = label, x = features,
training_frame = train.hex,
nfolds = 5,
balance_classes = TRUE,
leaderboard_frame = test.hex,
sort_metric = 'AUCPR',
max_models = num_models,
verbosity = 'info',
exclude_algos = "DeepLearning", #exclude for reproducibility
seed = 27)
# 1. create a data frame with just the features
features_eval <- as.data.frame(test) %>% dplyr::select(-target)
# 2. Create a vector with the actual responses
response <- as.numeric(as.vector(test$target))
# 3. Create custom predict function that returns the predicted values as a
# vector (probability of purchasing in our example)
pred <- function(model, newdata) {
results <- as.data.frame(h2o.predict(model, as.h2o(newdata)))
return(results[[3L]])
}
# example of prediction output
pred(aml, features_eval) %>% head()
#create predictor needed
predictor.aml <- Predictor$new(
model = aml,
data = features_eval,
y = response,
predict.fun = pred,
class = "classification"
)
high <- predict(aml, test.hex) %>% .[,3] %>% as.vector() %>% which.max()
high_prob_ob <- features_eval[high, ]
shapley <- Shapley$new(predictor.aml, x.interest = high_prob_ob, sample.size = 200)
plot(shapley, sort = TRUE)
Any suggestions/help appreciated.
Thank you,
Brian

I can offer a hacky solution that utilizes the fact that iml uses ggplot2 to plot.
N <- 10 # number of features to show
# Capture the ggplot2 object
p <- plot(shapley, sort = TRUE)
# Modify it so it shows only top N features
print(p + scale_x_discrete(limits=rev(p$data$feature.value[order(-p$data$phi)][1:N])))

How to Create a loop (when levels do not overlap the reference)

I have written some code in R. This code takes some data and splits it into a training set and a test set. Then, I fit a "survival random forest" model on the training set. After, I use the model to predict observations within the test set.
Due to the type of problem I am dealing with ("survival analysis"), a confusion matrix has to be made for each "unique time" (inside the file "unique.death.time"). For each confusion matrix made for each unique time, I am interested in the corresponding "sensitivity" value (e.g. sensitivity_1001, sensitivity_2005, etc.). I am trying to get all these sensitivity values : I would like to make a plot with them (vs unique death times) and determine the average sensitivity value.
In order to do this, I need to repeatedly calculate the sensitivity for each time point in "unique.death.times". I tried doing this manually and it is taking a long time.
Could someone please show me how to do this with a "loop"?
I have posted my code below:
#load libraries
library(survival)
library(data.table)
library(pec)
library(ranger)
library(caret)
#load data
data(cost)
#split data into train and test
ind <- sample(1:nrow(cost),round(nrow(cost) * 0.7,0))
cost_train <- cost[ind,]
cost_test <- cost[-ind,]
#fit survival random forest model
ranger_fit <- ranger(Surv(time, status) ~ .,
data = cost_train,
mtry = 3,
verbose = TRUE,
write.forest=TRUE,
num.trees= 1000,
importance = 'permutation')
#optional: plot training results
plot(ranger_fit$unique.death.times, ranger_fit$survival[1,], type = 'l', col = 'red') # for first observation
lines(ranger_fit$unique.death.times, ranger_fit$survival[21,], type = 'l', col = 'blue') # for twenty first observation
#predict observations test set using the survival random forest model
ranger_preds <- predict(ranger_fit, cost_test, type = 'response')$survival
ranger_preds <- data.table(ranger_preds)
colnames(ranger_preds) <- as.character(ranger_fit$unique.death.times)
From here, another user (Justin Singh) from a previous post (R: how to repeatedly "loop" the results from a function?) suggested how to create a loop:
sensitivity <- list()
for (time in names(ranger_preds)) {
prediction <- ranger_preds[which(names(ranger_preds) == time)] > 0.5
real <- cost_test$time >= as.numeric(time)
confusion <- confusionMatrix(as.factor(prediction), as.factor(real), positive = 'TRUE')
sensitivity[as.character(i)] <- confusion$byclass[1]
}
But due to some of the observations used in this loop, I get the following error:
Error in confusionMatrix.default(as.factor(prediction), as.factor(real), :
The data must contain some levels that overlap the reference.
Does anyone know how to fix this?
Thanks

Certain values in prediction and/or real have only 1 unique value in them. Make sure the levels of the factors are the same.
sapply(names(ranger_preds), function(x) {
prediction <- factor(ranger_preds[[x]] > 0.5, levels = c(TRUE, FALSE))
real <- factor(cost_test$time >= as.numeric(x), levels = c(TRUE, FALSE))
confusion <- caret::confusionMatrix(prediction, real, positive = 'TRUE')
confusion$byClass[1]
}, USE.NAMES = FALSE) -> result
result

Meta and Metafor R Package -

I am currently conducting a metaanlysis in R using the package "metafor". Doing my research I came across a different package for metaanalyses in R, namely "meta". I like the forest plot created by the latter package better (designwise) but unfortunatley some of the data is not the same as in the plot I created with metafor.
Specifically, the data is different only for I^2 and the pooled estimate.
meta_1 <- rma(yi=yi, vi=vi, measure="SMD", method="ML", slab=Citation, data=dat)
forest(meta_1)
meta_2 <- metagen(yi,vi^.5,data = dat,studlab = paste(Citation), comb.fixed = FALSE,
comb.random = TRUE, hakn = TRUE, method.tau = "ML", sm = "SMD")
forest(meta_2)
Does anyone know why those differences emerge?

So I was able to get the prediction interval to match across functions but not the I^2 values (even though the difference is off by only 2%). There might be some statistical correction one package is doing compared to the other or it has to do with the RE/FE type of modeling approach.
Anyway I hope this code helps point you in the right direction. To get the CIs to match you also have to use the parameter method.tau.ci in metagen().
library(meta)
library(metafor)
study<- c(1:10)
yi<- c( -0.48965031,0.64970214, 0.11201680,0.07945655,-0.70874645 -0.54922759,0.66768916 , -0.45523574 )
vi <- c(0.10299697,0.14036855,0.05137812, 0.03255550, 0.34913525, 0.34971466, 0.07539957, 0.08428983)
dat <- cbind(study, yi, vi)
dat <- as.data.frame(dat)
meta_1 <- rma(yi=dat$yi, vi=dat$vi, measure="SMD", method="REML", slab=paste(study), data=dat)
forest(meta_1)
meta_2 <- meta::metagen(TE =dat$yi,seTE = dat$vi^.5, method.tau = 'REML',
method.tau.ci = 'BJ', comb.random = TRUE, comb.fixed = TRUE,
sm = 'SMD')
forest(meta_2)

Plotting the effect of document pruning on text corpus in R text2vec

Is it possible to check how many documents remain in the corpus after applying prune_vocabulary in the text2vec package?
Here is an example for getting a dataset in and pruning vocabulary
library(text2vec)
library(data.table)
library(tm)
#Load movie review dataset
data("movie_review")
setDT(movie_review)
setkey(movie_review, id)
set.seed(2016L)
#Tokenize
prep_fun = tolower
tok_fun = word_tokenizer
it_train = itoken(movie_review$review,
preprocessor = prep_fun,
tokenizer = tok_fun,
ids = movie_review$id,
progressbar = FALSE)
#Generate vocabulary
vocab = create_vocabulary(it_train
, stopwords = tm::stopwords())
#Prune vocabulary
#How do I ascertain how many documents got kicked out of my training set because of the pruning criteria?
pruned_vocab = prune_vocabulary(vocab,
term_count_min = 10,
doc_proportion_max = 0.5,
doc_proportion_min = 0.001)
# create document term matrix with new pruned vocabulary vectorizer
vectorizer = vocab_vectorizer(pruned_vocab)
dtm_train = create_dtm(it_train, vectorizer)
Is there an easy way to understand how aggressive the term_count_min and doc_proportion_min parameters are being on my text corpus. I am trying to do something similar to how stm package lets us handle this using a plotRemoved function which produces a plot like this:

vocab $vocab is a data.table which contains a lot of statistics about your corpus. prune_vocabulary with term_count_min, doc_proportion_min parameters just filters this data.table. For example here is how you can calculate number of removed tokens:
total_tokens = sum(v$vocab$terms_counts)
total_tokens
# 1230342
# now lets prune
v2 = prune_vocabulary(v, term_count_min = 10)
total_tokens - sum(v2$vocab$terms_counts)
# 78037
# effectively this will remove 78037 tokens
On other side you can create document-term matrices with different vocabularies and check different statistics with functions from Matrix package: colMeans(), colSums(), rowMeans(), rowSums(), etc. I'm sure you can obtain any of the metrics above.
For example here is how to find empty documents:
doc_word_count = Matrix::rowSums(dtm)
indices_empty_docs = which(doc_word_count == 0)

How to use weights from survey package in TermDocumentMatrix

I work a lot with samples that I want to generalize to larger populations. However, most times the samples are biased and need to be weighted with the survey package. However, I have not found a way to weight Term Document Matrix on these kind of weights. Consider this example
library(tm)
library(wordcloud)
set.seed(123)
# Consider this example: I have performed a sample from a population and now have
# 1000 observations of text. In the data I also have information about gender.
# The sample
data <- rbind(data.frame(gender = "M",
words = sample(c("education", "money", "family",
"house", "debts"),
600, replace = TRUE)),
data.frame(gender = "F",
words = sample(c("career", "bank", "friends",
"drinks", "relax"),
400, replace = TRUE)))
# I create a simple wordcloud
text <- paste(data$words, collapse = " ")
matrix <- as.matrix(
TermDocumentMatrix(
VCorpus(
VectorSource(text)
)
)
)
Which produces a wordcloud that looks something like this:
As you can see, the terms mentioned by men are bigger because the appear more often. However, I know the true distribution of this population, thus this wordcloud is biased.
The true gender distribution
true_gender_dist <- data.frame(gender = c("M", "F"), freq = nrow(data) * c(0.49,0.51))
With the survey package I can weight the data with the rake function
library(survey)
rake_data <- rake(design = svydesign(ids = ~1, data = data),
sample.margins = list(~gender),
population.margins = list(true_gender_dist))
In order to use the weights in analysis, visualizations etc. (that are not included in the survey package) I add the weights to the original data.
data_weighted <- cbind(data, data.frame(weights = weights(rake_data)))
So far so good. However, I would like to make a wordcloud that take these weighs into consideration.
My first attempt would be to use the weights in making the Term Document Matrix.
text_corp <- VCorpus(VectorSource(text))
w_tdm <- TermDocumentMatrix(text_corp,
control = list(weighting = weights(rake_data)))
But then I get:
Error in .TermDocumentMatrix(m, weighting) : invalid weighting
Is this at all possible?

I can't comment yet, so I'll use the answer to comment your question:
You could be interested in the R package stm (structured topic models). It provides possibilities to infer latent topics regarding meta variables (continuous and/or discrete).
You can generate different kinds of plots to check out how metavariables influence
a) the selected topics depending,
b) the preferred words inside one topic,
c) and some more :)
Some links, if you're interested:
Paper describing the R package
R documentation
Some more Papers <-- this is a really good collection, if you want to dive into the subject some more!

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Need help diagnosing cause of "Covariate matrix is singular" when estimating effect in structural topic model (stm) - r

Related

Is there a way to limit the number of features in IML package Shapley values?

How to Create a loop (when levels do not overlap the reference)

Meta and Metafor R Package -

Plotting the effect of document pruning on text corpus in R text2vec

How to use weights from survey package in TermDocumentMatrix

Categories

Resources