STM: estimating metadata/topic relationships when starting from dfm - r

After running an STM model based on a Quanteda dfm, I want to estimate my covariates' effects on certain topics.
Running the STM model went fine, producing the topics as expected, but when using estimateEffect (in the final step in the script below) the R session is aborted, notifying there is a 'fatal error'.
How can I estimate my covariates' effects, when starting from a dfm? The STM manual advices on running an STM model from a dfm, but I couldn't find how to work with the covariates after this stage.
Here's the code:
# Read texts with Quanteda
texts <- (readtext("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles DJ/test data/*.txt",
docvarsfrom = "filenames", dvsep = "_",
docvarnames = c("Date of Publication", "Length LexisNexis", "source"),
encoding = "UTF-8-BOM"))
mycorpus <- corpus(texts)
tokens <- tokens(mycorpus, remove_punct = TRUE, remove_numbers = TRUE, ngrams = 1)
mydfm <- dfm(tokens, remove = stopwords("english"), stem = TRUE)
# Run the STM model - Metadata is called with 'data = docvars(mycorpus)'
stm_from_dfm <- stm(mydfm, K = 10, prevalence =~ Date.of.Publication + source, gamma.prior='L1', data = docvars(mycorpus))
# Estimate effects
prep <- estimateEffect(1:10 ~ Date.of.Publication + source, stm_from_dfm,
meta = docvars(mycorpus), uncertainty = "Global")
Alternatively, I made an STM corpus from my dfm corpus, using STMcorpus <- asSTMCorpus(mydfm). But then I couldn't run the STM model as it didn't recognized my meta data. Would it be better to follow this alternative strategy? (so I need to associate the meta data with the STMcorpus in some way after running STMcorpus <- asSTMCorpus(mydfm)).

We worked through this by email- but I'll add the answer here for others who might encounter some form of the problem.
There is a bug in the matrixStats package which causes R to crash with large matrices on Windows only. The bug and solution are detailed here: https://github.com/HenrikBengtsson/matrixStats/issues/104. This issue contains both a simple test of the problem and instructions for how to install the development version of matrixStats which fixes it. This is an issue in version matrixStats 0.52.2 and will presumably be resolved by the next CRAN release.

Related

Question about pre-processing Russian text for stm in R

I am trying to run a structural topic model in R using the stm package. The corpus is a collection of Russian-language speeches. The problem I am having is that the Russian words are not being pre-processed correctly. Here is the code I have written thus far:
library(stm) # Package for sturctural topic modeling
library(igraph) # Package for network analysis and visualisation
library(stmCorrViz) # Package for hierarchical correlation view of STMs
data <- read.csv("convocation4.csv") # Load data
stopwordsRU <- readLines("stop_words_russian.txt", encoding = "UTF-8") # Custom stopwords
processed <- textProcessor(
data$text,
metadata = data,
lowercase = TRUE,
removestopwords = TRUE,
removenumbers = TRUE,
removepunctuation = TRUE,
stem = TRUE,
language = "ru",
customstopwords = stopwordsRU
)
docs <- out$documents
vocab <- out$vocab
meta <- out$meta
fit <- stm(out$documents, out$vocab, K=20, prevalence=~party_id,
max.em.its=75, data=out$meta, init.type="Spectral",
seed=8458159)
Here is an example of the problem. I have included one Russian stopword, уважаемый, in my list of custom stopwords. However, the word уважаемые, the plural form, comes out as one of the top words after running the model. Why is this happening? How might I solve it?

How to train naive bayes classifier for tf-idf weighted dfm in R?

I'm new to text analysis, and I'm trying to train a naive bayes classifier for a dataset from quanteda using the following codes:
library("quanteda")
data(data_corpus_amicus, package = "quanteda.corpora")
# set training class
trainclass <- docvars(data_corpus_amicus, "trainclass")
amicus_train <- which(trainclass == "P" | trainclass == "R" )
# set test class
testclass <- docvars(data_corpus_amicus, "testclass")
amicus_test <- which(testclass == "AP" | testclass == "AR")
# create dfm from the data
amicus_dfm <- dfm(data_corpus_amicus, verbose = FALSE)
I wanted to train the classifier for a tf-idf weighed dfm so I tried the following:
amicus_dfm_weight <- dfm_tfidf(amicus_dfm, scheme_tf = "count", scheme_df = "inverse")
weight_nb <-textmodel_nb(amicus_dfm_weight[amicus_train,], docvars(data_corpus_amicus, "trainclass")[amicus_train])
The above code give me error Error: will not group a weighted dfm; use force = TRUE to override, so I also tried amicus_dfm_weight <- dfm_tfidf(amicus_dfm, scheme_tf = "count", scheme_df = "inverse", force = TRUE) but it still comes up with the same error.
Does anyone know what does that error mean and how to fix the error?
Many Thanks!
Yes: install the newest quanteda.textmodels (>= 0.9.1) package, where textmodel_nb() now lives. This issue has been fixed. In the future please consider using reprex::reprex() for reproducible examples, and always include the package versions in the output or explanation.
This would have been more appropriate as a GitHub issue than as a SO question.

Using R Package NNMAPSlite to get City Environmental vs Mortality Dataset

I have several question for those who have worked with R studio. Currently I need to work with NMMAPSlite package. However, I found that there is an issue from the package itself when I wanted to initialise the database connection to remote DB that store the NMMAPS City dataset.
In short, I need help to either
resolve the problem with NMMAPSlite old R package or
where to find the NMMAPS dataset in csv format
BACKGROUND
As a background, I'm using NMMAPSLite packages with intend to reproduce paper of Antonio Gasparrini. Attached at the bottom is the code base I would like to run. It requires:
require(dlnm);
require(NMMAPSlite)
Now the package NMMAPSlite has been deprecated it seems, so I managed to install the dependencies and the package from archive. I will elaborate below on the links required to get the dependencies for NMMAPS and DLNM as well.
PROBLEM
The problems occur when calling initDB() where it says it failed to create remoteDB instance due to invalid object creation. But I suspect, rather, the error comes from the fact the url is not supported. Here is the NMMAPS docs that describes the initDB() function. The db initialisation is necessary to read the city dataset.
The following is the error from R Console when running initDB()
creating directory 'NMMAPS' for local storage
Error in validObject(.Object) :
invalid class “remoteDB” object: object needs a 'url' of type 'http://'
In addition: Warning message:
In grep("^http://", URL, fixed = TRUE, perl = TRUE) :
argument 'perl = TRUE' will be ignored
QUESTIONS
I know this packages NMMAPS are deprecated and too old perhaps, but I really want to reproduce/replicate Antonio Gasparrini's paper: Distributed lag non-linear models for the purpose of my undergraduate thesis project.
Hence,
I wonder if there is anyway to get NMMAPS Dataset for cities environment data vs mortality rate. I visited the official NMMAPS Database but the link for downloading the data is either broken or the server is already down
Or you can also help me to find out if there is equivalent to NMMAPSlite package in R. I just need to download the cities dataset that contains humidity trend, temperatures trend, dewpoint, CO2 trends, Ozone O3 trend, and deaths/mortality rate with respect to time at any particular city for over 2 years. The most important that I need is the mortality rate and Ozone O3 trend.
Or last effort, perhaps do you mind suggesting me similar dataset that is used by his paper? Something where I can derive/analyze time relationship to estimate mortality rate given environmental and air polution information?
APPENDIX
Definition of initDB
baseurl = "http://www.ihapss.jhsph.edu/NMMAPS/v0.1"
function (basedir = "NMMAPS")
{
if (!file.exists(basedir))
message(gettextf("creating directory '%s' for local storage",
basedir))
outcome <- new("remoteDB", url = paste(baseurl, "outcome",
sep = "/"), dir = file.path(basedir, "outcome"), name = "outcome")
exposure <- new("remoteDB", url = paste(baseurl, "exposure",
sep = "/"), dir = file.path(basedir, "exposure"), name = "exposure")
Meta <- new("remoteDB", url = paste(baseurl, "Meta", sep = "/"),
dir = file.path(basedir, "Meta"), name = "Meta")
assign("exposure", exposure, .dbEnv)
assign("outcome", outcome, .dbEnv)
assign("Meta", Meta, .dbEnv)
}
Code to run:
The error comes from line 3
require(dlnm);require(NMMAPSlite)
##############################
# LOAD AND PREPARE THE DATASET
##############################
initDB()
data <- readCity("ny", collapseAge = TRUE)
data <- data[,c("city", "date", "dow", "death", "tmpd", "dptp", "rhum", "o3tmean", "o3mtrend", "cotmean", "comtrend")]
# TEMPERATURE: CONVERSION TO CELSIUS
data$temp <- (data$tmpd-32)*5/9
# POLLUTION: O3 AND CO AT LAG-01
data$o3 <- data$o3tmean + data$o3mtrend
data$co <- data$cotmean + data$comtrend
data$o301 <- filter(data$o3,c(1,1)/2,side=1)
data$co01 <- filter(data$co,c(1,1)/2, side=1)
# DEW POINT TEMPERATURE AT LAG 0-1
data$dp01 <- filter(data$dptp,c(1,1)/2,side=1)
##############################
# CROSSBASIS SPECIFICATION
##############################
# FIXING THE KNOTS AT EQUALLY SPACED VALUES
range <- range(data$temp,na.rm=T)
ktemp <- range [1] + (range [2]-range [1])/5*1:4
# CROSSBASIS MATRIX
ns.basis <- crossbasis(data$temp,varknots=ktemp,cenvalue=21, lagdf=5,maxlag=30)
##############################
# MODEL FIT AND PREDICTION
##############################
ns <- glm(death ~ ns.basis + ns (dp01, df=3 ) + dow + o301 + co01 +
ns(date,df=14*7),family=quasipoisson(), data)
ns.pred <- crosspred(ns.basis,ns,at=-16:33)
##############################
# RESULTS AND PLOTS
##############################
# 3-D PLOT (FIGURE 1)
crossplot(ns.pred,label="Temperature")
# SLICES (FIGURE 2, TOP)
percentiles <- round(quantile(data$temp,c(0.001,0.05,0.95,0.999)), 1)
ns.pred <- crosspred(ns.basis,ns,at=c(percentiles,-16:33))
crossplot(ns.pred,"slices",var=percentiles,lag=c(0,5,15,28), label="Temperature")
# OVERALL EFFECT (FIGURE 2, BELOW)
crossplot(ns.pred,"overall",label="Temperature", title="Overall effect of temperature on mortality
New York 1987–2000" )
# RR AT CHOSEN PERCENTILES VERSUS 21C (AND 95%CI)
ns.pred$allRRfit[as.character(percentiles)]
cbind(ns.pred$allRRlow,ns.pred$allRRhigh)[as.character(percentiles),]
##############################
# THE MOVING AVERAGE MODELS UP TO LAG x (DESCRIBED IN SECTION 5.2)
# CAN BE CREATED BY THE CROSSBASIS FUNCTION INCLUDING THE
# ARGUMENTS lagtype="strata", lagdf=1, maxlag=x
Resources for your context
Distributed lag non-linear models link
Rstudio's NMMAPSlite Package docs pdf download
Rstudio's DNLM Package docs pdf
Duplicate questions from another forum: forum
How to install package from tar/archive: link
Meanwhile, I will contact the author of this package and see if I can get the dataset. Preferable in csv format.
It seems that your code is based on R ver. < 3.0.0. You might find it difficult to reproduce the paper as the current R is typical > 4.0.0. You could try to install the windows version of NMMAPS database from the link given by 'Lil'. But, you will need to install an older version of R (2.9.2).
Or, you could hang on with the latest version of R and make a simple search on GitHub. In case you haven't found the NMMAPS database, you will find how to deal with the database here.
you could try this link http://www.biostat.jhsph.edu/IHAPSS/data/NMMAPS/R/ to download the package. There you have the city-data compressed where you can choose New York manually if initDB does not work.

Use a pre trained model with text2vec?

I would like to use a pre trained model with text2vec. My understanding was that the benefit here is that these models have been trained on a huge volume of data already, e.g. Google News Model.
Reading the text2vec documentation it looks like the getting started code reads in text data then trains a model with it:
library(text2vec)
text8_file = "~/text8"
if (!file.exists(text8_file)) {
download.file("http://mattmahoney.net/dc/text8.zip", "~/text8.zip")
unzip ("~/text8.zip", files = "text8", exdir = "~/")
}
wiki = readLines(text8_file, n = 1, warn = FALSE)
The documentation then proceeds to show one how to create tokens and a vocab:
# Create iterator over tokens
tokens <- space_tokenizer(wiki)
# Create vocabulary. Terms will be unigrams (simple words).
it = itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it)
vocab <- prune_vocabulary(vocab, term_count_min = 5L)
# Use our filtered vocabulary
vectorizer <- vocab_vectorizer(vocab)
# use window of 5 for context words
tcm <- create_tcm(it, vectorizer, skip_grams_window = 5L)
Then, this looks like the step to fit the model:
glove = GlobalVectors$new(word_vectors_size = 50, vocabulary = vocab, x_max = 10)
glove$fit(tcm, n_iter = 20)
My question is, is the well know Google pre trained word2vec model usable here without the need to rely on my own vocab or my own local device to train the model? If yes, how could I read it in and use it in r?
I think I'm misunderstanding or missing something here? Can I use text2vec for this task?
At the moment text2vec doesn't provide any functionality for downloading/manipulating pre-trained word embeddings.
I have a drafts to add such utilities to the next release.
But on other side you can easily do it manually with just standard R tools. For example here is how to read fasttext vectors:
con = url("https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.af.300.vec.gz", "r")
con = gzcon(con)
wv = readLines(con, n = 10)
Then you need just to parse it - strsplit and rbind are your friends.
This comes a bit late, but might be of interest for other users. Taylor Van Anne provides a little tutorial how to use pretrained GloVe vector models with text2vec here:
https://gist.github.com/tjvananne/8b0e7df7dcad414e8e6d5bf3947439a9

In text2vec package in R, could not find function "create_vocab_corpus"

I was trying to understand the text2vec package from http://dsnotes.com/articles/text2vec
but at the following step:
Now we can costruct DTM. Again, since all functions related to corpus construction have streaming API, we have to create iterator and provide it to create_vocab_corpus function:
it <- itoken(movie_review[['review']], preprocess_function = tolower,
tokenizer = word_tokenizer, chunks_number = 10, progessbar = F)
corpus <- create_vocab_corpus(it, vocabulary = vocab)
This code throws an error:
Error: could not find function "create_vocab_corpus"
Please, see tutorial for latest version (0.3): https://cran.r-project.org/web/packages/text2vec/vignettes/text-vectorization.html . There were some API breaks in v 0.3.

Resources