Creating topic models on frequency lists in R - r

I've been using the topicmodels package to create LDA models in R.
require(tm)
require(topicmodels)
textvector <- c("this is one sentence", "this is another one",
"a third sentence appears")
#and more, read in through a file
dtm <- DocumentTermMatrix(Corpus(VectorSource(textvector)))
lda.model <- LDA(dtm, 5)
But the only way format it accepts documents is as actual, literal documents. I was wondering if there is a way to provide a map of frequencies
[word1: 4, word2: 9, word3: 25, word5:3...]
This is obviously not a 'map' in R, but any data structure (data frame, table, list of vectors) representation that allows creation of topic models from word frequencies?
The reason I need this is because the topic models aren't being created on 'documents' and 'words' as such but analogous features in images, and a long-form representation needs way too much space.

You don't need to use tm's call to create the doc-term matrix. You can create and sent in your own, so long as the "documents" are in rows and the component "words" are represented in columns. However, you cannot simply supply frequency counts in a table, because LDA relies on knowing the what words appear in what documents!

Related

How can I convert to dataframe, but drawing in data labels rather than content?

I've switched recently from using CSV or XLSX outputs from qualtrics surveys to using SAV format data which I can analyse in SPSS and then pass on to R. One key challenge for R is the ways that survey data can appear as numbered responses ("1", "2", "3", etc.) but those numbers also correspond to specific items in your codebook ("pizza", "hot dogs", "veggie sausages", etc.). The haven() package, now absorbed into tidyverse() offers a really useful way to draw in a complete data set, using (I believe) labels to include the names of values, and rendering data as a tibble:
data <- read_sav(here("gits", "survey", "data", "survey.sav")) %>% select(Q0:Q68)
For some tools, I need to reduce data back down to a data-frame and zap all the labels, and that is pretty straight forward to achieve:
as.data.frame(as_factor(haven_spss_data$Q6))
However, I'd also like to have a data-frame consisting of the response text for a given item, e.g. the labels. How would I draw in that data as ordered factors (like above) but based on labels instead of content?
This should do it:
library(foreign)
df <- read.spss("yourfile.sav",to.data.frame=TRUE)
The foreign::read.spss function by default uses spss value labels instead of values where they are available.

How to sort lines of text alphabetically based on a part of each line?

I have a text file that contains abbreviations like so (simplified example):
\item[3D] Three-dimensional
\item[PCA] Principal Component Analysis
\item[RF] Random Forest
\item[ANN] Artificial Neural Networks
I want to manipulate these lines in R so that the abbreviations (e.g. ANN) are sorted in an alphabetical order and an abbreviation that starts with a number (e.g. 3D) comes after the last abbreviation that starts with letter. \item[]s should be ignored and left unmodified as they are going to be used in a LaTeX file.
My desired output is:
\item[ANN] Artificial Neural Networks
\item[PCA] Principal Component Analysis
\item[RF] Random Forest
\item[3D] Three-dimensional
I would be interested in solving this using tidyverse but any other solution will be useful too.
Here’s a ‘tidyverse’ solution:
sorted_lines = readLines(your_file) %>%
tibble(text = .) %>%
extract(text, into = 'abbr', regex = r'(\\item\[([^]]*)\])', remove = FALSE) %>%
arrange(abbr) %>%
pull(text)
Result:
\item[3D] Three-dimensional
\item[ANN] Artificial Neural Networks
\item[PCA] Principal Component Analysis
\item[RF] Random Forest
However, there’s really no need to use tidy data manipulation here. You can equivalently use (mostly1) base R functions:
lines = readLines(your_file)
abbreviations = str_match(lines, r'(\\item\[([^\]]*)\])')[, 2L]
sorted_lines = lines[order(abbreviations)]
Note that both solutions produce a different ordering than in your question, because they will order “3D” before “ANN”, as is conventional. Are you sure you want to put numbers at the end?
In both cases, the code extracts the abbreviation from each line of text via the regular expression r'(\\item\[([^]]*)\])', and then sorts the lines by these abbreviations.
The regular expression uses R 4.0’s new raw string literals: r"(…)". This allows us to use backslashes inside the string without having to escape them. Without raw string literals, the regular expression would look like this: \\\\item\\[([^\\]]*)\\]). — That’s just unnecessarily hard to read.
1 I’m using str_match from ‘stringr’, since the pattern extraction functions in base R are a pain to use.

Text Mining: Getting a Sentence-Term Matrix

I'm currently running into trouble finding anything relevant to creating a sentence-term matrix in R using text mining.
I'm using the tm package and the only thing that I can find is converting to a tdm or dtm.
I'm using only one excel file where I'm only interested in text mining one column of. That one column has about 1200 rows within it. I want to create a row (sentence) - term matrix. I want to create a matrix that tells me the frequency of words in each row (sentence).
I want to create a matrix of 1's and 0's that I can run a PCA analysis on later.
A dtm in my case is not helpful because since I'm only using one file, the number of rows is 1 and the columns are the frequency of words in that whole document.
Instead, I want to treat the sentences as documents if that makes sense. From there, I want a matrix which the frequency of words in each sentence.
Thank you!
When using text2vecyou just need to feed the content of your column as character vector into the tokenizer function - see below example.
Concerning your downstream analysis I would not recommend to run PCA on count data / integer values, PCA is not designed for this kind of data. You should either apply normalization, tfidf weighting, etc. on your dtm to turn it to continuous data before feeding it to PCA or otherwise apply correspondence analysis instead.
library(text2vex)
docs <- c("the coffee is warm",
"the coffee is cold",
"the coffee is hot",
"the coffee is warm",
"the coffee is hot",
"the coffee is perfect")
#Generate document term matrix with text2vec
tokens = docs %>%
word_tokenizer()
it = itoken(tokens
,ids = paste0("sent_", 1:length(docs))
,progressbar = FALSE)
vocab = create_vocabulary(it)
vectorizer = vocab_vectorizer(vocab)
dtm = create_dtm(it, vectorizer, type = "dgTMatrix")
With the corpus library:
library(corpus)
library(Matrix)
corpus <- federalist # sample data
x <- term_matrix(text_split(corpus, "sentences"))
Although, in your case, it sounds like you already split the text into sentences. If that is true, then there is no need for the text_split call; just do
x <- term_matrix(data$your_column_with_sentences)
(replacing data$your_column_with_sentences with whatever is appropriate for your data).
Can't add comments so here's a suggestion:
# Read Data from file using fread (for .csv from data.table package)
dat <- fread(filename, <add parameters as needed - col.namess, nrow etc>)
counts <- sapply(row_start:row_end, function(z) str_count(dat[z,.(selected_col_name)],"the"))
This will give you all occurances of "the" in the column of interested for the selected rows. You could also use apply if it's for all rows. Or other nested functions for different variations. Bear in mind that you would need to check for lowercast/uppercase letters - you can use tolower to achieve that. Hope this is helpful!

Defining synonyms within a corpus of Documents using R

I have a corpus of documents of a very specific topic (e.g. sports/athelics). Within that corpus, I would like to define synonyms myself. The reason why I want to define synonyms myself is because sometimes, given two words, it is possible that the synonyms() function within the WordNet package does not recognise them as synonyms, but within the text they can be interpreted as such (for example, "fit" and "strong").
My idea is to use word associations with Bygrams and Trigrams and define a synonym when words appear frequently in a phrase and have similar semantic content. For example, using the crude dataset within the tm package I would do something like:
data(crude)
options(mc.cores=1)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
crudetdm <- TermDocumentMatrix(crude, control=list(stripWhitespace = TRUE,
removePunctuation = TRUE,
removeNumbers = TRUE,
stopwords = TRUE,
removeSparseTerms = TRUE,
tokenize = BigramTokenizer))
ListAssoc <- lapply(crudetdm$dimnames$Terms, function(x) findAssocs(crudetdm, x, 0.9))
However this returns (as expected) Bigrams associated with Bigrams, while my idea would be to find individual words associated with the Bigrams in crudetdm$dimnames$Terms (the same excersise with Trigrams would be the next step). For example, using Bygrams and the crude dataset, the ideal scenario would be ending up with a data.frame like:
Bigram Associated Words
oil companies policies, marketing, prices, measures, market, revenue...
Then I would go myself trough the table and manually select those words that I believe can be considered synonyms in my dataset (my dataset is not that big). I can think of some ways around by defining multiple data.frames of bigrams and trigrams and match common words. However, I am sure there is a more elegant and efficient way of doing this in R.
Overall, my question is. Given a series of Bigrams and Trigrams, how can I find individual words that are associated to them?

Text Retrieval using R

I have been using R's text mining package and its really a great tool. I have not found retrieval support or maybe there are functionalities I am missing.
How can a simple VSM model be implemented using the R's text mining package?
# Sample R commands in support of my previous answer
require(fortunes)
require(tm)
sentences <- NULL
for (i in 1:10) sentences <- c(sentences,fortune(i)$quote)
d <- data.frame(textCol =sentences )
ds <- DataframeSource(d)
dsc<-Corpus(ds)
dtm<- DocumentTermMatrix(dsc, control = list(weighting = weightTf, stopwords = TRUE))
dictC <- Dictionary(dtm)
# The query below is created from words in fortune(1) and fortune(2)
newQry <- data.frame(textCol = "lets stand up and be counted seems to work undocumented")
newQryC <- Corpus(DataframeSource(newQry))
dtmNewQry <- DocumentTermMatrix(newQryC, control = list(weighting=weightTf,stopwords=TRUE,dictionary=dict1))
dictQry <- Dictionary(dtmNewQry)
# Below does a naive similarity (number of features in common)
apply(dtm,1,function(x,y=dictQry){length(intersect(names(x)[x!= 0],y))})
Assuming VSM = Vector Space Model, you can go about a simple retrieval system in the following manner:
Create a Document Term Matrix of your collection/corpus
Create a function for your similarity measure (Jaccard, Euclidean, etc.). There are packages available with these functions. RSiteSearch should help in finding them.
Convert your query to a Document Term Matrix (which will have 1 row and is mapped using the same dictionary as used for the first step)
Compute similarity with the query and the matrix from the first step.
Rank the results and choose the top n.
A non-R method is to use the GINI index on a text column (rows are documents) of a table in PostgreSQL. Using the ts_vector querying methods, you can have a very fast retrieval system.

Resources