How to combine multiwords in a dfm? - r

I created a corpus of 233 rows and 3 columns (Date, Title, Article) where the last column, Article, is text (so, I have 233 texts). The final aim is to apply topic models and, to do so, I need to convert my corpus into a dfm. Yet, I would like first to combine words into bigrams and trigrams to make the analysis more rigorous.
The problem is that when I use textstat_collocation or tokens_compound, I am forced to tokenize the corpus and, in so doing, I lose the structure (233 by 4) that is crucial to apply topic models. In fact, once I apply those functions, I just get one row of bigrams and trigrams which is useless to me.
So my question is: do you know any other way to look for bigrams and trigrams in a dfm without necessarily tokenizing the corpus?
Or, in other words, what do you usually do to look for multiwords in a dfm?
Thanks a lot for your time!

Related

textProcessor changes the number of observations of my corpus (using with stm package in R)

I'm working with a dataset that has 439 observations for text analysis in stm. When I use textProcessor, the number of observations changes to 438 for some reason. This creates problems later on: when using the findThoughts() function, for example.
##############################################
#PREPROCESSING
##############################################
#Process the data for analysis.
temp<-textProcessor(sovereigncredit$Content,sovereigncredit, customstopwords = customstop, stem=FALSE)
meta<-temp$meta
vocab<-temp$vocab
docs<-temp$documents
length(docs) # QUESTION: WHY IS THIS 438 instead of 439, like the original dataset?
length(sovereigncredit$Content) # See, this original one is 439.
out <- prepDocuments(docs, vocab, meta)
docs<-out$documents
vocab<-out$vocab
meta <-out$meta
An example of this becoming a problem down the line is:
thoughts1<-findThoughts(sovereigncredit1, texts=sovereigncredit$Content,n=5, topics=1)
For which the output is:
"Error in findThoughts(sovereigncredit1, texts = sovereigncredit$Content, :
Number of provided texts and number of documents modeled do not match"
In which "sovereigncredit1" is a topic model based on "out" from above.
If my interpretation is correct (and I'm not making another mistake), the problem seems to be this 1 observation difference in the number of observations pre and post textprocessing.
So far, I've looked at the original csv and made sure there are in fact 439 valid observations and no empty rows. I'm not sure what's up. Any help would be appreciated!
stm can't handle empty documents so we simply drop them. textProcessor removes a lot of stuff from texts: custom stopwords, words shorter than 3 characters, numbers etc. So what's happening here is one of your documents (whichever one is dropped) is essentially losing all of its contents sometime during the process of doing the various things textProcessor does.
You can work back what document it was and decide what you want to do about that in this instance. In general if you want more control over the text manipulation, I would strongly recommend the quanteda package which has much more fine-grained tools than stm for manipulating texts into a document-term matrix.

replace all rare words from the text (substitute very large number of strings in a large text)

I have a large text and wanted to replace all the words that have low frequency, with some marker, example "^rare^". My document is 1.7 million lines and after cleaning it up it has 482,932 unique words, out of which more than 400 thousand occur less than 6 these are the ones that I want to replace.
Couple ways that I know how take longer than is practical. For instance I just tried mgsub from qdap package.
test <- mgsub(rare, "<UNK>", smtxt$text)
Where rare is a vector of all the rare words and smtxt$text is the vector that holds all the text, one sentence per row.
R is still processing it.
I think, since each word is begin checked against each sentence this is expected. For now I am resigned to forget about doing something like this. I would like to hear for others if there is another way. Since I still have not looked into many option besides what I know: gsub, and mgsub and also tried turning the text into corpus to see if it will process faster.
Thanks

How to remove/separate conjoint words from tweets

I am mining Twitter data and one of the problems I come across while cleaning text is, being unable to remove/separate conjoint words that are usually hashtag data. Upon removing special characters and symbols like '#', I am left with phrases that make no sense. For instance:
1) Meaningless words: I have words like: 'spillwayjfleck' , 'bowhunterva' etc, which make no sense and need to be removed from my Corpus. Is there any function in R which can do it?.
2) Conjoint words: I need a method to separate joint words like: 'flashfloodwarn' to:
'flash', 'flood', 'warn', from my Corpus.
Any help would be appreciated.

matching specific character strings as word patterns in a corpus in R

I need your help. I have a corpus in R and i want to find specific words in it. The final line of my code is this
sentences_with_args <- arg.match(c("because", "however", "therefore"),myCorpus, 0).
Problem is that apart from the words mentioned above R returns several words that derive from these like "how" from "however" and "cause" from "because". How do i match the specific strings to their exact occurrences in the corpus? Thank you.

Stopwords eliminating and vector making

In text2vec, the only function I could find about Stopwords is “create_vocabulary”. But in text mining mission, we usually need to eliminate the stopwords in the resource document and then build corpus or other further processes. How do we use “stopword” to tackle the documents in building corpus, dtm and tcm using text2vec?
I’ve used tm for text mining before. It has a function for analyzing PDF document, but it reads one paper as several vectors(one line, one vector), not read each of the document as a vector as my expect. Furthermore, the format exchange function in tm have mess code problem in Chinese. If use text2vec to read documents, could it read one paper into a vector?(aka. Is the volume of vector large enough for one paper published on journals?) Otherwise, the corpus and vector built in text2vec compatible with which built in tm?
There are two ways to create document-term matrix:
Using feature hashing
Using vocabulary
See text-vectorization vignette for details.
You are interesting in 2 choice. This mean you should build vocabulary - set of words/ngrams which will be used in all downstream tasks. create_vocabulary creates vocabulary object and only terms from this object will be used in further steps. So if you will provide stopwords to create_vocabulary, it will remove them from the set of all observed words in the corpus.As you can see you should provide stopwords only once. All the dowstream tasks will work with vocabulary.
Answer on second question.
text2vec doesn't provide high-level functions for reading PDF documents. However it allows user to provide custom reader function. All you need is to read full articles with some function and reshape them to character vector where each element corresponds to desired unit of information (full article, paragraph, etc). For example you can easily combine lines into single element with paste() function. For example:
article = c("sentence 1.", "sentence 2")
full_article = paste(article, collapse = ' ')
# "sentence 1. sentence 2"
Hope this helps.

Resources