Stopwords eliminating and vector making - r

In text2vec, the only function I could find about Stopwords is “create_vocabulary”. But in text mining mission, we usually need to eliminate the stopwords in the resource document and then build corpus or other further processes. How do we use “stopword” to tackle the documents in building corpus, dtm and tcm using text2vec?
I’ve used tm for text mining before. It has a function for analyzing PDF document, but it reads one paper as several vectors(one line, one vector), not read each of the document as a vector as my expect. Furthermore, the format exchange function in tm have mess code problem in Chinese. If use text2vec to read documents, could it read one paper into a vector?(aka. Is the volume of vector large enough for one paper published on journals?) Otherwise, the corpus and vector built in text2vec compatible with which built in tm?

There are two ways to create document-term matrix:
Using feature hashing
Using vocabulary
See text-vectorization vignette for details.
You are interesting in 2 choice. This mean you should build vocabulary - set of words/ngrams which will be used in all downstream tasks. create_vocabulary creates vocabulary object and only terms from this object will be used in further steps. So if you will provide stopwords to create_vocabulary, it will remove them from the set of all observed words in the corpus.As you can see you should provide stopwords only once. All the dowstream tasks will work with vocabulary.
Answer on second question.
text2vec doesn't provide high-level functions for reading PDF documents. However it allows user to provide custom reader function. All you need is to read full articles with some function and reshape them to character vector where each element corresponds to desired unit of information (full article, paragraph, etc). For example you can easily combine lines into single element with paste() function. For example:
article = c("sentence 1.", "sentence 2")
full_article = paste(article, collapse = ' ')
# "sentence 1. sentence 2"
Hope this helps.

Related

From word vector to document vector [text2vec]

I'd like to use the GloVe word embedding implemented in text2vec to perform supervised regression/classification. I read the helpful tutorial on the text2vec homepage on how to generate the word vectors. However, I'm having trouble grasping how to proceed further, namely apply or transform these word vectors and attach them to each document in such a way that each document is represented by a vector (derived from its component words' vectors I assume), to be used as input in a classifier. I've run into some quick fixes online for short documents, but my documents are rather lengthy (movie subtitles) and there doesn't seem to be any guidance on how to proceed with such documents - or at least guidance matching my comprehension level; I have experience working with n-grams, dictionaries, and topic models, but word embeddings puzzle me.
Thank you!
If your goal is to classify documents - I doubt any doc2vec approach will beat bag-of-words/ngrams. If you still want to try - common simple strategy short documents (< 20 words) is to represent document as weighted sum/average of word vectors.
You can obtain it by something like:
common_terms = intersect(colnames(dtm), rownames(word_vectors) )
dtm_averaged = normalize(dtm[, common_terms], "l1")
# you can re-weight dtm above with tf-idf instead of "l1" norm
sentence_vectors = dtm_averaged %*% word_vectors[common_terms, ]
I'm not aware of any universal established methods to obtain good document vectors for long documents.

TM, Quanteda, text2vec. Get strings on the left of term in wordlist according to regex pattern

I would like to analyse a big folder of texts for the presence of names, addressess and telephone numbers in several languages.
These will usually be preceded with a word "Address", "telephone number", "name", "company", "hospital", "deliverer". I will have a dictionary of these words.
I am wondering if text mining tools would be perfect for the job.
I would like to create a Corpus for all these documents and then find texts that meet specific (i am thinking about regex criteria) on the right or down of the given dictionary entry.
Is there such a syntax in data mining packages in R, ie. to get the strings on the right or down of the wordlist entry, the strings that meet a specific pattern?
If not, would be more suitable tool in R to do the job?
Two options with quanteda come to mind:
Use kwic with your list of target patterns, with a window big enough to capture the size after the term that you want. This will return a data.frame that you can use the keyword and post columns from for your analysis. You can also construct a corpus directly from this object (corpus(mykwic)) and then focus on the new post docvar which will contain the text you want.
Use corpus_segment where you use the target word list to create a "tag" type, and anything following this tag, until the next tag, will be reshaped into a new document. This works well but is a bit trickier to configure, since you will need to get the regex right for the tag.

Avoiding specific words in word stemming with tm package

A previous post addressed this issue here: Text-mining with the tm-package - word stemming
However I am still running into challenges with the tm package.
My goal is to stem a large corpus of words, however I wish to avoid stemming specific words.
For instance, in the corpus I am looking to stem words to their root form of "indian" (stemmed from "indians", "indianspeak", "indianss", etc). However, stemming also transforms words such as "Indianapolis", and "Indiana" to indian, which I do not want.
The post mentioned above addresses this challenge by substituting unique identifiers for specific words in the corpus, stemming it, and then re-substituting the unique identifiers with the actual words. The approach makes sense, however I am still encountering problems with the meta data when the stemming transformation is applied to the corpus. After doing research, I am finding that tm package v0.6 made it so that you can't operate on simple character values (R-Project no applicable method for 'meta' applied to an object of class "character")
However, the solutions posted are not solving the errors I am encountering.
Starting from the solution in the first link posted, I am still running into errors from step 5:
# Step 5: reverse -> sub the identifier keys with the words you want to retain
corpus.temp[seq_len(length(corpus.temp))] <- lapply(corpus.temp, mgsub, pattern=replace, replacement=retain)
Error in UseMethod("meta", x) :
no applicable method for 'meta' applied to an object of class "character"
In order to move forward with my larger more complex corpus, I would like to understand why this is happening, and if there is a solution.

How to search for huge list of search-terms from a corpus using custom function in tm package

I want to select and retain the gene names from a corpus of multiple text documents using the tm package. I have used a custom function to keep only the genes defined in "pattern" and remove everything else. Here are my codes
docs <- Corpus(DirSource("path of the directory containing text documents"))
f <- content_transformer(function(x, pattern)regmatches(x, gregexpr(pattern, x, ignore.case=TRUE)))
genes = "IL1|IL2|IL3|IL4|IL5|IL6|IL7|IL8|IL9|IL10|TNF|TGF|AP2|OLR1|OLR2"
docs <- tm_map(docs, f, genes)
The codes are working perfectly fine. However, If I need to match a larger number of genes (say > 5000 genes), what is the best way to approach it ? I don't want to put the genes in an array and loop the tm_map function, to avoid huge run time and memory constraints.
If you simply want the fastest vectorized fixed-string regex, use stringi package, not tm. Specifically, look at stri_match* functions (or you might find stringr even faster if you're only handling ASCII - look for Hadley's latest versions and comments).
But if the regex of gene names is fixed and known upfront, and you're going to be doing a lot of retrieval on those few strings, then you could tag each document for faster retrieval.
(You haven't fully told us your use-case. What % of your runtime is this retrieval task? 0.1%? 99%? Are you storing your genes as text strings? Why not tokenize them and convert once to factors at input-time?)
Either way, tm is not a very scaleable performant package, so look at other approaches.

How to replace english abbreviated form to their dictionary form

I'm working on a system to analyze texts in english: I use stanford-core nlp to make sentences from whole documents and to make tokens from sentences. I also use the maxent tagger to get tokens pos tags.
Now, considering that I use this corpus to build a supervised classifier, it would be good if I could replace any word like 're, 's, havin, sayin', etc. to its standard form(are, is, having, saying). I've been searching for some english dictionary file, but I don't know how to use it. There are so many distinct cases to consider that I don't think it's an easy task to realize: is there some similar work or whole project that I could use?
Ideas:
I) use string edit distance on a subset of your text and try to match words that do not exist in the dictionary using edit distance against existing words in the dictionary.
II) The key feature of lots of those examples you have is that they are only 1 character different from the correct spelling. So, I suggest for those words that you fail to match with a dictionary entry, try and add all english characters to the front or back and lookup the resulting word in a dictionary. This is very expensive in the beginning but if you keep track of those misspellings in a lookup table (re -> are) at some point you will have 99.99% of the common misspellings (or whatever you call them) in your lookup table with their actual correct spelling.
III) Train a word-level 2-gram or 3-gram language model on proper and clean english text (i.e. newspaper articles), then run it over the entire corpus that you have and see for those words that your language model considers as unknown words (which means it hasn't seen them in training phase), what is the highest probable word according to the language model. Most probably the language model top-10 prediction will be the correct spelled word.

Resources