Avoiding specific words in word stemming with tm package - r

A previous post addressed this issue here: Text-mining with the tm-package - word stemming
However I am still running into challenges with the tm package.
My goal is to stem a large corpus of words, however I wish to avoid stemming specific words.
For instance, in the corpus I am looking to stem words to their root form of "indian" (stemmed from "indians", "indianspeak", "indianss", etc). However, stemming also transforms words such as "Indianapolis", and "Indiana" to indian, which I do not want.
The post mentioned above addresses this challenge by substituting unique identifiers for specific words in the corpus, stemming it, and then re-substituting the unique identifiers with the actual words. The approach makes sense, however I am still encountering problems with the meta data when the stemming transformation is applied to the corpus. After doing research, I am finding that tm package v0.6 made it so that you can't operate on simple character values (R-Project no applicable method for 'meta' applied to an object of class "character")
However, the solutions posted are not solving the errors I am encountering.
Starting from the solution in the first link posted, I am still running into errors from step 5:
# Step 5: reverse -> sub the identifier keys with the words you want to retain
corpus.temp[seq_len(length(corpus.temp))] <- lapply(corpus.temp, mgsub, pattern=replace, replacement=retain)
Error in UseMethod("meta", x) :
no applicable method for 'meta' applied to an object of class "character"
In order to move forward with my larger more complex corpus, I would like to understand why this is happening, and if there is a solution.

Related

fasttext - some word vectors returns 0

I am a new entry to R and fastText.
I read on fastText website that you should be able to retrieve words vectors for names like "New York" by typing "New_York" but it's not working for me. Actually, there are also other vector that I'm not able to recall properly.
I thought that it could be because of the OS (I work on windows).
require(plyr)
require(proxy)
require(ggpubr)
require (jtools)
require(tidyverse)
require(reshape2)
require(fastTextR)
london_agg <- read.csv2("londra_latlong2.csv",header=T,sep=",",dec=".",fill = T)
model <- ft_load("fastText/cc.en.300.bin")
london_agg$Name <- as.character(london_agg$Name)
ccc <- ft_word_vectors(model,london_agg$Name)
A word-vector model will only have full-word vectors for a string like New_York if the training data had preprocessed the text to create such tokens. I'm not sure if the cc FastText models, specifically, have done that – their distribution page doesn't mention it. (Google's original GoogleNews vectors in plain word2vec had used a phrase-combining algorithm to create vectors for a lot of multiword tokens like New_York.)
Failing that, a FastText model will also synthesize guess-vectors for other tokens that weren't in the training-data, using substrings of your requested token.
The cc.en.300.bin vectors are reported (on same page as above) as only having learned 5-character n-grams, so unknown (out-of-vocabulary with respect to training-tokens) token with fewer than 5 characters couldn't get any vector from FastText.
But those with more characters should get at least junk vectors. (The method for matching n-grams is based on a collision-oblivious hashtable, so even if 5-grams weren't in the training data, there should be some random junk data returned.)
Perhaps there's a bug in the R FastText implementation you're using. Separate from looking up your specific geo-data tokens, could you expand your question with some examples of individual tokens, of different lengths, that either return credible vectors (every dimension non-zero) & absolutely nothing (all zero dimensions)? The pattern of lookup words that return all-zeros might give a further hint as to what, if anything, the problem is.

Dictionary in quanteda with logical/context rules (cf Wordstat's "Proximity Rules")

Prior to using quanteda for text analysis I used Provalis Wordstat. When using dictionary methods, Wordstat allows the user to include in the dictionary both regular terms and proximity rules (e.g.: "Sudan" NOT NEAR "South_"; "Congo" NOT AFTER "Democratic_Republic_of_the_"). Is it possible to apply a similar feature in quanteda? Friends suggested that some exclusion rule via regex could work, but I wouldn't know how to implement it.
Though I am not very familiar with writing regex (or whichever option would make this feasible), my thoughts would be something in the lines of:
# example for dictionary with names of 2 african countries
africa_dict <- dictionary(list(algeria="algeria",
republic_of_the_congo=c("republic_of_the_congo", "congo_(NOT AFTER democratic_republic_of_the_)")))

Stopwords eliminating and vector making

In text2vec, the only function I could find about Stopwords is “create_vocabulary”. But in text mining mission, we usually need to eliminate the stopwords in the resource document and then build corpus or other further processes. How do we use “stopword” to tackle the documents in building corpus, dtm and tcm using text2vec?
I’ve used tm for text mining before. It has a function for analyzing PDF document, but it reads one paper as several vectors(one line, one vector), not read each of the document as a vector as my expect. Furthermore, the format exchange function in tm have mess code problem in Chinese. If use text2vec to read documents, could it read one paper into a vector?(aka. Is the volume of vector large enough for one paper published on journals?) Otherwise, the corpus and vector built in text2vec compatible with which built in tm?
There are two ways to create document-term matrix:
Using feature hashing
Using vocabulary
See text-vectorization vignette for details.
You are interesting in 2 choice. This mean you should build vocabulary - set of words/ngrams which will be used in all downstream tasks. create_vocabulary creates vocabulary object and only terms from this object will be used in further steps. So if you will provide stopwords to create_vocabulary, it will remove them from the set of all observed words in the corpus.As you can see you should provide stopwords only once. All the dowstream tasks will work with vocabulary.
Answer on second question.
text2vec doesn't provide high-level functions for reading PDF documents. However it allows user to provide custom reader function. All you need is to read full articles with some function and reshape them to character vector where each element corresponds to desired unit of information (full article, paragraph, etc). For example you can easily combine lines into single element with paste() function. For example:
article = c("sentence 1.", "sentence 2")
full_article = paste(article, collapse = ' ')
# "sentence 1. sentence 2"
Hope this helps.

How to search for huge list of search-terms from a corpus using custom function in tm package

I want to select and retain the gene names from a corpus of multiple text documents using the tm package. I have used a custom function to keep only the genes defined in "pattern" and remove everything else. Here are my codes
docs <- Corpus(DirSource("path of the directory containing text documents"))
f <- content_transformer(function(x, pattern)regmatches(x, gregexpr(pattern, x, ignore.case=TRUE)))
genes = "IL1|IL2|IL3|IL4|IL5|IL6|IL7|IL8|IL9|IL10|TNF|TGF|AP2|OLR1|OLR2"
docs <- tm_map(docs, f, genes)
The codes are working perfectly fine. However, If I need to match a larger number of genes (say > 5000 genes), what is the best way to approach it ? I don't want to put the genes in an array and loop the tm_map function, to avoid huge run time and memory constraints.
If you simply want the fastest vectorized fixed-string regex, use stringi package, not tm. Specifically, look at stri_match* functions (or you might find stringr even faster if you're only handling ASCII - look for Hadley's latest versions and comments).
But if the regex of gene names is fixed and known upfront, and you're going to be doing a lot of retrieval on those few strings, then you could tag each document for faster retrieval.
(You haven't fully told us your use-case. What % of your runtime is this retrieval task? 0.1%? 99%? Are you storing your genes as text strings? Why not tokenize them and convert once to factors at input-time?)
Either way, tm is not a very scaleable performant package, so look at other approaches.

How to replace english abbreviated form to their dictionary form

I'm working on a system to analyze texts in english: I use stanford-core nlp to make sentences from whole documents and to make tokens from sentences. I also use the maxent tagger to get tokens pos tags.
Now, considering that I use this corpus to build a supervised classifier, it would be good if I could replace any word like 're, 's, havin, sayin', etc. to its standard form(are, is, having, saying). I've been searching for some english dictionary file, but I don't know how to use it. There are so many distinct cases to consider that I don't think it's an easy task to realize: is there some similar work or whole project that I could use?
Ideas:
I) use string edit distance on a subset of your text and try to match words that do not exist in the dictionary using edit distance against existing words in the dictionary.
II) The key feature of lots of those examples you have is that they are only 1 character different from the correct spelling. So, I suggest for those words that you fail to match with a dictionary entry, try and add all english characters to the front or back and lookup the resulting word in a dictionary. This is very expensive in the beginning but if you keep track of those misspellings in a lookup table (re -> are) at some point you will have 99.99% of the common misspellings (or whatever you call them) in your lookup table with their actual correct spelling.
III) Train a word-level 2-gram or 3-gram language model on proper and clean english text (i.e. newspaper articles), then run it over the entire corpus that you have and see for those words that your language model considers as unknown words (which means it hasn't seen them in training phase), what is the highest probable word according to the language model. Most probably the language model top-10 prediction will be the correct spelled word.

Resources