TM, Quanteda, text2vec. Get strings on the left of term in wordlist according to regex pattern - r

I would like to analyse a big folder of texts for the presence of names, addressess and telephone numbers in several languages.
These will usually be preceded with a word "Address", "telephone number", "name", "company", "hospital", "deliverer". I will have a dictionary of these words.
I am wondering if text mining tools would be perfect for the job.
I would like to create a Corpus for all these documents and then find texts that meet specific (i am thinking about regex criteria) on the right or down of the given dictionary entry.
Is there such a syntax in data mining packages in R, ie. to get the strings on the right or down of the wordlist entry, the strings that meet a specific pattern?
If not, would be more suitable tool in R to do the job?

Two options with quanteda come to mind:
Use kwic with your list of target patterns, with a window big enough to capture the size after the term that you want. This will return a data.frame that you can use the keyword and post columns from for your analysis. You can also construct a corpus directly from this object (corpus(mykwic)) and then focus on the new post docvar which will contain the text you want.
Use corpus_segment where you use the target word list to create a "tag" type, and anything following this tag, until the next tag, will be reshaped into a new document. This works well but is a bit trickier to configure, since you will need to get the regex right for the tag.

Related

Using variable input for str_extract_all in R

I am pretty green when it comes to R and coding in general. I've been working on a CS project recently for a linguistics course through which I'm finding the words that surround various natural landscape words in The Lord of the Rings. For instance, I'm interested in finding the descriptive words used around words like "stream", "mountain", etc.
Anyhow, to extract all of these words from the text, I've been working off of this post. When running this command by itself, it works:
stringr::str_extract_all(text, "([^\\s]+\\s){4}stream(\\s[^\\s]+){6}")
where "stream" is the specific word I'm going after. The numbers before and after specify how many words before and after I want to extract along with it.
However, I'm interested in combining this (and some other things) into a single function, where all you need to plug in the text you want to search, and the word you want to get context for. However, as far as I've tinkered, I can't get anything other than a specific word to work in the above code. Would there be a way to, in the context of writing a function in R, include the above code, but with a variable input, for instance
stringr::str_extract_all(text, "([^\\s]+\\s){4}WORD(\\s[^\\s]+){6}")
where WORD is whatever you specify in the overall function:
function(text,WORD)
I apologize for the generally apparent newb-ness of this post. I am very new to all of this but would greatly appreciate any help you could offer.
This is what you are looking for, if I understood you correctly,
my_fun <- function(input_text, word) {
stringr::str_extract_all(
string = input_text,
pattern = paste("([^\\s]+\\s){4}", word, "(\\s[^\\s]+){6}", sep = "")
)
}
May the light of Eärendil ever shine upon you!

Omitting Words from Spellcheck in qdap

This is my first post with StackOverflow, I apologize if I violate any rules.
I am working with the R package qdap on spellchecking very messy medical record text. The goal of this work is to identify misspellings of drug side effects in order to build a side effect misspelling dictionary. The text I am working with contains many, many misspellings, abbreviations, and other things that make a simple spellcheck difficult. After I run a spellcheck on a small doctors note, I get hundreds of words returned to me by the spellcheck program. This makes it difficult to search for the side effect misspellings that I care about.
I attempted to use the following code to create a dictionary consisting only of correctly spelled side effects, so that qdap will trigger closely misspelled words as belonging to this dictionary. The problem is that with this, nearly every word in the text, properly or improperly spelled is not returned as incorrect (i.e. "notable" is spelled wrong and "nausea" is the suggested replacement from my dictionary).
dictionary <- readLines("dictionary.txt")
check_spelling(text$NOTE_TEXT[3379],range = 0, dictionary = dictionary,
assume.first.correct=FALSE)
Here the term "dictionary" is my self-built side-effects dictionary, and check_spelling is being run on text contained in a csv file. Is there any way to omit words that are very far away from words contained in my dictionary from appearing in the spellcheck function (such as my previous example)? This way I can cut down the number of words I am seeing in my spell_check output and identify only the misspelled side effects.
As a small note, changing assume.first.correct to TRUE will not change anything, because the dictionary does not run with it set that way.

How to check if a paragraph is part of a text in R

I have one paragrah of text (a vector of words) and I would like to see if it is "part" of a long text (a vector of words). However, I am know that this paragraph does not appear in the text in its exact form, but with slight changes: a few words could miss, the order could be slightly different, some words could be inserted as parenthetical elements etc.
I am currently implementing solutions "by hand", such as looking if most of the words of the paragraph are in the text, looking the distance between these words, their order, etc...
I was however wondering if there is no built-in method to do that?
I already checked the tm package, but it does not seem to do that...
Any idea?
I fear that you are stuck with hand-writing an approach, e.g. grep-ing some word groups and having some kind of matching threshold.

Stopwords eliminating and vector making

In text2vec, the only function I could find about Stopwords is “create_vocabulary”. But in text mining mission, we usually need to eliminate the stopwords in the resource document and then build corpus or other further processes. How do we use “stopword” to tackle the documents in building corpus, dtm and tcm using text2vec?
I’ve used tm for text mining before. It has a function for analyzing PDF document, but it reads one paper as several vectors(one line, one vector), not read each of the document as a vector as my expect. Furthermore, the format exchange function in tm have mess code problem in Chinese. If use text2vec to read documents, could it read one paper into a vector?(aka. Is the volume of vector large enough for one paper published on journals?) Otherwise, the corpus and vector built in text2vec compatible with which built in tm?
There are two ways to create document-term matrix:
Using feature hashing
Using vocabulary
See text-vectorization vignette for details.
You are interesting in 2 choice. This mean you should build vocabulary - set of words/ngrams which will be used in all downstream tasks. create_vocabulary creates vocabulary object and only terms from this object will be used in further steps. So if you will provide stopwords to create_vocabulary, it will remove them from the set of all observed words in the corpus.As you can see you should provide stopwords only once. All the dowstream tasks will work with vocabulary.
Answer on second question.
text2vec doesn't provide high-level functions for reading PDF documents. However it allows user to provide custom reader function. All you need is to read full articles with some function and reshape them to character vector where each element corresponds to desired unit of information (full article, paragraph, etc). For example you can easily combine lines into single element with paste() function. For example:
article = c("sentence 1.", "sentence 2")
full_article = paste(article, collapse = ' ')
# "sentence 1. sentence 2"
Hope this helps.

How to replace english abbreviated form to their dictionary form

I'm working on a system to analyze texts in english: I use stanford-core nlp to make sentences from whole documents and to make tokens from sentences. I also use the maxent tagger to get tokens pos tags.
Now, considering that I use this corpus to build a supervised classifier, it would be good if I could replace any word like 're, 's, havin, sayin', etc. to its standard form(are, is, having, saying). I've been searching for some english dictionary file, but I don't know how to use it. There are so many distinct cases to consider that I don't think it's an easy task to realize: is there some similar work or whole project that I could use?
Ideas:
I) use string edit distance on a subset of your text and try to match words that do not exist in the dictionary using edit distance against existing words in the dictionary.
II) The key feature of lots of those examples you have is that they are only 1 character different from the correct spelling. So, I suggest for those words that you fail to match with a dictionary entry, try and add all english characters to the front or back and lookup the resulting word in a dictionary. This is very expensive in the beginning but if you keep track of those misspellings in a lookup table (re -> are) at some point you will have 99.99% of the common misspellings (or whatever you call them) in your lookup table with their actual correct spelling.
III) Train a word-level 2-gram or 3-gram language model on proper and clean english text (i.e. newspaper articles), then run it over the entire corpus that you have and see for those words that your language model considers as unknown words (which means it hasn't seen them in training phase), what is the highest probable word according to the language model. Most probably the language model top-10 prediction will be the correct spelled word.

Resources