Stemming with R Text Analysis - r

I am doing a lot of analysis with the TM package. One of my biggest problems are related to stemming and stemming-like transformations.
Let's say I have several accounting related terms (I am aware of the spelling issues).
After stemming we have:
accounts -> account
account -> account
accounting -> account
acounting -> acount
acount -> acount
acounts -> acount
accounnt -> accounnt
Result: 3 Terms (account, acount, account) where I would have liked 1 (account) as all these relate to the same term.
1) To correct spelling is a possibility, but I have never attempted that in R. Is that even possible?
2) The other option is to make a reference list i.e. account = (accounts, account, accounting, acounting, acount, acounts, accounnt) and then replace all occurrences with the master term. How would I do this in R?
Once again, any help/suggestions would be greatly appreciated.

We could set up a list of synonyms and replace those values. For example
synonyms <- list(
list(word="account", syns=c("acount", "accounnt"))
)
This says we want to replace "acount" and "accounnt" with "account" (i'm assuming we're doing this after stemming). Now let's create test data.
raw<-c("accounts", "account", "accounting", "acounting",
"acount", "acounts", "accounnt")
And now let's define a transformation function that will replace the words in our list with the primary synonym.
library(tm)
replaceSynonyms <- content_transformer(function(x, syn=NULL) {
Reduce(function(a,b) {
gsub(paste0("\\b(", paste(b$syns, collapse="|"),")\\b"), b$word, a)}, syn, x)
})
Here we use the content_transformer function to define a custom transformation. And basically we just do a gsub to replace each of the words. We can then use this on a corpus
tm <- Corpus(VectorSource(raw))
tm <- tm_map(tm, stemDocument)
tm <- tm_map(tm, replaceSynonyms, synonyms)
inspect(tm)
and we can see all these values are transformed into "account" as desired. To add other synonyms, just add additional lists to the main synonyms list. Each sub-list should have the names "word" and "syns".

Mr. Flick has answered question #2. I am approaching via answering question #1.
Here is an approach the uses a binary search of a known word data base (DICTIONARY from qdapDictionaries). A binary lookup is slow for sure but if we make some assumptions about the replacing (like a range of differences in number of character). So here's the basic idea:
Turn the Corpus into a unique bag of words using qdap's bag_o_words
Look those words up in a dictionary (qdapDictionaries' DICTIONARY data set) to find words not recognize using match
These misses from step # 2 will be what we lookup
Determine number of characters for words in a dictionary to eliminate gross difference later using nchar
Run each element of misses through a loop (sapply) and do the following:
a. stem each element from misses using tm::stemDocument
b. determine number of characters and eliminate those from dictionary that are not within that range using nchar
c. use agrep with a max.distance to eliminate more words from the dictionary
d. use a binary lookup (that reverse engineers agrep) to determine the word from dictionary that is closest to the missed element [note this is a non-exported function from qdap called qdap:::Ldist]
The result is a named vector that we can use for gsubbing
Use tm_map with a custom tm flavored gsub function to replace words
Do the stemming with tm_map and stemDocument
Here's the code. I made a fake Corpus using the words you provide and some random words to demonstrate how to do this from start to end. You can play with range and max.distance that is supplied to sapply. The looser you are with these the slower the search will be but tightiening these too much will make it more likely to make a mistake. This really isn't an answer for spelling correction in a general sense but works here because you were stemming anyway. There's an Aspell package but I have never used it before.
terms <- c("accounts", "account", "accounting", "acounting", "acount", "acounts", "accounnt")
library(tm); library(qdap)
fake_text <- unlist(lapply(terms, function(x) {
paste(sample(c(x, sample(DICTIONARY[[1]], sample(1:5, 1)))), collapse=" ")
}))
fake_text
myCorp <- Corpus(VectorSource(fake_text))
terms2 <- unique(bag_o_words(as.data.frame(myCorp)[[2]]))
misses <- terms2[is.na(match(terms2, DICTIONARY[[1]]))]
chars <- nchar(DICTIONARY[[1]])
replacements <- sapply(misses, function(x, range = 3, max.distance = .2) {
x <- stemDocument(x)
wchar <- nchar(x)
dict <- DICTIONARY[[1]][chars >= (wchar - range) & chars <= (wchar + range)]
dict <- dict[agrep(x, dict, max.distance=max.distance)]
names(which.min(sapply(dict, qdap:::Ldist, x)))
})
replacer <- content_transformer(function(x) {
mgsub(names(replacements), replacements, x, ignore.case = FALSE, fixed = FALSE)
})
myCorp <- tm_map(myCorp, replacer)
inspect(myCorp <- tm_map(myCorp, stemDocument))

This question inspired me to attempt to write a spell check for the qdap package. There's an interactive version that may be useful here. It's available in qdap >= version 2.1.1. That means you'll need the dev version at the moment.. here are the steps to install:
library(devtools)
install_github("qdapDictionaries", "trinker")
install_github("qdap", "trinker")
library(tm); library(qdap)
## Recreate a Corpus like you describe.
terms <- c("accounts", "account", "accounting", "acounting", "acount", "acounts", "accounnt")
fake_text <- unlist(lapply(terms, function(x) {
paste(sample(c(x, sample(DICTIONARY[[1]], sample(1:5, 1)))), collapse=" ")
}))
fake_text
inspect(myCorp <- Corpus(VectorSource(fake_text)))
## The interactive spell checker (check_spelling_interactive)
m <- check_spelling_interactive(as.data.frame(myCorp)[[2]])
preprocessed(m)
inspect(myCorp <- tm_map(myCorp, correct(m)))
The correct function merely grabs a closure function from the output of check_spelling_interactive and allows you to then apply the "correcting" to any new text string(s).

Related

Error in dictionary replace function "more elements supplied than there are to replace"

I'm translating text in a dataset from one language to another. I have a dictionary for that language.
Here's how the dictionary (dict) looks.
id Lang English
1 Juu Up
2 Chini Down
The dictionary contains 365 rows and 3 columns. My data has 3000+ rows and ~100 columns.
Here's my code:
translated <- vector("double", ncol(dataset))
for (i in seq_along(dataset)) {
# dict[match(dataset[[i]], dict$Lang), 2, drop=F]
translated[[i]] <- with(dict, English[match(dataset[[i]], Lang)])
}
tr_dataset <- as.data.frame(translated)
Here's the error I keep getting
Error in output[[i]] <- with(dict, English[match(b[[i]], Lang)]) :
more elements supplied than there are to replace
you should not use with() in programming. Here is what the documentation says:
For interactive use this is very effective and nice to read. For programming however, i.e., in one's functions, more care is needed, and typically one should refrain from using with(), as, e.g., variables in data may accidentally override local variables, see the reference.
I cannot test the approach because you did not provide a reproducable example, but this may work:
translated <- vector("double", ncol(dataset))
for (i in seq_along(dataset)) {
# dict[match(dataset[[i]], dict$Lang), 2, drop=F]
translated[[i]] <- dict$English[match(dataset[[i]], Lang)]) # this row changed
}
tr_dataset <- as.data.frame(translated)

R: Logical from 2 vectors on pattern match

Trying to clean up some dirty data (for work), my data frame has a column for customer information (for our example lets say store and product) in a long weird string, as well as a column for store and a column for product. I can parse the store and the product from the string. Here is where I arrive at my problem.
let's say (consider these vectors part of a larger dataframe, appended with data$ if that helps, I was just working with them as vectors thinking it may speed up the code not having to pull the whole dataframe):
WeirdString <- c("fname: john; lname:smith; store:Amazon Inc.; product:Echo", "fname: cindy; lname:smith; store:BestBuy; product:Ps-4","fname: jon; lname:smith; store:WALMART; product:Pants")
so I parse this to be:
WS_Store <- c("Amazon Inc.", "BestBuy", "WALMART")
WS_Prod <- c("Echo", "Ps-4", "Pants")
What's in the tables (i.e. the non-parsed columns) is:
DB_Store <- c("Amazon", "BEST BUY", "Other")
DB_Prod <- c("ECHO", "PS4", "Jeans")
I currently am using a for loop to loop through i to grepl the "true" string from the parsed string. This takes forever, and I know R was designed to use vectorized code, So my question is, how do I eliminate the loop and use something like lapply (which I tried, and failed at, because I'm not savvy enough with lapply), or some other vectorized thing?
My current code:
for(i in 1:nrow(data)){ # could be i in length(DB_prod) or whatever, all vectors are the same length)
Diff_Store[i] <- !grepl(DB_Store[i], WS_Store[i], ignore.case=T)
Diff_Prod[i] <- !grepl(DB_Prod[i] , WS_Prod[i] , ignore.case=T)
}
I intend to append those columns back into the dataframe, as the true goal is to diagnose why the database has this problem.
If there's a better way than this, rather than trying to vectorize it, I'm open to it. The data in the DB_Store is restricted to a specific number of "stores" (in the table it comes from) but in the string, it seems to be open, which is why I use the DB as the pattern, not the x. Product is similar, but not as restricted, this is why some have dashes and some don't. I would love to match "close things" like Ps-4 vs. PS4, but I will probably just build a table of matches once I see how weird the string gets. To be true though, the string may not match, which is represented by the Pants/Jeans thing. The dataset is 2.5 million records, and there are many different "stores" and "products", and I do want to make sure they match on the same line, not "is it in the database" (which is what previous questions seem to ask, can I see if a string is in a list of strings, rather than a 1:1 comparison, and the last question did end in a loop, which takes minutes and hours to run)
Thanks!
Please check if this works for you:
check <- function(vec_a, vec_b){
mat <- cbind(vec_a, vec_b)
diff <- apply(mat, 1, function(x) !grepl(pattern = x[1], x = x[2], ignore.case = TRUE))
diff
}
Use your different vectors for stores (or products) in the arguments vec_a and vec_b, respectively (example: diff_stores <- check(DB_Store, WS_Store) ). This function will return a logical vector with TRUE values referring to items that weren't a match in the two original vectors. Is this what you wanted?

Apply text2vec embeddings to new data

I used text2vec to generate custom word embeddings from a corpus of proprietary text data that contains a lot of industry-specific jargon (thus stock embeddings like those available from google won't work). The analogies work great, but I'm having difficulty applying the embeddings to assess new data. I want to use the embeddings that I've already trained to understand relationships in new data. the approach I'm using (described below) seems convoluted, and it's painfully slow. Is there a better approach? Perhaps something already built into the package that I've simply missed?
Here's my approach (offered with the closest thing to reproducible code I can generate given that I'm using a proprietary data source):
d = list containing new data. each element is of class character
vecs = the word vectorizations obtained form text2vec's implementation of glove
new_vecs <- sapply(d, function(y){
it <- itoken(word_tokenizer(y), progressbar=FALSE) # for each statement, create an iterator punctuation
voc <- create_vocabulary(it, stopwords= tm::stopwords()) # for each document, create a vocab
vecs[rownames(vecs) %in% voc$vocab$terms, , drop=FALSE] %>% # subset vecs for the words in the new document, then
colMeans # find the average vector for each document
}) %>% t # close y function and sapply, then transpose to return matrix w/ one row for each statement
For my use case, I need to keep the results separate for each document, so anything that involves pasting-together the elements of d won't work, but surely there must be a better way than what I've cobbled together. I feel like I must be missing something rather obvious.
Any help will be greatly appreciated.
You need to do it in a "batch" mode using efficient linear algebra matrix operations. The idea is to have document-term matrix for documents d. This matrix will contain information about how many times each word appears in each document. Then need just multiply dtm by matrix of embeddings:
library(text2vec)
# we are interested in words which are in word embeddings
voc = create_vocabulary(rownames(vecs))
# now we will create document-term matrix
vectorizer = vocab_vectorizer(voc)
dtm = itoken(d, tokenizer = word_tokenizer) %>%
create_dtm(vectorizer)
# normalize - calculate term frequaency - i.e. divide count of each word
# in document by total number of words in document.
# So at the end we will receive average of word vectors (not sum of word vectors!)
dtm = normalize(dtm)
# and now we can calculate vectors for document (average of vecors of words)
# using dot product of dtm and embeddings matrix
document_vecs = dtm %*% vecs

Faster alternative methods to for-loop in R for pattern matching

I am working on a problem in which I have to two data frames data and abbreviations and I would like to replace all the abbreviations present in data to their respective full forms. Till now I was using for-loops in the following manner
abb <- c()
for(i in 1:length(data$text)){
for(j in 1:length(AbbreviationList$Abb)){
abb <- paste("(\\b", AbbreviationList$Abb[j], "\\b)", sep="")
data$text[i] <- gsub(abb, AbbreviationList$Fullform[j], tolower(data$text[i]))
}
}
The abbreviation data frame looks something like the image below and can be generated using the following code
Abbreviation <- c(c("hru", "how are you"),
c("asap", "as soon as possible"),
c("bf", "boyfriend"),
c("ur", "your"),
c("u", "you"),
c("afk", "away from keyboard"))
Abbreviation <- data.frame(matrix(Abbreviation, ncol=2, byrow=T), row.names=NULL)
names(Abbreviation) <- c("abb","Fullform")
And the data is merely a data frame with 1 columns having text strings in each rows which can also be generated using the following code.
data <- data.frame(unlist(c("its good to see you, hru doing?",
"I am near bridge come ASAP",
"Can u tell me the method u used for",
"afk so couldn't respond to ur mails",
"asmof I dont know who is your bf?")))
names(data) <- "text"
Initially, I had data frame with around 1000 observations and abbreviation of around 100. So, I was able to run the analysis. But now the data has increased to almost 50000 and I am facing difficulty in processing it as there are two for-loops which makes the process very slow. Can you suggest some better alternatives to for-loop and explain with an example how to use it in this situation. If this problem can be solved faster via vectorization method then please suggest how to do that as well.
Thanks for the help!
This should be faster, and without side effect.
mapply(function(x,y){
abb <- paste0("(\\b", x, "\\b)")
gsub(abb, y, tolower(data$text))
},abriv$Abb,abriv$Fullform)
gsub is vectorized so no you give it a character vector where matches are sought. Here I give it data$text
I use mapply to avoid the side effect of for.
First of all, clearly there is no need to compile the regular expressions with each iteration of the loop. Also, there is no need to actually loop over data$text: in R, very often you can use a vector where a value could do -- and R will go through all the elements of the vector and return a vector of the same length.
Abbreviation$regex <- sprintf( "(\\b%s\\b)", Abbreviation$abb )
for( j in 1:length( Abbreviation$abb ) ) {
data$text <- gsub( Abbreviation$regex[j],
Abbreviation$Fullform[j], data$text,
ignore.case= T )
}
The above code works with the example data.

Maximum reasonable size for stemCompletion in tm?

I have a corpus of 26 plain text files, each between 12 - 148kb, total of 1.2Mb. I'm using R on a Windows 7 laptop.
I did all the normal cleanup stuff (stopwords, custom stopwords, lower case, numbers) and want to do stem completion. I am using the original corpus as a dictionary as shown in the examples. I tried a couple of simple vectors to make sure it would work at all (with about 5 terms) and it did and very quickly.
exchanger <- function(x) stemCompletion(x, budget.orig)
budget <- tm_map(budget, exchanger)
It's been working since yesterday at 4pm! In R Studio under diagnostics, the request log shows new requests with different request numbers. Task manager shows it using some memory, but not a crazy amount. I don't want to stop it because what if it's almost there? Any other ideas of how to check progress - it's a volatile corpus, unfortunately? Ideas on how long it should take? I thought about using the dtm names vector as a dictionary, cut off at the most frequent (or high tf-idf), but I'm reluctant to kill this process.
This is a regular windows 7 laptop with lots of other things running.
Is this corpus too big for stemCompletion? Short of moving to Python, is there a better way to do stemCompletion or lemmatize vice stem - my web searching has not yielded any answers.
I can't give you a definite answer without data that reproduces your problem, but I would guess the bottleneck comes from the folllowing line from the stemCompletion source code:
possibleCompletions <- lapply(x, function(w) grep(sprintf("^%s", w), dictionary, value = TRUE))
After which, given you've kept the completion heuristic on the default of "prevalent", this happens:
possibleCompletions <- lapply(possibleCompletions, function(x) sort(table(x), decreasing = TRUE))
structure(names(sapply(possibleCompletions, "[", 1)), names = x)
That first line loops through each word in your corpus and checks it against your dictionary for possible completions. I'm guessing you have many words that appear many times in your corpus. That means the function is being called many times only to give the same response. A possibly faster version (depending on how many words were repeats and how often they were repeated) would look something like this:
y <- unique(x)
possibleCompletions <- lapply(y, function(w) grep(sprintf("^%s", w), dictionary, value = TRUE))
possibleCompletions <- lapply(possibleCompletions, function(x) sort(table(x), decreasing = TRUE))
z <- structure(names(sapply(possibleCompletions, "[", 1)), names = y)
z[match(x, names(z))]
So it only loops through the unique values of x rather than every value of x. To create this revised version of the code, you would need to download the source from CRAN and modify the function (I found it in the completion.R in the R folder).
Or you may just want to use Python for this one.
Cristina, following Schaun I recomend you use just unique word for apply stemcompletion. I mean, it is easy for your PC do completion in your unique words them do the completition in all your corpus (with all repetitions).
First of all, take the unique words from your corpus. For exemplo:
unique$text <- unique(budget)
Them, you can get the unique words from your original text
unique_budget.orig <- unique(budget.orig)
Now, you can apply the stemcomplection for your unieque words
unique$completition <- budget %>% stemCompletion (dictionary= unique_budget.orig)
Now you have an object with all words from your corpus and their completion. you just have to apply a join between your corpus and the object unique. Be sure both objects have the same variable name for words without the completition: this gonne be the key.
This gonna reduce the number of operation your PC have to do.

Resources