I have a corpus of 26 plain text files, each between 12 - 148kb, total of 1.2Mb. I'm using R on a Windows 7 laptop.
I did all the normal cleanup stuff (stopwords, custom stopwords, lower case, numbers) and want to do stem completion. I am using the original corpus as a dictionary as shown in the examples. I tried a couple of simple vectors to make sure it would work at all (with about 5 terms) and it did and very quickly.
exchanger <- function(x) stemCompletion(x, budget.orig)
budget <- tm_map(budget, exchanger)
It's been working since yesterday at 4pm! In R Studio under diagnostics, the request log shows new requests with different request numbers. Task manager shows it using some memory, but not a crazy amount. I don't want to stop it because what if it's almost there? Any other ideas of how to check progress - it's a volatile corpus, unfortunately? Ideas on how long it should take? I thought about using the dtm names vector as a dictionary, cut off at the most frequent (or high tf-idf), but I'm reluctant to kill this process.
This is a regular windows 7 laptop with lots of other things running.
Is this corpus too big for stemCompletion? Short of moving to Python, is there a better way to do stemCompletion or lemmatize vice stem - my web searching has not yielded any answers.
I can't give you a definite answer without data that reproduces your problem, but I would guess the bottleneck comes from the folllowing line from the stemCompletion source code:
possibleCompletions <- lapply(x, function(w) grep(sprintf("^%s", w), dictionary, value = TRUE))
After which, given you've kept the completion heuristic on the default of "prevalent", this happens:
possibleCompletions <- lapply(possibleCompletions, function(x) sort(table(x), decreasing = TRUE))
structure(names(sapply(possibleCompletions, "[", 1)), names = x)
That first line loops through each word in your corpus and checks it against your dictionary for possible completions. I'm guessing you have many words that appear many times in your corpus. That means the function is being called many times only to give the same response. A possibly faster version (depending on how many words were repeats and how often they were repeated) would look something like this:
y <- unique(x)
possibleCompletions <- lapply(y, function(w) grep(sprintf("^%s", w), dictionary, value = TRUE))
possibleCompletions <- lapply(possibleCompletions, function(x) sort(table(x), decreasing = TRUE))
z <- structure(names(sapply(possibleCompletions, "[", 1)), names = y)
z[match(x, names(z))]
So it only loops through the unique values of x rather than every value of x. To create this revised version of the code, you would need to download the source from CRAN and modify the function (I found it in the completion.R in the R folder).
Or you may just want to use Python for this one.
Cristina, following Schaun I recomend you use just unique word for apply stemcompletion. I mean, it is easy for your PC do completion in your unique words them do the completition in all your corpus (with all repetitions).
First of all, take the unique words from your corpus. For exemplo:
unique$text <- unique(budget)
Them, you can get the unique words from your original text
unique_budget.orig <- unique(budget.orig)
Now, you can apply the stemcomplection for your unieque words
unique$completition <- budget %>% stemCompletion (dictionary= unique_budget.orig)
Now you have an object with all words from your corpus and their completion. you just have to apply a join between your corpus and the object unique. Be sure both objects have the same variable name for words without the completition: this gonne be the key.
This gonna reduce the number of operation your PC have to do.
Related
Trying to clean up some dirty data (for work), my data frame has a column for customer information (for our example lets say store and product) in a long weird string, as well as a column for store and a column for product. I can parse the store and the product from the string. Here is where I arrive at my problem.
let's say (consider these vectors part of a larger dataframe, appended with data$ if that helps, I was just working with them as vectors thinking it may speed up the code not having to pull the whole dataframe):
WeirdString <- c("fname: john; lname:smith; store:Amazon Inc.; product:Echo", "fname: cindy; lname:smith; store:BestBuy; product:Ps-4","fname: jon; lname:smith; store:WALMART; product:Pants")
so I parse this to be:
WS_Store <- c("Amazon Inc.", "BestBuy", "WALMART")
WS_Prod <- c("Echo", "Ps-4", "Pants")
What's in the tables (i.e. the non-parsed columns) is:
DB_Store <- c("Amazon", "BEST BUY", "Other")
DB_Prod <- c("ECHO", "PS4", "Jeans")
I currently am using a for loop to loop through i to grepl the "true" string from the parsed string. This takes forever, and I know R was designed to use vectorized code, So my question is, how do I eliminate the loop and use something like lapply (which I tried, and failed at, because I'm not savvy enough with lapply), or some other vectorized thing?
My current code:
for(i in 1:nrow(data)){ # could be i in length(DB_prod) or whatever, all vectors are the same length)
Diff_Store[i] <- !grepl(DB_Store[i], WS_Store[i], ignore.case=T)
Diff_Prod[i] <- !grepl(DB_Prod[i] , WS_Prod[i] , ignore.case=T)
}
I intend to append those columns back into the dataframe, as the true goal is to diagnose why the database has this problem.
If there's a better way than this, rather than trying to vectorize it, I'm open to it. The data in the DB_Store is restricted to a specific number of "stores" (in the table it comes from) but in the string, it seems to be open, which is why I use the DB as the pattern, not the x. Product is similar, but not as restricted, this is why some have dashes and some don't. I would love to match "close things" like Ps-4 vs. PS4, but I will probably just build a table of matches once I see how weird the string gets. To be true though, the string may not match, which is represented by the Pants/Jeans thing. The dataset is 2.5 million records, and there are many different "stores" and "products", and I do want to make sure they match on the same line, not "is it in the database" (which is what previous questions seem to ask, can I see if a string is in a list of strings, rather than a 1:1 comparison, and the last question did end in a loop, which takes minutes and hours to run)
Thanks!
Please check if this works for you:
check <- function(vec_a, vec_b){
mat <- cbind(vec_a, vec_b)
diff <- apply(mat, 1, function(x) !grepl(pattern = x[1], x = x[2], ignore.case = TRUE))
diff
}
Use your different vectors for stores (or products) in the arguments vec_a and vec_b, respectively (example: diff_stores <- check(DB_Store, WS_Store) ). This function will return a logical vector with TRUE values referring to items that weren't a match in the two original vectors. Is this what you wanted?
I am doing a lot of analysis with the TM package. One of my biggest problems are related to stemming and stemming-like transformations.
Let's say I have several accounting related terms (I am aware of the spelling issues).
After stemming we have:
accounts -> account
account -> account
accounting -> account
acounting -> acount
acount -> acount
acounts -> acount
accounnt -> accounnt
Result: 3 Terms (account, acount, account) where I would have liked 1 (account) as all these relate to the same term.
1) To correct spelling is a possibility, but I have never attempted that in R. Is that even possible?
2) The other option is to make a reference list i.e. account = (accounts, account, accounting, acounting, acount, acounts, accounnt) and then replace all occurrences with the master term. How would I do this in R?
Once again, any help/suggestions would be greatly appreciated.
We could set up a list of synonyms and replace those values. For example
synonyms <- list(
list(word="account", syns=c("acount", "accounnt"))
)
This says we want to replace "acount" and "accounnt" with "account" (i'm assuming we're doing this after stemming). Now let's create test data.
raw<-c("accounts", "account", "accounting", "acounting",
"acount", "acounts", "accounnt")
And now let's define a transformation function that will replace the words in our list with the primary synonym.
library(tm)
replaceSynonyms <- content_transformer(function(x, syn=NULL) {
Reduce(function(a,b) {
gsub(paste0("\\b(", paste(b$syns, collapse="|"),")\\b"), b$word, a)}, syn, x)
})
Here we use the content_transformer function to define a custom transformation. And basically we just do a gsub to replace each of the words. We can then use this on a corpus
tm <- Corpus(VectorSource(raw))
tm <- tm_map(tm, stemDocument)
tm <- tm_map(tm, replaceSynonyms, synonyms)
inspect(tm)
and we can see all these values are transformed into "account" as desired. To add other synonyms, just add additional lists to the main synonyms list. Each sub-list should have the names "word" and "syns".
Mr. Flick has answered question #2. I am approaching via answering question #1.
Here is an approach the uses a binary search of a known word data base (DICTIONARY from qdapDictionaries). A binary lookup is slow for sure but if we make some assumptions about the replacing (like a range of differences in number of character). So here's the basic idea:
Turn the Corpus into a unique bag of words using qdap's bag_o_words
Look those words up in a dictionary (qdapDictionaries' DICTIONARY data set) to find words not recognize using match
These misses from step # 2 will be what we lookup
Determine number of characters for words in a dictionary to eliminate gross difference later using nchar
Run each element of misses through a loop (sapply) and do the following:
a. stem each element from misses using tm::stemDocument
b. determine number of characters and eliminate those from dictionary that are not within that range using nchar
c. use agrep with a max.distance to eliminate more words from the dictionary
d. use a binary lookup (that reverse engineers agrep) to determine the word from dictionary that is closest to the missed element [note this is a non-exported function from qdap called qdap:::Ldist]
The result is a named vector that we can use for gsubbing
Use tm_map with a custom tm flavored gsub function to replace words
Do the stemming with tm_map and stemDocument
Here's the code. I made a fake Corpus using the words you provide and some random words to demonstrate how to do this from start to end. You can play with range and max.distance that is supplied to sapply. The looser you are with these the slower the search will be but tightiening these too much will make it more likely to make a mistake. This really isn't an answer for spelling correction in a general sense but works here because you were stemming anyway. There's an Aspell package but I have never used it before.
terms <- c("accounts", "account", "accounting", "acounting", "acount", "acounts", "accounnt")
library(tm); library(qdap)
fake_text <- unlist(lapply(terms, function(x) {
paste(sample(c(x, sample(DICTIONARY[[1]], sample(1:5, 1)))), collapse=" ")
}))
fake_text
myCorp <- Corpus(VectorSource(fake_text))
terms2 <- unique(bag_o_words(as.data.frame(myCorp)[[2]]))
misses <- terms2[is.na(match(terms2, DICTIONARY[[1]]))]
chars <- nchar(DICTIONARY[[1]])
replacements <- sapply(misses, function(x, range = 3, max.distance = .2) {
x <- stemDocument(x)
wchar <- nchar(x)
dict <- DICTIONARY[[1]][chars >= (wchar - range) & chars <= (wchar + range)]
dict <- dict[agrep(x, dict, max.distance=max.distance)]
names(which.min(sapply(dict, qdap:::Ldist, x)))
})
replacer <- content_transformer(function(x) {
mgsub(names(replacements), replacements, x, ignore.case = FALSE, fixed = FALSE)
})
myCorp <- tm_map(myCorp, replacer)
inspect(myCorp <- tm_map(myCorp, stemDocument))
This question inspired me to attempt to write a spell check for the qdap package. There's an interactive version that may be useful here. It's available in qdap >= version 2.1.1. That means you'll need the dev version at the moment.. here are the steps to install:
library(devtools)
install_github("qdapDictionaries", "trinker")
install_github("qdap", "trinker")
library(tm); library(qdap)
## Recreate a Corpus like you describe.
terms <- c("accounts", "account", "accounting", "acounting", "acount", "acounts", "accounnt")
fake_text <- unlist(lapply(terms, function(x) {
paste(sample(c(x, sample(DICTIONARY[[1]], sample(1:5, 1)))), collapse=" ")
}))
fake_text
inspect(myCorp <- Corpus(VectorSource(fake_text)))
## The interactive spell checker (check_spelling_interactive)
m <- check_spelling_interactive(as.data.frame(myCorp)[[2]])
preprocessed(m)
inspect(myCorp <- tm_map(myCorp, correct(m)))
The correct function merely grabs a closure function from the output of check_spelling_interactive and allows you to then apply the "correcting" to any new text string(s).
I am working on a problem in which I have to two data frames data and abbreviations and I would like to replace all the abbreviations present in data to their respective full forms. Till now I was using for-loops in the following manner
abb <- c()
for(i in 1:length(data$text)){
for(j in 1:length(AbbreviationList$Abb)){
abb <- paste("(\\b", AbbreviationList$Abb[j], "\\b)", sep="")
data$text[i] <- gsub(abb, AbbreviationList$Fullform[j], tolower(data$text[i]))
}
}
The abbreviation data frame looks something like the image below and can be generated using the following code
Abbreviation <- c(c("hru", "how are you"),
c("asap", "as soon as possible"),
c("bf", "boyfriend"),
c("ur", "your"),
c("u", "you"),
c("afk", "away from keyboard"))
Abbreviation <- data.frame(matrix(Abbreviation, ncol=2, byrow=T), row.names=NULL)
names(Abbreviation) <- c("abb","Fullform")
And the data is merely a data frame with 1 columns having text strings in each rows which can also be generated using the following code.
data <- data.frame(unlist(c("its good to see you, hru doing?",
"I am near bridge come ASAP",
"Can u tell me the method u used for",
"afk so couldn't respond to ur mails",
"asmof I dont know who is your bf?")))
names(data) <- "text"
Initially, I had data frame with around 1000 observations and abbreviation of around 100. So, I was able to run the analysis. But now the data has increased to almost 50000 and I am facing difficulty in processing it as there are two for-loops which makes the process very slow. Can you suggest some better alternatives to for-loop and explain with an example how to use it in this situation. If this problem can be solved faster via vectorization method then please suggest how to do that as well.
Thanks for the help!
This should be faster, and without side effect.
mapply(function(x,y){
abb <- paste0("(\\b", x, "\\b)")
gsub(abb, y, tolower(data$text))
},abriv$Abb,abriv$Fullform)
gsub is vectorized so no you give it a character vector where matches are sought. Here I give it data$text
I use mapply to avoid the side effect of for.
First of all, clearly there is no need to compile the regular expressions with each iteration of the loop. Also, there is no need to actually loop over data$text: in R, very often you can use a vector where a value could do -- and R will go through all the elements of the vector and return a vector of the same length.
Abbreviation$regex <- sprintf( "(\\b%s\\b)", Abbreviation$abb )
for( j in 1:length( Abbreviation$abb ) ) {
data$text <- gsub( Abbreviation$regex[j],
Abbreviation$Fullform[j], data$text,
ignore.case= T )
}
The above code works with the example data.
I have an SQL database with 7 million+ records, each record containing some text. Within each record I want to perform text analysis, say count the occurences of specific words. I've tried R's tokenize function within the openNLP package which works great for small files, but 7 million records * between 1-100 words per record gets too large for R to hold in a data.frame. I thought about using R's bigmemory or ff packages, or even the mapReduce package. Do you guys have a preferred approach or package for this type of analysis?
Maybe approach it in parallel. I used parLapply b/c I believe it works on all three OS.
wc <- function(x) length(unlist(strsplit(x, "\\s+")))
wordcols <- rep("I like icecream alot.", 100000)
library(parallel)
cl <- makeCluster(mc <- getOption("cl.cores", detectCores()))
clusterExport(cl=cl, varlist=c("wc", "wordcols"), envir=environment())
output <- parLapply(cl, wordcols, function(x) {
wc(x)
}
)
stopCluster(cl)
sum(unlist(output))
On the SQL side you could extract also for each entry the len, then apply a replace(" yourWord ","") (with flanking spaces...) to it, calculate again the total string length and then take the differences between those two, that should do the trick. My SQL skills are not so well that I could present here easily an running example, sorry for that...
(I've tried asking this on BioStars, but for the slight chance that someone from text mining would think there is a better solution, I am also reposting this here)
The task I'm trying to achieve is to align several sequences.
I don't have a basic pattern to match to. All that I know is that the "True" pattern should be of length "30" and that the sequences I have had missing values introduced to them at random points.
Here is an example of such sequences, were on the left we see what is the real location of the missing values, and on the right we see the sequence that we will be able to observe.
My goal is to reconstruct the left column using only the sequences I've got on the right column (based on the fact that many of the letters in each position are the same)
Real_sequence The_sequence_we_see
1 CGCAATACTAAC-AGCTGACTTACGCACCG CGCAATACTAACAGCTGACTTACGCACCG
2 CGCAATACTAGC-AGGTGACTTCC-CT-CG CGCAATACTAGCAGGTGACTTCCCTCG
3 CGCAATGATCAC--GGTGGCTCCCGGTGCG CGCAATGATCACGGTGGCTCCCGGTGCG
4 CGCAATACTAACCA-CTAACT--CGCTGCG CGCAATACTAACCACTAACTCGCTGCG
5 CGCACGGGTAAGAACGTGA-TTACGCTCAG CGCACGGGTAAGAACGTGATTACGCTCAG
6 CGCTATACTAACAA-GTG-CTTAGGC-CTG CGCTATACTAACAAGTGCTTAGGCCTG
7 CCCA-C-CTAA-ACGGTGACTTACGCTCCG CCCACCTAAACGGTGACTTACGCTCCG
Here is an example code to reproduce the above example:
ATCG <- c("A","T","C","G")
set.seed(40)
original.seq <- sample(ATCG, 30, T)
seqS <- matrix(original.seq,200,30, T)
change.letters <- function(x, number.of.changes = 15, letters.to.change.with = ATCG)
{
number.of.changes <- sample(seq_len(number.of.changes), 1)
new.letters <- sample(letters.to.change.with , number.of.changes, T)
where.to.change.the.letters <- sample(seq_along(x) , number.of.changes, F)
x[where.to.change.the.letters] <- new.letters
return(x)
}
change.letters(original.seq)
insert.missing.values <- function(x) change.letters(x, 3, "-")
insert.missing.values(original.seq)
seqS2 <- t(apply(seqS, 1, change.letters))
seqS3 <- t(apply(seqS2, 1, insert.missing.values))
seqS4 <- apply(seqS3,1, function(x) {paste(x, collapse = "")})
require(stringr)
# library(help=stringr)
all.seqS <- str_replace(seqS4,"-" , "")
# how do we allign this?
data.frame(Real_sequence = seqS4, The_sequence_we_see = all.seqS)
I understand that if all I had was a string and a pattern I would be able to use
library(Biostrings)
pairwiseAlignment(...)
But in the case I present we are dealing with many sequences to align to one another (instead of aligning them to one pattern).
Is there a known method for doing this in R?
Writing an alignment algorithm in R looks like a bad idea to me, but there is an R interface to the MUSCLE algorithm in the bio3d package (function seqaln()). Be aware of the fact that you have to install this algorithm first.
Alternatively, you can use any of the available algorithms (eg ClustalW, MAFFT, T-COFFEE) and import the multiple sequence alignemts in R using bioconductor functionality. See eg here..
Though this is quite an old thread, I do not want to miss the opportunity to mention that, since Bioconductor 3.1, there is a package 'msa' that implements interfaces to three different multiple sequence alignment algorithms: ClustalW, ClustalOmega, and MUSCLE. The package runs on all major platforms (Linux/Unix, Mac OS, and Windows) and is self-contained in the sense that you need not install any external software. More information can be found on http://www.bioinf.jku.at/software/msa/ and http://www.bioconductor.org/packages/release/bioc/html/msa.html.
You can perform multiple alignment in R with the DECIPHER package.
Following your example, it would look something like:
library(DECIPHER)
dna <- DNAStringSet(all.seqS)
aligned_DNA <- AlignSeqs(dna)
It is fast and at least as accurate as the other methods listed here (see the paper). I hope that helps!
You are looking for a global alignment algorithm on multiple sequences.
Did you look at Wikipedia before asking ?
First learn what global alignment is, then look for multiple sequence alignment.
Wikipedia doesn't give a lot of details about algorithms, but this paper is better.