Force create Terms using tm package - r

I have a corpus that has words such as 5k,50k,7.5k,75k,10K,100K.
So when i create a TDM using the tm package, terms such as 10k and 100k are extracted separately. However , 5k and 7.5k are not extracted as separate terms.
Now , i understand that after punctuation correction "7.5k" might be falling under "75k" terms , but whats going on with "5k" . Why is it not extracted as a term ?
Basically , i would want to know if there is way to FORCE tm package to look for specific words and extract them as key terms.
Any pointers would help !!

Are you breaking words at punctuation? That is, is '.' a word-break character? If so, then the split of '7.5k' is ('7', '5k'), the second of which matches '5k'.

Related

How to remove/separate conjoint words from tweets

I am mining Twitter data and one of the problems I come across while cleaning text is, being unable to remove/separate conjoint words that are usually hashtag data. Upon removing special characters and symbols like '#', I am left with phrases that make no sense. For instance:
1) Meaningless words: I have words like: 'spillwayjfleck' , 'bowhunterva' etc, which make no sense and need to be removed from my Corpus. Is there any function in R which can do it?.
2) Conjoint words: I need a method to separate joint words like: 'flashfloodwarn' to:
'flash', 'flood', 'warn', from my Corpus.
Any help would be appreciated.

matching specific character strings as word patterns in a corpus in R

I need your help. I have a corpus in R and i want to find specific words in it. The final line of my code is this
sentences_with_args <- arg.match(c("because", "however", "therefore"),myCorpus, 0).
Problem is that apart from the words mentioned above R returns several words that derive from these like "how" from "however" and "cause" from "because". How do i match the specific strings to their exact occurrences in the corpus? Thank you.

How to search for huge list of search-terms from a corpus using custom function in tm package

I want to select and retain the gene names from a corpus of multiple text documents using the tm package. I have used a custom function to keep only the genes defined in "pattern" and remove everything else. Here are my codes
docs <- Corpus(DirSource("path of the directory containing text documents"))
f <- content_transformer(function(x, pattern)regmatches(x, gregexpr(pattern, x, ignore.case=TRUE)))
genes = "IL1|IL2|IL3|IL4|IL5|IL6|IL7|IL8|IL9|IL10|TNF|TGF|AP2|OLR1|OLR2"
docs <- tm_map(docs, f, genes)
The codes are working perfectly fine. However, If I need to match a larger number of genes (say > 5000 genes), what is the best way to approach it ? I don't want to put the genes in an array and loop the tm_map function, to avoid huge run time and memory constraints.
If you simply want the fastest vectorized fixed-string regex, use stringi package, not tm. Specifically, look at stri_match* functions (or you might find stringr even faster if you're only handling ASCII - look for Hadley's latest versions and comments).
But if the regex of gene names is fixed and known upfront, and you're going to be doing a lot of retrieval on those few strings, then you could tag each document for faster retrieval.
(You haven't fully told us your use-case. What % of your runtime is this retrieval task? 0.1%? 99%? Are you storing your genes as text strings? Why not tokenize them and convert once to factors at input-time?)
Either way, tm is not a very scaleable performant package, so look at other approaches.

How to replace english abbreviated form to their dictionary form

I'm working on a system to analyze texts in english: I use stanford-core nlp to make sentences from whole documents and to make tokens from sentences. I also use the maxent tagger to get tokens pos tags.
Now, considering that I use this corpus to build a supervised classifier, it would be good if I could replace any word like 're, 's, havin, sayin', etc. to its standard form(are, is, having, saying). I've been searching for some english dictionary file, but I don't know how to use it. There are so many distinct cases to consider that I don't think it's an easy task to realize: is there some similar work or whole project that I could use?
Ideas:
I) use string edit distance on a subset of your text and try to match words that do not exist in the dictionary using edit distance against existing words in the dictionary.
II) The key feature of lots of those examples you have is that they are only 1 character different from the correct spelling. So, I suggest for those words that you fail to match with a dictionary entry, try and add all english characters to the front or back and lookup the resulting word in a dictionary. This is very expensive in the beginning but if you keep track of those misspellings in a lookup table (re -> are) at some point you will have 99.99% of the common misspellings (or whatever you call them) in your lookup table with their actual correct spelling.
III) Train a word-level 2-gram or 3-gram language model on proper and clean english text (i.e. newspaper articles), then run it over the entire corpus that you have and see for those words that your language model considers as unknown words (which means it hasn't seen them in training phase), what is the highest probable word according to the language model. Most probably the language model top-10 prediction will be the correct spelled word.

Term Document matrix or Document Term matrix which one is better?

I am working on counting the frequency of unique words in a text document in R 3.2.2. I have collapsed so many articles into one single text document now and framed into corpus using tm package.
desc<-paste(column_input, collapse=" ")
desrc <- VectorSource(desc)
decorp<-Corpus(desrc)
#dedtm <- DocumentTermMatrix(decorp)
#dedtm <- TermDocumentMatrix(decorp)
There are 12000 odd terms in that one text doc. To proceed forward with matrix operations, I am not quite sure which is better method. Term Document matrix or Document Term matrix ?
I hope that depends upon context. Is it better to use Term Document matrix rather than Document Term matrix in case of fewer documents with more terms. I just wanted to understand the logic behind this. So, I hope there is no need for any reproducible example. Any suggestions would be greatly appreciated.
Thanks in advance,
Bala

Resources