Why my Term Document Matrix has letters missing at end? - r

enter image description hereI'm working on creating a word cloud. On creation I see many words having last alphabets missing. For ex., Movie --> movi, become --> becom
I've marked the words in yellow. the last one or two letters are missing

For those who need the answer to this question - We see the last letters in the TDM missing because when we perform stemming on our data, the stem function will look for words that have the same root word. All these words will be then set to their root words. This is the reason we will see "Movie" as "Movi" and so on.

missing letters at the end of the words are the result of preprosessing - stemming. Try to avoid stemming prior to creating DTM or TDM, and create a wordcloud without stemming.

Related

R: Regex for identifying numbers within HTML chunk

this is my first entry on stack overflow, so please be indulgent if my post might have some lack in terms of quality.
I want to learn some webscraping with R and started with a simple example --> Extracting a table from a Wikipedia site.
I managed to download the specific page and identified the HTML sections I am interested in:
<td style="text-align:right">511.000.000\n</td>
Now I want to extract the number in the data from the table by using regex. So i created a regex, which should match the structure of the number from my point of view:
pattern<-"\\d*\\.\\d*\\.\\d*\\.\\d*\\."
I also tried other variations but none of them found the number within the HTML code. I wanted to keep the pattern open as the numbers might be hundreds, thousand, millions, billions.
My questions: The number is within the HTML code, might it be
necessary to include some code for the non-number code (which should
not be extracted...)
What would be the correct version for the
pattern to identify the number correctly?
Thank you very much for your support!!
So many stars implies a lot of backtracking.
One point further, using \\d* would match more than 3 digits in any group and would also match a group with no digit.
Assuming your numbers are always integers, formatted using a . as thousand separator, you could use the following: \\d{1,3}(?:\\.\\d{3})* (note the usage of non-capturing group construct (?:...) - implying the use of perl = TRUE in arguments, as mentioned in Regular Expressions as used in R).
Look closely at your regex. You are assuming that the number will have 4 periods (\\.) in it, but in your own example there are only two periods. It's not going to match because while the asterisk marks \\d as optional (zero or more), the periods are not marked as optional. If you add a ? modifier after the 3rd and 4th period, you may find that your pattern starts matching.

matching specific character strings as word patterns in a corpus in R

I need your help. I have a corpus in R and i want to find specific words in it. The final line of my code is this
sentences_with_args <- arg.match(c("because", "however", "therefore"),myCorpus, 0).
Problem is that apart from the words mentioned above R returns several words that derive from these like "how" from "however" and "cause" from "because". How do i match the specific strings to their exact occurrences in the corpus? Thank you.

How to check if a paragraph is part of a text in R

I have one paragrah of text (a vector of words) and I would like to see if it is "part" of a long text (a vector of words). However, I am know that this paragraph does not appear in the text in its exact form, but with slight changes: a few words could miss, the order could be slightly different, some words could be inserted as parenthetical elements etc.
I am currently implementing solutions "by hand", such as looking if most of the words of the paragraph are in the text, looking the distance between these words, their order, etc...
I was however wondering if there is no built-in method to do that?
I already checked the tm package, but it does not seem to do that...
Any idea?
I fear that you are stuck with hand-writing an approach, e.g. grep-ing some word groups and having some kind of matching threshold.

Force create Terms using tm package

I have a corpus that has words such as 5k,50k,7.5k,75k,10K,100K.
So when i create a TDM using the tm package, terms such as 10k and 100k are extracted separately. However , 5k and 7.5k are not extracted as separate terms.
Now , i understand that after punctuation correction "7.5k" might be falling under "75k" terms , but whats going on with "5k" . Why is it not extracted as a term ?
Basically , i would want to know if there is way to FORCE tm package to look for specific words and extract them as key terms.
Any pointers would help !!
Are you breaking words at punctuation? That is, is '.' a word-break character? If so, then the split of '7.5k' is ('7', '5k'), the second of which matches '5k'.

How to replace english abbreviated form to their dictionary form

I'm working on a system to analyze texts in english: I use stanford-core nlp to make sentences from whole documents and to make tokens from sentences. I also use the maxent tagger to get tokens pos tags.
Now, considering that I use this corpus to build a supervised classifier, it would be good if I could replace any word like 're, 's, havin, sayin', etc. to its standard form(are, is, having, saying). I've been searching for some english dictionary file, but I don't know how to use it. There are so many distinct cases to consider that I don't think it's an easy task to realize: is there some similar work or whole project that I could use?
Ideas:
I) use string edit distance on a subset of your text and try to match words that do not exist in the dictionary using edit distance against existing words in the dictionary.
II) The key feature of lots of those examples you have is that they are only 1 character different from the correct spelling. So, I suggest for those words that you fail to match with a dictionary entry, try and add all english characters to the front or back and lookup the resulting word in a dictionary. This is very expensive in the beginning but if you keep track of those misspellings in a lookup table (re -> are) at some point you will have 99.99% of the common misspellings (or whatever you call them) in your lookup table with their actual correct spelling.
III) Train a word-level 2-gram or 3-gram language model on proper and clean english text (i.e. newspaper articles), then run it over the entire corpus that you have and see for those words that your language model considers as unknown words (which means it hasn't seen them in training phase), what is the highest probable word according to the language model. Most probably the language model top-10 prediction will be the correct spelled word.

Resources