R - Tokenization - single and two letter words in a TermDocumentMatrix - r

I am currently trying to do a little bit of text processing and I would like to get the one and two letter words in a TermDocumentMatrix.
The issue is that it seems to display only 3 letter words and more.
library(tm)
library(RWeka)
test<-'This is a test.'
testmyCorpus<-Corpus(VectorSource(test))
testTDF<-TermDocumentMatrix(testmyCorpus, control=list(tokenize=AlphabeticTokenizer))
inspect(testTDF)
Only the words "this" and "test" are displayed. Any ideas?
Thanks a lot for you help!
Robert

Here is the answer to almost your problem: in short, you should add an option control=list(wordLengths=c(1,Inf) to TermDocumentMatrix.

Related

RemoveWords command not removing some weird words

The point is that im trying to remove some weird words (like <U+0001F399><U+FE0F>) from my text corpus to do some twitter analysis.
There are many words like that that i just can't remove by using <- tm_map(X, removeWords).
i have plenty of tweets agregated in a dataset. Then i use the following code:
corpus_tweets <- tm_map (corpus_tweets, removeWords, c("<U+0001F339>", "<U+0001F4CD>"))
if i try changing those weird words for regular ones (like "life" or "animal") that also appear on my dataset the regular ones get removed easily.
Any idea of how to solve this?
As these are Unicode characters, you need to figure out how to properly enter them in R.
The escape code syntax for Unicode in R probably is not <U+xxxx>, but rather something like \Uxxxx. See the manual for details (I don't use R - I am too annoyed by its inconsistencies. This is even an example for such an inconsistency, where apparently the string is printed differently than what R would accept as input.)
corpus_tweets <- tm_map (corpus_tweets, removeWords, c("\U0001F339", "\U0001F4CD","\uFE0F","\uFE0E"))
NOTE: You use a slash and lowercase u then 4 hex digits to specify a character from Unicode plane 0; you must use uppercase U then 8 hex digits for the other planes (which are typically emoji, given you are working with tweets).
BTW, see Some emojis (e.g. ☁) have two unicode, u'\u2601' and u'\u2601\ufe0f'. What does u'\ufe0f' mean? Is it the same if I delete it? for why you are getting the FE0F in there: they are when the user wants to choose a variation of an emoji, e.g. to add colour. FE0E is its partner (to say you want the plain text glyph).

Extract numerical value before a string in R

I have been mucking around with regex strings and strsplit but can't figure out how to solve my problem.
I have a collection of html documents that will always contain the phrase "people own these". I want to extract the number immediately preceding this phrase. i.e. '732,234 people own these' - I'm hoping to capture the number 732,234 (including the comma, though I don't care if it's removed).
The number and phrase are always surrounded by a . I tried using Xpath but that seemed even harder than a regex expression. Any help or advice is greatly appreciated!
example string: >742,811 people own these<
-> 742,811
Could you please try following.
val <- "742,811 people own these"
gsub(' [a-zA-Z]+',"",val)
Output will be as follows.
[1] "742,811"
Explanation: using gsub(global substitution) function of R here. Putting condition here where it should replace all occurrences of space with small or capital alphabets with NULL for variable val.
Try using str_extract_all from the stringr library:
str_extract_all(data, "\\d{1,3}(?:,\\d{3})*(?:\\.\\d+)?(?= people own these)")

Get all hashtags using regular expressions

I'm studying the recent hashtag #BalanceTonPorc in one of my classes. I'm trying to get all the occurrences of this hashtag appearing in tweets, but of course nobody uses the same format.
Some people use #BalanceTonPorc, some #balancetonporc, and son on and so forth.
Using gsub, I've so far done this :
df$hashtags <- gsub(".alance.on.orc", "BalanceTonPorc", df$hashtags)
Which does what I want, and all variations of this hashtag are stored under the same one. But there are A LOT of other variations. Some people used #BalanceTonPorc... or #BalanceTonPorc.
Is there a way to have a RegEx that says I want everything that contains .alance.on.orc with every character possible after the hashtag, except , (because it separates hashtags)? Here is a screenshot to illustrate what I mean.
I'm also having another issue, in my frequency table I have twice #BalanceTonPorc, so I guess R must consider them to be different. Can you spot the difference?
You may use [^,]* to match any char but ,, 0+ occurrences:
gsub(".alance.on.orc[^,]*", "BalanceTonPorc", df$hashtags)
Or, to exactly match balancetonporc,
gsub("balancetonporc[^,]*", "BalanceTonPorc", df$hashtags, ignore.case=TRUE)
See a regex demo and an R online test:
x <- c("#balancetonPorc#%$%#$%^","#balancetonporc#%$%, text")
gsub("balancetonporc[^,]*", "BalanceTonPorc", x, ignore.case=TRUE)
# => [1] "#BalanceTonPorc" "#BalanceTonPorc, text"

extract string using same pattern in a text using R

I'm trying to deal with text with R and here is my question.
From this source text
#Pray4Manchester# I hope that #ArianaGrande# will be better soon.
I want to extract Pray4Manchester and ArianaGrande using the pattern #.+#, but when I run
str_extract_all(text,pattern="#.+#")
I get
#Pray4Manchester# I hope that #ArianaGrande#
How to solve this? Thanks.
We can do
str_extract_all(text, "(?<=#)\\w*(?=#)")[[1]]
#[1] "Pray4Manchester" "ArianaGrande"
data
text <- "#Pray4Manchester# I hope that #ArianaGrande# will be better soon."
You could use regex to look for results that match text between two hashes that don't contain a space character.
Something like this: ([#]{1}[^\s]+[#]{1})

How to count the number of sentences in a text in R?

I read a text into R using the readChar() function. I aim at testing the hypothesis that the sentences of the text have as many occurrences of letter "a" as occurrences of letter "b". I recently discovered the {stringr} package, which helped me a great deal to do useful things with my text such as counting the number of characters and the total number of occurrences of each letter in the entire text. Now, I need to know the number of sentences in the whole text. Does R have any function, which can help me do that? Thank you very much!
Thank you #gui11aume for your answer. A very good package I just found that can help do the work is {openNLP}. This is the code to do that:
install.packages("openNLP") ## Installs the required natural language processing (NLP) package
install.packages("openNLPmodels.en") ## Installs the model files for the English language
library(openNLP) ## Loads the package for use in the task
library(openNLPmodels.en) ## Loads the model files for the English language
text = "Dr. Brown and Mrs. Theresa will be away from a very long time!!! I can't wait to see them again." ## This sentence has unusual punctuation as suggested by #gui11aume
x = sentDetect(text, language = "en") ## sentDetect() is the function to use. It detects and seperates sentences in a text. The first argument is the string vector (or text) and the second argument is the language.
x ## Displays the different sentences in the string vector (or text).
[1] "Dr. Brown and Mrs. Theresa will be away from a very long time!!! "
[2] "I can't wait to see them again."
length(x) ## Displays the number of sentences in the string vector (or text).
[1] 2
The {openNLP} package is really great for natural language processing in R and you can find a good and short intro to it here or you can check out the package's documentation here.
Three more languages are supported in the package. You just need to install and load the corresponding model files.
{openNLPmodels.es} for Spanish
{openNLPmodels.ge} for German
{openNLPmodels.th} for Thai
What you are looking for is sentence tokenization, and it is not as straightforward as it seems, even in English (sentences like "I met Dr. Bennett, the ex husband of Mrs. Johson." can contain full stops).
R is definitely not the best choice for natural language processing. If you are Python proficient, I suggest you have a look at the nltk module, which covers this and many other topics. You can also copy the code from this blog post, which does sentence tokenization and word tokenization.
If you want to stick to R, I would suggest you count the end-of-sentence characters (., ?, !), since you are able to count characters. A way of doing it with a regular expression is like so:
text <- 'Hello world!! Here are two sentences for you...'
length(gregexpr('[[:alnum:] ][.!?]', text)[[1]])

Resources