matching specific character strings as word patterns in a corpus in R - string-matching

I need your help. I have a corpus in R and i want to find specific words in it. The final line of my code is this
sentences_with_args <- arg.match(c("because", "however", "therefore"),myCorpus, 0).
Problem is that apart from the words mentioned above R returns several words that derive from these like "how" from "however" and "cause" from "because". How do i match the specific strings to their exact occurrences in the corpus? Thank you.

Related

How to count number of sentences occurring in the built-in sentences vector ending with the words as “day”, “pay”, or “way”?

I read a text into R using the readChar() function. I recently discovered the stringr package, which helped me a great deal to do useful things with my text such as counting the number of characters and the total number of occurrences of each letter in the entire text. Now, I need to count the number of sentences in the built-in sentences vector ending with the
words “day”, “pay”, or “way”, it should not count the sentences if the last word is not exactly one of them (e.g. away). Does R have any function, which can help me do that?

How to combine multiwords in a dfm?

I created a corpus of 233 rows and 3 columns (Date, Title, Article) where the last column, Article, is text (so, I have 233 texts). The final aim is to apply topic models and, to do so, I need to convert my corpus into a dfm. Yet, I would like first to combine words into bigrams and trigrams to make the analysis more rigorous.
The problem is that when I use textstat_collocation or tokens_compound, I am forced to tokenize the corpus and, in so doing, I lose the structure (233 by 4) that is crucial to apply topic models. In fact, once I apply those functions, I just get one row of bigrams and trigrams which is useless to me.
So my question is: do you know any other way to look for bigrams and trigrams in a dfm without necessarily tokenizing the corpus?
Or, in other words, what do you usually do to look for multiwords in a dfm?
Thanks a lot for your time!

Why my Term Document Matrix has letters missing at end?

enter image description hereI'm working on creating a word cloud. On creation I see many words having last alphabets missing. For ex., Movie --> movi, become --> becom
I've marked the words in yellow. the last one or two letters are missing
For those who need the answer to this question - We see the last letters in the TDM missing because when we perform stemming on our data, the stem function will look for words that have the same root word. All these words will be then set to their root words. This is the reason we will see "Movie" as "Movi" and so on.
missing letters at the end of the words are the result of preprosessing - stemming. Try to avoid stemming prior to creating DTM or TDM, and create a wordcloud without stemming.

Force create Terms using tm package

I have a corpus that has words such as 5k,50k,7.5k,75k,10K,100K.
So when i create a TDM using the tm package, terms such as 10k and 100k are extracted separately. However , 5k and 7.5k are not extracted as separate terms.
Now , i understand that after punctuation correction "7.5k" might be falling under "75k" terms , but whats going on with "5k" . Why is it not extracted as a term ?
Basically , i would want to know if there is way to FORCE tm package to look for specific words and extract them as key terms.
Any pointers would help !!
Are you breaking words at punctuation? That is, is '.' a word-break character? If so, then the split of '7.5k' is ('7', '5k'), the second of which matches '5k'.

Regular expressions that find all words that meet a condition

I'm trying to work out the exercise from a book called R for Data Science on regular expressions.
There's this question which I'm unable to solve :
Given the corpus of common words in stringr::words, create regular expressions that find all words that:
Start with “y”.
End with “x”
Have seven letters or more.
Example:
sentence <- "I want to extract these - yandx,ynx and yrax,romanav "
# it would be helpful to find how to do these with stringr::str_view() function.
Also, please refer me to some good resources for learning regex in R.
This regex should work for you (considering your words don't contain punctuation)
(?i)((?:^y[a-z]+x)|(?:^[a-z]{7}[a-z]*$))
Also you can use Regex101 to validate your regex

Resources