I have a long text and I counted letters in every single word of it, now I want to show the shortest and the longest words. I used:
words<-strsplit(text," ")
nchar(words[[1]])
w<-factor(nchar(words[[1]]))
table(w)
and I got a table with amount of words of specific length. And now, for example I know that the longest word has 19 letters, but how can I find and show that one word from whole text?
EDIT: and how to show for example every 5-letters word?
Try which.max to find longest word
words[[1]][which.max(nchar(words[[1]]))]
If you want to find all 5-letter words, try below
words[[1]][nchar(words[[1]])==5]
Related
I have been following this example and was wondering if it is possible to draw the figure 4.4 for combinations of words that were within 10 words of the keyword instead of words that are right next to each other. For example, let's say I wanted to know which words were commonly within 10 words of "sir"?
Sorry, my company has disabled copying/pasting text on your website so I can't post the code.
Don't know about the 10 words difference. But one option may be to calculate co-occurrences on sentence-level, for example with the udpipe::cooccurrence function.
I read a text into R using the readChar() function. I recently discovered the stringr package, which helped me a great deal to do useful things with my text such as counting the number of characters and the total number of occurrences of each letter in the entire text. Now, I need to count the number of sentences in the built-in sentences vector ending with the
words “day”, “pay”, or “way”, it should not count the sentences if the last word is not exactly one of them (e.g. away). Does R have any function, which can help me do that?
I am trying to find short variations of texts in long text strings.
long.string = "A lot of irrelevant text that features some of the words from the relevant sentence, including decision, affirmed, and order. The result of this long decision process is affirmed, without any exceptions. It is order that the instructions be executed very slowly. Even more irrelevant text that features some of the words from the relevant sentence, including decision, affirmed, and order, as well as promptly.
variation = "result.*?affirmed.*?promptly"
The text of interest is in bold. The variation would be used in a grepl to tell me if the bold text is in the long string. However, I will still get a hit in this circumstance, even though the word "promptly" in the long string is outside of the two sentences of interest.
Assuming I want to look in a 2-3 sentence (ending with a period) radius, how do I construct my variation so that it does return a hit even when some words are outside of the radius?
I am using regex to search for exact two words in any order. I got the first part of the search but the second part is not working.
REGEXP_SUBSTR('TWO WORDS ARE ONE','(?:^|\W)WORDS(?:$|\W') - one word search
how do I add one more word in the above search?
I have one paragrah of text (a vector of words) and I would like to see if it is "part" of a long text (a vector of words). However, I am know that this paragraph does not appear in the text in its exact form, but with slight changes: a few words could miss, the order could be slightly different, some words could be inserted as parenthetical elements etc.
I am currently implementing solutions "by hand", such as looking if most of the words of the paragraph are in the text, looking the distance between these words, their order, etc...
I was however wondering if there is no built-in method to do that?
I already checked the tm package, but it does not seem to do that...
Any idea?
I fear that you are stuck with hand-writing an approach, e.g. grep-ing some word groups and having some kind of matching threshold.