Package corpus provides a custom stemming function. The stemming function should, when given a term as an input, return the stem of the term as the output.
From Stemming Words I taken the following example, that uses the hunspell dictionary to do the stemming.
First I define the sentences on which to test this function:
sentences<-c("The color blue neutralizes orange yellow reflections.",
"Zod stabbed me with blue Kryptonite.",
"Because blue is your favourite colour.",
"Red is wrong, blue is right.",
"You and I are going to yellowstone.",
"Van Gogh looked for some yellow at sunset.",
"You ruined my beautiful green dress.",
"You do not agree.",
"There's nothing wrong with green.")
The custom stemming function is:
stem_hunspell <- function(term) {
# look up the term in the dictionary
stems <- hunspell::hunspell_stem(term)[[1]]
if (length(stems) == 0) { # if there are no stems, use the original term
stem <- term
} else { # if there are multiple stems, use the last one
stem <- stems[[length(stems)]]
}
stem
}
This code
sentences=text_tokens(sentences, stemmer = stem_hunspell)
produces:
> sentences
[[1]]
[1] "the" "color" "blue" "neutralize" "orange" "yellow"
[7] "reflection" "."
[[2]]
[1] "zod" "stabbed" "me" "with" "blue" "kryptonite"
[7] "."
[[3]]
[1] "because" "blue" "i" "your" "favourite" "colour"
[7] "."
[[4]]
[1] "re" "i" "wrong" "," "blue" "i" "right" "."
[[5]]
[1] "you" "and" "i" "are" "go"
[6] "to" "yellowstone" "."
[[6]]
[1] "van" "gogh" "look" "for" "some" "yellow" "at" "sunset" "."
[[7]]
[1] "you" "ruin" "my" "beautiful" "green" "dress"
[7] "."
[[8]]
[1] "you" "do" "not" "agree" "."
[[9]]
[1] "there" "nothing" "wrong" "with" "green" "."
After stemming I would like to apply other operations on the text, e.g. removing stop words. Anyway, when I applied the tm-function:
removeWords(sentences,stopwords)
to my sentences, I obtained the following error:
Error in UseMethod("removeWords", x) :
no applicable method for 'removeWords' applied to an object of class "list"
If I use
unlist(sentences)
I don't get the desired result, because I end up with a chr of 65 elements. The desired result should be (e.g. for the the first sentences):
"the color blue neutralize orange yellow reflection."
If you want to remove stopwords from each sentence, you could use lapply :
library(tm)
lapply(sentences, removeWords, stopwords())
#[[1]]
#[1] "" "color" "blue" "neutralize" "orange" "yellow" "reflection" "."
#[[2]]
#[1] "zod" "stabbed" "" "" "blue" "kryptonite" "."
#...
#...
However, from your expected output it looks you want to paste the text together.
lapply(sentences, paste0, collapse = " ")
#[[1]]
#[1] "the color blue neutralize orange yellow reflection ."
#[[2]]
#[1] "zod stabbed me with blue kryptonite ."
#....
We can use map
library(tm)
library(purrr)
map(sentences, removeWords, stopwords())
#[[1]]
#[1] "" "color" "blue" "neutralize" "orange" "yellow" "reflection"
#[8] "."
#[[2]]
#[1] "zod" "stabbed" "" "" "blue" "kryptonite" "."
Related
This question already has answers here:
List of unique words from data.frame
(2 answers)
Closed 9 months ago.
Lets say i have a string and i only want unique words in the sentence as separate elements
a = "an apple is an apple"
word <- function(a){
words<- c(strsplit(a,split = " "))
return(unique(words))
}
word(a)
This returns
[[1]]
[1] "an" "apple" "is" "an" "apple"
and the output im expecting is
'an','apple','is'
what im doing wrong? really appreciate any help
Cheers!
The problem is that wrapping strsplit(.) in c(.) does not change the fact that it is still a list, and unique will be operating at the list-level, not the word-level.
c(strsplit(rep(a, 2), "\\s+"))
# [[1]]
# [1] "an" "apple" "is" "an" "apple"
# [[2]]
# [1] "an" "apple" "is" "an" "apple"
unique(c(strsplit(rep(a, 2), "\\s+")))
# [[1]]
# [1] "an" "apple" "is" "an" "apple"
Alternatives:
If length(a) is always 1, then perhaps
unique(strsplit(a, "\\s+")[[1]])
# [1] "an" "apple" "is"
If length(a) can be 2 or more and you want a list of unique words for each sentence, then
a2 <- c("an apple is an apple", "a pear is a pear", "an orange is an orange")
lapply(strsplit(a2, "\\s+"), unique)
# [[1]]
# [1] "an" "apple" "is"
# [[2]]
# [1] "a" "pear" "is"
# [[3]]
# [1] "an" "orange" "is"
(Note: this always returns a list, regardless of the number of sentences in the input.)
if length(a) can be 2 ore more and you want a unique words across all sentences, then
unique(unlist(strsplit(a2, "\\s+")))
# [1] "an" "apple" "is" "a" "pear" "orange"
(Note: this method also works well when length(a) is 1.)
Another possible solution, based on stringr::str_split:
library(tidyverse)
a %>% str_split("\\s+") %>% unlist %>% unique
#> [1] "an" "apple" "is"
You could try
unique(unlist(strsplita, " ")))
I have sentences from spoken conversation and would like to identify the words that are repeated fom sentence to sentence; here's some illustartive data (in reproducible format below)
df
# A tibble: 10 x 1
Orthographic
<chr>
1 "like I don't understand sorry like how old's your mom"
2 "eh sixty-one"
3 "yeah (...) yeah yeah like I mean she's not like in the risk age group but still"
4 "yeah"
5 "HH"
6 "I don't know"
7 "yeah I talked to my grandparents last night and last time I talked to them it was like two weeks…
8 "yeah"
9 "she said you should come home probably "
10 "no and like why would you go to the airport where people have corona sit in the plane where peop…
I'm not unsuccessful at extracting the repeated words using a forloop but do also get some strange results: Here's what I've been doing so far:
# initialize pattern and new column `rept` in `df`:
pattern1 <- c()
df$rept <- NA
# for loop:
for(i in 2:nrow(df)){
pattern1[i-1] <- paste0("\\b(", paste0(unlist(str_split(df$Orthographic[i-1], " ")), collapse = "|"), ")\\b")
df$rept[i] <- str_extract_all(df$Orthographic[i], pattern1[i-1])
}
The results are these; result # 10 is strange/incorrect - it should be character(0). How can the code be improved so that no such strange results are obtained?
df$rept
[[1]]
[1] NA
[[2]]
character(0)
[[3]]
character(0)
[[4]]
[1] "yeah"
[[5]]
character(0)
[[6]]
character(0)
[[7]]
[1] "I" "I" "don't" "I" "I" "don't" "I"
[[8]]
[1] "yeah"
[[9]]
character(0)
[[10]]
[1] "" "" "" "" "" "" "" "" "" "" "you" "" "" "" "" ""
[17] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[33] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[49] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[65] "" "" "" "" "" "" "" "" "" "" "" ""
Reproducible data:
structure(list(Orthographic = c("like I don't understand sorry like how old's your mom",
"eh sixty-one", "yeah (...) yeah yeah like I mean she's not like in the risk age group but still",
"yeah", "HH", "I don't know", "yeah I talked to my grandparents last night and last time I talked to them it was like two weeks ago and they at that time they were already like maybe you should just get on a plane and come home and like you can't just be here and and then last night they were like are you sure you don't wanna come home and I was I don't think I can and my mom said the same thing",
"yeah", "she said you should come home probably ", "no and like why would you go to the airport where people have corona sit in the plane where people have corona to get there where people have corona and then go and take it to your family"
)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
))
When you debug such regex issues concerning dynamic patterns with word boundaries, there are a lot of things to keep in mind (so as to understand how to best approach the whole issue).
First, check the patterns you get,
for(i in 2:nrow(df)) {
pattern1[i-1] <- paste0("(?<!\\S)(?:", paste0(escape.for.regex(unlist(str_split(trimws(df$Orthographic[i-1]), "\\s+"))), collapse = "|"), ")(?!\\S)")
df$rept[i] <- str_extract_all(df$Orthographic[i], pattern1[i-1])
}
Here is the list of regexps:
[1] "\\b(like|I|don't|understand|sorry|like|how|old's|your|mom)\\b"
[1] "\\b(eh|sixty-one)\\b"
[1] "\\b(yeah|(...)|yeah|yeah|like|I|mean|she's|not|like|in|the|risk|age|group|but|still)\\b"
[1] "\\b(yeah)\\b"
[1] "\\b(HH)\\b"
[1] "\\b(I|don't|know)\\b"
[1] "\\b(yeah|I|talked|to|my|grandparents|last|night|and|last|time|I|talked|to|them|it|was|like|two|weeks|ago|and|they|at|that|time|they|were|already|like|maybe|you|should|just|get|on|a|plane|and|come|home|and|like|you|can't|just|be|here|and|and|then|last|night|they|were|like|are|you|sure|you|don't|wanna|come|home|and|I|was|I|don't|think|I|can|and|my|mom|said|the|same|thing)\\b"
[1] "\\b(yeah)\\b"
[1] "\\b(she|said|you|should|come|home|probably|)\\b"
Look at the second pattern: \b(eh|sixty-one)\b. What if the first word was sixty? The \b(sixty|sixty-one)\b regex will never match sixty-one because sixty would have matched first and the other alternative would not even have been considered. You need to always sort the alternatives by length in the descending order to assure you always match the longest alternative first when you use word boundaries and you know there can be alternatives with more than one word in them. Here, you do not need to sort the alternatives because you only have single word alternatives.
See the next pattern containing |(...)| alternative. It matches any three chars other than line break chars and captures them into a group. However, the string contained a (...) substring where the parentheses and dots are literal chars. To match them with a regex, you need to escape all special chars.
Next, you consider "words" to be non-whitespace chunks of chars because you use str_split(df$Orthographic[i-1], " "). This invalidates the approach with \b altogether, you need to use whitespace boundaries, (?<!\S) at the start and (?!\S) at the end instead of \bs. More, since you only split with a single space, you may get empty alternatives if there are two or more consecutive spaces in the input string. You need to use \s+ pattern here to split by one or more whitespaces.
Next, there is a trailing space in the last but one string, and it creates an empty alternative. You need to trimws your input before splitting into tokens/words.
This is what you need to do with the regex solution: add the escape.for.regex function:
## Escape for regex
escape.for.regex <- function(string) {
gsub("([][{}()+*^${|\\\\?.])", "\\\\\\1", string)
}
and then use it to escape the tokens that you obtain by splitting the trimmed df$Orthographic[i-1] with \s+ regex, appy unique to remove duplicates to make the pattern more efficient and shorter, and add the whitespace boundaries:
for(i in 2:nrow(df)){
pattern1[i-1] <- paste0("(?<!\\S)(?:", paste0(escape.for.regex(unique(unlist(str_split(trimws(df$Orthographic[i-1]), "\\s+")))), collapse = "|"), ")(?!\\S)")
df$rept[i] <- str_extract_all(df$Orthographic[i], pattern1[i-1])
}
See the list of regexps:
[1] "(?<!\\S)(?:like|I|don't|understand|sorry|how|old's|your|mom)(?!\\S)"
[1] "(?<!\\S)(?:eh|sixty-one)(?!\\S)"
[1] "(?<!\\S)(?:yeah|\\(\\.\\.\\.\\)|like|I|mean|she's|not|in|the|risk|age|group|but|still)(?!\\S)"
[1] "(?<!\\S)(?:yeah)(?!\\S)"
[1] "(?<!\\S)(?:HH)(?!\\S)"
[1] "(?<!\\S)(?:I|don't|know)(?!\\S)"
[1] "(?<!\\S)(?:yeah|I|talked|to|my|grandparents|last|night|and|time|them|it|was|like|two|weeks|ago|they|at|that|were|already|maybe|you|should|just|get|on|a|plane|come|home|can't|be|here|then|are|sure|don't|wanna|think|can|mom|said|the|same|thing)(?!\\S)"
[1] "(?<!\\S)(?:yeah)(?!\\S)"
[1] "(?<!\\S)(?:she|said|you|should|come|home|probably)(?!\\S)"
Output:
> df$rept
[[1]]
NULL
[[2]]
character(0)
[[3]]
character(0)
[[4]]
[1] "yeah"
[[5]]
character(0)
[[6]]
character(0)
[[7]]
[1] "I" "I" "don't" "I" "I" "don't" "I"
[[8]]
[1] "yeah"
[[9]]
character(0)
[[10]]
[1] "you"
Depending on whether it is sufficient to identify repeated words, or also their repeat frequencies, you might want to modify the function, but here is one approach using the dplyr::lead function:
library(stringr)
library(dplyr)
# general function that identifies intersecting words from multiple strings
getRpt <- function(...){
l <- lapply(list(...), function(x) unlist(unique(
str_split(as.character(x), pattern=boundary(type="word")))))
Reduce(intersect, l)
}
df$rept <- mapply(getRpt, df$Orthographic, lead(df$Orthographic), USE.NAMES=FALSE)
R Package: stringr::words
I want to know the number of words that are exactly three letters long in the stringr::words file after applying the following regular expression:
x <- str_view(words, "^...$", match = TRUE)
While the code was able to extract words that are exactly three letters long, it does not tell me how many words there are. So, I thought the length function will be appropriate to find the number.
length(x)
The code returns 8, which cannot be as it is very clear that x is more than 8.
What is the proper syntax to calculate the number of words after matching with the regular expression, in this case, x?
Also, can anyone explain to me why length(x) returns 8 in the above example?
Thank you in advance.
str_view returns an HTML object which is used for viewing.
x <- str_view(words, "^...$", match = TRUE)
class(x)
#[1] "str_view" "htmlwidget"
The 8 components that you see are
names(x)
#[1] "x" "width" "height" "sizingPolicy" "dependencies"
#[6] "elementId" "preRenderHook" "jsHooks"
Instead of str_view use str_subset :
library(stringr)
x <- str_subset(words, "^...$")
x
# [1] "act" "add" "age" "ago" "air" "all" "and" "any" "arm" "art" "ask" "bad" "bag"
# [14] "bar" "bed" "bet" "big" "bit" "box" "boy" "bus" "but" "buy" "can" "car" "cat"
# [27] "cup" "cut" "dad" "day" "die" "dog" "dry" "due" "eat" "egg" "end" "eye" "far"
# [40] "few" "fit" "fly" "for" "fun" "gas" "get" "god" "guy" "hit" "hot" "how" "job"
# [53] "key" "kid" "lad" "law" "lay" "leg" "let" "lie" "lot" "low" "man" "may" "mrs"
# [66] "new" "non" "not" "now" "odd" "off" "old" "one" "out" "own" "pay" "per" "put"
# [79] "red" "rid" "run" "say" "see" "set" "sex" "she" "sir" "sit" "six" "son" "sun"
# [92] "tax" "tea" "ten" "the" "tie" "too" "top" "try" "two" "use" "war" "way" "wee"
#[105] "who" "why" "win" "yes" "yet" "you"
length(x)
#[1] 110
Another option is str_count:
library(stringr)
sum(str_count(x, "^...$"))
[1] 3
Data:
x <- c("abc", "abcd", "ab", "abc", "abcsd", "edf")
I'd suggest using grep with length:
length(grep("^.{3}$", words))
# => [1] 110
With grep, you actually get a subset of the words and length will return the count of the found matches.
stringr::str_view can be used to view HTML rendering of regular expression match, and it does not actually return the list of matches. Beside grep, you may use stringr::str_subset.
I have strings like these:
turns <- c("does him good to stir him up now and again .",
"when , when I see him he w's on the settees .",
"yes it 's been eery for a long time .",
"blissful timing , indeed it was ")
What I'm trying to do is extract those words that have exactly one vowel. I do get the correct result with this:
library(stringr)
str_extract_all(turns, "\\b[b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*\\b")
[[1]]
[1] "him" "to" "stir" "him" "up" "now" "and"
[[2]]
[1] "when" "when" "i" "him" "he" "on" "the"
[[3]]
[1] "yes" "it" "for" "a" "long"
[[4]]
[1] "it" "was"
However, it feels cumbersome to define a consonant class. Is there a more elegant and more concise way?
We can use str_count on the words after splitting the 'turns' at the spaces
library(stringr)
lapply(strsplit(turns, "\\s+"), function(x) x[str_count(x, '[aeiou]') == 1])
-output
#[[1]]
#[1] "him" "to" "stir" "him" "up" "now" "and"
#[[2]]
#[1] "when" "when" "him" "he" "on" "the"
#[[3]]
#[1] "yes" "it" "for" "a" "long"
#[[4]]
#[1] "it" "was"
You can use a PCRE regex with character classes containing double negation:
turns <- c("does him good to stir him up now and again .",
"when , when I see him he w's on the settees .",
"yes it 's been eery for a long time .",
"blissful timing , indeed it was ")
rx <- "\\b[^[:^alpha:]aeiou]*[aeiou][^[:^alpha:]aeiou]*\\b"
regmatches(turns, gregexpr(rx, turns, perl=TRUE, ignore.case=TRUE))
See the R demo online. The result is as in the question.
See the regex demo. Details:
\b - word boundary
[^[:^alpha:]aeiou]* - zero or more chars other than letters and aeiou chars
[aeiou] - a vowel
[^[:^alpha:]aeiou]* - zero or more chars other than letters and aeiou chars
\b - word boundary.
An equivalent expression:
(?i)\b[^\P{L}aeiou]*[aeiou][^\P{L}aeiou]*\b
See this regex demo. \P{L} matches any char but a letter. (?i) is equivalent of ignore.case=TRUE.
Here is a base R option using strsplit + nchar + gsub
lapply(
strsplit(turns, "\\s"),
function(v) v[nchar(gsub("[^aeiou]", "", v)) == 1]
)
which gives
[[1]]
[1] "him" "to" "stir" "him" "up" "now" "and"
[[2]]
[1] "when" "when" "him" "he" "on" "the"
[[3]]
[1] "yes" "it" "for" "a" "long"
[[4]]
[1] "it" "was"
I created lists for each word to extract words from sentences, for example like this
hello<- NULL
for (i in 1:length(text)){
hello[i]<-as.character(regmatches(text[i], gregexpr("[H|h]ello?", text[i])))
}
But I have more than 25 words list to extract, that's very long coding.
Is it possible to extract a group of characters(words) from text data?
Below is just a pseudo set.
words<-c("[H|h]ello","you","so","tea","egg")
text=c("Hello! How's you and how did saturday go?",
"hello, I was just texting to see if you'd decided to do anything later",
"U dun say so early.",
"WINNER!! As a valued network customer you have been selected" ,
"Lol you're always so convincing.",
"Did you catch the bus ? Are you frying an egg ? ",
"Did you make a tea and egg?"
)
subsets<-NULL
for ( i in 1:length(text)){
.....???
}
Expected output as below
[1] Hello you
[2] hello you
[3] you
[4] you so
[5] you you egg
[6] you tea egg
in base R, you could do:
regmatches(text,gregexpr(sprintf("\\b(%s)\\b",paste0(words,collapse = "|")),text))
[[1]]
[1] "Hello" "you"
[[2]]
[1] "hello" "you"
[[3]]
[1] "so"
[[4]]
[1] "you"
[[5]]
[1] "you" "so"
[[6]]
[1] "you" "you" "egg"
[[7]]
[1] "you" "tea" "egg"
depending on how you want the results:
trimws(gsub(sprintf(".*?\\b(%s).*?|.*$",paste0(words,collapse = "|")),"\\1 ",text))
[1] "Hello you" "hello you" "so" "you" "you so" "you you egg"
[7] "you tea egg"
You say that you have a long list of word-sets. Here's a way to turn each wordset into a regex, apply it to a corpus (a list of sentences) and pull out the hits as character-vectors. It's case-insensitive, and it checks for word boundaries, so you don't pull age out of agent or rage.
wordsets <- c(
"oak dogs cheese age",
"fire open jail",
"act speed three product"
)
library(tidyverse)
harvSent <- read_table("SENTENCE
Oak is strong and also gives shade.
Cats and dogs each hate the other.
The pipe began to rust while new.
Open the crate but don't break the glass.
Add the sum to the product of these three.
Thieves who rob friends deserve jail.
The ripe taste of cheese improves with age.
Act on these orders with great speed.
The hog crawled under the high fence.
Move the vat over the hot fire.") %>%
pull(SENTENCE)
aWset builds the regexs from the wordsets, and applies them to the sentences
aWset <- function(harvSent, wordsets){
# Turn out a vector of regex like "(?ix) \\b (oak|dogs|cheese) \\b"
regexS <- paste0("(?ix) \\b (",
str_replace_all(wordsets, " ", "|" ),
") \\b")
# Apply each regex to the sentences
map(regexS,
~ str_extract_all(harvSent, .x, simplify = TRUE) %>%
# str_extract_all return a character matrix of hits. Paste it together by row.
apply( MARGIN = 1,
FUN = function(x){
str_trim(paste(x, collapse = " "))}))
}
Giving us
aWset(harvSent , wordsets)
[[1]]
[1] "Oak" "dogs" "" "" "" "" "cheese age" ""
[9] "" ""
[[2]]
[1] "" "" "" "Open" "" "jail" "" "" "" "fire"
[[3]]
[1] "" "" "" "" "product three" "" ""