Get the word before exclamation mark in R tidyverse - r

I´m wondering how to get the words that occur before an exclamation mark! I have a dataframe with different strings on each row. I have tried following:
text %>%
str_match("!",lines)
I don´t really get what I want and I´m a bit lost. Anyone has advice?

You can str_extract_all the words before the ! using lookahead:
Data:
text <- c("Hello!", "This a test sentence", "That's another test sentence, yes! It is!", "And that's one more")
Solution:
library(stringr)
unlist(str_extract_all(text, "\\b\\w+\\b(?=!)"))
[1] "Hello" "yes" "is"
If you seek a dplyr solution:
data.frame(text) %>%
mutate(Word_before_excl = str_extract_all(text, "\\b\\w+\\b(?=!)"))
text Word_before_excl
1 Hello! Hello
2 This a test sentence
3 That's another test sentence, yes! It is! yes, is
4 And that's one more

Maybe we can use regmatches
> sapply(regmatches(text, gregexpr("\\b\\w+\\b(?=!)", text, perl = TRUE)), toString)
[1] "Hello" "" "yes, is" ""

You could also use :
> unlist(strsplit("Dog!Cat!", "!"))
[1] "Dog" "Cat"

Related

remove part of a word after apostrophe

I have
df<-c("That's","you're", "'am")
and I would like to remove the part of a word after and including the apostrophe which should return
c("That", "you", "")
tidyverse solution or a solution usable within a pipe |> structure preferable
Replace ' and whatever follows it, using str_replace in stringr.
library(stringr)
str_replace(df, "'.*", "")
#[1] "That" "you" ""
Using R base sub
> sub("'.*", "", df)
[1] "That" "you" ""
Your example data only has one word per string. If you also need it to work for strings containing multiple words then use:
gsub("'\\w*\\b","",df)
Using trimws in base R
trimws(df, whitespace = "'.*")
[1] "That" "you" ""

Regex in R to match a group of words and split by space

Split the regex by space if the group of words is not matched.
If group of words is matched then keep them as it is.
text <- c('considerate and helpful','not bad at all','this is helpful')
pattern <- c('considerate and helpful','not bad')
Output :
considerate and helpful, not bad, at, all, this, is, helpful
Thank you for the help!
Of course, just put the words in front of \w+:
library("stringr")
text <- c('considerate and helpful','not bad at all','this is helpful')
parts <- str_extract_all(text, "considerate and helpful|not bad|\\w+")
parts
Which yields
[[1]]
[1] "considerate and helpful"
[[2]]
[1] "not bad" "at" "all"
[[3]]
[1] "this" "is" "helpful"
It does not split on whitespaces but rather extracts the "words".

How to get the text between two words in R?

I am trying to get the text between two words in a sentence.
For example the sentence is -
x <- "This is my first sentence"
Now I want the text between This and first which is is my .
I have tried various functions from R like grep, grepl, pmatch , str_split. However, I could not get exactly what I want .
This is the closest what I have reached with gsub.
gsub(".*This\\s*|first*", "", x)
The output it gives is
[1] "is my sentence"
In reality, what I need is only
[1] "is my"
Any help would be appreciated.
You need .* at the end to match zero or more characters after the 'first'
gsub('^.*This\\s*|\\s*first.*$', '', x)
#[1] "is my"
Another approach using rm_between from the qdapRegex package.
library(qdapRegex)
rm_between(x, 'This', 'first', extract=TRUE)[[1]]
# [1] "is my"
Since this question is used as a reference, I'll add some possible solutions to build a complete overview. Both are based on a look-ahead/look-behind regex pattern.
base R
regmatches( x, gregexpr("(?<=This ).*(?= first)", x, perl = TRUE ) )
stringr
stringr::str_extract_all( x, "(?<=This ).+(?= first)" )

combine any word which comes after specific word

I want to combine word which comes after a specific word ,I have try bigram approach which is too slow and also tried with gregexpr but didnt get any good solution. for ex
text="This approach isnt good enough."
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
BigramTokenizer(text)
[1] "This approach" "approach isnt" "isnt good" "good enough"
what i really want is isnt_good as single word in a text ,combine next word which comes after isnt.
text
"This approach isnt_good enough."
Any efficient approach to convert into unigram.Thanks.
To extract all occurrences of the word "isn't" and the following word you can do this:
library(stringr)
pattern <- "isnt \\w+"
str_extract_all(text, pattern)
[[1]]
[1] "isnt good"
It essentially does the same thing as the example below (from the base package) but I find the stringr solution more elegant and readable.
> regmatches(text, regexpr(pattern, text))
[1] "isnt good"
Update
To replace the occurrences of isnt x with isnt_x you just need gsub of the base package.
gsub("isnt (\\w+)", "isnt_\\1", text)
[1] "This approach isnt_good enough."
What you do is to use a capturing group that copies whatever is found inside the parentheses to the \\1. See this page for a good introduction: http://www.regular-expressions.info/brackets.html
How about this function?
joinWords <- function(string, word){
y <- paste0(word, " ")
x <- unlist(strsplit(string, y))
paste0(x[1], word, "_", x[2])
}
> text <- "This approach isnt good enough."
> joinWords(text, "isnt")
# [1] "This approach isnt_good enough."
> joinWords("This approach might work for you", "might")
# [1] "This approach might_work for you"

Extract last word in string in R

What's the most elegant way to extract the last word in a sentence string?
The sentence does not end with a "."
Words are seperated by blanks.
sentence <- "The quick brown fox"
TheFunction(sentence)
should return: "fox"
I do not want to use a package if a simple solution is possible.
If a simple solution based on package exists, that is also fine.
Just for completeness: The library stringr contains a function for exactly this problem.
library(stringr)
sentence <- "The quick brown fox"
word(sentence,-1)
[1] "fox"
tail(strsplit('this is a sentence',split=" ")[[1]],1)
Basically as suggested by #Señor O.
x <- 'The quick brown fox'
sub('^.* ([[:alnum:]]+)$', '\\1', x)
That will catch the last string of numbers and characters before then end of the string.
You can also use the regexec and regmatches functions, but I find sub cleaner:
m <- regexec('^.* ([[:alnum:]]+)$', x)
regmatches(x, m)
See ?regex and ?sub for more info.
Another packaged option is stri_extract_last_words() from the stringi package
library(stringi)
stri_extract_last_words("The quick brown fox")
# [1] "fox"
The function also removes any punctuation that may be at the end of the sentence.
stri_extract_last_words("The quick brown fox? ...")
# [1] "fox"
Going in the package direction, this is the simplest answer I can think of:
library(stringr)
x <- 'The quick brown fox'
str_extract(x, '\\w+$')
#[1] "fox"

Resources