I´m wondering how to get the words that occur before an exclamation mark! I have a dataframe with different strings on each row. I have tried following:
text %>%
str_match("!",lines)
I don´t really get what I want and I´m a bit lost. Anyone has advice?
You can str_extract_all the words before the ! using lookahead:
Data:
text <- c("Hello!", "This a test sentence", "That's another test sentence, yes! It is!", "And that's one more")
Solution:
library(stringr)
unlist(str_extract_all(text, "\\b\\w+\\b(?=!)"))
[1] "Hello" "yes" "is"
If you seek a dplyr solution:
data.frame(text) %>%
mutate(Word_before_excl = str_extract_all(text, "\\b\\w+\\b(?=!)"))
text Word_before_excl
1 Hello! Hello
2 This a test sentence
3 That's another test sentence, yes! It is! yes, is
4 And that's one more
Maybe we can use regmatches
> sapply(regmatches(text, gregexpr("\\b\\w+\\b(?=!)", text, perl = TRUE)), toString)
[1] "Hello" "" "yes, is" ""
You could also use :
> unlist(strsplit("Dog!Cat!", "!"))
[1] "Dog" "Cat"
Related
I have
df<-c("That's","you're", "'am")
and I would like to remove the part of a word after and including the apostrophe which should return
c("That", "you", "")
tidyverse solution or a solution usable within a pipe |> structure preferable
Replace ' and whatever follows it, using str_replace in stringr.
library(stringr)
str_replace(df, "'.*", "")
#[1] "That" "you" ""
Using R base sub
> sub("'.*", "", df)
[1] "That" "you" ""
Your example data only has one word per string. If you also need it to work for strings containing multiple words then use:
gsub("'\\w*\\b","",df)
Using trimws in base R
trimws(df, whitespace = "'.*")
[1] "That" "you" ""
Split the regex by space if the group of words is not matched.
If group of words is matched then keep them as it is.
text <- c('considerate and helpful','not bad at all','this is helpful')
pattern <- c('considerate and helpful','not bad')
Output :
considerate and helpful, not bad, at, all, this, is, helpful
Thank you for the help!
Of course, just put the words in front of \w+:
library("stringr")
text <- c('considerate and helpful','not bad at all','this is helpful')
parts <- str_extract_all(text, "considerate and helpful|not bad|\\w+")
parts
Which yields
[[1]]
[1] "considerate and helpful"
[[2]]
[1] "not bad" "at" "all"
[[3]]
[1] "this" "is" "helpful"
It does not split on whitespaces but rather extracts the "words".
I am trying to get the text between two words in a sentence.
For example the sentence is -
x <- "This is my first sentence"
Now I want the text between This and first which is is my .
I have tried various functions from R like grep, grepl, pmatch , str_split. However, I could not get exactly what I want .
This is the closest what I have reached with gsub.
gsub(".*This\\s*|first*", "", x)
The output it gives is
[1] "is my sentence"
In reality, what I need is only
[1] "is my"
Any help would be appreciated.
You need .* at the end to match zero or more characters after the 'first'
gsub('^.*This\\s*|\\s*first.*$', '', x)
#[1] "is my"
Another approach using rm_between from the qdapRegex package.
library(qdapRegex)
rm_between(x, 'This', 'first', extract=TRUE)[[1]]
# [1] "is my"
Since this question is used as a reference, I'll add some possible solutions to build a complete overview. Both are based on a look-ahead/look-behind regex pattern.
base R
regmatches( x, gregexpr("(?<=This ).*(?= first)", x, perl = TRUE ) )
stringr
stringr::str_extract_all( x, "(?<=This ).+(?= first)" )
I want to combine word which comes after a specific word ,I have try bigram approach which is too slow and also tried with gregexpr but didnt get any good solution. for ex
text="This approach isnt good enough."
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
BigramTokenizer(text)
[1] "This approach" "approach isnt" "isnt good" "good enough"
what i really want is isnt_good as single word in a text ,combine next word which comes after isnt.
text
"This approach isnt_good enough."
Any efficient approach to convert into unigram.Thanks.
To extract all occurrences of the word "isn't" and the following word you can do this:
library(stringr)
pattern <- "isnt \\w+"
str_extract_all(text, pattern)
[[1]]
[1] "isnt good"
It essentially does the same thing as the example below (from the base package) but I find the stringr solution more elegant and readable.
> regmatches(text, regexpr(pattern, text))
[1] "isnt good"
Update
To replace the occurrences of isnt x with isnt_x you just need gsub of the base package.
gsub("isnt (\\w+)", "isnt_\\1", text)
[1] "This approach isnt_good enough."
What you do is to use a capturing group that copies whatever is found inside the parentheses to the \\1. See this page for a good introduction: http://www.regular-expressions.info/brackets.html
How about this function?
joinWords <- function(string, word){
y <- paste0(word, " ")
x <- unlist(strsplit(string, y))
paste0(x[1], word, "_", x[2])
}
> text <- "This approach isnt good enough."
> joinWords(text, "isnt")
# [1] "This approach isnt_good enough."
> joinWords("This approach might work for you", "might")
# [1] "This approach might_work for you"
What's the most elegant way to extract the last word in a sentence string?
The sentence does not end with a "."
Words are seperated by blanks.
sentence <- "The quick brown fox"
TheFunction(sentence)
should return: "fox"
I do not want to use a package if a simple solution is possible.
If a simple solution based on package exists, that is also fine.
Just for completeness: The library stringr contains a function for exactly this problem.
library(stringr)
sentence <- "The quick brown fox"
word(sentence,-1)
[1] "fox"
tail(strsplit('this is a sentence',split=" ")[[1]],1)
Basically as suggested by #Señor O.
x <- 'The quick brown fox'
sub('^.* ([[:alnum:]]+)$', '\\1', x)
That will catch the last string of numbers and characters before then end of the string.
You can also use the regexec and regmatches functions, but I find sub cleaner:
m <- regexec('^.* ([[:alnum:]]+)$', x)
regmatches(x, m)
See ?regex and ?sub for more info.
Another packaged option is stri_extract_last_words() from the stringi package
library(stringi)
stri_extract_last_words("The quick brown fox")
# [1] "fox"
The function also removes any punctuation that may be at the end of the sentence.
stri_extract_last_words("The quick brown fox? ...")
# [1] "fox"
Going in the package direction, this is the simplest answer I can think of:
library(stringr)
x <- 'The quick brown fox'
str_extract(x, '\\w+$')
#[1] "fox"