What's the most elegant way to extract the last word in a sentence string?
The sentence does not end with a "."
Words are seperated by blanks.
sentence <- "The quick brown fox"
TheFunction(sentence)
should return: "fox"
I do not want to use a package if a simple solution is possible.
If a simple solution based on package exists, that is also fine.
Just for completeness: The library stringr contains a function for exactly this problem.
library(stringr)
sentence <- "The quick brown fox"
word(sentence,-1)
[1] "fox"
tail(strsplit('this is a sentence',split=" ")[[1]],1)
Basically as suggested by #Señor O.
x <- 'The quick brown fox'
sub('^.* ([[:alnum:]]+)$', '\\1', x)
That will catch the last string of numbers and characters before then end of the string.
You can also use the regexec and regmatches functions, but I find sub cleaner:
m <- regexec('^.* ([[:alnum:]]+)$', x)
regmatches(x, m)
See ?regex and ?sub for more info.
Another packaged option is stri_extract_last_words() from the stringi package
library(stringi)
stri_extract_last_words("The quick brown fox")
# [1] "fox"
The function also removes any punctuation that may be at the end of the sentence.
stri_extract_last_words("The quick brown fox? ...")
# [1] "fox"
Going in the package direction, this is the simplest answer I can think of:
library(stringr)
x <- 'The quick brown fox'
str_extract(x, '\\w+$')
#[1] "fox"
Related
I am in R and would like to extract a two digit number 38y from the following string:
"/Users/files/folder/file_number_23a_version_38y_Control.txt"
I know that _Control always comes after the 38y and that 38y is preceded by an underscore. How can I use strsplit or other R commands to extract the 38y?
You could use
regmatches(x, regexpr("[^_]+(?=_Control)", x, perl = TRUE))
# [1] "38y"
or equivalently
stringr::str_extract(x, "[^_]+(?=_Control)")
# [1] "38y"
Using gsub.
gsub('.*_(.*)_Control.*', '\\1', x)
# [1] "38y"
See demo with detailed explanation.
A possible solution:
library(stringr)
text <- "/Users/files/folder/file_number_23a_version_38y_Control.txt"
str_extract(text, "(?<=_)\\d+\\D(?=_Control)")
#> [1] "38y"
You can find an explanation of the regex part at:
https://regex101.com/r/PQSZHX/1
I´m wondering how to get the words that occur before an exclamation mark! I have a dataframe with different strings on each row. I have tried following:
text %>%
str_match("!",lines)
I don´t really get what I want and I´m a bit lost. Anyone has advice?
You can str_extract_all the words before the ! using lookahead:
Data:
text <- c("Hello!", "This a test sentence", "That's another test sentence, yes! It is!", "And that's one more")
Solution:
library(stringr)
unlist(str_extract_all(text, "\\b\\w+\\b(?=!)"))
[1] "Hello" "yes" "is"
If you seek a dplyr solution:
data.frame(text) %>%
mutate(Word_before_excl = str_extract_all(text, "\\b\\w+\\b(?=!)"))
text Word_before_excl
1 Hello! Hello
2 This a test sentence
3 That's another test sentence, yes! It is! yes, is
4 And that's one more
Maybe we can use regmatches
> sapply(regmatches(text, gregexpr("\\b\\w+\\b(?=!)", text, perl = TRUE)), toString)
[1] "Hello" "" "yes, is" ""
You could also use :
> unlist(strsplit("Dog!Cat!", "!"))
[1] "Dog" "Cat"
I am trying to get the text between two words in a sentence.
For example the sentence is -
x <- "This is my first sentence"
Now I want the text between This and first which is is my .
I have tried various functions from R like grep, grepl, pmatch , str_split. However, I could not get exactly what I want .
This is the closest what I have reached with gsub.
gsub(".*This\\s*|first*", "", x)
The output it gives is
[1] "is my sentence"
In reality, what I need is only
[1] "is my"
Any help would be appreciated.
You need .* at the end to match zero or more characters after the 'first'
gsub('^.*This\\s*|\\s*first.*$', '', x)
#[1] "is my"
Another approach using rm_between from the qdapRegex package.
library(qdapRegex)
rm_between(x, 'This', 'first', extract=TRUE)[[1]]
# [1] "is my"
Since this question is used as a reference, I'll add some possible solutions to build a complete overview. Both are based on a look-ahead/look-behind regex pattern.
base R
regmatches( x, gregexpr("(?<=This ).*(?= first)", x, perl = TRUE ) )
stringr
stringr::str_extract_all( x, "(?<=This ).+(?= first)" )
I am looking for a grep way to obtain the characters in a string prior to the first space.
I have hacked the following function, as i could not figure out how to do it using the grep type commands in R.
Could someone help with the grep solution - if there is one ...
beforeSpace <- function(inWords) {
vapply(inWords, function(L) strsplit(L, "[[:space:]]")[[1]][1], FUN.VALUE = 'character')
}
words <- c("the quick", "brown dogs were", "lazier than quick foxes")
beforeSpace(words)
R> the quick brown dogs were lazier than quick foxes
"the" "brown" "lazier"
And do let me know if there's a better way than grep (or my function, beforeSpace) to do this.
Or just sub, with credit to #flodel:
sub(" .*", "", words)
# and if the 'space' can also be a tab or other white-space:
sub("\\s.*","",words)
#[1] "the" "brown" "lazier"
You can use qdap's beg2char (beginning of the string to a particular character) as follows:
x <- c("the quick", "brown dogs were", "lazier than quick foxes")
library(qdap)
beg2char(x)
## [1] "the" "brown" "lazier"
Using stringi
library(stringi)
stri_extract_first(words, regex="\\w+")
#[1] "the" "brown" "lazier"
I want to combine word which comes after a specific word ,I have try bigram approach which is too slow and also tried with gregexpr but didnt get any good solution. for ex
text="This approach isnt good enough."
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
BigramTokenizer(text)
[1] "This approach" "approach isnt" "isnt good" "good enough"
what i really want is isnt_good as single word in a text ,combine next word which comes after isnt.
text
"This approach isnt_good enough."
Any efficient approach to convert into unigram.Thanks.
To extract all occurrences of the word "isn't" and the following word you can do this:
library(stringr)
pattern <- "isnt \\w+"
str_extract_all(text, pattern)
[[1]]
[1] "isnt good"
It essentially does the same thing as the example below (from the base package) but I find the stringr solution more elegant and readable.
> regmatches(text, regexpr(pattern, text))
[1] "isnt good"
Update
To replace the occurrences of isnt x with isnt_x you just need gsub of the base package.
gsub("isnt (\\w+)", "isnt_\\1", text)
[1] "This approach isnt_good enough."
What you do is to use a capturing group that copies whatever is found inside the parentheses to the \\1. See this page for a good introduction: http://www.regular-expressions.info/brackets.html
How about this function?
joinWords <- function(string, word){
y <- paste0(word, " ")
x <- unlist(strsplit(string, y))
paste0(x[1], word, "_", x[2])
}
> text <- "This approach isnt good enough."
> joinWords(text, "isnt")
# [1] "This approach isnt_good enough."
> joinWords("This approach might work for you", "might")
# [1] "This approach might_work for you"