I am looking for a grep way to obtain the characters in a string prior to the first space.
I have hacked the following function, as i could not figure out how to do it using the grep type commands in R.
Could someone help with the grep solution - if there is one ...
beforeSpace <- function(inWords) {
vapply(inWords, function(L) strsplit(L, "[[:space:]]")[[1]][1], FUN.VALUE = 'character')
}
words <- c("the quick", "brown dogs were", "lazier than quick foxes")
beforeSpace(words)
R> the quick brown dogs were lazier than quick foxes
"the" "brown" "lazier"
And do let me know if there's a better way than grep (or my function, beforeSpace) to do this.
Or just sub, with credit to #flodel:
sub(" .*", "", words)
# and if the 'space' can also be a tab or other white-space:
sub("\\s.*","",words)
#[1] "the" "brown" "lazier"
You can use qdap's beg2char (beginning of the string to a particular character) as follows:
x <- c("the quick", "brown dogs were", "lazier than quick foxes")
library(qdap)
beg2char(x)
## [1] "the" "brown" "lazier"
Using stringi
library(stringi)
stri_extract_first(words, regex="\\w+")
#[1] "the" "brown" "lazier"
Related
I´m wondering how to get the words that occur before an exclamation mark! I have a dataframe with different strings on each row. I have tried following:
text %>%
str_match("!",lines)
I don´t really get what I want and I´m a bit lost. Anyone has advice?
You can str_extract_all the words before the ! using lookahead:
Data:
text <- c("Hello!", "This a test sentence", "That's another test sentence, yes! It is!", "And that's one more")
Solution:
library(stringr)
unlist(str_extract_all(text, "\\b\\w+\\b(?=!)"))
[1] "Hello" "yes" "is"
If you seek a dplyr solution:
data.frame(text) %>%
mutate(Word_before_excl = str_extract_all(text, "\\b\\w+\\b(?=!)"))
text Word_before_excl
1 Hello! Hello
2 This a test sentence
3 That's another test sentence, yes! It is! yes, is
4 And that's one more
Maybe we can use regmatches
> sapply(regmatches(text, gregexpr("\\b\\w+\\b(?=!)", text, perl = TRUE)), toString)
[1] "Hello" "" "yes, is" ""
You could also use :
> unlist(strsplit("Dog!Cat!", "!"))
[1] "Dog" "Cat"
I read about regex and came accross word boundaries. I found a question that is about the difference between \b and \B. Using the code from this question does not give the expected output. Here:
grep("\\bcat\\b", "The cat scattered his food all over the room.", value= TRUE)
# I expect "cat" but it returns the whole string.
grep("\\B-\\B", "Please enter the nine-digit id as it appears on your color - coded pass-key.", value= TRUE)
# I expect "-" but it returns the whole string.
I use the code as described in the question but with two backslashes as suggested here. Using one backslash does not work either. What am I doing wrong?
You can to use regexpr and regmatches to get the match. grep gives where it hits. You can also use sub.
x <- "The cat scattered his food all over the room."
regmatches(x, regexpr("\\bcat\\b", x))
#[1] "cat"
sub(".*(\\bcat\\b).*", "\\1", x)
#[1] "cat"
x <- "Please enter the nine-digit id as it appears on your color - coded pass-key."
regmatches(x, regexpr("\\B-\\B", x))
#[1] "-"
sub(".*(\\B-\\B).*", "\\1", x)
#[1] "-"
For more than 1 match use gregexpr:
x <- "1abc2"
regmatches(x, gregexpr("[0-9]", x))
#[[1]]
#[1] "1" "2"
grepreturns the whole string because it just looks to see if the match is present in the string. If you want to extract cat, you need to use other functions such as str_extractfrom package stringr:
str_extract("The cat scattered his food all over the room.", "\\bcat\\b")
[1] "cat"
The difference betweeen band Bis that bmarks word boundaries whereas Bis its negation. That is, \\bcat\\b matches only if cat is separated by white space whereas \\Bcat\\B matches only if cat is inside a word. For example:
str_extract_all("The forgot his education and scattered his food all over the room.", "\\Bcat\\B")
[[1]]
[1] "cat" "cat"
These two matches are from education and scattered.
Split the regex by space if the group of words is not matched.
If group of words is matched then keep them as it is.
text <- c('considerate and helpful','not bad at all','this is helpful')
pattern <- c('considerate and helpful','not bad')
Output :
considerate and helpful, not bad, at, all, this, is, helpful
Thank you for the help!
Of course, just put the words in front of \w+:
library("stringr")
text <- c('considerate and helpful','not bad at all','this is helpful')
parts <- str_extract_all(text, "considerate and helpful|not bad|\\w+")
parts
Which yields
[[1]]
[1] "considerate and helpful"
[[2]]
[1] "not bad" "at" "all"
[[3]]
[1] "this" "is" "helpful"
It does not split on whitespaces but rather extracts the "words".
I want to combine word which comes after a specific word ,I have try bigram approach which is too slow and also tried with gregexpr but didnt get any good solution. for ex
text="This approach isnt good enough."
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
BigramTokenizer(text)
[1] "This approach" "approach isnt" "isnt good" "good enough"
what i really want is isnt_good as single word in a text ,combine next word which comes after isnt.
text
"This approach isnt_good enough."
Any efficient approach to convert into unigram.Thanks.
To extract all occurrences of the word "isn't" and the following word you can do this:
library(stringr)
pattern <- "isnt \\w+"
str_extract_all(text, pattern)
[[1]]
[1] "isnt good"
It essentially does the same thing as the example below (from the base package) but I find the stringr solution more elegant and readable.
> regmatches(text, regexpr(pattern, text))
[1] "isnt good"
Update
To replace the occurrences of isnt x with isnt_x you just need gsub of the base package.
gsub("isnt (\\w+)", "isnt_\\1", text)
[1] "This approach isnt_good enough."
What you do is to use a capturing group that copies whatever is found inside the parentheses to the \\1. See this page for a good introduction: http://www.regular-expressions.info/brackets.html
How about this function?
joinWords <- function(string, word){
y <- paste0(word, " ")
x <- unlist(strsplit(string, y))
paste0(x[1], word, "_", x[2])
}
> text <- "This approach isnt good enough."
> joinWords(text, "isnt")
# [1] "This approach isnt_good enough."
> joinWords("This approach might work for you", "might")
# [1] "This approach might_work for you"
What's the most elegant way to extract the last word in a sentence string?
The sentence does not end with a "."
Words are seperated by blanks.
sentence <- "The quick brown fox"
TheFunction(sentence)
should return: "fox"
I do not want to use a package if a simple solution is possible.
If a simple solution based on package exists, that is also fine.
Just for completeness: The library stringr contains a function for exactly this problem.
library(stringr)
sentence <- "The quick brown fox"
word(sentence,-1)
[1] "fox"
tail(strsplit('this is a sentence',split=" ")[[1]],1)
Basically as suggested by #Señor O.
x <- 'The quick brown fox'
sub('^.* ([[:alnum:]]+)$', '\\1', x)
That will catch the last string of numbers and characters before then end of the string.
You can also use the regexec and regmatches functions, but I find sub cleaner:
m <- regexec('^.* ([[:alnum:]]+)$', x)
regmatches(x, m)
See ?regex and ?sub for more info.
Another packaged option is stri_extract_last_words() from the stringi package
library(stringi)
stri_extract_last_words("The quick brown fox")
# [1] "fox"
The function also removes any punctuation that may be at the end of the sentence.
stri_extract_last_words("The quick brown fox? ...")
# [1] "fox"
Going in the package direction, this is the simplest answer I can think of:
library(stringr)
x <- 'The quick brown fox'
str_extract(x, '\\w+$')
#[1] "fox"