How can I search multiple words in the same regex? - r

I've got a list of specific words to remove a list of sentences. Do I have to loop over the list and apply a function to each regex or can I somehow call them all at once? I've tried to do so with lapply, but I'm hoping to find a better way.
string <- 'This is a sample sentence from which to gather some cool
knowledge'
words <- c('a','from','some')
lapply(words,function(x){
string <- gsub(paste0('\\b',words,'\\b'),'',string)
})
My desired output is:
This is sample sentence which to gather cool knowledge.

You can collapse across your character vector of words-to-remove with the regex OR operator("|") sometimes referred to as the "pipe" symbol.
gsub(paste0('\\b',words,'\\b', collapse="|"), '', string)
[1] "This is sample sentence which to gather cool \n knowledge"
Or:
gsub(paste0('\\b',words,'\\b\\s{0,1}', collapse="|"), '', string)
[1] "This is sample sentence which to gather cool \n knowledge"

string<-'This is a sample sentence from which to gather some cool knowledge'
words<-c('a', 'from', 'some')
library(tm)
string<-removeWords(string, words = words)
string
[1] "This is sample sentence which to gather cool knowledge"
With the tm library you can use the removeWords().
or you can loop with gsub like:
string<-'This is a sample sentence from which to gather some cool knowledge'
words<-c('a', 'from', 'some')
for(i in 1:length(words)) {
string<-gsub(pattern = words[i], replacement = '', x = string)
}
string
[1] "This is sample sentence which to gather cool knowledge"
hope that helps.

You need to use the "|" to use or in the regex :
string2 <- gsub(paste(words,'|',collapse =""),'',string)
> string2
[1] "This is sample sentence which to gather cool knowledge"

Related

Remove first 4 words after a certain string pattern in R?

I am working with really long strings. How can I remove the first 4 words after a certain string pattern occurs? For example:
string <- "hello I am a user of stackoverflow and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
#remove the fist 4 words after and including "stackoverflow"
result
"hello I am a user of happy with all the help the community usually offers when I'm in need of some coding expertise."
Solution with base R
A one line solution:
pattern <- "stackoverflow"
string <- "hello I am a user of stackoverflow and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
gsub(paste0(pattern, "(\\W+\\w+){0,4}\\W*"), "", string)
#> [1] "hello I am a user of happy with all the help the community usually offers when I'm in need of some coding expertise."
How it works
Create the pattern you want with a regex:
"stackoverflow" followed by 4 words.
Definitely, check out ?regex for more info about it.
Words are identified by \\w+ and separators are identified by \\W+ (capital w, it includes spaces and special characters like the apostrophe that you have in the sentence)
(...){0,4} means that the combination of word and separator may repeat up to 4 times.
\\W* needs to identify a possible final separator, so that the remaining two pieces of the sentence won't have two separators dividing them. Try it without, you'll see what I mean.
gsub locates the pattern you want and replace it with "" (thus deliting it).
Handle Exceptions
Note that it works even for particular cases:
# end of a sentence with fewer than 4 words after
string <- "hello I am a user of stackoverflow and I am"
gsub(paste0(pattern, "(\\W+\\w+){0,4}\\W*"), "", string)
#> [1] "hello I am a user of "
# beginning of a sentence
string <- "stackoverflow and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
gsub(paste0(pattern, "(\\W+\\w+){0,4}\\W*"), "", string)
#> [1] "happy with all the help the community usually offers when I'm in need of some coding expertise."
# pattern == string
string <- "stackoverflow and I am really"
gsub(paste0(pattern, "(\\W+\\w+){0,4}\\W*"), "", string)
#> [1] ""
A tidyverse solution
library(stringr)
# locate start and end position of pattern
tmp <- str_locate(string, paste0(pattern,"(\\W+\\w+){0,4}\\W*"))
# get positions: start_sentence-start_pattern and end_pattern-end_sentence
tmp <- invert_match(tmp)
# get the substrings
tmp <- str_sub(string, tmp[,1], tmp[,2])
# collapse substrings together
str_c(tmp, collapse = "")
#> [1] "hello I am a user of happy with all the help the community usually offers when I'm in need of some coding expertise."
Search for your pattern with additional spaces and words after it. Find the positions of the first last match, split the string and paste it back together. At the end gsub any double (or more) spaces.
string <- "hello I am a user of stackoverflow and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
pat="stackoverflow"
library(stringr)
tmp=str_locate(
string,
paste0(
pat,
paste0(
rep("\\s?[a-zA-Z]+",4),
collapse=""
)
)
)
gsub("\\s{2,}"," ",
paste0(
substring(string,1,tmp[1]-1),
substring(string,tmp[2]+1)
)
)
[1] "hello I am a user of happy with all the help the community usually offers when I'm in need of some coding expertise."
Quick answer, I am sure you can have better code thant that:
string <- "hello I am a user of stackoverflow and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
t<-read.table(textConnection(string))
string2<-''
i<-0
j<-0
for(i in 1:length(t)){
if(t[i]=="stackoverflow"){
j=i
}else if(j>0){
if(i-j>4){
string2=paste0(string2, " " , t[i])
}
}else if(j==0){
if(i>1){
string2=paste0(string2, " " , t[i])
}else{
string2=t[i]
}
}
}
print(string2)

Extract First string in the sentence

U have a sentence where I need to extract the first even word. For example
df <- ("This is not the sentence")
For the above sentence, I need "This" to be extracted because it is the first even word
Another example is
df <- ("She is not going anywhere")
For the above sentence, I need "is" to be extracted because it is the first even word
We can write a function to do this. We split the string on whitespace count number of characters in each word and return the first even word.
extract_first_even_word <- function(text) {
all_words <- strsplit(text, "\\s+")[[1]]
all_words[which.max(nchar(all_words) %% 2 == 0)]
}
extract_first_even_word("This is not the sentence")
#[1] "This"
extract_first_even_word("She is not going anywhere")
#[1] "is"

How to remove numbers if a string starts with numbers, but keep everything else (in r)?

I am working with data in R and have a string related question.
If I have a vector (say books),
books <- c('123 Book1 331','51 Book2','Book3 69','Book4')
I want to split strings that start with numbers and keep the rest, else leave it as it is.
I would like to extract info in a way as shown below:
[1] "Book1 331" "Book2" "Book3 69" "Book4"
What package do i have to use in R? And what function?
Here is another variant using sub which does not require a capture group:
books <- c('123 Book1 331','51 Book2','Book3 69','Book4')
sub("^\\d+\\s+", "", books)
[1] "Book1 331" "Book2" "Book3 69" "Book4"
Demo
You could simply use gsub with your own regular expression. E.g.:
books <- c('123 Book1 331','51 Book2','Book3 69','Book4')
gsub("^.*?([a-zA-Z]+.+)", "\\1", books)
[1] "Book1 331" "Book2" "Book3 69" "Book4"

Looping through and replacing text in a data frame

I have a dataframe that consists of a variable with multiple words, such as:
variable
"hello my name is this"
"greetings friend"
And another dataframe that consists of two columns, one of which is words, the other of which is replacements for those words, such as:
word
"hello"
"greetings"
replacement:
replacement
"hi"
"hi"
I'm trying to find an easy way to replace the words in "variable" with the replacement words, looping over both all the observations, and all the words in each observation. The desired result is:
variable
"hi my name is this"
"hi friend"
I've looked into some methods that use cSplit, but it's not feasible for my application (there are too many words in any given observation of "variable", so this creates too many columns). I'm not sure how I would use strsplit for this, but am guessing that is the correct option?
EDIT: From my understanding of this question, my question my be a repeat of a previously unanswered question: Replace strings in text based on dictionary
stringr's str_replace_all would be handy in this case:
df = data.frame(variable = c('hello my name is this','greetings friend'))
replacement <- data.frame(word = c('hello','greetings'), replacment = c('hi','hi'), stringsAsFactors = F)
stringr::str_replace_all(df$variable,replacement$word,replacement$replacment)
Output:
> stringr::str_replace_all(df$variable,replacement$word,replacement$replacment)
[1] "hi my name is this" "hi friend"
This is similar to #amrrs's solution, but I am using a named vector instead of supplying two separate vectors. This also addresses the issue mentioned by OP in the comments:
library(dplyr)
library(stringr)
df2$word %>%
paste0("\\b", ., "\\b") %>%
setNames(df2$replacement, .) %>%
str_replace_all(df1$variable, .)
# [1] "hi my name is this" "hi friend" "hi, hellomy is not a word"
# [4] "hi! my friend"
This is the named vector with regex as the names and string to replace with as elements:
df2$word %>%
paste0("\\b", ., "\\b") %>%
setNames(df2$replacement, .)
# \\bhello\\b \\bgreetings\\b
# "hi" "hi"
Data:
df1 = data.frame(variable = c('hello my name is this',
'greetings friend',
'hello, hellomy is not a word',
'greetings! my friend'))
df2 = data.frame(word = c('hello','greetings'),
replacement = c('hi','hi'),
stringsAsFactors = F)
Note:
In order to address the issue of root words also being converted, I wrapped the regex with word boundaries (\\b). This makes sure that I am not converting words that live inside another, like "helloguys".

How to get words that end with certain characters within each string r

I have a vector of strings that looks like:
str <- c("bills slashed for poor families today", "your calls are charged", "complaints dept awaiting refund")
I want to get all the words that end with the letter s and remove the s. I have tried:
gsub("s$","",str)
but it doesn't work because it tries to match with the strings that end with s instead of words. I'm trying to get an output that looks like:
[1] bill slashed for poor familie today
[2] your call are charged
[3] complaint dept awaiting refund
Any pointers as to how I can do this? Thanks
$ checks for the end of the string, not the end of a word.
To check for the word boundaries you should use \b
So:
gsub("s\\b", "", str)
Here's a non base R solution:
library(rebus)
library(stringr)
plurals <- "s" %R% BOUNDARY
str_replace_all(str, pattern = plurals, replacement = "")
You could also use a positive lookahead assertion:
gsub(pattern = "s{1}(?>\\s)", " ", x = str, perl = T)
I am no expert on regex, but I believe this expression looks for an "s" if it is followed by a space. Finding a match, it replaces that "s" with a space. So, final "s's" are removed.

Resources