How to get the front part of some words in R? - r

I am trying to get the front part of some words in a sentence.
For example the sentence is -
x <- c("Ace Bayou Reannounces Recall of Bean Bag Chairs Due to Low Rate of Consumer Response; Two Child Deaths Previously Reported; Consumers Urged to Install Repair", "Panasonic Recalls Metal Cutter Saws Due to Laceration Hazard")
Now I want get the front part of Recall or Recalls. I have tried various functions from R like grep, grepl, pmatch , str_split. However, I could not get exactly what I want .
what I need is only
"Ace Bayou Reannounces"
"Panasonic"
Any help would be appreciated.

If you want to use regular expressions.
gsub(pattern = "(.*(?=Recall(s*)))(.*)", replacement = "\\1", x = x, perl = T)

Related

How can I dynamically get words surrounding a keyword?

I have a sentence that may contain keywords. I search for them, if one is true, I want the word before and after the keyword.
cont <- c("could not","would not","does not","will not","do not","were not","was not","did not")
text <- "this failed to increase incomes and production did not improve"
str_extract(text,"([^\\s]+\\s+){1}names(which(sapply(cont,grepl,text)))(\\s+[^\\s]+){1}")
This fails when I dynamically search using the names function but if I input:
str_extract(text,"([^\\s]+\\s+){1}did not(\\s+[^\\s]+){1}")
it correctly returns: production did not improve.
How can I get this to function without directly inputing the keywords?
Final note: I do not completely understand the syntax used to get surrounding objects. Basic r books have not covered this. Can someone explain please?
You could use your cont vector to create a vector of regex strings:
targets <- paste0("([^\\s]+\\s+){1}", cont, "(\\s+[^\\s]+){1}")
Which you can feed into str_extract_all and then unlist:
unlist(stringr::str_extract_all(text, targets))
#> [1] "production did not improve"
If this is something you need to do quite frequently, you could wrap it in a function:
get_surrounding <- function(string, keywords) {
targets <- paste0("([^\\s]+\\s+){1}", keywords, "(\\s+[^\\s]+){1}")
unlist(stringr::str_extract_all(string, targets))
}
With which you can easily run the query on new strings:
new_text <- "The production did not increase because the manager would not allow it."
get_surrounding(new_text, cont)
#> [1] "manager would not allow" "production did not increase"
Perhaps we can try this
> regmatches(text, gregexpr(sprintf("\\w+\\s(%s)\\s\\w+", paste0(cont, collapse = "|")), text))[[1]]
[1] "production did not improve"
Each match of the following regular expression will save the preceding and following words in capture groups 1 and 2, respectively.
\\b([a-z]+) +(?:could|would|does|will|do|were|was|did) +not +([a-z]+)\\b
You will of course have to form this expression programmatically, but that should be straightforward.
Hover the cursor over each element of the expression at this demo to obtain an explanation of its function.
For the string
"she could not believe that production did not improve"
there are two matches. For the first ("she could not believe") "she" and "believe" are saved to capture groups 1 and 2, respectively. For the second ("production did not improve") "production" and "improve" are saved to capture groups 1 and 2, respectively.

Look around pattern doesn't occasionally work

I'm using regex in R and working on a echocardiographic dataset. I want to detect cases where a phenomena called "SAM" is seen and I obviously would want to exclude cases like "no SAM"
so I wrote this lines:
pattern_sam <- regex("(?<!no )sam", ignore_case = TRUE)
str_view_all(echo_1_lvot$description_echo, pattern_sam, match = TRUE)
it effectively removes 99.9% of cases with "no SAM", yet for some reason I still get 3 cases of "no SAM" (see the following image)
Now the weird thing is that if I simply copy pasting these strings into a new dataset, this problem goes away...
sam_test <- tibble(description_echo = c(
"There is asymmetric septal hypertrophy severe LVH in all myocardial segements with spared basal posterior wall with asymmetric septal hypertrophy(anteroseptal thickness=2cm,PWD=0.94cm),no SAM nor LVOT obstruction which is compatible with type III HCM",
"-Normal LV size with mild to moderate systolic dysfunction,EF=45%,severe LVH in all myocardial segements with spared basal posterior wall with asymmetric septal hypertrophy(anteroseptal thickness=2cm,PWD=0.94cm),no SAM nor LVOT obstruction which is compa"
))
str_view_all(sam_test$description_echo, pattern_sam)
same thing happens when I try to detect other patterns
does anyone have any idea on what is the underlying problem and how can it be fixed?
P.S:
here is the .xls file (I only included the problematic string), if you want to see for yourself
funny thing is that when I manually remove the "No SAM" from the .xls and retype it in the exact same place, the problem goes away. still no idea what is wrong, could it be the text format?
You can match any whitespaces, even Unicode ones, with \s since you are using the ICU regex flavor (it is used with all stringr/stringi regex functions):
pattern_sam <- regex("(?<!no\\s)sam", ignore_case = TRUE)
To match any non-word chars including some non-printable chars, use
regex("(?<!no\\W)sam", ignore_case = TRUE)
Besides, if there can be several of them, you may use a constrained-width lookbehind (available in ICU and Java):
pattern_sam <- regex("(?<!no\\s{1,10})sam", ignore_case = TRUE)
pattern_sam <- regex("(?<!no\\W{1,10})sam", ignore_case = TRUE)
Here, from 1 to 10 chars can be between no and sam.
And if you need to match whole words, add \b, word boundary:
pattern_sam <- regex("(?<!\\bno\\s{1,10})sam\\b", ignore_case = TRUE)
pattern_sam <- regex("(?<!\\bno\\W{1,10})sam\\b", ignore_case = TRUE)

grepl() in R using complex pattern with multiple AND, OR

Is that possible to use a pattern like this (see below) in grepl()?
(poverty OR poor) AND (eradicat OR end OR reduc OR alleviat) AND extreme
The goal is to determine if a sentence meets the pattern using
ifelse(grepl(pattern, x, ignore.case = TRUE),"Yes","No")
For example, if x = "end extreme poverty in the country", it will return "Yes", while if x = "end poverty in the country", it will return "No".
An earlier post here works only for single work like poor AND eradicat AND extreme, but not work for my case. Any way to achieve my goal?
Tried this, pattern = "(?=.*poverty|poor)(?=.*eradicat|end|reduce|alleviate)(?=.*extreme)", but it does not work. The error is 'Invalid regexp'
For using all 3 assertions, you can group the words using a non capture group.
^(?=.*(?:poverty|poor))(?=.*extreme)(?=.*(?:eradicat|end|reduc|alleviat)).+
^ Start of string
(?=.*(?:poverty|poor)) Assert either poverty OR poor
(?=.*extreme) Assert extreme
(?=.*(?:eradicat|end|reduc|alleviat)) Assert either eradicat OR end OR reduc or alleviat
.+ Match the whole line for example
Regex demo
For grepl, you have to use perl=T enabling PCRE for the lookarounds.
grepl('^(?=.*(?:poverty|poor))(?=.*extreme)(?=.*(?:eradicat|end|reduc|alleviat)).+', v, perl=T)

remove multiple patterns from text vector r

I want to remove multiple patterns from multiple character vectors. Currently I am going:
a.vector <- gsub("#\\w+", "", a.vector)
a.vector <- gsub("http\\w+", "", a.vector)
a.vector <- gsub("[[:punct:]], "", a.vector)
etc etc.
This is painful. I was looking at this question & answer: R: gsub, pattern = vector and replacement = vector but it's not solving the problem.
Neither the mapply nor the mgsub are working. I made these vectors
remove <- c("#\\w+", "http\\w+", "[[:punct:]]")
substitute <- c("")
Neither mapply(gsub, remove, substitute, a.vector) nor mgsub(remove, substitute, a.vector) worked.
a.vector looks like this:
[4951] "#karakamen: Suicide amongst successful men is becoming rampant. Kudos for staing the conversation. #mental"
[4952] "#stiphan: you are phenomenal.. #mental #Writing. httptxjwufmfg"
I want:
[4951] "Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"
[4952] "you are phenomenal #mental #Writing" `
I know this answer is late on the scene but it stems from my dislike of having to manually list the removal patterns inside the grep functions (see other solutions here). My idea is to set the patterns beforehand, retain them as a character vector, then paste them (i.e. when "needed") using the regex seperator "|":
library(stringr)
remove <- c("#\\w+", "http\\w+", "[[:punct:]]")
a.vector <- str_remove_all(a.vector, paste(remove, collapse = "|"))
Yes, this does effectively do the same as some of the other answers here, but I think my solution allows you to retain the original "character removal vector" remove.
Try combining your subpatterns using |. For example
>s<-"#karakamen: Suicide amongst successful men is becoming rampant. Kudos for staing the conversation. #mental"
> gsub("#\\w+|http\\w+|[[:punct:]]", "", s)
[1] " Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"
But this could become problematic if you have a large number of patterns, or if the result of applying one pattern creates matches to others.
Consider creating your remove vector as you suggested, then applying it in a loop
> s1 <- s
> remove<-c("#\\w+","http\\w+","[[:punct:]]")
> for (p in remove) s1 <- gsub(p, "", s1)
> s1
[1] " Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"
This approach will need to be expanded to apply it to the entire table or vector, of course. But if you put it into a function which returns the final string, you should be able to pass that to one of the apply variants
In case the multiple patterns that you are looking for are fixed and don't change from case-to-case, you can consider creating a concatenated regex that combines all of the patterns into one uber regex pattern.
For the example you provided, you can try:
removePat <- "(#\\w+)|(http\\w+)|([[:punct:]])"
a.vector <- gsub(removePat, "", a.vector)
I had a vector with statement "my final score" and I wanted to keep on the word final and remove the rest. This what worked for me based on Marian suggestion:
str_remove_all("my final score", "my |score")
note: "my final score" is just an example. I was dealing with a vector.

How to remove specific special characters in R

I have some sentences like this one.
c = "In Acid-base reaction (page[4]), why does it create water and not H+?"
I want to remove all special characters except for '?&+-/
I know that if I want to remove all special characters, I can simply use
gsub("[[:punct:]]", "", c)
"In Acidbase reaction page4 why does it create water and not H"
However, some special characters such as + - ? are also removed, which I intend to keep.
I tried to create a string of special characters that I can use in some code like this
gsub("[special_string]", "", c)
The best I can do is to come up with this
cat("!\"#$%()*,.:;<=>#[\\]^_`{|}~.")
However, the following code just won't work
gsub("[cat("!\"#$%()*,.:;<=>#[\\]^_`{|}~.")]", "", c)
What should I do to remove special characters, except for a few that I want to keep?
Thanks
gsub("[^[:alnum:][:blank:]+?&/\\-]", "", c)
# [1] "In Acid-base reaction page4 why does it create water and not H+?"
In order to get your method to work, you need to put the literal "]" immediately after the leading "["
gsub("[][!#$%()*,.:;<=>#^_`|~.{}]", "", c)
[1] "In Acid-base reaction page4 why does it create water and not H+?"
You can them put the inner "[" anywhere. If you needed to exclude minus, it would then need to be last. See the ?regex page after all of those special pre-defined character classes are listed.
I think you're after a regex solution. I'll give you a messy solution and a package add on solution (shameless self promotion).
There's likely a better regex:
x <- "In Acid-base reaction (page[4]), why does it create water and not H+?"
keeps <- c("+", "-", "?")
## Regex solution
gsub(paste0(".*?($|'|", paste(paste0("\\",
keeps), collapse = "|"), "|[^[:punct:]]).*?"), "\\1", x)
#qdap: addon package solution
library(qdap)
strip(x, keeps, lower = FALSE)
## [1] "In Acid-base reaction page why does it create water and not H+?"

Resources