How to remove specific special characters in R - r

I have some sentences like this one.
c = "In Acid-base reaction (page[4]), why does it create water and not H+?"
I want to remove all special characters except for '?&+-/
I know that if I want to remove all special characters, I can simply use
gsub("[[:punct:]]", "", c)
"In Acidbase reaction page4 why does it create water and not H"
However, some special characters such as + - ? are also removed, which I intend to keep.
I tried to create a string of special characters that I can use in some code like this
gsub("[special_string]", "", c)
The best I can do is to come up with this
cat("!\"#$%()*,.:;<=>#[\\]^_`{|}~.")
However, the following code just won't work
gsub("[cat("!\"#$%()*,.:;<=>#[\\]^_`{|}~.")]", "", c)
What should I do to remove special characters, except for a few that I want to keep?
Thanks

gsub("[^[:alnum:][:blank:]+?&/\\-]", "", c)
# [1] "In Acid-base reaction page4 why does it create water and not H+?"

In order to get your method to work, you need to put the literal "]" immediately after the leading "["
gsub("[][!#$%()*,.:;<=>#^_`|~.{}]", "", c)
[1] "In Acid-base reaction page4 why does it create water and not H+?"
You can them put the inner "[" anywhere. If you needed to exclude minus, it would then need to be last. See the ?regex page after all of those special pre-defined character classes are listed.

I think you're after a regex solution. I'll give you a messy solution and a package add on solution (shameless self promotion).
There's likely a better regex:
x <- "In Acid-base reaction (page[4]), why does it create water and not H+?"
keeps <- c("+", "-", "?")
## Regex solution
gsub(paste0(".*?($|'|", paste(paste0("\\",
keeps), collapse = "|"), "|[^[:punct:]]).*?"), "\\1", x)
#qdap: addon package solution
library(qdap)
strip(x, keeps, lower = FALSE)
## [1] "In Acid-base reaction page why does it create water and not H+?"

Related

How to remove punctuation excluding negations?

Let's assume I have the following sentence:
s = c("I don't want to remove punctuation for negations. Instead, I want to remove only general punctuation. For example, keep I wouldn't like it but remove Inter's fan or Man city's fan.")
I would like to have the following outcome:
"I don't want to remove punctuation for negations Instead I want to remove only general punctuation For example keep I wouldn't like it but remove Inter fan or Man city fan."
At the moment if I simply use the code below, I remove both 's and ' in the negations.
s %>% str_replace_all("['']s\\b|[^[:alnum:][:blank:]#_]"," ")
"I don t want to remove punctuation for negations Instead I want to remove only general punctuation For example keep I wouldn t like it but remove Inter fan or Man city fan "
To sum up, I need to have a code that removes general punctuation, including "'s" except for negations that I want to keep in their raw format.
Can anyone help me ?
Thanks!
You can use a look ahead (?!t) testing that the [:punct:] is not followed by a t.
gsub("[[:punct:]](?!t)\\w?", "", s, perl=TRUE)
#[1] "I don't want to remove punctuation for negations Instead I want to remove only general punctuation For example keep I wouldn't like it but remove Inter fan or Man city fan"
In case you want to be more strict you can test in addition if there is no n before with (?<!n).
gsub("(?<!n)[[:punct:]](?!t)\\w?", "", s, perl=TRUE)
Or in case to restrict it only to 't (thanks to #chris-ruehlemann)
gsub("(?!'t)[[:punct:]]\\w?", "", s, perl=TRUE)
Or remove every punct but not ' or 's:
gsub("[^'[:^punct:]]|'s", "", s, perl = TRUE)
The same but use look ahead:
gsub("(?!')[[:punct:]]|'s", "", s, perl = TRUE)
We can do it in two steps, remove all punctuation excluding "'", then remove "'s" using fixed match:
gsub("'s", "", gsub("[^[:alnum:][:space:]']", "", s), fixed = TRUE)

How to remove punctuation inside brackets in R

I have tried to split documents into sentences, but there are some strange outcomes due to punctuation inside brackets. So I'd like to remove any punctuation.
example input:
A <- c('How to remove all punctuations(like this?) in side it?')
wanted output:
"How to remove all punctuations(like this) in side it?"
Perhaps something like this using a positive lookahead?
gsub("[?!;,.](?=\\))", "", A, perl = T)
#[1] "How to remove all punctuations(like this) in side it?"
Or using POSIX character classes
gsub("[[:punct:]](?=\\))", "", A, perl = T)
Or if you need to match other types of closing brackets (e.g. curly, square)
gsub("[[:punct:]](?=[)\\]}])", "", A, perl = T)

regex - define boundary using characters & delimiters

I realize this is a rather simple question and I have searched throughout this site, but just can't seem to get my syntax right for the following regex challenges. I'm looking to do two things. First have the regex to pick up the first three characters and stop at a semicolon. For example, my string might look as follows:
Apt;House;Condo;Apts;
I'd like to go here
Apartment;House;Condo;Apartment
I'd also like to create a regex to substitute a word in between delimiters, while keep others unchanged. For example, I'd like to go from this:
feline;labrador;bird;labrador retriever;labrador dog; lab dog;
To this:
feline;dog;bird;dog;dog;dog;
Below is the regex I'm working with. I know ^ denotes the beginning of the string and $ the end. I've tried many variations, and am making substitutions, but am not achieving my desired out put. I'm also guessing one regex could work for both? Thanks for your help everyone.
df$variable <- gsub("^apt$;", "Apartment;", df$variable, ignore.case = TRUE)
Here is an approach that uses look behind (so you need perl=TRUE):
> tmp <- c("feline;labrador;bird;labrador retriever;labrador dog; lab dog;",
+ "lab;feline;labrador;bird;labrador retriever;labrador dog; lab dog")
> gsub( "(?<=;|^) *lab[^;]*", "dog", tmp, perl=TRUE)
[1] "feline;dog;bird;dog;dog;dog;"
[2] "dog;feline;dog;bird;dog;dog;dog"
The (?<=;|^) is the look behind, it says that any match must be preceded by either a semi-colon or the beginning of the string, but what is matched is not included in the part to be replaced. The * will match 0 or more spaces (since your example string had one case where there was space between the semi-colon and the lab. It then matches a literal lab followed by 0 or more characters other than a semi-colon. Since * is by default greedy, this will match everything up to, but not including' the next semi-colon or the end of the string. You could also include a positive look ahead (?=;|$) to make sure it goes all the way to the next semi-colon or end of string, but in this case the greediness of * will take care of that.
You could also use the non-greedy modifier, then force to match to end of string or semi-colon:
> gsub( "(?<=;|^) *lab.*?(?=;|$)", "dog", tmp, perl=TRUE)
[1] "feline;dog;bird;dog;dog;dog;"
[2] "dog;feline;dog;bird;dog;dog;dog"
The .*? will match 0 or more characters, but as few as it can get away with, stretching just until the next semi-colon or end of line.
You can skip the look behind (and perl=TRUE) if you match the delimiter, then include it in the replacement:
> gsub("(;|^) *lab[^;]*", "\\1dog", tmp)
[1] "feline;dog;bird;dog;dog;dog;"
[2] "dog;feline;dog;bird;dog;dog;dog"
With this method you need to be careful that you only match the delimiter on one side (the first in my example) since the match consumes the delimiter (not with the look-ahead or look-behind), if you consume both delimiters, then the next will be skipped and only every other field will be considered for replacement.
I'd recommend doing this in two steps:
Split the string by the delimiters
Do the replacements
(optional, if that's what you gotta do) Smash the strings back together.
To split the string, I'd use the stringr library. But you can use base R too:
myString <- "Apt;House;Condo;Apts;"
# base R
splitString <- unlist(strsplit(myString, ";", fixed = T))
# with stringr
library(stringr)
splitString <- as.vector(str_split(myString, ";", simplify = T))
Once you've done that, THEN you can do the text substitution:
# base R
fixedApts <- gsub("^Apt$|^Apts$", "Apartment", splitString)
# with stringr
fixedApts <- str_replace(splitString, "^Apt$|^Apts$", "Apartment")
# then do the rest of your replacements
There's probabably a better way to do the replacements than regular expressions (using switch(), maybe?)
Use paste0(fixedApts, collapse = "") to collapse the vector into a single string at the end if that's what you need to do.

R: Trim whitespace within brackets

How can whitespace within brackets be trimmed?
x <- c("the li7(li7, p)b13 reaction")
In this particular case, it should only remove the whitespace between the comma and the p, but I'm looking for a general solution.
cases <-c(
"(a,b)",
"(a, b)",
"( a, b)",
"a(a, b)",
"a (a, b)",
"a (a, b) a(a,b) a(a,b )"
)
gsub("[[:space:]](?=[^()]*\\))", "", cases, perl = TRUE)
[1] "(a,b)" "(a,b)" "(a,b)"
[4] "a(a,b)" "a (a,b)" "a (a,b) a(a,b) a(a,b)"
The regex works as follows: when it finds a space, it looks ahead for a right parenthesis. If it finds any other parentheses on the way, it stops and moves on until it finds a space with none. It the replaces that with an empty string.
Alright, I found a solution using str_extract() in the stringr-package.
gsub("\\(+.*[[:blank:]]+.*\\)+",
gsub("[[:blank:]]", "",
str_extract(x, "\\(+.*[[:blank:]]+.*\\)+")),x)
This uses gsub() to search for a string pattern with a whitespace within brackets, then uses another gsub to replace it with the extracted part without the whitespace.
Edit: If your pattern within the brackets consists of something not covered by the [[:graph:]]-family, you might need to change that part of the expression.
Edit of edit: switched the [[:graph:]] for ., so this should now work on pretty much anything.

remove multiple patterns from text vector r

I want to remove multiple patterns from multiple character vectors. Currently I am going:
a.vector <- gsub("#\\w+", "", a.vector)
a.vector <- gsub("http\\w+", "", a.vector)
a.vector <- gsub("[[:punct:]], "", a.vector)
etc etc.
This is painful. I was looking at this question & answer: R: gsub, pattern = vector and replacement = vector but it's not solving the problem.
Neither the mapply nor the mgsub are working. I made these vectors
remove <- c("#\\w+", "http\\w+", "[[:punct:]]")
substitute <- c("")
Neither mapply(gsub, remove, substitute, a.vector) nor mgsub(remove, substitute, a.vector) worked.
a.vector looks like this:
[4951] "#karakamen: Suicide amongst successful men is becoming rampant. Kudos for staing the conversation. #mental"
[4952] "#stiphan: you are phenomenal.. #mental #Writing. httptxjwufmfg"
I want:
[4951] "Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"
[4952] "you are phenomenal #mental #Writing" `
I know this answer is late on the scene but it stems from my dislike of having to manually list the removal patterns inside the grep functions (see other solutions here). My idea is to set the patterns beforehand, retain them as a character vector, then paste them (i.e. when "needed") using the regex seperator "|":
library(stringr)
remove <- c("#\\w+", "http\\w+", "[[:punct:]]")
a.vector <- str_remove_all(a.vector, paste(remove, collapse = "|"))
Yes, this does effectively do the same as some of the other answers here, but I think my solution allows you to retain the original "character removal vector" remove.
Try combining your subpatterns using |. For example
>s<-"#karakamen: Suicide amongst successful men is becoming rampant. Kudos for staing the conversation. #mental"
> gsub("#\\w+|http\\w+|[[:punct:]]", "", s)
[1] " Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"
But this could become problematic if you have a large number of patterns, or if the result of applying one pattern creates matches to others.
Consider creating your remove vector as you suggested, then applying it in a loop
> s1 <- s
> remove<-c("#\\w+","http\\w+","[[:punct:]]")
> for (p in remove) s1 <- gsub(p, "", s1)
> s1
[1] " Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"
This approach will need to be expanded to apply it to the entire table or vector, of course. But if you put it into a function which returns the final string, you should be able to pass that to one of the apply variants
In case the multiple patterns that you are looking for are fixed and don't change from case-to-case, you can consider creating a concatenated regex that combines all of the patterns into one uber regex pattern.
For the example you provided, you can try:
removePat <- "(#\\w+)|(http\\w+)|([[:punct:]])"
a.vector <- gsub(removePat, "", a.vector)
I had a vector with statement "my final score" and I wanted to keep on the word final and remove the rest. This what worked for me based on Marian suggestion:
str_remove_all("my final score", "my |score")
note: "my final score" is just an example. I was dealing with a vector.

Resources