How can whitespace within brackets be trimmed?
x <- c("the li7(li7, p)b13 reaction")
In this particular case, it should only remove the whitespace between the comma and the p, but I'm looking for a general solution.
cases <-c(
"(a,b)",
"(a, b)",
"( a, b)",
"a(a, b)",
"a (a, b)",
"a (a, b) a(a,b) a(a,b )"
)
gsub("[[:space:]](?=[^()]*\\))", "", cases, perl = TRUE)
[1] "(a,b)" "(a,b)" "(a,b)"
[4] "a(a,b)" "a (a,b)" "a (a,b) a(a,b) a(a,b)"
The regex works as follows: when it finds a space, it looks ahead for a right parenthesis. If it finds any other parentheses on the way, it stops and moves on until it finds a space with none. It the replaces that with an empty string.
Alright, I found a solution using str_extract() in the stringr-package.
gsub("\\(+.*[[:blank:]]+.*\\)+",
gsub("[[:blank:]]", "",
str_extract(x, "\\(+.*[[:blank:]]+.*\\)+")),x)
This uses gsub() to search for a string pattern with a whitespace within brackets, then uses another gsub to replace it with the extracted part without the whitespace.
Edit: If your pattern within the brackets consists of something not covered by the [[:graph:]]-family, you might need to change that part of the expression.
Edit of edit: switched the [[:graph:]] for ., so this should now work on pretty much anything.
Related
Let's assume I have the following sentence:
s = c("I don't want to remove punctuation for negations. Instead, I want to remove only general punctuation. For example, keep I wouldn't like it but remove Inter's fan or Man city's fan.")
I would like to have the following outcome:
"I don't want to remove punctuation for negations Instead I want to remove only general punctuation For example keep I wouldn't like it but remove Inter fan or Man city fan."
At the moment if I simply use the code below, I remove both 's and ' in the negations.
s %>% str_replace_all("['']s\\b|[^[:alnum:][:blank:]#_]"," ")
"I don t want to remove punctuation for negations Instead I want to remove only general punctuation For example keep I wouldn t like it but remove Inter fan or Man city fan "
To sum up, I need to have a code that removes general punctuation, including "'s" except for negations that I want to keep in their raw format.
Can anyone help me ?
Thanks!
You can use a look ahead (?!t) testing that the [:punct:] is not followed by a t.
gsub("[[:punct:]](?!t)\\w?", "", s, perl=TRUE)
#[1] "I don't want to remove punctuation for negations Instead I want to remove only general punctuation For example keep I wouldn't like it but remove Inter fan or Man city fan"
In case you want to be more strict you can test in addition if there is no n before with (?<!n).
gsub("(?<!n)[[:punct:]](?!t)\\w?", "", s, perl=TRUE)
Or in case to restrict it only to 't (thanks to #chris-ruehlemann)
gsub("(?!'t)[[:punct:]]\\w?", "", s, perl=TRUE)
Or remove every punct but not ' or 's:
gsub("[^'[:^punct:]]|'s", "", s, perl = TRUE)
The same but use look ahead:
gsub("(?!')[[:punct:]]|'s", "", s, perl = TRUE)
We can do it in two steps, remove all punctuation excluding "'", then remove "'s" using fixed match:
gsub("'s", "", gsub("[^[:alnum:][:space:]']", "", s), fixed = TRUE)
I have the following dataframe:
df<-c("red apples,(golden,red delicious),bananas,(cavendish,lady finger),golden pears","yellow pineapples,red tomatoes,(roma,vine),orange carrots")
I want to remove the word preceding a comma and parentheses so my output would yield:
[1] "golden,red delicious),cavendish,lady finger),golden pears" "yellow pineapples,roma,vine),orange carrots"
Ideally, the right parenthesis would be removed as well. But I can manage that delete with gsub.
I feel like a lookbehind might work but can't seem to code it correctly.
Thanks!
edit: I amended the dataframe so that the word I want deleted is a string of two words.
We can use base R with gsub to remove the characters. We match a word (\\w+) followed by space (\\s+) followed by word (\\w+) comma (,) and (, replace with blank ("")
gsub("\\w+\\s+\\w+,\\(", "", df)
#[1] "golden,red delicious),cavendish,lady finger),golden pears"
#[2] "yellow pineapples,roma,vine),orange carrots"
Or if the , is one of the patterns to check for the words, we can create the pattern with characters that are not a ,
gsub("[^,]+,\\(", "", df)
#[1] "golden,red delicious),cavendish,lady finger),golden pears"
#[2] "yellow pineapples,roma,vine),orange carrots"
Using the tidyverse package stringr, I was able to make your data appear the way you'd want it with two function calls separated by a pipe. The pipe comes from the package magrittr which loads with dplyr and/or tidyverse.
I used stringr::str_replace_all to perform two substitutions which remove the words you wanted to take out. Note the syntax for multiple substitutions within this function.
str_replace_all( c( "first string to get rid of" = "string to replace it with", "second string to get rid of" = "second replacement string")
You might find it more intuitive to combine all the "get rid of strings" first followed by combining the replacement strings, but each element within the c() is the string to be replaced (in quotes) connected to its replacement (also in quotes) with "=". Each of those replaced=replacement pairs is separated by a comma.
Using str_replace, I first took out all text which starts with "," and ends with ",)" using this regular expression ",[a-z ]+,\\(" which refers to comma, followed by any number of lowercase letters and spaces (allowing for chunks with multiple words to be detected) followed by ",(". Note the escape for the "(". If you thought there might be capital letters you would use [a-zA-Z ] instead. In either case, note the space before the "]".
Because you wanted to take out the word, but not the comma preceding it, I replaced the removed text with ",".
This doesn't remove "red apples" in the first string because it doesn't follow a comma. The expression "^[a-z ]+,\\(" refers to any number of lowercase letters and spaces coming before ",(" at the beginning of the string (the ^ "anchors" your pattern to the beginning of the string). Therefore it removes "red apples" or any other example where the text you want to remove starts the string. For these cases, it makes sense to replace it with nothing ("") because you want the first character of the remaining string to appear at the beginning.
Together, the two substitutions remove the offending text whether it starts the string or is in the middle of it or ends it so in that sense it's more or less generalized.
str_remove_all("\\)") removes the right parentheses throughout
library(stringr)
library(magrittr)
df<-c("red apples,(golden,red delicious),bananas,(cavendish,lady finger),
golden pears","yellow pineapples,red tomatoes,(roma,vine),orange carrots")
str_replace_all(df, c(",[a-z ]+,\\(" = ",",
"^[a-z ]+,\\(" = "")) %>%
str_remove_all("\\)")
[1] "golden,red delicious,cavendish,lady finger,golden pears"
[2] "yellow pineapples,roma,vine,orange carrots"
I have text I am cleaning up in R. I want to use stringi, but am happy to use other packages.
Some of the words are broken over two lines. So I get a sub-string "halfword-\nsecondhalfword".
I also have strings that are just "----\nword" and " -\n" (and some others that I do not want to replace.
What I want to do is identify all sub-strings "[a-z]-\n" and then keep the generic letter [a,z], but remove the -\n characters.
I do not want to remove all -\n , and I do not want to remove the letter [a-z].
Thanks!
You may make use of word boundaries to match -<LF> only in between word characters:
gsub("\\b-\n\\b", "", x)
gsub("(*UCP)\\b-\n\\b", "", x, perl=TRUE)
stringr::str_replace_all(x, "\\b-\n\\b", "", x)
The latter two support word boundaries between any Unicode word characters.
See the regex demo.
If you want to only remove -<LF> between letters you may use
gsub("([a-zA-Z])-\n([a-zA-Z])", "\\1\\2", x)
gsub("(\\p{L})-\n(\\p{L})", "\\1\\2", x, perl=TRUE)
stringr::str_replace_all(x, "(\\p{L})-\n(\\p{L})", "\\1\\2")
If you need to only support lowercase letters, remove A-Z in the first gsub and replace \p{L} with \p{Ll} in the latter two.
See this regex demo.
I have tried to split documents into sentences, but there are some strange outcomes due to punctuation inside brackets. So I'd like to remove any punctuation.
example input:
A <- c('How to remove all punctuations(like this?) in side it?')
wanted output:
"How to remove all punctuations(like this) in side it?"
Perhaps something like this using a positive lookahead?
gsub("[?!;,.](?=\\))", "", A, perl = T)
#[1] "How to remove all punctuations(like this) in side it?"
Or using POSIX character classes
gsub("[[:punct:]](?=\\))", "", A, perl = T)
Or if you need to match other types of closing brackets (e.g. curly, square)
gsub("[[:punct:]](?=[)\\]}])", "", A, perl = T)
I have some sentences like this one.
c = "In Acid-base reaction (page[4]), why does it create water and not H+?"
I want to remove all special characters except for '?&+-/
I know that if I want to remove all special characters, I can simply use
gsub("[[:punct:]]", "", c)
"In Acidbase reaction page4 why does it create water and not H"
However, some special characters such as + - ? are also removed, which I intend to keep.
I tried to create a string of special characters that I can use in some code like this
gsub("[special_string]", "", c)
The best I can do is to come up with this
cat("!\"#$%()*,.:;<=>#[\\]^_`{|}~.")
However, the following code just won't work
gsub("[cat("!\"#$%()*,.:;<=>#[\\]^_`{|}~.")]", "", c)
What should I do to remove special characters, except for a few that I want to keep?
Thanks
gsub("[^[:alnum:][:blank:]+?&/\\-]", "", c)
# [1] "In Acid-base reaction page4 why does it create water and not H+?"
In order to get your method to work, you need to put the literal "]" immediately after the leading "["
gsub("[][!#$%()*,.:;<=>#^_`|~.{}]", "", c)
[1] "In Acid-base reaction page4 why does it create water and not H+?"
You can them put the inner "[" anywhere. If you needed to exclude minus, it would then need to be last. See the ?regex page after all of those special pre-defined character classes are listed.
I think you're after a regex solution. I'll give you a messy solution and a package add on solution (shameless self promotion).
There's likely a better regex:
x <- "In Acid-base reaction (page[4]), why does it create water and not H+?"
keeps <- c("+", "-", "?")
## Regex solution
gsub(paste0(".*?($|'|", paste(paste0("\\",
keeps), collapse = "|"), "|[^[:punct:]]).*?"), "\\1", x)
#qdap: addon package solution
library(qdap)
strip(x, keeps, lower = FALSE)
## [1] "In Acid-base reaction page why does it create water and not H+?"