add characters before special characters in a string

add characters before special characters in a string - r

I would like to add some characters to a string before a special character "(" and after the special character ")"
The position of "(" and ")" changes from one string to the next.
If it helps, I tried several ways, but I don't know how to piece it back together.
a <- "a(b"
grepl("[[:punct:]]", a) #special character exists
x <- "[[:punct:]]"
image <- str_extract(a, x) #extract special character
image
e.g.
"I want to go out (i.e. now). "
And the result to look like:
"I want to go out again (i.e. now) thanks."
I want to add "again" and "thanks" to the sentence.
Thank you for helping!

Use str_replace
library(stringr)
str_replace("I want to go out (i.e. now).", "\\(", "again (") %>%
str_replace("\\)", ") thanks")

We can use sub. Match the characters inside the brackets including the brackets, capture it as a group, and we replace it with adding 'again' followed by the backreference of the captureed group (\\1) followed by 'thanks'
sub("(\\([^)]+\\))\\..*", "again \\1 thanks.", str1)
#[1] "I want to go out again (i.e. now) thanks."
Or using two capture groups
sub("(\\([^)]+\\))(.*)\\s+", "again \\1 thanks\\2", str1)
#[1] "I want to go out again (i.e. now) thanks."
data
str1 <- "I want to go out (i.e. now). "
NOTE: Using only base R

Related

strsplit returning nested list with backslashes and quotes added \"

I'm using R to split a messy string of gene names and as a first step am simply attempting to break the string into a list by spaces between characters using strsplit and regex but have been coming across this weird bug:
string <- ' " "KPNA2" "UBE2C" "CENPF" ## [4] "HMGB2"'
ccGenes <- strsplit(string, split = '\\s+')[[1]]
returns a length 1 nested list containing an object of type "character [8]" (not sure what type of object this indicates) that places a backslash in front of double quotes (" -> \") looks like this when printed:
"" "\"" "\"KPNA2\"" "\"UBE2C\"" "\"CENPF\"" "##" "[4]" "\"HMGB2\""
what I want is a list that looks like this:
" "KPNA2" "UBE2C" "KPNA2" "UBE2C" etc...
After I will clean up the quotes and non gene items. I realize this is probably not the most efficient way to go about cleaning up this string, I'm still relatively new to programming and am more curious why the strsplit line I'm using is returning such weird output.
Thanks!

You can use a base R approach with
regmatches(string, gregexpr('(?<=")\\w+(?=")', string, perl=TRUE))[[1]]
# => [1] "KPNA2" "UBE2C" "CENPF" "HMGB2"
See the R demo online and the regex demo. Mind the perl=TRUE argument, it is necessary since this argument enables PCRE regex syntax.
Details:
(?<=") - a positive lookbehind that requires a " char to occur immediately to the left of the current position
\w+ - one or more letters, digits or underscores
(?=") - a positive lookahead that requires a " char to occur immediately to the right of the current position.
If you want to avoid matching underscores and lowercase letters, replace \\w+ with [A-Z0-9]+.

We may use str_extract to extract the alpha numeric characters after the " - match one of more alpha numeric characters ([[:alnum:]]+) that follows the " (within regex lookaround ((?<=")))
library(stringr)
str_extract_all(string, '(?<=")[[:alnum:]]+')[[1]]
[1] "KPNA2" "UBE2C" "CENPF" "HMGB2"
Also, if we want to use strsplit from base R, split not only the space (\\s+), but also on the double quotes and other characters not needed (#)
setdiff(strsplit(string, split = '["# ]+|\\[\\d+\\]')[[1]], "")
[1] "KPNA2" "UBE2C" "CENPF" "HMGB2"

Remove characters prior to parentheses but after the preceding comma in R

I have the following dataframe:
df<-c("red apples,(golden,red delicious),bananas,(cavendish,lady finger),golden pears","yellow pineapples,red tomatoes,(roma,vine),orange carrots")
I want to remove the word preceding a comma and parentheses so my output would yield:
[1] "golden,red delicious),cavendish,lady finger),golden pears" "yellow pineapples,roma,vine),orange carrots"
Ideally, the right parenthesis would be removed as well. But I can manage that delete with gsub.
I feel like a lookbehind might work but can't seem to code it correctly.
Thanks!
edit: I amended the dataframe so that the word I want deleted is a string of two words.

We can use base R with gsub to remove the characters. We match a word (\\w+) followed by space (\\s+) followed by word (\\w+) comma (,) and (, replace with blank ("")
gsub("\\w+\\s+\\w+,\\(", "", df)
#[1] "golden,red delicious),cavendish,lady finger),golden pears"
#[2] "yellow pineapples,roma,vine),orange carrots"
Or if the , is one of the patterns to check for the words, we can create the pattern with characters that are not a ,
gsub("[^,]+,\\(", "", df)
#[1] "golden,red delicious),cavendish,lady finger),golden pears"
#[2] "yellow pineapples,roma,vine),orange carrots"

Using the tidyverse package stringr, I was able to make your data appear the way you'd want it with two function calls separated by a pipe. The pipe comes from the package magrittr which loads with dplyr and/or tidyverse.
I used stringr::str_replace_all to perform two substitutions which remove the words you wanted to take out. Note the syntax for multiple substitutions within this function.
str_replace_all( c( "first string to get rid of" = "string to replace it with", "second string to get rid of" = "second replacement string")
You might find it more intuitive to combine all the "get rid of strings" first followed by combining the replacement strings, but each element within the c() is the string to be replaced (in quotes) connected to its replacement (also in quotes) with "=". Each of those replaced=replacement pairs is separated by a comma.
Using str_replace, I first took out all text which starts with "," and ends with ",)" using this regular expression ",[a-z ]+,\\(" which refers to comma, followed by any number of lowercase letters and spaces (allowing for chunks with multiple words to be detected) followed by ",(". Note the escape for the "(". If you thought there might be capital letters you would use [a-zA-Z ] instead. In either case, note the space before the "]".
Because you wanted to take out the word, but not the comma preceding it, I replaced the removed text with ",".
This doesn't remove "red apples" in the first string because it doesn't follow a comma. The expression "^[a-z ]+,\\(" refers to any number of lowercase letters and spaces coming before ",(" at the beginning of the string (the ^ "anchors" your pattern to the beginning of the string). Therefore it removes "red apples" or any other example where the text you want to remove starts the string. For these cases, it makes sense to replace it with nothing ("") because you want the first character of the remaining string to appear at the beginning.
Together, the two substitutions remove the offending text whether it starts the string or is in the middle of it or ends it so in that sense it's more or less generalized.
str_remove_all("\\)") removes the right parentheses throughout
library(stringr)
library(magrittr)
df<-c("red apples,(golden,red delicious),bananas,(cavendish,lady finger),
golden pears","yellow pineapples,red tomatoes,(roma,vine),orange carrots")
str_replace_all(df, c(",[a-z ]+,\\(" = ",",
"^[a-z ]+,\\(" = "")) %>%
str_remove_all("\\)")
[1] "golden,red delicious,cavendish,lady finger,golden pears"
[2] "yellow pineapples,roma,vine,orange carrots"

using regular expressions (regex) to make replace multiple patterns at the same time in R

I have a vector of strings and I want to remove -es from all strings (words) ending in either -ses or -ces at the same time. The reason I want to do it at the same time and not consequitively is that sometimes it happens that after removing one ending, the other ending appears while I don't want to apply this pattern to a single word twice.
I have no idea how to use two patterns at the same time, but this is the best I could:
text <- gsub("[sc]+s$", "[sc]", text)
I know the replacement is not correct, but I wonder how can I show that I want to replace it with the letter I just detected (c or s in this case). Thank you in advance.

To remove es at the end of words, that is preceded with s or c, you may use
gsub("([sc])es\\b", "\\1", text)
gsub("(?<=[sc])es\\b", "", text, perl=TRUE)
To remove them at the end of strings, you can go on using your $ anchor:
gsub("([sc])es$", "\\1", text)
gsub("(?<=[sc])es$", "", text, perl=TRUE)
The first gsub TRE pattern is ([sc])es\b: a capturing group #1 that matches either s or c, and then es is matched, and then \b makes sure the next char is not a letter, digit or _. The \1 in the replacement is the backreference to the value stored in the capturing group #1 memory buffer.
In the second example with the PCRE regex (due to perl=TRUE), (?<=[sc]) positive lookbehind is used instead of the ([sc]) capturing group. Lookbehinds are not consuming text, the text they match does not land in the match value, and thus, there is no need to restore it anyhow. The replacement is an empty string.

Strings ending with "ces" and "ses" follow the same pattern, i.e. "*es$"
If I understand it correctly than you don't need two patterns.
Example:
x = c("ces", "ses", "mes)
gsub( pattern = "*([cs])es$", replacement = "\\1", x)
[1] "c" "s" "mes"
Hope it helps.
M

How to extract a string between a symbol and a space?

I am trying to extract usernames tagged in a text-chat, such as "#Jack #Marie Hi there!"
I am trying to do it on the combination of # and whitespace but I cannot get the regex to match non-greedy (or at least this is what I think is wrong):
library(stringr)
str_extract(string = '#This is what I want to extract', pattern = "(?<=#)(.*)(?=\\s+)")
[1] "This is what I want to"
What I would like to extract instead is only This.

You could make your regex non greedy:
(?<=#)(.*?)(?=\s+)
Or if you want to capture only "This" after the # sign, you could try it like this using only a positive lookbehind:
(?<=#)\w+
Explanation
A positive lookbehind (?<=
That asserts that what is behind is an #
Close positive lookbehind )
Match one or more word characters \w+

The central part of your regex ((.*)) is a sequence of any chars.
Instead you shoud look for a sequence of chars other than white space
(\S+) or word chars (\w+).
Note also that I changed * to +, as you are probably not interested
in any empty sequence of chars.
To capture also a name which has "last" position in the source
string, the last part of your regex should match not only a sequence
of whitespace chars, but also the end of the string, so change
(?=\\s+) to (?=\\s+|$).
And the last remark: Actually you don't need the parentheses around
the "central" part.
So to sum up, the whole regex can be like this:
(?<=#)\w+(?=\s+|$)
(with global oprion).

Here is a non-regex approach or rather a minimal-regex approach since grep takes the detection of # through the regex engine
grep('#', strsplit(x, ' ')[[1]], value = TRUE)
#[1] "#This"
Or to avoid strsplit, we can use scan (taken from this answer), i.e.
grep('#', scan(textConnection(x), " "), value=TRUE)
#Read 7 items
#[1] "#This"

regex - define boundary using characters & delimiters

I realize this is a rather simple question and I have searched throughout this site, but just can't seem to get my syntax right for the following regex challenges. I'm looking to do two things. First have the regex to pick up the first three characters and stop at a semicolon. For example, my string might look as follows:
Apt;House;Condo;Apts;
I'd like to go here
Apartment;House;Condo;Apartment
I'd also like to create a regex to substitute a word in between delimiters, while keep others unchanged. For example, I'd like to go from this:
feline;labrador;bird;labrador retriever;labrador dog; lab dog;
To this:
feline;dog;bird;dog;dog;dog;
Below is the regex I'm working with. I know ^ denotes the beginning of the string and $ the end. I've tried many variations, and am making substitutions, but am not achieving my desired out put. I'm also guessing one regex could work for both? Thanks for your help everyone.
df$variable <- gsub("^apt$;", "Apartment;", df$variable, ignore.case = TRUE)

Here is an approach that uses look behind (so you need perl=TRUE):
> tmp <- c("feline;labrador;bird;labrador retriever;labrador dog; lab dog;",
+ "lab;feline;labrador;bird;labrador retriever;labrador dog; lab dog")
> gsub( "(?<=;|^) *lab[^;]*", "dog", tmp, perl=TRUE)
[1] "feline;dog;bird;dog;dog;dog;"
[2] "dog;feline;dog;bird;dog;dog;dog"
The (?<=;|^) is the look behind, it says that any match must be preceded by either a semi-colon or the beginning of the string, but what is matched is not included in the part to be replaced. The * will match 0 or more spaces (since your example string had one case where there was space between the semi-colon and the lab. It then matches a literal lab followed by 0 or more characters other than a semi-colon. Since * is by default greedy, this will match everything up to, but not including' the next semi-colon or the end of the string. You could also include a positive look ahead (?=;|$) to make sure it goes all the way to the next semi-colon or end of string, but in this case the greediness of * will take care of that.
You could also use the non-greedy modifier, then force to match to end of string or semi-colon:
> gsub( "(?<=;|^) *lab.*?(?=;|$)", "dog", tmp, perl=TRUE)
[1] "feline;dog;bird;dog;dog;dog;"
[2] "dog;feline;dog;bird;dog;dog;dog"
The .*? will match 0 or more characters, but as few as it can get away with, stretching just until the next semi-colon or end of line.
You can skip the look behind (and perl=TRUE) if you match the delimiter, then include it in the replacement:
> gsub("(;|^) *lab[^;]*", "\\1dog", tmp)
[1] "feline;dog;bird;dog;dog;dog;"
[2] "dog;feline;dog;bird;dog;dog;dog"
With this method you need to be careful that you only match the delimiter on one side (the first in my example) since the match consumes the delimiter (not with the look-ahead or look-behind), if you consume both delimiters, then the next will be skipped and only every other field will be considered for replacement.

I'd recommend doing this in two steps:
Split the string by the delimiters
Do the replacements
(optional, if that's what you gotta do) Smash the strings back together.
To split the string, I'd use the stringr library. But you can use base R too:
myString <- "Apt;House;Condo;Apts;"
# base R
splitString <- unlist(strsplit(myString, ";", fixed = T))
# with stringr
library(stringr)
splitString <- as.vector(str_split(myString, ";", simplify = T))
Once you've done that, THEN you can do the text substitution:
# base R
fixedApts <- gsub("^Apt$|^Apts$", "Apartment", splitString)
# with stringr
fixedApts <- str_replace(splitString, "^Apt$|^Apts$", "Apartment")
# then do the rest of your replacements
There's probabably a better way to do the replacements than regular expressions (using switch(), maybe?)
Use paste0(fixedApts, collapse = "") to collapse the vector into a single string at the end if that's what you need to do.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

add characters before special characters in a string - r

Use str_replace library(stringr) str_replace("I want to go out (i.e. now).", "\\(", "again (") %>% str_replace("\\)", ") thanks")

Related

strsplit returning nested list with backslashes and quotes added \"

Remove characters prior to parentheses but after the preceding comma in R

using regular expressions (regex) to make replace multiple patterns at the same time in R

How to extract a string between a symbol and a space?

regex - define boundary using characters & delimiters

Categories

Resources