Replacing string variable with punctuation in R without removing other string - r

In R, I am having trouble replacing a substring that has punctuation. Ie within the string "r.Export", I am trying to replace "r." with "Report.". I've used gsub and below is my code:
string <- "r.Export"
short <- "r."
replacement <- "Report."
gsub(short,replacement,string)
The desired output is: "Report.Export" however gsub seems to replace the second r such that the output is:
Report.ExpoReport.
Using sub() instead is not a solution either because I am doing multiple gsubs where sometimes the string to be replaced is:
short <- "o."
So, then the o's in r.Export are replaced anyway and it becomes a complete mess.

string <- "r.Export"
short <- "r\\."
replacement <- "Report."
gsub(short,replacement,string)
Returns:
[1] "Report.Export"
Or, using fixed=TRUE:
string <- "r.Export"
short <- "r."
replacement <- "Report."
gsub(short,replacement,string, fixed=TRUE)
Returns:
[1] "Report.Export"
Explanation: Without the fixed=TRUE argument, gsub expects a regular expression as first argument. And with regular expressions . is a placeholder for 'any character'. If you want the literal . (period) you have to use either \\. (i.e. escaping the period) or the aforementioned argument fixed=TRUE

Since you have characters in your pattern (.) which has a special meaning in regex use fixed = TRUE which matches the string as is.
gsub(short,replacement,string, fixed = TRUE)
#[1] "Report.Export"

I might actually add word boundaries and lookaheads to the mix here, to ensure as targeted a match as possible:
string <- "r.Export"
replacement <- "Report."
output <- gsub("\\br\\.(?=\\w)", replacement, string, perl=TRUE)
output
[1] "Report.Export"
This approach ensures that we only match r. when the r is preceded by whitespace or is the start of the string, and also when what follows the dot is another word. Consider the sentence The project r.Export needed a programmer. We wouldn't want to replace the final r. in this case.

We can use sub
sub(short,replacement,string, fixed = TRUE)
#[1] "Report.Export"

Related

Replace Strings in R with regular expression in it dynamically [duplicate]

I'm wanting to build a regex expression substituting in some strings to search for, and so these string need to be escaped before I can put them in the regex, so that if the searched for string contains regex characters it still works.
Some languages have functions that will do this for you (e.g. python re.escape: https://stackoverflow.com/a/10013356/1900520). Does R have such a function?
For example (made up function):
x = "foo[bar]"
y = escape(x) # y should now be "foo\\[bar\\]"
I've written an R version of Perl's quotemeta function:
library(stringr)
quotemeta <- function(string) {
str_replace_all(string, "(\\W)", "\\\\\\1")
}
I always use the perl flavor of regexps, so this works for me. I don't know whether it works for the "normal" regexps in R.
Edit: I found the source explaining why this works. It's in the Quoting Metacharacters section of the perlre manpage:
This was once used in a common idiom to disable or quote the special meanings of regular expression metacharacters in a string that you want to use for a pattern. Simply quote all non-"word" characters:
$pattern =~ s/(\W)/\\$1/g;
As you can see, the R code above is a direct translation of this same substitution (after a trip through backslash hell). The manpage also says (emphasis mine):
Unlike some other regular expression languages, there are no backslashed symbols that aren't alphanumeric.
which reinforces my point that this solution is only guaranteed for PCRE.
Apparently there is a function called escapeRegex in the Hmisc package. The function itself has the following definition for an input value of 'string':
gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", string)
My previous answer:
I'm not sure if there is a built in function but you could make one to do what you want. This basically just creates a vector of the values you want to replace and a vector of what you want to replace them with and then loops through those making the necessary replacements.
re.escape <- function(strings){
vals <- c("\\\\", "\\[", "\\]", "\\(", "\\)",
"\\{", "\\}", "\\^", "\\$","\\*",
"\\+", "\\?", "\\.", "\\|")
replace.vals <- paste0("\\\\", vals)
for(i in seq_along(vals)){
strings <- gsub(vals[i], replace.vals[i], strings)
}
strings
}
Some output
> test.strings <- c("What the $^&(){}.*|?", "foo[bar]")
> re.escape(test.strings)
[1] "What the \\$\\^&\\(\\)\\{\\}\\.\\*\\|\\?"
[2] "foo\\[bar\\]"
An easier way than #ryanthompson function is to simply prepend \\Q and postfix \\E to your string. See the help file ?base::regex.
Use the rex package
These days, I write all my regular expressions using rex. For your specific example, rex does exactly what you want:
library(rex)
library(assertthat)
x = "foo[bar]"
y = rex(x)
assert_that(y == "foo\\[bar\\]")
But of course, rex does a lot more than that. The question mentions building a regex, and that's exactly what rex is designed for. For example, suppose we wanted to match the exact string in x, with nothing before or after:
x = "foo[bar]"
y = rex(start, x, end)
Now y is ^foo\[bar\]$ and will only match the exact string contained in x.
According to ?regex:
The symbol \w matches a ‘word’ character (a synonym for [[:alnum:]_], an extension) and \W is its negation ([^[:alnum:]_]).
Therefore, using capture groups, (\\W), we can detect the occurrences of non-word characters and escape it with the \\1-syntax:
> gsub("(\\W)", "\\\\\\1", "[](){}.|^+$*?\\These are words")
[1] "\\[\\]\\(\\)\\{\\}\\.\\|\\^\\+\\$\\*\\?\\\\These\\ are\\ words"
Or similarly, replacing "([^[:alnum:]_])" for "(\\W)".

Ignore last "/" in R regex

Given the string "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275/", I need to generate a regex filter so that it ignores the last char if it is an "/" .
I tried the following regex "(http:////)?compras\\.dados\\.gov\\.br.*\\?.*(?<!//)" as of regexr.com/4om61, but it doesn´t work when I run in R as:
regex_exp_R <- "(http:////)?compras\\.dados\\.gov\\.br.*\\?.*(?<!//)"
grep(regex_exp_R, "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275/", perl = T, value = T)
I need this to work in pure regex and grep function, without using any string R package.
Thank you.
Simplified Case:
After important contributions of you all, one last issue remains.
Because I will use regex as an input in another friunction, the solution must work with pure regex and grep.
The remaining point is a very basic one: given the strings "a1bc/" or "a1bc", the regex must return "a1bc". Building on suggestions I received, I tried
grep(".*[^//]" ,"a1bc/", perl = T, value = T), but still get "a1bc/" instead of "a1bc". Any hints? Thank you.
If you want to return the string without the last / you can do this several ways. Below are a couple options using base R:
Using a back-reference in gsub() (sub() would work too here):
gsub("(.*?)/*$", "\\1", x)
[1] "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275"
# or, adapting your original pattern
gsub("((http:////)?compras\\.dados\\.gov\\.br.*\\?.*?)/*$", "\\1", x)
[1] "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275"
By position using ifelse() and substr() (this will proabbly be a little bit faster if scaling matters)
ifelse(substr(x, nchar(x), nchar(x)) == "/", substr(x, 1, nchar(x)-1), x)
[1] "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275"
Data:
x <- "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275/"
Use sub to remove a trailing /:
x <- c("a1bc/", "a2bc")
sub("/$", "", x)
This changes nothing on a string that does not end in /.
As others have pointed out, grep does not modify strings. It returns a numeric vector of indices of the matched strings or a vector of the (unmodified) matched items. It's usually used to subset a character vector.
You can use a negative look-behind at the end to ensure it doesn't end with the character you don't want (in this case, a /). The regex would then be:
.+(?<!\/)
You can view it here with your three input examples: https://regex101.com/r/XB9f7K/1/. If you only want it to match urls, then you would change the .+ part at the beginning to your url regex.
How about trying gsub("(.*?)/+$","\\1",s)?

regex - define boundary using characters & delimiters

I realize this is a rather simple question and I have searched throughout this site, but just can't seem to get my syntax right for the following regex challenges. I'm looking to do two things. First have the regex to pick up the first three characters and stop at a semicolon. For example, my string might look as follows:
Apt;House;Condo;Apts;
I'd like to go here
Apartment;House;Condo;Apartment
I'd also like to create a regex to substitute a word in between delimiters, while keep others unchanged. For example, I'd like to go from this:
feline;labrador;bird;labrador retriever;labrador dog; lab dog;
To this:
feline;dog;bird;dog;dog;dog;
Below is the regex I'm working with. I know ^ denotes the beginning of the string and $ the end. I've tried many variations, and am making substitutions, but am not achieving my desired out put. I'm also guessing one regex could work for both? Thanks for your help everyone.
df$variable <- gsub("^apt$;", "Apartment;", df$variable, ignore.case = TRUE)
Here is an approach that uses look behind (so you need perl=TRUE):
> tmp <- c("feline;labrador;bird;labrador retriever;labrador dog; lab dog;",
+ "lab;feline;labrador;bird;labrador retriever;labrador dog; lab dog")
> gsub( "(?<=;|^) *lab[^;]*", "dog", tmp, perl=TRUE)
[1] "feline;dog;bird;dog;dog;dog;"
[2] "dog;feline;dog;bird;dog;dog;dog"
The (?<=;|^) is the look behind, it says that any match must be preceded by either a semi-colon or the beginning of the string, but what is matched is not included in the part to be replaced. The * will match 0 or more spaces (since your example string had one case where there was space between the semi-colon and the lab. It then matches a literal lab followed by 0 or more characters other than a semi-colon. Since * is by default greedy, this will match everything up to, but not including' the next semi-colon or the end of the string. You could also include a positive look ahead (?=;|$) to make sure it goes all the way to the next semi-colon or end of string, but in this case the greediness of * will take care of that.
You could also use the non-greedy modifier, then force to match to end of string or semi-colon:
> gsub( "(?<=;|^) *lab.*?(?=;|$)", "dog", tmp, perl=TRUE)
[1] "feline;dog;bird;dog;dog;dog;"
[2] "dog;feline;dog;bird;dog;dog;dog"
The .*? will match 0 or more characters, but as few as it can get away with, stretching just until the next semi-colon or end of line.
You can skip the look behind (and perl=TRUE) if you match the delimiter, then include it in the replacement:
> gsub("(;|^) *lab[^;]*", "\\1dog", tmp)
[1] "feline;dog;bird;dog;dog;dog;"
[2] "dog;feline;dog;bird;dog;dog;dog"
With this method you need to be careful that you only match the delimiter on one side (the first in my example) since the match consumes the delimiter (not with the look-ahead or look-behind), if you consume both delimiters, then the next will be skipped and only every other field will be considered for replacement.
I'd recommend doing this in two steps:
Split the string by the delimiters
Do the replacements
(optional, if that's what you gotta do) Smash the strings back together.
To split the string, I'd use the stringr library. But you can use base R too:
myString <- "Apt;House;Condo;Apts;"
# base R
splitString <- unlist(strsplit(myString, ";", fixed = T))
# with stringr
library(stringr)
splitString <- as.vector(str_split(myString, ";", simplify = T))
Once you've done that, THEN you can do the text substitution:
# base R
fixedApts <- gsub("^Apt$|^Apts$", "Apartment", splitString)
# with stringr
fixedApts <- str_replace(splitString, "^Apt$|^Apts$", "Apartment")
# then do the rest of your replacements
There's probabably a better way to do the replacements than regular expressions (using switch(), maybe?)
Use paste0(fixedApts, collapse = "") to collapse the vector into a single string at the end if that's what you need to do.

how to write a regular expression to extract a specific element from string in r

I have a string list as below:
df = read.table(text="AC1=60;AD=393,115;AF1=0.318816;BQB=0.508823;DP=1016;DP4=393
AC1=190;AD=2,747;AF1=1;BQB=0.0722892;DP=749;DP4=2,0,747,0;FQ=-43.6844
AC1=150;AD=1,5;AF1=0.787353;DP=6;DP4=1,0,5,0;VDB=0.00215942
AC1=47;AD=660,182;AF1=0.24862;BQB=0.680047;DP=1684;DP4=660,0,182,0
AC1=47;AD=659,183;AF1=0.248425;DP=842;DP4=0,659,0,183;FQ=999
AC1=78;AD=23,17;AF1=0.408247;BQB=1;DP=40;DP4=23,0,17,0", header=FALSE, stringsAsFactors=F)
each element is separated by ";". I would like to extract out only "DP=[0-9]" part. The result is expected as:
DP=1016
DP=749
DP=6
DP=1684
DP=842
DP=40
I appreciate any helps.
In base:
gsub(".*((?<=;)DP=[^;]+(?=;)).*", "\\1", df$V1, perl=TRUE)
#[1] "DP=1016" "DP=749" "DP=6" "DP=842" "DP=1684" "DP=40"
I was surprised when the resident genius on regex suggested the use packages for text extraction. sub and gsub can get unruly when pulling out a specific string:
library(stringr)
str_extract_all(df$V1, "(?<=;)DP=[^;]+(?=;)")
Here is one regular expression that will work
gsub(".*;(DP=[0-9.]+);.*$", "\\1", df$V1)
If it's the case that the "DP=" substring contains multiple entries separated by commas, as do substrings like "DP4= " in some cases in the example data, then as #pierre-lafortune notes in the comments below, and in his answer, you might be better off with the [^;] character class:
gsub(".*;(DP=[^;]+);.*$", "\\1", df$V1)
Of course, you could just add the comma to the character class,
gsub(".*;(DP=[0-9.,]+);.*$", "\\1", df$V1)
but there may be other characters you want to keep as well. So [^;] would be the most inclusive approach.

selective removal of characters following a pattern using R

How to selectively remove characters from a string following a pattern?
I wish to remove the 7 figures and the preceding colon.
For example:
"((Northern_b:0.005926,Tropical_b:0.000000)N19:0.002950"
should become
"((Northern_b,Tropical_b)N19"
x <- "((Northern_b:0.005926,Tropical_b:0.000000)N19:0.002950"
gsub("[:]\\d{1}[.]\\d{6}", "", x)
The gsub function does string replacement and replaces all matched found in the string (see ?gsub). An alternative, if you want something with a more friendly names is str_replace_all from the stringr package.
The regular expression makes use of the \\d{n} search, which looks for digits. The integer indicates the number of digits to look for. So \\d{1} looks for a set of digits with length 1. \\d{6} looks for a set of digits of length 6.
gsub('[:][0-9.]+','',x)
[1] "((Northern_b,Tropical_b)N19"
Another approach to solve this problem
library(stringr)
str1 <- c("((Northern_b:0.005926,Tropical_b:0.000000)N19:0.002950")
str_replace_all(str1, regex("(:\\d{1,}\\.\\d{1,})"), "")
#[1] "((Northern_b,Tropical_b)N19"

Resources