I am trying to replace each character in a string with following rules using stringr package:
replace_characters <- function(x){str_replace_all(x,c("A"="N",'B'='O','C'='P','D'='Q','E'='R','F'='S','G'='T','H'='U','I'='V','J'='W','K'='X','L'='Y','M'='Z',
'N'='A','O'='B','P'='C','Q'='D','R'='E','S'='F','T'='G','U'='H','V'='I','W'='J','X'='K','Y'='L','Z'='M','0'='5','1'='6','2'='7','3'='8','4'='9','5'='0','6'='1','7'='2','8'='3','9'='4'))}
and then i tried the function with a random string:
replace_characters("HSNKSL584")
and i got:
"HFAKFL034"
as you can see some of the letters(numbers) have been replaced as expected but some remain unchanged.
Can anyone explain the reason for me?
Thanks!
Behind the scenes, stringr::str_replace_all calls upon stringi's stri_replace_all_* functions. If you used a named vector to describe multiple replacement patterns (which is the case here), the corresponding parameters fed into stri_replace_all_* includes vectorize_all = FALSE.
From stri_replace_all_*'s help file:
However, for stri_replace_all*, if vectorize_all is FALSE, the each
substring matching any of the supplied patterns is replaced by a
corresponding replacement string. In such a case, the vectorization is
over str, and - independently - over pattern and replacement. In other
words, this is equivalent to something like for (i in 1:npatterns) str
<- stri_replace_all(str, pattern[i], replacement[i]...
As raymkchow & Sotos have noted in the comments, when you cycle through the replacement patterns one by one sequentially, some are going to get affected more than once, effectively reversing the replacement from an earlier cycle.
We can do this with chartr from base R
chartr("HSNKL584", "UFAXY039", "HSNKSL584")
#[1] "UFAXFY039"
This can be made into a function
replace_char_fun <- function(str1) {
old <- paste(c(LETTERS, 0:9), collapse="")
new <- paste(c(LETTERS[14:26], LETTERS[1:13], 5:9, 0:4), collapse="")
chartr(old, new, str1)
}
replace_char_fun( "HSNKSL584")
#[1] "UFAXFY039"
Related
I am trying to get to grips with the world of regular expressions in R.
I was wondering whether there was any simple way of combining the functionality of "grep" and "gsub"?
Specifically, I want to append some additional information to anything which matches a specific pattern.
For a generic example, lets say I have a character vector:
char_vec <- c("A","A B","123?")
Then lets say I want to append any letter within any element of char_vec with
append <- "_APPEND"
Such that the result would be:
[1] "A_APPEND" "A_APPEND B_APPEND" "123?"
Clearly a gsub can replace the letters with append, but this does not keep the original expression (while grep would return the letters but not append!).
Thanks in advance for any / all help!
It seems you are not familiar with backreferences that you may use in the replacement patterns in (g)sub. Once you wrap a part of the pattern with a capturing group, you can later put this value back into the result of the replacement.
So, a mere gsub solution is possible:
char_vec <- c("A","A B","123?")
append <- "_APPEND"
gsub("([[:alpha:]])", paste0("\\1", append), char_vec)
## => [1] "A_APPEND" "A_APPEND B_APPEND" "123?"
See this R demo.
Here, ([[:alpha:]]) matches and captures into Group 1 any letter and \1 in the replacement reinserts this value into the result.
Definatly not as slick as #Wiktor Stribiżew but here is what i developed for another method.
char_vars <- c('a', 'b', 'a b', '123')
grep('[A-Za-z]', char_vars)
gregexpr('[A-Za-z]', char_vars)
matches = regmatches(char_vars,gregexpr('[A-Za-z]', char_vars))
for(i in 1:length(matches)) {
for(found in matches[[i]]){
char_vars[i] = sub(pattern = found,
replacement = paste(found, "_append", sep=""),
x=char_vars[i])
}
}
In R, I am having trouble replacing a substring that has punctuation. Ie within the string "r.Export", I am trying to replace "r." with "Report.". I've used gsub and below is my code:
string <- "r.Export"
short <- "r."
replacement <- "Report."
gsub(short,replacement,string)
The desired output is: "Report.Export" however gsub seems to replace the second r such that the output is:
Report.ExpoReport.
Using sub() instead is not a solution either because I am doing multiple gsubs where sometimes the string to be replaced is:
short <- "o."
So, then the o's in r.Export are replaced anyway and it becomes a complete mess.
string <- "r.Export"
short <- "r\\."
replacement <- "Report."
gsub(short,replacement,string)
Returns:
[1] "Report.Export"
Or, using fixed=TRUE:
string <- "r.Export"
short <- "r."
replacement <- "Report."
gsub(short,replacement,string, fixed=TRUE)
Returns:
[1] "Report.Export"
Explanation: Without the fixed=TRUE argument, gsub expects a regular expression as first argument. And with regular expressions . is a placeholder for 'any character'. If you want the literal . (period) you have to use either \\. (i.e. escaping the period) or the aforementioned argument fixed=TRUE
Since you have characters in your pattern (.) which has a special meaning in regex use fixed = TRUE which matches the string as is.
gsub(short,replacement,string, fixed = TRUE)
#[1] "Report.Export"
I might actually add word boundaries and lookaheads to the mix here, to ensure as targeted a match as possible:
string <- "r.Export"
replacement <- "Report."
output <- gsub("\\br\\.(?=\\w)", replacement, string, perl=TRUE)
output
[1] "Report.Export"
This approach ensures that we only match r. when the r is preceded by whitespace or is the start of the string, and also when what follows the dot is another word. Consider the sentence The project r.Export needed a programmer. We wouldn't want to replace the final r. in this case.
We can use sub
sub(short,replacement,string, fixed = TRUE)
#[1] "Report.Export"
Given the string "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275/", I need to generate a regex filter so that it ignores the last char if it is an "/" .
I tried the following regex "(http:////)?compras\\.dados\\.gov\\.br.*\\?.*(?<!//)" as of regexr.com/4om61, but it doesn´t work when I run in R as:
regex_exp_R <- "(http:////)?compras\\.dados\\.gov\\.br.*\\?.*(?<!//)"
grep(regex_exp_R, "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275/", perl = T, value = T)
I need this to work in pure regex and grep function, without using any string R package.
Thank you.
Simplified Case:
After important contributions of you all, one last issue remains.
Because I will use regex as an input in another friunction, the solution must work with pure regex and grep.
The remaining point is a very basic one: given the strings "a1bc/" or "a1bc", the regex must return "a1bc". Building on suggestions I received, I tried
grep(".*[^//]" ,"a1bc/", perl = T, value = T), but still get "a1bc/" instead of "a1bc". Any hints? Thank you.
If you want to return the string without the last / you can do this several ways. Below are a couple options using base R:
Using a back-reference in gsub() (sub() would work too here):
gsub("(.*?)/*$", "\\1", x)
[1] "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275"
# or, adapting your original pattern
gsub("((http:////)?compras\\.dados\\.gov\\.br.*\\?.*?)/*$", "\\1", x)
[1] "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275"
By position using ifelse() and substr() (this will proabbly be a little bit faster if scaling matters)
ifelse(substr(x, nchar(x), nchar(x)) == "/", substr(x, 1, nchar(x)-1), x)
[1] "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275"
Data:
x <- "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275/"
Use sub to remove a trailing /:
x <- c("a1bc/", "a2bc")
sub("/$", "", x)
This changes nothing on a string that does not end in /.
As others have pointed out, grep does not modify strings. It returns a numeric vector of indices of the matched strings or a vector of the (unmodified) matched items. It's usually used to subset a character vector.
You can use a negative look-behind at the end to ensure it doesn't end with the character you don't want (in this case, a /). The regex would then be:
.+(?<!\/)
You can view it here with your three input examples: https://regex101.com/r/XB9f7K/1/. If you only want it to match urls, then you would change the .+ part at the beginning to your url regex.
How about trying gsub("(.*?)/+$","\\1",s)?
So I've run into a small bug/feature in R where the agrep function does not accept the "|" character as valid regular expression logic (others have had this problem too), when used in the argument.
I'm trying to do a fuzzy match of 30 different, relatively unique names in one character vector (ListofUniqueNames) against a list of over 380,000 different names in a data-frame column (MasterList$Names), and get an output of all the matching names. I was able to accomplish this fine for exact matches using grep via
grep(paste(ListofUniqueNames,collapse="|"),MasterList$Names, value=TRUE, ignore.case = TRUE)
However, this approach doesn't work for agrep due to the problem listed above. How can I accomplish this same task but with fuzzy matching?
You could call agrep one by one for each pattern,
and then combine the results:
unlist(lapply(ListofUniqueNames, function(x) agrep(x, MasterList$Names, value=T, ignore.case = TRUE)))
I am trying to get to grips with the world of regular expressions in R.
I was wondering whether there was any simple way of combining the functionality of "grep" and "gsub"?
Specifically, I want to append some additional information to anything which matches a specific pattern.
For a generic example, lets say I have a character vector:
char_vec <- c("A","A B","123?")
Then lets say I want to append any letter within any element of char_vec with
append <- "_APPEND"
Such that the result would be:
[1] "A_APPEND" "A_APPEND B_APPEND" "123?"
Clearly a gsub can replace the letters with append, but this does not keep the original expression (while grep would return the letters but not append!).
Thanks in advance for any / all help!
It seems you are not familiar with backreferences that you may use in the replacement patterns in (g)sub. Once you wrap a part of the pattern with a capturing group, you can later put this value back into the result of the replacement.
So, a mere gsub solution is possible:
char_vec <- c("A","A B","123?")
append <- "_APPEND"
gsub("([[:alpha:]])", paste0("\\1", append), char_vec)
## => [1] "A_APPEND" "A_APPEND B_APPEND" "123?"
See this R demo.
Here, ([[:alpha:]]) matches and captures into Group 1 any letter and \1 in the replacement reinserts this value into the result.
Definatly not as slick as #Wiktor Stribiżew but here is what i developed for another method.
char_vars <- c('a', 'b', 'a b', '123')
grep('[A-Za-z]', char_vars)
gregexpr('[A-Za-z]', char_vars)
matches = regmatches(char_vars,gregexpr('[A-Za-z]', char_vars))
for(i in 1:length(matches)) {
for(found in matches[[i]]){
char_vars[i] = sub(pattern = found,
replacement = paste(found, "_append", sep=""),
x=char_vars[i])
}
}