Removing brackets in a string without the content - r

I would like to rearrange the Data I have. It is composed just with names, but some are with brackets and I would like to get rid, to keep the content, and habe at the end 2 names.
For exemple
df <- c ("Do(i)lfal", "Do(i)lferl", "Steff(l)", "Steffe", "Steffi")
I would like to have at the end
df <- c( "Doilfal", "Dolfal", "Doilferl", "Dolferl", "Steff", "Steffl", "Steffe", "Steffi")
I tried
sub("(.*)(\\([a-z]\\))(.*)$", "\\1\\2, \\1\\2\\3", df)
But it is not very working
Thank you very much

df = gsub("[\\(\\)]", "", df)

You made two small mistakes:
In the first case you want \1\2\3, because you want all letter. It's in the second name that you want \1\3 (skipping the middle vowel).
You placed the parentheses themselves (i) inside the capture group. So it's also being capture. You must place the capture group only around the thing inside the parentheses.
A small change to your regex does it:
sub("(.*)\\(([a-z])\\)(.*)$", "\\1\\2\\3, \\1\\3", df)

You can use
df <- c ("Do(i)lfal", "Do(i)lferl", "Steff(l)", "Steffe", "Steffi")
unlist(strsplit( paste(sub("(.*?)\\(([a-z])\\)(.*)", "\\1\\2\\3, \\1\\3", df), collapse=","), "\\s*,\\s*"))
# => [1] "Doilfal"
# [2] "Dolfal"
# [3] "Doilferl"
# [4] "Dolferl"
# [5] "Steffl"
# [6] "Steff"
# [7] "Steffe"
# [8] "Steffi"
See the online R demo and the first regex demo. Details:
First, the sub is executed with the first regex, (.*?)\(([a-z])\)(.*) that matches
(.*?) - any zero or more chars as few as possible, captured into Group 1 (\1)
\( - a ( char
([a-z]) - Group 2 (\2): any ASCII lowercase letter
\) - a ) char
(.*) - any zero or more chars as many as possible, captured into Group 3 (\3)
Then, the results are pasted with a , char as a collpasing char
Then, the resulting char vector is split with the \s*,\s* regex that matches a comma enclosed with zero or more whitespace chars.

Related

Remove one number at position n of the number in a string of numbers separated by slashes

I have a character column with this configuration:
data <- data.frame(
id = 1:3,
codes = c("08001301001", "08002401002 / 08002601003 / 17134604034", "08004701005 / 08005101001"))
I want to remove the 6th digit of any number within the string. The numbers are always 10 characters long.
My code works. However I believe it might be done easier using RegEx, but I couldn't figure it out.
library(stringr)
remove_6_digit <- function(x){
idxs <- str_locate_all(x,"/")[[1]][,1]
for (idx in c(rev(idxs+7), 6)){
str_sub(x, idx, idx) <- ""
}
return(x)
}
result <- sapply(data$codes, remove_6_digit, USE.NAMES = F)
You can use
gsub("\\b(\\d{5})\\d", "\\1", data$codes)
See the regex demo. This will remove the 6th digit from the start of a digit sequence.
Details:
\b - word boundary
(\d{5}) - Capturing group 1 (\1): five digits
\d - a digit.
While word boundary looks enough for the current scenario, a digit boundary is also an option in case the numbers are glued to word chars:
gsub("(?<!\\d)(\\d{5})\\d", "\\1", data$codes, perl=TRUE)
where perl=TRUE enables the PCRE regex syntax and (?<!\d) is a negative lookbehind that fails the match if there is a digit immediately to the left of the current location.
And if you must only change numeric char sequences of 10 digits (no shorter and no longer) you can use
gsub("\\b(\\d{5})\\d(\\d{4})\\b", "\\1\\2", data$codes)
gsub("(?<!\\d)(\\d{5})\\d(?=\\d{4}(?!\\d))", "\\1", data$codes, perl=TRUE)
One remark though: your numbers consist of 11 digits, so you need to replace \\d{4} with \\d{5}, see this regex demo.
Another possible solution, using stringr::str_replace_all and lookaround :
library(tidyverse)
data %>%
mutate(codes = str_replace_all(codes, "(?<=\\d{5})\\d(?=\\d{5})", ""))
#> id codes
#> 1 1 0800101001
#> 2 2 0800201002 / 0800201003 / 1713404034
#> 3 3 0800401005 / 0800501001

Find pattern and wrap between parenthesis the next letter

I have to find different patterns in a data frame column, once it is found, the next letter should be wrapped between parentheses:
Data:
a <- c('(acetyl)RKJOEQLKQ', 'LDFEION(acetyl)EFNEOW')
if the pattern is: '(acetyl)'
this is the output that I'd like to achieve:
Expected output:
b <- c('(R)KJOEQLKQ', 'LDFEION(E)FNEOW')
I know how that I can find the pattern with gsub:
b <- gsub('(acetyl)', replacement = '', a)
However, I'm not sure how to approach the wrapping between the parenthesis of the next letter after the pattern is found.
Any help would be appreciated.
You can use
a <- c('(acetyl)RKJOEQLKQ', 'LDFEION(acetyl)EFNEOW')
gsub('\\(acetyl\\)(.)', '(\\1)', a)
## => [1] "(R)KJOEQLKQ" "LDFEION(E)FNEOW"
See the regex demo and the online R demo.
Details:
\(acetyl\) - matches a literal string (acetyl)
(.) - captures into Group 1 any single char
The (\1) replacement pattern replaces the matches with ( + Group 1 value + ).

How to replace `qux$foo$bar` with `qux[["foo"]][["bar]]`? (dollar subsetting to brackets subsetting)

In a R script (that I would read with readLines), I want to replace every occurence of qux$foo$bar with qux[["foo"]][["bar]]. But I'm not a regex master.
I started with this regex:
> gsub("(\\w*)(\\$)(\\w*)", '\\1[["\\3"]]', "qux$foo$bar; input$test$a$a") %>% cat
qux[["foo"]][["bar"]]; input[["test"]][["a"]][["a"]]
Nice. But I also want to handle the case of backticks. So I tried:
> gsub("(\\w*)(\\$)`{0,1}(\\w*)`{0,1}", '\\1[["\\3"]]', "qux$`foo`; bar$`baz`; x$uvw") %>% cat
qux[["foo"]]; bar[["baz"]]; x[["uvw"]]
Looks correct. But between the backticks, there could be a space, and the previous way does not work in this case. So I tried the following, which neither does not work:
gsub("(\\w*)(\\$)`{0,1}(.*)`{0,1}", '\\1[["\\3"]]', "qux$`fo o`") %>% cat
qux[["fo o`"]]
Could you help to find the right regex pattern? It seems that instead of \\w I need something which means match a "word that can contain spaces".
You can use
gsub('(\\w*)(?|\\$`([^`]*)`|\\$([^\\s$]+))', '\\1[["\\2"]]', x, perl=TRUE)
## Or
gsub('\\$`([^`]*)`|\\$([^\\s$]+)', '[["\\1\\2"]]', x, perl=TRUE)
See the regex #1 demo and regex #2 demo. Details:
(\w*) - Group 1 (\1): zero or more word chars
(?|$`([^`]*)`|$([^\s$]+)) - a branch reset group matching either
$`([^`]*)` - $, backtick, Group 2 (\2) capturing zero or more non-backtick chars, and a backtick.
| - or
$([^\s$]+) - $, then Group 2 (\2) capturing one or more chars other than whitespace and $
See the R demo:
x <- c('qux$foo$bar','qux$foo$bar; input$test$a$a','qux$`foo`; bar$`baz`; x$uvw','qux$`fo o`', 'q_ux$f_o_o$b.a_r')
gsub('(\\w*)(?|\\$`([^`]*)`|\\$([^\\s$]+))', '\\1[["\\2"]]', x, perl=TRUE)
## Or
## gsub('\\$`([^`]*)`|\\$([^\\s$]+)', '[["\\1\\2"]]', x, perl=TRUE)
Output:
[1] "qux[[\"foo\"]][[\"bar\"]]"
[2] "qux[[\"foo\"]][[\"bar;\"]] input[[\"test\"]][[\"a\"]][[\"a\"]]"
[3] "qux[[\"foo\"]]; bar[[\"baz\"]]; x[[\"uvw\"]]"
[4] "qux[[\"fo o\"]]"
[5] "q_ux[[\"f_o_o\"]][[\"b.a_r\"]]"
Note: backslashes in the output are console artifacts to keep the double quoted strings valid string literals, they are not part of the plain text output.
You might repeat optional spaces before and after matching 1 or more word characters.
You don't need a capture group for the $ but instead you could use a capture group to pair up the backtick in case it is there or not using a backreference to group 2.
To repeat 0+ whitespace chars you can also use \s but that could also match a newline.
Note that \w* matches optional word chars, and {0,1} can be written as ?
(\w*)\$(`?)( *\w+(?: +\w+)* *)\2
The pattern matches:
(\w*) Capture group 1 Match optional word characters
\$ Match $
(`?) Capture group 2, optionally match a backtick
( *\w+(?: +\w+)* *) Capture group 3 Match repetitions of word characters between spaces
\2 Backreference to what is captured in group 2 (yes or no backtick)
Regex demo
gsub("(\\w*)\\$(`?)( *\\w+(?: +\\w+)* *)\\2", '\\1[["\\3"]]', "qux$fo o$bar", perl=TRUE)
Output
[1] "qux[[\"fo o\"]][[\"bar\"]]"

Keep only the first letter of each word after a comma

I have strings like Sacher, Franz Xaver or Nishikawa, Kiyoko.
Using R, I want to change them to Sacher, F. X. or Nishikawa, K..
In other words, the first letter of each word after the comma should be retained with a dot (and a whitespace if another word follows).
Here is a related response, but it cannot be applied to my case 1:1 as it does not have a comma in its strings; it seems that the simple addition of (<?=, ) does not work.
E.g. in the following attempts, gsub() replaces everything, while my str_replace_all()-attempt leads to an error:
TEST <- c("Sacher, Franz Xaver", "Nishikawa, Kiyoko", "Al-Assam, Muhammad")
# first attempt
# (resembles the response from the other thread)
gsub('\\b(\\pL)\\pL{2,}|.','\\U\\1', TEST, perl = TRUE)
# second attempt
# error: "Incorrect unicode property"
stringr::str_replace_all(TEST, '(?<=, )\\b(\\pL)\\pL{2,}|.','\\U\\1')
I would be grateful for your help!
You can use
gsub("(*UCP)^[^,]+(*SKIP)(*F)|\\b(\\p{L})\\p{L}*", "\\U\\1.", TEST, perl=TRUE)
See the regex demo. Details:
(*UCP) - the PCRE verb that will make \b Unicode aware
^[^,]+(*SKIP)(*F) - start of string and then any zero or more chars other than a comma, and then the match is failed and skipped, the next match starts at the location where the failure occurred
| - or
\b - word boundary
(\p{L}) - Group 1: any Unicode letter
\p{L}* - zero or more Unicode letters
See the R demo:
TEST <- c("Sacher, Franz Xaver", "Nishikawa, Kiyoko", "Al-Assam, Muhammad")
gsub("(*UCP)^[^,]+(*SKIP)(*F)|\\b(\\p{L})\\p{L}*", "\\U\\1.", TEST, perl=TRUE)
## => [1] "Sacher, F. X." "Nishikawa, K." "Al-Assam, M."
A crude approach splitting the string :
TEST <- c("Sacher, Franz Xaver", "Nishikawa, Kiyoko", "Al-Assam, Muhammad")
sapply(strsplit(TEST, '\\s+'), function(x)
paste0(x[1], paste0(substr(x[-1], 1, 1), collapse = '.'), '.'))
#[1] "Sacher,F.X." "Nishikawa,K." "Al-Assam,M."
An approach using multiple backreference:
gsub("(\\b\\w+,\\s)(\\b\\w).*(\\b\\w)*", "\\1\\2.\\3", TEST)
[1] "Sacher, F." "Nishikawa, K." "Al-Assam, M."
Here, we use three capturing groups to refer back to in gsub's replacment argument via backreference:
(\\b\\w+,\\s): this, first, group captures the last name plus the comma followed by whitespace
(\\b\\w): this, second, group captures the initial of the first name
(\\b\\w): this, third, group captures the initial of the middle name

R Regex capture group?

I have a lot of strings like this:
2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0
I want to extract the substring that lays right after the last "/" and ends with "_":
556662
I have found out how to extract: /01/01/07/556662
by using the following regex: (\/)(.*?)(?=\_)
Please advise how can I capture the right group.
You may use
x <- "2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0"
regmatches(x, regexpr(".*/\\K[^_]+", x, perl=TRUE))
## [1] "556662"
See the regex and R demo.
Here, the regex matches and outputs the first substring that matches
.*/ - any 0+ chars as many as possible up to the last /
\K - omits this part from the match
[^_]+ - puts 1 or more chars other than _ into the match value.
Or, a sub solution:
sub(".*/([^_]+).*", "\\1", x)
See the regex demo.
Here, it is similar to the previous one, but the 1 or more chars other than _ are captured into Group 1 (\1 in the replacement pattern) and the trailing .* make sure the whole input is matched (and consumed, ready to be replaced).
Alternative non-base R solutions
If you can afford or prefer to work with stringi, you may use
library(stringi)
stri_match_last_regex("2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0", ".*/([^_]+)")[,2]
## [1] "556662"
This will match a string up to the last / and will capture into Group 1 (that you access in Column 2 using [,2]) 1 or more chars other than _.
Or
stri_extract_last_regex("2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0", "(?<=/)[^_/]+")
## => [1] "556662"
This will extract the last match of a string that consists of 1 or more chars other than _ and / after a /.
You could use a capturing group:
/([^_/]+)_[^/\s]*
Explanation
/ Match literally
([^_/]+) Capture in a group matching not an underscore or forward slash
_[^/\s]* Match _ and then 0+ times not a forward slash or a whitespace character
Regex demo | R demo
One option to get the capturing group might be to get the second column using str_match:
library(stringr)
str = c("2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0")
str_match(str, "/([^_/]+)_[^/\\s]*")[,2]
# [1] "556662"
I changed the Regex rules according to the code of Wiktor Stribiżew.
x <- "2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0"
regmatches(x, regexpr(".*/([0-9]+)", x, perl=TRUE))
sub(".*/([0-9]+).*", "\\1", x)
Output
[1] "2019/01/01/07/556662"
[1] "556662"
R demo

Resources