Find pattern and wrap between parenthesis the next letter - r

I have to find different patterns in a data frame column, once it is found, the next letter should be wrapped between parentheses:
Data:
a <- c('(acetyl)RKJOEQLKQ', 'LDFEION(acetyl)EFNEOW')
if the pattern is: '(acetyl)'
this is the output that I'd like to achieve:
Expected output:
b <- c('(R)KJOEQLKQ', 'LDFEION(E)FNEOW')
I know how that I can find the pattern with gsub:
b <- gsub('(acetyl)', replacement = '', a)
However, I'm not sure how to approach the wrapping between the parenthesis of the next letter after the pattern is found.
Any help would be appreciated.

You can use
a <- c('(acetyl)RKJOEQLKQ', 'LDFEION(acetyl)EFNEOW')
gsub('\\(acetyl\\)(.)', '(\\1)', a)
## => [1] "(R)KJOEQLKQ" "LDFEION(E)FNEOW"
See the regex demo and the online R demo.
Details:
\(acetyl\) - matches a literal string (acetyl)
(.) - captures into Group 1 any single char
The (\1) replacement pattern replaces the matches with ( + Group 1 value + ).

Related

Removing brackets in a string without the content

I would like to rearrange the Data I have. It is composed just with names, but some are with brackets and I would like to get rid, to keep the content, and habe at the end 2 names.
For exemple
df <- c ("Do(i)lfal", "Do(i)lferl", "Steff(l)", "Steffe", "Steffi")
I would like to have at the end
df <- c( "Doilfal", "Dolfal", "Doilferl", "Dolferl", "Steff", "Steffl", "Steffe", "Steffi")
I tried
sub("(.*)(\\([a-z]\\))(.*)$", "\\1\\2, \\1\\2\\3", df)
But it is not very working
Thank you very much
df = gsub("[\\(\\)]", "", df)
You made two small mistakes:
In the first case you want \1\2\3, because you want all letter. It's in the second name that you want \1\3 (skipping the middle vowel).
You placed the parentheses themselves (i) inside the capture group. So it's also being capture. You must place the capture group only around the thing inside the parentheses.
A small change to your regex does it:
sub("(.*)\\(([a-z])\\)(.*)$", "\\1\\2\\3, \\1\\3", df)
You can use
df <- c ("Do(i)lfal", "Do(i)lferl", "Steff(l)", "Steffe", "Steffi")
unlist(strsplit( paste(sub("(.*?)\\(([a-z])\\)(.*)", "\\1\\2\\3, \\1\\3", df), collapse=","), "\\s*,\\s*"))
# => [1] "Doilfal"
# [2] "Dolfal"
# [3] "Doilferl"
# [4] "Dolferl"
# [5] "Steffl"
# [6] "Steff"
# [7] "Steffe"
# [8] "Steffi"
See the online R demo and the first regex demo. Details:
First, the sub is executed with the first regex, (.*?)\(([a-z])\)(.*) that matches
(.*?) - any zero or more chars as few as possible, captured into Group 1 (\1)
\( - a ( char
([a-z]) - Group 2 (\2): any ASCII lowercase letter
\) - a ) char
(.*) - any zero or more chars as many as possible, captured into Group 3 (\3)
Then, the results are pasted with a , char as a collpasing char
Then, the resulting char vector is split with the \s*,\s* regex that matches a comma enclosed with zero or more whitespace chars.

Using gsub or sub function to only get part of a string?

Col
WBU-ARGU*06:03:04
WBU-ARDU*08:01:01
WBU-ARFU*11:03:05
WBU-ARFU*03:456
I have a column which has 75 rows of variables such as the col above. I am not quite sure how to use gsub or sub in order to get up until the integers after the first colon.
Expected output:
Col
WBU-ARGU*06:03
WBU-ARDU*08:01
WBU-ARFU*11:03
WBU-ARFU*03:456
I tried this but it doesn't seem to work:
gsub("*..:","", df$col)
Following may help you here too.
sub("([^:]*):([^:]*).*","\\1:\\2",df$dat)
Output will be as follows.
> sub("([^:]*):([^:]*).*","\\1:\\2",df$dat)
[1] "WBU-ARGU*06:03" "WBU-ARDU*08:01" "WBU-ARFU*11:03" "WBU-ARFU*03:456b"
Where Input for data frame is as follows.
dat <- c("WBU-ARGU*06:03:04","WBU-ARDU*08:01:01","WBU-ARFU*11:03:05","WBU-ARFU*03:456b")
df <- data.frame(dat)
Explanation: Following is only for explanation purposes.
sub(" ##using sub for global subtitution function of R here.
([^:]*) ##By mentioning () we are keeping the matched values from vector's element into 1st place of memory(which we could use later), which is till next colon comes it will match everything.
: ##Mentioning letter colon(:) here.
([^:]*) ##By mentioning () making 2nd place in memory for matched values in vector's values which is till next colon comes it will match everything.
.*" ##Mentioning .* to match everything else now after 2nd colon comes in value.
,"\\1:\\2" ##Now mentioning the values of memory holds with whom we want to substitute the element values \\1 means 1st memory place \\2 is second memory place's value.
,df$dat) ##Mentioning df$dat dataframe's dat value.
You may use
df$col <- sub("(\\d:\\d+):\\d+$", "\\1", df$col)
See the regex demo
Details
(\\d:\\d+) - Capturing group 1 (its value will be accessible via \1 in the replacement pattern): a digit, a colon and 1+ digits.
: - a colon
\\d+ - 1+ digits
$ - end of string.
R Demo:
col <- c("WBU-ARGU*06:03:04","WBU-ARDU*08:01:01","WBU-ARFU*11:03:05","WBU-ARFU*03:456")
sub("(\\d:\\d+):\\d+$", "\\1", col)
## => [1] "WBU-ARGU*06:03" "WBU-ARDU*08:01" "WBU-ARFU*11:03" "WBU-ARFU*03:456"
Alternative approach:
df$col <- sub("^(.*?:\\d+).*", "\\1", df$col)
See the regex demo
Here,
^ - start of string
(.*?:\\d+) - Group 1: any 0+ chars, as few as possible (due to the lazy *? quantifier), then : and 1+ digits
.* - the rest of the string.
However, it should be used with the PCRE regex engine, pass perl=TRUE:
col <- c("WBU-ARGU*06:03:04","WBU-ARDU*08:01:01","WBU-ARFU*11:03:05","WBU-ARFU*03:456")
sub("^(.*?:\\d+).*", "\\1", col, perl=TRUE)
## => [1] "WBU-ARGU*06:03" "WBU-ARDU*08:01" "WBU-ARFU*11:03" "WBU-ARFU*03:456"
See the R online demo.
sub("(\\d+:\\d+):\\d+$", "\\1", df$Col)
[1] "WBU-ARGU*06:03" "WBU-ARDU*08:01" "WBU-ARFU*11:03" "WBU-ARFU*03:456"
Alternatively match what you want (instead of subbing out what you don't want) with stringi:
stringi::stri_extract_first(df$Col, regex = "[A-Z-\\*]+\\d+:\\d+")
Slightly more concise stringr:
stringr::str_extract(df$Col, "[A-Z-\\*]+\\d+:\\d+")
# or
stringr::str_extract(df$Col, "[\\w-*]+\\d+:\\d+")

Remove part of a string until a character is found R

I have a regex problem or somewhat regex related problem...
I have strings that look like this:
"..........))))..)))))))"
"....))))))))...)).))))..))"
"......))))...)))...)))))"
I want to remove the initial dot sequence, so that I only get the string starting by the first occurence of ")" symbol. Say, the output would be somthing like:
"))))..)))))))"
"))))))))...)).))))..))"
"))))...)))...)))))"
I assume it would be somewhat similar to a lookahead regex but cannot figure out the correct one...
Any help?
Thanks
We match for 0 or more dots (\\.*) from the start (^) of the string and replace it with blank
sub("^\\.*", "", v1)
#[1] "))))..)))))))" "))))))))...)).))))..))" "))))...)))...)))))"
If it needs to start from ), then as above match 0 or more dots till the first ) and replace with the )
sub("^\\.*\\)", ")", v1)
#[1] "))))..)))))))" "))))))))...)).))))..))" "))))...)))...)))))"
data
v1 <- c("..........))))..)))))))", "....))))))))...)).))))..))", "......))))...)))...)))))")
You can simply remove dots from the beginning of the line (marked in the regex by ^) until you reach a non-dot character:
a <- "..........))))..)))))))"
b <- "....))))))))...)).))))..))"
c <- "......))))...)))...)))))"
sub("^\\.*", "", a) # "))))..)))))))"
sub("^\\.*", "", b) # "))))))))...)).))))..))"
sub("^\\.*", "", c) # "))))...)))...)))))"
The way your question is worded, the goal isn't to remove just . from the beginning, but any symbol until the first ) is encountered. So this answer is a more general solution.
stringr::str_extract("..........))))..)))))))","\\).*$")
Alternatively, if you want to stick with base R, you could use sub/gsub like this:
gsub("[^\\)]*(\\).*$)","\\1","..........))))..)))))))")
sub("[^\\)]*","","..........))))..)))))))")

R gsub fixed pattern and non-fixed pattern at the same time

I have a column in an R data.table of data with text like this:
> my_table
descr
1: DESCRIPTIONA - JONES:4:2
2: DESCRIPTIONB - WILDER:6:7
---
253: DESCRIPTIONA - MANN:5:8
254: DESCRIPTIONB - ROBERTS:3:4
Notice there are two kinds of Descriptions: DESCRIPTIONA and DESCRIPTIONB. I want to replace the Whole description part including the names up to the first semi colon with A if it's DESCRIPTIONA and B if it's DESCRIPTIONB. That means I totally don't care about the name. The output should look something like this:
> my_table
descr
1: A:4:2
2: B:6:7
---
253: A:5:8
254: B:3:4
I'm trying to use gsub to accomplish this, but I can't get the regex to replace just the part (DESCRIPTIONA - JONES):4:2. It's difficult because each name is different, and is of a different length. Any ideas?
x = c(
"DESCRIPTIONA - JONES:4:2",
"DESCRIPTIONB - WILDER:6:7",
"DESCRIPTIONA - MANN:5:8",
"DESCRIPTIONB - ROBERTS:3:4"
)
gsub(pattern = "DESCRIPTION(.)[^:]*", replacement = "\\1", x)
# [1] "A:4:2" "B:6:7" "A:5:8" "B:3:4"
Explanation: "DESCRIPTION(.)[^:]*" matches the word DESCRIPTION, then a single character (.) which is "saved" as a capturing group by the parens (), then it continues to match non-colon characters [^:], as many as possible (*). It replaces the full match with the first (\\1) capturing group.
You can play with it here to understand better: https://regex101.com/r/Sc7oC1/1

r: regex for containing pattern with negation

Suppose I have the following two strings and want to use grep to see which match:
business_metric_one
business_metric_one_dk
business_metric_one_none
business_metric_two
business_metric_two_dk
business_metric_two_none
And so on for various other metrics. I want to only match the first one of each group (business_metric_one and business_metric_two and so on). They are not in an ordered list so I can't index and have to use grep. At first I thought to do:
.*metric.*[^_dk|^_none]$
But this doesn't seem to work. Any ideas?
You need to use a PCRE pattern to filter the character vector:
x <- c("business_metric_one","business_metric_one_dk","business_metric_one_none","business_metric_two","business_metric_two_dk","business_metric_two_none")
grep("metric(?!.*_(?:dk|none))", x, value=TRUE, perl=TRUE)
## => [1] "business_metric_one" "business_metric_two"
See the R demo
The metric(?!.*(?:_dk|_none)) pattern matches
metric - a metric substring
(?!.*_(?:dk|none)) - that is not followed with any 0+ chars other than line break chars followed with _ and then either dk or none.
See the regex demo.
NOTE: if you need to match only such values that contain metric and do not end with _dk or _none, use a variation, metric.*$(?<!_dk|_none) where the (?<!_dk|_none) negative lookbehind fails the match if the string ends with either _dk or _none.
You can also do something like this:
grep("^([[:alpha:]]+_){2}[[:alpha:]]+$", string, value = TRUE)
# [1] "business_metric_one" "business_metric_two"
or use grepl to match dk and none, then negate the logical when you're indexing the original string:
string[!grepl("(dk|none)", string)]
# [1] "business_metric_one" "business_metric_two"
more concisely:
string[!grepl("business_metric_[[:alpha:]]+_(dk|none)", string)]
# [1] "business_metric_one" "business_metric_two"
Data:
string = c("business_metric_one","business_metric_one_dk","business_metric_one_none","business_metric_two","business_metric_two_dk","business_metric_two_none")

Resources