I'd like to add parentheses around grouped text separated by a comma using stringr. So if there is text that is separated by one or more commas, then I'd like parentheses around the text. There will always be a "=" before this type of string begins and there will either be a space or nothing (vector ends) after the string. Is there a generalized way to do this? Here's a sample problem:
Sample:
a <- data.frame(Rule = c("A=0 & B=Grp1,Grp2", "A=0 & B=Grp1,Grp3,Grp4 & C=1"))
a
Rule
1 A=0 & B=Grp1,Grp2
2 A=0 & B=Grp1,Grp3,Grp4 & C=1
Desired Output:
Rule
1 A=0 & B=(Grp1,Grp2)
2 A=0 & B=(Grp1,Grp3,Grp4) & C=1
Here is another potential solution. I have altered the example input to show that it works with multiple "Grp's" per line:
library(stringr)
a <- data.frame(Rule = c("A=0 & B=Grp1,Grp2",
"A=0 & B=Grp1,Grp3,Grp4 & C=1 & D=Grp5,Grp6"))
str_replace_all(a$Rule, "=([^, &]+,[^ $]+)", "=(\\1)")
#> [1] "A=0 & B=(Grp1,Grp2)"
#> [2] "A=0 & B=(Grp1,Grp3,Grp4) & C=1 & D=(Grp5,Grp6)"
Created on 2022-11-23 by the reprex package (v2.0.1)
Explanation:
regex = "=([^, &]+,[^ $]+)", "=(\\1)"
=( starting with an equals sign, capture a group
[^, &]+, with one or more characters that aren't ",", " ", and "&" followed by a comma
[^ $]+) followed by one or more characters that aren't " " or the end of the line ("$")
=(\\1) then replace the equals sign and add parentheses around the captured group (e.g. the Grp1,Grp2)
This should work:
Find: (([A-Za-z\d]+,)+[A-Za-z\d]+)
Replace: ($1)
Explanation:
[A-Za-z\d] is any alphanumeric character.
The inner group looks for 1 or more copies of groups of alphanum characters separated by commas. (e.g. Abcd1,Abcd2,)
The outer group then looks for the closing alphanumeric group, which doesn't have a comma after it. (e.g. Abcd3)
These are concatenated then the whole group is captured.
Last thing to do is the replacement, which is pretty self explanatory.
Related
I would like to use R to remove all underlines expect those between words. At the end the code removes underlines at the end or at the beginning of a word.
The result should be
'hello_world and hello_world'.
I want to use those pre-built classes. Right know I have learn to expect particular characters with following code but I don't know how to use the word boundary sequences.
test<-"hello_world and _hello_world_"
gsub("[^_[:^punct:]]", "", test, perl=T)
You can use
gsub("[^_[:^punct:]]|_+\\b|\\b_+", "", test, perl=TRUE)
See the regex demo
Details:
[^_[:^punct:]] - any punctuation except _
| - or
_+\b - one or more _ at the end of a word
| - or
\b_+ - one or more _ at the start of a word
One non-regex way is to split and use trimws by setting the whitespace argument to _, i.e.
paste(sapply(strsplit(test, ' '), function(i)trimws(i, whitespace = '_')), collapse = ' ')
#[1] "hello_world and hello_world"
We can remove all the underlying which has a word boundary on either of the end. We use positive lookahead and lookbehind regex to find such underlyings. To remove underlying at the start and end we use trimws.
test<-"hello_world and _hello_world_"
gsub("(?<=\\b)_|_(?=\\b)", "", trimws(test, whitespace = '_'), perl = TRUE)
#[1] "hello_world and hello_world"
You could use:
test <- "hello_world and _hello_world_"
output <- gsub("(?<![^\\W])_|_(?![^\\W])", "", test, perl=TRUE)
output
[1] "hello_world and hello_world"
Explanation of regex:
(?<![^\\W]) assert that what precedes is a non word character OR the start of the input
_ match an underscore to remove
| OR
_ match an underscore to remove, followed by
(?![^\\W]) assert that what follows is a non word character OR the end of the input
For a text field, I would like to expose those that contain invalid characters. The list of invalid characters is unknown; I only know the list of accepted ones.
For example for French language, the accepted list is
A-z, 1-9, [punc::], space, àéèçè, hyphen, etc.
The list of invalid charactersis unknown, yet I want anything unusual to resurface, for example, I would want
This is an 2-piece à-la-carte dessert to pass when
'Ã this Øs an apple' pumps up as an anomalie
The 'not contain' notion in R does not behave as I would like, for example
grep("[^(abc)]",c("abcdef", "defabc", "apple") )
(those that does not contain 'abc') match all three while
grep("(abc)",c("abcdef", "defabc", "apple") )
behaves correctly and match only the first two. Am I missing something
How can we do that in R ? Also, how can we put hypen together in the list of accepted characters ?
[a-z1-9[:punct:] àâæçéèêëîïôœùûüÿ-]+
The above regex matches any of the following (one or more times). Note that the parameter ignore.case=T used in the code below allows the following to also match uppercase variants of the letters.
a-z Any lowercase ASCII letter
1-9 Any digit in the range from 1 to 9 (excludes 0)
[:punct:] Any punctuation character
The space character
àâæçéèêëîïôœùûüÿ Any valid French character with a diacritic mark
- The hyphen character
See code in use here
x <- c("This is an 2-piece à-la-carte dessert", "Ã this Øs an apple")
gsub("[a-z1-9[:punct:] àâæçéèêëîïôœùûüÿ-]+", "", x, ignore.case=T)
The code above replaces all valid characters with nothing. The result is all invalid characters that exist in the string. The following is the output:
[1] "" "ÃØ"
If by "expose the invalid characters" you mean delete the "accepted" ones, then a regex character class should be helpful. From the ?regex help page we can see that a hyphen is already part of the punctuation character vector;
[:punct:]
Punctuation characters:
! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ _ ` { | } ~
So the code could be:
x <- 'Ã this Øs an apple'
gsub("[A-z1-9[:punct:] àéèçè]+", "", x)
#[1] "ÃØ"
Note that regex has a predefined, locale-specific "[:alpha:]" named character class that would probably be both safer and more compact than the expression "[A-zàéèçè]" especially since the post from ctwheels suggests that you missed a few. The ?regex page indicates that "[0-9A-Za-z]" might be both locale- and encoding-specific.
If by "expose" you instead meant "identify the postion within the string" then you could use the negation operator "^" within the character class formalism and apply gregexpr:
gregexpr("[^A-z1-9[:punct:] àéèçè]+", x)
[[1]]
[1] 1 8
attr(,"match.length")
[1] 1 1
One column of my data.frame looks like the following:
c("BP_1_CSPP", "BP_2_GEGS", "BP_3_AEAG", "BP_4_KPAP", "BP_5_TAKP",
"BP_6_GGDR", "BP_7_MQQP", "BP_8_EEEE", "BP_9_RSDP", "BP_10_APAS",
"BP_11_KRGG", "BP_12_RSQQ", "BP_13_QQLS", "BP_14_EPEV", "BP_15_AAPS",
"BP_16_SDVT", "BP_17_GQQQ", "BP_18_AETP", "BP_19_PPSA", "BP_20_DATP",
"EpQ_1_AYAT", "EpQ_2_HEKL", "EpQ_3_SCSV", "EpQ_4_MAYV", "EpQ_5_LKDP",
"EpQ_6_ERCE", "EpQ_7_DNPA", "EpQ_8_YGIS", "EpQ_9_GMSS", "EpQ_10_AAKK",
"EpQ_11_NIRI", "EpQ_12_ERRR", "EpQ_13_MDRE", "EpQ_14_SRQM", "EpQ_15_DWSI",
"EpQ_16_VLVQ", "EpQ_17_GRTI", "EpQ_18_EKVR", "EpQ_19_PDVA", "EpQ_20_ADVT",
"LbT_1_RPGG", "LbT_2_TQGD", "LbT_3_EVKS", "LbT_4_VIEM", "LbT_5_GSAD",
"LbT_6_VRPI", "LbT_7_CELG", "LbT_8_APQQ", "LbT_9_SAEE", "LbT_10_GEAE",
"LbT_11_EELR", "LbT_12_EWAN", "LbT_13_IKEE", "LbT_14_VSDF", "LbT_15_WEDV",
"LbT_16_SGGA", "LbT_17_KATN", "LbT_18_EREG", "LbT_19_AWAS", "LbT_20_VDRD",
"abc_1_CVTQ", "abc_2_KEAP", "abc_3_TAYI", "abc_4_MITN", "abc_5_MPTV",
"abc_6_TRTG", "abc_7_KSTI", "abc_8_KEAI", "abc_9_HVYS", "abc_10_LGMG",
"abc_11_VAYQ", "abc_12_AGTG", "abc_13_TDSW", "abc_14_HKKS", "abc_15_YGLA",
"abc_16_WEEW", "abc_17_HSTI", "abc_18_EKCI", "abc_19_PAGI", "abc_20_TGTI",
"TcII")
Considering all the numbers < 10, which are located within the strings (e.g. "BP_1_CSPP", "BP_2_GEGS" , I wanted to add a leading zero to them, such that I would have:
"BP_01_CSPP", "BP_02_GEGS", "BP_03_AEAG", "BP_04_KPAP", "BP_05_TAKP",
"BP_06_GGDR"
and so on.
This question almost did the job, yet it does not worked for my data as:
The "0" will not be inserted at the same position all the time (some strings have 3 characters before the 0 to be inserted (e.g. BP_1_CSPP) while others have 4 (e.g. EpQ_3_SCSV)
I will still have some characters after the zero to be inserted i.e. the zero will be inserted at the middle of the string.
We can use sub to match the pattern of _ followed by a single number (([0-9])) captured as a group (inside the brackets) followed by _ and replace it with _ followed by 0, the backreference of the capture group (\\1) followed by _.
v1 <- sub("_([0-9])_", "_0\\1_", v1)
v1
#[1] "BP_01_CSPP" "BP_02_GEGS" "BP_03_AEAG" "BP_04_KPAP" "BP_05_TAKP" "BP_06_GGDR" "BP_07_MQQP" "BP_08_EEEE" "BP_09_RSDP" "BP_10_APAS" "BP_11_KRGG"
#[12] "BP_12_RSQQ" "BP_13_QQLS" "BP_14_EPEV" "BP_15_AAPS" "BP_16_SDVT" "BP_17_GQQQ" "BP_18_AETP" "BP_19_PPSA" "BP_20_DATP" "EpQ_01_AYAT" "EpQ_02_HEKL"
#[23] "EpQ_03_SCSV" "EpQ_04_MAYV" "EpQ_05_LKDP" "EpQ_06_ERCE" "EpQ_07_DNPA" "EpQ_08_YGIS" "EpQ_09_GMSS" "EpQ_10_AAKK" "EpQ_11_NIRI" "EpQ_12_ERRR" "EpQ_13_MDRE"
#[34] "EpQ_14_SRQM" "EpQ_15_DWSI" "EpQ_16_VLVQ" "EpQ_17_GRTI" "EpQ_18_EKVR" "EpQ_19_PDVA" "EpQ_20_ADVT" "LbT_01_RPGG" "LbT_02_TQGD" "LbT_03_EVKS" "LbT_04_VIEM"
#[45] "LbT_05_GSAD" "LbT_06_VRPI" "LbT_07_CELG" "LbT_08_APQQ" "LbT_09_SAEE" "LbT_10_GEAE" "LbT_11_EELR" "LbT_12_EWAN" "LbT_13_IKEE" "LbT_14_VSDF" "LbT_15_WEDV"
#[56] "LbT_16_SGGA" "LbT_17_KATN" "LbT_18_EREG" "LbT_19_AWAS" "LbT_20_VDRD" "abc_01_CVTQ" "abc_02_KEAP" "abc_03_TAYI" "abc_04_MITN" "abc_05_MPTV" "abc_06_TRTG"
#[67] "abc_07_KSTI" "abc_08_KEAI" "abc_09_HVYS" "abc_10_LGMG" "abc_11_VAYQ" "abc_12_AGTG" "abc_13_TDSW" "abc_14_HKKS" "abc_15_YGLA" "abc_16_WEEW" "abc_17_HSTI"
#[78] "abc_18_EKCI" "abc_19_PAGI" "abc_20_TGTI" "TcII"
If we are using strsplit, another option is split by _, replace the numbers by formatting with sprintf and then paste together
sapply(strsplit(v1, "_"), function(x) {
if(length(x)>1) x[2] <- sprintf("%02d", as.numeric(x[2]))
paste(x, collapse="_")})
I have a string format that I would like to select from a character vector. The form is
123 123 1234
where the two spaces can also be a hyphen. i.e. 3 digits followed by space or hyphen, followed by 3 digits, followed by space or hyphen, followed by 4 digits
I am trying to do this by the following:
grep("^([0-9]{3}[ -.])([0-9]{3}[ -.])([0-9]{4}$)",mytext)
however this yields:
integer(0)
What am I doing wrong?
Your string has a whitespace at the end, so you can either consider that white space, like so:
grep("^([0-9]{3}[ -.])([0-9]{3}[ -.])([0-9]{4} $)",mytext)
Or remove the end of line assertion "$", like so:
grep("^([0-9]{3}[ -.])([0-9]{3}[ -.])([0-9]{4})",mytext)
Also, as pointed out by Wiktor Stribiżew, the character class [ -.] will match any character in the range between " " and ".". To match "-","." and " " you have to escape the "-" or put it at the end of the class. Like [ \-.] or [ .-]
Regular Expression To exclude sub-string name(job corps)
Includes at least 1 upper case letter, 1 lower case letter, 1 number and 1 symbol except "#"
I have written something like below :
^((?!job corps).)(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[!#$%^&*]).*$
I tested with the above regular expression, not working for special character.
can anyone guide on this..
If I understand well your requirements, you can use this pattern:
^(?![^a-z]*$|[^A-Z]*$|[^0-9]*$|[^!#$%^&*]*$|.*?job corps)[^#]*$
If you only want to allow characters from [a-zA-Z0-9^#$%&*] changes the pattern to:
^(?![^a-z]*$|[^A-Z]*$|[^0-9]*$|[^!#$%^&*]*$|.*?job corps)[a-zA-Z0-9^#$%&*]*$
details:
^ # start of the string
(?! # not followed by any of these cases
[^a-z]*$ # non lowercase letters until the end
|
[^A-Z]*$ # non uppercase letters until the end
|
[^0-9]*$
|
[^!#$%^&*]*$
|
.*?job corps # any characters and "job corps"
)
[^#]* # characters that are not a #
$ # end of the string
demo
Note: you can write the range #$%& like #-& to win a character.
stribizhev, your answer is correct
^(?!.job corps)(?=.[0-9])(?=.[a-z])(?=.[A-Z])(?=.[!#$%^&])(?!.#).$
can verify the expression in following url:
http://www.freeformatter.com/regex-tester.html