Extract portion of string with punctuation - r

I have a string:
string <- "newdatat.scat == \"RDS16\" ~ \"Asthma\","
and I want to extract separately:
RDS16
Asthma
What I've tried so far is:
extract <- str_extract(string,'~."(.+)')
but I am only able to get:
~ \"Asthma\",
If you have an answer, can you also kindly explain the regex behind it? I'm having a hard time converting string patterns to regex.

If you just need to extract sections surrounded by ", then you can use something like the following. The pattern ".*?" matches first ", then .*? meaning as few characters as possible, before finally matching another ". This will get you the strings including the " double quotes; you then just have to remove the double quotes to clean up.
Note that str_extract_all is used to return all matches, and that it returns a list of character vectors so we need to index into the list before removing the double quotes.
library(stringr)
string <- "newdatat.scat == \"RDS16\" ~ \"Asthma\","
str_extract_all(string, '".*?"') %>%
`[[`(1) %>%
str_remove_all('"')
#> [1] "RDS16" "Asthma"
Created on 2021-06-21 by the reprex package (v1.0.0)

Base R solutions:
# Solution 1:
# Extract strings (still quoted):
# dirtyStrings => list of strings
dirtyStrings <- regmatches(
string,
gregexpr(
'".*?"',
string
)
)
# Iterate over the list and "clean" - unquote - each
# element, store as a vector: result => character vector
result <- c(
vapply(
dirtyStrings,
function(x){
noquote(
gsub(
'"',
'',
x
)
)
},
character(
lengths(
dirtyStrings
)
)
)
)
# Solution 2:
# Same as above, less generic -- assumes all strings
# will follow the same pattern: result => character vector
result <- unlist(
lapply(
strsplit(
gsub(
".*\\=\\=",
"",
noquote(
string
)
),
"~"),
function(x){
gsub(
"\\W+",
"",
noquote(x)
)
}
)
)

You can capture the two values in two separate columns.
In stringr use str_match -
string <- "newdatat.scat == \"RDS16\" ~ \"Asthma\","
stringr::str_match(string, '"(\\w+)" ~ "(\\w+)"')[, -1, drop = FALSE]
# [,1] [,2]
#[1,] "RDS16" "Asthma"
Or in base R use strcapture
strcapture('"(\\w+)" ~ "(\\w+)"', string,
proto = list(col1 = character(), col2 = character()))
# col1 col2
#1 RDS16 Asthma

Related

Formatting UK Postcodes in R

I am trying to format UK postcodes that come in as a vector of different input in R.
For example, I have the following postcodes:
postcodes<-c("IV41 8PW","IV408BU","kY11..4hJ","KY1.1UU","KY4 9RW","G32-7EJ")
How do I write a generic code that would convert entries of the above vector into:
c("IV41 8PW","IV40 8BU","KY11 4HJ","KY1 1UU","KY4 9RW","G32 7EJ")
That is the first part of the postcode is separated from the second part of the postcode by one space and all letters are capitals.
EDIT: the second part of the postcode is always the 3 last characters (combination of a number followed by letters)
I couldn't come up with a smart regex solution so here is a split-apply-combine approach.
sapply(strsplit(sub('^(.*?)(...)$', '\\1:\\2', postcodes), ':', fixed = TRUE), function(x) {
paste0(toupper(trimws(x, whitespace = '[.\\s-]')), collapse = ' ')
})
#[1] "IV41 8PW" "IV40 8BU" "KY11 4HJ" "KY1 1UU" "KY4 9RW" "G32 7EJ"
The logic here is that we insert a : (or any character that is not in the data) in the string between the 1st and 2nd part. Split the string on :, remove unnecessary characters, get it in upper case and combine it in one string.
One approach:
Convert to uppercase
extract the alphanumeric characters
Paste back together with a space before the last three characters
The code would then be:
library(stringr)
postcodes<-c("IV41 8PW","IV408BU","kY11..4hJ","KY1.1UU","KY4 9RW","G32-7EJ")
postcodes <- str_to_upper(postcodes)
sapply(str_extract_all(postcodes, "[:alnum:]"), function(x)paste(paste0(head(x,-3), collapse = ""), paste0(tail(x,3), collapse = "")))
# [1] "IV41 8PW" "IV40 8BU" "KY11 4HJ" "KY1 1UU" "KY4 9RW" "G32 7EJ"
You can remove everything what is not a word caracter \\W (or [^[:alnum:]_]) and then insert a space before the last 3 characters with (.{3})$ and \\1.
sub("(.{3})$", " \\1", toupper(gsub("\\W+", "", postcodes)))
#sub("(...)$", " \\1", toupper(gsub("\\W+", "", postcodes))) #Alternative
#sub("(?=.{3}$)", " ", toupper(gsub("\\W+", "", postcodes)), perl=TRUE) #Alternative
#[1] "IV41 8PW" "IV40 8BU" "KY11 4HJ" "KY1 1UU" "KY4 9RW" "G32 7EJ"
# Option 1 using regex:
res1 <- gsub(
"(\\w+)(\\d[[:upper:]]\\w+$)",
"\\1 \\2",
gsub(
"\\W+",
" ",
postcodes
)
)
# Option 2 using substrings:
res2 <- vapply(
trimws(
gsub(
"\\W+",
" ",
postcodes
)
),
function(ir){
paste(
trimws(
substr(
ir,
1,
nchar(ir) -3
)
),
substr(
ir,
nchar(ir) -2,
nchar(ir)
)
)
},
character(1),
USE.NAMES = FALSE
)

R: show matched special character in a string

How can I show which special character was a match in each row of the single column dataframe?
Sample dataframe:
a <- data.frame(name=c("foo","bar'","ip_sum","four","%23","2_planet!","#abc!!"))
determining if the string has a special character:
a$name_cleansed <- gsub("([-./&,])|[[:punct:]]","\\1",a$name) #\\1 puts back the exception we define (dash and slash)
a <- a %>% mutate(has_special_char=if_else(name==name_cleansed,FALSE,TRUE))
You can use str_extract if we want only first special character.
library(stringr)
str_extract(a$name,'[[:punct:]]')
#[1] NA "'" "_" NA "%" "_" "#"
If we need all of the special characters we can use str_extract_all.
sapply(str_extract_all(a$name,'[[:punct:]]'), function(x) toString(unique(x)))
#[1] "" "'" "_" "" "%" "_, !" "#, !"
To exclude certain symbols, we can use
exclude_symbol <- c('-', '.', '/', '&', ',')
sapply(str_extract_all(a$name,'[[:punct:]]'), function(x)
toString(setdiff(unique(x), exclude_symbol)))
We can use grepl here for a base R option:
a$has_special_char <- grepl("(?![-./&,])[[:punct:]]", a$name, perl=TRUE)
a$special_char <- ifelse(a$has_special_char, sub("^.*([[:punct:]]).*$", "\\1", a$name), NA)
a
name has_special_char special_char
1 foo FALSE <NA>
2 bar' TRUE '
3 ip_sum TRUE _
4 four FALSE <NA>
5 %23 TRUE %
6 2_planet! TRUE !
7 #abc!! TRUE !
Data:
a <- data.frame(name=c("foo","bar'","ip_sum","four","%23","2_planet!","#abc!!"))
The above logic returns, arbitrarily, the first symbol character, if present, in each name, otherwise returning NA. It reuses the has_special_char column to determine if a symbol occurs in the name already.
Edit:
If you want a column which shows all special characters, then use:
a$all_special_char <- ifelse(a$has_special_char, gsub("[^[:punct:]]+", "", a$name), NA)
Base R regex solution using (caret) not "^" operator:
gsub("(^[-./&,])|[^[:punct:]]", "", a$name)
Also if you want a data.frame returned:
within(a, {
special_char <- gsub("(^[-./&,])|[^[:punct:]]", "", name);
has_special_char <- special_char != ""})
If you only want unique special characters per name as in #Ronak Shah's answer:
within(a, {
special_char <- sapply(gsub("(^[-./&,])|[^[:punct:]]", "", a$name),
function(x){toString(unique(unlist(strsplit(x, ""))))});
has_special_char <- special_char != ""
}

find alphanumeric elements in vector

I have a vector
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
In this vector, I want to do two things:
Remove any numbers from an element that contains both numbers and letters and then
If a group of letters is followed by another group of letters, merge them into one.
So the above vector will look like this:
'1.2','asdgkd','232','4343','zyzfva','3213','1232','dasd'
I thought I will first find the alphanumeric elements and remove the numbers from them using gsub.
I tried this
gsub('[0-9]+', '', myVec[grepl("[A-Za-z]+$", myVec, perl = T)])
"asd" "gkd" ".zyz" "fva" "dasd"
i.e. it retains the . which I don't want.
This seems to return what you are after
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
clean <- function (x) {
is_char <- grepl("[[:alpha:]]", x)
has_number <- grepl("\\d", x)
mixed <- is_char & has_number
x[mixed] <- gsub("[\\d\\.]+","", x[mixed], perl=T)
grp <- cumsum(!is_char | (is_char & !c(FALSE, head(is_char, -1))))
unname(tapply(x, grp, paste, collapse=""))
}
clean(myVec)
# [1] "1.2" "asdgkd" "232" "4343" "zyzfva" "3213" "1232" "dasd"
Here we look for numbers and letters mixed together and remove the numbers. Then we defined groups for collapsing, looking for characters that come after other characters to put them in the same group. Then we finally collapse all the values in the same group.
Here's my regex-only solution:
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
# find all elemnts containing letters
lettrs = grepl("[A-Za-z]", myVec)
# remove all non-letter characters
myVec[lettrs] = gsub("[^A-Za-z]" ,"", myVec[lettrs])
# paste all elements together, remove delimiter where delimiter is surrounded by letters and split string to new vector
unlist(strsplit(gsub("(?<=[A-Za-z])\\|(?=[A-Za-z])", "", paste(myVec, collapse="|"), perl=TRUE), split="\\|"))

regex to remove nested parenthesis brackets

How does one use regex expressions in R to replace the nested parenthesis in this example:
chf <- "(Mn,Ca,Zn)5(AsO4)2((AsO3)OH)24(H2O)(OH(AsO3))(OH(AsO3)OH)"
The desired output is
"(Mn,Ca,Zn)5(AsO4)2(AsO3OH)24(H2O)(OHAsO3)(OHAsO3OH)"
I'm trying this but I'm not able to exclude what's inside the nested brackets.
> str_replace_all(chf,"\\((\\w+)\\)","(gone)")
[1] "(Mn,Ca,Zn)5(gone)2((gone)OH)24(gone)(OH(gone))(OH(gone)OH)"
You may use
library(gsubfn)
chf <- "(Mn,Ca,Zn)5(AsO4)2((AsO3)OH)24(H2O)(OH(AsO3))(OH(AsO3)OH)"
gsubfn("\\((?:[^()]++|(?R))*\\)", ~ gsub("(^\\(|\\)$)|[()]", "\\1", x, perl=TRUE), chf, perl=TRUE, backref=0)
# => [1] "(Mn,Ca,Zn)5(AsO4)2(AsO3OH)24(H2O)(OHAsO3)(OHAsO3OH)"
The \((?:[^()]++|(?R))*\) regex is a known PCRE pattern to match nested parentheses. Once the match is found gsubfn takes the string and removes all non-initial and non-final parentheses using gsub("(^\\(|\\)$)|[()]", "\\1", x, perl=TRUE). Here, (^\\(|\\)$) matches and captures the first ( and last ) into Group 1 and then any ( and ) are matched with [()]. The replacement is the contents of Group 1.
A base R equivalent solution:
chf <- "(Mn,Ca,Zn)5(AsO4)2((AsO3)OH)24(H2O)(OH(AsO3))(OH(AsO3)OH)"
gre <- gregexpr("\\((?:[^()]++|(?R))*\\)", chf, perl=TRUE)
matches <- regmatches(chf, gre)
regmatches(chf, gre) <- lapply(matches, gsub, pattern="(^\\(|\\)$)|[()]", replacement="\\1")
> chf
# => "(Mn,Ca,Zn)5(AsO4)2(AsO3OH)24(H2O)(OHAsO3)(OHAsO3OH)"

How can I split a string and ignore the delimiter if it's "quoted"

Say I have the following string:
params <- "var1 /* first, variable */, var2, var3 /* third, variable */"
I want to split it using , as a separator, then extract the "quoted substrings", so I get 2 vectors as follow :
params_clean <- c("var1","var2","var3")
params_def <- c("first, variable","","third, variable") # note the empty string as a second element.
I use the term "quoted" in a wide sense, with arbitrary strings, here /* and */, which protect substrings from being split.
I found a workaround based on read.table and the fact it allows quoted elements :
library(magrittr)
params %>%
gsub("/\\*","_temp_sep_ '",.) %>%
gsub("\\*/","'",.) %>%
read.table(text=.,strin=F,sep=",") %>%
unlist %>%
unname %>%
strsplit("_temp_sep_") %>%
lapply(trimws) %>%
lapply(`length<-`,2) %>%
do.call(rbind,.) %>%
inset(is.na(.),value="")
But it's quite ugly and hackish, what's a simpler way ? I'm thinking there must be a regex to feed to strsplit for this situation.
related to this question
You may use
library(stringr)
cmnt_rx <- "(\\w+)\\s*(/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/)?"
res <- str_match_all(params, cmnt_rx)
params_clean <- res[[1]][,2]
params_clean
## => [1] "var1" "var2" "var3"
params_def <- gsub("^/[*]\\s*|\\s*[*]/$", "", res[[1]][,3])
params_def[is.na(params_def)] <- ""
params_def
## => [1] "first, variable" "" "third, variable"
The main regex details (it is actually (\w+)\s*)(COMMENTS_REGEX)?):
(\w+) - Capturing group 1: one or more word chars
\s* - 0+ whitespace chars
( - Capturing group 2 start
/\* - match the comment start /*
[^*]*\*+ - match 0+ characters other than * followed with 1+ literal *
(?:[^/*][^*]*\*+)* - 0+ sequences of:
[^/*][^*]*\*+ - not a / or * (matched with [^/*]) followed with 0+ non-asterisk characters ([^*]*) followed with 1+ asterisks (\*+)
/ - closing /
)? - Capturing group 2 end, repeat 1 or 0 times (it means it is optional).
See the regex demo.
The "^/[*]\\s*|\\s*[*]/$" pattern in gsub removes /* and */ with adjoining spaces.
params_def[is.na(params_def)] <- "" part replaces NA with empty strings.
Here you are
library(stringr)
params <- "var1 /* first, variable */, var2, var3 /* third, variable */"
# Split by , which are not enclosed in your /*...*/
params_split <- str_split(params, ",(?=[^(/[*])]*(/[*]))")[[1]]
# Extract matches of /*...*/, only taking the contents
params_def <- str_match(params_split, "/[*] *(.*?) *[*]/")[,2]
params_def[is.na(params_def)] <- ""
# Remove traces of /*...*/
params_clean <- trimws(gsub("/[*] *(.*?) *[*]/", "", params_split))
You can wrap it in a function and use the (not well documented) (*SKIP)(*FAIL) mechanism in plain R:
getparams <- function(params) {
tmp <- unlist(strsplit(params, "/\\*.*?\\*/(*SKIP)(*FAIL)|,", perl = TRUE))
params_clean <- vector(length = length(tmp))
params_def <- vector(length = length(tmp))
for (i in seq_along(tmp)) {
# get params_def if available
match <- regmatches(tmp[i], regexec("/\\*(.*?)\\*/", tmp[i]))
params_def[i] <- ifelse(identical(match[[1]], character(0)), "", trimws(match[[1]][2]))
# params_clean
params_clean[i] <- trimws(gsub("/(.*)\\*.*?\\*/", "\\1", tmp[i]))
}
return(list(params_clean = params_clean, params_def = params_def))
}
params <- "var1 /* first, variable */, var2, var3 /* third, variable */"
getparams(params)
This splits the initial string using (*SKIP)(*FAIL) (see a demo on regex101.com) and analyzes the parts afterwards.
This yields a list:
$params_clean
[1] "var1" "var2" "var3"
$params_def
[1] "first, variable" "" "third, variable"
Or, shorter with sapply:
getparams <- function(params) {
tmp <- unlist(strsplit(params, "/\\*.*?\\*/(*SKIP)(*FAIL)|,", perl = TRUE))
(p <- sapply(tmp, function(x) {
match <- regmatches(x, regexec("/\\*(.*?)\\*/", x))
def <- ifelse(identical(match[[1]], character(0)), "", trimws(match[[1]][2]))
clean <- trimws(gsub("/(.*)\\*.*?\\*/", "\\1", x))
c(clean, def)
}, USE.NAMES = F))
}
Which will yield a matrix:
[,1] [,2] [,3]
[1,] "var1" "var2" "var3"
[2,] "first, variable" "" "third, variable"
With the latter, you get the variable names with e.g. result[1,].

Resources