Regular expression to select 2 kinds of substrings - r

I have strings like:
\n A vs B \n
\n C vs D (EF) \n
\n GH ( I vs J) \n
in a vector called myData.
The following is myData.
c("\n A vs B \n", "\n C vs D (EF) \n", "\n GH ( I vs J)\n")
I want to select A vs B from 1, C vs D from 2 and I vs J from 3.
I have the following code:
loc = regexpr(".*vs.*|\\(.*vs.*\\)",myData,ignore.case=TRUE,perl=T)
end = loc + attr(loc,"match.length")-1
substr(myData,loc,end)
which gives three output:
[1] " A vs B " " C vs D (EF) " " GH ( I vs J)"
The last match is incorrect. How can I fix this?

We can use str_extract
library(stringr)
str_extract(str1, "[A-Za-z]\\s*vs\\s*[A-Za-z]")
#[1] "A vs B" "C vs D" "I vs J"
Or if there are other lower case characters in place of 'vs'
str_extract(str1, "[A-Z]\\s*[a-z]+\\s*[A-Z]")
#[1] "A vs B" "C vs D" "I vs J"
Or with sub from base R
sub(".*([A-Z]\\s*[a-z]+\\s*[A-Z]).*", "\\1", str1)
#[1] "A vs B" "C vs D" "I vs J"
data
str1 <- c("\n A vs B \n", "\n C vs D (EF) \n", "\n GH ( I vs J)\n")

You may use the base R regmatches / gregexpr solution using a PCRE regex like yours, but using lookarounds, changing . to [^()] (to avoid the overflow across parentheses) and placing the longer alternative before the smaller one:
> myData <- c("\n A vs B \n", "\n C vs D (EF) \n", "\n GH ( I vs J)\n")
> res <- regmatches(myData, gregexpr("(?<=\\()[^()]*vs[^()]*(?=\\))|[^()]*vs[^()]*", myData, perl=TRUE))
> trimws(res)
[1] "A vs B" "C vs D" "I vs J"
See the R online demo
Details:
(?<=\\() - positive lookbehind making sure there is a ( immediately to the left of the current location
[^()]* - 0+ chars other than ( and )
vs - a literal substring
[^()]* - 0+ chars other than ( and )
(?=\\)) - positive lookahead making sure there is a ) immediately to the right of the current location
| - or
[^()]*vs[^()]* - a vs enclosed with 0+ chars other than ( and )
NOTE: If you need to prevent the overflow across lines, you need to add \r\n to the [^()] -> [^()\r\n].
See this regex demo.

Throwing a non-regex approach in the mix. Basically we split at vs and paste tha last character of the first element with the first character of the second element.
sapply(strsplit(x, ' vs '), function(i)
paste0(substr(i[1], nchar(i), nchar(i)), ' Vs ', substr(i[2], 1, 1)))
#[1] "A Vs B" "C Vs D" "I Vs J"

Related

regex for replacement of specific character outside parenthesis only

I am looking for regex (preferably in R) which can replace (any number of) specific characters say ; with say ;; but only when not present inside parenthesis () inside the text string.
Note: 1. There may be more than one replacement character present inside parenthesis too
2. There are no nested parenthesis in the data/vector
Example
text;othertext to be replaced with text;;othertext
but text;other(texttt;some;someother);more to be replaced with text;;other(texttt;some;someother);;more. (i.e. ; only outside () to be replaced with replacement text)
Still if some clarification is needed I will try to explain
in_vec <- c("abcd;ghi;dfsF(adffg;adfsasdf);dfg;(asd;fdsg);ag", "zvc;dfasdf;asdga;asd(asd;hsfd)", "adsg;(asdg;ASF;DFG;ASDF;);sdafdf", "asagf;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;sdfa")
in_vec
#> [1] "abcd;ghi;dfsF(adffg;adfsasdf);dfg;(asd;fdsg);ag"
#> [2] "zvc;dfasdf;asdga;asd(asd;hsfd)"
#> [3] "adsg;(asdg;ASF;DFG;ASDF;);sdafdf"
#> [4] "asagf;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;sdfa"
Expected output (calculated manually)
[1] "abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag"
[2] "zvc;;dfasdf;;asdga;;asd(asd;hsfd)"
[3] "adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf"
[4] "asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa"
You can use gsub with ;(?![^(]*\\)):
gsub(";(?![^(]*\\))", ";;", in_vec, perl=TRUE)
#[1] "abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag"
#[2] "zvc;;dfasdf;;asdga;;asd(asd;hsfd)"
#[3] "adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf"
#[4] "asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa"
; finds ;, (?!) .. Negative Lookahead (make the replacement when it does not match), [^(] .. everything but not (, * repeat the previous 0 to n times, \\) .. flowed by ).
Or
gsub(";(?=[^)]*($|\\())", ";;", in_vec, perl=TRUE)
#[1] "abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag"
#[2] "zvc;;dfasdf;;asdga;;asd(asd;hsfd)"
#[3] "adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf"
#[4] "asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa"
; finds ;, (?=) .. Positive Lookahead (make the replacement when it does match), [^)] .. everything but not ), * repeat the previous 0 to n times, ($|\\() .. match end $ or (.
Or using gregexpr and regmatches extracting the part between ( and ) and making the replacement in the non-matched substrings:
x <- gregexpr("\\(.*?\\)", in_vec) #Find the part between ( and )
mapply(function(a, b) {
paste(matrix(c(gsub(";", ";;", b), a, ""), 2, byrow=TRUE), collapse = "")
}, regmatches(in_vec, x), regmatches(in_vec, x, TRUE))
#[1] "abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag"
#[2] "zvc;;dfasdf;;asdga;;asd(asd;hsfd)"
#[3] "adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf"
#[4] "asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa"
But all of them will work only for simple open ( close ) combinations.
Though the problem can be tackled with regex, using a simple function might be more straightforward and easier to understand.
replace_semicolons_outside_parentheses <- function(raw_string) {
"""Replace ; with ;; outside of parentheses"""
processed_string <- ""
n_open_parentheses <- 0
# Loops over characters in raw_string
for (char in strsplit(raw_string, "")[[1]]) {
# Update the net number of open parentheses
if (char == "(") {
n_open_parentheses <- n_open_parentheses + 1
} else if (char == ")") {
n_open_parentheses <- n_open_parentheses - 1
}
# Replace ; with ;; outside of parentheses
if (char == ";" && n_open_parentheses == 0) {
processed_string <- paste0(processed_string, ";;")
} else {
processed_string <- paste0(processed_string, char)
}
}
return(processed_string)
}
Note that the function above also works for nested parentheses: no semicolons inside nested parentheses are replaced! The desired output can be obtained in a single line:
out_vec <- lapply(in_vec, replace_semicolons_outside_parentheses)
# 1. 'abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag'
# 2. 'zvc;;dfasdf;;asdga;;asd(asd;hsfd)'
# 3. 'adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf'
# 4. 'asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa'
Use the following in case of no nested parentheses:
gsub("\\([^()]*\\)(*SKIP)(*FAIL)|;", ";;", in_vec, perl=TRUE)
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
\( '(' char
--------------------------------------------------------------------------------
[^()]* any character except: '(', ')' (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\) ')' char
--------------------------------------------------------------------------------
(*SKIP)(*FAIL) skip current match, search for new one from here
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
; ';'
If there are nested parentheses:
gsub("(\\((?:[^()]++|(?1))*\\))(*SKIP)(*FAIL)|;", ";;", in_vec, perl=TRUE)
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\( '('
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
[^()]++ any character except: '(', ')' (1 or more times
(matching the most amount possible, no backtracking))
--------------------------------------------------------------------------------
| or
--------------------------------------------------------------------------------
(?1) recursing first group pattern
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
\) ')'
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
(*SKIP)(*FAIL) skip the match, search for next
--------------------------------------------------------------------------------
| or
--------------------------------------------------------------------------------
; ';'
--------------------------------------------------------------------------------

Moving character at the beginning of the string maintaining the position of other characters intact in R

string <- paste(append(rep(" ", 7), append("A", append(rep(" ", 3), append("B", append(rep(" ", 17), "C"))))), collapse = "")
string
[1] " A B C"
how can I move A at the beginning of the string, keeping the position of B and C the same?
You can use sub to get all spaces ( *) before a word (\\w+) and change their position \\2\\1.
sub("( *)(\\w+)", "\\2\\1", string)
#[1] "A B C"
Or only for A:
sub("( *)A", "A\\1", string)

replace special symbols on windows

Students commonly paste assignment questions from a pdf or word document into Rmarkdown. However, the pasted text often has non-ascii characters for bullets, quotes, etc. I have used gsub in the past as part of a function to replace such characters and that seemed to work fine but I'm running into problems now again.
The first line in each pair shown below works on macOS, Linux, and Windows. However, non-ascii characters are not allowed in code to be included in an R package. The 2nd line in each pair works on macOS and Linux but not on Windows.
It would be great to have a general approach to deal with these type of characters that does not involve simply deleting them.
gsub("•", "*", "A big dot •")
gsub("\xE2\x80\xA2", "*", "A big dot •")
gsub("…", "...", "Some small dots …")
gsub("\xE2\x80\xA6", "...", "Some small dots …")
gsub("–", "-", "A long-dash –")
gsub("\xE2\x80\x93", "-", "A long-dash –")
gsub("’", "'", "A curly single quote ’")
gsub("\xE2\x80\x99", "'", "A curly single quote ’")
gsub("‘", "'", "A curly single quote ‘")
gsub("\xE2\x80\x98", "'", "A curly single quote ‘")
gsub("”", '"', "A curly double quote ”")
gsub("\xE2\x80\x9D", '"', "A curly double quote ”")
gsub("“", '"', "A curly double quote “")
gsub("\xE2\x80\x9C", '"', "A curly double quote “")
We can check the hex encoding of a character using the Encoding function:
x <- c("•", "…", "–", "’", "‘", "”", "“")
y <- x
Encoding(y) <- "bytes"
> x
[1] "•" "…" "–" "’" "‘" "”" "“"
> cat(y)
\x95 \x85 \x96 \x92 \x91 \x94 \x93
We can then include the hex codes in your gsub's:
gsub("•", "*", "A big dot •")
gsub("[\x95\xE2\x80\xA2]", "*", "A big dot •")
gsub("…", "...", "Some small dots …")
gsub("[\x85\xE2\x80\xA6]", "...", "Some small dots …")
gsub("–", "-", "A long-dash –")
gsub("[\x96\xE2\x80\x93]", "-", "A long-dash –")
gsub("’", "'", "A curly single quote ’")
gsub("[\x92\xE2\x80\x99]", "'", "A curly single quote ’")
gsub("‘", "'", "A curly single quote ‘")
gsub("[\x91\xE2\x80\x98]", "'", "A curly single quote ‘")
gsub("”", '"', "A curly double quote ”")
gsub("[\x94\xE2\x80\x9D]", '"', "A curly double quote ”")
gsub("“", '"', "A curly double quote “")
gsub("[\x93\xE2\x80\x9C]", '"', "A curly double quote “")
Also with stri_trans_general from stringi:
library(stringi)
stri_trans_general(x, "ascii")
# [1] "•" "..." "-" "'" "'" "\"" "\""
This seems to not work for "•", but works for the rest.
Note that I have only tested this solution on Windows and not other OS.
It seems that on systems with non-US language settings gsub("[\x95\xE2\x80\xA2]", "*", "A big dot •") can cause errors (see e.g., below).
> gsub("[\x95\xE2\x80\xA2]", "*", "A big dot •")
Error in gsub("[曗€", "*", "A big dot <U+2022>") :
invalid regular expression '[曗€', reason 'Missing ']''
The following, however, does work well.
gsub("\u2022", "*", "A big dot •")
gsub("\u2026", "...", "Some small dots …")
gsub("\u2013", "-", "A long-dash –")
gsub("\u2019", "'", "A curly single quote ’")
gsub("\u2018", "'", "A curly single quote ‘")
gsub("\u201D", '"', "A curly double quote ”")
gsub("\u201C", '"', "A curly double quote “")
Also, stringi::stri_trans_general works well on systems with US language settings but on a system with Chinese language settings the code shown below does not return the desired result which is just 夹. Not sure what the solution is.
stringi::stri_trans_general("夹", "ascii")
> stringi::stri_trans_general("夹", "ascii")
[1] " 1/4D"

Exclude everything after the second occurrence of a certain string

I have the following string
string <- c('a - b - c - d',
'z - c - b',
'y',
'u - z')
I would like to subset it such that everything after the second occurrence of ' - ' is thrown away.
The result would be this:
> string
[1] "a - b" "z - c" "y" "u - z"
I used substr(x = string, 1, regexpr(string, pattern = '[^ - ]*$') - 4), but it excludes the last occurrence of ' - ', which is not what I want .
Note that you cannot use a negated character class to negate a sequence of characters. [^ - ]*$ matches any 0+ chars other than a space (yes, it matches -, too, because the - created a range between a space and a space) followed by the end of the string marker ($).
You may use a sub function with the following regex:
^(.*? - .*?) - .*
to replace with \1. See the regex demo.
R code:
> string <- c('a - b - c - d', 'z - c - b', 'y', 'u - z')
> sub("^(.*? - .*?) - .*", "\\1", string)
[1] "a - b" "z - c" "y" "u - z"
Details:
^ - start of a string
(.*? - .*?) - Group 1 (referred to with the \1 backreference in the replacement pattern) capturing any 0+ chars lazily up to the first space, hyphen, space and then again any 0+ chars up to the next leftmost occurrence of space, hyphen, space
- - a space, hyphen and a space
.* - any zero or more chars up to the end of the string.
try this (\w(?:\s+-\s+\w)?).*. For the explanation of the regex look this https://regex101.com/r/BbfsNQ/2.
That regex will retrieve the first tuple if exists or just the first caracter if there's not a tuple. So, the data is get into a "capturing group". Then to display the captured groups, it depends on the used language but in pure regex that will be \1 to get the first group (\2 to get second etc...). Look at the part "Substitution" on the regex101 if you wan't a graphic example.

How to remove strings with exceptions in R?

I have a string. I want to (a) keep "/" in fractions, (b) insert whitespace around "/" that are between words, and (c) remove all other "/".
s = "/// // / 1/2 111/222 a/b abc/abc a / b / // ///"
The result should be as follows.
s = "1/2 111/222 a b abc abc a b"
I have tried a few things. I cannot make everything right.
I'm not a regex expert, but this appears to work on your example.
s = "/// // / 1/2 111/222 a/b abc/abc a / b / // ///"
i <- gsub("/{2,}|/\\s", "", s)
i <- trimws(gsub("([[:alpha:]]{1,})(/)([[:alpha:]]{1,})", "\\1 \\3", i))
i <- gsub("\\s{2,}", " ", i)
identical(i, "1/2 111/222 a b abc abc a b")
[1] TRUE

Resources