How to insert a white space before open bracket - r

I have a string 3.4(2.5-4.7), I want to insert a white space before the open bracket "(" so that the string becomes 3.4 (2.5-4.7).
Any idea how this could be done in R?

x <- "3.4(2.5-4.7)"
sub("(.*)(?=\\()", "\\1 ", x, perl = T)
[1] "3.4 (2.5-4.7)"
This regex is based on lookahead: it creates one capturing group subsuming everything up until the lookahead, namely, the opening parenthesis (?=\\()), recalls it and inserts one whitespace after it in the replacement argument to sub (which is enough unless you have more than one such substitution per string, in which case gsubis needed). The argument perl = Tneeds to be added to enable the lookahead.
EDIT:
If you have a string like this:
x <- "3.4(2.5to4.7)"
the regex gets slightly more complex; the underlying idea though remains the same: you divide the string into different captruing groups (...), which you then recall using appropriate backreference in the replacement argument while adding the sought spaces:
sub("(.*)(\\(\\d+\\.\\d+)(to)(\\d+\\.\\d+\\))", "\\1 \\2 \\3 \\4", x)
[1] "3.4 (2.5 to 4.7)"
EDIT2:
x <- '3.4(2.5,4.7)'
sub("(.*)(\\(\\d+\\.\\d+)(,)(\\d+\\.\\d+\\))", "\\1 \\2\\3 \\4", x)
[1] "3.4 (2.5, 4.7)"
EDIT3:
x <- '3(2,4)'
sub("(.*)(\\(\\d+)(,)(\\d+)", "\\1 \\2\\3 \\4", x)

A very short way uses sub, which will substitute the first open bracket ( with a space followed by an open bracket, i.e. (.
x <- '3.4(2.5-4.7)'
sub("\\(", " (", x)
# [1] "3.4 (2.5-4.7)"
Alternatively, you can specify the argument fixed = TRUE which considers the pattern as fixed and not as a regular expression.
x <- '3.4(2.5-4.7)'
sub("(", " (", x, fixed = TRUE)
# [1] "3.4 (2.5-4.7)"

Try
gsub('(.*)(\\(.*\\))', '\\1 \\2', '3.4(2.5-4.7)')
#[1] "3.4 (2.5-4.7)"
The way the regex works is that it creates two groups. The first group (.*) it takes all elements and the second group (\\(.*\\)) takes all elements after the parenthesis. Note that we need to escape the parenthesis so we use \\(. We then join those two groups with a space between them \\1 \\2

Related

Split string in parts by minus and plus in R

I want to split this string:
test = "-1x^2+3x^3-x^8+1-x"
...into parts by plus and minus characters in R. My goal would be to get:
"-1x^2" "+3x^3" "-x^8" "+1" "-x"
This didn't work:
strsplit(test, split = "-")
strsplit(test, split = "+")
We can provide a regular expression in strsplit, where we use ?= to lookahead to find the plus or minus sign, then split on that character. This will allow for the character itself to be retained rather than being dropped in the split.
strsplit(x, "(?<=.)(?=[+])|(?<=.)(?=[-])",perl = TRUE)
# [1] "-1x^2" "+3x^3" "-x^8" "+1" "-x"
Try
> strsplit(test, split = "(?<=.)(?=[+-])", perl = TRUE)[[1]]
[1] "-1x^2" "+3x^3" "-x^8" "+1" "-x"
where (?<=.)(?=[+-]) captures the spliter that happens to be in front of + or -.
This uses gsub to search for any character followed by + or - and inserts a semicolon between the two characters. Then it splits on semicolon.
s <- "-1x^2+3x^3-x^8+1-x"
strsplit(gsub("(.)([+-])", "\\1;\\2", s), ";")[[1]]
## [1] "-1x^2" "+3x^3" "-x^8" "+1" "-x"
In your examples, you use strsplit with a plus and a minus sign which will split on every encounter.
You could assert that what is directly to the left is not either the start of the string or + or -, while asserting + and - directly to the right.
(?<!^|[+-])(?=[+-])
Explanation
(?<! Negative lookabehind assertion
^ Start of string
| Or - [+-] Match either + or - using a character class
) Close lookbehind
(?= Positive lookahead assertion
[+-] Match either + or -
) Close lookahead
As the pattern uses lookaround assertions, you have to use perl = T to use a perl style regex.
Example
test <- "-1x^2+3x^3-x^8+1-x"
strsplit(test, split = "(?<!^|[\\s+-])(?=[+-])", perl = T)
Output
[[1]]
[1] "-1x^2" "+3x^3" "-x^8" "+1" "-x"
See a online R demo.
If there can also not be a space to the left, you can write the pattern as
(?<!^|[\\s+-])(?=[+-])
See a regex demo.

R Use Regular Expression to capture number when sometimes the capture is at the end of the string or not

I need to capture the numbers out of a string that come after a certain parameter name.
I have it working for most, but there is one parameter that is sometimes at the end of the string, but not always. When using the regular expression, it seems to matter.
I've tried different things, but nothing seems to work in both cases.
# Regular expression to capture the digit after the phrase "AppliedWhenID="
p <- ".*&AppliedWhenID=(.\\d*)"
# Tried this, but when at end, it just grabs a blank
#p <- ".*&AppliedWhenID=(.\\d*)&.*|.*&AppliedWhenID=(.\\d*)$"
testAtEnd <- "ReportType=233&ReportConID=171&MonthQuarterYear=0TimePeriodLabel=Year%202020&AppliedWhenID=2"
testNotAtEnd <- "ReportType=233&ReportConID=171&MonthQuarterYear=0TimePeriodLabel=Year%202020&AppliedWhenID=2&AgDateTypeID=1"
# What should be returned is "2"
gsub(p, "\\1", testAtEnd) # works
gsub(p, "\\1", testNotAtEnd) # doesn't work, it captures 2 + &AgDateTypeID=1
Note that sub and gsub replace the found text(s), thus, in order to extract a part of the input string with a capturing group + a backreference, you need to actually match (and consume) the whole string.
Hence, you need to match the string to the end by adding .* at the end of the pattern:
p <- ".*&AppliedWhenID=(\\d+).*"
sub(p, "\\1", testNotAtEnd)
# => [1] "2"
sub(p, "\\1", testAtEnd)
# => [1] "2"
See the regex demo and the R online demo.
Note that gsub matches multiple occurrences, you need a single one, so it makes sense to replace gsub with sub.
Regex details
.* - any zero or more chars as many as possible
&AppliedWhenID= - a &AppliedWhenID= string
(\d+) - Group 1 (\1): one or more digits
.* - any zero or more chars as many as possible.
You could try using the string look behind conditional "(?<=)" and str_extract() from the stringr library.
testAtEnd <- "ReportType=233&ReportConID=171&MonthQuarterYear=0TimePeriodLabel=Year%202020&AppliedWhenID=2"
testNotAtEnd <- "ReportType=233&ReportConID=171&MonthQuarterYear=0TimePeriodLabel=Year%202020&AppliedWhenID=2&AgDateTypeID=1"
p <- "(?<=AppliedWhenID=)\\d+"
# What should be returned is "2"
library(stringr)
str_extract(testAtEnd, p)
str_extract(testNotAtEnd, p)
Or in base R
p <- ".*((?<=AppliedWhenID=)\\d+).*"
gsub(p, "\\1", testAtEnd, perl=TRUE)
gsub(p, "\\1", testNotAtEnd, perl=TRUE)

How would I remove the text before the initial period, the initial period itself and text after final period in a string?

I need to remove the text before the leading period (as well as the leading period) and the text following the last period from a string.
Given this string for example:
"ABCD.EF.GH.IJKL.MN"
I'd like to get the output:
[1] "IJKL"
I have tried the following:
split_string <- sub("^.*?\\.","", string)
split_string <- sub("^\\.+|\\.[^.]*$", "", string)
I believe I have it working for the period and text after for that string output I want. However, the first line needs to be executed multiple times to remove the text before that period in question e.g. '.I'.
One option in base R is to capture as a group ((...)) the word followed by the dot (\\.) and the word (\\w+) till the end ($) of the string. In the replacement, use the backreference (\\1) of the captured word
sub(".*\\.(\\w+)\\.\\w+$", "\\1", str1)
#[1] "IJKL"
Here, we match characters (.*) till the . (\\. - escaped to get the literal value because . is a metacharacter that will match any character if not escaped), followed by the word captured ((\\w+)), followed by a dot and another word at the end ($)of the string. The replacement part is mentioned above
Or another option is regmatches/regexpr from base R
regmatches(str1, regexpr("\\w+(?=\\.\\w+$)", str1, perl = TRUE))
#[1] "IJKL"
Or another option is word from stringr
library(stringr)
word(str1, -2, sep="[.]")
#[1] "IJKL"
data
str1 <- "ABCD.EF.GH.IJKL.MN"
Here is a janky dplyr version in case the other values are of importance and you want to select them later on, just include them in the "select".
df<- data.frame(x=c("ABCD.EF.GH.IJKL.MN"))
df2<-df %>%
separate(x, into=c("var1", "var2","var3","var4","var5")) %>%
select("var4")
Split into groups at period and take the second one from last.
sapply(strsplit(str1, "\\."), function(x) x[length(x) - 1])
#[1] "IJKL"
Get indices of the periods and use substr to extract the relevant portion
sapply(str1, function(x){
ind = gregexpr("\\.", x)[[1]]
substr(x, ind[length(ind) - 1] + 1, ind[length(ind)] - 1)
}, USE.NAMES = FALSE)
#[1] "IJKL"
These alternatives all use no packages or regular expressions.
1) basename/dirname Assuming the test input s shown in the Note at the end convert the dots to slashes and then use dirname and basename.
basename(dirname(chartr(".", "/", s)))
## [1] "IJKL" "IJKL"
2) strsplit Using strsplit split the strings at dot creating a list of character vectors, one vector per input string, and then for each such vector take the last 2 elements using tail and the first of those using indexing.
sapply(strsplit(s, ".", fixed = TRUE), function(x) tail(x, 2)[1])
## [1] "IJKL" "IJKL"
3) read.table It is not clear from the question what the general case is but if all the components of s have the same number of dot separated fields then we can use read.table to create a data.frame with one row per input string and one column per dot-separated component. Then take the column just before the last.
dd <- read.table(text = s, sep = ".", as.is = TRUE)
dd[[ncol(dd)-1]]
## [1] "IJKL" "IJKL"
4) substr Again, the general case is not clear but if the string of interest is always at character positions 12-15 then a simple solution is:
substr(s, 12, 15)
## [1] "IJKL" "IJKL"
Note
s <- c("ABCD.EF.GH.IJKL.MN", "ABCD.EF.GH.IJKL.MN")

Split and re-concatenate a string

I am trying to get the host of an IP address from a list of strings.
ips <- c('140.112.204.42', '132.212.14.139', '31.2.47.93', '7.112.221.238')
I want to get the first 2 digits from the ips. output:
ips <- c('140.112', '132.212', '31.2', '7.112')
This is the code that I wrote to convert them:
cat(unlist(strsplit(ips, "\\.", fixed = FALSE))[1:2], sep = ".")
When I check the type of individual ips in the end I get something like this:
140.112 NULL
Not sure what I am doing wrong. If you have some other ideas completely different from this that is completely fine too.
With sub:
ips <- c('140.112.204.42', '132.212.14.139', '31.2.47.93', '7.112.221.238')
sub('\\.\\d+\\.\\d+$', '', ips)
# [1] "140.112" "132.212" "31.2" "7.112"
With str_extract from stringr:
library(stringr)
str_extract(ips, '^\\d+\\.\\d+')
# [1] "140.112" "132.212" "31.2" "7.112"
With strsplit + sapply:
sapply(strsplit(ips, '\\.'), function(x) paste(x[1:2], collapse = '.'))
# [1] "140.112" "132.212" "31.2" "7.112"
With read.table + apply:
apply(read.table(textConnection(ips), sep='.')[1:2], 1, paste, collapse = '.')
#[1] "140.112" "132.212" "31.2" "7.112"
Notes:
sub('\\.\\d+\\.\\d+$', '', ips):
i. \\.\\d+\\.\\d+$ matches a literal dot, a digit one or more times, a literal dot again, and a digit one or more times at the end of the string
ii. sub removes the above match from the string
str_extract(ips, '^\\d+\\.\\d+'):
i. ^\\d+\\.\\d+ matches a digit one or more times, a literal dot and a digit one or more times in the beginning of the string
ii. str_extract extracts the above match from the string
sapply(strsplit(ips, '\\.'), function(x) paste(x[1:2], collapse = '.')):
i. strsplit(ips, '\\.') splits each ip using a literal dot as the delimiter. This returns a list of vectors after the split
ii. With sapply, paste(x[1:2], collapse = '.') is applied to every element of the list, thus taking only the first two numbers from each vector, and collapsing them with a dot as the separator. sapply then coerces the list to a vector, thus returning a vector of the desired ips.
apply(read.table(textConnection(ips), sep='.')[1:2], 1, paste, collapse = '.'):
i. read.table(textConnection(ips), sep='.')[1:2] treats ips as text input and reads it in with dot as a delimiter. Only taking the first two columns.
ii. apply enables paste to be operated on each row, and collapses with a dot.
Could you please try following.
gsub("([0-9]+.[0-9]+)(.*)","\\1",ips)
Explanation: Using gsub function and putting regex there to match digits then DOT then digits in memory's 1st place holder and keeping .* everything after it in 2nd place holder of memory. Then substituting these with \\1 with first regex's value which will be first 2 fields.
One solution is the following:
vapply(strsplit(ips, ".", fixed = TRUE),
function(x) paste(x[1:2], collapse = "."),
character(1L))
vapply applies function(x) to each element of the output of strsplit
strsplit produces a list where each element of the list is the components of the IP addresses separated by "."; setting fixed = TRUE requests to split using the exact value of the splitting string (i.e., "."), not using regex
function(x) takes the first two elements (x[1:2]) of each item coming out of strsplit and pastes them together, seperated by "."
character(1L) tells vapply that each element of the output (i.e., returned from function(x) should be a string of length 1.
Edit: #useR posted this solution right before me (using sapply).
substr is vectorised on the stop argument, so you can use this with a vector of positions before the second dot. regexpr gives the positions of the first match, so if you sub out the first one you can match on the second - which will be conveniently one before it's true position as needed (since you removed the first one).
substr(ips,1,regexpr("\\.",sub("\\.","",ips)))
[1] "140.112" "132.212" "31.2" "7.112"
We can convert the ip addresses to numeric_version class and then format using this base R one-liner that employs no regular expressions:
format(numeric_version(ips)[, 1:2])
[1] "140.112" "132.212" "31.2" "7.112"

Removing the second "|" on the last position

Here are some examples from my data:
a <-c("sp|Q9Y6W5|","sp|Q9HB90|,sp|Q9NQL2|","orf|NCBIAAYI_c_1_1023|",
"orf|NCBIACEN_c_10_906|,orf|NCBIACEO_c_5_1142|",
"orf|NCBIAAYI_c_258|,orf|aot172_c_6_302|,orf|aot180_c_2_405|")
For a: The individual strings can contain even more entries of "sp|" and "orf"
The results have to be like this:
[1] "sp|Q9Y6W5" "sp|Q9HB90,sp|Q9NQL2" "orf|NCBIAAYI_c_1_1023"
"orf|NCBIACEN_c_10_906,orf|NCBIACEO_c_5_1142"
"orf|NCBIAAYI_c_258,orf|aot172_c_6_302,orf|aot180_c_2_405"
So the aim is to remove the last "|" for each "sp|" and "orf|" entry. It seems that "|" is a special challenge because it is a metacharacter in regular expressions. Furthermore, the length and composition of the "orf|" entries varying a lot. The only things they have in common is "orf|" or "sp|" at the beginning and that "|" is on the last position. I tried different things with gsub() but also with the stringr package or regexpr() or [:punct:], but nothing really worked. Maybe it was just the wrong combination.
We can use gsub to match the | that is followed by a , or is at the end ($) of the string and replace with blank ("")
gsub("[|](?=(,|$))", "", a, perl = TRUE)
#[1] "sp|Q9Y6W5"
#[2] "sp|Q9HB90,sp|Q9NQL2"
#[3] "orf|NCBIAAYI_c_1_1023"
#[4] "orf|NCBIACEN_c_10_906,orf|NCBIACEO_c_5_1142"
#[5] "orf|NCBIAAYI_c_258,orf|aot172_c_6_302,orf|aot180_c_2_405"
Or we split by ,', remove the last character withsubstr, andpastethelist` elements together
sapply(strsplit(a, ","), function(x) paste(substr(x, 1, nchar(x)-1), collapse=","))
An easy alternative that might work. You need to escape the "|" using "\\|".
# Input
a <-c("sp|Q9Y6W5|","sp|Q9HB90|,sp|Q9NQL2|","orf|NCBIAAYI_c_1_1023|",
"orf|NCBIACEN_c_10_906|,orf|NCBIACEO_c_5_1142|",
"orf|NCBIAAYI_c_258|,orf|aot172_c_6_302|,orf|aot180_c_2_405|")
# Expected output
b <- c("sp|Q9Y6W5", "sp|Q9HB90,sp|Q9NQL2", "orf|NCBIAAYI_c_1_1023" ,
"orf|NCBIACEN_c_10_906,orf|NCBIACEO_c_5_1142" ,
"orf|NCBIAAYI_c_258,orf|aot172_c_6_302,orf|aot180_c_2_405")
res <- gsub("\\|,", ",", gsub("\\|$", "", a))
all(res == b)
#[1] TRUE
You could construct a single regex call to gsub, but this is simple and easy to understand. The inner gsub looks for | and the end of the string and removes it. The outer gsub looks for ,| and replaces with ,.
You do not have to use a PCRE regex here as all you need can be done with the default TRE regex (if you specify perl=TRUE, the pattern is compiled with a PCRE regex engine and is sometimes slower than TRE default regex engine).
Here is the single simple gsub call:
gsub("\\|(,|$)", "\\1", a)
See the online R demo. No lookarounds are really necessary, as you see.
Pattern details
\\| - a literal | symbol (because if you do not escape it or put into a bracket expression it will denote an alternation operator, see the line below)
(,|$) - a capturing group (referenced to with \1 from the replacement pattern) matching either of the two alternatives:
, - a comma
| - or (the alternation operator)
$ - end of string anchor.
The \1 in the replacement string tells the regex engine to insert the contents stored in the capturing group #1 back into the resulting string (so, the commas are restored that way where necessary).

Resources