How to use sub in R with numeric operations on the matches? - r

Let's say I want to change the string X0_Y1_Z2 into X0_Y1_Z1, i.e. to decrease the last number by one. I tried it by the following statement in R, which doesn't work:
sub("(\\S+_\\S+_)\\S(\\d)", paste0("\\1", as.numeric("\\2")-1), "X0_Y1_Z2", perl=T)
How can I do it?

If you always have the string in this same format, and you only have 1 last digit to decrement, use a simple substring:
> paste0(substring(s, 1, nchar(s)-1), as.numeric(substring(s, nchar(s))) - 1)
> [1] "X0_Y1_Z1"
In order to match the last digit chunk in a string, use [0-9]+$ regex. To increase the value, use gsubfn package. See an example code:
> library(gsubfn)
> s <- "X0_Y1_Z2"
> gsubfn('[0-9]+$', ~ as.numeric(x)-1, s)
[1] "X0_Y1_Z1"
If you need to validate the string the way you did, use more groups and the anchors ^ and $ will require the whole string to match the pattern (a "full string match"):
> p <- "^(\\S+_\\S+_\\S)(\\d+)$"
> gsubfn(p, function(x1,x2) paste0(x1, as.numeric(x2)-1), s)
[1] "X0_Y1_Z1"

Related

extract substring in R

Suppose I have list of string "S[+229]EC[+57]VDSTDNSSK[+229]PSSEPTSHVAR" and need to get a vector of string that contains only numbers with bracket like eg. [+229][+57].
Is there a convenient way in R to do this?
Using base R, then try it with
> unlist(regmatches(s,gregexpr("\\[\\+\\d+\\]",s)))
[1] "[+229]" "[+57]" "[+229]"
Or you can use
> gsub(".*?(\\[.*\\]).*","\\1",gsub("\\].*?\\[","] | [",s))
[1] "[+229] | [+57] | [+229]"
We can use str_extract_all from stringr
stringr::str_extract_all(x, "\\[\\+\\d+\\]")[[1]]
#[1] "[+229]" "[+57]" "[+229]"
Wrap it in unique if you need only unique values.
Similarly, in base R using regmatches and gregexpr
regmatches(x, gregexpr("\\[\\+\\d+\\]", x))[[1]]
data
x <- "S[+229]EC[+57]VDSTDNSSK[+229]PSSEPTSHVAR"
Seems like you want to remove the alphabetical characters, so
gsub("[[:alpha:]]", "", x)
where [:alpha:] is the class of alphabetical (lower-case and upper-case) characters, [[:alpha:]] says 'match any single alphabetical character', and gsub() says substitute, globally, any alphabetical character with the empty string "". This seems better than trying to match bracketed numbers, which requires figuring out which characters need to be escaped with a (double!) \\.
If the intention is to return the unique bracketed numbers, then the approach is to extract the matches (rather than remove the unwanted characters). Instead of using gsub() to substitute matches to a regular expression with another value, I'll use gregexpr() to identify the matches, and regmatches() to extract the matches. Since numbers always occur in [], I'll simplify the regular expression to match one or more (+) characters from the collection +[:digit:].
> xx <- regmatches(x, gregexpr("[+[:digit:]]+", x))
> xx
[[1]]
[1] "+229" "+57" "+229"
xx is a list of length equal to the length of x. I'll write a function that, for any element of this list, makes the values unique, surrounds the values with [ and ], and concatenates them
fun <- function(x)
paste0("[", unique(x), "]", collapse = "")
This needs to be applied to each element of the list, and simplified to a vector, a task for sapply().
> sapply(xx, fun)
[1] "[+229][+57]"
A minor improvement is to use vapply(), so that the result is robust (always returning a character vector with length equal to x) to zero-length inputs
> x = character()
> xx <- regmatches(x, gregexpr("[+[:digit:]]+", x))
> sapply(xx, fun) # Hey, this returns a list :(
list()
> vapply(xx, fun, "character") # vapply() deals with 0-length inputs
character(0)

How to count " in the string? [duplicate]

I am trying to get the number of open brackets in a character string in R. I am using the str_count function from the stringr package
s<- "(hi),(bye),(hi)"
str_count(s,"(")
Error in stri_count_regex(string, pattern, opts_regex = attr(pattern,
: ` Incorrectly nested parentheses in regexp pattern.
(U_REGEX_MISMATCHED_PAREN)
I am hoping to get 3 for this example
( is a special character. You need to escape it:
str_count(s,"\\(")
# [1] 3
Alternatively, given that you're using stringr, you can use the coll function:
str_count(s,coll("("))
# [1] 3
You could also use gregexpr along with length in base R:
sum(gregexpr("(", s, fixed=TRUE)[[1]] > 0)
[1] 3
gregexpr takes in a character vector and returns a list with the starting positions of each match. I added fixed=TRUE in order to match literals.length will not work because gregexpr returns -1 when a subexpression is not found.
If you have a character vector of length greater than one, you would need to feed the result to sapply:
# new example
s<- c("(hi),(bye),(hi)", "this (that) other", "what")
sapply((gregexpr("(", s, fixed=TRUE)), function(i) sum(i > 0))
[1] 3 1 0
If you want to do it in base R you can split into a vector of individual characters and count the "(" directly (without representing it as a regular expression):
> s<- "(hi),(bye),(hi)"
> chars <- unlist(strsplit(s,""))
> length(chars[chars == "("])
[1] 3

Regular expression to extract specific part of a URL

I have a vector of URLs and need to extract a certain part of it. I've tried using a regex tester to see if my attempts worked, but they were no good.
The URLs I have are in this format: https://www.baseball-reference.com/teams/MIL/1976.shtml
I ned to extract the three letters after "teams/" (so for the example above, I need "MIL")
Does anyone have any idea how to get the correct regular expression to get this working? Thanks.
1) basename/dirname Try this:
u <- "https://www.baseball-reference.com/teams/MIL/1976.shtml" # input data
basename(dirname(u))
## [1] "MIL"
2) sub or with a regular expression:
sub(".*teams/(.*?)/.*", "\\1", u)
## [1] "MIL"
3) strsplit Split the string on / and take the second last component.
s <- strsplit(u, "/")[[1]]
s[length(s) - 1]
## [1] "MIL"
4) gsub Since the required substring is all upper case and no other characters in the input are this gsub which removes all characters that are not upper case letters would work:
gsub("[^A-Z]", "", u)
## [1] "MIL"
Many different ways to achieve this using regexp's. Here's one:
url <- "https://www.baseball-reference.com/teams/MIL/1976.shtml"
gsub(".+teams/(\\w{3}).+$", "\\1", url);
#[1] "MIL"
Or
x <- c('https://www.baseball-reference.com/teams/MIL/1976.shtml')
pattern <- "/teams/([^/]+)"
m <- regexec(pattern, x)
res = regmatches(x, m)[[1]]
res[2]
which yields
[1] "MIL"
Consider using the stringr package to simplify your code when handling strings.
Use a regular expression with positive lookbehind to catch alphanumeric codes following the string "teams\":
stringr::str_extract(url, "(?<=teams\\/)[A-Z]*")
In your case, if the URLs literally all begin with the same string https://www.baseball-reference.com/teams/ then you can avoid regex entirely and use a simple substring to get the three-letter code which follows:
stringr::str_sub(url, 42, 44)
Here are the results:
> url <- "https://www.baseball-reference.com/teams/MIL/1976.shtml"
>
> stringr::str_extract(url, "(?<=teams\\/)[A-Z]*")
[1] "MIL"
>
> stringr::str_sub(url, 42, 44)
[1] "MIL"

Return number from string

I'm trying to extract the "Number" of "Humans" in the string below, for example:
string <- c("ProjectObjectives|Objectives_NA, PublishDate|PublishDate_NA, DeploymentID|DeploymentID_NA, Species|Human|Gender|Female, Species|Cat|Number|1, Species|Human|Number|1, Species|Human|Position|Left")
The position of the text in the string will constantly change, so I need R to search the string and find "Species|Human|Number|" and return 1.
Apologies if this is a duplicate of another thread, but I've looked here (extract a substring in R according to a pattern) and here (R extract part of string). But I'm not having any luck.
Any ideas?
Use a capturing approach - capture 1 or more digits (\d+) after the known substring (just escape the | symbols):
> string <- c("ProjectObjectives|Objectives_NA, PublishDate|PublishDate_NA, DeploymentID|DeploymentID_NA, Species|Human|Gender|Female, Species|Cat|Number|1, Species|Human|Number|1, Species|Human|Position|Left")
> pattern = "Species\\|Human\\|Number\\|(\\d+)"
> unlist(regmatches(string,regexec(pattern,string)))[2]
[1] "1"
A variation is to use a PCRE regex with regmatches/regexpr
> pattern="(?<=Species\\|Human\\|Number\\|)\\d+"
> regmatches(string,regexpr(pattern,string, perl=TRUE))
[1] "1"
Here, the left side context is put inside a non-consuming pattern, a positive lookbehind, (?<=...).
The same functionality can be achieved with \K operator:
> pattern="Species\\|Human\\|Number\\|\\K\\d+"
> regmatches(string,regexpr(pattern,string, perl=TRUE))
[1] "1"
Simplest way I can think of:
as.integer(gsub("^.+Species\\|Human\\|Number\\|(\\d+).+$", "\\1", string))
It will introduce NAs where there is no mention of Speces|Human|Number. Also, there will be artefacts if any of the strings is a number (but I assume that this won't be an issue)

How to delete string or digits after certain pattern?

If there is a vector x that is,
x <- c('/name12/?ad_2','/name13/?ad_3','/name14/?ad_4')
Is there a way to delete the following numbers after 'ad_'?
so the converted x appears as
'/name12/?ad_' '/name13/?ad_' '/name14/?ad_'
I was trying to use gsub function but it didn't work because of the digits followed by 'name'.
You may use a regex with sub (since you perform a single search and replace, you do not need gsub) and use a pattern depending on what you need to include or exclude in the result.
You might use "(\\?ad_)[0-9]+$" to remove ?ad_ + digits and replace with "\\1" to restore the ?ad_ value, or just match the _ and then digits (and replace with _).
See demo code:
> x <- c('/name12/?ad_2','/name13/?ad_3','/name14/?ad_4')
> sub("(\\?ad_)[0-9]+$", "\\1", x)
[1] "/name12/?ad_" "/name13/?ad_" "/name14/?ad_"
> sub("_[0-9]+$", "_", x)
[1] "/name12/?ad_" "/name13/?ad_" "/name14/?ad_"
See the regex demo
Pattern details:
_ - matches an underscore
[0-9]+ - 1 or more (due to the + quantifier matching one or more occurrences, as many as possible)
$ - the end of string.
Since the prefix is the same length for all of them:
x <- c('/name12/?ad_2','/name13/?ad_3','/name14/?ad_4')
substr(x,1,12)
[1] "/name12/?ad_" "/name13/?ad_" "/name14/?ad_"
Otherwise I would grep it.

Resources