How to subset the string by escaping some characters - r

I want to subset the string in two sub-string in following way
select character from 5 to 20
select character from 5 to 21 but escape the 20th character
Example:
String: AGGTGAACGCCACGTCCAAAGTTAGGTGATGCATTCAAGTT
sub1: GAACGCCACGTCCAAA
sub2: GAACGCCACGTCCAAG

The ?substring function is also useful. It is distinct from ?substr, by its capability to handle single or multiple substrings at once:
substring(str1, 5, 20)
#[1] "GAACGCCACGTCCAAA"
substring(str1, c(5,21), c(19,21))
#[1] "GAACGCCACGTCCAA" "G"
paste(substring(str1, c(5,21), c(19,21)), collapse="")
#[1] "GAACGCCACGTCCAAG"

We can use sub to match the first 4 characters (.{4}) from the start (^) of the string followed by the next 16 which we capture as a group ((.{16})) followed by other characters (.*) and replace it with the backreference (\\1) of the captured grouop
sub("^.{4}(.{16}).*", "\\1", str1)
#[1] "GAACGCCACGTCCAAA"
We can get the first case with substr/substring
substr(str1, 5, 20)
#[1] "GAACGCCACGTCCAAA"
For the second case, instead of capturing 16 characters, capture 15 characters followed by a character (.) followed by capturing the next character as a group ((.)) and replace with the backreferences (\\1\\2) of the captured group
sub("^.{4}(.{15}).(.).*", "\\1\\2", str1)
#[1] "GAACGCCACGTCCAAG"
Or with substr
sprintf("%s%s", substr(str1, 5, 19), substr(str1, 21, 21))
#[1] "GAACGCCACGTCCAAG"
data
str1 <- "AGGTGAACGCCACGTCCAAAGTTAGGTGATGCATTCAAGTT"

Related

Extract first X digits of N digit numbers

How to select first 2 digits of a number? I just need the name of the function
Example: 12455 turns into 12, 13655 into 13
Basically it's the equivalent of substring for integers.
If at the end you need again a numeric vector/element, you can use
as.numeric(substr(x, 1, 2))
This solution uses gsub, the anchor ^ signifiying the start position of a string, \\d{2} for any two digits appearing at this position, wrapped into (...) to mark it as a capturing group, and backreference \\1 in the replacement argument, which 'recalls' the capturing group:
x <- c(12455,13655)
gsub("(^\\d{2}).*", "\\1", x)
[1] "12" "13"
Alternatively, use str_extract:
library(stringr)
str_extract(x, "^\\d{2}")

Regex: extract a number after a string that contains a number

Suppose I have a string:
str <- "England has 90 cases(1 discharged, 5 died); Scotland has 5 cases(2 discharged, 1 died)"
How can I grab the number of discharged cases in England?
I have tried
sub("(?i).*England has [\\d] cases(.*?(\\d+).*", "\\1", str),
It's returning the original string. Many Thanks!
We can use regmatches/gregexpr to match one or more digits (\\d+) followed by a space, 'discharged' to extract the number of discharges
as.integer(regmatches(str, gregexpr("\\d+(?= discharged)", str, perl = TRUE))[[1]])
#[1] 1 2
If it is specific only to 'England', start with the 'England' followed by characters tat are not a ( ([^(]+) and (, then capture the digits (\\d+) as a group, in the replacement specify the backreference (\\1) of the captured group
sub("England[^(]+\\((\\d+).*", "\\1", str)
#[1] "1"
Or if we go by the OP's option, the ( should be escaped as it is a metacharacter to capture group (after the cases). Also, \\d+ can be placed outside the square brackets
sub("(?i)England has \\d+ cases\\((\\d+).*", "\\1", str)
#[1] "1"
We can use str_match to capture number before "discharged".
stringr::str_match(str, "England.*?(\\d+) discharged")[, 2]
#[1] "1"
the regex is \d+(?= discharged) and get the first match

R Question: Extracting Numeric Characters from End of String

I have a data frame. One of the columns is in string format. Various letters and numbers, but always ending in a string of numbers. Sadly this string isn't always the same length.
I'd like to know how to write a bit of code to extract just the numbers at the end. So for example:
x <- c("AB ABC 19012301927 / XX - 4625",
"BC - AB / 827 / 9765",
"XXXX-9276"
)
And I'd like to get from this: (4625, 9765, 9276)
Is there any easy way to do this please?
Thank you.
A
We can use sub to capture one or more digits (\\d+) at the end ($) of the string that follows a non-digit ([^0-9]) and other characters (.*), in the replacement, specify the backreference (\\1) of the captured group
sub(".*[^0-9](\\d+)$", "\\1", x)
#[1] "4625" "9765" "9276"
Or with word from stringr
library(stringr)
word(x, -1, sep="[- ]")
#[1] "4625" "9765" "9276"
Or with stri_extract_last
library(stringi)
stri_extract_last_regex(x, "\\d+")
#[1] "4625" "9765" "9276"
Replace everything up to the last non-digit with a zero length string.
sub(".*\\D", "", x)
giving:
[1] "4625" "9765" "9276"

R Returning all characters after the first underscore

Sample DATA
x=c("AG.av08_binloop_v6","TL.av1_binloopv2")
Sample ATTEMPT
y=gsub(".*_","",x)
Sample DESIRED
WANT=c("binloop_v6","binloopv2")
Basically I aim to extract all the characters AFTER the first underscore value.
In the pattern, we can change the zero or more any characters (.* - here . is metacharacter that can match any character) to zero or more characters that is not a _ ([^_]*) from the start (^) of the string.
sub("^[^_]*_", "", x)
#[1] "binloop_v6" "binloopv2"
If we don't specify it as such, the _ will match till the last _ in the string and uptill that substring will be lost returning 'v6' and 'binloopv2'
An easier option would be word from stringr
library(stringr)
word(x, 2, sep = "_")
#[1] "binloop" "binloopv2"
regexpr gives the position of first match (in this case _). Then substring can be used to extract the part of x from relevant position to the end (nchar(x))
substring(x, regexpr("_", x) + 1, nchar(x))
#[1] "binloop_v6" "binloopv2"

Extract Between Parts of a String

I have a string of names in the following format:
names <- c("Q-1234-1", "Q-1234-2", "Q-1234-1-8", "Q-1234-2-8")
I am trying to extract the single digit after the second hyphen. There are instances where there will be a third hyphen and an additional digit at the end of the name. The desired output is:
1, 2, 1, 2
I assume that I will need to use sub/gsub but am not sure where to start. Any suggestions?
We can use sub to match the pattern of zero or more characters that are not a - ([^-]*) from the start (^) of the string followed by a - followed by zero or more characters that are not a - followed by a - and the number that follows being captured as a group. In the replacement, we use the backreference of the captured group (\\1)
as.integer(sub("^[^-]*-[^-]*-(\\d).*", "\\1", names))
#[1] 1 2 1 2
Or this can be modified to
as.integer(sub("^([^-]*-){2}(\\d).*", "\\2", names))
#[1] 1 2 1 2
Here's an alternative using stringr
library("stringr")
names <- c("Q-1234-1", "Q-1234-2", "Q-1234-1-8", "Q-1234-2-8")
output = str_split_fixed(names, pattern = "-", n = 4)[,3]

Resources