Extract first X digits of N digit numbers - r

How to select first 2 digits of a number? I just need the name of the function
Example: 12455 turns into 12, 13655 into 13
Basically it's the equivalent of substring for integers.

If at the end you need again a numeric vector/element, you can use
as.numeric(substr(x, 1, 2))

This solution uses gsub, the anchor ^ signifiying the start position of a string, \\d{2} for any two digits appearing at this position, wrapped into (...) to mark it as a capturing group, and backreference \\1 in the replacement argument, which 'recalls' the capturing group:
x <- c(12455,13655)
gsub("(^\\d{2}).*", "\\1", x)
[1] "12" "13"
Alternatively, use str_extract:
library(stringr)
str_extract(x, "^\\d{2}")

Related

Regex: extract a number after a string that contains a number

Suppose I have a string:
str <- "England has 90 cases(1 discharged, 5 died); Scotland has 5 cases(2 discharged, 1 died)"
How can I grab the number of discharged cases in England?
I have tried
sub("(?i).*England has [\\d] cases(.*?(\\d+).*", "\\1", str),
It's returning the original string. Many Thanks!
We can use regmatches/gregexpr to match one or more digits (\\d+) followed by a space, 'discharged' to extract the number of discharges
as.integer(regmatches(str, gregexpr("\\d+(?= discharged)", str, perl = TRUE))[[1]])
#[1] 1 2
If it is specific only to 'England', start with the 'England' followed by characters tat are not a ( ([^(]+) and (, then capture the digits (\\d+) as a group, in the replacement specify the backreference (\\1) of the captured group
sub("England[^(]+\\((\\d+).*", "\\1", str)
#[1] "1"
Or if we go by the OP's option, the ( should be escaped as it is a metacharacter to capture group (after the cases). Also, \\d+ can be placed outside the square brackets
sub("(?i)England has \\d+ cases\\((\\d+).*", "\\1", str)
#[1] "1"
We can use str_match to capture number before "discharged".
stringr::str_match(str, "England.*?(\\d+) discharged")[, 2]
#[1] "1"
the regex is \d+(?= discharged) and get the first match

R Question: Extracting Numeric Characters from End of String

I have a data frame. One of the columns is in string format. Various letters and numbers, but always ending in a string of numbers. Sadly this string isn't always the same length.
I'd like to know how to write a bit of code to extract just the numbers at the end. So for example:
x <- c("AB ABC 19012301927 / XX - 4625",
"BC - AB / 827 / 9765",
"XXXX-9276"
)
And I'd like to get from this: (4625, 9765, 9276)
Is there any easy way to do this please?
Thank you.
A
We can use sub to capture one or more digits (\\d+) at the end ($) of the string that follows a non-digit ([^0-9]) and other characters (.*), in the replacement, specify the backreference (\\1) of the captured group
sub(".*[^0-9](\\d+)$", "\\1", x)
#[1] "4625" "9765" "9276"
Or with word from stringr
library(stringr)
word(x, -1, sep="[- ]")
#[1] "4625" "9765" "9276"
Or with stri_extract_last
library(stringi)
stri_extract_last_regex(x, "\\d+")
#[1] "4625" "9765" "9276"
Replace everything up to the last non-digit with a zero length string.
sub(".*\\D", "", x)
giving:
[1] "4625" "9765" "9276"

R Returning all characters after the first underscore

Sample DATA
x=c("AG.av08_binloop_v6","TL.av1_binloopv2")
Sample ATTEMPT
y=gsub(".*_","",x)
Sample DESIRED
WANT=c("binloop_v6","binloopv2")
Basically I aim to extract all the characters AFTER the first underscore value.
In the pattern, we can change the zero or more any characters (.* - here . is metacharacter that can match any character) to zero or more characters that is not a _ ([^_]*) from the start (^) of the string.
sub("^[^_]*_", "", x)
#[1] "binloop_v6" "binloopv2"
If we don't specify it as such, the _ will match till the last _ in the string and uptill that substring will be lost returning 'v6' and 'binloopv2'
An easier option would be word from stringr
library(stringr)
word(x, 2, sep = "_")
#[1] "binloop" "binloopv2"
regexpr gives the position of first match (in this case _). Then substring can be used to extract the part of x from relevant position to the end (nchar(x))
substring(x, regexpr("_", x) + 1, nchar(x))
#[1] "binloop_v6" "binloopv2"

Extract Between Parts of a String

I have a string of names in the following format:
names <- c("Q-1234-1", "Q-1234-2", "Q-1234-1-8", "Q-1234-2-8")
I am trying to extract the single digit after the second hyphen. There are instances where there will be a third hyphen and an additional digit at the end of the name. The desired output is:
1, 2, 1, 2
I assume that I will need to use sub/gsub but am not sure where to start. Any suggestions?
We can use sub to match the pattern of zero or more characters that are not a - ([^-]*) from the start (^) of the string followed by a - followed by zero or more characters that are not a - followed by a - and the number that follows being captured as a group. In the replacement, we use the backreference of the captured group (\\1)
as.integer(sub("^[^-]*-[^-]*-(\\d).*", "\\1", names))
#[1] 1 2 1 2
Or this can be modified to
as.integer(sub("^([^-]*-){2}(\\d).*", "\\2", names))
#[1] 1 2 1 2
Here's an alternative using stringr
library("stringr")
names <- c("Q-1234-1", "Q-1234-2", "Q-1234-1-8", "Q-1234-2-8")
output = str_split_fixed(names, pattern = "-", n = 4)[,3]

Replace element in vector based on first letter of character string

Consider the vectors below:
ID <- c("A1","B1","C1","A12","B2","C2","Av1")
names <- c("ALPHA","BRAVO","CHARLIE","AVOCADO")
I want to replace the first character of each element in vector ID with vector names based on the first letter of vector names. I also want to add a _0 before each number between 0:9.
Note that the elements Av1 and AVOCADO throw things off a bit, especially with the lowercase v in Av1.
The result should look like this:
res <- c("ALPHA_01","BRAVO_01","CHARLIE_01","ALPHA_12","BRAVO_02","CHARLIE_02", "AVOCADO_01")
I know it should be done with regex but I've been trying for 2 days now and haven't got anywhere.
We can use gsubfn.
library(gsubfn)
#remove the number part from 'ID' (using `sub`) and get the unique elements
nm1 <- unique(sub("\\d+", "", ID))
#using gsubfn, replace the non-numeric elements with the matching
#key/value pair in the replacement
#finally format to add the "_" with sub
sub("(\\d+)$", "_0\\1", gsubfn("(\\D+)", as.list(setNames(names, nm1)), ID))
#[1] "ALPHA_01" "BRAVO_01" "CHARLIE_01" "ALPHA_02"
#[5] "BRAVO_02" "CHARLIE_02" "AVOCADO_01"
The (\\d+) indicates one or more numeric elements, and (\\D+) is one or more non-numeric elements. We are wrapping it within the brackets to capture as a group and replace it with the backreference (\\1 - as it is the first backreference for the captured group).
Update
If the condition would be to append 0 only to those 'ID's that have numbers less than 10, then we can do this with a second gsubfn and sprintf
gsubfn("(\\d+)", ~sprintf("_%02d", as.numeric(x)),
gsubfn("(\\D+)", as.list(setNames(names, nm1)), ID))
#[1] "ALPHA_01" "BRAVO_01" "CHARLIE_01" "ALPHA_12"
#[5] "BRAVO_02" "CHARLIE_02" "AVOCADO_01"
Doing this via base R, we can search for second character being V (as in AVOCADO) and substring 2 characters if that's true or 1 character if not. This will capture both AVOCADO and ALPHA. We then match those substrings with the letters extracted from ID (also convert toupper to capture Av with AV). Finally paste _0 along with the number found in each ID
paste0(names[match(toupper(sub('\\d+', '', ID)),
ifelse(substr(names, 2, 2) == 'V', substr(names, 1, 2),
substr(names, 1, 1)))],'_0', sub('\\D+', '', ID))
#[1] "ALPHA_01" "BRAVO_01" "CHARLIE_01" "ALPHA_02" "BRAVO_02" "CHARLIE_02" "AVOCADO_01"

Resources