Extract Between Parts of a String - r

I have a string of names in the following format:
names <- c("Q-1234-1", "Q-1234-2", "Q-1234-1-8", "Q-1234-2-8")
I am trying to extract the single digit after the second hyphen. There are instances where there will be a third hyphen and an additional digit at the end of the name. The desired output is:
1, 2, 1, 2
I assume that I will need to use sub/gsub but am not sure where to start. Any suggestions?

We can use sub to match the pattern of zero or more characters that are not a - ([^-]*) from the start (^) of the string followed by a - followed by zero or more characters that are not a - followed by a - and the number that follows being captured as a group. In the replacement, we use the backreference of the captured group (\\1)
as.integer(sub("^[^-]*-[^-]*-(\\d).*", "\\1", names))
#[1] 1 2 1 2
Or this can be modified to
as.integer(sub("^([^-]*-){2}(\\d).*", "\\2", names))
#[1] 1 2 1 2

Here's an alternative using stringr
library("stringr")
names <- c("Q-1234-1", "Q-1234-2", "Q-1234-1-8", "Q-1234-2-8")
output = str_split_fixed(names, pattern = "-", n = 4)[,3]

Related

Remove one number at position n of the number in a string of numbers separated by slashes

I have a character column with this configuration:
data <- data.frame(
id = 1:3,
codes = c("08001301001", "08002401002 / 08002601003 / 17134604034", "08004701005 / 08005101001"))
I want to remove the 6th digit of any number within the string. The numbers are always 10 characters long.
My code works. However I believe it might be done easier using RegEx, but I couldn't figure it out.
library(stringr)
remove_6_digit <- function(x){
idxs <- str_locate_all(x,"/")[[1]][,1]
for (idx in c(rev(idxs+7), 6)){
str_sub(x, idx, idx) <- ""
}
return(x)
}
result <- sapply(data$codes, remove_6_digit, USE.NAMES = F)
You can use
gsub("\\b(\\d{5})\\d", "\\1", data$codes)
See the regex demo. This will remove the 6th digit from the start of a digit sequence.
Details:
\b - word boundary
(\d{5}) - Capturing group 1 (\1): five digits
\d - a digit.
While word boundary looks enough for the current scenario, a digit boundary is also an option in case the numbers are glued to word chars:
gsub("(?<!\\d)(\\d{5})\\d", "\\1", data$codes, perl=TRUE)
where perl=TRUE enables the PCRE regex syntax and (?<!\d) is a negative lookbehind that fails the match if there is a digit immediately to the left of the current location.
And if you must only change numeric char sequences of 10 digits (no shorter and no longer) you can use
gsub("\\b(\\d{5})\\d(\\d{4})\\b", "\\1\\2", data$codes)
gsub("(?<!\\d)(\\d{5})\\d(?=\\d{4}(?!\\d))", "\\1", data$codes, perl=TRUE)
One remark though: your numbers consist of 11 digits, so you need to replace \\d{4} with \\d{5}, see this regex demo.
Another possible solution, using stringr::str_replace_all and lookaround :
library(tidyverse)
data %>%
mutate(codes = str_replace_all(codes, "(?<=\\d{5})\\d(?=\\d{5})", ""))
#> id codes
#> 1 1 0800101001
#> 2 2 0800201002 / 0800201003 / 1713404034
#> 3 3 0800401005 / 0800501001

R Question: Extracting Numeric Characters from End of String

I have a data frame. One of the columns is in string format. Various letters and numbers, but always ending in a string of numbers. Sadly this string isn't always the same length.
I'd like to know how to write a bit of code to extract just the numbers at the end. So for example:
x <- c("AB ABC 19012301927 / XX - 4625",
"BC - AB / 827 / 9765",
"XXXX-9276"
)
And I'd like to get from this: (4625, 9765, 9276)
Is there any easy way to do this please?
Thank you.
A
We can use sub to capture one or more digits (\\d+) at the end ($) of the string that follows a non-digit ([^0-9]) and other characters (.*), in the replacement, specify the backreference (\\1) of the captured group
sub(".*[^0-9](\\d+)$", "\\1", x)
#[1] "4625" "9765" "9276"
Or with word from stringr
library(stringr)
word(x, -1, sep="[- ]")
#[1] "4625" "9765" "9276"
Or with stri_extract_last
library(stringi)
stri_extract_last_regex(x, "\\d+")
#[1] "4625" "9765" "9276"
Replace everything up to the last non-digit with a zero length string.
sub(".*\\D", "", x)
giving:
[1] "4625" "9765" "9276"

How to subset the string by escaping some characters

I want to subset the string in two sub-string in following way
select character from 5 to 20
select character from 5 to 21 but escape the 20th character
Example:
String: AGGTGAACGCCACGTCCAAAGTTAGGTGATGCATTCAAGTT
sub1: GAACGCCACGTCCAAA
sub2: GAACGCCACGTCCAAG
The ?substring function is also useful. It is distinct from ?substr, by its capability to handle single or multiple substrings at once:
substring(str1, 5, 20)
#[1] "GAACGCCACGTCCAAA"
substring(str1, c(5,21), c(19,21))
#[1] "GAACGCCACGTCCAA" "G"
paste(substring(str1, c(5,21), c(19,21)), collapse="")
#[1] "GAACGCCACGTCCAAG"
We can use sub to match the first 4 characters (.{4}) from the start (^) of the string followed by the next 16 which we capture as a group ((.{16})) followed by other characters (.*) and replace it with the backreference (\\1) of the captured grouop
sub("^.{4}(.{16}).*", "\\1", str1)
#[1] "GAACGCCACGTCCAAA"
We can get the first case with substr/substring
substr(str1, 5, 20)
#[1] "GAACGCCACGTCCAAA"
For the second case, instead of capturing 16 characters, capture 15 characters followed by a character (.) followed by capturing the next character as a group ((.)) and replace with the backreferences (\\1\\2) of the captured group
sub("^.{4}(.{15}).(.).*", "\\1\\2", str1)
#[1] "GAACGCCACGTCCAAG"
Or with substr
sprintf("%s%s", substr(str1, 5, 19), substr(str1, 21, 21))
#[1] "GAACGCCACGTCCAAG"
data
str1 <- "AGGTGAACGCCACGTCCAAAGTTAGGTGATGCATTCAAGTT"

Combining gsub calls and remove character after last instance of a string

I have the following string:
time <- "2017-05-30T09:20:00-08:00"
I was to use gsub to produce this:
"2017-05-30 09:20:00"
Here is what I have so far:
time2 <- gsub("T", " ", time)
gsub("\\-.*", "", time2)
Two questions -
How do remove all characters after the last instance of -?
How do I combine these two statements into one?
Use a single call to a sub with a spelled out regex to capture the parts you are interested in, and just match everything else. Then, use replacement backreferences \1 and \2 in the replacement pattern to re-insert those two captured subparts:
^(\d{4}-\d{2}-\d{2})T(\d{2}:\d{2}:\d{2}).*
See the regex demo.
Details:
^ - start of a string
(\d{4}-\d{2}-\d{2}) - Group 1: 4 digits, -, 2 digits, - and then 2 digits
T - a T letter
(\d{2}:\d{2}:\d{2}) - Group 2: 2 digis, :, 2 digits, : and 2 digits
.* - any 0+ chars up to the string end.
R online demo:
time_s <- "2017-05-30T09:20:00-08:00"
sub("^(\\d{4}-\\d{2}-\\d{2})T(\\d{2}:\\d{2}:\\d{2}).*", "\\1 \\2", time_s)
## => [1] "2017-05-30 09:20:00"
It may be better to use functions that convert to DateTime
library(anytime)
format(anytime(time), "%Y-%m-%d %H:%M:%S")
#[1] "2017-05-30 09:20:00"

Replace element in vector based on first letter of character string

Consider the vectors below:
ID <- c("A1","B1","C1","A12","B2","C2","Av1")
names <- c("ALPHA","BRAVO","CHARLIE","AVOCADO")
I want to replace the first character of each element in vector ID with vector names based on the first letter of vector names. I also want to add a _0 before each number between 0:9.
Note that the elements Av1 and AVOCADO throw things off a bit, especially with the lowercase v in Av1.
The result should look like this:
res <- c("ALPHA_01","BRAVO_01","CHARLIE_01","ALPHA_12","BRAVO_02","CHARLIE_02", "AVOCADO_01")
I know it should be done with regex but I've been trying for 2 days now and haven't got anywhere.
We can use gsubfn.
library(gsubfn)
#remove the number part from 'ID' (using `sub`) and get the unique elements
nm1 <- unique(sub("\\d+", "", ID))
#using gsubfn, replace the non-numeric elements with the matching
#key/value pair in the replacement
#finally format to add the "_" with sub
sub("(\\d+)$", "_0\\1", gsubfn("(\\D+)", as.list(setNames(names, nm1)), ID))
#[1] "ALPHA_01" "BRAVO_01" "CHARLIE_01" "ALPHA_02"
#[5] "BRAVO_02" "CHARLIE_02" "AVOCADO_01"
The (\\d+) indicates one or more numeric elements, and (\\D+) is one or more non-numeric elements. We are wrapping it within the brackets to capture as a group and replace it with the backreference (\\1 - as it is the first backreference for the captured group).
Update
If the condition would be to append 0 only to those 'ID's that have numbers less than 10, then we can do this with a second gsubfn and sprintf
gsubfn("(\\d+)", ~sprintf("_%02d", as.numeric(x)),
gsubfn("(\\D+)", as.list(setNames(names, nm1)), ID))
#[1] "ALPHA_01" "BRAVO_01" "CHARLIE_01" "ALPHA_12"
#[5] "BRAVO_02" "CHARLIE_02" "AVOCADO_01"
Doing this via base R, we can search for second character being V (as in AVOCADO) and substring 2 characters if that's true or 1 character if not. This will capture both AVOCADO and ALPHA. We then match those substrings with the letters extracted from ID (also convert toupper to capture Av with AV). Finally paste _0 along with the number found in each ID
paste0(names[match(toupper(sub('\\d+', '', ID)),
ifelse(substr(names, 2, 2) == 'V', substr(names, 1, 2),
substr(names, 1, 1)))],'_0', sub('\\D+', '', ID))
#[1] "ALPHA_01" "BRAVO_01" "CHARLIE_01" "ALPHA_02" "BRAVO_02" "CHARLIE_02" "AVOCADO_01"

Resources