R extract string between nth and ith instance of delimiter - r

I have a vector of strings, similar to this one, but with many more elements:
s <- c("CGA-DV-558_T_90.67.0_DV_1541_07", "TC-V-576_T_90.0_DV_151_0", "TCA-DV-X_T_6.0_D_A2_07", "T-V-Z_T_2_D_A_0", "CGA-DV-AW0_T.1_24.4.0_V_A6_7", "ACGA-DV-A4W0_T_274.46.0_DV_A266_07")
And I would like to use a function that extracts the string between the nth and ith instances of the delimiter "_". For example, the string between the 2nd (n = 2) and 3rd (i = 3) instances, to get this:
[1] "90.67.0" "90.0" "6.0" "2" "24.4.0" "274.46.0"
Or if n = 4 and i = 5"
[1] "1541" "151" "A2" "A" "A" "A266"
Any suggestions? Thank you for your help!

You can do this with gsub
n = 2
i = 3
pattern1 = paste0("(.*?_){", n, "}")
temp = gsub(pattern1, "", s)
pattern2 = paste0("((.*?_){", i-n, "}).*")
temp = gsub(pattern2, "\\1", temp)
temp = gsub("_$", "", temp)
[1] "1541" "151" "A2" "A" "A6" "A266"

#FUNCTION
foo = function(x, n, i){
do.call(c, lapply(x, function(X)
paste(unlist(strsplit(X, "_"))[(n+1):(i)], collapse = "_")))
}
#USAGE
foo(x = s, n = 3, i = 5)
#[1] "DV_1541" "DV_151" "D_A2" "D_A" "V_A6" "DV_A266"

A third method, that uses substring for the extraction and gregexpr to find the positions is
# extract postions of "_" from each vector element, returns a list
spots <- gregexpr("_", s, fixed=TRUE)
# extract text in between third and fifth underscores
substring(s, sapply(spots, "[", 3) + 1, sapply(spots, "[", 5) - 1)
"DV_1541" "DV_151" "D_A2" "D_A" "V_A6" "DV_A266"

Related

How to split letters with bracket and numbers in R?

The string is s = '[12]B1[16]M5'
I want to split it as the following results with strsplit function in R:
let <- c('[12]B', '[16]M')
num <- c(1, 5)
Thanks a lot
You could use regular expression for your task.
s = '[12]B1[16]M22'
grx <- gregexpr("\\[.+?\\].+[[:digit:]]?", s)
let <- do.call(c, regmatches(s, grx))
#let
#[1] "[12]B" "[16]M"
If you want to get all chunks (let + num), you can tweak the patter as below. This facilitates extracting the numeric part.
grx <- gregexpr("\\[.+?\\].+([[:digit:]]+)", s)
out <- do.call(c, regmatches(s, grx))
num <- gsub(".+\\][[:alpha:]]+", "", out)
num
[1] "1" "22"
Using the stringr package:
library(stringr)
x <- '[12]B1[16]M2'
let <- unlist(str_extract_all(x, "\\[[0-9]{2}\\][A-Z]"))
x <- gsub(pattern = "\\[[0-9]{2}\\][A-Z]",
replacement = "",
x)
num <- unlist(str_extract_all(x, "[0-9]"))
the regular expression "\\[[0-9]{2}\\][A-Z]" can be broken down as
\\[ an opening bracket
[0-9]{2} a sequence of two consecutive digits
\\] a closing bracket
[A-Z] a sequence of exactly one upper case letter
1) strapply Create a regular expression, pat which matches the two parts and then extract each separately using strapply. The first capture group (first parenthesized portion of regular expression) consists of a left square bracket "\\[" the smallest string ".*?" until the right square bracket "\\]" followed by any character "." . The second capture group consists of one or more digits "\\d+".
library(gsubfn)
pat <- "(\\[.*?\\].)(\\d+)"
let <- strapply(s, pat, simplify = c)
num <- strapply(s, pat, ~ as.numeric(..2), simplify = c)
let
## [1] "[12]B" "[16]M"
num
## [1] 1 5
1a) Variation
This could also be expressed as this mapply producing a 2 component list:
mapply(strapply, s, pat, c(~ ..1, ~ as.numeric(..2)), simplify = "c",
SIMPLIFY = FALSE, USE.NAMES = FALSE)
## [[1]]
## [1] "[12]B" "[16]M"
##
## [[2]]
## [1] 1 5
2) gsub/read.table This uses no packages -- only gsub and read.table. pat is defined in (1). It returns a data frame with the results in two coiumns:
read.table(text = gsub(pat, "\\1 \\2\n", s), as.is = TRUE, col.names = c("let", "num"))
## let num
## 1 [12]B 1
## 2 [16]M 5
3) gsub/strsplit This is somewhat similar to (2) but uses strsplit rather than read.table. pat is from (1).
spl <- matrix(strsplit(gsub(pat, "\\1 \\2 ", s), " ")[[1]], 2)
let <- spl[1, ]
num <- as.numeric(spl[2, ])

Combining elements in a string vector with defined element size and accounting for not event sizes

Given is vector:
vec <- c(LETTERS[1:10])
I would like to be able to combine it in a following manner:
resA <- c("AB", "CD", "EF", "GH", "IJ")
resB <- c("ABCDEF","GHIJ")
where elements of the vector vec are merged together according to the desired size of a new element constituting the resulting vector. This is 2 in case of resA and 5 in case of resB.
Desired solution characteristics
The solution should allow for flexibility with respect to the element sizes, i.e. I may want to have vectors with elements of size 2 or 20
There may be not enough elements in the vector to match the desired chunk size, in that case last element should be shortened accordingly (as shown)
This is shouldn't make a difference but the solution should work on words as well
Attempts
Initially, I was thinking of using something on the lines:
c(
paste0(vec[1:2], collapse = ""),
paste0(vec[3:4], collapse = ""),
paste0(vec[5:6], collapse = "")
# ...
)
but this would have to be adapted to jump through the remaining pairs/bigger groups of the vec and handle last group which often would be of a smaller size.
Here is what I came up with. Using Harlan's idea in this question, you can split the vector in different number of chunks. You also want to use your paste0() idea in lapply() here. Finally, you unlist a list.
unlist(lapply(split(vec, ceiling(seq_along(vec)/2)), function(x){paste0(x, collapse = "")}))
# 1 2 3 4 5
#"AB" "CD" "EF" "GH" "IJ"
unlist(lapply(split(vec, ceiling(seq_along(vec)/5)), function(x){paste0(x, collapse = "")}))
# 1 2
#"ABCDE" "FGHIJ"
unlist(lapply(split(vec, ceiling(seq_along(vec)/3)), function(x){paste0(x, collapse = "")}))
# 1 2 3 4
#"ABC" "DEF" "GHI" "J"
vec <- c(LETTERS[1:10])
f1 <- function(x, n){
f <- function(x) paste0(x, collapse = '')
regmatches(f(x), gregexpr(f(rep('.', n)), f(x)))[[1]]
}
f1(vec, 2)
# [1] "AB" "CD" "EF" "GH" "IJ"
or
f2 <- function(x, n)
apply(matrix(x, nrow = n), 2, paste0, collapse = '')
f2(vec, 5)
# [1] "ABCDE" "FGHIJ"
or
f3 <- function(x, n) {
f <- function(x) paste0(x, collapse = '')
strsplit(gsub(sprintf('(%s)', f(rep('.', n))), '\\1 ', f(x)), '\\s+')[[1]]
}
f3(vec, 4)
# [1] "ABCD" "EFGH" "IJ"
I would say the last is best of these since n for the others must be a factor or you will get warnings or recycling
edit - more
f4 <- function(x, n) {
f <- function(x) paste0(x, collapse = '')
Vectorize(substring, USE.NAMES = FALSE)(f(x), which((seq_along(x) %% n) == 1),
which((seq_along(x) %% n) == 0))
}
f4(vec, 2)
# [1] "AB" "CD" "EF" "GH" "IJ"
or
f5 <- function(x, n)
mapply(function(x) paste0(x, collapse = ''),
split(x, c(0, head(cumsum(rep_len(sequence(n), length(x)) %in% n), -1))),
USE.NAMES = FALSE)
f5(vec, 4)
# [1] "ABCD" "EFGH" "IJ"
Here is another way, working with the original array.
A side note, working with words is not straightforward, since there is at least two ways to understand it: you can either keep each word separately or collapse them first an get individual characters. The next function can deal with both options.
vec <- c(LETTERS[1:10])
vec2 <- c("AB","CDE","F","GHIJ")
cuts <- function(x, n, bychar=F) {
if (bychar) x <- unlist(strsplit(paste0(x, collapse=""), ""))
ii <- seq_along(x)
li <- split(ii, ceiling(ii/n))
return(sapply(li, function(y) paste0(x[y], collapse="")))
}
cuts(vec2,2,F)
# 1 2
# "ABCDE" "FGHIJ"
cuts(vec2,2,T)
# 1 2 3 4 5
# "AB" "CD" "EF" "GH" "IJ"

is there a way to extend LETTERS past 26 characters e.g., AA, AB, AC...?

I use LETTERS most of the time for my factors but today I tried to go beyond 26 characters:
LETTERS[1:32]
Expecting there to be an automatic recursive factorization AA, AB, AC... But was disappointed. Is this simply a limitation of LETTERS or is there a way to get what I'm looking for using another function?
Would 702 be enough?
LETTERS702 <- c(LETTERS, sapply(LETTERS, function(x) paste0(x, LETTERS)))
If not, how about 18,278?
MOAR_LETTERS <- function(n=2) {
n <- as.integer(n[1L])
if(!is.finite(n) || n < 2)
stop("'n' must be a length-1 integer >= 2")
res <- vector("list", n)
res[[1]] <- LETTERS
for(i in 2:n)
res[[i]] <- c(sapply(res[[i-1L]], function(y) paste0(y, LETTERS)))
unlist(res)
}
ml <- MOAR_LETTERS(3)
str(ml)
# chr [1:18278] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" ...
This solution uses recursion. Usage is a bit different in the sense MORELETTERS is not a long vector you will have to store and possibly expand as your inputs get larger. Instead, it is a function that converts your numbers into the new base.
extend <- function(alphabet) function(i) {
base10toA <- function(n, A) {
stopifnot(n >= 0L)
N <- length(A)
j <- n %/% N
if (j == 0L) A[n + 1L] else paste0(Recall(j - 1L, A), A[n %% N + 1L])
}
vapply(i-1L, base10toA, character(1L), alphabet)
}
MORELETTERS <- extend(LETTERS)
MORELETTERS(1:1000)
# [1] "A" "B" ... "ALL"
MORELETTERS(c(1, 26, 27, 1000, 1e6, .Machine$integer.max))
# [1] "A" "Z" "AA" "ALL" "BDWGN" "FXSHRXW"
You can make what you want like this:
LETTERS2<-c(LETTERS[1:26], paste0("A",LETTERS[1:26]))
Another solution for excel style column names, generalized to any number of letters
#' Excel Style Column Names
#'
#' #param n maximum number of letters in column name
excel_style_colnames <- function(n){
unlist(Reduce(
function(x, y) as.vector(outer(x, y, 'paste0')),
lapply(1:n, function(x) LETTERS),
accumulate = TRUE
))
}
A variant on eipi10's method (ordered correctly) using data.table:
library(data.table)
BIG_LETTERS <- c(LETTERS,
do.call("paste0",CJ(LETTERS,LETTERS)),
do.call("paste0",CJ(LETTERS,LETTERS,LETTERS)))
Yet another option:
l2 = c(LETTERS, sort(do.call("paste0", expand.grid(LETTERS, LETTERS[1:3]))))
Adjust the two instances of LETTERS inside expand.grid to get the number of letter pairs you'd like.
A function to produce Excel-style column names, i.e.
# A, B, ..., Z, AA, AB, ..., AZ, BA, BB, ..., ..., ZZ, AAA, ...
letterwrap <- function(n, depth = 1) {
args <- lapply(1:depth, FUN = function(x) return(LETTERS))
x <- do.call(expand.grid, args = list(args, stringsAsFactors = F))
x <- x[, rev(names(x)), drop = F]
x <- do.call(paste0, x)
if (n <= length(x)) return(x[1:n])
return(c(x, letterwrap(n - length(x), depth = depth + 1)))
}
letterwrap(26^2 + 52) # through AAZ
## This will take a few seconds:
# x <- letterwrap(1e6)
It's probably not the fastest, but it extends indefinitely and is nicely predictable. Took about 20 seconds to produce through 1 million, BDWGN.
(For a few more details, see here: https://stackoverflow.com/a/21689613/903061)
A little late to the party, but I want to play too.
You can also use sub, and sprintf in place of paste0 and get a length 702 vector.
c(LETTERS, sapply(LETTERS, sub, pattern = " ", x = sprintf("%2s", LETTERS)))
Here's another addition to the list. This seems a bit faster than Gregor's (comparison done on my computer - using length.out = 1e6 his took 12.88 seconds, mine was 6.2), and can also be extended indefinitely. The flip side is that it's 2 functions, not just 1.
make.chars <- function(length.out, case, n.char = NULL) {
if(is.null(n.char))
n.char <- ceiling(log(length.out, 26))
m <- sapply(n.char:1, function(x) {
rep(rep(1:26, each = 26^(x-1)) , length.out = length.out)
})
m.char <- switch(case,
'lower' = letters[m],
'upper' = LETTERS[m]
)
m.char <- LETTERS[m]
dim(m.char) <- dim(m)
apply(m.char, 1, function(x) paste(x, collapse = ""))
}
get.letters <- function(length.out, case = 'upper'){
max.char <- ceiling(log(length.out, 26))
grp <- rep(1:max.char, 26^(1:max.char))[1:length.out]
unlist(lapply(unique(grp), function(n) make.chars(length(grp[grp == n]), case = case, n.char = n)))
}
##
make.chars(5, "lower", 2)
#> [1] "AA" "AB" "AC" "AD" "AE"
make.chars(5, "lower")
#> [1] "A" "B" "C" "D" "E"
make.chars(5, "upper", 4)
#> [1] "AAAA" "AAAB" "AAAC" "AAAD" "AAAE"
tmp <- get.letters(800)
head(tmp)
#> [1] "A" "B" "C" "D" "E" "F"
tail(tmp)
#> [1] "ADO" "ADP" "ADQ" "ADR" "ADS" "ADT"
Created on 2019-03-22 by the reprex package (v0.2.1)

Pass a comma separated string as a list

Say I have a string such as "x = 1, y = 'cat', z = NULL". I want to obtain the list created by the code list(x = 1, z = 'cat', z = NULL). Here is my first attempt, which I am aware is horrible:
parse_text <- function(x) parse(text = x)[[1]]
strsplit2 <- function(x, ...) strsplit(x, ...)[[1]]
trim_whitespace <- function (x) gsub("^\\s+|\\s+$", "", x)
# take 1
x <- "nk = 1, ncross = 1, pmethod = 'backward'"
x <- strsplit2(x, ",")
xs <- lapply(x, strsplit2, "=")
keys <- lapply(xs, function(x) trim_whitespace(x[1]))
vals <- lapply(xs, function(x) parse_text(x[2]))
setNames(vals, keys)
This is what I imagined a more canonical approach to look like:
# take 2
x <- "nk = 1, ncross = 1, pmethod = 'backward'"
x <- strsplit2(x, ",")
xs <- lapply(x, parse_text)
do.call(list, xs)
But this loses the names of the list. Any help much appreciated! Cheers
You can first create a string containing the expression that you want to execute (i.e. list('your string'), in this case "list( nk = 1, ncross = 1, pmethod = 'backward' )" ) with function paste to add list( and ), then parse the expression with parse function and finally evaluate it with eval function:
x <- "nk = 1, ncross = 1, pmethod = 'backward'" #your string
eval(parse(text=paste('list(',x,')'))) #create and returns the desired list
$nk
[1] 1
$ncross
[1] 1
$pmethod
[1] "backward"
As shown, this will returns you the correct named list.
I hope this will help you.
Here is another way, avoiding the dreaded parse & eval route (but IMHO entirely suitable for this use-case). It relies on the conformity of your tag=value pairings, delimited by ,.
x <- "nk = 1, ncross = 1, pmethod = 'backward'"
# Split into tag=value
vals <- strsplit( x , "," )[[1]]
# Split again and transform to matrix of tags and values
mat <- do.call( rbind , strsplit( vals , "=" ) )
# Return as a list
setNames( as.list( mat[,2] ) , mat[,1] )
#$`nk `
#[1] " 1"
#$` ncross `
#[1] " 1"
#$` pmethod `
#[1] " 'backward'"
Convert the commas to semicolons, source the string into environment e and convert e to a list:
source(textConnection(chartr(",", ";", s)), local = e <- new.env())
as.list(e)
giving:
$x
[1] 1
$y
[1] "cat"
$z
NULL

how to match longest matching string

I have string and character vector. I would like to find all strings in character vector matching as much as possible characters from beging of string.
For example:
s <- "abs"
vc <- c("ab","bb","abc","acbd","dert")
result <- c("ab","abc")
String s should be matched exactly up to first K characters. I want match for as much as possible (max K<=length(s)).
Here there is no match for "abs" (grep("abs",vc)), but for "ab" there are two matches (result <-grep("ab",vc)).
Another interpretation:
s <- "abs"
# Updated vc
vc <- c("ab","bb","abc","acbd","dert","abwabsabs")
st <- strsplit(s, "")[[1]]
mtc <- sapply(strsplit(substr(vc, 1, nchar(s)), ""),
function(i) {
m <- i == st[1:length(i)]
sum(m * cumsum(m))})
vc[mtc == max(mtc)]
#[1] "ab" "abc" "abwabsabs"
# Another vector vc
vc <- c("ab","bb","abc","acbd","dert","absq","abab")
....
vc[mtc == max(mtc)]
#[1] "absq"
Since we are considering only beginnings of strings, in the first case the longest match was "ab", even though there is "abwabsabs" which has "abs".
Edit: Here is a "single pattern" solution, possibly it could be more concise, but here we go...
vc <- c("ab","bb","abc","acbd","dert","abwabsabs")
(auxOne <- sapply((nchar(s)-1):1, function(i) substr(s, 1, i)))
#[1] "ab" "a"
(auxTwo <- sapply(nchar(s):2, function(i) substring(s, i)))
#[1] "s" "bs"
l <- attr(regexpr(
paste0("^((",s,")|",paste0("(",auxOne,"(?!",auxTwo,"))",collapse="|"),")"),
vc, perl = TRUE), "match.length")
vc[l == max(l)]
#[1] "ab" "abc" "abwabsabs"
Here's a function that uses grep and checks to see if a given string s matches the beginning of any string in vc, recursively removing a character from the end of s:
myfun <- function(s, vc) {
notDone <- TRUE
maxChar <- max(nchar(vc)) # EDIT: these two lines truncate s to
s <- substr(s, 1, maxChar) # the maximum number of chars in vc
subN <- nchar(s)
while(notDone & subN > 0){
ss <- substr(s, 1, subN)
ans <- grep(sprintf("^%s", ss), vc, val = TRUE)
if(length(ans)) {
notDone <- FALSE
} else {
subN <- subN - 1
}
}
return(ans)
}
s <- "abs"
# Updated vc from #Julius's answer
vc <- c("ab","bb","abc","acbd","dert","absq","abab")
> myfun(s, vc)
[1] "absq"
# And there's no infinite recursion if there's no match
> myfun("q", "a")
character(0)
Just a note, long after the fact, that the triebeard package now exists; it's very, very efficient and user-friendly for finding longest or partial matches.

Resources