How to replace a string with another with interleaving characters in R - r

I have the following strings:
x <- "??????????DRHRTRHLAK??????????"
x2 <- "????????????????????TRCYHIDPHH"
x3 <- "FKDHKHIDVK????????????????????TRCYHIDPHH"
x4 <- "FKDHKHIDVK????????????????????"
What I want to do is to replace all the ? characters with
another string
rep <- "ndqeegillkkkkfpssyvv"
Resulting in:
ndqeegillkDRHRTRHLAKkkkfpssyvv # x
ndqeegillkkkkfpssyvvTRCYHIDPHH # x2
FKDHKHIDVKndqeegillkkkkfpssyvvTRCYHIDPHH # x3
FKDHKHIDVKndqeegillkkkkfpssyvv # x4
Basically, keeping the order of rep in the replacement with the interleaving characters DRHRTRHLAK in x.
The total length of rep is the same as the total length of ?, 20 characters.
Note that I don't want to split rep manually again as an extra step.
I tried this but failed:
>gsub(pattern = "\\?+", replacement = rep, x = x)
[1] "ndqeegillkkkkfpssyvvDRHRTRHLAKndqeegillkkkkfpssyvv"

Example data:
x <- c(
"??????????DRHRTRHLAK??????????",
"????????????????????TRCYHIDPHH",
"FKDHKHIDVK????????????????????TRCYHIDPHH"
)
rep <- "ndqeegillkkkkfpssyvv"
Fix it up with regmatches<- replacements in a vectorised fashion:
gr <- gregexpr("\\?+", x)
csml <- lapply(gr, \(x) cumsum(attr(x, "match.length")) )
regmatches(x, gr) <- lapply(csml, \(x) substring(rep, c(1,x[-length(x)]+1), x))
#[1] "ndqeegillkDRHRTRHLAKkkkfpssyvv"
#[2] "ndqeegillkkkkfpssyvvTRCYHIDPHH"
#[3] "FKDHKHIDVKndqeegillkkkkfpssyvvTRCYHIDPHH"

String Split with substr():
x <- "??????????DRHRTRHLAK??????????"
rep <- "ndqeegillkkkkfpssyvv"
x<-gsub(pattern = "^\\?+", replacement = substr(rep, 1, 10), x = x)
x<-gsub(pattern = "\\?+$", replacement = substr(rep, 11, 20), x = x)
x
#[1] "ndqeegillkDRHRTRHLAKkkkfpssyvv"
Regex ^ matches start, and $ matches end.

You can count the number of ?'s and then cut rep based on that:
x <- "??????????DRHRTRHLAK??????????"
rep <- "ndqeegillkkkkfpssyvv"
pattern <- "(\\?+)(DRHRTRHLAK)(\\?+)"
n <- nchar(gsub(pattern, "\\1", x))
gsub(pattern, paste0(substr(rep, 1, n), "\\2", substr(rep, n+1, nchar(rep))), x)
#[1] "ndqeegillk??????????kkkfpssyvv"
Edit: new examples:
A very verbose way is to do a if else chain, checking where the ?'s are, and substituting rep accordingly.
if(grepl("^\\?.+\\?$", x)){ #?'s on both ends
n <- gsub(pattern, "\\1", x) %>% nchar()
gsub(pattern, paste0(substr(rep, 1, n), "\\2", substr(rep, n+1, nchar(rep))), x)
} else if(grepl("^\\?", x)){ #?'s only on start
n <- gsub(pattern, "\\1", x) %>% nchar()
gsub(pattern, paste0(substr(rep, 1, n), "\\2"), x)
} else if(grepl("\\?$", x)){ #?'s only on end
n <- gsub(pattern, "\\2", x) %>% nchar()
gsub(pattern, paste0("\\2", substr(rep, 1, n)), x)
} else if(grepl("^[A-Z]+\\?+[A-Z]+$", x)){ #?'s only on middle
n <- gsub(pattern, "\\2", x) %>% nchar()
gsub("([A-Z]+)\\?+([A-Z]+)", paste0("\\1", substr(rep, 1, n), "\\2"), x)
}

Related

Use Recursion in R to Split a String Into Chunks

I am trying to wrap my head around the idea of recursion. However, when I apply my recursive R function, it does not return a string split into the number of chunks desired. It only returns two chunks. However, my goal is to split a long string into multiple chunks of smaller strings of size n. I am sure there are other ways to do this, but I am trying find a recursive solution. Any help is appreciated thanks in advance.
# Sample dataset
x <- paste0(rep(letters, 10000), collapse = "")
split_group <- function(x, n = 10) {
if (nchar(x) < n) {
return(x)
} else {
beginning <- substring(x, 1, n)
remaining <- substring(x, (n + 1), (n + 1) + (n - 1))
c(beginning, split_group(remaining, n))
}
}
split_group(x = x, n = 10)
# Returns: "abcdefghij" "klmnopqrst" ""
Use <= instead of < and fix remaining.
split_group <- function(x, n = 10) {
if (nchar(x) <= n) x
else {
beginning <- substring(x, 1, n)
remaining <- substring(x, n + 1)
c(beginning, split_group(remaining, n))
}
}
x <- substring(paste(letters, collapse = ""), 1, 24)
split_group(x, 2)
## [1] "ab" "cd" "ef" "gh" "ij" "kl" "mn" "op" "qr" "st" "uv" "wx"
split_group(x, 5)
## [1] "abcde" "fghij" "klmno" "pqrst" "uvwx"
split_group(x, 6)
## [1] "abcdef" "ghijkl" "mnopqr" "stuvwx"
split_group(x, 10)
## [1] "abcdefghij" "klmnopqrst" "uvwx"
split_group(x, 23)
## [1] "abcdefghijklmnopqrstuvw" "x"
split_group(x, 24)
## [1] "abcdefghijklmnopqrstuvwx"
split_group(x, 25)
## [1] "abcdefghijklmnopqrstuvwx"
2) and some approaches without recursion The first is the shortest but the second is the simplest and only uses base R. The third only uses base R as well.
library(gsubfn)
strapply(x, "(.{1,10})", simplify = c)
## [1] "abcdefghij" "klmnopqrst" "uvwx"
ix <- seq(1, nchar(x), 10)
substring(x, ix, ix + 10 - 1)
## [1] "abcdefghij" "klmnopqrst" "uvwx"
sapply(seq(1, nchar(x), 10), function(i) substring(x, i, i + 10 - 1))
## [1] "abcdefghij" "klmnopqrst" "uvwx"
library(zoo)
s <- strsplit(x, "")[[1]]
rollapply(s, 10, by = 10, paste0, collapse = "", partial = TRUE, align = "left")
## [1] "abcdefghij" "klmnopqrst" "uvwx"
A base R option would be
x1 <- strsplit(x, "(?<=.{10})(?=.)", perl = TRUE)[[1]]
-output
> head(x1, 10)
[1] "abcdefghij" "klmnopqrst" "uvwxyzabcd" "efghijklmn" "opqrstuvwx" "yzabcdefgh" "ijklmnopqr" "stuvwxyzab" "cdefghijkl" "mnopqrstuv"

Add 1 to the first even digit in a srting

I have a vector of numeric values in R
x <- c(4320, 5400, 6786)
For each of this values I want to get a new value, where I sum 1 to the first non 0 even digit (starting from the right). The resulting vector should be:
[1] 4330 5500 6787
I haven't made any progresses so far. For numbers with only four digits, as in the example, I guess this could be accomplished with stringr and ifelse statements, iterating through each digit. But I was looking for a more general solution.
EDIT
Additionally I also want to convert all the digits to the right of the focal number to 0. So I build on one of the solutions by #onyambu to get a slightly modified version.
x <- c(432095, 540100, 678507)
fun <- function(x){
y <- max(which(as.numeric(x) %%2 == 0 & x!='0'))
x[y]<- as.numeric(x[y]) + 1
x[(y+1):length(x)] <- 0 # line added to convert digits to the right to 0
as.numeric(paste0(x, collapse=''))
}
y = sapply(strsplit(as.character(x), ''), fun)
print(y)
[1] 433000 550000 679000
Using Recursion and only numerical operations:
fun <- function(x, ten_times = 0, rem=0 ){
if(floor(x/10) == x/10) # is divisible by 10? remove the zero
Recall(x/10, ten_times + 1, rem)
else if (x%%2 == 1) # is odd remove the odd and store it go to next digit
Recall(x%/%10, ten_times+1, rem + (x%%10)*10^ten_times)
else # add one to the even and also add back the remainder to the number
(x + 1) * 10^ten_times + rem
}
sapply(x, fun)
[1] 4330 5500 6787
Note that we could use vectorized ifelse with the same logic above to carry out the operation in a vectorized manner. Though you might want to increase the recursion depth. Probably stick with the non-vectorized version above and the use sapply
fun <- function(x, ten_times = 0, rem=0 ){
ifelse(floor(x/10) == x/10, Recall(x/10, ten_times + 1, rem),
ifelse(x%%2 == 1, Recall(x%/%10, ten_times+1, rem + (x%%10)*10^ten_times),
(x+1)*10^ten_times + rem))
}
fun(x)
[1] 4330 5500 6787
Note that this will throw an error if the number is purely made up of non-even numbers. eg fun(1111) will throw an error.
EDIT:
If you need all the values after the even number to be zero, change this into:
fun <- function(x, ten_times = 0){
if(floor(x/10) == x/10) Recall(x/10, ten_times + 1)
else if (x%%2 == 1)Recall(x%/%10, ten_times+1)
else (x + 1) * 10^ten_times
}
sapply(x, fun)
[1] 433000 550000 679000
Also seems like a ceiling problem:
y <- sapply(strsplit(as.character(x),''),
\(x)max(which(!as.numeric(x) %% 2 & x!='0'))) - nchar(x)
ceiling(x * 10^y)/10^y
[1] 433000 550000 679000
fun <- function(x){
y <- max(which(as.numeric(x) %%2 == 0 &x!='0'))
x[y]<- as.numeric(x[y]) + 1
as.numeric(paste0(x, collapse=''))
}
sapply(strsplit(as.character(x), ''), fun)
[1] 4330 5500 6787
Try this function
fn <- function(x) {
y <- x ; add <- 1
while(x != 0){
if(x %% 10 != 0 & x %% 2 == 0 ) {
y <- y + add
break
}
x <- floor(x/10)
add <- add * 10
}
y
}
fn <- Vectorize(fn)
fn(x)
#> [1] 4330 5500 6787
Another possible solution:
library(tidyverse)
str_split(x, "", simplify = T) %>%
type.convert(as.is = T) %>%
apply(1, \(x) {which.max(cumsum(x %% 2 == 0 & x != 0)) %>%
{x[.] <<- x[.] + 1}; x %>% str_c(collapse = "") %>% parse_integer})
#> [1] 4330 5500 6787
1) gsubfn Using gsubfn we can get a 2 line solution. gsubfn is like gsub except the second argument can be a function, possibly expressed in formula notation instead of a replacement string. The match to each capture group (portion in parenthesis) in the regular expression is passed as a separate argument to the function and the result is the output of the function.
In this case there are 3 capture groups which represent the prefix (p), the digit (d) and the suffix (s). The formula representation of the function is the body and the arguments are the free variables in the body in the order encountered.
library(gsubfn)
x1 <- c(4320, 5400, 6786)
f1 <- ~ paste0(p, as.numeric(d) + 1, s)
gsubfn("(.*)([2468])(.*)", f1, as.character(x1)) |> as.numeric()
## [1] 4330 5500 6787
To do that plus replace remaining characters after the transformed one to zero
x2 <- c(432095, 540100, 678507)
f2 <- ~ paste0(p, as.numeric(d) + 1, gsub(".", 0, s))
gsubfn("(.*)([2468])(.*)", f2, as.character(x2)) |> as.numeric()
## [1] 433000 550000 679000
2) Base R This base R solution extracts the prefix, digit and suffix using sub and then transforms the digit and pastes them back together.
pat <- "(.*)([2468])(.*)"
as.numeric(paste0(
sub(pat, "\\1", x1),
as.numeric(sub(pat, "\\2", x1)) + 1,
sub(pat, "\\3", x1)
))
## [1] 4330 5500 6787
or performing the same operation and zeroing out the suffix:
pat <- "(.*)([2468])(.*)"
as.numeric(paste0(
sub(pat, "\\1", x2),
as.numeric(sub(pat, "\\2", x2)) + 1,
gsub(".", 0, sub(pat, "\\3", x2))
))
## [1] 433000 550000 679000

R: Filter vectors by 'two-way' partial match

With two vectors
x <- c("abc", "12")
y <- c("bc", "123", "nomatch")
is there a way to do a filter of both by 'two-way' partial matching (remove elements in one vector if they contain or are contained in any element in the other vector) so that the result are these two vectors:
x1 <- c()
y1 <- c("nomatch")
To explain - every element of x is either a substring or a superstring of one of the elements of y, hence x1 is empty. Update - it is not sufficient for a substring to match the initial chars - a substring might be found anywhere in the string it matches. Example above has been updated to reflect this.
I originally thought ?pmatch might be handy, but your edit clarifies you don't just want to match the start of items. Here's a function that should work:
remover <- function(x,y) {
pmx <- sapply(x, grep, x=y)
pmy <- sapply(y, grep, x=x)
hit <- unlist(c(pmx,pmy))
list(
x[!(seq_along(x) %in% hit)],
y[!(seq_along(y) %in% hit)]
)
}
remover(x,y)
#[[1]]
#character(0)
#
#[[2]]
#[1] "nomatch"
It correctly does nothing when no match is found (thanks #Frank for picking up the earlier error):
remover("yo","nomatch")
#[[1]]
#[1] "yo"
#
#[[2]]
#[1] "nomatch"
We can do the following:
# Return data.frame of matches of a in b
m <- function(a, b) {
data.frame(sapply(a, function(w) grepl(w, b), simplify = F));
}
# Match x and y and remove
x0 <- x[!apply(m(x, y), 2, any)]
y0 <- y[!apply(m(x, y), 1, any)]
# Match y and x and remove
x1 <- x0[!apply(m(y0, x0), 1, any)]
y1 <- y0[!apply(m(y0, x0), 2, any)]
x1;
#character(0)
x2;
#[1] "nomatch"
I build a matrix of all possible matches in both directions, then combine both with | as a match in any direction is equally a match, and then and use it to subset x and y:
x <- c("abc", "12")
y <- c("bc", "123", "nomatch")
bool_mat <- sapply(x,function(z) grepl(z,y)) | t(sapply(y,function(z) grepl(z,x)))
x1 <- x[!apply(bool_mat,2,any)] # character(0)
y1 <- y[!apply(bool_mat,1,any)] # [1] "nomatch"

How to convert numbers in Base 2 to Base 4 in R

For instance, how to convert the number '10010000110000011000011111011000' in Base2 to number in Base4 ?
Here is one approach that breaks up the string into units of length 2 and then looks up the corresponding base 4 for the pair:
convert <- c("00"="0","01"="1","10"="2","11"="3")
from2to4 <- function(s){
if(nchar(s) %% 2 == 1) s <- paste0('0',s)
n <- nchar(s)
bigrams <- sapply(seq(1,n,2),function(i) substr(s,i,i+1))
digits <- convert[bigrams]
paste0(digits, collapse = "")
}
A one-liner approach:
> paste(as.numeric(factor(substring(a,seq(1,nchar(a),2),seq(2,nchar(a),2))))-1,collapse="")
[1] "2100300120133120"
There are multiple ways to split the string into 2 digits, see Chopping a string into a vector of fixed width character elements
Here are a couple inverses:
bin_to_base4 <- function(x){
x <- strsplit(x, '')
vapply(x, function(bits){
bits <- as.integer(bits)
paste(2 * bits[c(TRUE, FALSE)] + bits[c(FALSE, TRUE)], collapse = '')
}, character(1))
}
base4_to_bin <- function(x){
x <- strsplit(x, '')
vapply(x, function(quats){
quats <- as.integer(quats)
paste0(quats %/% 2, quats %% 2, collapse = '')
}, character(1))
}
x <- '10010000110000011000011111011000'
bin_to_base4(x)
#> [1] "2100300120133120"
base4_to_bin(bin_to_base4(x))
#> [1] "10010000110000011000011111011000"
...and they're vectorized!
base4_to_bin(bin_to_base4(c(x, x)))
#> [1] "10010000110000011000011111011000" "10010000110000011000011111011000"
For actual use, it would be a good idea to put in some sanity checks to ensure the input is actually in the appropriate base.
Convert Base2 to Base10 first, then from Base10 to Base4

Pass a comma separated string as a list

Say I have a string such as "x = 1, y = 'cat', z = NULL". I want to obtain the list created by the code list(x = 1, z = 'cat', z = NULL). Here is my first attempt, which I am aware is horrible:
parse_text <- function(x) parse(text = x)[[1]]
strsplit2 <- function(x, ...) strsplit(x, ...)[[1]]
trim_whitespace <- function (x) gsub("^\\s+|\\s+$", "", x)
# take 1
x <- "nk = 1, ncross = 1, pmethod = 'backward'"
x <- strsplit2(x, ",")
xs <- lapply(x, strsplit2, "=")
keys <- lapply(xs, function(x) trim_whitespace(x[1]))
vals <- lapply(xs, function(x) parse_text(x[2]))
setNames(vals, keys)
This is what I imagined a more canonical approach to look like:
# take 2
x <- "nk = 1, ncross = 1, pmethod = 'backward'"
x <- strsplit2(x, ",")
xs <- lapply(x, parse_text)
do.call(list, xs)
But this loses the names of the list. Any help much appreciated! Cheers
You can first create a string containing the expression that you want to execute (i.e. list('your string'), in this case "list( nk = 1, ncross = 1, pmethod = 'backward' )" ) with function paste to add list( and ), then parse the expression with parse function and finally evaluate it with eval function:
x <- "nk = 1, ncross = 1, pmethod = 'backward'" #your string
eval(parse(text=paste('list(',x,')'))) #create and returns the desired list
$nk
[1] 1
$ncross
[1] 1
$pmethod
[1] "backward"
As shown, this will returns you the correct named list.
I hope this will help you.
Here is another way, avoiding the dreaded parse & eval route (but IMHO entirely suitable for this use-case). It relies on the conformity of your tag=value pairings, delimited by ,.
x <- "nk = 1, ncross = 1, pmethod = 'backward'"
# Split into tag=value
vals <- strsplit( x , "," )[[1]]
# Split again and transform to matrix of tags and values
mat <- do.call( rbind , strsplit( vals , "=" ) )
# Return as a list
setNames( as.list( mat[,2] ) , mat[,1] )
#$`nk `
#[1] " 1"
#$` ncross `
#[1] " 1"
#$` pmethod `
#[1] " 'backward'"
Convert the commas to semicolons, source the string into environment e and convert e to a list:
source(textConnection(chartr(",", ";", s)), local = e <- new.env())
as.list(e)
giving:
$x
[1] 1
$y
[1] "cat"
$z
NULL

Resources