Replacing nth instance of a character string using sub/gsub in R

Replacing nth instance of a character string using sub/gsub in R - r

I am attempting to re-name some character strings given to me in a large list. The issue is that I only need to replace some of the characters not all of them.
exdata <- c("i_am_having_trouble_with_this_string",
"i_am_wishing_files_were_cleaner_for_me",
"any_help_would_be_greatly_appreciated")
From this list, for example, I would like to replace the third through the fifth instance of "_" with "-". I am having trouble understanding the regex coding for this, as most examples split strings up instead of keeping them intact.

Here are some alternative approaches. All of them can be generalized to arbitrary bounds by replacing 3 and 5 with other numbers.
1) strsplit Split the strings at underscore and use paste to collapse it back using the appropriate separators. No packages are used.
i <- 3
j <- 5
sapply(strsplit(exdata, "_"), function(x) {
g <- seq_along(x)
g[g < i] <- i
g[g > j + 1] <- j+1
paste(tapply(x, g, paste, collapse = "_"), collapse = "-")
})
giving:
[1] "i_am_having-trouble-with-this_string"
[2] "i_am_wishing-files-were-cleaner_for_me"
[3] "any_help_would-be-greatly-appreciated"
2) for loop This translates the first j occurrences of old to new in x and then translates the first i-1 occurrences of new back to old. No packages are used.
translate <- function(old, new, x, i = 1, j) {
if (i <= 1) {
if (j > 0) for(k in seq_len(j)) x <- sub(old, new, x, fixed = TRUE)
x
} else Recall(new, old, Recall(old, new, x, 1, j), 1, i-1)
}
translate("_", "-", exdata, 3, 5)
giving:
[1] "i_am_having-trouble-with-this_string"
[2] "i_am_wishing-files-were-cleaner_for_me"
[3] "any_help_would-be-greatly-appreciated"
3) gsubfn This uses a package but in return is substantially shorter than the others. gsubfn is like gsub except that the replacement string in gsub can be a string, list, function or proto object. In the case of a proto object the fun method of the proto object is invoked each time there is a match to the regular expression. Below the matching string is passed to fun as x while the output of fun replaces the match in the data. The proto object is automatically populated with a number of variables set by gsubfn and accessible by fun including count which is 1 for the first match, 2 for the second and so on. For more information see the gsubfn vignette -- section 4 discusses the use of proto objects.
library(gsubfn)
p <- proto(i = 3, j = 5,
fun = function(this, x) if (count >= i && count <= j) "-" else x)
gsubfn("_", p, exdata)
giving:
[1] "i_am_having-trouble-with-this_string"
[2] "i_am_wishing-files-were-cleaner_for_me"
[3] "any_help_would-be-greatly-appreciated"

> gsub('(.*_.*_.*?)_(.*?)_(.*?)_(.*)','\\1-\\2-\\3-\\4', exdata)
[1] "i_am_having-trouble-with-this_string" "i_am_wishing-files-were-cleaner_for_me" "any_help_would-be-greatly-appreciated"

Related

Regex to convert time equations to R date-time (POSIXct)

I'm reading in data from another platform where a combination of the strings listed below is used for expressing timestamps:
\* = current time
t = current day (00:00)
mo = month
d = days
h = hours
m = minutes
For example, *-3d is current time minus 3 days, t-3h is three hours before today morning (midnight yesterday).
I'd like to be able to ingest these equations into R and get the corresponding POSIXct value. I'm trying using regex in the below function but lose the numeric multiplier for each string:
strTimeConverter <- function(z){
ret <- stringi::stri_replace_all_regex(
str = z,
pattern = c('^\\*',
'^t',
'([[:digit:]]{1,})mo',
'([[:digit:]]{1,})d',
'([[:digit:]]{1,})h',
'([[:digit:]]{1,})m'),
replacement = c('Sys.time()',
'Sys.Date()',
'*lubridate::months(1)',
'*lubridate::days(1)',
'*lubridate::hours(1)',
'*lubridate::minutes(1)'),
vectorize_all = F
)
return(ret)
# return(eval(expr = parse(text = ret)))
}
> strTimeConverter('*-5mo+3d+4h+2m')
[1] "Sys.time()-*lubridate::months(1)+*lubridate::days(1)+*lubridate::hours(1)+*lubridate::minutes(1)"
> strTimeConverter('t-5mo+3d+4h+2m')
[1] "Sys.Date()-*lubridate::months(1)+*lubridate::days(1)+*lubridate::hours(1)+*lubridate::minutes(1)"
Expected output:
# *-5mo+3d+4h+2m
"Sys.time()-5*lubridate::months(1)+3*lubridate::days(1)+4*lubridate::hours(1)+4*lubridate::minutes(1)"
# t-5mo+3d+4h+2m
"Sys.Date()-5*lubridate::months(1)+3*lubridate::days(1)+4*lubridate::hours(1)+4*lubridate::minutes(1)"
I assumed that wrapping the [[:digit]]{1,} in parentheses () would preserve them but clearly that's not working. I defined the pattern like this else the code replaces repeat occurrences e.g. * gets converted to Sys.time() but then the m in Sys.time() gets replaced with *lubridate::minutes(1).
I plan on converting the (expected) output to R date-time using eval(parse(text = ...)) - currently commented out in the function.
I'm open to using other packages or approach.
Update
After tinkering around for a bit, I found the below version works - I'm replacing strings in the order such that newly replaced characters are not replaced again:
strTimeConverter <- function(z){
ret <- stringi::stri_replace_all_regex(
str = z,
pattern = c('y', 'd', 'h', 'mo', 'm', '^t', '^\\*'),
replacement = c('*years(1)',
'*days(1)',
'*hours(1)',
'*days(30)',
'*minutes(1)',
'Sys.Date()',
'Sys.time()'),
vectorize_all = F
)
ret <- gsub(pattern = '\\*', replacement = '*lubridate::', x = ret)
rdate <- (eval(expr = parse(text = ret)))
attr(rdate, 'tzone') <- 'UTC'
return(rdate)
}
sample_string <- '*-5mo+3d+4h+2m'
strTimeConverter(sample_string)
This works but is not very elegant and will likely fail as I'm forced to incorporate other expressions (e.g. yd for day of the year e.g. 124).

You can use backreferences in the replacements like this:
library(stringr)
x <- c("*-5mo+3d+4h+2m", "t-5mo+3d+4h+2m")
repl <- c('^\\*' = 'Sys.time()', '^t' = 'Sys.Date()', '(\\d+)mo' = '\\1*lubridate::months(1)', '(\\d+)d' = '\\1*lubridate::days(1)', '(\\d+)h' = '\\1*lubridate::hours(1)', '(\\d+)m' = '\\1*lubridate::minutes(1)')
stringr::str_replace_all(x, repl)
## => [1] "Sys.time()-5*lubridate::months(1)+3*lubridate::days(1)+4*lubridate::hours(1)+2*lubridate::minutes(1)"
## [2] "Sys.Date()-5*lubridate::months(1)+3*lubridate::days(1)+4*lubridate::hours(1)+2*lubridate::minutes(1)"
See the R demo online.
See, for example, '(\\d+)mo' = '\\1*lubridate::months(1)'. Here, (\d+)mo matches and captures into Group 1 one or more digits, and mo is just matched. Then, when the match is found, \1 in \1*lubridate::months(1) inserts the contents of Group 1 into the resulting string.
Note that it might make the replacements safer if you cap the time period match with a word boundary (\b) on the right:
repl <- c('^\\*' = 'Sys.time()', '^t' = 'Sys.Date()', '(\\d+)mo\\b' = '\\1*lubridate::months(1)', '(\\d+)d\\b' = '\\1*lubridate::days(1)', '(\\d+)h\\b' = '\\1*lubridate::hours(1)', '(\\d+)m\\b' = '\\1*lubridate::minutes(1)')
It won't work if the time spans are glued one to another without any non-word delimiters, but you have + in your example strings, so it is safe here.
Actually, you can make it work with the function you used, too. Just make sure the backreferences have the $n syntax:
x <- c("*-5mo+3d+4h+2m", "t-5mo+3d+4h+2m")
pattern = c('^\\*', '^t', '(\\d+)mo', '(\\d+)d', '(\\d+)h', '(\\d+)m')
replacement = c('Sys.time()', 'Sys.Date()', '$1*lubridate::months(1)', '$1*lubridate::days(1)', '$1*lubridate::hours(1)', '$1*lubridate::minutes(1)')
stringi::stri_replace_all_regex(x, pattern, replacement, vectorize_all=FALSE)
Output:
[1] "Sys.time()-5*lubridate::months(1)+3*lubridate::days(1)+4*lubridate::hours(1)+2*lubridate::minutes(1)"
[2] "Sys.Date()-5*lubridate::months(1)+3*lubridate::days(1)+4*lubridate::hours(1)+2*lubridate::minutes(1)"

Another option to produce the time directly, would be the following:
strTimeConvert <- function(base=Sys.time(), delta="-5mo+3d+4h+2m"){
mo <- gsub(".*([+-]\\d+)mo.*", "\\1", x)
ds <- gsub(".*([+-]\\d+)d.*", "\\1", x)
hs <- gsub(".*([+-]\\d+)h.*", "\\1", x)
ms <- gsub(".*([+-]\\d+)m.*", "\\1", x)
out <- base + months(as.numeric(mo)) + days(as.numeric(ds)) +
hours(as.numeric(hs)) + minutes(as.numeric(ms))
out
}
strTimeConvert()
# [1] "2020-07-21 20:32:19 EDT"

How to remove/replace specific parentheses from a string containing multiple parentheses in R

Given the following string of parentheses, I am trying to remove one specific parentheses,
where the position of one of its bracket is marked with 1.
((((((((((((((((((********))))))))))))))))))
00000000000000000000000000000000010000000000
So for the above example, the solution I am looking for is
((((((((((-(((((((********)))))))-))))))))))
00000000000000000000000000000000010000000000
I am tried using strsplit function from stringr to split and get the indexes of the bracket marked with 1. But I am not sure how I can get the index of its corresponding closing bracket.
Could anyone give some input on this..
What I did..
a = "((((((((((-(((((((********)))))))-))))))))))"
b = "00000000000000000000000000000000010000000000"
which(unlist(strsplit(b,"")) == 1)
#[1] 34
a_mod = unlist(strsplit(a,""))[-34]
here, I removed one bracket of the parentheses which I wanted to remove but I do not know how I can remove its corresponding opening bracket which is in 11th position in this example

Locate the 1 in b giving pos2 and also calculate the length of b giving n. Then replace positions pos2 and pos1 = n-pos2+1 with minus characters. See ?gregexpr and ?nchar and ?substr for more info. No packages are used.
pos2 <- regexpr(1, b)
n <- nchar(a)
pos1 <- n - pos2 + 1
substr(a, pos1, pos1) <- substr(a, pos2, pos2) <- "-"
a
## [1] "((((((((((-(((((((********)))))))-))))))))))"

Since the parentheses are paired the index of the close parentheses is just the length of the string minus the index of the open parentheses (they're equidistant from the string ends)
library(stringr)
string <- "((((((((((((((((((********))))))))))))))))))"
b <- "00000000000000000000000000000000010000000000"
location <- str_locate(b, "1")[1]
len <- str_length(string)
substr(string, location, location) <- "-"
substr(string, len-location, len-location) <- "-"
string
"(((((((((-((((((((********)))))))-))))))))))"

You should show what you have tried. One very simple way that would work for your example would be to do something like:
gsub("\\*){8}", "\\*)))))))-", "((((((((((((((((((********))))))))))))))))))")
#> [1] "((((((((((((((((((********)))))))-))))))))))"
Edit:
In response to your question: It depends what you mean by other similar examples.
If you go purely by position in the string, you already have an excellent answer from G. Grothendieck. If you want a solution where you want to replace the nth closing bracket, for example, you could do:
s <- "((((((((((((((((((********))))))))))))))))))"
replace_par <- function(n, string) {
sub(paste0("(!?\\))(\\)){", n, "}"),
paste0(paste(rep(")", (n-1)), collapse=""), "-"),
string, perl = TRUE)}
replace_par(8, s)
#> [1] "((((((((((((((((((********)))))))-)))))))))"
Created on 2020-05-21 by the reprex package (v0.3.0)

You could write a function that does the replacement the way you want:
strreplace <- function(x,y,val = "-")
{
regmatches(x,regexpr(1,y)) <- val
sub(".([(](?:[^()]|(?1))*+[)])(?=-)", paste0(val, "\\1"), x, perl = TRUE)
}
a <- "((((((((((((((((((********))))))))))))))))))"
b < -"00000000000000000000000000000000010000000000"
strreplace(a, b)
[1] "((((((((((-(((((((********)))))))-))))))))))"
# Nested paranthesis
a = "((((****))))((((((((((((((((((********))))))))))))))))))"
b = "00000000000000000000000000000000000000000000010000000000"
strreplace(a,b)
[1] "((((****))))((((((((((-(((((((********)))))))-))))))))))"

Substitute the ^ (power) symbol with C's pow syntax in mathematical expression

I have a math expression, for example:
((2-x+3)^2+(x-5+7)^10)^0.5
I need to replace the ^ symbol to pow function of C language. I think that regex is what I need, but I don't know a regex like a pro. So I ended up with this regex:
(\([^()]*)*(\s*\([^()]*\)\s*)+([^()]*\))*
I don't know how to improve this. Can you advice me something to solve that problem?
The expected output:
pow(pow(2-x+3,2)+pow(x-5+7,10),0.5)

One of the most fantastic things about R is that you can easily manipulate R expressions with R. Here, we recursively traverse your expression and replace all instances of ^ with pow:
f <- function(x) {
if(is.call(x)) {
if(identical(x[[1L]], as.name("^"))) x[[1L]] <- as.name("pow")
if(length(x) > 1L) x[2L:length(x)] <- lapply(x[2L:length(x)], f)
}
x
}
f(quote(((2-x+3)^2+(x-5+7)^10)^0.5))
## pow((pow((2 - x + 3), 2) + pow((x - 5 + 7), 10)), 0.5)
This should be more robust than the regex since you are relying on the natural interpretation of the R language rather than on text patterns that may or may not be comprehensive.
Details: Calls in R are stored in list like structures with the function / operator at the head of the list, and the arguments in following elements. For example, consider:
exp <- quote(x ^ 2)
exp
## x^2
is.call(exp)
## [1] TRUE
We can examine the underlying structure of the call with as.list:
str(as.list(exp))
## List of 3
## $ : symbol ^
## $ : symbol x
## $ : num 2
As you can see, the first element is the function/operator, and subsequent elements are the arguments to the function.
So, in our recursive function, we:
Check if an object is a call
If yes: check if it is a call to the ^ function/operator by looking at the first element in the call with identical(x[[1L]], as.name("^")
If yes: replace the first element with as.name("pow")
Then, irrespective of whether this was a call to ^ or anything else:
if the call has additional elements, cycle through them and apply this function (i.e. recurse) to each element, replacing the result back into the original call (x[2L:length(x)] <- lapply(x[2L:length(x)], f))
If no: just return the object unchanged
Note that calls often contain the names of functions as the first element. You can create those names with as.name. Names are also referenced as "symbols" in R (hence the output of str).

Here is a solution that follows the parse tree recursively and replaces ^:
#parse the expression
#alternatively you could create it with
#expression(((2-x+3)^2+(x-5+7)^10)^0.5)
e <- parse(text = "((2-x+3)^2+(x-5+7)^10)^0.5")
#a recursive function
fun <- function(e) {
#check if you are at the end of the tree's branch
if (is.name(e) || is.atomic(e)) {
#replace ^
if (e == quote(`^`)) return(quote(pow))
return(e)
}
#follow the tree with recursion
for (i in seq_along(e)) e[[i]] <- fun(e[[i]])
return(e)
}
#deparse to get a character string
deparse(fun(e)[[1]])
#[1] "pow((pow((2 - x + 3), 2) + pow((x - 5 + 7), 10)), 0.5)"
This would be much easier if rapply worked with expressions/calls.
Edit:
OP has asked regarding performance. It is very unlikely that performance is an issue for this task, but the regex solution is not faster.
library(microbenchmark)
microbenchmark(regex = {
v <- "((2-x+3)^2+(x-5+7)^10)^0.5"
x <- grepl("(\\(((?:[^()]++|(?1))*)\\))\\^(\\d*\\.?\\d+)", v, perl=TRUE)
while(x) {
v <- sub("(\\(((?:[^()]++|(?1))*)\\))\\^(\\d*\\.?\\d+)", "pow(\\2, \\3)", v, perl=TRUE);
x <- grepl("(\\(((?:[^()]++|(?1))*)\\))\\^(\\d*\\.?\\d+)", v, perl=TRUE)
}
},
BrodieG = {
deparse(f(parse(text = "((2-x+3)^2+(x-5+7)^10)^0.5")[[1]]))
},
Roland = {
deparse(fun(parse(text = "((2-x+3)^2+(x-5+7)^10)^0.5"))[[1]])
})
#Unit: microseconds
# expr min lq mean median uq max neval cld
# regex 321.629 323.934 335.6261 335.329 337.634 384.623 100 c
# BrodieG 238.405 246.087 255.5927 252.105 257.227 355.943 100 b
# Roland 211.518 225.089 231.7061 228.802 235.204 385.904 100 a
I haven't included the solution provided by #digEmAll, because it seems obvious that a solution with that many data.frame operations will be relatively slow.
Edit2:
Here is a version that also handles sqrt.
fun <- function(e) {
#check if you are at the end of the tree's branch
if (is.name(e) || is.atomic(e)) {
#replace ^
if (e == quote(`^`)) return(quote(pow))
return(e)
}
if (e[[1]] == quote(sqrt)) {
#replace sqrt
e[[1]] <- quote(pow)
#add the second argument
e[[3]] <- quote(0.5)
}
#follow the tree with recursion
for (i in seq_along(e)) e[[i]] <- fun(e[[i]])
return(e)
}
e <- parse(text = "sqrt((2-x+3)^2+(x-5+7)^10)")
deparse(fun(e)[[1]])
#[1] "pow(pow((2 - x + 3), 2) + pow((x - 5 + 7), 10), 0.5)"

DISCLAIMER: The answer was written with the OP original regex in mind, when the question sounded as "process the ^ preceded with balanced (nested) parentheses". Please do not use this solution for generic math expression parsing, only for educational purposes and only when you really need to process some text in the balanced parentheses context.
Since a PCRE regex can match nested parentheses, it is possible to achieve in R with a mere regex in a while loop checking the presence of ^ in the modified string with x <- grepl("(\\(((?:[^()]++|(?1))*)\\))\\^(\\d*\\.?\\d+)", v, perl=TRUE). Once there is no ^, there is nothing else to substitute.
The regex pattern is
(\(((?:[^()]++|(?1))*)\))\^(\d*\.?\d+)
See the regex demo
Details:
(\(((?:[^()]++|(?1))*)\)) - Group 1: a (...) substring with balanced parentheses capturing what is inside the outer parentheses into Group 2 (with ((?:[^()]++|(?1))*) subpattern) (explanation can be found at How can I match nested brackets using regex?), in short, \ matches an outer (, then (?:[^()]++|(?1))* matches zero or more sequences of 1+ chars other than ( and ) or the whole Group 1 subpattern ((?1) is a subroutine call) and then a ))
\^ - a ^ caret
(\d*\.?\d+) - Group 3: an int/float number (.5, 1.5, 345)
The replacement pattern contains a literal pow() and the \\2 and \\3 are backreferences to the substrings captured with Group 2 and 3.
R code:
v <- "((2-x+3)^2+(x-5+7)^10)^0.5"
x <- grepl("(\\(((?:[^()]++|(?1))*)\\))\\^(\\d*\\.?\\d+)", v, perl=TRUE)
while(x) {
v <- sub("(\\(((?:[^()]++|(?1))*)\\))\\^(\\d*\\.?\\d+)", "pow(\\2, \\3)", v, perl=TRUE);
x <- grepl("(\\(((?:[^()]++|(?1))*)\\))\\^(\\d*\\.?\\d+)", v, perl=TRUE)
}
v
## => [1] "pow(pow(2-x+3, 2)+pow(x-5+7, 10), 0.5)"
And to support ^(x-3) pows, you may use
v <- sub("(\\(((?:[^()]++|(?1))*)\\))\\^(?|()(\\d*\\.?\\d+)|(\\((‌(?:[^()]++|(?3))*)\\‌)))", "pow(\\2, \\4)", v, perl=TRUE);
and to check if there are any more values to replace:
x <- grepl("(\\(((?:[^()]++|(?1))*)\\))\\^(?|()(\\d*\\.?\\d+)|(\\((‌(?:[^()]++|(?3))*)\\‌)))", v, perl=TRUE)

Here's an example exploiting R parser (using getParseData function) :
# helper function which turns getParseData result back to a text expression
recreateExpr <- function(DF,parent=0){
elements <- DF[DF$parent == parent,]
s <- ""
for(i in 1:nrow(elements)){
element <- elements[i,]
if(element$terminal)
s <- paste0(s,element$text)
else
s <- paste0(s,recreateExpr(DF,element$id))
}
return(s)
}
expr <- "((2-x+3)^2+(x-5+7)^10)^0.5"
DF <- getParseData(parse(text=expr))[,c('id','parent','token','terminal','text')]
# let's find the parents of all '^' expressions
parentsOfPow <- unique(DF[DF$token == "'^'",'parent'])
# replace all the the 'x^y' expressions with 'pow(x,y)'
for(p in parentsOfPow){
idxs <- which(DF$parent == p)
if(length(idxs) != 3){ stop('expression with '^' is not correct') }
idxtok1 <- idxs[1]
idxtok2 <- idxs[2]
idxtok3 <- idxs[3]
# replace '^' token with 'pow'
DF[idxtok2,c('token','text')] <- c('pow','pow')
# move 'pow' token as first token in the expression
tmp <- DF[idxtok1,]
DF[idxtok1,] <- DF[idxtok2,]
DF[idxtok2,] <- tmp
# insert new terminals '(' ')' and ','
DF <- rbind(
DF[1:(idxtok2-1),],
data.frame(id=max(DF$id)+1,parent=p,token=',',terminal=TRUE,text='(',
stringsAsFactors=FALSE),
DF[idxtok2,],
data.frame(id=max(DF$id)+2,parent=p,token=',',terminal=TRUE,text=',',
stringsAsFactors=FALSE),
DF[(idxtok2+1):idxtok3,],
data.frame(id=max(DF$id)+3,parent=p,token=')',terminal=TRUE,text=')',
stringsAsFactors=FALSE),
if(idxtok3<nrow(DF)) DF[(idxtok3+1):nrow(DF),] else NULL
)
}
# print the new expression
recreateExpr(DF)
> [1] "pow((pow((2-x+3),2)+pow((x-5+7),10)),0.5)"

String splitting in R Programming

Currently the script below is splitting a combined item code into a specific item codes.
rule2 <- c("MR")
df_1 <- test[grep(paste("^",rule2,sep="",collapse = "|"),test$Name.y),]
SpaceName_1 <- function(s){
num <- str_extract(s,"[0-9]+")
if(nchar(num) >3){
former <- substring(s, 1, 4)
latter <- strsplit(substring(s,5,nchar(s)),"")
latter <- unlist(latter)
return(paste(former,latter,sep = "",collapse = ","))
}
else{
return (s)
}
}
df_1$Name.y <- sapply(df_1$Name.y, SpaceName_1)
Example,
Combined item code: Room 324-326 is splitting into MR324 MR325 MR326.
However for this particular Combined item code: Room 309-311 is splitting into MR309 MR300 MR301.
How should I amend the script to give me MR309 MR310 MR311?

You can try something along these lines:
range <- "324-326"
x <- as.numeric(unlist(strsplit(range, split="-")))
paste0("MR", seq(x[1], x[2]))
[1] "MR324" "MR325" "MR326"
I assume that you can obtain the numerical room sequence by some means, and then use the snippet I gave you above.
If your combined item codes always have the form Room xxx-yyy, then you can extract the range using gsub:
range <- gsub("Room ", "", "Room 324-326")
If your item codes were in a vector called codes, then you could obtain a vector of ranges using:
ranges <- sapply(codes, function(x) gsub("Room ", "", x))

We can also evaluate the string after replacing the - with : and then paste the prefix "MR".
paste0("MR", eval(parse(text=sub("\\S+\\s+(\\d+)-(\\d+)", "\\1:\\2", range))))
#[1] "MR324" "MR325" "MR326"
Wrap it as a function for convenience
fChange <- function(prefixStr, RangeStr){
paste0(prefixStr, eval(parse(text=sub("\\S+\\s+(\\d+)-(\\d+)",
"\\1:\\2", RangeStr))))
}
fChange("MR", range)
fChange("MR", range1)
#[1] "MR309" "MR310" "MR311"
For multiple elements, just loop over and apply the function
sapply(c(range, range1), fChange, prefixStr = "MR")
data
range <- "Room 324-326"
range1 <- "Room 309-311"

String split and expand the (vector) at the delimiter: R

I have this vector (it's big in size) myvec. I need to split them matching at / and create another result vector resvector. How can I get this done in R?
myvec<-c("IID:WE:G12D/V/A","GH:SQ:p.R172W/G", "HH:WG:p.S122F/H")
resvector
IID:WE:G12D, IID:WE:G12V,IID:WE:G12A,GH:SQ:p.R172W,GH:SQ:p.R172G,HH:WG:p.S122F,HH:WG:p.S122H

You can try this, using strsplit as mentioned by #Tensibai:
sp_vec <- strsplit(myvec, "/") # split the element of the vector by "/" : you will get a list where each element is the decomposition (vector) of one element of your vector, according to "/"
ts_vec <- lapply(sp_vec, # for each element of the previous list, do
function(x){
base <- sub("\\w$", "", x[1]) # get the common beginning of the column names (so first item of vector without the last letter)
x[-1] <- paste0(base, x[-1]) # paste this common beginning to the rest of the vector items (so the other letters)
x}) # return the vector
resvector <- unlist(ts_vec) # finally, unlist to get the needed vector
resvector
# [1] "IID:WE:G12D" "IID:WE:G12V" "IID:WE:G12A" "GH:SQ:p.R172W" "GH:SQ:p.R172G" "HH:WG:p.S122F" "HH:WG:p.S122H"

Here is a concise answer with regex and some functional programming:
x = gsub('[A-Z]/.+','',myvec)
y = strsplit(gsub('[^/]+(?=[A-Z]/.+)','',myvec, perl=T),'/')
unlist(Map(paste0, x, y))
# "IID:WE:G12D" "IID:WE:G12V" "IID:WE:G12A" "GH:SQ:p.R172W" "GH:SQ:p.R172G" "HH:WG:p.S122F" "HH:WG:p.S122H"

myvec<-c("IID:WE:G12D/V/A","GH:SQ:p.R172W/G", "HH:WG:p.S122F/H")
custmSplit <- function(str){
splitbysep <- strsplit(str, '/')[[1]]
splitbysep[-1] <- paste0(substr(splitbysep[1], 1, nchar(splitbysep[1])), splitbysep[-1])
return(splitbysep)
}
do.call('c', lapply(myvec, custmSplit))
# [1] "IID:WE:G12D" "IID:WE:G12DV" "IID:WE:G12DA" "GH:SQ:p.R172W" "GH:SQ:p.R172WG" "HH:WG:p.S122F" "HH:WG:p.S122FH"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Replacing nth instance of a character string using sub/gsub in R - r

> gsub('(._._.?)_(.?)_(.?)_(.)','\\1-\\2-\\3-\\4', exdata) [1] "i_am_having-trouble-with-this_string" "i_am_wishing-files-were-cleaner_for_me" "any_help_would-be-greatly-appreciated"

Related

Regex to convert time equations to R date-time (POSIXct)

How to remove/replace specific parentheses from a string containing multiple parentheses in R

Substitute the ^ (power) symbol with C's pow syntax in mathematical expression

String splitting in R Programming

String split and expand the (vector) at the delimiter: R

Categories

Resources

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Replacing nth instance of a character string using sub/gsub in R - r

> gsub('(.*_.*_.*?)_(.*?)_(.*?)_(.*)','\\1-\\2-\\3-\\4', exdata) [1] "i_am_having-trouble-with-this_string" "i_am_wishing-files-were-cleaner_for_me" "any_help_would-be-greatly-appreciated"

Related

Regex to convert time equations to R date-time (POSIXct)

How to remove/replace specific parentheses from a string containing multiple parentheses in R

Substitute the ^ (power) symbol with C's pow syntax in mathematical expression

String splitting in R Programming

String split and expand the (vector) at the delimiter: R

Categories

Resources

> gsub('(._._.?)_(.?)_(.?)_(.)','\\1-\\2-\\3-\\4', exdata) [1] "i_am_having-trouble-with-this_string" "i_am_wishing-files-were-cleaner_for_me" "any_help_would-be-greatly-appreciated"