String split and expand the (vector) at the delimiter: R - r

I have this vector (it's big in size) myvec. I need to split them matching at / and create another result vector resvector. How can I get this done in R?
myvec<-c("IID:WE:G12D/V/A","GH:SQ:p.R172W/G", "HH:WG:p.S122F/H")
resvector
IID:WE:G12D, IID:WE:G12V,IID:WE:G12A,GH:SQ:p.R172W,GH:SQ:p.R172G,HH:WG:p.S122F,HH:WG:p.S122H

You can try this, using strsplit as mentioned by #Tensibai:
sp_vec <- strsplit(myvec, "/") # split the element of the vector by "/" : you will get a list where each element is the decomposition (vector) of one element of your vector, according to "/"
ts_vec <- lapply(sp_vec, # for each element of the previous list, do
function(x){
base <- sub("\\w$", "", x[1]) # get the common beginning of the column names (so first item of vector without the last letter)
x[-1] <- paste0(base, x[-1]) # paste this common beginning to the rest of the vector items (so the other letters)
x}) # return the vector
resvector <- unlist(ts_vec) # finally, unlist to get the needed vector
resvector
# [1] "IID:WE:G12D" "IID:WE:G12V" "IID:WE:G12A" "GH:SQ:p.R172W" "GH:SQ:p.R172G" "HH:WG:p.S122F" "HH:WG:p.S122H"

Here is a concise answer with regex and some functional programming:
x = gsub('[A-Z]/.+','',myvec)
y = strsplit(gsub('[^/]+(?=[A-Z]/.+)','',myvec, perl=T),'/')
unlist(Map(paste0, x, y))
# "IID:WE:G12D" "IID:WE:G12V" "IID:WE:G12A" "GH:SQ:p.R172W" "GH:SQ:p.R172G" "HH:WG:p.S122F" "HH:WG:p.S122H"

myvec<-c("IID:WE:G12D/V/A","GH:SQ:p.R172W/G", "HH:WG:p.S122F/H")
custmSplit <- function(str){
splitbysep <- strsplit(str, '/')[[1]]
splitbysep[-1] <- paste0(substr(splitbysep[1], 1, nchar(splitbysep[1])), splitbysep[-1])
return(splitbysep)
}
do.call('c', lapply(myvec, custmSplit))
# [1] "IID:WE:G12D" "IID:WE:G12DV" "IID:WE:G12DA" "GH:SQ:p.R172W" "GH:SQ:p.R172WG" "HH:WG:p.S122F" "HH:WG:p.S122FH"

Related

How to remove/replace specific parentheses from a string containing multiple parentheses in R

Given the following string of parentheses, I am trying to remove one specific parentheses,
where the position of one of its bracket is marked with 1.
((((((((((((((((((********))))))))))))))))))
00000000000000000000000000000000010000000000
So for the above example, the solution I am looking for is
((((((((((-(((((((********)))))))-))))))))))
00000000000000000000000000000000010000000000
I am tried using strsplit function from stringr to split and get the indexes of the bracket marked with 1. But I am not sure how I can get the index of its corresponding closing bracket.
Could anyone give some input on this..
What I did..
a = "((((((((((-(((((((********)))))))-))))))))))"
b = "00000000000000000000000000000000010000000000"
which(unlist(strsplit(b,"")) == 1)
#[1] 34
a_mod = unlist(strsplit(a,""))[-34]
here, I removed one bracket of the parentheses which I wanted to remove but I do not know how I can remove its corresponding opening bracket which is in 11th position in this example
Locate the 1 in b giving pos2 and also calculate the length of b giving n. Then replace positions pos2 and pos1 = n-pos2+1 with minus characters. See ?gregexpr and ?nchar and ?substr for more info. No packages are used.
pos2 <- regexpr(1, b)
n <- nchar(a)
pos1 <- n - pos2 + 1
substr(a, pos1, pos1) <- substr(a, pos2, pos2) <- "-"
a
## [1] "((((((((((-(((((((********)))))))-))))))))))"
Since the parentheses are paired the index of the close parentheses is just the length of the string minus the index of the open parentheses (they're equidistant from the string ends)
library(stringr)
string <- "((((((((((((((((((********))))))))))))))))))"
b <- "00000000000000000000000000000000010000000000"
location <- str_locate(b, "1")[1]
len <- str_length(string)
substr(string, location, location) <- "-"
substr(string, len-location, len-location) <- "-"
string
"(((((((((-((((((((********)))))))-))))))))))"
You should show what you have tried. One very simple way that would work for your example would be to do something like:
gsub("\\*){8}", "\\*)))))))-", "((((((((((((((((((********))))))))))))))))))")
#> [1] "((((((((((((((((((********)))))))-))))))))))"
Edit:
In response to your question: It depends what you mean by other similar examples.
If you go purely by position in the string, you already have an excellent answer from G. Grothendieck. If you want a solution where you want to replace the nth closing bracket, for example, you could do:
s <- "((((((((((((((((((********))))))))))))))))))"
replace_par <- function(n, string) {
sub(paste0("(!?\\))(\\)){", n, "}"),
paste0(paste(rep(")", (n-1)), collapse=""), "-"),
string, perl = TRUE)}
replace_par(8, s)
#> [1] "((((((((((((((((((********)))))))-)))))))))"
Created on 2020-05-21 by the reprex package (v0.3.0)
You could write a function that does the replacement the way you want:
strreplace <- function(x,y,val = "-")
{
regmatches(x,regexpr(1,y)) <- val
sub(".([(](?:[^()]|(?1))*+[)])(?=-)", paste0(val, "\\1"), x, perl = TRUE)
}
a <- "((((((((((((((((((********))))))))))))))))))"
b < -"00000000000000000000000000000000010000000000"
strreplace(a, b)
[1] "((((((((((-(((((((********)))))))-))))))))))"
# Nested paranthesis
a = "((((****))))((((((((((((((((((********))))))))))))))))))"
b = "00000000000000000000000000000000000000000000010000000000"
strreplace(a,b)
[1] "((((****))))((((((((((-(((((((********)))))))-))))))))))"

How to combine the character list vectors into one character vector while keeping the contents of each list unchanged [duplicate]

I have a list of named values:
myList <- list('A' = 1, 'B' = 2, 'C' = 3)
I want a vector with the value 1:3
I can't figure out how to extract the values without defining a function. Is there a simpler way that I'm unaware of?
library(plyr)
myvector <- laply(myList, function(x) x)
Is there something akin to myList$Values to strip the names and return it as a vector?
Use unlist with use.names = FALSE argument.
unlist(myList, use.names=FALSE)
purrr::flatten_*() is also a good option. the flatten_* functions add thin sanity checks and ensure type safety.
myList <- list('A'=1, 'B'=2, 'C'=3)
purrr::flatten_dbl(myList)
## [1] 1 2 3
This can be done by using unlist before as.vector.
The result is the same as using the parameter use.names=FALSE.
as.vector(unlist(myList))

R: For loop works on list, not individual element

I'm trying to learn by writing a function. It should convert the UOM (unit of measure) into a fraction of the standard UOM. In this case, 1/10 or 0.1
I'm trying to loop through a list generated from strsplit, but I only get the whole list, not each element in the list. I can't figure out what I'm doing wrong. Is strsplit the wrong function? I don't think the problem is in strsplit, but I can't figure out what I'm doing wrong in the For loop:
qty<-0
convf<-0
uom <- "EA"
std <- "CA"
pack <-"1EA/10CA"
if(uom!=std){
s<-strsplit(pack,split = '/')
for (i in s){
print(i)
if(grep(uom,i)){
qty<- regmatches(i,regexpr('[0-9]+',i))
}
if(grep(std,i)){
convf<-regmatches(i, regexpr('[0-9]+',i))
}
} #end for
qty<-as.numeric(qty)
convf<-as.numeric(convf)
}
return(qty/convf)
maybe is a problem with the indexing of the list. Have you tried to use [[1]] after the strsplit function?
Example:
string <- "Hello/world"
mylist <- strsplit(string, "/")
## [[1]]
## [1] "Hello" "World"
But if we explicit say that we want the first "element" of the list with [[1]] we will have the entire array of the string.
Example:
string <- "Hello/World"
mylist <- strsplit(string, "/")[[1]]
## [1] "Hello" "World"
Hope this can help you in your problem.
There are a few issues here. The main problem you are having is that s is a list of length 1. Within that list, the first (only) element is a vector of length 2. Consequently, you would need to set i in s[[1]].
However, we can go one step further. Try the following code:
library(stringr)
lapply(strsplit(pack,split = '/'), # works within the list, can handle larger vectors for `pack`
function(x, uom, std) {
reg_expr <- paste(uom,std, sep = "|") # call this on its own, it's just searching for the text saved in uom or std
qty <- as.numeric(str_remove(x, reg_expr)) # removes that text and converts the string to a number
names(qty) <- str_extract(x, reg_expr) # extracts the text and uses it to name elements in qty
qty[uom] / qty[std] # your desired result.
},
uom = uom, # since these are part of the function call, we need to specify what they are. This is where you should change them.
std = std)
I don't know if this is what you're trying to practice, but I'd avoid loops while extracting the digits from a string like "1EA/10CA". If it helps, the column lst is actually a list inside of a dataset.
library(magrittr)
ds <- data.frame(pack = c("1EA/10CA", "1EA/4CA", "2EA/2CA"))
pattern <- "^(\\d+)EA/(\\d+)CA$"
ds %>%
dplyr::mutate(
qty = as.numeric(sub(pattern, "\\1", pack)),
convf = as.numeric(sub(pattern, "\\2", pack)),
ratio = qty / convf,
lst = purrr::map2(qty, convf, ~list(qty=.x[[1]], convf=.y[[1]]))
)
Result:
pack qty convf ratio lst
1 1EA/10CA 1 10 0.10 1, 10
2 1EA/4CA 1 4 0.25 1, 4
3 2EA/2CA 2 2 1.00 2, 2

Finding the position of a character within a string

I am trying to find the equivalent of the ANYALPHA SAS function in R. This function searches a character string for an alphabetic character, and returns the first position at which at which the character is found.
Example: looking at the following string '123456789A', the ANYALPHA function would return 10 since first alphabetic character is at position 10 in the string. I would like to replicate this function in R but have not been able to figure it out. I need to search for any alphabetic character regardless of case (i.e. [:alpha:])
Thanks for any help you can offer!
Here's an anyalpha function. I added a few extra features. You can specify the maximum amount of matches you want in the n argument, it defaults to 1. You can also specify if you want the position or the value itself with value=TRUE:
anyalpha <- function(txt, n=1, value=FALSE) {
txt <- as.character(txt)
indx <- gregexpr("[[:alpha:]]", txt)[[1]]
ret <- indx[1:(min(n, length(indx)))]
if(value) {
mapply(function(x,y) substr(txt, x, y), ret, ret)
} else {ret}
}
#test
x <- '123A56789BC'
anyalpha(x)
#[1] 4
anyalpha(x, 2)
#[1] 4 10
anyalpha(x, 2, value=TRUE)
#[1] "C" "A"

Replace non-ascii chars with a defined string list without a loop in R

I want to replace non-ascii characters (for now, only spanish), by their ascii equivalent. If I have "á", I want to replace it with "a" and so on.
I built this function (works fine), but I don't want to use a loop (including internal loops like sapply).
latin2ascii<-function(x) {
if(!is.character(x)) stop ("input must be a character object")
require(stringr)
mapL<-c("á","é","í","ó","ú","Á","É","Í","Ó","Ú","ñ","Ñ","ü","Ü")
mapA<-c("a","e","i","o","u","A","E","I","O","U","n","N","u","U")
for(y in 1:length(mapL)) {
x<-str_replace_all(x,mapL[y],mapA[y])
}
x
}
Is there an elegante way to solve it? Any help, suggestion or modification is appreciated
gsubfn() in the package of the same name is really nice for this sort of thing:
library(gsubfn)
# Create a named list, in which:
# - the names are the strings to be looked up
# - the values are the replacement strings
mapL <- c("á","é","í","ó","ú","Á","É","Í","Ó","Ú","ñ","Ñ","ü","Ü")
mapA <- c("a","e","i","o","u","A","E","I","O","U","n","N","u","U")
# ll <- setNames(as.list(mapA), mapL) # An alternative to the 2 lines below
ll <- as.list(mapA)
names(ll) <- mapL
# Try it out
string <- "ÍÓáÚ"
gsubfn("[áéíóúÁÉÍÓÚñÑüÜ]", ll, string)
# [1] "IOaU"
Edit:
G. Grothendieck points out that base R also has a function for this:
A <- paste(mapA, collapse="")
L <- paste(mapL, collapse="")
chartr(L, A, "ÍÓáÚ")
# [1] "IOaU"
I like the version by Josh, but I thought I might add another 'vectorized' solution. It returns a vector of unaccented strings. It also only relies on the base functions.
x=c('íÁuÚ','uíÚÁ')
mapL<-c("á","é","í","ó","ú","Á","É","Í","Ó","Ú","ñ","Ñ","ü","Ü")
mapA<-c("a","e","i","o","u","A","E","I","O","U","n","N","u","U")
split=strsplit(x,split='')
m=lapply(split,match,mapL)
mapply(function(split,m) paste(ifelse(is.na(m),split,mapA[m]),collapse='') , split, m)
# "iAuU" "uiUA"

Resources