Finding the position of a character within a string

Finding the position of a character within a string - r

I am trying to find the equivalent of the ANYALPHA SAS function in R. This function searches a character string for an alphabetic character, and returns the first position at which at which the character is found.
Example: looking at the following string '123456789A', the ANYALPHA function would return 10 since first alphabetic character is at position 10 in the string. I would like to replicate this function in R but have not been able to figure it out. I need to search for any alphabetic character regardless of case (i.e. [:alpha:])
Thanks for any help you can offer!

Here's an anyalpha function. I added a few extra features. You can specify the maximum amount of matches you want in the n argument, it defaults to 1. You can also specify if you want the position or the value itself with value=TRUE:
anyalpha <- function(txt, n=1, value=FALSE) {
txt <- as.character(txt)
indx <- gregexpr("[[:alpha:]]", txt)[[1]]
ret <- indx[1:(min(n, length(indx)))]
if(value) {
mapply(function(x,y) substr(txt, x, y), ret, ret)
} else {ret}
}
#test
x <- '123A56789BC'
anyalpha(x)
#[1] 4
anyalpha(x, 2)
#[1] 4 10
anyalpha(x, 2, value=TRUE)
#[1] "C" "A"

Related

How to remove/replace specific parentheses from a string containing multiple parentheses in R

Given the following string of parentheses, I am trying to remove one specific parentheses,
where the position of one of its bracket is marked with 1.
((((((((((((((((((********))))))))))))))))))
00000000000000000000000000000000010000000000
So for the above example, the solution I am looking for is
((((((((((-(((((((********)))))))-))))))))))
00000000000000000000000000000000010000000000
I am tried using strsplit function from stringr to split and get the indexes of the bracket marked with 1. But I am not sure how I can get the index of its corresponding closing bracket.
Could anyone give some input on this..
What I did..
a = "((((((((((-(((((((********)))))))-))))))))))"
b = "00000000000000000000000000000000010000000000"
which(unlist(strsplit(b,"")) == 1)
#[1] 34
a_mod = unlist(strsplit(a,""))[-34]
here, I removed one bracket of the parentheses which I wanted to remove but I do not know how I can remove its corresponding opening bracket which is in 11th position in this example

Locate the 1 in b giving pos2 and also calculate the length of b giving n. Then replace positions pos2 and pos1 = n-pos2+1 with minus characters. See ?gregexpr and ?nchar and ?substr for more info. No packages are used.
pos2 <- regexpr(1, b)
n <- nchar(a)
pos1 <- n - pos2 + 1
substr(a, pos1, pos1) <- substr(a, pos2, pos2) <- "-"
a
## [1] "((((((((((-(((((((********)))))))-))))))))))"

Since the parentheses are paired the index of the close parentheses is just the length of the string minus the index of the open parentheses (they're equidistant from the string ends)
library(stringr)
string <- "((((((((((((((((((********))))))))))))))))))"
b <- "00000000000000000000000000000000010000000000"
location <- str_locate(b, "1")[1]
len <- str_length(string)
substr(string, location, location) <- "-"
substr(string, len-location, len-location) <- "-"
string
"(((((((((-((((((((********)))))))-))))))))))"

You should show what you have tried. One very simple way that would work for your example would be to do something like:
gsub("\\*){8}", "\\*)))))))-", "((((((((((((((((((********))))))))))))))))))")
#> [1] "((((((((((((((((((********)))))))-))))))))))"
Edit:
In response to your question: It depends what you mean by other similar examples.
If you go purely by position in the string, you already have an excellent answer from G. Grothendieck. If you want a solution where you want to replace the nth closing bracket, for example, you could do:
s <- "((((((((((((((((((********))))))))))))))))))"
replace_par <- function(n, string) {
sub(paste0("(!?\\))(\\)){", n, "}"),
paste0(paste(rep(")", (n-1)), collapse=""), "-"),
string, perl = TRUE)}
replace_par(8, s)
#> [1] "((((((((((((((((((********)))))))-)))))))))"
Created on 2020-05-21 by the reprex package (v0.3.0)

You could write a function that does the replacement the way you want:
strreplace <- function(x,y,val = "-")
{
regmatches(x,regexpr(1,y)) <- val
sub(".([(](?:[^()]|(?1))*+[)])(?=-)", paste0(val, "\\1"), x, perl = TRUE)
}
a <- "((((((((((((((((((********))))))))))))))))))"
b < -"00000000000000000000000000000000010000000000"
strreplace(a, b)
[1] "((((((((((-(((((((********)))))))-))))))))))"
# Nested paranthesis
a = "((((****))))((((((((((((((((((********))))))))))))))))))"
b = "00000000000000000000000000000000000000000000010000000000"
strreplace(a,b)
[1] "((((****))))((((((((((-(((((((********)))))))-))))))))))"

Replace variable name in string with variable value [R]

I have a character in R, say "\\frac{A}{B}". And I have values for A and B, say 5 and 10. Is there a way that I can replace the A and B with the 5 and 10?
I tried the following.
words <- "\\frac{A}{B}"
numbers <- list(A=5, B=10)
output <- do.call("substitute", list(parse(text=words)[[1]], numbers))
But I get an error on the \. Is there a way that I can do this? I an trying to create equations with the actual variable values.

You could use the stringi function stri_replace_all_fixed()
stringi::stri_replace_all_fixed(
words, names(numbers), numbers, vectorize_all = FALSE
)
# [1] "\\frac{5}{10}"

Try this:
sprintf(gsub('\\{\\w\\}','\\{%d}',words),5,10)

I'm more familiar with gsub than substitute. The following works:
words <- "\\frac{A}{B}"
numbers <- list(A=5, B=10)
arglist = mapply(list, as.list(names(numbers)), numbers, SIMPLIFY=F)
for (i in 1:length(arglist)){
arglist[[i]]["x"] <- words
words <- do.call("gsub", arglist[[i]])
}
But of course this is unsafe because you're iterating over the substitutions. If, say, the first variable has value "B" and the second variable has name "B", you'll have problems. There's probably a cleaner way.

String split and expand the (vector) at the delimiter: R

I have this vector (it's big in size) myvec. I need to split them matching at / and create another result vector resvector. How can I get this done in R?
myvec<-c("IID:WE:G12D/V/A","GH:SQ:p.R172W/G", "HH:WG:p.S122F/H")
resvector
IID:WE:G12D, IID:WE:G12V,IID:WE:G12A,GH:SQ:p.R172W,GH:SQ:p.R172G,HH:WG:p.S122F,HH:WG:p.S122H

You can try this, using strsplit as mentioned by #Tensibai:
sp_vec <- strsplit(myvec, "/") # split the element of the vector by "/" : you will get a list where each element is the decomposition (vector) of one element of your vector, according to "/"
ts_vec <- lapply(sp_vec, # for each element of the previous list, do
function(x){
base <- sub("\\w$", "", x[1]) # get the common beginning of the column names (so first item of vector without the last letter)
x[-1] <- paste0(base, x[-1]) # paste this common beginning to the rest of the vector items (so the other letters)
x}) # return the vector
resvector <- unlist(ts_vec) # finally, unlist to get the needed vector
resvector
# [1] "IID:WE:G12D" "IID:WE:G12V" "IID:WE:G12A" "GH:SQ:p.R172W" "GH:SQ:p.R172G" "HH:WG:p.S122F" "HH:WG:p.S122H"

Here is a concise answer with regex and some functional programming:
x = gsub('[A-Z]/.+','',myvec)
y = strsplit(gsub('[^/]+(?=[A-Z]/.+)','',myvec, perl=T),'/')
unlist(Map(paste0, x, y))
# "IID:WE:G12D" "IID:WE:G12V" "IID:WE:G12A" "GH:SQ:p.R172W" "GH:SQ:p.R172G" "HH:WG:p.S122F" "HH:WG:p.S122H"

myvec<-c("IID:WE:G12D/V/A","GH:SQ:p.R172W/G", "HH:WG:p.S122F/H")
custmSplit <- function(str){
splitbysep <- strsplit(str, '/')[[1]]
splitbysep[-1] <- paste0(substr(splitbysep[1], 1, nchar(splitbysep[1])), splitbysep[-1])
return(splitbysep)
}
do.call('c', lapply(myvec, custmSplit))
# [1] "IID:WE:G12D" "IID:WE:G12DV" "IID:WE:G12DA" "GH:SQ:p.R172W" "GH:SQ:p.R172WG" "HH:WG:p.S122F" "HH:WG:p.S122FH"

storing long strings (DNA sequence) in R

I have written a function that finds the indices of subsequences in a long DNA sequence. It works when my longer DNA sequence is < about 4000 characters. However, when I try to apply the same function to a much longer sequence, the console gives me a + instead of a >... which leads me to believe that it is the length of the string that is the problem.
for example: when the longer sequence is: "GATATATGCATATACTT", and the subsequence is: "ATAT", I get the indices "1, 3, 9" (0-based)
dnaMatch <- function(dna, sequence) {
ret <- list()
k <- str_length(sequence)
c <- str_length(dna) - k
for(i in 1:(c+1)) {
ret[i] = str_sub(dna, i, i+k-1)
}
ret <- unlist(ret)
TFret <- lapply (ret, identical, sequence)
TFret <- which(unlist(TFret), arr.ind = TRUE) -1
print(TFret)
}
Basically, my question is... is there any way around the character-limitation in the string class?

I can replicate nrussell's example, but this assigns correctly x<-paste0(rep("abcdef",1000),collapse="") -- a potential workaround is writing the character string to a .txt file and reading the .txt file into R directly:
test.txt is a 6,000 character long string.
`test<-read.table('test.txt',stringsAsFactors = FALSE)
length(class(test[1,1]))
[1] 1
class(test[1,1])
[1] "character"
nchar(test[1,1])
[1] 6000`

Rather than write your own function, why not use the function words.pos in package seqinr. It seems to work even for strings up to a million base pairs.
For example,
library(seqinr)
data(ec999)
myseq <- paste(ec999[[1]], collapse="")
myseq <- paste(rep(myseq,100), collapse="")
words.pos("atat", myseq)

Convert binary vector to decimal

I have a vector of a binary string:
a<-c(0,0,0,1,0,1)
I would like to convert this vector into decimal.
I tried using the compositions package and the unbinary() function, however, this solution and also most others that I have found on this site require g-adic string as input argument.
My question is how can I convert a vector rather than a string to decimal?
to illustrate the problem:
library(compositions)
unbinary("000101")
[1] 5
This gives the correct solution, but:
unbinary(a)
unbinary("a")
unbinary(toString(a))
produces NA.

You could try this function
bitsToInt<-function(x) {
packBits(rev(c(rep(FALSE, 32-length(x)%%32), as.logical(x))), "integer")
}
a <- c(0,0,0,1,0,1)
bitsToInt(a)
# [1] 5
here we skip the character conversion. This only uses base functions.
It is likely that
unbinary(paste(a, collapse=""))
would have worked should you still want to use that function.

There is a one-liner solution:
Reduce(function(x,y) x*2+y, a)
Explanation:
Expanding the application of Reduce results in something like:
Reduce(function(x,y) x*2+y, c(0,1,0,1,0)) = (((0*2 + 1)*2 + 0)*2 + 1)*2 + 0 = 10
With each new bit coming next, we double the so far accumulated value and add afterwards the next bit to it.
Please also see the description of Reduce() function.

If you'd like to stick to using compositions, just convert your vector to a string:
library(compositions)
a <- c(0,0,0,1,0,1)
achar <- paste(a,collapse="")
unbinary(achar)
[1] 5

This function will do the trick.
bintodec <- function(y) {
# find the decimal number corresponding to binary sequence 'y'
if (! (all(y %in% c(0,1)))) stop("not a binary sequence")
res <- sum(y*2^((length(y):1) - 1))
return(res)
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Finding the position of a character within a string - r

Related

How to remove/replace specific parentheses from a string containing multiple parentheses in R

Replace variable name in string with variable value [R]

String split and expand the (vector) at the delimiter: R

storing long strings (DNA sequence) in R

Convert binary vector to decimal

Categories

Resources