String manipulation in R - r

I have a list of DNA sequences, for example, "AGAACCTTATTGGGTCAAGT". If I were wanting to create a list with all possible strings that could sequentially happen in the sequence of a given length (for example 4) how would this be done in R?
In this case, the first string would be "AGAA". The second would be "GAAC", the third, "AACC", and so forth.

x = "AGAACCTTATTGGGTCAAGT"
sapply(1:(nchar(x)-3), function(i) substr(x, i, i+3))
#[1] "AGAA" "GAAC" "AACC" "ACCT" "CCTT" "CTTA" "TTAT" "TATT" "ATTG" "TTGG" "TGGG" "GGGT" "GGTC" "GTCA" "TCAA" "CAAG" "AAGT"

Related

How can I extract matched part of multiple strings?

I have multiple strings, and I want to extract the part that matches.
In practice my strings are directories, and I need to choose where to write a file, which is the location that matches in all strings. For example, if you have a vector with three strings:
data.dir <- c("C:\\data\\files\\subset1\\", "C:\\data\\files\\subset3\\", "C:\\data\\files\\subset3\\")
...the part that matches in all strings is "C:\data\files\". How can I extract this?
strsplit and intersect the overlapping parts recursively using Reduce. You can then piece it back together by paste-ing.
paste(Reduce(intersect, strsplit(data.dir, "\\\\")), collapse="\\")
#[1] "C:\\data\\files"
As #g-grothendieck notes, this will fail in certain circumstances like:
data.dir <- c("C:\\a\\b\\c\\", "C:\\a\\X\\c\\")
An ugly hack might be something like:
tail(
Reduce(
intersect,
lapply(strsplit(data.dir, "\\\\"),
function(x) sapply(1:length(x), function(y) paste(x[1:y], collapse="\\") )
)
),
1)
...which will deal with either case.
Alternatively, use dirname if you only ever have one extra directory level:
unique(dirname(data.dir))
#[1] "C:/data/files"
g contains the character positions to successive backslashes in data.dir[1]. From this create a logical vector ok whose ith element is TRUE if the first g[i] characters of all elements in data.dir are the same, i.e. all elements of substr(data.dir, 1, g[i]) are the same. If ok[1] is TRUE then there is a non-zero length common prefix whose length is given by the first g[k] characters of data.dir[1] where k (which equals rle(ok)$lengths[1]) is the leading number of TRUE values in ok; otherwise, there is no common prefix so return "".
g <- gregexpr("\\", data.dir[1], fixed = TRUE)[[1]]
ok <- sapply(g, function(i) all(substr(data.dir[1], 1, i) == substr(data.dir, 1, i)))
if (ok[1]) substr(data.dir[1], 1, g[rle(ok)$lengths[1]]) else ""
For data.dir defined in the question the last line gives:
[1] "C:\\data\\files\\"

Replace variable name in string with variable value [R]

I have a character in R, say "\\frac{A}{B}". And I have values for A and B, say 5 and 10. Is there a way that I can replace the A and B with the 5 and 10?
I tried the following.
words <- "\\frac{A}{B}"
numbers <- list(A=5, B=10)
output <- do.call("substitute", list(parse(text=words)[[1]], numbers))
But I get an error on the \. Is there a way that I can do this? I an trying to create equations with the actual variable values.
You could use the stringi function stri_replace_all_fixed()
stringi::stri_replace_all_fixed(
words, names(numbers), numbers, vectorize_all = FALSE
)
# [1] "\\frac{5}{10}"
Try this:
sprintf(gsub('\\{\\w\\}','\\{%d}',words),5,10)
I'm more familiar with gsub than substitute. The following works:
words <- "\\frac{A}{B}"
numbers <- list(A=5, B=10)
arglist = mapply(list, as.list(names(numbers)), numbers, SIMPLIFY=F)
for (i in 1:length(arglist)){
arglist[[i]]["x"] <- words
words <- do.call("gsub", arglist[[i]])
}
But of course this is unsafe because you're iterating over the substitutions. If, say, the first variable has value "B" and the second variable has name "B", you'll have problems. There's probably a cleaner way.

Finding the position of a character within a string

I am trying to find the equivalent of the ANYALPHA SAS function in R. This function searches a character string for an alphabetic character, and returns the first position at which at which the character is found.
Example: looking at the following string '123456789A', the ANYALPHA function would return 10 since first alphabetic character is at position 10 in the string. I would like to replicate this function in R but have not been able to figure it out. I need to search for any alphabetic character regardless of case (i.e. [:alpha:])
Thanks for any help you can offer!
Here's an anyalpha function. I added a few extra features. You can specify the maximum amount of matches you want in the n argument, it defaults to 1. You can also specify if you want the position or the value itself with value=TRUE:
anyalpha <- function(txt, n=1, value=FALSE) {
txt <- as.character(txt)
indx <- gregexpr("[[:alpha:]]", txt)[[1]]
ret <- indx[1:(min(n, length(indx)))]
if(value) {
mapply(function(x,y) substr(txt, x, y), ret, ret)
} else {ret}
}
#test
x <- '123A56789BC'
anyalpha(x)
#[1] 4
anyalpha(x, 2)
#[1] 4 10
anyalpha(x, 2, value=TRUE)
#[1] "C" "A"

String split and expand the (vector) at the delimiter: R

I have this vector (it's big in size) myvec. I need to split them matching at / and create another result vector resvector. How can I get this done in R?
myvec<-c("IID:WE:G12D/V/A","GH:SQ:p.R172W/G", "HH:WG:p.S122F/H")
resvector
IID:WE:G12D, IID:WE:G12V,IID:WE:G12A,GH:SQ:p.R172W,GH:SQ:p.R172G,HH:WG:p.S122F,HH:WG:p.S122H
You can try this, using strsplit as mentioned by #Tensibai:
sp_vec <- strsplit(myvec, "/") # split the element of the vector by "/" : you will get a list where each element is the decomposition (vector) of one element of your vector, according to "/"
ts_vec <- lapply(sp_vec, # for each element of the previous list, do
function(x){
base <- sub("\\w$", "", x[1]) # get the common beginning of the column names (so first item of vector without the last letter)
x[-1] <- paste0(base, x[-1]) # paste this common beginning to the rest of the vector items (so the other letters)
x}) # return the vector
resvector <- unlist(ts_vec) # finally, unlist to get the needed vector
resvector
# [1] "IID:WE:G12D" "IID:WE:G12V" "IID:WE:G12A" "GH:SQ:p.R172W" "GH:SQ:p.R172G" "HH:WG:p.S122F" "HH:WG:p.S122H"
Here is a concise answer with regex and some functional programming:
x = gsub('[A-Z]/.+','',myvec)
y = strsplit(gsub('[^/]+(?=[A-Z]/.+)','',myvec, perl=T),'/')
unlist(Map(paste0, x, y))
# "IID:WE:G12D" "IID:WE:G12V" "IID:WE:G12A" "GH:SQ:p.R172W" "GH:SQ:p.R172G" "HH:WG:p.S122F" "HH:WG:p.S122H"
myvec<-c("IID:WE:G12D/V/A","GH:SQ:p.R172W/G", "HH:WG:p.S122F/H")
custmSplit <- function(str){
splitbysep <- strsplit(str, '/')[[1]]
splitbysep[-1] <- paste0(substr(splitbysep[1], 1, nchar(splitbysep[1])), splitbysep[-1])
return(splitbysep)
}
do.call('c', lapply(myvec, custmSplit))
# [1] "IID:WE:G12D" "IID:WE:G12DV" "IID:WE:G12DA" "GH:SQ:p.R172W" "GH:SQ:p.R172WG" "HH:WG:p.S122F" "HH:WG:p.S122FH"

storing long strings (DNA sequence) in R

I have written a function that finds the indices of subsequences in a long DNA sequence. It works when my longer DNA sequence is < about 4000 characters. However, when I try to apply the same function to a much longer sequence, the console gives me a + instead of a >... which leads me to believe that it is the length of the string that is the problem.
for example: when the longer sequence is: "GATATATGCATATACTT", and the subsequence is: "ATAT", I get the indices "1, 3, 9" (0-based)
dnaMatch <- function(dna, sequence) {
ret <- list()
k <- str_length(sequence)
c <- str_length(dna) - k
for(i in 1:(c+1)) {
ret[i] = str_sub(dna, i, i+k-1)
}
ret <- unlist(ret)
TFret <- lapply (ret, identical, sequence)
TFret <- which(unlist(TFret), arr.ind = TRUE) -1
print(TFret)
}
Basically, my question is... is there any way around the character-limitation in the string class?
I can replicate nrussell's example, but this assigns correctly x<-paste0(rep("abcdef",1000),collapse="") -- a potential workaround is writing the character string to a .txt file and reading the .txt file into R directly:
test.txt is a 6,000 character long string.
`test<-read.table('test.txt',stringsAsFactors = FALSE)
length(class(test[1,1]))
[1] 1
class(test[1,1])
[1] "character"
nchar(test[1,1])
[1] 6000`
Rather than write your own function, why not use the function words.pos in package seqinr. It seems to work even for strings up to a million base pairs.
For example,
library(seqinr)
data(ec999)
myseq <- paste(ec999[[1]], collapse="")
myseq <- paste(rep(myseq,100), collapse="")
words.pos("atat", myseq)

Resources