storing long strings (DNA sequence) in R - r

I have written a function that finds the indices of subsequences in a long DNA sequence. It works when my longer DNA sequence is < about 4000 characters. However, when I try to apply the same function to a much longer sequence, the console gives me a + instead of a >... which leads me to believe that it is the length of the string that is the problem.
for example: when the longer sequence is: "GATATATGCATATACTT", and the subsequence is: "ATAT", I get the indices "1, 3, 9" (0-based)
dnaMatch <- function(dna, sequence) {
ret <- list()
k <- str_length(sequence)
c <- str_length(dna) - k
for(i in 1:(c+1)) {
ret[i] = str_sub(dna, i, i+k-1)
}
ret <- unlist(ret)
TFret <- lapply (ret, identical, sequence)
TFret <- which(unlist(TFret), arr.ind = TRUE) -1
print(TFret)
}
Basically, my question is... is there any way around the character-limitation in the string class?

I can replicate nrussell's example, but this assigns correctly x<-paste0(rep("abcdef",1000),collapse="") -- a potential workaround is writing the character string to a .txt file and reading the .txt file into R directly:
test.txt is a 6,000 character long string.
`test<-read.table('test.txt',stringsAsFactors = FALSE)
length(class(test[1,1]))
[1] 1
class(test[1,1])
[1] "character"
nchar(test[1,1])
[1] 6000`

Rather than write your own function, why not use the function words.pos in package seqinr. It seems to work even for strings up to a million base pairs.
For example,
library(seqinr)
data(ec999)
myseq <- paste(ec999[[1]], collapse="")
myseq <- paste(rep(myseq,100), collapse="")
words.pos("atat", myseq)

Related

Replace variable name in string with variable value [R]

I have a character in R, say "\\frac{A}{B}". And I have values for A and B, say 5 and 10. Is there a way that I can replace the A and B with the 5 and 10?
I tried the following.
words <- "\\frac{A}{B}"
numbers <- list(A=5, B=10)
output <- do.call("substitute", list(parse(text=words)[[1]], numbers))
But I get an error on the \. Is there a way that I can do this? I an trying to create equations with the actual variable values.
You could use the stringi function stri_replace_all_fixed()
stringi::stri_replace_all_fixed(
words, names(numbers), numbers, vectorize_all = FALSE
)
# [1] "\\frac{5}{10}"
Try this:
sprintf(gsub('\\{\\w\\}','\\{%d}',words),5,10)
I'm more familiar with gsub than substitute. The following works:
words <- "\\frac{A}{B}"
numbers <- list(A=5, B=10)
arglist = mapply(list, as.list(names(numbers)), numbers, SIMPLIFY=F)
for (i in 1:length(arglist)){
arglist[[i]]["x"] <- words
words <- do.call("gsub", arglist[[i]])
}
But of course this is unsafe because you're iterating over the substitutions. If, say, the first variable has value "B" and the second variable has name "B", you'll have problems. There's probably a cleaner way.

Finding the position of a character within a string

I am trying to find the equivalent of the ANYALPHA SAS function in R. This function searches a character string for an alphabetic character, and returns the first position at which at which the character is found.
Example: looking at the following string '123456789A', the ANYALPHA function would return 10 since first alphabetic character is at position 10 in the string. I would like to replicate this function in R but have not been able to figure it out. I need to search for any alphabetic character regardless of case (i.e. [:alpha:])
Thanks for any help you can offer!
Here's an anyalpha function. I added a few extra features. You can specify the maximum amount of matches you want in the n argument, it defaults to 1. You can also specify if you want the position or the value itself with value=TRUE:
anyalpha <- function(txt, n=1, value=FALSE) {
txt <- as.character(txt)
indx <- gregexpr("[[:alpha:]]", txt)[[1]]
ret <- indx[1:(min(n, length(indx)))]
if(value) {
mapply(function(x,y) substr(txt, x, y), ret, ret)
} else {ret}
}
#test
x <- '123A56789BC'
anyalpha(x)
#[1] 4
anyalpha(x, 2)
#[1] 4 10
anyalpha(x, 2, value=TRUE)
#[1] "C" "A"

Convert binary vector to decimal

I have a vector of a binary string:
a<-c(0,0,0,1,0,1)
I would like to convert this vector into decimal.
I tried using the compositions package and the unbinary() function, however, this solution and also most others that I have found on this site require g-adic string as input argument.
My question is how can I convert a vector rather than a string to decimal?
to illustrate the problem:
library(compositions)
unbinary("000101")
[1] 5
This gives the correct solution, but:
unbinary(a)
unbinary("a")
unbinary(toString(a))
produces NA.
You could try this function
bitsToInt<-function(x) {
packBits(rev(c(rep(FALSE, 32-length(x)%%32), as.logical(x))), "integer")
}
a <- c(0,0,0,1,0,1)
bitsToInt(a)
# [1] 5
here we skip the character conversion. This only uses base functions.
It is likely that
unbinary(paste(a, collapse=""))
would have worked should you still want to use that function.
There is a one-liner solution:
Reduce(function(x,y) x*2+y, a)
Explanation:
Expanding the application of Reduce results in something like:
Reduce(function(x,y) x*2+y, c(0,1,0,1,0)) = (((0*2 + 1)*2 + 0)*2 + 1)*2 + 0 = 10
With each new bit coming next, we double the so far accumulated value and add afterwards the next bit to it.
Please also see the description of Reduce() function.
If you'd like to stick to using compositions, just convert your vector to a string:
library(compositions)
a <- c(0,0,0,1,0,1)
achar <- paste(a,collapse="")
unbinary(achar)
[1] 5
This function will do the trick.
bintodec <- function(y) {
# find the decimal number corresponding to binary sequence 'y'
if (! (all(y %in% c(0,1)))) stop("not a binary sequence")
res <- sum(y*2^((length(y):1) - 1))
return(res)
}

Numeric matrix is taking far more memory than it should - R

I am creating a document term matrix (dtm for short) for a Naive Bayes implementation (I know there is a function for this, but I have to code it myself for homework.) I wrote a function that successfully creates the dtm, the problem is that the resulting matrix is taking up too much memory. For example a 100 x 32000 matrix (of 0's and 1's) is 24MB in size! This is resulting in crashy behavior in r when trying to work with the full 10k documents. The functions follow and a toy example is in the last 3 lines. Can anyone spot why the "sparser" function in particular is returning such memory-intensive results?
listAllWords <- function(docs)
{
str1 <- strsplit(x=docs, split="\\s", fixed=FALSE)
dictDupl <- unlist(str1)[!(unlist(str1) %in% stopWords)]
dictionary <- unique(dictDupl)
}
#function to create the sparse matrix of words as they appear in each article segment
sparser <- function (docs, dictionary)
{
num.docs <- length(docs) #dtm rows
num.words <- length(dictionary) #dtm columns
dtm <- mat.or.vec(num.docs,num.words) # Instantiate dtm of zeroes
for (i in 1:num.docs)
{
doc.temp <- unlist(strsplit(x=docs[i], split="\\s", fixed=FALSE)) #vectorize words
num.words.doc <- length(doc.temp)
for (j in 1:num.words.doc)
{
ind <- which(dictionary == doc.temp[j]) #loop over words and find index in dict.
dtm[i,ind] <- 1 #indicate this word is in this document
}
}
return(dtm)
}
docs <- c("the first document contains words", "the second document is also made of words", "the third document is words and a number 4")
dictionary <- listAllWords(docs)
dtm <- sparser(docs,dictionary)
If it makes any difference I am running this in R Studio in Mac OSX, 64 bit
Surely part of your problem is that you are not actually storing integers, but doubles. Note:
m <- mat.or.vec(100,32000)
m1 <- matrix(0L,100,32000)
> object.size(m)
25600200 bytes
> object.size(m1)
12800200 bytes
And note the lack of the "L" in the code for mat.or.vec:
> mat.or.vec
function (nr, nc)
if (nc == 1L) numeric(nr) else matrix(0, nr, nc)
<bytecode: 0x1089984d8>
<environment: namespace:base>
You will also want to explicitly assign 1L, otherwise R will convert everything to doubles upon the first assignment, I think. You can verify that by simply assigning one value of m1 above the value 1 and recheck the object size.
I should probably also mention the function storage.mode which can help you to verify that you're using integers.
If you want to store 0/1 values economically, I would suggest raw type.
m8 <- matrix(0,100,32000)
m4 <- matrix(0L,100,32000)
m1 <- matrix(raw(1),100,32000)
The raw type takes just 1 byte per value:
> object.size(m8)
25600200 bytes
> object.size(m4)
12800200 bytes
> object.size(m1)
3200200 bytes
Here is how to operate with it:
> m1[2,2] = as.raw(1)
> m1[2,2]
[1] 01
> as.integer(m1[2,2])
[1] 1
If you really want to be economical look at the ff and bit packages.

Replace non-ascii chars with a defined string list without a loop in R

I want to replace non-ascii characters (for now, only spanish), by their ascii equivalent. If I have "á", I want to replace it with "a" and so on.
I built this function (works fine), but I don't want to use a loop (including internal loops like sapply).
latin2ascii<-function(x) {
if(!is.character(x)) stop ("input must be a character object")
require(stringr)
mapL<-c("á","é","í","ó","ú","Á","É","Í","Ó","Ú","ñ","Ñ","ü","Ü")
mapA<-c("a","e","i","o","u","A","E","I","O","U","n","N","u","U")
for(y in 1:length(mapL)) {
x<-str_replace_all(x,mapL[y],mapA[y])
}
x
}
Is there an elegante way to solve it? Any help, suggestion or modification is appreciated
gsubfn() in the package of the same name is really nice for this sort of thing:
library(gsubfn)
# Create a named list, in which:
# - the names are the strings to be looked up
# - the values are the replacement strings
mapL <- c("á","é","í","ó","ú","Á","É","Í","Ó","Ú","ñ","Ñ","ü","Ü")
mapA <- c("a","e","i","o","u","A","E","I","O","U","n","N","u","U")
# ll <- setNames(as.list(mapA), mapL) # An alternative to the 2 lines below
ll <- as.list(mapA)
names(ll) <- mapL
# Try it out
string <- "ÍÓáÚ"
gsubfn("[áéíóúÁÉÍÓÚñÑüÜ]", ll, string)
# [1] "IOaU"
Edit:
G. Grothendieck points out that base R also has a function for this:
A <- paste(mapA, collapse="")
L <- paste(mapL, collapse="")
chartr(L, A, "ÍÓáÚ")
# [1] "IOaU"
I like the version by Josh, but I thought I might add another 'vectorized' solution. It returns a vector of unaccented strings. It also only relies on the base functions.
x=c('íÁuÚ','uíÚÁ')
mapL<-c("á","é","í","ó","ú","Á","É","Í","Ó","Ú","ñ","Ñ","ü","Ü")
mapA<-c("a","e","i","o","u","A","E","I","O","U","n","N","u","U")
split=strsplit(x,split='')
m=lapply(split,match,mapL)
mapply(function(split,m) paste(ifelse(is.na(m),split,mapA[m]),collapse='') , split, m)
# "iAuU" "uiUA"

Resources