Identifying mutations between two sequences - r

Given two sequences (such as DNA sequences) of equal length, I want to be able to find the mutations between them - both type and index within the sequence. As an example, if I fed in the sequences AGGCTAC and AGCCTTC, I want to get out a list like G3C, A6T.
I can get the number of differences just fine:
seqs <- Biostrings::readAAStringSet("PSE-1_Round20.fas")
nmut <- adist(seqs[1], seqs)
But I can't think of a more elegant way to get the positions than just looping, which seems very kludgy - and I'd like to take this as an opportunity to learn instead.
I'm working with the Biostrings package in R, but there don't seem to be any special tools in that package for what I want to do any I think any solution that works for generic strings should also work for me. In fact, if there's a more elegant solution in python or bash scripting, I'd also accept that.

There seem to be multiple packages that should do this. One is the findMutations function in the adegenet package.
As for the string comparison question, see this question. Here's a function that will work if the strings are the same length:
mutations <- function(str1, str2) {
str1vec <- unlist(strsplit(str1, ""))
str2vec <- unlist(strsplit(str2, ""))
iMut <- (1:length(str1vec))[str1vec != str2vec]
return(paste0(str1vec[iMut], iMut, str2vec[iMut]))
}
> mutations("AGGCTAC", "AGCCTTC")
[1] "G3C" "A6T"

Related

Splitting strings into elements from a list

A function in a package gives me a character, where the original strings are merged together. I need to separate them, in other words I have to find the original elements. Here is an example and what I have tried:
orig<-c("answer1","answer2","answer3")
result<-"answer3answer2"
What I need as an outcome is:
c("answer2","answer3")
I have tried to split() result, but there is no string to base it on, especially that I have no former knowledge of what the answers will be.
I have tried to match() the result to the orig, but I would need to do that with all substrings.
There has to be an easy solution, but I haven't found it.
index <- gregexpr(paste(orig,collapse='|'),result)[[1]]
starts <- as.numeric(index)
stops <- starts + attributes(index)$match.length - 1 )
substring(result, starts, stops)
This should work for well defined and reversible inputs. Alternatively, is it possible to append some strings to the input of the function, such that it can be easily separated afterwards?
What you're describing seems to be exactly string matching, and for your strings, grepl seems to be just the thing, in particular:
FindSubstrings <- function(orig, result){
orig[sapply(orig, grepl, result)]
}
In more detail: grepl takes a pattern argument and looks whether it occurs in your string (result in our case), and returns a TRUE/FALSE value. We subset the original values by the logical vector - does the value occur in the string?
Possible improvements:
fixed=TRUE may be a bright idea, because you don't need the full regex power for simple strings matching
some match patterns may contain others, for example, "answer10" contains "answer1"
stringi may be faster for such tasks (just rumors floating around, haven't rigorously tested), so you may want to look into it if you do this a lot.

Convert characters or symbols to existing variables in R

I'm using R to compute the best fit of a sequence of initializations, and I named them Initialization1, Initialization2, etc.. I compared the best fit with the largest result_probs value. And I want to use the one, say Initialization1, with the best property I want again.
best_fit <- paste("Initialization", which.max(results_probObs), sep = "")
best_estimated <- somefunction(best_fit, string1)
However, best_fit here is a character and can't be used as the existing Initialization1 (which is a list). I've tried as.name() too. It gave me a symbol and couldn't be used as a list as well.
Thank you very much for helping.

How to search for huge list of search-terms from a corpus using custom function in tm package

I want to select and retain the gene names from a corpus of multiple text documents using the tm package. I have used a custom function to keep only the genes defined in "pattern" and remove everything else. Here are my codes
docs <- Corpus(DirSource("path of the directory containing text documents"))
f <- content_transformer(function(x, pattern)regmatches(x, gregexpr(pattern, x, ignore.case=TRUE)))
genes = "IL1|IL2|IL3|IL4|IL5|IL6|IL7|IL8|IL9|IL10|TNF|TGF|AP2|OLR1|OLR2"
docs <- tm_map(docs, f, genes)
The codes are working perfectly fine. However, If I need to match a larger number of genes (say > 5000 genes), what is the best way to approach it ? I don't want to put the genes in an array and loop the tm_map function, to avoid huge run time and memory constraints.
If you simply want the fastest vectorized fixed-string regex, use stringi package, not tm. Specifically, look at stri_match* functions (or you might find stringr even faster if you're only handling ASCII - look for Hadley's latest versions and comments).
But if the regex of gene names is fixed and known upfront, and you're going to be doing a lot of retrieval on those few strings, then you could tag each document for faster retrieval.
(You haven't fully told us your use-case. What % of your runtime is this retrieval task? 0.1%? 99%? Are you storing your genes as text strings? Why not tokenize them and convert once to factors at input-time?)
Either way, tm is not a very scaleable performant package, so look at other approaches.

Using values from a dataframe to apply a function to a vector

I'll start off by admitting that I'm terrible at the apply functions, and function writing in general, in R. I am working on a course project to clean and model some text data, and I would like to include a step that cleans up contractions.
The qdapDictionaries package includes a contractions data frame with two columns, the first column is the contraction and the second is the expanded version. For example:
contraction expanded
5 aren't are not
I want to use the values in here to run a gsub function on my text, which I still have in a large character element. Something like gsub(contr,expd,text).
Here's an example vector that I am using to test things out:
vct <- c("I've got a problem","it shouldn't be that hard","I'm having trouble 'cause I'm dumb")
I'm stumped on how to loop through the data frame (without actually writing a loop, because it seems like the least efficient way to do it) so I can run all the gsubs that I need.
There's probably a simple answer, but here's what I tried: first, I created a function that would return the expanded version if passed a contraction:
expand <- function(contr) {
expd <- contractions[which(contractions[1]==contr),2]
}
I can use sapply with this and it does work, more or less; looping over the first column in contractions, sapply(contractions[,1],expand) returns a named vector of characters with the expanded phrases.
I can't figure out how to combine this vector with gsub though. I tried writing a second function gsub_expand and changing the expand function to return both the contraction and the expansion:
gsub_expand <- function(list, text) {
text <- gsub(list[[1]],list[[2]],text)
return(text)
}
When I ran gsub_expand(sapply(contractions[,1],expand),vct) it only corrected a portion of my vector.
[1] "I've got a problem" "it shouldn't be that hard" "I'm having trouble because I'm dumb"
The first entry in the contractions data frame is 'cause and because, so the interior sapply doesn't seem to actually be looping. I'm stuck in the logic of what I want to pass to what, and what I'm supposed to loop over.
Thanks for any help.
Two options:
stringr::str_replace_all
The stringr package does mostly the same things you can do with base regex functions, but sometimes in a dramatically simpler way. This is one of those times. You can pass str_replace_all a named list or character vector, and it will use the names as patterns and the values as replacements, so all you need is
library(stringr)
contractions <- c("I've" = 'I have', "shouldn't" = 'should not', "I'm" = 'I am')
str_replace_all(vct, contractions)
and you get
[1] "I have got a problem" "it should not be that hard"
[3] "I am having trouble 'cause I am dumb"
No muss, no fuss, just works.
lapply/mapply/Map and gsub
You can, of course, use lapply or a for loop to repeat gsub. You can formulate this call in a few ways, depending on how your data is stored, and how you want to get it out. Let's first make a copy of vct, because we're going to overwrite it:
vct2 <- vct
Now we can use any of these three:
lapply(1:length(contractions),
function(x){vct2 <<- gsub(names(contractions[x]), contractions[x], vct2)})
# `mapply` is a multivariate version of `sapply`
mapply(function(x, y){vct2 <<- gsub(x, y, vct2)}, names(contractions), contractions)
# `Map` is a multivariate version of `lapply`
Map(function(x, y){vct2 <<- gsub(x, y, vct2)}, names(contractions), contractions)
each of which will return slightly different useless data, but will also save the changes to vct2, which now looks the same as the results of str_replace_all above.
These are a little complicated, mostly because you need to save the internal version of vct as you go with each change made. The vct <<- writes to the initialized vct2 outside the function's environment, allowing us to capture the successive changes. Be a little careful with <<-; it's powerful. See ?assignOps for more info.

Looking for an optimized way of replacing list patterns in long documents

Using tm package, I have a corpus of 10,900 documents (docs).
docs = Corpus(VectorSource(abstracts$abstract))
And I also have a list of terms (termslist) and all their synonyms and different spellings. I use it to transform each of synonyms or spellings into one term.
Term, Synonyms
term1, synonym1
term1, synonym2
term1, synonym3
term2, synonym1
... etc
The way I'm doing it right now is to loop through all documents, and another nester loop through all terms to find and replace.
for (s in 1:length(docs)){
for (i in 1:nrow(termslist)){
docs[[s]]$content<-gsub(termslist[i,2], termslist[i,1], docs[[s]])
}
print(s)
}
Currently this takes a second for a document (having around 1000 row in termslist), which means 10,900 seconds which is roughly 6 hours!
Is there a more optimized way of doing this within tm package or within R generally?
UPDATE:
mathematical.coffee's answer was actually helpful. I had to re-create a table with unique terms as rows and the second column would be the synonyms separated by '|' , then just loop over them. Now it takes significantly less time than before.
**[The messy] code of creating the new table:
newtermslist<-list()
authname<-unique(termslist[,1])
newtermslist<- cbind(newtermslist,authname)
syns<-list()
for (i in seq(authname)){
syns<- rbind(syns,
paste0('(',
paste(termslist[which(termslist[,1]==authname[i]),2],collapse='|')
, ')')
)
}
newtermslist<-cbind(newtermslist,syns)
newtermslist<-cbind(unlist(newtermslist[,1]),unlist(newtermslist[,2]))
I think when you wish to perform many replacements, this may be the only way to do it (i.e. sequentially, saving the replaced output as the input for the next replacement).
However, you might gain some speed trying (you will have to do some benchmarking to compare):
use fixed=T (since your synonyms are not regexes but literal spellings), useBytes=T (**see ?gsub - if you have multibyte locale this may or may not be a good idea). Or
compress your termslist - if blue has synonyms cerulean, cobalt and sky, then your regex could be (cerulean|cobalt|sky) with replacement blue, so that all the synonyms for blue are replaced in one iteration rather than in 3 separate ones. To do this, you'd preprocess your termslist - e.g. newtermslist <- ddply(terms, .(term), summarize, regex=paste0('(', paste(synonym, collapse='|'), ')')) and then do your current loop over this. You will have fixed=F (the default, i.e. use regex).
see also ?tm_map and ?content_transformer. I'm not sure if these will speed things up at all, but you could try.
(Re benchmarking - try library(rbenchmark); benchmark(expression1, expression2, ...), or good ol' system.time for timing, Rprof for profiling)
Here I'm answering my own question after coming through a parallelized solution that will do things in parallel. It should run the code faster but I haven't compared the two solutions yet.
library(doParallel)
library(foreach)
cl<-makeCluster(detectCores())
registerDoParallel(cl)
system.time({ # this one to print how long it takes after it evaluate the expression
foreach(s=1:length(docs)) %:% foreach(i=1:nrow(newtermslist)) %dopar% {
docs[[s]]$content<-gsub(newtermslist[i,2], newtermslist[i,1], docs[[s]]$content)
}
})
stopCluster(cl)

Resources