I am using barcodes to tag mitochondrial DNA strands previous to PCR. The barcode sequences are not known, but they are 18 nucleotides long and directly proceed a known sequence (either CATCAT or TACTAC). Each DNA molecule will get a unique barcode identifier. Once the molecules undergo PCR, I need to cluster the sequences based on their 18 nucleotide barcode, and then subsequently align the sequences, per barcode.
To use an overly simple example, lets say I have 2 molecules that are going into a PCR reaction:
CATCATBARCODE1SEQUENCE1
TACTACBARCODE2SEQUENCE2
After amplification I have:
CATCATBARCODE1SEQUENCE1
CATCATBARCODE1SEQUENCE1
TACTACBARCODE2SEQUENCE2
TACTACBARCODE2SEQUENCE2
I then want to search the section of sequence at position 6-13 and cluster them based on that window of sequence without changing the rest of the sequence, which would actually just look like what I have above. Then I could perform the alignment on the adjacent sequences.
Any ideas on how I could accomplish this clustering of a window of sequence, without taking into account the rest of the sequence? Thanks.
Overly simplified R code, but seems to do what you ask:
seqs <- c('CATCATBARCODE1SEQUENCE1',
'CATCATBARCODE1SEQUENCE1',
'TACTACBARCODE2SEQUENCE2',
'TACTACBARCODE2SEQUENCE2' )
clusters <- list()
for (seq in seqs) {
barcode <- substr(seq, 7, 14)
if (!is.null(clusters[[barcode]])) {
clusters[[barcode]] <- append(clusters[[barcode]], seq)
} else {
clusters[[barcode]] <- c(seq)
}
}
print(clusters)
prints:
$BARCODE1
[1] "CATCATBARCODE1SEQUENCE1" "CATCATBARCODE1SEQUENCE1"
$BARCODE2
[1] "TACTACBARCODE2SEQUENCE2" "TACTACBARCODE2SEQUENCE2"
Assuming you can already obtain sequences starting like [CATCATBARCODEX] what I would do is just to process it in python. If your sequence starts are not the same then you may need to search for CATCAT and discard those that look to be in the wrong position. There may be some issue if the number of barcodes is very large but I think for something on the order of 100,000 simple methods should work.
Anyways, once you find the CATCAT what I would do is just to build up a dictionary of barcodes and start filtering. Then you can just rip off this first part of the sequences and align using whatever methods (I had a barcode project and using custom genome with bowtie was convenient).
let's say you need to find this sequence instead of just starting with it, in python a solution would be like
my_dict= {}
for seq in seqs:
idx = seq.find("CATCAT")
idx2 = seq.find("TACTAC")
if idx==-1 and idx2==-1:continue
# here you need to consider the location of idx and idx2, both may be present, sequence needs to be long enough etc
barcode = seq[idx+6, idx+6+18]
# you may want to shorten the barcode or encode it to a string
if barcode in my_dict:
my_dict[barcode]=1
else :
my_dict[barcode]+=1;
seq=seq[idx+24:]
In addition to the counting you can 1) append sequences to a fasta file per barcode or 2) assign the barcode as annotation to a large fasta file.
Regardless you probably want to strip down the sequence to simplify the downstream analysis.
Related
I have a large character around 300K, and I want to get a unique list of similar words in the list. Clustering won't work as the number of cluster will change from application to application.
Suppose the data looks like so:
x = as.vector(c('accuracy','accuracy','friendliness','friendliness','email','email_','email_hi',`email_asdlk`))
As you can see, there are 3 clusters here, accuracy friendliness and email
using stringdistmatrix on a size of 300k takes way to long. What other options are there?
I'm using R package TraMineR to make some academic research on sequence analysis.
I want to find a pattern defined as someone being in the target company, then going out, then coming back to the target company.
(simplified) I've define state A as target company; B as outside industry company and C as inside industry company.
So what I want to do is find sequences with the specific patterns A-B-A or A-C-A.
After looking at this question (Strange number of subsequences? ) and reading the user guide, specially the following passages:
4.3.3 Subsequences
A sequence u is a subsequence of x if all successive elements ui of u appear >in x in the same
order, which we simply denote by u x. According to this denition, unshared >states can appear
between those common to both sequences u and x. For example, u = S; M is a >subsequence of
x = S; U; M; MC.
and
7.3.2 Finding sequences with a given subsequence
The seqpm() function counts the number of sequences that contain a given subsequence and collects
their row index numbers. The function returns a list with two elements. The rst element, MTab,
is just a table with the number of occurrences of the given subsequence in the data. Note that
only one occurrence is counted per sequence, even when the sub-sequence appears more than one
time in the sequence. The second element of the list, MIndex, gives the row index numbers of
the sequences containing the subsequence. These index numbers may be useful for accessing the
concerned sequences (example below). Since it is easier to search a pattern in a character string,
the function rst translates the sequence data in this format when using the seqconc function with
the TRUE option.
I concluded that seqpm() was the function I needed to get the job done.
So I have sequences like:
A-A-A-A-A-B-B-B-B-B-A-A-A-A-A
And out of the definition of subsequences that i found on the mentiod sources, i figure I could find that kind of sequence by using:
seqpm(sequence,"ABA")
But that does not happen. In order to find that example sequence i need to input
seqpm(sequence,"ABBBBBA")
which is not very useful for what I need.
So do you guys see where I might've missed something ?
How can I retrieve all the sequences that do go from A to B and Back to A?
Is there a way for me to find go from A to anything else and then back to A ?
Thanks a lot !
The title of the seqpm help page is "Find substring patterns in sequences", and this is what the function actually does. It searches for sequences that contain a given substring (not a subsequence). Seems there is a formulation error in the user's guide.
A solution to find the sequences that contain given subsequences, is to convert the state sequences into event sequences with seqecreate , and then use the seqefsub and seqeapplysub function. I illustrate using the actcal data that ships with TraMineR.
library(TraMineR)
data(actcal)
actcal.seq <- seqdef(actcal[,13:24])
## displaying the first state sequences
head(actcal.seq)
## transforming into event sequences
actcal.seqe <- seqecreate(actcal.seq, tevent = "state", use.labels=FALSE)
## displaying the first event sequences
head(actcal.seqe)
## now searching for the subsequences
subs <- seqefsub(actcal.seqe, strsubseq=c("(A)-(D)","(D)-(B)"))
## and identifying the sequences that contain the subsequences
subs.pres <- seqeapplysub(subs, method="presence")
head(subs.pres)
## we can now, for example, count the sequences that contain (A)-(D)
sum(subs.pres[,1])
## or list the sequences that contain (A)-(D)
rownames(subs.pres)[subs.pres[,1]==1]
Hope this helps.
I am designing a word filter that can filter out bad words (200 words in list) in an article (about 2000 words). And there I have a problem that what data structure I need to save this bad word list, so that the program can use a little time to find the bad word in articles?
-- more details
If the size of bad word list is 2000, the article is 50000, and the program will procedure about 1000 articles one time. Which data structure I should choose, a less then O(n^2) solution in searching?
You can use HashTable because its average complexity is O(1) for insert and search and your data just 2000 words.
http://en.wikipedia.org/wiki/Hash_table
A dictionary usually is a mapping from one thing (word in 1st language) to another thing (word in 2nd language). You don't seem to need this mapping here, but just a set of words.
Most languages provide a set data structure out of the box that has insert and membership testing methods.
A small example in Python, comparing a list and a set:
import random
import string
import time
def create_word(min_len, max_len):
return "".join([random.choice(string.ascii_lowercase) for _ in
range(random.randint(min_len, max_len+1))])
def create_article(length):
return [create_word(3, 10) for _ in range(length)]
wordlist = create_article(50000)
article = " ".join(wordlist)
good_words = []
bad_words_list = [random.choice(wordlist) for _ in range(2000)]
print("using list")
print(time.time())
for word in article.split(" "):
if word in bad_words_list:
continue
good_words.append(word)
print(time.time())
good_words = []
bad_words_set = set(bad_words_list)
print("using set")
print(time.time())
for word in article.split(" "):
if word in bad_words_set:
continue
good_words.append(word)
print(time.time())
This creates an "article" of 50000 randomly created "words" with a length between 3 and 10 letters, then picks 2000 of those words as "bad words".
First, they are put in a list and the "article" is scanned word by word if a word is in this list of bad words. In Python, the in operator tests for membership. For an unordered list, there's no better way than scanning the whole list.
The second approach uses the set datatype that is initialized with the list of bad words. A set has no ordering, but way faster lookup (again using the in keyword) if an element is contained. That seems to be all you need to know.
On my machine, the timings are:
using list
1421499228.707602
1421499232.764034
using set
1421499232.7644095
1421499232.785762
So it takes about 4 seconds with a list and 2 hundreths of a second with a set.
I think the best structure, you can use there is set. - http://en.wikipedia.org/wiki/Set_%28abstract_data_type%29
I takes log_2(n) time to add element to structure (once-time operation) and the same answer every query. So if you will have 200 elements in data structure, your program will need to do only about 8 operations to check, does the word is existing in set.
You need a Bag data structure for this problem. In a Bag data structure elements have no order but is designed for fast lookup of an element in the Bag. It time complexity is O(1). So for N words in an article overall complexity turns out to be O(N). Which is the best you can achieve in this case. Java Set is an example of Bag implementation in Java.
I am asking this here because I couldn't find the answer I am looking for elsewhere and I don't know where else I could ask this. I hope someone can reply without saying that the question is irrelevant to the forum. I have a biology background and I am currently using bioinformatics. I need to understand in lay language hash tables and suffix trees. Something simple, I don't get the O(n) concepts and all that stuff, I think they are both kind of the same: a way to store string data? But I would like to understand better the differences. This will help enormously to other people like me. We are a lot in this field now!
Thanks in advance.
OK, lets use bioinformatics to help illustrate the differences.
Let's say you have several DNA sequences that are pretty long. If we want to store these sequences in a datastructure.
If we want to use a hashtable
A Hashtable is a useful way to store a bunch of objects but very quickly search the datastructure to see if we already contain a particular object.
One bioinformatics usecase that we can solve with a hashtable is de-duping a large sequence set. Let's say we have a huge dataset of next-gen sequenced data and we want to de-duplicate it before we assemble. We can use a hashtable to store the unique sequences. Before inserting any sequences into the hashtable, we can first check to see if it already exists in the hashtable and if it does we skip that read. Only if it is not yet in the hashtable do we add it. Then when we are done the elements in the hash will be the unique sequences.
Hashtables are basically an array of LinkedLists. Each cell in the array we will call a "bin". When we insert or search for something in the hashtable, we have to first know what bin it is in. The way we determine which bin to use is by a hash algorithm.
We have to come up with a hash algorithm. Something that will convert our sequence into a number. A requirement of this equation is the same sequence must always evaluate to the same number. It's OK if different sequences evaluate to the same number (which is called as hash collision) since there are an infinite number of possible sequences and we will only have a limited range of possible number values in our hash.
A simple hash algorithm is to assign a value to each base A =1 G =2 C = 3 T =4 (assume no ambiguities) then we can just sum up the bases in our sequence. This would mean that any sequences with the same number of As, Cs Gs and Ts will have the same hash value. If we wanted, we could also have a more complicated algorithm that also takes position into account so to get the same number we would have to also have the same sequence in the same order.
Once we have our hash algorithm. We can make a hash table by binning the sequences by their hash values. The more bins we have in our table, the fewer hash values per bin. Hashtables are often implemented by an array of LinkedLists. This is a very fast lookup because to see if a sequence is in our hashtable or to add a new sequence to our hash table, we just compute the hash value for the sequence to see what bin it is in, then we only have to look at the values inside that bin. We can ignore the rest of the bins.
suffix tree
A Suffix Tree is a different datastructure which is a graph where each node is (in this case) a residue in our sequence. Edges in the graph will point to the next node etc. So for example if our sequence was ACGT the path in the graph will be A->C->G->T->$. If we had another sequence ACTT the path will be A->C->T->T->$.
We can combine consecutive nodes if there is only 1 path so in the previous example since both sequence start with AC then the paths will be AC->G->T->$and AC->T->T->$.
In bioinformatics this is really useful for substring matching (like finding repetitive regions or primer binding sites etc) since we can easily see where there are subpaths in our graph that match our motif.
Hope that helps
I got a litte problem understanding conceptually the structure of a random writing program (that takes input in form of a text file) and uses the Markov algorithm to create a somewhat sensible output.
So the data structure i am using is to use cases ranging from 0-10. Where at case 0: I count the number a letter/symbol or digit appears and base my new text on this to simulate the input. I have already implemented this by using an Map type that holds each unique letter in the input text and a array of how many there are in the text. So I can simply ask for the size of the array for the specific letter and create output text easy like this.
But now I Need to create case1/2/3 and so on... case 1 also holds what letter is most likely to appear after any letter aswell. Do i need to create 10 seperate arrays for these cases, or are there an easier way?
There are a lot of ways to model this. One approach is as you describe, with an multi-dimensional array where each index is the following character in the chain and the final result is the count.
# Two character sample:
int counts[][] = new int[26][26]
# ... initialize all entries to zero
# 'a' => 0, 'b' => 1, ... 'z' => 25
# For example for the string 'apple'
# Note: I'm only writing this like this to show what the result is, it should be in a
# loop or function ...
counts['a'-'a']['p'-'a']++
counts['p'-'a']['p'-'a']++
counts['p'-'a']['l'-'a']++
counts['l'-'a']['l'-'e']++
Then to randomly generate names you would count the number of total outcomes for a given character (ex: 2 outcomes for 'p' in the previous example) and pick a weighted random number for one of the possible outcomes.
For smaller sizes (say up to 4 characters) that should work fine. For anything larger you may start to run into memory issues since (assuming you're using A-Z) 26^N entries for an N-length chain.
I wrote something like a couple of years ago. I think I used random pages from Wikipedia to for seed data to generate the weights.