meaning of MinSupport in traminer - r

I'm trying to analyse event of subsequences in TraMineR, but I really don't know what "support" and "minSupport" stand for!
I looked for the meaning, but I just found this sentences in details of seqefsub function.
There are two usages of this function. The first is for searching subsequences satisfying a support
condition. By default, the support is counted per sequence and not per occurrence, i.e. when a se-
quence contains twice a same subsequence it is counted only once.

The support is a basic concept in data mining. Here, the support of a subsequence is the number (or proportion) of sequences that contain the subsequence. We could also consider the number of time the subsequence occurs among the considered sequences and then there would be different ways of counting the number of occurrences when a same subsequence appears more than once in a same sequence.
In the seqefsub function of TraMineR you can control the counting method with the constraintargument that should be set using seqeconstraint. For the available counting methods, see the help page of this latter function and the references given therein.
Hope this helps.

Related

Is it safe to use multiple symbols hits per row in the Aster nPath algorithm?

On page 77 of the Aster Analtyics users guide:
http://www.info.teradata.com/eDownload.cfm?itemid=122580002
it says:
“Note that the predicates for different symbols may overlap, and therefore multiple symbols may match the same row.”
Does anyone have any experience with practical use cases where you actually need multiple symbols per row?
Concern is this could explode pretty quickly: for n symbols per row and m rows in a partition, number of symbol combinations per partition is n^m.
e.g., for n=2 and m=50 this results in ~1e15 symbol combinations which we certainly don't want to traverse.
Thx,
Francis
nPath in Aster will never match multiple symbols per row, but multiple symbols may match the same row and nPath will have to choose one of them when forming its result. There will be many cases when this is possible, for example, when you have general page hit (defined as TRUE) and then special pages like START, FINISH, BASKET, etc. then the matching will depend on how pattern is defined.

Identifying indices of sequences which contain frequent subsequences

Using TraMineR I can identify frequent subsequences in a dataset of sequences. However, it only gives me a count of how often such a subsequence occur in the overall dataset, such as that it occurs in 21/22 sequences.
Is there any way of getting indices of exactly which sequences contain a specific frequent subsequence?
See function seqeapplysub.
According to help page: Checks occurrences of the subsequences subseq among the event sequences and returns the result according to the selected method.

Find specific patterns in sequences

I'm using R package TraMineR to make some academic research on sequence analysis.
I want to find a pattern defined as someone being in the target company, then going out, then coming back to the target company.
(simplified) I've define state A as target company; B as outside industry company and C as inside industry company.
So what I want to do is find sequences with the specific patterns A-B-A or A-C-A.
After looking at this question (Strange number of subsequences? ) and reading the user guide, specially the following passages:
4.3.3 Subsequences
A sequence u is a subsequence of x if all successive elements ui of u appear >in x in the same
order, which we simply denote by u x. According to this denition, unshared >states can appear
between those common to both sequences u and x. For example, u = S; M is a >subsequence of
x = S; U; M; MC.
and
7.3.2 Finding sequences with a given subsequence
The seqpm() function counts the number of sequences that contain a given subsequence and collects
their row index numbers. The function returns a list with two elements. The rst element, MTab,
is just a table with the number of occurrences of the given subsequence in the data. Note that
only one occurrence is counted per sequence, even when the sub-sequence appears more than one
time in the sequence. The second element of the list, MIndex, gives the row index numbers of
the sequences containing the subsequence. These index numbers may be useful for accessing the
concerned sequences (example below). Since it is easier to search a pattern in a character string,
the function rst translates the sequence data in this format when using the seqconc function with
the TRUE option.
I concluded that seqpm() was the function I needed to get the job done.
So I have sequences like:
A-A-A-A-A-B-B-B-B-B-A-A-A-A-A
And out of the definition of subsequences that i found on the mentiod sources, i figure I could find that kind of sequence by using:
seqpm(sequence,"ABA")
But that does not happen. In order to find that example sequence i need to input
seqpm(sequence,"ABBBBBA")
which is not very useful for what I need.
So do you guys see where I might've missed something ?
How can I retrieve all the sequences that do go from A to B and Back to A?
Is there a way for me to find go from A to anything else and then back to A ?
Thanks a lot !
The title of the seqpm help page is "Find substring patterns in sequences", and this is what the function actually does. It searches for sequences that contain a given substring (not a subsequence). Seems there is a formulation error in the user's guide.
A solution to find the sequences that contain given subsequences, is to convert the state sequences into event sequences with seqecreate , and then use the seqefsub and seqeapplysub function. I illustrate using the actcal data that ships with TraMineR.
library(TraMineR)
data(actcal)
actcal.seq <- seqdef(actcal[,13:24])
## displaying the first state sequences
head(actcal.seq)
## transforming into event sequences
actcal.seqe <- seqecreate(actcal.seq, tevent = "state", use.labels=FALSE)
## displaying the first event sequences
head(actcal.seqe)
## now searching for the subsequences
subs <- seqefsub(actcal.seqe, strsubseq=c("(A)-(D)","(D)-(B)"))
## and identifying the sequences that contain the subsequences
subs.pres <- seqeapplysub(subs, method="presence")
head(subs.pres)
## we can now, for example, count the sequences that contain (A)-(D)
sum(subs.pres[,1])
## or list the sequences that contain (A)-(D)
rownames(subs.pres)[subs.pres[,1]==1]
Hope this helps.

hash table suffix tree explanation

I am asking this here because I couldn't find the answer I am looking for elsewhere and I don't know where else I could ask this. I hope someone can reply without saying that the question is irrelevant to the forum. I have a biology background and I am currently using bioinformatics. I need to understand in lay language hash tables and suffix trees. Something simple, I don't get the O(n) concepts and all that stuff, I think they are both kind of the same: a way to store string data? But I would like to understand better the differences. This will help enormously to other people like me. We are a lot in this field now!
Thanks in advance.
OK, lets use bioinformatics to help illustrate the differences.
Let's say you have several DNA sequences that are pretty long. If we want to store these sequences in a datastructure.
If we want to use a hashtable
A Hashtable is a useful way to store a bunch of objects but very quickly search the datastructure to see if we already contain a particular object.
One bioinformatics usecase that we can solve with a hashtable is de-duping a large sequence set. Let's say we have a huge dataset of next-gen sequenced data and we want to de-duplicate it before we assemble. We can use a hashtable to store the unique sequences. Before inserting any sequences into the hashtable, we can first check to see if it already exists in the hashtable and if it does we skip that read. Only if it is not yet in the hashtable do we add it. Then when we are done the elements in the hash will be the unique sequences.
Hashtables are basically an array of LinkedLists. Each cell in the array we will call a "bin". When we insert or search for something in the hashtable, we have to first know what bin it is in. The way we determine which bin to use is by a hash algorithm.
We have to come up with a hash algorithm. Something that will convert our sequence into a number. A requirement of this equation is the same sequence must always evaluate to the same number. It's OK if different sequences evaluate to the same number (which is called as hash collision) since there are an infinite number of possible sequences and we will only have a limited range of possible number values in our hash.
A simple hash algorithm is to assign a value to each base A =1 G =2 C = 3 T =4 (assume no ambiguities) then we can just sum up the bases in our sequence. This would mean that any sequences with the same number of As, Cs Gs and Ts will have the same hash value. If we wanted, we could also have a more complicated algorithm that also takes position into account so to get the same number we would have to also have the same sequence in the same order.
Once we have our hash algorithm. We can make a hash table by binning the sequences by their hash values. The more bins we have in our table, the fewer hash values per bin. Hashtables are often implemented by an array of LinkedLists. This is a very fast lookup because to see if a sequence is in our hashtable or to add a new sequence to our hash table, we just compute the hash value for the sequence to see what bin it is in, then we only have to look at the values inside that bin. We can ignore the rest of the bins.
suffix tree
A Suffix Tree is a different datastructure which is a graph where each node is (in this case) a residue in our sequence. Edges in the graph will point to the next node etc. So for example if our sequence was ACGT the path in the graph will be A->C->G->T->$. If we had another sequence ACTT the path will be A->C->T->T->$.
We can combine consecutive nodes if there is only 1 path so in the previous example since both sequence start with AC then the paths will be AC->G->T->$and AC->T->T->$.
In bioinformatics this is really useful for substring matching (like finding repetitive regions or primer binding sites etc) since we can easily see where there are subpaths in our graph that match our motif.
Hope that helps

Faster R code for fuzzy name matching using agrep() for multiple patterns...?

I'm a bit of an R novice and have been trying to experiment a bit using the agrep function in R. I have a large data base of customers (1.5 million rows) of which I'm sure there are many duplicates. Many of the duplicates though are not revealed using the table() to get the frequency of repeated exact names. Just eyeballing some of the rows, I have noticed many duplicates that are "unique" because there was a minor miss-key in the spelling of the name.
So far, to find all of the duplicates in my data set, I have been using agrep() to accomplish the fuzzy name matching. I have been playing around with the max.distance argument in agrep() to return different approximate matches. I think I have found a happy medium between returning false positives and missing out on true matches. As the agrep() is limited to matching a single pattern at a time, I was able to find an entry on stack overflow to help me write a sapply code that would allow me to match the data set against numerous patterns. Here is the code I am using to loop over numerous patterns as it combs through my data sets for "duplicates".
dups4<-data.frame(unlist(sapply(unique$name,agrep,value=T,max.distance=.154,vf$name)))
unique$name= this is the unique index I developed that has all of the "patterns" I wish to hunt for in my data set.
vf$name= is the column in my data frame that contains all of my customer names.
This coding works well on a small scale of a sample of 600 or so customers and the agrep works fine. My problem is when I attempt to use a unique index of 250K+ names and agrep it against my 1.5 million customers. As I type out this question, the code is still running in R and has not yet stopped (we are going on 20 minutes at this point).
Does anyone have any suggestions to speed this up or improve the code that I have used? I have not yet tried anything out of the plyr package. Perhaps this might be faster... I am a little unfamiliar though with using the ddply or llply functions.
Any suggestions would be greatly appreciated.
I'm so sorry, I missed this last request to post a solution. Here is how I solved my agrep, multiple pattern problem, and then sped things up using parallel processing.
What I am essentially doing is taking a a whole vector of character strings and then fuzzy matching them against themselves to find out if there are any fuzzy matched duplicate records in the vector.
Here I create clusters (twenty of them) that I wish to use in a parallel process created by parSapply
cl<-makeCluster(20)
So let's start with the innermost nesting of the code parSapply. This is what allows me to run the agrep() in a paralleled process. The first argument is "cl", which is the number of clusters I have specified to parallel process across ,as specified above.
The 2nd argument is the specific vector of patterns I wish to match against. The third argument is the actual function I wish to use to do the matching (in this case agrep). The next subsequent arguments are all arguments related to the agrep() that I am using. I have specified that I want the actual character strings returned (not the position of the strings) using value=T. I have also specified my max.distance I am willing to accept in a fuzzy match... in this case a cost of 2. The last argument is the full list of patterns I wish to be matched against the first list of patterns (argument 2). As it so happens, I am looking to identify duplicates, hence I match the vector against itself. The final output is a list, so I use unlist() and then data frame it to basically get a table of matches. From there, I can easily run a frequency table of the table I just created to find out, what fuzzy matched character strings have a frequency greater than 1, ultimately telling me that such a pattern match against itself and one other pattern in the vector.
truedupevf<-data.frame(unlist(parSapply(cl,
s4dupe$fuzzydob,agrep,value=T,
max.distance=2,s4dupe$fuzzydob)))
I hope this helps.

Resources