Find specific patterns in sequences - r

I'm using R package TraMineR to make some academic research on sequence analysis.
I want to find a pattern defined as someone being in the target company, then going out, then coming back to the target company.
(simplified) I've define state A as target company; B as outside industry company and C as inside industry company.
So what I want to do is find sequences with the specific patterns A-B-A or A-C-A.
After looking at this question (Strange number of subsequences? ) and reading the user guide, specially the following passages:
4.3.3 Subsequences
A sequence u is a subsequence of x if all successive elements ui of u appear >in x in the same
order, which we simply denote by u x. According to this denition, unshared >states can appear
between those common to both sequences u and x. For example, u = S; M is a >subsequence of
x = S; U; M; MC.
and
7.3.2 Finding sequences with a given subsequence
The seqpm() function counts the number of sequences that contain a given subsequence and collects
their row index numbers. The function returns a list with two elements. The rst element, MTab,
is just a table with the number of occurrences of the given subsequence in the data. Note that
only one occurrence is counted per sequence, even when the sub-sequence appears more than one
time in the sequence. The second element of the list, MIndex, gives the row index numbers of
the sequences containing the subsequence. These index numbers may be useful for accessing the
concerned sequences (example below). Since it is easier to search a pattern in a character string,
the function rst translates the sequence data in this format when using the seqconc function with
the TRUE option.
I concluded that seqpm() was the function I needed to get the job done.
So I have sequences like:
A-A-A-A-A-B-B-B-B-B-A-A-A-A-A
And out of the definition of subsequences that i found on the mentiod sources, i figure I could find that kind of sequence by using:
seqpm(sequence,"ABA")
But that does not happen. In order to find that example sequence i need to input
seqpm(sequence,"ABBBBBA")
which is not very useful for what I need.
So do you guys see where I might've missed something ?
How can I retrieve all the sequences that do go from A to B and Back to A?
Is there a way for me to find go from A to anything else and then back to A ?
Thanks a lot !

The title of the seqpm help page is "Find substring patterns in sequences", and this is what the function actually does. It searches for sequences that contain a given substring (not a subsequence). Seems there is a formulation error in the user's guide.
A solution to find the sequences that contain given subsequences, is to convert the state sequences into event sequences with seqecreate , and then use the seqefsub and seqeapplysub function. I illustrate using the actcal data that ships with TraMineR.
library(TraMineR)
data(actcal)
actcal.seq <- seqdef(actcal[,13:24])
## displaying the first state sequences
head(actcal.seq)
## transforming into event sequences
actcal.seqe <- seqecreate(actcal.seq, tevent = "state", use.labels=FALSE)
## displaying the first event sequences
head(actcal.seqe)
## now searching for the subsequences
subs <- seqefsub(actcal.seqe, strsubseq=c("(A)-(D)","(D)-(B)"))
## and identifying the sequences that contain the subsequences
subs.pres <- seqeapplysub(subs, method="presence")
head(subs.pres)
## we can now, for example, count the sequences that contain (A)-(D)
sum(subs.pres[,1])
## or list the sequences that contain (A)-(D)
rownames(subs.pres)[subs.pres[,1]==1]
Hope this helps.

Related

How can I access data in a nested R list?

I want to learn how to access data from a nested list in R. I am relatively new to the R programming language, so I am unsure how to proceed.
The data is a 'large list(947 elements, 654.9mb) and takes the form:
The numbers within the datalist refer to station numbers and when I click on one (in Rstudio) it looks like this:
I want to kow how I can access the data within 'doy' for example. I have tried:
data[[1]]
which returns all the data for the first element of the list (site, location, doy,ltm etc). So clearly the number used within the square brackets is interpreted as an index for the list, as opposed to an identifier for the elements/station in the list.
Then I tried:
data$1
but it returned the error:
Error: unexpected numeric constant in "data$1"
Then I tried:
data[data$1==doy]
But was returned this:
Error: unexpected numeric constant in "data[data$1"
So at this point, I realise that it is not construing the number of the station as a category/factor within the list. It's just reading it as a number. So I thought I'd put some quotes around it to see if that changed what happened:
data[data$"1"=="doy"]
This returned
named list()
But when I looked at it in the environment, it was a list of 0.
I looked at some of the similar question here on Stack (like: accessing nested lists in R) and tried:
data[data$"1"=="doy",][[1]]
But just got:
Error in data[data$"1" == "doy", ] : incorrect number of dimensions
How can I access this data? It reminds me of a structure in Matlab, but it doesn't seem to be indexed in a similar fashion in R.
Let's look at some ways to do what you want:
data[[1]]
This returns the first element of the list, which is itself a list. You can use the $ subsetting shorthand, but the name of the first element is nonstandard. R prefers names that start with letters and include only alphanumeric characters, periods and underscores. You can escape this behavior with backticks:
data$`1`
If you want to access one of the elements of list 1 in your list of lists, you need to further subset. To get to doy, which is the third element of 1. You can do that four ways.
data[[1]][[3]]
data$`1`[[3]]
data[[1]]$doy
data$`1`$doy
One way (in addition to what Ben Norris has shown):
our_list[[c("1", "doy")]]
Reproducible example data (please provide next time)
our_list <- list(`1` = list(site = "x", doy = 3))

How can I return a vector with a dataframe inside in R?

Here is a challenge for you: I was trying to make a tic tac toe based on R. First, the players have to configure putting in the name of the players, and the game should check if the name exists in a file called "Players.txt" (if not, the game will create one), if the name exists, the game will ask for a new one. The last part of the game is that the game should record all the punctuation of the players (each gambling chip used will subtract 5 points of 100 that the player has at the beginning of the game). The problem is when a player wins, the game shows the following error: "Error in table[location_name1, 3]: Incorrect number of dimension in R".
A vector can either be atomic or a list. Atomic vectors can only contain elements of one and the same data type. That means, you are "accidentally" creating a list with
vector=c(win,name1,name2,table)
with the result that each column of the data frame should become an entry.
You can solve it with
vector <- list(win, name1, name2, table)
vector is still a list but now it has the format I believe you want.
Having done that you still get errors. The reason is that these assignments fail.
location_name1=which(grepl(name1,table$gamers))
location_name2=which(grepl(name2,table$gamers))
They return an empty vector because earlier in the code you set win=vector[1]... table=vector[4]. Since vector is now a list, you have to subset it accordingly. That means you have to chance the statements to table=vector[[4]].
Now you are going to get another problem. The reason is that you treat the columns table$scores as text. When you read the data you need to make sure that this columns is not interpreted as text. You also have to eliminate all statements that coerce the column into text. Otherwise table[location_name1,3]=table[location_name1,3]+pointsx will obviously fail because you cannot add a number to a string.
For example, you coerce the column into a character column with this statement:
name1 <- data.frame(gamers=name1,games="1",scores="100")
games and scores are strings not numbers. Another example is the assigment after reading the table from the file. You can make sure that scoresare numeric by doing this.
scores <- as.numeric(table[,3])
Please get familiar with Rstudio debugging capabilities (https://support.rstudio.com/hc/en-us/articles/205612627-Debugging-with-RStudio). This way you can go through your code line by line and check consequences of each assignment to the data frame.

Cluster analysis of barcoded DNA

I am using barcodes to tag mitochondrial DNA strands previous to PCR. The barcode sequences are not known, but they are 18 nucleotides long and directly proceed a known sequence (either CATCAT or TACTAC). Each DNA molecule will get a unique barcode identifier. Once the molecules undergo PCR, I need to cluster the sequences based on their 18 nucleotide barcode, and then subsequently align the sequences, per barcode.
To use an overly simple example, lets say I have 2 molecules that are going into a PCR reaction:
CATCATBARCODE1SEQUENCE1
TACTACBARCODE2SEQUENCE2
After amplification I have:
CATCATBARCODE1SEQUENCE1
CATCATBARCODE1SEQUENCE1
TACTACBARCODE2SEQUENCE2
TACTACBARCODE2SEQUENCE2
I then want to search the section of sequence at position 6-13 and cluster them based on that window of sequence without changing the rest of the sequence, which would actually just look like what I have above. Then I could perform the alignment on the adjacent sequences.
Any ideas on how I could accomplish this clustering of a window of sequence, without taking into account the rest of the sequence? Thanks.
Overly simplified R code, but seems to do what you ask:
seqs <- c('CATCATBARCODE1SEQUENCE1',
'CATCATBARCODE1SEQUENCE1',
'TACTACBARCODE2SEQUENCE2',
'TACTACBARCODE2SEQUENCE2' )
clusters <- list()
for (seq in seqs) {
barcode <- substr(seq, 7, 14)
if (!is.null(clusters[[barcode]])) {
clusters[[barcode]] <- append(clusters[[barcode]], seq)
} else {
clusters[[barcode]] <- c(seq)
}
}
print(clusters)
prints:
$BARCODE1
[1] "CATCATBARCODE1SEQUENCE1" "CATCATBARCODE1SEQUENCE1"
$BARCODE2
[1] "TACTACBARCODE2SEQUENCE2" "TACTACBARCODE2SEQUENCE2"
Assuming you can already obtain sequences starting like [CATCATBARCODEX] what I would do is just to process it in python. If your sequence starts are not the same then you may need to search for CATCAT and discard those that look to be in the wrong position. There may be some issue if the number of barcodes is very large but I think for something on the order of 100,000 simple methods should work.
Anyways, once you find the CATCAT what I would do is just to build up a dictionary of barcodes and start filtering. Then you can just rip off this first part of the sequences and align using whatever methods (I had a barcode project and using custom genome with bowtie was convenient).
let's say you need to find this sequence instead of just starting with it, in python a solution would be like
my_dict= {}
for seq in seqs:
idx = seq.find("CATCAT")
idx2 = seq.find("TACTAC")
if idx==-1 and idx2==-1:continue
# here you need to consider the location of idx and idx2, both may be present, sequence needs to be long enough etc
barcode = seq[idx+6, idx+6+18]
# you may want to shorten the barcode or encode it to a string
if barcode in my_dict:
my_dict[barcode]=1
else :
my_dict[barcode]+=1;
seq=seq[idx+24:]
In addition to the counting you can 1) append sequences to a fasta file per barcode or 2) assign the barcode as annotation to a large fasta file.
Regardless you probably want to strip down the sequence to simplify the downstream analysis.

Identifying indices of sequences which contain frequent subsequences

Using TraMineR I can identify frequent subsequences in a dataset of sequences. However, it only gives me a count of how often such a subsequence occur in the overall dataset, such as that it occurs in 21/22 sequences.
Is there any way of getting indices of exactly which sequences contain a specific frequent subsequence?
See function seqeapplysub.
According to help page: Checks occurrences of the subsequences subseq among the event sequences and returns the result according to the selected method.

Markov Algorithm for Random Writing

I got a litte problem understanding conceptually the structure of a random writing program (that takes input in form of a text file) and uses the Markov algorithm to create a somewhat sensible output.
So the data structure i am using is to use cases ranging from 0-10. Where at case 0: I count the number a letter/symbol or digit appears and base my new text on this to simulate the input. I have already implemented this by using an Map type that holds each unique letter in the input text and a array of how many there are in the text. So I can simply ask for the size of the array for the specific letter and create output text easy like this.
But now I Need to create case1/2/3 and so on... case 1 also holds what letter is most likely to appear after any letter aswell. Do i need to create 10 seperate arrays for these cases, or are there an easier way?
There are a lot of ways to model this. One approach is as you describe, with an multi-dimensional array where each index is the following character in the chain and the final result is the count.
# Two character sample:
int counts[][] = new int[26][26]
# ... initialize all entries to zero
# 'a' => 0, 'b' => 1, ... 'z' => 25
# For example for the string 'apple'
# Note: I'm only writing this like this to show what the result is, it should be in a
# loop or function ...
counts['a'-'a']['p'-'a']++
counts['p'-'a']['p'-'a']++
counts['p'-'a']['l'-'a']++
counts['l'-'a']['l'-'e']++
Then to randomly generate names you would count the number of total outcomes for a given character (ex: 2 outcomes for 'p' in the previous example) and pick a weighted random number for one of the possible outcomes.
For smaller sizes (say up to 4 characters) that should work fine. For anything larger you may start to run into memory issues since (assuming you're using A-Z) 26^N entries for an N-length chain.
I wrote something like a couple of years ago. I think I used random pages from Wikipedia to for seed data to generate the weights.

Resources