Markov Algorithm for Random Writing - markov

I got a litte problem understanding conceptually the structure of a random writing program (that takes input in form of a text file) and uses the Markov algorithm to create a somewhat sensible output.
So the data structure i am using is to use cases ranging from 0-10. Where at case 0: I count the number a letter/symbol or digit appears and base my new text on this to simulate the input. I have already implemented this by using an Map type that holds each unique letter in the input text and a array of how many there are in the text. So I can simply ask for the size of the array for the specific letter and create output text easy like this.
But now I Need to create case1/2/3 and so on... case 1 also holds what letter is most likely to appear after any letter aswell. Do i need to create 10 seperate arrays for these cases, or are there an easier way?

There are a lot of ways to model this. One approach is as you describe, with an multi-dimensional array where each index is the following character in the chain and the final result is the count.
# Two character sample:
int counts[][] = new int[26][26]
# ... initialize all entries to zero
# 'a' => 0, 'b' => 1, ... 'z' => 25
# For example for the string 'apple'
# Note: I'm only writing this like this to show what the result is, it should be in a
# loop or function ...
counts['a'-'a']['p'-'a']++
counts['p'-'a']['p'-'a']++
counts['p'-'a']['l'-'a']++
counts['l'-'a']['l'-'e']++
Then to randomly generate names you would count the number of total outcomes for a given character (ex: 2 outcomes for 'p' in the previous example) and pick a weighted random number for one of the possible outcomes.
For smaller sizes (say up to 4 characters) that should work fine. For anything larger you may start to run into memory issues since (assuming you're using A-Z) 26^N entries for an N-length chain.
I wrote something like a couple of years ago. I think I used random pages from Wikipedia to for seed data to generate the weights.

Related

Handle a string return from R to Tableau and SPLIT it

I connect Tableau to R and execute an R function for recommending products. When R ends, the return value is a string which will have all products details, like below:
ID|Existing_Prod|Recommended_Prod\nC001|NA|PROD008\nC002|PROD003|NA\nF003|NA|PROD_ABC\nF004|NA|PROD_ABC1\nC005|PROD_ABC2|NA\nC005|PRODABC3|PRODABC4
(Each line separated by \n indicating end of line)
On Tableau, I display the calculated field which is as below:
ID|Existing_Prod|Recommended_Prod
C001|NA|PROD008
C002|PROD003|NA
F003|NA|PROD_ABC
F004|NA|PROD_ABC1
C005|PROD_ABC2|NA
C005|PRODABC3|PRODABC4
Above data reaches Tableau through a calculated field as a single string which I want to split based on pipeline ('|'). Now, I need to split this into three columns, separated by the pipeline.
I used Split function on the calculated field :
SPLIT([R_Calculated_Field],'|',1)
SPLIT([R_Calculated_Field],'|',2)
SPLIT([R_Calculated_Field],'|',3)
But the error says "SPLIT function cannot be applied on Table calculations", which is self explanatory. Are there any alternatives to solve this ?? I googled to check for best practices to handle integration between R and Tableau and all I could find was simple kmeans clustering codes.
Make sure you understand how partitioning and addressing work for table calcs. Table calcs pass vectors of arguments to the R script, and receive a single vector in response. The cardinality of those vectors depends on the partitioning of the table calc. You can view that by editing the table calc, clicking specific dimensions. The fields that are not checked determine the partitioning - and thus the cardinality of the arguments you send and receive from R
This means it might be tricky to map your problem onto this infrastructure. Not necessarily impossible. It was designed to send a series of vector arguments with one cell per partitioning dimension, say, Manufacturer and get back one vector with one result per Manufacturer (or whatever combination of fields partition your data for the table calc). Sounds like you are expecting an arbitrary length list of recommendations. It shouldn’t be too hard to have your R script turn the string into a vector before returning, but the size of the vector has to make sense.
As an example of an approach that fits this model more easily, say you had a Tableau view that had one row per Product (and you had N products) - and some other aggregated measure fields in the view per Product. (In Tableau speak, the view’s level of detail is at the Product level.)
It would be straightforward to pass those measures as a series of argument vectors to R - each vector having N values, and then have R return a vector of reals of length N where the value returned at each location was a recommender score for the product at that position. (Which is why the ordering aka addressing of the vectors also matters)
Then you could filter out low scoring products from the view and visually distinguish highly recommended products.
So the first step to understanding R integration is to understand how table calcs operate with partitioning and addressing and to think in terms of vectors of fixed lengths passed in both directions.
If this model doesn’t support your use case well, you might be able to do something useful with URL actions or the JavaScript API.

How do I export a custom list of numbers and letters to Excel from R?

To help with some regular label-making and printing I need to do, I am looking to write a script that allows me to enter a range of sequential numbers (some with string identifiers) that I can export with a specific format to Excel. For example, if I entered the range '1:16', I am looking for an output in Excel exactly as:
Example Excel Output
For each unique sequential number (i.e., 1 to 16) the first five rows must be labeled with a 'U", the next three rows with an 'F' and the last two rows must be the number alone. The final exported matrix will be n columns x 21 rows, where n will vary depending on the number range I enter.
My main problem is in writing to Excel. I can't find out how to customize this output and write to specific rows and columns as in the example above. I am limited to 'openxlsx' since I work on a corporate secure workstation. Here is what I have so far:
Example Code
Any help you may have would be very appreciated, thanks in advance!

Cluster analysis of barcoded DNA

I am using barcodes to tag mitochondrial DNA strands previous to PCR. The barcode sequences are not known, but they are 18 nucleotides long and directly proceed a known sequence (either CATCAT or TACTAC). Each DNA molecule will get a unique barcode identifier. Once the molecules undergo PCR, I need to cluster the sequences based on their 18 nucleotide barcode, and then subsequently align the sequences, per barcode.
To use an overly simple example, lets say I have 2 molecules that are going into a PCR reaction:
CATCATBARCODE1SEQUENCE1
TACTACBARCODE2SEQUENCE2
After amplification I have:
CATCATBARCODE1SEQUENCE1
CATCATBARCODE1SEQUENCE1
TACTACBARCODE2SEQUENCE2
TACTACBARCODE2SEQUENCE2
I then want to search the section of sequence at position 6-13 and cluster them based on that window of sequence without changing the rest of the sequence, which would actually just look like what I have above. Then I could perform the alignment on the adjacent sequences.
Any ideas on how I could accomplish this clustering of a window of sequence, without taking into account the rest of the sequence? Thanks.
Overly simplified R code, but seems to do what you ask:
seqs <- c('CATCATBARCODE1SEQUENCE1',
'CATCATBARCODE1SEQUENCE1',
'TACTACBARCODE2SEQUENCE2',
'TACTACBARCODE2SEQUENCE2' )
clusters <- list()
for (seq in seqs) {
barcode <- substr(seq, 7, 14)
if (!is.null(clusters[[barcode]])) {
clusters[[barcode]] <- append(clusters[[barcode]], seq)
} else {
clusters[[barcode]] <- c(seq)
}
}
print(clusters)
prints:
$BARCODE1
[1] "CATCATBARCODE1SEQUENCE1" "CATCATBARCODE1SEQUENCE1"
$BARCODE2
[1] "TACTACBARCODE2SEQUENCE2" "TACTACBARCODE2SEQUENCE2"
Assuming you can already obtain sequences starting like [CATCATBARCODEX] what I would do is just to process it in python. If your sequence starts are not the same then you may need to search for CATCAT and discard those that look to be in the wrong position. There may be some issue if the number of barcodes is very large but I think for something on the order of 100,000 simple methods should work.
Anyways, once you find the CATCAT what I would do is just to build up a dictionary of barcodes and start filtering. Then you can just rip off this first part of the sequences and align using whatever methods (I had a barcode project and using custom genome with bowtie was convenient).
let's say you need to find this sequence instead of just starting with it, in python a solution would be like
my_dict= {}
for seq in seqs:
idx = seq.find("CATCAT")
idx2 = seq.find("TACTAC")
if idx==-1 and idx2==-1:continue
# here you need to consider the location of idx and idx2, both may be present, sequence needs to be long enough etc
barcode = seq[idx+6, idx+6+18]
# you may want to shorten the barcode or encode it to a string
if barcode in my_dict:
my_dict[barcode]=1
else :
my_dict[barcode]+=1;
seq=seq[idx+24:]
In addition to the counting you can 1) append sequences to a fasta file per barcode or 2) assign the barcode as annotation to a large fasta file.
Regardless you probably want to strip down the sequence to simplify the downstream analysis.

Find specific patterns in sequences

I'm using R package TraMineR to make some academic research on sequence analysis.
I want to find a pattern defined as someone being in the target company, then going out, then coming back to the target company.
(simplified) I've define state A as target company; B as outside industry company and C as inside industry company.
So what I want to do is find sequences with the specific patterns A-B-A or A-C-A.
After looking at this question (Strange number of subsequences? ) and reading the user guide, specially the following passages:
4.3.3 Subsequences
A sequence u is a subsequence of x if all successive elements ui of u appear >in x in the same
order, which we simply denote by u x. According to this denition, unshared >states can appear
between those common to both sequences u and x. For example, u = S; M is a >subsequence of
x = S; U; M; MC.
and
7.3.2 Finding sequences with a given subsequence
The seqpm() function counts the number of sequences that contain a given subsequence and collects
their row index numbers. The function returns a list with two elements. The rst element, MTab,
is just a table with the number of occurrences of the given subsequence in the data. Note that
only one occurrence is counted per sequence, even when the sub-sequence appears more than one
time in the sequence. The second element of the list, MIndex, gives the row index numbers of
the sequences containing the subsequence. These index numbers may be useful for accessing the
concerned sequences (example below). Since it is easier to search a pattern in a character string,
the function rst translates the sequence data in this format when using the seqconc function with
the TRUE option.
I concluded that seqpm() was the function I needed to get the job done.
So I have sequences like:
A-A-A-A-A-B-B-B-B-B-A-A-A-A-A
And out of the definition of subsequences that i found on the mentiod sources, i figure I could find that kind of sequence by using:
seqpm(sequence,"ABA")
But that does not happen. In order to find that example sequence i need to input
seqpm(sequence,"ABBBBBA")
which is not very useful for what I need.
So do you guys see where I might've missed something ?
How can I retrieve all the sequences that do go from A to B and Back to A?
Is there a way for me to find go from A to anything else and then back to A ?
Thanks a lot !
The title of the seqpm help page is "Find substring patterns in sequences", and this is what the function actually does. It searches for sequences that contain a given substring (not a subsequence). Seems there is a formulation error in the user's guide.
A solution to find the sequences that contain given subsequences, is to convert the state sequences into event sequences with seqecreate , and then use the seqefsub and seqeapplysub function. I illustrate using the actcal data that ships with TraMineR.
library(TraMineR)
data(actcal)
actcal.seq <- seqdef(actcal[,13:24])
## displaying the first state sequences
head(actcal.seq)
## transforming into event sequences
actcal.seqe <- seqecreate(actcal.seq, tevent = "state", use.labels=FALSE)
## displaying the first event sequences
head(actcal.seqe)
## now searching for the subsequences
subs <- seqefsub(actcal.seqe, strsubseq=c("(A)-(D)","(D)-(B)"))
## and identifying the sequences that contain the subsequences
subs.pres <- seqeapplysub(subs, method="presence")
head(subs.pres)
## we can now, for example, count the sequences that contain (A)-(D)
sum(subs.pres[,1])
## or list the sequences that contain (A)-(D)
rownames(subs.pres)[subs.pres[,1]==1]
Hope this helps.

Stopping a large number of zeros being printed (not scientific notation)

What I'm trying to achieve is to have all printed numbers display at maximum 7 digits. Here are examples of what I want printed:
0.000000 (versus the actual number which is 0.000000000029481.....)
0.299180 (versus the actual number which is 0.299180291884922.....)
I've had success with the latter types of numbers by using options(scipen=99999) and options(digits=6). However, the former example will always print a huge number of zeros followed by five non-zero digits. How do I stop this from occurring and achieve my desired result? I also do not want scientific notation.
I want this to apply to ALL printed numbers in EVERY context. For example if I have some matrix, call it A, and I print this matrix, I want every element to just be 6-7 digits. I want this to be automatic for every print in every context; just like using options(digits=6) and options(scipen=99999) makes it automatic for every context.
You can define a new print method for the type you wish to print. For example, if all your numbers are doubles, you can create
print.double=function(x){sprintf("%.6f", x)}
Now, when you print a double (or a vector of doubles), the function print.double() will be called instead of print.default().
You may have to create similar functions print.integer(), print.complex(), etc., depending on the types you need to print.
To return to the default print method, simply delete the function print.double().
Are all your numbers < 1? You could try a simple sprintf( "%.6f", x ). Otherwise you could try wrapping things to sprintf based on the number of digits; check ?sprintf for other details.

Resources