Number of possible paths - math

Why there are 2^k possible paths from S to T in the following graph.
Can anyone explain.
Note:all directions are from left to right in figure .

For every pair of parallel paths you can only select one of them at a time.
So there are k such pairs of parallel paths.Therefore, total number of combinations are: 2*2*2...k times= 2k

In counting configurations, it is sometimes helpful to find a way to encode the configurations and then count the possible codes. This works as long as there is a 1-1 correspondence between the objects coded and the codes.
In this case you either take the top branch or the bottom branch at each stage. Code the paths as k-bit binary numbers by using a 0 at a given bit position if the corresponding path takes the top branch at that stage and otherwise a 1. This is clearly a 1-1 correspondence, hence the number of such paths is the number of k-bit numbers, which is 2^k (counting the number of k-bit numbers is essentially the same problem, so this isn't exactly a proof, but if you are already familiar with how k-bit numbers work, this could shed light on the path question).
For example (with k = 4):
path code num
vvvv 0000 0
vvv^ 0001 1
vv^v 0010 2
vv^^ 0011 3
v^vv 0100 4
v^v^ 0101 5
v^^v 0110 6
v^^^ 0111 7
^vvv 1000 8
^vv^ 1001 9
^v^v 1010 10
^v^^ 1011 11
^^vv 1100 12
^^v^ 1101 13
^^^v 1110 14
^^^^ 1111 15
(Generated by the following Python 3 script, if you want to play around with it:)
def pathCodes(k):
print('path code num')
for n in range(2**k):
b = bin(n)[2:]
b = ('0'*(k-len(b))) + b
s = b.replace('0','v')
s = s.replace('1','^')
print(s,b,n)

Related

How to create an interval file defined by values from another file - for circos imaging of WGS data

I am trying to depict my whole-genome sequence (WGS) data of my parasite, using the circos software.
One of the elements I would like to depict, is the areas of the reference genome for which i do not have sequencing data from my parasite.
I order to do this, I have used Samtools to create an mpileup file, from which I have extracted the positions where the sequence depth = 0. I therefore have a file that looks like this:
$chromosome_name $chromosome_position $depth
chr_1 1 0
chr_1 2 0
chr_1 3 0
chr_2 67 0
chr_2 68 0
chr_2 1099 0
chr_2 1100 0
chr_2 1101 0
this means that there are 3 positions in chromosome 1, with no sequence data (depth = 0): namely positions 1, 2 and 3. For chromosome 2, the positions with no data are positions 67, 68, 1099, 1100 and 1101.
Due to the fact that my files are enormous (up to 3 million lines), and the fact that alot of the unsequenced positions come in intervals, I would like to create an interval file from the above data. Also, circos requires such an interval-file in order to create tiles. I therefore need to create a new file from the above, that looks like this:
$chromosome_name $start_pos $end_pos
chr_1 1 3
chr_2 67 68
chr_2 1099 1101
I have searched a bunch, but I have only found questions pertaining to grouping data by pre-defined intervals (e.g. group purchases occurring over a period of 6 months, patients by age etc).
So if anybody can help me out, I will be extremely happy!
Sidsel
Consider using bedtools. Specifically the bedtools merge sub-command:
http://bedtools.readthedocs.io/en/latest/content/tools/merge.html
From this page, it would seem to do what you want:
bedtools merge combines overlapping or “book-ended” features in an
interval file into a single feature which spans all of the combined
features.
Moreover, you can use the -d option to specify max distance between featured to merge:
-d Maximum distance between features allowed for features to be merged. Default is 0. That is, overlapping and/or book-ended features
are merged.

Optimizing the sum of a variable in R given a constraint

Using the following dataset:
ID=c(1:24)
COST=c(85,109,90,104,107,87,99,95,82,112,105,89,101,93,111,83,113,81,97,97,91,103,86,108)
POINTS=c(113,96,111,85,94,105,105,95,107,88,113,100,96,89,89,93,100,92,109,90,101,114,112,109)
mydata=data.frame(ID,COST,POINTS)
I need a R function that will consider all combinations of rows where the sum of 'COST' is less than a fixed value - in this case, $500 - and make the optimal selection based on the summed 'POINTS'.
Your help is appreciated.
So since this post is still open I thought I would give my solution. These kinds of problems are always fun. So, you can try to brute force the solution by checking all possible combinations (some 2^24, or over 16 million) one by one. This could be done by considering that for each combination, a value is either in it or not. Thinking in binary you could use the follow function code which was inspired by this post:
#DO NOT RUN THIS CODE
for(i in 1:2^24)
sum_points[i]<-ifelse(sum(as.numeric((intToBits(i)))[1:24] * mydata$COST) < 500,
sum(as.numeric((intToBits(i)))[1:24] * mydata$POINTS),
0)
I estimate this would take many hours to run. Improvements could be made with parallelization, etc, but still this is a rather intense calculation. This method will also not scale very well, as an increase by 1 (to 25 different IDs now) will double the computation time. Another option would be to cheat a little. For example, we know that we have to stay under $500. If we added up the n cheapest items, at would n would we definitely be over $500?
which(cumsum(sort(mydata$COST))>500)
[1] 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
So any more than 5 IDs chosen and we are definitely over $500. What else.
Well we can run a little code and take the max for that portion and see what that tells us.
sum_points<-1:10000
for(i in 1:10000)
sum_points[i]<-ifelse(sum(as.numeric((intToBits(i)))[1:24]) <6,
ifelse(sum(as.numeric((intToBits(i)))[1:24] * mydata$COST) < 500,
sum(as.numeric((intToBits(i)))[1:24] * mydata$POINTS),
0),
0)
sum_points[which.max(sum_points)]
[1] 549
So we have to try to get over 549 points with the remaining 2^24 - 10000 choices. But:
which(cumsum(rev(sort(mydata$POINTS)))<549)
[1] 1 2 3 4
Even if we sum the 4 highest point values, we still dont beat 549, so there is no reason to even search those. Further, the number of choices to consider must be greater than 4, but less than 6. My gut feeling tells me 5 would be a good number to try. Instead of looking at all 16 millions choices, we can just look at all of the ways to make 5 out of 24, which happens to be 24 choose 5:
num<-1:choose(24,5)
combs<-combn(24,5)
sum_points<-1:length(num)
for(i in num)
sum_points[i]<-ifelse(sum(mydata[combs[,i],]$COST) < 500,
sum(mydata[combs[,i],]$POINTS),
0)
which.max(sum_points)
[1] 2582
sum_points[2582]
[1] 563
We have a new max on the 2582nd iteration. To retrieve the IDs:
mydata[combs[,2582],]$ID
[1] 1 3 11 22 23
And to verify that nothing went wrong:
sum(mydata[combs[,2582],]$COST)
[1] 469 #less than 500
sum(mydata[combs[,2582],]$POINTS)
[1] 563 #what we expected.

Interpreting results of sequence mining with SPADE

Please help to interpret the results of SPADE frequent sequence mining algorithm (http://www.inside-r.org/packages/cran/arulesSequences/docs/cspade​)
With support = 0.05:
s1 <- cspade(x, parameter = list(support = 0.05), control = list(verbose = TRUE))
I get, for example, these sequences:
4 <{C},{V}> 0.15644023
5 <{C,V}> 0.73127376
Looks like these are the same sequences, aren't they? How <{C},{V}> semantically differes from <{C,V}> ? Any real life examples?
From Spade paper (M. J. Zaki. (2001). SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning Journal, 42, 31--60):
"An input-sequence C is said to contain another sequence A, if A is a
subsequence of the input-sequence C. The support or frequency of a sequence is the the total number of input-sequences in the database D that contain A."
Then, for example, if:
sequence support
1 <{C}> 1.00000000
Does it mean that sequence <{C}> is contained in all sequences in database D, correct?
Complete output that I get from my data:
> as(s1, "data.frame")
sequence support
1 <{C}> 1.00000000
2 <{L}> 0.20468120
3 <{V}> 0.73127376
4 <{C},{V}> 0.15644023
5 <{C,V}> 0.73127376
6 <{L,V}> 0.07882027
7 <{V},{V}> 0.13343431
8 <{C,V},{V}> 0.13343431
9 <{C},{C},{V}> 0.05558572
10 <{C,L,V}> 0.07882027
11 <{V},{C,V}> 0.13343431
12 <{C},{C,V}> 0.15644023
13 <{C,V},{C,V}> 0.13343431
14 <{C},{C},{C,V}> 0.05558572
15 <{C},{L}> 0.05738619
16 <{C,L}> 0.20468120
17 <{C},{C,L}> 0.05738619
18 <{C},{C}> 0.22128547
19 <{L},{C}> 0.06233031
20 <{V},{C}> 0.16921494
21 <{V},{V},{C}> 0.05047012
22 <{V},{C},{C}> 0.06233031
23 <{C,V},{C}> 0.16921494
24 <{C},{V},{C}> 0.05781487
25 <{C,V},{V},{C}> 0.05047012
26 <{V},{C,V},{C}> 0.05047012
27 <{C},{C,V},{C}> 0.05781487
28 <{C,V},{C,V},{C}> 0.05047012
29 <{C,L},{C}> 0.06233031
30 <{C},{C},{C}> 0.07882027
31 <{C,V},{C},{C}> 0.06233031
> summary(s1)
set of 31 sequences with
most frequent items:
C V L (Other)
27 22 8 8
most frequent elements:
{C} {V} {C,V} {L} {C,L} (Other)
21 12 12 3 3 2
element (sequence) size distribution:
sizes
1 2 3
7 13 11
sequence length distribution:
lengths
1 2 3 4 5
3 9 12 6 1
summary of quality measures:
support
Min. :0.05047
1st Qu.:0.05760
Median :0.07882
Mean :0.17121
3rd Qu.:0.16283
Max. :1.00000
includes transaction ID lists: FALSE
mining info:
data ntransactions nsequences support
x 61000 34991 0.05
> ​
When using SPADE algorithm, remember that you are also dealing with temporal data (i.e. you can know the order or time of occurrence of the item).
Looks like these are the same sequences, aren't they? How <{C},{V}>
semantically differs from <{C,V}> ? Any real life examples?
In your example, <{C}, {V}> means that item C occurred first, and then item V; <{C, V}> means than item C and V occurred at the same time.
Then, for example, if:
sequence support
1 <{C}> 1.00000000
Does it mean that sequence <{C}> is contained in all sequences in
database D, correct?
An item with support value of 1 means that it happened (in a market basket analysis example) in ALL transactions.
Hope this helps.
Looks like these are the same sequences, aren't they? How <{C},{V}>
semantically differes from <{C,V}> ? Any real life examples?
As user2552108 pointed, {C,V} implies that C and V occurred at the same time. In practice this can be used to encode multi-dimensional sequential data. For example, suppose that C was Canada and V was Vancouver. Now this could have been something like:
[{C,V,M,peanut,butter,maple_syrup}, ... , {}]
In this case, your frequent item-set can not only have single length sets like say {C}, {V}, {U}, {W}, or {X}, but also sets with length > 1 (the sets that appeared simultaneously - at the same time).
For this reason, the element in transactions/sequences are defined as sets and not single elements.
Does it mean that sequence <{C}> is contained in all sequences in
database D, correct?
That's correct!

Grouping words that are similar

CompanyName <- c('Kraft', 'Kraft Foods', 'Kfraft', 'nestle', 'nestle usa', 'GM', 'general motors', 'the dow chemical company', 'Dow')
I want to get either:
CompanyName2
Kraft
Kraft
Kraft
nestle
nestle
general motors
general motors
Dow
Dow
But would be absolutely fine with:
CompanyName2
1
1
1
2
2
3
3
I see algorithms for getting the distance between two words, so if I had just one weird name I would compare it to all other names and pick the one with the lowest distance. But I have thousands of names and want to group them all into groups.
I do not know anything about elastic search, but would one of the functions in the elastic package or some other function help me out here?
I'm sorry there's no programming here. I know. But this is way out of my area of normal expertise.
Solution: use string distance
You're on the right track. Here is some R code to get you started:
install.packages("stringdist") # install this package
library("stringdist")
CompanyName <- c('Kraft', 'Kraft Foods', 'Kfraft', 'nestle', 'nestle usa', 'GM', 'general motors', 'the dow chemical company', 'Dow')
CompanyName = tolower(CompanyName) # otherwise case matters too much
# Calculate a string distance matrix; LCS is just one option
?"stringdist-metrics" # see others
sdm = stringdistmatrix(CompanyName, CompanyName, useNames=T, method="lcs")
Let's take a look. These are the calculated distances between strings, using Longest Common Subsequence metric (try others, e.g. cosine, Levenshtein). They all measure, in essence, how many characters the strings have in common. Their pros and cons are beyond this Q&A. You might look into something that gives a higher similarity value to two strings that contain the exact same substring (like dow)
sdm[1:5,1:5]
kraft kraft foods kfraft nestle nestle usa
kraft 0 6 1 9 13
kraft foods 6 0 7 15 15
kfraft 1 7 0 10 14
nestle 9 15 10 0 4
nestle usa 13 15 14 4 0
Some visualization
# Hierarchical clustering
sdm_dist = as.dist(sdm) # convert to a dist object (you essentially already have distances calculated)
plot(hclust(sdm_dist))
If you want to group then explicitly into k groups, use k-medoids.
library("cluster")
clusplot(pam(sdm_dist, 5), color=TRUE, shade=F, labels=2, lines=0)

Testing recurrences and orders in strings matlab

I have observed nurses during 400 episodes of care and recorded the sequence of surfaces contacts in each.
I categorised the surfaces into 5 groups 1:5 and calculated the probability density functions of touching any one of 1:5 (PDF).
PDF=[ 0.255202629 0.186199343 0.104052574 0.201533406 0.253012048]
I then predicted some 1000 sequences using:
for i=1:1000 % 1000 different nurses
seq(i,1:end)=randsample(1:5,max(observed_seq_length),'true',PDF);
end
eg.
seq = 1 5 2 3 4 2 5 5 2 5
stairs(1:max(observed_seq_length),seq) hold all
I'd like to compare my empirical sequences with my predicted one. What would you suggest to be the best strategy or property to look at?
Regards,
EDIT: I put r as a tag as this may well fall more easily under that category due to the nature of the question rather than the matlab code.

Resources