Viterbi algorithm post-treatment - r

I am running scripts for a project in Hidden Markov Model with 2 hiddens states at school. At some point, I use Viterbi's algrithm to find the most suitable sequences of hidden states. My output is a vector like that :
c("1","1","1","2","2","1", "1","1","1","2", "2","2")
I would like to count how many subsequences of each states there is, and also record their length and positions. The output would be, for example, a matrx like that:
State Length Starting_Position
1 3 1
2 2 4
1 4 6
2 3 10
Is there any R command or package who can do that easily ?
Thank you.

Related

How do I simulate choosing a random player at first, and then repeating that sequence?

I am trying to simulate a game in R. For that I need to choose a random player out of n_players who begins in the first round. Then the other n_players follow in a random order in the first round. However, in the next rounds the same order of players as in the first round must be kept. Does anyone have an idea on how to do this?
Create a sequence of numbers, say n=10, from 1 up to n.
x<-1:10
Think of this to be the tag number of players. You can then use the sample function of R (read the documentation using ?sample command or visit here) to create another sequence of numbers whose order have been shuffled randomly.
y<-sample(x,10,replace=F)
Now your y variable is the order in which your players are selected one by one.
Also, you can access each individual chosen player just like you choose an element from a vector.
Finally, the vector y is the sequence in which these players are selected in the subsequent rounds.
Test run:
x<-1:10
#[1] 1 2 3 4 5 6 7 8 9 10
y<-sample(x,10,replace=F)
#[1] 2 4 1 8 9 7 5 6 10 3

R : fast checking conditions in multiple rows dataframe for each group

I'm looking to optimise the following problem [simplified version here]:
I have two data frames, the first contains the information.
user_id
game_id
score
ON
1
1
450
1
1
2
200
1
1
3
400
1
2
1
225
1
2
2
150
1
2
3
200
1
The second contains the conditions.
game_id
game_id_ref
req_score
type
2
1
150
1
3
1
200
1
1
1
400
2
3
2
175
1
The conditions should be evaluated on the information data frame in the following way.
The conditions with type == 1 describe TURN ON conditions, and enforce that a game can only TURN ON if the score on the game from the game_id_ref >= req_score, so the first row from the conditions should be read as; the game with game_id == 2 can only TURN ON for user X when they have a score of 150 or higher on the game with game_id == 1.
The conditions with type == 2 describe TURN OFF conditions, and enforce that a game must be TURNED OFF if the score on the game from the game_id_ref >= req_score, so the third row from the conditions should be read as; for user X the game with game_id == 1 must be TURNED OFF when they have a score of 400 or higher on the game with game_id == 1.
In the information data frame I have a column ON which indicates if a game is ON for a particular user. The default is 1 [the game is ON] but this is before evaluating all the conditions. I am looking for the fastest way to evaluate the conditions for each user separately, and return the same information data frame, however now with ON = 0 if for a user the game fails to meet criteria type 1 or met criteria type 2.
So for this mock example, the required output would be:
user_id
game_id
score
ON
1
1
450
0
1
2
200
1
1
3
400
1
2
1
225
1
2
2
150
1
2
3
200
0
My current solution has been to create a separate function in which I check this by applying a for_loop over all the rows of the conditions table [approx 100 conditions], and using this function in a group_map function, on the information data frame grouped by the user_ids [approx 350000 unique users]. While this works relatively ok [approx 10 min], I would like to know if someone has a much faster solution for this.
Thanks!
Probably you can fine-tune your solution to be a bit faster in R but without seeing your code it is hard to say. Your solution sounds quite reasonable to me already.
However, if you have so much data, this kind of problem can be solved faster with SQL. I assume you already use some data management system. SQL uses indexing to make JOIN very fast, which you can never achieve in R (unless you write a database management system in R, not recommended). After you join your information and condition data frame on the game_id column, you can check all the conditions which should be fast. That can also be done in SQL by the way.
Sorry if it is not the expected answer. If you are not familiar with SQL, and you feel like there is no way you want to learn a new technology for a simple question like this, please provide your code so far so we can see what could be improved

x must be numeric while trying to create histogram in R

I am a newbie in R. I need to generate some graphs. I imported an excel file and need to create a histogram on one column. My importing code is-
file=read.xlsx('femalecommentcount.xlsx',1,header=FALSE)
col=file[2]
col looks like this (part) -
36961 1
36962 1
36963 7
36964 1
36965 2
36966 1
36967 1
36968 4
36969 1
36970 6
36971 3
36972 1
36973 6
36974 6
36975 2
36976 2
36977 8
36978 2
36979 1
36980 1
36981 1
the first column is the row number. I'm not sure how to remove this. The second column is my data that I want a histogram on. hist() function requires a vector, I'm not sure how exactly to convert.
If I just simple call -
hist(col)
it gives-
Error in hist.default(col) : 'x' must be numeric
I have tried few commands randomly from the internet, but they didn't work.
My eventual goal is to just generate a good histogram (and maybe other charts) on that column, to get a good understadning of the spread of my data.
It should be col=file[[2]] or col=file[, 2] --- solution given in comment
data import should be in correct way to avoid numeric issue

Understanding an RLE coverage value

Using R and bioconductor.
I'm not sure how to understand an integer rle that you'd get from functions like coverage() such as this
integer-Rle of length 3312 with 246 runs
Lengths: 25 34 249 16 7 11 16 ... 2 32 2 26 34 49
Values : 0 1 0 1 2 3 2 ... 1 2 1 0 1 0
Okay so I get that it represents coverage of one range vs other ranges. In this case reads of an experiment over a given range. What do the 'runs' mean? What about the 'Lengths' and 'Values'? I thought that maybe Lengths represent a postion and values represent the amount of times its covered but then why would there be multiples of the same position such as 2 above? Why would they be out of order?
I ask because I'm using
sum(coverage)
to compare the coverage of one range to another of a different length and I was wondering if that was appropriate.
Probably it's better to ask about Bioconductor packages on the Bioconductor support site.
The interpretation is that there is a run of 25 nucleotides with 0 coverage, then a run of 24 nucleotides with 1 coverage (i.e., a single read) then another run of 249 nucleotides with no coverage, then things start to get interesting as multiple reads overlap positions. From the summary line at the top of the output, your read covers 3312 nucleotides, maybe from a single transcript? If you were to
plot(as.integer(coverage))
you'd get a quick plot of how coverage varies along the length of the transcript.
Maybe sum(coverage) is appropriate; a more usual metric is to count reads rather than coverage, e.g., with GenomicRanges::summarizeOverlaps() illustrated in this DESeq2 work flow in the context of RNA-seq.
This might help to understand the concept of RLE: https://www.youtube.com/watch?v=ypdNscvym_E
Here is an easy example:
> x <- IRanges(start=c(-2L, 1L, 3L),
+ width=c( 5L, 4L, 6L))
> x
IRanges of length 3
start end width
[1] -2 2 5
[2] 1 4 4
[3] 3 8 6
> coverage(x)
integer-Rle of length 8 with 2 runs
Lengths: 4 4
Values : 2 1
The output means the first 4 places are in packs of 2 and the next four places are in single-packs. All places including 0 and below 0 were ignored!
The length means that the complete range that we are looking at, so to say all places together, are 8.
The runs are the types of packs that occur. Here, we only have overlaps that include two ranges (pack of two) and overlaps that don't really overlap (single pack).

Assembling a combinatorial tree in R

I'm trying to build a Combinatorial Tree model, where the initial leaf is the first 6 digits. The 2nd level is all possible combinations of 5 digits of the parent's 6 digits. Then the 3rd level is all possible combinations of 4 digits of it's parent's digits. This pattern continues until 6th level, which is composed of only single digits.
So my question is there a way to generate a tree in this fashion? I've been searching for examples of basic trees in R and have wound up empty handed. Any advice would be much appreciated. Thank you
You can get something like that using this:
f <- function(x)
{
if(length(x)==1) return(c(value=x))
c(list(value=x), child=lapply(seq(x), function(i)f(x[-i])))
}
Example:
> f(1:3)
$value
[1] 1 2 3
$child1
$child1$value
[1] 2 3
$child1$child1
value
3
$child1$child2
value
2
$child2
$child2$value
[1] 1 3
$child2$child1
value
3
$child2$child2
value
1
$child3
$child3$value
[1] 1 2
$child3$child1
value
2
$child3$child2
value
1

Resources