Hidden Markov Models methods for selecting optimal number of states - r

Package RHmm
I have a vector which I fit into a hmm model in an attemp to select an optimal number of states for a hidden markov model
x<-c(-0.0961421466,-0.0375458485,0.0681121271,0.0259201028,0.0016780785,0.0311860542,
0.0067940299,0.0126520055,0.0357599812,0.0007679569,0.0409759326,0.0560839083,-0.0272581160,-0.0439501404,0.0321578353,0.0196158110,-0.0097262133,-0.0226182376,0.0119897380,-0.0099522863,-0.0359443106,-0.0039363349,-0.0476283592,-0.0383203835,-0.0518624079,0.0187455678,0.0950535435,0.0057115192,-0.0307805051,-0.0272725295,-0.0254645538,-0.0102565781,-0.0267986024,-0.0482906267,-0.0256826510,-0.0414746754,-0.0470666997,0.0284912760,0.1021992517,0.0875572274,0.0064152031,0.0200731787,-0.0091688456,-0.0575608699,-0.0442028942,-0.0277449185,-0.0115369429,0.0084710328,0.0745290085,0.0159369842,-0.0784550401,-0.0934970644,-0.0978390888,0.0160188869,0.0275268626,-0.0552651617,0.0033928140,0.0468507896,0.0374087653,0.0521167410,-0.0177752833,-0.0592673076,0.0514406681,0.0847486437,0.0738066194,-0.0098354049,-0.0572274292,0.0478305465,0.0096885221,-0.0445535022,-0.0153455265,-0.0105375508,0.0100704249,-0.0035215994,0.0243363762,0.0504443519,0.0570023276,0.0395103033,-0.0612817210,-0.0557737453,-0.0273657697,-0.0220077940,0.0083501817,0.0275081574,0.0323161331,0.0385741087,0.0175820844-0.0410599399,-0.0071019642,0.0431060115,-0.0107360128,-0.0007280372,0.0360799385,-0.0061620858 0.0164458899 -0.0050461344 -0.0578381588 0.0097198169 0.0027277926 -0.0127642317,
-0.0037062560, -0.0045482803, 0.0367596953, 0.0021176710,-0.0319243533,-0.0194663776,0.00 91915981,0.0061495737,-0.0090424506,0.0127655251,0.0161735008,0.0193814765,-0.0208605478,-0.0598025722,0.0022554035,0.0473633792,0.0247213549,-0.0063206694,-0.0201626938,0.0207952819,0.0379032576,0.0151612333,0.0038692090,0.0111271847,0.0497851603,0.0273431360,-0.0172488883,-0.0038909126,0.0264670631,-0.0065249612,-0.0467169856,-0.0255090099,0.0082489658, 0.0352569415,0.0272149172,0.0074228928,-0.0040191315,-0.0170611558,-0.0309531801,-0.0327952044,-0.0239372287,-0.0212792531,-0.0132712774,0.0086866983,-0.0007553260,0.0107026497,0.0065106253,-0.0321813990,-0.0081734233,0.0296845524,0.0268925281,-0.0025994962,-0.0038915206, -0.0126335449,0.0040244308,0.0227324065,0.0114903822,-0.0031516422,0.0031563335,0.0137143092,0.0026222849,0.0035802606,0.0111382363,-0.0008037881, -0.0282458124, 0.0056121633, 0.0254201390,0.0033781147,-0.0166139097,-0.0124559340,0.0088520417,0.0072600174, -0.0050320069,-0.0114740312,-0.0066160556, -0.0042080799, -0.0205501042,0.0027078715, 0.0122158472,-0.0206261771,-0.0267682015,-0.0107602258,0.0088477499,0.0165057256, 0.0106637013,0.0115216769,0.0278296526,0.0026376283,-0.0231543960,-0.0141964203)
#partitions test/train
nhs <- c(2,3,4) #number of possible states
S<-runif(length (x))<= .66
train<-print(S)
# mean conditional density of log probability of seeing the partial sequence of obs
for(i in 1:length(nhs)){
pred <- vector("list", length(x))
for(fold in 1:length(x)){
fit <- HMMFit(x [which(train==TRUE)],dis="NORMAL",nStates=nhs[i],
asymptCov=FALSE)
pred[[fold]] <- forwardBackward(fit, x[which(train==FALSE)])
}
error[i] <- pred[[fold]]$LLH
}
nhs[which.max(error)] # Optimal number of hidden states (method max log-likehood)
Every time I run the model trying to obtain the best number of states to use of the hidden markov model I get a different number of states as I believe the model is trained over randmonly selected new values. This does not happen if I just fit the model.
#score proportional to probability that a sequence is generated by a given model
nhs <- c(2,3,4)
for(i in 1:length(nhs)){
fit <- HMMFit(x, dis="NORMAL", nStates= nhs[i], asymptCov=FALSE)
VitPath = viterbi(fit, x)
error[i] <- fit[[3]]
}
error<-c(error)
error[is.na(error)] <- 10000
nhs[which.min(error)] # Optimal number of hidden states (method min AIC)
However results are very different. Which one is better, on one hand I have a model where I can test on new samples. On the other hand the second provides best fit on seen samples however results are very different. In case of the model if I repeat the test given that the training/test set change (random) the resulting number of states also change. In this case what percentage sample / training should I use as to be certain that this choice will provide generalization in the number of states.
What additional methods may I employ as to be able to select an optimal number of states
Many thanks

The recurrence quantification analysis (RQA) is a method of nonlinear data analysis which quantifies the number and duration of recurrences of a dynamical system presented by its state space trajectory.
These measures can be computed in windows along the main diagonal. This allows to study their time dependence and can be used for the detection of transitions. (vertical or horizontal point = chaos-chaos transitions or diagonal structures = chaos-order or order-chaos transitions). The lengths of diagonal lines in an RP are directly related to the ratio of determinism or predictability inherent to the system.
Another possibility is to define these measures for each diagonal parallel to the main diagonal separately. This approach enables the study of time delays, unstable periodic orbits, and by applying to measures which base on diagonal structures are able to find chaos-order transitions, measures based on vertical (horizontal) structures are able to find chaos-chaos transitions, the assessment of similarities between processes.
Cross recurrence plot (CRP) will be the equivalent of cross phase analysis in wavelets. CRP is a graph which shows all those times at which a state in one dynamical system occurs simultaneously in a second dynamical system. With other words, the CRP reveals all the times when the phase space trajectory of the first system visits roughly the same area in the phase space where the phase space trajectory of the second system is. Is this last analysis which can provide a determination of the optimal number of hidden states.

Related

Code syntax in calculating posterior distribution in WinBUGS

Recently I read "The BUGS Book – A Practical Introduction to Bayesian Analysis" to learn WinBUGS. The way WinBUGS describes the derivation of posterior distribution makes me feel confused.
Let's take Example 4.1.1 in this book to illustrae:
Suppose we observe the number of deaths y in a given hospital for a
high-risk operation. Let n denote the total number of such
operations performed and suppose we wish to make inferences regarding
the underlying true mortality rate, $\theta$.
The code of WinBUGS is:
y <- 10 # the number of deaths
n <- 100 # the total number of such operations
#########################
y ~ dbin(theta,n) # likelihood, also a parametric sampling distribution
logit(theta) <- logit.theta # normal prior for the logistic transform of theta
logit.theta ~ dnorm(0,0.368) # precision = 1/2.71
The author said that:
The software knows how to derive the posterior distribution and
subsequently sample from it.
My question is:
Which code reflects the logic structure to tell WinBUGS about "which parameter that I want to calculate its posterior distribution"?
This question seems silly, but if I do not read the background first, I truly cannot find directly in the code above about which parameter is focused on (e.g., theta, or y?).
Below are some of my thoughts (as a beginner of WinBUGS):
I think the following three attributions of the code style in WinBUGS makes me confused:
(1) the code does not follow "a specific sequence". For example, why is logit.theta ~ dnorm(0,0.368) not in front of logit(theta) <- logit.theta?
(2) repeated variable. Foe example, why did the last two lines not be reduced into one line: logit(theta) ~ dnorm(0,0.368)?
(3) variables are defined in more than one place. For example, y is defined two times: y <- 10 and y ~ dbin(theta, n). This one has been explained in Appendix A of the book (i.e., However, a check has been built in so that when finding a logical node that also features as a stochastic node, a stochastic node is created with the calculated values as fixed data), yet I still cannot catch its meaning.
BUGS is a declarative language. For the most part, statements aren't executed in sequence, they define different parts of the model. BUGS works on models that can be represented by directed acyclic graphs, i.e. those where you put a prior on some components, then conditional distributions on other components given the earlier ones.
It's a fairly simple language, so I think logit(theta) ~ dnorm(0, 0.368) is just too complicated for it.
The language lets you define a complicated probability model, and declare observations of certain components in it. Once you declare an observation, the model that BUGS samples from is the the original full model conditioned on that observation. y <- 10 defines observed data. y ~ dbin(theta,n) is part of the model.
The statement n <- 100 could be either: for fixed constants like n, it doesn't really matter which way you think of it. Either the model says that n is always 100, or n has an undeclared prior distribution not depending on any other parameter, and an observed value of 100. These two statements are equivalent.
Finally, your big question: Nothing in the code above says which parameter you want to look at. BUGS will compute the joint posterior distribution of every parameter. n and y will take on their fixed values, theta and logit.theta will both be simulated from the posterior. In another part of your code (or by using the WinBUGS menus) you can decide which of those to look at.

Three-step method LCA in R (poLCA). Posterior probabilities from inclusive LCA?

As recommended by Bray, Lanzaa and Tanb (2015) I’d like to perform three-step method to classify individuals into classes by using posterior probabilities of inclusive LCA (LCA including covariates). However, the inclusive model is very different compare with the non-inclusive model if I include all variables of interest.
Conditional probabilities are completely different, as well as the number of cases per class. Therefore, the interpretation of profiles or patterns changes completely from the non-inclusive model (step-1) when using posterior probabilities of inclusive LCA (in order to assign the cases).
My question is, am I doing something wrong? Is it normal to get these changes? Maybe procedure isn't correct. The model itself loses sense when looking at item conditional probabilities of each class.
These are the steps I took:
To perform LCA to study profiles of sexual risk behaviors (using 6 variables) and analyze association with diferent types of drug use, gender and age (model 4 seemed the best choice).
z <- cbind(sexrisk1, sexrisk2, sexrisk3, sexrisk4, sexrisk5, sexrisk6)
lc4 <- poLCA(z, MyData, nclass = 4,nrep=10)
Include all variables of interest as covariate for “appropriate” posterior analysis (as recommended Bray, Lanzaa and Tanb (2015))
f <- cbind(sexrisk1, sexrisk2, sexrisk3, sexrisk4, sexrisk5, sexrisk6)~ drug1+drug2+drug3+gender+age
lc4.cov <- poLCA(f, MyData, nclass = 4,nrep=10)
Once inclusive model is performed, I used the values of predicted classes and posterior probabilities (which I think poLCA does it via maximum-probability assignment. Not sure of this) to assign cases to membership classes.
table(lc4.cov$predclass)
write.csv(cbind(MyData$code, lc4.cov$posterior), 'new.data.csv')
(NOTE: by incresing the number of nrep of both models (inclusive and non-inclusive) results of posterior probabilities showed less differences).

Preventing a Gillespie SSA Stochastic Model From Running Negative

I have produce a stochastic model of infection (parasitic worm), using a Gillespie SSA. The model used the "GillespieSSA"package (https://cran.r-project.org/web/packages/GillespieSSA/index.html).
In short the code models a population of discrete compartments. Movement between compartments is dependent on user defined rate equations. The SSA algorithm acts to calculate the number of events produced by each rate equation for a given timestep (tau) and updates the population accordingly, process repeats up to a given time point. The problem is, the number of events is assumed Poisson distributed (Poisson(rate[i]*tau)), thus produces an error when the rate is negative, including when population numbers become negative.
# Parameter Values
sir.parms <- c(deltaHinfinity=0.00299, CHi=0.00586, deltaH0=0.0854, aH=0.5,
muH=0.02, SigmaW=0.1, SigmaM =0.8, SigmaL=104, phi=1.15, f = 0.6674,
deltaVo=0.0166, CVo=0.0205, alphaVo=0.5968, beta=52, mbeta=7300 ,muV=52, g=0.0096, N=100)
# Inital Population Values
sir.x0 <- c(W=20,M=10,L=0.02)
# Rate Equations
sir.a <- c("((deltaH0+deltaHinfinity*CHi*mbeta*L)/(1+CHi*mbeta*L))*mbeta*L*N"
,"SigmaW*W*N", "muH*W*N", "((1/2)*phi*f)*W*N", "SigmaM*M*N", "muH*M*N",
"(deltaVo/(1+CVo*M))*beta*M*N", "SigmaL*L*N", "muV*L*N", "alphaVo*M*L*N", "(aH/g)*L*N")
# Population change for even
sir.nu <- matrix(c(+0.01,0,0,
-0.01,0,0,
-0.01,0,0,
0,+0.01,0,
0,-0.01,0,
0,-0.01,0,
0,0,+0.01/230,
0,0,-0.01/230,
0,0,-0.01/230,
0,0,-0.01/230,
0,0,-0.01/32),nrow=3,ncol=11,byrow=FALSE)
runs <- 10
set.seed(1)
# Data Frame of output
sir.out <- data.frame(time=numeric(),W=numeric(),M=numeric(),L=numeric())
# Multiple runs and combining data and SSA methods
for(i in 1:runs){
sim <- ssa(sir.x0,sir.a,sir.nu,sir.parms, method="ETL", tau=1/12, tf=140, simName="SIR")
sim.out <- data.frame(time=sim$data[,1],W=sim$data[,2],M=sim$data[,3],L=sim$data[,4])
sim.out$run <- i
sir.out <- rbind(sir.out,sim.out)
}
Thus, rates are computed and the model updates the population values for each time step, with the data store in a data frame, then attached together with previous runs. However, when levels of the population get very low events can occur such that the number of events that occurs reducing a population is greater than the number in the compartment. One method is to make the time step very small, however this greatly increases the length of the simulation very long.
My question is there a way to augment the code so that as the data is created/ calculated at each time step any values of population numbers that are negative are converted to 0?
I have tried working on this problem, but only seem to be able to come up with methods that alter the values once the simulation is complete, with the negative values still causing issues in the runs themselves.
E.g.
if (sir.out$L < 0) sir.out$L == 0
Any help would be appreciated
I believe the problem is the method you set ("ETL") in the ssa function. The ETL method will eventually produce negative numbers. You can try the "OTL" method, based on Efficient step size selection for the tau-leaping simulation method- in which there are a few more parameters that you can tweak, but the basic command is:
ssa(sir.x0,sir.a,sir.nu,sir.parms, method="OTL", tf=140, simName="SIR")
Or the direct method, which will not produce negative number whatsoever:
ssa(sir.x0,sir.a,sir.nu,sir.parms, method="D", tf=140, simName="SIR")

Preprocess data in R

Im using R to create logistic regression classifier model.
Here is the code sample:
library(ROCR)
DATA_SET <- read.csv('E:/1.csv')
classOneCount= 4000
classZeroCount = 4000
sample.churn <- sample(which(DATA_SET$Class==1),classOneCount)
sample.nochurn <- sample(which(DATA_SET$Class==0),classZeroCount )
train.set <- DATA_SET[c(sample.churn,sample.nochurn),]
test.set <- DATA_SET[c(-sample.churn,-sample.nochurn),]
full.logit <- glm(Class~., data = train.set, family = binomial)
And it works fine, but I would like to preprocess the data to see if it improves classification model.
What I would like to do would be to divide input vector variables which are continuoes into intervals. Lets say that one variable is height in centimeters in float.
Sample values of height:
183.23
173.43
163.53
153.63
193.27
and so on, and I would like to split it into lets say 3 different intervals: small, medium, large.
And do it with all variables from my set - there are 32 variables.
What's more I would like to see at the end correlation between value of the variables (this intervals) and classification result class.
Is this clear?
Thank you very much in advance
The classification model creates some decision boundary and existing algorithms are rather good at estimating it. Let's assume that you have one variable - height - and linear decision boundary. Your algorithm can then decide between what values put decision boundary by estimating error on training set. If you perform quantization and create few intervals your algorithm have fewer places to put boundary(data loss). It will likely perform worse on such cropped dataset than on original one. It could help if your learning algorithm is suffering from high variance (is overfitting data) but then you could also try getting more training examples, use smaller set (subset) of features or use algorithm with regularization and increase regularization parameter
There are also many questions about how to choose number of intervals and how to divide data into them like: should all intervals be equally frequent or of equal width or most similar to each other inside each interval?
If you want just to experiment use some software like f.e. free version of RapidMiner Studio (it can read CSV and Excel files and have some quick quantization options) to convert your data

Determining optimum number of clusters for k-means with a large dataset

I have a matrix of 62 columns and 181408 rows that I am going to be clustering using k-means. What I would ideally like is a method of identifying what the optimum number of clusters should be. I have tried implementing the gap statistic technique using clusGap from the cluster package (reproducible code below), but this produces several error messages relating to the size of the vector (122 GB) and memory.limitproblems in Windows and a "Error in dist(xs) : negative length vectors are not allowed" in OS X. Does anyone has any suggestions on techniques that will work in determining optimum number of clusters with a large dataset? Or, alternatively, how to make my code function (and does not take several days to complete)? Thanks.
library(cluster)
inputdata<-matrix(rexp(11247296, rate=.1), ncol=62)
clustergap <- clusGap(inputdata, FUN=kmeans, K.max=12, B=10)
At 62 dimensions, the result will likely be meaningless due to the curse of dimensionality.
k-means does a minimum SSQ assignment, which technically equals minimizing the squared Euclidean distances. However, Euclidean distance is known to not work well for high dimensional data.
If you don't know the numbers of the clusters k to provide as parameter to k-means so there are three ways to find it automaticaly:
G-means algortithm: it discovers the number of clusters automatically using a statistical test to decide whether to split a k-means center into two. This algorithm takes a hierarchical approach to detect the number of clusters, based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution (continuous function which approximates the exact binomial distribution of events), and if not it splits the cluster. It starts with a small number of centers, say one cluster only (k=1), then the algorithm splits it into two centers (k=2) and splits each of these two centers again (k=4), having four centers in total. If G-means does not accept these four centers then the answer is the previous step: two centers in this case (k=2). This is the number of clusters your dataset will be divided into. G-means is very useful when you do not have an estimation of the number of clusters you will get after grouping your instances. Notice that an inconvenient choice for the "k" parameter might give you wrong results. The parallel version of g-means is called p-means. G-means sources:
source 1
source 2
source 3
x-means: a new algorithm that efficiently, searches the space of cluster locations and number of clusters to optimize the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC) measure. This version of k-means finds the number k and also accelerates k-means.
Online k-means or Streaming k-means: it permits to execute k-means by scanning the whole data once and it finds automaticaly the optimal number of k. Spark implements it.
This is from RBloggers.
https://www.r-bloggers.com/k-means-clustering-from-r-in-action/
You could do the following:
data(wine, package="rattle")
head(wine)
df <- scale(wine[-1])
wssplot <- function(data, nc=15, seed=1234){
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i)$withinss)}
plot(1:nc, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")}
wssplot(df)
this will create a plot like this.
From this you can choose the value of k to be either 3 or 4. i.e
there is a clear fall in 'within groups sum of squares' when moving from 1 to 3 clusters. After three clusters, this decrease drops off, suggesting that a 3-cluster solution may be a good fit to the data.
But like Anony-Mouse pointed out, the curse of dimensionality affects due to the fact that euclidean distance being used in k means.
I hope this answer helps you to a certain extent.

Resources