markov chains stationary distribution's condition about the init state - math

As a property of the markov chain, the stationary distribution has been widely used in many fields like page_rank etc.
However, since the distribution is just a property about the transition matrix and has nothing to do with the init state of the markov chain.
So what's the condition of the transition matrix make the init state has nothing to do with markov chain so it will finally arrive at the stationary distribution after nth iteration.

Markov chains aren't guaranteed to have unique stationary distributions. For example, consider a two state Markov Chain where the transition matrix is the identity matrix. That means that whatever the initial state is, it never changes. So in that case there is no stationary distribution that is independent of the initial case.
Where there is a stationary distribution, unless the initial state is the stationary distribution, the stationary distribution is only reached in the limit as n tends to infinity. So iteration n+1 will be closer to it that iteration n, but however large n is, it won't ever actually be the stationary distribution. However, for practical purposes (i.e. to the limit of the accuracy of floating point numbers in computers), the stationary state may well be reached after a handful of iterations.

You need the underlying graph to be strongly connected and aperiodic. If you want to find the stationary distribution of a periodic Markov chain just by running some chain, add "stay put" transitions with some constant probability to each node and scale the other transitions down appropriately.

Related

State space models/Kalman filter instability

I am using KFAS (in R) to estimate a local level state space model (smoothing, Gaussians).
In order to evaluate the sensitivity of the estimation, I am introducing a single local perturbation (e.g. a small block of 20 time points with a value of zero out of 1000 time points of the signal) to each of the control signals.
Although the small perturbations are detected in all cases, an additional fluctuations in the resulted estimates are observed throughout the signal.
Is there a way to adjust the estimation to decrease the effect of the local perturbation on the smoothing of the whole signal?
A Kalman filter is inherently a linear filter. So it will always multiply your erroneous signal by some gain (determined by your covariances) and add it to your state. It is by far better to just skip filter updates when you have outliers in your measurement data.

Predicting future emissions from fitted HMM model

I've fitted a HMM model to my data using hmm.discnp package in R as follows:
library(hmm.discnp)
zs <- hmm(y=lis,K=5)
Now I want to predict the future K observations (emissions) from this model. But I am only able to get most probable state sequence for the observations that I already have through Viterbi algorithm.
I have t emissions already , i.e (y(1),...,y(t)).
I want the most probable future K emissions from the fitted HMM object i.e (y(t+1),...y(t+k)).
Is there a function to calculate this? if not then how do I calculate it manually?
Generating emissions from an HMM is pretty straightforward to do manually. I'm am not really familiar with R but I explain here the steps to generate data as you ask.
First thing to keep in mind is that, by its Markovian nature, the HMM has no memory. At any time, only the current state is known, what happened before is "forgotten". This means that the generation of the sample at time t+1 only depends of the sample at time t.
If you have a sequence, the first thing you can do is to fit the most probable state sequence (with the Viterbi algorithm) as you did. Now, you know the state that generated the last observation that you have (the one that you denote y(t)).
Now, from this state, you know the probabilities to transit to each other state of the model thanks to the transition matrix. This is a probability mass function (pmf) and you can draw a state number from this pmf (not by hand! R should have a built-in function to draw a sample from a pmf). The state number you draw is the state in which your system is at time t+1.
With this information, you can now draw a sample observation from the probability function that is assigned to this new state (same here, if it is a Gaussian distribution, use a Gaussian random generator that should exist in R).
From this state t+1, you can now apply the same procedure to reach a state at time t+2 and so on.
Keep in mind that if you do this full procedure several times (to generate data samples from time t+1 to t+k), you will end up with different results. This is due to the probabilistic nature of the model. I am not sure of what you mean by most probable future emissions and I am not sure whether there are some routines or not to do so. You can compute the likelihood of the full sequence you obtain at the end (from 1 to t+k). It will in general be greater that the likelihood of the sequence up to t as the last part has been truly generated from the model itself and thus "perfectly" fits in some regards.

Why use softmax as opposed to standard normalization?

In the output layer of a neural network, it is typical to use the softmax function to approximate a probability distribution:
This is expensive to compute because of the exponents. Why not simply perform a Z transform so that all outputs are positive, and then normalise just by dividing all outputs by the sum of all outputs?
There is one nice attribute of Softmax as compared with standard normalisation.
It react to low stimulation (think blurry image) of your neural net with rather uniform distribution and to high stimulation (ie. large numbers, think crisp image) with probabilities close to 0 and 1.
While standard normalisation does not care as long as the proportion are the same.
Have a look what happens when soft max has 10 times larger input, ie your neural net got a crisp image and a lot of neurones got activated
>>> softmax([1,2]) # blurry image of a ferret
[0.26894142, 0.73105858]) # it is a cat perhaps !?
>>> softmax([10,20]) # crisp image of a cat
[0.0000453978687, 0.999954602]) # it is definitely a CAT !
And then compare it with standard normalisation
>>> std_norm([1,2]) # blurry image of a ferret
[0.3333333333333333, 0.6666666666666666] # it is a cat perhaps !?
>>> std_norm([10,20]) # crisp image of a cat
[0.3333333333333333, 0.6666666666666666] # it is a cat perhaps !?
I've had this question for months. It seems like we just cleverly guessed the softmax as an output function and then interpret the input to the softmax as log-probabilities. As you said, why not simply normalize all outputs by dividing by their sum? I found the answer in the Deep Learning book by Goodfellow, Bengio and Courville (2016) in section 6.2.2.
Let's say our last hidden layer gives us z as an activation. Then the softmax is defined as
Very Short Explanation
The exp in the softmax function roughly cancels out the log in the cross-entropy loss causing the loss to be roughly linear in z_i. This leads to a roughly constant gradient, when the model is wrong, allowing it to correct itself quickly. Thus, a wrong saturated softmax does not cause a vanishing gradient.
Short Explanation
The most popular method to train a neural network is Maximum Likelihood Estimation. We estimate the parameters theta in a way that maximizes the likelihood of the training data (of size m). Because the likelihood of the whole training dataset is a product of the likelihoods of each sample, it is easier to maximize the log-likelihood of the dataset and thus the sum of the log-likelihood of each sample indexed by k:
Now, we only focus on the softmax here with z already given, so we can replace
with i being the correct class of the kth sample. Now, we see that when we take the logarithm of the softmax, to calculate the sample's log-likelihood, we get:
, which for large differences in z roughly approximates to
First, we see the linear component z_i here. Secondly, we can examine the behavior of max(z) for two cases:
If the model is correct, then max(z) will be z_i. Thus, the log-likelihood asymptotes zero (i.e. a likelihood of 1) with a growing difference between z_i and the other entries in z.
If the model is incorrect, then max(z) will be some other z_j > z_i. So, the addition of z_i does not fully cancel out -z_j and the log-likelihood is roughly (z_i - z_j). This clearly tells the model what to do to increase the log-likelihood: increase z_i and decrease z_j.
We see that the overall log-likelihood will be dominated by samples, where the model is incorrect. Also, even if the model is really incorrect, which leads to a saturated softmax, the loss function does not saturate. It is approximately linear in z_j, meaning that we have a roughly constant gradient. This allows the model to correct itself quickly. Note that this is not the case for the Mean Squared Error for example.
Long Explanation
If the softmax still seems like an arbitrary choice to you, you can take a look at the justification for using the sigmoid in logistic regression:
Why sigmoid function instead of anything else?
The softmax is the generalization of the sigmoid for multi-class problems justified analogously.
I have found the explanation here to be very good: CS231n: Convolutional Neural Networks for Visual Recognition.
On the surface the softmax algorithm seems to be a simple non-linear (we are spreading the data with exponential) normalization. However, there is more than that.
Specifically there are a couple different views (same link as above):
Information Theory - from the perspective of information theory the softmax function can be seen as trying to minimize the cross-entropy between the predictions and the truth.
Probabilistic View - from this perspective we are in fact looking at the log-probabilities, thus when we perform exponentiation we end up with the raw probabilities. In this case the softmax equation find the MLE (Maximum Likelihood Estimate)
In summary, even though the softmax equation seems like it could be arbitrary it is NOT. It is actually a rather principled way of normalizing the classifications to minimize cross-entropy/negative likelihood between predictions and the truth.
The values of q_i are unbounded scores, sometimes interpreted as log-likelihoods. Under this interpretation, in order to recover the raw probability values, you must exponentiate them.
One reason that statistical algorithms often use log-likelihood loss functions is that they are more numerically stable: a product of probabilities may be represented be a very small floating point number. Using a log-likelihood loss function, a product of probabilities becomes a sum.
Another reason is that log-likelihoods occur naturally when deriving estimators for random variables that are assumed to be drawn from multivariate Gaussian distributions. See for example the Maximum Likelihood (ML) estimator and the way it is connected to least squares.
We are looking at a multiclass classification problem. That is, the predicted variable y can take one of k categories, where k > 2. In probability theory, this is usually modelled by a multinomial distribution. Multinomial distribution is a member of exponential family distributions. We can reconstruct the probability P(k=?|x) using properties of exponential family distributions, it coincides with the softmax formula.
If you believe the problem can be modelled by another distribution, other than multinomial, then you could reach a conclusion that is different from softmax.
For further information and a formal derivation please refer to CS229 lecture notes (9.3 Softmax Regression).
Additionally, a useful trick usually performs to softmax is: softmax(x) = softmax(x+c), softmax is invariant to constant offsets in the input.
The choice of the softmax function seems somehow arbitrary as there are many other possible normalizing functions. It is thus unclear why the log-softmax loss would perform better than other loss alternatives.
From "An Exploration of Softmax Alternatives Belonging to the Spherical Loss Family" https://arxiv.org/abs/1511.05042
The authors explored some other functions among which are Taylor expansion of exp and so called spherical softmax and found out that sometimes they might perform better than usual softmax.
I think one of the reasons can be to deal with the negative numbers and division by zero, since exp(x) will always be positive and greater than zero.
For example for a = [-2, -1, 1, 2] the sum will be 0, we can use softmax to avoid division by zero.
Adding to Piotr Czapla answer, the greater the input values, the greater the probability for the maximum input, for same proportion and compared to the other inputs:
Suppose we change the softmax function so the output activations are given by
where c is a positive constant. Note that c=1 corresponds to the standard softmax function. But if we use a different value of c we get a different function, which is nonetheless qualitatively rather similar to the softmax. In particular, show that the output activations form a probability distribution, just as for the usual softmax. Suppose we allow c to become large, i.e., cā†’āˆž. What is the limiting value for the output activations a^L_j? After solving this problem it should be clear to you why we think of the c=1 function as a "softened" version of the maximum function. This is the origin of the term "softmax". You can follow the details from this source (equation 83).
While it indeed somewhat arbitrary, the softmax has desirable properties such as:
being easily diferentiable (df/dx = f*(1-f))
when used as the output layer for a classification task, the in-fed scores are interpretable as log-odds

Do I need to normalize my probability distribution after observing evidence in a decision making issue?

Currently I'm working on a decision making system and I have the following elements:
Complex bayesian network
Decision CPD (Conditional probability distribution)
Utility factor (Which maps weights to certain probability assignments) (Only the parents of U)
I'm calculating my utility value for the complete network without observing any evidence and that works fine.
However now I have some evidence observed and what that does is set the probability for every illegal assignment to 0. Legal assignments still have the same probability. When I then run my utility calculation function it changes the decision CPD (due to the fact that illegal options are removed (multiplied by 0)) but it still sees the probability of the evidence occurring with prior probability.
My question is, do I need to normalize the probabilities after observing the evidence and not taking into account that the evidence occurs with a certain probability. This heavily impacts the outcome of the utility function and therefore the outcome of my decision.
The sum of the probabilities of all possible outcomes must always sum to 1. So if some outcomes are eliminated and hence assigned zero probability, then YES, you must re-normalise so that the sum of all the remaining probabilities adds up to 1.

Is a Markov chain the same as a finite state machine?

Is a finite state machine just an implementation of a Markov chain? What are the differences between the two?
Markov chains can be represented by finite state machines. The idea is that a Markov chain describes a process in which the transition to a state at time t+1 depends only on the state at time t. The main thing to keep in mind is that the transitions in a Markov chain are probabilistic rather than deterministic, which means that you can't always say with perfect certainty what will happen at time t+1.
The Wikipedia articles on Finite-state machines has a subsection on Finite Markov-chain processes, I'd recommend reading that for more information. Also, the Wikipedia article on Markov chains has a brief sentence describing the use of finite state machines in representing a Markov chain. That states:
A finite state machine can be used as
a representation of a Markov chain.
Assuming a sequence of independent and
identically distributed input signals
(for example, symbols from a binary
alphabet chosen by coin tosses), if
the machine is in state y at time n,
then the probability that it moves to
state x at time n + 1 depends only on
the current state.
Whilst a Markov chain is a finite state machine, it is distinguished by its transitions being stochastic, i.e. random, and described by probabilities.
The two are similar, but the other explanations here are slightly wrong. Only FINITE Markov chains can be represented by a FSM. Markov chains allow for an infinite state space. As it was pointed out, the transitions of a Markov chain are described by probabilities, but it is also important to mention that the transition probabilities can only depend on the current state. Without this restriction, it would be called a "discrete time stochastic process".
I believe this should answer your question:
https://en.wikipedia.org/wiki/Probabilistic_automaton
And, you are on to the right idea - they are almost the same, subsets, supersets and modifications depending on what adjective describes the chain or the automaton. Automata typically take an input as well, but I am sure there have been papers utilizing 'Markov-chains' with inputs.
Think gaussian distribution vs. normal distribution - same ideas different fields. Automata belong to computer science, Markov belongs to probability and statistics.
I think most of the answers are not appropriate. A Markov process is generated by a (probablistic) finite state machine, but not every process generated by a probablistic finite state machine is a Markov process. E.g. Hidden Markov Processes are basically the same as processes generated by probabilistic finite state machines, but not every Hidden Markov Process is a Markov Process.
If leaving the inner working details aside, finite state machine is like a plain value, while markov chain is like a random variable (add probability on top of the plain value). So the answer to the original question is no, they are not the same. In the probabilistic sense, Markov chain is an extension of finite state machine.

Resources