Getting the next observation from a HMM gaussian mixture distribution - r

I have a continuous univariate xts object of length 1000, which I have converted into a data.frame called x to be used by the package RHmm.
I have already chosen that there are going to be 5 states and 4 gaussian distributions in the mixed distribution.
What I'm after is the expected mean value for the next observation. How do I go about getting that?
So what I have so far is:
a transition matrix from running the HMMFit() function
a set of means and variances for each of the gaussian distributions in the mixture, along with their respective proportions, all of which was also generated form the HMMFit() function
a list of past hidden states relating to the input data when using the output of the HMMFit function and putting it into the viterbi function
How would I go about getting the next hidden state (i.e. the 1001st value) from what I've got, and then using it to get the weighted mean from the gaussian distributions.
I think I'm pretty close just not too sure what the next part is...The last state is state 5, do I use the 5th row in the transition matrix somehow to get the next state?
All I'm after is the weighted mean for what is to be expect in the next observation, so the next hidden state isn't even necessary. Do I multiply the probabilities in row 5 by each of the means, weighted to their proportion for each state? and then sum it all together?
here is the code I used.
# have used 2000 iterations to ensure convergence
a <- HMMFit(x, nStates=5, nMixt=4, dis="MIXTURE", control=list(iter=2000)
v <- viterbi(a,x)
a
v
As always any help would be greatly appreciated!

Next predicted value uses last hidden state last(v$states) to get probability weights from the transition matrix a$HMM$transMat[last(v$states),] for each state the distribution means a$HMM$distribution$mean are weighted by proportions a$HMM$distribution$proportion, then its all multiplied together and summed. So in the above case it would be as follows:
sum(a$HMM$transMat[last(v$states),] * .colSums((matrix(unlist(a$HMM$distribution$mean), nrow=4,ncol=5)) * (matrix(unlist(a$HMM$distribution$proportion), nrow=4,ncol=5)), m=4,n=5))

Related

How to eliminate zeros in simulated data from rnorm function

I have a large set of high frequency data of wind. I use this data in a model to calculate gas exchange between atmosphere and water. I am using the average wind of a 10-day series of measurements to represent gas exchange at a given time. Since the wind is an average value from a 10-day series I want to apply the error to the output by adding the error to the input:
#fictional time series, manually created by me.
wind <- c(0,0,0,0,0,4,3,2,4,3,2,0,0,1,0,0,0,0,1,1,4,5,4,3,2,1,0,0,0,0,0)
I then create 100 values around the mean and sd of the wind vector:
df <- as.data.frame(mapply(rnorm,mean=mean(wind),sd=sd(wind),n=100))
The standard deviation generates negative values. If these are run in the gas exchange model I get disproportionately large error simply because wind speed can't be negative and the model is not constructed to be capable to run with negative wind measurements. I have been suggested to log transform the raw data and run the rnorm() with logged values, and then transform back. But since there are several zeros in the data (0=no wind) I can't simply log the values. Hence I use the log(x+c) method:
wind.log <- log(wind+1)
df.log <- as.data.frame(mapply(rnorm,
mean=mean(wind.log),
sd=sd(wind.log),n=100))
However, I will need to convert values back to actual wind measurements before running them in the model.
This is where it gets problematic, since I will need to use exp(x)-c to convert values back and then I end up with negative values again.
Is there a way to work around this without truncating the 0's and screwing up the generated distribution around the mean?
My only alternative is otherwise is to calculate gas exchange directly at every given time point and generate a distribution from that, those values would never be negative or = 0 and can hence be log-transformed.
Suggestion: use a zero-inflated/altered model, where you generate some proportion of zero values and draw the rest from a log-normal distribution(to make sure you don't get negative values):
wind <- c(0,0,0,0,0,4,3,2,4,3,2,0,0,1,0,0,0,0,1,1,4,5,4,3,2,1,0,0,0,0,0)
prop_nonzero <- mean(wind>0)
lmean <- mean(log(wind[wind>0]))
lsd <- sd(log(wind[wind>0]))
n <- 500
vals <- rbinom(n, size=1,prob=prop_nonzero)*rlnorm(n,meanlog=lmean,sdlog=lsd)
Alternatively you could use a Tweedie distribution (as suggested by #aosmith), or fit a censored model to estimate the distribution of wind values that get measured as zero (assuming that the wind speed is never exactly zero, just too small to measure)

How to compute the mean survival time

I'm using the survival library. After computing the Kaplan-Meier estimator of a survival function:
km = survfit(Surv(time, flag) ~ 1)
I know how to compute percentiles:
quantile(km, probs = c(0.05,0.25,0.5,0.75,0.95))
But, how do I compute the mean survival time?
Calculate Mean Survival Time
The mean survival time will in general depend on what value is chosen for the maximum survival time. You can get the restricted mean survival time with print(km, print.rmean=TRUE). By default, this assumes that the longest survival time is equal to the longest survival time in the data. You can set this to a different value by adding an rmean argument (e.g., print(km, print.rmean=TRUE, rmean=250)).
Extract Value of Mean Survival Time and Store in an Object
In response to your comment: I initially figured one could extract the mean survival time by looking at the object returned by print(km, print.rmean=TRUE), but it turns out that print.survfit doesn't return a list object but just returns text to the console.
Instead, I looked through the code of print.survfit (you can see the code by typing getAnywhere(print.survfit) in the console) to see where the mean survival time is calculated. It turns out that a function called survmean takes care of this, but it's not an exported function, meaning R won't recognize the function when you try to run it like a "normal" function. So, to access the function, you need to run the code below (where you need to set rmean explicitly):
survival:::survmean(km, rmean=60)
You'll see that the function returns a list where the first element is a matrix with several named values, including the mean and the standard error of the mean. So, to extract, for example, the mean survival time, you would do:
survival:::survmean(km, rmean=60)[[1]]["*rmean"]
Details on How the Mean Survival Time is Calculated
The help for print.survfit provides details on the options and how the restricted mean is calculated:
?print.survfit
The mean and its variance are based on a truncated estimator. That is,
if the last observation(s) is not a death, then the survival curve
estimate does not go to zero and the mean is undefined. There are four
possible approaches to resolve this, which are selected by the rmean
option. The first is to set the upper limit to a constant,
e.g.,rmean=365. In this case the reported mean would be the expected
number of days, out of the first 365, that would be experienced by
each group. This is useful if interest focuses on a fixed period.
Other options are "none" (no estimate), "common" and "individual". The
"common" option uses the maximum time for all curves in the object as
a common upper limit for the auc calculation. For the
"individual"options the mean is computed as the area under each curve,
over the range from 0 to the maximum observed time for that curve.
Since the end point is random, values for different curves are not
comparable and the printed standard errors are an underestimate as
they do not take into account this random variation. This option is
provided mainly for backwards compatability, as this estimate was the
default (only) one in earlier releases of the code. Note that SAS (as
of version 9.3) uses the integral up to the last event time of each
individual curve; we consider this the worst of the choices and do not
provide an option for that calculation.
Using the tail formula (and since our variable is non negative) you can calculate the mean as the integral from 0 to infinity of 1-CDF, which equals the integral of the Survival function.
If we replace a parametric Survival curve with a non parametric KM estimate, the survival curve goes only until the last time point in our dataset. From there on it "assumes" that the line continues straight. So we can use the tail formula in a "restricted" manner only until some cut-off point, which we can define (default is the last time point in our dataset).
You can calculate it using the print function, or manually:
print(km, print.rmean=TRUE) # print function
sum(diff(c(0,km$time))*c(1,km$surv[1:(length(km$surv)-1)])) # manually
I add 0 in the beginning of the time vector, and 1 at the beginning of the survival vector since they're not included. I only take the survival vector up to the last point, since that is the last chunk. This basically calculates the area-under the survival curve up to the last time point in your data.
If you set up a manual cut-off point after the last point, it will simply add that area; e.g., here:
print(km, print.rmean=TRUE, rmean=4) # gives out 1.247
print(km, print.rmean=TRUE, rmean=4+2) # gives out 1.560
1.247+2*min(km$surv) # gives out 1.560
If the cut-off value is below the last, it will only calculate the area-under the KM curve up to that point.
There's no need to use the "hidden" survival:::survmean(km, rmean=60).
Use just summary(km)$table[,5:6], which gives you the RMST and its SE. The CI can be calculated using appropriate quantile of the normal distribution.

Preventing a Gillespie SSA Stochastic Model From Running Negative

I have produce a stochastic model of infection (parasitic worm), using a Gillespie SSA. The model used the "GillespieSSA"package (https://cran.r-project.org/web/packages/GillespieSSA/index.html).
In short the code models a population of discrete compartments. Movement between compartments is dependent on user defined rate equations. The SSA algorithm acts to calculate the number of events produced by each rate equation for a given timestep (tau) and updates the population accordingly, process repeats up to a given time point. The problem is, the number of events is assumed Poisson distributed (Poisson(rate[i]*tau)), thus produces an error when the rate is negative, including when population numbers become negative.
# Parameter Values
sir.parms <- c(deltaHinfinity=0.00299, CHi=0.00586, deltaH0=0.0854, aH=0.5,
muH=0.02, SigmaW=0.1, SigmaM =0.8, SigmaL=104, phi=1.15, f = 0.6674,
deltaVo=0.0166, CVo=0.0205, alphaVo=0.5968, beta=52, mbeta=7300 ,muV=52, g=0.0096, N=100)
# Inital Population Values
sir.x0 <- c(W=20,M=10,L=0.02)
# Rate Equations
sir.a <- c("((deltaH0+deltaHinfinity*CHi*mbeta*L)/(1+CHi*mbeta*L))*mbeta*L*N"
,"SigmaW*W*N", "muH*W*N", "((1/2)*phi*f)*W*N", "SigmaM*M*N", "muH*M*N",
"(deltaVo/(1+CVo*M))*beta*M*N", "SigmaL*L*N", "muV*L*N", "alphaVo*M*L*N", "(aH/g)*L*N")
# Population change for even
sir.nu <- matrix(c(+0.01,0,0,
-0.01,0,0,
-0.01,0,0,
0,+0.01,0,
0,-0.01,0,
0,-0.01,0,
0,0,+0.01/230,
0,0,-0.01/230,
0,0,-0.01/230,
0,0,-0.01/230,
0,0,-0.01/32),nrow=3,ncol=11,byrow=FALSE)
runs <- 10
set.seed(1)
# Data Frame of output
sir.out <- data.frame(time=numeric(),W=numeric(),M=numeric(),L=numeric())
# Multiple runs and combining data and SSA methods
for(i in 1:runs){
sim <- ssa(sir.x0,sir.a,sir.nu,sir.parms, method="ETL", tau=1/12, tf=140, simName="SIR")
sim.out <- data.frame(time=sim$data[,1],W=sim$data[,2],M=sim$data[,3],L=sim$data[,4])
sim.out$run <- i
sir.out <- rbind(sir.out,sim.out)
}
Thus, rates are computed and the model updates the population values for each time step, with the data store in a data frame, then attached together with previous runs. However, when levels of the population get very low events can occur such that the number of events that occurs reducing a population is greater than the number in the compartment. One method is to make the time step very small, however this greatly increases the length of the simulation very long.
My question is there a way to augment the code so that as the data is created/ calculated at each time step any values of population numbers that are negative are converted to 0?
I have tried working on this problem, but only seem to be able to come up with methods that alter the values once the simulation is complete, with the negative values still causing issues in the runs themselves.
E.g.
if (sir.out$L < 0) sir.out$L == 0
Any help would be appreciated
I believe the problem is the method you set ("ETL") in the ssa function. The ETL method will eventually produce negative numbers. You can try the "OTL" method, based on Efficient step size selection for the tau-leaping simulation method- in which there are a few more parameters that you can tweak, but the basic command is:
ssa(sir.x0,sir.a,sir.nu,sir.parms, method="OTL", tf=140, simName="SIR")
Or the direct method, which will not produce negative number whatsoever:
ssa(sir.x0,sir.a,sir.nu,sir.parms, method="D", tf=140, simName="SIR")

R optimize linear function

I'm new to R and need a little help with a simple optimization.
I want to apply a functional transformation to a variable (sales_revenue) over time (24 month forecast values 1 to 24). Basically I want to push sales revenue for products from later months into earlier month.
The functional transformations on t time is:
trans=D+(t/(A+B*t+C*t^2))
I will then want to solve:
1) sales_revenue=sales_revenue*trans
where total_sales_revenue=1,000,000 (or within +/- 2.5%)
total_sales_revenue is the sum of all sales_revenue over the 24 months forecast.
If trans has too many parameters I can fix most of them if required and leave B free to estimate.
I think the approach should be fix all parameters except B, differentiate function (1) (not sure what ti diff by) and solve for a non zero minima (use constraints to make sure its the right minima and no-zero, run optimization on that function with the constraint that the total sum of sales_revenue*trans will be equal (or close to) 1,000,000.
#user2138362, did you mean "1) sales_revenue=total_sales_revenue*trans"?
I'm supposing your parameters A, C and D are fixed, and you want to find B such that the distance between your observed values and your predicted values is minimized.
Let's say your time is in months. So we can write a function to give you the squared distance:
dist <- function(B)
{
t <- 1:length(sales_revenue)
total_sales_revenue <- sum(sales_revenue)
predicted <- total_sales_revenue * (D+(t/(A+B*t+C*t^2)))
sum((sales_revenue-predicted)^2)
}
I'm also using the squared euclidean distance as a measure of distance. Make the appropriate changes if that is not the case.
Now, dist is the function you have to minimize. You can use optim, as pointed out by #iTech. But even at the minimum of dist it probably won't be zero, as you have many (24) observations. But you can get the best fit, plot it, and see if it's nice.

Is it possibile to arrange a time series in the way that a specific autocorrleation is created?

I have a file containing 2,500 random numbers. Is it possible to rearrange these saved numbers in the way that a specific autocorrelation is created? Lets say, autocorrelation to the lag 1 of 0.2, autocorrelation to the lag 2 of 0.4, etc.etc.
Any help is greatly appreciated!
To be more specific:
The time series of a daily return in percent of an asset has the following characteristics that I am trying to recreate:
Leptokurtic, symmetric distribution, let's say centered at a daily return of zero
No significant autocorrelations (because the sign of a daily return is not predictable)
Significant autocorrleations if the time series is squared
The aim is to produce a random time series which satisfies all these three characteristics. The only two inputs should be the leptokurtic distribution (this I have already created) and the specific autocorrelation of the squared resulting time series (e.g. the final squared time series should have an autocorrelation at lag 1 of 0.2).
I only know how to produce random numbers out of my own mixed-distribution. Naturally if I would square this resulting time series, there would be no autocorrelation. I would like to find a way which takes this into account.
Generally the most straightforward way to create autocorrelated data is to generate the data so that it's autocorrelated. For example, you could create an auto correlated path by always using the value at p-1 as the mean for the random draw at time period p.
Rearranging is not only hard, but sort of odd conceptually. What are you really trying to do in the end? Giving some context might allow better answers.
There are functions for simulating correlated data. arima.sim() from stats package and simulate.Arima() from the forecast package.
simulate.Arima() has the advantages that (1.) it can simulate seasonal ARIMA models (maybe sometimes called "SARIMA") and (2.) It can simulate a continuation of an existing timeseries to which you have already fit an ARIMA model. To use simulate.Arima(), you do need to already have an Arima object.
UPDATE:
type ?arima.sim then scroll down to "examples".
Alternatively:
install.packages("forecast")
library(forecast)
fit <- auto.arima(USAccDeaths)
plot(USAccDeaths,xlim=c(1973,1982))
lines(simulate(fit, 36),col="red")

Resources