I would like to estimate (MLE) this model using MARSS (or another package in R)
x_t=x_{t-1}+w_t , with w_t ~ N(0,q)
y_t= d1_t + \alpha d2_t + \beta (d3_t -x_{t-1}) + v_t, with v_t ~ N(0,6*q)
where the first line is the transition equation and the second, the observation one.
I managed to write it in form accepted by MARSS (R-package), as below:
[x1_t,x2_{t-1}]= [1,0;1,0][x1_{t-1},x2_{t-2}]+[w1_t,w2_t], with w1_t ~ N(0,q) and w2_t ~ N(0,0)
y_t= D d_t+Z x_t , with v_t ~ N(0,6*q)
where
x_t=[x1_t,x2_{t-1}]
D=[1,\alpha,\beta]
Z=[0,\beta]
d_t=[d1_t,d2_t, d3_t]
The problem is that I couldn't make the constraint work properly. When I run this system, R considers the \beta in Z matrix separately of the \beta in D matrix. All the examples that I saw on internet show a linear restriction using Z matrix only (or just D only). The same occurs in the variances that I would like to be multiples.
Anyone could help me with this?
Here's a toy data:
B <- matrix(list(1,0,1,0),2,2,byrow=TRUE)
U <- matrix(0,2,1)
C <- matrix(0,2,1)
G <- matrix(list(1,0,0,0),2,2,byrow=TRUE)
Q <- matrix(list('d',0,0,0),2,2,byrow=TRUE)
Z <- matrix(list(0,'b'),1,2)
A <- matrix(0)
D <- matrix(list(1,'a','b'),1,3)
H <- matrix(1)
R=matrix(list('6*d'))
dt<-matrix(rnorm(300),3,100)
y<-rnorm(100)
x0=matrix(list(0.094,0.094),2,1)
V0=matrix(list(0.001,0,0,0.001),2,2)
model.list = list(B=B, U=U, C=C, Q=Q, Z=Z, A=A, D=D, d=dt, H=H, R=R,x0=x0,V0=V0)
kemfit = MARSS(y, model=model.list, control=list(maxit=100,conv.test.slope.tol=0.1,abstol=0.1),method='kem')
The EM algorithm in MARSS only allows constraints (like setting values equal) within the same matrices. Setting constraints across A & D or U & C is easy but across D & Z or R & Q requires rewriting your model in a weird way where your covariates (dt) appears as dummy states (x's). So you don't want to do that.
You can just write a function to return the negative log-likelihood of your state-space model and then minimize that with optim(). I would do this with the KFAS package using the SSCustom() function because that will be fast. However, here is how to do this with the MARSS package just to show you the concept. As the author of MARSS, I can write this down immediately whereas with the KFAS package (which I also use), I'd need to look up how to do the covariates.
# Set up the parts that don't change
dt<-matrix(rnorm(300),3,100)
y<-rnorm(100)
x0=matrix(list(0.094,0.094),2,1)
V0=matrix(list(0.001,0,0,0.001),2,2)
B <- matrix(list(1,0,1,0),2,2,byrow=TRUE)
U <- A <- "zero"
# Put the parameters you will estimate into a vector
pars <- c(a=0.1624, b=-0.1, d=sqrt(0.2))
# Write a function to return the negative log-likelihood
negloglik <- function(pars){
Q <- matrix(list(pars["d"]^2,0,0,0),2,2,byrow=TRUE)
Z <- matrix(list(0, pars["b"]),1,2)
D <- matrix(list(1, pars["a"], pars["b"]),1,3)
R <- matrix(6*pars["d"]^2)
model.list = list(B=B, U=U, Q=Q, Z=Z, A=A, D=D, d=dt, R=R, x0=x0, V0=V0)
-1*MARSS(y, model=model.list, control=list(maxit=100,conv.test.slope.tol=0.1,abstol=0.1),method='kem', silent=TRUE)$logLik
}
optim(pars, negloglik, method="BFGS")
Using the MARSS() function to get the logLik is a bit silly here since that is a fitting function but with all the parameters fixed, it will just return the logLik without fitting.
If you want to see what your KFAS model should look like, you can do this:
kfas.model <- MARSSkfas(kemfit, return.kfas.model=TRUE, return.lag.one=FALSE)$kfas.model
Then
library(KFAS)
logLik(kfas.model)
will get you the log-likelihood. But how the covariates are entering the KFAS model is a little non-intuitive. They appear in the kfas.model$Z element as a time-varying Z. I am sure the KFAS package has some helper function to construct models with covariates. I always construct KFAS models from matrices (no helper functions) so I am not familiar with those, but I know they exist.
I am working with the cumulative emergence of flies over time (taken at irregular intervals) over many summers (though first I am just trying to make one year work). The cumulative emergence follows a sigmoid pattern and I want to create a maximum likelihood estimation of a 3-parameter Weibull cumulative distribution function. The three-parameter models I've been trying to use in the fitdistrplus package keep giving me an error. I think this must have something to do with how my data is structured, but I cannot figure it out. Obviously I want it to read each point as an x (degree days) and a y (emergence) value, but it seems to be unable to read two columns. The main error I'm getting says "Non-numeric argument to mathematical function" or (with slightly different code) "data must be a numeric vector of length greater than 1". Below is my code including added columns in the df_dd_em dataframe for cumulative emergence and percent emergence in case that is useful.
degree_days <- c(998.08,1039.66,1111.29,1165.89,1236.53,1293.71,
1347.66,1387.76,1445.47,1493.44,1553.23,1601.97,
1670.28,1737.29,1791.94,1849.20,1920.91,1967.25,
2036.64,2091.85,2152.89,2199.13,2199.13,2263.09,
2297.94,2352.39,2384.03,2442.44,2541.28,2663.90,
2707.36,2773.82,2816.39,2863.94)
emergence <- c(0,0,0,1,1,0,2,3,17,10,0,0,0,2,0,3,0,0,1,5,0,0,0,0,
0,0,0,0,1,0,0,0,0,0)
cum_em <- cumsum(emergence)
df_dd_em <- data.frame (degree_days, emergence, cum_em)
df_dd_em$percent <- ave(df_dd_em$emergence, FUN = function(df_dd_em) 100*(df_dd_em)/46)
df_dd_em$cum_per <- ave(df_dd_em$cum_em, FUN = function(df_dd_em) 100*(df_dd_em)/46)
x <- pweibull(df_dd_em[c(1,3)],shape=5)
dframe2.mle <- fitdist(x, "weibull",method='mle')
Here's my best guess at what you're after:
Set up data:
dd <- data.frame(degree_days=c(998.08,1039.66,1111.29,1165.89,1236.53,1293.71,
1347.66,1387.76,1445.47,1493.44,1553.23,1601.97,
1670.28,1737.29,1791.94,1849.20,1920.91,1967.25,
2036.64,2091.85,2152.89,2199.13,2199.13,2263.09,
2297.94,2352.39,2384.03,2442.44,2541.28,2663.90,
2707.36,2773.82,2816.39,2863.94),
emergence=c(0,0,0,1,1,0,2,3,17,10,0,0,0,2,0,3,0,0,1,5,0,0,0,0,
0,0,0,0,1,0,0,0,0,0))
dd <- transform(dd,cum_em=cumsum(emergence))
We're actually going to fit to an "interval-censored" distribution (i.e. probability of emergence between successive degree day observations: this version assumes that the first observation refers to observations before the first degree-day observation, you could change it to refer to observations after the last observation).
library(bbmle)
## y*log(p) allowing for 0/0 occurrences:
y_log_p <- function(y,p) ifelse(y==0 & p==0,0,y*log(p))
NLLfun <- function(scale,shape,x=dd$degree_days,y=dd$emergence) {
prob <- pmax(diff(pweibull(c(-Inf,x), ## or (c(x,Inf))
shape=shape,scale=scale)),1e-6)
## multinomial probability
-sum(y_log_p(y,prob))
}
library(bbmle)
I should probably have used something more systematic like the method of moments (i.e. matching the mean and variance of a Weibull distribution with the mean and variance of the data), but I just hacked around a bit to find plausible starting values:
## preliminary look (method of moments would be better)
scvec <- 10^(seq(0,4,length=101))
plot(scvec,sapply(scvec,NLLfun,shape=1))
It's important to use parscale to let R know that the parameters are on very different scales:
startvals <- list(scale=1000,shape=1)
m1 <- mle2(NLLfun,start=startvals,
control=list(parscale=unlist(startvals)))
Now try with a three-parameter Weibull (as originally requested) -- requires only a slight modification of what we already have:
library(FAdist)
NLLfun2 <- function(scale,shape,thres,
x=dd$degree_days,y=dd$emergence) {
prob <- pmax(diff(pweibull3(c(-Inf,x),shape=shape,scale=scale,thres)),
1e-6)
## multinomial probability
-sum(y_log_p(y,prob))
}
startvals2 <- list(scale=1000,shape=1,thres=100)
m2 <- mle2(NLLfun2,start=startvals2,
control=list(parscale=unlist(startvals2)))
Looks like the three-parameter fit is much better:
library(emdbook)
AICtab(m1,m2)
## dAIC df
## m2 0.0 3
## m1 21.7 2
And here's the graphical summary:
with(dd,plot(cum_em~degree_days,cex=3))
with(as.list(coef(m1)),curve(sum(dd$emergence)*
pweibull(x,shape=shape,scale=scale),col=2,
add=TRUE))
with(as.list(coef(m2)),curve(sum(dd$emergence)*
pweibull3(x,shape=shape,
scale=scale,thres=thres),col=4,
add=TRUE))
(could also do this more elegantly with ggplot2 ...)
These don't seem like spectacularly good fits, but they're sane. (You could in principle do a chi-squared goodness-of-fit test based on the expected number of emergences per interval, and accounting for the fact that you've fitted a three-parameter model, although the values might be a bit low ...)
Confidence intervals on the fit are a bit of a nuisance; your choices are (1) bootstrapping; (2) parametric bootstrapping (resample parameters assuming a multivariate normal distribution of the data); (3) delta method.
Using bbmle::mle2 makes it easy to do things like get profile confidence intervals:
confint(m1)
## 2.5 % 97.5 %
## scale 1576.685652 1777.437283
## shape 4.223867 6.318481
dd <- data.frame(degree_days=c(998.08,1039.66,1111.29,1165.89,1236.53,1293.71,
1347.66,1387.76,1445.47,1493.44,1553.23,1601.97,
1670.28,1737.29,1791.94,1849.20,1920.91,1967.25,
2036.64,2091.85,2152.89,2199.13,2199.13,2263.09,
2297.94,2352.39,2384.03,2442.44,2541.28,2663.90,
2707.36,2773.82,2816.39,2863.94),
emergence=c(0,0,0,1,1,0,2,3,17,10,0,0,0,2,0,3,0,0,1,5,0,0,0,0,
0,0,0,0,1,0,0,0,0,0))
dd$cum_em <- cumsum(dd$emergence)
dd$percent <- ave(dd$emergence, FUN = function(dd) 100*(dd)/46)
dd$cum_per <- ave(dd$cum_em, FUN = function(dd) 100*(dd)/46)
dd <- transform(dd)
#start 3 parameter model
library(FAdist)
## y*log(p) allowing for 0/0 occurrences:
y_log_p <- function(y,p) ifelse(y==0 & p==0,0,y*log(p))
NLLfun2 <- function(scale,shape,thres,
x=dd$degree_days,y=dd$percent) {
prob <- pmax(diff(pweibull3(c(-Inf,x),shape=shape,scale=scale,thres)),
1e-6)
## multinomial probability
-sum(y_log_p(y,prob))
}
startvals2 <- list(scale=1000,shape=1,thres=100)
m2 <- mle2(NLLfun2,start=startvals2,
control=list(parscale=unlist(startvals2)))
summary(m2)
#graphical summary
windows(5,5)
with(dd,plot(cum_per~degree_days,cex=3))
with(as.list(coef(m2)),curve(sum(dd$percent)*
pweibull3(x,shape=shape,
scale=scale,thres=thres),col=4,
add=TRUE))
I am trying to determine whether there is a significant difference between two Gamm distributions. One distribution has (shape, scale)=(shapeRef,scaleRef) while the other has (shape, scale)=(shapeTarget,scaleTarget). I try to do analysis of variance with the following code
n=10000
x=rgamma(n, shape=shapeRef, scale=scaleRef)
y=rgamma(n, shape=shapeTarget, scale=scaleTarget)
glmm1 <- gam(y~x,family=Gamma(link=log))
anova(glmm1)
The resulting p values keep changing and can be anywhere from <0.1 to >0.9.
Am I going about this the wrong way?
Edit: I use the following code instead
f <- gl(2, n)
x=rgamma(n, shape=shapeRef, scale=scaleRef)
y=rgamma(n, shape=shapeTarget, scale=scaleTarget)
xy <- c(x, y)
anova(glm(xy ~ f, family = Gamma(link = log)),test="F")
But, every time I run it I get a different p-value.
You will indeed get a different p-value every time you run this, if you pick different realizations every time. Just like your data values are random variables, which you'd expect to vary each time you ran an experiment, so is the p-value. If the null hypothesis is true (which was the case in your initial attempts), then the p-values will be uniformly distributed between 0 and 1.
Function to generate simulated data:
simfun <- function(n=100,shapeRef=2,shapeTarget=2,
scaleRef=1,scaleTarget=2) {
f <- gl(2, n)
x=rgamma(n, shape=shapeRef, scale=scaleRef)
y=rgamma(n, shape=shapeTarget, scale=scaleTarget)
xy <- c(x, y)
data.frame(xy,f)
}
Function to run anova() and extract the p-value:
sumfun <- function(d) {
aa <- anova(glm(xy ~ f, family = Gamma(link = log),data=d),test="F")
aa["f","Pr(>F)"]
}
Try it out, 500 times:
set.seed(101)
r <- replicate(500,sumfun(simfun()))
The p-values are always very small (the difference in scale parameters is easily distinguishable), but they do vary:
par(las=1,bty="l") ## cosmetic
hist(log10(r),col="gray",breaks=50)