I'm building my own maximum likelihood estimator that estimates the parameters associated with the mean and standard deviation. On simulated data my function works when the true mean is a linear function and the standard deviation is constant. However, if the mean structure is polynomial my function cannot recover the true parameters. Can anybody point me to a solution?
I'm aware there are plenty of existing functions for estimating means and SDs. I'm not interested in them, I'm interested in why my function is not working.
Below is a reproducible example where my model does not recover the true standard deviation (true sd = 1.648, mysd = 4.184123)
*Edit: added library()
library(tidyverse)
my_poly_loglik <- function(pars, #parameters
outcome, #outcome variable
poly_mean){ #data frame of polynomials
#modelling the mean - adding intercept column
mean_mdl = cbind(1, poly_mean) %*% pars[1:(ncol(poly_mean) + 1)]
#modelling the standard deviation on exponential scale
sd_mdl = exp(pars[length(pars)])
#computing log likelihood
sum_log_likelihood <- sum(dnorm(outcome,
mean = mean_mdl,
sd = sd_mdl,
log = TRUE),
na.rm = TRUE)
#since optim() is minimizing we want the -log likelihood
return(-sum_log_likelihood)
}
#Generate data
set.seed(103)
n <- 100000 #100k obs
z <- runif(n, min = 0.1, max = 40) #independent variable sampled uniformly
mean <- 10 + 0.2 * z + 0.4 * z^2 #mean structure
sd = exp(0.5) #constant SD
y <- rnorm(n,mean, sd)
#Visualizing simulated data
#plot(z,mean)
#plot(z,sd)
#plot(z,y)
mydf = data.frame(z,y)
#Defining polynomials
polymean = cbind(z, z^2)
#Initial values. 2 extra for mean_intercept and SD
pars = rep(0, ncol(polymean) + 2)
#Optimising my likelihood function
optim_res <- optim(pars,
fn = my_poly_loglik,
outcome = mydf$y,
poly_mean = polymean)
if (optim_res$convergence != 0) stop("optim_res value is not 0!")
#comparing my function to the real parameter
plot_df = data.frame("mymean" = optim_res$par[1] + (polymean %*% optim_res$par[2:3]),
"truemean" = mean,
"z" = z)
#my mean (black) and true mean (red)
plot_df %>%
ggplot(aes(x = z, y = mymean)) +
geom_line() +
geom_line(aes(y = truemean), color = "red")
#Works!
#my SD and true SD - PROBLEM!
sd #true sd
exp(optim_res$par[length(optim_res$par)]) #my sd
this is not a complete solution but it might help others find the correct answer.
The code looks good overall and the issue emerges only with a high range of the z values. In fact, scaling them or generating data from a considerably lower range leads to the correct solution. Furthermore, checking the hessian shows that the covariance matrix of the estimates is not positive semidefinite and slightly reducing the range results in correlations of the mean parameters close to 1. (This is a bit puzzling since a normal linear model with the same parametrization does not suffer from the same issue -- I know it does not optimize the likelihood directly, but still a bit unintuitive to me).
So, a temporal solution might be rescaling the predictors / using an orthogonal parametrization? But that does not really explain core of the issue.
Related
I'm working my way through learning about Sequential Importance Resampling (SIR) particle filters (starting with a relatively simple example), but am a bit stuck in my understanding.
I'm particularly interested in estimating a full posteior distribution (over time), vs. just point estimates. However, the spread (variance) of the distributions I'm getting seems to be only a function the importance weighting likelihood function's assumed standard deviation (which is an input variable to the algorithm).
My question: is there a way that the SIR algorithm can learn the importance weighting likelihood's standard deviations from data? (Where the data's [observations'] variance potentially changes over time, as it does in my code and the plots below.) So that, I'd hope, the posterior will reflect not only changes in the observations' means, over time (which my code handles), but also changes in their variance (which it doesn't)?
My intuition is that the way to do this is to (somehow) model the observational likelihood's variance (the variable sdObs in my code, below) as, itself, a Markov chain, with its own prior and posterior. I tried doing this, but it didn't work.
In the first plot (link) below there are 100 timesteps with widely dispersed data ("observations"). Which are followed by another 100 timesteps with narrowly dispersed data. There's nothing in the algorithm (as coded) that adjusts the model's standard deviations -- either the transition likelihood's sd (sdTrans) or the observation likelihood's (sdObs). So, the Percentile Interval (PI, and shaded in my plots) just stays at pretty much a fixed variance across all timesteps, regardless of the "observed" data. Whereas I'd like for it to narrow after timestep 100 to reflect the narrower distribution of the observed data starting at timestep 100.
(Apologies that I don't yet have enough "reputation points" to post images! But links are below.)
Image (ggplot): Particle Filter Results on data with a fixed mean and changing variance, with a wide observation variance in the model
If I narrow the observation likelihood's standard deviation, sdObs, from 4 to 1, the whole PI gets correspondingly narrower (without regard to the distribution of observations). Basically, I'm "selecting" the posterior's standard deviation via my choice of sdObs, and which is barely influenced by the observed data's actual standard deviation. (It's fine to specify a prior, but I'm essentially just "picking" the posterior; not what I want!)
Image (ggplot): Particle Filter Results on data with a fixed mean and changing variance, with a narrow observation variance in the model
The filter does track changes in the distribution's means. Here's a plot with a change in the underlying mean.
Image (ggplot): Particle Filter Results on data with a changing mean and variance, with a narrow observation variance in the model
My code is based on this post.
library(ggplot2)
library(reshape2)
############# Parameters ##########
# Generative params for the data (draws from beta distributions,
# specified by an (alpha, beta) pair, and which is bounded in [0, 1].)
alphaSet = c(5, 50)
betaSet = c(3, 30)
nSet = c(100, 100) # number of timesteps with each (alpha, beta) pair
# Particle filtering parameters
sdTrans = .1 # SD for transition model
sdObs = 4 # SD for observation model
N = 10000 # number of particles
# Percentile interval
piSpec = 0.8
set.seed(100)
############# Program Body ########
# Generate "observations" (draws from beta distributions)
samp = c(mapply(rbeta, nSet, shape1=alphaSet, shape2=betaSet))
# Convert to a logit scale so can model it with a normal distribution
sampLogit = logit(samp)
# Number of time steps (as specified by nSet)
T = sum(nSet)
### 1. Initialization (t = 0) ###
x <- matrix( nrow=N, ncol=T ) # Matrix of particles at each timestep
weights <- matrix( nrow=N, ncol=T )
x[, 1] <- rnorm(N, 0, sdTrans) # Draw particles for the 1st timestep
### 2. Importance Sampling Step (t = 0) ###
# Calculate weights, i.e. probability of evidence given sample from X
weights[, 1] <- dnorm(sampLogit[1], x[, 1], sdObs)
# Normalise weights
weights[, 1] <- weights[, 1]/sum(weights[, 1])
### 3. Selection Step (t = 0) ###
# Weighted resampling with replacement. This ensures that X will converge
# to the true distribution
x[, 1] <- sample(x[, 1], replace = TRUE, size = N, prob = weights[, 1])
for (t in seq(2, T)) {
### 2. Importance Sampling Step ###
# Predict x_{t} from previous time step x_{t-1}
# based on process (transition) model
x[, t] <- rnorm(N, x[, t-1], sdTrans)
# Calculate and normalise weights
weights[, t] <- dnorm(sampLogit[t], x[, t], sdObs)
weights[, t] <- weights[, t]/sum(weights[, t])
### 3. Selection Step ###
# Weighted resampling with replacement
x[, t] <- sample(x[, t], replace = TRUE, size = N, prob = weights[, t])
}
# Create a data frame of the particles
dfX = data.frame(t=rep(1:T), x=c(x))
############# Plot ################
# Convert back to [0, 1] space
# Calculate mean
particleMean = inv_logit( apply( x, 2, mean ) )
# Calculate Percentile Interval (PI)
piVec = c( (1-piSpec)/2, 1-((1-piSpec)/2) )
particlePI = inv_logit( apply( x, 2, quantile, piVec) )
# Create data frames for plotting, and plot
particleMeanDf = data.frame(time=1:T, data=samp,
mean=particleMean)
particleMeanDfMelt = melt(particleMeanDf, id.vars="time")
particlePIDf = data.frame( time=1:T,
low=particlePI[1,], high=particlePI[2,])
particlePIDfMelt = melt(particlePIDf, id.vars="time")
particlePlt = ggplot() +
geom_point(data=particleMeanDfMelt, aes(x=time, y=value, color=variable),
alpha=0.8, size=1) +
geom_ribbon(data=particlePIDf, aes(x=time, ymin=low, ymax=high),
fill="steelblue", alpha=0.1) +
geom_line(data=particlePIDf, aes(x=time, y=low),
fill="steelblue", alpha=0.2) +
geom_line(data=particlePIDf, aes(x=time, y=high),
fill="steelblue", alpha=0.2) +
ylim(0,1) +
theme_light()
plot(particlePlt)
I fitted a glmmTMB model using family = nbinom1. Now I would like to perform a simulation of data based on predicted values and the dispersion. However, from the help files, it looks like the go-to rnbinom function uses the family=nbinom2 parameterization where variance is equal to mu + mu^2/size.
1) Can anyone help me figure out how to simulate family=nbinom1 data (where variance is equal to mu + mu*size)?
2) Also, is my extraction / use of the dispersion value as size correct?
Thanks so much!
Current code (data not provided, because doesn't matter), using the stats:::rnbinom function despite the mismatch of variance definition:
library(glmmTMB)
mod <- glmmTMB(y ~ x + (1 | ID), data = df, family = nbinom1)
preds <- predict(mod, type = "response")
size <- sigma(mod)
sim <- rnbinom(nrow(df), mu = preds, size = size)
We can try to simulate nbinom1, so if the variance is mu + mu*k:
set.seed(111)
k = 2
x = runif(100,min=1,max=3)
y = rnbinom(100,mu=exp(2*x),size=exp(2*x)/k)
ID = sample(1:2,100,replace=TRUE)
df = data.frame(x,y,ID)
mod <- glmmTMB(y ~ x + (1 | ID), data = df, family = nbinom1)
sigma(mod)
[1] 1.750076
In the above, for every mean, mu, I specified a size that is mu / k so that it will give an expected variance of mu*k. This shows that as long as you parameterize the rnbinom correctly, you get back rnbinom1.
Now with this model, if we need to simulate data, it's just using the same parameterization as above:
preds <- predict(mod, type = "response")
size <- sigma(mod)
sim <- rnbinom(nrow(df), mu = preds, size = preds/size)
plot(sim,df$y)
There are a variety of issues here, including:
sigma(mod) gives the estimated standard deviation of the residuals; it is not a variance but the square-root of a variance, so you might want to square it.
there are many parametrisations of a negative binomial distribution beyond R's version, but in R's version, if the mean is mean(dat) and the variance var(dat) then you can estimate size with mean(dat)^2/(var(dat)-mean(dat)) and the probability prob with mean(dat)/var(dat)
rnbinom() will tolerate size being non-integer or infinite despite this being a theoretical nonsense; it will not tolerate size being negative which can happen if var(dat) is less than mean(dat). It will also have problems the mean is negative or if size is zero.
So perhaps you could consider adapting your simulation lines to something like
sizes <- ifelse(sigma(mod) ^ 2 > preds, preds ^ 2 / (sigma(mod) ^ 2 - preds), Inf)
sim <- ifelse(preds > 0, rnbinom(nrow(df), mu = preds, size = sizes), 0)
then you might still get errors when sigma(mod) is less than or equal to preds
I have a list of data
frequency x1,x2,...,xn
i.e. 10,20,...,5000.
Intensity y1,yx,...,yn
0,0,...,50,60,50,...,0
where I want to fit a normal distribution to the data.
I found some website online such as (http://www.di.fc.ul.pt/~jpn/r/distributions/fitting.html) through the procedure like,
my_data <- rnorm(250, mean=1, sd=0.45)# unkonwn distribution parameters
fit <- fitdistr(my_data, densfun="normal")
but obviously, those methods won't work.
How to fit the above data to a normal distribution?
You can use the maximum likelihood function, mle, to solve this problem. Here is how you would do that:
my_data <- rnorm(250, mean=1, sd=0.45)# unkonwn distribution parameters
logLik <- function(sigma, mu){
ll <- vapply(my_data,
function(x) dnorm(x, mean = mu, sd = sigma),
FUN.VALUE = numeric(1))
-sum(log(ll))
}
mle(logLik, start = list(sigma = 1, mu = 1))
mle requires a log-likehood function that it uses to determine the optimal parameters (which in the case of a normal distribution are mu (mean) and sigma (st. dev.)). It takes the negative sum of the log-likelihood -sum(log(ll)) as part of a numerical procedure to find the best parameters for the distribution. It then returns the estimated parameters:
Call:
mle(minuslogl = logLik, start = list(sigma = 1, mu = 1))
Coefficients:
sigma mu
0.4595003 0.9724402
time = 1:100
head(y)
0.07841589 0.07686316 0.07534116 0.07384931 0.07238699 0.07095363
plot(time,y)
This is an exponential curve.
How can I fit line on this curve without knowing the formula ? I can't use 'nls' as the formula is unknown (only data points are given).
How can I get the equation for this curve and determine the constants in the equation?
I tried loess but it doesn't give the intercepts.
You need a model to fit to the data.
Without knowing the full details of your model, let's say that this is an
exponential growth model,
which one could write as: y = a * e r*t
Where y is your measured variable, t is the time at which it was measured,
a is the value of y when t = 0 and r is the growth constant.
We want to estimate a and r.
This is a non-linear problem because we want to estimate the exponent, r.
However, in this case we can use some algebra and transform it into a linear equation by taking the log on both sides and solving (remember
logarithmic rules), resulting in:
log(y) = log(a) + r * t
We can visualise this with an example, by generating a curve from our model, assuming some values for a and r:
t <- 1:100 # these are your time points
a <- 10 # assume the size at t = 0 is 10
r <- 0.1 # assume a growth constant
y <- a*exp(r*t) # generate some y observations from our exponential model
# visualise
par(mfrow = c(1, 2))
plot(t, y) # on the original scale
plot(t, log(y)) # taking the log(y)
So, for this case, we could explore two possibilies:
Fit our non-linear model to the original data (for example using nls() function)
Fit our "linearised" model to the log-transformed data (for example using the lm() function)
Which option to choose (and there's more options), depends on what we think
(or assume) is the data-generating process behind our data.
Let's illustrate with some simulations that include added noise (sampled from
a normal distribution), to mimic real data. Please look at this
StackExchange post
for the reasoning behind this simulation (pointed out by Alejo Bernardin's comment).
set.seed(12) # for reproducible results
# errors constant across time - additive
y_add <- a*exp(r*t) + rnorm(length(t), sd = 5000) # or: rnorm(length(t), mean = a*exp(r*t), sd = 5000)
# errors grow as y grows - multiplicative (constant on the log-scale)
y_mult <- a*exp(r*t + rnorm(length(t), sd = 1)) # or: rlnorm(length(t), mean = log(a) + r*t, sd = 1)
# visualise
par(mfrow = c(1, 2))
plot(t, y_add, main = "additive error")
lines(t, a*exp(t*r), col = "red")
plot(t, y_mult, main = "multiplicative error")
lines(t, a*exp(t*r), col = "red")
For the additive model, we could use nls(), because the error is constant across
t. When using nls() we need to specify some starting values for the optimization algorithm (try to "guesstimate" what these are, because nls() often struggles to converge on a solution).
add_nls <- nls(y_add ~ a*exp(r*t),
start = list(a = 0.5, r = 0.2))
coef(add_nls)
# a r
# 11.30876845 0.09867135
Using the coef() function we can get the estimates for the two parameters.
This gives us OK estimates, close to what we simulated (a = 10 and r = 0.1).
You could see that the error variance is reasonably constant across the range of the data, by plotting the residuals of the model:
plot(t, resid(add_nls))
abline(h = 0, lty = 2)
For the multiplicative error case (our y_mult simulated values), we should use lm() on log-transformed data, because
the error is constant on that scale instead.
mult_lm <- lm(log(y_mult) ~ t)
coef(mult_lm)
# (Intercept) t
# 2.39448488 0.09837215
To interpret this output, remember again that our linearised model is log(y) = log(a) + r*t, which is equivalent to a linear model of the form Y = β0 + β1 * X, where β0 is our intercept and β1 our slope.
Therefore, in this output (Intercept) is equivalent to log(a) of our model and t is the coefficient for the time variable, so equivalent to our r.
To meaningfully interpret the (Intercept) we can take its exponential (exp(2.39448488)), giving us ~10.96, which is quite close to our simulated value.
It's worth noting what would happen if we'd fit data where the error is multiplicative
using the nls function instead:
mult_nls <- nls(y_mult ~ a*exp(r*t), start = list(a = 0.5, r = 0.2))
coef(mult_nls)
# a r
# 281.06913343 0.06955642
Now we over-estimate a and under-estimate r
(Mario Reutter
highlighted this in his comment). We can visualise the consequence of using the wrong approach to fit our model:
# get the model's coefficients
lm_coef <- coef(mult_lm)
nls_coef <- coef(mult_nls)
# make the plot
plot(t, y_mult)
lines(t, a*exp(r*t), col = "brown", lwd = 5)
lines(t, exp(lm_coef[1])*exp(lm_coef[2]*t), col = "dodgerblue", lwd = 2)
lines(t, nls_coef[1]*exp(nls_coef[2]*t), col = "orange2", lwd = 2)
legend("topleft", col = c("brown", "dodgerblue", "orange2"),
legend = c("known model", "nls fit", "lm fit"), lwd = 3)
We can see how the lm() fit to log-transformed data was substantially better than the nls() fit on the original data.
You can again plot the residuals of this model, to see that the variance is not constant across the range of the data (we can also see this in the graphs above, where the spread of the data increases for higher values of t):
plot(t, resid(mult_nls))
abline(h = 0, lty = 2)
Unfortunately taking the logarithm and fitting a linear model is not optimal.
The reason is that the errors for large y-values weight much more than those
for small y-values when apply the exponential function to go back to the
original model.
Here is one example:
f <- function(x){exp(0.3*x+5)}
squaredError <- function(a,b,x,y) {sum((exp(a*x+b)-f(x))^2)}
x <- 0:12
y <- f(x) * ( 1 + sample(-300:300,length(x),replace=TRUE)/10000 )
x
y
#--------------------------------------------------------------------
M <- lm(log(y)~x)
a <- unlist(M[1])[2]
b <- unlist(M[1])[1]
print(c(a,b))
squaredError(a,b,x,y)
approxPartAbl_a <- (squaredError(a+1e-8,b,x,y) - squaredError(a,b,x,y))/1e-8
for ( i in 0:10 )
{
eps <- -i*sign(approxPartAbl_a)*1e-5
print(c(eps,squaredError(a+eps,b,x,y)))
}
Result:
> f <- function(x){exp(0.3*x+5)}
> squaredError <- function(a,b,x,y) {sum((exp(a*x+b)-f(x))^2)}
> x <- 0:12
> y <- f(x) * ( 1 + sample(-300:300,length(x),replace=TRUE)/10000 )
> x
[1] 0 1 2 3 4 5 6 7 8 9 10 11 12
> y
[1] 151.2182 203.4020 278.3769 366.8992 503.5895 682.4353 880.1597 1186.5158 1630.9129 2238.1607 3035.8076 4094.6925 5559.3036
> #--------------------------------------------------------------------
>
> M <- lm(log(y)~x)
> a <- unlist(M[1])[2]
> b <- unlist(M[1])[1]
> print(c(a,b))
coefficients.x coefficients.(Intercept)
0.2995808 5.0135529
> squaredError(a,b,x,y)
[1] 5409.752
> approxPartAbl_a <- (squaredError(a+1e-8,b,x,y) - squaredError(a,b,x,y))/1e-8
> for ( i in 0:10 )
+ {
+ eps <- -i*sign(approxPartAbl_a)*1e-5
+ print(c(eps,squaredError(a+eps,b,x,y)))
+ }
[1] 0.000 5409.752
[1] -0.00001 5282.91927
[1] -0.00002 5157.68422
[1] -0.00003 5034.04589
[1] -0.00004 4912.00375
[1] -0.00005 4791.55728
[1] -0.00006 4672.70592
[1] -0.00007 4555.44917
[1] -0.00008 4439.78647
[1] -0.00009 4325.71730
[1] -0.0001 4213.2411
>
Perhaps one can try some numeric method, i.e. gradient search, to find the
minimum of the squared error function.
If it really is exponential, you can try taking the logarithm of your variable and fitting a linear model to that.
This question probably stems from the fact that I don't fully understand what the predict() function is doing, but I'm wondering if there is a way to access the underlying prediction data so that I can get prediction intervals for a given unobserved value. Here's what I mean:
x <- rnorm(100,10)
y <- x+rnorm(100,5)
And making a linear model:
mod1 <- lm(y ~ x)
If I want the confidence intervals for the model estimates, I can do:
confint(mod1)
and get
> 2.5 % 97.5 %
(Intercept) -8.1864342 29.254714
x 0.7578651 1.132339
If I wanted to, I could plug these lower and upper bound estimates into a prediction equation to get a lower and upper confidence interval for some input of x.
What if I want to do the same, but with a prediction interval? Using
predict(mod1, interval = "prediction")
looks like it fits the model to the existing data with lower and upper bounds, but doesn't tell me which parameters those lower and upper bounds are based on so that I could use them for an unobserved value.
(I know I can technically put a value into the predict() command, but I just want the underlying parameters so that I don't necessarily have to do the prediction in R)
The predict function accepts a newdata argument that computes the interval for unobserved values. Here is an example
x <- rnorm(100, 10)
y <- x + rnorm(100, 5)
d <- data.frame(x = x, y = y)
mod <- lm(y ~ x, data = d)
d2 <- data.frame(x = c(0.3, 0.6, 0.2))
predict(mod, newdata = d2, interval = 'prediction')
I don't know what you mean by underlying parameters. The computation of prediction intervals involves a complex formula and you cannot reduce it to a few simple parameters.