Simulate an AR(1) process with uniform innovations - r

I need to plot an AR(1) graph for the process
y[k] = 0.75 * y[k-1] + e[k] for y0 = 1.
Assume that e[k] is uniformly distributed on the interval [-0.5, 0.5].
I am trying to use arima.sim:
library(tseries)
y.0 <- arima.sim(model=list(ar=.75), n=100)
plot(y.0)
It does not seem correct. Also, what parameters do I change if y[0] = 10?

We want to use R base function arima.sim for this task, and no extra libraries are required.
By default, arima.sim generates ARIMA with innovations ~ N(0,1). If we want to change this, we need to control the rand.gen or innov argument. For example, you want innovations from uniform distributions U[-0.5, 0.5], we can do either of the following:
arima.sim(model=list(ar=.75), n=100, rand.gen = runif, min = -0.5, max = 0.5)
arima.sim(model=list(ar=.75), n = 100, innov = runif(100, -0.5, 0.5))
Example
set.seed(0)
y <- arima.sim(model=list(ar=.75), n = 100, innov = runif(100, -0.5, 0.5))
ts.plot(y)
In case we want to have explicit control on y[0], we can just shift the above time series such that it starts from y[0]. Suppose y0 is our desired starting value, we can do
y <- y - y[1] + y0
For example, starting from y0 = 1:
y <- y - y[1] + 1
ts.plot(y)

Related

Confidence intervals from mocel coefficients vs whole model

I'm trying to demonstrate that there is an important difference between two ways of making linear model predictions. The first way, which my heart tells me is more correct, uses predict.lm which as I understand preserves the correlations between coefficients. The second approach tries to use the parameters independently.
Is this the correct way to show the difference? The two approaches seem somewhat close.
Also, is the StdErr of the coefficients the same as the standard deviation of their distributions? Or have I misunderstood what the model table is saying.
Below is a quick reprex to show what I mean:
# fake dataset
xs <- runif(200, min = -1, max = 1)
true_inter <- -1.3
true_slope <- 3.1
ybar <- true_inter + true_slope*xs
ys <- rnorm(200, ybar, sd = 1)
model <- lm(ys~xs)
# predictions
coef_sterr <- summary(model)$coefficients
inters <- rnorm(500, mean = coef_sterr[1,1], sd = coef_sterr[1,2])
slopes <- rnorm(500, mean = coef_sterr[2,1], sd = coef_sterr[2,2])
newx <- seq(from = -1, to= 1, length.out = 20)
avg_predictions <- cbind(1, newx) %*% rbind(inters, slopes)
conf_predictions <- apply(avg_predictions, 1, quantile, probs = c(.25, .975), simplify = TRUE)
# from confint
conf_interval <- predict(model, newdata=data.frame(xs = newx),
interval="confidence",
level = 0.95)
# plot to visualize
plot(ys~xs)
# averages are exactly the same
abline(model)
abline(a = coef(model)[1], b = coef(model)[2], col = "red")
# from predict, using parameter covariance
matlines(newx, conf_interval[,2:3], col = "blue", lty=1, lwd = 3)
# from simulated lines, ignoring parameter covariance
matlines(newx, t(conf_predictions), col = "orange", lty = 1, lwd = 2)
Created on 2022-04-05 by the reprex package (v2.0.1)
In this case, they would be close because there is very little correlation between the model parameters, so drawing them from two independent normals versus a multivariate normal is not that different:
set.seed(519)
xs <- runif(200, min = -1, max = 1)
true_inter <- -1.3
true_slope <- 3.1
ybar <- true_inter + true_slope*xs
ys <- rnorm(200, ybar, sd = 1)
model <- lm(ys~xs)
cov2cor(vcov(model))
# (Intercept) xs
# (Intercept) 1.00000000 -0.08054106
# xs -0.08054106 1.00000000
Also, it is probably worth calculating both of the intervals the same way, though it shouldn't make that much difference. That said, 500 observations may not be enough to get reliable estimates of the 2.5th and 97.5th percentiles of the distribution. Let's consider a slightly more complex example. Here, the two X variables are correlated - the correlation of the parameters derives in part from the correlation of the columns of the design matrix, X.
set.seed(519)
X <- MASS::mvrnorm(200, c(0,0), matrix(c(1,.65,.65,1), ncol=2))
b <- c(-1.3, 3.1, 2.5)
ytrue <- cbind(1,X) %*% b
y <- ytrue + rnorm(200, 0, .5*sd(ytrue))
dat <- data.frame(y=y, x1=X[,1], x2=X[,2])
model <- lm(y ~ x1 + x2, data=dat)
cov2cor(vcov(model))
# (Intercept) x1 x2
# (Intercept) 1.00000000 0.02417386 -0.01515887
# x1 0.02417386 1.00000000 -0.73228003
# x2 -0.01515887 -0.73228003 1.00000000
In this example, the coefficients for x1 and x2 are correlated around -0.73. As you'll see, this still doesn't result in a huge difference. Let's calculate the relevant statistics.
First, we draw B1 using the multivariate method that you rightly suspect is correct. Then, we'll draw B2 from a bunch of independent normals (actually, I'm using a multivariate normal with a diagonal variance-covariance matrix, which is the same thing).
b_est <- coef(model)
v <- vcov(model)
B1 <- MASS::mvrnorm(2500, b_est, v, empirical=TRUE)
B2 <- MASS::mvrnorm(2500, b_est, diag(diag(v)), empirical = TRUE)
Now, let's make a hypothetical X matrix and generate the relevant predictions:
hypX <- data.frame(x1=seq(-3,3, length=50),
x2 = mean(dat$x2))
yhat1 <- as.matrix(cbind(1, hypX)) %*% t(B1)
yhat2 <- as.matrix(cbind(1, hypX)) %*% t(B2)
Then we can calculate confidence intervals, etc...
yh1_ci <- t(apply(yhat1, 1, function(x)unname(quantile(x, c(.025,.975)))))
yh2_ci <- t(apply(yhat2, 1, function(x)unname(quantile(x, c(.025,.975)))))
yh1_ci <- as.data.frame(yh1_ci)
yh2_ci <- as.data.frame(yh2_ci)
names(yh1_ci) <- names(yh2_ci) <- c("lwr", "upr")
yh1_ci$fit <- c(as.matrix(cbind(1, hypX)) %*% b_est)
yh2_ci$fit <- c(as.matrix(cbind(1, hypX)) %*% b_est)
yh1_ci$method <- factor(1, c(1,2), labels=c("Multivariate", "Independent"))
yh2_ci$method <- factor(2, c(1,2), labels=c("Multivariate", "Independent"))
yh1_ci$x1 <- hypX[,1]
yh2_ci$x1 <- hypX[,1]
yh <- rbind(yh1_ci, yh2_ci)
We could then plot the two confidence intervals as you did.
ggplot(yh, aes(x=x1, y=fit, ymin=lwr, ymax=upr, fill=method)) +
geom_ribbon(colour="transparent", alpha=.25) +
geom_line() +
theme_classic()
Perhaps a better visual would be to compare the widths of the intervals.
w1 <- yh1_ci$upr - yh1_ci$lwr
w2 <- yh2_ci$upr - yh2_ci$lwr
ggplot() +
geom_point(aes(x=hypX[,1], y=w2-w1)) +
theme_classic() +
labs(x="x1", y="Width (Independent) - Width (Multivariate)")
This shows that for small values of x1, the independent confidence intervals are wider than the multivariate ones. For values of x1 above 0, it's a more mixed bag.
This tells you that there is some difference, but you don't need the simulation to know which one is 'right'. That's because the prediction is a linear combination of constants and random variables.
In this case, the b terms are the random variables and the x values are the constants. We know that the variance of a linear combination can be calculated this way:
All that is to say that your intuition is correct.

Failing to simulate data for a negative binomial probability distribution

I am hoping someone can help me.
In a beginners workshop I attended, in the process of fitting a multiple regression model, the instructor initially established a prior predictive check using a Poisson distribution for the outcome. This was done in two steps. Initially, a function was created:
multiple_regression_poisson_dgp <- function(predictor1,
predictor2, alpha_mean, alpha_sd, beta_predictor1_mean,
beta_predictor1_sd, beta_predictor2_mean,
beta_predictor2_sd) {
N <- length(predictor1)
alpha <- rnorm(1, mean = alpha_mean, sd = alpha_sd);
beta_predictor1 <- rnorm(1, mean = beta_predictor1_mean,
sd = beta_predictor1_sd);
beta_predictor2 <- rnorm(1, mean = beta_predictor2_mean,
sd = beta_super_sd);
outcome <- rpois(N, lambda = alpha + beta_predictor1 *
predictor1 + beta_predictor2 * predictor2)
return(outcome)
}
After this function was created, the following priors were generated:
multiple_regression_poisson_dgp(dataset$predictor1,
dataset$predictor2,
alpha_mean = 1,
alpha_sd = 0.5,
beta_predictor1_mean = -0.25,
beta_predictor1_sd = 0.5,
beta_predictor2_mean = 0,
beta_predictor2_sd = 1)
This worked fine. The issue is that, further down the line, it was shown that the Poisson distribution was not the most adequate. The negative binomial was suggested as the next step. Unfortunately, when I try to replicate the process for the negative binomial, I am unsuccessful. I have tried to replicate both of the steps shown above, but for the negative binomial. The first step was coded as:
multiple_regression_negative_binomial_dgp <-
function(predictor1, predictor2, alpha_mean, alpha_sd,
beta_predictor1_mean, beta_predictor1_sd,
beta_predictor2_mean, beta_predictor2_sd, phi_mean,
phi_sd) {
N <- length(predictor1)
alpha <- rnorm(1, mean = alpha_mean, sd = alpha_sd);
beta_predictor1 <- rnorm(1, mean = beta_predictor1_mean,
sd = beta_predictor1_sd);
beta_predictor2 <- rnorm(1, mean = beta_predictor2_mean,
sd = beta_super_sd);
phi <- rnorm(1, mean = phi_mean, sd = phi_sd);
outcome<- rnbinom(N, size = mu + mu^2/phi, mu = alpha +
beta_predictor1 * predictor1 + beta_predictor2 *
predictor2)
return(outcome)
}
Because there is a phi in the negative binomial, and given that it will be a parameter whose prior I will be calculating, I assumed it needed to be added to the equation. Additionally, given the documentation for rnbinom(), I thought i could treat mu as I treated lambda in the Poisson generation, feeding the regression equation onto it.
The function is likely inadequate, but after I create it and move onto the second step, the errors emerge. The second step I coded as:
multiple_regression_negative_binomial_dgp(dataset$predictor1,
dataset$predictor2,
alpha_mean = 1,
alpha_sd = 0.5,
beta_predictor1_mean = -0.25,
beta_predictor1_sd = 0.5,
beta_predictor2_mean = 0,
beta_predictor2_sd = 1,
phi_mean = 0,
phi_sd = 1)
However, as soon as I try to run this data generating process, I get the warning stating:
Error in rnbinom(N, size = mu, mu = alpha + beta_predictor1 * predictor1 + beta_predictor2 * predictor2 : object 'mu' not found
Any help would be much appreciated, I realize that I am applying a more mechanistic mindset in trying to replicate the Poisson data generating process for the negative binomial one, but I have been unable to find any clues as to how to solve this. Most examples I came across define a value for mu and for size, instead of 'feeding' it the formula.
In your multiple_regression_negative_binomial_dgp function, you call rnbinom. That function needs a size argument, and you assign mu + mu^2/phi to it, but mu is not defined within the function, nor passed to it. The fact that rnbinom contains a mu argument, which you do provide (alpha + beta_predictor1 * predictor1 + beta_predictor2 * predictor2), doesn't take care of it, because rnbinom doesn't pass that information over to size. I would suggest you try:
multiple_regression_negative_binomial_dgp <- function(predictor1, predictor2,
alpha_mean, alpha_sd, beta_predictor1_mean,
beta_predictor1_sd, beta_predictor2_mean,
beta_predictor2_sd, phi_mean, phi_sd) {
N <- length(predictor1)
alpha <- rnorm(1, mean=alpha_mean, sd=alpha_sd)
beta_predictor1 <- rnorm(1, mean=beta_predictor1_mean, sd=beta_predictor1_sd)
beta_predictor2 <- rnorm(1, mean=beta_predictor2_mean, sd=beta_super_sd);
phi <- rnorm(1, mean=phi_mean, sd=phi_sd)
Mu <- alpha + beta_predictor1*predictor1 + beta_predictor2*predictor2
outcome <- rnbinom(N, size=Mu + Mu^2/phi, mu=Mu)
return(outcome)
}

How to calculate x-values of the convolution of two distributions?

(This question may be suited for https://stats.stackexchange.com/, but I'm thinking it's just how you calculate what I want in R that is my question).
I'm trying to add multiple distributions together, and then look at the resulting distribution. I'll illustrate my problem with a simple example using normally distributed random variables, p1 and p2.
set.seed(21)
N <- 1000
p1 <- rnorm(N, mean = 0, sd = 1)
p2 <- rnorm(N, mean = 10, sd = 1)
Which we can plot:
data.frame(p1, p2) %>%
gather(key="dist", value="value") %>%
ggplot(aes(value, color=dist)) + geom_density()
I can add these distributions together using convolve. Okay so that's fine. But what I can't figure out, is how to plot the summation of the distributions with the appropriate x-values. In the examples I've seen, it looks like the x-values are manually added in a way that doesn't seem "accurate" for lack of better work. See this Example.
I can "add" them together and plot:
pdf.c <- convolve(pdf1.y, pdf2.y, type = "open")
plot(pdf.c, type="l")
My question is how to get the corresponding x-values of the new distribution. I'm sure I'm missing something from a foundational statistics point of view.
Appendix for pdf1 and pdf2:
set.seed(21)
N <- 1000
p1 <- rnorm(N, mean = 0, sd = 1)
p2 <- rnorm(N, mean = 10, sd = 1)
pdf1.x <- density(p1)$x
pdf2.x <- density(p2)$x
pdf1.y <- density(p1)$y / sum(density(p1)$y)
pdf2.y <- density(p2)$y / sum(density(p2)$y)
df1 <- data.frame(pdf.x = pdf1.x, pdf.y = pdf1.y, dist = "1", stringsAsFactors = FALSE)
df2 <- data.frame(pdf.x = pdf2.x, pdf.y = pdf2.y, dist = "2", stringsAsFactors = FALSE)
df <- bind_rows(df1, df2)
Assuming that p1 and p2 are discretized uniformly, with the same interval dx between successive x values. (I see that you have discretized p1 and p2 at random points -- that's not the same, and, without thinking about it some more, I don't have an answer for that.) Let x1 = x1_0 + (k - 1) times dx, k = 1, 2, 3, ..., n1 be the points at which p1 is discretized, and x2 = x2_0 + (k - 1) times dx, k = 1, 2, 3, ..., n2 be the points at which p2 is discretized.
Each point xi_k = xi_0 + (k - 1) times dx represents the center point of a bar which has width dx and height pi(xi_k), i = 1, 2. Thus the mass of the bar is dx times pi(xi_k), and the total mass for all bars approaches 1 as dx approaches 0. These masses are the values which are convolved. If the discretized masses are normalized to 1, then their convolution will also be normalized to 1.
To be very careful, the range over which the distributions are discretized are xi_0 - dx/2 to xi_0 + (ni - 1) times dx + dx/2. After computing the convolution, the range for the result is likewise -dx/2 and +dx/2 the first and last points, respectively.
The convolution has n = n1 + n2 - 1 points, namely x1_0 + x2_0 + (k - 1) times dx, k = 1, 2, 3, ..., n1 + n2 - 1. The first point is x1_0 + x2_0 (i.e. first point for p1 plus first point for p2) and the last point is x1_0 + x2_0 + (n1 + n2 - 2) times dx = (x1_0 + (n1 - 1) times dx) + (x2_0 + (n2 - 1) times dx) (i.e. last point for p1 plus last point for p2). From this you can construct x values corresponding to the convolution via the seq function or something like that.

Log-likelihood calculation given estimated parameters

In general: I want to calculate the (log) likelihood of data N given the estimated model parameters from data O.
More specifically, I want to know if my ll_given_modPars function below exists in one of the may R packages dealing with data modeling (lme4, glmm, etc.) as shown in this abstract example (not run):
library(lme4)
o_model <- lmer(observed ~ fixed.id + (1|random.id), data = O, REML = F)
n_logLik <- ll_given_modPars(model.estimates = o_model, data = N)
The fictional example above is on a linear mixed model for simplicity but I would like to eventually do this in a generalized linear mixed model which deals with the Poisson family or directly the negative binomial (for lme4: glmer(..., family="poisson") or glmer.nb ).
From what I could see most packages deal with parameter estimation (great, I need that) but then compare models on the same data with different combinations of fixed and random effects using anova or something to that extent which is not what I want to do.
I want the log likelihood for the same parameters on different data.
The main attempts made:
After not finding a function which seems to be doing that I thought of 'simply' tweaking the lme4 code to my purposes: it calculates the log likelihood for parameters given the data so I thought I could use the same framework but not have it optimize over different parameters but isolate the likelihood calculation function and just give it the parameters and the data. Unfortunately the code is a bit above my current skills https://github.com/lme4/lme4/blob/master/R/nbinom.R (I get a bit lost in how they use the objects over which they optimize).
I thought of doing the likelihood calculation myself, starting with a linear mixed model and then working my way up to more involved ones. But already with this example I'm having a hard time following the math and even when using the formula as specified the obtained log-likelihood is still different (I don't know why, see code in appendix) and I fear it will take me too long before I'll be able to do it for the more involved models (such as Poisson or negative binomial)
At this point I'm not sure what avenue is best to pursue and would appreciate any input you might have.
Appendix: Trying to calculate the log-likelihood (or finding a closed form approximation) based on How does lmer (from the R package lme4) compute log likelihood?. lmer (from lme4) gives a log-likelihood of -17.8 and I get -45.56
library(lme4)
set.seed(7)
n <- 2 # number of groups
m <- 4 # number of instances per group
fixed.effect <- c(0, -2, -1, 1)
tau <- 5 # standard deviation of random effects
sigma <- 2 # standard deviation of error
random.effect <- rnorm(n, mean=0, sd=tau)
sim.data <- data.frame(GROUP.ID=as.factor(rep(1:n, each=m)),
GROUP.EFFECT=rep(random.effect, each=m),
INSTANCE.ID=as.factor(rep(1:m, times=n)),
INSTANCE.EFFECT=rep(fixed.effect, times=n))
# calculate expected Y value
sim.data$EXPECT.Y <- sim.data$GROUP.EFFECT + sim.data$INSTANCE.EFFECT
# now observe Y value, assuming normally distributed with fixed std. deviation
sim.data$OBS.Y <- rnorm(nrow(sim.data), mean=sim.data$EXPECT.Y, sigma)
model <- lmer(OBS.Y ~ INSTANCE.ID + (1|GROUP.ID), data = sim.data, REML=F)
summary(model)
toy.model.var <- VarCorr(model)
toy.model.sigma <- attr(toy.model.var, 'sc') # corresponds to the epsilon, residual standard deviation
toy.model.tau.squared <- toy.model.var[[1]][1] # corresponds to variance of random effects
toy.model.betas <- model#beta
# left product, spread within gropus
toy.data <- rbind(sim.data$OBS.Y[1:4], sim.data$OBS.Y[5:8])
toy.mean.adj <- rbind(toy.data[1,] - mean(unlist(toy.data[1,])), toy.data[2,] - mean(unlist(toy.data[2,])))
toy.mean.adj.prod1 <- prod(dnorm(unlist(toy.mean.adj[1,]), mean = 0, sd = toy.model.sigma))
toy.mean.adj.prod2 <- prod(dnorm(unlist(toy.mean.adj[2,]), mean = 0, sd = toy.model.sigma))
toy.mean.adj.final.prod <- toy.mean.adj.prod1 * toy.mean.adj.prod2
# right product, spread between gropus
toy.mean.beta.adj <- rbind(mean(unlist(toy.data[1,])) - toy.model.betas, mean(unlist(toy.data[2,])) - toy.model.betas)
toy.mean.beta.adj[1,] <- toy.mean.beta.adj[1,] - c(0, toy.model.betas[1], toy.model.betas[1], toy.model.betas[1])
toy.mean.beta.adj[2,] <- toy.mean.beta.adj[2,] - c(0, toy.model.betas[1], toy.model.betas[1], toy.model.betas[1])
toy.mean.beta.adj.prod1 <- prod(dnorm(unlist(toy.mean.beta.adj[1,]), mean = 0, sd = sqrt(toy.model.sigma^2/4 + toy.model.tau.squared)) * sqrt(2/4*pi*toy.model.sigma^2))
toy.mean.beta.adj.prod2 <- prod(dnorm(unlist(toy.mean.beta.adj[2,]), mean = 0, sd = sqrt(toy.model.sigma^2/4 + toy.model.tau.squared)) * sqrt(2/4*pi*toy.model.sigma^2))
toy.mean.beta.adj.final.prod <- toy.mean.beta.adj.prod1 * toy.mean.beta.adj.prod2
toy.total.prod <- toy.mean.adj.final.prod * toy.mean.beta.adj.final.prod
log(toy.total.prod)
EDIT: A helpful link was provided in the comments (https://stats.stackexchange.com/questions/271903/understand-marginal-likelihood-of-mixed-effects-models). Converting my example from above I can replicate the log-likelihood
library(mvtnorm)
z = getME(model, "Z")
zt = getME(model, "Zt")
psi = bdiag(replicate(2, toy.model.tau.squared, simplify=FALSE))
betw = z%*%psi%*%zt
err = Diagonal(8, sigma(model)^2)
v = betw + err
dmvnorm(sim.data$OBS.Y, predict(model, re.form=NA), as.matrix(v), log=TRUE)
While I did not manage to come up with a closed form solution for all of them, I did manage to reproduce the log-likelihoods using numerical integration. I have posted below small examples for how it works in the LMM setting (assuming normal residuals random effects) as well as the GLMM with Poisson and Negative-Binomial. Note that especially the latter one tends so differ ever so slightly when you increase the sample size. My guess is that there is some rounding happening somewhere but for my purposes the precision achieved here is good enough. I will for now accept my own answer but if someone posts a closed form for the Poisson or the Negative-Binomial I will happily accept your answer :)
library(lme4)
library(mvtnorm)
################################################################################
# LMM numerical integration
set.seed(7)
n <- 2 # number of groups
m <- 4 # number of instances per group
fixed.effect <- c(0, -2, -1, 1)
tau <- 5 # standard deviation of random effects
sigma <- 2 # standard deviation of error
random.effect <- rnorm(n, mean=0, sd=tau)
normal.data <- data.frame(GROUP.ID=as.factor(rep(1:n, each=m)),
GROUP.EFFECT=rep(random.effect, each=m),
INSTANCE.ID=as.factor(rep(1:m, times=n)),
INSTANCE.EFFECT=rep(fixed.effect, times=n))
# calculate expected Y value
normal.data$EXPECT.Y <- normal.data$GROUP.EFFECT + normal.data$INSTANCE.EFFECT
# now observe Y value, assuming normally distributed with fixed std. deviation
normal.data$OBS.Y <- rnorm(nrow(normal.data), mean=normal.data$EXPECT.Y, sigma)
normal.model <- lmer(OBS.Y ~ INSTANCE.ID + (1|GROUP.ID), data = normal.data, REML=F)
summary(normal.model)
normal.model.var <- VarCorr(normal.model)
normal.model.sigma <- attr(normal.model.var, 'sc') # corresponds to the epsilon, residual standard deviation
normal.model.tau.squared <- normal.model.var[[1]][1] # corresponds to variance of random effects
normal.model.betas <- normal.model#beta
normal.group.tau <- sqrt(normal.model.tau.squared)
normal.group.sigma <- sigma(normal.model)
normal.group.beta <- predict(normal.model, re.form=NA)[1:4]
integrate_group1 <- function(x){
p1 <- dnorm(normal.data$OBS.Y[1] - normal.group.beta[1] - x, mean = 0, sd = normal.group.sigma) * dnorm(x, mean = 0, sd = normal.group.tau)
p2 <- dnorm(normal.data$OBS.Y[2] - normal.group.beta[2] - x, mean = 0, sd = normal.group.sigma)
p3 <- dnorm(normal.data$OBS.Y[3] - normal.group.beta[3] - x, mean = 0, sd = normal.group.sigma)
p4 <- dnorm(normal.data$OBS.Y[4] - normal.group.beta[4] - x, mean = 0, sd = normal.group.sigma)
p_out <- p1 * p2 * p3 * p4
p_out
}
normal.group1.integration <- integrate(integrate_group1, lower = -10*normal.group.tau, upper = 10*normal.group.tau, subdivisions = 10000L, rel.tol = 1e-10, abs.tol = 1e-50)$value[1]
integrate_group2 <- function(x){
p1 <- dnorm(normal.data$OBS.Y[5] - normal.group.beta[1] - x, mean = 0, sd = normal.group.sigma) * dnorm(x, mean = 0, sd = normal.group.tau)
p2 <- dnorm(normal.data$OBS.Y[6] - normal.group.beta[2] - x, mean = 0, sd = normal.group.sigma)
p3 <- dnorm(normal.data$OBS.Y[7] - normal.group.beta[3] - x, mean = 0, sd = normal.group.sigma)
p4 <- dnorm(normal.data$OBS.Y[8] - normal.group.beta[4] - x, mean = 0, sd = normal.group.sigma)
p_out <- p1 * p2 * p3 * p4
p_out
}
normal.group2.integration <- integrate(integrate_group2, lower = -10*normal.group.tau, upper = 10*normal.group.tau, subdivisions = 10000L, rel.tol = 1e-10, abs.tol = 1e-50)$value[1]
log(normal.group1.integration) + log(normal.group2.integration)
#################################
# Poisson numerical integration
set.seed(13) #13
n <- 2 # number of groups
m <- 4 # number of instances per group
# effect sizes are much smaller since they are exponentiated
fixed.effect <- c(0, -0.2, -0.1, 0.2)
tau <- 1.5 # standard deviation of random effects
# sigma <- 1.5 # standard deviation of error
random.effect <- rnorm(n, mean=0, sd=tau) # guide effect
poisson.data <- data.frame(GROUP.ID=as.factor(rep(1:n, each=m)),
GROUP.EFFECT=rep(random.effect, each=m),
INSTANCE.ID=as.factor(rep(1:m, times=n)),
INSTANCE.EFFECT=rep(fixed.effect, times=n))
# calculate expected Y value
poisson.data$EXPECT.Y <- exp(poisson.data$GROUP.EFFECT + poisson.data$INSTANCE.EFFECT)
# now observe Y value, assuming normally distributed with fixed std. deviation
poisson.data$OBS.Y <- rpois(nrow(poisson.data), poisson.data$EXPECT.Y)
poisson.model <- glmer(OBS.Y ~ INSTANCE.ID + (1|GROUP.ID), data = poisson.data, family="poisson")
summary(poisson.model)
poisson.model.var <- VarCorr(poisson.model)
poisson.model.sigma <- attr(poisson.model.var, 'sc') # corresponds to the epsilon, residual standard deviation
poisson.model.tau.squared <- poisson.model.var[[1]][1] # corresponds to variance of random effects
poisson.model.betas <- poisson.model#beta
poisson.group.tau <- sqrt(poisson.model.tau.squared)
poisson.group.sigma <- sigma(poisson.model)
poisson.group.beta <- predict(poisson.model, re.form=NA)[1:4]
integrate_group1 <- function(x){
p1 <- dpois(poisson.data$OBS.Y[1], lambda = exp(poisson.group.beta[1] + x)) * dnorm(x, mean = 0, sd = poisson.group.tau)
p2 <- dpois(poisson.data$OBS.Y[2], lambda = exp(poisson.group.beta[2] + x))
p3 <- dpois(poisson.data$OBS.Y[3], lambda = exp(poisson.group.beta[3] + x))
p4 <- dpois(poisson.data$OBS.Y[4], lambda = exp(poisson.group.beta[4] + x))
p_out <- p1 * p2 * p3 * p4
p_out
}
poisson.group1.integration <- integrate(integrate_group1, lower = -10*poisson.group.tau, upper = 10*poisson.group.tau, subdivisions = 10000L, rel.tol = 1e-10, abs.tol = 1e-50)$value[1]
integrate_group2 <- function(x){
p1 <- dpois(poisson.data$OBS.Y[5], lambda = exp(poisson.group.beta[1] + x)) * dnorm(x, mean = 0, sd = poisson.group.tau)
p2 <- dpois(poisson.data$OBS.Y[6], lambda = exp(poisson.group.beta[2] + x))
p3 <- dpois(poisson.data$OBS.Y[7], lambda = exp(poisson.group.beta[3] + x))
p4 <- dpois(poisson.data$OBS.Y[8], lambda = exp(poisson.group.beta[4] + x))
p_out <- p1 * p2 * p3 * p4
p_out
}
poisson.group2.integration <- integrate(integrate_group2, lower = -10*poisson.group.tau, upper = 10*poisson.group.tau, subdivisions = 10000L, rel.tol = 1e-10, abs.tol = 1e-50)$value[1]
log(poisson.group1.integration) + log(poisson.group2.integration)
#############
# Negative-Binomial numerical integration
set.seed(13) #13
n <- 100 # number of groups
m <- 4 # number of instances per group
# effect sizes are much smaller since they are exponentiated
fixed.effect <- c(0, -0.2, -0.1, 0.2)
tau <- 1.5 # standard deviation of random effects
theta <- 0.5
# sigma <- 1.5 # standard deviation of error
random.effect <- rnorm(n, mean=0, sd=tau) # guide effect
nb.data <- data.frame(GROUP.ID=as.factor(rep(1:n, each=m)),
GROUP.EFFECT=rep(random.effect, each=m),
INSTANCE.ID=as.factor(rep(1:m, times=n)),
INSTANCE.EFFECT=rep(fixed.effect, times=n))
# calculate expected Y value
nb.data$EXPECT.Y <- exp(nb.data$GROUP.EFFECT + nb.data$INSTANCE.EFFECT)
# now observe Y value, assuming normally distributed with fixed std. deviation
nb.data$OBS.Y <- rnbinom(nrow(nb.data), mu = nb.data$EXPECT.Y, size = theta)
nb.model <- glmer.nb(OBS.Y ~ INSTANCE.ID + (1|GROUP.ID), data = nb.data)
summary(nb.model)
nb.model.var <- VarCorr(nb.model)
nb.model.sigma <- attr(nb.model.var, 'sc') # corresponds to the epsilon, residual standard deviation
nb.model.tau.squared <- nb.model.var[[1]][1] # corresponds to variance of random effects
nb.model.betas <- nb.model#beta
nb.group.tau <- sqrt(nb.model.tau.squared)
nb.group.beta <- predict(nb.model, re.form=NA)[1:4]
nb.group.dispersion <- getME(nb.model, "glmer.nb.theta")
integration_function_generator <- function(input.obs, input.beta, input.dispersion, input.tau){
function(x){
p1 <- dnbinom(input.obs[1], mu = exp(input.beta[1] + x), size = input.dispersion) * dnorm(x, mean = 0, sd = input.tau)
p2 <- dnbinom(input.obs[2], mu = exp(input.beta[2] + x), size = input.dispersion)
p3 <- dnbinom(input.obs[3], mu = exp(input.beta[3] + x), size = input.dispersion)
p4 <- dnbinom(input.obs[4], mu = exp(input.beta[4] + x), size = input.dispersion)
p_out <- p1 * p2 * p3 * p4
p_out
}
}
nb.all.group.integrations <- c()
for(i in 1:n){
temp.obs <- nb.data$OBS.Y[(1:4)+(i-1)*4]
temp_integrate_function <- integration_function_generator(temp.obs, nb.group.beta, nb.group.dispersion, nb.group.tau)
temp.integration <- integrate(temp_integrate_function, lower = -10*nb.group.tau, upper = 10*nb.group.tau, subdivisions = 10000L, rel.tol = 1e-10, abs.tol = 1e-50)$value[1]
nb.all.group.integrations <- c(nb.all.group.integrations, temp.integration)
}
sum(log(nb.all.group.integrations))

How to automatically fit data with several normal cumulative distribution functions in R

I have several data sets (hundreds of them actually), that I know can be fitted with the sum of several normal cumulative distributions (see here).
Here is one example of such data set, here with two cumulative distribution functions:
library(pracma)
library(minpack.lm)
x <- seq(1, 1000, length.out = 50)
k1 <- 0.5
mu1 <- 500
sigma1 <- 100
y1 <- k1 * (1 + erf((x - mu1) / (sqrt(2) * sigma1)))
k2 <- 0.5
mu2 <- 300
sigma2 <- 50
y2 <- k2 * (1 + erf((x - mu2) / (sqrt(2) * sigma2)))
my.df <- data.frame(x, y = y1 + y2, type = "data")
ggplot(my.df, aes(x, y)) + geom_line()
Now I want to fit those curves, so I use nls to do so:
model <- nlsLM(y ~ k1 * (1 + erf((x - mu1) / (sqrt(2) * sigma1)))
+ k2 * (1 + erf((x - mu2) / (sqrt(2) * sigma2))),
start= c(mu1 = 500 , sigma1 = 50, k1 = 0.5,
mu2 = 300 , sigma2 = 50, k2 = 0.5),
data = my.df,
control = nls.lm.control(maxiter = 500))
tmp <- data.frame(x, y = predict(model), type = "fit")
combined <- rbind(my.df, tmp)
ggplot(combined, aes(x, y, colour = type, shape = type)) + geom_line() + geom_point()
Here is what I get:
The fit is great. However, I helped nls a lot:
I gave it a perfect fitting curve as input, not raw data
I told it my curve was the sum of two functions (not one or three)
And I almost gave the solution by providing very close parameter values
To fix the first point, I compute 3 models for one, two and three functions and choose the one with the minimum deviance.
For the second point, with my hundreds of data sets unfortunately, the parameters change quite a bit and I have disappointing results when I give the same starting parameters for all sets.
Is there a better way to select those starting values?
I heard of the mixtools library, but I'm not sure it works for CDF (cumulative distribution functions).

Resources