Normal distribution has higher probability for multimodality than unimodality

Normal distribution has higher probability for multimodality than unimodality - r

I am trying to analyze a mixture model using the mixtool package, in other words, I would like to analyze if my data is a uni-, bi- or multimodal distribution.
For simplicity here an example:
library(mixtools)
#creating an aritifical normal distribution
mydata <- rnorm(1000, 1750, 60)
#defining the cuts and preparing it for calculations
cutp <- seq(1600, 2300, by=25)
mult <- makemultdata(mydata, cuts = cutp)
comp <- multmixmodel.sel(mult, comps = 1:3, epsilon = 0.01)
#plotting the data (in this case 2 subpopulations)
mixmdl = normalmixEM(mydata, k=2, maxit=50000)
plot(mixmdl,which=2)
lines(density(mydata), lty=2, lwd=2)
Now as a result for 'comp', I get:
1 2 3 Winner
AIC -Inf -94.04097 -124.04097 2
BIC -Inf -35.04097 -35.04097 2
CAIC -Inf -64.54097 -79.54097 2
ICL -Inf -35.04097 -35.04097 2
Loglik -Inf -35.04097 -35.04097 2
In my very limited understanding for this kind of executions, I expected to see 1 as a 'winner' (since I produced a single normal distribution).
However, as you can see, I get infinite values for 1, and identical values for the BIC, ICL and Loglik for 2 and 3. This speaks against a normal distribution and a higher (or identical) probability to deal with a bi- or multimodal distribution. Since I used a normal distribution to start with, I would expect to see a highest probability for 1 and at least some differences between 2 and 3. What confuses me the most are the identical values for 2 and 3 in some of the tests.
So my question is why my approach fails to recognize the distribution as a gaussian and rather classifies it as bi- / multimodal?

I do not know a lot about the mixtools package. I did give a little try with the what you did, and I did not come to the same conclusion as you.
When I fit a two-component multinomial mixture model (which is what you are doing with multmixmodel.sel), the second component is non-existent; the posterior probability is almost zero.
set.seed(1)
mydata <- rnorm(1000, 1750, 60)
cutp <- seq(min(mydata), max(mydata), by=25)
mult <- makemultdata(mydata, cuts = cutp)
multmod2 <- multmixEM(mult, k=2)
multmod2 $posterior
# comp.1 comp.2
# [1,] 1 1.980052e-226
When I fit mixture models to the original data, the single component is selected each time.
library(mclust)
fit <- Mclust(mydata)
fit
#'Mclust' model object:
# best model: univariate normal (X) with 1 components
library(EMMIX)
# Available from
#https://people.smp.uq.edu.au/GeoffMcLachlan/mix_soft/EMMIX_R/
fit_1 <- EMMIX(mydata, g=1)
fit_2 <- EMMIX(mydata, g=2)
c(fit_1$bic, fit_2$bic)
# [1] 11108.02 11128.67
#(BIC selects the one component model)

Related

Tail dependence simulation and estimate of a t-Copula

Onece I've generated 500 observations of T1, T2 with T1Exp(1) and T2Exp(0.5),
and their dependence according to a Clayton copula with theta= 2.
I have to 1) show by simulation that the coefficeient of lower tail dependence is approximately 0.7 in this case, as calcualated with the direct formula=1/sqrt(2)
and 2)Estimate a t-copula using these data, assuming exponential marginal distributions(degrees of freedom and correlation)
R
library(Copula)
library(ghyp)
set.seed(123)
data <- rCopula(n = 500, copula = claytonCopula(2))
x1 <- qexp(data[,1],rate=0.1)
x2 <- qexp(data[,2], rate=0,5)
x <- cbind(x1,x2)
x.df <- data.frame(x)
To simulate the coefficient of lower tail dependence probably I have to set a lower bound but then I don't know.
For the t Copula I think to use the function
fitCopula(tCopula(dim=2,dispstr="un"),pghyp(x.df),method="ml"
but it does not converge

Simulate data for mixed-effects model with predefined parameter

I'm trying to simulate data for a model expressed with the following formula:
lme4::lmer(y ~ a + b + (1|subject), data) but with a set of given parameters:
a <- rnorm() measured at subject level (e.g nSubjects = 50)
y is measured at the observation level (e.g. nObs = 7 for each subject
b <- rnorm() measured at observation level and correlated at a given r with a
variance ratio of the random effects in lmer(y ~ 1 + (1 | subject), data) is fixed at for example 50/50 or 10/90 (and so on)
some random noise is present (so that a full model does not explain all the variance)
effect size of the fixed effects can be set at a predefined level (e.g. dCohen=0.5)
I played with various packages like: powerlmm, simstudy or simr but still fail to find a working solution that will accommodate the amount of parameters I'd like to define beforehand.
Also for my learning purposes I'd prefer a base R method than a package solution.
The closest example I found is a blog post by Ben Ogorek "Hierarchical linear models and lmer" which looks great but I can't figure out how to control for parameters listed above.
Any help would be appreciated.
Also if there a package that I don't know of, that can do these type of simulations please let me know.

Some questions about the model definition:
How do we specify a correlation between two random vectors that are different lengths? I'm not sure: I'll sample 350 values (nObs*nSubject) and throw away most of the values for the subject-level effect.
Not sure about "variance ratio" here. By definition, the theta parameters (standard deviations of the random effects) are scaled by the residual standard deviation (sigma), e.g. if sigma=2, theta=2, then the residual std dev is 2 and the among-subject std dev is 4
Define parameter/experimental design values:
nSubjects <- 50
nObs <- 7
## means of a,b are 0 without loss of generality
sdvec <- c(a=1,b=1)
rho <- 0.5 ## correlation
betavec <- c(intercept=0,a=1,b=2)
beta_sc <- betavec[-1]*sdvec ## scale parameter values by sd
theta <- 0.4 ## = 20/50
sigma <- 1
Set up data frame:
library(lme4)
set.seed(101)
## generate a, b variables
mm <- MASS::mvrnorm(nSubjects*nObs,
mu=c(0,0),
Sigma=matrix(c(1,rho,rho,1),2,2)*outer(sdvec,sdvec))
subj <- factor(rep(seq(nSubjects),each=nObs)) ## or ?gl
## sample every nObs'th value of a
avec <- mm[seq(1,nObs*nSubjects,by=nObs),"a"]
avec <- rep(avec,each=nObs) ## replicate
bvec <- mm[,"b"]
dd <- data.frame(a=avec,b=bvec,Subject=subj)
Simulate:
dd$y <- simulate(~a+b+(1|Subject),
newdata=dd,
newparams=list(beta=beta_sc,theta=theta,sigma=1),
family=gaussian)[[1]]

Confidence Interval in mixed effect models

library(lme4)
fm1 <- lmer(Reaction ~ Days + (Days|Subject), data = sleepstudy)
To generate a 95% CI, I can use the predictInterval() function from the package merTools.
library(merTools)
head(predictInterval(fm1, level = 0.95, seed = 123, n.sims = 100))
# fit upr lwr
# 1 255.4179 313.8781 184.1400
# 2 273.2944 333.2005 231.3584
# 3 291.8451 342.8701 240.8226
# 4 311.3562 359.2908 250.4980
# 5 330.3671 384.2520 270.7094
# 6 353.4378 409.9307 289.4760
In the documentation, it says about the predictInterval() function
This function provides a way to capture model uncertainty in predictions from multi-level models
fit with lme4. By drawing a sampling distribution for the random and the fixed effects and then
estimating the fitted value across that distribution, it is possible to generate a prediction interval for
fitted values that includes all variation in the model except for variation in the covariance parameters,
theta. This is a much faster alternative than bootstrapping for models fit to medium to large datasets.
My goal is to get all the fitted values instead of the the upper and lower CI i.e. for each row, I need the
original n simulations from which these 95% CI are calculated. I checked the argument in the documentation and
followed this:
head(predictInterval(fm1, n.sims = 100, returnSims = TRUE, seed = 123, level = 0.95))
# fit upr lwr
# 1 255.4179 313.8781 184.1400
# 2 273.2944 333.2005 231.3584
# 3 291.8451 342.8701 240.8226
# 4 311.3562 359.2908 250.4980
# 5 330.3671 384.2520 270.7094
# 6 353.4378 409.9307 289.4760
Instead of getting the 100 simulations, it still gives me the same output. What is it I am doing wrong here?
A second question though I believe this is more of a StatsExchange one.
"By drawing a sampling distribution for the random and the fixed
effects and then."`
How does it draws the sampling distribution if some could explain me?

You can get simulated values if you specify newdata in the predictInterval() function.
predInt <- predictInterval(fm1, newdata = sleepstudy, n.sims = 100,
returnSims = TRUE, seed = 123, level = 0.95)
simValues <- attr(predInt, "sim.results")
Details on how to create sampling distributions of parameters are given in the Detail section of the help page.You can get the estimates of fit, lower and upper boundaries as:
fit <- apply(simValues, 1, function(x){quantile(x, probs=0.500) } )
lwr <- apply(simValues, 1, function(x){quantile(x, probs=0.025) } )
upr <- apply(simValues, 1, function(x){quantile(x, probs=0.975) } )

AIC in R: differences in manual vs. internal value when using weighted data

I am attempting to use R for model selection based on the AIC statistic. When comparing linear models with or without weighting, my code in R informs me that weighting is preferable compared to no-weighting, and these results are confirmed in other software (GraphPad Prism). I have sample code using real data from a standard curve:
#Linear Curve Fitting
a <- c(0.137, 0.412, 1.23, 3.7, 11.1 ,33.3)
b <- c(0.00198, 0.00359, 0.00816, 0.0220, 0.0582, 0.184)
m1 <- lm(b ~ poly(a,1))
m2 <- lm(b ~ poly(a,1), weight=1/a)
n1 <- 6 #Number of observations
k1 <- 2 #Number of parameters
When I calculate AIC using either the internal function in R or via manual calculation in which:
AIC = n + n log 2π + n log(RSS/n) + 2(k + 1) with n observations and k parameters
I get equivalent AIC values for the non-weighted model. When I analyze the effect of weighting, the manual AIC value is lower, however the end result is that both the internal and manual AIC suggest that weighting is preferred.
> AIC(m1); n1+(n1*log(2*pi))+n1*(log(deviance(m1)/n1))+(2*(k1+1))
[1] -54.83171
[1] -54.83171
> AIC(m2); n1+(n1*log(2*pi))+n1*(log(deviance(m2)/n1))+(2*(k1+1))
[1] -64.57691
[1] -69.13025
When I try the same analysis using a nonlinear model, the difference in AIC between the internal function and manual calculation is more profound. Below is a code of examplar Michaelis-Menten kinetic data:
c <- c(0.5, 1, 5, 10, 30, 100, 300)
d <- c(3, 5, 20, 50, 75, 200, 250)
m3 <- nls(d ~ (V * c)/(K + c), start=list(V=10, K=1))
m4 <- nls(d ~ (V * c)/(K + c), start=list(V=10, K=1), weight=1/d^2)
n2 <- 7
k2 <- 2
The AIC are calculated as indicated for the first two models:
> AIC(m3); n2+(n2*log(2*pi))+n2*(log(deviance(m3)/n2))+(2*(k2+1))
[1] 58.48839
[1] 58.48839
> AIC(m4); n2+(n2*log(2*pi))+n2*(log(deviance(m4)/n2))+(2*(k2+1))
[1] 320.7105
[1] 0.1538546
Similar to the linear example, the internal AIC and manual AIC values are the same when data are not weighted (m3). The problem occurs with weighting (m4) as the manual AIC estimate is much lower. This situation is similar to what was asked in a related problem AIC with weighted nonlinear regression (nls).
I earlier mentioned GraphPad Prism, which for both the models and datasets given above showed lower AICs when weighting was used. My question then is why is there such a difference in the internal vs. manual AIC estimates in R when weighting the data (for which the outcome is different for nonlinear model compared to a linear one)? Ultimately, should I regard the internal AIC value or the manual value as being more correct, or am I using a wrong equation?

The discrepancy you are seeing is from using the unweighted log-likelihood formula in the manual calculations for a weighted model. For example, you can replicate the AIC results for m2 and m4 with the following adjustments:
In the case of m2, you simply need to subract sum(log(m2$weights)) from your calculation:
AIC(m2); n1+(n1*log(2*pi))+n1*(log(deviance(m2)/n1))+(2*(k1+1)) - sum(log(m2$weights))
[1] -64.57691
[1] -64.57691
In the case of m4, you would have to swap the deviance call with a weighted residuals calculation, and subtract n2 * sum(log(m4$weights)) from your results:
AIC(m4); n2+(n2*log(2*pi))+n2*(log(sum(m4$weights * m4$m$resid()^2)/n2))+(2*(k2+1)) - n2 * sum(log(m4$weights))
[1] 320.7105
[1] 320.7105
I believe the derivation for the formula used by logLikin m2 is pretty straight forward and correct, but I am not as sure about m4. From reading some other threads about logLik.nls() (example 1, example 2), it seems like there is some confusion about the correct approach for the nls estimate. To summarize, I believe AIC is correct for m2; I was not able to verify the math for the weighted nls model and would lean towards using the m2 formula again in that case (but replace deviance calculation with weighted residuals), or (maybe better) not use AIC for the nls model

P-value for polyserial correlation

I have some basic questions concerning the polyserial() {polycor} function.
Does a p-value exist for rho, or can it be calculated?
For the assumption of a bivariate
normal, is the tested null hypothesis "Yes, bivariate normal"? That is, do I want a high or low p-value.
Thanks.

If you form the returned object with:
polS <- polyserial(x, y, ML=TRUE, std.err=TRUE) # ML estimate
... You should have no difficulty forming a p-value for the hypothesis: rho == 0 using a z-statistic formed by the ratio of a parameter divided by its standard error. But that is not the same as testing the assumption of bivariate normality. For that you need to examine "chisq" component of polS. The print method for objects of class 'polycor' hands that to you in a nice little sentence. You interpret that result in the usual manner: Low p-values are stronger evidence against the null hypothesis (in this case H0: bivariate normality). As a scientist, you do not "want" either result. You want to understand what the data is telling you.

I e-mailed the package author -because I had the same questions) and based on his clarifications, I offer my answers:
First, the easy question: higher p-values (traditionally > 0.05) give you more confidence that the distribution is bivariate normal. Lower p-values indicate a non-normal distribution, BUT, if the sample size is sufficiently large, the maximum likelihood estimate (option ML=TRUE), non-normality doesn't matter; the correlation is still reliable anyway.
Now, for the harder question: to calculate the p-value, you need to:
Execute polyserial with the std.err=TRUE option to have access to more details.
From the resulting polyserial object, access the var[1, 1] element. var is the covariance matrix of the parameter estimates, and sqrt(var[1, 1]) is the standard error (which displays in parentheses in the output after the rho result).
From the standard error, you can calculate the p-value based on the R code below.
Here's some code to illustrate this with copiable R-code, based on the example code in the polyserial documentation:
library(mvtnorm)
library(polycor)
set.seed(12345)
data <- rmvnorm(1000, c(0, 0), matrix(c(1, .5, .5, 1), 2, 2))
x <- data[,1]
y <- data[,2]
y <- cut(y, c(-Inf, -1, .5, 1.5, Inf))
# 2-step estimate
poly_2step <- polyserial(x, y, std.err=TRUE)
poly_2step
##
## Polyserial Correlation, 2-step est. = 0.5085 (0.02413)
## Test of bivariate normality: Chisquare = 8.604, df = 11, p = 0.6584
std.err_2step <- sqrt(poly_2step$var[1, 1])
std.err_2step
## [1] 0.02413489
p_value_2step <- 2 * pnorm(-abs(poly_2step$rho / std.err_2step))
p_value_2step
## [1] 1.529176e-98
# ML estimate
poly_ML <- polyserial(x, y, ML=TRUE, std.err=TRUE)
poly_ML
##
## Polyserial Correlation, ML est. = 0.5083 (0.02466)
## Test of bivariate normality: Chisquare = 8.548, df = 11, p = 0.6635
##
## 1 2 3
## Threshold -0.98560 0.4812 1.50700
## Std.Err. 0.04408 0.0379 0.05847
std.err_ML <- sqrt(poly_ML$var[1, 1])
std.err_ML
## [1] 0.02465517
p_value_ML <- 2 * pnorm(-abs(poly_ML$rho / std.err_ML))
p_value_ML
##
## 1.927146e-94
And to answer an important question that you didn't ask: you would want to always use the maximum likelihood version (ML=TRUE) because it is more accurate, except if you have a really slow computer, in which case the default 2-step approach is acceptable.