How to normalize a Lmer model? - r

lmer:
mixed.lmer6 <- lmer(Size ~ (Time+I(Time^2))*Country*STemperature +
(1|Country:Locality)+ (1|Locality:Individual)+(1|Batch)+
(1|Egg_masses), REML = FALSE, data = data_NoNA)
residuals:
plot_model(mixed.lmer6, type = "diag")
Tried manual log,power, sqrt transformations in my formula but no improvement and I also can not find a suitable automatic transformation R function such as BoxCox (which does not work for LMER's)
Any help or tips would be appreciated

This might be better suited for CrossValidated ("what should I do?" is appropriate for CV; "how should I do it?" is best for Stack Overflow), but I'll take a crack.
The Q-Q plot is generally the last/least important diagnostic you should look at (the order should be approximately (1) check for significant bias/missed patterns in the mean [fitted vs. residual, residual vs. covariates]; (2) check for outliers/influential points [leverage, Cook's distance]; (3) check for heteroscedasticity [scale-location plot]; (4) check distributional assumptions [Q-Q plot]). The reason is that any of the "upstream" failures (e.g. missed patterns) will show up in the Q-Q plot as well; resolving them will often resolve the apparent non-Normality.
If you can fix the distributional assumptions by fixing something else about the model (adding covariates/adding interactions/adding polynomial or spline terms/removing outliers), then do that.
you could code your own brute-force Box-Cox, something like
fitted_model <- lmer(..., data = mydata)
bcfun <- function(lambda, resp = "y") {
y <- mydata[[resp]]
mydata$newy <- if (lambda==0) log(y) else (y^lambda -1)/lambda
## https://stats.stackexchange.com/questions/261380/how-do-i-get-the-box-cox-log-likelihood-using-the-jacobian
log_jac <- sum((lambda-1)*log(y))
newfit <- update(fitted_model, newy ~ ., data = mydata)
return(-2*(c(logLik(newfit))+ log_jac))
}
lambdavec <- seq(-2, 2, by = 0.2)
boxcox <- vapply(lambdavec, bcfun, FUN.VALUE = numeric(1))
plot(lambdavec, boxcox - min(boxcox))
(lightly tested! but feel free to let me know if it doesn't work)
if you do need to fit a mixed model with a heavy-tailed residual distribution (e.g. Student t), the options are fairly limited. The brms package can fit such models (but takes you down the Bayesian/MCMC rabbit hole), and the heavy package (currently archived on CRAN) will work, but doesn't appear to handle crossed random effects.

Related

How to fit Gumbel distribution?

I want to find a package in R to fit the extreme value distribution
https://en.wikipedia.org/wiki/Generalized_extreme_value_distribution with three unknown parameters mu, sigma, xi.
I found two packages that can do the inference for these three parameters based on maximum likelihood estimation.
library(ismev)
gev.fit(data)
and
library(extRemes)
fevd(data)
the output is estimates of mu, sigma, and xi.
But if I just want to fit distribution with two parameters mu and sigma (like Gumbel distribution, the parameter xi=0). How to apply the above two packages? Or are there any other packages that can do inference for the Gumbel distribution?
The evd package has 2-parameter [dpqr]gumbel functions that you can combine with any general-purpose optimization method (optim() is one such possibility, as suggested in the comments, but there are some shortcuts as suggested below).
Load packages, simulate example:
library(evd)
library(fitdistrplus)
set.seed(101)
x <- rgumbel(1000, loc = 2, scale = 2.5)
Make a more robust wrapper for dgumbel() that won't throw an error if we hand it a non-positive scale value (there are other ways to deal with this problem, but this one works):
dg <- function(x, loc, scale, log) {
r <- try(dgumbel(x, loc, scale, log), silent = TRUE)
if (inherits(r, "try-error")) return(NA)
return(r)
}
fitdistr(x, dg, start = list(loc = 1, scale = 1))
Results seem reasonable:
loc scale
2.09220866 2.48122956
(0.08261121) (0.06102183)
If you want more flexibility I would recommend the bbmle package (for possibly obvious reasons :-) )

How to write a function to check model assumptions for a linear model in R?

I'm making a lot of models in R and trying to check the model assumptions for all of them. It would be awesome if I could write a function to do it all in one go, but it doesn't seem to be working.
I have:
assumptionfunction <- function(y, modelobject){
plot(x)
plot(y, x$residuals)
qqnorm(x$residuals)
}
And I'm getting lots of errors.
Instead of creating your own function, you can use an existing one. The beautiful check_model() function from the performance package does just that:
library(performance)
library(see)
model <- lm(mpg ~ wt * cyl + gear, data = mtcars)
check_model(model)
If you insist on using some objective tests, there is the gvlma package.
library(gvlma)
gvlma(model)
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Value p-value Decision
Global Stat 1.770046 0.7780 Assumptions acceptable.
Skewness 0.746520 0.3876 Assumptions acceptable.
Kurtosis 0.003654 0.9518 Assumptions acceptable.
Link Function 0.927065 0.3356 Assumptions acceptable.
Heteroscedasticity 0.092807 0.7606 Assumptions acceptable.
Now if you don't like gvlma because it doesn't explicitly name the tests used and gives Skewness and Kurtosis but not overall normality from, say, Shapiro-Wilk, I made a convenience function. It gets all tests names and assumptions at once with the total number of assumptions that are not respected. You can take it and modify it to suit your needs.
# Load the function:
source("https://raw.githubusercontent.com/RemPsyc/niceplots/master/niceAssFunction.R")
View(niceAss(model))
Interpretation: (p) values < .05 imply assumptions are not respected.
Diagnostic is how many assumptions are not respected for a given model or variable.
Applied to a list of models:
# Define our dependent variables
(DV <- names(mtcars[-1]))
# Make list of all formulas
(formulas <- paste(DV, "~ mpg"))
# Make list of all models
models.list <- sapply(X = formulas, FUN = lm, data = mtcars, simplify = FALSE, USE.NAMES = TRUE)
# Make diagnostic table
(ass.table <- do.call("rbind", lapply(models.list, niceAss)))
# Use the Viewer for better results
View(ass.table)

Weighted Portmanteau Test for Fitted GARCH process

I have fitted a GARCH process to a time series and analyzed the ACF for squared and absolute residuals to check the model goodness of fit. But I also want to do a formal test and after searching the internet, The Weighted Portmanteau Test (originally by Li and Mak) seems to be the one.
It's from the WeightedPortTest package and is one of the few (perhaps the only one?) that properly tests the GARCH residuals.
While going through the instructions in various documents I can't wrap my head around what the "h.t" argument wants. It says in the info in R that I need to assign "a numeric vector of the conditional variances". This may be simple to an experienced user, though I'm struggling to understand. What is it that I need to do and preferably how would I code it in R?
Thankful for any kind of help
Taken directly from the documentation:
h.t: a numeric vector of the conditional variances
A little toy example using the fGarch package follows:
library(fGarch)
library(WeightedPortTest)
spec <- garchSpec(model = list(alpha = 0.6, beta = 0))
simGarch11 <- garchSim(spec, n = 300)
fit <- garchFit(formula = ~ garch(1, 0), data = simGarch11)
Weighted.LM.test(fit#residuals, fit#h.t, lag = 10)
And using garch() from the tseries package:
library(tseries)
fit2 <- garch(as.numeric(simGarch11), order = c(0, 1))
summary(fit2)
# comparison of fitted values:
tail(fit2$fitted.values[,1]^2)
tail(fit#h.t)
# comparison of residuals after unstandardizing:
unstd <- fit2$residuals*fit2$fitted.values[,1]
tail(unstd)
tail(fit#residuals)
Weighted.LM.test(unstd, fit2$fitted.values[,1]^2, lag = 10)

error Error: step factor reduced below 0.001 without reducing pwrss when using nlmer

I think this could be more of a stats question rather than R question, but I have an error Error: step factor reduced below 0.001 without reducing pwrss when trying to fit a nlmer function to data. My data is:https://www.dropbox.com/s/cri5n7lewhc8j02/chweight.RData?dl=0
I'm trying to fit the model so that I can predict the weight of chicks based on time, for chicks on diet 1. I did the following:
cw1<-subset(ChickWeight, ChickWeight$Diet==1)
m1 <- nlmer(weight~ SSlogis(Time, Asym, xmid, scal) ~ Asym|Chick, cw1, start=c(Asym = 190, xmid = 730, scal = 350))
Could there be other ways to solve this error? I think the error has to do with Asym values but I'm not understanding well what it is doing, so any brief guidance would help.
I have been asked to improve my answer, so here is my attempt to do so.
This error is usually tripped because your start values aren't adequately close to the "true" values, so the optimizer fails to find any local improvements in fit by moving away from them. You need to try providing better starting guesses--this can sometimes be accomplished by algebraically solving the equation at a few points, as described in many places such as this article. Other times, you can plot the data and try to make educated guesses as to what the parameters might be, if you have knowledge of what the parameters "do" within the non-linear function (that is, maybe parameter a represents an asymptote, b is a scaler, c is the mean rate of change, etc.). That's hard for me personally because I have no math background, but I'm often able to make a reasonable guess most of the time.
To answer the question more directly, though, here is some reproducible code that should illustrate that the error in question comes from bad starting guesses.
#Create independentand dependent variables, X and Y, and a grouping variable Z.
xs = rep(1:10, times = 10)
ys = 3 + 2*exp(-0.5*xs)
zs = rep(1:10, each=10)
#Put random noise in X.
for (i in 1:100) {
xs[i] = rnorm(1, xs[i], 2)
}
df1 = data.frame(xs, ys, zs) #Assemble data into data frame.
require(lme4) #Turn on our package.
#Define our custom function--in this case, a three-parameter exponential model.
funct1 = funct1 = deriv(~beta0 + beta1*exp(beta2*xs), namevec=c('beta0',
'beta1', 'beta2'), function.arg=c('xs','beta0', 'beta1','beta2'))
#This will return the exact same error because our starting guesses are way off.
test1 = nlmer(ys ~ funct1(xs, beta0, beta1, beta2) ~ (beta0|zs), data = df1,
start=c(beta0=-50,beta1=200,beta2=3))
#Our starting guesses are much better now, and so nlmer is able to converge this time.
test1 = nlmer(ys ~ funct1(xs, beta0, beta1, beta2) ~ (beta0|zs), data = df1,
start=c(beta0=3.2,beta1=1.8,beta2=-0.3))

How to extract info from package in R and use in function?

I apologize for the vague question title. What I want to do is run a regression in R using geeglm from the geepack R package, then use information from that to calculate a quasilikelihood information criteria (QIC; Pan 2001). I can do this fairly easily for single models but I would like to write a general function that can do this for a variety of different types of models. I guess my real question is whether there is a better alternative than having a long series of nested ifelse statements?
Here's my current code:
library(geepack)
data(dietox) #data from the geepack package
# Run gee regression
dietox$Cu <- as.factor(dietox$Cu)
mf <- formula(Weight ~ Cu * (Time + I(Time^2) + I(Time^3)))
gee1 <- geeglm(mf, data = dietox, id = Pig, family = gaussian, corstr = "ar1")
Then I can run a function to calculate the quasilikelihood:
QlogLik.normal <- function(model.R) {
library(MASS)
mu.R <- model.R$fitted.values
y <- model.R$y
# Quasi Likelihood for Normal
quasi.R <- sum(((y - mu.R)^2)/-2)
quasi.R
}
However, I would like to write a function that is more general because the quasilikelihood function is different for every distribution. The above function would work for gee1 because it had a gaussian (normal) distribution. If I wanted to generalize it for a variety of distributions I could use a series of nested ifelse statements (below), but I don't know if this is the best way to do this. Does anyone have other options or a better solution? This just doesn't seem very elegant to say the least (clearly I don't have much programming or R experience).
QlogLik <- function(model.R) {
library(MASS)
mu.R <- model.R$fitted.values
y <- model.R$y
ifelse(model.R$modelInfo$variance == "poisson",
# Quasi Likelihood for Poisson
quasi.R <- sum((y*log(mu.R)) - mu.R),
ifelse(model.R$modelInfo$variance == "gaussian",
# Quasi Likelihood for Normal
quasi.R <- sum(((y - mu.R)^2)/-2),
ifelse(model.R$modelInfo$variance == "binomial",
# Quasilikelihood for Binomial
quasi.R <- sum(y*log(mu.R/(1 - mu.R)) + log(1 - mu.R)),
quasi.R <- "Error: distribution not recognized")))
quasi.R
}
In this example, I used the model output from geeglm to extract the type of distribution used to model the variance
model.R$modelInfo$variance
but there may be other ways to determine what distribution was used in the geeglm model. Any help would be appreciated.
You should be able to rewrite your function like this:
QlogLik <- function(model.R) {
library(MASS)
mu.R <- model.R$fitted.values
y <- model.R$y
type <- family(model.R)$family
switch(type,
poisson = sum((y*log(mu.R)) - mu.R),
gaussian = sum(((y - mu.R)^2)/-2),
binomial = sum(y*log(mu.R/(1 - mu.R)) + log(1 - mu.R)),
stop("Error: distribution not recognized"))
}
As #baptise points out, switch useful in these cases. You use family(model.R)$family to automatically detect what family type should be used with switch.
Also, if your commands for what to do in different cases run beyond one line, you can wrap the lines with curly brackets ({ do something here }) instead.
switch(type,
type1 = { something <- do(this)
thisis(something) },
type2 = do(that))
I hope this helps!
You may also use model.R$family$family which gives the type of distribution used to model the variance, but so far I didn't know if you could eliminate those ifelse statements. The quasi.R in your code differs among different distributions, so you have to define each of them separately.
BTW, it is a good question and thanks for posting it: I had similar situations in the past, and hope to get some advice on how to write the codes more efficiently.

Resources