Probability (density) of new dataset under fitted model - r

Given a fitted model in R (i.e. an object of class 'lm', 'glm', 'merMod', etc), I am trying to figure out how to calculate the probability of a new dataset.
That is, I want the probability (density) of dataset B under the parameter estimates obtained by fitting a model to dataset A. I know how to do this in general, but I am wondering whether a simple pre-existing function can do this in R. Is there a simple function to do this?
This question is very similar, but I want to do this in R.

for a linear regression model lm, you could use the following function to determine and visualize the likelihood, assuming the residuals of linear model fit is normally distributed (these functions were adapted from this R Blogger Post; rationale of the procedure can be found in this post):
log_lik <- function(beta0, beta1, mu, sigma) {
## beta0 and beta1 require intial guesses
R <- y - x * beta1 - beta0
R <- dnorm(R, mu, sigma, log=T)
return(-sum(R))
}
library(stats4)
fit <- mle(log_lik, start=list(beta0=4, beta1=2, mu = 0, sigma=1))
summary(fit)
## mu will be your estimated likelihood
## sigma will be uncertainty
For glm, this post in Cross Validated provides user-defined R functions for the likelihood.
(ps: It would be nice if you could provide a specific example involving one of lm, glm, etc. If you just want to know these models in general, Cross Validated, Mathematics, or Data Science might be better places to ask.)

Related

Optimizing a GAM for Smoothness

I am currently trying to generate a general additive model in R using a response variable and three predictor variables. One of the predictors is linear, and the dataset consists of 298 observations.
I have run the following code to generate a basic GAM:
GAM <- gam(response~ linearpredictor+ s(predictor2) + s(predictor3), data = data[2:5])
This produces a model with 18 degrees of freedom and seems to substantially overfit the data. I'm wondering how I might generate a GAM that maximizes smoothness and predictive error. I realize that each of these features is going to come at the expense of the other, but is there good a way to find the optimal model that doesn't overfit?
Additionally, I need to perform leave one out cross validation (LOOCV), and I am not sure how to make sure that gam() does this in the MGCV package. Any help on either of these problems uld be greatly appreciated. Thank you.
I've run this to generate a GAM, but it overfits the data.
GAM <- gam(response~ linearpredictor+ s(predictor2) + s(predictor3), data = data[2:5])
I have also generated 1,000,000 GAMs with varying combinations of smoothing parameters and ranged the maximum degrees of freedom allowed from 10 (as shown in the code below) to 19. The variable "combinations2" is a list of all 1,000,000 combinations of smoothers I selected. This code is designed to try and balance degrees of freedom and AIC score. It does function, but I'm not sure that I'm actually going to be able to find the optimal model from this. I also cannot tell how to make sure that it uses LOOCV.
BestGAM <- gam(response~ linearpredictor+ predictor2+ predictor3, data = data[2:5])
for(i in 1:100000){
PotentialGAM <- gam(response~ linearpredictor+ s(predictor2) + s(predictor3), data = data[2:5], sp=c(combinations2[i,]$Var1,combinations2[i,]$Var2))
if (AIC(PotentialGAM,BestGAM)$df[1] <= 10 & AIC(PotentialGAM,BestGAM)$AIC[1] < AIC(PotentialGAM,BestGAM)$AIC[2]){
BestGAM <<- PotentialGAM
listNumber <- i
}
}
You are fitting your GAM using generalised cross validation (GCV) smoothness selection. GCV is a way to get around the invariance problem of ordinary cross validation (OCV; what you also call LOOCV) when estimating GAMs. Note that GCV is the same as OCV on a rotated version of the fitting problem (rotating y - Xβ by Q, any orthogonal matrix), and while when fitting with GCV {mgcv} doesn't actually need to do the rotation and the expected GCV score isn't affected by the rotation, GCV is just OCV (wood 2017, p. 260)
It has been shown that GCV can undersmooth (resulting in more wiggly models) as the objective function (GCV profile) can become flat around the optimum. Instead it is preferred to estimate GAMs (with penalized smooths) using REML or ML smoothness selection; add method = "REML" (or "ML") to your gam() call.
If the REML or ML fit is as wiggly as the GCV one with your data, then I'd be likely to presume gam() is not overfitting, but that there is something about your response data that hasn't been explained here (are the data ordered in time, for example?)
As to your question
how I might generate a GAM that maximizes smoothness and [minimize?] predictive error,
you are already doing that using GCV smoothness selection and for a particular definition of "smoothness" (in this case it is squared second derivatives of the estimated smooths, integrated over the range of the covariates, and summed over smooths).
If you want GCV but smoother models, you can increase the gamma argument above 1; gamma 1.4 is often used for example, which means that each EDF costs 40% more in the GCV criterion.
FWIW, you can get the LOOCV (OCV) score for your model without actually fitting 288 GAMs through the use of the influence matrix A. Here's a reproducible example using my {gratia} package:
library("gratia")
library("mgcv")
df <- data_sim("eg1", seed = 1)
m <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data = df, method = "REML")
A <- influence(m)
r <- residuals(m, type = "response")
ocv_score <- mean(r^2 / (1 - A))

How to obtain R^2 for robust mixed effect model (rlmer command; robustlmm)?

I estimated a robust mixed effect model with the rlmercommand from the robustlmmpackage. Is there a way to obtain the marginal and conditional R^2 values?
Just going to answer that myself. I could not find a package or rather a function in R that is equivalent to e.g. r.squaredGLMM in the case of lmerMod objects but I found a quick workaround that works with rlmerMod objects. Basically you just have to extract the variance components for the fixed effects, random effects and residuals and then manually calcualte the marginal and conditional R^2 based on the formula provided by Nakagawa & Schielzeth (2013).
library(robustlmm)
library(insight)
library(lme4)
data(Dyestuff, package = "lme4")
robust.model <- rlmer(Yield ~ 1|Batch, data=Dyestuff)
var.fix <- get_variance_fixed(robust.model)
var.ran <- get_variance_random(robust.model)
var.res <- get_variance_residual(robust.model)
R2m = var.fix/(var.fix+var.ran+var.res)
R2c = (var.fix+var.ran)/(var.fix+var.ran+var.res)
Literature:
Nakagawa, S. and Schielzeth, H. (2013), A general and simple method for obtaining R2 from generalized linear mixed‐effects models. Methods Ecol Evol, 4: 133-142. doi:10.1111/j.2041-210x.2012.00261.x

How to calculate AIC, BIC and likelihoods of a fitted kalman filter using the DSE function in R

I would like to test the suitability of the dynamic linear model which I have fitted to a problem set of data. I have done this using the SS() function in the dse package in R. Are there any ways of testing the fit of the model in R using likelihoods and information tests?
For illustrative purposes, assume that my model is a random walk. The theoretical form of the random walk being X(t) = X(t-1) + e(t)~N(0,1) for state evolution Y(t) = X(t) + w(t)~N(0,1). The code in R being defined by:
kalman.filter=dse::SS(F = matrix(1,1,1),
Q = matrix(1,1,1),
H = matrix(1,1,1),
R = matrix(1,1,1),
z0 = matrix(0,1,1),
P0 = matrix(0,1,1)
)
Assume that the actual observations were then:
simulate.kalman.filter=simulate(kalman.filter, start = 1, freq = 1, sampleT = 100)
Then assume we fit a model called "test":
test=l(kalman.filter, simulate.kalman.filter)
How can I test the fit of the data (simulate.kalman.filter) to the model theoretical model in R? I am looking for function such as the likelihood and the Bayesian Information Criterion.
I have figured out the answer to the question.
The function for doing this is called informationTests() in the same package of dse. It will return the AIC, BIC, and negative log-likelihood of the fitted model to the data. In the example above, this is done by:
informationTests(test)
Remember that a model with a lower BIC is considered better. You can also compare two models (assume that you had a second model fitted to the data called test2) by adding the second model as a parameter:
informationTests(test, test2)
This tabulates the AIC, BIC and likelihoods against one another.

Prewhitening / autocorrelation removal / AR(1) deterministic trend model

I am trying to prewhiten a time series using the methods described in a paper by K. Hamed found here: http://www.sciencedirect.com/science/article/pii/S0022169409000675
The idea is to fit an AR(1) model with a linear trend component to the data to remove the autocorrelation. The model I want to prewhiten with is given by
x_t = rho * x_(t-1) + alpha + beta * t + e_t
where x_t and x_(t-1) are observations of the time series, rho is the autocorrelation coefficient, beta is the slope of the trend and e_t is uncorrelated noise. Apologies for not knowing how to format equations, I tried to use Latex syntax to no avail...
Anyway I have estimated the parameters to be rho = 0.02, alpha = 0.16 and beta = -0.00092
How do I fit an AR(1) model in R given these specific parameter values? I thought using init in the Arima function would allow me to specify them but it just uses the input as initial values.
fit <- Arima(x, order=c(1,0,0),init=c(0.02, 0.16))
Furthermore, how do you fit a generic ARIMA model with a linear trend? I tried the following
for(t in 2:length(x)){
fit <- Arima(x, order=c(1,0,0),init=c(0.02, 0.16)) - 0.00092 * t
}
but it returns an error as non numeric argument to binary operator and I am not sure how to work around this.
Thanks in advance.
From forecast::Arima documentation:
Arima(x, order=c(0,0,0), seasonal=c(0,0,0),
xreg=NULL, include.mean=TRUE, include.drift=FALSE,
include.constant, lambda=model$lambda, transform.pars=TRUE,
fixed=NULL, init=NULL, method=c("CSS-ML","ML","CSS"), n.cond,
optim.control=list(), kappa=1e6, model=NULL)
You have to include an external regressor (i.e. linear trend) as vector in xreg argument
xreg Optionally, a vector or matrix of external regressors, which must have the same number of rows as x.

corr.bias parameter in Random forest regression model in R

I'm using the regression model of random forest in R and I found the parameter corr.bias which according to the manual is "experimental", my data is nonlinear and I just wonder if setting this parameter to true can enhance the results, plus I don't know exactly how it works for nonlinear data, so I really appreciate if someone can explain to me how this correction bias works in the random forest package and if it can enhance my regression model or not.
The short answer is that it performs a simple correction based on a linear regression on the actual and fitted values.
From regrf.c:
/* Do simple linear regression of y on yhat for bias correction. */
if (*biasCorr) simpleLinReg(nsample, yptr, y, coef, &errb, nout);
and the first few lines of that function are simply:
void simpleLinReg(int nsample, double *x, double *y, double *coef,
double *mse, int *hasPred) {
/* Compute simple linear regression of y on x, returning the coefficients,
the average squared residual, and the predicted values (overwriting y). */
So when you fit a regression random forest with corr.bias = TRUE the model object returned will contain a coef element which will simply be the two coefficients from the linear regression.
Then when you call predict.randomForest this happens:
## Apply bias correction if needed.
yhat <- rep(NA, length(rn))
names(yhat) <- rn
if (!is.null(object$coefs)) {
yhat[keep] <- object$coefs[1] + object$coefs[2] * ans$ypred
}
The non-linear nature of your data probably isn't necessarily relevant, but the bias correction may be very poor if the relationship between the fitted and actual values is very far from linear.
You can always fit the model and then plot the fitted vs actual values yourself and see whether a correction based on a linear regression would help or not.

Resources