How to specify random effects names in a newdata data.frame used in predict() function? - lme4 - r

I have a problem using the predict() function in lme4.
More precisely, I am not very clear on how to declare the names of random effects to be used in the newdata data frame, which I feed the predict() function with, in order to get some predictions.
I will try and describe my problem in detail.
Data
The data I am working with is longitudinal. I have 119 observations, for each of which I have several (6-7) measurements for each observation, which represent the size of proteins, which aggregate in time and grow bigger (let's call it LDL).
Model
The model used to describe this process is a Richard's curve (generalized logistic function), which can be written as
Now, I fit a separate curve for the group of measurements of each observation, with the following fixed, random effects, and variables:
alpha_fix - a fixed effect for alpha
alpha|Obs - a random effect for alpha, which varies among observations
gamma_fix - a fixed effect for gamma
gamma|Obs - a random effect for gamma, which varies among observations
delta_f - a fixed effect
Time - a continuous variable, time in hours
LDL - response variable, continuous, representing size of proteins at time point t.
Predictions
Once I fit the model, I want to use it to predict the value of LDL at a specific time point, for each observation. In order to do this, I need to use the predict function and assign a data frame for newdata. reading through the documentation here, it says the following:
If any random effects are included in re.form (see below), newdata
must contain columns corresponding to all of the grouping variables
and random effects used in the original model, even if not all are
used in prediction; however, they can be safely set to NA in this case
Now, the way I understand this, I need to have a data frame newdata, which in my case contains the following columns: "Time", "Obs", "alpha_f", "gamma_f", "delta_f", as well as two columns for the random effects of alpha and gamma, respectively. However, I am not sure how these two columns with random effects should be named, in order for the predict() function to understand them. I tried with "alpha|Obs" and "gamma|Obs", as well as "Obs$alpha", "Obs$gamma", but both throw the error
Error in FUN(X[[i]], ...) : random effects specified in re.form
that were not present in original model.
I was wondering whether anyone has any idea what the correct way to do this is.
For completeness, the code used to fit the model is provided below:
ModelFunction = function (alpha, gamma, delta, Time) {
15 + (alpha-15) / (1 + exp ((gamma-Time) / delta))
}
ModelGradient = deriv(body(ModelFunction) [[2]], namevec = c ("alpha", "gamma", "delta"), function.arg = ModelFunction)
starting_conditions = c (alpha = 5000, gamma = 1.5, delta = 0.2) # Based on visual observation
fit = nlmer (
LDL ~
ModelGradient (alpha, gamma, delta, Time)
~ (gamma | Obs) + (alpha | Obs),
start = starting_conditions,
nlmerControl(optimizer="bobyqa", optCtrl = list(maxfun = 100000)),
data = ldlData)
I would really appreciate it if someone could give me some advice.
Thanks for your time!

Related

Syntax error when fitting a Bayesian logistic regression

I am attempting to model binary species traits, where presence is represented by 1 and absence by 0, as a function of some sampling variables. To accomplish this, I have constructed a brms model and added a phylogenetic structure to it. Here is the model I used:
model <- brms::brm(male_head | trials(1 + 0) ~
PC1 + PC2 + PC3 +
(1|gr(phylo, cov = covariance_matrix)),
data = data,
family = binomial(),
prior = prior,
data2 = list(covariance_matrix = covariance_matrix))
Each line of my df represents one observation with a binary outcome.
Initially, I was unsure about which arguments to use in the trials() function. Since my species are non-repeated and some have the traits I'm modeling while others do not, I thought that trials(1 + 0) might be appropriate. I recall seeing a vignette that suggested this, but I can't find it now. Is this syntax correct?
Furthermore, for some reason I'm unaware, the model is producing one estimate value for each line of my predictors. As my df has 362 lines, the model summary displays a lengthy list of 362 estimate values. I would prefer to have one estimate value for each sampling variable instead. Although I have managed to achieve this by making the treatment effect a random effect (i.e., (1|PC1) + (1|PC2) + (1|PC3)), I don't think this is the appropriate approach. I also tried bernoulli() but no success either. Do you have any suggestions for how I can address this issue?
EDIT:
For some reason the values of my sampling variables/principal components were being read as factors. The second part of this question was solved.

LASSO-type regressions with non-negative continuous dependent variable (dependent var)

I am using "glmnet" package (in R) mostly to perform regularized linear regression.
However I am wondering if it can perform LASSO-type regressions with non-negative (integer) continuous (dependent) outcome variable.
I can use family = poisson, but the outcome variable is not specifically "count" variable. It is just a continuous variable with lower limit 0.
I aware of "lower.limits" function, but I guess it is for covariates (independent variables). (Please correct me if my understanding of this function not right.)
I look forward to hearing from you all! Thanks :-)
You are right that setting lower limit in glmnet is meant for covariates. Poisson will set a lower limit to zero because you exponentiate to get back the "counts".
Going along those lines, most likely it will work if you transform your response variable. One quick way is to take the log of your response variable, do the fit and transform it back, this will ensure that it's always positive. you have to deal with zeros
An alternative is a power transformation. There's a lot to think about and I can only try a two parameter box-cox with a dataset since you did not provide yours:
library(glmnet)
library(mlbench)
library(geoR)
data(BostonHousing)
data = BostonHousing
data$chas=as.numeric(data$chas)
# change it to min 0 and max 1
data$medv = (data$medv-min(data$medv))/diff(range(data$medv))
Then here I use a quick approximation via pca (without fitting all the variables) to get the suitable lambda1 and lambda2 :
bcfit = boxcoxfit(object = data[,14],
xmat = prcomp(data[,-14],scale=TRUE,center=TRUE)$x[,1:2],
lambda2=TRUE)
bcfit
Fitted parameters:
lambda lambda2 beta0 beta1 beta2 sigmasq
0.42696313 0.00001000 -0.83074178 -0.09876102 0.08970137 0.05655903
Convergence code returned by optim: 0
Check the lambda2, it is the one thats critical for deciding whether you get a negative value.. It should be rather small.
Create the functions to power transform:
bct = function(y,l1,l2){((y+l2)^l1 -1)/l1}
bctinverse = function(y,l1,l2){(y*l1+1)^(1/l1) -l2}
Now we transform the response:
data$medv_trans = bct(data$medv,bcfit$lambda[1],bcfit$lambda[2])
And fit glmnet:
fit = glmnet(x=as.matrix(data[,1:13]),y=data$medv_trans,nlambda=500)
Get predictions over all lambdas, and you can see there's no negative predictions once you transform back:
pred = predict(fit,as.matrix(data[,1:13]))
range(bctinverse(pred,bcfit$lambda[1],bcfit$lambda[2]))
[1] 0.006690685 0.918473356
And let's say we do a fit with cv:
fit = cv.glmnet(x=as.matrix(data[,1:13]),y=data$medv_trans)
pred = predict(fit,as.matrix(data[,1:13]))
pred_transformed = bctinverse(pred,bcfit$lambda[1],bcfit$lambda[2]
plot(data$medv,pred_transformed,xlab="orig response",ylab="predictions")

What are the differences between directly plotting the fit function and plotting the predicted values(they have same shape but different ranges)?

I am trying to learn gam() in R for a logistic regression using spline on a predictor. The two methods of plotting in my code gives the same shape but different ranges of response in the logit scale, seems like an intercept is missing in one. Both are supposed to be correct but, why the differences in range?
library(ISLR)
attach(Wage)
library(gam)
gam.lr = gam(I(wage >250) ~ s(age), family = binomial(link = "logit"), data = Wage)
agelims = range(age)
age.grid = seq(from = agelims[1], to = agelims[2])
pred=predict(gam.lr, newdata = list(age = age.grid), type = "link")
par(mfrow = c(2,1))
plot(gam.lr)
plot(age.grid, pred)
I expected that both of the methods would give the exact same plot. plot(gam.lr) plots the additive effects of each component and since here there's only one so it is supposed to give the predicted logit function. The predict method is also giving me estimates in the link scale. But the actual outputs are on different ranges. The minimum value of the first method is -4 while that of the second is less than -7.
The first plot is of the estimated smooth function s(age) only. Smooths are subject to identifiability constraints as in the basis expansion used to parametrise the smooth, there is a function or combination of functions that are entirely confounded with the intercept. As such, you can't fit the smooth and an intercept in the same model as you could subtract some value from the intercept and add it back to the smooth and you have the same fit but different coefficients. As you can add and subtract an infinity of values you have an infinite supply of models, which isn't helpful.
Hence identifiability constraints are applied to the basis expansions, and the one that is most useful is to ensure that the smooth sums to zero over the range of the covariate. This involves centering the smooth at 0, with the intercept then representing the overall mean of the response.
So, the first plot is of the smooth, subject to this sum to zero constraint, so it straddles 0. The intercept in this model is:
> coef(gam.lr)[1]
(Intercept)
-4.7175
If you add this to values in this plot, you get the values in the second plot, which is the application of the full model to the data you supplied, intercept + f(age).
This is all also happening on the link scale, the log odds scale, hence all the negative values.

Calculating the standard error of parameters in nlme

I am running a non-linear mixed model in nlme, and I am having trouble calculating the standard errors of the three parameters. We have our final model here:
shortG.nlme9 <- update(shortG.nlme6,
fixed = Asym + xmid + scal ~ Treatment * Breed + Environment,
start = c(shortFix6[1:16], rep(0,2),
shortFix6[17:32], rep(0,2),
shortFix6[33:48], rep(0,2)),
control = nlmeControl(pnlsTol = 0.02, msVerbose = TRUE))
And when we plug it in with the summary statement, we can get the standard errors of each of the treatments, breeds, treatment*breed interactions, and environments. However, we are looking at making growth curves for specific combinations (treatment1/breed1, treatment2/breed1, treatment3/breed1, etc), so we need to combine effects of treatment, breed, and the environments for the parameter values, and logically combine their standard errors to get the SE of the full parameter. To do this, is there either a way to get R to come up with the full SE on its own, or is there an easy way to have R give us a covariate matrix so we can calculate the values by hand? When we look at the basic statistics by simply plugging in the summary(shortG.nlme9) statement, we are automatically given a correlation matrix, so is there something we could write in for a covariate matrix instead?

obtain the probability equation represented by plotmo plots

I want to obtain the equations of the probability functions represented by plotmo (R). This is the equations of the model when varying one or two predictors while holding the other predictors constant in the mean value. I want an easy way to obtain the mathematical equation because a must to make to many models with different variables.
if my model is like this:
glm(formula = pres_aus ~ pH_sp + Annual_prec + I(pH_sp^2) + I(Annual_prec^2), family = binomial(link = "logit"), data = puntos_calibrado)
how can i make it?
No data example provided, so no testing done, but couldn't you just skip the construction of a symbolic expression and do something along the lines of:
model.matrix(data.frame(one=1, dat) ) %*% coef(mdl.fit)
# where mdl.fit is returned from glm()
In a sense this is the R matrix representation of the formula: sum( beta_i*X_1). If you want to specify a mean value for a particular column then just pull that dataframe apart and use only parts of it for a calculation. So for the first column held at the mean:
model.matrix(data.frame(one=1, mn1 =mean(dat[[1]]), dat[-1]) ) %*%
coef(mdl.fit)

Resources