Fitting GLM (family = inverse.gaussian) on simulated AR(1)-data. - r

I am encountering quite an annoying and to me incomprehensible problem, and I hope some of you can help me. I am trying to estimate the autoregression (influence of previous measurements of variable X on current measurement of X) for 4 groups that have a positively skewed distribution to various degrees. The theory is that more positively skewed distributions have less variance, and since the relationship between 2 variables depends on the amount of shared variance, positively skewed distributions have a smaller autoregression that more normally distributed variables.
I use simulations to investigate this, and generate data as follows: I simulate data for n people with tp time points. I use a fixed autoregressive parameter, phi (at .3 so we have a stationary process). To generate positively skewed distributions I use a chi-square distributed error. Individuals differ in the degrees of freedom that is used for the chi2 distributed errors. In other words, degrees of freedom is a level 2 variable (and is in itself chi2(1)-distributed). Individuals with a very low df get a very skewed distribution whereas individuals with a higher df get a more normal distribution.
for(i in 1:n) { # Loop over persons.
chi[i, 1] <- rchisq(1, df[i]) # Set initial value.
for(t in 2:(tp + burn)) { # Loop over time points.
chi[i, t] <- phi[i] * chi[i, t - 1] + # Autoregressive effect.
rchisq(1, df[i]) # Chi-square distributed error.
} # End loop over time points.
} # End loop over persons.
Now that I have the outcome variable generated, I put it in long format, I create a lagged predictor, and I person mean center the predictor (or group mean center, or cluster mean center, all the same). I call this lagged and centered predictor chi.pred. I make the subgroups based on the degrees of freedom of individuals. The 25% with a lowest df goes in subgroup 1, 26% - 50% in subgroup 2, etc.
The problem is this: fitting a multilevel (i.e. mixed or random effects model) autoregressive(1) model with family = inverse.gaussian and link = 'identity', using glmer() from the lme4 package gives me quite a lot of warnings. E.g. "degenerate Hessian", "large eigen value/ratio", "failed to converge with max|grad", etc.. I just don't get why.
The model I fit are
# Random intercept, but fixed slope with subgroups as level 2 predictor of slope.
lmer(chi ~ chi.pred + chi.pred:factor(sub.df.noise) + (1|id), data = sim.data, control = lmerControl(optimizer = 'bobyqa'))
# Random intercept and slope.
lmer(chi ~ chi.pred + (1 + chi.pred|id), data = sim.data, control = lmerControl(optimizer = 'bobyqa'))
The reason I use inverse gaussian is because it is said to work better on skewed data.
Does anybody have any clue why I can't fit the models? I have tried increasing sample size and time points, different optimizers, I have double-double-double checked if lagging and centering the data is correct, increased the number of iterations, added some noise to the subgroups (since otherwise they are 1 on 1 related to degree of freedom) etc.

Related

GLS / GLM nested design with autocorrelation over time

Still fairly new to GLM and a bit confused about how to establish my model.
About my project:
I sampled the microbiome (and measured a diversity index value = Shannon) from the root system of a sample of 9 trees (=tree1_cat).
In each tree I sampled fine and thick roots (=rootpart) and each tree was sampled four times (=days) over the course of one season. Thus I have a nested design but have to keep time in mind for autocorrelation. Also not all values are present, thus I have a few missing values). So far I have tried and tested the following:
Model <- gls(Shannon ~ tree1_cat/rootpart + tree1_cat + days,
na.action = na.omit, data = psL.meta,
correlation = corAR1(form =~ 1|days),
weights = varIdent(form= ~ 1|days))
Furthermore I've tried to get more insight and used anova(Model) to get the p-values of those factors. Am I allowed to use those p-values? Also I've used emmeans(Model, specs = pairwise ~ rootpart)for pairwise comparisons but since rootpart was entered as nested factor it only gives me the paired interactions.
It all works, but I am not sure, whether this is the right model! Any help would be highly appreciated!
It would be helpful to know your scientific question, but let's suppose you're interested in differences in Shannon diversity between fine and thick roots and in time trends. A model you could use would be:
library(lmerTest)
lmer(Shannon ~ rootpart*days + (rootpart*days|tree1_cat), data = ...)
The fixed-effect component rootpart*days can be expanded into 1 + rootpart + days + rootpart:days (where 1 signifies the intercept)
intercept: SD in fine roots on day 0 (hopefully that's the beginning of the season)
rootpart: difference between fine and thick roots on day 0
days: change per day in SD in fine roots (slope)
rootpart:days difference in slope between thick roots and fine roots
The random-effect component (rootpart*days|tree1_cat) measures how all four of these effects vary across trees, and their correlations (e.g. do trees with a larger-than-average difference between fine and thick roots on day 0 also have a larger-than-average change over time in fine root SD?)
This 'maximal' random effects model is almost certainly too complex for your data; a rough rule of thumb says you should have 10-20 data points per parameter estimated, the fixed-effect model takes 4 parameters. A full model with 4 random effects requires the estimate of a 4×4 covariance matrix, which has (4*5)/2 = 10 parameters all by itself. I might just try (1+days|tree1_cat) (random slopes) or (rootpart|tree_cat) (among-tree difference in fine vs. thick differences), with a bias towards allowing for the variation in the effect that is your primary interest (e.g. if your primary question is about fine vs. thick then go with (rootpart|tree_cat).
I probably wouldn't worry about autocorrelation at all, nor heteroscedasticity by day (your varIdent(~1|days) term) unless those patterns are very strongly evident in the data.
If you want to allow for autocorrelation you'll need to fit the model with nlme::lme or glmmTMB (lmer still doesn't have machinery for autocorrelation models); something like
library(nlme)
lme(Shannon ~ rootpart*days,
random = ~days|tree1_cat,
data = ...,
correlation = corCAR1(form = ~days|tree1_cat)
)
You need to use corCAR1 (continuous-time autoregressive order-1) rather than the more common corAR1 for unevenly sampled data. Be aware that lme is more finicky/worse at dealing with singular models, so you may discover you need to simplify your model before you can actually get this model to run.

Why is my random forest regression predicting values not found in my training set? (R)

I have a linear regression random forest model predicting plant height from a set of variables.
training <- read.csv('/sers/me/Desktop/training_data.csv')
rf_model <- randomForest(height ~ EVI + NDVI + Annual_Mean_Temperature + Annual_Precipitation + Precipitation_of_Wettest_Month, data = training, importance=TRUE, na.action = na.roughfix)
But when I look at the predicted values I see some negative numbers, despite that there are no negative values in my training dataset for the dependent variable -- as I'm predicting plant height, a negative value is physically impossible.
> min(rf_model$predicted)
-4.433786671143025159836e-12
I've checked my training set and there are no negative values here, so how can this be / what should I do?
> min(training$height)
0
First, the negative number you listed there is extremely small and is equal to 0 even if you don't round until the 11th or 12th decimal place, so you probably could just treat that fitted value as 0.
Second, without some sort of transformation, the range for a response variable in linear regression is the entire real line. The coefficients are chosen based on what minimizes the loss function (sum of squares in the basic case), so the model doesn't really care if it produces fitted values that aren't in the exact same range as the original response.
Take this misspecified model for example. We know the data-generating process requires Y to be positive, but a simple linear model will create negative fitted values in an effort to draw the best line through the data:
set.seed(0)
n <- 1000
x <- rnorm(n)
y <- exp(x + rnorm(n))
data.frame(x, y) %>%
ggplot(aes(x, y)) +
geom_point() +
geom_smooth(method = 'lm')
In order to restrict the range of your response, you can transform it, which is the idea behind GLMs. For example, if you take the logarithm of your response variable and then fit the model, you will have to exponentiate the resulting fitted values to get them back on the original scale, which guarantees that they are positive.

How to specify random coefficients priors in rstanarm?

Suppose I have a following formula for a mixed effects model:
Performance ~ 1 + WorkingHours + Tenure + (1 + WorkingHours + Tenure || JobClass)
then I can specify priors for fixed slopes and fixed intercept as:
prior = normal(c(mu1,mu2), c(sd1,sd2), autoscale = FALSE)
prior_intercept = normal(mean, scale, autoscale = FALSE)
But how do I specify the priors for random slopes and intercept using
prior_covariance = decov(regularization, concentration, shape, scale)
(or)
lkj(regularization, scale, df)
if I know the variance between the slopes and intercepts and the correlation between them.
I am unable to understand how to specify the parameters for the above mixed effects formula.
Because you're working in a Bayesian model, you aren't going to specify the correlations or variances. You're going to specify a likelihood distribution of covariance matrices (by way of the correlation matrix and vector of variances) by giving the values for a few parameters.
The regularization parameter is a positive real value that determines how likely things are to be correlated. A value of 1 is sort of the "anything's possible" option (this is the default). Values greater than 1 mean that you believe there are few, if any, correlations. Values less than 1 mean you believe there is a lot of correlation.
The scale parameter is related to the sum of the variances. In particular, the scale parameter is equal to the square root of the average variance.
The concentration parameter is used to control how the total variance is distributed among the different variables. A value of 1 is saying you don't have an expectation. Larger values say that you believe that the variables have similar proportions of the total variance. Values between 0 and 1 mean that you think there are dissimilar contributions.
The shape parameter is used for a Gamma distribution that acts as a prior on the scale.
Then, finally, df is your prior degrees of freedom.
So, decov and lkj are each giving you a different way to express your expectations about properties of the covariance matrix, but they won't let you specify which specific variables you believe to be correlated with which other specific variables. It should decide that as part of the model fitting process.
This is all from the rstanarm documentation

how to calculate heritability from half-sib design

I'm trying to measure heritability of a trait, flowering time (FT), for a set of data collected from a half-sib design. The data includes FT for each mother plant and 2 half siblings from that mother plant for ~150 different maternal lines (ML). Paternity is unknown.
I've tried:
Estimating heritability with a regression of the maternal FT and the mean sibling FT, and doubling the slope. This worked fine, and produced an estimate of 0.14.
Running an ANOVA and using the between ML variation to estimate additive genetic variance. Got the idea from slide 25 of this powerpoint and from this thread on within and between variance calculation
fit = lm(FT ~ ML, her)
anova(fit)
her is the dataset, which, in this case, only includes the half sib FT values (I excluded the mother FT values for this attempt at heritability)
From the ANOVA output I have have used the "ML" term mean square as the between ML variation, which is also equal to 1/4 of the additive genetic variance because coefficient of relatedness between half-sibs is 0.25. This value turned out to be 0.098. Also, by multiplying this by 4 I could get the additive genetic variance.
I have used the "residuals" mean square as all variability save for that accounted for by the "ML" term. So, all of variance minus 1/4 of additive genetic variance. This turned out to be 1.342.
And then attempted to calculate heritabilty as Va/Vp = (4*0.098)/(1.342 + 0.098) = 0.39
This is quite different from my slope estimate, and I'm not sure if my reasoning is correct.
I've tried things with the sommer and heritability packages of R but haven't had success using either for a half-sib design and haven't found an example of a half-sib design with either package.
Any suggestions?

How to improve a Zero-Inflated Negative Binomial regression model?

everybody!
I have a response variable that counts sucessful days in a month and is distributed in a peculiar shape (see above). About 50% are zeros, and there is a heavy tail. Because of the overdispersion and the excess of zeros, I was advised to predict it with a Zero-Inflated Negative Binomial regression model.
However, no matter how significant a model I obtain, it reflects little of those distributing features (see below). For example, the peaks are always around 4, and no predictions fall beyond 20.
Is this usual in fitting overdispersed, heavy-tailed count data? Are there other ways to improve the fitting? Any suggestions would be appreciated. Thank you!
P. S.
I also tried logistic regression to predict zero/non-zero only. But none of the fitted models perform better than simply guessing zeros for all cases.
I suppose you did a histogram of the fitted values, so this will only reflect the fitted means, and possibly multiplied by the ratio of being zero depending on the model you use. It is not supposed to recreate that distribution because how spread your data can be is embedded in the dispersion parameter.
We can use an example from the pscl package:
library(pscl)
data("bioChemists")
fit <- hurdle(art ~ ., data = bioChemists,dist="negbin",zero.dist="binomial")
par(mfrow=c(1,2))
hist(fit$y,main="Observed")
hist(fit$fitted.values,main="Fitted")
As mentioned before, in this hurdle model, the fitted values you see, are the predicted means multiplied by the ratio of being zero (see more here):
head(fit$fitted.values)
1 2 3 4 5 6
1.9642025 1.2887343 1.3033753 1.3995826 2.4560884 0.8783207
head(predict(fit,type="zero")*predict(fit,type="count"))
1 2 3 4 5 6
1.9642025 1.2887343 1.3033753 1.3995826 2.4560884 0.8783207
To simulate the data based on the fitted model, we extract out the parameters:
Theta=fit$theta
Means=predict(fit,type="count")
Zero_p = predict(fit,type="prob")[,1]
Have function to simulate the counts:
simulateCounts = function(mu,theta,zero_p){
N = length(mu)
x = rnbinom(N,mu=mu,size=THETA)
x[runif(x)<zero_p] = 0
x
}
So run this simulation a number of times to get the spectrum of values:
set.seed(100)
simulated = replicate(10,simulateCounts(Means,Theta,Zero_p))
simulated = unlist(simulated)
par(mfrow=c(1,2))
hist(bioChemists$art,main="Observed")
hist(simulated,main="simulated")

Resources