Extracting linear term from a polynomial predictor in a GLM - r

I am relatively new to both R and Stack overflow so please bear with me. I am currently using GLMs to model ecological count data under a negative binomial distribution in brms. Here is my general model structure, which I have chosen based on fit, convergence, low LOOIC when compared to other models, etc:
My goal is to characterize population trends of study organisms over the study period. I have created marginal effects plots by using the model to predict on a new dataset where all covariates are constant except year (shaded areas are 80% and 95% credible intervals for posterior predicted means):
I am now hoping to extract trend magnitudes that I can report and compare across species (i.e. say a certain species declined or increased by x% (+/- y%) per year). Because I use poly() in the model, my understanding is that R uses orthogonal polynomials, and the resulting polynomial coefficients are not easily interpretable. I have tried generating raw polynomials (setting raw=TRUE in poly()), which I thought would produce the same fit and have directly interpretable coefficients. However, the resulting models don't really run (after 5 hours neither chain gets through even a single iteration, whereas the same model with raw=FALSE only takes a few minutes to run). Very simplified versions of the model (e.g. count ~ poly(year, 2, raw=TRUE)) do run, but take several orders of magnitude longer than setting raw=FALSE, and the resulting model also predicts different counts than the model with orthogonal polynomials. My questions are (1) what is going on here? and (2) more broadly, how can I feasibly extract the linear term of the quartic polynomial describing response to year, or otherwise get at a value corresponding to population trend?
I feel like this should be relatively simple and I apologize if I'm overlooking something obvious. Please let me know if there is further code that I should share for more clarity–I didn't want to make the initial post crazy long, but happy to show specific predictions from different models or anything else. Thank you for any help.

Related

Type of regression for large dataset, nonlinear, skewed in R

I'm researching moth biomass in different biotopes, and I want to find a model that estimates the biomass. I have measured the length and width of the forewing, abdomen and thorax of 37088 specimens, and I have weighed them individually (dried).
First, I wanted to a simple linear regression of each variable on the biomass. The problem is, none of the assumptions are met. The data is not linear, biomass (and some variables) don't follow a normal distribution, there is heteroskedasticity, and a lot of outliers. Now I have tried to transform my data using log, x^2, 1/x, and boxcox, but none of them actually helped. I have also tried Thiel-Sen regression (not possible because of too much data) and Siegel regression (biomass is not a vector). Is there some other form of non-parametric or median-based regression that I can try? Because I am really out of ideas.
Here is a frequency histogram for biomass:
Frequency histogram dry biomass
So what I actually want to do is to build a model that accurately estimates the dry biomass, based on the measurements I performed. I have a power function (Rogers et al.) that is general for all insects, but there is a significant difference between this estimate and what I actually weighed. Therefore, I just want to build to build a model with all significant variables. I am not very familiar with power functions, but maybe it is possible to build one myself? Can anyone recommend a method? Thanks in advance.
To fit a power function, you could perhaps try nlsLM from the minpack.lm package
library(minpack.lm)
m <- nlsLM( y ~ a*x^b, data=your.data.here )
Then see if it performs satisfactory.

Function to produce a single metric to compare the shape of two distributions (predictions vs actuals)

I am assessing the accuracy of a model that predicts count data.
My actual data has quite an unusual distribution - although I have a large amount of data, the shape is unlike any standard distributions (poisson, normal, negative binomial etc.).
As part of my assessment, I want a metric for how well the distribution of the predictions match the distribution of actual data. I've tried using standard model performance metrics, such as MAE or RMSE, but they don't seem to capture how well the predictions match the expected distribution.
My initial idea was to split the predictions into deciles, and calculate what proportion fall in each decile. This would be a very rough indication of the underlying distribution. I would then calculate the same for my 'actuals' and sum the absolute differences between the proportions.
This works to some extent, but feels a bit clunky, and the split into deciles feels arbitrary. Is there a function in R to produce a single metric for how well two distributions match?

How to check and control for autocorrelation in a mixed effect model of longitudinal data?

I have behavioral data for many groups of birds over 10 days of observation. I wanted to investigate whether there is a temporal pattern in some behaviors (e.g. does mate competition increase over time?) And I was told that I had to account for the autocorrelation of the data, since behavior is unlikely to be independent in each day.
However I was wondering about two things:
Since I'm not interested in the differences in y among days but the trend of y over days, do I still need to correct for autocorrelation?
If yes, how do I control for the autocorrelation so that I'm left out only with the signal (and noise of course)?
For the second question, keep in mind I will be analyzing the effect of time on behavior using mixed models in R (since there are random effects such as pseudo-replication), but I have not found any straightforward method of correcting for autocorrelation in the data when modeling the responses.
(1) Yes, you should check for/account for autocorrelation.
The first example here shows an example of estimating trends in a mixed model while accounting for autocorrelation.
You can fit these models with lme from the nlme package. Here's a mixed model without autocorrelation included:
cmod_lme <- lme(GS.NEE ~ cYear,
data=mc2, method="REML",
random = ~ 1 + cYear | Site)
and you can explore the autocorrelation by using plot(ACF(cmod_lme)).
(2) Add correlation to the model something like this:
cmod_lme_acor <- update(cmod_lme,
correlation=corAR1(form=~cYear|Site)
#JeffreyGirard notes that
to check the ACF after updating the model to include the correlation argument, you will need to use plot(ACF(cmod_lme_acor, resType = "normalized"))

Simulating data using existing data and probability

I have measured multiple attributes (height, species, crown width, condition etc) for about 1500 trees in a city. Using remote sensing techniques I also have the heights for the rest of the 9000 trees in the city. I want to simulate/generate/estimate the missing attributes for these unmeasured trees by using their heights.
From the measured data I can obtain proportion of each species in the measured population (and thus a rough probability), height distributions for each species, height-crown width relationships for the species, species-condition relationship and so on. I want to use the height data for the unmeasured trees to first estimate the species and then estimate the rest of the attributes too using probability theory. So for a height of say 25m its more likely to be a Cedar (height range 5 - 30 m) rather than a Mulberry tree (height range 2 -8 m) and more likely to be a cedar (50% of population) than an oak (same height range but 2% of population) and hence will have a crown width of 10m and have a health condition of 95% (based on the distributions for cedar trees in my measured data). But also I am expecting some of the other trees of 25m to be given oak, just less frequently than cedar based on the proportion in population.
Is there a way to do this using probability theory in R preferably utilising Bayesian or machine learning methods?
Im not asking for someone to write the code for me - I am fairly experienced with R. I just want to be pointed in the right direction i.e. a package that does this kind of thing neatly.
Thanks!
Because you want to predict a categorical variable, i.e. the species, you should consider using a tree regression, a method which can be found in the R packages rpart and RandomForest. These models excel when you have a discrete number of categories and you need to slot your observations into those categories. I think those packages would work in your application. As a comparison, you can also look at multinomial regression (mnlogit, nnet, maxent) which can also predict categorical outcomes; unfortunately multinomial regression can get unwieldy with large numbers of outcomes and/or large datasets.
If you want to then predict the individual values for individual trees in your species, first run a regression of all of your measured variables, including species type, on the measured trees. Then take the categorical labels that you predicted and predict out-of-sample for the unmeasured trees where you use the categorical labels as predictors for the unmeasured variable of interest, say tree height. That way the regression will predict the average height for that species/dummy variable, plus some error and incorporating any other information you have on that out-of-sample tree.
If you want to use a Bayesian method, you consider using a hierarchical regression to model these out-of-sample predictions. Sometimes hierarchical models do better at predicting as they tend to be fairly conservative. Consider looking at the package Rstanarm for some examples.
I suggest you looking over Bayesian Networks with table CPDs over your random variables. This is a generative model that can handle missing data and do inference over casual relationships between variables. Bayesian Network structure can be specified by-hand or learned from data by a algorithm.
R has several implementations of Bayesian Networks with bnlearn being one of them: http://www.bnlearn.com/
Please see a tutorial on how to use it here: https://www.r-bloggers.com/bayesian-network-in-r-introduction/
For each species, the distribution of the other variables (height, width, condition) is probably a fairly simple bump. You can probably model the height and width as a joint Gaussian distribution; dunno about condition. Anyway with a joint distribution for variables other than species, you can construct a mixture distribution of all those per-species bumps, with mixing weights equal to the proportion of each species in the available data. Given the height, you can find the conditional distribution of the other variables conditional on height (and it will also be a mixture distribution). Given the conditional mixture, you can sample from it as usual: pick a bump with frequency equal to its mixing weight, and then sample from the selected bump.
Sounds like a good problem. Good luck and have fun.

evaluate forecast by the terms of p-value and pearson correlation

I am using R to do some evaluations for two different forecasting models. The basic idea of the evaluation is do the comparison of Pearson correlation and it corresponding p-value using the function of cor.() . The graph below shows the final result of the correlation coefficient and its p-value.
we suggestion that model which has lower correlation coefficient with corresponding lower p-value(less 0,05) is better(or, higher correlation coefficient but with pretty high corresponding p-value).
so , in this case, overall, we would say that the model1 is better than model2.
but the question here is, is there any other specific statistic method to quantify the comparison?
Thanks a lot !!!
Assuming you're working with time series data since you called out a "forecast". I think what you're really looking for is backtesting of your forecast model. From Ruey S. Tsay's "An Introduction to Analysis of Financial Data with R", you might want to take a look at his backtest.R function.
backtest(m1,rt,orig,h,xre=NULL,fixed=NULL,inc.mean=TRUE)
# m1: is a time-series model object
# orig: is the starting forecast origin
# rt: the time series
# xre: the independent variables
# h: forecast horizon
# fixed: parameter constriant
# inc.mean: flag for constant term of the model.
Backtesting allows you to see how well your models perform on past data and Tsay's backtest.R provides RMSE and Mean-Absolute-Error which will give you another perspective outside of correlation. Caution depending on the size of your data and complexity of your model, this can be a very slow running test.
To compare models you'll normally look at RMSE which is essentially the standard deviation of the error of your model. Those two are directly comparable and smaller is better.
An even better alternative is to set up training, testing, and validation sets before you build your models. If you train two models on the same training / test data you can compare them against your validation set (which has never been seen by your models) to get a more accurate measurement of your model's performance measures.
One final alternative, if you have a "cost" associated with an inaccurate forecast, apply those costs to your predictions and add them up. If one model performs poorly on a more expensive segment of data, you may want to avoid using it.
As a side-note, your interpretation of a p value as less is better leaves a little to be [desired] quite right.
P values address only one question: how likely are your data, assuming a true null hypothesis? It does not measure support for the alternative hypothesis.

Resources