Regularised discriminant analysis (RDA) in R - r

I am trying to apply RDA on my data in R, after some research I found that there is a package in R called "rda" which seems can do the job for me. However I looked at the description of the RDA function in that package and I'm a little confused now:
Usage given in R:
rda(x, y, xnew=NULL, ynew=NULL, prior=table(y)/length(y),alpha=seq(0, 0.99, len=10), delta=seq(0, 3, len=10), regularization="S", genelist=FALSE, trace=FALSE)
I'm not sure what do "alpha" and "delta" stand for in this case. I was taught that in RDA, there are two parameters "lambda" and "sigma", where lambda is a complexity parameter that
dictates the balance between linear and quadratic discriminant analysis and sigma is another parameter to regularise the covariance matrix further. BOTH OF THEM ARE BETWEEN 0 AND 1.
But as for this "rda" function in R, the default values of delta is between 0 and 3 which confused me.
Could anyone explain this for me please? Thanks!

You can use the package klaR which have a function rda with a parametrization of regularization parameters similar to the one you described.
detach(package:rda)
require(klaR)
data(iris)
x <- rda(Species ~ ., data = iris, gamma = 0.05, lambda = 0.2)
predict(x, iris)
Is not a good idea to mix the two packages (namespace issue for some functions), it's better to detach rda if you want to use klaR (or the opposite).

Related

depmix function to fit two state gamma distribution

I am using depmixS4 package in R. I have a data that looks like a gamma distribution, and I am assuming that there are two states. I would like to fit two-state gamma distribution to my data in R. The following is my code:
mod <- depmix(freq ~ 1, data = mod.data, nstates = 2, family = gamma()) # use gamma
fit.mod <- fit(mod)
However, it seems like I have an error because I am not passing an argument in the family = gamma() part. It works fine if I just use family = gaussian(). Can someone help me with this problem? Thanks!
Using a gamma distribution can provide issues with unsuitable starting values, so be sure to supply suitable ones when using the gamma family.

Windmeijer correction for Arellano-Bond Test in R package plm

Let´s suppose I have a simple AR(1) panel data model I estimate with the pgmm command in R - data available :
library(plm)
library(Ecdat)
data(Airline)
reg.gmm = pgmm(output ~ lag(output, 1)| lag(output, 2:99), data= Airline, Robust=TRUE)
With Robust=TRUE I use the Windmeijer(2005) correction to the variance-covariance matrix. Now I want to test for second order autocorrelation using Arrelano-Bond:
mtest(reg.gmm, order = 2, vcov = reg.gmm$vcov)
Am I using the Windmeijer-corrected variance-covariance matrix, as I intend to? If not, how can I implement it? The documentation is quite tight-lipped on that topic. Thanks for any help in advance!
Unfortunately the example with the Airline data throws an error which seems to be related to too many instruments in your GMM formula. If you are using different data which does not exhibit this problem you can use robust standard errors by using the vcovHC option in mtest. In your example the last call could then be:
mtest(reg.gmm, order = 2, vcov = vcovHC)

How to deal with spatially autocorrelated residuals in GLMM

I am conducting an analysis of where on the landscape a predator encounters potential prey. My response data is binary with an Encounter location = 1 and a Random location = 0 and my independent variables are continuous but have been rescaled.
I originally used a GLM structure
glm_global <- glm(Encounter ~ Dist_water_cs+coverMN_cs+I(coverMN_cs^2)+
Prey_bio_stand_cs+Prey_freq_stand_cs+Dist_centre_cs,
data=Data_scaled, family=binomial)
but realized that this failed to account for potential spatial-autocorrelation in the data (a spline correlogram showed high residual correlation up to ~1000m).
Correlog_glm_global <- spline.correlog (x = Data_scaled[, "Y"],
y = Data_scaled[, "X"],
z = residuals(glm_global,
type = "pearson"), xmax = 1000)
I attempted to account for this by implementing a GLMM (in lme4) with the predator group as the random effect.
glmm_global <- glmer(Encounter ~ Dist_water_cs+coverMN_cs+I(coverMN_cs^2)+
Prey_bio_stand_cs+Prey_freq_stand_cs+Dist_centre_cs+(1|Group),
data=Data_scaled, family=binomial)
When comparing AIC of the global GLMM (1144.7) to the global GLM (1149.2) I get a Delta AIC value >2 which suggests that the GLMM fits the data better. However I am still getting essentially the same correlation in the residuals, as shown on the spline correlogram for the GLMM model).
Correlog_glmm_global <- spline.correlog (x = Data_scaled[, "Y"],
y = Data_scaled[, "X"],
z = residuals(glmm_global,
type = "pearson"), xmax = 10000)
I also tried explicitly including the Lat*Long of all the locations as an independent variable but results are the same.
After reading up on options, I tried running Generalized Estimating Equations (GEEs) in “geepack” thinking this would allow me more flexibility with regards to explicitly defining the correlation structure (as in GLS models for normally distributed response data) instead of being limited to compound symmetry (which is what we get with GLMM). However I realized that my data still demanded the use of compound symmetry (or “exchangeable” in geepack) since I didn’t have temporal sequence in the data. When I ran the global model
gee_global <- geeglm(Encounter ~ Dist_water_cs+coverMN_cs+I(coverMN_cs^2)+
Prey_bio_stand_cs+Prey_freq_stand_cs+Dist_centre_cs,
id=Pride, corstr="exchangeable", data=Data_scaled, family=binomial)
(using scaled or unscaled data made no difference so this is with scaled data for consistency)
suddenly none of my covariates were significant. However, being a novice with GEE modelling I don’t know a) if this is a valid approach for this data or b) whether this has even accounted for the residual autocorrelation that has been evident throughout.
I would be most appreciative for some constructive feedback as to 1) which direction to go once I realized that the GLMM model (with predator group as a random effect) still showed spatially autocorrelated Pearson residuals (up to ~1000m), 2) if indeed GEE models make sense at this point and 3) if I have missed something in my GEE modelling. Many thanks.
Taking the spatial autocorrelation into account in your model can be done is many ways. I will restrain my response to R main packages that deal with random effects.
First, you could go with the package nlme, and specify a correlation structure in your residuals (many are available : corGaus, corLin, CorSpher ...). You should try many of them and keep the best model. In this case the spatial autocorrelation in considered as continous and could be approximated by a global function.
Second, you could go with the package mgcv, and add a bivariate spline (spatial coordinates) to your model. This way, you could capture a spatial pattern and even map it. In a strict sens, this method doesn't take into account the spatial autocorrelation, but it may solve the problem. If the space is discret in your case, you could go with a random markov field smooth. This website is very helpfull to find some examples : https://www.fromthebottomoftheheap.net
Third, you could go with the package brms. This allows you to specify very complex models with other correlation structure in your residuals (CAR and SAR). The package use a bayesian approach.
I hope this help. Good luck

polyval: from Matlab to R

I would like to use in R the following expression given in Matlab:
y1=polyval(p,end_v);
where p in Matlab is:
p = polyfit(Nodes_2,CInt_interp,3);
Right now in R I have:
p <- lm(Spectra_BIR$y ~ poly(Spectra_BIR$x,3, raw=TRUE))
But I do not know which command in R corresponds to the polyval from Matlab.
Many thanks!
r:
library(polynom)
predict(polynomial(1:3), c(5,7,9))
[1] 86 162 262
matlab (official example):
p = [3 2 1];
polyval(p,[5 7 9])
ans = 86 162 262
There is no exact equivalence in R for polyfit and polyvar, as these MATLAB routines are so primitive compared with R's statistical tool box.
In MATLAB, polyfit mainly returns polynomial regression coefficients (covariance can be obtained if required, though). polyvar takes regression coefficients p, and a set of new x values to predict the fitted polynomial.
In R, the fashion is: use lm to obtain a regression model (much broader; not restricted to polynomial regression); use summary.lm for model summary, like obtaining covariance; use predict.lm for prediction.
So here is the way to go in R:
## don't use `$` in formula; use `data` argument
fit <- lm(y ~ poly(x,3, raw=TRUE), data = Spectra_BIR)
Note, fit not only contains coefficients, but also essential components for orthogonal computation. If you want to extract coefficients, do coef(fit), or unname(coef(fit)) if you don't want names of coefficients to be shown.
Now, to predict, we do:
x.new <- rnorm(5) ## some random new `x`
## note, `predict.lm` takes a "lm" model, not coefficients
predict.lm(fit, newdata = data.frame(x = x.new))
predict.lm is much much more powerful than polyvar. It can return confidence interval. Have a read on ?predict.lm.
There are a few sensitive issues with the use of predict.lm. There have been countless questions / answers regarding this, and you can find the root question to which I often close those questions as duplicated:
Getting Warning: “ 'newdata' had 1 row but variables found have 32 rows” on predict.lm in R
Predict() - Maybe I'm not understanding it
So make sure you get the good habit of using lm and predict at the early stage of learning R.
Extra
It is also not difficult to construct something identical to polyvar in R. The function g in my answer Function for polynomials of arbitrary order is doing this, although by setting nderiv we can also get derivatives of the polynomial.

Variable selection methods

I have been doing variable selection for a modeling problem.
I have used trial and error for the selection (adding / removing a variable) with a decrease in error. However, I have the challenge as the number of variables grows into the hundreds that manual variable selection can not be performed as the model takes 1/2 hour to compute, rendering the task impossible.
Would you happen to know of any other packages than the regsubsets from the leaps package (which when tested with the same trial and error variables produced a higher error, it did not include some variables which were lineraly dependant - excluding some valuable variables).
You need a better (i.e. not flawed) approach to model selection. There are plenty of options, but one that should be easy to adapt to your situation would be using some form of regularization, such as the Lasso or the elastic net. These apply shrinkage to the sizes of the coefficients; if a coefficient is shrunk from its least squares solution to zero, that variable is removed from the model. The resulting model coefficients are slightly biased but they have lower variance than the selected OLS terms.
Take a look at the lars, glmnet, and penalized packages
Try using the stepAIC function of the MASS package.
Here is a really minimal example:
library(MASS)
data(swiss)
str(swiss)
lm <- lm(Fertility ~ ., data = swiss)
lm$coefficients
## (Intercept) Agriculture Examination Education Catholic
## 66.9151817 -0.1721140 -0.2580082 -0.8709401 0.1041153
## Infant.Mortality
## 1.0770481
st1 <- stepAIC(lm, direction = "both")
st2 <- stepAIC(lm, direction = "forward")
st3 <- stepAIC(lm, direction = "backward")
summary(st1)
summary(st2)
summary(st3)
You should try the 3 directions and ckeck which model works better with your test data.
Read ?stepAIC and take a look at the examples.
EDIT
It's true stepwise regression isn't the greatest method. As it's mentioned in GavinSimpson answer, lasso regression is a better/much more efficient method. It's much faster than stepwise regression and will work with large datasets.
Check out the glmnet package vignette:
http://www.stanford.edu/~hastie/glmnet/glmnet_alpha.html

Resources