I have been trying to convert a repeated measures model from SAS to R, since a collaborator will do the analysis but does not have SAS. We are dealing with 4 groups, 8 to 10 animals per group, and then 5 time points for each animal. The mock data file is available here https://drive.google.com/file/d/0B-WfycVUQyhaVGU2MUpuQkg4Mk0/edit?usp=sharing as a Rdata file and here https://drive.google.com/file/d/0B-WfycVUQyhaR0JtZ0V4VjRkTk0/edit?usp=sharing as an excel file:
The original SAS code (1) is :
proc mixed data=essai.data_test method=reml;
class group time mice;
model param = group time group*time / ddfm=kr;
repeated time / type=un subject=mice group=group;
run;
Which gives :
Type 3 Tests des effets fixes
DDL DDL Valeur
Effet Num. Res. F Pr > F
group 3 15.8 1.58 0.2344
time 4 25.2 10.11 <.0001
group*time 12 13.6 1.66 0.1852
I know that R does not handle degrees of freedom in the same way as SAS does, so I am first trying to obtain results similar to (2) :
proc mixed data=essai.data_test method=reml;
class group time mice;
model param = group time group*time;
repeated time / type=un subject=mice group=group;
run;
I have found some hints here Converting Repeated Measures mixed model formula from SAS to R and when specifying a compound symmetry correlation matrix this works perfectly. However, I am not able to obtain the same thing for a general correlation matrix.
With (2) in SAS, I obtain the following results :
Type 3 Tests des effets fixes
DDL DDL Valeur
Effet Num. Res. F Pr > F
group 3 32 1.71 0.1852
time 4 128 11.21 <.0001
group*time 12 128 2.73 0.0026
Using the following R code :
options(contrasts=c('contr.sum','contr.poly'))
mod <- lme(param~group*time, random=list(mice=pdDiag(form=~group-1)),
correlation = corSymm(form=~1|mice),
weights = varIdent(form=~1|group),
na.action = na.exclude, data = data, method = "REML")
anova(mod,type="marginal")
I obtain:
numDF denDF F-value p-value
(Intercept) 1 128 1373.8471 <.0001
group 3 32 1.5571 0.2189
time 4 128 10.0628 <.0001
group:time 12 128 1.6416 0.0880
The degrees of freedom are similar, but not the tests on fixed effects and I don’t know where this comes from. Would anyone have any idea of what I am doing wrong here?
Your R code differs from the SAS code in multiple ways. Some of them are fixable, but I was not able to fix all the aspects to reproduce the SAS analysis.
The R code fits a mixed effects model with a random mice effect, while the SAS code fits a generalized linear model that allows correlation between the residuals, but there are no random effects (because there is no RANDOM statement). In R you would have to use the gls function from the same nlme package.
In the R code all observations within the same group have the same variance, while in the SAS code you have an unstructured covariance matrix, that is each time-point within each group has its own variance. You can achieve the same effect by using weights=varIdent(form=~1|group*time).
In the R code the correlation matrix is the same for every mouse regardless of group. In the SAS code each group has its own correlation matrix. This is the part that I don't know how to reproduce in R.
I have to note that the R model seems to be more meaningful - SAS estimates way too many variances and correlations (which, by the way, you can see meaningfully arranged using the R and RCORR options to the repeated statement).
"In the R code the correlation matrix is the same for every mouse regardless of group. In the SAS code each group has its own correlation matrix. This is the part that I don't know how to reproduce in R." - Try: correlation=corSymm(~1|group*time)
Related
I am working with count data (available here) that are zero-inflated and overdispersed and has random effects. The package best suited to work with this sort of data is the glmmTMB (details here and troubleshooting here).
Before working with the data, I inspected it for normality (it is zero-inflated), homogeneity of variance, correlations, and outliers. The data had two outliers, which I removed from the dataset linekd above. There are 351 observations from 18 locations (prop_id).
The data looks like this:
euc0 ea_grass ep_grass np_grass np_other_grass month year precip season prop_id quad
3 5.7 0.0 16.7 4.0 7 2006 526 Winter Barlow 1
0 6.7 0.0 28.3 0.0 7 2006 525 Winter Barlow 2
0 2.3 0.0 3.3 0.0 7 2006 524 Winter Barlow 3
0 1.7 0.0 13.3 0.0 7 2006 845 Winter Blaber 4
0 5.7 0.0 45.0 0.0 7 2006 817 Winter Blaber 5
0 11.7 1.7 46.7 0.0 7 2006 607 Winter DClark 3
The response variable is euc0 and the random effects are prop_id and quad_id. The rest of the variables are fixed effects (all representing the percent cover of different plant species).
The model I want to run:
library(glmmTMB)
seed0<-glmmTMB(euc0 ~ ea_grass + ep_grass + np_grass + np_other_grass + month + year*precip + season*precip + (1|prop_id) + (1|quad), data = euc, family=poisson(link=identity))
fit_zinbinom <- update(seed0,family=nbinom2) #allow variance increases quadratically
The error I get after running the seed0 code is:
Error in optimHess(par.fixed, obj$fn, obj$gr) : gradient in optim
evaluated to length 1 not 15 In addition: There were 50 or more
warnings (use warnings() to see the first 50)
warnings() gives:
1. In (function (start, objective, gradient = NULL, hessian = NULL, ... :
NA/NaN function evaluation
I also normally mean center and standardize my numerical variables, but this only removes the first error and keeps the NA/NaN error. I tried adding a glmmTMBControl statement like this OP, but it just opened a whole new world of errors.
How can I fix this? What am I doing wrong?
A detailed explanation would be appreciated so that I can learn how to troubleshoot this better myself in the future. Alternatively, I am open to a MCMCglmm solution as that function can also deal with this sort of data (despite taking longer to run).
An incomplete answer ...
identity-link models for limited-domain response distributions (e.g. Gamma or Poisson, where negative values are impossible) are computationally problematic; in my opinion they're often conceptually problematic as well, although there are some reasonable arguments in their favor. Do you have a good reason to do this?
This is a pretty small data set for the model you're trying to fit: 13 fixed-effect predictors and 2 random-effect predictors. The rule of thumb would be that you want about 10-20 times that many observations: that seems to fit in OK with your 345 or so observations, but ... only 40 of your observations are non-zero! That means your 'effective' number of observations/amount of information will be much smaller (see Frank Harrell's Regression Modeling Strategies for more discussion of this point).
That said, let me run through some of the things I tried and where I ended up.
GGally::ggpairs(euc, columns=2:10) doesn't detect anything obviously terrible about the data (I did throw out the data point with euc0==78)
In order to try to make the identity-link model work I added some code in glmmTMB. You should be able to install via remotes::install_github("glmmTMB/glmmTMB/glmmTMB#clamp") (note you will need compilers etc. installed to install this). This version takes negative predicted values and forces them to be non-negative, while adding a corresponding penalty to the negative log-likelihood.
Using the new version of glmmTMB I don't get an error, but I do get these warnings:
Warning messages:
1: In fitTMB(TMBStruc) :
Model convergence problem; non-positive-definite Hessian matrix. See vignette('troubleshooting')
2: In fitTMB(TMBStruc) :
Model convergence problem; false convergence (8). See vignette('troubleshooting')
The Hessian (second-derivative) matrix being non-positive-definite means there are some (still hard-to-troubleshoot) problems. heatmap(vcov(f2)$cond,Rowv=NA,Colv=NA) lets me look at the covariance matrix. (I also like corrplot::corrplot.mixed(cov2cor(vcov(f2)$cond),"ellipse","number"), but that doesn't work when vcov(.)$cond is non-positive definite. In a pinch you can use sfsmisc::posdefify() to force it to be positive definite ...)
Tried scaling:
eucsc <- dplyr::mutate_at(euc1,dplyr::vars(c(ea_grass:precip)), ~c(scale(.)))
This will help some - right now we're still doing a few silly things like treating year as a numeric variable without centering it (so the 'intercept' of the model is at year 0 of the Gregorian calendar ...)
But that still doesn't fix the problem.
Looking more closely at the ggpairs plot, it looks like season and year are confounded: with(eucsc,table(season,year)) shows that observations occur in Spring and Winter in one year and Autumn in the other year. season and month are also confounded: if we know the month, then we automatically know the season.
At this point I decided to give up on the identity link and see what happened. update(<previous_model>, family=poisson) (i.e. using a Poisson with a standard log link) worked! So did using family=nbinom2, which was much better.
I looked at the results and discovered that the CIs for the precip X season coefficients were crazy, so dropped the interaction term (update(f2S_noyr_logNB, . ~ . - precip:season)) at which point the results look sensible.
A few final notes:
the variance associated with quadrat is effectively zero
I don't think you necessarily need zero-inflation; low means and overdispersion (i.e. family=nbinom2) are probably sufficient.
the distribution of the residuals looks OK, but there still seems to be some model mis-fit (library(DHARMa); plot(simulateResiduals(f2S_noyr_logNB2))). I would spend some time plotting residuals and predicted values against various combinations of predictors to see if you can localize the problem.
PS A quicker way to see that there's something wrong with the fixed effects (multicollinearity):
X <- model.matrix(~ ea_grass + ep_grass +
np_grass + np_other_grass + month +
year*precip + season*precip,
data=euc)
ncol(X) ## 13
Matrix::rankMatrix(X) ## 11
lme4 has tests like this, and machinery for automatically dropping aliased columns, but they aren't implemented in glmmTMB at present.
I am attempting to perform a least discriminant analysis on geometric morphometric data. Because geometric morphometric data typically produces large numbers of variables and discriminant analyses require more data points than variables to accurately classify specimens, a common solution in the literature is to perform a principal component analysis and then use a variable number of PCs representing less than 99% of the cumulative variance but returning the highest reclassification rate as input for the LDA.
Right now the way I am doing this is running the LDA in R (using the functions in the Morpho and MASS packages) under every possible number of PCs used and noting the classification accuracy by hand until I found the lowest number of PCs that returned the highest accuracy, but this is highly inefficient.
I was wondering if there was any way to write a function that would run an LDA for all possible numbers of the first N PCs (up to a certain, user defined level representing 99% of the cumulative variance) and return the percent reclassification rate for each level, producing something like the following:
PCs percent_accuracy
20 72.2
19 76.3
18 77.4
17 80.1
16 75.4
15 50.7
... ...
1 20.2
So row 1 would be the reclassification rate when the first 20 PCs are used, row 2 is the rate when the first 19 PCs are used, and so on and so forth.
I'm unable to find a way of performing cross validation on a regression random forest model that I'm trying to produce.
So I have a dataset containing 1664 explanatory variables (different chemical properties), with one response variable (retention time). I'm trying to produce a regression random forest model in order to be able to predict the chemical properties of something given its retention time.
ID RT (seconds) 1_MW 2_AMW 3_Sv 4_Se
4281 38 145.29 5.01 14.76 28.37
4952 40 132.19 6.29 11 21.28
4823 41 176.21 7.34 12.9 24.92
3840 41 174.24 6.7 13.99 26.48
3665 42 240.34 9.24 15.2 27.08
3591 42 161.23 6.2 13.71 26.27
3659 42 146.22 6.09 12.6 24.16
This is an example of the table that I have. I want to basically plot RT against 1_MW, etc (up to 1664 variables), so I can find which of these variables are of importance and which aren't.
I do:-
r = randomForest(RT..seconds.~., data = cadets, importance =TRUE, do.trace = 100)
varImpPlot(r)
which tells me which variables are of importance and what not, which is great. However, I want to be able to partition my dataset so that I can perform cross validation on it. I found an online tutorial that explained how to do it, but for a classification model rather than regression.
I understand you do:-
k = 10
n = floor(nrow(cadets)/k)
i = 1
s1 = ((i-1) * n+1)
s2 = (i * n)
subset = s1:s2
to define how many cross folds you want to do, and the size of each fold, and to set the starting and end value of the subset. However, I don't know what to do here on after. I was told to loop through but I honestly have no idea how to do this. Nor do I know how to then plot the validation set and the test set onto the same graph to depict the level of accuracy/error.
If you could please help me with this I'd be ever so grateful, thanks!
From the source:
The out-of-bag (oob) error estimate
In random forests, there is no need for cross-validation or a separate
test set to get an unbiased estimate of the test set error. It is
estimated internally , during the run...
In particular, predict.randomForest returns the out-of-bag prediction if newdata is not given.
As topchef pointed out, cross-validation isn't necessary as a guard against over-fitting. This is a nice feature of the random forest algorithm.
It sounds like your goal is feature selection, cross-validation is still useful for this purpose. Take a look at the rfcv() function within the randomForest package. Documentation specifies input of a data frame & vector, so I'll start by creating those with your data.
set.seed(42)
x <- cadets
x$RT..seconds. <- NULL
y <- cadets$RT..seconds.
rf.cv <- rfcv(x, y, cv.fold=10)
with(rf.cv, plot(n.var, error.cv))
I am trying to create a script in R for automatically assessing the predictive power of various possible linear models. To assess the predictive power of a model, I use as a quality indicator their overall mean square which comes from a cross-validation for which I use the function CVlm from package DAAG. My question is how can I retrieve the value of the overall mean square resulted from CVlm in an automated way (without having to observed the textual output of CVlm)?
For example the following code from http://maths-people.anu.edu.au/~johnm/r-book/3edn/scripts/reg1.R
houseprices.lm <- lm(sale.price ~ area, data=houseprices)
CVlm(houseprices, houseprices.lm, plotit=TRUE)
has an output in the form
fold 1
Observations in test set: ...
fold 2
Observations in test set: ...
Overall ms
2023
How can I access/store the value of ms (2023) of each run?
You have to store the result of CVlm in a variable and access the ms attribute:
houseprices.lm <- lm(sale.price ~ area, data=houseprices)
cv <- CVlm(houseprices, houseprices.lm, plotit=TRUE)
attr(cv, "ms")
# [1] 3934
I keep getting an error like this:
Error in `coef<-.corARMA`(`*tmp*`, value = c(18.3113452983211, -1.56626248550284, :
Coefficient matrix not invertible
or like this:
Error in gls(archlogfl ~ co2, correlation = corARMA(p = 3)) : false convergence (8)
with the gls function in nlme.
The former example was with the model gls(archlogflfornma~nma,correlation=corARMA(p=3)) where archlogflfornma is
[1] 2.611840 2.618454 2.503317 2.305531 2.180464 2.185764 2.221760 2.211320
and nma is
[1] 138 139 142 148 150 134 137 135
You can see the model in the latter, and archlogfl is
[1] 2.611840 2.618454 2.503317 2.305531 2.180464 2.185764 2.221760 2.211320
[9] 2.105556 2.176747
and co2 is
[1] 597.5778 917.9308 1101.0430 679.7803 886.5347 597.0668 873.4995
[8] 816.3483 1427.0190 423.8917
I have R 2.13.1.
Roland
#GavinSimpson's comment above, that trying to estimate a model with 5 parameters from 10 observations is very hopeful, is correct. The general rule of thumb is that you should have at least 10 times as many data points as parameters, and that's for standard fixed effect/regression parameters. (Generally variance structure parameters such as AR parameters are even a bit harder/require a bit more data than regression parameters to estimate.)
That said, in a perfect world one could hope to estimate parameters even from overfitted models. Let's just explore what happens though:
archlogfl <- c(2.611840,2.618454,2.503317,
2.305531,2.180464,2.185764,2.221760,2.211320,
2.105556,2.176747)
co2 <- c(597.5778,917.9308,1101.0430,679.7803,
886.5347,597.0668,873.4995,
816.3483,1427.0190,423.8917)
Take a look at the data,
plot(archlogfl~co2,type="b")
library(nlme)
g0 <- gls(archlogfl~co2)
plot(ACF(g0),alpha=0.05)
This is an autocorrelation function of the residuals, with 95% confidence intervals (note that these are curvewise confidence intervals, so we would expect about 1/20 points to fall outside these boundaries in any case).
So there is indeed some (graphical) evidence for some autocorrelation here. We'll fit an AR(1) model, with verbose output (to understand the scale on which these parameters are estimated, you'll probably have to dig around in Pinheiro and Bates 2000: what's presented in the printout are the unconstrained values of the parameters, what's printed in the summaries are the constrained values ...
g1 <- gls(archlogfl ~co2,correlation=corARMA(p=1),
control=glsControl(msVerbose=TRUE))
Let's see what's left after we fit AR1:
plot(ACF(g1,resType="normalized"),alpha=0.05)
Now fit AR(2):
g2 <- gls(archlogfl ~co2,correlation=corARMA(p=2),
control=glsControl(msVerbose=TRUE))
plot(ACF(g2,resType="normalized"),alpha=0.05)
As you correctly state, trying to go to AR(3) fails.
gls(archlogfl ~co2,correlation=corARMA(p=3))
You can play with tolerances, starting conditions, etc., but I don't think it's going to help much.
gls(archlogfl ~co2,correlation=corARMA(p=3,value=c(0.9,-0.5,0)),
control=glsControl(tolerance=1e-4,msVerbose=TRUE),verbose=TRUE)
If I were absolutely desperate to get these values I would code my own generalized least-squares function, constructing the AR(3) correlation matrix from scratch, and try to run it with some slightly more robust optimizer, but I would really have to have a good reason to work that hard ...
Another alternative would be to use arima to fit to the residuals from a gls or lm fit without autocorrelation: arima(residuals(g0),c(3,0,0)). (You can see that if you do this with arima(residuals(g0),c(2,0,0)) the answers are close to (but not quite equal to) the results from gls with corARMA(p=2).)