I have a gam function with many covariates and I would like to simplify it (find the minimum model)
I used a dsm function to model the density of a species across line transcect segments as a function of covariates. And it works fine!
But it was the maximum model with too many covariates and I would like to reduce their number automatically. So I tried using the gam::step.Gam function. (I also used the gam.scope function to make sure I do everything correctly).
DSM code:
GamModel = dsm(
ddf.obj=PreparedDdf,
formula = D ~ x + y + Cov1 + Cov2 +...+ Covn factor1+ factor2+...+factorn,
family=gaussian(link='identity'),
group=FALSE,
engine='gam',
convert.units=1,
segment.data=segment.df,
observation.data=observation.df
)
step.Gam code:
GamScope=gam.scope(segment.df[,c(5:6,11:16)], response=1, smoother="s", arg=NULL, form=TRUE)
MinModel = step.Gam(GamModel, GamScope, trace=TRUE, direction="backward")
I hoped to get the minimal model, instead it gives me the following error:
Error in gam(formula = D ~ x + Cov1 + Cov2 + Cov3, : invalid `method': REML
And I don't understand why this happens! I tried different methods (GACV.Cp, ML) and I get the same kind of error (invalid method: GACV.Cp etc)
Why is this happening? Is it because it's gam model produced by the dsm function?
And more importantly, how can I minimize the model automatically??
(When I used engine='glm' instead of 'gam' in the dsm function) and I try the stats::step function to find the minimal model it works, but the results seem a bit scetcy... so I want to use the gam engine)
The gam package doesn't fit models using REML or the other options you state. Those are options to the gam() function in the mgcv package.
The only allowed options for the method argument in gam::gam() are:
"glm.fit", which is the default, and
"model.frame", which doesn't really do anything as it instructs the function to just spit out the model frame resulting from the formula.
It is quite important to differentiate between these two packages that both provide a gam() function. They are very different approaches to the estimation of GAMs.
As you are using dsm(), you'll be fitting using mgcv::gam() not gam::gam() and in that case you cannot apply the gam::step.gam() function to the model.
I believe that the authors of dsm() recommend that you use the select = TRUE argument to mgcv::gam(), which you can provide when using dsm() and which will get passed on the gam(). This will add extra penalties to the smooth terms in the model such that they can be shrunk out of the model.
Related
I had to transform a variable response (e.g. Variable 1) to fulfil the assumptions of linear models in lmer using an approach suggested here https://www.r-bloggers.com/2020/01/a-guide-to-data-transformation/ for heavy-tailed data and demonstrated below:
TransformVariable1 <- sqrt(abs(Variable1 - median(Variable1))
I then fit the data to the following example model:
fit <- lmer(TransformVariable1 ~ x + y + (1|z), data = dataframe)
Next, I update the reference grid to account for the transformation as suggested here Specifying that model is logit transformed to plot backtransformed trends:
rg <- update(ref_grid(fit), tran = "TransformVariable1")
Neverthess, the emmeans are not back transformed to the original scale after using the following command:
fitemm <- as.data.frame(emmeans(rg, ~ x + y, type = "response"))
My question is: How can I back transform the emmeans to the original scale?
Thank you in advance.
There are two major problems here.
The lesser of them is in specifying tran. You need to either specify one of a handful of known transformations, such as "log", or a list with the needed functions to undo the transformation and implement the delta method. See the help for make.link, make.tran, and vignette("transformations", "emmeans").
The much more serious issue is that the transformation used here is not a monotone function, so it is impossible to back-transform the results. Each transformed response value corresponds to two possible values on either side of the median of the original variable. The model we have here does not estimate effects on the given variable, but rather effects on the dispersion of that variable. It's like trying to use the speedometer as a substitute for a navigation system.
I would suggest using a different model, or at least a different response variable.
A possible remedy
Looking again at this, I wonder if what was meant was the symmetric square-root transformation -- what is shown multiplied by sign(Variable1 - median(Variable1)). This transformation is available in emmeans::make.tran(). You will need to re-fit the model.
What I suggest is creating the transformation object first, then using it throughout:
require(lme4)
requre(emmeans)
symsqrt <- make.tran("sympower", param = c(0.5, median(Variable1)))
fit <- with(symsqrt,
lmer(linkfun(Variable1) ~ x + y + (1|z), data = dataframe)
)
emmeans(fit, ~ x + y, type = "response")
symsqrt comprises a list of functions needed to implement the transformation. The transformation itself is symsqrt$linkfun, and the emmeans package knows to look for the other stuff when the response transformation is named linkfun.
BTW, please break the habit of wrapping emmeans() in as.data.frame(). That renders invisible some important annotations, and also disables the possibility of following up with contrasts and comparisons. If you think you want to see more precision than is shown, you can precede the call with emm_options(opt.digits = FALSE); but really, you are kidding yourself if you think those extra digits give you useful information.
I have already checked a couple of topics and also found some help regarding heteroskedasticity in panel regressions. But unfortunately, some questions have remained unsolved.
Following example (some repeated measures, data already in long format):
Panelregr <- plm(V1~ V2 + V3 + V4, data = XY, model ="random")
Then I checked for heteroskedasticity:
B.P.Test <- bptest(V ~ V2 + V3 + V4, data=XY, studentize = F)
The test was highly significant --> heteroskedasticity
Then I read (Link: https://www.princeton.edu/~otorres/Panel101R.pdf) about using robust covariance matrix to account for hetereoskedasticity. For the example above I used the code
coeftest(Panelregr, vcovHC)
summary(Panelregr, vcov = vcovHC)
and got the results. But I could also use
coeftest(Panelregr, vcovHC(Panelregr, type = "HC3"))
or the other types HC0 - HC4
Now some questions came up:
Which estimator of these five types do I receive when I use coeftest(Panelregr, vcovHC) instead of defining one particular HC..? Is it HC0?
How do I know which HC... fits to my data? (I read some information, for example: https://cran.r-project.org/web/packages/sandwich/vignettes/sandwich.pdf , page 4, but I´m still not sure how to decide).
How do I describe the results in case of the use of one of these correct estimators? Example: "In order to account for heteroskedasticity, a robust covariance metrix was used. In detail, we used the HC... estimator as ... In the following table, the results of the HC... estimator are shown."
When I correct for hetereosk. , the results don´t include values like R-squared. Is it correct to report the corrected values (e.g. coeftest(Panelregr, vcovHC) and to report values like R-squared from the "originial" Panel regression (Panelregr <- plm(V1~ V2 + V3 + V4, data = XY, model ="random"))?
1) The default one (see ?vcovHC) and for plm::vcovHC that is HC0 as it is the first value mentioned for argument type.
3) HC0, HC1, ... are scaling factors for the variance-covariance matrix. Good to mentioned that. You also want to mention the estimator, i.e. what is given by the method argument. A typical choice is the estimator by Arellano (1987) and it is the default for plm::vcovHC.
4) The R^2 is not impacted by using a het.-consistent variance-covariance matrix. However, the F-statistic is. summary(Panelregr, vcov = vcovHC) gives you what you need.
I am unsuccessfully trying to do the Arellano and Bond (1991) estimation using pgmm from the plm package. To see if the problem was in my data, I instead used the data supplied i the plm library, but got the same problem when using the "summary" command:
Error in t(y) %*% x : non-conformable arguments
The coefficients of the model can be obtained though.
My own data has T=3, N=290. As I understand it T=3 is the minimnum, but should be sufficient. When using the Arellano and Bond data, I get the same error when T=4.
data("EmplUK", package = "plm")
library(sqldf)
UK<-sqldf("select * from EmplUK where year in ('1982','1981',
'1980','1979')")
z1 <- pgmm(log(emp) ~ lag(log(emp), 1) + log(wage) +
log(capital) + log(output) | lag(log(emp), 2),
data = UK, effect = "twoways", model = "twosteps")
summary(z1)
The way I understand the estimation method and the R-formula, the left hand term is the difference in the dependent variable, and the first right hand term is the lagged difference. And the latter term is instrumented by the level of the dependent variable in (t-2)
I have verified that subset I use is a balanced panel with T=4. When I include more years, everything works out. So it must be the length of the panel that causes troubles.
Any help would be much appreciated.
A similar question is asked here. It is suggested that the error has to do with mtest, a serial correlation test performed by the pgmm summary method. Running the function separately seems to confirm this
>mtest(z1, order = 2)
Error in t(y) %*% x : non-conformable arguments
T=3 is enough to estimate the model, but this only only leaves you with an estimate for the last period. A second order mtest requires the residuals to contain at least 3 periods, i.e. T=5 for your model.
I ran a three way repeated measures ANOVA with ezANOVA.
anova_1<-ezANOVA(data = main_data, dv = .(rt), wid.(id),
within = .(A,B,C), type = 3, detailed = TRUE)
I'm trying to see what's going on with the residuals via a qqplot but I don't know how to get to them or if they'r even there. With my lme models I simply extract them from the model
main_data$model_residuals <- as.numeric(residuals(model_1))
and plot them
residuals_qq<-ggplot(main_data, aes(sample = main_data$model_residuals)) +
stat_qq(color="black", alpha=1, size =2) +
geom_abline(intercept = mean(main_data$model_residuals), slope = sd(main_data$model_residuals))
I'd like to use ggplot since I'd like to keep a sense of consistency in my graphing.
EDIT
Maybe I wasn't clear in what I'm trying to do. With lme models I can simply create the variable model_residuals from the residuals object in the main_data data.frame that then contains the residuals I plot in ggplot. I want to know if something similar is possible for the residuals in ezAnova or if there is a way I can get hold of the residuals for my ANOVA.
I had the same trouble with ezANOVA. The solution I went for was to switch to ez.glm (from the afex package). Both ezANOVA and ez.glm wrap a function from a different package, so you should get the same results.
This would look like this for your example:
anova_1<-ez.glm("id", "rt", main_data, within=c("A","B","C"), return="full")
nice.anova(anova_1$Anova) # show the ANOVA table like ezANOVA does.
Then you can pull out the lm object and get your residuals in the usual way:
residuals(anova_1$lm)
Hope that helps.
Edit: A few changes to make it work with the last version
anova_1<-aov_ez("id", "rt", main_data, within=c("A","B","C"))
print(m1)
print(m1$Anova)
summary(m1$Anova)
summary(m1)
Then you can pull out the lm object and get your residuals in the usual way:
residuals(anova_1$lm)
A quite old post I know, but it's possible to use ggplot to plot the residuals after modeling your data with ez package by using this function:
proj(ez_outcome$aov)[[3]][, "Residuals"]
then:
qplot(proj(ez_outcome$aov)[[3]][, "Residuals"])
Hope it helps.
Also potentially adding to an old post, but I butted up against this problem as well and as this is the first thing that pops up when searching for this question I thought I might add how I got around it.
I found that if you include the return_aov = TRUE argument in the ezANOVA setup, then the residuals are in there, but ezANOVA partitions them up in the resulting list it produces within each main and interaction effect, similar to how base aov() does if you include an Error term for subject ID as in this case.
These can be pulled out into their own list with purrr by mapping the residual function over this aov sublist in ezANOVA, rather than the main output. So from the question example, it becomes:
anova_1 <- ezANOVA(data = main_data, dv = .(rt), wid = .(id),
within = .(A,B,C), type = 3, detailed = TRUE, return_aov = TRUE)
ezanova_residuals <- purrr::map(anova_1$aov, residuals)
This will produce a list where each entry is the residuals from a part of the ezANOVA model for effects and interactions, i.e. $(Intercept), $id, id:a, id:b, id:a:b etc.
I found it useful to then stitch these together in a tibble using enframe and unnest (as the list components will probably be different lengths) to very quickly get them in a long format, that can then be plotted or tested:
ezanova_residuals_tbl <- enframe(ezanova_residuals) %>% unnest
hist(ezanova_residuals_tbl$value)
shapiro.test(ezanova_residuals_tbl$value)
I've not used this myself but the mapping idea also works for the coefficients and fitted.values functions to pull them out of the ezANOVA results, if needed. They might come out in some odd formats and need some extra manipulation afterwards though.
I'm trying to use the lmer function to create a minimum adequate model. My model is Mated ~ Size * Attempts * Status + (random factor).
as.logical(Mated)
as.numeric(Size)
as.factor(Attempts)
as.factor(Status)
(These have all worked on previous models)
So after all that I try running my model:
Model1<-lmer(Mated ~ Size*Status*Attempts + (1|FemaleID),data=mydata)
And it can be submitted without fault.It's only when I try to apply this update that it goes wrong:
Model2<-update(Model1, REML=FALSE)
Here is the error message supplied:
Error in fn(x, ...) : Downdated VtV is not positive definite
If I make a third model without the interaction and do an ANOVA between that and model one, then it says the two are significantly different.
Model3<-update(Model1,~.-Size:Status:Attempts
anova(Model1,Model3)
What am I doing wrong? Is the three way interaction really significant or have I made some mistake?
Thank you
If Mated is binary, then you should probably be using glmer with a logit or probit link function instead, something like:
model <- glmer(Mated ~ Size * Status * Attempts + (1|FemaleID),
data = mydata, family = binomial)
It would help if you could let us know what your data looks like (head(mydata) might be fine, or see here for how to make a reproducible example).
Also, I would avoid making Mated logical (see this question and answer for how it can make your life more difficult). Instead, as.factor(Mated) will explicitly make your response variable discrete.
After that, you can compare your full and reduced models with anova().