ggplot2 residuals with ezANOVA - r

I ran a three way repeated measures ANOVA with ezANOVA.
anova_1<-ezANOVA(data = main_data, dv = .(rt), wid.(id),
within = .(A,B,C), type = 3, detailed = TRUE)
I'm trying to see what's going on with the residuals via a qqplot but I don't know how to get to them or if they'r even there. With my lme models I simply extract them from the model
main_data$model_residuals <- as.numeric(residuals(model_1))
and plot them
residuals_qq<-ggplot(main_data, aes(sample = main_data$model_residuals)) +
stat_qq(color="black", alpha=1, size =2) +
geom_abline(intercept = mean(main_data$model_residuals), slope = sd(main_data$model_residuals))
I'd like to use ggplot since I'd like to keep a sense of consistency in my graphing.
EDIT
Maybe I wasn't clear in what I'm trying to do. With lme models I can simply create the variable model_residuals from the residuals object in the main_data data.frame that then contains the residuals I plot in ggplot. I want to know if something similar is possible for the residuals in ezAnova or if there is a way I can get hold of the residuals for my ANOVA.

I had the same trouble with ezANOVA. The solution I went for was to switch to ez.glm (from the afex package). Both ezANOVA and ez.glm wrap a function from a different package, so you should get the same results.
This would look like this for your example:
anova_1<-ez.glm("id", "rt", main_data, within=c("A","B","C"), return="full")
nice.anova(anova_1$Anova) # show the ANOVA table like ezANOVA does.
Then you can pull out the lm object and get your residuals in the usual way:
residuals(anova_1$lm)
Hope that helps.
Edit: A few changes to make it work with the last version
anova_1<-aov_ez("id", "rt", main_data, within=c("A","B","C"))
print(m1)
print(m1$Anova)
summary(m1$Anova)
summary(m1)
Then you can pull out the lm object and get your residuals in the usual way:
residuals(anova_1$lm)

A quite old post I know, but it's possible to use ggplot to plot the residuals after modeling your data with ez package by using this function:
proj(ez_outcome$aov)[[3]][, "Residuals"]
then:
qplot(proj(ez_outcome$aov)[[3]][, "Residuals"])
Hope it helps.

Also potentially adding to an old post, but I butted up against this problem as well and as this is the first thing that pops up when searching for this question I thought I might add how I got around it.
I found that if you include the return_aov = TRUE argument in the ezANOVA setup, then the residuals are in there, but ezANOVA partitions them up in the resulting list it produces within each main and interaction effect, similar to how base aov() does if you include an Error term for subject ID as in this case.
These can be pulled out into their own list with purrr by mapping the residual function over this aov sublist in ezANOVA, rather than the main output. So from the question example, it becomes:
anova_1 <- ezANOVA(data = main_data, dv = .(rt), wid = .(id),
within = .(A,B,C), type = 3, detailed = TRUE, return_aov = TRUE)
ezanova_residuals <- purrr::map(anova_1$aov, residuals)
This will produce a list where each entry is the residuals from a part of the ezANOVA model for effects and interactions, i.e. $(Intercept), $id, id:a, id:b, id:a:b etc.
I found it useful to then stitch these together in a tibble using enframe and unnest (as the list components will probably be different lengths) to very quickly get them in a long format, that can then be plotted or tested:
ezanova_residuals_tbl <- enframe(ezanova_residuals) %>% unnest
hist(ezanova_residuals_tbl$value)
shapiro.test(ezanova_residuals_tbl$value)
I've not used this myself but the mapping idea also works for the coefficients and fitted.values functions to pull them out of the ezANOVA results, if needed. They might come out in some odd formats and need some extra manipulation afterwards though.

Related

How do I create a simple linear regression function in R that iterates over the entire dataframe?

I am working through ISLR and am stuck on a question. Basically, I am trying to create a function that iterates through an entire dataframe. It is question 3.7, 15a.
For each predictor, fit a simple linear regression model to predictthe response. Describe your results. In which of the models isthere a statistically significant association between the predictor and the response? Create some plots to back up your assertions.
So my thinking is like this:
y = Boston$crim
x = Boston[, -crim]
TestF1 = lm(y ~ x)
summary(TestF1)
But this is nowhere near the right answer. I was hoping to break it down by:
Iterate over the entire dataframe with crim as my response and the others as predictors
Extract the p values that are statistically significant (or extract the ones insignificant)
Move on to the next question (which is considerably easier)
But I am stuck. I've googled but can't find anything. I tried this combn(Boston) thing but it didn't work either. Please help, thank you.
If your problem is to iterate over a data frame, here is an example for mtrcars (mpg is the targer variable, and the rest are predictors, assuming models with a single predictor). The idea is to generate strings and convert them to formulas:
lms <- vector(mode = "list", length = ncol(mtcars)-1)
for (i in seq_along(lms)){
lms[[i]] <- lm(as.formula(paste0("mpg~",names(mtcars)[-1][i])), data = mtcars)
}
If you want to look at each and every variable combination, start with a model with all variables and then eliminate non-significant predictors finding the best model.

How do I use combn for multiple regression (or an alternative)?

I want to get regression coefficients and fit statistics from one dependent regressed on all combinations of two other independent factors.
What I have is data like this (Note the NA):
H<-data.frame(replicate(10,sample(0:20,10,rep=TRUE)))
H[2,3]<-NA
names(H)<-c("dep",letters[1:9])
So I want to regress "ind" on all these combinations using lm.
apply(combn(names(H)[2:9],2), MARGIN=2, FUN=paste, collapse="*")
"axb" "axc" "axd" "axe" "axf" "axg" ... etc.
One at a time, I could get what I want like:
ab<-data.frame(ind="a*b",cbind(data.frame(glance(lm(data=H,dep~a*b))),
t(data.frame(unlist((lm(data=H,dep~a*b)[1]))))
))
names(ab)[13:16]<-c("int","coef1","coef2","coefby")
ac<-data.frame(ind="a*c",cbind(data.frame(glance(lm(data=H,dep~a*c))),
t(data.frame(unlist((lm(data=H,dep~a*c)[1]))))
))
names(ac)[13:16]<-c("int","coef1","coef2","coefby")
rbind(ab,ac)
What I want is either all these coefficients and statistics, or at least the model coefficients and r.squared.
Someone already showed how to almost the exact same thing using combn. But when I tried a modification of this using glance instead of coefs
fun <- function(x) glance(lm(dep~paste(x, collapse="*"), data=H))[[1]][1]
combn(names(H[2:10]), 2, fun)
I get an error. I thought maybe I needed to try "dep" repeated 36 times, one for each 2 factor combination, but that didn't do it.
Error in model.frame.default(formula = dep ~ paste(x, collapse = "*"), :
variable lengths differ (found for 'paste(x, collapse = "*")')
How do I get either one coefficient at a time or all of them, for all possible dep~x*y multiple regression combination (with "dep" always being my y dependent variable)? Thanks!
Posting as an answer since apparently it worked:
I'm not sure where you got the code dep~paste(x, collapse="*"), using paste inside a formula won't work and I don't see that being done anywhere on the page you link. You need to build the full formula as a string. Try something like this:
formula = as.formula(paste("dep ~", paste(x, collapse = "*")))
Next time, please show the code you are using to call the function, not just the function itself.
You may also be interested in the leaps package if you just want the "best" model, not every model. ("Best" in quotes because this is a terrible way to do model selection in general, violating all sorts of statistical assumptions for multiple comparisons and the like. Check out the LASSO instead for a better way.)

Using glmnet to predict a continuous variable in a dataset

I have this data set.
wbh
I wanted to use the R package glmnet to determine which predictors would be useful in predicting fertility. However, I have been unable to do so, most likely due to not having a full understanding of the package. The fertility variable is SP.DYN.TFRT.IN. I want to see which predictors in the data set give the most predictive power for fertility. I wanted to use LASSO or ridge regression to shrink the number of coefficients, and I know this package can do that. I'm just having some trouble implementing it.
I know there are no code snippets which I apologize for but I am rather lost on how I would code this out.
Any advice is appreciated.
Thank you for reading
Here is an example on how to run glmnet:
library(glmnet)
library(tidyverse)
df is the data set your provided.
select y variable:
y <- df$SP.DYN.TFRT.IN
select numerical variables:
df %>%
select(-SP.DYN.TFRT.IN, -region, -country.code) %>%
as.matrix() -> x
select factor variables and convert to dummy variables:
df %>%
select(region, country.code) %>%
model.matrix( ~ .-1, .) -> x_train
run model(s), several parameters here can be tweaked I suggest checking the documentation. Here I just run 5-fold cross validation to determine the best lambda
cv_fit <- cv.glmnet(x, y, nfolds = 5) #just with numeric variables
cv_fit_2 <- cv.glmnet(cbind(x ,x_train), y, nfolds = 5) #both factor and numeric variables
par(mfrow = c(2,1))
plot(cv_fit)
plot(cv_fit_2)
best lambda:
cv_fit$lambda[which.min(cv_fit$cvm)]
coefficients at best lambda
coef(cv_fit, s = cv_fit$lambda[which.min(cv_fit$cvm)])
equivalent to:
coef(cv_fit, s = "lambda.min")
after running coef(cv_fit, s = "lambda.min") all features with - in the resulting table are dropped from the model. This situation corresponds to the left lambda depicted with the left vertical dashed line on the plots.
I suggest reading the linked documentation - elastic nets are quite easy to grasp if you know a bit of linear regression and the package is quite intuitive. I also suggest reading ISLR, at least the part with L1 / L2 regularization. and these videos: 1, 2, 3 4, 5, 6, first three are about estimating model performance via test error and the last three are about the question at hand. This one is how to implement these models in R. By the way these guys on the videos invented LASSO and made glment.
Also check the glmnetUtils library which provides a formula interface and other nice things like in built mixing parameter (alpha) selection. Here is the vignette.

Fit and calibrate data frame via factors

At first, I use RStudio.
I have a data frame (APD) and I would like to fit the w.r.t to the factor Serial_number. The fit is a lm fit. Then I would like to use this fit to do a calibration (calibrate() out of the investr-package).
Here is an example picture of my data:
Here's the data: Data
Currently I use following lines to fit via Serial_number:
Coefficients<- APD %>%
group_by(Serial_number) %>%
do(tidy(fit<- lm(log(log(Amplification)) ~ Voltage_transformed, .)))
But here, I cannot apply the calibrate()-function. Calibrate function needs an object, that inherits from "lm". And tidy only works for S3/S4-objects.
Do you have an idea?
In your posted code, you're trying to rbind the predicted values from each model, not the coefficients. The function for coefficients is just coefficients(object).
I would also suggest un-nesting your code, since that makes it hard to read and change later on. Here are two generalized functions (each make assumptions, so edit as needed):
lm_by_variable <- function(data_, formula_, byvar) {
by(
data_,
data_[[byvar]],
FUN = lm,
formula = formula_,
simplify = FALSE
)
}
combine_coefficients <- function(fit_list) {
all_coefficients <- lapply(fit_list, coefficients)
do.call('rbind', all_coefficients)
}
lm_by_variable(...) should be pretty self-evident: group by byvar, use lm with the given formula on each subset, and don't simplify the result. Simplifying results is really only useful for interactive work. In a script, it's better to know exactly what will be returned. In this case, a list.
The next function, combine_coefficients(...) returns a matrix of the fitted coefficients. It assumes every fitted model in fit_list has the same terms. We could add logic to make it more robust, but that doesn't seem necessary in this case.

Why do variations of this t.test require different coding? (R)

New to R and trying to get my head around it's coding (new to coding in general)
My question is, running t-tests (paired and independent) I have to change the formula for it to recognise my columns. The following both work; however the 'paired' code will not work if styled like the 'independent' code (with data = '').
Independent: t.test(Nicotine ~ Brand, data = nicotine, alternative='two.sided', conf.level=.95, var.equal=FALSE)
Paired: with(omega3, t.test(Before, After, paired = TRUE, alternative='greater', conf.level=.95))
Why does this happen? ideally I'd prefer to not use the with formula, but I cannot understand why it will not recognize "Before" and "After" when I add the argument data = omega3
Any insight is greatly appreciated.
Thom
It has to do with the way the data are used by the function. When you're using a formula, you're telling R: "Use this variable as my predictor (independent var), and this other one as my outcome (dependent var)". In the case of the independent samples t-test, you'd have:
continuous.variable ~ dichotomous.variable
(outcome/dependent) (predictor/independent)
With paired-samples, you have no such thing as a "predictor" (or more largely speaking "explanatory variable"). You simply have two columns that you wish to compare against one another.
So you can see the formula notation as a nice feature of R, but one which you cannot use in every situation.
Besides, there are alternatives to using with function :
t.test(Before, After, paired = TRUE, alternative='greater', conf.level=.95, data=omega3)
# or
t.test(omega3$Before, omega3$After, paired = TRUE, alternative='greater', conf.level=.95)

Resources