Why do variations of this t.test require different coding? (R) - r

New to R and trying to get my head around it's coding (new to coding in general)
My question is, running t-tests (paired and independent) I have to change the formula for it to recognise my columns. The following both work; however the 'paired' code will not work if styled like the 'independent' code (with data = '').
Independent: t.test(Nicotine ~ Brand, data = nicotine, alternative='two.sided', conf.level=.95, var.equal=FALSE)
Paired: with(omega3, t.test(Before, After, paired = TRUE, alternative='greater', conf.level=.95))
Why does this happen? ideally I'd prefer to not use the with formula, but I cannot understand why it will not recognize "Before" and "After" when I add the argument data = omega3
Any insight is greatly appreciated.
Thom

It has to do with the way the data are used by the function. When you're using a formula, you're telling R: "Use this variable as my predictor (independent var), and this other one as my outcome (dependent var)". In the case of the independent samples t-test, you'd have:
continuous.variable ~ dichotomous.variable
(outcome/dependent) (predictor/independent)
With paired-samples, you have no such thing as a "predictor" (or more largely speaking "explanatory variable"). You simply have two columns that you wish to compare against one another.
So you can see the formula notation as a nice feature of R, but one which you cannot use in every situation.
Besides, there are alternatives to using with function :
t.test(Before, After, paired = TRUE, alternative='greater', conf.level=.95, data=omega3)
# or
t.test(omega3$Before, omega3$After, paired = TRUE, alternative='greater', conf.level=.95)

Related

Why is predict in R taking Train data instead of Test data? [duplicate]

Working in R to develop regression models, I have something akin to this:
c_lm = lm(trainingset$dependent ~ trainingset$independent)
c_pred = predict(c_lm,testset$independent))
and every single time, I get a mysterious error from R:
Warning message:
'newdata' had 34 rows but variables found have 142 rows
which essentially translates into R not being able to find the independent column of the testset data.frame. This is simply because the exact name from the right-hand side of the formula in lm must be there in predict. To fix it, I can do this:
tempset = trainingset
c_lm = lm(trainingset$dependent ~ tempset$independent)
tempset = testset
c_pred = predict(c_lm,tempset$independent))
or some similar variation, but this is really sloppy, in my opinion.
Is there another way to clean up the translation between the two so that the independent variables' data frame does not have to have the exact same name in predict as it does in lm?
No, No, No, No, No, No! Do not use the formula interface in the way you are doing if you want all the other sugar that comes with model formulas. You wrote:
c_lm = lm(trainingset$dependent ~ trainingset$independent)
You repeat trainingset twice, which is a waste of fingers/time, redundant, and not least causing you the problem that you are hitting. When you now call predict, it will be looking for a variable in testset that has the name trainingset$independent, which of course doesn't exist. Instead, use the data argument in your call to lm(). For example, this fits the same model as your formula but is efficient and also works properly with predict()
c_lm = lm(dependent ~ independent, data = trainingset)
Now when you call predict(c_lm, newdata = testset), you only need to have a data frame with a variable whose name is independent (or whatever you have in the model formula).
An additional reason to write formulas as I show them, is legibility. Getting the object name out of the formula allows you to more easily see what the model is.

How to run a multinomial logit regression with both individual and time fixed effects in R

Long story short:
I need to run a multinomial logit regression with both individual and time fixed effects in R.
I thought I could use the packages mlogit and survival to this purpose, but I am cannot find a way to include fixed effects.
Now the long story:
I have found many questions on this topic on various stack-related websites, none of them were able to provide an answer. Also, I have noticed a lot of confusion regarding what a multinomial logit regression with fixed effects is (people use different names) and about the R packages implementing this function.
So I think it would be beneficial to provide some background before getting to the point.
Consider the following.
In a multiple choice question, each respondent take one choice.
Respondents are asked the same question every year. There is no apriori on the extent to which choice at time t is affected by the choice at t-1.
Now imagine to have a panel data recording these choices. The data, would look like this:
set.seed(123)
# number of observations
n <- 100
# number of possible choice
possible_choice <- letters[1:4]
# number of years
years <- 3
# individual characteristics
x1 <- runif(n * 3, 5.0, 70.5)
x2 <- sample(1:n^2, n * 3, replace = F)
# actual choice at time 1
actual_choice_year_1 <- possible_choice[sample(1:4, n, replace = T, prob = rep(1/4, 4))]
actual_choice_year_2 <- possible_choice[sample(1:4, n, replace = T, prob = c(0.4, 0.3, 0.2, 0.1))]
actual_choice_year_3 <- possible_choice[sample(1:4, n, replace = T, prob = c(0.2, 0.5, 0.2, 0.1))]
# create long dataset
df <- data.frame(choice = c(actual_choice_year_1, actual_choice_year_2, actual_choice_year_3),
x1 = x1, x2 = x2,
individual_fixed_effect = as.character(rep(1:n, years)),
time_fixed_effect = as.character(rep(1:years, each = n)),
stringsAsFactors = F)
I am new to this kind of analysis. But if I understand correctly, if I want to estimate the effects of respondents' characteristics on their choice, I may use a multinomial logit regression.
In order to take advantage of the longitudinal structure of the data, I want to include in my specification individual and time fixed effects.
To the best of my knowledge, the multinomial logit regression with fixed effects was first proposed by Chamberlain (1980, Review of Economic Studies 47: 225–238). Recently, Stata users have been provided with the routines to implement this model (femlogit).
In the vignette of the femlogit package, the author refers to the R function clogit, in the survival package.
According to the help page, clogit requires data to be rearranged in a different format:
library(mlogit)
# create wide dataset
data_mlogit <- mlogit.data(df, id.var = "individual_fixed_effect",
group.var = "time_fixed_effect",
choice = "choice",
shape = "wide")
Now, if I understand correctly how clogit works, fixed effects can be passed through the function strata (see for additional details this tutorial). However, I am afraid that it is not clear to me how to use this function, as no coefficient values are returned for the individual characteristic variables (i.e. I get only NAs).
library(survival)
fit <- clogit(formula("choice ~ alt + x1 + x2 + strata(individual_fixed_effect, time_fixed_effect)"), as.data.frame(data_mlogit))
summary(fit)
Since I was not able to find a reason for this (there must be something that I am missing on the way these functions are estimated), I have looked for a solution using other packages in R: e.g., glmnet, VGAM, nnet, globaltest, and mlogit.
Only the latter seems to be able to explicitly deal with panel structures using appropriate estimation strategy. For this reason, I have decided to give it a try. However, I was only able to run a multinomial logit regression without fixed effects.
# state formula
formula_mlogit <- formula("choice ~ 1| x1 + x2")
# run multinomial regression
fit <- mlogit(formula_mlogit, data_mlogit)
summary(fit)
If I understand correctly how mlogit works, here's what I have done.
By using the function mlogit.data, I have created a dataset compatible with the function mlogit. Here, I have also specified the id of each individual (id.var = individual_fixed_effect) and the group to which individuals belongs to (group.var = "time_fixed_effect"). In my case, the group represents the observations registered in the same year.
My formula specifies that there are no variables correlated with a specific choice, and which are randomly distributed among individuals (i.e., the variables before the |). By contrast, choices are only motivated by individual characteristics (i.e., x1 and x2).
In the help of the function mlogit, it is specified that one can use the argument panel to use panel techniques. To set panel = TRUE is what I am after here.
The problem is that panel can be set to TRUE only if another argument of mlogit, i.e. rpar, is not NULL.
The argument rpar is used to specify the distribution of the random variables: i.e. the variables before the |.
The problem is that, since these variables does not exist in my case, I can't use the argument rpar and then set panel = TRUE.
An interesting question related to this is here. A few suggestions were given, and one seems to go in my direction. Unfortunately, no examples that I can replicate are provided, and I do not understand how to follow this strategy to solve my problem.
Moreover, I am not particularly interested in using mlogit, any efficient way to perform this task would be fine for me (e.g., I am ok with survival or other packages).
Do you know any solution to this problem?
Two caveats for those interested in answering:
I am interested in fixed effects, not in random effects. However, if you believe there is no other way to take advantage of the longitudinal structure of my data in R (there is indeed in Stata but I don't want to use it), please feel free to share your code.
I am not interested in going Bayesian. So if possible, please do not suggest this approach.

How do I use combn for multiple regression (or an alternative)?

I want to get regression coefficients and fit statistics from one dependent regressed on all combinations of two other independent factors.
What I have is data like this (Note the NA):
H<-data.frame(replicate(10,sample(0:20,10,rep=TRUE)))
H[2,3]<-NA
names(H)<-c("dep",letters[1:9])
So I want to regress "ind" on all these combinations using lm.
apply(combn(names(H)[2:9],2), MARGIN=2, FUN=paste, collapse="*")
"axb" "axc" "axd" "axe" "axf" "axg" ... etc.
One at a time, I could get what I want like:
ab<-data.frame(ind="a*b",cbind(data.frame(glance(lm(data=H,dep~a*b))),
t(data.frame(unlist((lm(data=H,dep~a*b)[1]))))
))
names(ab)[13:16]<-c("int","coef1","coef2","coefby")
ac<-data.frame(ind="a*c",cbind(data.frame(glance(lm(data=H,dep~a*c))),
t(data.frame(unlist((lm(data=H,dep~a*c)[1]))))
))
names(ac)[13:16]<-c("int","coef1","coef2","coefby")
rbind(ab,ac)
What I want is either all these coefficients and statistics, or at least the model coefficients and r.squared.
Someone already showed how to almost the exact same thing using combn. But when I tried a modification of this using glance instead of coefs
fun <- function(x) glance(lm(dep~paste(x, collapse="*"), data=H))[[1]][1]
combn(names(H[2:10]), 2, fun)
I get an error. I thought maybe I needed to try "dep" repeated 36 times, one for each 2 factor combination, but that didn't do it.
Error in model.frame.default(formula = dep ~ paste(x, collapse = "*"), :
variable lengths differ (found for 'paste(x, collapse = "*")')
How do I get either one coefficient at a time or all of them, for all possible dep~x*y multiple regression combination (with "dep" always being my y dependent variable)? Thanks!
Posting as an answer since apparently it worked:
I'm not sure where you got the code dep~paste(x, collapse="*"), using paste inside a formula won't work and I don't see that being done anywhere on the page you link. You need to build the full formula as a string. Try something like this:
formula = as.formula(paste("dep ~", paste(x, collapse = "*")))
Next time, please show the code you are using to call the function, not just the function itself.
You may also be interested in the leaps package if you just want the "best" model, not every model. ("Best" in quotes because this is a terrible way to do model selection in general, violating all sorts of statistical assumptions for multiple comparisons and the like. Check out the LASSO instead for a better way.)

ggplot2 residuals with ezANOVA

I ran a three way repeated measures ANOVA with ezANOVA.
anova_1<-ezANOVA(data = main_data, dv = .(rt), wid.(id),
within = .(A,B,C), type = 3, detailed = TRUE)
I'm trying to see what's going on with the residuals via a qqplot but I don't know how to get to them or if they'r even there. With my lme models I simply extract them from the model
main_data$model_residuals <- as.numeric(residuals(model_1))
and plot them
residuals_qq<-ggplot(main_data, aes(sample = main_data$model_residuals)) +
stat_qq(color="black", alpha=1, size =2) +
geom_abline(intercept = mean(main_data$model_residuals), slope = sd(main_data$model_residuals))
I'd like to use ggplot since I'd like to keep a sense of consistency in my graphing.
EDIT
Maybe I wasn't clear in what I'm trying to do. With lme models I can simply create the variable model_residuals from the residuals object in the main_data data.frame that then contains the residuals I plot in ggplot. I want to know if something similar is possible for the residuals in ezAnova or if there is a way I can get hold of the residuals for my ANOVA.
I had the same trouble with ezANOVA. The solution I went for was to switch to ez.glm (from the afex package). Both ezANOVA and ez.glm wrap a function from a different package, so you should get the same results.
This would look like this for your example:
anova_1<-ez.glm("id", "rt", main_data, within=c("A","B","C"), return="full")
nice.anova(anova_1$Anova) # show the ANOVA table like ezANOVA does.
Then you can pull out the lm object and get your residuals in the usual way:
residuals(anova_1$lm)
Hope that helps.
Edit: A few changes to make it work with the last version
anova_1<-aov_ez("id", "rt", main_data, within=c("A","B","C"))
print(m1)
print(m1$Anova)
summary(m1$Anova)
summary(m1)
Then you can pull out the lm object and get your residuals in the usual way:
residuals(anova_1$lm)
A quite old post I know, but it's possible to use ggplot to plot the residuals after modeling your data with ez package by using this function:
proj(ez_outcome$aov)[[3]][, "Residuals"]
then:
qplot(proj(ez_outcome$aov)[[3]][, "Residuals"])
Hope it helps.
Also potentially adding to an old post, but I butted up against this problem as well and as this is the first thing that pops up when searching for this question I thought I might add how I got around it.
I found that if you include the return_aov = TRUE argument in the ezANOVA setup, then the residuals are in there, but ezANOVA partitions them up in the resulting list it produces within each main and interaction effect, similar to how base aov() does if you include an Error term for subject ID as in this case.
These can be pulled out into their own list with purrr by mapping the residual function over this aov sublist in ezANOVA, rather than the main output. So from the question example, it becomes:
anova_1 <- ezANOVA(data = main_data, dv = .(rt), wid = .(id),
within = .(A,B,C), type = 3, detailed = TRUE, return_aov = TRUE)
ezanova_residuals <- purrr::map(anova_1$aov, residuals)
This will produce a list where each entry is the residuals from a part of the ezANOVA model for effects and interactions, i.e. $(Intercept), $id, id:a, id:b, id:a:b etc.
I found it useful to then stitch these together in a tibble using enframe and unnest (as the list components will probably be different lengths) to very quickly get them in a long format, that can then be plotted or tested:
ezanova_residuals_tbl <- enframe(ezanova_residuals) %>% unnest
hist(ezanova_residuals_tbl$value)
shapiro.test(ezanova_residuals_tbl$value)
I've not used this myself but the mapping idea also works for the coefficients and fitted.values functions to pull them out of the ezANOVA results, if needed. They might come out in some odd formats and need some extra manipulation afterwards though.

lm and predict - agreement of data.frame names

Working in R to develop regression models, I have something akin to this:
c_lm = lm(trainingset$dependent ~ trainingset$independent)
c_pred = predict(c_lm,testset$independent))
and every single time, I get a mysterious error from R:
Warning message:
'newdata' had 34 rows but variables found have 142 rows
which essentially translates into R not being able to find the independent column of the testset data.frame. This is simply because the exact name from the right-hand side of the formula in lm must be there in predict. To fix it, I can do this:
tempset = trainingset
c_lm = lm(trainingset$dependent ~ tempset$independent)
tempset = testset
c_pred = predict(c_lm,tempset$independent))
or some similar variation, but this is really sloppy, in my opinion.
Is there another way to clean up the translation between the two so that the independent variables' data frame does not have to have the exact same name in predict as it does in lm?
No, No, No, No, No, No! Do not use the formula interface in the way you are doing if you want all the other sugar that comes with model formulas. You wrote:
c_lm = lm(trainingset$dependent ~ trainingset$independent)
You repeat trainingset twice, which is a waste of fingers/time, redundant, and not least causing you the problem that you are hitting. When you now call predict, it will be looking for a variable in testset that has the name trainingset$independent, which of course doesn't exist. Instead, use the data argument in your call to lm(). For example, this fits the same model as your formula but is efficient and also works properly with predict()
c_lm = lm(dependent ~ independent, data = trainingset)
Now when you call predict(c_lm, newdata = testset), you only need to have a data frame with a variable whose name is independent (or whatever you have in the model formula).
An additional reason to write formulas as I show them, is legibility. Getting the object name out of the formula allows you to more easily see what the model is.

Resources