lm() saving residuals with group_by with R- confused spss user - r

This is complete reEdit of my orignal question
Let's assume I'm working on RT data gathered in a repeated measure experiment. As part of my usual routine I always transform RT to natural logarytms and then compute a Z score for each RT within each partipant adjusting for trial number. This is typically done with a simple regression in SPSS syntax:
split file by subject.
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT rtLN
/METHOD=ENTER trial
/SAVE ZRESID.
split file off.
To reproduce same procedure in R generate data:
#load libraries
library(dplyr); library(magrittr)
#generate data
ob<-c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3)
ob<-factor(ob)
trial<-c(1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6)
rt<-c(300,305,290,315,320,320,350,355,330,365,370,370,560,565,570,575,560,570)
cond<-c("first","first","first","snd","snd","snd","first","first","first","snd","snd","snd","first","first","first","snd","snd","snd")
#Following variable is what I would get after using SPSS code
ZreSPSS<-c(0.4207,0.44871,-1.7779,0.47787,0.47958,-0.04897,0.45954,0.45487,-1.7962,0.43034,0.41075,0.0407,-0.6037,0.0113,0.61928,1.22038,-1.32533,0.07806)
sym<-data.frame(ob, trial, rt, cond, ZreSPSS)
I could apply a formula (blend of Mark's and Daniel's solution) to compute residuals from a lm(log(rt)~trial) regression but for some reason group_by is not working here
sym %<>%
group_by (ob) %>%
mutate(z=residuals(lm(log(rt)~trial)),
obM=mean(rt), obSd=sd(rt), zRev=z*obSd+obM)
Resulting values clearly show that grouping hasn't kicked in.
Any idea why it didn't work out?

Using dplyr and magrittr, you should be able to calculate z-scores within individual with this code (it breaks things into the groups you tell it to, then calculates within that group).
experiment %<>%
group_by(subject) %>%
mutate(rtLN = log(rt)
, ZRE1 = scale(rtLN))
You should then be able to do use that in your model. However, one thing that may help your shift to R thinking is that you can likely build your model directly, instead of having to make all of these columns ahead of time. For example, using lme4 to treat subject as a random variable:
withRandVar <-
lmer(log(rt) ~ cond + (1|as.factor(subject))
, data = experiment)
Then, the residuals should already be on the correct scale. Further, if you use the z-scores, you probably should be plotting on that scale. I am not actually sure what running with the z-scores as the response gains you -- it seems like you would lose information about the degree of difference between the groups.
That is, if the groups are tight, but the difference between them varies by subject, a z-score may always show them as a similar number of z-scores away. Imagine, for example, that you have two subjects, one scores (1,1,1) on condition A and (3,3,3) on condition B, and a second subject that scores (1,1,1) and (5,5,5) -- both will give z-scores of (-.9,-.9,-.9) vs (.9,.9,.9) -- losing the information that the difference between A and B is larger in subject 2.
If, however, you really want to convert back, you can probably use this to store the subject means and sds, then multiply the residuals by subjSD and add subjMean.
experiment %<>%
group_by(subject) %>%
mutate(rtLN = log(rt)
, ZRE1 = scale(rtLN)
, subjMean = mean(rtLN)
, subjSD = sd(rtLN))

mylm <- lm(x~y)
rstandard(mylm)
This returns the standardized residuals of the function. To bind these to a variable you can do:
zresid <- rstandard(mylm)
EXAMPLE:
a<-rnorm(1:10,10)
b<-rnorm(1:10,10)
mylm <- lm(a~b)
mylm.zresid<-rstandard(mylm)
See also:
summary(mylm)
and
mylm$coefficients
mylm$fitted.values
mylm$xlevels
mylm$residuals
mylm$assign
mylm$call
mylm$effects
mylm$qr
mylm$terms
mylm$rank
mylm$df.residual
mylm$model

Related

Multinomial logit: estimation on a subset of alternatives in R

As McFadden (1978) showed, if the number of alternatives in a multinomial logit model is so large that computation becomes impossible, it is still feasible to obtain consistent estimates by randomly subsetting the alternatives, so that the estimated probabilities for each individual are based on the chosen alternative and C other randomly selected alternatives. In this case, the size of the subset of alternatives is C+1 for each individual.
My question is about the implementation of this algorithm in R. Is it already embedded in any multinomial logit package? If not - which seems likely based on what I know so far - how would one go about including the procedure in pre-existing packages without recoding extensively?
Not sure whether the question is more about doing the sampling of alternatives or the estimation of MNL models after sampling of alternatives. To my knowledge, there are no R packages that do sampling of alternatives (the former) so far, but the latter is possible with existing packages such as mlogit. I believe the reason is that the sampling process varies depending on how your data is organized, but it is relatively easy to do with a bit of your own code. Below is code adapted from what I used for this paper.
library(tidyverse)
# create artificial data
set.seed(6)
# data frame of choser id and chosen alt_id
id_alt <- data.frame(
id = 1:1000,
alt_chosen = sample(1:30, 1)
)
# data frame for universal choice set, with an alt-specific attributes (alt_x2)
alts <- data.frame(
alt_id = 1:30,
alt_x2 = runif(30)
)
# conduct sampling of 9 non-chosen alternatives
id_alt <- id_alt %>%
mutate(.alts_all =list(alts$alt_id),
# use weights to avoid including chosen alternative in sample
.alts_wtg = map2(.alts_all, alt_chosen, ~ifelse(.x==.y, 0, 1)),
.alts_nonch = map2(.alts_all, .alts_wtg, ~sample(.x, size=9, prob=.y)),
# combine chosen & sampled non-chosen alts
alt_id = map2(alt_chosen, .alts_nonch, c)
)
# unnest above data.frame to create a long format data frame
# with rows varying by choser id and alt_id
id_alt_lf <- id_alt %>%
select(-starts_with(".")) %>%
unnest(alt_id)
# join long format df with alts to get alt-specific attributes
id_alt_lf <- id_alt_lf %>%
left_join(alts, by="alt_id") %>%
mutate(chosen=ifelse(alt_chosen==alt_id, 1, 0))
require(mlogit)
# convert to mlogit data frame before estimating
id_alt_mldf <- mlogit.data(id_alt_lf,
choice="chosen",
chid.var="id",
alt.var="alt_id", shape="long")
mlogit( chosen ~ 0 + alt_x2, id_alt_mldf) %>%
summary()
It is, of course, possible without using the purrr::map functions, by using apply variants or looping through each row of id_alt.
Sampling of alternatives is not currently implemented in the mlogit package. As stated previously, the solution is to generate a data.frame with a subset of alternatives and then using mlogit (and importantly to use a formula with no intercepts). Note that mlogit can deal with unbalanced data, ie the number of alternatives doesn't have to be the same for all the choice situations.
My recommendation would be to review the mlogit package.
Vignette:
https://cran.r-project.org/web/packages/mlogit/vignettes/mlogit2.pdf
the package has a set of example exercises that (in my opinion) are worth looking at:
https://cran.r-project.org/web/packages/mlogit/vignettes/Exercises.pdf
You may also want to take a look at the gmnl package (I have not used it)
https://cran.r-project.org/web/packages/gmnl/index.html
Multinomial Logit Models with Continuous and Discrete Individual Heterogeneity in R: The gmnl Package
Mauricio Sarrias' (Author) gmnl Web page
Question: What specific problem(s) are you trying to apply a multinomial logit model too? Suitably intrigued.
Aside from the above question, I hope the above points you in the right direction.

Basic - T-Test -> Grouping Factor Must have Exactly 2 Levels

I am relatively new to R. For my assignment I have to start by conducting a T-Test by looking at the effect of a politician's (Conservative or Labour) wealth on their real gross wealth and real net wealth. I have to attempt to estimate the effect of serving in office wealth using a simple t-test.
The dataset is called takehome.dta
Labour and Tory are binary where 1 indicates that they serve for that party and 0 otherwise.
The variables for wealth are lnrealgross and lnrealnet.
I have imported and attached the dataset, but when I attempt to conduct a simple t-test. I get the following message "grouping factor must have exactly 2 levels." Not quite sure where I appear to be going wrong. Any assistance would be appreciated!
are you doing this:
t.test(y~x)
when you mean to do this
t.test(y,x)
In general use the ~ then you have data like
y <- 1:10
x <- rep(letters[1:2], each = 5)
and the , when you have data like
y <- 1:5
x <- 6:10
I assume you're doing something like:
y <- 1:10
x <- rep(1,10)
t.test(y~x) #instead of t.test(y,x)
because the error suggests you have no variation in the grouping factor x
The differences between ~ and , is the type of statistical test you are running. ~ gives you the mean differences. This is for dependent samples (e.g. before and after). , gives you the difference in means. This is for independent samples (e.g. treatment and control). These two tests are not interchangeable.
I was having a similar problem and did not realize given the size of my dataset that one of my y's had no values for one of my levels. I had taken a series of gene readings for two groups and one gene had readings only for group 2 and not group 1. I hadn't even noticed but for some reason this presented with the same error as what I would get if I had too many levels. The solution is to remove that y or in my case gene from my analysis and then the error is solved.

Meta analysis from mean-effectsizes for overlapping samples

I run a meta analysis and use the metafor library to calculate fisher z transformed values from correlations.
>meta1 <- escalc(ri=TESTR, ni=N, measure="ZCOR", data=subdata2)
As some of the studies I include in my meta-analysis, overlap in samples (i.e. in Study XY, 5 effect-sizes are reported from the same N), I need to calculate means of the standardized z-values. To indicate overlapping samples, I gave all effect sizes IDs (in Excel) which are equal if the samples overlap.
To run the final metaanalysis, I would like R to sum the standardized effect sizes from IDs and calculate means for the final metaanalysis.
So the idea is:
IF Effect_SIZE_ID (a variable) is similar in two lines of my df, then sum both effect sizes and divide it by two (calculate the mean). Provide this result in a new column.
As I am a full newbie, please let me know if you require further specification!
Thank you so much in advance.
LEon
Have a look at the summaryBy command in the doBy package.
mymean <- summaryBy(SD_effect ~ ID, FUN = mean, data = data)
Should work in general (if you provide some sample data it is easy to check if that does what you need).

Why can't I find my factor names when I extract residuals?

I'm working with some election data trying to separate it by "state" and "election."
I ran a regression with fixed effects for state and year (as you'll see below), got my summary data, and have been trying to use the resid() function to extract the residuals.
m5 <- lm(demVote ~ state*year, data=presidentialElections)
plot(resid(m5) ~ fitted(m5))
resid.m5 <- resid(m5)
I think it all worked above just perfectly. However, here's where I'm lost - if I do summary(resid.m5) (where I put the extracted residuals, or so I thought), I can't seem to find my factor names anymore. If I want to see my residuals per state or per year (or an average of them by state/year, for example) then how do I access that with the resid() function? Thanks!
Just as was said in the comments before, you have to realize that the residuals that are being returned are in the same order as your observations in the data set.
Here is an example using the iris data set that comes with every R installation (and a probably quite nonsensical regression):
data(iris)
m5 <- lm(Sepal.Length ~ Species*Sepal.Width, data=iris)
resid.m5 <- resid(m5)
dta.complete <- data.frame(iris, r.m5=resid.m5)
Here, the residuals are combined with the original data. It is perhaps a little unorthodox, but why not keep things together. Now you can use all the classical subsetting as much as you like. For instance:
with(dta.complete, by(r.m5, Species, mean))
Good luck!

"Simulating" a large number of regressions with different predictor values

Let's say I have the following data and I'm interested in examining some counterfactuals. In particular, I want to examine whether there would be changes in predicted income given a change in income. The best way I can think to do this is to write a loop that runs this regression 1:n. However, how do I also make adjustments to the data frame while running through the loop. I'm really hoping that there is a base R function or something in a package that someone can point me to.
df = data.frame(year=c(2000,2001,2002,2003,2004,2005,2006,2007,2009,2010),
income=c(100,50,70,80,50,40,60,100,90,80),
age=c(26,30,35,30,28,29,31,34,20,35),
gpa=c(2.8,3.5,3.9,4.0,2.1,2.65,2.9,3.2,3.3,3.1))
df
mod = lm(income ~ age + gpa, data=df)
summary(mod)
Here are some counter factuals that may be worth considering when looking at the relationship between age, gpa, and income.
# What is everyone in the class had a lower/higher gpa?
df$gpa2 = df$gpa + 0.55
# what if one person had a lower/higher gpa?
df$gpa2[3] = 1.6
# what if the most recent employee/person had a lower/higher gpa?
df[10,4] = 4.0
With or without looping, what would be the best way to "simulate" a large (1000+) number of regression models in order examine various counter factuals, and then save those results in some data structure? Is there a "counter factual" analysis package which could save me a bit of work?

Resources