Why can't I find my factor names when I extract residuals? - r

I'm working with some election data trying to separate it by "state" and "election."
I ran a regression with fixed effects for state and year (as you'll see below), got my summary data, and have been trying to use the resid() function to extract the residuals.
m5 <- lm(demVote ~ state*year, data=presidentialElections)
plot(resid(m5) ~ fitted(m5))
resid.m5 <- resid(m5)
I think it all worked above just perfectly. However, here's where I'm lost - if I do summary(resid.m5) (where I put the extracted residuals, or so I thought), I can't seem to find my factor names anymore. If I want to see my residuals per state or per year (or an average of them by state/year, for example) then how do I access that with the resid() function? Thanks!

Just as was said in the comments before, you have to realize that the residuals that are being returned are in the same order as your observations in the data set.
Here is an example using the iris data set that comes with every R installation (and a probably quite nonsensical regression):
data(iris)
m5 <- lm(Sepal.Length ~ Species*Sepal.Width, data=iris)
resid.m5 <- resid(m5)
dta.complete <- data.frame(iris, r.m5=resid.m5)
Here, the residuals are combined with the original data. It is perhaps a little unorthodox, but why not keep things together. Now you can use all the classical subsetting as much as you like. For instance:
with(dta.complete, by(r.m5, Species, mean))
Good luck!

Related

lm() saving residuals with group_by with R- confused spss user

This is complete reEdit of my orignal question
Let's assume I'm working on RT data gathered in a repeated measure experiment. As part of my usual routine I always transform RT to natural logarytms and then compute a Z score for each RT within each partipant adjusting for trial number. This is typically done with a simple regression in SPSS syntax:
split file by subject.
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT rtLN
/METHOD=ENTER trial
/SAVE ZRESID.
split file off.
To reproduce same procedure in R generate data:
#load libraries
library(dplyr); library(magrittr)
#generate data
ob<-c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3)
ob<-factor(ob)
trial<-c(1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6)
rt<-c(300,305,290,315,320,320,350,355,330,365,370,370,560,565,570,575,560,570)
cond<-c("first","first","first","snd","snd","snd","first","first","first","snd","snd","snd","first","first","first","snd","snd","snd")
#Following variable is what I would get after using SPSS code
ZreSPSS<-c(0.4207,0.44871,-1.7779,0.47787,0.47958,-0.04897,0.45954,0.45487,-1.7962,0.43034,0.41075,0.0407,-0.6037,0.0113,0.61928,1.22038,-1.32533,0.07806)
sym<-data.frame(ob, trial, rt, cond, ZreSPSS)
I could apply a formula (blend of Mark's and Daniel's solution) to compute residuals from a lm(log(rt)~trial) regression but for some reason group_by is not working here
sym %<>%
group_by (ob) %>%
mutate(z=residuals(lm(log(rt)~trial)),
obM=mean(rt), obSd=sd(rt), zRev=z*obSd+obM)
Resulting values clearly show that grouping hasn't kicked in.
Any idea why it didn't work out?
Using dplyr and magrittr, you should be able to calculate z-scores within individual with this code (it breaks things into the groups you tell it to, then calculates within that group).
experiment %<>%
group_by(subject) %>%
mutate(rtLN = log(rt)
, ZRE1 = scale(rtLN))
You should then be able to do use that in your model. However, one thing that may help your shift to R thinking is that you can likely build your model directly, instead of having to make all of these columns ahead of time. For example, using lme4 to treat subject as a random variable:
withRandVar <-
lmer(log(rt) ~ cond + (1|as.factor(subject))
, data = experiment)
Then, the residuals should already be on the correct scale. Further, if you use the z-scores, you probably should be plotting on that scale. I am not actually sure what running with the z-scores as the response gains you -- it seems like you would lose information about the degree of difference between the groups.
That is, if the groups are tight, but the difference between them varies by subject, a z-score may always show them as a similar number of z-scores away. Imagine, for example, that you have two subjects, one scores (1,1,1) on condition A and (3,3,3) on condition B, and a second subject that scores (1,1,1) and (5,5,5) -- both will give z-scores of (-.9,-.9,-.9) vs (.9,.9,.9) -- losing the information that the difference between A and B is larger in subject 2.
If, however, you really want to convert back, you can probably use this to store the subject means and sds, then multiply the residuals by subjSD and add subjMean.
experiment %<>%
group_by(subject) %>%
mutate(rtLN = log(rt)
, ZRE1 = scale(rtLN)
, subjMean = mean(rtLN)
, subjSD = sd(rtLN))
mylm <- lm(x~y)
rstandard(mylm)
This returns the standardized residuals of the function. To bind these to a variable you can do:
zresid <- rstandard(mylm)
EXAMPLE:
a<-rnorm(1:10,10)
b<-rnorm(1:10,10)
mylm <- lm(a~b)
mylm.zresid<-rstandard(mylm)
See also:
summary(mylm)
and
mylm$coefficients
mylm$fitted.values
mylm$xlevels
mylm$residuals
mylm$assign
mylm$call
mylm$effects
mylm$qr
mylm$terms
mylm$rank
mylm$df.residual
mylm$model

Where does R store subset information in glm object?

I'm trying to do some post-processing of a large number of glm models that I am working with, but I need to extract information about the data subset from the glm objects.
As a toy example:
x <- rnorm(100)
y <- rnorm(100,x,0.5)
s<-sample(c(T,F),100,replace=T)
myGlm <- glm(y~x, subset= s)
From this, I need to know which of the 100 observations were used by getting the information out of myGlm. I thought that myGlm$data would have the subsetted data, but it actually has all 100 observations in it. I looked through str(myGlm) to no avail. However, it is quite clear that somewhere in the object, information about the subset s is stored.
This seems like it should be totally trivial!
Thanks in advance.
as.numeric(rownames(myGlm$model))

Many linear regressions

As part of my data analysis (on time series), I am checking for correlation between log-returns and realized volatility.
My data consists of time series spanning several years for around hundred different companies (large zoo object, ~2 MB filesize). To check for the above-mentioned correlation, I have used the following code to calculate several rolling variances (a.k.a. realized volatility):
rollvar5 <- sapply(returns, rollVar, n=5, na.rm=TRUE)
rollvar10 <- sapply(returns, rollVar, n=10, na.rm=TRUE)
using the simple fTrading function rollVar. I have then converted the rolling variances to zoo objects and added the date index (by exporting to the results to csv files and manually adding the date, and then using read.zoo - not very sophisticated but it works just fine).
Now I wish to create around 100 linear regression models, each linking the log-returns of a company to the realized volatility to the specified company. On an individual basis, this would look like the following:
lm_rollvar5 <- lm(returns$[5:1000,1] ~ rollvar5[5:1000,1])
lm_rollvar10 <- lm(returns$[10:1000,1] ~ rollvar10[10:1000,1])
This works without problems.
Now I wish to extend this to automatically create the linear regression models for all 100 companies. What I've tried was a simple for-loop:
NC <- ncol(returns)
for(i in 1:NC){
lm_rollvar5 <- lm(returns[5:1000],i] ~ rollvar5[5:1000,i])
summary(lm_rollvar5)
lm_rollvar10 <- lm(returns[10:1000],i] ~ rollvar10[10:1000,i])
summary(lm_rollvar10)
}
Is there any way I could optimize my approach? (i.e. how could I save all regression results in a simple way). Since now the for-loop just outputs hundreds of regression results, which is quite ineffective in analyzing the results.
I also tried to use the apply function but I am unsure how to use it in this case, since there are several timeseries objects (the returns and the rolling variances are saved in different objects as you can see).
As to your question how you could save all regression results in a simple way, this is a bit difficult to answer given that we don't know what you need to do, and what you consider "simple". However, you could define a list outside the loop and store each regression model in this list so that you can access the models without refitting them later. Try e.g.
NC <- ncol(returns)
lm_rollvar5 <- vector(mode="list", length=NC)
lm_rollvar10 <- vector(mode="list", length=NC)
for(i in 1:NC){
lm_rollvar5[[i]] <- lm(returns[5:1000],i] ~ rollvar5[5:1000,i])
lm_rollvar10[[i]] <- lm(returns[10:1000],i] ~ rollvar10[10:1000,i])
}
This gives you the fitted model for firm i at the i-th position in the list. In the same manner, you can also save the output of summary. Or you do sth like
my.summaries_5 <- lapply(lm_rollvar5, summary)
which gives you a list of summaries.

Regression coefficients by group in dataframe R

I have data of various companies' financial information organized by company ticker. I'd like to regress one of the columns' values against the others while keeping the company constant. Is there an easy way to write this out in lm() notation?
I've tried using:
reg <- lmList(lead2.dDA ~ paudit1 + abs.d.GINDEX + logcapx + logmkvalt +
logmkvalt2|pp, data=reg.df)
where pp is a vector of company names, but this returns coefficients as though I regressed all the data at once (and did not separate by company name).
A convenient and apparently little-known syntax for estimating separate regression coefficients by group in lm() involves using the nesting operator, /. In this case it would look like:
reg <- lm(lead2.dDA ~ 0 + pp/(paudit1 + abs.d.GINDEX + logcapx +
logmkvalt + logmkvalt2), data=reg.df)
Make sure that pp is a factor and not a numeric. Also notice that the overall intercept must be suppressed for this to work; in the new formulation, we have a different "intercept" for each group.
A couple comments:
Although the regression coefficients obtained this way will match those given by lmList(), it should be noted that with lm() we estimate only a single residual variance across all the groups, whereas lmList() would estimate separate residual variances for each group.
Like I mentioned in my earlier comment, the lmList() syntax that you gave looks like it should have worked. Since you say it didn't, this leads me to expect that really the problem is something else (although it's hard to tell what without a reproducible example), and so it seems likely that the solution I posted will fail for you as well, for the same unknown reasons. If you want more detailed guidance, please provide more information; help us help you.

coxph stratified by year

I think this should be something very easy, but I can't quite get my head around it.
I have the following code:
library(survival)
cox <- coxph(Surv(SURV, DEAD)~YEAR, data)
summary(cox)
but I would like to have the result split down into the individual years.
Here's what the SPSS syntax and solution would look like:
COXREG surv /STATUS=dead(1) /CONTRAST (year)=Indicator(1)
/METHOD=ENTER year /PRINT=CI(95)
/CRITERIA=PIN(.05) POUT(.10) ITERATE(20).
EXECUTE.
and the same thing in STATA:
xi: stcox i.year
Here's the output of
str(data)
You did not show us str(data) or how to construct a reproducible example the gave "data". I suspect that "YEAR" will turn out to be a numeric vector. If it had been a factor variable you would have seen an Intercept and n-1 coefficients. The Interecpt coefficient would then have been the same as the "year" and the other coefficients would have matched up to the year(n) values. You told the SPSS engine that "year" was an "INDICATOR" but you didn't offer the same courtesy to the R engine.
Try this:
data$year.ind <- factor(data$year) # equivalent of SPSS INDICATOR
# or SAS /CLASS
cox.mdl <- coxph(Surv(SURV, DEAD)~YEAR, data)
as.matrix(coef(coc.mdl)
summary(cox.mdl)
R often splits computing and display of results to allow more freedom. I assume you need the predict function of coxph (?predict.coxph).
There are examples at the bottom of the documentation page, most likely you want
predict(cox, type="terms")

Resources