coxph stratified by year - r

I think this should be something very easy, but I can't quite get my head around it.
I have the following code:
library(survival)
cox <- coxph(Surv(SURV, DEAD)~YEAR, data)
summary(cox)
but I would like to have the result split down into the individual years.
Here's what the SPSS syntax and solution would look like:
COXREG surv /STATUS=dead(1) /CONTRAST (year)=Indicator(1)
/METHOD=ENTER year /PRINT=CI(95)
/CRITERIA=PIN(.05) POUT(.10) ITERATE(20).
EXECUTE.
and the same thing in STATA:
xi: stcox i.year
Here's the output of
str(data)

You did not show us str(data) or how to construct a reproducible example the gave "data". I suspect that "YEAR" will turn out to be a numeric vector. If it had been a factor variable you would have seen an Intercept and n-1 coefficients. The Interecpt coefficient would then have been the same as the "year" and the other coefficients would have matched up to the year(n) values. You told the SPSS engine that "year" was an "INDICATOR" but you didn't offer the same courtesy to the R engine.
Try this:
data$year.ind <- factor(data$year) # equivalent of SPSS INDICATOR
# or SAS /CLASS
cox.mdl <- coxph(Surv(SURV, DEAD)~YEAR, data)
as.matrix(coef(coc.mdl)
summary(cox.mdl)

R often splits computing and display of results to allow more freedom. I assume you need the predict function of coxph (?predict.coxph).
There are examples at the bottom of the documentation page, most likely you want
predict(cox, type="terms")

Related

lm() saving residuals with group_by with R- confused spss user

This is complete reEdit of my orignal question
Let's assume I'm working on RT data gathered in a repeated measure experiment. As part of my usual routine I always transform RT to natural logarytms and then compute a Z score for each RT within each partipant adjusting for trial number. This is typically done with a simple regression in SPSS syntax:
split file by subject.
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT rtLN
/METHOD=ENTER trial
/SAVE ZRESID.
split file off.
To reproduce same procedure in R generate data:
#load libraries
library(dplyr); library(magrittr)
#generate data
ob<-c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3)
ob<-factor(ob)
trial<-c(1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6)
rt<-c(300,305,290,315,320,320,350,355,330,365,370,370,560,565,570,575,560,570)
cond<-c("first","first","first","snd","snd","snd","first","first","first","snd","snd","snd","first","first","first","snd","snd","snd")
#Following variable is what I would get after using SPSS code
ZreSPSS<-c(0.4207,0.44871,-1.7779,0.47787,0.47958,-0.04897,0.45954,0.45487,-1.7962,0.43034,0.41075,0.0407,-0.6037,0.0113,0.61928,1.22038,-1.32533,0.07806)
sym<-data.frame(ob, trial, rt, cond, ZreSPSS)
I could apply a formula (blend of Mark's and Daniel's solution) to compute residuals from a lm(log(rt)~trial) regression but for some reason group_by is not working here
sym %<>%
group_by (ob) %>%
mutate(z=residuals(lm(log(rt)~trial)),
obM=mean(rt), obSd=sd(rt), zRev=z*obSd+obM)
Resulting values clearly show that grouping hasn't kicked in.
Any idea why it didn't work out?
Using dplyr and magrittr, you should be able to calculate z-scores within individual with this code (it breaks things into the groups you tell it to, then calculates within that group).
experiment %<>%
group_by(subject) %>%
mutate(rtLN = log(rt)
, ZRE1 = scale(rtLN))
You should then be able to do use that in your model. However, one thing that may help your shift to R thinking is that you can likely build your model directly, instead of having to make all of these columns ahead of time. For example, using lme4 to treat subject as a random variable:
withRandVar <-
lmer(log(rt) ~ cond + (1|as.factor(subject))
, data = experiment)
Then, the residuals should already be on the correct scale. Further, if you use the z-scores, you probably should be plotting on that scale. I am not actually sure what running with the z-scores as the response gains you -- it seems like you would lose information about the degree of difference between the groups.
That is, if the groups are tight, but the difference between them varies by subject, a z-score may always show them as a similar number of z-scores away. Imagine, for example, that you have two subjects, one scores (1,1,1) on condition A and (3,3,3) on condition B, and a second subject that scores (1,1,1) and (5,5,5) -- both will give z-scores of (-.9,-.9,-.9) vs (.9,.9,.9) -- losing the information that the difference between A and B is larger in subject 2.
If, however, you really want to convert back, you can probably use this to store the subject means and sds, then multiply the residuals by subjSD and add subjMean.
experiment %<>%
group_by(subject) %>%
mutate(rtLN = log(rt)
, ZRE1 = scale(rtLN)
, subjMean = mean(rtLN)
, subjSD = sd(rtLN))
mylm <- lm(x~y)
rstandard(mylm)
This returns the standardized residuals of the function. To bind these to a variable you can do:
zresid <- rstandard(mylm)
EXAMPLE:
a<-rnorm(1:10,10)
b<-rnorm(1:10,10)
mylm <- lm(a~b)
mylm.zresid<-rstandard(mylm)
See also:
summary(mylm)
and
mylm$coefficients
mylm$fitted.values
mylm$xlevels
mylm$residuals
mylm$assign
mylm$call
mylm$effects
mylm$qr
mylm$terms
mylm$rank
mylm$df.residual
mylm$model

Repeated measures ANOVA - different resutls for SPSS versus R

I am trying to run a repeated - measures ANOVA using R and compared it to the SPSS output and results differ a lot! Maybe I make a mistake somewhere, but I cannot figure it out
So some sample data:
id is the subject. Every subject makes one rating for three items (res_1, res_2 and res_3). I want to compare an overall effect of item.
id<-c(1,2,3,4,5,6)
res_1<-c(1,1,1,2,2,1)
res_2<-c(4,5,2,4,4,3)
res_3<-c(4,5,6,3,6,6)
## wide format for spss
table<-as.data.frame(cbind(id, res_1, res_2, res_3))
## reshape to long format
library(reshape2)
table<-melt(table, id.vars="id")
colnames(table)<-c("id", "item", "rating")
aov.out = aov(rating ~ item+ Error(id/item), data=table)
summary(aov.out)
And here is my SPSS code (from wide format data)
GLM item_1 item_2 item_3
/WSFACTOR=factor1 3 Polynomial
/METHOD=SSTYPE(3)
/PRINT=DESCRIPTIVE
/CRITERIA=ALPHA(.05)
/WSDESIGN=factor1.
The results I get from
R: p value 0.0526 (error:within)
and SPSS: p value 0.003 (test of within subject effect)
Does anyone have a suggestion that may explain the difference?
If I do a non-parametric Friedmann test, I get the same results in SPSS and R.
Actually, when looking at my data, the summary(aov.out) is the same as SPSS's "test of within subjects contrast" (but I learned to look at the test of within subjects effect).
Thanks!
There's a lot of stuff out there; I am a bit surprised that your google for 'spss versus R anova' did not bring you to links explaining about the difference in sums-of-squares between SPSS (type-III) and R (type-I) as well as difference in how contrasts are handled.
These are the top two results that I found:
http://myowelt.blogspot.ca/2008/05/obtaining-same-anova-results-in-r-as-in.html and
https://stats.stackexchange.com/questions/40958/testing-anova-hypothesis-with-contrasts-in-r-and-spss

How to interpolate and `predict` using mgcv::gam?

I've begun by mastering how to use splines to interpolate 1-dimentional function.
model = spline(bdp[,4]~bdp[,1])
I could then use
predict(model, c(0))
to predict function value in point 0.
Then I've searched the Internet to find something to spline 3-dimentional data and I came across an answer on stackoverflow suggesting that mgcv::gam is the best choice.
And so I tried:
model=gam(bdp[,4]~s(bdp[,1],bdp[,2],bdp[,3]))
and then I did:
predict(model, newdata=c(0,0,0), type="response")
hoping that it will return a value of spline interpolation for point (0,0,0).
It calculated for a while and returned lots of multidimentional data that I could not understand.
I must be doing something wrong. What do I do to receive a value for a single point from gam object? And, just to be sure, can you agree/disagree that gam is the right choice to interpolate splines for 3D data or would you suggest something else?
I'm adding a reproducible example.
This is a data file (please unpack in c:/r/) https://www.sendspace.com/file/b4mazl
# install.packages("mgcv")
library(mgcv)
bdp = read.table("c:/r/temp_bdp.csv")
bdg=gam(bdp[,4]~s(bdp[,1],bdp[,2],bdp[,3]))
#this returns lots of data, not just function value that I wanted.
predict(bdg, newdata=data.frame(0,0,0,0), type="response")
Minimal reproducible example:
tmp = t(matrix(runif(4*200),4))
tmpgam=gam(tmp[,4]~s(tmp[,1],tmp[,2],tmp[,3]))
predict(tmpgam, newdata=data.frame(0,0,0,0), type="response")
For
predict(bdg, newdata=data.frame(0,0,0,0), type="response")
it returns a lot of numbers any warns that newdata didn't have enough data
for
predict(bdg, c(0,0,0,0), type="response")
it returns nothing and also warns about the same.
So with nearly all types of models you fit, if you plan to use the predict function, it's best to use a "proper" formula with column names rather than using matrix/data.frame slices. The reason is that when predict runs, it matches the values in newdata to the model using the names in both so they should match identically. When you index the data.frame like that, it create weird names in the model. Do the best way to fit the model and predict is
bdg <- gam(V4~s(V1,V2,V3), data=bdp)
predict(bdg, newdata=data.frame(V1=0, V2=0, V3=0))
# 1
# 85431440244
That's assuming
names(bdp)
# [1] "V1" "V2" "V3" "V4"
So here we fit with "V1","V2","V3" and newdata has columns "V1","V2" and "V3"
So i've only focused on the R-coding part. As far as the question if this is an appropriate analysis is better fitted for https://stats.stackexchange.com/

Why can't I find my factor names when I extract residuals?

I'm working with some election data trying to separate it by "state" and "election."
I ran a regression with fixed effects for state and year (as you'll see below), got my summary data, and have been trying to use the resid() function to extract the residuals.
m5 <- lm(demVote ~ state*year, data=presidentialElections)
plot(resid(m5) ~ fitted(m5))
resid.m5 <- resid(m5)
I think it all worked above just perfectly. However, here's where I'm lost - if I do summary(resid.m5) (where I put the extracted residuals, or so I thought), I can't seem to find my factor names anymore. If I want to see my residuals per state or per year (or an average of them by state/year, for example) then how do I access that with the resid() function? Thanks!
Just as was said in the comments before, you have to realize that the residuals that are being returned are in the same order as your observations in the data set.
Here is an example using the iris data set that comes with every R installation (and a probably quite nonsensical regression):
data(iris)
m5 <- lm(Sepal.Length ~ Species*Sepal.Width, data=iris)
resid.m5 <- resid(m5)
dta.complete <- data.frame(iris, r.m5=resid.m5)
Here, the residuals are combined with the original data. It is perhaps a little unorthodox, but why not keep things together. Now you can use all the classical subsetting as much as you like. For instance:
with(dta.complete, by(r.m5, Species, mean))
Good luck!

Using survfit object's formula in survdiff call

I'm doing some survival analysis in R, and looking to tidy up/simplify my code.
At the moment I'm doing several steps in my data analysis:
make a Surv object (time variable with indication as to whether each observation was censored);
fit this Surv object according to a categorical predictor, for plotting/estimation of median survival time processes; and
calculate a log-rank test to ask whether there is evidence of "significant" differences in survival between the groups.
As an example, here is a mock-up using the lung dataset in the survival package from R. So the following code is similar enough to what I want to do, but much simplified in terms of the predictor set (which is why I want to simplify the code, so I don't make inconsistent calls across models).
library(survival)
# Step 1: Make a survival object with time-to-event and censoring indicator.
# Following works with defaults as status = 2 = dead in this dataset.
# Create survival object
lung.Surv <- with(lung, Surv(time=time, event=status))
# Step 2: Fit survival curves to object based on patient sex, plot this.
lung.survfit <- survfit(lung.Surv ~ lung$sex)
print(lung.survfit)
plot(lung.survfit)
# Step 3: Calculate log-rank test for difference in survival objects
lung.survdiff <- survdiff(lung.Surv ~ lung$sex)
print(lung.survdiff)
Now this is all fine and dandy, and I can live with this but would like to do better.
So my question is around step 3. What I would like to do here is to be able to use information in the formula from the lung.survfit object to feed into the calculation of the differences in survival curves: i.e. in the call to survdiff. And this is where my domitable [sic] programming skills hit a wall. Below is my current attempt to do this: I'd appreciate any help that you can give! Once I can get this sorted out I should be able to wrap a solution up in a function.
lung.survdiff <- survdiff(parse(text=(lung.survfit$call$formula)))
## Which returns following:
# Error in survdiff(parse(text = (lung.survfit$call$formula))) :
# The 'formula' argument is not a formula
As I commented above, I actually sorted out the answer to this shortly after having written this question.
So step 3 above could be replaced by:
lung.survdiff <- survdiff(formula(lung.survfit$call$formula))
But as Ben Barnes points out in the comment to the question, the formula from the survfit object can be more directly extracted with
lung.survdiff <- survdiff(formula(lung.survfit))
Which is exactly what I wanted and hoped would be available -- thanks Ben!

Resources