How to set data for R survival analysis - r

I'm sorry if the question looks silly, but I have a small data set which I would like to manipulate with function "survfit" of R package "survival", and, well, I don't know to set a proper dataframe usable by "survfit"; data are as follows:
time number_at_risk number_death number_censored
1 25 10 0 2
2 28 8 1 0
3 33 7 1 0
4 37 6 0 1
5 41 5 1 0
6 43 4 0 1
7 48 3 0 3
And now, if I run the usual syntax survfit(Surv(time, number_censored) ~ 1, data = data), it gives me the warning In Surv(time, number_censored) : Invalid status value, converted to NA.
Obviously, the data are not properly organized. So, how should I set my dataframe?
Thanks.

time must be a vector with the times where an event happened and status an indicator if that event is a censorship or death (0/1).
In your example the data should look like this:
times = c(1,1,2,3,4,5,6,7,7,7)
status = c(0,0,1,1,0,1,0,0,0,0)
survfit(Surv(times,status)~1)

Related

Episode splitting in survival analysis by the timing of an event in R

Is it possible to split episode by a given variable in survival analysis in R, similar to in STATA using stsplit in the following way: stsplit var, at(0) after(time=time)?
I am aware that the survival package allows one to split episode by given cut points such as c(0,5,10,15) in survSplit, but if a variable, say time of divorce, differs by each individual, then providing cutpoints for each individual would be impossible, and the split would have to be based on the value of a variable (say graduation, or divorce, or job termination).
Is anyone aware of a package or know a resource I might be able to tap into?
Perhaps Epi package is what you are looking for. It offers multiple ways to cut/split the follow-up time using the Lesix objects. Here is the documentation of cutLesix().
After some poking around, I think tmerge() in the survival package can achieve what stsplit var can do, which is to split episodes not just by a given cut points (same for all observations), but by when an event occurs for an individual.
This is the only way I knew how to split data
id<-c(1,2,3)
age<-c(19,20,29)
job<-c(1,1,0)
time<-age-16 ## create time since age 16 ##
data<-data.frame(id,age,job,time)
id age job time
1 1 19 1 3
2 2 20 1 4
3 3 29 0 13
## simple split by time ##
## 0 to up 2 years, 2-5 years, 5+ years ##
data2<-survSplit(data,cut=c(0,2,5),end="time",start="start",
event="job")
id age start time job
1 1 19 0 2 0
2 1 19 2 3 1
3 2 20 0 2 0
4 2 20 2 4 1
5 3 29 0 2 0
6 3 29 2 5 0
7 3 29 5 13 0
However, if I want to split by a certain variable, such as when each individuals finished school, each person might have a different cut point (finished school at different ages).
## split by time dependent variable (age finished school) ##
d1<-data.frame(id,age,time,job)
scend<-c(17,21,24)-16
d2<-data.frame(id,scend)
## create start/stop time ##
base<-tmerge(d1,d1,id=id,tstop=time)
## create time-dependent covariate ##
s1<-tmerge(base,d2,id=id,
finish=tdc(scend))
id age time job tstart tstop finish
1 1 19 3 1 0 1 0
2 1 19 3 1 1 3 1
3 2 20 4 1 0 4 0
4 3 29 13 0 0 8 0
5 3 29 13 0 8 13 1
I think tmerge() is more or less comparable with stsplit function in STATA.

boxplot with filtered values

I am new to coding and want to create boxplots based on my data.
For that, I want to filter a boxplot by specific values:
My data structure is called "Auswertungen" and is structured like this:
Participant Donation Treatment Manipulation
1 0 1 passed
2 0.4 2 passed
3 0.2 2 failed
4 0 3 failed
5 0.3 3 passed
now I want to plot the Donations based on the Treatments, using a boxplot. I want to graphs, one with all data points and one without those who failed the manipulation.
I found something like
boxplot(Donation ~ Treatment)
with(subset(Auswertungen, Manipulation == "passed"), boxplot(Donation ~ Treatment))
but the second formula is exactly showing me the same boxplots as before, so I guess the subset is not working?
Got it, sorry.
boxplot(Donation ~ Treatment)
boxplot(Donation[Manipulation == "passed"] ~ Treatment[Manipulation == "passed"]
If your data is roughly structured like this:
set.seed(222)
Donation <- abs(rnorm(20))
Treatment <- sample(1:3, 20, replace = T)
Manipulation <- sample(c("passed", "failed"), 20, replace = T)
df <- data.frame(Donation, Treatment, Manipulation)
df
Donation Treatment Manipulation
1 1.487757090 3 passed
2 0.001891901 2 failed
3 1.381020790 1 failed
4 0.380213631 3 passed
5 0.184136230 1 failed
6 0.246895883 3 passed
7 1.215560910 3 failed
8 1.561405098 1 failed
9 0.427310197 2 passed
10 1.201023506 3 passed
11 1.052458495 2 passed
12 1.305063566 2 failed
13 0.692607634 3 failed
14 0.602648854 3 failed
15 0.197753074 2 failed
16 1.185874517 2 passed
17 2.005512989 3 passed
18 0.007509885 2 passed
19 0.519490356 2 failed
20 0.746295471 2 failed
And you want to have two boxplots, you can first define a two-panel layout:
par(mfrow = c(1,2))
And then fill your two boxplots into it, the first one unfiltered:
boxplot(df$Donation ~ df$Treatment)
and the second filtered on the condition that Manipulation=="passed":
boxplot((df$Donation[df$Manipulation=="passed"] ~ df$Treatment[df$Manipulation=="passed"]))
The result would be something like this:

How can I filter out rows from linear regression based on another linear regression

I would like to conduct a linear regression that will have three steps: 1) Running the regression on all data points 2) Taking out the 10 outiers as found by using the absolute distanse value of rstandard 3) Running the regression again on the new data frame.
I know how to do it manually but these is very awkwarding. Is there a way to do it automatically? Can it be done for taking out columns as well?
Here is my toy data frame and code (I'll take out 2 top outliers):
df <- read.table(text = "userid target birds wolfs
222 1 9 7
444 1 8 4
234 0 2 8
543 1 2 3
678 1 8 3
987 0 1 2
294 1 7 16
608 0 1 5
123 1 17 7
321 1 8 7
226 0 2 7
556 0 20 3
334 1 6 3
225 0 1 1
999 0 3 11
987 0 30 1 ",header = TRUE)
model<- lm(target~ birds+ wolfs,data=df)
rstandard <- abs(rstandard(model))
df<-cbind(df,rstandard)
g<-subset(df,rstandard > sort(unique(rstandard),decreasing=T)[3])
g
userid target birds wolfs rstandard
4 543 1 2 3 1.189858
13 334 1 6 3 1.122579
modelNew<- lm(target~ birds+ wolfs,data=df[-c(4,13),])
I don't see how you could do this without estimating two models, the first to identify the most influential cases and the second on the data without those cases. You could simplify your code and avoid cluttering the workspace, however, by doing it all in one shot, with the subsetting process embedded in the call to estimate the "final" model. Here's code that does this for the example you gave:
model <- lm(target ~ birds + wolfs,
data = df[-(as.numeric(names(sort(abs(rstandard(lm(target ~ birds + wolfs, data=df))), decreasing=TRUE)))[1:2]),])
Here, the initial model, evaluation of influence, and ensuing subsetting of the data are all built into the code that comes after the first data =.
Also, note that the resulting model will differ from the one your code produced. That's because your g did not correctly identify the two most influential cases, as you can see if you just eyeball the results of abs(rstandard(lm(target ~ birds + wolfs, data=df))). I think it has to do with your use of unique(), which seems unnecessary, but I'm not sure.

How to perform a repeated G.test in R?

I downloaded the R package RVAideMemoire in order to use the G.test.
> head(bio)
Date Trt Treated Control Dead DeadinC AliveinC
1 23Ap citol 1 3 1 0 13
2 23Ap cital 1 5 3 1 6
3 23Ap gerol 0 3 0 0 9
4 23Ap mix 0 5 0 0 8
5 23Ap cital 0 5 1 0 13
6 23Ap cella 0 5 0 1 4
So, I make subsets of the data to look at each treatment, because the G.test result will need to be pooled for each one.
datamix<-subset(bio, Trt=="mix")
head(datamix)
Date Trt Treated Control Dead DeadinC AliveinC
4 23Ap mix 0 5 0 0 8
8 23Ap mix 0 5 1 0 8
10 23Ap mix 0 2 3 0 5
20 23Ap mix 0 0 0 0 18
25 23Ap mix 0 2 1 0 15
28 23Ap mix 0 1 0 0 12
So for the G.test(x) to work if x is a matrix, it must be constructed as 2 columns containing numbers, with 1 row per population. If I use the apply() function I can run the G,test on each row if my data set contains only two columns of numbers. I want to look only at the treated and control for example, but I'm not sure how to omit columns so the G.test can ignore the headers, and other columns. I've tried using the following but I get an error:
apply(datamix, 1, G.test)
Error in match.fun(FUN) : object 'G.test' not found
I have also thought about trying to use something like this rather than creating subsets.
by(bio, Trt, rowG.test)
The G.test spits out this, when you compare two numbers.
G-test for given probabilities
data: counts
G = 0.6796, df = 1, p-value = 0.4097
My other question is, is there someway to add all the df and G values that I get for each row (once I'm able to get all these numbers) for each treatment? Is there also some way to have R report the G, df and p-values in a table to be summed rather than like above for each row?
Any help is hugely appreciated.
You're really close. This seems to work (hard to tell with such a small sample though).
by(bio,bio$Trt,function(x)G.test(as.matrix(x[,3:4])))
So first, the indices argument to by(...) (the second argument) is not evaluated in the context of bio, so you have to specify bio$Trt instead of just Trt.
Second, this will pass all the columns of bio, for each unique value of bio$Trt, to the function specified in the third argument. You need to extract only the two columns you want (columns 3 and 4).
Third, and this is a bit subtle, passing x[,3:4] to G.test(...) causes it to fail with an unintelligible error. Looking at the code, G.test(...) requires a matrix as it's first argument, whereas x[,3:4] in the code above is a data.frame. So you need to convert with as.matrix(...).

drawing multiple boxplots from imputed data in R

I have an imputed dataset that I'm analysing, and I'm trying to draw boxplots, but I can't wrap my head around the proper procedure.
my data (a sample, original has 20 observations per imputation and 13 vars per group, all values range from 0 to 25):
.imp .id FTE_RM FTE_PD OMZ_RM OMZ_PD
1 1 25 25 24 24
1 2 4 0 2 6
1 3 11 5 3 2
1 4 12 3 3 3
2 1 20 15 15 15
2 2 4 1 2 3
2 3 0 0 0 6
2 4 20 0 0 0
.imp signifies the imputation round, .id the identifer for each observartion.
I want to draw all the FTE_* variables in a single plot (and the `OMZ_* in another), but wonder what to do with all the imputations, can I just include all values? The imputated data now has 500 observations. With for instance an ANOVA I'd need to average the ANOVA results by 5 to get back to 20 observations. But is this needed for a boxplot as well, since I only deal with medians, means, max. and min.?
Such as:
data_melt <- melt(df[grep("^FTE_", colnames(df))])
ggplot(data_melt, aes(x=variable, y=value))+geom_boxplot()
I've played a couple of times with ggplot, but consider myself a complete newbie.
I assume you want to keep the identifier for .imp and .id after melting so rather put:
data_melt <- melt(df,c(".imp",".id"))
For completeness of the dataframe it probably helps to introduce a column that identifies the type - FTE vs. OMZ:
data_melt$type <- ifelse(grepl("FTE",data_melt$variable),"FTE","OMZ")
Having this data.frame you can, for example, facet on the type (alternatively you can just use a simple filter statement on data_melt to restrict to one type):
ggplot(data_melt, aes(x=variable, y=value))+geom_boxplot()+facet_wrap(~type,scales="free_x")
This would look like this.
EDIT: fixed the data mess-up

Resources