Kaplan-Meier Survival curve with 3 Plots - r

I am very new to R studio, and I am currently learning how to do Kaplan-Meier survival curves.
Here are the columns that are needed and the information the columns contain:
Group: “normal”, “high”
Response: “responder” “non-responder”
Days: this is the time variable
Outcome: “0”, “1” (0 = censored & 1 = event)
I’m trying to plot 3 plots on one survival curve (Kaplan-Meier curve). I want to plot Normal (regardless of responder status) vs High Responder vs High Non-responder). Is there a way to do this?
I tried making subsets of data (one including only those that are “normal” and then another containing only those that were “high”) so that I could plot the normal (1 plot)(regardless of responder status) and the other subset could be used to plot the high responder vs high nonrepsonder (2 plots) but then I got stuck on how to combine them.

After you've created the first plot you can add the other lines over top of the first plot. Here's an example doing that with the built-in cancer dataset. I make the people aged 60 or older all in one group, but split the people under 60 into two groups based on their sex.
library(survival)
dat_60plus <- cancer[cancer$age >= 60,]
dat_under60 <- cancer[cancer$age < 60,]
mod_60plus <- survfit(Surv(time, status) ~ 1, data = dat_60plus)
mod_under60sex <- survfit(Surv(time, status) ~ sex, data = dat_under60)
plot(mod_60plus, conf.int = FALSE)
lines(mod_under60sex, conf.int = FALSE, col = c("blue", "red"))
Alternatively you could create a new variable which identifies which group each individual is in and make a single model based on that variable. I think this is conceptually a bit simpler, so long as creating the variable isn't too complicated.
dat_all <- cancer
dat_all$myvar <- factor(ifelse(cancer$age >= 60,
"60plus",
ifelse(cancer$sex == 1,
"under60male",
"under60female")))
mod_all <- survfit(Surv(time, status) ~ myvar, data = dat_all)
plot(mod_all, conf.int = FALSE, col = c("black", "red", "blue"))
Created on 2023-01-25 with reprex v2.0.2
A potential downside of the first method is that if one of the groups has much longer survival time than the others, then you might have to do more adjusting of the graphical settings to make the plot look good. The second method should have more automatic checks for that.

Related

how to change order of rows in ggforest without change reference

Imagine the following database. Made up data
K<- c(2,2.2,2.4,2.6,2.8,3,3.5,3.8,4,4.2,4.4,4.8,5,5.2,5.4,5.6,5.8,6)
event <- c(1,1,1,0,1,1,1,1,1,0,1,1,1,1,1,0,1,1)
t<- c(8,10,25,10,8,22,30,16,32,30,32,20,8,12,14,22,10,6)
df<- data.frame(K,event,t)
I split the variable K (potassium) into a categorical variable with 3 levels (< 3, >= 3 and <5, >=5)
df$K_cut <- cut(K, c(0,3,5,6.5), right = F)
levels(df$K_cut) # [1] "[0,3)" "[3,5)" "[5,6.5)"
We perform a cox regression and represent it with ggforest
The reference category is potassium < 3
fit3<- coxph(Surv(t,event) ~ K_cut, data=df)
fit3
library(survminer)
ggforest(fit3, data=df, fontsize=0.8)
We changed the reference category to be a normal potassium (3-5)
And when plotting it is now the correct reference, but it is plotted on the first line.
df$K_cut <- relevel(df$K_cut, ref = "[3,5)")
fit4<- coxph(Surv(t,event) ~ K_cut, data=df)
fit4
library(survminer)
ggforest(fit4, data=df, fontsize=0.8)
I would like more to be able to put the reference category K 3-5 but for it to be on the center line, so that the graph represents from top to bottom, K < 3, between 3 and 5 and K >=5
The result shoud be (with paste, retouching the figure ...)
Is there a way to do it with ggforest or with another function/package
Change the order of the rows and put the reference wherever you want ..
In addittion, can you change the spaces so that the intervals and N= ... are not so close together, or modify the name of the variable
In the ggforest documentation, I have not seen that such options exist.
Regards and thanks
One option to achieve your desired result would be to relevel your factor after estimating your model and before calling ggforest:
library(survminer)
library(survival)
fit4 <- coxph(Surv(t, event) ~ K_cut, data = df)
df$K_cut <- factor(df$K_cut, levels = c("[0,3)", "[3,5)", "[5,6.5)"))
ggforest(fit4, data = df, fontsize = 0.8)

Visualising a three way interaction between two continuous variables and one categorical variable in R

I have a model in R that includes a significant three-way interaction between two continuous independent variables IVContinuousA, IVContinuousB, IVCategorical and one categorical variable (with two levels: Control and Treatment). The dependent variable is continuous (DV).
model <- lm(DV ~ IVContinuousA * IVContinuousB * IVCategorical)
You can find the data here
I am trying to find out a way to visualise this in R to ease my interpretation of it (perhaps in ggplot2?).
Somewhat inspired by this blog post I thought that I could dichotomise IVContinuousB into high and low values (so it would be a two-level factor itself:
IVContinuousBHigh <- mean(IVContinuousB) + sd (IVContinuousB)
IVContinuousBLow <- mean(IVContinuousB) - sd (IVContinuousB)
I then planned to plot the relationship between DV and IV ContinuousA and fit lines representing the slopes of this relationship for different combinations of IVCategorical and my new dichotomised IVContinuousB:
IVCategoricalControl and IVContinuousBHigh
IVCategoricalControl and IVContinuousBLow
IVCategoricalTreatment and IVContinuousBHigh
IVCategoricalTreatment and IVContinuousBLow
My first question is does this sound like a viable solution to producing an interpretable plot of this three-way-interaction? I want to avoid 3D plots if possible as I don't find them intuitive... Or is there another way to go about it? Maybe facet plots for the different combinations above?
If it is an ok solution, my second question is how to I generate the data to predict the fit lines to represent the different combinations above?
Third question - does anyone have any advice as to how to code this up in ggplot2?
I posted a very similar question on Cross Validated but because it is more code related I thought I would try here instead (I will remove the CV post if this one is more relevant to the community :) )
Thanks so much in advance,
Sarah
Note that there are NAs (left as blanks) in the DV column and the design is unbalanced - with slightly different numbers of datapoints in the Control vs Treatment groups of the variable IVCategorical.
FYI I have the code for visaualising a two-way interaction between IVContinuousA and IVCategorical:
A<-ggplot(data=data,aes(x=AOTAverage,y=SciconC,group=MisinfoCondition,shape=MisinfoCondition,col = MisinfoCondition,))+geom_point(size = 2)+geom_smooth(method='lm',formula=y~x)
But what I want is to plot this relationship conditional on IVContinuousB....
Here are a couple of options for visualizing the model output in two dimensions. I'm assuming here that the goal here is to compare Treatment to Control
library(tidyverse)
theme_set(theme_classic() +
theme(panel.background=element_rect(colour="grey40", fill=NA))
dat = read_excel("Some Data.xlsx") # I downloaded your data file
mod <- lm(DV ~ IVContinuousA * IVContinuousB * IVCategorical, data=dat)
# Function to create prediction grid data frame
make_pred_dat = function(data=dat, nA=20, nB=5) {
nCat = length(unique(data$IVCategorical))
d = with(data,
data.frame(IVContinuousA=rep(seq(min(IVContinuousA), max(IVContinuousA), length=nA), nB*2),
IVContinuousB=rep(rep(seq(min(IVContinuousB), max(IVContinuousB), length=nB), each=nA), nCat),
IVCategorical=rep(unique(IVCategorical), each=nA*nB)))
d$DV = predict(mod, newdata=d)
return(d)
}
IVContinuousA vs. DV by levels of IVContinuousB
The roles of IVContinuousA and IVContinuousB can of course be switched here.
ggplot(make_pred_dat(), aes(x=IVContinuousA, y=DV, colour=IVCategorical)) +
geom_line() +
facet_grid(. ~ round(IVContinuousB,2)) +
ggtitle("IVContinuousA vs. DV, by Level of IVContinousB") +
labs(colour="")
You can make a similar plot without faceting, but it gets difficult to interpret as the number of IVContinuousB levels increases:
ggplot(make_pred_dat(nB=3),
aes(x=IVContinuousA, y=DV, colour=IVCategorical, linetype=factor(round(IVContinuousB,2)))) +
geom_line() +
#facet_grid(. ~ round(IVContinuousB,2)) +
ggtitle("IVContinuousA vs. DV, by Level of IVContinousB") +
labs(colour="", linetype="IVContinuousB") +
scale_linetype_manual(values=c("1434","11","62")) +
guides(linetype=guide_legend(reverse=TRUE))
Heat map of the model-predicted difference, DV treatment - DV control on a grid of IVContinuousA and IVContinuousB values
Below, we look at the difference between treatment and control at each pair of IVContinuousA and IVContinuousB.
ggplot(make_pred_dat(nA=100, nB=100) %>%
group_by(IVContinuousA, IVContinuousB) %>%
arrange(IVCategorical) %>%
summarise(DV = diff(DV)),
aes(x=IVContinuousA, y=IVContinuousB)) +
geom_tile(aes(fill=DV)) +
scale_fill_gradient2(low="red", mid="white", high="blue") +
labs(fill=expression(Delta*DV~(Treatment - Control)))
If you really want to avoid 3-d plotting, you could indeed turn one of the continuous variables into a categorical one for visualization purposes.
For the purpose of the answer, I used the Duncan data set from the package car, as it is of the same form as the one you described.
library(car)
# the data
data("Duncan")
# the fitted model; education and income are continuous, type is categorical
lm0 <- lm(prestige ~ education * income * type, data = Duncan)
# turning education into high and low values (you can extend this to more
# levels)
edu_high <- mean(Duncan$education) + sd(Duncan$education)
edu_low <- mean(Duncan$education) - sd(Duncan$education)
# the values below should be used for predictions, each combination of the
# categories must be represented:
prediction_mat <- data.frame(income = Duncan$income,
education = rep(c(edu_high, edu_low),each =
nrow(Duncan)),
type = rep(levels(Duncan$type), each =
nrow(Duncan)*2))
predicted <- predict(lm0, newdata = prediction_mat)
# rearranging the fitted values and the values used for predictions
df <- data.frame(predicted,
income = Duncan$income,
edu_group =rep(c("edu_high", "edu_low"),each = nrow(Duncan)),
type = rep(levels(Duncan$type), each = nrow(Duncan)*2))
# plotting the fitted regression lines
ggplot(df, aes(x = income, y = predicted, group = type, col = type)) +
geom_line() +
facet_grid(. ~ edu_group)

Combining 3 separate line plots into one line plot display

We are looking at pattern recognitions and making different variables
unusualsubjects <- rtaverages$subject_id[rtaverages$count < 5] # make a list of subjects without enough data.
rtaverages <- filter(rtaverages,!(subject_id %in% unusualsubjects)) # only include data from good subjects. ! = not. Put data from acceptable subjects right back in the same data frame
# Another example of filtering subjects: let's say we only wanted to analyze subjects with accuracies over 95%
accurateSubjects <- averages$subject_id[averages$accuracy > .95] #returns all of the subject_ids for subjects meeting an accuracy criterion
length(accurateSubjects) # tells us how many accurate subjects there are
goodSubjectdata<-filter(data,subject_id %in% accurateSubjects) # make a new data frame that contains only the data from accurate subjects
Code to conduct actual ANOVA of the Respone Times results
model<ezANOVA(data=Data,dv=rt,within=c(set_size,target_presence,task),wid=subject_id) # You need to fill in the XXXs with the correct variable names within the variable containing all of the correct RTs. conduct a repeated measures ANOVA - dv = dependent variable. within = a list of all of the within subject variables. wid = variable that is used to group data by subject
model # show results of the ANOVA model
table1 <- tapply(X=Data$rt,INDEX=list(Data$task,Data$set_size),FUN=mean,trim=0.1)#find breakdown just of setsize and task - less broken down than the above tapply code, obtained just by deleting one item from the INDEX list "INDEX=list(rtaverages$target_presence,rtaverages$set_size,rtaverages$task)" above
table1 #show means so that one can begin to interpret the data. You'll break down rtaverages in different ways to get the different mean RTs that you need for your report
par(mar = c(4,4,4,0),mfrow=c(1, 2) ) # mfrow=c(1,2) creates two plots side by side
lineplot.CI(data=filter(rtaverages,task=="conjunctive"),x.factor=set_size,group=target_presence,x.cont=TRUE,response=rt,ylim=c(0,4000),x.leg=2,xlab="Conjunctive Set Size",ylab="RT") # produces a line graph with confidence intervals
lineplot.CI(data=filter(rtaverages,task=="disjunctive"),x.factor=set_size,group=target_presence,x.cont=TRUE,response=rt,ylim=c(0,4000),x.leg=2,xlab="Disjunctive Set Size",ylab="RT") # produces a line graph with confidence intervals
Currently attempting to put 3 lines onto one plot the following way:
# The next bit of code is to reproduce Treisman and Gelade's Figure 1, including best lines of fit
rtaverages$set_size_num<-sizes[rtaverages$set_size] # added a new column to rtaverage data frame which is the numeric/continuous version of the nominal/categorical set_size factor which will be useful for predicting RT from set_size
bySetSize<-group_by(rtaverages,set_size_num,task,target_presence) #collapse even more, so all subjects' data are combined together
collapsed<-summarize(bySetSize,rt=mean(rt,trim=0.1)) # make RT summary
collapsed # show what collapsed data look like. Note that there are now only 4 (set sizes) X 2 (tasks) X 2 (present/absent trials)=16 rows
cp<-filter(collapsed,task=="conjunctive" & target_presence=="present") # plot each of the four lines separated, filtering by the right type each time
cpf<-lm(data=cp,rt ~ set_size_num) # use a linear model to predict RT from set size. Use this to get out best fitting slope (estimate for set size) and intercept
summary(cpf) # make a summary of the linear regression model. cpf stands for: conjunctive, present fit
cp3<-filter(collapsed,task=="conjunctive" & target_presence=="absent")
caf<-lm(data=cp3,rt ~ set_size_num)
summary(caf)
cp1<-filter(collapsed,task=="disjunctive" & target_presence=="present")
dpf<-lm(data=cp1,rt ~ set_size_num)
summary(dpf)
cp2<-filter(collapsed,task=="disjunctive" & target_presence=="absent")
daf<-lm(data=cp2,rt ~ set_size_num)
summary(daf)
plot(cp$set_size_num,cp$rt,ylim=c(0,4000),xlim=c(0,30),pch=19,col="green",xlab="Set Size",ylab="Response Time (msec.)") # use a big enough range to capture all of the data
abline(cpf, col="green") # add the line with the slope and intercept derived from linear model
lines(cp$set_size_num,cp3$rt,col="green")
abline(caf, col="green")
lines(cp1$set_size_num,cp1$rt,col="red")
abline(dpf, col="red")
lines(cp2$set_size_num,cp2$rt,col="red")
abline(daf, col="red")
legend(x=0,y=4000,pch=c(19,1,19,1),col=c("green","green","red","red"),cex=0.7,legend=c("Conjunctive present","Conjunctive absent","Disjunctive present","Disjunctive absent")) #Legend only should be plotted once, pch sets 4 symbols, and col sets 4 colors. cex < 1 so that legend box isn't too big
I got them to combine, but now the lines lost their format:

R: plotting actual vs observed with mixed effects model

I have a data set:
date=c(56,54,112,230,250,134,114)
species=c("pink","blue","pink","green","black","orange","purple")
year=c(1901,2000,1958,1978,1992,1992,1994)
loc=c("forest","river","river","cloud","cloud")
peop=c(1.0,-6.2,1.55,0.45,-2.8,3.45,4.1)
per=c(1,5,63,9,45,1,2)
tem=c(12,65,14,35,26,24,22)
high=c(2500,3400,2600,2800,2546,2148,3654)
From this data set I created a mixed effects model:
model<-lmer(date~(1|species)+ high + year + tem*peop + per)
I need a graph that plots the actual observed values for date vs the predicted ones by the model. Thanks!
Construct data (it's better to have data in a data frame rather than lying around in your global workspace):
dd <-
data.frame(date=c(56,54,112,230,250,134,114),
species=c("pink","blue","pink","green","black",
"orange","purple"),
year=c(1901,2000,1958,1978,1992,1992,1994),
## I added a couple of values here since the
## length didn't match the other variables
loc=c("forest","river","river","cloud","cloud","forest","forest"),
peop=c(1.0,-6.2,1.55,0.45,-2.8,3.45,4.1),
per=c(1,5,63,9,45,1,2),
tem=c(12,65,14,35,26,24,22),
high=c(2500,3400,2600,2800,2546,2148,3654))
This model can't actually be fit with a data set this short, so I replicated it (still very artificial, but OK for illustration)
dd <- do.call(rbind,replicate(10,dd,simplify=FALSE))
library("lme4")
model<-lmer(date~(1|species)+ high + year + tem*peop + per,dd)
Plot expected (x-axis) vs observed (y axis):
plot(fitted(model),dd$date)
abline(a=0,b=1) ## 1-to-1 line

How to draw my function to plot with data in R

I have data about response time at web site according users that hit at the same time.
For example:
10 users hit the same time have (average) response time 300ms
20 users -> 450ms etc
I import the data in R and I make the plot from 2 columns data (users, response time).
Also I use the function loess to draw a line about those points, at the plot.
Here's the code that I have wrote:
users <- seq(5,250, by=5)
responseTime <- c(179.5,234.0,258.5,382.5,486.0,679.0,594.0,703.5,998.0,758.0,797.0,812.0,804.5,890.5,1148.5,1182.5,1298.0,1422.0,1413.5,1209.5,1488.0,1632.0,1715.0,1632.5,2046.5,1860.5,2910.0,2836.0,2851.5,3781.0,2725.0,3036.0,2862.0,3266.0,3175.0,3599.0,3563.0,3375.0,3110.0,2958.0,3407.0,3035.5,3040.0,3378.0,3493.0,3455.5,3268.0,3635.0,3453.0,3851.5)
data1 <- data.frame(users,responseTime)
data1
plot(data1, xlab="Users", ylab="Response Time (ms)")
lines(data1)
loess_fit <- loess(responseTime ~ users, data1)
lines(data1$users, predict(loess_fit), col = "green")
Here's my plot's image:
My questions are:
How to draw my nonlinear function at the same plot to compare it with the other lines?
example: response_time (f(x)) = 30*users^2.
Also how to make predictions for the line of function loess and for my function and show them to the plot, example: if I have data until 250 users, make prediction until 500 users
If you know the equation of the line that you want to draw, then just define a variable for your prediction:
predictedResponseTime <- 30 * users ^ 2
lines(users, predictedResponseTime)
If the problem is that you want to fit a line, then you need to call a modelling function.
Since loess is a non-parametric model, is isn't appropriate to use it to make predictions outside of the range of your data.
In this case, a simple (ordinary least squares) linear regression using lm provides a reasonable fit.
model <- lm(responseTime ~ users)
prediction <- data.frame(users = 1:500)
prediction$responseTime <- predict(model, prediction)
with(prediction, lines(users, responseTime))
Another solution to plot your curve knowing the underlying function is function curve.
In your example of f(x)=30x^2:
plot(data1, xlab="Users", ylab="Response Time (ms)")
lines(data1)
lines(data1$users, predict(loess_fit), col = "green")
curve(30*x^2,col="red", add=TRUE) #Don't forget the add parameter.

Resources