Defining particular sub-groups in survival analysis in R - r

This is the code in R that produces a Kaplan-Meier plot of Overall Survival for a population broken down by Stage.
library(tidyverse)
library(forcats)
library(broom)
library(survival)
library(Hmisc)
library(gmodels)
library(lazyeval)
library(plotrix)
library(summariser)
library(magrittr)
library(survminer)
library(dplyr)
library(lattice)
library(Formula)
library(lubridate)
library(ggfortify)
library(readxl)
icccdata = read_excel("ICCC_All_20072016.xls")
head(icccdata)
km <- with(icccdata, Surv(Time, Status))
# STAGE specific OVERALL SURVIVAL
survival_object2 <- Surv(icccdata$Time, icccdata$CancerSurvival)
str(survival_object2)
my_survfit_STAGE_OS <- survfit(survival_object2 ~ Stage, data = icccdata)
print(my_survfit_STAGE_OS, print.rmean = TRUE)
dat_my_survfit_STAGE_OS <- fortify(my_survfit_STAGE_OS)
ggsurvplot(my_survfit_STAGE_OS, risk.table = TRUE, xlab = "Time (years)", censor = T)
The Stage data consists of the values 0, I, II, III, IV.
I want to be able to just show the values for Stage I, without having the Stage 0, II, III, or IV displayed. I'd appreciate some help with the code to separate out a single sub-group.
A

I would recommend building your own ggplot() instead of using ggsurvplot().
This can be done by using surv_summary() from the survminer package. This is also what ggsurvplot() is using behind the scenes.
E.g.:
df <- surv_summary(my_survfit_STAGE_OS)
df %>% filter(Stage == "I") %>%
ggplot(aes(x = time, y = surv, col = Stage)) +
geom_step()
If you want to plot stage I and II you could use %in% like
df %>% filter(Stage %in% c("I", "II")) %>% ggplot(....)
In ggsurvplot() a dataframe from surv_summary() is passed to ggsurvplot_df() and then created using ggplot() based on user options.
Check out the R source code here:
https://github.com/kassambara/survminer/blob/master/R/ggsurvplot_df.R
If you want an at risk table this can be created using ggrisktable()

Related

ggcorplot() : Changing Variable Labels on X and Y Axis

I'm currently working on a correlation plot with gradient for a dataset involving social factors and outcomes such as grades.
The variable names are not that accessible, and I was wondering how to change, for example, "famrel" to "Family Relationship" on the axis.
I am using ggcorrplot() as well as ggplotly to add interactivity.
Any help would be much appreciated! I've been googling for two hours but cannot for the life of me find an applicable solution that doesn't involve altering the original dataframe.
df_corr <- select_if(df, is.numeric)
df_corr
corr <- round(cor(df_corr), 1)
p.mat <- cor_pmat(df_corr)
corr.plot <- ggcorrplot(corr,
hc.order = TRUE,
type = "lower",
)
ggplotly(corr.plot)
Above is my code; I have also attached a screenshot of the resulting chart.
I tried googling as well as searching stack overflow for the answer to my question, but was unsuccessful.
In addition to #Kat's option of directly editing the dataframe, you can also rename the column names and rownames directly in corr. Also be aware that select_if has been superseded.
library(tidyverse)
library(ggcorrplot)
library(plotly)
data(mtcars)
df <- mtcars %>% select(mpg, gear, disp, drat)
df_corr <- df %>% select(where(is.numeric)) # changed here
corr <- round(cor(df_corr), 1)
colnames(corr) <- c("mpg_new", "gear_new", "disp_new", "drat_new")
rownames(corr) <- c("mpg_new", "gear_new", "disp_new", "drat_new")
p.mat <- cor_pmat(df_corr)
corr.plot <- ggcorrplot(corr, hc.order = TRUE, type = "lower")
ggplotly(corr.plot)
If you only want to change a single value on the axes (e.g., famrel), then you can edit a single rowname and a single column name like this:
colnames(corr)[colnames(corr) == "mpg"] <- "mpg_new"
rownames(corr)[rownames(corr) == "mpg"] <- "mpg_new"

geom_qq() labeling each datapoint with the corresponding 'year'

library(fpp3)
library(hrbrthemes)
library(readr)
# load 'Fastenal' dataset and convert to tsibble
Fastenal <- read_csv('http://course1.winona.edu/bdeppa/FIN%20335/Datasets/Fastenal%20Sales%20(2004-2013).csv')
Fastenal %>% mutate(YearMonth = yearmonth('2004-01') + 0:119,
TotSales = Total.Sales/1000000) %>%
as_tsibble(index = YearMonth) -> Fastenal
# Fit AvSalesPD using SNAIVE model
Fastenal %>% model(SNAIVE(AvSalesPD/1000000)) -> fit
# Plot residual QQ plot, label each data points with corresponding 'Year'
fit %>% augment() %>%
ggplot(aes(sample = .resid/1000000)) +
geom_qq(aes(label = year(YearMonth), geom = 'text') + geom_qq_line()+ theme_ipsum()
I'm trying to draw QQ plot, using 'geom = "text"'.
When I run the above codes, I get error message as below;
'Error: geom_text requires the following missing aesthetics: label'
I couldn't find what I did wrong...-_-;
I think the problem is that geom_qq does not know how to use the aesthetics you are providing. By specifing geom = "text" in the function call, it should be drawing text instead of points. The problem is that there is no text supplied that the geom_text can use.
Thus we need to supply the labels. The following code should produce what you asked.
library(fpp3)
library(readr)
# load 'Fastenal' dataset and convert to tsibble
Fastenal <- read_csv('http://course1.winona.edu/bdeppa/FIN%20335/Datasets/Fastenal%20Sales%20(2004-2013).csv')
Fastenal %>% mutate(YearMonth = yearmonth('2004-01') + 0:119,
TotSales = Total.Sales/1000000) %>%
as_tsibble(index = YearMonth) -> Fastenal
# Fit AvSalesPD using SNAIVE model
Fastenal %>% model(SNAIVE(AvSalesPD/1000000)) -> fit
# Plot residual QQ plot, label each data points with corresponding 'Year'
fit <- fit %>% augment()
# here we define the labels to use.
labs <- year(fit$YearMonth)
labs <- labs[!is.na(fit$.resid)]
labs <- labs[order(na.omit(fit$.resid))] #EDIT: important! sort the labels for QQ-plots
ggplot(fit, aes(sample = .resid/1000000)) +
geom_qq(geom = "text", label = labs, angle = 90, size=1.5) +
geom_qq_line()
Note how we are manually creating the labels. We pass to geom_qq the same arguments as you passed, but outside the aes().
Since we are passing this information outside aesthetics, we have to take care that information in labs matches that of the aesthetics used.
Indeed,the points that you are using to draw the QQ plot have some missing entries (12). These are omitted, and as a result you have 108 points (not 120) in your resulting plot. This is why we have to make sure that labs has one entry for each non-missing point in your data (i.e. .resid entries different from NA).
If you do not filter NAs in the construction of labs you end up with an error and no plot.
The above code produces this plot:

Change position of legend in plot of pec object

I am trying to plot the prediction error curve from pec package but I can't change the legend position and size. There's an example from pec package:
library(rms)
library(pec)
data(pbc)
pbc <- pbc[sample(1:NROW(pbc),size=100),]
f1 <- psm(Surv(time,status!=0)~edema+log(bili)+age+sex+albumin,data=pbc)
f2 <- coxph(Surv(time,status!=0)~edema+log(bili)+age+sex+albumin,data=pbc,x=TRUE,y=TRUE)
f3 <- cph(Surv(time,status!=0)~edema+log(bili)+age+sex+albumin,data=pbc,surv=TRUE)
brier <- pec(list("Weibull"=f1,"CoxPH"=f2,"CPH"=f3),data=pbc,formula=Surv(time,status!=0)~1)
print(brier)
plot(brier)
But shows a big the legend in the middle of plot.
I also tried:
plot(brier, legend = "topright")
class(brier)
But don't show legend.
How can I change the position of legend? And also ¿is it posible to plot this graph using ggplot?
I think I got what you want using ggplot2. The idea is to pick elements from your brier object that contains data for the plot, make a dataframe with it and plot it.
library(ggplot2)
# packages for the pipe and pivot_wider, you can do it with base functions, I just prefer these
library(tidyr)
library(dplyr)
df <- do.call(cbind, brier[["AppErr"]]) # contains y values for each model
df <- cbind(brier[["time"]], df) # values of the x axis
colnames(df)[1] <- "time"
df <- as.data.frame(df) %>% pivot_longer(cols = 2:last_col(), names_to = "models", values_to = "values") # pivot table to long format makes it easier to use ggplot
ggplot(data = df, aes(x = time, y = values, color = models)) +
geom_line() # I suppose you know how to custom axis names etc.
Output:

Plot 3D regression surface using plot_ly

I am trying to plot a regression model for a data set with measurements for "mue" (friction coefficient for breaking train), speed of train and temp of train. I built a simple regression model using lm so I could test plotting with plot_ly. The plot attached shows the blue markers of the original data and the surface plotted doesn't look right. It should look more like a regression surface... I also plotted in 2D to make sure the regression actually works and it does. I've posted the code below and am wondering if anyone here has any advice. Been trying everything I can find online and none of it seems to be working. I think the issue might have to do with building a grid? I've tried that a few times, but I always get error messages for vectors not matching up, etc. I'd be happy to post that as well if needed. Thank you!
3D Regression Plot
2D Regression Plot
set.seed(123) # randum number generator
training.samples <- avg.frame$avg.mue %>%
createDataPartition(p = 0.8, list = FALSE) # pick 80 percent of data
train.data <- avg.frame[training.samples, ] # 80 percent is training data
test.data <- avg.frame[-training.samples, ] # 20 percent is test data
model_2 <- lm(avg.mue ~ avg.speed + avg.temp, data = train.data)
vals <- predict(model_2, train.data)
avg.mue <- matrix(vals, nrow = length(test.data$avg.speed), ncol = length(test.data$avg.temp))
plane <- avg.mue
p <- plot_ly(data = train.data, z = ~avg.mue, x = ~avg.speed, y = ~avg.temp, opacity = 0.6) %>%
add_markers()
p %>% add_surface(z = ~plane, x = ~avg.speed, y = ~avg.temp, showscale = FALSE) %>%
layout(showlegend = FALSE)

show only 0-90% or 0-95% percentile

Here is my code and plot results, dues to some outliers, the x-axis is very long. Is there a simple method which I can filter df$foo by only 0-90% or 0-95% percentile in R, so that I can plot only normal values? Thanks.
df <- read.csv('~/Downloads/foo.tsv', sep='\t', header=F, stringsAsFactors=FALSE)
names(df) <- c('a', 'foo', 'goo')
df$foo <- as.numeric(df$foo)
goodValue <- df$foo
summary(goodValue)
hist(goodValue,main="Distribution",xlab="foo",breaks=20)
Maybe this is what you're looking for?
a = c(rnorm(99), 50) #create some data
quant <- as.numeric(quantile(a, c(0, 0.9))) #get 0 and 0.9 quantile
hist(a[a > quant[1] & a < quant[2]]) #histogram only data within these bounds
Suppose you wanted to examine the diamonds. (I don't have your data)
library(ggplot2)
library(dplyr)
diamonds %>% ggplot() + geom_histogram(aes(x = price))
You might decide to examine the deciles of your data, and since the tail probability is not of interest to you, you might throw away the top uppermost decile. You could do that as follows, with a free scale so that you can see what is happening within each decile.
diamonds %>% mutate(ntile = ntile(price, 10)) %>%
filter(ntile < 10) %>%
ggplot() + geom_histogram(aes(x = price)) +
facet_wrap(~ntile, scales = "free_x")
But be cautious although seeing your data in a much finer granularity has its benefits, notice how you could almost barely tell that your data is roughly exponentially distributed (with a heavy tail, as commodities price data often are).

Resources