How to truncate Kaplan Meier curves when number at risk is < 10 - r

For a publication in a peer-reviewed scientific journal (http://www.redjournal.org), we would like to prepare Kaplan-Meier plots. The journal has the following specific guidelines for these plots:
"If your figures include curves generated from analyses using the Kaplan-Meier method or the cumulative incidence method, the following are now requirements for the presentation of these curves:
That the number of patients at risk is indicated;
That censoring marks are included;
That curves be truncated when there are fewer than 10 patients at risk; and
An estimate of the confidence interval should be included either in the figure itself or the text.”
Here, I illustrate my problem with the veteran dataset (https://github.com/tidyverse/reprex is great!).
We can adress 1, 2 and 4 easily with the survminer package:
library(survival)
library(survminer)
#> Warning: package 'survminer' was built under R version 3.4.3
#> Loading required package: ggplot2
#> Loading required package: ggpubr
#> Warning: package 'ggpubr' was built under R version 3.4.3
#> Loading required package: magrittr
fit.obj <- survfit(Surv(time, status) ~ celltype, data = veteran)
ggsurvplot(fit.obj,
conf.int = T,
risk.table ="absolute",
tables.theme = theme_cleantable())
I have, however, a problem with requirement 3 (truncate curves when there are fewer than 10 patients at risk). I see that all the required information is available in the survfit object:
library(survival)
fit.obj <- survfit(Surv(time, status) ~ celltype, data = veteran)
summary(fit.obj)
#> Call: survfit(formula = Surv(time, status) ~ celltype, data = veteran)
#>
#> celltype=squamous
#> time n.risk n.event survival std.err lower 95% CI upper 95% CI
#> 1 35 2 0.943 0.0392 0.8690 1.000
#> 8 33 1 0.914 0.0473 0.8261 1.000
#> 10 32 1 0.886 0.0538 0.7863 0.998
#> 11 31 1 0.857 0.0591 0.7487 0.981
#> 15 30 1 0.829 0.0637 0.7127 0.963
#> 25 29 1 0.800 0.0676 0.6779 0.944
#> 30 27 1 0.770 0.0713 0.6426 0.924
#> 33 26 1 0.741 0.0745 0.6083 0.902
#> 42 25 1 0.711 0.0772 0.5749 0.880
#> 44 24 1 0.681 0.0794 0.5423 0.856
#> 72 23 1 0.652 0.0813 0.5105 0.832
#> 82 22 1 0.622 0.0828 0.4793 0.808
#> 110 19 1 0.589 0.0847 0.4448 0.781
#> 111 18 1 0.557 0.0861 0.4112 0.754
#> 112 17 1 0.524 0.0870 0.3784 0.726
#> 118 16 1 0.491 0.0875 0.3464 0.697
#> 126 15 1 0.458 0.0876 0.3152 0.667
#> 144 14 1 0.426 0.0873 0.2849 0.636
#> 201 13 1 0.393 0.0865 0.2553 0.605
#> 228 12 1 0.360 0.0852 0.2265 0.573
#> 242 10 1 0.324 0.0840 0.1951 0.539
#> 283 9 1 0.288 0.0820 0.1650 0.503
#> 314 8 1 0.252 0.0793 0.1362 0.467
#> 357 7 1 0.216 0.0757 0.1088 0.429
#> 389 6 1 0.180 0.0711 0.0831 0.391
#> 411 5 1 0.144 0.0654 0.0592 0.351
#> 467 4 1 0.108 0.0581 0.0377 0.310
#> 587 3 1 0.072 0.0487 0.0192 0.271
#> 991 2 1 0.036 0.0352 0.0053 0.245
#> 999 1 1 0.000 NaN NA NA
#>
#> celltype=smallcell
#> time n.risk n.event survival std.err lower 95% CI upper 95% CI
#> 2 48 1 0.9792 0.0206 0.93958 1.000
#> 4 47 1 0.9583 0.0288 0.90344 1.000
#> 7 46 2 0.9167 0.0399 0.84172 0.998
#> 8 44 1 0.8958 0.0441 0.81345 0.987
#> 10 43 1 0.8750 0.0477 0.78627 0.974
#> 13 42 2 0.8333 0.0538 0.73430 0.946
#> 16 40 1 0.8125 0.0563 0.70926 0.931
#> 18 39 2 0.7708 0.0607 0.66065 0.899
#> 20 37 2 0.7292 0.0641 0.61369 0.866
#> 21 35 2 0.6875 0.0669 0.56812 0.832
#> 22 33 1 0.6667 0.0680 0.54580 0.814
#> 24 32 1 0.6458 0.0690 0.52377 0.796
#> 25 31 2 0.6042 0.0706 0.48052 0.760
#> 27 29 1 0.5833 0.0712 0.45928 0.741
#> 29 28 1 0.5625 0.0716 0.43830 0.722
#> 30 27 1 0.5417 0.0719 0.41756 0.703
#> 31 26 1 0.5208 0.0721 0.39706 0.683
#> 51 25 2 0.4792 0.0721 0.35678 0.644
#> 52 23 1 0.4583 0.0719 0.33699 0.623
#> 54 22 2 0.4167 0.0712 0.29814 0.582
#> 56 20 1 0.3958 0.0706 0.27908 0.561
#> 59 19 1 0.3750 0.0699 0.26027 0.540
#> 61 18 1 0.3542 0.0690 0.24171 0.519
#> 63 17 1 0.3333 0.0680 0.22342 0.497
#> 80 16 1 0.3125 0.0669 0.20541 0.475
#> 87 15 1 0.2917 0.0656 0.18768 0.453
#> 95 14 1 0.2708 0.0641 0.17026 0.431
#> 99 12 2 0.2257 0.0609 0.13302 0.383
#> 117 9 1 0.2006 0.0591 0.11267 0.357
#> 122 8 1 0.1755 0.0567 0.09316 0.331
#> 139 6 1 0.1463 0.0543 0.07066 0.303
#> 151 5 1 0.1170 0.0507 0.05005 0.274
#> 153 4 1 0.0878 0.0457 0.03163 0.244
#> 287 3 1 0.0585 0.0387 0.01600 0.214
#> 384 2 1 0.0293 0.0283 0.00438 0.195
#> 392 1 1 0.0000 NaN NA NA
#>
#> celltype=adeno
#> time n.risk n.event survival std.err lower 95% CI upper 95% CI
#> 3 27 1 0.9630 0.0363 0.89430 1.000
#> 7 26 1 0.9259 0.0504 0.83223 1.000
#> 8 25 2 0.8519 0.0684 0.72786 0.997
#> 12 23 1 0.8148 0.0748 0.68071 0.975
#> 18 22 1 0.7778 0.0800 0.63576 0.952
#> 19 21 1 0.7407 0.0843 0.59259 0.926
#> 24 20 1 0.7037 0.0879 0.55093 0.899
#> 31 19 1 0.6667 0.0907 0.51059 0.870
#> 35 18 1 0.6296 0.0929 0.47146 0.841
#> 36 17 1 0.5926 0.0946 0.43344 0.810
#> 45 16 1 0.5556 0.0956 0.39647 0.778
#> 48 15 1 0.5185 0.0962 0.36050 0.746
#> 51 14 1 0.4815 0.0962 0.32552 0.712
#> 52 13 1 0.4444 0.0956 0.29152 0.678
#> 73 12 1 0.4074 0.0946 0.25850 0.642
#> 80 11 1 0.3704 0.0929 0.22649 0.606
#> 84 9 1 0.3292 0.0913 0.19121 0.567
#> 90 8 1 0.2881 0.0887 0.15759 0.527
#> 92 7 1 0.2469 0.0850 0.12575 0.485
#> 95 6 1 0.2058 0.0802 0.09587 0.442
#> 117 5 1 0.1646 0.0740 0.06824 0.397
#> 132 4 1 0.1235 0.0659 0.04335 0.352
#> 140 3 1 0.0823 0.0553 0.02204 0.307
#> 162 2 1 0.0412 0.0401 0.00608 0.279
#> 186 1 1 0.0000 NaN NA NA
#>
#> celltype=large
#> time n.risk n.event survival std.err lower 95% CI upper 95% CI
#> 12 27 1 0.9630 0.0363 0.89430 1.000
#> 15 26 1 0.9259 0.0504 0.83223 1.000
#> 19 25 1 0.8889 0.0605 0.77791 1.000
#> 43 24 1 0.8519 0.0684 0.72786 0.997
#> 49 23 1 0.8148 0.0748 0.68071 0.975
#> 52 22 1 0.7778 0.0800 0.63576 0.952
#> 53 21 1 0.7407 0.0843 0.59259 0.926
#> 100 20 1 0.7037 0.0879 0.55093 0.899
#> 103 19 1 0.6667 0.0907 0.51059 0.870
#> 105 18 1 0.6296 0.0929 0.47146 0.841
#> 111 17 1 0.5926 0.0946 0.43344 0.810
#> 133 16 1 0.5556 0.0956 0.39647 0.778
#> 143 15 1 0.5185 0.0962 0.36050 0.746
#> 156 14 1 0.4815 0.0962 0.32552 0.712
#> 162 13 1 0.4444 0.0956 0.29152 0.678
#> 164 12 1 0.4074 0.0946 0.25850 0.642
#> 177 11 1 0.3704 0.0929 0.22649 0.606
#> 200 9 1 0.3292 0.0913 0.19121 0.567
#> 216 8 1 0.2881 0.0887 0.15759 0.527
#> 231 7 1 0.2469 0.0850 0.12575 0.485
#> 250 6 1 0.2058 0.0802 0.09587 0.442
#> 260 5 1 0.1646 0.0740 0.06824 0.397
#> 278 4 1 0.1235 0.0659 0.04335 0.352
#> 340 3 1 0.0823 0.0553 0.02204 0.307
#> 378 2 1 0.0412 0.0401 0.00608 0.279
#> 553 1 1 0.0000 NaN NA NA
But I have no idea how I can manipulate this list. I would very much appreciate any advice on how to filter out all lines with n.risk < 10 from fit.obj.

I can't quite seem to get this all the way there. But I see that you can pass a data.frame rather than a fit object to the plotting function. You can do this and clip the values. For example
ss <- subset(surv_summary(fit.obj), n.risk>=10)
ggsurvplot(ss,
conf.int = T)
But it seems in this mode it does not automatically print the table. There is a function to draw just the table with
ggrisktable(fit.obj, tables.theme = theme_cleantable())
So I guess you could just combine them. Maybe i'm missing an easier way to draw the table when using a data.frame in the same plot.

As a slight variation on the above answers, if you want to truncate each group individually when less than 10 patients are at risk in that group, I found this to work and not require plotting the figure and table separately:
library(survival)
library(survminer)
# truncate each line when fewer than 10 at risk
atrisk <- 10
# KM fit
fit.obj <- survfit(Surv(time, status) ~ celltype, data = veteran)
# subset each stratum separately
maxcutofftime = 0 # for plotting
strata <- rep(names(fit.obj$strata), fit.obj$strata)
for (i in names(fit.obj$strata)){
cutofftime <- min(fit.obj$time[fit.obj$n.risk < atrisk & strata == i])
maxcutofftime = max(maxcutofftime, cutofftime)
cutoffs <- which(fit.obj$n.risk < atrisk & strata == i)
fit.obj$lower[cutoffs] <- NA
fit.obj$upper[cutoffs] <- NA
fit.obj$surv[cutoffs] <- NA
}
# plot
ggsurvplot(fit.obj, data = veteran, risk.table = TRUE, conf.int = T, pval = F,
tables.theme = theme_cleantable(), xlim = c(0,maxcutofftime), break.x.by = 90)
edited to add: note that if we had used pval = T above, that would give the p-value for the truncated data, not the full data. It doesn't make much of a difference in this example as both are p<0.0001, but be careful :)

I'm following up on MrFlick's great answer.
I'd interpret 3) to mean that there should be at least 10 at risk total - i.e., not per group. So we have to create an ungrouped Kaplan-Meier fit first and determine the time cutoff from there.
Subset the surv_summary object w/r/t this cutoff.
Plot KM-curve and risk table separately. Crucially, function survminer::ggrisktable() (a minimal front end for ggsurvtable()) accepts options xlim and break.time.by. However, the function can currently only extend the upper time limit, not reduce it. I assume this is a bug. I created function ggsurvtable_mod() to change this.
Turn ggplot objects into grobs and use ggExtra::grid.arrange() to put both plots together. There is probably a more elegant way to do this based on options widths and heights.
Admittedly, this is a bit of a hack and needs tweaking to get the correct alignment between survival plot and risk table.
library(survival)
library(survminer)
# ungrouped KM estimate to determine cutoff
fit1_ss <- surv_summary(survfit(Surv(time, status) ~ 1, data=veteran))
# time cutoff with fewer than 10 at risk
cutoff <- min(fit1_ss$time[fit1_ss$n.risk < 10])
# KM fit and subset to cutoff
fit.obj <- survfit(Surv(time, status) ~ celltype, data = veteran)
fit_ss <- subset(surv_summary(fit.obj), time < cutoff)
# KM survival plot and risk table as separate plots
p1 <- ggsurvplot(fit_ss, conf.int=TRUE)
# note options xlim and break.time.by
p2 <- ggsurvtable_mod(fit.obj,
survtable="risk.table",
tables.theme=theme_cleantable(),
xlim=c(0, cutoff),
break.time.by=100)
# turn ggplot objects into grobs and arrange them (needs tweaking)
g1 <- ggplotGrob(p1)
g2 <- ggplotGrob(p2)
lom <- rbind(c(NA, rep(1, 14)),
c(NA, rep(1, 14)),
c(rep(2, 15)))
gridExtra::grid.arrange(grobs=list(g1, g2), layout_matrix=lom)

Related

In R-studio I want to know how I would save a loop result without deleting the previous loop result [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
In this code I want to calculate the trimmed mean and variance of trimmed mean for normal distribution in many value of alpha ( I want to calculate the value of the each alpha from 1 to 13 ) and storage the result in the data.frame then print the all result But the problem is that the new result storages over the previous result and at the end i end up getting only the last result of alpha value.
ProDistFun<- data.frame(matrix(nrow=91, ncol=4))
colnames(ProDistFun)<-c("x","Alpha","Trimmed Mean","Variance Of Trimmed Mean")
mu=7 # Mean Value
sigma2=4 # Variance value
for (alpha in c(0.001,0.01,0.025,0.05,0.1,0.25,0.375))
{
for(i in 1:13)
{
ProDistFun[i,1]<-i
ProDistFun[i,2]<-alpha
# The trimmed mean
a=qnorm(alpha, mean=mu, sd=sqrt(sigma2))
b=qnorm(1-alpha, mean=mu, sd=sqrt(sigma2))
fun_TM <- function(x) ((x*exp(-0.5*((x-mu)/sqrt(sigma2))^2))/((1-2*alpha)*(sqrt(2*pi*sigma2))))
MT1 <- integrate(fun_TM, a, b)
MT <-MT1$value
ProDistFun[i,3]<-MT
# The variance of trimmed mean
fun_VTM <- function(x) ((((x-MT)^2)*exp(-0.5*((x-mu)/sqrt(sigma2))^2))/(sqrt(2*pi*sigma2)))
fVTM <- integrate(fun_VTM, a, b)
fV <- fVTM$value
VT=((fV+(alpha*(a-MT)^2)+(alpha*(b-MT)^2))/((1-2*alpha)^2))
ProDistFun[i,4]<-VT
}
}
print(ProDistFun)
Edited Sept 30 12:23 hours
It's still not totally clear to me what you're trying to accomplish with x since you use it in confusing ways, but using most of your code and simplifying drastically but still using a for loop
Create an empty dataframe without the matrix contortions.
ProDistFun <- data.frame(
x = integer(0),
Alpha = numeric(0),
Trimmed_Mean = numeric(0),
Variance_Of_Trimmed_Mean = numeric(0)
)
Assign your two constants. IMPORTANT NOTE -- throughout your code please stop treating assignment <- as the same thing as = it will cause problems down the road.
mu <- 7 # Mean Value
sigma2 <- 4 # Variance value
Pull your two functions out of the loops (no point in rerunning them every iteration). I have no idea if they are correct but they seem to work. It also seems to me that in these functions you want the x value to be the value of i so I've made that change. If you don't you get 13 iterations of identical
fun_TM <- function(x) {
((i*exp(-0.5*((i-mu)/sqrt(sigma2))^2))/((1-2*alpha)*(sqrt(2*pi*sigma2))))
}
fun_VTM <- function(x) {
((((i-MT)^2)*exp(-0.5*((i-mu)/sqrt(sigma2))^2))/(sqrt(2*pi*sigma2)))
}
Now we run the nested loops of alpha and i. We build a row to add to our empty dataframe ProDist_df and then last step is rbind the newrow to it. Note as stated in the help file for integrate we need to Vectorize.
for (alpha in c(0.001,0.01,0.025,0.05,0.1,0.25,0.375)) {
for (i in 1:13) {
a <- qnorm(alpha, mean = mu, sd = sqrt(sigma2))
b <- qnorm(1 - alpha, mean = mu, sd = sqrt(sigma2))
MT1 <- integrate(Vectorize(fun_TM), a, b)
MT <- MT1$value
fVTM <- integrate(Vectorize(fun_VTM), a, b)
fV <- fVTM$value
VT <- ((fV+(alpha*(a-MT)^2)+(alpha*(b-MT)^2))/((1-2*alpha)^2))
newrow <- data.frame(x = i,
Alpha = alpha,
Trimmed_Mean = MT,
Variance_Of_Trimmed_Mean = VT)
ProDist_df <- rbind(ProDist_df, newrow)
}
}
ProDist_df
#> x Alpha Trimmed_Mean Variance_Of_Trimmed_Mean
#> 1 1 0.001 0.02744577 0.2003378
#> 2 2 0.001 0.21710028 0.5148306
#> 3 3 0.001 1.00307392 1.4849135
#> 4 4 0.001 3.20833233 0.6092747
#> 5 5 0.001 7.49244239 9.4048587
#> 6 6 0.001 13.08172721 109.7134117
#> 7 7 0.001 17.29412878 262.6203021
#> 8 8 0.001 17.44230295 195.0733274
#> 9 9 0.001 13.48639629 30.3828344
#> 10 10 0.001 8.02083083 3.2269398
#> 11 11 0.001 3.67793771 18.0605860
#> 12 12 0.001 1.30260169 12.5886402
#> 13 13 0.001 0.35679506 4.5613376
#> 14 1 0.010 0.02104086 1.4856627
#> 15 2 0.010 0.16643643 1.7087503
#> 16 3 0.010 0.76899043 2.5612272
#> 17 4 0.010 2.45961619 2.3689151
#> 18 5 0.010 5.74396001 1.1324626
#> 19 6 0.010 10.02889499 28.3270524
#> 20 7 0.010 13.25826466 76.9619814
#> 21 8 0.010 13.37185999 50.5144290
#> 22 9 0.010 10.33912801 2.7851234
#> 23 10 0.010 6.14904048 9.7709433
#> 24 11 0.010 2.81963158 18.3179999
#> 25 12 0.010 0.99861856 11.4783190
#> 26 13 0.010 0.27353117 4.8704116
#> 27 1 0.025 0.01828687 3.5703608
#> 28 2 0.025 0.14465195 3.7170106
#> 29 3 0.025 0.66833905 4.3472611
#> 30 4 0.025 2.13768272 4.1121498
#> 31 5 0.025 4.99214637 1.0747081
#> 32 6 0.025 8.71623613 12.2965566
#> 33 7 0.025 11.52292107 37.4315915
#> 34 8 0.025 11.62164818 22.0916831
#> 35 9 0.025 8.98586347 1.0699878
#> 36 10 0.025 5.34420680 13.1972079
#> 37 11 0.025 2.45057652 19.1385400
#> 38 12 0.025 0.86791169 12.3691446
#> 39 13 0.025 0.23772931 6.5199633
#> 40 1 0.050 0.01619943 7.3749079
#> 41 2 0.050 0.12813995 7.4154400
#> 42 3 0.050 0.59204825 7.6768541
#> 43 4 0.050 1.89366655 6.8889173
#> 44 5 0.050 4.42229360 2.4843697
#> 45 6 0.050 7.72127906 5.6367092
#> 46 7 0.050 10.20758133 19.2763445
#> 47 8 0.050 10.29503875 10.2078727
#> 48 9 0.050 7.96012848 2.5125393
#> 49 10 0.050 4.73416638 16.5558668
#> 50 11 0.050 2.17084357 21.3087060
#> 51 12 0.050 0.76883970 15.1092609
#> 52 13 0.050 0.21059254 9.9710784
#> 53 1 0.100 0.01419911 17.3206584
#> 54 2 0.100 0.11231710 17.1281634
#> 55 3 0.100 0.51894156 16.5102647
#> 56 4 0.100 1.65983477 13.8052303
#> 57 5 0.100 3.87622451 6.3261279
#> 58 6 0.100 6.76784806 2.9011142
#> 59 7 0.100 8.94713932 9.2952208
#> 60 8 0.100 9.02379741 4.8107658
#> 61 9 0.100 6.97720412 6.0182204
#> 62 10 0.100 4.14958694 22.3456468
#> 63 11 0.100 1.90278571 28.0669010
#> 64 12 0.100 0.67390263 23.5641219
#> 65 13 0.100 0.18458837 19.4835253
#> 66 1 0.250 0.01195695 101.3283283
#> 67 2 0.250 0.09458127 99.3524957
#> 68 3 0.250 0.43699624 91.6992768
#> 69 4 0.250 1.39773265 71.1428700
#> 70 5 0.250 3.26413547 35.4870897
#> 71 6 0.250 5.69914690 7.1958771
#> 72 7 0.250 7.53430941 4.8250199
#> 73 8 0.250 7.59886254 4.6624497
#> 74 9 0.250 5.87544385 18.9156520
#> 75 10 0.250 3.49433163 57.7975362
#> 76 11 0.250 1.60231955 87.6386891
#> 77 12 0.250 0.56748763 98.7559156
#> 78 13 0.250 0.15544029 101.2808652
#> 79 1 0.375 0.01129729 591.0212507
#> 80 2 0.375 0.08936330 578.6087319
#> 81 3 0.375 0.41288753 529.2387893
#> 82 4 0.375 1.32062094 401.4184815
#> 83 5 0.375 3.08405591 197.9457725
#> 84 6 0.375 5.38472985 37.5416149
#> 85 7 0.375 7.11864801 5.0996822
#> 86 8 0.375 7.17963979 7.6766516
#> 87 9 0.375 5.55130064 59.4024907
#> 88 10 0.375 3.30155234 228.2708772
#> 89 11 0.375 1.51392096 415.5768786
#> 90 12 0.375 0.53617983 529.7332481
#> 91 13 0.375 0.14686478 575.9244272
The general way to save all atomic values generated in a loop is to create an empty vector before the loop and then append the new values to the vector inside the loop. E.g.,
output <- vector()
for (i in 1:5) {
newvalue <- i
output <- c(output, newvalue)
}

regression by group and retain all the columns in R

I am doing a linear regression by group and want to extract the residuals of the regression
library(dplyr)
set.seed(124)
dat <- data.frame(ID = sample(111:503, 18576, replace = T),
ID2 = sample(11:50, 18576, replace = T),
ID3 = sample(1:14, 18576, replace = T),
yearRef = sample(1998:2014, 18576, replace = T),
value = rnorm(18576))
resid <- dat %>% dplyr::group_by(ID3) %>%
do(augment(lm(value ~ yearRef, data=.))) %>% ungroup()
How do I retain the ID, ID2 as well in the resid. At the moment, it only retains the ID3 in the final data frame
Use group_split then loop through each group using map_dfr to bind ID, ID2 and augment output using bind_cols
library(dplyr)
library(purrr)
dat %>% group_split(ID3) %>%
map_dfr(~bind_cols(select(.x,ID,ID2), augment(lm(value~yearRef, data=.x))), .id = "ID3")
# A tibble: 18,576 x 12
ID3 ID ID2 value yearRef .fitted .se.fit .resid .hat .sigma .cooksd
<chr> <int> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 196 16 -0.385 2009 -0.0406 0.0308 -0.344 1.00e-3 0.973 6.27e-5
2 1 372 47 -0.793 2012 -0.0676 0.0414 -0.726 1.81e-3 0.973 5.05e-4
3 1 470 15 -0.496 2011 -0.0586 0.0374 -0.438 1.48e-3 0.973 1.50e-4
4 1 242 40 -1.13 2010 -0.0496 0.0338 -1.08 1.21e-3 0.973 7.54e-4
5 1 471 34 1.28 2006 -0.0135 0.0262 1.29 7.26e-4 0.972 6.39e-4
6 1 434 35 -1.09 1998 0.0586 0.0496 -1.15 2.61e-3 0.973 1.82e-3
7 1 467 45 -0.0663 2011 -0.0586 0.0374 -0.00769 1.48e-3 0.973 4.64e-8
8 1 334 27 -1.37 2003 0.0135 0.0305 -1.38 9.86e-4 0.972 9.92e-4
9 1 186 25 -0.0195 2003 0.0135 0.0305 -0.0331 9.86e-4 0.973 5.71e-7
10 1 114 34 1.09 2014 -0.0857 0.0500 1.18 2.64e-3 0.973 1.94e-3
# ... with 18,566 more rows, and 1 more variable: .std.resid <dbl>
Taking the "many models" approach, you can nest the data on ID3 and use purrr::map to create a list-column of the broom::augment data frames. The data list-column has all the original columns aside from ID3; map into that and select just the ones you want. Here I'm assuming you want to keep any column that starts with "ID", but you can change this. Then unnest both the data and the augment data frames.
library(dplyr)
library(tidyr)
dat %>%
group_by(ID3) %>%
nest() %>%
mutate(aug = purrr::map(data, ~broom::augment(lm(value ~ yearRef, data = .))),
data = purrr::map(data, select, starts_with("ID"))) %>%
unnest(c(data, aug))
#> # A tibble: 18,576 x 12
#> # Groups: ID3 [14]
#> ID3 ID ID2 value yearRef .fitted .se.fit .resid .hat .sigma
#> <int> <int> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 11 431 15 0.619 2002 0.0326 0.0346 0.586 1.21e-3 0.995
#> 2 11 500 21 -0.432 2000 0.0299 0.0424 -0.462 1.82e-3 0.995
#> 3 11 392 28 -0.246 1998 0.0273 0.0515 -0.273 2.67e-3 0.995
#> 4 11 292 40 -0.425 1998 0.0273 0.0515 -0.452 2.67e-3 0.995
#> 5 11 175 36 -0.258 1999 0.0286 0.0468 -0.287 2.22e-3 0.995
#> 6 11 419 23 3.13 2005 0.0365 0.0273 3.09 7.54e-4 0.992
#> 7 11 329 17 -0.0414 2007 0.0391 0.0274 -0.0806 7.57e-4 0.995
#> 8 11 284 23 -0.450 2006 0.0378 0.0268 -0.488 7.25e-4 0.995
#> 9 11 136 28 -0.129 2006 0.0378 0.0268 -0.167 7.25e-4 0.995
#> 10 11 118 17 -1.55 2013 0.0470 0.0470 -1.60 2.24e-3 0.995
#> # … with 18,566 more rows, and 2 more variables: .cooksd <dbl>,
#> # .std.resid <dbl>

How to find both rows associated with a string in an R dataframe and subtract their mutual column values

In R, I have a dataframe that looks like this:
sample value gene tag isPTV
1 1120 3.4 arx1 1120|arx1 0
2 2123 2.3 mnf2 2123|mnf2 0
3 1129 1.9 trf4 1129|trf4 0
4 2198 0.2 brc1 2198|brc1 0
5 1120 2.1 arx1 1120|arx1 1
6 2123 0.4 mnf2 2123|mnf2 1
7 1129 1.2 trf4 1129|trf4 1
8 2198 0.9 brc1 2198|brc1 1
Such that 0 means false and 1 means true. What I'm ultimately trying to do is create a dataframe that, for each tag, finds the absolute value between the value numbers.
For instance, for 1129|trf4 occurs in two separate rows. There's a value for when it isPTV and when it is not, so the absolute value would be 1.9 - 1.2 = 0.7.
I started out by trying to write a function to do these for a given tag value, such that, for a given tag, it would return both rows containing the tag:
getExprValue <- function(dataframe, tag){
return(dataframe[tag,])
}
But this is not working, and I'm not very familiar with how you index dataframes in R.
What is the right way to do this?
UPDATE:
Solution 1 Attempt:
m_diff <- m %>% group_by(tag) %>% mutate(absDiff = abs(diff(value)))
Response:
Error in mutate_impl(.data, dots) : ColumnabsDiffmust be length 1 (the group size), not 0
Solution 2 Attempt:
with(df1, abs(ave(value, tag, FUN = diff)))
Response:
Error in x[i] <- value[[j]] : replacement has length zero
Edit: I just noticed that #akrun had a much simpler solution
Create data with a structure similar to yours:
library(tidyverse)
dat <- tibble(
sample = rep(sample(1000:3000, 10), 2),
value = rnorm(20, 5, 1),
gene = rep(letters[1:10], 2),
tag = paste(sample, gene, sep = "|"),
isPTV = rep(0:1, each = 10)
)
dat
#> # A tibble: 20 x 5
#> sample value gene tag isPTV
#> <int> <dbl> <chr> <chr> <int>
#> 1 2149 5.90 a 2149|a 0
#> 2 1027 5.46 b 1027|b 0
#> 3 1103 5.65 c 1103|c 0
#> 4 1884 4.86 d 1884|d 0
#> 5 2773 5.58 e 2773|e 0
#> 6 2948 6.98 f 2948|f 0
#> 7 2478 5.17 g 2478|g 0
#> 8 2724 6.71 h 2724|h 0
#> 9 1927 5.06 i 1927|i 0
#> 10 1081 4.39 j 1081|j 0
#> 11 2149 4.60 a 2149|a 1
#> 12 1027 2.97 b 1027|b 1
#> 13 1103 6.17 c 1103|c 1
#> 14 1884 5.83 d 1884|d 1
#> 15 2773 4.23 e 2773|e 1
#> 16 2948 6.48 f 2948|f 1
#> 17 2478 5.06 g 2478|g 1
#> 18 2724 5.32 h 2724|h 1
#> 19 1927 7.32 i 1927|i 1
#> 20 1081 4.73 j 1081|j 1
#akrun solution (much better than mine):
dat %>%
group_by(tag) %>%
mutate(absDiff = abs(diff(value)))
#> # A tibble: 20 x 6
#> # Groups: tag [10]
#> sample value gene tag isPTV absDiff
#> <int> <dbl> <chr> <chr> <int> <dbl>
#> 1 2149 5.90 a 2149|a 0 1.30
#> 2 1027 5.46 b 1027|b 0 2.49
#> 3 1103 5.65 c 1103|c 0 0.520
#> 4 1884 4.86 d 1884|d 0 0.974
#> 5 2773 5.58 e 2773|e 0 1.34
#> 6 2948 6.98 f 2948|f 0 0.502
#> 7 2478 5.17 g 2478|g 0 0.114
#> 8 2724 6.71 h 2724|h 0 1.39
#> 9 1927 5.06 i 1927|i 0 2.26
#> 10 1081 4.39 j 1081|j 0 0.337
#> 11 2149 4.60 a 2149|a 1 1.30
#> 12 1027 2.97 b 1027|b 1 2.49
#> 13 1103 6.17 c 1103|c 1 0.520
#> 14 1884 5.83 d 1884|d 1 0.974
#> 15 2773 4.23 e 2773|e 1 1.34
#> 16 2948 6.48 f 2948|f 1 0.502
#> 17 2478 5.06 g 2478|g 1 0.114
#> 18 2724 5.32 h 2724|h 1 1.39
#> 19 1927 7.32 i 1927|i 1 2.26
#> 20 1081 4.73 j 1081|j 1 0.337
My initial suggestion (unnecessarily complicated):
nested <- dat %>%
group_by(tag) %>%
nest()
nested %>%
mutate(difference = map(data, ~ abs(diff(.$value)))) %>%
select(- data) %>%
unnest()
#> # A tibble: 10 x 2
#> tag difference
#> <chr> <dbl>
#> 1 2149|a 1.30
#> 2 1027|b 2.49
#> 3 1103|c 0.520
#> 4 1884|d 0.974
#> 5 2773|e 1.34
#> 6 2948|f 0.502
#> 7 2478|g 0.114
#> 8 2724|h 1.39
#> 9 1927|i 2.26
#> 10 1081|j 0.337

"summarise_at" and "mutate_if" for descriptive statistics for character variables

I would like to use summarise_at and mutate_at on multiple character variables at the same time. I have looked at many examples that use integer variables, but I just can't figure it out for character variables. Directly below is the code I use to produce descriptive statistics for a character (or factor) variable.
library(tidyverse)
# First block of code
starwars %>%
group_by(gender) %>%
summarise (n = n()) %>%
mutate(totalN = (cumsum(n))) %>%
mutate(percent = round((n / sum(n)), 3)) %>%
mutate(cumpercent = round(cumsum(freq = n / sum(n)),3))
This produces:
A tibble: 5 x 5
gender n totalN percent cumpercent
<chr> <int> <int> <dbl> <dbl>
1 female 19 19 0.218 0.218
2 hermaphrodite 1 20 0.011 0.230
3 male 62 82 0.713 0.943
4 none 2 84 0.023 0.966
5 <NA> 3 87 0.034 1.000
I would like to produce this same thing, but for multiple character (or factor) variables at once. In this case, let's use the variables gender and eye_color This is what I have tried:
starwars %>%
summarise_at(vars(gender, eyecolor) (n = n()) %>%
mutate_at(vars(gender, eyecolor) (totalN = (cumsum(n))) %>%
mutate_at(vars(gender", "eyecolor) (percent = round((n / sum(n)), 3)) %>%
mutate_at(vars(gender, eyecolor) (cumpercent = round(cumsum(freq = n / sum(n)),3))))))
I get the following error:
Error in eval(expr, envir, enclos) : attempt to apply non-function
I understand that there are built-in functions called using funs, but I don't want to use them. I have tried playing with the code in many different ways to get it to work, but have come up short.
What I would like to produce, is something like this:
A tibble: 5 x 5
gender n totalN percent cumpercent
<chr> <int> <int> <dbl> <dbl>
1 female 19 19 0.218 0.218
2 hermaphrodite 1 20 0.011 0.230
3 male 62 82 0.713 0.943
4 none 2 84 0.023 0.966
5 <NA> 3 87 0.034 1.000
A tibble: 15 x 5
eye_color n totalN percent cumpercent
<chr> <int> <int> <dbl> <dbl>
1 black 10 10 0.115 0.115
2 blue 19 29 0.218 0.333
3 blue-gray 1 30 0.011 0.345
4 brown 21 51 0.241 0.586
5 dark 1 52 0.011 0.598
6 gold 1 53 0.011 0.609
7 green, yellow 1 54 0.011 0.621
8 hazel 3 57 0.034 0.655
9 orange 8 65 0.092 0.747
10 pink 1 66 0.011 0.759
11 red 5 71 0.057 0.816
12 red, blue 1 72 0.011 0.828
13 unknown 3 75 0.034 0.862
14 white 1 76 0.011 0.874
15 yellow 11 87 0.126 1.000
Perhaps a loop would be better? Right now I have many lines of code to generate the descriptive statistics for each character variable because I have to run the first block of code (noted above) for each variable. It would be great if I could just list the variables I would like to use and run each through the first block of code.
Based on your expected output, mutate_at is not what you want, since it mutates on the columns selected. What you wanted to do is to group_by gender and eye_color separately. This is a good place to write your summary code into a function:
library(tidyverse)
library(rlang)
summary_func = function(group_by_var){
group_by_quo = enquo(group_by_var)
starwars %>%
group_by(!!group_by_quo) %>%
summarise(n = n()) %>%
mutate(totalN = (cumsum(n)),
percent = round((n / sum(n)), 3),
cumpercent = round(cumsum(freq = n / sum(n)),3))
}
Result:
> summary_func(gender)
# A tibble: 5 x 5
gender n totalN percent cumpercent
<chr> <int> <int> <dbl> <dbl>
1 female 19 19 0.218 0.218
2 hermaphrodite 1 20 0.011 0.230
3 male 62 82 0.713 0.943
4 none 2 84 0.023 0.966
5 <NA> 3 87 0.034 1.000
> summary_func(eye_color)
# A tibble: 15 x 5
eye_color n totalN percent cumpercent
<chr> <int> <int> <dbl> <dbl>
1 black 10 10 0.115 0.115
2 blue 19 29 0.218 0.333
3 blue-gray 1 30 0.011 0.345
4 brown 21 51 0.241 0.586
5 dark 1 52 0.011 0.598
6 gold 1 53 0.011 0.609
7 green, yellow 1 54 0.011 0.621
8 hazel 3 57 0.034 0.655
9 orange 8 65 0.092 0.747
10 pink 1 66 0.011 0.759
11 red 5 71 0.057 0.816
12 red, blue 1 72 0.011 0.828
13 unknown 3 75 0.034 0.862
14 white 1 76 0.011 0.874
15 yellow 11 87 0.126 1.000
The idea is to make your summary code into a function so that you can apply the same code over different group_by variables. enquo from rlang takes the code supplied to group_by_var and bundles it with the environment where it was called into a quosure. You can then use !! to unquote the group_by_quo in the group_by step. This enables non-standard evaluation (i.e. typing summary_func(gender) instead of summary_func("gender").
If you don't want to call summary_func for every variable you want to group_by, you can wrap your dplyr code in map from purrr, and unquote each argument of group_by_quo supplied as ... arguments. Notice the change from enquo to quos to convert each argument of ... to a list of quosures:
summary_func = function(...){
group_by_quo = quos(...)
map(group_by_quo, ~{
starwars %>%
group_by(!!.x) %>%
summarise(n = n()) %>%
mutate(totalN = (cumsum(n)),
percent = round((n / sum(n)), 3),
cumpercent = round(cumsum(freq = n / sum(n)),3))
})
}
You can now do this:
summary_func(gender, eye_color)
or with a vector of character variable names to group_by:
group_vars = c("gender", "eye_color")
summary_func(!!!syms(group_vars))
Result:
[[1]]
# A tibble: 5 x 5
gender n totalN percent cumpercent
<chr> <int> <int> <dbl> <dbl>
1 female 19 19 0.218 0.218
2 hermaphrodite 1 20 0.011 0.230
3 male 62 82 0.713 0.943
4 none 2 84 0.023 0.966
5 <NA> 3 87 0.034 1.000
[[2]]
# A tibble: 15 x 5
eye_color n totalN percent cumpercent
<chr> <int> <int> <dbl> <dbl>
1 black 10 10 0.115 0.115
2 blue 19 29 0.218 0.333
3 blue-gray 1 30 0.011 0.345
4 brown 21 51 0.241 0.586
5 dark 1 52 0.011 0.598
6 gold 1 53 0.011 0.609
7 green, yellow 1 54 0.011 0.621
8 hazel 3 57 0.034 0.655
9 orange 8 65 0.092 0.747
10 pink 1 66 0.011 0.759
11 red 5 71 0.057 0.816
12 red, blue 1 72 0.011 0.828
13 unknown 3 75 0.034 0.862
14 white 1 76 0.011 0.874
15 yellow 11 87 0.126 1.000

R - Extract summary table from Survfit with Strata

I'm new to R and survival analysis, and I am interested to export into a dataframe the results from survfit where there is strata.
This site has provided an excellent solution but not to one with strata (https://stat.ethz.ch/pipermail/r-help/2014-October/422348.html). How can i append (or stack) each strata with an extra column which contains the strata type.
solution in the link offered is not applicable to strata groupings
library(survival)
data(lung)
mod <- with(lung, survfit(Surv(time, status)~ 1))
res <- summary(mod)
str(res)
# Extract the columns you want
cols <- lapply(c(2:6, 8:10) , function(x) res[x])
# Combine the columns into a data frame
tbl <- do.call(data.frame, cols)
str(tbl)
Thank you in advanced,
R newbie
It is basically the same as you have there, just an extra column
res <- summary( survfit( Surv(futime, fustat)~rx, data=ovarian))
cols <- lapply(c(2:6, 8:11) , function(x) res[x])
tbl <- do.call(data.frame, cols)
head(tbl)
# time n.risk n.event n.censor surv strata std.err upper lower
# 1 59 13 1 0 0.9230769 rx=1 0.0739053 1.0000000 0.7890186
# 2 115 12 1 0 0.8461538 rx=1 0.1000683 1.0000000 0.6710952
# 3 156 11 1 0 0.7692308 rx=1 0.1168545 1.0000000 0.5711496
# 4 268 10 1 0 0.6923077 rx=1 0.1280077 0.9946869 0.4818501
# 5 329 9 1 0 0.6153846 rx=1 0.1349320 0.9457687 0.4004132
# 6 431 8 1 0 0.5384615 rx=1 0.1382642 0.8906828 0.3255265
Another option is to use the ggfortify library.
library(survival)
library(ggfortify)
# fit a survival model
mod <- survfit(Surv(futime, fustat) ~ rx, data = ovarian)
# extract results to a data.frame
res <- fortify(mod)
str(res)
'data.frame': 26 obs. of 9 variables:
$ time : int 59 115 156 268 329 431 448 477 638 803 ...
$ n.risk : num 13 12 11 10 9 8 7 6 5 4 ...
$ n.event : num 1 1 1 1 1 1 0 0 1 0 ...
$ n.censor: num 0 0 0 0 0 0 1 1 0 1 ...
$ surv : num 0.923 0.846 0.769 0.692 0.615 ...
$ std.err : num 0.0801 0.1183 0.1519 0.1849 0.2193 ...
$ upper : num 1 1 1 0.995 0.946 ...
$ lower : num 0.789 0.671 0.571 0.482 0.4 ...
$ strata : Factor w/ 2 levels "rx=1","rx=2": 1 1 1 1 1 1 1 1 1 1 ...
The advantage of this method is that you get the complete data (i.e 26 observations instead of 12) and you can plot your survival curves with ggplot.
library(ggplot2)
ggplot(data = res, aes(x = time, y = surv, color = strata)) +
geom_line() +
# plot censor marks
geom_point(aes(shape = factor(ifelse(n.censor >= 1, 1, NA)))) +
# format censor shape as "+"
scale_shape_manual(values = 3) +
# hide censor legend
guides(shape = "none")
The easiest way I see is to use the tidy() function from the broom package.
library(survival)
library(dplyr)
#>
#> Attache Paket: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(broom)
survfit( Surv(futime, fustat)~rx, data=ovarian) %>%
tidy()
#> # A tibble: 26 x 9
#> time n.risk n.event n.censor estimate std.error conf.high conf.low strata
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 59 13 1 0 0.923 0.0801 1 0.789 rx=1
#> 2 115 12 1 0 0.846 0.118 1 0.671 rx=1
#> 3 156 11 1 0 0.769 0.152 1 0.571 rx=1
#> 4 268 10 1 0 0.692 0.185 0.995 0.482 rx=1
#> 5 329 9 1 0 0.615 0.219 0.946 0.400 rx=1
#> 6 431 8 1 0 0.538 0.257 0.891 0.326 rx=1
#> 7 448 7 0 1 0.538 0.257 0.891 0.326 rx=1
#> 8 477 6 0 1 0.538 0.257 0.891 0.326 rx=1
#> 9 638 5 1 0 0.431 0.340 0.840 0.221 rx=1
#> 10 803 4 0 1 0.431 0.340 0.840 0.221 rx=1
#> # … with 16 more rows
Created on 2021-08-04 by the reprex package (v0.3.0)

Resources