Creating an Identification variable to identify observations - r

Hello I have a data set called "Sample" like this
Sample
A tibble: 221,088 x 7
gvkey two_digit_sic fyear part1 part2 part3 part4
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 001003 57 1987 0.0317 0.0686 0.0380 0.157
2 001003 57 1988 -0.358 0.0623 -0.338 0.162
3 001003 57 1989 -0.155 0.0614 -0.784 0.140
4 001004 50 1988 0.0868 0.00351 0.108 0.300
5 001004 50 1989 0.0176 0.00281 0.113 0.296
6 001004 50 1990 -0.0569 0.00257 0.0618 0.291
7 001004 50 1991 0.00317 0.00263 -0.112 0.314
8 001004 50 1992 -0.0418 0.00253 -0.0479 0.300
9 001004 50 1993 0.00763 0.00274 0.0216 0.334
10 001004 50 1994 -0.0115 0.00239 0.0459 0.307
# ... with 221,078 more rows
count(Sample, gvkey)
# A tibble: 23,978 x 2
gvkey n
<chr> <int>
1 001003 3
2 001004 30
3 001009 7
4 001010 16
5 001011 7
6 001012 2
7 001013 23
8 001014 5
9 001017 8
10 001019 14
# ... with 23,968 more rows
count(Sample, two_digit_sic)
# A tibble: 73 x 2
two_digit_sic n
<chr> <int>
1 01 527
2 02 111
3 07 105
4 08 120
5 09 24
6 10 8860
7 12 477
8 13 11200
9 14 811
10 15 858
# ... with 63 more rows
Then I run the following model
library(dplyr)
library(broom)
mjones_1991 <- Sample %>%
group_by(two_digit_sic, fyear) %>%
filter(n()>=10) %>%
do (augment (lm (part1 ~ part2 + part3 + part4, data = .))) %>%
ungroup()
Then I got the following results
mjones_1991
# A tibble: 219,587 x 13
two_digit_sic fyear part1 part2 part3 part4 .fitted .se.fit .resid
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 01 1988 -0.0478 2.36e-2 0.147 1.01 -0.119 0.0576 0.0714
2 01 1988 -0.174 4.29e-2 0.327 0.810 0.00104 0.0560 -0.175
3 01 1988 0.0250 6.15e-4 0.422 0.619 0.0534 0.0711 -0.0284
4 01 1988 -0.0974 2.55e-2 -0.0134 0.292 -0.0847 0.0586 -0.0127
5 01 1988 -0.142 1.15e-3 0.0233 0.677 -0.137 0.0489 -0.0058
6 01 1988 -0.479 2.46e-1 -0.0552 0.538 -0.0393 0.0635 -0.439
7 01 1988 0.00861 2.78e-1 0.251 1.58 -0.0407 0.122 0.0493
8 01 1988 -0.154 2.94e-2 -0.348 0.619 -0.284 0.0984 0.131
9 01 1988 -0.0526 8.96e-4 0.172 0.602 -0.0580 0.0452 0.0053
10 01 1988 -0.0574 2.15e-2 0.0535 0.316 -0.0596 0.0540 0.0021
# ... with 219,577 more rows, and 4 more variables: .hat <dbl>, .sigma <dbl>,
# .cooksd <dbl>, .std.resid <dbl>
The problem is that I lost gvkey; therefore, I cannot identify the .fitted or .se.fit or .resid is for which gvkey.
Here is the filtering of the two_digit_sic == "01" and fyear == "1988"
# A tibble: 18 x 7
gvkey two_digit_sic fyear part1 part2 part3 part4
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 001266 01 1988 -0.0478 0.0236 0.147 1.01
2 002249 01 1988 -0.174 0.0429 0.327 0.810
3 002812 01 1988 0.0250 0.000615 0.422 0.619
4 003702 01 1988 -0.0974 0.0255 -0.0134 0.292
5 008596 01 1988 -0.142 0.00115 0.0233 0.677
6 009062 01 1988 -0.479 0.246 -0.0552 0.538
7 009391 01 1988 0.00861 0.278 0.251 1.58
8 010390 01 1988 -0.154 0.0294 -0.348 0.619
9 010884 01 1988 -0.0526 0.000896 0.172 0.602
10 012349 01 1988 -0.0574 0.0215 0.0535 0.316
11 012750 01 1988 0.0577 0.0157 0.0794 0.422
12 013155 01 1988 0.117 0.124 0.370 0.829
13 013462 01 1988 0.255 0.0828 0.529 0.270
14 013468 01 1988 -0.0774 0.0445 0.129 0.191
15 013550 01 1988 -0.0219 0.0204 0.0375 0.879
16 013743 01 1988 -0.0911 0.228 0.0870 0.739
17 014400 01 1988 0.415 0.546 0.0710 0.0437
18 014881 01 1988 -0.134 0.00380 0.0211 0.666
You can see I have 18 observations for two_digit_sic == "01" and fyear == "1988"
in ~~~mjones_1991~~~ data set I have the same observations, but I lose the identifiers (gvkey). Do you have any idea how can keep the gvkey?

Related

Impute missing data of a single variable using the slope of a linear regression of another variable in R

Here is an extract of my dataset (df8) which contains time series from 2000 to 2018 for 194 countries.
iso3 year anc4 median
<chr> <dbl> <dbl> <dbl>
1 BIH 2000 NA 0.739
2 BIH 2001 NA 0.746
3 BIH 2002 NA 0.763
4 BIH 2003 NA 0.778
5 BIH 2004 NA 0.842
6 BIH 2005 NA 0.801
7 BIH 2006 NA 0.819
8 BIH 2007 NA 0.841
9 BIH 2008 NA 0.845
10 BIH 2009 NA 0.840
11 BIH 2010 0.842 0.856
12 BIH 2011 NA 0.873
13 BIH 2012 NA 0.867
14 BIH 2013 NA 0.889
15 BIH 2014 NA 0.879
16 BIH 2015 NA 0.883
17 BIH 2016 NA 0.854
18 BIH 2017 NA 0.891
19 BIH 2018 NA 0.920
20 BWA 2000 NA 0.739
21 BWA 2001 NA 0.746
22 BWA 2002 NA 0.763
23 BWA 2003 NA 0.778
24 BWA 2004 NA 0.842
25 BWA 2005 NA 0.801
26 BWA 2006 0.733 0.819
27 BWA 2007 NA 0.841
28 BWA 2008 NA 0.845
29 BWA 2009 NA 0.840
30 BWA 2010 NA 0.856
31 BWA 2011 NA 0.873
32 BWA 2012 NA 0.867
33 BWA 2013 NA 0.889
34 BWA 2014 NA 0.879
35 BWA 2015 NA 0.883
36 BWA 2016 NA 0.854
37 BWA 2017 NA 0.891
38 BWA 2018 NA 0.920
What I would like to do is to impute missing data for variable anc4, using the slope of a linear regression based on regional medians (median). I would like to do that at the country level as each country do not pertain to the same region.
This is what I have tried..
df_model <- df8
predictions <- vector()
for(i in unique(df_model$iso3)) {
temp <- df_model[df_model[,2]==i,]
predictions <- c(predictions,predict(lm(median~year,temp),df8[is.na(df8$anc4) & df8$iso3==i,]))
}
df8[is.na(df8$anc4),]$anc4 <- predictions
I used the code I have been using when imputing missing anc4 data using a linear regression of observed anc4 data points and tried to adapt that using medians..but did not quite work!
Thank you so much!
Your last comment made your question clear: you get the slope from the linear regression on the medians and you get the intercept from the only non-missing value.
However, there was a rather serious flaw in your code: you should never grow a vector inside a for loop. Use *apply functions, or even better use *map functions from the purrr package. If you have a very good reason to use a for loop, at least preallocate its size.
Since you get the intercept from outside the model, you cannot use predict here. Fortunately, it is rather simple to predict manually when using a linear model.
Here is my solution using the dplyr syntax. If you are not familiar with it, I urge you to read about it (for instance there)
x=df_model %>%
group_by(iso3) %>%
mutate(
slope=lm(median~year)$coefficients["year"],
intercept=anc4[!is.na(anc4)]-slope*year[!is.na(anc4)],
anc4_imput = intercept+year*slope,
anc4_error = anc4-anc4_imput,
)
x
#> # A tibble: 38 x 8
#> # Groups: iso3 [2]
#> iso3 year anc4 median slope intercept anc4_imput anc4_error
#> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 BIH 2000 NA 0.739 0.00844 -16.1 0.758 NA
#> 2 BIH 2001 NA 0.746 0.00844 -16.1 0.766 NA
#> 3 BIH 2002 NA 0.763 0.00844 -16.1 0.774 NA
#> 4 BIH 2003 NA 0.778 0.00844 -16.1 0.783 NA
#> 5 BIH 2004 NA 0.842 0.00844 -16.1 0.791 NA
#> 6 BIH 2005 NA 0.801 0.00844 -16.1 0.800 NA
#> 7 BIH 2006 NA 0.819 0.00844 -16.1 0.808 NA
#> 8 BIH 2007 NA 0.841 0.00844 -16.1 0.817 NA
#> 9 BIH 2008 NA 0.845 0.00844 -16.1 0.825 NA
#> 10 BIH 2009 NA 0.84 0.00844 -16.1 0.834 NA
#> # ... with 28 more rows
#error is negligible
x %>% filter(!is.na(anc4))
#> # A tibble: 2 x 8
#> # Groups: iso3 [2]
#> iso3 year anc4 median slope intercept anc4_imput anc4_error
#> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 BIH 2010 0.842 0.856 0.00844 -16.1 0.842 1.22e-15
#> 2 BWA 2006 10.7 10.8 0.00844 -6.20 10.7 0.
#Created on 2020-06-12 by the reprex package (v0.3.0)

regression by group and retain all the columns in R

I am doing a linear regression by group and want to extract the residuals of the regression
library(dplyr)
set.seed(124)
dat <- data.frame(ID = sample(111:503, 18576, replace = T),
ID2 = sample(11:50, 18576, replace = T),
ID3 = sample(1:14, 18576, replace = T),
yearRef = sample(1998:2014, 18576, replace = T),
value = rnorm(18576))
resid <- dat %>% dplyr::group_by(ID3) %>%
do(augment(lm(value ~ yearRef, data=.))) %>% ungroup()
How do I retain the ID, ID2 as well in the resid. At the moment, it only retains the ID3 in the final data frame
Use group_split then loop through each group using map_dfr to bind ID, ID2 and augment output using bind_cols
library(dplyr)
library(purrr)
dat %>% group_split(ID3) %>%
map_dfr(~bind_cols(select(.x,ID,ID2), augment(lm(value~yearRef, data=.x))), .id = "ID3")
# A tibble: 18,576 x 12
ID3 ID ID2 value yearRef .fitted .se.fit .resid .hat .sigma .cooksd
<chr> <int> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 196 16 -0.385 2009 -0.0406 0.0308 -0.344 1.00e-3 0.973 6.27e-5
2 1 372 47 -0.793 2012 -0.0676 0.0414 -0.726 1.81e-3 0.973 5.05e-4
3 1 470 15 -0.496 2011 -0.0586 0.0374 -0.438 1.48e-3 0.973 1.50e-4
4 1 242 40 -1.13 2010 -0.0496 0.0338 -1.08 1.21e-3 0.973 7.54e-4
5 1 471 34 1.28 2006 -0.0135 0.0262 1.29 7.26e-4 0.972 6.39e-4
6 1 434 35 -1.09 1998 0.0586 0.0496 -1.15 2.61e-3 0.973 1.82e-3
7 1 467 45 -0.0663 2011 -0.0586 0.0374 -0.00769 1.48e-3 0.973 4.64e-8
8 1 334 27 -1.37 2003 0.0135 0.0305 -1.38 9.86e-4 0.972 9.92e-4
9 1 186 25 -0.0195 2003 0.0135 0.0305 -0.0331 9.86e-4 0.973 5.71e-7
10 1 114 34 1.09 2014 -0.0857 0.0500 1.18 2.64e-3 0.973 1.94e-3
# ... with 18,566 more rows, and 1 more variable: .std.resid <dbl>
Taking the "many models" approach, you can nest the data on ID3 and use purrr::map to create a list-column of the broom::augment data frames. The data list-column has all the original columns aside from ID3; map into that and select just the ones you want. Here I'm assuming you want to keep any column that starts with "ID", but you can change this. Then unnest both the data and the augment data frames.
library(dplyr)
library(tidyr)
dat %>%
group_by(ID3) %>%
nest() %>%
mutate(aug = purrr::map(data, ~broom::augment(lm(value ~ yearRef, data = .))),
data = purrr::map(data, select, starts_with("ID"))) %>%
unnest(c(data, aug))
#> # A tibble: 18,576 x 12
#> # Groups: ID3 [14]
#> ID3 ID ID2 value yearRef .fitted .se.fit .resid .hat .sigma
#> <int> <int> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 11 431 15 0.619 2002 0.0326 0.0346 0.586 1.21e-3 0.995
#> 2 11 500 21 -0.432 2000 0.0299 0.0424 -0.462 1.82e-3 0.995
#> 3 11 392 28 -0.246 1998 0.0273 0.0515 -0.273 2.67e-3 0.995
#> 4 11 292 40 -0.425 1998 0.0273 0.0515 -0.452 2.67e-3 0.995
#> 5 11 175 36 -0.258 1999 0.0286 0.0468 -0.287 2.22e-3 0.995
#> 6 11 419 23 3.13 2005 0.0365 0.0273 3.09 7.54e-4 0.992
#> 7 11 329 17 -0.0414 2007 0.0391 0.0274 -0.0806 7.57e-4 0.995
#> 8 11 284 23 -0.450 2006 0.0378 0.0268 -0.488 7.25e-4 0.995
#> 9 11 136 28 -0.129 2006 0.0378 0.0268 -0.167 7.25e-4 0.995
#> 10 11 118 17 -1.55 2013 0.0470 0.0470 -1.60 2.24e-3 0.995
#> # … with 18,566 more rows, and 2 more variables: .cooksd <dbl>,
#> # .std.resid <dbl>

How to remove the variable names from the diagonal and put them on axes in R function scatterplotMatrix?

I am trying to reproduce a matrix plot on a book. Here is the plot on the book:
Here are my codes:
y=read.table("T1-2.dat");
colnames(y) <- c("density","mach-dir","cross-dir");
library(car);
scatterplotMatrix(y,smooth=F, regLine=F, var.labels=colnames(y), diagonal=list(method="boxplot"));
And this is what it looks like right now:
.
How can I delete the names from the diagonal and put them on the side of the table just like the one on the book.
Thanks in advance.
Data:
> y
density mach-dir cross-dir
1 0.801 121.41 70.42
2 0.824 127.70 72.47
3 0.841 129.20 78.20
4 0.816 131.80 74.89
5 0.840 135.10 71.21
6 0.842 131.50 78.39
7 0.820 126.70 69.02
8 0.802 115.10 73.10
9 0.828 130.80 79.28
10 0.819 124.60 76.48
11 0.826 118.31 70.25
12 0.802 114.20 72.88
13 0.810 120.30 68.23
14 0.802 115.70 68.12
15 0.832 117.51 71.62
16 0.796 109.81 53.10
17 0.759 109.10 50.85
18 0.770 115.10 51.68
19 0.759 118.31 50.60
20 0.772 112.60 53.51
21 0.806 116.20 56.53
22 0.803 118.00 70.70
23 0.845 131.00 74.35
24 0.822 125.70 68.29
25 0.971 126.10 72.10
26 0.816 125.80 70.64
27 0.836 125.50 76.33
28 0.815 127.80 76.75
29 0.822 130.50 80.33
30 0.822 127.90 75.68
31 0.843 123.90 78.54
32 0.824 124.10 71.91
33 0.788 120.80 68.22
34 0.782 107.40 54.42
35 0.795 120.70 70.41
36 0.805 121.91 73.68
37 0.836 122.31 74.93
38 0.788 110.60 53.52
39 0.772 103.51 48.93
40 0.776 110.71 53.67
41 0.758 113.80 52.42
And by the way, can we also display "Max, Med, Min" and the corresponding values on the diagonal as well? Thanks.

How to truncate Kaplan Meier curves when number at risk is < 10

For a publication in a peer-reviewed scientific journal (http://www.redjournal.org), we would like to prepare Kaplan-Meier plots. The journal has the following specific guidelines for these plots:
"If your figures include curves generated from analyses using the Kaplan-Meier method or the cumulative incidence method, the following are now requirements for the presentation of these curves:
That the number of patients at risk is indicated;
That censoring marks are included;
That curves be truncated when there are fewer than 10 patients at risk; and
An estimate of the confidence interval should be included either in the figure itself or the text.”
Here, I illustrate my problem with the veteran dataset (https://github.com/tidyverse/reprex is great!).
We can adress 1, 2 and 4 easily with the survminer package:
library(survival)
library(survminer)
#> Warning: package 'survminer' was built under R version 3.4.3
#> Loading required package: ggplot2
#> Loading required package: ggpubr
#> Warning: package 'ggpubr' was built under R version 3.4.3
#> Loading required package: magrittr
fit.obj <- survfit(Surv(time, status) ~ celltype, data = veteran)
ggsurvplot(fit.obj,
conf.int = T,
risk.table ="absolute",
tables.theme = theme_cleantable())
I have, however, a problem with requirement 3 (truncate curves when there are fewer than 10 patients at risk). I see that all the required information is available in the survfit object:
library(survival)
fit.obj <- survfit(Surv(time, status) ~ celltype, data = veteran)
summary(fit.obj)
#> Call: survfit(formula = Surv(time, status) ~ celltype, data = veteran)
#>
#> celltype=squamous
#> time n.risk n.event survival std.err lower 95% CI upper 95% CI
#> 1 35 2 0.943 0.0392 0.8690 1.000
#> 8 33 1 0.914 0.0473 0.8261 1.000
#> 10 32 1 0.886 0.0538 0.7863 0.998
#> 11 31 1 0.857 0.0591 0.7487 0.981
#> 15 30 1 0.829 0.0637 0.7127 0.963
#> 25 29 1 0.800 0.0676 0.6779 0.944
#> 30 27 1 0.770 0.0713 0.6426 0.924
#> 33 26 1 0.741 0.0745 0.6083 0.902
#> 42 25 1 0.711 0.0772 0.5749 0.880
#> 44 24 1 0.681 0.0794 0.5423 0.856
#> 72 23 1 0.652 0.0813 0.5105 0.832
#> 82 22 1 0.622 0.0828 0.4793 0.808
#> 110 19 1 0.589 0.0847 0.4448 0.781
#> 111 18 1 0.557 0.0861 0.4112 0.754
#> 112 17 1 0.524 0.0870 0.3784 0.726
#> 118 16 1 0.491 0.0875 0.3464 0.697
#> 126 15 1 0.458 0.0876 0.3152 0.667
#> 144 14 1 0.426 0.0873 0.2849 0.636
#> 201 13 1 0.393 0.0865 0.2553 0.605
#> 228 12 1 0.360 0.0852 0.2265 0.573
#> 242 10 1 0.324 0.0840 0.1951 0.539
#> 283 9 1 0.288 0.0820 0.1650 0.503
#> 314 8 1 0.252 0.0793 0.1362 0.467
#> 357 7 1 0.216 0.0757 0.1088 0.429
#> 389 6 1 0.180 0.0711 0.0831 0.391
#> 411 5 1 0.144 0.0654 0.0592 0.351
#> 467 4 1 0.108 0.0581 0.0377 0.310
#> 587 3 1 0.072 0.0487 0.0192 0.271
#> 991 2 1 0.036 0.0352 0.0053 0.245
#> 999 1 1 0.000 NaN NA NA
#>
#> celltype=smallcell
#> time n.risk n.event survival std.err lower 95% CI upper 95% CI
#> 2 48 1 0.9792 0.0206 0.93958 1.000
#> 4 47 1 0.9583 0.0288 0.90344 1.000
#> 7 46 2 0.9167 0.0399 0.84172 0.998
#> 8 44 1 0.8958 0.0441 0.81345 0.987
#> 10 43 1 0.8750 0.0477 0.78627 0.974
#> 13 42 2 0.8333 0.0538 0.73430 0.946
#> 16 40 1 0.8125 0.0563 0.70926 0.931
#> 18 39 2 0.7708 0.0607 0.66065 0.899
#> 20 37 2 0.7292 0.0641 0.61369 0.866
#> 21 35 2 0.6875 0.0669 0.56812 0.832
#> 22 33 1 0.6667 0.0680 0.54580 0.814
#> 24 32 1 0.6458 0.0690 0.52377 0.796
#> 25 31 2 0.6042 0.0706 0.48052 0.760
#> 27 29 1 0.5833 0.0712 0.45928 0.741
#> 29 28 1 0.5625 0.0716 0.43830 0.722
#> 30 27 1 0.5417 0.0719 0.41756 0.703
#> 31 26 1 0.5208 0.0721 0.39706 0.683
#> 51 25 2 0.4792 0.0721 0.35678 0.644
#> 52 23 1 0.4583 0.0719 0.33699 0.623
#> 54 22 2 0.4167 0.0712 0.29814 0.582
#> 56 20 1 0.3958 0.0706 0.27908 0.561
#> 59 19 1 0.3750 0.0699 0.26027 0.540
#> 61 18 1 0.3542 0.0690 0.24171 0.519
#> 63 17 1 0.3333 0.0680 0.22342 0.497
#> 80 16 1 0.3125 0.0669 0.20541 0.475
#> 87 15 1 0.2917 0.0656 0.18768 0.453
#> 95 14 1 0.2708 0.0641 0.17026 0.431
#> 99 12 2 0.2257 0.0609 0.13302 0.383
#> 117 9 1 0.2006 0.0591 0.11267 0.357
#> 122 8 1 0.1755 0.0567 0.09316 0.331
#> 139 6 1 0.1463 0.0543 0.07066 0.303
#> 151 5 1 0.1170 0.0507 0.05005 0.274
#> 153 4 1 0.0878 0.0457 0.03163 0.244
#> 287 3 1 0.0585 0.0387 0.01600 0.214
#> 384 2 1 0.0293 0.0283 0.00438 0.195
#> 392 1 1 0.0000 NaN NA NA
#>
#> celltype=adeno
#> time n.risk n.event survival std.err lower 95% CI upper 95% CI
#> 3 27 1 0.9630 0.0363 0.89430 1.000
#> 7 26 1 0.9259 0.0504 0.83223 1.000
#> 8 25 2 0.8519 0.0684 0.72786 0.997
#> 12 23 1 0.8148 0.0748 0.68071 0.975
#> 18 22 1 0.7778 0.0800 0.63576 0.952
#> 19 21 1 0.7407 0.0843 0.59259 0.926
#> 24 20 1 0.7037 0.0879 0.55093 0.899
#> 31 19 1 0.6667 0.0907 0.51059 0.870
#> 35 18 1 0.6296 0.0929 0.47146 0.841
#> 36 17 1 0.5926 0.0946 0.43344 0.810
#> 45 16 1 0.5556 0.0956 0.39647 0.778
#> 48 15 1 0.5185 0.0962 0.36050 0.746
#> 51 14 1 0.4815 0.0962 0.32552 0.712
#> 52 13 1 0.4444 0.0956 0.29152 0.678
#> 73 12 1 0.4074 0.0946 0.25850 0.642
#> 80 11 1 0.3704 0.0929 0.22649 0.606
#> 84 9 1 0.3292 0.0913 0.19121 0.567
#> 90 8 1 0.2881 0.0887 0.15759 0.527
#> 92 7 1 0.2469 0.0850 0.12575 0.485
#> 95 6 1 0.2058 0.0802 0.09587 0.442
#> 117 5 1 0.1646 0.0740 0.06824 0.397
#> 132 4 1 0.1235 0.0659 0.04335 0.352
#> 140 3 1 0.0823 0.0553 0.02204 0.307
#> 162 2 1 0.0412 0.0401 0.00608 0.279
#> 186 1 1 0.0000 NaN NA NA
#>
#> celltype=large
#> time n.risk n.event survival std.err lower 95% CI upper 95% CI
#> 12 27 1 0.9630 0.0363 0.89430 1.000
#> 15 26 1 0.9259 0.0504 0.83223 1.000
#> 19 25 1 0.8889 0.0605 0.77791 1.000
#> 43 24 1 0.8519 0.0684 0.72786 0.997
#> 49 23 1 0.8148 0.0748 0.68071 0.975
#> 52 22 1 0.7778 0.0800 0.63576 0.952
#> 53 21 1 0.7407 0.0843 0.59259 0.926
#> 100 20 1 0.7037 0.0879 0.55093 0.899
#> 103 19 1 0.6667 0.0907 0.51059 0.870
#> 105 18 1 0.6296 0.0929 0.47146 0.841
#> 111 17 1 0.5926 0.0946 0.43344 0.810
#> 133 16 1 0.5556 0.0956 0.39647 0.778
#> 143 15 1 0.5185 0.0962 0.36050 0.746
#> 156 14 1 0.4815 0.0962 0.32552 0.712
#> 162 13 1 0.4444 0.0956 0.29152 0.678
#> 164 12 1 0.4074 0.0946 0.25850 0.642
#> 177 11 1 0.3704 0.0929 0.22649 0.606
#> 200 9 1 0.3292 0.0913 0.19121 0.567
#> 216 8 1 0.2881 0.0887 0.15759 0.527
#> 231 7 1 0.2469 0.0850 0.12575 0.485
#> 250 6 1 0.2058 0.0802 0.09587 0.442
#> 260 5 1 0.1646 0.0740 0.06824 0.397
#> 278 4 1 0.1235 0.0659 0.04335 0.352
#> 340 3 1 0.0823 0.0553 0.02204 0.307
#> 378 2 1 0.0412 0.0401 0.00608 0.279
#> 553 1 1 0.0000 NaN NA NA
But I have no idea how I can manipulate this list. I would very much appreciate any advice on how to filter out all lines with n.risk < 10 from fit.obj.
I can't quite seem to get this all the way there. But I see that you can pass a data.frame rather than a fit object to the plotting function. You can do this and clip the values. For example
ss <- subset(surv_summary(fit.obj), n.risk>=10)
ggsurvplot(ss,
conf.int = T)
But it seems in this mode it does not automatically print the table. There is a function to draw just the table with
ggrisktable(fit.obj, tables.theme = theme_cleantable())
So I guess you could just combine them. Maybe i'm missing an easier way to draw the table when using a data.frame in the same plot.
As a slight variation on the above answers, if you want to truncate each group individually when less than 10 patients are at risk in that group, I found this to work and not require plotting the figure and table separately:
library(survival)
library(survminer)
# truncate each line when fewer than 10 at risk
atrisk <- 10
# KM fit
fit.obj <- survfit(Surv(time, status) ~ celltype, data = veteran)
# subset each stratum separately
maxcutofftime = 0 # for plotting
strata <- rep(names(fit.obj$strata), fit.obj$strata)
for (i in names(fit.obj$strata)){
cutofftime <- min(fit.obj$time[fit.obj$n.risk < atrisk & strata == i])
maxcutofftime = max(maxcutofftime, cutofftime)
cutoffs <- which(fit.obj$n.risk < atrisk & strata == i)
fit.obj$lower[cutoffs] <- NA
fit.obj$upper[cutoffs] <- NA
fit.obj$surv[cutoffs] <- NA
}
# plot
ggsurvplot(fit.obj, data = veteran, risk.table = TRUE, conf.int = T, pval = F,
tables.theme = theme_cleantable(), xlim = c(0,maxcutofftime), break.x.by = 90)
edited to add: note that if we had used pval = T above, that would give the p-value for the truncated data, not the full data. It doesn't make much of a difference in this example as both are p<0.0001, but be careful :)
I'm following up on MrFlick's great answer.
I'd interpret 3) to mean that there should be at least 10 at risk total - i.e., not per group. So we have to create an ungrouped Kaplan-Meier fit first and determine the time cutoff from there.
Subset the surv_summary object w/r/t this cutoff.
Plot KM-curve and risk table separately. Crucially, function survminer::ggrisktable() (a minimal front end for ggsurvtable()) accepts options xlim and break.time.by. However, the function can currently only extend the upper time limit, not reduce it. I assume this is a bug. I created function ggsurvtable_mod() to change this.
Turn ggplot objects into grobs and use ggExtra::grid.arrange() to put both plots together. There is probably a more elegant way to do this based on options widths and heights.
Admittedly, this is a bit of a hack and needs tweaking to get the correct alignment between survival plot and risk table.
library(survival)
library(survminer)
# ungrouped KM estimate to determine cutoff
fit1_ss <- surv_summary(survfit(Surv(time, status) ~ 1, data=veteran))
# time cutoff with fewer than 10 at risk
cutoff <- min(fit1_ss$time[fit1_ss$n.risk < 10])
# KM fit and subset to cutoff
fit.obj <- survfit(Surv(time, status) ~ celltype, data = veteran)
fit_ss <- subset(surv_summary(fit.obj), time < cutoff)
# KM survival plot and risk table as separate plots
p1 <- ggsurvplot(fit_ss, conf.int=TRUE)
# note options xlim and break.time.by
p2 <- ggsurvtable_mod(fit.obj,
survtable="risk.table",
tables.theme=theme_cleantable(),
xlim=c(0, cutoff),
break.time.by=100)
# turn ggplot objects into grobs and arrange them (needs tweaking)
g1 <- ggplotGrob(p1)
g2 <- ggplotGrob(p2)
lom <- rbind(c(NA, rep(1, 14)),
c(NA, rep(1, 14)),
c(rep(2, 15)))
gridExtra::grid.arrange(grobs=list(g1, g2), layout_matrix=lom)

implement apply and t.test functions same time

I am using R, I have a couple conditions with three replicates in each and I want to apply t.test to each of the elements in the conditions (the rows). For this I am willing to use apply function to the dataset (143,554 rows) containing all the info and specifying to retrieve the pval obtained by the t.test.
The columns 4,6,8 are the replicates for the first condition (main element of apply), the colums 10,12,14 are the elements of the second condition (example data at the end). And I thought that something like this could do the work:
t.test.10x = apply( MT.10x[,c(4,6,8)], 1, function(x) t.test(x, MT.10x[,c(10,12,14)])$p.value)
However this syntax is wrong because providing the whole table for the second condition in t.test will not go row by row, instead this approach will compare all the rows in 10,12,14 to each row in 4,6,8.
I don't want to use for loop but if it is absolutely required... well..
Thank you!!
Dataset example:
Chr Start End wt1_R wt1_T wt2_R wt2_T wt3_R wt3_T ko1_R ko1_T ko2_R ko2_T ko3_R ko3_T
chr1 3060417 3060419 0.0698 43 0.25 28 0.172 29 0.188 32 0.156 45 0.119 42
chr1 3060431 3060433 0.786 28 0.818 22 0.526 19 0.895 19 0.833 36 0.784 37
chr1 3168805 3168807 0.688 16 1 19 0.769 13 0.929 14 0.933 15 0.9 10
chr1 3228992 3228994 0.7 10 1 11 0.786 14 1 14 0.938 16 0.923 13
chr1 3233065 3233067 0.857 14 0.917 12 1 17 0.846 13 0.857 21 0.952 21
chr1 3265234 3265236 0.84 25 0.727 11 0.909 22 0.968 31 0.895 19 0.905 21
chr1 3265322 3265324 0.111 27 0.25 28 0.55 20 0.385 13 0.467 15 0.462 13
chr1 3265345 3265347 0.806 31 0.857 35 0.733 30 0.9 30 0.8 25 1 17
chr1 3265357 3265359 1 30 0.759 29 0.758 33 0.867 30 0.903 31 1 18
chr1 3265486 3265488 1 15 0.545 22 1 13 0.8 10 0.917 12 1 24
chr1 3265512 3265514 0.857 28 0.75 20 0.583 24 0.714 21 0.882 17 0.839 31
chr1 3265540 3265542 0.757 37 0.966 29 0.969 32 0.774 31 0.955 22 0.971 34
chr1 3265771 3265773 0.741 27 0.864 22 0.963 27 1 20 0.864 22 0.962 26
chr1 3265776 3265778 1 20 1 21 1 26 0.722 18 1 24 0.852 27
chr1 3265803 3265805 0.611 18 0.96 25 1 17 1 18 0.895 19 0.828 29
chr1 3760225 3760227 0.278 36 0.0741 27 0.417 24 0.158 19 0.4 40 0.136 22
chr1 3760285 3760287 0.851 47 0.711 38 0.867 15 0.81 21 0.914 35 0.893 28
chr1 3761299 3761301 0.786 14 0.885 26 1 11 0.929 14 0.771 35 0.75 24
chr1 3761414 3761416 0.706 17 1 17 0.545 22 0.857 14 0.818 11 0.8 15
chr1 3838606 3838608 0.806 31 0.692 13 0.611 18 1 11 1 23 1 11
chr1 3838611 3838613 0.767 30 1 13 0.947 19 0.818 11 1 20 1 11
chr1 4182108 4182110 0.231 13 0.5 14 0.143 21 0.0667 15 0.235 17 0.353 17
chr1 4547434 4547436 0.9 10 1 13 1 17 1 14 0.909 11 0.909 11
chr1 4547456 4547458 1 18 1 10 0.895 19 0.833 12 1 12 1 12
chr1 4547496 4547498 0.812 16 0.917 12 0.75 16 0.923 13 0.818 11 0.9 10
chr1 4547509 4547511 1 14 1 12 1 15 0.9 10 0.833 12 1 11
chr1 4547512 4547514 0.923 13 1 12 1 14 0.909 11 0.833 12 0.909 11
chr1 4765732 4765734 0 11 0 12 0 11 0 13 0 13 0.1 10
chr1 5185343 5185345 0.818 22 0.909 22 0.963 27 1 15 0.923 13 1 16
chr1 5185567 5185569 0.885 52 0.781 32 0.984 63 1 37 0.844 45 1 29
I think you are looking for mapply:
mapply(function(x,y)
t.test(x,y)$p.value,
MT.10x[,c(4,6,8)], MT.10x[,c(10,12,14)])
## wt1_R wt2_R wt3_R
## 0.4790554 0.8289961 0.5204527

Resources