Survdiff() output fields in R - r

my question is about the output structure of survdiff() function form the 'survival' library in R. Namely, I have a data frame containing survival data
> dat
ID Time Treatment Gender Censored
1 E002 2.7597536 IND F 0
2 E003 4.2710472 Control M 0
3 E005 1.4784394 IND F 0
4 E006 6.8993840 Control F 1
5 E008 9.5934292 IND M 0
6 E009 2.9897331 Control F 0
7 E014 1.3470226 IND F 1
8 E016 2.1683778 Control F 1
9 E018 2.7597536 IND F 1
10 E022 1.3798768 IND F 0
11 E023 0.7227926 IND M 1
12 E024 5.5195072 IND F 0
13 E025 2.4640657 Control F 0
14 E028 7.4579055 Control M 1
15 E029 5.5195072 Control F 1
16 E030 2.7926078 IND M 0
17 E031 4.9938398 Control F 0
18 E032 2.7268994 IND M 0
19 E033 0.1642710 IND M 1
20 E034 4.1396304 Control F 0
and a model
> diff = survdiff(Surv(Time, Censored) ~ Treatment+Gender, data = dat)
> diff
Call:
survdiff(formula = Surv(Time, Censored) ~ Treatment + Gender,
data = dat)
N Observed Expected (O-E)^2/E (O-E)^2/V
Treatment=Control, Gender=M 2 1 1.65 0.255876 0.360905
Treatment=Control, Gender=F 7 3 2.72 0.027970 0.046119
Treatment=IND, Gender=M 5 2 2.03 0.000365 0.000519
Treatment=IND, Gender=F 6 2 1.60 0.100494 0.139041
Chisq= 0.5 on 3 degrees of freedom, p= 0.924
I'm wondering what's the field of the output object that contains the values from the very right column (O-E)^2/V? I'd like to use them further but can't obtain them neither from diff\$obs, diff\$exp, diff\$var nor from their combinations.
Your help's gonna be much appreciated.

For (O-E)^2/V try something like
rowSums(diff$obs - diff$exp)^2 / diag(diff$var)
while for (O-E)^2/E try something like
rowSums(diff$obs - diff$exp)^2 / rowSums(diff$exp)

Related

Binary representation of breast cancer wisconsin database

I want to produce a binary representation of the well-known breast cancer Wisconsin database.
The initial data set has 31 numerical variables, and one categorical variable.
id_number diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave_points_mean symmetry_mean
1 842302 M 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419
2 842517 M 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812
3 84300903 M 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069
4 84348301 M 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597
5 84358402 M 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809
I want to produce a binary representation of this dataframe by:
transforming the diagnosis column (levels= M , B) to two columns diagnosis_M and diagnosis_B and put 1 or 0 in the relevant row depending on the value in the initial column (M or B).
Looking for the median of each numerical column and split it as two columns depending on whether the values are greater or lower than the mean value. eg: for the column radius_mean, split it in radius_mean_great in-which we put 1 if the values > mean, o else; and a column radius_mean_low inversely.
library(mlbench)
library("RCurl")
library("curl")
UCI_data_URL <- getURL('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data')
names <- c('id_number', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean','concave_points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave_points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave_points_worst', 'symmetry_worst', 'fractal_dimension_worst')
breast.cancer.fr <- read.table(textConnection(UCI_data_URL), sep = ',', col.names = names)
Well there are several ways to binarize the base, I found the following I hope it serves
df <- breast.cancer.fr[,3:32]
df2 <- matrix(NA, ncol = 2*ncol(df), nrow = nrow(df))
for(i in 1:ncol(df)){
df2[,2*i-1]<- as.numeric(df[,i] > mean(df[,i]))
df2[,2*i] <- as.numeric(df[,i] <= mean(df[,i]))}
colnames(df2) <- c(rbind(paste0(names(df),"_great"),paste0(names(df),"_low")))
library(dplyr)
df3 <- select(breast.cancer.fr,id_number,diagnosis) %>% mutate(diagnosis_M = as.numeric(diagnosis == "M")) %>%
mutate(diagnosis_B = as.numeric(diagnosis == "B"))
df <- cbind(df3[,-2],df2)
df[1:10,1:7]
id_number diagnosis_M diagnosis_B radius_mean_great radius_mean_low texture_mean_great texture_mean_low
1 842302 1 0 1 0 0 1
2 842517 1 0 1 0 0 1
3 84300903 1 0 1 0 1 0
4 84348301 1 0 0 1 1 0
5 84358402 1 0 1 0 0 1
6 843786 1 0 0 1 0 1
7 844359 1 0 1 0 1 0
8 84458202 1 0 0 1 1 0
9 844981 1 0 0 1 1 0
10 84501001 1 0 0 1 1 0

R is not ordering data correctly - skips E values

I am trying to order data by the column weightFisher. However, it is almost as if R does not process e values as low, because all the e values are skipped when I try to order from smallest to greatest.
Code:
resultTable_bon <- GenTable(GOdata_bon,
weightFisher = resultFisher_bon,
weightKS = resultKS_bon,
topNodes = 15136,
ranksOf = 'weightFisher'
)
head(resultTable_bon)
#create Fisher ordered df
indF <- order(resultTable_bon$weightFisher)
resultTable_bonF <- resultTable_bon[indF, ]
what resultTable_bon looks like:
GO.ID Term Annotated Significant Expected Rank in weightFisher
1 GO:0019373 epoxygenase P450 pathway 19 13 1.12 1
2 GO:0097267 omega-hydroxylase P450 pathway 9 7 0.53 2
3 GO:0042738 exogenous drug catabolic process 10 7 0.59 3
weightFisher weightKS
1 1.9e-12 0.79744
2 7.9e-08 0.96752
3 2.5e-07 0.96336
what "ordered" resultTable_bonF looks like:
GO.ID Term Annotated Significant Expected Rank in weightFisher
17 GO:0014075 response to amine 33 7 1.95 17
18 GO:0034372 very-low-density lipoprotein particle re... 11 5 0.65 18
19 GO:0060710 chorio-allantoic fusion 6 4 0.35 19
weightFisher weightKS
17 0.00014 0.96387
18 0.00016 0.83624
19 0.00016 0.92286
As #bhas says, it appears to be working precisely as you want it to. Maybe it's the use of head() that's confusing you?
To put your mind at ease, try it with something simpler
dtf <- data.frame(a=c(1, 8, 6, 2)^-10, b=c(7, 2, 1, 6))
dtf
# a b
# 1 1.000000e+00 7
# 2 9.313226e-10 2
# 3 1.653817e-08 1
# 4 9.765625e-04 6
dtf[order(dtf$a), ]
# a b
# 2 9.313226e-10 2
# 3 1.653817e-08 1
# 4 9.765625e-04 6
# 1 1.000000e+00 7
Try the following :
resultTable_bon$weightFisher <- as.numeric (resultTable_bon$weightFisher)
Then :
resultTable_bonF <- resultTable_bon[order(resultTable_bonF$weightFisher),]

Removing outlier from excel using R code

The following datasheet is from excel file
Part A B C D E F G H I J K L
XXX 0 1 1 2 0 1 2 3 1 2 1 0
YYY 0 1 2 2 0 30 1 1 0 1 10 0
....
So, I want to display those parts that contains outliers having logic of
[median – t * MAD, median + t * MAD]
So how to code this using R by function for large amount of data?
You would want to calculate robust Z-scores based on median and MAD (median of absolute deviations) instead of non-robust standard mean and SD. Then assess your data using Z, with Z=0 meaning on median, Z=1 one MAD out, etc.
Let's assume we have the following data, where one set is outliers:
df <- rbind( data.frame(tag='normal', res=rnorm(1000)*2.71), data.frame(tag='outlier', res=rnorm(20)*42))
then Z it:
df$z <- with(df, (res - median(res))/mad(res))
that gives us something like this:
> head(df)
tag res z
1 normal -3.097 -1.0532
2 normal -0.650 -0.1890
3 normal 1.200 0.4645
4 normal 1.866 0.6996
5 normal -6.280 -2.1774
6 normal 1.682 0.6346
Then cut it into Z-bands, eg.
df$band <- cut(df$z, breaks=c(-99,-3,-1,1,3,99))
That can be analyzed in a straightforward way:
> addmargins(xtabs(~band+tag, df))
tag
band normal outlier Sum
(-99,-3] 1 9 10
(-3,-1] 137 0 137
(-1,1] 719 2 721
(1,3] 143 1 144
(3,99] 0 8 8
Sum 1000 20 1020
As can be seen, obviously, the ones with the biggest Zs (those being in the (-99,-3) and (3,99) Z-band, are those from the outlier community).

How to calculate survival probabilities in R?

I am trying to fit a parametric survival model. I think I managed to do so. However, I could not succeed in calculating the survival probabilities:
library(survival)
zaman <- c(65,156,100,134,16,108,121,4,39,143,56,26,22,1,1,5,65,
56,65,17,7,16,22,3,4,2,3,8,4,3,30,4,43)
test <- c(rep(1,17),rep(0,16))
WBC <- c(2.3,0.75,4.3,2.6,6,10.5,10,17,5.4,7,9.4,32,35,100,
100,52,100,4.4,3,4,1.5,9,5.3,10,19,27,28,31,26,21,79,100,100)
status <- c(rep(1,33))
data <- data.frame(zaman,test,WBC)
surv3 <- Surv(zaman[test==1], status[test==1])
fit3 <- survreg( surv3 ~ log(WBC[test==1]),dist="w")
On the other hand, no problem at all while calculating the survival probabilities using the Kaplan-Meier Estimation:
fit2 <- survfit(Surv(zaman[test==0], status[test==0]) ~ 1)
summary(fit2)$surv
Any idea why?
You can get the predicted probabilities from a survreg object with predict:
predict(fit3)
If you're interested in combining this with the original data, and also in the residual and standard errors of the predictions, you can use the augment function in my broom package:
library(broom)
augment(fit3)
A full analysis might look something like:
library(survival)
library(broom)
data <- data.frame(zaman, test, WBC, status)
subdata <- data[data$test == 1, ]
fit3 <- survreg( Surv(zaman, status) ~ log(WBC), subdata, dist="w")
augment(fit3, subdata)
With the output:
zaman test WBC status .fitted .se.fit .resid
1 65 1 2.30 1 115.46728 43.913188 -50.467281
2 156 1 0.75 1 197.05852 108.389586 -41.058516
3 100 1 4.30 1 85.67236 26.043277 14.327641
4 134 1 2.60 1 108.90836 39.624106 25.091636
5 16 1 6.00 1 73.08498 20.029707 -57.084979
6 108 1 10.50 1 55.96298 13.989099 52.037022
7 121 1 10.00 1 57.28065 14.350609 63.719348
8 4 1 17.00 1 44.47189 11.607368 -40.471888
9 39 1 5.40 1 76.85181 21.708514 -37.851810
10 143 1 7.00 1 67.90395 17.911170 75.096054
11 56 1 9.40 1 58.99643 14.848751 -2.996434
12 26 1 32.00 1 32.88935 10.333303 -6.889346
13 22 1 35.00 1 31.51314 10.219871 -9.513136
14 1 1 100.00 1 19.09922 8.963022 -18.099216
15 1 1 100.00 1 19.09922 8.963022 -18.099216
16 5 1 52.00 1 26.09034 9.763728 -21.090343
17 65 1 100.00 1 19.09922 8.963022 45.900784
In this case, the .fitted column is the predicted probabilities.

how would I get the below answer using melt/cast from reshape2 package

I have two data frames, x and y. I rbind them to get z. Then I use reshape function (not package) to get the below answer.
set.seed(1234)
x <- data.frame(rp=c(1:5),dmg=1000*runif(5), loss=500*runif(5), model="m1")
y <- data.frame(rp=c(1:5),dmg=1000*runif(5), loss=500*runif(5), model="m2")
z <- rbind(x, y)
> z
rp dmg loss model
1 113.7 320.2 m1
2 622.3 4.7 m1
3 609.3 116.3 m1
4 623.4 333.0 m1
5 860.9 257.1 m1
1 693.6 418.6 m2
2 545.0 143.1 m2
3 282.7 133.4 m2
4 923.4 93.4 m2
5 292.3 116.1 m2
> reshape(z, idvar="rp", timevar="model", direction="wide")
rp dmg.m1 loss.m1 dmg.m2 loss.m2
1 113.7 320.2 693.6 418.6
2 622.3 4.7 545.0 143.1
3 609.3 116.3 282.7 133.4
4 623.4 333.0 923.4 93.4
5 860.9 257.1 292.3 116.1
How would I get the same result using cast/melt combination in reshape2?
> dcast(melt(z, c("rp", "model")), rp ~ variable + model)
rp dmg_m1 dmg_m2 loss_m1 loss_m2
1 1 113.7034 693.5913 320.155303 418.64781
2 2 622.2994 544.9748 4.747878 143.11164
3 3 609.2747 282.7336 116.275253 133.41039
4 4 623.3794 923.4335 333.041879 93.36139
5 5 860.9154 292.3158 257.125571 116.11296
Breaking that down: first you use melt to put it in long form. However, you don't want to melt rp and model since these will serve to identify rows and columns later on.
> my.df <- melt(z, c("rp", "model"))
> my.df
rp model variable value
1 1 m1 dmg 113.703411
2 2 m1 dmg 622.299405
3 3 m1 dmg 609.274733
4 4 m1 dmg 623.379442
5 5 m1 dmg 860.915384
6 1 m2 dmg 693.591292
7 2 m2 dmg 544.974836
8 3 m2 dmg 282.733584
9 4 m2 dmg 923.433484
10 5 m2 dmg 292.315840
11 1 m1 loss 320.155303
12 2 m1 loss 4.747878
13 3 m1 loss 116.275253
14 4 m1 loss 333.041879
15 5 m1 loss 257.125571
16 1 m2 loss 418.647814
17 2 m2 loss 143.111642
18 3 m2 loss 133.410390
19 4 m2 loss 93.361395
20 5 m2 loss 116.112955
Then you cast it into a data frame using dcast. You want rp to identify rows and variable and model both to identify columns, and you express this with a formula.
> dcast(my.df, rp ~ variable + model)
rp dmg_m1 dmg_m2 loss_m1 loss_m2
1 1 113.7034 693.5913 320.155303 418.64781
2 2 622.2994 544.9748 4.747878 143.11164
3 3 609.2747 282.7336 116.275253 133.41039
4 4 623.3794 923.4335 333.041879 93.36139
5 5 860.9154 292.3158 257.125571 116.11296

Resources