R: Using Log Rank Test (survdiff) - r

OK, so I have a dataframe that looks like this:
head(exprs, 21)
sample expr ID X_OS
1 BIX high TCGA_DM_A28E_01 26
2 BIX high TCGA_AY_6197_01 88
3 BIX high TCGA_HB_KH8H_01 553
4 BIX low TCGA_K4_6303_01 256
5 BIX low TCGA_F4_6703_01 491
6 BIX low TCGA_Y7_PIK2_01 177
7 BIX low TCGA_A6_5657_01 732
8 HEF high TCGA_DM_A28E_01 26
9 HEF high TCGA_AY_6197_01 88
10 HEF high TCGA_F4_6703_01 491
11 HEF high TCGA_HB_KH8H_01 553
12 HEF low TCGA_K4_6303_01 256
13 HEF low TCGA_Y7_PIK2_01 177
14 HEF low TCGA_A6_5657_01 732
15 TUR high TCGA_DM_A28E_01 26
16 TUR high TCGA_F4_6703_01 491
17 TUR high TCGA_Y7_PIK2_01 177
18 TUR low TCGA_K4_6303_01 256
19 TUR low TCGA_AY_6197_01 88
20 TUR low TCGA_HB_KH8H_01 553
21 TUR low TCGA_A6_5657_01 732
Simply, for each sample, there are 7 patients, each with a survival time (X_OS) and expression level high or low (expr). In the code below, I wish to take the first sample and run it through the survdiff function, with the outputs going to dfx. However, I'm new to survival analysis and I'm not sure how to use the parameters of the survdiff function. I wish to compare high and low expression groups for each sample. How can I edit the function expfun to yield the survdiff output I need? In addition, ideally I'd love to get the pvalues out of it, but I can work on that in a later step. Thank you!
expfun = function(x) {
survdiff(Surv(x$X_OS, x$expr))
}
dfx <- pblapply(split(exprs[c("expr", "X_OS")], exprs$sample), expfun)

Try this. I added a proper Surv() call because you only had times and no status argument and I made it into a formula (with the predictor on the other side of the tilde) because Surv function expects status as its second argument and survdiff expects a formula as its first argument. That means you need to use the regular R regression calling convention where column names are used as the formula tokens and the dataframe is given to the data argument. If you had a censoring variable, it would be put in as the second Surv argument rather than the 1's that I have in there now.
expfun = function(x) {
survdiff( Surv( X_OS, rep(1,nrow(x)) ) ~ expr, data=x)
}
dfx <- lapply(split(exprs[c("expr", "X_OS")], exprs$sample), expfun)
This is the result from print.survdiff:
> dfx
$BIX
Call:
survdiff(formula = Surv(X_OS, rep(1, nrow(x))) ~ expr, data = x)
N Observed Expected (O-E)^2/E (O-E)^2/V
expr=high 3 3 2.05 0.446 0.708
expr=low 4 4 4.95 0.184 0.708
Chisq= 0.7 on 1 degrees of freedom, p= 0.4
$HEF
Call:
survdiff(formula = Surv(X_OS, rep(1, nrow(x))) ~ expr, data = x)
N Observed Expected (O-E)^2/E (O-E)^2/V
expr=high 4 4 3.14 0.237 0.51
expr=low 3 3 3.86 0.192 0.51
Chisq= 0.5 on 1 degrees of freedom, p= 0.475
$TUR
Call:
survdiff(formula = Surv(X_OS, rep(1, nrow(x))) ~ expr, data = x)
N Observed Expected (O-E)^2/E (O-E)^2/V
expr=high 3 3 1.75 0.902 1.41
expr=low 4 4 5.25 0.300 1.41
Chisq= 1.4 on 1 degrees of freedom, p= 0.235
Note that you can see the code to produce the print output with:
getAnywhere(print.survdiff)

Related

Creating and plotting confidence intervals

I have fitted a gaussian GLM model to my data, i now wish to create 95% CIs and fit them to my data. Im having a couple of issues with this when plotting as i cant get them to capture my data, they just seem to plot the same line as the model without captuing the data points. Also Im also unsure that I've created my CIs the correct way here for the mean. I entered my data and code below if anyone knows how to fix this
data used
aids
cases quarter date
1 2 1 83.00
2 6 2 83.25
3 10 3 83.50
4 8 4 83.75
5 12 1 84.00
6 9 2 84.25
7 28 3 84.50
8 28 4 84.75
9 36 1 85.00
10 32 2 85.25
11 46 3 85.50
12 47 4 85.75
13 50 1 86.00
14 61 2 86.25
15 99 3 86.50
16 95 4 86.75
17 150 1 87.00
18 143 2 87.25
19 197 3 87.50
20 159 4 87.75
21 204 1 88.00
22 168 2 88.25
23 196 3 88.50
24 194 4 88.75
25 210 1 89.00
26 180 2 89.25
27 277 3 89.50
28 181 4 89.75
29 327 1 90.00
30 276 2 90.25
31 365 3 90.50
32 300 4 90.75
33 356 1 91.00
34 304 2 91.25
35 307 3 91.50
36 386 4 91.75
37 331 1 92.00
38 368 2 92.25
39 416 3 92.50
40 374 4 92.75
41 412 1 93.00
42 358 2 93.25
43 416 3 93.50
44 414 4 93.75
45 496 1 94.00
my code used to create the model and intervals before plotting
#creating the model
model3 = glm(cases ~ date,
data = aids,
family = poisson(link='log'))
#now to add approx. 95% confidence envelope around this line
#predict again but at the linear predictor level along with standard errors
my_preds <- predict(model3, newdata=data.frame(aids), se.fit=T, type="link")
#calculate CI limit since linear predictor is approx. Gaussian
upper <- my_preds$fit+1.96*my_preds$se.fit #this might be logit not log
lower <- my_preds$fit-1.96*my_preds$se.fit
#transform the CI limit to get one at the level of the mean
upper <- exp(upper)/(1+exp(upper))
lower <- exp(lower)/(1+exp(lower))
#plotting data
plot(aids$date, aids$cases,
xlab = 'Date', ylab = 'Cases', pch = 20)
#adding CI lines
plot(aids$date, exp(my_preds$fit), type = "link",
xlab = 'Date', ylab = 'Cases') #add title
lines(aids$date,exp(my_preds$fit+1.96*my_preds$se.fit),lwd=2,lty=2)
lines(aids$date,exp(my_preds$fit-1.96*my_preds$se.fit),lwd=2,lty=2)
outcome i currently get with no data points, the model is correct here but the CI isnt as i have no data points, so the CIs are made incorrectly i think somewhere
Edit: Response to OP's providing full data set.
This started out as a question about plotting data and models on the same graph, but has morphed considerably. You seem you have an answer to the original question. Below is one way to address the rest.
Looking at your (and my) plots it seems clear that poisson glm is just not a good model. To say it differently, the number of cases may vary with date, but is also influenced by other things not in your model (external regressors).
Plotting just your data suggests strongly that you have at least two and perhaps more regimes: time frames where the growth in cases follows different models.
ggplot(aids, aes(x=date)) + geom_point(aes(y=cases))
This suggests segmented regression. As with most things in R, there is a package for that (more than one actually). The code below uses the segmented package to build successive poisson glm using 1 breakpoint (two regimes).
library(data.table)
library(ggplot2)
library(segmented)
setDT(aids) # convert aids to a data.table
aids[, pred:=
predict(
segmented(glm(cases~date, .SD, family = poisson), seg.Z = ~date, npsi=1),
type='response', se.fit=TRUE)$fit]
ggplot(aids, aes(x=date))+ geom_line(aes(y=pred))+ geom_point(aes(y=cases))
Note that we need to tell segmented the count of breakpoints, but not where they are - the algorithm figures that out for you. So here, we see a regime prior to 3Q87 which is well modeled using poission glm, and a regime after that which is not. This is a fancy way of saying that "something happened" around 3Q87 which changed the course of the disease (at least in this data).
The code below does the same thing but for between 1 and 4 breakpoints.
get.pred <- \(p.n, p.DT) {
fit <- glm(cases~date, p.DT, family=poisson)
seg.fit <- segmented(fit, seg.Z = ~date, npsi=p.n)
predict(seg.fit, type='response', se.fit=TRUE)[c('fit', 'se.fit')]
}
gg.dt <- rbindlist(lapply(1:4, \(x) { copy(aids)[, c('pred', 'se'):=get.pred(x, .SD)][, npsi:=x] } ))
ggplot(gg.dt, aes(x=date))+
geom_ribbon(aes(ymin=pred-1.96*se, ymax=pred+1.96*se), fill='grey80')+
geom_line(aes(y=pred))+
geom_point(aes(y=cases))+
facet_wrap(~npsi)
Note that the location of the first breakpoint does not seem to change, and also that, notwithstanding the use of the poisson glm the growth appears linear in all but the first regime.
There are goodness-of-fit metrics described in the package documentation which can help you decide how many break points are most consistent with your data.
Finally, there is also the mcp package which is a bit more powerful but also a bit more complex to use.
Original Response: Here is one way that builds the model predictions and std. error in a data.table, then plots using ggplot.
library(data.table)
library(ggplot2)
setDT(aids) # convert aids to a data.table
aids[, c('pred', 'se', 'resid.scale'):=predict(glm(cases~date, data=.SD, family=poisson), type='response', se.fit=TRUE)]
ggplot(aids, aes(x=date))+
geom_ribbon(aes(ymin=pred-1.96*se, ymax=pred+1.96*se), fill='grey80')+
geom_line(aes(y=pred))+
geom_point(aes(y=cases))
Or, you could let ggplot do all the work for you.
ggplot(aids, aes(x=date, y=cases))+
stat_smooth(method = glm, method.args=list(family=poisson))+
geom_point()

Writing a function to compare differences of a series of numeric variables

I am working on a problem set and absolutely cannot figure this one out. I think I've fried my brain to the point where it doesn't even make sense anymore.
Here is a look at the data ...
sex age chol tg ht wt sbp dbp vldl hdl ldl bmi
<chr> <int> <int> <int> <dbl> <dbl> <int> <int> <int> <int> <int> <dbl>
1 M 60 137 50 68.2 112. 110 70 10 53 74 2.40
2 M 26 154 202 82.8 185. 88 64 34 31 92 2.70
3 M 33 198 108 64.2 147 120 80 22 34 132 3.56
4 F 27 154 47 63.2 129 110 76 9 57 88 3.22
5 M 36 212 79 67.5 176. 130 100 16 37 159 3.87
6 F 31 197 90 64.5 121 122 78 18 58 111 2.91
7 M 28 178 163 66.5 167 118 68 19 30 135 3.78
8 F 28 146 60 63 105. 120 80 12 46 88 2.64
9 F 25 231 165 64 126 130 72 23 70 137 3.08
10 M 22 163 30 68.8 173 112 70 6 50 107 3.66
# … with 182 more rows
I must write a function, myTtest, to perform the following task:
Perform a two-sample t-tests to compare the differences of a series of numeric variables between each level of a classification variable
The first argument, dat, is a data frame
The second argument, classVar, is a character vector of length 1. It is the name of the classification variable, such as 'sex.'
The third argument, numVar, is a character vector that contains the name of the numeric variables, such as c("age", "chol", "tg"). This means I need to perform three t-tests to compare the difference of those between males and females.
The function should return a data frame with the following variables: Varname, F.mean, M.mean, t (for t-statistics), df (for degrees of freedom), and p (for p-value).
I should be able to run this ...
myTtest(dat = chol, classVar = "sex", numVar = c("age", "chol", "tg")
... and then get the data frame to appear.
Any help is greatly appreciated. I am pulling my hair out over this one! As well, as noted in my comment below, this has to be done without Tidyverse ... which is why I'm having so much trouble to begin with.
The intuition for this solution is that you can loop over your dependent variables, and call t.test() in each loop. Then save the results from each DV and stack them together in one big data frame.
I'll leave out some bits for you to fill in, but here's the gist:
First, some example data:
set.seed(123)
n <- 20
grp <- sample(c("m", "f"), n, replace = TRUE)
df <- data.frame(grp = grp, age = rnorm(n), chol = rnorm(n), tg = rnorm(n))
df
grp age chol tg
1 m 1.2240818 0.42646422 0.25331851
2 m 0.3598138 -0.29507148 -0.02854676
3 m 0.4007715 0.89512566 -0.04287046
4 f 0.1106827 0.87813349 1.36860228
5 m -0.5558411 0.82158108 -0.22577099
6 f 1.7869131 0.68864025 1.51647060
7 f 0.4978505 0.55391765 -1.54875280
8 f -1.9666172 -0.06191171 0.58461375
9 m 0.7013559 -0.30596266 0.12385424
10 m -0.4727914 -0.38047100 0.21594157
Now make a container that each of the model outputs will go into:
fits_df <- data.frame()
Loop over each DV and append the model output to fits_df each time with rbind:
for (dv in c("age", "chol", "tg")) {
frml <- as.formula(paste0(dv, " ~ grp")) # make a model formula: dv ~ grp
fit <- t.test(frml, two.sided = TRUE, data = df) # perform the t-test
# hint: use str(fit) to figure out how to pull out each value you care about
fit_df <- data.frame(
dv = col,
f_mean = xxx,
m_mean = xxx,
t = xxx,
df = xxx,
p = xxx
)
fits_df <- rbind(fits_df, fit_df)
}
Your output will look like this:
fits_df
dv f_mean m_mean t df p
1 age -0.18558068 -0.04446755 -0.297 15.679 0.7704954
2 chol 0.07731514 0.22158672 -0.375 17.828 0.7119400
3 tg 0.09349567 0.23693052 -0.345 14.284 0.7352112
One note: When you're pulling out values from fit, you may get odd row names in your output data frame. This is due to the names property of the various fit attributes. You can get rid of these by using as.numeric() or as.character() wrappers around the values you pull from fit (for example, fit$statistic can be cleaned up with as.character(round(fit$statistic, 3))).

Clustering biological sequences based on numeric values

I am trying to cluster several amino acid sequences of a fixed length (13) into K clusters based on the Atchley factors (5 numbers which represent each amino acid.
For example, I have an input vector of strings like the following:
key <- HDMD::AAMetric.Atchley
sequences <- sapply(1:10000, function(x) paste(sapply(1:13, function (X) sample(rownames(key), 1)), collapse = ""))
However, my actual list of sequences is over 10^5 (specifying for need for computational efficiency).
I then convert these sequences into numeric vectors by the following:
key <- HDMD::AAMetric.Atchley
m1 <- key[strsplit(paste(sequences, collapse = ""), "")[[1]], ]
p = 13
output <-
do.call(cbind, lapply(1:p, function(i)
m1[seq(i, nrow(m1), by = p), ]))
I want to output (which is now 65 dimensional vectors) in an efficient way.
I was originally using Mini-batch kmeans, but I noticed the results were very inconsistent when I repeated. I need a consistent clustering approach.
I also was concerned about the curse of dimensionality, considering at 65 dimensions, Euclidean distance doesn't work.
Many high dimensional clustering algorithms I saw assume that outliers and noise exists in the data, but as these are biological sequences converted to numeric values, there is no noise or outlier.
In addition to this, feature selection will not work, as each of the properties of each amino acid and each amino acid are relevant in the biological context.
How would you recommend clustering these vectors?
I think self organizing maps can be of help here - at least the implementation is quite fast so you will know soon enough if it is helpful or not:
using the data from the op along with:
rownames(output) <- 1:nrow(output)
colnames(output) <- make.names(colnames(output), unique = TRUE)
library(SOMbrero)
you define the number of cluster in advance
fit <- trainSOM(x.data=output , dimension = c(5, 5), nb.save = 10, maxit = 2000,
scaling="none", radius.type = "gaussian")
the nb.save is used as intermediate steps for further exploration how the training developed during the iterations:
plot(fit, what ="energy")
seems like more iterations is in order
check the frequency of clusters:
table(my.som$clustering)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
428 417 439 393 505 458 382 406 271 299 390 303 336 358 365 372 332 268 437 464 541 381 569 419 467
predict clusters based on new data:
predict(my.som, output[1:20,])
#output
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
19 12 11 8 9 1 11 13 14 5 18 2 22 21 23 22 4 14 24 12
check which variables were important for clustering:
summary(fit)
#part of output
Summary
Class : somRes
Self-Organizing Map object...
online learning, type: numeric
5 x 5 grid with square topology
neighbourhood type: gaussian
distance type: euclidean
Final energy : 44.93509
Topographic error: 0.0053
ANOVA :
Degrees of freedom : 24
F pvalue significativity
pah 1.343 0.12156074
pss 1.300 0.14868987
ms 16.401 0.00000000 ***
cc 1.695 0.01827619 *
ec 17.853 0.00000000 ***
find optimal number of clusters:
plot(superClass(fit))
fit1 <- superClass(fit, k = 4)
summary(fit1)
#part of output
SOM Super Classes
Initial number of clusters : 25
Number of super clusters : 4
Frequency table
1 2 3 4
6 9 4 6
Clustering
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
1 1 2 2 2 1 1 2 2 2 1 1 2 2 2 3 3 4 4 4 3 3 4 4 4
ANOVA
Degrees of freedom : 3
F pvalue significativity
pah 1.393 0.24277933
pss 3.071 0.02664661 *
ms 19.007 0.00000000 ***
cc 2.906 0.03332672 *
ec 23.103 0.00000000 ***
Much more in this vignette

Using glmer.nb(), the error message:(maxstephalfit) PIRLS step-halvings failed to reduce deviance in pwrssUpdate is returned

When using glmer.nb, we just get error message
> glm1 <- glmer.nb(Jul ~ scale(I7)+ Maylg+(1|Year), data=bph.df)
Error: (maxstephalfit) PIRLS step-halvings failed to reduce deviance in pwrssUpdate
In addition: Warning message:
In theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace = control$trace > :
iteration limit reached
Who can help me? Thanks very much!
My data listed below.
Year Jul A7 Maylg L7b
331 1978 1948 6 1.322219 4
343 1979 8140 32 2.678518 2
355 1980 106896 26 2.267172 2
367 1981 36227 25 4.028205 2
379 1982 19085 18 2.752816 2
391 1983 26010 32 2.086360 3
403 1984 1959 1 2.506505 4
415 1985 8025 18 2.656098 0
427 1986 9780 20 1.939519 0
439 1987 48235 29 4.093912 1
451 1988 7473 30 2.974972 2
463 1989 2850 25 2.107210 2
475 1990 10555 18 2.557507 3
487 1991 70217 30 4.843563 0
499 1992 2350 31 1.886491 2
511 1993 3363 32 2.956649 4
523 1994 5140 37 1.934498 4
535 1995 14210 36 2.492760 1
547 1996 3644 27 1.886491 1
559 1997 9828 29 1.653213 1
571 1998 3119 41 2.535294 4
583 1999 5382 10 2.472756 3
595 2000 690 5 1.886491 2
607 2001 871 13 NA 2
619 2002 12394 27 0.845098 5
631 2003 4473 36 1.342423 2
You're going to have a lot of problems with this data set, among other things, because you have an observation-level random effect (you only have one data point per Year) and are trying to fit a negative binomial model. That essentially means you're trying to fit the overdispersion in two different ways at the same time.
If you fit the Poisson model, you can see that the results are strongly underdispersed (for a Poisson model, the residual deviance should be approximately equal to the residual degrees of freedom).
library("lme4")
glm0 <- glmer(Jul ~ scale(A7)+ Maylg+(1|Year), data=bph.df,
family="poisson")
print(glm0)
Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation) [glmerMod]
Family: poisson ( log )
Formula: Jul ~ scale(A7) + Maylg + (1 | Year)
Data: bph.df
AIC BIC logLik deviance df.resid
526.4904 531.3659 -259.2452 518.4904 21
Random effects:
Groups Name Std.Dev.
Year (Intercept) 0.9555
Number of obs: 25, groups: Year, 25
Fixed Effects:
(Intercept) scale(A7) Maylg
7.3471 0.3363 0.6732
deviance(glm0)/df.residual(glm0)
## [1] 0.0003479596
Or alternatively:
library("aods3")
gof(glm0)
## D = 0.0073, df = 21, P(>D) = 1
## X2 = 0.0073, df = 21, P(>X2) = 1
glmmADMB does manage to fit it, but I don't know how far I would trust the results (the dispersion parameter is very large, indicating that the model has basically converged to a Poisson distribution anyway).
bph.df <- na.omit(transform(bph.df,Year=factor(Year)))
glmmadmb(Jul ~ scale(A7)+ Maylg+(1|Year), data=bph.df,
family="nbinom")
GLMM's in R powered by AD Model Builder:
Family: nbinom
alpha = 403.43
link = log
Fixed effects:
Log-likelihood: -259.25
AIC: 528.5
Formula: Jul ~ scale(A7) + Maylg + (1 | Year)
(Intercept) scale(A7) Maylg
7.3628472 0.3348105 0.6731953
Random effects:
Structure: Diagonal matrix
Group=Year
Variance StdDev
(Intercept) 0.9105 0.9542
Number of observations: total=25, Year=25
The results are essentially identical to the Poisson model from lme4 above.

Plotting different columns on the same file using boxes

I have a file that looks like
$cat myfile.dat
1 8 32 19230 1.186 3.985
1 8 64 9620 0.600 7.877
1 8 128 4810 0.312 15.136
1 8 256 2410 0.226 20.927
1 8 512 1210 0.172 27.708
1 8 1024 610 0.135 35.582
1 8 2048 310 0.121 40.172
1 8 4096 160 0.117 43.141
1 8 8192 80 0.112 44.770
.....
2 8 16384 300 0.692 6.816
2 8 32768 150 0.686 6.877
2 8 65536 80 0.853 5.904
2 10 320 7830 1.041 4.575
2 10 640 3920 0.919 5.189
2 10 1280 1960 0.828 5.757
2 10 2560 980 0.773 6.167
2 10 5120 490 0.746 6.391
2 10 10240 250 0.748 6.507
2 10 20480 130 0.770 6.567
....
3 18 8192 10 1.311 12.759
3 20 32 650 1.631 3.978
3 20 64 330 0.838 7.863
3 20 128 170 0.483 14.046
3 20 256 90 0.508 14.160
3 20 512 50 0.559 14.283
3 20 1024 30 0.665 14.405
3 20 2048 20 0.865 14.782
3 20 4096 10 0.856 14.932
3 20 8192 10 1.704 14.998
As you can see, there are many ways of plotting this information depending on the column we want as x axis. One of the ways I would like to plot the information is the 6th against the 1st column
p "myfile.dat" u 1:6
My main questions is if there is a way to plot those bars as solid boxes since we are only interested in the peak value achieved and not the frequency or density region of the dots.
Gnuplot has the smooth option, which can be used e.g. as smooth frequency to sum all y-values for the same x-value. Unfortunately there is no smooth maximum, which you would need here, but one can 'emulate' that with a bit of tricking in the Using statement.
reset
xval = -1000
max(x, y) = (x > y ? x : y)
maxval = 0
colnum = 6
set boxwidth 0.2
plot 'mydata.dat' using (val = column(colnum), $1):\
(maxval_prev = (xval == $1 ? maxval : 0), \
maxval = (xval == $1 ? max(maxval, val) : val),\
xval = $1, \
(maxval > maxval_prev ? maxval-maxval_prev : 0)\
) \
smooth frequency lw 3 with boxes t 'maximum values'
Every using entry can consist of different assignments, which are separated by a comma.
If a new x value appears, the variables are initialized. This works, because the data is made monotonic in x by smooth frequency.
If the current value is bigger than the stored maximum value, the difference between the stored maximum value and the current value is added. Potentially, this could result in numerical errors due to repeated adding and subtracting, but judging from you sample data and given the resolution of the plot, this shouldn't be a problem.
The result for you data is:
You can search for the maximum and plot only that, but this is probably easier, even if it draws lots of boxes one over another:
plot "myfile.dat" using 1:6:(.1) with boxes fillstyle solid

Resources