Plot of Kaplan-Meier curve inconsistent with generated statistics - r

I am plotting a Kaplan-Meier (KM) curve for the readmission data which is available in the R package frailtypack. I used these simple codes which stratifies the curve by sex variable:
library(survival)
library(frailtypack)
data(readmission)
readmission
sobj<-Surv(readmission$time,readmission$event==1)
km.plot <- survfit(sobj ~readmission$sex, data = readmission)
km.plot
plot(km.plot,lty=c(1,2),lwd=2)
legend(x="bottomleft",lty=c(1,2),lwd=2, legend=c("Male","Female"))
The data is on recurrent events (i.e. subjects have multiple failure times). The output from "km.plot" tells me that there are substantial censored event times for both males and females. Under this I expect the KM curves to level-off to non-zero survival probabilities but that of the female goes zero. I still get this when I produce the plot for only the first event times, neglecting subsequent ones.
I think something is probably wrong with my code but finding it hard to figure it out. I greatly appreciate any help on this

Do NOT create Surv-objects outside the regression arguments, and DO use column names without reference to the dataframes. The violation of those two practices in your code will prevent the ‘predict’ and ‘plot’ methods from knowing where to access the data elements using the terms attributes in the model objects.
The question of the shape of the curve: If the last event is a death then the K-M curve will drop to zero.
And please explain why you think a K-M curve would be meaningful with “repeated events”?

Related

Interpreting results from emmeans comparison

I have a glm model with two fixed effects, Treatment and Date, to estimate Temperature from data collected in a time series. Within Treatment there are three different categories: Fucus, Terrycloth or Control, and temperature is measured beneath those canopies. The model is created like so mod1 <- glm(Temp ~ Treatment * Date, data = aveTerry.df )
I am trying to tell if Terrycloth has a similar effect as Fucus canopy (i.e. replicates it).
I found the emmeans package and believe it could help me compare between these levels within treatment by using my model, and have used it as so to find the estimated marginal means terry.emmeans <- emmeans(modAllTerry, poly ~ Treatment | Date) and plotted the comparisons via plot(terry.emmeans.average, comparison = TRUE) +theme_bw()
Giving me this output linked here.
I am looking for some help understanding what this graphical output is, especially what exactly are the comparisons (which are shown by the red arrows). I somewhat understand the that blue boxes are the confidence intervals for the mean value of temperature for each treatment on one day (based on model), but am wondering how is the comparison made? And why do some days only have a one sided arrow?
As described in the documentation for plot.emmGrid, the comparison arrows are created in such a way that two arrows are disjoint if and only if their respective means are significantly different at the stated level.
The lowest mean in the set has only a right-pointing arrow because that mean will not be compared with anything smaller, obviating the need for a left-pointing arrow. For similar reasons, the highest mean has only a left-pointing arrow. These arrows do not define intervals; their only purpose is depicting comparisons.
In situations where the SEs of pairwise comparisons vary widely, it may not be possible to construct comparison arrows. If that happens, an error message is displayed.
Confidence intervals are available as well, but those CIs should not be used for comparing means.
More information and examples may be found via vignette("comparisons", "emmeans"). Also, details of how the arrows are actually constructed are given in vignette("xplanations", "emmeans")

How can I achieve hierarchical clustering with p-values for a large dataset?

I am trying to carry out hierarchical cluster analysis (based on Ward's method) on a large dataset (thousands of records and 13 variables) representing multi-species observations of marine predators, to identify possible significant clusters in species composition.
Each record has date, time etc and presence/absence data (0 / 1) for each species.
I attempted hierarchical clustering with the function pvclust. I transposed the data (pvclust works on transposed tables), then I ran pvclust on the data selecting Jacquard distances (“binary” in R) as a distance measure (suitable for species pres/abs data) and Ward’s method (“ward.D2”). I used “parallel = TRUE” to reduce computation time. However, using a default of nboots= 1000, my computer was not able to finish the computation in hours and finally I got ann error, so I tried with lower nboots (100).
I cannot provide my dataset here, and I do not think it makes sense to provide a small test dataset, as one of the main issues here seems to be the size itself of the dataset. However, I am providing the lines of code I used for the transposition, clustering and plotting:
tdata <- t(data)
cluster <- pvclust(tdata, method.hclust="ward.D2", method.dist="binary",
nboot=100, parallel=TRUE)
plot(cluster, labels=FALSE)
This is the dendrogram I obtained (never mind the confusion at the lower levels due to overlap of branches).
As you can see, the p-values for the higher ramifications of the dendrogram all seem to be 0.
Now, I understand that my data may not be perfect, but I still think there is something wrong with the method I am using, as I would not expect all these values to be zero even with very low significance in the clusters.
So my questions would be
is there anything I got wrong in the pvclust function itself?
may my low nboots (due to “weak” computer) be a reason for the non-significance of my results?
are there other functions in R I could try for hierarchical clustering that also deliver p-values?
Thanks in advance!
.............
I have tried to run the same code on a subset of 500 records with nboots = 1000. This worked in a reasonable computation time, but the output is still not very satisfying - see dendrogram2 .dendrogram obtained for a SUBSET of 500 records and nboots=1000

how to find differentially methylated regions (for example with probe lasso in Champ) based on regression continuous variable ~ beta (with CpGassoc)

I performed 450K Illumina methylation chips on human samples, and want to search for the association between a continuous variable and beta, adjusted for other covariates. For this, I used the CpGassoc package in R. I would also like to search for differentially methylated regions based on the significant CpG sites. However, the probe lasso function in the Champ package and also other packages for 450K DMR analyses always assume 2 groups for which DMRs need to be find. I do not have 2 groups, but this continuous variable. Is there a way to load my output from CpGassoc in the probe lasso function from Champ? Or into another bump hunter package? I'm a MD, not a bio-informatician, thus comb-p, etc. would not be possible for me.
Thank you very much for your help.
Kind regards,
Line
I have not worked with methylation data before, so take what I say with a grain of salt. Also, don't use acronyms without describing them I'm guessing most people on this site don't know what a DMR is.
you could use lasso from the glmnet package to run a lasso on your data. So if your continuous variable was age you could do something like. If meth.dt is your methylations data.table with your columns as the amount of methylation for a given site, and your rows as subjects. I'm not sure if methylation data is considered to be poisson, I know RNA-seq data is. I also can't get too specific but the following code should work after adjusting to your number of columns
#load libraries
library(data.table)
library(glmnet)
#read in data
meth.dt <- fread("/data")
#lasso
AgeLasso <- glmnet(as.matrix(meth.dt[,1:70999,with=F]),meth.dt$Age, family="poisson")
cv.AgeLasso <- cv.glmnet(as.matrix(meth.dt[,1:70999,with=F]), meth.dt$Age, family="poisson")
coefTranscripts <- coef(cv.AgeLasso, s= "lambda.1se")[,1][coef(cv.AgeLasso, s= "lambda.1se")[,1] != 0]
This will give you the methylation sites that are the best predictors of your continuous variable using a parsimonious model. For additional info about glmnet see http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html
Also might want to ask the people over at cross validated. They may have some better answers. http://stats.stackexchange.com
What is your continuous variable just out of curiosity?
Let me know how you ended up solving it if you don't use this method.

Using survfit object's formula in survdiff call

I'm doing some survival analysis in R, and looking to tidy up/simplify my code.
At the moment I'm doing several steps in my data analysis:
make a Surv object (time variable with indication as to whether each observation was censored);
fit this Surv object according to a categorical predictor, for plotting/estimation of median survival time processes; and
calculate a log-rank test to ask whether there is evidence of "significant" differences in survival between the groups.
As an example, here is a mock-up using the lung dataset in the survival package from R. So the following code is similar enough to what I want to do, but much simplified in terms of the predictor set (which is why I want to simplify the code, so I don't make inconsistent calls across models).
library(survival)
# Step 1: Make a survival object with time-to-event and censoring indicator.
# Following works with defaults as status = 2 = dead in this dataset.
# Create survival object
lung.Surv <- with(lung, Surv(time=time, event=status))
# Step 2: Fit survival curves to object based on patient sex, plot this.
lung.survfit <- survfit(lung.Surv ~ lung$sex)
print(lung.survfit)
plot(lung.survfit)
# Step 3: Calculate log-rank test for difference in survival objects
lung.survdiff <- survdiff(lung.Surv ~ lung$sex)
print(lung.survdiff)
Now this is all fine and dandy, and I can live with this but would like to do better.
So my question is around step 3. What I would like to do here is to be able to use information in the formula from the lung.survfit object to feed into the calculation of the differences in survival curves: i.e. in the call to survdiff. And this is where my domitable [sic] programming skills hit a wall. Below is my current attempt to do this: I'd appreciate any help that you can give! Once I can get this sorted out I should be able to wrap a solution up in a function.
lung.survdiff <- survdiff(parse(text=(lung.survfit$call$formula)))
## Which returns following:
# Error in survdiff(parse(text = (lung.survfit$call$formula))) :
# The 'formula' argument is not a formula
As I commented above, I actually sorted out the answer to this shortly after having written this question.
So step 3 above could be replaced by:
lung.survdiff <- survdiff(formula(lung.survfit$call$formula))
But as Ben Barnes points out in the comment to the question, the formula from the survfit object can be more directly extracted with
lung.survdiff <- survdiff(formula(lung.survfit))
Which is exactly what I wanted and hoped would be available -- thanks Ben!

Is it possibile to arrange a time series in the way that a specific autocorrleation is created?

I have a file containing 2,500 random numbers. Is it possible to rearrange these saved numbers in the way that a specific autocorrelation is created? Lets say, autocorrelation to the lag 1 of 0.2, autocorrelation to the lag 2 of 0.4, etc.etc.
Any help is greatly appreciated!
To be more specific:
The time series of a daily return in percent of an asset has the following characteristics that I am trying to recreate:
Leptokurtic, symmetric distribution, let's say centered at a daily return of zero
No significant autocorrelations (because the sign of a daily return is not predictable)
Significant autocorrleations if the time series is squared
The aim is to produce a random time series which satisfies all these three characteristics. The only two inputs should be the leptokurtic distribution (this I have already created) and the specific autocorrelation of the squared resulting time series (e.g. the final squared time series should have an autocorrelation at lag 1 of 0.2).
I only know how to produce random numbers out of my own mixed-distribution. Naturally if I would square this resulting time series, there would be no autocorrelation. I would like to find a way which takes this into account.
Generally the most straightforward way to create autocorrelated data is to generate the data so that it's autocorrelated. For example, you could create an auto correlated path by always using the value at p-1 as the mean for the random draw at time period p.
Rearranging is not only hard, but sort of odd conceptually. What are you really trying to do in the end? Giving some context might allow better answers.
There are functions for simulating correlated data. arima.sim() from stats package and simulate.Arima() from the forecast package.
simulate.Arima() has the advantages that (1.) it can simulate seasonal ARIMA models (maybe sometimes called "SARIMA") and (2.) It can simulate a continuation of an existing timeseries to which you have already fit an ARIMA model. To use simulate.Arima(), you do need to already have an Arima object.
UPDATE:
type ?arima.sim then scroll down to "examples".
Alternatively:
install.packages("forecast")
library(forecast)
fit <- auto.arima(USAccDeaths)
plot(USAccDeaths,xlim=c(1973,1982))
lines(simulate(fit, 36),col="red")

Resources