Simulate Cox Proportional Hazard with geom_hex - r

I am interested in replicating an experiment in a paper [1] I came across. The idea is that I need to simulate a cox proportional hazard model that is dependent on the first to covariates in the dataframe. I am trying to make a plot similar to this:
But I am trying to make a "hex" version of it. The problem is that I can't seem to get the "z-axis" correct.
set.seed(42) # this makes the example exactly reproducible
#50,000 random uniforms
obs <- runif(50000,min = -1, max = .999)
#make uniforms a matrix
obs <- matrix(data = obs, nrow = 5000, ncol = 10)
#make is_censored
is_censored <- sample(0:1,5000,TRUE,prob=c(0.40,0.60))
#hazard function
const <- 1
time <- rexp(n = 5000, const*exp(-(obs[,1]+2*obs[,2])))
#dataset
df <- cbind(obs, is_censored, time)
#names for covariates
names = letters[1:10]
colnames(df)[1:10] <- names
#truth data
x <- df[,1]; y <- df[,2]
true <- tibble(x,y,time)
install.packages("hexbin")
library(hexbin)
ggplot(true,aes(x,y))+
geom_hex(bins = 30)
I thought that if I added time for the z-axis I would get the correct gradient, but instead I got:
ggplot(true,aes(x,y,fill=time))+
geom_hex(bins = 30)
How can I get the proper gradient?
1Deep Survival: A Deep Cox Proportional Hazards Network

Related

Simulate in R the number of samples needed in order to achieve the true standard deviation

i want to recreate in R the figure above that simulates the number of samples needed in order to achieve the true standard deviation.
How can I do it in R ?
I suppose that the distribution is t-distribution or normal.
So I have to generate numbers from these distributions and each time to increase the size of the sample and plot it in order to recreate this plot as shown in the figure.
Any help ?
set.seed(123)
x <- list(v1=rnorm(1,0,12),v2=rnorm(10,0,11),
v3=rnorm(20,0,10),v4=rnorm(30,0,9),
v5=rnorm(40,0,8),v6=rnorm(50,0,7),
v7=rnorm(60,0,6),v8=rnorm(70,0,5),
v9=rnorm(80,0,4),v10=rnorm(90,0,3),
v11=rnorm(100,0,2),v12=rnorm(110,0,2))
g = lapply(x,sd)
g
g1 = unlist(g)
plot(g1,type="l")
First, start with a random uniform distribution of suitable size, and select which sample sizes you want to compute your standard error of the mean.
set.seed(123)
x <- runif(1e6, 0, 1)
sample_size <- 5:120
You can define a function to compute this sigma_m. Here you sample with replacement a sample of n from x, and take the standard deviation and divide by sqrt(n).
calc_sigma_m <- function(n, x) {
sd(sample(x, n, replace = TRUE))/sqrt(n)
}
A data frame can neatly store the sample sizes and sigma_m values for plotting:
df <- data.frame(sample_size,
sigma_m = sapply(sample_size, calc_sigma_m, x))
Your initial plot will look like this:
library(ggplot2)
ggplot(df, aes(sample_size, sigma_m)) +
geom_line()
As expected, this is not smooth especially at smaller sample sizes.
If you want a smooth curve for demonstration, you repeat the sampling process and sigma_m calculation many times, and take the mean.
calc_sigma_m_mean <- function(n, x) {
mean(replicate(1000, sd(sample(x, n, replace = TRUE))/sqrt(n)))
}
df <- data.frame(sample_size, sigma_m = sapply(sample_size, calc_sigma_m_mean, x))
Then you will get a smoother curve:
ggplot(df, aes(sample_size, sigma_m)) +
geom_line()

Plotting Forecast and Real values in one plot using a Rolling Window

I have a code which takes the input as the Yield Spread (dependent var.) and Forward Rates(independent var.) and operate an auto.arima to get the orders. Afterwards, I am forecasting the next 25 dates (forc.horizon). My training data are the first 600 (training). Then I am moving the time window 25 dates, meaning using the data from 26 to 625, estimating the auto.arima and then forecasting the data from 626 to 650 and so on. My data sets are 2298 rows (date) and 30 columns (maturity).
I want to store all of the forecasts and then plot the forecasted and real values in the same plot.
This is the code I have, but it doesn't store the forecasts in a way to plot later.
forecast.func <- function(NS.spread, ind.v, maturity, training, forc.horizon){
NS.spread <- NS.spread/100
forc <- c()
j <- 0
for(i in 1:floor((nrow(NS.spread)-training)/forc.horizon)){
# test data
y <- NS.spread[(1+j):(training+j) , maturity]
f <- ind.v[(1+j):(training+j) , maturity]
# auto- arima
c <- auto.arima(y, xreg = f, test= "adf")
# forecast
e <- ind.v[(training+j+1):(training+j+forc.horizon) , maturity]
h <- forecast(c, xreg = lagmatrix(e, -1))
forc <- c(forc, list(h))
j <- j + forc.horizon
}
return(forc)
}
a <- forecast.func(spread.NS.JPM, Forward.rate.JPM, 10, 600, 25)
lapply(a, plot)
Here's a link to my two datasets:
https://drive.google.com/drive/folders/1goCxllYHQo3QJ0IdidKbdmfR-DZgrezN?usp=sharing
LOOK AT THE END for a full functional example on how to handle AUTO.ARIMA MODEL with DAILY DATA using XREG and FOURIER SERIES with ROLLING STARTING TIMES and cross validated training and test.
Without a reproducible example no one can help you, because they can't run your code. You need to provide data. :-(
Even if it's not part of StackOverflow to discuss statistics matters, why don't you do an auto.arima with xreg instead of lm + auto.arima on residuals? Especially, considering how you forecast at the end, that training method looks really wrong. Consider using:
fit <- auto.arima(y, xreg = lagmatrix(f, -1))
h <- forecast(fit, xreg = lagmatrix(e, -1))
auto.arima will automatically calculate the best parameters by max likelihood.
On your coding question..
forc <- c() should be outside of the for loop, otherwise at every run you delete your previous results.
Same for j <- 0: at every run you're setting it back to 0. Put it outside if you need to change its value at every run.
The output of forecast is an object of class forecast, which is actually a type of list. Therefore, you can't use cbind effectively.
I'm my opinion, you should create forc in this way: forc <- list()
And create a list of your final results in this way:
forc <- c(forc, list(h)) # instead of forc <- cbind(forc, h)
This will create a list of objects of class forecast.
You can then plot them with a for loop by getting access at every object or with a lapply.
lapply(output_of_your_function, plot)
This is as far as I can go without a reproducible example.
FINAL EDIT
FULL FUNCTIONAL EXAMPLE
Here I try to sum up a conclusion out of the million comments we wrote.
With the data you provided, I built a code that can handle everything you need.
From training and test to model, till forecast and finally plotting which have the X axis with the time as required in one of your comments.
I removed the for loop. lapply is much better for your case.
You can leave the fourier series if you want to. That's how Professor Hyndman suggests to handle daily time series.
Functions and libraries needed:
# libraries ---------------------------
library(forecast)
library(lubridate)
# run model -------------------------------------
.daily_arima_forecast <- function(init, training, horizon, tt, ..., K = 10){
# create training and test
tt_trn <- window(tt, start = time(tt)[init] , end = time(tt)[init + training - 1])
tt_tst <- window(tt, start = time(tt)[init + training], end = time(tt)[init + training + horizon - 1])
# add fourier series [if you want to. Otherwise, cancel this part]
fr <- fourier(tt_trn[,1], K = K)
frf <- fourier(tt_trn[,1], K = K, h = horizon)
tsp(fr) <- tsp(tt_trn)
tsp(frf) <- tsp(tt_tst)
tt_trn <- ts.intersect(tt_trn, fr)
tt_tst <- ts.intersect(tt_tst, frf)
colnames(tt_tst) <- colnames(tt_trn) <- c("y", "s", paste0("k", seq_len(ncol(fr))))
# run model and forecast
aa <- auto.arima(tt_trn[,1], xreg = tt_trn[,-1])
fcst <- forecast(aa, xreg = tt_tst[,-1])
# add actual values to plot them later!
fcst$test.values <- tt_tst[,1]
# NOTE: since I modified the structure of the class forecast I should create a new class,
# but I didnt want to complicate your code
fcst
}
daily_arima_forecast <- function(y, x, training, horizon, ...){
# set up x and y together
tt <- ts.intersect(y, x)
# set up all starting point of the training set [give it a name to recognize them later]
inits <- setNames(nm = seq(1, length(y) - training, by = horizon))
# remove last one because you wouldnt have enough data in front of it
inits <- inits[-length(inits)]
# run model and return a list of all your models
lapply(inits, .daily_arima_forecast, training = training, horizon = horizon, tt = tt, ...)
}
# plot ------------------------------------------
plot_daily_forecast <- function(x){
autoplot(x) + autolayer(x$test.values)
}
Reproducible Example on how to use the previous functions
# create a sample data
tsp(EuStockMarkets) <- c(1991, 1991 + (1860-1)/365.25, 365.25)
# model
models <- daily_arima_forecast(y = EuStockMarkets[,1],
x = EuStockMarkets[,2],
training = 600,
horizon = 25,
K = 5)
# plot
plots <- lapply(models, plot_daily_forecast)
plots[[1]]
Example for the author of the post
# your data
load("BVIS0157_Forward.rda")
load("BVIS0157_NS.spread.rda")
spread.NS.JPM <- spread.NS.JPM / 100
# pre-work [out of function!!!]
set_up_ts <- function(m){
start <- min(row.names(m))
end <- max(row.names(m))
# daily sequence
inds <- seq(as.Date(start), as.Date(end), by = "day")
ts(m, start = c(year(start), as.numeric(format(inds[1], "%j"))), frequency = 365.25)
}
mts_spread.NS.JPM <- set_up_ts(spread.NS.JPM)
mts_Forward.rate.JPM <- set_up_ts(Forward.rate.JPM)
# model
col <- 10
models <- daily_arima_forecast(y = mts_spread.NS.JPM[, col],
x = stats::lag(mts_Forward.rate.JPM[, col], -1),
training = 600,
horizon = 25,
K = 5) # notice that K falls between ... that goes directly to the inner function
# plot
plots <- lapply(models, plot_daily_forecast)
plots[[5]]

Is it possible to analyse a spatial point pattern given another, underlying, spatial point pattern in R

I want to analyse the type of spatial pattern shown by an animal (i.e. random, clustered, uniform) taking into consideration the underlying spatial pattern of it's available habitat. The animals in question roost in trees, so a standard analysis of the animal spp will always show a clustered distribution (i.e. clustering around trees), but I want to test whether there is clustering between trees vs whether they distribute randomly throughout trees. To provide a visual, I want to be able to differentiate between the following scenarios in the image:
https://imgur.com/a/iE3nAoh (image not allowed because I'm new to stack overflow, but it's available through the link)
Here is a reproducible data frame. The scenario here is of uniform habitat (25 areas of habitat) and uniform animals (16 animals per habitat):
library(spatstat)
data <- data.frame(matrix(ncol = 4, nrow = 25))
x <- c("habitat", "x", "y", "animalcount")
colnames(data) <- x
data$habitat <- 1:25
data$x <- seq(from=2, to=20, by=4)
data$y[1:5] <- 2
data$y[6:10] <- 6
data$y[11:15] <- 10
data$y[16:20] <- 14
data$y[21:25] <- 18
data$animalcount <- 16
Set up conditions for the spatial analysis:
plot.win <- owin(c(0,20), c(0,20)) # set the plot window as 20x20m
nS <- 499 # number of simulations
cd <- 5 # cluster distance
ed <- 50 # envelope distance
incr.dist <- 0.5 # increment distance for envelopes
Create the point pattern for the habitat:
habitat <- ppp(x = data$x, y = data$y, window = plot.win)
Create the point pattern for the animals. To do this, first make a new dataframe with repeated rows for the number in animal count, so that points are individual animals. Jitter x/y so that x/y coordinates are not exactly the same:
data <-data[which(data$animalcount>0),]
duplicate_rows <- function(habitat, x, y, animalcount) {
expanded <- paste0("animal-", 1:animalcount)
repeated_rows <- data.frame("habitat" = habitat, "x" = x, "y" = y, "animalcount" = expanded)
repeated_rows
}
expanded_rows <- Map(f = duplicate_rows, data$habitat, data$x, data$y, data$animalcount)
animal_data <- do.call(rbind, expanded_rows)
animal_data$xan <- jitter(animal_data$x)
animal_data$yan <- jitter(animal_data$y)
animal <- ppp(x = animal_data$xan, y = animal_data$yan, window = plot.win)
Now test Complete Spatial Randomness of animals regardless of habitat. This should come out as clustered:
an.csr <- envelope(animal, Kest, nsims = nS, savepatterns = TRUE, r = seq(0, ed, incr.dist), correction=c("Ripley"), verbose = FALSE) #CSR fit and determine the number of simulations
an.dclf <- dclf.test(an.csr, rinterval = c(0,cd), verbose = FALSE) #calculate the summary statistics of the CSR null model fit (dclf.test)
plot(an.csr, sqrt(./pi)-r~r, ylab="L(r)-r", xlab="r (meters)", xlim=c(0,ed), legend="NULL", main=paste("Animal - CSR", sep = "")) #plot 0-centered fit with the confidence bounds
clarkevans(animal)[2] #R > 1 suggests ordering, < 1 suggests clustering
clarkevans.test(animal, "Donnelly")$p
Now test Complete Spatial Randomness of animals, given the available habitat. This should come out not clustered. But simply adding habitat as a covariate clearly isn't the appropriate way to do it:
an.csr <- envelope(animal, covariates = animal_data[,2:3], Kest, nsims = nS, savepatterns = TRUE, r = seq(0, ed, incr.dist), correction=c("Ripley"), verbose = FALSE)
an.dclf <- dclf.test(an.csr, rinterval = c(0,cd), verbose = FALSE)
plot(an.csr, sqrt(./pi)-r~r, ylab="L(r)-r", xlab="r (meters)", xlim=c(0,ed), legend="NULL", main=paste("Animal - CSR", sep = ""))
clarkevans(animal)[2]
clarkevans.test(animal, "Donnelly")$p
I also tried running the test of Complete Spatial Randomness on a fitted Point Process Model, where the animal point pattern could be predicted by x&y, but this also did not change outcomes:
animalppm<-ppm(animal~x+y)
an.csr <- envelope(animalppm, Kest, nsims = nS, savepatterns = TRUE, r = seq(0, ed, incr.dist), correction=c("Ripley"), verbose = FALSE)
an.dclf <- dclf.test(an.csr, rinterval = c(0,cd), verbose = FALSE)
plot(an.csr, sqrt(./pi)-r~r, ylab="L(r)-r", xlab="r (meters)", xlim=c(0,ed), legend="NULL", main=paste("Animal - CSR", sep = ""))
clarkevans(animalppm)[2] #R > 1 suggests ordering, < 1 suggests clustering
clarkevans.test(animalppm, "Donnelly")$p
From there I would run tests of aggregation models, but the logic of adding the second point pattern should be similar.
I would appreciate any suggestions on ways to deal with this. I cannot think of an effective way to google this, and am coming up short on clever coding solutions in R. Thanks in advance!
You can model the intensity as depending on the distance to the
habitat pattern. Here is a simple example where the animals follow a Poisson
point process with intensity function which decays log-linearly with distance
to the habitat:
library(spatstat)
data <- expand.grid(x = seq(2, 18, by=4), y = seq(2, 18, by=4))
data$animalcount <- 16
plot.win <- owin(c(0,20), c(0,20)) # set the plot window as 20x20m
habitat <- ppp(x = data$x, y = data$y, window = plot.win)
d <- distmap(habitat)
plot(d)
lam <- exp(3-2*d)
plot(lam)
animal <- rpoispp(lam)
plot(animal)
fit <- ppm(animal ~ d)
fit
#> Nonstationary Poisson process
#>
#> Log intensity: ~d
#>
#> Fitted trend coefficients:
#> (Intercept) d
#> 2.952048 -1.974381
#>
#> Estimate S.E. CI95.lo CI95.hi Ztest Zval
#> (Intercept) 2.952048 0.07265533 2.809646 3.094450 *** 40.63085
#> d -1.974381 0.07055831 -2.112673 -1.836089 *** -27.98226
Taking the underlying non-homogeneous intensity into account
there is no sign of departure from the Poisson model in the
(inhomogeneous) K-function:
plot(Kinhom(animal, lambda = fit))
#> Warning: The behaviour of Kinhom when lambda is a ppm object has changed
#> (in spatstat 1.37-0 and later). See help(Kinhom)
You don't have to have simple log-linear dependence on distance. You could also make a threshold model where you have one intensity with e.g. distance 1 of the habitat and another intensity outside this distance. You can make all kinds of derived covariates from e.g. the distance for use in your model.
If animals is the point pattern of animals, and trees is the point pattern of trees (both objects of class "ppp" in spatstat) then you could do
d <- distfun(trees)
f <- rhohat(animals, d)
plot(f)
to get an idea of how the concentration of animals depends on distance to nearest tree. You can use
berman.test(animals, d)
to perform a hypothesis test of dependence on the trees.

Cumulative Area-Under-Curve above a cut-off value

I am a non-computing/math student who is really new to R and require some help. I have provided a dummy dataset and example to illustrate my problems.
Note: AUC = area under curve; ICP = intracranial pressure; cumAUC = cumulative AUC;
To put things into context, any ICP value > 20 is not clinically-ideal. A method of calculating this physiological insult is via a measure known as the ICP-times-Time burden, which can be represented by the AUC of the ICP-Time curve above y-cutoff of 20.
Hence I would like to calculate the cumulative AUC of the ICP-time curve above a cut-off value of ICP = 20. However, my codes are not giving me the desired output because ideally, the cumAUC should remain constant at ICP values < 20 and not be increasing. Here's a dummy dataset and codes that I have created:
require(MASS) #Using the area function in MASS
require(dplyr)
require(ggplot2)
df <- data.frame(time=seq(1,20,by=1),
ICP=c(7,9,15,14,16,20,25,23,26,27,18,15,10,9,7,13,22,24,26,20))
ggplot(data=df, mapping=aes(x=time,y=ICP)) + geom_line() + geom_hline(yintercept = 20)
func_test <- approxfun(df$time, df$ICP, method="linear", rule=2)
area_single <- function(x) {area(func_test,0,x)}
area_multiple <- Vectorize(area_single)
area_cutoff <- function(x, level=20){
tmp <- area_multiple(x)-lag(area_multiple(x))-level*(x-lag(x))
tmp_test <- tmp>0
tmp_test <- ifelse(is.na(tmp), FALSE, tmp_test)
out <- ifelse(tmp_test, tmp, 0)
return(out)
}
df_auc <- df %>%
mutate(cumAUC = cumsum(area_cutoff(time)))
Could you all kindly point out any possible error in my code, or any alternative suggestions will be great! :) Thank you so much for your help; much appreciated!

linear interpolation of points in R

This may seem a really simple question, but here goes:
I have a data frame:
test_df <- data.frame(x1 = c(277422033,24118536.4,2096819.0,
182293.4,15905,1330,105,16,1),
x2 = c(2.496e-3,2.495e-2,2.496e-1,
2.496e0,2.47e1,2.48e2,2.456e3,
3.7978e4,3.781e5))
and I would like to linearly interpolate this to increase the number of points. The variables are linearly related on a log scales, i.e.
plot(log10(test_df[,1]),log10(test_df[,2]))
So, my question is, how do I linearly interpolate these to increase the number of values?
Here is my attempt using a linear model (as opposed to the approx function):
I have defined a linear model as:
test.lm <- lm(log10(x1) ~ log10(x2), data = test_df)
and then define a new variable for the new points:
ss <- seq(min(test_df$x2),max(test_df$x2),length.out = 100) # new x1
then predict the new values and plot the points
newY <- predict(test.lm, newdata = data.frame(x2 = ss)) # interpolated values
test_df2 <- data.frame(x1 = 10^newY,
x2 = ss)
points(newY,log10(ss),col = "red")
This works as I expect i.e. the graph in the end is as I expected.
I would like to increase the number of points in test_df2 which can be done by increasing length.out e.g.
ss <- seq(min(test_df$x2),max(test_df$x2),length.out = 10000000)
but this makes the running time very long on my machine, to the point that I have to restart R.
Is there a way that I can linearly interpolate at an evenly distributed number of points which also extend the entire number of points specified in ss?
Just use
ss <- 10^seq(log10(min(test_df$x2)),log10(max(test_df$x2)),length.out = 1000)
to have your new data evenly distributed on the log scale.

Resources