Cumulative Area-Under-Curve above a cut-off value - r

I am a non-computing/math student who is really new to R and require some help. I have provided a dummy dataset and example to illustrate my problems.
Note: AUC = area under curve; ICP = intracranial pressure; cumAUC = cumulative AUC;
To put things into context, any ICP value > 20 is not clinically-ideal. A method of calculating this physiological insult is via a measure known as the ICP-times-Time burden, which can be represented by the AUC of the ICP-Time curve above y-cutoff of 20.
Hence I would like to calculate the cumulative AUC of the ICP-time curve above a cut-off value of ICP = 20. However, my codes are not giving me the desired output because ideally, the cumAUC should remain constant at ICP values < 20 and not be increasing. Here's a dummy dataset and codes that I have created:
require(MASS) #Using the area function in MASS
require(dplyr)
require(ggplot2)
df <- data.frame(time=seq(1,20,by=1),
ICP=c(7,9,15,14,16,20,25,23,26,27,18,15,10,9,7,13,22,24,26,20))
ggplot(data=df, mapping=aes(x=time,y=ICP)) + geom_line() + geom_hline(yintercept = 20)
func_test <- approxfun(df$time, df$ICP, method="linear", rule=2)
area_single <- function(x) {area(func_test,0,x)}
area_multiple <- Vectorize(area_single)
area_cutoff <- function(x, level=20){
tmp <- area_multiple(x)-lag(area_multiple(x))-level*(x-lag(x))
tmp_test <- tmp>0
tmp_test <- ifelse(is.na(tmp), FALSE, tmp_test)
out <- ifelse(tmp_test, tmp, 0)
return(out)
}
df_auc <- df %>%
mutate(cumAUC = cumsum(area_cutoff(time)))
Could you all kindly point out any possible error in my code, or any alternative suggestions will be great! :) Thank you so much for your help; much appreciated!

Related

Simulate in R the number of samples needed in order to achieve the true standard deviation

i want to recreate in R the figure above that simulates the number of samples needed in order to achieve the true standard deviation.
How can I do it in R ?
I suppose that the distribution is t-distribution or normal.
So I have to generate numbers from these distributions and each time to increase the size of the sample and plot it in order to recreate this plot as shown in the figure.
Any help ?
set.seed(123)
x <- list(v1=rnorm(1,0,12),v2=rnorm(10,0,11),
v3=rnorm(20,0,10),v4=rnorm(30,0,9),
v5=rnorm(40,0,8),v6=rnorm(50,0,7),
v7=rnorm(60,0,6),v8=rnorm(70,0,5),
v9=rnorm(80,0,4),v10=rnorm(90,0,3),
v11=rnorm(100,0,2),v12=rnorm(110,0,2))
g = lapply(x,sd)
g
g1 = unlist(g)
plot(g1,type="l")
First, start with a random uniform distribution of suitable size, and select which sample sizes you want to compute your standard error of the mean.
set.seed(123)
x <- runif(1e6, 0, 1)
sample_size <- 5:120
You can define a function to compute this sigma_m. Here you sample with replacement a sample of n from x, and take the standard deviation and divide by sqrt(n).
calc_sigma_m <- function(n, x) {
sd(sample(x, n, replace = TRUE))/sqrt(n)
}
A data frame can neatly store the sample sizes and sigma_m values for plotting:
df <- data.frame(sample_size,
sigma_m = sapply(sample_size, calc_sigma_m, x))
Your initial plot will look like this:
library(ggplot2)
ggplot(df, aes(sample_size, sigma_m)) +
geom_line()
As expected, this is not smooth especially at smaller sample sizes.
If you want a smooth curve for demonstration, you repeat the sampling process and sigma_m calculation many times, and take the mean.
calc_sigma_m_mean <- function(n, x) {
mean(replicate(1000, sd(sample(x, n, replace = TRUE))/sqrt(n)))
}
df <- data.frame(sample_size, sigma_m = sapply(sample_size, calc_sigma_m_mean, x))
Then you will get a smoother curve:
ggplot(df, aes(sample_size, sigma_m)) +
geom_line()

How to accomplish replicated calculation and plot in subset dataset?

I have a simulated data created like this:
average_vector = c(0,0,25)
sigma_matrix = matrix(c(4,1,0,1,8,0,0,0,9),nrow=3,ncol=3)
set.seed(12345)
data0 = as.data.frame(mvrnorm(n =20000, mu = average_vector, Sigma=sigma_matrix))
names(data0)=c("hard","smartness","age")
set.seed(13579)
data0$final=0.5*data0$hard+0.2*data0$smartness+(-0.1)*data0$age+rnorm(n=dim(data0)[1],mean=90,sd=6)
Now, I want to randomly sample 50 students 1,000 times (1,000 sets of 50 people), I used this code:
datsub<-(replicate(1000, sample(1:nrow(data0),50)))
After that step, I encountered a issue: I want to ask if I want to run a regression model with the 50 selected people (1,000 times), and record/store the point estimates of “hard” from model 4, where is given like this:
model4 = lm(formula = final ~ hard + smartness + age, data = data0), and plot the variation around the line of 0.5 (true value), is there any way I can achieve that? Thanks a lot!
I would highly suggest looking into either caret or the newer (and still maintained) TidyModels if you're just getting into R modelling. Either of these will make your life easier, once you get used to the dplyr-like syntax.
What you're trying to do is bootstrapping. Here is the manual approach using only base functions.
n <- nrow(data0)
k <- 1000
ns <- 50
samples <- replicate(k, sample(seq_len(n), ns))
params <- vector('list', k)
for(i in seq_len(n)){
params[[i]] <- coef( lm(formula = final ~ hard + smartness + age, data = data0[samples[, i],]) )
}
# merge params into columns
params <- do.call(rbind, params)
# Create plot from here.
plot(x = seq_len(n), y = params[, "hard"])
abline(h = 0.5)
Note the above may have a few typos as your example is not reproducible.

Problems with calculating the likelihood of an outcome based on cumulative probability function in R

I am new to R and looking to estimate the likelihood of having an outcome>=100 using a probability density function (the outcome in my example is the size of an outbreak). I believe I have the correct coding, but something doesn't feel right about the answer, when I look at the plot.
This is my code (it's based on the output of a stochastic model of an outbreak). I'd very much appreciate pointers. I think the error is in the likelihood calculation....
Thank you!
total_cases.dist <- dlnorm(sample.range, mean = total_cases.mean, sd = total_cases.sd)
total_cases.df <- data.frame("total_cases" = sample.range, "Density" = total_cases.dist)
library(ggplot2)
ggplot(total_cases.df, aes(x = total_cases, y = Density)) + geom_point()
pp <- function(x) {
print(paste0(round(x * 100, 3), "%"))
}
# likelihood of n_cases >= 100
pp(sum(total_cases.df$Density[total_cases.df$total_cases >= 100]))
You are using dlnorm, which is the log-normal distribution, which means the mean and sd are the mean of the log (values) and sd of log(values), for example:
# we call the standard rlnorm
X = rlnorm(1000,0,1)
# gives something close to sd = exp(1), and mean=something
c(mean(X),sd(X))
# gives what we simulated
c(mean(log(X)),sd(log(X)))
We now simulate some data, using a known poisson distribution where mean = variance. And we can model it using the log-normal:
set.seed(100)
X <- rpois(500,lambda=1310)
# we need to log values first
total_cases.mean <- mean(log(X))
total_cases.sd <- sd(log(X))
and you can see it works well
sample.range <- 1200:1400
hist(X,br=50,freq=FALSE)
lines(sample.range,
dlnorm(sample.range,mean=total_cases.mean,sd=total_cases.sd),
col="navyblue")
For your example, you can get probability of values > 1200 (see histogram):
plnorm(1200,total_cases.mean,total_cases.sd,lower.tail=FALSE)
Now for your data, if it is true that mean = 1310.198 and total_cases.sd = 31615.26, take makes variance ~ 76000X of your mean ! I am not sure then if the log normal distribution is appropriate for modeling this kind of data..

R bootstrap through all columns in a data frame

I'm trying to get bootstrapped descriptive statistics for the columns of a data frame using boot() in R. I can't figure out how to write the "statistic" function required of boot so that it goes through all the columns and returns 4 stats each.
I'm basically trying to modify code from a book chapter introducing bootstrapping, but the example there is using dlply to group up the numeric values in one column based on the group names in another. I don't need that step; I just need to do the same thing to every column. I reproduce the code from the textbook below:
library(boot)
library(plyr)
library(moments)
DescStat <- function(data, i) {
+ temp <- data[i,]
+ desc <- dlply(temp, "GroupName", summarize, mean = mean(ValueColumn), sd = sd(ValueColumn), skew = skewness(ValueColumn), kurt = kurtosis(ValueColumn))
+ l.desc <- unlist(desc)
+ return(l.desc)
+}
DasBoot <- boot(dataframe, DescStat, 1000)
print(DasBoot)
That returns a table of each one of those stats, plus the bias and the SE, each on its own line. So Group1 mean and bias and SE, then Group1 SD and bias and SE on the next line, and on down the list, 4 lines for each group.
I would like to do the exact same thing, but for column in the data frame instead of groups of cases.
I have kind of been floundering. It seems like this should not be that difficult to do, but all the examples and tutorials online are either for one column (which I'm able to do just fine by specifying it in the statistic function), or for trickier manipulations such as in the above.
Any help would be greatly appreciated.
I was able to figure it out using either apply() or purrr::map(). Here's the statistic function that doesn't require any extra packages (tidyverse) installed:
> ApStat <- function(data, i) {
+ temp <- data[i,]
+ desc <- apply(temp, 2, FUN = function(x) {
+ list(mean = mean(x, 0.2), sd = sd(x),
+ skew = skewness(x), kurt = kurtosis(x))
+ })
+ l.desc <- unlist(desc)
+ return(l.desc)
+ }
Thanks to the guys/gals at r/rstats!

Simulate Cox Proportional Hazard with geom_hex

I am interested in replicating an experiment in a paper [1] I came across. The idea is that I need to simulate a cox proportional hazard model that is dependent on the first to covariates in the dataframe. I am trying to make a plot similar to this:
But I am trying to make a "hex" version of it. The problem is that I can't seem to get the "z-axis" correct.
set.seed(42) # this makes the example exactly reproducible
#50,000 random uniforms
obs <- runif(50000,min = -1, max = .999)
#make uniforms a matrix
obs <- matrix(data = obs, nrow = 5000, ncol = 10)
#make is_censored
is_censored <- sample(0:1,5000,TRUE,prob=c(0.40,0.60))
#hazard function
const <- 1
time <- rexp(n = 5000, const*exp(-(obs[,1]+2*obs[,2])))
#dataset
df <- cbind(obs, is_censored, time)
#names for covariates
names = letters[1:10]
colnames(df)[1:10] <- names
#truth data
x <- df[,1]; y <- df[,2]
true <- tibble(x,y,time)
install.packages("hexbin")
library(hexbin)
ggplot(true,aes(x,y))+
geom_hex(bins = 30)
I thought that if I added time for the z-axis I would get the correct gradient, but instead I got:
ggplot(true,aes(x,y,fill=time))+
geom_hex(bins = 30)
How can I get the proper gradient?
1Deep Survival: A Deep Cox Proportional Hazards Network

Resources