I tried to run the codes to figure out target dose and D optimally
but it always say my length is wrong. I do not understand how to code "weights"
and delta. Can someone help me out here?
Here is my code:
library(DoseFinding)
doses <- c(0,5, 25, 125, 200)
fmodels <- Mods(linear = NULL, emax = 14,
doses = doses, placEff=-0.17, maxEff=-1.4)
weights <- rep(1/5, 5)
desTD <- optDesign(fmodels, probs=1, designCrit="TD&Dopt",Delta=0.5)
plot(fmodels, plotTD = TRUE, Delta = 0.2)
When I used the example it showed all the probabilities for all doses:
data(IBScovars)
doses <- c(0, 10, 25, 50, 100, 150)
fmodels <- Mods(linear = NULL, emax = 25, exponential = 85,
logistic = c(50, 10.8811),
doses = doses, placEff=0, maxEff=0.4)
plot(fmodels, plotTD = TRUE, Delta = 0.2)
weights <- rep(1/4, 4)
desTD <- optDesign(fmodels, weights, Delta=0.2, designCrit="TD")
Calculated TD - optimal design:
0 10 25 50 100 150
0.34960 0.09252 0.00366 0.26760 0.13342 0.15319
But for mine only three doses show up...does it mean
other doses are not important?
Well, from the help page, you need to have the same number of weights as doses. The models retain info about the doses used. What you have looks fine, but you could also do
ds <- attr(fmodels, "doses")
weights <- rep.int(1, length(ds))/length(ds)
to extract the info from the fmodel object.
Also, when running your optDesign function, i had problem with the probs and dsignCrit parameters you specified. The probs length should correspond to the length of fmodels. An interal calculation does this to find the total length
Reduce("+",lapply(fmodels, function(x) {
if (is.vector(x))
return(1)
if (is.matrix(x))
return(nrow(x))
}))
# [1] 2
so there should be two probabilities. Also i think designCrit="TD&Dopt" should be designCrit="Dopt&TD" so the following seems to run without error
desTD <- optDesign(fmodels, probs=c(.5,.5),
weights=weights, designCrit="Dopt&TD",Delta=0.5)
It's unclear exactly what your question about Delta is. According to the help page, that's just your estimated effect size.
Related
I am trying to simulate how replacement/reassignment of values on random samples affect predictions conveyed by AUC.
I have a tumor classification in a dataframe denoted df$who which has levels 1, 2, 3 corresponding to the severity of the tumor lesion.
Intro to the question
Lets say the baseline data looks like this:
set.seed(1)
df <- data.frame(
who = as.factor(sample(1:3, size = 6000, replace = TRUE, prob = c(0.8, 0.15, 0.05))),
age = round(runif(n = 6000, min = 18, max = 95), digits = 1),
gender = sample(c("m", "f"), size = 6000, replace = TRUE, prob = c(1/3, 2/3)),
event.time = runif(n = 6000, min = 8, max = 120),
event = as.factor(sample(0:2, size = 6000, replace = TRUE, prob = c(0.25, 0.2, 0.55)))
)
And a standard cause-specific Cox regression looks like:
library(survival)
a_baseline <- coxph(Surv(event.time, event == 1) ~ who + age + gender, data = df, x = TRUE)
From which AUC can be obtained as a measure of predictive performance. Here, leave-one-out bootstrap on 5-year prediction on df$event == 1.
library(riskRegression)
u <- Score(list("baseline" = a_baseline),
Surv(event.time, event == 1) ~ 1,
data = df,
times = 60,
plots = "cal",
B = 50,
split.method = "loob",
metrics = c("auc", "brier")
)
# The AUC is then obtained
u$AUC$score$AUC[2]
Question
I want to simulate how re-classifying a random 5% of df$who == 1 to dfwho == 2 affect the 5-year prediction on df$event == 1
I want to create 10 separate and simulated subsets of the baseline data df, but each containing a random allocation of 5% df$who == 1 to .. == 2. Then, I want to apply each of these 10 separate and simulated subsets to predict the 5-year risk of df$event == 1.
I have applied a for loop to this. The expected output is dataframe that tells me which of the 10 simulated datasets yielded the highest and lowest u$AUC$score$AUC[2] (i.e., the best and worst prediction).
I am new to for loop, but here is my go (that obviously did not work).
all_auc <- data.frame() ## create a dataframe to fill in AUC from all 10 simulated sub-datasets
for(i in 1:10){ #1:10 represent the simulated datasets from 1 to 10
df[i] <- df #allocating baseline data to each of the 10 datasets
df[i]$who[sample(which(df[i]$who==1), round(0.05*length(which(df[i]$who==1))))]=2 #create the random 5% allocation of who==1 to who==2 in the i'th simulated dataset
ith_cox <- coxph(Surv(event.time, event == 1) ~ who + age + gender, data = df[i], x = TRUE) #create the i'th Cox regression based on the i´th dataset
# create the predictions based on the i´th Cox
u[i] <- Score(list("baseline" = ith_cox),
Surv(event.time, event == 1) ~ 1,
data = df[i],
times = 60,
plots = "cal",
B = 50,
split.method = "loob",
metrics = c("auc", "brier")
)
# summarize all AUC from all 10 sub-datasets
all_auc <- u[i]$AUC$score$AUC[2]
}
(1) I could not get this for loop to work as described, and
(2) the final dataframe all_auc should provide only which of the 10 datasets yielded the worst and best predictions (I will then use these two data sets for further analysis).
A final note
This is only a reproducible example. The for loop will be applied to 10.000 simulated datasets in our analysis. I do not know if this could affect the answer - but, it illustrates the importance of the result: a dataframe (or vector?) that simply tells me which simulated dataset yielded the best vs worst predictions, and that I subsequently will be able to use these two dataframes for furter analysis, eg df2930 and df8939.
I need help to calculate bootstrap-based credible intervals of the quantity qtt.ci from my regression coef.def.
So far my attempts have resulted in:
Error in quantile.default(s, c(0.025, 0.25, 0.5, 0.75, 0.975)) :
missing values and NaN's not allowed if 'na.rm' is FALSE
preceded by:
Warning message: In bayesboot(dat, boot_fn) : The sample from
bayesboot contains either NAs, NaNs or NULLs. Make sure that your
statistic function only return actual values.
Here are my sample data:
dat <- data.frame(
A = c(1, 1, 0, 0), B = c(1, 0, 1, 0),
Pass = c(278, 100, 153, 79), Fail = c(743, 581, 1232, 1731)
Below is my regression. The quantity I want to get the bootstrap-based 95% credible intervals is qtt.ci:
boot_fn <- function(dat) {
coef.def = unname(coef(glm(cbind(Pass, Fail) ~ A * B, binomial,
dat)))
}
qtt.ci <- exp(sum(coef.def[2:4])) - exp(coef.def[2]) - exp(coef.def[3]) + 1
Here is my attempt:
bb_ci <- bayesboot(dat, boot_fn)
summary(bb_ci)
Not certain how to get the bootstrap-based confidence intervals for qtt.ci.
Thank you in advance.
EDIT:
Following the answer by #RuiBarradas, I tried doing bootstrap to get the 95% CI for the quantity qtt.ci (which is the quantity for which I want to get the bootstrapped CI), but no success:
library(bayesboot)
boot_fn <- function(dat) {
coef.def <- unname(coef(glm(cbind(Pass, Fail) ~ A * B, binomial, dat)))
qtt<- (exp(sum(coef.def[2:4])) - exp(coef.def[2]) - exp(coef.def[3]) + 1)
if(all(!is.na(qtt))) qtt else NULL
}
Runs <- 1e2
qtt.ci <- bayesboot(dat, boot_fn, R = Runs, R2 = Runs)
summary(qtt.ci)
Quantiles:
statistic q2.5% q25% median q75% q97.5%
V1 2.705878 2.705878 2.705878 2.705878 2.705878
Therefore, this does not give the CI for qtt.ci. The output is simply the point estimate for qtt:
qtt<-(exp(sum(coef.def[2:4])) - exp(coef.def[2]) - exp(coef.def[3]) + 1)
qtt
[1] 2.705878
Any help would be much appreciated.
The following solves the warning issue. I have tested it with much less runs, instead of 4000 just 100.
library(bayesboot)
boot_fn <- function(dat) {
fit <- glm(cbind(Pass, Fail) ~ A * B, binomial, dat)
coef.def <- unname(coef(fit))
if(all(!is.na(coef.def))) coef.def else NULL
}
Runs <- 1e2
bb_ci <- bayesboot(dat, boot_fn, R = Runs, R2 = Runs)
summary(bb_ci)
Edit.
According to the formula in the question and the dialog in comments with the OP, to get the bootstrap-based CI run:
qtt <- exp(sum(bb_ci[2:4])) - exp(bb_ci[2]) - exp(bb_ci[3]) + 1
I'm trying out the Keras package in R by doing this tutorial about forecasting the temperature. However, the tutorial has no explanation on how to predict with the trained RNN model and I wonder how to do this. To train a model I used the following code copied from the tutorial:
dir.create("~/Downloads/jena_climate", recursive = TRUE)
download.file(
"https://s3.amazonaws.com/keras-datasets/jena_climate_2009_2016.csv.zip",
"~/Downloads/jena_climate/jena_climate_2009_2016.csv.zip"
)
unzip(
"~/Downloads/jena_climate/jena_climate_2009_2016.csv.zip",
exdir = "~/Downloads/jena_climate"
)
library(readr)
data_dir <- "~/Downloads/jena_climate"
fname <- file.path(data_dir, "jena_climate_2009_2016.csv")
data <- read_csv(fname)
data <- data.matrix(data[,-1])
train_data <- data[1:200000,]
mean <- apply(train_data, 2, mean)
std <- apply(train_data, 2, sd)
data <- scale(data, center = mean, scale = std)
generator <- function(data, lookback, delay, min_index, max_index,
shuffle = FALSE, batch_size = 128, step = 6) {
if (is.null(max_index))
max_index <- nrow(data) - delay - 1
i <- min_index + lookback
function() {
if (shuffle) {
rows <- sample(c((min_index+lookback):max_index), size = batch_size)
} else {
if (i + batch_size >= max_index)
i <<- min_index + lookback
rows <- c(i:min(i+batch_size, max_index))
i <<- i + length(rows)
}
samples <- array(0, dim = c(length(rows),
lookback / step,
dim(data)[[-1]]))
targets <- array(0, dim = c(length(rows)))
for (j in 1:length(rows)) {
indices <- seq(rows[[j]] - lookback, rows[[j]],
length.out = dim(samples)[[2]])
samples[j,,] <- data[indices,]
targets[[j]] <- data[rows[[j]] + delay,2]
}
list(samples, targets)
}
}
lookback <- 1440
step <- 6
delay <- 144
batch_size <- 128
train_gen <- generator(
data,
lookback = lookback,
delay = delay,
min_index = 1,
max_index = 200000,
shuffle = TRUE,
step = step,
batch_size = batch_size
)
val_gen = generator(
data,
lookback = lookback,
delay = delay,
min_index = 200001,
max_index = 300000,
step = step,
batch_size = batch_size
)
test_gen <- generator(
data,
lookback = lookback,
delay = delay,
min_index = 300001,
max_index = NULL,
step = step,
batch_size = batch_size
)
# How many steps to draw from val_gen in order to see the entire validation set
val_steps <- (300000 - 200001 - lookback) / batch_size
# How many steps to draw from test_gen in order to see the entire test set
test_steps <- (nrow(data) - 300001 - lookback) / batch_size
library(keras)
model <- keras_model_sequential() %>%
layer_flatten(input_shape = c(lookback / step, dim(data)[-1])) %>%
layer_dense(units = 32, activation = "relu") %>%
layer_dense(units = 1)
model %>% compile(
optimizer = optimizer_rmsprop(),
loss = "mae"
)
history <- model %>% fit_generator(
train_gen,
steps_per_epoch = 500,
epochs = 20,
validation_data = val_gen,
validation_steps = val_steps
)
I tried to predict the temperature with the code below. If I am correct, this should give me the normalized predicted temperature for every batch. So when I denormalize the values and average them, I get the predicted temperature. Is this correct and if so for which time is then predicted (latest observation time + delay?) ?
prediction.set <- test_gen()[[1]]
prediction <- predict(model, prediction.set)
Also, what is the correct way to use keras::predict_generator() and the test_gen() function? If I use the following code:
model %>% predict_generator(generator = test_gen,
steps = test_steps)
it gives this error:
error in py_call_impl(callable, dots$args, dots$keywords) :
ValueError: Error when checking model input: the list of Numpy
arrays that you are passing to your model is not the size the model expected.
Expected to see 1 array(s), but instead got the following list of 2 arrays:
[array([[[ 0.50394005, 0.6441838 , 0.5990761 , ..., 0.22060473,
0.2018686 , -1.7336458 ],
[ 0.5475698 , 0.63853574, 0.5890239 , ..., -0.45618412,
-0.45030192, -1.724062...
Note: my familiarity with syntax of R is very little, so unfortunately I can't give you an answer using R. Instead, I am using Python in my answer. I hope you could easily translate back, my words at least, to R.
... If I am correct, this should give me the normalized predicted
temperature for every batch.
Yes, that's right. The predictions would be normalized since you have trained it with normalized labels:
data <- scale(data, center = mean, scale = std)
Therefore, you would need to denormalize the values using the computed mean and std to find the real predictions:
pred = model.predict(test_data)
denorm_pred = pred * std + mean
... for which time is then predicted (latest observation time +
delay?)
That's right. Concretely, since in this particular dataset every ten minutes a new obeservation is recorded and you have set delay=144, it would mean that the predicted value is the temperature 24 hours ahead (i.e. 144 * 10 = 1440 minutes = 24 hours) from the last given observation.
Also, what is the correct way to use keras::predict_generator() and
the test_gen() function?
predict_generator takes a generator that gives as output only test samples and not the labels (since we don't need labels when we are performing prediction; the labels are needed when training, i.e. fit_generator(), and when evaluating the model, i.e. evaluate_generator()). That's why the error mentions that you need to pass one array instead of two arrays. So you need to define a generator that only gives test samples or one alternative way, in Python, is to wrap your existing generator inside another function that gives only the input samples (I don't know whether you can do this in R or not):
def pred_generator(gen):
for data, labels in gen:
yield data # discards labels
preds = model.predict_generator(pred_generator(test_generator), number_of_steps)
You need to provide one other argument which is the number of steps of generator to cover all the samples in test data. Actually we have num_steps = total_number_of_samples / batch_size. For example, if you have 1000 samples and each time the generator generate 10 samples, you need to use generator for 1000 / 10 = 100 steps.
Bonus: To see how good your model performs you can use evaluate_generator using the existing test generator (i.e. test_gen):
loss = model.evaluate_generator(test_gen, number_of_steps)
The given loss is also normalized and to denormalize it (to get a better sense of prediction error) you just need to multiply it by std (you don't need to add mean since you are using mae, i.e. mean absolute error, as the loss function):
denorm_loss = loss * std
This would tell you how much your predictions are off on average. For example, if you are predicting the temperature, a denorm_loss of 5 means that the predictions are on average 5 degrees off (i.e. are either less or more than the actual value).
Update: For prediction, you can define a new generator using an existing generator in R like this:
pred_generator <- function(gen) {
function() { # wrap it in a function to make it callable
gen()[1] # call the given generator and get the first element (i.e. samples)
}
}
preds <- model %>%
predict_generator(
generator = pred_generator(test_gen), # pass test_gen directly to pred_generator without calling it
steps = test_steps
)
evaluate_generator(model, test_gen, test_steps)
I´m working on a STM Model (topicmodelling) and i´d like to evaluate and verify the model, but i´m not sure how to do it. My code is:
Corpus.STM <- readCorpus(dtm, type = "slam")
Model choice:
BestM1. <- searchK(Corpus.STM$documents, Corpus.STM$vocab, K=c(10,20, 30, 40, 50, 60), proportion = .4, heldout.seed = 1, prevalence=~ cvJahr+ cvDienstgrad+ cvLand, data=Jahr.Land )
BestM2. <- searchK(Corpus.STM$documents, Corpus.STM$vocab, K=c(85,110), proportion = .4, heldout.seed = 1, prevalence=~ cvJahr+ cvDienstgrad+ cvLand, data=Jahr.Land )
BestM3. <- searchK(Corpus.STM$documents, Corpus.STM$vocab, K=c(20,21,22,23,24,25,26,27,28,29,30), proportion = .4, heldout.seed = 1, prevalence=~ cvJahr+ cvDienstgrad+ cvLand, data=Jahr.Land )
str(BestM1.)
plot.searchK(BestM1.)
plot.STM(BestM2)
plot.searchK(BestM3.)
#27 seems to be a good choice
#Heldout
set.seed(1)
heldout<- make.heldout(Corpus.STM$documents, Corpus.STM$vocab, proportion = .5,seed = 1)
stm.mod1 <- stm(heldout$documents, heldout$vocab, K =27, seed = 1, init.type = "Spectral", max.em.its = 100 )
heldout.evaluation <- eval.heldout(stm.mod1, heldout$missing)
heldout.evaluation
#evaluation heldout
labelTopics(stm.mod1)
plot.STM(stm.mod1, type="labels", n=5, frexweight = 0.25)
cloud(stm.mod1, topic=5)
plot.STM(stm.mod1, type="summary", labeltype="frex", topics=c(1:5), n=8)
I´m not sure how to interpret the output of "eval.heldout". Additional I want to make sure that the model doesn´t overfit, but i´m not sure how it could work.
eval.heldout() calculates the held-out log-likelihood using document completion. The number you want is the heldout.evaluation$expected.heldout which is the average of the held-out log-likelihood values for each document. Unfortunately there is no unambiguous measure of whether or not the model is "overfit." The plot.searchK() call you have will give you a plot of the held-out log-likelihood over different values of K and certainly if that number is decreasing as K goes up one explanation is overfitting.
Sorry to not have a clearer answer but unfortunately there are no hard and fast rules here.
I want to find Lethal Dose (LD50) with its confidence interval in R. Other softwares line Minitab, SPSS, SAS provide three different versions of such confidence intervals. I could not find such intervals in any package in R (I also used findFn function from sos package).
How can I find such intervals? I coded for one type of intervals based on Delta method (as not sure about it correctness) but would like to use any established function from R package. Thanks
MWE:
dose <- c(10.2, 7.7, 5.1, 3.8, 2.6, 0)
total <- c(50, 49, 46, 48, 50, 49)
affected <- c(44, 42, 24, 16, 6, 0)
finney71 <- data.frame(dose, total, affected)
fm1 <- glm(cbind(affected, total-affected) ~ log(dose),
family=binomial(link = logit), data=finney71[finney71$dose != 0, ])
summary(fm1)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.886912 0.6429272 -7.601035 2.937717e-14
log(dose) 3.103545 0.3877178 8.004650 1.198070e-15
library(MASS)
xp <- dose.p(fm1, p=c(0.50, 0.90, 0.95)) # from MASS
xp.ci <- xp + attr(xp, "SE") %*% matrix(qnorm(1 - 0.05/2)*c(-1,1), nrow=1)
zp.est <- exp(cbind(xp, attr(xp, "SE"), xp.ci[,1], xp.ci[,2]))
dimnames(zp.est)[[2]] <- c("LD", "SE", "LCL","UCL")
zp.est
LD SE LCL UCL
p = 0.50: 4.828918 1.053044 4.363708 5.343724
p = 0.90: 9.802082 1.104050 8.073495 11.900771
p = 0.95: 12.470382 1.133880 9.748334 15.952512
From the package drc, you can get the ED50 (same calculation), along with confidence intervals.
library(drc) # Directly borrowed from the drc manual
mod <- drm(affected/total ~ dose, weights = total,
data = finney71[finney71$dose != 0, ], fct = LL2.2(), type = "binomial")
#intervals on log scale
ED(mod, c(50, 90, 95), interval = "fls", reference = "control")
Estimated effective doses
(Back-transformed from log scale-based confidence interval(s))
Estimate Lower Upper
1:50 4.8289 4.3637 5.3437
1:90 9.8021 8.0735 11.9008
1:95 12.4704 9.7483 15.9525
Which matches the manual output.
The "finney71" data is included in this package, and your calculation of confidence intervals exactly matches the example given by the drc folks, down to the "# from MASS" comment. You should give credit to them, rather than claiming you wrote the code.
There's a few other ways to figure this out. One is using parametric bootstrap, which is conveniently available through the boot package.
First, we'll refit the model.
library(boot)
finney71 <- finney71[finney71$dose != 0,] # pre-clean data
fm1 <- glm(cbind(affected, total-affected) ~ log(dose),
family=binomial(link = logit),
data=finney71)
And for illustration, we can figure out the LD50 and LD75.
statfun <- function(dat, ind) {
mod <- update(fm1, data = dat[ind,])
coefs <- coef(mod)
c(exp(-coefs[1]/coefs[2]),
exp((log(0.75/0.25) - coefs[2])/coefs[1]))
}
boot_out <- boot(data = finney71, statistic = statfun, R = 1000)
The boot.ci function can work out a variety of confidence intervals for us, using this object.
boot.ci(boot_out, index = 1, type = c('basic', 'perc', 'norm'))
##BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
##Based on 999 bootstrap replicates
##
##CALL :
##boot.ci(boot.out = boot_out, type = c("basic", "perc", "norm"),
## index = 1)
##Intervals :
##Level Normal Basic Percentile
##95% ( 3.976, 5.764 ) ( 4.593, 5.051 ) ( 4.607, 5.065 )
The confidence intervals using the normal approximation are thrown off quite a bit by a few extreme values, which the basic and percentile-based intervals are more robust to.
One interesting thing to note: if the sign of the slope is sufficiently unclear, we can get some rather extreme values (simulated as in this answer, and discussed more thoroughly in this blog post by Andrew Gelman).
set.seed(1)
x <- rnorm(100)
z = 0.05 + 0.1*x*rnorm(100, 0, 0.05) # small slope and more noise
pr = 1/(1+exp(-z))
y = rbinom(1000, 1, pr)
sim_dat <- data.frame(x, y)
sim_mod <- glm(y ~ x, data = sim_dat, family = 'binomial')
statfun <- function(dat, ind) {
mod <- update(sim_mod, data = dat[ind,])
-coef(mod)[1]/coef(mod)[2]
}
sim_boot <- boot(data = sim_dat, statistic = statfun, R = 1000)
hist(sim_boot$t[,1], breaks = 100,
main = "Bootstrap of simulated model")
The delta method above gives us mean = 6.448, lower ci = -36.22, and upper ci = 49.12, and all of the bootstrap CIs give us similarly extreme estimates.
##Level Normal Basic Percentile
##95% (-232.19, 247.76 ) ( -20.17, 45.13 ) ( -32.23, 33.06 )