r: for loop to simulate predictions when random sampling is applied

r: for loop to simulate predictions when random sampling is applied - r

I am trying to simulate how replacement/reassignment of values on random samples affect predictions conveyed by AUC.
I have a tumor classification in a dataframe denoted df$who which has levels 1, 2, 3 corresponding to the severity of the tumor lesion.
Intro to the question
Lets say the baseline data looks like this:
set.seed(1)
df <- data.frame(
who = as.factor(sample(1:3, size = 6000, replace = TRUE, prob = c(0.8, 0.15, 0.05))),
age = round(runif(n = 6000, min = 18, max = 95), digits = 1),
gender = sample(c("m", "f"), size = 6000, replace = TRUE, prob = c(1/3, 2/3)),
event.time = runif(n = 6000, min = 8, max = 120),
event = as.factor(sample(0:2, size = 6000, replace = TRUE, prob = c(0.25, 0.2, 0.55)))
)
And a standard cause-specific Cox regression looks like:
library(survival)
a_baseline <- coxph(Surv(event.time, event == 1) ~ who + age + gender, data = df, x = TRUE)
From which AUC can be obtained as a measure of predictive performance. Here, leave-one-out bootstrap on 5-year prediction on df$event == 1.
library(riskRegression)
u <- Score(list("baseline" = a_baseline),
Surv(event.time, event == 1) ~ 1,
data = df,
times = 60,
plots = "cal",
B = 50,
split.method = "loob",
metrics = c("auc", "brier")
)
# The AUC is then obtained
u$AUC$score$AUC[2]
Question
I want to simulate how re-classifying a random 5% of df$who == 1 to dfwho == 2 affect the 5-year prediction on df$event == 1
I want to create 10 separate and simulated subsets of the baseline data df, but each containing a random allocation of 5% df$who == 1 to .. == 2. Then, I want to apply each of these 10 separate and simulated subsets to predict the 5-year risk of df$event == 1.
I have applied a for loop to this. The expected output is dataframe that tells me which of the 10 simulated datasets yielded the highest and lowest u$AUC$score$AUC[2] (i.e., the best and worst prediction).
I am new to for loop, but here is my go (that obviously did not work).
all_auc <- data.frame() ## create a dataframe to fill in AUC from all 10 simulated sub-datasets
for(i in 1:10){ #1:10 represent the simulated datasets from 1 to 10
df[i] <- df #allocating baseline data to each of the 10 datasets
df[i]$who[sample(which(df[i]$who==1), round(0.05*length(which(df[i]$who==1))))]=2 #create the random 5% allocation of who==1 to who==2 in the i'th simulated dataset
ith_cox <- coxph(Surv(event.time, event == 1) ~ who + age + gender, data = df[i], x = TRUE) #create the i'th Cox regression based on the i´th dataset
# create the predictions based on the i´th Cox
u[i] <- Score(list("baseline" = ith_cox),
Surv(event.time, event == 1) ~ 1,
data = df[i],
times = 60,
plots = "cal",
B = 50,
split.method = "loob",
metrics = c("auc", "brier")
)
# summarize all AUC from all 10 sub-datasets
all_auc <- u[i]$AUC$score$AUC[2]
}
(1) I could not get this for loop to work as described, and
(2) the final dataframe all_auc should provide only which of the 10 datasets yielded the worst and best predictions (I will then use these two data sets for further analysis).
A final note
This is only a reproducible example. The for loop will be applied to 10.000 simulated datasets in our analysis. I do not know if this could affect the answer - but, it illustrates the importance of the result: a dataframe (or vector?) that simply tells me which simulated dataset yielded the best vs worst predictions, and that I subsequently will be able to use these two dataframes for furter analysis, eg df2930 and df8939.

Related

Is there way a way to cbind() unimputed data with an imputationList from jomo imputation, similar to mice?

I am trying to impute a large dataset of covariates, and then cbind it with outcome data for different cohorts/ages of children (so I don't want to impute those outcomes across the whole large set of covariates, because different ages of children got different measures). Here is my syntax
#jomo imputation
set.seed(90291)
TchParImput$icept <- 1
l1miss <- c("TCR_CLD_Q3A", "PAR_QA2D", "PAR_QH1", "TCR_CLD_Q2A", "TCR_CLD_Q2B", "PAR_QI6_YEARS_DERIVED", "PAR_SQ8_DERIVED", "PAR_QB6",
"PAR_QB16_RELCHILD_DERIVED", "PAR_QB16_RELADULT_DERIVED","PAR_QC2B2"
)
l2miss <- c("T_QA17D", "T_QB1B2", "TA_QA17D", "TA_QB1B2")
l1complete <- c("icept")
l2complete <- c("icept")
impdata <- jomo(TchParImput[l1miss], Y2 = TchParImput[l2miss], X = TchParImput[l1complete],
X2 = TchParImput[l2complete], clus = TchParImput$CLASSROOM_ID,
nburn = 2000, nbetween = 2000, nimp = 20, meth = "random")
imp.list <- imputationList(split(imp, imp$Imputation)[-1])
With mice, I've used:
Preschool_data_imp <- filter(imputedmidsobject, CHILDAGE_MONTHS >= 36)
data_imp <- cbind(Preschool_data_imp, data.frame(outcomes))
to then add in the unimputed, complete outcome data, but the imp.list format of jomo is different than mids, and so I am not sure how to proceed. Feedback appreciated

Kfold CV in brms

I am trying to use kfold CV as a means of evaluating a model run using brms and I feel like I'm missing something. As a reproducible example, my data are structured as a binary response (0, 1) dependent on the length of an individual. Here is some code to generate and plot data similar to those I am working with:
library(brms)
library(tidyverse)
library(loo)
length <- seq(0, 100, by = 1)
n_fish_per_length <- 10
a0 <- -48
a1 <- 2
a2 <- -0.02
prob <- plogis(a0 + a1 * length + a2 * length^2)
plot(length, prob , type = 'l')
sim_data <-
expand_grid(fish_id = seq_len(n_fish_per_length),
length = length) %>%
mutate(prob_use = plogis(a0 + a1 * length + a2 * length^2)) %>%
mutate(is_carp = rbinom(n = n(), size = 1, prob= prob_use))
ggplot(sim_data, aes(x = length, y = is_carp)) +
geom_jitter(width = 0, height = 0.05) +
geom_smooth(method = "glm", formula = y ~ x + I(x^2),
method.args = list(family = binomial(link = "logit")))
I then use brms to run my model.
Bayes_Model_Binary <- brm(formula = is_carp ~ length + I(length^2),
data=sim_data,
family = bernoulli(link = "logit"),
warmup = 2500,
iter = 5000,
chains = 4,
inits= "0",
cores=4,
seed = 123)
summary(Bayes_Model_Binary)
I'd like to use kfold CV to evaluate the model. I can use something like this:
kfold(Bayes_Model_Binary, K = 10, chains = 1, save_fits = T)
but the response in my data is highly imbalanced (~18% = 1, ~82% = 0) and my reading suggests that I need to used stratified kfold cv to account for this. If I use:
sim_data$fold <- kfold_split_stratified(K = 10, x = sim_data$is_carp)
the data are split the way I would expect but I'm not sure what the best way is to move forward with the CV process from here. I saw this post https://mc-stan.org/loo/articles/loo2-elpd.html, but I'm not sure how to modify this to work with a brmsfit object. Alternatively, it appears that I should be able to use:
kfold(Bayes_Model_Binary, K = 10, folds = 'stratified', group = sim_data$is_carp)
but this throws an error. Likely because is_carp is the response rather than a predictor in the model. What would my group be in this context? Am I missing/misinterpreting something here? I'm assuming that there is a very simple solution here that I am overlooking but appreciate any thoughts.

After some additional digging and learning how to access information about each fold in the analysis, I was able to determine that the structure of the data (proportion of 0s and 1s in the response) is maintained using the default settings in the kfold() function. To do this I used the following code.
First, save the kfold CV analysis as an object.
kfold1 <- kfold(Bayes_Model_Binary, K = 10, save_fits = T)
kfold1$fits is a list of the model fitting results and the observations used in the test data set (omitted) for each fold.
From this information, I created a loop to print the proportion of observations in each training data set where is_carp = 1 (could also do this for each test data set) with the following code.
for(i in 1:10){
print(length(which(sim_data$is_carp[-kfold1$fits[i, ]$omitted] == 1)) /
nrow(sim_data[-kfold1$fits[i, ]$omitted, ]))
}
[1] 0.1859186
[1] 0.1925193
[1] 0.1991199
[1] 0.1914191
[1] 0.1881188
[1] 0.1848185
[1] 0.1936194
[1] 0.1980198
[1] 0.190319
[1] 0.1870187
and it's easy to then compare these proportions with the proportion of observations where is_carp = 1 from the original data set.
length(which(sim_data$is_carp == 1)) / nrow(sim_data)
[1] 0.1910891

Power simulation in R for a linear mixed model

I need to do a linear mixed model simulation to get power for varying sample sizes.
My model is:
Ratings = y
Fixed effect, x = Ring
Random effect = participants
The code I tried is below. It only returns 'Based on 100 simulations, (0 warnings, 100 errors)
alpha = 0.05, nrow = 2000' ....
Thank you!!
#create a dataframe
library(lmerTest)
library(simr)
library(tidyverse)
Ring = c('Ring', 'NoRing')
#from 1 to 10 (11 is not included).
Ring = rep(Ring, times = 1000)
attractiveness = floor(runif(10, min=1, max=11)) #this creates random numbers
#from 1 to 10 (11 is not included).
participants<-rep(factor(1:100),each=20)
targetID = rep(c(1,2,3,4,5,6,7,8,9,10), each= 2)
targetImage= rep(targetID, times= 100)
Ratings = rep(attractiveness, times = 200)
data<-data.frame(participants, Ring, targetImage, Ratings)
#parameters for the model:
## Intercept and slopes for ring
fixed <- c(3, 0.5)
## Random intercepts for participants
rand <- 0.5
## residual variance
res <- 2
model <- makeLmer(Ratings ~ Ring + (1|participants), fixef=fixed,
VarCorr=rand, sigma=res, data=data)
sim_treat <- powerSim(model, nsim=100, test = fcompare(Ratings~Ring))
sim_treat

How to generate a latent variable from a set of different kinds of variables with R?

For a n number of observations, I want to generate a latent variable (unobserved), I can assume or not that this variable has a specific distribution or not, from a set of other variables that proxy this latent variable. For my specific case, I want to generate latent ability, from a set of variables that proxy ability (observed ability). One variable is discrete, and exhibit normality, another is binary but very skewed, and the last one is an ordered categorical variable. This looks like my data, and I would like to estimate a response for each observation.
set.seed(123877)
# number of units
n <- 1000L
# age
age <- sample(rnorm(n, 25, 10))
# cum laude
hon <- sample(0L:1L, n, TRUE, prob = c(.9, .1) )
# prestige of university
pres <- factor(sample(1L:25L, n, TRUE), labels = 25L:1L, ordered = T)
dat <- data.frame(id=1L:n, age, hon, pres)

I found a solution, using the ltm package, here is the code:
set.seed(123877)
u.latent <- vector()
class(u.latent) <- 'try-error'
library('ltm')
while (class(u.latent)=='try-error') {
# numer of units
n <- 1000L
# age
age <- round(rnorm(n, 25, 10))
# cum laude
hon <- sample(0L:1L, n, TRUE, prob = c(.9, .1) )
# prestige of university
pres <- sample(1L:10L, n, TRUE)
# pres <-factor(pres, levels = 1L:25L, ordered = TRUE)
dat <- data.frame(age, hon, pres)
# latent variable
u.latent <- try(gpcm(dat))
}
We can test if the model fits the data:
GoF.gpcm(u.latent)
#H0 the model fits the data
#Ha: the model does not fit the data
The estimates of the latent variables are straight forward:
u.estimates <-factor.scores(u.latent)
hist(u.estimates$score.dat$z1)

Learning hidden markov model in R

A hidden Markov model (HMM) is one in which you observe a sequence of observations, but do not know the sequence of states the model went through to generate the observations. Analyses of hidden Markov models seek to recover the sequence of hidden states from the observed data.
I have data with both observations and hidden states (observations are of continuous values) where the hidden states were tagged by an expert. I would like to train a HMM that would be able - based on a (previously unseen) sequence of observations - to recover the corresponding hidden states.
Is there any R package to do that? Studying the existing packages (depmixS4, HMM, seqHMM - for categorical data only) allows you to specify a number of hidden states only.
EDIT:
Example:
data.tagged.by.expert = data.frame(
hidden.state = c("Wake", "REM", "REM", "NonREM1", "NonREM2", "REM", "REM", "Wake"),
sensor1 = c(1,1.2,1.2,1.3,4,2,1.78,0.65),
sensor2 = c(7.2,5.3,5.1,1.2,2.3,7.5,7.8,2.1),
sensor3 = c(0.01,0.02,0.08,0.8,0.03,0.01,0.15,0.45)
)
data.newly.measured = data.frame(
sensor1 = c(2,3,4,5,2,1,2,4,5,8,4,6,1,2,5,3,2,1,4),
sensor2 = c(2.1,2.3,2.2,4.2,4.2,2.2,2.2,5.3,2.4,1.0,2.5,2.4,1.2,8.4,5.2,5.5,5.2,4.3,7.8),
sensor3 = c(0.23,0.25,0.23,0.54,0.36,0.85,0.01,0.52,0.09,0.12,0.85,0.45,0.26,0.08,0.01,0.55,0.67,0.82,0.35)
)
I would like to create a HMM with discrete time t whrere random variable x(t) represents the hidden state at time t, x(t) {"Wake", "REM", "NonREM1", "NonREM2"}, and 3 continuous random variables sensor1(t), sensor2(t), sensor3(t) representing the observations at time t.
model.hmm = learn.model(data.tagged.by.user)
Then I would like to use the created model to estimate hidden states responsible for newly measured observations
hidden.states = estimate.hidden.states(model.hmm, data.newly.measured)

Data (training/testing)
To be able to run learning methods for Naive Bayes classifier, we need longer data set
states = c("NonREM1", "NonREM2", "NonREM3", "REM", "Wake")
artificial.hypnogram = rep(c(5,4,1,2,3,4,5), times = c(40,150,200,300,50,90,30))
data.tagged.by.expert = data.frame(
hidden.state = states[artificial.hypnogram],
sensor1 = log(artificial.hypnogram) + runif(n = length(artificial.hypnogram), min = 0.2, max = 0.5),
sensor2 = 10*artificial.hypnogram + sample(c(-8:8), size = length(artificial.hypnogram), replace = T),
sensor3 = sample(1:100, size = length(artificial.hypnogram), replace = T)
)
hidden.hypnogram = rep(c(5,4,1,2,4,5), times = c(10,10,15,10,10,3))
data.newly.measured = data.frame(
sensor1 = log(hidden.hypnogram) + runif(n = length(hidden.hypnogram), min = 0.2, max = 0.5),
sensor2 = 10*hidden.hypnogram + sample(c(-8:8), size = length(hidden.hypnogram), replace = T),
sensor3 = sample(1:100, size = length(hidden.hypnogram), replace = T)
)
Solution
In the solution, we used Viterbi algorithm - combined with Naive Bayes classifier.
At each clock time t, a Hidden Markov Model consist of
an unobserved state (denoted as hidden.state in this case) taking a finite number of states
states = c("NonREM1", "NonREM2", "NonREM3", "REM", "Wake")
a set of observed variables (sensor1, sensor2, sensor3 in this case)
Transition matrix
A new state is entered based upon a transition probability distribution
(transition matrix). This can be easily computed from data.tagged.by.expert e.g. using
library(markovchain)
emit_p <- markovchainFit(data.tagged.by.expert$hidden.state)$estimate
Emission matrix
After each transition is made, an observation (sensor_i) is produced according to a conditional probability distribution (emission matrix) which depends on the current state H of hidden.state only. We will replace emmision matrices by Naive Bayes classifier.
library(caret)
library(klaR)
library(e1071)
model = train(hidden.state ~ .,
data = data.tagged.by.expert,
method = 'nb',
trControl=trainControl(method='cv',number=10)
)
Viterbi algorithm
To solve the problem, we use Viterbi algorithm with the initial probability of 1 for "Wake" state and 0 otherwise. (We expect the patient to be awake in the beginning of the experiment)
# we expect the patient to be awake in the beginning
start_p = c(NonREM1 = 0,NonREM2 = 0,NonREM3 = 0, REM = 0, Wake = 1)
# Naive Bayes model
model_nb = model$finalModel
# the observations
observations = data.newly.measured
nObs <- nrow(observations) # number of observations
nStates <- length(states) # number of states
# T1, T2 initialization
T1 <- matrix(0, nrow = nStates, ncol = nObs) #define two 2-dimensional tables
row.names(T1) <- states
T2 <- T1
Byj <- predict(model_nb, newdata = observations[1,])$posterior
# init first column of T1
for(s in states)
T1[s,1] = start_p[s] * Byj[1,s]
# fill T1 and T2 tables
for(j in 2:nObs) {
Byj <- predict(model_nb, newdata = observations[j,])$posterior
for(s in states) {
res <- (T1[,j-1] * emit_p[,s]) * Byj[1,s]
T2[s,j] <- states[which.max(res)]
T1[s,j] <- max(res)
}
}
# backtract best path
result <- rep("", times = nObs)
result[nObs] <- names(which.max(T1[,nObs]))
for (j in nObs:2) {
result[j-1] <- T2[result[j], j]
}
# show the result
result
# show the original artificial data
states[hidden.hypnogram]
References
To read more about the problem, see Vomlel Jiří, Kratochvíl Václav : Dynamic Bayesian Networks for the Classification of Sleep Stages , Proceedings of the 11th Workshop on Uncertainty Processing (WUPES’18), p. 205-215 , Eds: Kratochvíl Václav, Vejnarová Jiřina, Workshop on Uncertainty Processing (WUPES’18), (Třeboň, CZ, 2018/06/06) [2018] Download

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

r: for loop to simulate predictions when random sampling is applied - r

Related

Is there way a way to cbind() unimputed data with an imputationList from jomo imputation, similar to mice?

Kfold CV in brms

Power simulation in R for a linear mixed model

How to generate a latent variable from a set of different kinds of variables with R?

Learning hidden markov model in R

Categories

Resources