Related
I am trying to simulate how replacement/reassignment of values on random samples affect predictions conveyed by AUC.
I have a tumor classification in a dataframe denoted df$who which has levels 1, 2, 3 corresponding to the severity of the tumor lesion.
Intro to the question
Lets say the baseline data looks like this:
set.seed(1)
df <- data.frame(
who = as.factor(sample(1:3, size = 6000, replace = TRUE, prob = c(0.8, 0.15, 0.05))),
age = round(runif(n = 6000, min = 18, max = 95), digits = 1),
gender = sample(c("m", "f"), size = 6000, replace = TRUE, prob = c(1/3, 2/3)),
event.time = runif(n = 6000, min = 8, max = 120),
event = as.factor(sample(0:2, size = 6000, replace = TRUE, prob = c(0.25, 0.2, 0.55)))
)
And a standard cause-specific Cox regression looks like:
library(survival)
a_baseline <- coxph(Surv(event.time, event == 1) ~ who + age + gender, data = df, x = TRUE)
From which AUC can be obtained as a measure of predictive performance. Here, leave-one-out bootstrap on 5-year prediction on df$event == 1.
library(riskRegression)
u <- Score(list("baseline" = a_baseline),
Surv(event.time, event == 1) ~ 1,
data = df,
times = 60,
plots = "cal",
B = 50,
split.method = "loob",
metrics = c("auc", "brier")
)
# The AUC is then obtained
u$AUC$score$AUC[2]
Question
I want to simulate how re-classifying a random 5% of df$who == 1 to dfwho == 2 affect the 5-year prediction on df$event == 1
I want to create 10 separate and simulated subsets of the baseline data df, but each containing a random allocation of 5% df$who == 1 to .. == 2. Then, I want to apply each of these 10 separate and simulated subsets to predict the 5-year risk of df$event == 1.
I have applied a for loop to this. The expected output is dataframe that tells me which of the 10 simulated datasets yielded the highest and lowest u$AUC$score$AUC[2] (i.e., the best and worst prediction).
I am new to for loop, but here is my go (that obviously did not work).
all_auc <- data.frame() ## create a dataframe to fill in AUC from all 10 simulated sub-datasets
for(i in 1:10){ #1:10 represent the simulated datasets from 1 to 10
df[i] <- df #allocating baseline data to each of the 10 datasets
df[i]$who[sample(which(df[i]$who==1), round(0.05*length(which(df[i]$who==1))))]=2 #create the random 5% allocation of who==1 to who==2 in the i'th simulated dataset
ith_cox <- coxph(Surv(event.time, event == 1) ~ who + age + gender, data = df[i], x = TRUE) #create the i'th Cox regression based on the i´th dataset
# create the predictions based on the i´th Cox
u[i] <- Score(list("baseline" = ith_cox),
Surv(event.time, event == 1) ~ 1,
data = df[i],
times = 60,
plots = "cal",
B = 50,
split.method = "loob",
metrics = c("auc", "brier")
)
# summarize all AUC from all 10 sub-datasets
all_auc <- u[i]$AUC$score$AUC[2]
}
(1) I could not get this for loop to work as described, and
(2) the final dataframe all_auc should provide only which of the 10 datasets yielded the worst and best predictions (I will then use these two data sets for further analysis).
A final note
This is only a reproducible example. The for loop will be applied to 10.000 simulated datasets in our analysis. I do not know if this could affect the answer - but, it illustrates the importance of the result: a dataframe (or vector?) that simply tells me which simulated dataset yielded the best vs worst predictions, and that I subsequently will be able to use these two dataframes for furter analysis, eg df2930 and df8939.
I'm trying to integrate multiple types of abundance data in an model using jags in r. I'm struggeling with fitting a binomial N-mixture model with count as well as abundance-class using interval censoring.
My data set 1 includes count data abundances from several sites, survey rounds and multiple years. The data set 2 is comprised of aggregated data from one year (year 1, nested in the years1-5 in data set 1) in the form of abundance-class data.
A simplified data set can be simulated roughly following Ch. 10.3-5 in Marc Kéry & J. Andy Royle (2020): Applied hierarchical modeling in ecology. Modeling distribution, abundance and species richness using R and BUGS. Volume 2: Dynamic and Advanced models:
library(AHMbook)
library(jagsUI)
library(berryFunctions)
## Simulate two data sets
# Constants for simulation function
M1 <- 500 # Number of sites with abundance-class surveys
J1 <- 1 # Number of surveys in abundance-class survey
M2 <- 100 # Number of sites with full-count surveys
J2 <- 4 # Number of surveys in full-count surveys
mean.lam1 <- 6.5 # Abundance intercept
mean.lam2 <- 6 # Abundance intercept
mean.lam3 <- 5.5 # Abundance intercept
mean.lam4 <- 5 # Abundance intercept
mean.lam5 <- 4.5 # Abundance intercept
beta.lam <- 0.5 # Coefficients of a site covariate in abundance
mean.p1tot <- 1-(1-0.4)^4 # Detection intercept in abundance-class survey: Total detection probability after 4 surveys with each p=0.4
mean.p2 <- 0.4 # Detection intercept in full-count survey
# Simulate data set 1 for abundance-class surveys (surveyd in year1)
set.seed(1)
str(data1 <- simNmix(nsite = M1, nvisit = J1, mean.lam = mean.lam1,
mean.p = mean.p1tot, beta2.lam = beta.lam,
show.plot = FALSE))
# Simulate data sets 2a-e for full-count surveys, 5 data sets for 5 years (year1-5) with varying mean.lam (decreasing mean abundance over time)
str(data2a <- simNmix(nsite = M2, nvisit = J2, mean.lam = mean.lam1,
mean.p = mean.p2, beta2.lam = beta.lam,
show.plot = FALSE))
str(data2b <- simNmix(nsite = M2, nvisit = J2, mean.lam = mean.lam2,
mean.p = mean.p2, beta2.lam = beta.lam,
show.plot = FALSE))
str(data2c <- simNmix(nsite = M2, nvisit = J2, mean.lam = mean.lam3,
mean.p = mean.p2, beta2.lam = beta.lam,
show.plot = FALSE))
str(data2d <- simNmix(nsite = M2, nvisit = J2, mean.lam = mean.lam4,
mean.p = mean.p2, beta2.lam = beta.lam,
show.plot = FALSE))
str(data2e <- simNmix(nsite = M2, nvisit = J2, mean.lam = mean.lam5,
mean.p = mean.p2, beta2.lam = beta.lam,
show.plot = FALSE))
# Get lower and upper class boundary for every binned count
breaks <- c(0, 1, 2, 3, 7, 20, 50, 150, 400) # Up to and including
Cclass <- classify(c(data1$C), method = 'custom', breaks = breaks)$index
Aclass <- matrix(Cclass, byrow = FALSE, ncol = data1$nvisit)
n <- length(Cclass)
limits <- array(NA, c(n, 2), list(1:n, c('Lower', 'Upper')))
for(i in 1:n){
limits[i, 1:2] <- c(breaks[Cclass[i]], breaks[Cclass[i]+1])
}
# Add NAs for unsurveyed years (year2-5) to data of surveyed year (year1)
limitsNA <- array(NA, c(500, 5, 2), list(1:500,1:5,c('Lower', 'Upper')))
for(i in 1:500){
limitsNA[i, 1, 1:2] <- c(limits[i,1], limits[i,2])
}
# Vectorize the environmental covariate
X22=rowMeans(array(c(data2a$site.cov[,2], data2b$site.cov[,2], data2c$site.cov[,2], data2d$site.cov[,2], data2e$site.cov[,2]), dim=c(100, 5))) #using mean, as my original data set includes a site cov that is constant across the survey years
# Response for interval censoring = simply a vector of ones !
y <- rep(1, 500)
# Bundle data
str(bdata <- list(y = cbind(y,y,y,y,y), M1 = nrow(Aclass), X21 = data1$site.cov[,2],
limits = limitsNA, C2 = array(c(data2a$C, data2b$C, data2c$C, data2d$C, data2e$C), dim=c(100, 4,5)), M2 = nrow(data2$C), J2 = ncol(data2$C), T2=5,
X22=X22, year2=as.vector(scale(c(2005:2009)))))
My goal is to use data set 1 (full-count data) to estimate a detection probability and shared covariates in the likelihood part of the model. I would like to refine the shared likelihood covariates with the larger, but coarser data set 2. I managed to do this.
But I would also like to use the trend-information (included via year covariates) provided by data set 1 to extrapolate abundances of data set 2. In the end I would like to obtain N-estimates for the sites included in data set 2 for the missing years 2-5 only included in data set 1.
cat(file = "model2.txt", "
model {
# Priors for the parameters shared in both data sets
alpha.lam ~ dnorm(0, 0.01) # Abundance intercept
beta.lam ~ dnorm(0, 0.1) # Abundance slope on site cov X2
beta.lam1 ~ dnorm(0, 0.1) # Abundance slope on year
for (i in 1:M2){
for (j in 1:J2){
p2[i,j] ~ dunif(0,1)
}}
# Likelihood
# Model for latent abundance in both data sets
# Note same parameter names means parameters are identical
for (i in 1:M1){ # Data set 1
for(t in 1:T2){
N1[i,t] ~ dpois(lambda1[i,t])
log(lambda1[i,t]) <- alpha.lam + beta.lam * X21[i] + beta.lam1 * year2[t] #more complex covariates included in original model...
}}
for (i in 1:M2){ # Data set 2
for (t in 1:T2){
N2[i,t] ~ dpois(lambda2[i,t])
log(lambda2[i,t]) <- alpha.lam + beta.lam * X22[i] + beta.lam1 * year2[t]
}}
# Observation model for observed counts and for detection
# Observation model for data set 2 (full counts)
for (i in 1:M2){ # Data set 2
for(j in 1:J2){
for(t in 1:T2){
C2[i,j,t] ~ dbin(p2[i,j], N2[i,t])
}}}
for (i in 1:M2){
ptot[i]<- 1-(1-p2[i,1])*(1-p2[i,2])*(1-p2[i,3])*(1-p2[i,4]) #Total observ prob after 4 survey dates per site
}
ptotmean <- mean(ptot) #Mean total detection probability after four surveys
# Observation model for data set 1 (binned counts)
for (i in 1:M1){
for(t in 1:T2){
y[i,t] ~ dinterval(C1[i,t], limits[i,t,]) # interval censoring
C1[i,t] ~ dbin(ptotmean, N1[i,t]) # using average detection probability from oberservation model 2
}}
# Derived quantities
for (t in 1:T2){
Ntotal1[t] <- sum(N1[,t]) # Total abundance in data set 1
Ntotal2[t] <- sum(N2[,t]) # Total abundance in data set 2
Nmean1[t] <- mean(N2[,t])
Nmean2[t] <- mean(N2[,t])
}
for (j in 1:J2){
pmean[j] <- mean(p2[,j])
}
}
")
# Initial values
Nst1 <- limitsNA[,,2]+5
Nst1[is.na(Nst1)] <- round(mean(Nst1, na.rm=T))+5
Cst <- limitsNA[,,1]+1
Cst[is.na(Cst)] <- round(mean(Cst, na.rm=T))+1
Nst2 <- apply(bdata$C2, c(1,3), max)+5
inits <- function() list(N1 = Nst1, C1 = Cst, N2 = Nst2)
# Parameters monitored
params <- c("mean.lam", "beta.lam", "beta.lam1",
"Ntotal1", "Ntotal2", "GTotalN", "N1", "C1", "N2", "Ntotal1", "Ntotal2", "Nmean1", "Nmean2", "pmean", "ptot", "ptotmean")
# MCMC settings
# na <- 1000 ; nc <- 3 ; ni <- 50000 ; nb <- 10000 ; nt <- 40
na <- 1000 ; nc <- 3 ; ni <- 5000 ; nb <- 1000 ; nt <- 4 # ~~~~ for testing, 2 mins
# Call JAGS (ART 22 min), gauge convergence and summarize posteriors
out2 <- jags(bdata, inits, params, "model2.txt", n.adapt = na, n.chains = nc,
n.thin = nt, n.iter = ni, n.burnin = nb, parallel = TRUE)
R returns the following error:
Error in checkForRemoteErrors(val) :
3 nodes produced errors; first error: RUNTIME ERROR:
Unable to resolve the following parameters:
limits[1,2,1:2] (line 40)
limits[1,3,1:2] (line 40)
limits[1,4,1:2] (line 40)
limits[1,5,1:2] (line 40)
[...]
Dinterval seems to be missing limits. So far I could not find a soulution to my problem, as this application of dinterval is not very common. Any thoughts on how I could extrapolate the data abundace-class estimates to the timespan of data set 1, while sharing covariates between the model parts?
Another minor problem I encountered: Unlike the simulated data, my real abundance-class data also includes absences. I failed to include these absences and therefore excluded them from my data set for now. I suppose dinterval cannot handle non-abundance class data with the limits (0,0). I would appreciate any ideas on how to include these sites in my model.
I'd be happy about any helpful suggestions!
Thanks, Jo
(This post was also posted on sourceforge: https://sourceforge.net/p/mcmc-jags/discussion/610037/thread/224919442a/)
I'm trying to estimate an Okun's law equation with a dlm using the dlm package in R. I can estimate the non-time varying model using nls as follows:
const_coef <- nls(formula = dur~ b1*dur_lag1 + b2*(d2lgdp-b0) + b3*d2lrulc_lag2 ,
start = list(b0 =0.1, b1=0.1, b2=0.1, b3=0.1),
data = mod_data)
the dlm model I want to be able to estimate allows for b1 and b0 in the above to follow random walks. I can do this in Eviews by declaring the measurement equation and appending the states (below is some code provided by the authors of the original paper which I can replicate:
'==========================
' SPECIFY THE KALMAN FILTER
'==========================
'Priors on state variables
vector(2) mprior
mprior(1) = 4 'Prior on starting value for trend GDP growth (annual average GDP growth over 1950s)
mprior(2) = 0 'Prior on starting value for lagged dependent variable
sym(2) vprior
vprior(1,1) = 5 'Prior on variance of trend GDP growth (variance of annual GDP growth over 1950s)
vprior(2,2) = 1 'Prior on variance of lagged dependent variable
'Specify coefficient vector
coef(8) ckf
'Declare state space
sspace ss1
ss1.append dur = lag*dur(-1) + ckf(2)*(d2lgdp-trend)+ckf(3)*D2LRULC(-2)+[var=exp(ckf(4))] 'Measurement equation
ss1.append #state trend = 1*trend(-1) + [var = exp(ckf(5))] 'State equation for trend GDP growth (random walk)
ss1.append #state lag = 1*lag(-1) + [var = exp(ckf(6))] 'State equation for lagged dependent variable (random walk)
'Apply priors to state space
ss1.append #mprior mprior
ss1.append #vprior vprior
'Set parameter starting values
param ckf(2) -0.0495 ckf(3) 0.01942 ckf(4) -2.8913 ckf(5) -4.1757 ckf(6) -6.2466 'starting values for parameters
'=====================
' ESTIMATE THE MODEL
'=====================
'Estimate state space
smpl %estsd %ested 'Estimation sample
ss1.ml(m=500,showopts) 'Estimate Kalman filter by maximum likelihood
freeze(mytab) ss1.stats
I'm really not sure how to do this with the dlm package. I've tried the following:
buildSS <- function(v){
dV <- exp(v[1]) # Variance of the measurment equation (ckf4)
dW <- c(exp(v[2]), # variance of the lagged dep (ckf6)
0, # variance of the coef on d2lgdp ckf(2) set to 0
0, # variance of the coef on d2lrulc ckf(3) set to 0
exp(v[3]) # variance of the random walk intercept (ckf5)
)
beta.vec <- c(1,v[4],v[5],1) # Params ckf(2) ckf3(3)
okuns <- dlmModReg(mod_data.tvp[,-1], addInt = TRUE, dV =dV, dW = dW, m0 = beta.vec)
}
#'Set parameter starting values
ckf4Guess <- -2.8913
ckf2guess <- -0.0495
ckf3guess <- 0.01942
ckf5guess <- -4.1757
ckf6guess <- -6.2466
params <- c(ckf4Guess,
ckf5guess,
ckf6guess,
ckf2guess,
ckf3guess)
tvp_mod.mle <- dlmMLE(mod_data.tvp[,"dur"] , parm = params, build = buildSS)
tvp_mod <- buildSS(tvp_mod.mle$par)
tvp_filter <- dlmFilter(mod_data$dur,tvp_mod)
The above code runs, but the outputs are not correct. I am not specifying the the states properly. Does anyone have any experience in building dlms with mutlvirate regression in R?
I think I have gotten to a solution - I've managed to recreate the estimates in the paper which estimates this model using Eviews (also checked this using Eviews).
#--------------------------------------------------------------------------------------------------------------------------
# tvp model full model - dur = alpha*dur(-1)+ beta(dgdp-potential) + gamma*wages
#--------------------------------------------------------------------------------------------------------------------------
# Construct DLM
OkunsDLMfm <- dlm(
FF = matrix(c(1,1,1,1),ncol = 4, byrow = TRUE),
V = matrix(1),
GG = matrix(c(1,0,0,0,
0,1,0,0,
0,0,1,0,
0,0,0,1), ncol = 4, byrow = TRUE),
W = matrix(c(1,0,0,0,
0,1,0,0,
0,0,1,0,
0,0,0,1), ncol = 4, byrow = TRUE),
JFF = matrix(c(1,2,3,0),ncol = 4, byrow = TRUE),
X = cbind(mod_data$dur_lag1,mod_data$d2lgdp, mod_data$d2lrulc_lag2), # lagged dep var, dgdp, wages.
m0 = c(0,0,0,0),
C0 = matrix(c(1e+07,0,0,0,
0,1e+07,0,0,
0,0,1e+07,0,
0,0,0,1e+07), ncol = 4, byrow = TRUE)
)
buildOkunsFM <- function(p){
V(OkunsDLMfm) <- exp(p[2])
GG(OkunsDLMfm)[1,1] <- 1
GG(OkunsDLMfm)[2,2] <- 1
GG(OkunsDLMfm)[3,3] <- 1
GG(OkunsDLMfm)[4,4] <- 1
W(OkunsDLMfm)[1,1] <- exp(p[3])
W(OkunsDLMfm)[2,2] <- 0
W(OkunsDLMfm)[3,3] <- 0
W(OkunsDLMfm)[4,4] <- exp(p[4])
m0(OkunsDLMfm) <- c(0,0,0,p[1]*4)
C0(OkunsDLMfm)[1,1] <- 1
C0(OkunsDLMfm)[4,4] <- 5
return(OkunsDLMfm)
}
okuns.estfm <- dlmMLE(y = mod_data$dur, parm = c(-0.049,-1.4,-6,-5), build = buildOkunsFM)
OkunsDLM1fm <- buildOkunsFM(okuns.estfm$par)
The time varying level, the estimate of potential output, is derived by dividing the 4 element of the state vector by the second * by negative 1.
Not sure if this is best way to specify the DLM, but the results from the model are very close to what is reported (within 0.01) of the results from using Eviews. That being said, very open to any other specifications.
I am trying to understand how to use mixed linear models to analyse my data by simulating a model, but I can't reproduce the input parameters. What am I missing?
I want to start simulating a model with a random intercept for each subject. Here is the formula of what I want to simulate and reproduce:
If beta1 (<11) is small I find gamma00 as the intercept in fixed section, but I am completedly unaable to retrieve the slope (beta1). Also, the linear effect is not significant. Where is my conceptual mistake?
library(lmerTest)
# Generating data set
# General values and variables
numObj <- 20
numSub <- 100
e <- rnorm(numObj * numSub, mean = 0, sd = 0.1)
x <- scale(runif(numObj * numSub, min = -100, max = 100))
y <- c()
index <- 1
# Coefficients
gamma00 <- 18
gamma01 <- 0.5
beta1 <- -100
w <- runif(numSub, min = -3, max = 3)
uo <- rnorm(numSub, mean = 0, sd = 0.1)
meanBeta0 <- mean(gamma00 + gamma01*w + uo) # I should be able to retrieve that parameter.
for(j in 1:numSub){
for(i in 1:numObj){
y[index] <- gamma00 + gamma01*w[j]+ uo[j] + beta1*x[i] + e[index]
index <- index + 1
}
}
dataFrame2 <- data.frame(y = y, x = x, subNo = factor(rep(1:numSub, each = numObj)), objNum = factor(rep(1:numObj, numSub)))
model2 <- lmer(y ~ x +
(1 | subNo), data = dataFrame2)
summary(model2)
anova(model2)
No conceptual mistake here, just a mixed up index value: you should be using index rather than i to index x in your data generation loop.
Basically due to the mix-up you were using the first subject's x values for generating data for all the subjects, but using the individual x values in the model.
I have written a small function that simulates data from a normal distribution, how it is usual in linear models. My question is how to get a model with a pvalue of sim[, 1] == 0.05. I want to show that if I add a random variable even it is normal distributed around zero with small variance N(0,0.0023) , that pvalue of sim[,1] changes. The code below shows the true model.
set.seed(37) # seed for reproducability
simulation <- function(b_0, b_1,n,min_x_1 ,max_x_1,sd_e){
mat <- NA
x_1 <- runif(n = n, min = min_x_1, max =max_x_1)
error <- rnorm(mean = 0,sd = sd_e, n = n )
y <- b_0 + b_1*x_1 + error
mat <- matrix(cbind(x_1,y), ncol = 2)
return(mat)
#plot(mat[,1],mat[,2])
}
sim <- simulation(10,-2,10000,-10,70,0.003)
summary(lm(sim[,2] ~ sim[,1] ))