I want to compare a multinomial logit model and a random forest using a grouped brier score within cross validation. The theoretical foundation of this approach is: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3702649/pdf/nihms461154.pdf
My dependent variable has three outcomes and my data-set compremises life-time data, where the lifetime lies between 0-5.
To make things reproducable, my dataset looks like:
library(data.table)
N <- 1000
X1 <- rnorm(N, 175, 7)
X2 <- rnorm(N, 30, 8)
length <- sample(0:5,N,T)
Ycont <- 0.5*X1 - 0.3*X2 + 10 + rnorm(N, 0, 6)
Ycateg <- ntile(Ycont,3)
df <- data.frame(id=1:N,length,X1, X2, Ycateg)
df$Ycateg=ifelse(df$Ycateg==1,"current",ifelse(df$Ycateg==2,"default","prepaid"))
df=setDT(df)[,.SD[rep(1L,length)],by = id]
df=df[ , time := 1:.N , by=id]
df=df[,-c("length")]
head(df)
id X1 X2 Ycateg time
1: 1 178.0645 10.84313 1 1
2: 2 169.4208 34.39831 1 1
3: 2 169.4208 34.39831 1 2
4: 2 169.4208 34.39831 1 3
5: 2 169.4208 34.39831 1 4
6: 2 169.4208 34.39831 1 5
What I did so far is:
library(caret)
fitControl <- trainControl(method = 'cv',number=5)
cv=train(as.factor(Ycateg)~.,
data = df,
method = "multinom",
maxit=150,
trControl = fitControl)
cv
Since the models are used to predict probabilities at each time point, I want to compute the following for each fold:
Brier Score for each category of the dependent variable: BS_i=(Y_it,k - p_it,k)² - where i denotes observation i of the test-fold,t the time and k the class k of the dependent variable
Summarise for this one fold 1. by computing 1/n_t (BS_i) where n_t are the number of observations which do have an observed time t - so a grouped computation
So in the end, what I want to report - for example for a 3 fold CV & knowing that time ranges from 0-5 - is an output like this:
fold time Brier_0 Brier_1 Brier_2
1 1 0 0.39758714 0.11703814 0.8711775
2 1 1 0.99461281 0.95051037 0.1503217
3 1 2 0.01791559 0.83653814 0.1553521
4 1 3 0.92067849 0.55275340 0.6466206
5 1 4 0.73112563 0.07603891 0.5769286
6 1 5 0.29500600 0.66219814 0.7590742
7 2 0 0.24691469 0.06736522 0.8612998
8 2 1 0.13629191 0.55973431 0.5617303
9 2 2 0.48006915 0.01357407 0.4515544
10 2 3 0.01257112 0.40250469 0.1814620
. . . . . .
I know that I have to set up a customized version of the summaryFunction, but I'm really lost on how to do this. So my main aim is not to tune a model but to validate it.
There is one thing that should be remarked: the summaryFunction can only return a single numeric vector - correct me if I'm wrong. Futher, the data-parameter of the summaryFunction contains a column rowIndex which can be used to extract additional variables form the original data set.
customSummary <- function (data, lev = NULL, model = NULL) { # for training on a next-period return
#browser() #essential for debugging
dat=dim(data)
# get observed dummy
Y_obs = model.matrix( ~ data[, "obs"] - 1) # create dummy - for each level of the outcome
# get predicted probabilities
Y_pre=as.data.frame(data[ , c("current","default","prepaid")])
# get rownumbers
rows=data[,"rowIndex"]
# get time of each obs
time=df[rows,]$time
# put it all together
df_temp=data.frame(Y_obs,Y_pre,time)
names(df_temp)=c("Y_cur","Y_def","Y_pre","p_cur","p_def","p_pre","time")
# group by time and compute crier score
out=df_temp %>% group_by(time) %>% summarise(BS_cur=1/n()*sum((Y_cur-p_cur)^2),BS_def=1/n()*sum((Y_def-p_def)^2),BS_pre=1/n()*sum((Y_pre-p_pre)^2))
# name
names(out)=c("time","BS_cur","BS_def","BS_pre")
# now create one line of return - caret seems to be able to hande only one
out=as.data.frame(out)
out_stack=stack(out)
out_stack=out_stack[(max(out$time)):length(out_stack[,1]),]
out_stack=out_stack[-1,]
out_stack$ind=paste(out_stack$ind,out$time,sep = "_")
# recall, the return type must be simply numeric
out_final=(t(out_stack[,1]))
names(out_final)=(out_stack[,2])
return(out_final)
}
# which type of cross validation to do
fitControl <- trainControl(method = 'cv',number=5,classProbs=TRUE,summaryFunction=customSummary, selectionFunction = "best", savePredictions = TRUE)
grid <- expand.grid(decay = 0 )
cv=train(as.factor(Ycateg)~.,
data = df,
method = "multinom",
maxit=150,
trControl = fitControl,
tuneGrid = grid
)
cv$resample
BS_cur_1 BS_cur_2 BS_cur_3 BS_cur_4 BS_cur_5 BS_def_1 BS_def_2 BS_def_3 BS_def_4 BS_def_5 BS_pre_1 BS_pre_2 BS_pre_3 BS_pre_4 BS_pre_5
1 0.1657623 0.1542842 0.1366912 0.1398001 0.2056348 0.1915512 0.2256758 0.2291467 0.2448737 0.2698545 0.1586134 0.2101389 0.1432483 0.2076886 0.1663780
2 0.1776843 0.1919503 0.1615440 0.1654297 0.1200515 0.2108787 0.2185783 0.2209958 0.2467931 0.2199898 0.1580643 0.1595971 0.2015860 0.1826029 0.1947144
3 0.1675981 0.1818885 0.1893253 0.1402550 0.1400997 0.2358501 0.2342476 0.2079819 0.1870549 0.2065355 0.2055711 0.1586077 0.1453172 0.1638555 0.2106146
4 0.1796041 0.1573086 0.1500860 0.1738538 0.1171626 0.2247850 0.2168341 0.2031590 0.1807209 0.2616180 0.1677508 0.1965577 0.1873078 0.1859176 0.1344115
5 0.1909324 0.1640292 0.1556209 0.1371598 0.1566207 0.2314311 0.1991000 0.2255612 0.2195158 0.2071910 0.1976272 0.1777507 0.1843828 0.1453439 0.1736540
Resample
1 Fold1
2 Fold2
3 Fold3
4 Fold4
5 Fold5
Related
I am trying to simulate two-level data, with level 1 interaction term.
For example,
I have two level-2 variables, two level -1 variable, and interaction variable.
CN<- 40 # number of cluster
nj<- 2 # cluster size, in this case, dyadic data.
l2v1 <- rep(rnorm(CN, mean=0, sd=1), each=nj) # level 2 variable 1
l2v2 <- rep(sample(rep(c(-1, 1), CN),each=nj) # level 2 variable 2, which is binary variable
l1v1 <- rnorm(CN*ng, 0, 1) # level 1 variable 2
l1v2 <- rnorm(sample(rep(c(-1,1),CN*nj) # level 2 variable 2, which is binary
error2 <- rep(rnorm(CN, 0,1) each = nj)) # error for level 2
error1 <- rnorm(CN*nj) # error for level 1
## putting together
y <- coef1*l1v1 + coef2*l1v2 + coef3*l2v1 + coef4*l2v2 + coef5 * l1v1* l1v2 + error2 + error1
In this case, how can I control ICC?
For example, I want to simulate this data with ICC of 0.3
ICC = between variance / total variance
ICC= {coef3^2 + coef4^2 + 1}/{coef3^2 + coef4^2 + 1 + 1+ coef1^2 + coef2^2 + coeff5^2}
and coef1 = coef3, coef 2= coef 4 due to some research questions.
so I plugged in the arbitrary numbers in coef 1, 2, 3, 4 and tried to set coeff 5 so that I can have data with targeted ICC. However, it seems like it does not work.
Did I miss something?
I'm trying to plot roc curve from lasso logistic regression result. so I used predict() using type="response" to get a probability. however, the result was opposite of when I put type = "class"
first of all, this is my dataset. my predictor has 2 levels
selected_data$danger <- factor(selected_data$danger, levels = c(1,0))
lasso_data<-selected_data
str(lasso_data$danger)
# Factor w/ 2 levels "1","0": 1 1 1 1 1 1 1 1 1 1 ...
# partition
input_train <- createDataPartition(y=lasso_data$danger, p=0.8, list=FALSE)
train_dataset <- lasso_data[input_train,]
test_dataset <- lasso_data[-input_train,]
dim(train_dataset)
# [1] 768 62
dim(test_dataset)
# [1] 192 62
I did run both cases(type = class, response) to compare.
lasso_model <- cv.glmnet( x=data.matrix(train_dataset[,-length(train_dataset)]), y = train_dataset[,length(train_dataset)],
family = "binomial" , type.measure = "auc",alpha=1, nfolds=5)
lasso_pred <- predict(lasso_model, newx=data.matrix(test_dataset[,-length(test_dataset)]),
s=lasso_model$lambda.min, type= "class", levels=c(1,0))
lasso_pred_resp <- predict(lasso_model, s="lambda.1se", newx=data.matrix(test_dataset[,-length(test_dataset)]), type="response", levels=c(1,0))
threshold <- 0.5 # or whatever threshold you use
pred <- ifelse(lasso_pred_resp>threshold, 1, 0)
table(lasso_pred, pred)
# pred
# lasso_pred 0 1
# 0 11 95
# 1 76 10
I have no idea why this is happening...
Any help would be greatly appreciated.
For logistic regression in R, the probability or "response" always refers to the probability of being the 2nd level, in your case it is "0".
So you predictions should be:
pred <- ifelse(lasso_pred_resp>threshold, 0, 1)
To avoid confusions, you can also do:
lvl <- levels(lasso_data$danger)
pred <- ifelse(lasso_pred_resp>threshold,lvl[2],lvl[1])
Background:
McElearth (2016) in his rethinking book pages 158-159, uses an index variable instead of dummy coding for a 3-category variable called "clade" to predict "kcal.per.g" (linear regression).
Question: I was wondering if we could apply the same approach in "rstanarm"? I have provided data and R code for a possible demonstration below.
library("rethinking") # A github package not on CRAN
data(milk)
d <- milk
d$clade_id <- coerce_index(d$clade) # Index variable maker
#[1] 4 4 4 4 4 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 # index variable
# Model Specification:
fit1 <- map(
alist(
kcal.per.g ~ dnorm( mu , sigma ) ,
mu <- a[clade_id] ,
a[clade_id] ~ dnorm( 0.6 , 10 ) ,
sigma ~ dunif( 0 , 10 )
) ,
data = d )
The most analogous way to do this using the rstanarm package is with
library(rstanarm)
fit1 <- stan_glmer(kcal.per.g ~ 1 + (1 | clade_id), data = milk,
prior_intercept = normal(0.6, 1, autoscale = FALSE),
prior_aux = exponential(rate = 1/5, autoscale = FALSE),
prior_covariance = decov(shape = 10, scale = 1))
However, this is not exactly the same for the following reasons:
Bounded uniform priors on sigma are not implemented because they are not a good idea, so I have used an exponential distribution with an expectation of 5 instead
Fixing the standard deviation on a is not implemented either so I have used a gamma distribution with an expectation of 10
Hierarchical models in rstanarm (and lme4) are parameterized with deviations from common parameters, so rather than using an expectation of 0.6 for a, I have used an expectation of 0.6 for the global intercept and the prior on a is normal with an expectation of zero. This means you need to call coef(fit1) rather than ranef(fit1) to see the "intercepts" as they are parameterized in the original model.
I have the following data frame:
lm mean resids sd resids resid 1 resid 2 resid 3 intercept beta
1 0.000000e+00 6.2806844 -3.6261548 7.2523096 -3.6261548 103.62615 24.989340
2 -2.960595e-16 8.7515899 -5.0527328 10.1054656 -5.0527328 141.96786 -1.047323
3 -2.960595e-16 5.9138984 -3.4143908 6.8287817 -3.4143908 206.29046 -26.448694
4 3.700743e-17 0.5110845 0.2950748 -0.5901495 0.2950748 240.89801 -35.806642
5 7.401487e-16 6.6260504 3.8255520 -7.6511040 3.8255520 187.03479 -23.444762
6 5.921189e-16 8.7217431 5.0355007 -10.0710014 5.0355007 41.43239 3.138396
7 0.000000e+00 5.5269434 3.1909823 -6.3819645 3.1909823 -119.90628 27.817845
8 -1.480297e-16 1.0204260 -0.5891432 1.1782864 -0.5891432 -180.33773 35.623363
9 -5.921189e-16 6.9488186 -4.0119023 8.0238046 -4.0119023 -64.72245 21.820226
10 -8.881784e-16 8.6621512 -5.0010953 10.0021906 -5.0010953 191.65339 -5.218767
Each row represents an estimated linear model with window length 3. I used rollapply on a separate dataframe with the function lm(y~t) to extract the coefficients and intercepts into a new dataframe, which I have combined with the residuals from the same model and their corresponding means and residuals.
Since the window length is 3, it implies that there are 3 residuals as shown, per model, in resid 1, resid 2 and resid 3. The mean and sd of these are included accordingly.
I am seeking to predict the next observation, in essence, k+1, where k is the window length, using the intercept and beta.
Recall that lm1 takes observations 1,2,3 to estimate the intercept and the beta, and lm2 takes 2,3,4, lm3 takes 3,4,5, etc. The function for the prediction should be:
predict_lm1 = intercept_lm1 + beta_lm1*(k+1)
Where k+1 = 4. For lm2:
predict_lm2 = intercept_lm2 + beta_lm2*(k+1)
Where k+1 = 5.
Clearly, k increases by 1 every time I move down one row in the dataset. This is because the explanatory variable is time, t, which is a sequence increasing by one per observation.
Should I use a for loop, or an apply function here?
How can I make a function that iterates down the rows and calculates the predictions accordingly with the information found in that row?
Thanks.
EDIT:
I managed to find a possible solution by writing the following:
n=nrow(dataset)
for(i in n){
predictions = dataset$Intercept + dataset$beta*(k+1)
}
However, k does not increase by 1 per iteration. Thus, k+1 is always = 4.
How can I make sure k increases by 1 accordingly?
EDIT 2
I managed to add 1 to k by writing the following:
n=nrow(dataset)
for(i in n){
x = 0
x[i] = k + 1
preds = dataset$`(Intercept)` + dataset$t*(x[i])
}
However, the first prediction is overestimated. It should be 203, whereas it is estimated as 228, implying that it sets the explanatory variable as 1 too high.
Yet, the second prediction is correct. I am not sure what I am doing wrong. Any advice?
EDIT 3
I managed to find a solution as follows:
n=nrow(dataset)
for(i in n){
x = k + 1
preds = dataset$`(Intercept)` + dataset$t*(x)
x = x + 1
}
Your loop is not iterating:
dataset <- read.table(text="lm meanresids sdresids resid1 resid2 resid3 intercept beta
1 0.000000e+00 6.2806844 -3.6261548 7.2523096 -3.6261548 103.62615 24.989340
2 -2.960595e-16 8.7515899 -5.0527328 10.1054656 -5.0527328 141.96786 -1.047323
3 -2.960595e-16 5.9138984 -3.4143908 6.8287817 -3.4143908 206.29046 -26.448694
4 3.700743e-17 0.5110845 0.2950748 -0.5901495 0.2950748 240.89801 -35.806642
5 7.401487e-16 6.6260504 3.8255520 -7.6511040 3.8255520 187.03479 -23.444762
6 5.921189e-16 8.7217431 5.0355007 -10.0710014 5.0355007 41.43239 3.138396
7 0.000000e+00 5.5269434 3.1909823 -6.3819645 3.1909823 -119.90628 27.817845
8 -1.480297e-16 1.0204260 -0.5891432 1.1782864 -0.5891432 -180.33773 35.623363
9 -5.921189e-16 6.9488186 -4.0119023 8.0238046 -4.0119023 -64.72245 21.820226
10 -8.881784e-16 8.6621512 -5.0010953 10.0021906 -5.0010953 191.65339 -5.218767", header=T)
n <- nrow(dataset)
predictions <- data.frame()
for(i in 1:n){
k <- i ##not sure where k is coming from but put it here
predictions <- rbind(predictions, dataset$intercept[i] + dataset$beta[i]*(k+1))
}
predictions
I am trying to simulate three small datasets, which contains x1,x2,x3,x4, trt and IND.
However, when I try to split simulated data by IND using "split" in R I get Warning messages and outputs are correct. Could someone please give me a hint what I did wrong in my R code?
# Step 2: simulate data
Alpha = 0.05
S = 3 # number of replicates
x = 8 # number of covariates
G = 3 # number of treatment groups
N = 50 # number of subjects per dataset
tot = S*N # total subjects for a simulation run
# True parameters
alpha = c(0.5, 0.8) # intercepts
b1 = c(0.1,0.2,0.3,0.4) # for pi_1 of trt A
b2 = c(0.15,0.25,0.35,0.45) # for pi_2 of trt B
b = c(1.1,1.2,1.3,1.4);
##############################################################################
# Scenario 1: all covariates are independent standard normally distributed #
##############################################################################
set.seed(12)
x1 = rnorm(n=tot, mean=0, sd=1);x2 = rnorm(n=tot, mean=0, sd=1);
x3 = rnorm(n=tot, mean=0, sd=1);x4 = rnorm(n=tot, mean=0, sd=1);
###############################################################################
p1 = exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4)/
(1+exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4) +
exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4))
p2 = exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4)/
(1+exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4) +
exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4))
p3 = 1/(1+exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4) +
exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4))
# To assign subjects to one of treatment groups based on response probabilities
tmp = function(x){sample(c("A","B","C"), 1, prob=x, replace=TRUE)}
trt = apply(cbind(p1,p2,p3),1,tmp)
IND=rep(1:S,each=N) #create an indicator for split simulated data
sim=data.frame(x1,x2,x3,x4,trt, IND)
Aset = subset(sim, trt=="A")
Bset = subset(sim, trt=="B")
Cset = subset(sim, trt=="C")
Anew = split(Aset, f = IND)
Bnew = split(Bset, f = IND)
Cnew = split(Cset, f = IND)
The warning message:
> Anew = split(Aset, f = IND)
Warning message:
In split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) :
data length is not a multiple of split variable
and the output becomes
$`2`
x1 x2 x3 x4 trt IND
141 1.0894068 0.09765185 -0.46702047 0.4049424 A 3
145 -1.2953113 -1.94291045 0.09926239 -0.5338715 A 3
148 0.0274979 0.72971804 0.47194731 -0.1963896 A 3
$`3`
[1] x1 x2 x3 x4 trt IND
<0 rows> (or 0-length row.names)
I have checked my R code several times however, I can't figure out what I did wrong. Many thanks in advance
IND is the global variable for the full data, sim. You want to use the specific one for the subset, eg
Anew <- split(Aset, f = Aset$IND)
It's a warning, not an error, which means split executed successfully, but may not have done what you wanted to do.
From the "details" section of the help file:
f is recycled as necessary and if the length of x is not a multiple of
the length of f a warning is printed. Any missing values in f are
dropped together with the corresponding values of x.
Try checking the length of your IND against the size of your dataframe, maybe.
Not sure what your goal is once you have your data split, but this sounds like a good candidate for the plyr package.
> library(plyr)
> ddply(sim, .(trt,IND), summarise, x1mean=mean(x1), x2sum=sum(x2), x3min=min(x3), x4max=max(x4))
trt IND x1mean x2sum x3min x4max
1 A 1 -0.49356448 -1.5650528 -1.016615 2.0027822
2 A 2 0.05908053 5.1680463 -1.514854 0.8184445
3 A 3 0.22898716 1.8584443 -1.934188 1.6326763
4 B 1 0.01531230 1.1005720 -2.002830 2.6674931
5 B 2 0.17875088 0.2526760 -1.546043 1.2021935
6 B 3 0.13398967 -4.8739380 -1.565945 1.7887837
7 C 1 -0.16993037 -0.5445507 -1.954848 0.6222546
8 C 2 -0.04581149 -6.3230167 -1.491114 0.8714535
9 C 3 -0.41610973 0.9085831 -1.797661 2.1174894
>
Where you can substitute summarise and its following arguments for any function that returns a data.frame or something that can be coerced to one. If lists are the target, ldply is your friend.