Unique intercepts approach for categorical variables in "rstanarm" package in R - rstanarm

Background:
McElearth (2016) in his rethinking book pages 158-159, uses an index variable instead of dummy coding for a 3-category variable called "clade" to predict "kcal.per.g" (linear regression).
Question: I was wondering if we could apply the same approach in "rstanarm"? I have provided data and R code for a possible demonstration below.
library("rethinking") # A github package not on CRAN
data(milk)
d <- milk
d$clade_id <- coerce_index(d$clade) # Index variable maker
#[1] 4 4 4 4 4 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 # index variable
# Model Specification:
fit1 <- map(
alist(
kcal.per.g ~ dnorm( mu , sigma ) ,
mu <- a[clade_id] ,
a[clade_id] ~ dnorm( 0.6 , 10 ) ,
sigma ~ dunif( 0 , 10 )
) ,
data = d )

The most analogous way to do this using the rstanarm package is with
library(rstanarm)
fit1 <- stan_glmer(kcal.per.g ~ 1 + (1 | clade_id), data = milk,
prior_intercept = normal(0.6, 1, autoscale = FALSE),
prior_aux = exponential(rate = 1/5, autoscale = FALSE),
prior_covariance = decov(shape = 10, scale = 1))
However, this is not exactly the same for the following reasons:
Bounded uniform priors on sigma are not implemented because they are not a good idea, so I have used an exponential distribution with an expectation of 5 instead
Fixing the standard deviation on a is not implemented either so I have used a gamma distribution with an expectation of 10
Hierarchical models in rstanarm (and lme4) are parameterized with deviations from common parameters, so rather than using an expectation of 0.6 for a, I have used an expectation of 0.6 for the global intercept and the prior on a is normal with an expectation of zero. This means you need to call coef(fit1) rather than ranef(fit1) to see the "intercepts" as they are parameterized in the original model.

Related

Multinomial regression : how to show all coefficients without L, Q and R?

I have this dataframe that I applied multinom function
df = data.frame(x = c('a','a','b','b','c','c','d','d','d','e','e','f','f',
'f','f','g','g','g','h','h','h','h','i','i','j','j'),
y = c(1,2,1,3,1,2,1,4,5,1,2,2,3,4,5,1,1,2,1,2,2,3,2,2,3,4) )
df$y = factor(df$y,ordered = TRUE)
nnet::multinom(y~x, data = df)
when checking the output, I have all the variables with their coefficients (meaning everything is fine)
Coefficients:
(Intercept) xb xc xd xe xf
2 -6.045294e-05 -31.83512 3.800915e-05 -36.67053 3.800915e-05 25.00515
3 -1.613311e+01 16.13310 -1.725649e+01 -21.06832 -1.725649e+01 41.13825
4 -1.692352e+01 -14.71119 -1.428100e+01 16.92351 -1.428100e+01 41.92865
5 -2.129358e+01 -10.49359 -1.002518e+01 21.29353 -1.002518e+01 46.29867
xg xh xi xj
2 -0.6931261 0.6932295 40.499799 -25.311410
3 -24.0387863 16.1332150 -8.876562 45.191730
4 -20.2673490 -16.0884760 -6.394423 45.982129
5 -15.1755064 -11.8589447 -4.563793 -6.953942
but my original dataframe (will share only the output) that is coded as the dependent and independent variables from the df dataframe (meaning as ordinal factors) and all the analysis is well done but when it comes to interpretation I have this output :
Coefficients:
(Intercept) FIES_R.L FIES_R.Q FIES_R.C FIES_R^4 FIES_R^5
2 -0.09594409 -1.303256 0.03325169 -0.1753022 -0.46026668 -0.282463422
3 -0.18587599 -1.469957 0.42005569 -0.2977628 0.00508412 0.003068678
4 -0.58189239 -2.875183 0.33128994 -0.6787992 0.11145099 0.239368520
5 -2.68727952 -10.178604 -5.12515249 -5.8454920 -3.13775961 -1.820629143
FIES_R^6 FIES_R^7 FIES_R^8
2 -0.2179067 -0.1000471 -0.1489342
3 0.1915476 -0.5483707 -0.2565626
4 0.2585801 0.3821566 -0.2679774
5 -0.5562958 -0.6335412 -0.7205215
I don't want FIES_R.L,FIES_R.Q and FIES_R.C. I want them as : FIES_R_1, FIES_R_2, FIES_R_3, FIES_R_4, FIES_R_5, FIES_R_6, FIES_R_7, FIES_R_8,
why I have such an output ? knowing that the two dataframes include ordinal categorical variables and the x variable and the FIES variable include many categories in both dataframes. Thanks
I just figured it out : because the independent variable is an ordinal factor. Meaning FIES in my dataset in an ordinal factor. When I used the argument ordered = FALSE the problem got solved
You can change the coefnames "by hand":
mod = nnet::multinom(y~x, data = df)
mod$vcoefnames = c("(Intercept)", paste0(substr(mod$vcoefnames, 1, 6), "_", 1:8))

Problem in creating a model.matrix of quantitative predictors in R

I must do a Lasso regression with the package glmnetand I have problems to generate my x model.matrix
My data.frame: 108 observations, Y response variable, 24 predictors, here is an overview:
CONVENTIONAL_HUmin CONVENTIONAL_HUmean CONVENTIONAL_HUstd CONVENTIONAL_HUmax
1 37.9400539686119 63.4903779286635 11.7592095845857 85.2375439991287
2 23.8400539686119 80.5903779286635 15.0592095845857 125.837543999129
3 19.3035945249441 73.2764716205565 12.8816244173147 130.24141901586
CONVENTIONAL_HUQ1 CONVENTIONAL_HUQ2 CONVENTIONAL_HUQ3 HISTO_Skewness HISTO_Kurtosis
1 54.9938390994964 65.4873070322704 72.8863025473031 -0.203420585259268 2.25208159159488
2 70.8938390994964 80.3873070322704 91.4863025473031 -0.117420585259268 2.91208159159488
3 64.4689755423307 73.8666609177099 81.7351818199415 -0.0908104900456161 2.8751327713366
HISTO_ExcessKurtosis HISTO_Entropy_log10 HISTO_Entropy_log2 HISTO_Energy...Uniformity.
1 -0.751917020142877 0.701345471328916 2.32782599847774 0.219781577333287
2 -0.0887170201428774 0.793345471328916 2.63782599847774 0.184781577333287
3 -0.127231561113029 0.738530858918985 2.45445652190669 0.206887426065656
GLZLM_SZE GLZLM_LZE GLZLM_LGZE GLZLM_HGZE GLZLM_SZLGE
1 0.366581916604228 35.7249100350856 8.7285612359045e-05 11497.6407737833 3.22615226279017e-05
2 0.693581916604228 984.424910035086 8.5685612359045e-05 11697.6407737833 5.98615226279017e-05
3 0.622711792823853 1103.10288991619 8.5573088970709e-05 11571.7421733917 5.33303855950858e-05
GLZLM_SZHGE GLZLM_LZLGE GLZLM_LZHGE GLZLM_GLNU GLZLM_ZLNU
1 4164.91570215061 0.00314512237564268 405585.990838764 2.66964898745512 2.47759091065361
2 8064.91570215061 0.0835651223756427 11581585.9908388 12.9796489874551 38.5375909106536
3 7295.45317481887 0.0949686480587339 12926109.9421091 15.0930512668698 37.6083347285291
GLZLM_ZP Y
1 0.219643444043173 1
2 0.112643444043173 0
3 0.104031438564764 0
My code for the model.matrix
x=model.matrix(Y~.,data=data.det)
It générale a very large model.matrix with 244728 elements ! It seems that it has duplicated a hundred times each predictor of the 24 !
Here's an overview of the data.matrix:
(Intercept) CONVENTIONAL_HUmin-10.5599460313881
CONVENTIONAL_HUmin-117.359946031388 CONVENTIONAL_HUmin-13.0599460313881
CONVENTIONAL_HUmin-154.359946031388 CONVENTIONAL_HUmin-17.6599460313881
CONVENTIONAL_HUmin-18.3599460313881 CONVENTIONAL_HUmin-2.87994603138811
CONVENTIONAL_HUmin-21.281710504529 CONVENTIONAL_HUmin-28.3599460313881
CONVENTIONAL_HUmin-3.44994603138811 CONVENTIONAL_HUmin-3.89640547505594
CONVENTIONAL_HUmin-67.0599460313881 CONVENTIONAL_HUmin-682.359946031388
CONVENTIONAL_HUmin-9.08171050452898 CONVENTIONAL_HUmin1.04428949547101
CONVENTIONAL_HUmin1.63928949547101 CONVENTIONAL_HUmin10.8400539686119
CONVENTIONAL_HUmin10.968289495471 CONVENTIONAL_HUmin11.5400539686119
CONVENTIONAL_HUmin11.618289495471 CONVENTIONAL_HUmin11.6400539686119
CONVENTIONAL_HUmin12.518289495471 CONVENTIONAL_HUmin12.5400539686119
CONVENTIONAL_HUmin13.4400539686119 CONVENTIONAL_HUmin13.6400539686119
CONVENTIONAL_HUmin13.7400539686119 CONVENTIONAL_HUmin13.818289495471
CONVENTIONAL_HUmin14.5400539686119 CONVENTIONAL_HUmin14.6693017607572
CONVENTIONAL_HUmin14.8400539686119 CONVENTIONAL_HUmin16.9400539686119
CONVENTIONAL_HUmin17.0400539686119 CONVENTIONAL_HUmin17.618289495471
CONVENTIONAL_HUmin18.2400539686119 CONVENTIONAL_HUmin18.8400539686119
CONVENTIONAL_HUmin19.3035945249441 CONVENTIONAL_HUmin20.0400539686119
CONVENTIONAL_HUmin20.818289495471 CONVENTIONAL_HUmin21.0400539686119
CONVENTIONAL_HUmin21.118289495471 CONVENTIONAL_HUmin21.3400539686119
CONVENTIONAL_HUmin21.5400539686119 CONVENTIONAL_HUmin21.9400539686119
...
attr(,"contrasts")$CONVENTIONAL_HUmin
[1] "contr.treatment"
Not convenient at all because I end up with much more predictors in the input x for Lasso Regression which makes hazardous selection of the predictors even more present
Have you an idea of the source of the dysfunction ? any suggestion to fix that ?
Try this, you want a matrix not a model matrix...
# make a matrix of your predictors minus your outcome
x <- as.matrix(data.detect[-25])
# put the y column in a vector
y <- data.detect$Y
# run it
fit.lasso <- glmnet(x, y, family = "binomial", alpha = 1)

Modify summaryFunction in caret to compute grouped Brier-Score

I want to compare a multinomial logit model and a random forest using a grouped brier score within cross validation. The theoretical foundation of this approach is: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3702649/pdf/nihms461154.pdf
My dependent variable has three outcomes and my data-set compremises life-time data, where the lifetime lies between 0-5.
To make things reproducable, my dataset looks like:
library(data.table)
N <- 1000
X1 <- rnorm(N, 175, 7)
X2 <- rnorm(N, 30, 8)
length <- sample(0:5,N,T)
Ycont <- 0.5*X1 - 0.3*X2 + 10 + rnorm(N, 0, 6)
Ycateg <- ntile(Ycont,3)
df <- data.frame(id=1:N,length,X1, X2, Ycateg)
df$Ycateg=ifelse(df$Ycateg==1,"current",ifelse(df$Ycateg==2,"default","prepaid"))
df=setDT(df)[,.SD[rep(1L,length)],by = id]
df=df[ , time := 1:.N , by=id]
df=df[,-c("length")]
head(df)
id X1 X2 Ycateg time
1: 1 178.0645 10.84313 1 1
2: 2 169.4208 34.39831 1 1
3: 2 169.4208 34.39831 1 2
4: 2 169.4208 34.39831 1 3
5: 2 169.4208 34.39831 1 4
6: 2 169.4208 34.39831 1 5
What I did so far is:
library(caret)
fitControl <- trainControl(method = 'cv',number=5)
cv=train(as.factor(Ycateg)~.,
data = df,
method = "multinom",
maxit=150,
trControl = fitControl)
cv
Since the models are used to predict probabilities at each time point, I want to compute the following for each fold:
Brier Score for each category of the dependent variable: BS_i=(Y_it,k - p_it,k)² - where i denotes observation i of the test-fold,t the time and k the class k of the dependent variable
Summarise for this one fold 1. by computing 1/n_t (BS_i) where n_t are the number of observations which do have an observed time t - so a grouped computation
So in the end, what I want to report - for example for a 3 fold CV & knowing that time ranges from 0-5 - is an output like this:
fold time Brier_0 Brier_1 Brier_2
1 1 0 0.39758714 0.11703814 0.8711775
2 1 1 0.99461281 0.95051037 0.1503217
3 1 2 0.01791559 0.83653814 0.1553521
4 1 3 0.92067849 0.55275340 0.6466206
5 1 4 0.73112563 0.07603891 0.5769286
6 1 5 0.29500600 0.66219814 0.7590742
7 2 0 0.24691469 0.06736522 0.8612998
8 2 1 0.13629191 0.55973431 0.5617303
9 2 2 0.48006915 0.01357407 0.4515544
10 2 3 0.01257112 0.40250469 0.1814620
. . . . . .
I know that I have to set up a customized version of the summaryFunction, but I'm really lost on how to do this. So my main aim is not to tune a model but to validate it.
There is one thing that should be remarked: the summaryFunction can only return a single numeric vector - correct me if I'm wrong. Futher, the data-parameter of the summaryFunction contains a column rowIndex which can be used to extract additional variables form the original data set.
customSummary <- function (data, lev = NULL, model = NULL) { # for training on a next-period return
#browser() #essential for debugging
dat=dim(data)
# get observed dummy
Y_obs = model.matrix( ~ data[, "obs"] - 1) # create dummy - for each level of the outcome
# get predicted probabilities
Y_pre=as.data.frame(data[ , c("current","default","prepaid")])
# get rownumbers
rows=data[,"rowIndex"]
# get time of each obs
time=df[rows,]$time
# put it all together
df_temp=data.frame(Y_obs,Y_pre,time)
names(df_temp)=c("Y_cur","Y_def","Y_pre","p_cur","p_def","p_pre","time")
# group by time and compute crier score
out=df_temp %>% group_by(time) %>% summarise(BS_cur=1/n()*sum((Y_cur-p_cur)^2),BS_def=1/n()*sum((Y_def-p_def)^2),BS_pre=1/n()*sum((Y_pre-p_pre)^2))
# name
names(out)=c("time","BS_cur","BS_def","BS_pre")
# now create one line of return - caret seems to be able to hande only one
out=as.data.frame(out)
out_stack=stack(out)
out_stack=out_stack[(max(out$time)):length(out_stack[,1]),]
out_stack=out_stack[-1,]
out_stack$ind=paste(out_stack$ind,out$time,sep = "_")
# recall, the return type must be simply numeric
out_final=(t(out_stack[,1]))
names(out_final)=(out_stack[,2])
return(out_final)
}
# which type of cross validation to do
fitControl <- trainControl(method = 'cv',number=5,classProbs=TRUE,summaryFunction=customSummary, selectionFunction = "best", savePredictions = TRUE)
grid <- expand.grid(decay = 0 )
cv=train(as.factor(Ycateg)~.,
data = df,
method = "multinom",
maxit=150,
trControl = fitControl,
tuneGrid = grid
)
cv$resample
BS_cur_1 BS_cur_2 BS_cur_3 BS_cur_4 BS_cur_5 BS_def_1 BS_def_2 BS_def_3 BS_def_4 BS_def_5 BS_pre_1 BS_pre_2 BS_pre_3 BS_pre_4 BS_pre_5
1 0.1657623 0.1542842 0.1366912 0.1398001 0.2056348 0.1915512 0.2256758 0.2291467 0.2448737 0.2698545 0.1586134 0.2101389 0.1432483 0.2076886 0.1663780
2 0.1776843 0.1919503 0.1615440 0.1654297 0.1200515 0.2108787 0.2185783 0.2209958 0.2467931 0.2199898 0.1580643 0.1595971 0.2015860 0.1826029 0.1947144
3 0.1675981 0.1818885 0.1893253 0.1402550 0.1400997 0.2358501 0.2342476 0.2079819 0.1870549 0.2065355 0.2055711 0.1586077 0.1453172 0.1638555 0.2106146
4 0.1796041 0.1573086 0.1500860 0.1738538 0.1171626 0.2247850 0.2168341 0.2031590 0.1807209 0.2616180 0.1677508 0.1965577 0.1873078 0.1859176 0.1344115
5 0.1909324 0.1640292 0.1556209 0.1371598 0.1566207 0.2314311 0.1991000 0.2255612 0.2195158 0.2071910 0.1976272 0.1777507 0.1843828 0.1453439 0.1736540
Resample
1 Fold1
2 Fold2
3 Fold3
4 Fold4
5 Fold5

Modelling for zero using glm function in R

I am trying to build a logistic regression model using glm function in R. My dependent variable is binomial with 0 and 1 only. Here 0 - Non Return , 1- Return.
I want to model for Non-Return (0's),but glm function of R by default build for 1's. Like in SAS which by default build for lower value and we can use descending attribute in proc logistic to change the order, do we have something similar in glm too ?
I have one option to achieve this by changing 0 to 1 and vice-versa in my raw data but don't want to change my raw data.
Please help me or guide how can I do the similar thing in R.
Thanks in advance.
Just specify 1 - y as the DV:
set.seed(42)
y <- sample(c(0, 1), 10, TRUE)
#[1] 1 1 0 1 1 1 1 0 1 1
fit <- glm(y ~ 1, family = binomial)
coef(fit)
# (Intercept)
# 1.386294
log(mean(y) / (1 - mean(y)))
#[1] 1.386294
1 - y
#[1] 0 0 1 0 0 0 0 1 0 0
fit1 <- glm(1 - y ~ 1, family = binomial)
coef(fit1)
#(Intercept)
#-1.386294
log(mean(1 - y) / (1 - mean(1 - y)))
#[1] -1.386294
Alternatively, you can temporarily transform your data by using...transform:
glm( data = transform( data.frame(y=0), y=y+1 ), ... )

R- Partial eta squared for repeated measures ANOVA (car package)

I have a 2-way repeated measures design (3 x 2), and I would like to get figures out how to calculate effect sizes (partial eta squared).
I have a matrix with data in it (called a) like so (repeated measures)
A.a A.b B.a B.b C.a C.b
1 514.0479 483.4246 541.1342 516.4149 595.5404 588.8000
2 569.0741 550.0809 569.7574 599.1509 621.4725 656.8136
3 738.2037 660.3058 812.2970 735.8543 767.0683 738.7920
4 627.1101 638.1338 641.2478 682.7028 694.3569 761.6241
5 599.3417 637.2846 599.4951 632.5684 626.4102 677.2634
6 655.1394 600.9598 729.3096 669.4189 728.8995 716.4605
idata =
Caps Lower
A a
A b
B a
B b
C a
C b
I know how to do a repeated measures ANOVA with the car package (type 3 SS is standard in my field although I know that it results in a logical error.. if somebody wants to explain that to me like I'm 5 I would love to understand it):
summary(Anova(lm(a ~ 1),
idata=idata,type=3,
idesign=~Caps*Lower)),
multivariate=FALSE)
I think what I want to do is take this part of the summary print out:
Univariate Type III Repeated-Measures ANOVA Assuming Sphericity
SS num Df Error SS den Df F Pr(>F)
(Intercept) 14920141 1 153687 5 485.4072 3.577e-06 ***
Caps 33782 2 8770 10 19.2589 0.000372 ***
Lower 195 1 13887 5 0.0703 0.801451
Caps:Lower 2481 2 907 10 13.6740 0.001376 **
And use it to calculate partial ETA squared. So, if I'm not mistaken, I need to take the SS from the first column and divide it by (itself + SS Error for that row) for each effect. Is this the correct way to go about it? If so, how do I do it? I can't figure out how to reference values from the summary print out.
The partial eta-squared can be calculated with the etasq function in heplots package
library(car)
mod <- Anova(lm(a ~ 1),
idata = idata,
type = 3,
idesign = ~Caps*Lower)
mod
library(heplots)
etasq(mod, anova = TRUE)
Since you are asking about the calculations:
From ?etasq: 'For univariate linear models, classical η^2 = SSH / SST and partial η^2 = SSH / (SSH + SSE). These are identical in one-way designs.'.
If you wish to inspect the code for the calculations of η^2 for a model with a class as in the example, you may use getS3method(f = "etasq", class = "Anova.mlm").

Resources