Output from multinomial logistic regression in Stan - stan

I'm running a multinomial logistic regression. The outcome has four categories and there are two predictors (Male =1, a measure of the number of books in the home as a 5 point scale, and measure of motivation to read which is continuous. Here is the essential aspects of the code. I'm following "K" parameterization in https://mc-stan.org/docs/stan-users-guide/multi-logit.html. Thanks.
DOWELL <- as.factor(DOWELL)
Canadareg2 <- data.frame(Canadareg2)
n <- nrow(Canadareg2)
f <- as.formula("DOWELL ~ Male + booksHome + motivRead")
m <- model.matrix(f,Canadareg2)
data.list <- list(n=nrow(Canadareg2),
k=length(unique(Canadareg2[,1])),
d=ncol(m),x=m, Male=Male, booksHome=booksHome,
motivRead=motivRead,DOWELL=as.numeric(Canadareg2[,1]))
ReadMultiNom <- "
data {
int<lower = 2> k; // The variable has at least two categories
int<lower = 1> d; // number of predictors
int<lower = 0> n;
vector[n] Male;
vector[n] booksHome;
vector[n] motivRead;
int <lower=1, upper=k> DOWELL[n];
matrix[n, d] x;
}
parameters {
matrix[d, k] beta;
}
transformed parameters {
matrix[n, k] x_beta= x * beta;
}
model {
to_vector(beta) ~ normal(0,2); // vectorizes beta and assigns same prior
for (i in 1:n) {
DOWELL[i] ~ categorical_logit(x_beta[i]');
}
}
generated quantities {
int <lower=1, upper=k> DOWELL_rep[n];
vector[n] log_lik;
for (i in 1:n) {
DOWELL_rep[i] = categorical_logit_rng(x_beta[i]');
log_lik[i] = categorical_logit_lpmf(DOWELL[i] |x_beta[i]');
}
}
"
nChains = 4
nIter= 10000
thinSteps = 10
burnInSteps = floor(nIter/2)
DOWELL = data.list$DOWELL
MultiNomRegFit = stan(data=data.list,model_code=ReadMultiNom,
chains=nChains,control = list(adapt_delta = 0.99),
iter=nIter,warmup=burnInSteps,thin=thinSteps)
Everything runs beautifully and all convergence criterion are met. However, I am struggling to interpret the betas. I'm not sure where the Male effects are located. That is, it seems to be only for the two other predictors, but even then, one of them is a 5 point scale. It would seem to me that each beta would have three elements, e.g. beta(1,1,1), beta(1,2,1), etc. Here is the output. I'm just unsure how to interpret the betas.
Inference for Stan model: 7be33c603bd35d82ad7f6b200ccee16f.
## 4 chains, each with iter=10000; warmup=5000; thin=10;
## post-warmup draws per chain=500, total post-warmup draws=2000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## beta[1,1] -2.07 0.03 1.05 -4.16 -2.80 -2.08 -1.36 0.00 1705 1
## beta[1,2] 1.16 0.03 1.04 -0.97 0.47 1.18 1.84 3.25 1652 1
## beta[1,3] 1.06 0.03 1.07 -1.03 0.36 1.06 1.76 3.15 1703 1
## beta[1,4] -0.01 0.03 1.07 -2.17 -0.70 -0.01 0.68 2.06 1657 1
## beta[2,1] 0.51 0.03 1.04 -1.52 -0.18 0.52 1.20 2.51 1618 1
## beta[2,2] -0.01 0.03 1.04 -2.08 -0.72 -0.02 0.67 2.07 1636 1
## beta[2,3] -0.31 0.03 1.03 -2.31 -1.00 -0.31 0.38 1.70 1617 1
## beta[2,4] -0.26 0.03 1.04 -2.28 -0.94 -0.26 0.44 1.72 1648 1
## beta[3,1] 0.26 0.03 1.01 -1.73 -0.40 0.27 0.95 2.17 1525 1
## beta[3,2] 0.03 0.03 1.01 -2.01 -0.65 0.05 0.70 1.93 1525 1
## beta[3,3] 0.02 0.03 1.01 -2.04 -0.64 0.04 0.70 1.95 1522 1
## beta[3,4] -0.30 0.03 1.01 -2.36 -0.99 -0.29 0.38 1.59 1528 1
## beta[4,1] 0.18 0.02 0.98 -1.70 -0.49 0.14 0.85 2.17 1561 1
## beta[4,2] -0.09 0.02 0.98 -1.98 -0.77 -0.14 0.58 1.89 1568 1
## beta[4,3] -0.12 0.02 0.98 -2.01 -0.80 -0.15 0.56 1.85 1566 1
## beta[4,4] 0.11 0.02 0.99 -1.79 -0.56 0.07 0.79 2.10 1559 1
##
## Samples were drawn using NUTS(diag_e) at Tue Oct 18 09:58:48 2022.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
I'm not quite sure what is going on, so any advice would be appreciated.

Related

Filter rows of dataframe based on combinations of conditions

Let's say we have df1 with p values:
Symbol p1 p2 p3 p4 p5
AABT 0.01 0.12 0.23 0.02 0.32
ABC1 0.13 0.01 0.01 0.12 0.02
ACDC 0.15 0.01 0.34 0.24 0.01
BAM1 0.01 0.02 0.04 0.01 0.02
BCR 0.01 0.36 0.02 0.07 0.04
BDSM 0.02 0.43 0.01 0.03 0.41
BGL 0.27 0.77 0.01 0.04 0.02
and df2 with Fold Changes:
Symbol FC1 FC2 FC3 FC4 FC5
AABT 1.21 -0.32 0.23 -0.72 0.45
ABC1 0.13 0.93 -1.61 0.12 1.03
ACDC 0.23 1.31 0.42 -0.39 1.50
BAM1 -1.33 -1.27 -0.89 1.22 -1.03
BCR 1.43 -0.25 1.29 0.54 0.97
BDSM 1.20 0.23 -1.98 -1.09 -0.31
BGL 0.33 0.12 -1.33 -1.14 -1.23
I would like to do the following in df2:
Keep rows that in df1, have values < 0.05 in 3/5 of columns or greater
Eliminate rows that show discordant signs of FC. FC should be taken into consideration only when the respective p from df1 is lower than 0.05 (i.e. significant)
Sort the resulting data in an intuitive order so as to discriminate rows having positive FC from rows having negative FC, and if possible, discriminate rows whose significances in FC arise sequentially (e.g. FC3 FC4 FC5) from others that don't (e.g. FC1 FC3 FC5)
For example, step 1 would result in:
Symbol FC1 FC2 FC3 FC4 FC5
ABC1 0.13 0.93 -1.61 0.12 1.03
BAM1 -1.33 -1.27 -0.89 1.22 -1.03
BCR 1.43 -0.25 1.29 0.54 0.97
BDSM 1.20 0.23 -1.98 -1.09 -0.31
BGL 0.33 0.12 -1.33 -1.14 -1.23
and step 2, in:
Symbol FC1 FC2 FC3 FC4 FC5
BCR 1.43 -0.25 1.29 0.54 0.97
BGL 0.33 0.12 -1.33 -1.14 -1.23
How can this be achieved? I imagine using a for loop and the count function would do the job for step 1, but steps 2 and 3 look somewhat complicated to me. Thank you in advance for your elegant solutions.
data
df1:
df1 <- read.table(h=T,strin=F,text="Symbol p1 p2 p3 p4 p5
AABT 0.01 0.12 0.23 0.02 0.32
ABC1 0.13 0.01 0.01 0.12 0.02
ACDC 0.15 0.01 0.34 0.24 0.01
BAM1 0.01 0.02 0.04 0.01 0.02
BCR 0.01 0.36 0.02 0.07 0.04
BDSM 0.02 0.43 0.01 0.03 0.41
BGL 0.27 0.77 0.01 0.04 0.02")
df2:
df2 <- read.table(h=T,strin=F,text="Symbol FC1 FC2 FC3 FC4 FC5
AABT 1.21 -0.32 0.23 -0.72 0.45
ABC1 0.13 0.93 -1.61 0.12 1.03
ACDC 0.23 1.31 0.42 -0.39 1.50
BAM1 -1.33 -1.27 -0.89 1.22 -1.03
BCR 1.43 -0.25 1.29 0.54 0.97
BDSM 1.20 0.23 -1.98 -1.09 -0.31
BGL 0.33 0.12 -1.33 -1.14 -1.23")
I'm not sure how elegant this is, but you can get the result you requested using apply and sapply with subsetting, like this:
# Create logical matrix telling us whether p values are significant
sig <- apply(df1[-1], 2, function(x) x < 0.05)
# Create numeric matrix of the sign of each FC (will be either -1 or 1)
sign <- apply(df2[-1], 2, function(x) sign(x))
# Create a vector telling us whether there were 3 or more p < 0.05 in each row
ss1 <- apply(sig, 1, function(x) length(which(x)) > 2)
# Create a vector telling us whether all FC signs match excluding p = ns
ss2 <- sapply(seq(nrow(df1)), function(i) length(table(sign[i,][sig[i,]])) == 1)
# Subset the data frames accordingly:
df1[ss1, ]
#> Symbol p1 p2 p3 p4 p5
#> 2 ABC1 0.13 0.01 0.01 0.12 0.02
#> 4 BAM1 0.01 0.02 0.04 0.01 0.02
#> 5 BCR 0.01 0.36 0.02 0.07 0.04
#> 6 BDSM 0.02 0.43 0.01 0.03 0.41
#> 7 BGL 0.27 0.77 0.01 0.04 0.02
df2[ss1 & ss2, ]
#> Symbol FC1 FC2 FC3 FC4 FC5
#> 5 BCR 1.43 -0.25 1.29 0.54 0.97
#> 7 BGL 0.33 0.12 -1.33 -1.14 -1.23
Created on 2020-07-10 by the reprex package (v0.3.0)

How to retrieve observation scores for each Principal Component in R using principal Function

pc_unrotate = principal(correlate1,nfactors = 4,rotate = "none")
print(pc_unrotate)
output:
Principal Components Analysis
Call: principal(r = correlate1, nfactors = 4, rotate = "none")
Standardized loadings (pattern matrix) based upon correlation matrix
PC1 PC2 PC3 PC4 h2 u2 com
ProdQual 0.25 -0.50 -0.08 0.67 0.77 0.232 2.2
Ecom 0.31 0.71 0.31 0.28 0.78 0.223 2.1
TechSup 0.29 -0.37 0.79 -0.20 0.89 0.107 1.9
CompRes 0.87 0.03 -0.27 -0.22 0.88 0.119 1.3
Advertising 0.34 0.58 0.11 0.33 0.58 0.424 2.4
ProdLine 0.72 -0.45 -0.15 0.21 0.79 0.213 2.0
SalesFImage 0.38 0.75 0.31 0.23 0.86 0.141 2.1
ComPricing -0.28 0.66 -0.07 -0.35 0.64 0.359 1.9
WartyClaim 0.39 -0.31 0.78 -0.19 0.89 0.108 2.0
OrdBilling 0.81 0.04 -0.22 -0.25 0.77 0.234 1.3
DelSpeed 0.88 0.12 -0.30 -0.21 0.91 0.086 1.4
PC1 PC2 PC3 PC4
SS loadings 3.43 2.55 1.69 1.09
Proportion Var 0.31 0.23 0.15 0.10
Cumulative Var 0.31 0.54 0.70 0.80
Proportion Explained 0.39 0.29 0.19 0.12
Cumulative Proportion 0.39 0.68 0.88 1.00
Mean item complexity = 1.9
Test of the hypothesis that 4 components are sufficient.
The root mean square of the residuals (RMSR) is 0.06
Fit based upon off diagonal values = 0.97
Now i need to get the scores, Tried pc_unrotate$scores but it returns null.
executed names(pc_unrotate),
Name of PCA
and found that Scores attribute is missing...so what can i do to get PCA scores?
Add argument scores=TRUE to the principal() function call: https://www.rdocumentation.org/packages/psych/versions/1.9.12.31/topics/principal
pc_unrotate = principal(correlate1,nfactors = 4,rotate = "none", scores = TRUE)

`psych::alpha`- detailed interpretation of the output

I am aware that Cronbach's alpha has been extensively discussed here and elsewhere, but I cannot find a detailed interpretation of the output table.
psych::alpha(questionaire)
Reliability analysis
Call: psych::alpha(x = diagnostic_test)
raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
0.69 0.73 1 0.14 2.7 0.026 0.6 0.18 0.12
lower alpha upper 95% confidence boundaries
0.64 0.69 0.74
Reliability if an item is dropped:
raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
Score1 0.69 0.73 0.86 0.14 2.7 0.027 0.0136 0.12
Score2 0.68 0.73 0.87 0.14 2.7 0.027 0.0136 0.12
Score3 0.69 0.73 0.87 0.14 2.7 0.027 0.0136 0.12
Score4 0.67 0.72 0.86 0.14 2.5 0.028 0.0136 0.11
Score5 0.68 0.73 0.87 0.14 2.7 0.027 0.0134 0.12
Score6 0.69 0.73 0.91 0.15 2.7 0.027 0.0138 0.12
Score7 0.69 0.73 0.85 0.15 2.7 0.027 0.0135 0.12
Score8 0.68 0.72 0.86 0.14 2.6 0.028 0.0138 0.12
Score9 0.68 0.73 0.92 0.14 2.7 0.027 0.0141 0.12
Score10 0.68 0.72 0.90 0.14 2.6 0.027 0.0137 0.12
Score11 0.67 0.72 0.86 0.14 2.5 0.028 0.0134 0.11
Score12 0.67 0.71 0.87 0.13 2.5 0.029 0.0135 0.11
Score13 0.67 0.72 0.86 0.14 2.6 0.028 0.0138 0.11
Score14 0.68 0.72 0.86 0.14 2.6 0.028 0.0138 0.11
Score15 0.67 0.72 0.86 0.14 2.5 0.028 0.0134 0.11
Score16 0.68 0.72 0.88 0.14 2.6 0.028 0.0135 0.12
score 0.65 0.65 0.66 0.10 1.8 0.030 0.0041 0.11
Item statistics
n raw.r std.r r.cor r.drop mean sd
Score1 286 0.36 0.35 0.35 0.21 0.43 0.50
Score2 286 0.37 0.36 0.36 0.23 0.71 0.45
Score3 286 0.34 0.34 0.34 0.20 0.73 0.44
Score4 286 0.46 0.46 0.46 0.33 0.35 0.48
Score5 286 0.36 0.36 0.36 0.23 0.73 0.44
Score6 286 0.29 0.32 0.32 0.18 0.87 0.34
Score7 286 0.33 0.32 0.32 0.18 0.52 0.50
Score8 286 0.42 0.41 0.41 0.28 0.36 0.48
Score9 286 0.32 0.36 0.36 0.22 0.90 0.31
Score10 286 0.37 0.40 0.40 0.26 0.83 0.37
Score11 286 0.48 0.47 0.47 0.34 0.65 0.48
Score12 286 0.49 0.49 0.49 0.37 0.71 0.46
Score13 286 0.46 0.44 0.44 0.31 0.44 0.50
Score14 286 0.44 0.43 0.43 0.30 0.43 0.50
Score15 286 0.48 0.47 0.47 0.35 0.61 0.49
Score16 286 0.39 0.39 0.39 0.26 0.25 0.43
score 286 1.00 1.00 1.00 1.00 0.60 0.18
Warning messages:
1: In cor.smooth(r) : Matrix was not positive definite, smoothing was done
2: In cor.smooth(R) : Matrix was not positive definite, smoothing was done
3: In cor.smooth(R) : Matrix was not positive definite, smoothing was done
as far as I know, r.cor stand for the total-item correlation, or biserial correlation. I have seen that this is usually interpreted together with the corresponding p-value.
1. What is the exact interpretation of r.cor and r.drop?
2. How can the p-value be calculated ?
1. Although this is more of a question for Crossvalidated, here is the detailed explanation of ‘Item statistics’ section:
raw.r: correlation between the item and the total score from the scale (i.e., item-total correlations); there is a problem with raw.r, that is, the item itself is included in the total—this means we’re correlating the item with itself, so of course it will correlate (r.cor and r.drop solve this problem; see ?alpha for details)
r.drop: item-total correlation without that item itself (i.e., item-rest correlation or corrected item-total correlation); low item-total correlations indicate that that item doesn’t correlate well with the scale overall
r.cor: item-total correlation corrected for item overlap and scale reliability
mean and sd: mean and sd of the scale if that item is dropped
2. You should not use the p-values corresponding to these correlation coefficient to guide your decisions. I would suggest not to bother calculating them.

Calculating omega for factor analysis: NA result

I am trying to compute omega estimates after exploratory factor analysis to estimate the reliability of the components I've found. Using the omega() function from the psych package I get this output:
Output for omega function
Alpha: 0.8
G.6: 0.86
Omega Hierarchical: 0.37
Omega H asymptotic: 0.43
Omega Total 0.86
Schmid Leiman Factor loadings greater than
0.2
g F1* F2* F3* h2 u2 p2
EMS1 0.30 0.71 0.59 0.41 0.15
EMS3 -0.21 0.64 0.53 0.47 0.05
EMS4 0.62 0.41 0.59 0.04
EMS7 0.34 0.62 0.50 0.50 0.23
EMS8 0.36 0.42 0.32 0.68 0.40
EMS9 0.57 0.33 0.67 0.00
EMS10 0.39 0.20 0.80 0.11
EMS11 0.72 0.51 0.49 0.02
EMS12 0.68 0.46 0.54 0.02
EMS15 0.54 -0.24 0.41 0.59 0.02
EMS16 0.22 0.77 0.63 0.37 0.08
EMS19 0.65 0.52 0.48 0.01
EMS20 0.27 0.53 0.36 0.64 0.21
EMS21 0.62 0.40 0.60 0.04
EMS23 0.63 0.42 0.58 0.07
EMS24 0.68 0.45 0.55 1.02
EMS25 0.73 0.56 0.44 0.95
EMS27 0.45 0.20 0.25 0.75 0.83
EMS28 0.78 0.59 0.41 1.02
EMS34 0.26 0.31 0.48 0.34 0.66 0.20
With eigenvalues of:
g F1* F2* F3*
2.5 3.4 2.9 0.0
general/max 0.73 max/min = Inf
mean percent general = 0.27 with sd = 0.36 and cv of 1.33
Explained Common Variance of the general factor = 0.28
The degrees of freedom are 133 and the fit is 0.8
The number of observations was 601 with Chi Square = 471.81 with prob < 1.9e-39
The root mean square of the residuals is 0.04
The df corrected root mean square of the residuals is 0.05
RMSEA index = 0.066 and the 10 % confidence intervals are 0.059 0.072
BIC = -379.21
Compare this with the adequacy of just a general factor and no group factors
The degrees of freedom for just the general factor are 170 and the fit is 5.4
The number of observations was 601 with Chi Square = 3195.63 with prob < 0
The root mean square of the residuals is 0.22
The df corrected root mean square of the residuals is 0.24
RMSEA index = 0.173 and the 10 % confidence intervals are 0.167 0.177
BIC = 2107.87
Measures of factor score adequacy
g F1* F2* F3*
Correlation of scores with factors 0.9 0.94 0.93 0
Multiple R square of scores with factors 0.8 0.89 0.86 0
Minimum correlation of factor score estimates 0.6 0.78 0.73 -1
Total, General and Subset omega for each subset
g F1* F2* F3*
Omega total for total scores and subscales 0.86 0.82 0.85 NA
Omega general for total scores and subscales 0.37 0.08 0.34 NA
Omega group for total scores and subscales 0.58 0.75 0.51 NA
Warning messages:
1: In fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, :
A loading greater than abs(1) was detected. Examine the loadings carefully.
2: In fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, :
An ultra-Heywood case was detected. Examine the results carefully
3: In cov2cor(t(w) %*% r %*% w) :
diag(.) had 0 or NA entries; non-finite result is doubtful
This is how I am calling the function:
omega(df[,items],nfactors=3)
After searching for guidance, I could not find why omega was not computed for the 3rd factor. I am not sure if it an issue related to one of the warning messages:
Warning messages:
1: In fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, :
A loading greater than abs(1) was detected. Examine the loadings carefully.
2: In fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, :
An ultra-Heywood case was detected. Examine the results carefully
3: In cov2cor(t(w) %*% r %*% w) :
diag(.) had 0 or NA entries; non-finite result is doubtful
This could be because Omega is calculated by fitting a CFA model, and in your case with 3 factors, factor number 3 is used for identification reasons. So you wouldn't expect Omega to be calculated for it

How to automate zero-inflated beta regressions reporting results in a table?

I have a matrix with 11 columns named "env", with 1 response variable bounded between 0 and 1 ("R1") and 10 possible predictors ranging from "P1 to "P10". I would like to use a zero-inflated beta regression (r package and function "gamlss") to assess the individual effect of each predictor on my response variable summarizing the AIC, estimate and probability for each predictor in a table. The table should have the predictors as rows and model parameters (AIC, estimate and probability) as columns. This process must be repeated individually for the three coefficients of the beta distribution (MU, NU and SIGMA).
Here a subset of my data matrix (sorry for not being able to simulate it following the guidelines)
P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 R1
600 243.89 2.68 180.32 1753.62 5.15 16.11 46.59 1.52 0.96 0.04
674 259.43 1.49 174.06 1230.71 5.50 19.42 45.65 1.62 0.88 0.28
231 156.19 0.00 151.68 1002.93 5.22 12.76 50.38 1.63 1.00 0.00
624 256.53 8.58 181.32 1194.07 5.35 25.33 58.74 1.33 0.94 0.36
773 346.91 15.59 180.17 1665.10 4.99 26.65 39.74 1.13 0.93 0.21
468 186.84 6.13 172.11 1570.75 5.34 18.72 38.52 1.55 0.97 0.10
340 478.28 14.68 169.06 1685.20 4.81 14.17 112.48 1.65 0.98 0.00
719 401.34 14.57 180.84 1824.18 4.74 13.46 129.70 1.67 0.98 0.00
603 831.58 7.79 158.69 1675.99 5.49 35.08 109.76 1.40 0.62 0.00
355 463.96 3.39 174.65 1987.08 4.26 25.69 85.57 1.56 0.98 0.03
527 560.11 32.18 175.29 2661.40 4.69 50.79 84.67 1.30 0.92 0.14
603 313.94 20.98 163.86 3211.07 4.60 45.15 86.36 1.36 0.93 0.02
508 571.58 40.62 118.69 2842.65 5.11 53.89 57.88 1.31 0.99 0.13
270 191.50 0.35 176.33 3280.57 4.75 51.99 127.10 1.29 0.51 0.12
353 770.72 0.05 173.76 2079.63 5.46 39.12 141.40 1.26 0.51 0.16
166 488.43 12.40 164.20 2692.61 4.55 41.28 107.06 1.40 0.91 0.13
881 316.41 32.43 156.37 2883.55 4.15 29.20 71.32 1.59 0.89 0.21
013 734.83 20.08 156.98 2044.81 4.72 49.62 113.42 1.35 0.98 0.20
526 452.85 33.85 164.58 1795.64 5.01 26.16 87.38 1.54 0.95 0.06
Below the syntax for the gamlss function with the 3 coefficients:
m1.mu<-gamlss(formula= R1~P1, family=BEZI, data=env, method=RS(100))
m1.nu<-gamlss(formula= R1~P1, nu.formula= R1~P1, family=BEZI, data=env, method=RS(100))
m1.si<-gamlss(formula= R1~P1,si.formula= R1~P1, family=BEZI, data=env, method=RS(100))
and here is the structure of the table I am trying to get:
AIC.mu EST.mu PROB.mu AIC.nu EST.nu PROB.nu AIC.si EST.si PROB.si
P1
P2
P3
…
P10
I assume I should be able to automate this with a combination of "for loop", "lapply" or "apply" and "cbind" but unfortunately I cannot manage to get it working. Would be great if some of you guys could give me a hand. Many Thanks

Resources