Working with the "prostate" dataset in "ElemStatLearn" package.
set.seed(3434)
fit.lm = train(data=trainset, lpsa~., method = "lm")
fit.ridge = train(data=trainset, lpsa~., method = "ridge")
fit.lasso = train(data=trainset, lpsa~., method = "lasso")
Comparing RMSE (for bestTune in case of ridge and lasso)
fit.lm$results[,"RMSE"]
[1] 0.7895572
fit.ridge$results[fit.ridge$results[,"lambda"]==fit.ridge$bestTune$lambda,"RMSE"]
[1] 0.8231873
fit.lasso$results[fit.lasso$results[,"fraction"]==fit.lasso$bestTune$fraction,"RMSE"]
[1] 0.7779534
Comparing absolute value of coefficients
abs(round(fit.lm$finalModel$coefficients,2))
(Intercept) lcavol lweight age lbph svi lcp gleason pgg45
0.43 0.58 0.61 0.02 0.14 0.74 0.21 0.03 0.01
abs(round(predict(fit.ridge$finalModel, type = "coef", mode = "norm")$coefficients[8,],2))
lcavol lweight age lbph svi lcp gleason pgg45
0.49 0.62 0.01 0.14 0.65 0.05 0.00 0.01
abs(round(predict(fit.lasso$finalModel, type = "coef", mode = "norm")$coefficients[8,],2))
lcavol lweight age lbph svi lcp gleason pgg45
0.56 0.61 0.02 0.14 0.72 0.18 0.00 0.01
My question is: how can "ridge" RMSE be higher than that of plain "lm". Doesn't that defeat the very purpose of penalized regression vs plain "lm"?
Also, how can the absolute value of the coefficient of "lweight" be actually higher in ridge (0.62) vs that in lm (0.61)? Both coefficients are positive originally without the abs().
I was expecting ridge to perform similar to lasso, which not only reduced RMSE but also shrank the size of coefficients vs plain "lm".
Thank you!
Related
I am trying to make a ROC Curve using pROC with the 2 columns as below: (the list goes on to over >300 entries)
Actual_Findings_%
Predicted_Finding_Prob
0.23
0.6
0.48
0.3
0.26
0.62
0.23
0.6
0.48
0.3
0.47
0.3
0.23
0.6
0.6868
0.25
0.77
0.15
0.31
0.55
The code I tried to use is:
roccurve<- plot(roc(response = data$Actual_Findings_% <0.4, predictor = data$Predicted_Finding_Prob >0.5),
legacy.axes = TRUE, print.auc=TRUE, main = "ROC Curve", col = colors)
Where the threshold for positive findings is
Actual_Findings_% <0.4
AND
Predicted_Finding_Prob >0.5
(i.e to be TRUE POSITIVE, actual_finding_% would be LESS than 0.4, AND predicted_finding_prob would be GREATER than 0.5)
but when I try to plot this roc curve, I get the error:
"Setting levels: control = FALSE, case = TRUE
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'plot': Predictor must be numeric or ordered."
Any help would be much appreciated!
This should work:
data <- read.table( text=
"Actual_Findings_% Predicted_Finding_Prob
0.23 0.6
0.48 0.3
0.26 0.62
0.23 0.6
0.48 0.3
0.47 0.3
0.23 0.6
0.6868 0.25
0.77 0.15
0.31 0.55
", header=TRUE, check.names=FALSE )
library(pROC)
roccurve <- plot(
roc(
response = data$"Actual_Findings_%" <0.4,
predictor = data$"Predicted_Finding_Prob"
),
legacy.axes = TRUE, print.auc=TRUE, main = "ROC Curve"
)
Now importantly - the roc curve is there to show you what happens when you varry your classification threshold. So one thing you do do wrong is to go and enforce one, by setting predictions < 0.5
This does however give a perfect separation, which is nice I guess. (Though bad for educational purposes.)
I am new to Stan and rstan.
I recently may find a weird issue when I worked on Markov chain Monte Carlo (MCMC). In short, for example, the data has 10 observations, say ID 1 to 10. Now, I permutate it by shifting the 10th row between the original first and second rows, say ID 1, 10, and 2 to 9. Two different scenarios of data will give different estimates, even I fix the same random seed.
To illustrate the issue in a simpler way, I write the following R scripts.
##TEST 01
# generate data
N <- 100
set.seed(123)
Y <- rnorm(N, 1.6, 0.2)
stan_code1 <- "
data {
int <lower=0> N; //number of data
real Y[N]; //data in an (C++) array
}
parameters {
real mu; //mean parameter of a normal distribution
real <lower=0> sigma; //standard deviation parameter of a normal distribution
}
model {
//prior distributions for parameters
mu ~ normal(1.7, 0.3);
sigma ~ cauchy(0, 1);
//likelihood of Y given parameters
for (i in 1:N) {
Y[i] ~ normal(mu, sigma);
}
}
"
# compile model
library(rstan)
model1 <- stan_model(model_code = stan_code1) #usually, take half a minute to run
# pass data to stan and run model
set.seed(123)
fit <- sampling(model1, list(N=N, Y=Y), iter=200, chains=4)
print(fit)
# mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
# mu 1.62 0.00 0.02 1.58 1.61 1.62 1.63 1.66 473 1.00
# sigma 0.18 0.00 0.01 0.16 0.18 0.18 0.19 0.21 141 1.02
# lp__ 117.84 0.07 0.85 115.77 117.37 118.07 118.51 118.78 169 1.01
Yp <- Y[c(1,100,2:99)]
set.seed(123)
fit2 <- sampling(model1, list(N=N, Y=Yp), iter=200, chains=4)
print(fit2)
# mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
# mu 1.62 0.00 0.02 1.59 1.61 1.62 1.63 1.66 480 0.99
# sigma 0.18 0.00 0.01 0.16 0.18 0.18 0.19 0.21 139 1.02
# lp__ 117.79 0.09 0.95 115.72 117.35 118.05 118.49 118.77 124 1.01
As we can see from the above simple case, two results fit and fit2 are different.
And, even stranger, if I write the likelihood before the priors (previousy, the priors are written ahead of the likelihood) in code file, the same random seed and the same data will still give different estimates.
##TEST 01'
# generate data
#N <- 100
set.seed(123)
Y <- rnorm(N, 1.6, 0.2)
stan_code11 <- "
data {
int <lower=0> N; //number of data
real Y[N]; //data in an (C++) array
}
parameters {
real mu; //mean parameter of a normal distribution
real <lower=0> sigma; //standard deviation parameter of a normal distribution
}
model {
//likelihood of Y given parameters
for (i in 1:N) {
Y[i] ~ normal(mu, sigma);
}
//prior distributions for parameters
mu ~ normal(1.7, 0.3);
sigma ~ cauchy(0, 1);
}
"
# compile model
#library(rstan)
model11 <- stan_model(model_code = stan_code11) #usually, take half a minute to run
# pass data to stan and run model
set.seed(123)
fit11 <- sampling(model11, list(N=N, Y=Y), iter=200, chains=4)
print(fit11)
# mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
# mu 1.62 0.00 0.02 1.58 1.61 1.62 1.63 1.66 455 0.99
# sigma 0.19 0.00 0.01 0.16 0.18 0.18 0.20 0.21 94 1.04
# lp__ 117.68 0.08 0.93 115.24 117.18 117.90 118.45 118.77 149 1.01
##TEST01 was
# mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
# mu 1.62 0.00 0.02 1.58 1.61 1.62 1.63 1.66 473 1.00
# sigma 0.18 0.00 0.01 0.16 0.18 0.18 0.19 0.21 141 1.02
# lp__ 117.84 0.07 0.85 115.77 117.37 118.07 118.51 118.78 169 1.01
Stan does not utilize the same pseudo random-number generator as R. Thus, calling set.seed(123) only makes Y repeatable and does not make the MCMC sampling repeatable. In order to accomplish the later, you need to pass an integer as the seed argument to the stan (or sampling) function in the rstan package like
sampling(model11, list(N = N, Y = Y), seed = 1234).
Even then, I could imagine that permuting the observations could result in different realizations of the draws from the posterior distribution due to floating-point reasons. But none of this really matters (unless you conduct too few iterations or get warning messages) because the posterior distribution is the same even if a finite set of realizations from the posterior distribution are randomly different numbers.
I'm using the psych package for factor analysis. I want to specify the labels of the latent factors, either in the fa() object, or when graphing with fa.diagram().
For example, with toy data:
require(psych)
n <- 100
choices <- 1:5
df <- data.frame(a=sample(choices, replace=TRUE, size=n),
b=sample(choices, replace=TRUE, size=n),
c=sample(choices, replace=TRUE, size=n),
d=sample(choices, replace=TRUE, size=n))
model <- fa(df, nfactors=2, fm="pa", rotate="promax")
model
Factor Analysis using method = pa
Call: fa(r = df, nfactors = 2, rotate = "promax", fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
PA1 PA2 h2 u2 com
a 0.45 -0.49 0.47 0.53 2.0
b 0.22 0.36 0.17 0.83 1.6
c -0.02 0.20 0.04 0.96 1.0
d 0.66 0.07 0.43 0.57 1.0
I want to change PA1 and PA2 to FactorA and FactorB, either by changing the model object itself, or adjusting the labels in the output of fa.diagram():
The docs for fa.diagram have a labels argument, but no examples, and the experimentation I've done so far hasn't been fruitful. Any help much appreciated!
With str(model) I found the $loadings attribute, which fa.diagram() uses to render the diagram. Modifying colnames() of model$loadings did the trick.
colnames(model$loadings) <- c("FactorA", "FactorB")
fa.diagram(model)
I tried to create mixed-effect logistic regression model using glmer() function, however the model does not converge. Firstly, I changed categorical variables to from vectors to factors.
schwa_completed_2$Outcome <- as.factor(schwa_completed_2$Outcome)
schwa_completed_2$frequency_grouped <- as.factor(schwa_completed_2$frequency_grouped)
schwa_completed_2$sonority_grouped <- as.factor(schwa_completed_2$sonority_grouped)
schwa_completed_2$participant_gender <- as.factor(schwa_completed_2$participant_gender)
schwa_completed_2$participant_age_group <- as.factor(schwa_completed_2$participant_age_group)
schwa_completed_2$Speaker <- as.factor(schwa_completed_2$Speaker)
Also there is one more continuous variable. Then I created a model
model <- glmer(Outcome ~ frequency_grouped + sonority_grouped + syl_sec_EN +
participant_gender + participant_age_group + 1|Speaker,
data = schwa_completed_2, family = binomial, optimizer = "bobyqa")
Unfortunately, the model does not converge. If I got rid off "Speaker" effect the model works just fine, however, the results probably are skewed.
Warning messages:
1: In commonArgs(par, fn, control, environment()) :
maxfun < 10 * length(par)^2 is not recommended.
2: In optwrap(optimizer, devfun, start, rho$lower, control = control, :
convergence code 1 from bobyqa: bobyqa -- maximum number of function
evaluations exceeded
3: In (function (fn, par, lower = rep.int(-Inf, n), upper = rep.int(Inf, :
failure to converge in 10000 evaluations
4: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model failed to converge with max|grad| = 0.0785481 (tol = 0.001, component 1)
Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation) ['glmerMod']
Family: binomial ( logit )
Formula: Outcome ~ frequency_grouped + sonority_grouped + syl_sec_EN +
participant_gender + participant_age_group + 1 | Speaker
Data: schwa_completed_2
AIC BIC logLik deviance df.resid
1820.8 2066.1 -864.4 1728.8 1486
Scaled residuals:
Min 1Q Median 3Q Max
-2.5957 -0.6255 -0.3987 0.7714 3.4432
Random effects:
Groups Name Variance Std.Dev. Corr
Speaker (Intercept) 2.08476 1.4439
frequency_groupedmoderately_frequent 0.78914 0.8883 -0.15
frequency_groupedvery_frequent 3.07514 1.7536 -0.90 0.35
sonority_groupedsonorants 1.33795 1.1567 0.82 -0.44 -0.91
sonority_groupedstops 1.76849 1.3298 0.02 -0.42 -0.36 0.51
sonority_groupedvowels 2.97690 1.7254 0.23 0.02 -0.32 0.55 0.77
syl_sec_EN 0.03217 0.1794 -0.62 -0.42 0.32 -0.44 0.11 -0.52
participant_genderM 0.41458 0.6439 -0.86 -0.18 0.77 -0.77 -0.24 -0.62 0.82
participant_age_groupY 0.52428 0.7241 0.46 0.80 -0.20 0.06 -0.44 0.08 -0.73 -0.63
Number of obs: 1532, groups: Speaker, 40
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.7650 0.1862 -4.108 3.99e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
convergence code: 0
Model failed to converge with max|grad| = 0.0785481 (tol = 0.001, component 1)
failure to converge in 10000 evaluations
Is it because of the too complicated model or my laptop is not powerful enough? I don't know what should I do at this point. Is very anything I can do to fix this?
Ok, so what helped me was grouping the speakers with group by, and then scale the syl_sec_EN variable
I have a logistic regression model, for which I have been using the rms package. The model fits best using a log term for tn1, and for clinical interpretation I’m using log2. I ran the model using lrm from the rms package, and then to double check, I ran it using glm. The initial coefficients are the same:
h <- lrm(formula = outcomehosp ~ I(log2(tn1 + 0.001)) + apscore_ad +
emsurg + corrapiidiag, data = d, x = TRUE, y = TRUE)
Coef S.E. Wald Z Pr(>|Z|)
Intercept -3.4570 0.3832 -9.02 <0.0001
tn1 0.0469 0.0180 2.60 0.0093
apscore_ad 0.1449 0.0127 11.44 <0.0001
emsurg 0.0731 0.3228 0.23 0.8208
f <- glm(formula = outcomehosp ~ apscore_ad + emsurg + corrapiidiag +
I(log2(tn1 + 0.001)), family = binomial(), data = tn24)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.45699 0.38315 -9.023 < 2e-16
I(log2(tn1 + 0.001)) 0.04690 0.01804 2.600 0.00932
apscore_ad 0.14487 0.01267 11.438 < 2e-16
emsurg 0.07310 0.32277 0.226 0.82082
However when I try to get the odds ratios, they are noticeably different for tn1 between the two models, and this doesn’t seem to be a difference of log2 transformation.
summary(h)
Effects Response : outcomehosp
Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
tn1 0 0.21 0.21 0.362120 0.15417 6.5300e-02 0.673990
Odds Ratio 0 0.21 0.21 1.436400 NA 1.0675e+00 1.962100
apscore_ad 14 25.00 11.00 1.593600 0.15631 1.3605e+00 1.961000
Odds Ratio 14 25.00 11.00 4.921400 NA 3.8981e+00 7.106600
emsurg 0 1.00 1.00 0.073103 0.33051 -5.8224e-01 0.734860
Odds Ratio 0 1.00 1.00 1.075800 NA 5.5865e-01 2.085200
exp(f$coefficients)
(Intercept) 0.03152467
apscore_ad 1.15589222
emsurg 1.07584115
I(log2(tn1 + 0.001)) 1.04802
Would anyone be able to explain what the rms package is calculating the odds ratio of? Many thanks.
The tn1 effect from summary(h) is the effect on the log of the odds ratio of tn1 going from 0 to 0.21 -- the inter-quartile range. See ?summary.rms.
So, the effect from the first row of summary(h) is 0.36212 = (log2(0.211)-log2(0.001))*.0469.