Error when predicting with DirichletReg package in R - r

I am trying to make predictions on a test set using the DirichReg function from the DirichletReg package. When I run the model with only a few predictors it works fine, but when I use more than ~5 predictors I get an error that I can't figure out. The code below creates an MWE that reproduces the error.
library(DirichletReg)
set.seed(1)
# create dataset
predictor1 <- rnorm(n = 1000, mean = 5, sd = 1)
predictor2 <- rnorm(n = 1000, mean = 5, sd = 1)
predictor3 <- rnorm(n = 1000, mean = 5, sd = 1)
predictor4 <- rnorm(n = 1000, mean = 5, sd = 1)
predictor5 <- rnorm(n = 1000, mean = 5, sd = 1)
predictor6 <- rnorm(n = 1000, mean = 5, sd = 1)
prob_A <- runif(n = 1000, min = 0, max = 0.5)
prob_B <- runif(n = 1000, min = 0, max = 0.5)
prob_C <- 1 - prob_A - prob_B
dat <- data.frame(predictor1, predictor2, predictor3, predictor4, predictor5,
predictor6, prob_A, prob_B, prob_C)
# split data into training and test sets
train_vec <- sample(c(0, 1), size = nrow(dat), replace = T, prob = c(0.2, 0.8))
train_dat <- dat[train_vec == 1, ]
test_dat <- dat[train_vec == 0, ]
# run model
train_dat$prob <- DR_data(train_dat[, c('prob_A', 'prob_B', 'prob_C')])
mod <- DirichReg(prob ~ predictor1 + predictor2 + predictor3 + predictor4 +
predictor5 + predictor6,
data = train_dat, model = 'common')
# run predictions
test_dat$prob <- DR_data(test_dat[, c('prob_A', 'prob_B', 'prob_C')])
preds <- predict(object = mod, newdata = test_dat)
Here's the error that I'm getting:
Error in parse(text = x, keep.source = FALSE) :
<text>:1:74: unexpected '|'
1: prob ~ predictor1 + predictor2 + predictor3 + predictor4 + predictor5 + |
^
I would appreciate any help. I haven't been able to google the error or find it in the package documentation.

This seems to be a bug in the package. I recommend that you contact the package maintainer to report it.
A possible workaround is to explicitly list the separate parts of the regression specification instead of relying on the package to internally replicate the regressors for all parts.
mod2 <- DirichReg(prob ~
predictor1 + predictor2 + predictor3 + predictor4 + predictor5 + predictor6 |
predictor1 + predictor2 + predictor3 + predictor4 + predictor5 + predictor6 |
predictor1 + predictor2 + predictor3 + predictor4 + predictor5 + predictor6,
data = train_dat, model = "common")
all.equal(coef(mod), coef(mod2))
## [1] TRUE
predict(mod2, newdata = test_dat)
## [,1] [,2] [,3]
## [1,] 0.2436493 0.2715895 0.4847612
## [2,] 0.2541715 0.2252292 0.5205993
## [3,] 0.2618741 0.2345063 0.5036196
## ...

Related

Adaptive LASSO in bayesQR

I have been playing with the bayesQR package, and want to apply it to an application that calls for variable selection using LASSO. As far as I understand, this is possible in bayesQR, but I haven't been able to get any variables dropped. My toy example is below, where the extraneous variables (c and d) are dropped by glmnet, but not by bayesQR.
Is there something fundamental that I am missing? How can I get model4 below to have eliminated variables?
library(data.table)
library(bayestestR)
library(bayesQR)
library(glmnet)
# Generate data
n = 250
seed = 22
noise_sd = 5
set.seed(seed)
dt = data.table(a = runif(n, min = 0, max = 10),
b = runif(n, min = 0, max = 10),
c = runif(n, min = 0, max = 10),
d = rnorm(n, sd = .01 * noise_sd))
dt[, y := (a + rnorm(n, sd = 1 * noise_sd)) * 2 + (b + rnorm(n, sd = 2 * noise_sd)) + rnorm(n, sd = 2 * noise_sd)]
formula = y ~ a + b + c + d
# Just with GLM
model1 = glm(data = dt,
formula = formula)
# LASSO with glmnet
x = as.matrix(dt[, .(a,b,c,d)])
y = dt$y
cv_model <- cv.glmnet(x, y, alpha = 1)
best_lambda <- cv_model$lambda.min
model2 <- glmnet(x, y, alpha = 1, lambda = best_lambda)
print(coef(model2))
# Quantile regression with bayesQR
model3 = bayesQR(data = dt,
formula = formula,
ndraw = 5000,
seed = seed)
# Quantile regression with bayesQR using adaptive lasso
model4 = bayesQR(data = dt,
formula = formula,
ndraw = 5000,
alasso = TRUE,
seed = seed)
message('GLM')
print(summary(model1))
message('glmnet with LASSO')
print(coef(model2))
message('bayesQR')
print(summary(model3))
message('bayesQR with ALASSO')
print(summary(model4))

How long should a Monte Carlo bootstrap power analysis simulation in R take? Is it potentially hours? (1000 reps, 1000 bootstraps)

I am using a Monte Carlo simulation to run a power analysis for a longitudinal mediation model. I'm using the power.boot function from the bmem package (lavaan).
I checked the code with only 5 reps/5 bootstrap to make sure it worked and it did.
Then I ran the code with 1000 reps, 1000 bootstrap as the package documentation recommends.
It's been over an hour now and it's still running - is this normal? How long is too long?
powermodel1 <-'
x2 ~ start(.6)*x1 + x*x1
x3 ~ start(.6)*x2 + x*x2
m2 ~ start(.15)*x1 + a*x1 + start(.3)*m1 + m*m1
m3 ~ start(.15)*x2 + a*x2 + start(.3)*m2 + m*m2
y2 ~ start(.5)*m1 + b*m2 + start(.3)*y1 + y*y1
y3 ~ start(.5)*m2 + b*m2 + start(.3)*y2 + y*y2 + start(0.05)*x1 + c*x1
x1 ~~ start(.15)*m1
x1 ~~ start(.15)*y1
y1 ~~ start(.5)*m1
'
indirect <- 'ab:=a*b'
N<-200
system.time(bootstrap<-power.boot(powermodel1, indirect, N, nrep=1000, nboot=1000, parallel = 'multicore'))
summary(bootstrap)
Unfortunately it looks like it will take a while; ~8hrs on my system:
library(bmem)
powermodel1 <-'
x2 ~ start(.6)*x1 + x*x1
x3 ~ start(.6)*x2 + x*x2
m2 ~ start(.15)*x1 + a*x1 + start(.3)*m1 + m*m1
m3 ~ start(.15)*x2 + a*x2 + start(.3)*m2 + m*m2
y2 ~ start(.5)*m1 + b*m2 + start(.3)*y1 + y*y1
y3 ~ start(.5)*m2 + b*m2 + start(.3)*y2 + y*y2 + start(0.05)*x1 + c*x1
x1 ~~ start(.15)*m1
x1 ~~ start(.15)*y1
y1 ~~ start(.5)*m1
'
indirect <- 'ab:=a*b'
N<-200
system.time(bootstrap<-bmem::power.boot(powermodel1, indirect, N, nrep = 10, nboot = 10, parallel = 'multicore'))
system.time(bootstrap<-bmem::power.boot(powermodel1, indirect, N, nrep = 30, nboot = 30, parallel = 'multicore'))
system.time(bootstrap<-bmem::power.boot(powermodel1, indirect, N, nrep = 60, nboot = 60, parallel = 'multicore'))
system.time(bootstrap<-bmem::power.boot(powermodel1, indirect, N, nrep = 100, nboot = 100, parallel = 'multicore'))
library(tidyverse)
# Load the times from above into a dataframe
benchmark <- tibble(bootstraps = c(10, 30, 60, 100),
times = c(4.021, 30.122, 121.103, 311.236))
# Plot the points and fit a curve
ggplot(benchmark, aes(x = bootstraps, y = times)) +
geom_point() +
geom_smooth(se = FALSE, span = 5)
# Fit a model
fit <- lm(data = benchmark, times~poly(bootstraps,
2, raw=TRUE))
newtimes <- data.frame(bootstraps = seq(100, 1000, length = 4))
# Predict the time it will take for larger bootstrap/rep values
predict(fit, newdata = newtimes)
> 1 2 3 4
> 311.6829 4568.3812 13789.6754 27975.5655
# Convert from seconds to hours
print(27975.5655/60/60)
>[1] 7.77099

geom_contour error: "Not possible to generate contour data all z values are equal"

I was trying to draw a decision boundary from a training data set onto a testing data set with geom_contour() from ggplot. The decision boundary was from the training results of a support vector classifier. But it kept producing this error message:
Not possible to generate contour data
all z values are equal
Could someone help me with this? Below is my code.
#libraries
library(tidyverse)
library(caret)
library(kernlab)
set.seed(6758)
The data sets were set up so that they would have a non-linear but clear decision boundary.
#data sets
df_train <- tibble(X1 = runif(100, min = 0, max = 20),
X2 = runif(100, min = 0, max = 20),
Y = X1 * X2) %>%
mutate(result = ifelse(Y <= 90, "fail", "success"))
df_test <- tibble(X1 = runif(100, min = 0, max = 20),
X2 = runif(100, min = 0, max = 20),
Y = X1 * X2) %>%
mutate(result = ifelse(Y <= 90, "fail", "success"))
#for train()
X1_train <- df_train %>% dplyr::select(X1, X2)
Y1_train <- df_train$result
As I said, I was fitting an SVM, with train from caret
svm_radial <- train(
x = X1_train,
y = Y1_train,
method = "svmRadial",
trControl = trainControl(method = "cv", number = 10)
)
#training results
predr_train <- cbind(df_train, pred = predict(svm_radial, newdata = X1_train))
df_test %>%
ggplot(aes(x = X1, y = X2, color = result)) +
geom_point() +
geom_contour(data = predr_train,
aes(z = as.numeric(pred)))

Visualization of predict glm using multiple variables in R

I would like to use the following dataset to fit glm and visualize the predict().
y=c(-18.948,-19.007,-18.899,-19.022,-20.599,-19.778,-17.344,-20.265,-20.258,-19.886,-18.05,-19.824,-20.1,-20.508,-20.455,-16.573,-20.249,-20.205,-20.436,-16.358,-17.717,-19.794,-20.372,-19.944,-20.072,-19.889,-20.139,-19.132,-20.275,-19.953,-19.769,-20.2,-19.638,-17.419,-19.086,-18.347,-18.73,-18.872,-18.956,-19.28,-18.176,-19.036,-18.084,-20.11,-19.641,-19.656,-19.25,-18.68,-19.089,-18.969,-18.161,-17.603,-20.37,-19.233,-18.961,-19.083,-20.118,-19.795,-17.154,-16.75)
x1=c(9.698,9.583,9.356,9.326,9.438,9.733,8.803,8.973,9.141,9.044,8.788,9.377,9.26,10.186,9.035,9.569,9.431,9.09,8.776,9.117,9.393,9.408,9.307,8.868,8.398,8.407,9.364,9.074,8.444,9.122,10.11,7.81,9.777,6.472,9.521,8.92,9.341,9.446,9.08,8.071,8.047,8.019,7.419,9.022,9.981,9.337,9.989,10.013,9.31,10.843,8.337,9.103,6.438,9.372,9.071,8.749,9.016,8.181,9.284,8.44)
x2=c('S03','S03','S03','S03','S03','S03','S03','S03','S03','S03','S03','S03','S03','S03','S03','S03','S03','S03','S03','S03','S04','S04','S04','S04','S04','S04','S06','S06','S06','S06','S06','S06','S06','S06','S07','S07','S07','S07','S07','S07','S07','S07','S07','S08','S08','S09','S09','S09','S09','S09','S09','S09','S10','S03','S03','S03','S04','S04','S07','S07')
x3=c('A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','P1','P1','P1','P1','P1','P1','P1')
mydata <- data.frame(y,x1,x2,x3)
Fit glm Model:
myglm <- glm(y ~ x1+x2+x3+x1:x2, family="gaussian", data= mydata)
Prediction:
1). Extract the range of x1
min <- min(mydata$x1)
max <- max(mydata$x1)
2). Create a new data frame.x
Here comes the question:
How should I include x2 and x3 in the new.x?
new.x <- data.frame(
x1=seq(min, max, length=60),
x2= ???
x3= ???)
Then predict new.y with myglm:
new.y = predict(myglm, newdata=new.x, se.fit=TRUE)
Combine new.x and new.y:
addThese <- data.frame(new.x, new.y)
interval
addThese <- mutate(addThese,
d15N=exp(fit),
lwr=exp(fit-1.96*se.fit),
upr=exp(fit+1.96*se.fit))
3). Visualization of the original data points and the glm prediction smooth line added:
ggplot(addThese, aes(x1, fit))+
geom_point(shape=21, size=3)+
geom_smooth(data=addThese,
aes(ymin=lwr, ymax=upr),
stat='identity')
I'm still wondering if this is a right way to create new.data, but I'll give it a try. So with your data, slightly modifying your code:
myglm <- glm(y ~ x1 + x2 + x3 + x1:x2, family = gaussian, data = mydata)
minx <- min(mydata$x1)
maxx <- max(mydata$x1)
# create data with all combinations of x1, x2, x3
new.data <- expand.grid(x1 = seq(minx, maxx, length.out = 60),
x2 = unique(mydata$x2),
x3 = unique(mydata$x3)
)
# visualize data
data.frame(predict(myglm, newdata = new.data, se.fit = T)[1:2]) %>%
bind_cols(new.data) %>%
mutate(d15N = exp(fit), lwr = fit - 1.96 * se.fit, upr = fit + 1.96 * se.fit) %>%
ggplot(aes(x = x1, y = fit, colour = interaction(x2, x3))) +
geom_point(size = 1, alpha = .75, pch = 19, position = "jitter") +
geom_smooth(aes(ymin = lwr, ymax = upr), stat = "identity", alpha = .5) +
facet_wrap(~interaction(x2, x3, sep = " : "), nrow = 5) +
ggthemes::theme_few() +
labs(y = "Predicted value", x = bquote(x[1])) +
theme(legend.position = "none")

Curve function in r with expr different from x

I have found few codes on line but non of them could help me solve my problem. I know the expr needs x but I couldn't find way to plot these to functions using the curve function. I am able to plot them when the model has only one independent variable but not more than one. Here is the code
n <- 50
x1 <- runif(n = n, min = 0, max = 1)
x2 <- rnorm(n,mean = -50,1)
x3 <- rnorm(n=n,mean =50,sd=8)
z <- 3 - 4.2*x1 - x2 - x3
pr <- 1/(1+exp(-z))
y <- rbinom(n=n,1,pr)
y
# create dataframe
df = data.frame(y=y,x1=x1,x2=x2,x3=x3)
m <- glm( y~.,data=df,family=binomial(link = "logit"))
summary(m)
beta.hat <- m$coefficients
z.hat <- beta.hat[1] + beta.hat[2]*x1 + beta.hat[3]*x2 + beta.hat[4]*x3
curve(expr = exp(z) / (1 + exp(z)), xlim = c(0,1), ylab = expression(pi), n = 1000, lwd = 3, xlab = expression(z/hat(z)))
curve(expr = exp(z.hat)/(1 + exp(z.hat)), xlim = c(0,1), add = TRUE, col = "red", n = 1000)

Resources