Error when predicting with DirichletReg package in R - r
I am trying to make predictions on a test set using the DirichReg function from the DirichletReg package. When I run the model with only a few predictors it works fine, but when I use more than ~5 predictors I get an error that I can't figure out. The code below creates an MWE that reproduces the error.
library(DirichletReg)
set.seed(1)
# create dataset
predictor1 <- rnorm(n = 1000, mean = 5, sd = 1)
predictor2 <- rnorm(n = 1000, mean = 5, sd = 1)
predictor3 <- rnorm(n = 1000, mean = 5, sd = 1)
predictor4 <- rnorm(n = 1000, mean = 5, sd = 1)
predictor5 <- rnorm(n = 1000, mean = 5, sd = 1)
predictor6 <- rnorm(n = 1000, mean = 5, sd = 1)
prob_A <- runif(n = 1000, min = 0, max = 0.5)
prob_B <- runif(n = 1000, min = 0, max = 0.5)
prob_C <- 1 - prob_A - prob_B
dat <- data.frame(predictor1, predictor2, predictor3, predictor4, predictor5,
predictor6, prob_A, prob_B, prob_C)
# split data into training and test sets
train_vec <- sample(c(0, 1), size = nrow(dat), replace = T, prob = c(0.2, 0.8))
train_dat <- dat[train_vec == 1, ]
test_dat <- dat[train_vec == 0, ]
# run model
train_dat$prob <- DR_data(train_dat[, c('prob_A', 'prob_B', 'prob_C')])
mod <- DirichReg(prob ~ predictor1 + predictor2 + predictor3 + predictor4 +
predictor5 + predictor6,
data = train_dat, model = 'common')
# run predictions
test_dat$prob <- DR_data(test_dat[, c('prob_A', 'prob_B', 'prob_C')])
preds <- predict(object = mod, newdata = test_dat)
Here's the error that I'm getting:
Error in parse(text = x, keep.source = FALSE) :
<text>:1:74: unexpected '|'
1: prob ~ predictor1 + predictor2 + predictor3 + predictor4 + predictor5 + |
^
I would appreciate any help. I haven't been able to google the error or find it in the package documentation.
This seems to be a bug in the package. I recommend that you contact the package maintainer to report it.
A possible workaround is to explicitly list the separate parts of the regression specification instead of relying on the package to internally replicate the regressors for all parts.
mod2 <- DirichReg(prob ~
predictor1 + predictor2 + predictor3 + predictor4 + predictor5 + predictor6 |
predictor1 + predictor2 + predictor3 + predictor4 + predictor5 + predictor6 |
predictor1 + predictor2 + predictor3 + predictor4 + predictor5 + predictor6,
data = train_dat, model = "common")
all.equal(coef(mod), coef(mod2))
## [1] TRUE
predict(mod2, newdata = test_dat)
## [,1] [,2] [,3]
## [1,] 0.2436493 0.2715895 0.4847612
## [2,] 0.2541715 0.2252292 0.5205993
## [3,] 0.2618741 0.2345063 0.5036196
## ...
Related
Adaptive LASSO in bayesQR
I have been playing with the bayesQR package, and want to apply it to an application that calls for variable selection using LASSO. As far as I understand, this is possible in bayesQR, but I haven't been able to get any variables dropped. My toy example is below, where the extraneous variables (c and d) are dropped by glmnet, but not by bayesQR. Is there something fundamental that I am missing? How can I get model4 below to have eliminated variables? library(data.table) library(bayestestR) library(bayesQR) library(glmnet) # Generate data n = 250 seed = 22 noise_sd = 5 set.seed(seed) dt = data.table(a = runif(n, min = 0, max = 10), b = runif(n, min = 0, max = 10), c = runif(n, min = 0, max = 10), d = rnorm(n, sd = .01 * noise_sd)) dt[, y := (a + rnorm(n, sd = 1 * noise_sd)) * 2 + (b + rnorm(n, sd = 2 * noise_sd)) + rnorm(n, sd = 2 * noise_sd)] formula = y ~ a + b + c + d # Just with GLM model1 = glm(data = dt, formula = formula) # LASSO with glmnet x = as.matrix(dt[, .(a,b,c,d)]) y = dt$y cv_model <- cv.glmnet(x, y, alpha = 1) best_lambda <- cv_model$lambda.min model2 <- glmnet(x, y, alpha = 1, lambda = best_lambda) print(coef(model2)) # Quantile regression with bayesQR model3 = bayesQR(data = dt, formula = formula, ndraw = 5000, seed = seed) # Quantile regression with bayesQR using adaptive lasso model4 = bayesQR(data = dt, formula = formula, ndraw = 5000, alasso = TRUE, seed = seed) message('GLM') print(summary(model1)) message('glmnet with LASSO') print(coef(model2)) message('bayesQR') print(summary(model3)) message('bayesQR with ALASSO') print(summary(model4))
How long should a Monte Carlo bootstrap power analysis simulation in R take? Is it potentially hours? (1000 reps, 1000 bootstraps)
I am using a Monte Carlo simulation to run a power analysis for a longitudinal mediation model. I'm using the power.boot function from the bmem package (lavaan). I checked the code with only 5 reps/5 bootstrap to make sure it worked and it did. Then I ran the code with 1000 reps, 1000 bootstrap as the package documentation recommends. It's been over an hour now and it's still running - is this normal? How long is too long? powermodel1 <-' x2 ~ start(.6)*x1 + x*x1 x3 ~ start(.6)*x2 + x*x2 m2 ~ start(.15)*x1 + a*x1 + start(.3)*m1 + m*m1 m3 ~ start(.15)*x2 + a*x2 + start(.3)*m2 + m*m2 y2 ~ start(.5)*m1 + b*m2 + start(.3)*y1 + y*y1 y3 ~ start(.5)*m2 + b*m2 + start(.3)*y2 + y*y2 + start(0.05)*x1 + c*x1 x1 ~~ start(.15)*m1 x1 ~~ start(.15)*y1 y1 ~~ start(.5)*m1 ' indirect <- 'ab:=a*b' N<-200 system.time(bootstrap<-power.boot(powermodel1, indirect, N, nrep=1000, nboot=1000, parallel = 'multicore')) summary(bootstrap)
Unfortunately it looks like it will take a while; ~8hrs on my system: library(bmem) powermodel1 <-' x2 ~ start(.6)*x1 + x*x1 x3 ~ start(.6)*x2 + x*x2 m2 ~ start(.15)*x1 + a*x1 + start(.3)*m1 + m*m1 m3 ~ start(.15)*x2 + a*x2 + start(.3)*m2 + m*m2 y2 ~ start(.5)*m1 + b*m2 + start(.3)*y1 + y*y1 y3 ~ start(.5)*m2 + b*m2 + start(.3)*y2 + y*y2 + start(0.05)*x1 + c*x1 x1 ~~ start(.15)*m1 x1 ~~ start(.15)*y1 y1 ~~ start(.5)*m1 ' indirect <- 'ab:=a*b' N<-200 system.time(bootstrap<-bmem::power.boot(powermodel1, indirect, N, nrep = 10, nboot = 10, parallel = 'multicore')) system.time(bootstrap<-bmem::power.boot(powermodel1, indirect, N, nrep = 30, nboot = 30, parallel = 'multicore')) system.time(bootstrap<-bmem::power.boot(powermodel1, indirect, N, nrep = 60, nboot = 60, parallel = 'multicore')) system.time(bootstrap<-bmem::power.boot(powermodel1, indirect, N, nrep = 100, nboot = 100, parallel = 'multicore')) library(tidyverse) # Load the times from above into a dataframe benchmark <- tibble(bootstraps = c(10, 30, 60, 100), times = c(4.021, 30.122, 121.103, 311.236)) # Plot the points and fit a curve ggplot(benchmark, aes(x = bootstraps, y = times)) + geom_point() + geom_smooth(se = FALSE, span = 5) # Fit a model fit <- lm(data = benchmark, times~poly(bootstraps, 2, raw=TRUE)) newtimes <- data.frame(bootstraps = seq(100, 1000, length = 4)) # Predict the time it will take for larger bootstrap/rep values predict(fit, newdata = newtimes) > 1 2 3 4 > 311.6829 4568.3812 13789.6754 27975.5655 # Convert from seconds to hours print(27975.5655/60/60) >[1] 7.77099
geom_contour error: "Not possible to generate contour data all z values are equal"
I was trying to draw a decision boundary from a training data set onto a testing data set with geom_contour() from ggplot. The decision boundary was from the training results of a support vector classifier. But it kept producing this error message: Not possible to generate contour data all z values are equal Could someone help me with this? Below is my code. #libraries library(tidyverse) library(caret) library(kernlab) set.seed(6758) The data sets were set up so that they would have a non-linear but clear decision boundary. #data sets df_train <- tibble(X1 = runif(100, min = 0, max = 20), X2 = runif(100, min = 0, max = 20), Y = X1 * X2) %>% mutate(result = ifelse(Y <= 90, "fail", "success")) df_test <- tibble(X1 = runif(100, min = 0, max = 20), X2 = runif(100, min = 0, max = 20), Y = X1 * X2) %>% mutate(result = ifelse(Y <= 90, "fail", "success")) #for train() X1_train <- df_train %>% dplyr::select(X1, X2) Y1_train <- df_train$result As I said, I was fitting an SVM, with train from caret svm_radial <- train( x = X1_train, y = Y1_train, method = "svmRadial", trControl = trainControl(method = "cv", number = 10) ) #training results predr_train <- cbind(df_train, pred = predict(svm_radial, newdata = X1_train)) df_test %>% ggplot(aes(x = X1, y = X2, color = result)) + geom_point() + geom_contour(data = predr_train, aes(z = as.numeric(pred)))
Visualization of predict glm using multiple variables in R
I would like to use the following dataset to fit glm and visualize the predict(). y=c(-18.948,-19.007,-18.899,-19.022,-20.599,-19.778,-17.344,-20.265,-20.258,-19.886,-18.05,-19.824,-20.1,-20.508,-20.455,-16.573,-20.249,-20.205,-20.436,-16.358,-17.717,-19.794,-20.372,-19.944,-20.072,-19.889,-20.139,-19.132,-20.275,-19.953,-19.769,-20.2,-19.638,-17.419,-19.086,-18.347,-18.73,-18.872,-18.956,-19.28,-18.176,-19.036,-18.084,-20.11,-19.641,-19.656,-19.25,-18.68,-19.089,-18.969,-18.161,-17.603,-20.37,-19.233,-18.961,-19.083,-20.118,-19.795,-17.154,-16.75) x1=c(9.698,9.583,9.356,9.326,9.438,9.733,8.803,8.973,9.141,9.044,8.788,9.377,9.26,10.186,9.035,9.569,9.431,9.09,8.776,9.117,9.393,9.408,9.307,8.868,8.398,8.407,9.364,9.074,8.444,9.122,10.11,7.81,9.777,6.472,9.521,8.92,9.341,9.446,9.08,8.071,8.047,8.019,7.419,9.022,9.981,9.337,9.989,10.013,9.31,10.843,8.337,9.103,6.438,9.372,9.071,8.749,9.016,8.181,9.284,8.44) x2=c('S03','S03','S03','S03','S03','S03','S03','S03','S03','S03','S03','S03','S03','S03','S03','S03','S03','S03','S03','S03','S04','S04','S04','S04','S04','S04','S06','S06','S06','S06','S06','S06','S06','S06','S07','S07','S07','S07','S07','S07','S07','S07','S07','S08','S08','S09','S09','S09','S09','S09','S09','S09','S10','S03','S03','S03','S04','S04','S07','S07') x3=c('A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','P1','P1','P1','P1','P1','P1','P1') mydata <- data.frame(y,x1,x2,x3) Fit glm Model: myglm <- glm(y ~ x1+x2+x3+x1:x2, family="gaussian", data= mydata) Prediction: 1). Extract the range of x1 min <- min(mydata$x1) max <- max(mydata$x1) 2). Create a new data frame.x Here comes the question: How should I include x2 and x3 in the new.x? new.x <- data.frame( x1=seq(min, max, length=60), x2= ??? x3= ???) Then predict new.y with myglm: new.y = predict(myglm, newdata=new.x, se.fit=TRUE) Combine new.x and new.y: addThese <- data.frame(new.x, new.y) interval addThese <- mutate(addThese, d15N=exp(fit), lwr=exp(fit-1.96*se.fit), upr=exp(fit+1.96*se.fit)) 3). Visualization of the original data points and the glm prediction smooth line added: ggplot(addThese, aes(x1, fit))+ geom_point(shape=21, size=3)+ geom_smooth(data=addThese, aes(ymin=lwr, ymax=upr), stat='identity')
I'm still wondering if this is a right way to create new.data, but I'll give it a try. So with your data, slightly modifying your code: myglm <- glm(y ~ x1 + x2 + x3 + x1:x2, family = gaussian, data = mydata) minx <- min(mydata$x1) maxx <- max(mydata$x1) # create data with all combinations of x1, x2, x3 new.data <- expand.grid(x1 = seq(minx, maxx, length.out = 60), x2 = unique(mydata$x2), x3 = unique(mydata$x3) ) # visualize data data.frame(predict(myglm, newdata = new.data, se.fit = T)[1:2]) %>% bind_cols(new.data) %>% mutate(d15N = exp(fit), lwr = fit - 1.96 * se.fit, upr = fit + 1.96 * se.fit) %>% ggplot(aes(x = x1, y = fit, colour = interaction(x2, x3))) + geom_point(size = 1, alpha = .75, pch = 19, position = "jitter") + geom_smooth(aes(ymin = lwr, ymax = upr), stat = "identity", alpha = .5) + facet_wrap(~interaction(x2, x3, sep = " : "), nrow = 5) + ggthemes::theme_few() + labs(y = "Predicted value", x = bquote(x[1])) + theme(legend.position = "none")
Curve function in r with expr different from x
I have found few codes on line but non of them could help me solve my problem. I know the expr needs x but I couldn't find way to plot these to functions using the curve function. I am able to plot them when the model has only one independent variable but not more than one. Here is the code n <- 50 x1 <- runif(n = n, min = 0, max = 1) x2 <- rnorm(n,mean = -50,1) x3 <- rnorm(n=n,mean =50,sd=8) z <- 3 - 4.2*x1 - x2 - x3 pr <- 1/(1+exp(-z)) y <- rbinom(n=n,1,pr) y # create dataframe df = data.frame(y=y,x1=x1,x2=x2,x3=x3) m <- glm( y~.,data=df,family=binomial(link = "logit")) summary(m) beta.hat <- m$coefficients z.hat <- beta.hat[1] + beta.hat[2]*x1 + beta.hat[3]*x2 + beta.hat[4]*x3 curve(expr = exp(z) / (1 + exp(z)), xlim = c(0,1), ylab = expression(pi), n = 1000, lwd = 3, xlab = expression(z/hat(z))) curve(expr = exp(z.hat)/(1 + exp(z.hat)), xlim = c(0,1), add = TRUE, col = "red", n = 1000)