I have been playing with the bayesQR package, and want to apply it to an application that calls for variable selection using LASSO. As far as I understand, this is possible in bayesQR, but I haven't been able to get any variables dropped. My toy example is below, where the extraneous variables (c and d) are dropped by glmnet, but not by bayesQR.
Is there something fundamental that I am missing? How can I get model4 below to have eliminated variables?
library(data.table)
library(bayestestR)
library(bayesQR)
library(glmnet)
# Generate data
n = 250
seed = 22
noise_sd = 5
set.seed(seed)
dt = data.table(a = runif(n, min = 0, max = 10),
b = runif(n, min = 0, max = 10),
c = runif(n, min = 0, max = 10),
d = rnorm(n, sd = .01 * noise_sd))
dt[, y := (a + rnorm(n, sd = 1 * noise_sd)) * 2 + (b + rnorm(n, sd = 2 * noise_sd)) + rnorm(n, sd = 2 * noise_sd)]
formula = y ~ a + b + c + d
# Just with GLM
model1 = glm(data = dt,
formula = formula)
# LASSO with glmnet
x = as.matrix(dt[, .(a,b,c,d)])
y = dt$y
cv_model <- cv.glmnet(x, y, alpha = 1)
best_lambda <- cv_model$lambda.min
model2 <- glmnet(x, y, alpha = 1, lambda = best_lambda)
print(coef(model2))
# Quantile regression with bayesQR
model3 = bayesQR(data = dt,
formula = formula,
ndraw = 5000,
seed = seed)
# Quantile regression with bayesQR using adaptive lasso
model4 = bayesQR(data = dt,
formula = formula,
ndraw = 5000,
alasso = TRUE,
seed = seed)
message('GLM')
print(summary(model1))
message('glmnet with LASSO')
print(coef(model2))
message('bayesQR')
print(summary(model3))
message('bayesQR with ALASSO')
print(summary(model4))
Related
I tried to search this online, but couldn't exactly figure out what my issue was. Here is my code:
n = 10000
x1 <- runif(n,0,100)
x2 <- runif(n,0,100)
y1 <- 10*sin(x1/10) + 10 + rnorm(n, sd = 1)
y2 <- x2 * cos(x2) - 2 * rnorm(n, sd = 2)
x <- c(x1, x2)
y <- c(x1, x2)
start1 = list(a = 10, b = 5)
start2 = list(a = 30, b = 5)
library(flexmix)
library(flexmixNL)
modelNL <- flexmix(y~x, k =2,
model = FLXMRnlm(formula = y ~ a*x/(b+x),
family = "gaussian",
start = list(start1, start2)))
plot(x, y, col = clusters(modelNL))
and before the plot, it gives me this error:
Error in matrix(1, nrow = sum(groups$groupfirst)) : data is too long
I checked google for similar errors, but I don't quite understand what is wrong with my own code that results in this error.
As you can already tell, I am very new to R, so please explain it in the most layman terms possible. Thank you in advance.
Ironically (in the context of an error message saying data is "too long") I think the proximate cause of that error is no data argument. If you give it the data in the form of a dataframe, you still get an error but its not the same one as you are experiencing. When you plot the data, you get a rather bizarre set of values at least from a statistical distribution standpoint and it's not clear why you are trying to model this with this formula. Nonetheless, with those starting values and a dataframe argument to data, one sees results.
> modelNL <- flexmix(y~x, k =2, data=data.frame(x=x,y=y),
+ model = FLXMRnlm(formula = y ~ a*x/(b+x),
+ family = "gaussian",
+ start = list(start1, start2)))
> modelNL
Call:
flexmix(formula = y ~ x, data = data.frame(x = x, y = y), k = 2, model = FLXMRnlm(formula = y ~
a * x/(b + x), family = "gaussian", start = list(start1, start2)))
Cluster sizes:
1 2
6664 13336
convergence after 20 iterations
> summary(modelNL)
Call:
flexmix(formula = y ~ x, data = data.frame(x = x, y = y), k = 2, model = FLXMRnlm(formula = y ~
a * x/(b + x), family = "gaussian", start = list(start1, start2)))
prior size post>0 ratio
Comp.1 0.436 6664 20000 0.333
Comp.2 0.564 13336 16306 0.818
'log Lik.' -91417.03 (df=7)
AIC: 182848.1 BIC: 182903.4
Most R regression functions first check for the matchng names in formulae within the data= argument. Apparently this function fails when it needs to go out to the global environment to match formula tokens.
I tried a formula suggested by the plot of the data and get convergent results:
> modelNL <- flexmix(y~x, k =2, data=data.frame(x=x,y=y),
+ model = FLXMRnlm(formula = y ~ a*x*cos(x+b),
+ family = "gaussian",
+ start = list(start1, start2)))
> modelNL
Call:
flexmix(formula = y ~ x, data = data.frame(x = x, y = y), k = 2, model = FLXMRnlm(formula = y ~
a * x * cos(x + b), family = "gaussian", start = list(start1, start2)))
Cluster sizes:
1 2
9395 10605
convergence after 17 iterations
> summary(modelNL)
Call:
flexmix(formula = y ~ x, data = data.frame(x = x, y = y), k = 2, model = FLXMRnlm(formula = y ~
a * x * cos(x + b), family = "gaussian", start = list(start1, start2)))
prior size post>0 ratio
Comp.1 0.521 9395 18009 0.522
Comp.2 0.479 10605 13378 0.793
'log Lik.' -78659.85 (df=7)
AIC: 157333.7 BIC: 157389
The reduction in AIC seems huge compare to the first formula.
I'm trying to utilize the effectPlotData as described here: https://cran.r-project.org/web/packages/GLMMadaptive/vignettes/Methods_MixMod.html
But, I'm trying to apply it to a model (two-part mixed model for zero-inflated semi-continuous data) that includes random/fixed effects for both a linear and logistic portion (hurdle lognormal). I get the following error:
'Error in Qs[1, ] : incorrect number of dimensions'
Which, I think is from having more than one set of random/fixed effect outcomes, but if anyone else has come across this error or can advise, it would be appreciated! I've tried changing the terms in the new data frame and tried a couple of different options with length.out (attempted this as number of subjects and then number of total observations across all subjects), but get the same error each time.
Code below, specifies the model into m and new data frame into nDF:
m = mixed_model(Y~X, random = ~1|Subject,
data = data_combined_temp_Fix_Num3,
family = hurdle.lognormal,
n_phis = 1, zi_fixed = ~X , zi_random = ~1|Subject,
na.action = na.exclude)
nDF <- with(data_combined_temp_Fix_Num3,
expand.grid(X = seq(min(X), max(X), length.out = 908),
Y = levels(Y)))
effectPlotData(m, nDF)
It seems to work for with the following example:
library("GLMMadaptive")
set.seed(1234)
n <- 100 # number of subjects
K <- 8 # number of measurements per subject
t_max <- 5 # maximum follow-up time
# we constuct a data frame with the design:
# everyone has a baseline measurment, and then measurements at random follow-up times
DF <- data.frame(id = rep(seq_len(n), each = K),
time = c(replicate(n, c(0, sort(runif(K - 1, 0, t_max))))),
sex = rep(gl(2, n/2, labels = c("male", "female")), each = K))
# design matrices for the fixed and random effects non-zero part
X <- model.matrix(~ sex * time, data = DF)
Z <- model.matrix(~ time, data = DF)
# design matrices for the fixed and random effects zero part
X_zi <- model.matrix(~ sex, data = DF)
Z_zi <- model.matrix(~ 1, data = DF)
betas <- c(-2.13, -0.25, 0.24, -0.05) # fixed effects coefficients non-zero part
sigma <- 0.5 # standard deviation error terms non-zero part
gammas <- c(-1.5, 0.5) # fixed effects coefficients zero part
D11 <- 0.5 # variance of random intercepts non-zero part
D22 <- 0.1 # variance of random slopes non-zero part
D33 <- 0.4 # variance of random intercepts zero part
# we simulate random effects
b <- cbind(rnorm(n, sd = sqrt(D11)), rnorm(n, sd = sqrt(D22)), rnorm(n, sd = sqrt(D33)))
# linear predictor non-zero part
eta_y <- as.vector(X %*% betas + rowSums(Z * b[DF$id, 1:2, drop = FALSE]))
# linear predictor zero part
eta_zi <- as.vector(X_zi %*% gammas + rowSums(Z_zi * b[DF$id, 3, drop = FALSE]))
# we simulate log-normal longitudinal data
DF$y <- exp(rnorm(n * K, mean = eta_y, sd = sigma))
# we set the zeros from the logistic regression
DF$y[as.logical(rbinom(n * K, size = 1, prob = plogis(eta_zi)))] <- 0
###############################################################################
km1 <- mixed_model(y ~ sex * time, random = ~ 1 | id, data = DF,
family = hurdle.lognormal(),
zi_fixed = ~ sex)
km1
nDF <- with(DF, expand.grid(time = seq(min(time), max(time), length.out = 15),
sex = levels(sex)))
plot_data <- effectPlotData(km1, nDF)
library("lattice")
xyplot(pred + low + upp ~ time | sex, data = plot_data,
type = "l", lty = c(1, 2, 2), col = c(2, 1, 1), lwd = 2,
xlab = "Follow-up time", ylab = "")
local({
km1$Funs$mu_fun <- function (eta) {
pmax(exp(eta + 0.5 * exp(2 * km1$phis)), .Machine$double.eps)
}
km1$family$linkfun <- function (mu) log(mu)
plot_data <- effectPlotData(km1, nDF)
xyplot(exp(pred) + exp(low) + exp(upp) ~ time | sex, data = plot_data,
type = "l", lty = c(1, 2, 2), col = c(2, 1, 1), lwd = 2,
xlab = "Follow-up time", ylab = "")
})
In case someone comes across the same error, I was filtering data from my data frame within the model -- which caused the dimensions of the model and the variable from the data frame to not match. I applied the same filtering to the new data frame (I've also moved forward with a completely new data frame that only includes trials that are actually used by the model so that no filtering has to be used at any step).
m = mixed_model(Y~X, random = ~1|Subject,
data = data_combined_temp_Fix_Num3[data_combined_temp_Fix_Num3$Z>=4 &
data_combined_temp_Fix_Num3$ZZ>= 4,],
family = hurdle.lognormal,
n_phis = 1, zi_fixed = ~X , zi_random = ~1|Subject,
na.action = na.exclude)`
nDF <- with(data_combined_temp_Fix_Num3,
expand.grid(X = seq(min(X[data_combined_temp_Fix_Num3$Z>= 4 &
data_combined_temp_Fix_Num3$ZZ>= 4])),
max(X[data_combined_temp_Fix_Num3$Z>= 4 &
data_combined_temp_Fix_Num3$ZZ>= 4])), length.out = 908),
Y = levels(Y)))`
effectPlotData(m, nDF)
I was trying to draw a decision boundary from a training data set onto a testing data set with geom_contour() from ggplot. The decision boundary was from the training results of a support vector classifier. But it kept producing this error message:
Not possible to generate contour data
all z values are equal
Could someone help me with this? Below is my code.
#libraries
library(tidyverse)
library(caret)
library(kernlab)
set.seed(6758)
The data sets were set up so that they would have a non-linear but clear decision boundary.
#data sets
df_train <- tibble(X1 = runif(100, min = 0, max = 20),
X2 = runif(100, min = 0, max = 20),
Y = X1 * X2) %>%
mutate(result = ifelse(Y <= 90, "fail", "success"))
df_test <- tibble(X1 = runif(100, min = 0, max = 20),
X2 = runif(100, min = 0, max = 20),
Y = X1 * X2) %>%
mutate(result = ifelse(Y <= 90, "fail", "success"))
#for train()
X1_train <- df_train %>% dplyr::select(X1, X2)
Y1_train <- df_train$result
As I said, I was fitting an SVM, with train from caret
svm_radial <- train(
x = X1_train,
y = Y1_train,
method = "svmRadial",
trControl = trainControl(method = "cv", number = 10)
)
#training results
predr_train <- cbind(df_train, pred = predict(svm_radial, newdata = X1_train))
df_test %>%
ggplot(aes(x = X1, y = X2, color = result)) +
geom_point() +
geom_contour(data = predr_train,
aes(z = as.numeric(pred)))
I have found few codes on line but non of them could help me solve my problem. I know the expr needs x but I couldn't find way to plot these to functions using the curve function. I am able to plot them when the model has only one independent variable but not more than one. Here is the code
n <- 50
x1 <- runif(n = n, min = 0, max = 1)
x2 <- rnorm(n,mean = -50,1)
x3 <- rnorm(n=n,mean =50,sd=8)
z <- 3 - 4.2*x1 - x2 - x3
pr <- 1/(1+exp(-z))
y <- rbinom(n=n,1,pr)
y
# create dataframe
df = data.frame(y=y,x1=x1,x2=x2,x3=x3)
m <- glm( y~.,data=df,family=binomial(link = "logit"))
summary(m)
beta.hat <- m$coefficients
z.hat <- beta.hat[1] + beta.hat[2]*x1 + beta.hat[3]*x2 + beta.hat[4]*x3
curve(expr = exp(z) / (1 + exp(z)), xlim = c(0,1), ylab = expression(pi), n = 1000, lwd = 3, xlab = expression(z/hat(z)))
curve(expr = exp(z.hat)/(1 + exp(z.hat)), xlim = c(0,1), add = TRUE, col = "red", n = 1000)
I'm currently working on a randomForest model. In my configuration I realize that high nodesize values is the configuration which outperform using crossvalidation.
But then I realized something strange.
Here is a reproduicible piece of code :
For nodesize = nrow(data)+4 :
library(randomForest)
library(data.table)
set.seed(1)
n = 10
sigma = 0.4
X = runif(n)
Y = runif(n)
Z = X^2+X*Y+Y+sigma*rnorm(n)
Data = data.table(X,Y,Z)
model = randomForest(formula = as.formula('Z ~ X + Y'),data = Data,mtry = 1,tree= 500,nodesize = n+4,do.trace = TRUE)
pred = predict(model,Data)
print(pred)
1.041549 1.036075 1.266310 1.324197 1.308377 1.480041 1.691081 1.752463 1.203856 1.306943
For nodesize = nrow(data)+5
library(randomForest)
library(data.table)
set.seed(1)
n = 10
sigma = 0.4
X = runif(n)
Y = runif(n)
Z = X^2+X*Y+Y+sigma*rnorm(n)
Data = data.table(X,Y,Z)
model = randomForest(formula = as.formula('Z ~ X + Y'),data = Data,mtry = 1,tree= 500,nodesize = n+5,do.trace = TRUE)
pred = predict(model,Data)
print(pred)
1.330427 1.330427 1.330427 1.330427 1.330427 1.330427 1.330427 1.330427 1.330427 1.330427
The prediction are the same for any observation from nodesize = n+5.
This is the case for any value of n (n = 20000 also).
For me the nodesize is the minimal number of records in a node in order for a split to be performed. So this means that if we have n records, the trees are grown by sampling n+4 observations with replacements. So if nodesize > n+4 no split is performed and the trees return global means : that's why the same prediction is attributed to every observations. Does that makes sense ? Is there a parameter to indicate how many sample are drawn from the original dataset ?
Thanks in advance
#Morgan is right that there was a misunderstanding regarding what nodesize does. It's the minimum size of any terminal node. But still, the behavior of randomForest seems unexpected (bug?). It does actually create branches if n <= nodesize <= n+5, that's what you've found.
If nodesize is 10 (the size of the sample), there should not be any splits, but randomForest still makes a split that splits off several observations. It also does that when nodesize is one of 11:14 (not shown here):
n = 10
sigma = 0.4
set.seed(100)
X = runif(n)
set.seed(200)
Y = runif(n)
set.seed(1)
Z = X^2+X*Y+Y+sigma*rnorm(n)
Data = data.frame(X,Y,Z)
#
# mtry = p, replace = F and sampsize = n to eliminate randomness
#
model = randomForest(formula = Z ~ X + Y, data = Data,
mtry = 2, ntree = 50, nodesize = 10, replace = F, sampsize = n)
grid <- expand.grid(X = seq(from = min(Data$X), to = max(Data$X), length.out = 100),
Y = seq(from = min(Data$Y), to = max(Data$Y), length.out = 100))
grid$grid_preds <- predict(model, grid)
ggplot(grid, aes(x = X, y = Y)) + geom_point(aes(color = grid_preds)) +
geom_point(data = Data, aes(x = X, y = Y, size = 4), color = "blue") +
theme(legend.position = "none")
If you set nodesize to n + 5 or higher randomForest does not make any splits anymore, as expected:
model = randomForest(formula = Z ~ X + Y, data = Data,
mtry = 2, ntree = 50, nodesize = 15, replace = F, sampsize = n)
grid <- expand.grid(X = seq(from = min(Data$X), to = max(Data$X), length.out = 100),
Y = seq(from = min(Data$Y), to = max(Data$Y), length.out = 100))
grid$grid_preds <- predict(model, grid)
ggplot(grid, aes(x = X, y = Y, color = grid_preds)) + geom_point()
As a comparison, ranger shows the expected behavior and does not attempt any splits if min.node.size >= n:
library(ranger)
rang = ranger(Z ~ X + Y, data = Data, write.forest = T,
replace = F, sample.fraction = 1,
mtry = 2, num.trees = 50, min.node.size = 10)
grid$grid_preds <- predict(rang, grid)$prediction
ggplot(grid, aes(x = X, y = Y, color = grid_preds)) + geom_point()
By the way, the n+5 rule in randomForest also holds for n other than 10. I'm wondering too what's going on there.