Misuse predict.rq in the package quantreg? - r

I am using quantreg package to predict new data based on training set. However, I noticed a discrepancy between predict.rq or predict and doing it manually. Here is an example:
The quantile regression setting is
N = 10000
tauList = seq(1:11/12)/12
y = rchisq(N,2)
X = matrix( rnorm(3*N) ,nrow = N, ncol = 3 )
fit <- rq( y ~ X-1, tau = tauList, method = "fn")
The new data set I want to predict is
newdata <- matrix( rbeta((3*N),2,2) ,nrow = N,ncol=3 )
I use predict.rq or predict to predict newdata. Both return the same result:
fit_use_predict <- predict.rq( fit, newdata = as.data.frame(newdata) )
Also I manually do the prediction based on the coefficients matrix:
coef_mat <- coef(fit)
fit_use_multiplication <- newdata %*% coef_mat
I expect both are numerically identical, but they are not:
diff <- fit_use_predict - fit_use_multiplication
print(diff)
Their difference cannot be negligible.
However, predicting the original data set X, both return the same result, i.e.,
predict(fit, newdata = data.frame(X)) = X %*% coef_mat ## True
Do I miss something when using the function? Thanks!

A more serious problem here, before we get to prediction is that the model is forcing all of the fitted quantile functions through the origin of design space and since the covariates are centered at the origin all of the quantile functions are forced to cross there. Even if the X's all lie in the positive orthant it is quite a strong assumption to say that the distribution of the response is degenerate at the origin.

I think you just have to retain the 'X' name in your data as it was in the training data.
library(quantreg)
N = 10000
tauList = seq(1:11/12)/12
y = rchisq(N,2)
X = matrix( rnorm(3*N) ,nrow = N, ncol = 3 )
fit <- rq( y ~ X-1, tau = tauList, method = "fn")
newdata <- matrix( rbeta((3*N),2,2) ,nrow = N,ncol=3 )
fit_use_predict <- predict.rq( fit, newdata = data.frame(X=I(newdata)) )
coef_mat <- coef(fit)
fit_use_multiplication <- newdata %*% coef_mat
diff <- fit_use_predict - fit_use_multiplication
max( abs(diff) )
Output is 0

Related

Does caret::train() in r have a standardized output across different fit methods/models?

I'm working with the train() function from the caret package to fit multiple regression and ML models to test their fit. I'd like to write a function that iterates through all model types and enters the best fit into a dataframe. Biggest issue is that caret doesn't provide all the model fit statistics that I'd like so they need to be derived from the raw output. Based on my exploration there doesn't seem to be a standardized way caret outputs each models fit.
Another post (sorry don't have a link) created this function which pulls from fit$results and fit$bestTune to get pre calculated RMSE, R^2, etc.
get_best_result <- function(caret_fit) {
best = which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
best_result = caret_fit$results[best, ]
rownames(best_result) = NULL
best_result
}
One example of another fit statistic I need to calculate using raw output is BIC. The two functions below do that. The residuals (y_actual - y_predicted) are needed along with the number of x variables (k) and the number of rows used in the prediction (n). k and n must be derived from the output not the original dataset due to the models dropping x variables (feature selection) or rows (omitting NAs) based on its algorithm.
calculate_MSE <- function(residuals){
# residuals can be replaced with y_actual-y_predicted
mse <- mean(residuals^2)
return(mse)
}
calculate_BIC <- function(n, mse, k){
BIC <- n*log(mse)+k*log(n)
return(BIC)
}
The real question is is there a standardized output of caret::train() for x variables or either y_actual, y_predicted, or residuals?
I tried fit$finalModel$model and other methods but to no avail.
Here is a reproducible example along with the function I'm using. Please consider the functions above a part of this reproducible example.
library(rlist)
library(data.table)
# data
df <- data.frame(y1 = rnorm(50, 0, 1),
y2 = rnorm(50, .25, 1.5),
x1 = rnorm(50, .4, .9),
x2 = rnorm(50, 0, 1.1),
x3 = rnorm(50, 1, .75))
missing_index <- sample(1:50, 7, replace = F)
df[missing_index,] <- NA
# function to fit models and pull results
fitModels <- function(df, Ys, Xs, models){
# empty list
results <- list()
# number of for loops
loops_counter <- 0
# for every y
for(y in 1:length(Ys)){
# for every model
for(m in 1:length(models)){
# track loops
loops_counter <- loops_counter + 1
# fit the model
set.seed(1) # seed for reproducability
fit <- tryCatch(train(as.formula(paste(Ys[y], paste(Xs, collapse = ' + '),
sep = ' ~ ')),
data = df,
method = models[m],
na.action = na.omit,
tuneLength = 10),
error = function(e) {return(NA)})
# pull results
results[[loops_counter]] <- c(Y = Ys[y],
model = models[m],
sample_size = nrow(fit$finalModel$model),
RMSE = get_best_result(fit)[[2]],
R2 = get_best_result(fit)[[3]],
MAE = get_best_result(fit)[[4]],
BIC = calculate_BIC(n = length(fit$finalModel),
mse = calculate_MSE(fit$finalModel$residuals),
k = length(fit$finalModel$xNames)))
}
}
# list bind
results_df <- list.rbind(results)
return(results_df)
}
linear_models <- c('lm', 'glmnet', 'ridge', 'lars', 'enet')
fits <- fitModels(df, c(y1, y2), c(x1,x2,x3), linear_models)

How to Use Parametric Bootstrap to Calculate Covariance Matrix?

Using the following dataset "data": load(url("https://www.math.ntnu.no/emner/TMA4315/2020h/hoge-veluwe.Rdata"))
I have fit a poisson GLM model = glm(y~t + I(t^2), family = poisson, data). I now want to estimate the covariance matrix of the β coefficients obtained from the GLM regression by using parametric bootstrapping with 1000 simulations. My code so far is:
ysim = simulate(mod_quad, 1)
betahat = matrix(0,nrow = 1000, ncol = 3)
for (i in 1:1000){
sim_data = cbind(ysim, data$t)
betahat[i, ]= glm(ysim ~ data$t + I(data$t^2),
family = poisson,
data = sim_data )$coefficients
ysim = simulate(glm(ysim ~ data$t + I(data$t^2)family = poisson, data = sim_data ), 1)
}
var(betahat[1000,])
The betahat matrix just comes out to a zero matrix each time so I'm not sure what is missing in my approach?
This is really more of a coding than a statistical question. Streamlining, I would do this:
## set vals to NA (not 0) to make detection of problems easier
betahat <- matrix(NA, nrow = 1000, ncol = 3)
for (i in 1:1000) {
## replace response with parametric simulation
sim_data <- transform(data, y=simulate(mod_quad, 1)[[1]])
## refit model with new data
newfit <- update(mod_quad, data=sim_data)
## store new coefficients
betahat[i, ] <- coef(newfit)
}
## compute variance
var(betahat)

Unexpected result from cross validation

I would like to perform 10-fold cross validation manually using prostate data to learn how to do it manually. I utilise the elasticnet package for code. I estimated the parameters by glmnet package (of course, it can perform cross validation too, but I would like to do that manually). After the analysis, It seems to me that I need a different criterion to choose tuning parameter other than minimum of cv.error because this gives the almost null model, if not so "where is my mistake?". (According to the original paper of Tibshirani, optimum model has three variables)
Here is the code
library(ElemStatLearn)
library(glmnet)
x <- scale(prostate[,1:8],T,T)
y <- scale(prostate[,9],T,F)
lambda = seq(0,1,0.02)
cv.folds <- function(n, folds = 10){
split(sample(1:n), rep(1:folds, length = n))
}
c.val <- function(x, y, K = 10, lambda, plot.it = TRUE){
n <- nrow(x)
all.folds <- cv.folds(length(y), K)
residmat <- matrix(0, length(lambda), K)
for(i in seq(K)) {
omit <- all.folds[[i]]
xk <- as.matrix(x[-omit, ])
yk <- as.vector(y[-omit])
xg <- x[omit, ]
yg <- y[omit]
fit <- glmnet(xk, yk, family="gaussian",
alpha=1, lambda=lambda,standardize = FALSE, intercept = FALSE)
fit <- predict(fit,newx=xg,lambda=lambda)
if(length(omit)==1){fit<-matrix(fit,nrow=1)}
residmat[, i] <- apply((yg - fit)^2, 2, mean)
}
cv <- apply(residmat, 1, mean)
cv.error <- sqrt(apply(residmat, 1, var)/K)
object<-list(lambda = lambda, cv = cv, cv.error = cv.error)
if(plot.it) {
plot(lambda, cv, type = "b", xlab="lambda", ylim = range(cv, cv + cv.error, cv - cv.error))
invisible(object)
}
}
result <- c.val(x,y,K = 10,lambda = lambda)
lambda.opt <- lambda[which.min(result$cv.error)]
fit <- glmnet(x, y, family="gaussian",
alpha=1, lambda=lambda.opt,standardize = FALSE, intercept = FALSE)
coef(fit)
Result:
> coef(fit)
9 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) .
lcavol 0.01926724
lweight .
age .
lbph .
svi .
lcp .
Edit:
Model generated directly from glmnet.
fit.lasso <- glmnet(x, y, family="gaussian", alpha=1,
standardize = FALSE, intercept = FALSE)
fit.lasso.cv <- cv.glmnet(x, y, type.measure="mse", alpha=1,
family="gaussian",standardize = FALSE, intercept = FALSE)
coef.lambda.min <- coef(fit.lasso.cv,s=fit.lasso.cv$lambda.min)
coef.lambda.1se <- coef(fit.lasso.cv,s=fit.lasso.cv$lambda.1se)
cbind(coef.lambda.min,coef.lambda.1se)
Result:
9 x 2 sparse Matrix of class "dgCMatrix"
1 1
(Intercept) . .
lcavol 0.59892674 0.5286355
lweight 0.23669159 0.1201279
age -0.06979581 .
lbph 0.09392021 .
svi 0.24620007 0.1400748
lcp . .
gleason 0.00346421 .
pgg45 0.06631013 .
The second column shows the correct (lambda.1se) result.
Your "mistake" is very hard to spot: it comes from the fact that glmnet will not use the order of your own lambda vector to sort the vector of results.
Example with the data you used:
res <- glmnet(x, y, lambda=lambda)
res$lambda
So when you call the command lambda[which.min(result$cv.error)] at the end of your procedure, you will not get the value corresponding to the minimum of the cross-validated error. Also, it explains why your graph looks strange.
An easy fix would be to declare lambda at the beginning of the script as a decreasing vector:
lambda = seq(1, 0, 0.02)
Final remark: be careful when using a single lambda.

R: simulating data with lmer and predict (or else)

I am fitting the following model
fit<- lmer(y ~ a + b + (1|c) + (1|a:d) , data=inputdata)
to real observations collected in "inputdata".
Now I want to generate various (1000) modelled datasets for a simulation based on the model parameters and the determined errors. I can use
pred <- predict(fit, newdata=list(a=val_a1, b=val_b1, c=val_c1, d = val_d1),
allow.new.levels = TRUE)
but this always provides the same (the most likely, mean value). Is there a way to get a distribution of values, meaning to draw from a predicted distribution?
As asked by #Adam Quek a reproducable example:
#creating dataset
a <- as.factor(sort(rep(1:4,5 )))
b <- rep(1:2,10)+0.5
c <- as.factor(c( sort(rep(1:2,5)),sort(rep(1:2,5)) ))
d <- as.factor(rep(1:5,4 ))
a <- c(a,a,a)
b <- c(b,b,b)
c <- c(c,c,c)
d <- c(d,d,d)
y <- rnorm(60)
inputdata = data.frame(y,a,b,c,d)
# fitting the model
fit<- lmer(y ~ a + b + (1|c) + (1|a:d) , data=inputdata)
# making specific predictions for a parameter set
val_a1 = 1
val_b1 = 2
val_c1 = 1
val_d1 = 4
pred <- predict(fit, newdata=list(a=val_a1, b=val_b1, c=val_c1, d = val_d1),
allow.new.levels = TRUE)
pred
what I obtain is:
0.2394255
If I do it again
pred <- predict(fit, newdata=list(a=val_a1, b=val_b1, c=val_c1, d = val_d1),
allow.new.levels = TRUE)
pred
I get of course:
0.2394255
but what I am searching for is a R function or routine that easily provides a suite of predictions that follow the distribution of my input values. Something like
for (i in 1:1000){
pred[i] <- predict(fit, newdata=list(a=val_a1, b=val_b1, c=val_c1, d =
val_d1),allow.new.levels = TRUE)
}
and mean(pred) = 0.2394255 but sd(pred) != 0
Thanks to #Alex W! bootMer does the job. Below for those who are interested the solution for the example:
m1 <- function(.) {
predict(., newdata=inputdata, re.form=NULL)
}
boot1 <- lme4::bootMer(fit, m1, nsim=1000, use.u=FALSE, type="parametric")
boot1$t[,1]
where boot1$t[,1]now contains the 1000 predictions when using the parameter values defined in inputdata[1,].
https://cran.r-project.org/web/packages/merTools/vignettes/Using_predictInterval.html
was a helpful link.

Predict Logistf

I'm using a R package called logistf to make a Logistc Regression and I saw that there's no predict function for new data in this package and predict package does not work with this, so I found a code that show how making this with new data:
fit<-logistf(Tax ~ L20+L24+L28+L29+L31+L32+L33+L36+S10+S15+S16+S17+S20, data=trainData)
betas <- coef(fit)
X <- model.matrix(fit, data=testData)
probs <- 1 / (1 + exp(-X %*% betas))
I want to make a cross validation version with this using fit$predict and the probabilities that probs generate for me. Has anyone ever done something like this before?
Other thing that I want to know is about fit$predict I'm making a binary logistic regression, and this function returns many values, are these values from class 0 or 1, how can I know this? Thanks
While the code that you wrote works perfectly, there is a concise way of getting the same results seemingly:
brglm_model <- brglm(formula = response ~ predictor , family = "binomial", data = train )
brglm_pred <- predict(object = brglm_model, newdata = test , type = "response")
About the CV, you have to write a few lines of code I guess:
#Setting the number of folds, and number of instances in each fold
n_folds <- 5
fold_size <- nrow(dataset) %/% 5
residual <- nrow(dataset) %% 5
#label the instances based on the number of folds
cv_labels <- c(rep(1,fold_size),rep(2,fold_size), rep(3,fold_size), rep(4,fold_size), rep(5,fold_size), rep(5,residual))
# the error term would differ based on each threshold value
t_seq <- seq(0.1,0.9,by = 0.1)
index_mat <- matrix(ncol = (n_folds+1) , nrow = length(t_seq))
index_mat[,1] <- t_seq
# the main loop for calculation of the CV error on each fold
for (i in 1:5){
train <- dataset %>% filter(cv_labels != i)
test <- dataset %>% filter(cv_labels == i )
brglm_cv_model <- brglm(formula = response_var ~ . , family = "binomial", data = train )
brglm_cv_pred <- predict(object = brglm_model, newdata = test , type = "response")
# error formula that you want, e.g. misclassification
counter <- 0
for (treshold in t_seq ) {
counter <- counter + 1
conf_mat <- table( factor(test$response_var) , factor(brglm_cv_pred>treshold, levels = c("FALSE","TRUE") ))
sen <- conf_mat[2,2]/sum(conf_mat[2,])
# other indices can be computed as follows
#spec <- conf_mat[1,1]/sum(conf_mat[1,])
#prec <- conf_mat[2,2]/sum(conf_mat[,2])
#F1 <- (2*prec * sen)/(prec+sen)
#accuracy <- (conf_mat[1,1]+conf_mat[2,2])/sum(conf_mat)
#here I am only interested in sensitivity
index_mat[counter,(i+1)] <- sen
}
}
# final data.frame would be the mean of sensitivity over each threshold value
final_mat <- matrix(nrow = length(t_seq), ncol = 2 )
final_mat[,1] <- t_seq
final_mat[,2] <- apply(X = index_mat[,-1] , MARGIN = 1 , FUN = mean)
final_mat <- data.frame(final_mat)
colnames(final_mat) <- c("treshold","sensitivity")
#why not having a look at the CV-sensitivity of the model over threshold values?
ggplot(data = final_mat) +
geom_line(aes(x = treshold, y = sensitivity ), color = "blue")

Resources