Unexpected result from cross validation

Unexpected result from cross validation - r

I would like to perform 10-fold cross validation manually using prostate data to learn how to do it manually. I utilise the elasticnet package for code. I estimated the parameters by glmnet package (of course, it can perform cross validation too, but I would like to do that manually). After the analysis, It seems to me that I need a different criterion to choose tuning parameter other than minimum of cv.error because this gives the almost null model, if not so "where is my mistake?". (According to the original paper of Tibshirani, optimum model has three variables)
Here is the code
library(ElemStatLearn)
library(glmnet)
x <- scale(prostate[,1:8],T,T)
y <- scale(prostate[,9],T,F)
lambda = seq(0,1,0.02)
cv.folds <- function(n, folds = 10){
split(sample(1:n), rep(1:folds, length = n))
}
c.val <- function(x, y, K = 10, lambda, plot.it = TRUE){
n <- nrow(x)
all.folds <- cv.folds(length(y), K)
residmat <- matrix(0, length(lambda), K)
for(i in seq(K)) {
omit <- all.folds[[i]]
xk <- as.matrix(x[-omit, ])
yk <- as.vector(y[-omit])
xg <- x[omit, ]
yg <- y[omit]
fit <- glmnet(xk, yk, family="gaussian",
alpha=1, lambda=lambda,standardize = FALSE, intercept = FALSE)
fit <- predict(fit,newx=xg,lambda=lambda)
if(length(omit)==1){fit<-matrix(fit,nrow=1)}
residmat[, i] <- apply((yg - fit)^2, 2, mean)
}
cv <- apply(residmat, 1, mean)
cv.error <- sqrt(apply(residmat, 1, var)/K)
object<-list(lambda = lambda, cv = cv, cv.error = cv.error)
if(plot.it) {
plot(lambda, cv, type = "b", xlab="lambda", ylim = range(cv, cv + cv.error, cv - cv.error))
invisible(object)
}
}
result <- c.val(x,y,K = 10,lambda = lambda)
lambda.opt <- lambda[which.min(result$cv.error)]
fit <- glmnet(x, y, family="gaussian",
alpha=1, lambda=lambda.opt,standardize = FALSE, intercept = FALSE)
coef(fit)
Result:
> coef(fit)
9 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) .
lcavol 0.01926724
lweight .
age .
lbph .
svi .
lcp .
Edit:
Model generated directly from glmnet.
fit.lasso <- glmnet(x, y, family="gaussian", alpha=1,
standardize = FALSE, intercept = FALSE)
fit.lasso.cv <- cv.glmnet(x, y, type.measure="mse", alpha=1,
family="gaussian",standardize = FALSE, intercept = FALSE)
coef.lambda.min <- coef(fit.lasso.cv,s=fit.lasso.cv$lambda.min)
coef.lambda.1se <- coef(fit.lasso.cv,s=fit.lasso.cv$lambda.1se)
cbind(coef.lambda.min,coef.lambda.1se)
Result:
9 x 2 sparse Matrix of class "dgCMatrix"
1 1
(Intercept) . .
lcavol 0.59892674 0.5286355
lweight 0.23669159 0.1201279
age -0.06979581 .
lbph 0.09392021 .
svi 0.24620007 0.1400748
lcp . .
gleason 0.00346421 .
pgg45 0.06631013 .
The second column shows the correct (lambda.1se) result.

Your "mistake" is very hard to spot: it comes from the fact that glmnet will not use the order of your own lambda vector to sort the vector of results.
Example with the data you used:
res <- glmnet(x, y, lambda=lambda)
res$lambda
So when you call the command lambda[which.min(result$cv.error)] at the end of your procedure, you will not get the value corresponding to the minimum of the cross-validated error. Also, it explains why your graph looks strange.
An easy fix would be to declare lambda at the beginning of the script as a decreasing vector:
lambda = seq(1, 0, 0.02)
Final remark: be careful when using a single lambda.

Related

Does caret::train() in r have a standardized output across different fit methods/models?

I'm working with the train() function from the caret package to fit multiple regression and ML models to test their fit. I'd like to write a function that iterates through all model types and enters the best fit into a dataframe. Biggest issue is that caret doesn't provide all the model fit statistics that I'd like so they need to be derived from the raw output. Based on my exploration there doesn't seem to be a standardized way caret outputs each models fit.
Another post (sorry don't have a link) created this function which pulls from fit$results and fit$bestTune to get pre calculated RMSE, R^2, etc.
get_best_result <- function(caret_fit) {
best = which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
best_result = caret_fit$results[best, ]
rownames(best_result) = NULL
best_result
}
One example of another fit statistic I need to calculate using raw output is BIC. The two functions below do that. The residuals (y_actual - y_predicted) are needed along with the number of x variables (k) and the number of rows used in the prediction (n). k and n must be derived from the output not the original dataset due to the models dropping x variables (feature selection) or rows (omitting NAs) based on its algorithm.
calculate_MSE <- function(residuals){
# residuals can be replaced with y_actual-y_predicted
mse <- mean(residuals^2)
return(mse)
}
calculate_BIC <- function(n, mse, k){
BIC <- n*log(mse)+k*log(n)
return(BIC)
}
The real question is is there a standardized output of caret::train() for x variables or either y_actual, y_predicted, or residuals?
I tried fit$finalModel$model and other methods but to no avail.
Here is a reproducible example along with the function I'm using. Please consider the functions above a part of this reproducible example.
library(rlist)
library(data.table)
# data
df <- data.frame(y1 = rnorm(50, 0, 1),
y2 = rnorm(50, .25, 1.5),
x1 = rnorm(50, .4, .9),
x2 = rnorm(50, 0, 1.1),
x3 = rnorm(50, 1, .75))
missing_index <- sample(1:50, 7, replace = F)
df[missing_index,] <- NA
# function to fit models and pull results
fitModels <- function(df, Ys, Xs, models){
# empty list
results <- list()
# number of for loops
loops_counter <- 0
# for every y
for(y in 1:length(Ys)){
# for every model
for(m in 1:length(models)){
# track loops
loops_counter <- loops_counter + 1
# fit the model
set.seed(1) # seed for reproducability
fit <- tryCatch(train(as.formula(paste(Ys[y], paste(Xs, collapse = ' + '),
sep = ' ~ ')),
data = df,
method = models[m],
na.action = na.omit,
tuneLength = 10),
error = function(e) {return(NA)})
# pull results
results[[loops_counter]] <- c(Y = Ys[y],
model = models[m],
sample_size = nrow(fit$finalModel$model),
RMSE = get_best_result(fit)[[2]],
R2 = get_best_result(fit)[[3]],
MAE = get_best_result(fit)[[4]],
BIC = calculate_BIC(n = length(fit$finalModel),
mse = calculate_MSE(fit$finalModel$residuals),
k = length(fit$finalModel$xNames)))
}
}
# list bind
results_df <- list.rbind(results)
return(results_df)
}
linear_models <- c('lm', 'glmnet', 'ridge', 'lars', 'enet')
fits <- fitModels(df, c(y1, y2), c(x1,x2,x3), linear_models)

Misuse predict.rq in the package quantreg?

I am using quantreg package to predict new data based on training set. However, I noticed a discrepancy between predict.rq or predict and doing it manually. Here is an example:
The quantile regression setting is
N = 10000
tauList = seq(1:11/12)/12
y = rchisq(N,2)
X = matrix( rnorm(3*N) ,nrow = N, ncol = 3 )
fit <- rq( y ~ X-1, tau = tauList, method = "fn")
The new data set I want to predict is
newdata <- matrix( rbeta((3*N),2,2) ,nrow = N,ncol=3 )
I use predict.rq or predict to predict newdata. Both return the same result:
fit_use_predict <- predict.rq( fit, newdata = as.data.frame(newdata) )
Also I manually do the prediction based on the coefficients matrix:
coef_mat <- coef(fit)
fit_use_multiplication <- newdata %*% coef_mat
I expect both are numerically identical, but they are not:
diff <- fit_use_predict - fit_use_multiplication
print(diff)
Their difference cannot be negligible.
However, predicting the original data set X, both return the same result, i.e.,
predict(fit, newdata = data.frame(X)) = X %*% coef_mat ## True
Do I miss something when using the function? Thanks!

A more serious problem here, before we get to prediction is that the model is forcing all of the fitted quantile functions through the origin of design space and since the covariates are centered at the origin all of the quantile functions are forced to cross there. Even if the X's all lie in the positive orthant it is quite a strong assumption to say that the distribution of the response is degenerate at the origin.

I think you just have to retain the 'X' name in your data as it was in the training data.
library(quantreg)
N = 10000
tauList = seq(1:11/12)/12
y = rchisq(N,2)
X = matrix( rnorm(3*N) ,nrow = N, ncol = 3 )
fit <- rq( y ~ X-1, tau = tauList, method = "fn")
newdata <- matrix( rbeta((3*N),2,2) ,nrow = N,ncol=3 )
fit_use_predict <- predict.rq( fit, newdata = data.frame(X=I(newdata)) )
coef_mat <- coef(fit)
fit_use_multiplication <- newdata %*% coef_mat
diff <- fit_use_predict - fit_use_multiplication
max( abs(diff) )
Output is 0

Fitting probit model inr R

For my thesis I have to fit some glm models with MLEs that R doesn't have, I was going ok for the models with close form but now I have to use de Gausian CDF, so i decide to fit a simple probit model.
this is the code:
Data:
set.seed(123)
x <-matrix( rnorm(50,2,4),50,1)
m <- matrix(runif(50,2,4),50,1)
t <- matrix(rpois(50,0.5),50,1)
z <- (1+exp(-((x-mean(x)/sd(x)))))^-1 + runif(50)
y <- ifelse(z < 1.186228, 0, 1)
data1 <- as.data.frame(cbind(y,x,m,t))
myprobit <- function (formula, data)
{
mf <- model.frame(formula, data)
y <- model.response(mf, "numeric")
X <- model.matrix(formula, data = data)
if (any(is.na(cbind(y, X))))
stop("Some data are missing.")
loglik <- function(betas, X, y, sigma) { #loglikelihood
p <- length(betas)
beta <- betas[-p]
eta <- X %*% beta
sigma <- 1 #because of identification, sigma must be equal to 1
G <- pnorm(y, mean = eta,sd=sigma)
sum( y*log(G) + (1-y)*log(1-G))
}
ls.reg <- lm(y ~ X - 1)#starting values using ols, indicating that this model already has a constant
start <- coef(ls.reg)
fit <- optim(start, loglik, X = X, y = y, control = list(fnscale = -1), method = "BFGS", hessian = TRUE) #optimizar
if (fit$convergence > 0) {
print(fit)
stop("optim failed to converge!") #verify convergence
}
return(fit)
}
myprobit(y ~ x + m + t,data = data1)
And i get: Error in X %*% beta : non-conformable arguments, if i change start <- coef(ls.reg) with start <- c(coef(ls.reg), 1) i get wrong stimatives comparing with:
probit <- glm(y ~ x + m + t,data = data1 , family = binomial(link = "probit"))
What am I doing wrong?
Is possible to correctly fit this model using pnorm, if no, what algorithm should I use to approximate de gausian CDF. Thanks!!

The line of code responsible for your error is the following:
eta <- X %*% beta
Note that "%*%" is the matrix multiplication operator. By reproducing your code I noticed that X is a matrix with 50 rows and 4 columns. Hence, for matrix multiplication to be possible your "beta" needs to have 4 rows. But when you run "betas[-p]" you subset the betas vector by removing its last element, leaving only three elements instead of the four you need for matrix multiplication to be defined. If you remove [-p] the code will work.

Predict Logistf

I'm using a R package called logistf to make a Logistc Regression and I saw that there's no predict function for new data in this package and predict package does not work with this, so I found a code that show how making this with new data:
fit<-logistf(Tax ~ L20+L24+L28+L29+L31+L32+L33+L36+S10+S15+S16+S17+S20, data=trainData)
betas <- coef(fit)
X <- model.matrix(fit, data=testData)
probs <- 1 / (1 + exp(-X %*% betas))
I want to make a cross validation version with this using fit$predict and the probabilities that probs generate for me. Has anyone ever done something like this before?
Other thing that I want to know is about fit$predict I'm making a binary logistic regression, and this function returns many values, are these values from class 0 or 1, how can I know this? Thanks

While the code that you wrote works perfectly, there is a concise way of getting the same results seemingly:
brglm_model <- brglm(formula = response ~ predictor , family = "binomial", data = train )
brglm_pred <- predict(object = brglm_model, newdata = test , type = "response")
About the CV, you have to write a few lines of code I guess:
#Setting the number of folds, and number of instances in each fold
n_folds <- 5
fold_size <- nrow(dataset) %/% 5
residual <- nrow(dataset) %% 5
#label the instances based on the number of folds
cv_labels <- c(rep(1,fold_size),rep(2,fold_size), rep(3,fold_size), rep(4,fold_size), rep(5,fold_size), rep(5,residual))
# the error term would differ based on each threshold value
t_seq <- seq(0.1,0.9,by = 0.1)
index_mat <- matrix(ncol = (n_folds+1) , nrow = length(t_seq))
index_mat[,1] <- t_seq
# the main loop for calculation of the CV error on each fold
for (i in 1:5){
train <- dataset %>% filter(cv_labels != i)
test <- dataset %>% filter(cv_labels == i )
brglm_cv_model <- brglm(formula = response_var ~ . , family = "binomial", data = train )
brglm_cv_pred <- predict(object = brglm_model, newdata = test , type = "response")
# error formula that you want, e.g. misclassification
counter <- 0
for (treshold in t_seq ) {
counter <- counter + 1
conf_mat <- table( factor(test$response_var) , factor(brglm_cv_pred>treshold, levels = c("FALSE","TRUE") ))
sen <- conf_mat[2,2]/sum(conf_mat[2,])
# other indices can be computed as follows
#spec <- conf_mat[1,1]/sum(conf_mat[1,])
#prec <- conf_mat[2,2]/sum(conf_mat[,2])
#F1 <- (2*prec * sen)/(prec+sen)
#accuracy <- (conf_mat[1,1]+conf_mat[2,2])/sum(conf_mat)
#here I am only interested in sensitivity
index_mat[counter,(i+1)] <- sen
}
}
# final data.frame would be the mean of sensitivity over each threshold value
final_mat <- matrix(nrow = length(t_seq), ncol = 2 )
final_mat[,1] <- t_seq
final_mat[,2] <- apply(X = index_mat[,-1] , MARGIN = 1 , FUN = mean)
final_mat <- data.frame(final_mat)
colnames(final_mat) <- c("treshold","sensitivity")
#why not having a look at the CV-sensitivity of the model over threshold values?
ggplot(data = final_mat) +
geom_line(aes(x = treshold, y = sensitivity ), color = "blue")

How to estimate the Kalman Filter with 'KFAS' R package, with an AR(1) transition equation?

I am using 'KFAS' package from R to estimate a state-space model with the Kalman filter. My measurement and transition equations are:
y_t = Z_t * x_t + \eps_t (measurement)
x_t = T_t * x_{t-1} + R_t * \eta_t (transition),
with \eps_t ~ N(0,H_t) and \eta_t ~ N(0,Q_t).
So, I want to estimate the variances H_t and Q_t, but also T_t, the AR(1) coefficient. My code is as follows:
library(KFAS)
set.seed(100)
eps <- rt(200, 4, 1)
meas <- as.matrix((arima.sim(n=200, list(ar=0.6), innov = rnorm(200)*sqrt(0.5)) + eps),
ncol=1)
Zt <- 1
Ht <- matrix(NA)
Tt <- matrix(NA)
Rt <- 1
Qt <- matrix(NA)
ss_model <- SSModel(meas ~ -1 + SSMcustom(Z = Zt, T = Tt, R = Rt,
Q = Qt), H = Ht)
fit <- fitSSM(ss_model, inits = c(0,0.6,0), method = 'L-BFGS-B')
But it returns: "Error in is.SSModel(do.call(updatefn, args = c(list(inits, model), update_args)),: System matrices (excluding Z) contain NA or infinite values, covariance matrices contain values larger than 1e+07"
The NA definitions for the variances works well, as documented in the package's paper. However, it seems this cannot be done for the AR coefficients. Does anyone know how can I do this?
Note that I am aware of the SSMarima function, which eases the definition of the transition equation as ARIMA models. Although I am able to estimate the AR(1) coef. and Q_t this way, I still cannot estimate the \eps_t variance (H_t). Moreover, I am migrating my Kalman filter codes from EViews to R, so I need to learn SSMcustom for other models that are more complicated.
Thanks!

It seems that you are missing something in your example, as your error message comes from the function fitSSM. If you want to use fitSSM for estimating general state space models, you need to provide your own model updating function. The default behaviour can only handle NA's in covariance matrices H and Q. The main goal of fitSSM is just to get started with simple stuff. For complex models and/or large data, I would recommend using your self-written objective function (with help of logLik method) and your favourite numerical optimization routines manually for maximum performance. Something like this:
library(KFAS)
set.seed(100)
eps <- rt(200, 4, 1)
meas <- as.matrix((arima.sim(n=200, list(ar=0.6), innov = rnorm(200)*sqrt(0.5)) + eps),
ncol=1)
Zt <- 1
Ht <- matrix(NA)
Tt <- matrix(NA)
Rt <- 1
Qt <- matrix(NA)
ss_model <- SSModel(meas ~ -1 + SSMcustom(Z = Zt, T = Tt, R = Rt,
Q = Qt), H = Ht)
objf <- function(pars, model, estimate = TRUE) {
model$H[1] <- pars[1]
model$T[1] <- pars[2]
model$Q[1] <- pars[3]
if (estimate) {
-logLik(model)
} else {
model
}
}
opt <- optim(c(1, 0.5, 1), objf, method = "L-BFGS-B",
lower = c(0, -0.99, 0), upper = c(100, 0.99, 100), model = ss_model)
ss_model_opt <- objf(opt$par, ss_model, estimate = FALSE)
Same with fitSSM:
updatefn <- function(pars, model) {
model$H[1] <- pars[1]
model$T[1] <- pars[2]
model$Q[1] <- pars[3]
model
}
fit <- fitSSM(ss_model, c(1, 0.5, 1), updatefn, method = "L-BFGS-B",
lower = c(0, -0.99, 0), upper = c(100, 0.99, 100))
identical(ss_model_opt, fit$model)