R biglasso results don't match hdm or glmnet - r

I've been experimenting with the R package 'biglasso' for high-dimensional data. However, the results I'm getting don't match the results I get for the LASSO functions from 'hdm' or 'glmnet. The documentation for biglasso is also really poor.
In the example below, the results from hdm and glmnet are very close but not exact, which is expected. However, biglasso doesn't drop the 'share' variable. I've tried all the different screen settings, and it doesn't make a difference. Any thoughts on how to get biglasso to be more consistent with the others? Thanks!
EDIT: for a given value of lambda, the results are highly similar. But each method seems to select a different lambda.. which for hdm makes sense, given that it's intended for causal inference and isn't concerned with out-of-sample prediction. hdm uses a different objective function from Belloni et al. (2012), but I'm not sure why cv.biglasso and cv.glmnet would differ so much. If I run biglasso without a screening rule, they should be maximizing the same objective function just with random diffs in the CV folds, no?
EDIT 2: I've edited the code below to include F. Privé's code to make glmnet use an algorithm similar to biglasso, and some additional code to make biglasso mimic glmnet.
##########
## PREP ##
##########
## Load required libraries
library(hdm)
library(biglasso)
library(glmnet)
## Read automobile dataset
data(BLP)
df <- BLP[[1]]
## Extract outcome
Y <- scale(df$mpg)
## Rescale variables
df$price <- scale(df$price)
df$mpd <- scale(df$mpd)
df$space <- scale(df$space)
df$hpwt <- scale(df$hpwt)
df$outshr <- scale(df$outshr)
## Limit to variables I want
df <- df[,names(df) %in% c("price","mpd","space","hpwt","share","outshr","air")]
## Convert to matrix
df.mat <- data.matrix(df)
df.bm <- as.big.matrix(df.mat)
#########
## HDM ##
#########
## Set seed for reproducibility
set.seed(1233)
## Run LASSO
fit.hdm <- rlasso(x=df.mat, y=Y, post=FALSE, intercept=TRUE)
## Check results
coef(fit.hdm)
############
## GLMNET ##
############
## Set seed for reproducibility
set.seed(1233)
## LASSO with 10-fold cross-validation
fit.glmnet <- cv.glmnet(df.mat, Y, alpha=1, family="gaussian")
## Check default results
coef(fit.glmnet)
## Try to mimic results of biglasso
coef(fit.glmnet, s = "lambda.min")
##############
## BIGLASSO ##
##############
## LASSO with 10-fold cross-validation
fit.bl <- cv.biglasso(df.bm, Y, penalty="lasso", eval.metric="default",
family="gaussian", screen="None",
seed=1233, nfolds=10)
## Check default results
coef(fit.bl)
## Try to mimic results of glmnet
## Calculate threshold for CV error (minimum + 1 standard error)
thresh <- min(fit.bl$cve) + sd(fit.bl$cve)/sqrt(100)
## Identify highest lambda with CVE at or below threshold
max.lambda <- max(fit.bl$lambda[fit.bl$cve <= thresh])
## Check results for the given lambda
coef(fit.bl$fit)[,which(fit.bl$fit$lambda==max.lambda)]

There are basically two ways to choose the "best" lambda after CV:
The one that minimizes the CV error (default of {biglasso})
The one that is the most parsimonious (highest lambda) with the CV error lower than the minimum + 1 standard error (default of {glmnet}).
Try coef(fit.glmnet, s = "lambda.min") to use the minimum.
Also, to ensure reproducibility, try setting the CV folds instead of some seed. There are parameters foldid in glmnet() and cv.ind in biglasso().

Related

How to predict in kknn function? library(kknn)

I try to use kknn + loop to create a leave-out-one cross validation for a model, and compare that with train.kknn.
I have split the data into two parts: training (80% data), and test (20% data). In the training data, I exclude one point in the loop to manually create LOOCV.
I think something gets wrong in predict(knn.fit, data.test). I have tried to find how to predict in kknn through the kknn package instruction and online but all the examples are "summary(model)" and "table(validation...)" rather than the prediction on a separate test data. The code predict(model, dataset) works successfully in train.kknn function, so I thought I could use the similar arguments in kknn.
I am not sure if there is such a prediction function in kknn. If yes, what arguments should I give?
Look forward to your suggestion. Thank you.
library(kknn)
for (i in 1:nrow(data.train)) {
train.data <- data.train[-i,]
validation.data <- data.train[i,]
knn.fit <- kknn(as.factor(R1)~., train.data, validation.data, k = 40,
kernel = "rectangular", scale = TRUE)
# train.data + validation.data is the 80% data I split.
}
pred.knn <- predict(knn.fit, data.test) # data.test is 20% data.
Here is the error message:
Error in switch(type, raw = object$fit, prob = object$prob,
stop("invalid type for prediction")) : EXPR must be a length 1
vector
Actually I try to compare train.kknn and kknn+loop to compare the results of the leave-out-one CV. I have two more questions:
1) in kknn: is it possible to use another set of data as test data to see the knn.fit prediction?
2) in train.kknn: I split the data and use 80% of the whole data and intend to use the rest 20% for prediction. Is it an correct common practice?
2) Or should I just use the original data (the whole data set) for train.kknn, and create a loop: data[-i,] for training, data[i,] for validation in kknn? So they will be the counterparts?
I find that if I use the training data in the train.kknn function and use prediction on test data set, the best k and kernel are selected and directly used in generating the predicted value based on the test dataset.
In contrast, if I use kknn function and build a loop of different k values, the model generates the corresponding prediction results based on
the test data set each time the k value is changed. Finally, in kknn + loop, the best k is selected based on the best actual prediction accuracy rate of test data. In short, the best k train.kknn selected may not work best on test data.
Thank you.
For objects returned by kknn, predict gives the predicted value or the predicted probabilities of R1 for the single row contained in validation.data:
predict(knn.fit)
predict(knn.fit, type="prob")
The predict command also works on objects returned by train.knn.
For example:
train.kknn.fit <- train.kknn(as.factor(R1)~., data.train, ks = 10,
kernel = "rectangular", scale = TRUE)
class(train.kknn.fit)
# [1] "train.kknn" "kknn"
pred.train.kknn <- predict(train.kknn.fit, data.test)
table(pred.train.kknn, as.factor(data.test$R1))
The train.kknn command implements a leave-one-out method very close to the loop developed by #vcai01. See the following example:
set.seed(43210)
n <- 500
data.train <- data.frame(R1=rbinom(n,1,0.5), matrix(rnorm(n*10), ncol=10))
library(kknn)
pred.kknn <- array(0, nrow(data.train))
for (i in 1:nrow(data.train)) {
train.data <- data.train[-i,]
validation.data <- data.train[i,]
knn.fit <- kknn(as.factor(R1)~., train.data, validation.data, k = 40,
kernel = "rectangular", scale = TRUE)
pred.kknn[i] <- predict(knn.fit)
}
knn.fit <- train.kknn(as.factor(R1)~., data.train, ks = 40,
kernel = "rectangular", scale = TRUE)
pred.train.kknn <- predict(knn.fit, data.train)
table(pred.train.kknn, pred.kknn)
# pred.kknn
# pred.train.kknn 1 2
# 0 374 14
# 1 9 103

How to export 'flexmix' model (in R) into Tex?

I have used the R package 'flexmix' to create some regression models. I now want to export the results to Tex.
Unlike conventional models created with lm(), the flexmix models are not saved as named numerics but as FLXRoptim objects.
When I now use the normal syntax from the 'texreg' package in order to create Tex code from the model results, I am getting error messages:
"unable to find an inherited method for function ‘extract’ for signature ‘"FLXRoptim"’"
I have to access the models directly, these are stored as 'Coefmat' and I did not manage to make this usable for texreg().
library(flexmix)
library(texreg)
data("patent")
## 1. Flexmix model ##
flex.model <- flexmix(formula = Patents ~ lgRD, data = patent, k = 3,
model = FLXMRglm(family = "poisson"), concomitant = FLXPmultinom(~RDS))
re.flex.model <- refit(flex.model)
## 2. Approach of results extraction ##
comp1.flex <- re.flex.model#components[[1]][["Comp.1"]]
## 3. Not working: Tex Export ##
texreg(comp1.flex)
Do you guys have an idea how to make these model results usable for Tex export?
I have now found a workaround: 'Texreg' allows us to create Texreg models with manually specified columns.
createTexreg(coef.names, coef, se, pvalues)
Using the example from above:
## Take estimates, SEs, and p-values for Comp1 ##
est1 <- re.flex.model#components[[1]][["Comp.1"]][,1]
se1 <- re.flex.model#components[[1]][["Comp.1"]][,2]
pval1 <- re.flex.model#components[[1]][["Comp.1"]][,4]
## Take estimates, SEs, and p-values for Comp2 ##
est2 <- re.flex.model#components[[1]][["Comp.2"]][,1]
se2 <- re.flex.model#components[[1]][["Comp.2"]][,2]
pval2 <- re.flex.model#components[[1]][["Comp.2"]][,4]
## Create Texreg objects and export into Tex ##
mymodel1 <- createTexreg(row.names(comp1.flex), est1, se1, pval1)
mymodel2 <- createTexreg(row.names(comp1.flex), est2, se2, pval2)
models.flex = list(mymodel1, mymodel2)
texreg(models.flex)
That's probably the most practical way to turn such specific models into a conventional Tex output.

Variable importance for support vector machine and naive Bayes classifiers in R

I’m working on building predictive classifiers in R on a cancer dataset.
I’m using random forest, support vector machine and naive Bayes classifiers. I’m unable to calculate variable importance on SVM and NB models
I end up receiving the following error.
Error in UseMethod("varImp") :
no applicable method for 'varImp' applied to an object of class "c('svm.formula', 'svm')"
I would greatly appreciate it if anyone could help me.
Given
library(e1071)
model <- svm(Species ~ ., data = iris)
class(model)
# [1] "svm.formula" "svm"
library(caret)
varImp(model)
# Error in UseMethod("varImp") :
# no applicable method for 'varImp' applied to an object of class "c('svm.formula', 'svm')"
methods(varImp)
# [1] varImp.bagEarth varImp.bagFDA varImp.C5.0* varImp.classbagg*
# [5] varImp.cubist* varImp.dsa* varImp.earth* varImp.fda*
# [9] varImp.gafs* varImp.gam* varImp.gbm* varImp.glm*
# [13] varImp.glmnet* varImp.JRip* varImp.lm* varImp.multinom*
# [17] varImp.mvr* varImp.nnet* varImp.pamrtrained* varImp.PART*
# [21] varImp.plsda varImp.randomForest* varImp.RandomForest* varImp.regbagg*
# [25] varImp.rfe* varImp.rpart* varImp.RRF* varImp.safs*
# [29] varImp.sbf* varImp.train*
There is no function varImp.svm in methods(varImp), therefore the error. You might want to have a look at this post on Cross Validated, too.
If you use R, the variable importance can be calculated with Importance method in rminer package. This is my sample code:
library(rminer)
M <- fit(y~., data=train, model="svm", kpar=list(sigma=0.10), C=2)
svm.imp <- Importance(M, data=train)
In detail, refer to the following link https://cran.r-project.org/web/packages/rminer/rminer.pdf
I have created a loop that iteratively removes one predictor at a time and captures in a data frame various performance measures derived from the confusion matrix. This is not supposed to be a one size fits all solution, I don't have the time for it, but it should not be difficult to apply modifications.
Make sure that the predicted variable is last in the data frame.
I mainly needed specificity values from the models and by removing one predictor at a time, I can evaluate the importance of each predictor, i.e. by removing a predictor, the smallest specificity of the model(less predictor number i) means that the predictor has the most importance. You need to know on what indicator you will attribute importance.
You can also add another for loop inside to change between kernels, i.e. linear, polynomial, radial, but you might have to account for the other parameters,e.g. gamma. Change "label_fake" with your target variable and df_final with your data frame.
SVM version:
set.seed(1)
varimp_df <- NULL # df with results
ptm1 <- proc.time() # Start the clock!
for(i in 1:(ncol(df_final)-1)) { # the last var is the dep var, hence the -1
smp_size <- floor(0.70 * nrow(df_final)) # 70/30 split
train_ind <- sample(seq_len(nrow(df_final)), size = smp_size)
training <- df_final[train_ind, -c(i)] # receives all the df less 1 var
testing <- df_final[-train_ind, -c(i)]
tune.out.linear <- tune(svm, label_fake ~ .,
data = training,
kernel = "linear",
ranges = list(cost =10^seq(1, 3, by = 0.5))) # you can choose any range you see fit
svm.linear <- svm(label_fake ~ .,
kernel = "linear",
data = training,
cost = tune.out.linear[["best.parameters"]][["cost"]])
train.pred.linear <- predict(svm.linear, testing)
testing_y <- as.factor(testing$label_fake)
conf.matrix.svm.linear <- caret::confusionMatrix(train.pred.linear, testing_y)
varimp_df <- rbind(varimp_df,data.frame(
var_no=i,
variable=colnames(df_final[,i]),
cost_param=tune.out.linear[["best.parameters"]][["cost"]],
accuracy=conf.matrix.svm.linear[["overall"]][["Accuracy"]],
kappa=conf.matrix.svm.linear[["overall"]][["Kappa"]],
sensitivity=conf.matrix.svm.linear[["byClass"]][["Sensitivity"]],
specificity=conf.matrix.svm.linear[["byClass"]][["Specificity"]]))
runtime1 <- as.data.frame(t(data.matrix(proc.time() - ptm1)))$elapsed # time for running this loop
runtime1 # divide by 60 and you get minutes, /3600 you get hours
}
Naive Bayes version:
varimp_nb_df <- NULL
ptm1 <- proc.time() # Start the clock!
for(i in 1:(ncol(df_final)-1)) {
smp_size <- floor(0.70 * nrow(df_final))
train_ind <- sample(seq_len(nrow(df_final)), size = smp_size)
training <- df_final[train_ind, -c(i)]
testing <- df_final[-train_ind, -c(i)]
x = training[, names(training) != "label_fake"]
y = training$label_fake
model_nb_var = train(x,y,'nb', trControl=ctrl)
predict_nb_var <- predict(model_nb_var, newdata = testing )
confusion_matrix_nb_1 <- caret::confusionMatrix(predict_nb_var, testing$label_fake)
varimp_nb_df <- rbind(varimp_nb_df, data.frame(
var_no=i,
variable=colnames(df_final[,i]),
accuracy=confusion_matrix_nb_1[["overall"]][["Accuracy"]],
kappa=confusion_matrix_nb_1[["overall"]][["Kappa"]],
sensitivity=confusion_matrix_nb_1[["byClass"]][["Sensitivity"]],
specificity=confusion_matrix_nb_1[["byClass"]][["Specificity"]]))
runtime1 <- as.data.frame(t(data.matrix(proc.time() - ptm1)))$elapsed # time for running this loop
runtime1 # divide by 60 and you get minutes, /3600 you get hours
}
Have fun!

Why doesn't `predict.ksvm` predict the class labels?

I'm trying to figure out the difference in performance between ksvm (kernlab) and svm (e1071).
I'm using the spam database in the kernlab package.
Minimal working example
library("kernlab")
data(spam)
##replacing "spam" labels with 1 or -1
type= ifelse(spam[,58]=="spam",1,-1)
spam <- spam[,-58]
spam <- cbind(spam, type)
## create test and training set
spam <- spam[sample(1:4601),] #random permutation
selection <- 1:2300
training <- spam[selection,-58]
training.truth <- spam[selection,58]
test <- spam[-selection, -58]
test.truth <- spam[-selection, 58]
## train a support vector machine
filter <- ksvm(training.truth~.,
data=training,
kernel="rbfdot",
class="C-svc",
kpar=list(sigma=0.05),C=5)
## predict mail type on the test set
mailtype <- predict(filter,test)
mailtype[1,] ## returns -0.2459927
Why does this return -0.2459927, why doesn't it return the label 1 or -1?
I tried adjusting some options, but none seem to work.
You just need to define training.truth as factor, so that kvsm predicts classes as the type by default (which is "response"). That is:
training.truth <- as.factor(spam[selection,58])
with the rest of the code untouched, and then,
mailtype <- predict(filter,test)
mailtype[1]
[1] -1
Levels: -1 1
Hope it helps.

trouble using foreach from doParallel with gamm4

I am trying to use foreach to make use of parallel processing for a complete subsets regression problem. I am trying to fit a complete list of models using the gamm4 package, using the binomial function where the response is provided as a proportion, and the weights argument supplies the number of trials. The code works fine when run using %do% but fails under %dopar% (retutns only NA's for AIC and BIC). Strangely, the code does work using %dopar% fine if the weights argument to the gamm4 call is left out, but obviously this is not a viable solution. I have been using similar code with no issues based on a gaussian distribution and a binomial distribution where the response is entered as 1,0s (thus no need for a call to weights) with no problems at all. I am using windows 7 64bit, with R version 3.1.2. I have updated all the relevant packages. A reproducible (but toy) example:
set.seed(666)
# generate a random factor with a random offset effect
random.factor=factor(sort(rep(1:10,10)))
random.effect=sort(rep(rnorm(10),10))
# generate some random predictor variables
X1 = rnorm(100)
X2 = rnorm(100)
X3 = rnorm(100)
X4 = rep(0,100) # make it so one variable fails (just to check the "try" if statement)
#X4 = rnorm(100)
X5 = rnorm(100)
# calculate a response variable based on some of the predictors
z = 1 + 2*X1 + 3*X2 + 2*X3^2 # linear combination with a bias
pr = 1/(1+exp(-(z+random.effect))) # pass through an inv-logit function
y = rbinom(n=100,size=100,pr)/100 # bernoulli response variable.
# Note that the response variable is a proprotion of successes of 100 trials
# We want to feed the number of trials as a "weights" argument to gamm
# now make a data frame of predictors
pred.dat=data.frame(X1=X1,X2=X2,X3=X3,X4=X4,X5=X5)
pred.vars=colnames(pred.dat)
# make a dataframe for passing to gamm
use.dat = data.frame(random.factor=random.factor,y=y,pred.dat)
# now set up the models to run
# this includes all combinations of variables, but only up to a total of two in
# any one model
model.fits.test=c(combn(1:ncol(pred.dat), 1,simplify = F),
combn(1:ncol(pred.dat), 2,simplify = F))
models.use=list(1,2,3,4,5)
n.models=length(model.fits.test)
require(lme4)
require(doParallel)
registerDoParallel(cores=4)
# if I run this using do, it works fine (with error values from the try argument
# returned for models that fail)
out.dat<-foreach(l = 1:n.models,.combine=rbind,
.packages=c("lme4","gamm4"))%do%{
vars.vec=model.fits.test[[l]]
formula.l<-as.formula(paste("y~",
paste(colnames(pred.dat)[vars.vec],collapse="+"),"+(1|random.factor)",sep=""))
model.fit=try(glmer(formula.l,
data=use.dat,
family="binomial",
weights=rep(100,nrow(use.dat))))
success<-class(model.fit)[[1]]!="try-error"
out.vec<-c(rep(NA,2),rep(NA,ncol(pred.dat)))
names(out.vec)<- c("AIC","BIC",colnames(pred.dat))
out.vec[
which(match(names(out.vec),pred.vars[vars.vec])>0)]<-1
if(success){
out.vec["AIC"]<-AIC(model.fit)
out.vec["BIC"]<-BIC(model.fit)
}
return(out.vec)
}
out.dat
# if I run using dopar, nothing is returned.
out.dat<-foreach(l = 1:n.models,.combine=rbind,
.packages=c("lme4","gamm4"))%dopar%{
vars.vec=model.fits.test[[l]]
formula.l<-as.formula(paste("y~",
paste(colnames(pred.dat)[vars.vec],collapse="+"),"+(1|random.factor)",sep=""))
model.fit=try(glmer(formula.l,
data=use.dat,
family="binomial",
weights=rep(100,nrow(use.dat))))
success<-class(model.fit)[[1]]!="try-error"
out.vec<-c(rep(NA,2),rep(NA,ncol(pred.dat)))
names(out.vec)<- c("AIC","BIC",colnames(pred.dat))
out.vec[
which(match(names(out.vec),pred.vars[vars.vec])>0)]<-1
if(success){
out.vec["AIC"]<-AIC(model.fit)
out.vec["BIC"]<-BIC(model.fit)
}
return(out.vec)
}
out.dat
# Now run dopar without the weights argument (not really appropriate,
# but for the sake of demonstration). I get results again, but it doesn't
# really make sense to do this. Also, my real example fails unless I can supply
# weights.
out.dat<-foreach(l = 1:n.models,.combine=rbind,
.packages=c("lme4","gamm4"))%dopar%{
vars.vec=model.fits.test[[l]]
formula.l<-as.formula(paste("y~1+",
paste("s(",colnames(pred.dat)[vars.vec],")",collapse="+"),sep=""))
model.fit=try(gamm4(formula.l, random=~(1|random.factor),
data=use.dat,family="binomial"))
success<-class(model.fit)[[1]]!="try-error"
out.vec<-c(rep(NA,2),rep(NA,ncol(pred.dat)))
names(out.vec)<- c("AIC","BIC",colnames(pred.dat))
out.vec[
which(match(names(out.vec),pred.vars[vars.vec])>0)]<-1
if(success){
out.vec["AIC"]<-AIC(model.fit$mer)
out.vec["BIC"]<-BIC(model.fit$mer)
}
return(out.vec)
}
out.dat

Resources