predict.glm() on blind test data - r

I'm using regularized logistic regression for a classification problem using the glmnet package. In the development process, everything is working fine, but I have a problem when it comes to making predictions on blind test data.
Because I don't know the class label, my data frame for testing has a column less than the one I used for training. This seems to be a problem for predict.glm(), because it expects matching dimensions - I can "fix" it by adding a column with some arbitrary labels in the test data, but this seems like a bad idea. I hope this example will illustrate the problem:
library(glmnet)
example = data.frame(rnorm(20))
colnames(example) = "A"
example$B = rnorm(20)
example$class = ((example$A + example$B) > 0)*1
testframe = data.frame(rnorm(20))
colnames(testframe) = "A"
testframe$B = rnorm(20)
x = model.matrix(class ~ ., data = example)
y = data.matrix(example$class)
# this is similar to the situation I have with my data
# the class labels are ommited on the blind test set
So if I just proceed like this, I get an error:
x.test = as.matrix(testframe)
ridge = glmnet(x,y, alpha = 0, family = "binomial", lambda = 0.01789997)
ridge.pred = predict(ridge, newx = x.test, s = 0.01789997, type = "class")
Error in cbind2(1, newx) %*% nbeta:
Cholmod error 'X and/or Y have
wrong dimensions' at file ../MatrixOps/cholmod_sdmult.c, line 90
I can "fix" the problem by adding a class column to my test data:
testframe$class = 0
x.test = model.matrix(class ~ ., data = testframe)
ridge.pred2 = predict(ridge, newx = x.test, s = 0.01789997, type = "class")
So I have a couple of questions about this:
a) Is this workaround with adding a column safe to do? It feels very wrong/dangerous to do this, because I don't know if the predict method will use it (why would it require this column to be there otherwise?
b) What's a better / "the correct" way to do this?
Thanks in advance!

Answer
When you create the matrix x, drop the (Intercept) column (which is always the first column). Then your predict function will work without the workaround. Specifically, use this line to create x.
x = model.matrix(class ~ ., data = example)[,-1]
Explanation
You are getting an error because model.matrix is creating a column for an intercept in the model, which is not on your x.test matrix.
colnames(x)
# [1] "(Intercept)" "A" "B"
colnames(x.test)
# [1] "A" "B"
Unless you set intercept=FALSE, glmnet will add an intercept to the model for you. Thus, the simplest thing to do is exclude the intercept column from both the x and x.test matrices.

Related

Make a model matrix if missing the response variable and where matrix multiplication recreates the predict function

I want to create a model matrix for a test dataset which is missing the response variable, and where I can perfectly replicate the results of calling predict() on the model if building predictions using matrix multiplication. See code below for example.
I have code which can do this (again, see below for example), but it requires that I create a placeholder response variable in my test data. This doesn't seem very clean, and I'm wondering if there's a way to get the code to work without this workaround.
# Make data, fit model
set.seed(1); df_train = data.frame(y = rnorm(10), x = rnorm(10), z = rnorm(10))
set.seed(2); df_test = data.frame(x = rnorm(10), z = rnorm(10))
fit = lm(y ~ poly(x) + poly(z), data = df_train)
# Make model matrices. Get error for the test data as 'y' isnt found
mm_train = model.matrix(terms(fit), df_train)
mm_test = model.matrix(terms(fit), df_test) #"Error in eval(predvars, data, env) : object 'y' not found"
# Make fake y variable for test data then build model matrix. I want to know if there's a less hacky way to do this
df_test$y = 1
mm_test = model.matrix(terms(fit), df_test)
# Check predict and matrix multiplication give identical results on test data. NB this is not the case if contstructing the model matrix using (e.g.) mm_test = model.matrix(formula(fit)[-2], df_test) for the reason outlined here https://stackoverflow.com/questions/59462820/why-are-predict-lm-and-matrix-multiplication-giving-different-predictions.
preds_1 = round(predict(fit, df_test), 5)
preds_2 = round(mm_test %*% fit$coefficients, 5)
all(preds_1 == preds_2) #TRUE

R: how to make predictions using gamboost

library(mboost)
### a simple two-dimensional example: cars data
cars.gb <- gamboost(dist ~ speed, data = cars, dfbase = 4,
control = boost_control(mstop = 50))
set.seed(1)
cars_new <- cars + rnorm(nrow(cars))
> predict(cars.gb, newdata = cars_new$speed)
Error in check_newdata(newdata, blg, mf) :
‘newdata’ must contain all predictor variables, which were used to specify the model.
I fit a model using the example on the help(gamboost) page. I want to use this model to predict on a new dataset, cars_new, but encountered the above error. How can I fix this?
predict function looks for a variable called speed but when you subset it with $ sign it has no name anymore.
so, this variant of prediction works;
predict(cars.gb, newdata = data.frame(speed = cars_new$speed))
or keep the original name as is;
predict(cars.gb, newdata = cars_new['speed'])

Error with svyglm function in survey package in R: "all variables must be in design=argument"

New to stackoverflow. I'm working on a project with NHIS data, but I cannot get the svyglm function to work even for a simple, unadjusted logistic regression with a binary predictor and binary outcome variable (ultimately I'd like to use multiple categorical predictors, but one step at a time).
El_under_glm<-svyglm(ElUnder~SO2, design=SAMPdesign, subset=NULL, family=binomial(link="logit"), rescale=FALSE, correlation=TRUE)
Error in eval(extras, data, env) :
object '.survey.prob.weights' not found
I changed the variables to 0 and 1 instead:
Under_narm$SO2REG<-ifelse(Under_narm$SO2=="Heterosexual", 0, 1)
Under_narm$ElUnderREG<-ifelse(Under_narm$ElUnder=="No", 0, 1)
But then get a different issue:
El_under_glm<-svyglm(ElUnderREG~SO2REG, design=SAMPdesign, subset=NULL, family=binomial(link="logit"), rescale=FALSE, correlation=TRUE)
Error in svyglm.survey.design(ElUnderREG ~ SO2REG, design = SAMPdesign, :
all variables must be in design= argument
This is the design I'm using to account for the weights -- I'm pretty sure it's correct:
SAMPdesign=svydesign(data=Under_narm, id= ~NHISPID, weight= ~SAMPWEIGHT)
Any and all assistance appreciated! I've got a good grasp of stats but am a slow coder. Let me know if I can provide any other information.
Using some make-believe sample data I was able to get your model to run by setting rescale = TRUE. The documentation states
Rescaling of weights, to improve numerical stability. The default
rescales weights to sum to the sample size. Use FALSE to not rescale
weights.
So, one solution maybe is just to set rescale = TRUE.
library(survey)
# sample data
Under_narm <- data.frame(SO2 = factor(rep(1:2, 1000)),
ElUnder = sample(0:1, 1000, replace = TRUE),
NHISPID = paste0("id", 1:1000),
SAMPWEIGHT = sample(c(0.5, 2), 1000, replace = TRUE))
# with 'rescale' = TRUE
SAMPdesign=svydesign(ids = ~NHISPID,
data=Under_narm,
weights = ~SAMPWEIGHT)
El_under_glm<-svyglm(formula = ElUnder~SO2,
design=SAMPdesign,
family=quasibinomial(), # this family avoids warnings
rescale=TRUE) # Weights rescaled to the sum of the sample size.
summary(El_under_glm, correlation = TRUE) # use correlation with summary()
Otherwise, looking code for this function's method with 'survey:::svyglm.survey.design', it seems like there may be a bug. I could be wrong, but by my read when 'rescale' is FALSE, .survey.prob.weights does not appear to get assigned a value.
if (is.null(g$weights))
g$weights <- quote(.survey.prob.weights)
else g$weights <- bquote(.survey.prob.weights * .(g$weights)) # bug?
g$data <- quote(data)
g[[1]] <- quote(glm)
if (rescale)
data$.survey.prob.weights <- (1/design$prob)/mean(1/design$prob)
There may be a work around if you assign a vector of numeric values to .survey.prob.weights in the global environment. No idea what these values should be, but your error goes away if you do something like the following. (.survey.prob.weights needs to be double the length of the data.)
SAMPdesign=svydesign(ids = ~NHISPID,
data=Under_narm,
weights = ~SAMPWEIGHT)
.survey.prob.weights <- rep(1, 2000)
El_under_glm<-svyglm(formula = ElUnder~SO2,
design=SAMPdesign,
family=quasibinomial(),
rescale=FALSE)
summary(El_under_glm, correlation = TRUE)

R object is not a matrix

I am new to R and trying to save my svm model in R and have read the documentation but still do not understand what is wrong.
I am getting the error "object is not a matrix" which would seem to mean that my data is not a matrix, but it is... so something is missing.
My data is defined as:
data = read.table("data.csv")
trainSet = as.data.frame(data[,1:(ncol(data)-1)])
Where the last line is my label
I am trying to define my model as:
svm.model <- svm(type ~ ., data=trainSet, type='C-classification', kernel='polynomial',scale=FALSE)
This seems like it should be correct but I am having trouble finding other examples.
Here is my code so far:
# load libraries
require(e1071)
require(pracma)
require(kernlab)
options(warn=-1)
# load dataset
SVMtimes = 1
KERNEL="polynomial"
DEGREE = 2
data = read.table("head.csv")
results10foldAll=c()
# Cross Fold for training and validation datasets
for(timesRun in 1:SVMtimes) {
cat("Running SVM = ",timesRun," result = ")
trainSet = as.data.frame(data[,1:(ncol(data)-1)])
trainClasses = as.factor(data[,ncol(data)])
model = svm(trainSet, trainClasses, type="C-classification",
kernel = KERNEL, degree = DEGREE, coef0=1, cost=1,
cachesize = 10000, cross = 10)
accAll = model$accuracies
cat(mean(accAll), "/", sd(accAll),"\n")
results10foldAll = rbind(results10foldAll, c(mean(accAll),sd(accAll)))
}
# create model
svm.model <- svm(type ~ ., data = trainSet, type='C-classification', kernel='polynomial',scale=FALSE)
An example of one of my samples would be:
10.135338 7.214543 5.758917 6.361316 0.000000 18.455875 14.082668 31
Here, trainSet is a data frame but in the svm.model function it expects data to be a matrix(where you are assigning trainSet to data). Hence, set data = as.matrix(trainSet). This should work fine.
Indeed as pointed out by #user5196900 you need a matrix to run the svm(). However beware that matrix object means all columns have same datatypes, all numeric or all categorical/factors. If this is true for your data as.matrix() may be fine.
In practice more than often people want to model.matrix() or sparse.model.matrix() (from package Matrix) which gives dummy columns for categorical variables, while having single column for numerical variables. But a matrix indeed.

Change SMOTE parameters inside CARET k-fold cross-validation classification

I have a classification problem with a very skewed class to predict (e.g. 90% / 10% unbalanced binary variable to predict).
In order to deal with that issue, I want to use the SMOTE method to oversample this class variable. However, as I read here (http://www.marcoaltini.com/blog/dealing-with-imbalanced-data-undersampling-oversampling-and-proper-cross-validation) it is best practice to use SMOTE inside the k-fold loop to avoid overfitting.
As I'm using the caret package to perform my analysis, I'm referring to this link (http://topepo.github.io/caret/sampling.html). I undestand everything perfectly but the last part where it explains how to change the SMOTE parameters:
smotest <- list(name = "SMOTE with more neighbors!",
func = function (x, y) {
library(DMwR)
dat <- if (is.data.frame(x)) x else as.data.frame(x)
dat$.y <- y
dat <- SMOTE(.y ~ ., data = dat, k = 10)
list(x = dat[, !grepl(".y", colnames(dat), fixed = TRUE)],
y = dat$.y)
},
first = TRUE)
I simply don't understand this. Someone care to explain? Let's say I want to include the SMOTE parameters perc.over, k and perc.under, how would I do that?
Thank you very much.
EDIT:
Actually I realized I could probably just add these parameters inside the "SMOTE" expression in the above function, this would for instance give something like:
smotest <- list(name = "SMOTE with more neighbors!",
func = function (x, y) {
library(DMwR)
dat <- if (is.data.frame(x)) x else as.data.frame(x)
dat$.y <- y
dat <- SMOTE(.y ~ ., data = dat, k = 10, perc.over = 1200, perc.under = 100)
list(x = dat[, !grepl(".y", colnames(dat), fixed = TRUE)],
y = dat$.y)
},
first = TRUE)
I am not sure to have understood what you do not understand but here is an attempt to clarify what is done in this piece of code.
The smotest object is created as list because it is the way the argument sampling of trainControl function must be represented. The first element of this list is a name used only for display purposes. The second, func, is the actual sampling function. The third, first, is a logical value indicating whether samplin must be done before or after the pre-processing step.
The element func is here only a wrapper of SMOTE function. In this wrapper, line 3 is here because only a data.frame can be passed to SMOTE function. Line 4 is added because a formula combined to a data.frame is used in SMOTE rather than a couple x y. Line 6 is here to ensure that the appropriate format is returned to trainControl.
And, to answer you last question: yes, you can do what you have proposed to set additional parameters to SMOTE.

Resources