Changing glmnet from binomial to multinomial gives error - r

I'm trying to adapt code for a binomial glmnet to make it work for a multinomial problem, but for some reason I keep getting an error code.
Here's the original code for the binomial model that works perfectly:
traininglasso <- stratified(sp_lasso, group = "Cat",
select = list(Cat = c("A","B", "C")),
size = c(86), replace=FALSE)
traininglasso[,Cat:=factor(Cat, labels = c("B", "B", "C") )]
check_lasso <- anti_join(sp_lasso, traininglasso, by=c("Accepted Symbol"))
check_lasso[,Cat:=factor(Cat, labels = c("B", "B", "C") )]
use_for_lasso <- within(for_lasso, Cat <- relevel(Cat, ref="C"))
lassod <- model.matrix(Cat~., use_for_lasso)[,-1]
cv.lassod <- cv.glmnet(lassod, use_for_lasso$Cat, alpha =1, family= "binomial")
lambdad <- cv.lassod
lasso_modeld <- glmnet(lassod, use_for_lasso$Cat, alpha =1, family = "binomial",
lambda = lambdad$lambda.1se)
coefd <- coef(lasso_modeld)
check_lasso_matrix <- model.matrix(Cat~., check_lasso)[,-1]
probslasso4 <- as.data.frame(predict.glmnet(lasso_modeld, type="response", newx = check_lasso_matrix))
Sorry that it's so wordy, but basically my steps are this:
Conduct stratified random sample of original dataset to get 86 observations of each of three categories (Cat): "A", "B", and "C"
Join categories A and B together so that the outcome is binary (two categories, just B and C)
Assemble all observations not in the random sample to use for checking model accuracy at the end and recategorize those as well.
Run the steps for a LASSO glm as recommended
Then, in the last line, generate predictions for checking the accuracy of the model using the non-training data.
Again, all of this works perfectly fine. However, when I leave my data as three categories and change the family to multinomial (those are quite literally the only changes I've made in the code below, everything else including the data is the same) I get this error message:
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'as.matrix': requires numeric/complex matrix/vector arguments
I've read about other people getting this error and simply needing to reformat their matrices, but I suspect that's not my issue since the binomial code works with the matrix I used for that.
Here's the code that I've tried for the multinomial version that isn't working. I ran the entire code chunk above again, but I'm only including here the 4 lines that I edited to go from binomial to multinomial:
traininglasso[,Cat:=factor(Cat, labels = c("A", "B", "C") )]
check_lasso[,Cat:=factor(Cat, labels = c("A", "B", "C") )]
cv.lassod<- cv.glmnet(lassod, use_for_lasso$Cat, alpha =1, family= "multinomial")
lasso_modeld <- glmnet(lassod, use_for_lasso$Cat, alpha =1, family = "multinomial",
lambda = lambdad$lambda.1se)

Figured it out. For whatever reason, when you create a multinomial model with glmnet, you need to use the regular predict() function instead of the predict.glmnet() function. Using the same model matrix for both multinomial and binomial models works fine - seems like the error actually has nothing to do with the matrix format.

Related

Error with svyglm function in survey package in R: "all variables must be in design=argument"

New to stackoverflow. I'm working on a project with NHIS data, but I cannot get the svyglm function to work even for a simple, unadjusted logistic regression with a binary predictor and binary outcome variable (ultimately I'd like to use multiple categorical predictors, but one step at a time).
El_under_glm<-svyglm(ElUnder~SO2, design=SAMPdesign, subset=NULL, family=binomial(link="logit"), rescale=FALSE, correlation=TRUE)
Error in eval(extras, data, env) :
object '.survey.prob.weights' not found
I changed the variables to 0 and 1 instead:
Under_narm$SO2REG<-ifelse(Under_narm$SO2=="Heterosexual", 0, 1)
Under_narm$ElUnderREG<-ifelse(Under_narm$ElUnder=="No", 0, 1)
But then get a different issue:
El_under_glm<-svyglm(ElUnderREG~SO2REG, design=SAMPdesign, subset=NULL, family=binomial(link="logit"), rescale=FALSE, correlation=TRUE)
Error in svyglm.survey.design(ElUnderREG ~ SO2REG, design = SAMPdesign, :
all variables must be in design= argument
This is the design I'm using to account for the weights -- I'm pretty sure it's correct:
SAMPdesign=svydesign(data=Under_narm, id= ~NHISPID, weight= ~SAMPWEIGHT)
Any and all assistance appreciated! I've got a good grasp of stats but am a slow coder. Let me know if I can provide any other information.
Using some make-believe sample data I was able to get your model to run by setting rescale = TRUE. The documentation states
Rescaling of weights, to improve numerical stability. The default
rescales weights to sum to the sample size. Use FALSE to not rescale
weights.
So, one solution maybe is just to set rescale = TRUE.
library(survey)
# sample data
Under_narm <- data.frame(SO2 = factor(rep(1:2, 1000)),
ElUnder = sample(0:1, 1000, replace = TRUE),
NHISPID = paste0("id", 1:1000),
SAMPWEIGHT = sample(c(0.5, 2), 1000, replace = TRUE))
# with 'rescale' = TRUE
SAMPdesign=svydesign(ids = ~NHISPID,
data=Under_narm,
weights = ~SAMPWEIGHT)
El_under_glm<-svyglm(formula = ElUnder~SO2,
design=SAMPdesign,
family=quasibinomial(), # this family avoids warnings
rescale=TRUE) # Weights rescaled to the sum of the sample size.
summary(El_under_glm, correlation = TRUE) # use correlation with summary()
Otherwise, looking code for this function's method with 'survey:::svyglm.survey.design', it seems like there may be a bug. I could be wrong, but by my read when 'rescale' is FALSE, .survey.prob.weights does not appear to get assigned a value.
if (is.null(g$weights))
g$weights <- quote(.survey.prob.weights)
else g$weights <- bquote(.survey.prob.weights * .(g$weights)) # bug?
g$data <- quote(data)
g[[1]] <- quote(glm)
if (rescale)
data$.survey.prob.weights <- (1/design$prob)/mean(1/design$prob)
There may be a work around if you assign a vector of numeric values to .survey.prob.weights in the global environment. No idea what these values should be, but your error goes away if you do something like the following. (.survey.prob.weights needs to be double the length of the data.)
SAMPdesign=svydesign(ids = ~NHISPID,
data=Under_narm,
weights = ~SAMPWEIGHT)
.survey.prob.weights <- rep(1, 2000)
El_under_glm<-svyglm(formula = ElUnder~SO2,
design=SAMPdesign,
family=quasibinomial(),
rescale=FALSE)
summary(El_under_glm, correlation = TRUE)

How to load a csv file into R as a factor for use with glmnet and logistic regression

I have a csv file (single column, numeric values) called "y" that consists of zeros and ones where the rows with the value 1 indicate the target variable for logistic regression, and another file called "x" with the same number of rows and with columns of numeric predictor values. How do I load these so that I can then use cv.glmnet, i.e.
x <- read.csv('x',header=FALSE,sep=",")
y <- read.csv('y',header=FALSE )
is throwing an error
Error in y %*% rep(1, nc) :
requires numeric/complex matrix/vector arguments
when I call
cvfit = cv.glmnet(x, y, family = "binomial")
I know that "y" should be loaded as a "factor," but how do I do this? My online searches have found all sorts of approaches that have just confused me. What is the simple one-liner to just load this data ready for glmnet?
The cv.glmnet requires data to be provided in vector or matrix format. You can use the following code
xmat = as.matrix(x)
yvec = as.vector(y)
Then use
cvfit = cv.glmnet(xmat, yvec, family = "binomial")
If you can provide your data in dput() format, I can give a try.

How can I get the R IML FeatureImp() function to work?

I am trying to get the FeatureImp function from the IML package to work, but it keeps throwing an error. Below is an example from the diamonds dataset, on which I train a random forest model.
library(iml)
library(caret)
library(randomForest)
data(diamonds)
# create some binary classification target (without specific meaning)
diamonds$target <- as.factor(ifelse(diamonds$color %in% c("D", "E", "F"), "X", "Y"))
# drop categorical variables (to keep it simple for demonstration purposes)
diamonds <- subset(diamonds, select = -c(color, clarity, cut))
# train model
mdl_diamonds <- train(target ~ ., method = "rf", data = diamonds)
# create iml predictor
x_pred <- Predictor$new(model = mdl_diamonds, data = diamonds[, 1:7], y = diamonds$target, type = "prob")
# calculate feature importance
x_imp <- FeatureImp$new(x_pred, loss = "mae")
This ends with the following error:
Error in if (self$original.error == 0) { :
missing value where TRUE/FALSE needed
In addition: Warning message:
In Ops.factor(actual, predicted) : ‘-’ not meaningful for factors
I don't understand what I'm doing wrong. Can anyone give me clue?
I'm working on R version 3.5.1, iml package version 0.9.0.
I have found the problem. I was using "mae" as the loss function, which is - I could have known - not applicable for a classification target. Using "ce" or "f1" returns output as expected.
because it's random forest. so try loss = 'ce'.

Calculating prediction accuracy of a tree using rpart's predict method

I have constructed a decision tree using rpart for a dataset.
I have then divided the data into 2 parts - a training dataset and a test dataset. A tree has been constructed for the dataset using the training data. I want to calculate the accuracy of the predictions based on the model that was created.
My code is shown below:
library(rpart)
#reading the data
data = read.table("source")
names(data) <- c("a", "b", "c", "d", "class")
#generating test and train data - Data selected randomly with a 80/20 split
trainIndex <- sample(1:nrow(x), 0.8 * nrow(x))
train <- data[trainIndex,]
test <- data[-trainIndex,]
#tree construction based on information gain
tree = rpart(class ~ a + b + c + d, data = train, method = 'class', parms = list(split = "information"))
I now want to calculate the accuracy of the predictions generated by the model by comparing the results with the actual values train and test data however I am facing an error while doing so.
My code is shown below:
t_pred = predict(tree,test,type="class")
t = test['class']
accuracy = sum(t_pred == t)/length(t)
print(accuracy)
I get an error message that states -
Error in t_pred == t : comparison of these types is not implemented In
addition: Warning message: Incompatible methods ("Ops.factor",
"Ops.data.frame") for "=="
On checking the type of t_pred, I found out that it is of type integer however the documentation
(https://stat.ethz.ch/R-manual/R-devel/library/rpart/html/predict.rpart.html)
states that the predict() method must return a vector.
I am unable to understand why is the type of the variable is an integer and not a list. Where have I made the mistake and how can I fix it?
Try calculating the confusion matrix first:
confMat <- table(test$class,t_pred)
Now you can calculate the accuracy by dividing the sum diagonal of the matrix - which are the correct predictions - by the total sum of the matrix:
accuracy <- sum(diag(confMat))/sum(confMat)
My response is very similar to #mtoto's one but a bit more simply... I hope it also helps.
mean(test$class == t_pred)

Odd errors when using party package "contrasts cannot be applied ...." and "object of type closure...."

I am using the party package.
When I run:
tree1 <- mob(incarcerated~priors+opens+concrearr+postrearr+anyrearr+postconvfel+postconvmis+
ag_vfo+ag_cla2+in_custody |PRIOR_FELONY_ARREST ,
data = jamaal,
control = ctrl,
model = glinearModel,
family = binomial)
I get the error
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
But I checked and every factor variable has at least 2 levels.
I then tried a much simpler tree
treetest <- mob(incarcerated~priors|in_custody,
data = jamaal,
control = ctrl,
model = glinearModel,
family = binomial)
and got one of the infamous R error messages
Error: object of type 'closure' is not subsettable
Any help appreciated
UPDATE
I found the source of the first error (it was a problem with how I was using factor()) but not the second. Also, rpart works on the same data with no problem.
The data are confidential, but I will check with the client if posting a small subset is OK
FURTHER UPDATE
Here is an small example with made up data:
priors <- c(rep('Y', 5), rep('N', 5))
incarcerated <- rep(c('Y', 'N'), 5)
in_custody <- rep(c(rep('Y', 3), rep('N', 2)),2)
testdata <- data.frame(cbind(priors, incarcerated, in_custody))
treetest <- mob(incarcerated~priors|in_custody, data = testdata,
model = glinearModel, family = binomial)
gives the same error.
party is looking for the results of a binomial() call, rather than the function binomial or the string "binomial". (In my opinion the glm() function in base R has made things very confusing by accepting any of these three as acceptable variants.)
priors <- c(rep('Y', 5), rep('N', 5))
incarcerated <- rep(c('Y', 'N'), 5)
in_custody <- rep(c(rep('Y', 3), rep('N', 2)),2)
testdata <- data.frame(cbind(priors, incarcerated, in_custody))
library(party)
treetest <- mob(incarcerated~priors|in_custody, data = testdata,
model = glinearModel, family = binomial())
In hindsight, this error message is at least somewhat informative -- it tells us to look for a function that it is being passed somewhere that R expects an object that has elements that can be extracted ...

Resources