Lasso Regression glmnet - error regarding the input data - r

I try to fit a Lasso regression model using glmnet(). As I have never worked with Lasso regression before, I tried to get along with tutorials but when applying the model, it always results with the following error:
Error in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs,:
one multinomial or binomial class has 1 or 0 observations; not allowed
Working with the dataset from this question (https://stats.stackexchange.com/questions/72251/an-example-lasso-regression-using-glmnet-for-binary-outcome) it seems that the dependent variable, the y, has to consist only of 0 and 1. Whenever I set one of the observation values of y to 2 or anything else than 0 or 1, it results in this error.
This is my code:
lambdas_to_try <- 10^seq(-3, 5, length.out = 100)
x_vars <- as.matrix(data.frame(data$x1, data$x2, data$x3))
lasso_cv <- cv.glmnet(x_vars, y=as.factor(data$y), alpha = 1, lambda = lambdas_to_try, family = "binomial", nfolds = 10)
x_vars_2 <- model.matrix(data$y ~ data$x1 + data$x2 + data$x3)[, -1]
lasso_cv_2 <- cv.glmnet(x_vars, y=as.factor(data$y), alpha = 1, lambda = lambdas_to_try, family = "binomial", nfolds = 10)
And this is how my dataset looks like:
The problem is, that in my data, the y variable represents the number of crimes, so it has integer values between 0 and 1000. I cannot set the value to 0 and 1 only. How does it work to use these data to apply a Lasso regression?

As #Gregor noted, what you have is count data, and it should be regression and not classification. Using an example dataset, this is how you can implement it:
library(MASS)
library(glmnet)
data(Insurance)
Your response variable should be numeric:
str(Insurance)
'data.frame': 64 obs. of 5 variables:
$ District: Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
$ Group : Ord.factor w/ 4 levels "<1l"<"1-1.5l"<..: 1 1 1 1 2 2 2 2 3 3 ...
$ Age : Ord.factor w/ 4 levels "<25"<"25-29"<..: 1 2 3 4 1 2 3 4 1 2 ...
$ Holders : int 197 264 246 1680 284 536 696 3582 133 286 ...
$ Claims : int 38 35 20 156 63 84 89 400 19 52 ...
Now we set the predictors and response variables:
y = Insurance$Claims
X = model.matrix(Claims ~ .,data=Insurance)
Run a cv to find the best lambda (if you don't know your L1 norm):
fit = cv.glmnet(x=X,y=y,family="poisson")
pred = predict(fit,X,s=fit$lambda.1se)
The prediction is in log scale, so to compare with your actual
plot(log(y),pred,xlab="log (actual)",ylab="log (predicted)")

Related

R Error in model.frame.default(formula, data) : variable lengths differ (found for 'age')

I use my training data for the SVM model is successful, but when I try to tune the hyperparameters it shows out this error
I am new to ML and R please help me with it thank you so much
#svm model
svm_model= svm(svm_train$y ~.,data = svm_train,kernel = "radial",cost = 1,gamma = 1/ncol(svm_train),type="C-classification")
str(svm_model)
summary(svm_model)
#
#Parameters:
# SVM-Type: C-classification
#SVM-Kernel: radial
#cost: 1
#Number of Support Vectors: 1844
#( 889 955 )
#Number of Classes: 2
#Levels:
# 1 2
#tune
tObj<-tune.svm(svm_train$y ~.,data = svm_train,gamma = c(0.1,0.5,1,5,2,3,4,5),cost = c(0.5,1,5,10,100,1000),type="C-classification",kernel = "radial")
Error in model.frame.default(formula, data) :
variable lengths differ (found for 'age')
my svm_train is a no null data frame as below
> svm_train[1,]
age job marital education default housing loan contact month day_of_week duration campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx
24020 40 10 1 4 2 1 2 1 10 2 144 1 999 1 1 -0.1 93.798 -40.4
euribor3m nr.employed y
24020 4.968 5195.8 1
>

How to use knn classification (class package) using training and test datasets

Dfcensus is the original data frame. I am trying to use Sex, EducYears and Age to predict whether a person's Income is "<=50K" or ">50K".
There are 20,000 rows in x_train_auto (training set) and 12,561 in x_test_auto (test set).
My classification variable (training set) has 15,124 <=50k and 4876 >50k.
Here is my code:
predictions = knn(train = x_train_auto, # response
test = x_test_auto, # response
cl = Df_census$Income[in_train_census], # prediction
k = 25)
table(predictions)
#<=50K
#12561
As you can see, all 12,561 test samples were predicted to have an Income of ">=50K".
This doesn't make sense. I am not sure where I am going wrong.
P.S.: I have sex one-hot encodes as 0 for male and 1 for female. And I have scaled Educ_years and Age and added sex to the data frame. I then added the one-hot encoded sex variable back into the scaled test and train data.
identifying the problem
Your provided x_test-auto.csv data suggests that you passed logical vectors with TRUEs and FALSEs (which define the indices of training and test samples rather than the actual data) to the train and test arguments of class::knn.
the solution
Rather, use the logical vector in x_train_auto (which I believe corresponds to in_train_census in your example) to define two separate data.frames, each containing all your desired predictors. These are then the training and the test set.
p <- c("Age","EducYears","Sex")
Df_train <- Df_census[in_train_census,p]
Df_test <- Df_census[!in_train_census,p]
In the knn function, pass the training set to the train argument, and the test set to the test argument, and further pass the outcome / target variable of the training set (as a factor) to cl.
The output (see ?class::knn) will be the predicted outcome for the test set.
Here is a complete and reproducible workflow using your data.
the data
library(class)
# read data from Dropbox
x_train_auto <- read.csv("https://dropbox.com/s/6kupkp4u4qyizy7/x_test_auto.csv?dl=1", row.names = 1)
Df_census <- read.csv("https://dropbox.com/s/ccvck8ajnatmpv0/Df_census.csv?dl=1", row.names = 1, stringsAsFactors = TRUE)
table(x_train_auto) # TRUE are training, FALSE are test set
#> x_train_auto
#> FALSE TRUE
#> 12561 20000
str(Df_census) # Income as factor, Sex is binary, Age and EducYears are numeric
#> 'data.frame': 32561 obs. of 15 variables:
#> $ Age : int 39 50 38 53 28 37 49 52 31 42 ...
#> $ Work : Factor w/ 9 levels "?","Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ...
#> $ Fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
#> $ Education : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
#> $ EducYears : int 13 13 9 7 13 14 5 9 14 13 ...
#> $ MaritalStatus: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
#> $ Occupation : Factor w/ 15 levels "?","Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ...
#> $ Relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
#> $ Race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
#> $ Sex : int 1 1 1 1 0 0 0 1 0 1 ...
#> $ CapitalGain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
#> $ CapitalLoss : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ HoursPerWeek : int 40 13 40 40 40 40 16 45 50 40 ...
#> $ NativeCountry: Factor w/ 42 levels "?","Cambodia",..: 40 40 40 40 6 40 24 40 40 40 ...
#> $ Income : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
# predictors and response
p <- c("Age","EducYears","Sex")
y <- "Income"
# create data partition
in_train_census <- x_train_auto$x
Df_train <- Df_census[in_train_census,]
Df_test <- Df_census[!in_train_census,]
# check
dim(Df_train)
#> [1] 20000 15
dim(Df_test)
#> [1] 12561 15
table(Df_train$Income)
#>
#> <=50K >50K
#> 15124 4876
using class::knn
The knn (k-nearest-neighbors) algorithm can perform better or worse depending on the choice of the hyperparameter k. It's often difficult to know which k value is best for the classification of a particular dataset. In a machine learning setting, you'd want to try out different values of k to find a value that gives the highest performance on your test dataset (i.e., data which was not used for model fitting).
It's always important to strike a good balance between overfitting (model is too complex, and will give good results on the training data, but less accurate or even rubbish results on new test data) and underfitting (model is too trivial to explain the actual patterns in the data). In the case of knn, using a larger k value would probably better safeguard against overfitting, according to the explanations here.
# apply knn for various k using the given training / test set
r <- data.frame(array(NA, dim = c(0, 2), dimnames = list(NULL, c("k","accuracy"))))
for (k in 1:30) {
#cat("k =", k, "\n")
# fit model on training set, predict test set data
set.seed(60402) # to be reproducible
predictions <- knn(train = Df_train[,p],
test = Df_test[,p],
cl = Df_train[,y],
k = k)
# confusion matrix on test set
t <- table(pred = predictions, ref = Df_test[,y])
# accuracy
a <- sum(diag(t)) / sum(t)
# bind
r <- rbind(r, data.frame(k = k, accuracy = a))
}
visualize model assessment
# find best k
r[which.max(r$accuracy),]
#> k accuracy
#> 17 17 0.8007324
(k.best <- r[which.max(r$accuracy),"k"])
#> [1] 17
# plot
with(r, plot(k, accuracy, type = "l"))
abline(v = k.best, lty = 2)
Created on 2021-09-23 by the reprex package (v2.0.1)
interpretation
The loop results suggest that your optimal value of k for this particular training and test set is between 12 and 17 (see plot above), but the accuracy gain is very small compared to using k = 1 (it's at around 80% regardless of k).
additional thoughts
Given that high income is rarer compared to lower income, accuracy might not be the desired performance metric. Sensitivity might be equally or more important, and you could modify the example code to calculate and assess other performance metrics instead.
In addition to pure prediction, you might want to explore whether other variables could be informative predictors of the Income class, by adding them to the p vector and comparing the resulting accuracies.
Here, we base our conclusions on a particular realization of training and test data. Better machine learning practice would be to split your data into 2 (as here), but then repeatedly split the training set again to fit and assess many more models, using e.g. (repeated) k-fold cross validation. A good package to do this in R is e.g. caret or tidymodels.
To gain a better understanding regarding which variables are the best predictors of Income class, I would also carry out a logistic regression on various uncorrelated predictors.

Why coxph() results some of the coefficient as NA when using survSplit() in R?

I'm working with survival data, and using survSplit() to deal with non-proportionality with time-dependent coefficients. The model is based on the article by Terry Therneau et. al. (2020) (https://cran.r-project.org/web/packages/survival/vignettes/timedep.pdf)
I have a factor variable with 6 levels to represent different types of knee prostheses. When I'm applying survSplit() with any cutpoints, the coefficients for the reference level of this time-adjusted factor appears as NA in the results. There is no collinearity, and the problem can be reproduced with other factor variables in the data. Also, if I change the reference value by altering the factor levels, the reference value is NA anyways.
The problem is reproduced below with the factor variable celltype in the veteran data:
str(veteran)
'data.frame': 137 obs. of 8 variables:
$ trt : num 1 1 1 1 1 1 1 1 1 1 ...
$ celltype: Factor w/ 4 levels "squamous","smallcell",..: 1 1 1 1 1 1 1 1 1 1 ...
$ time : num 72 411 228 126 118 10 82 110 314 100 ...
$ status : num 1 1 1 1 1 1 1 1 1 0 ...
$ karno : num 60 70 60 60 70 20 40 80 50 70 ...
$ diagtime: num 7 5 3 9 11 5 10 29 18 6 ...
$ age : num 69 64 38 63 65 49 69 68 43 70 ...
$ prior : num 0 10 0 10 10 0 10 0 0 0 ...
library(tidyverse)
library(survival)
library(survminer)
df <- veteran
cox <- coxph(Surv(time, status) ~ celltype + age + prior, data = df)
cox.zph(cox, terms=F)
cox_tdc <- survSplit(Surv(time, status) ~ .,
data= df,
cut=c(150),
zero=0,
episode= "tgroup",
id="id") %>%
dplyr::select(id, tstart, time, status, tgroup, celltype, age, prior)
coxph(Surv(tstart, time, status) ~
celltype:strata(tgroup) + age + prior,
data=cox_tdc)
Call:
coxph(formula = Surv(tstart, time, status) ~ celltype:strata(tgroup) +
age + prior, data = cox_tdc)
coef exp(coef) se(coef) z p
age 0.005686 1.005702 0.009494 0.599 0.549262
prior 0.008592 1.008629 0.020661 0.416 0.677516
celltypesquamous:strata(tgroup)tgroup=1 0.300732 1.350848 0.360243 0.835 0.403828
celltypesmallcell:strata(tgroup)tgroup=1 1.172992 3.231649 0.325177 3.607 0.000309
celltypeadeno:strata(tgroup)tgroup=1 1.232753 3.430660 0.352423 3.498 0.000469
celltypelarge:strata(tgroup)tgroup=1 NA NA 0.000000 NA NA
celltypesquamous:strata(tgroup)tgroup=2 -1.160625 0.313290 0.450989 -2.574 0.010067
celltypesmallcell:strata(tgroup)tgroup=2 -0.238994 0.787420 0.542002 -0.441 0.659252
celltypeadeno:strata(tgroup)tgroup=2 1.455195 4.285319 0.837621 1.737 0.082335
celltypelarge:strata(tgroup)tgroup=2 NA NA 0.000000 NA NA
Likelihood ratio test=34.54 on 8 df, p=3.238e-05
n= 171, number of events= 128
Likelihood ratio test=69.43 on 9 df, p=1.97e-11
n= 272, number of events= 128
The problem with this is that I cannot test the Shoenfeld residuals as cox.zph() results an error: "Error in solve.default(imat, u) :
system is computationally singular: reciprocal condition number = 5.09342e-19". Because of the NAs.
Problem with NAs does not happen if I don't use time-dependent coefficients (:strata(tgroup))
Has anyone dealed with this problem previously? Why some of the coefficients are NAs? I really appreciate your help with this!
EDIT: example was changed to include reproducible data.
EDIT2: Fixed the time cutpoint in the example which resulted biased coefficients
EDIT3: I asked Terry Therneau about this problem and the email conversation is below:
Terry Therneau:
There are two issues here.
The model.matrix routine is used by lm, glm, coxph and a host of other routines to create the X matrix for regression. It tries to be intelligent so as to not create redundant columns in X; those columns will end up with an NA coefficient. It is pretty good, but not perfect. Your case of a model with strata(tgroup): factor variable is one where it leaves in too many. The extra NA in the printout are a nuiscance, but not something that I can fix.
The cox.zph routine, on the other hand, is my problem. It should ignore those NA columns, and does not. There is actually code to check for the NA, but your example shows that it is incomplete. I will add an NA case, like yours, so my test suite and repair the problem. (The NA case worked once, but some update broke it.)
Me:
When the I get the results with NA rows as in this case, are the coefficients still correct, and the NA represents the reference value?
Terry:
Yes, all the coefficients are correct. SAS, for instance, does not try to 'pre-eliminate' columns and models with factors always have some missing in the coefficient vector. Since they use "." instead of "NA" for printing the missings don't jump off the page as much. Numerically, there is no penalty for doing in one way or the other.

How to find missing values in Regression tree for Veteran Status

I don't know why but I seem to be missing the nodes having to do with veteran status in my regression tree. Perhaps I am missing something? Suggestions welcome!
> str(d1)
'data.frame': 185390 obs. of 5 variables:
$ Total.Individual.Income : int 18899 0 15440 10859 25000 20000 8400
0 56002 50012 ...
$ Race : Factor w/ 2 levels "Black, American
India, Hispanic, Other",..: 2 2 1 1 2 2 2 2 2 2 ...
$ Sex : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 1 1 1 2 1 ...
$ Veteran : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 1 1 1 1 ...
$ Educational.Level.Achieved: Factor w/ 2 levels "Associated Degree and Up",..: 2 2 2 1 2 2 2 2 1 2 ...
> m1 <- rpart(Total.Individual.Income ~ ., data=d1, method="anova")
> m1
n= 185390
node), split, n, deviance, yval
* denotes terminal node
1) root 185390 6.806020e+14 31892.14
2) Educational.Level.Achieved=No Degree 130563 1.891821e+14 17617.89
*
3) Educational.Level.Achieved=Associated Degree and Up 54827
4.014663e+14 65884.32
6) Sex=Female 29910 1.266138e+14 49292.16 *
7) Sex=Male 24917 2.567340e+14 85801.30 *
My goal with this code is to create a regression tree of the predictors above as the respect to total individual income.
Notice that neither is race... that may be because neither race or Veteran are variables really useful at classifying your data, based on the outcome you're looking for (Total.Individual.Income).
Anyway, it's difficult to tell in absense of a reproducible example.
See the results to this:
require(rpart)
m1 <- rpart(mpg ~ ., data = mtcars)
> m1
n= 32
node), split, n, deviance, yval
* denotes terminal node
1) root 32 1126.04700 20.09062
2) cyl>=5 21 198.47240 16.64762
4) hp>=192.5 7 28.82857 13.41429 *
5) hp< 192.5 14 59.87214 18.26429 *
3) cyl< 5 11 203.38550 26.66364 *
Notice that only two variables (cyl and hp) show as predictors, even though there are 10 variables that can be predictors. Yet, if we do exclude cyl and hp, we get totally different results:
m2 <- rpart(mpg ~ ., data = mtcars[,c(1, 3, 5:11)])
then the result changs:
> m2
n= 32
node), split, n, deviance, yval
* denotes terminal node
1) root 32 1126.04700 20.09062
2) wt>=2.3925 25 320.44640 17.58800
4) disp>=266.9 14 85.20000 15.10000 *
5) disp< 266.9 11 38.28727 20.75455 *
3) wt< 2.3925 7 89.81429 29.02857 *
showing us now weight (wt) and displacement (disp) as predictors.
So nothing is wrong with your code, it seems all you need is to better understand what rpart is doing under the hood. ?rpart may be a good start.

predict() error: what can I do if one variable exists in training data but not in prediction data?

I have a training data set with the below variables
str(PairsTrain)
'data.frame': 1495698 obs. of 4 variables:
$ itemID_1 : int 1 4 8 12 15 19 20 20 22 26 ...
$ itemID_2 : int 4112648 1223296 2161930 5637025 113701
$ isDuplicate : int 1 0 1 0 0 0 0 0 1 0 ...
$ generationMethod: int 1 1 1 1 1 1 1 1 1 1 ...
I have learned from this dataset using the logistic regression glm() function
mod1 <- glm(isDuplicate ~., data = PairsTrain, family = binomial)
Below is the structure of my test dataset:
str(Test)
'data.frame': 1044196 obs. of 3 variables:
$ id : int 0 1 2 3 4 5 6 7 8 9 ...
$ itemID_1: int 5 5 6 11 23 23 30 31 36 47 ...
$ itemID_2: int 4670875 787210 1705280 3020777 5316130 3394969 2922567
I am trying to make predictions on my test data set like below
PredTest <- predict(mod1, newdata = Test, type = "response")
Error in eval(expr, envir, enclos) : object 'generationMethod' not found
I get the above error. I am thinking that the reason for the error I am getting is that the number of features in my test dataset doesn't match the training dataset.
I am not sure if I am correct and I am stuck here and don't know how to deal this situation.
OK, this is all you need:
test$generationMethod <- 0
You must have variable generationMethod in your test! It has been used for building models, hence required by predict when you make prediction. As you said you don't have this variable in test, use the above to create such variable in the test. This will have no effect in making prediciton, as this column is all 0; but, it will help you get pass the variable checking by predict.
Alternatively, you might consider removing variable generationMethod from your model development. Try:
mod2 <- glm(isDuplicate ~ itemID_1 + itemID_2, data = PairsTrain,
family = binomial)

Resources