R : model.matrix for cv.glmnet drops rows from the dataframe - r

I am having an issue with creating a matrix of explanatory variables for running ridge and lasso regression using cv.glmnet.
My original data frame is of dimension 1460*81 and consist of several numeric and factor variables. In order to run glmnet, I am attempting to create a matrix of predictors using model.matrix.
However, when creating model.matrix on my original dataset, some of the rows are being dropped and my response variable and predictors are not of the same length.
Here's the code:
str(train1)
'data.frame': 1460 obs. of 80 variables:
$ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ...
$ MSZoning : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
$ LotFrontage : num 65 80 68 60 84 85 75 69 51 50 ...
$ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420
$ Street : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
$ Alley : Factor w/ 3 levels "Grvl","None",..: 2 2 2 2 2 2 2 2 2 2 ...
$ LotShape : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4
$ LandContour : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4
$ Utilities : Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
And now I am passing the data frame to model.matrix to create a matrix.
x = model.matrix(SalePrice ~., data = train1)
dim(x)
dim(x)
[1] 1370 260
Notice, how n = 1460 * 80 is transformed to 1370 * 260. This is causing a mismatch between lengths of my predictor variables and response variable when I try to run ridge regression.
cv.ridge <- glmnet(x, y, alpha = 0)
Error in glmnet(x, y, alpha = 0) :
number of observations in y (1460) not equal to the number of rows of x (1370)
Any ideas on where to look to ensure the length of the matrix (x) is equal (y)?

Related

randomForest using factor variables as continuous?

I am using the package randomForest to produce habitat suitability models for species. I thought everything was working as it should until I started looking at individual trees with getTree(). The documentation (see page 4 of the randomForest vignette) states that for categorical variables, the split point will be an integer, which makes sense. However, in the trees I have looked at for my results, this is not the case.
The data frame I used to build the model was formatted with categorical variables as factors:
> str(df.full)
'data.frame': 27087 obs. of 23 variables:
$ sciname : Factor w/ 2 levels "Laterallus jamaicensis",..: 1 1 1 1 1 1 1 1 1 1 ...
$ estid : Factor w/ 2 levels "7694","psabs": 1 1 1 1 1 1 1 1 1 1 ...
$ pres : Factor w/ 2 levels "1","0": 1 1 1 1 1 1 1 1 1 1 ...
$ stratum : Factor w/ 89 levels "poly_0","poly_1",..: 1 1 1 1 1 1 1 1 1 1 ...
$ ra : Factor w/ 3 levels "high","low","medium": 3 3 3 3 3 3 3 3 3 3 ...
$ eoid : Factor w/ 2 levels "0","psabs": 1 1 1 1 1 1 1 1 1 1 ...
$ avd3200 : num 0.1167 0.0953 0.349 0.1024 0.3765 ...
$ biocl05 : num 330 330 330 330 330 ...
$ biocl06 : num 66 65.8 66 65.8 66 ...
$ biocl08 : num 277 277 277 277 277 ...
$ biocl09 : num 170 170 170 170 170 ...
$ biocl13 : num 186 186 185 186 185 ...
$ cti : num 19.7 19 10.4 16.4 14.7 ...
$ dtnhdwat : num 168 240 39 206 309 ...
$ dtwtlnd : num 0 0 0 0 0 0 0 0 0 0 ...
$ e2em1n99 : num 0 0 0 0 0 0 0 0 0 0 ...
$ ems30_53 : Factor w/ 53 levels "0","602","2206",..: 19 4 17 4 19 19 4 4 19 19 ...
$ ems5607_46: num 0 0 1 0 0.4 ...
$ ksat : num 0.21 0.21 0.21 0.21 0.21 ...
$ lfevh_53 : Factor w/ 53 levels "0","11","16",..: 38 38 38 38 38 38 38 38 38 38 ...
$ ned : num 1.46 1.48 1.54 1.48 1.47 ...
$ soilec : num 14.8 14.8 19.7 14.8 14.8 ...
$ wtlnd_53 : Factor w/ 50 levels "0","3","7","11",..: 4 31 7 31 7 31 7 7 31 31 ...
This was the function call:
# rfStratum and sampSizeVec were previously defined
> rf.full$call
randomForest(x = df.full[, c(7:23)], y = df.full[, 3],
ntree = 2000, mtry = 7, replace = TRUE, strata = rfStratum,
sampsize = sampSizeVec, importance = TRUE, norm.votes = TRUE)
Here are the first 15 lines of an example tree (note that the variables in lines 1, 5, and 15 should be categorical, i.e., they should have integer split values):
> tree100
left daughter right daughter split var split point status prediction
1 2 3 ems30_53 9.007198e+15 1 <NA>
2 4 5 biocl08 2.753206e+02 1 <NA>
3 6 7 biocl06 6.110518e+01 1 <NA>
4 8 9 biocl06 1.002722e+02 1 <NA>
5 10 11 lfevh_53 9.006718e+15 1 <NA>
6 0 0 <NA> 0.000000e+00 -1 0
7 12 13 biocl05 3.310025e+02 1 <NA>
8 14 15 ned 2.814818e+00 1 <NA>
9 0 0 <NA> 0.000000e+00 -1 1
10 16 17 avd3200 4.199712e-01 1 <NA>
11 18 19 e2em1n99 1.724138e-02 1 <NA>
12 20 21 biocl09 1.738916e+02 1 <NA>
13 22 23 ned 8.837864e-01 1 <NA>
14 24 25 biocl05 3.442437e+02 1 <NA>
15 26 27 lfevh_53 9.007199e+15 1 <NA>
Additional information: I encountered this because I was investigating an error I was getting when predicting the results back onto the study area stating that the types of predictors in the new data did not match those of the training data. I have done 6 other iterations of this model using the same data frame and scripts (just with different subsets of predictors) and never before gotten this message. The only thing I could find that was different between the randomforest object in this run compared to that in the other runs is that the rf.full$forest$ncat components are stored as double instead of integer
> for(i in 1:length(rf.full$forest$ncat)){
+ cat(names(rf.full$forest$ncat)[[i]], ": ", class(rf.full$forest$ncat[[i]]), "\n")
+ }
avd12800 : numeric
cti : numeric
dtnhdwat : numeric
dtwtlnd : numeric
ems2207_99 : numeric
ems30_53 : numeric
ems5807_99 : numeric
hydgrp : numeric
ksat : numeric
lfevh_53 : numeric
ned : numeric
soilec : numeric
wtlnd_53 : numeric
>
> rf.full$forest$ncat
avd12800 cti dtnhdwat dtwtlnd ems2207_99 ems30_53 ems5807_99 hydgrp ksat lfevh_53
1 1 1 1 1 53 1 1 1 53
ned soilec wtlnd_53
1 1 50
However, xlevels (which appears to be a list of the predictor variables used and their types) are all showing the correct datatype for each predictor.
> for(i in 1:length(rf.full$forest$xlevels)){
+ cat(names(rf.full$forest$xlevels)[[i]], ": ", class(rf.full$forest$xlevels[[i]]),"\n")
+ }
avd12800 : numeric
cti : numeric
dtnhdwat : numeric
dtwtlnd : numeric
ems2207_99 : numeric
ems30_53 : character
ems5807_99 : numeric
hydgrp : character
ksat : numeric
lfevh_53 : character
ned : numeric
soilec : numeric
wtlnd_53 : character
# example continuous predictor
> rf.full$forest$xlevels$avd12800
[1] 0
# example categorical predictor
> rf.full$forest$xlevels$ems30_53
[1] "0" "602" "2206" "2207" "4504" "4507" "4702" "4704" "4705" "4706" "4707" "4717" "5207" "5307" "5600"
[16] "5605" "5607" "5616" "5617" "5707" "5717" "5807" "5907" "6306" "6307" "6507" "6600" "7002" "7004" "9107"
[31] "9116" "9214" "9307" "9410" "9411" "9600" "4607" "4703" "6402" "6405" "6407" "6610" "7005" "7102" "7104"
[46] "7107" "9000" "9104" "9106" "9124" "9187" "9301" "9505"
The ncat component is simply a vector of the number of categories per variable with 1 for continuous variables (as noted here), so it doesn't seem like it should matter if that is stored as an integer or a double, but it seems like this might all be related.
Questions
1) Shouldn't the split point for categorical predictors in any given tree of a randomForest forest be an integer, and if yes, any thoughts as to why factors in the data frame used as input to the randomForest call here are not being used as such?
2) Does the number type (double vs integer) of the ncat component of a randomForest object matter in any way related to model building, and any thoughts as to what could cause this to switch from integer in the first 6 runs to double in this last run (with each run containing different subsets of the same data)?
The randomforest::randomForest algorithm encodes low-cardinality (up to 32 categories) and high-cardinality (32 to 64? categories) categorical splits differently. Pay attention - all your "problematic" features belong to the latter class, and are encoded using 64-bit floating point values.
While the console output doesn't make sense for the human observer, the randomForest model object/algorithm itself is correct (ie. treats those variables as categorical), and is making correct predictions.
If you want to investigate the structure of decision tree, and decision tree ensemble models, then you might consider exporting them to the PMML data format. For example, you can use the R2PMML package for this:
library("r2pmml")
r2pmml(rf.full, "MyRandomForest.pmml")
Then, open the MyRandomForest.pmml in a text editor, and you shall have a nice overview about the internals of your model (branches, split conditions, leaf values, etc).

How to reduce a data set after subsetting it with gsub

I have a large data set, which I reduced applying gsub multiple times, basically in this form:
levels(Orders$Im) <- gsub("Osp", "OsProf", levels(Orders$Im))
I also used rbind:
DI_Reduced <- rbind(CX, OsP)
I need to apply function "tree" to the resulting data.frame, but I get an error:
tree.model <- tree(line ~ CountryCode + OrderType + Support, data=train.set)
The error is:
Error in tree(line ~ CountryCode + OrderType + Support, :
factor predictors must have at most 32 levels
Strange thing: if I export the train.set with write.csv and then I re-import it with read.csv, the error disappears and the tree is built.
I investigated the structure of the train.set and this is the difference before and after exporting/importing it:
$ CustomerNumber: Factor w/ 4616 levels "0","101959","210285",..: 3070 3069 4539 3732 2573 3086 2973 3817 4056 2956 ...
$ CountryCode : Factor w/ 4 levels "OtherCountry",..: 3 3 4 4 3 3 3 4 4 3 ...
$ OrderType : Factor w/ 5 levels "Order","NewOrder",..: 5 5 5 5 5 5 5 5 5 5 ...
$ Support : Factor w/ 5 levels "#N/A","BN",..: 4 4 4 4 2 4 4 4 4 4 ...
$ Manuf : Factor w/ 163 levels "<Generic>","6gi",..: 52 52 52 52 52 52 52 52 52 52 ...
$ line : Factor w/ 623 levels "\"Generic\" Skews",..: 400 35 400 400 400 400 400 400 400 400 ...
________________________________________________________________
$ CustomerNumber: Factor w/ 692 levels "201500","20202",..: 361 360 680 499 138 367 315 523 592 304 ...
$ CountryCode : Factor w/ 2 levels "JP","US": 1 1 2 2 1 1 1 2 2 1 ...
$ OrderType : Factor w/ 1 level "Online": 1 1 1 1 1 1 1 1 1 1 ...
$ Support : Factor w/ 4 levels "BN","MC",..: 3 3 3 3 1 3 3 3 3 3 ...
$ Manuf : Factor w/ 1 level "DY": 1 1 1 1 1 1 1 1 1 1 ...
$ line : Factor w/ 2 levels "CX","OTX": 2 1 2 2 2 2 2 2 2 2 ...
It seems to me that gsub does not really subsect the original data.frame, and the hidden values stay in the train.set till I export/import the train. Is there another way to do this operation and obtain a tree?
As the error says, your dependent variable line has more than 32 levels. As per your train.set structure line : Factor w/ 623 levels
Try using other tree libraries like rpart.
Refactoring after subset might help.
sapply(train.set, {function(x) if(class(x) == "factor") {factor(x)}})
Also, gsub is not used for subsetting usually. It is global substitution function. You should share the pre-processing steps followed as well to help others help you with this better.

R - Random Forest and more than 53 categories

I know. RandomForest is not able to handle more than 53 categories. Sadly I have to analyze data and one column has 165 levels. Therefor I want to use RandomForest for a classification.
My problem is I cannot remove this columns since this predictor is really important and known as a valuable predictor.
This predictor has 165 levels and is a factor.
Are there any tips how I can handle this? Since we are talking about film genre I have no idea.
Are there alternative packages for big data? A special workaround? Something like this..
Switching to Python is no option. We have too many R scripts here.
Thanks a lot and all the best
The str(data) looks like this:
'data.frame': 481696 obs. of 18 variables:
$ SENDERNR : int 432 1612 735 721 436 436 1321 721 721 434 ...
$ SENDER : Factor w/ 14 levels "ARD Das Erste",..: 6 3 4 9 12 12 10 9 9 7 ...
$ GEPLANTE_SENDUNG_N: Factor w/ 12563 levels "-- nicht bekannt --",..: 7070 808 5579 9584 4922 4922 12492 1933 9584 4533 ...
$ U_N_PROGRAMMCODE : Factor w/ 14 levels "Bühne/Aufführung",..: 9 4 8 4 8 8 12 8 4 2 ...
$ U_N_PROGRAMMSPARTE: Factor w/ 6 levels "Anderes","Fiction",..: 5 3 2 3 2 2 5 2 3 3 ...
$ U_N_SENDUNGSFORMAT: Factor w/ 29 levels "Bühne / Aufführung",..: 20 9 19 4 19 19 24 19 4 16 ...
$ U_N_GENRE : Factor w/ 163 levels "Action / Abenteuer",..: 119 147 115 4 158 158 163 61 4 84 ...
$ U_N_PRODUKTIONSART: Factor w/ 5 levels "Eigen-, Co-, Auftragsproduktion, Cofinanzierung",..: 1 1 3 1 3 3 1 3 1 1 ...
$ U_N_HERKUNFTSLAND : Factor w/ 25 levels "afrikanische Länder",..: 16 16 25 16 15 15 16 25 16 16 ...
$ GEPLANTE_SENDUNG_V: Factor w/ 12191 levels "-- nicht bekannt --",..: 6932 800 5470 9382 1518 9318 12119 1829 9382 4432 ...
$ U_V_PROGRAMMCODE : Factor w/ 13 levels "Bühne/Aufführung",..: 9 4 8 4 8 8 12 8 4 2 ...
$ U_V_PROGRAMMSPARTE: Factor w/ 6 levels "Anderes","Fiction",..: 5 3 2 3 2 2 5 2 3 3 ...
$ U_V_SENDUNGSFORMAT: Factor w/ 28 levels "Bühne / Aufführung",..: 20 9 19 4 19 19 24 19 4 16 ...
$ U_V_GENRE : Factor w/ 165 levels "Action / Abenteuer",..: 119 148 115 4 160 19 165 61 4 84 ...
$ U_V_PRODUKTIONSART: Factor w/ 5 levels "Eigen-, Co-, Auftragsproduktion, Cofinanzierung",..: 1 1 3 1 3 3 1 3 1 1 ...
$ U_V_HERKUNFTSLAND : Factor w/ 25 levels "afrikanische Länder",..: 16 16 25 16 15 9 16 25 16 16 ...
$ ABGELEHNT : int 0 0 0 0 0 0 0 0 0 0 ...
$ AKZEPTIERT : Factor w/ 2 levels "0","1": 2 1 2 2 2 2 1 2 2 2 ...
Having faced the same issue, here are some tips I can list.
Switch to another algorithm, for instance gradient boosting from
gbm package. You can handle up to 1024 categorical levels. If your predictor has quite discriminant parameters, you should also consider probabilistic approaches such as naiveBayes.
Transform your predictor into dummy variables, which can be done by using matrix.model. You can then perform a random forest over this matrix.
Reduce the number of levels in your factor. Ok, that may sound like a silly advice, but is it really relevant to look at factors with such "thinness" ? Is it possible for you to aggregate some modalities at a broader level ?
EDIT TO ADD MODEL.MATRIX EXAMPLE
As mentioned, here is an example on how to use model.matrix to transform your column into dummy variables.
mydf <- data.frame(var1 = factor(c("A", "A", "A", "B", "B", "C")),
var2 = factor(c("X", "Y", "X", "Y", "X", "Z")),
target = c(1,1,1,2,2,2))
dummyMat <- model.matrix(target ~ var1 + var2, mydf, # set contrasts.arg to keep all levels
contrasts.arg = list(var1 = contrasts(mydf$var1, contrasts = F),
var2 = contrasts(mydf$var2, contrasts = F)))
mydf2 <- cbind(mydf, dummyMat[,c(2:ncol(dummyMat)]) # just removing intercept column
Use the caret package :
random_forest <- train("***dependent variable name***" ~ .,
data = "***your training data set***",
method = "ranger")
This can handle 53 + categories.

Error in Cross Validation in GLMNET package R for Binomial Target Variable

This is in reference to https://stats.stackexchange.com/questions/72251/an-example-lasso-regression-using-glmnet-for-binary-outcome I am trying to use the Cross Validation in GLMNET (i.e. cv.glmnet) for a binomial target variable. The glmnet works fine but the cv.glmnet throws an error here is the error log:
Error in storage.mode(y) = "double" : invalid to change the storage mode of a factor
In addition: Warning messages:
1: In Ops.factor(x, w) : ‘*’ not meaningful for factors
2: In Ops.factor(y, ybar) : ‘-’ not meaningful for factors
Data Types:
'data.frame': 490 obs. of 13 variables:
$ loan_id : Factor w/ 614 levels "LP001002","LP001003",..: 190 381 259 310 432 156 179 24 429 408 ...
$ gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 1 ...
$ married : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 2 2 2 1 ...
$ dependents : Factor w/ 4 levels "0","1","2","3+": 1 1 1 3 1 4 2 3 1 1 ...
$ education : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 1 2 1 2 ...
$ self_employed : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
$ applicantincome : int 9328 3333 14683 7667 6500 39999 3750 3365 2920 2213 ...
$ coapplicantincome: num 0 2500 2100 0 0 ...
$ loanamount : int 188 128 304 185 105 600 116 112 87 66 ...
$ loan_amount_term : Factor w/ 10 levels "12","36","60",..: 6 9 9 9 9 6 9 9 9 9 ...
$ credit_history : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ property_area : Factor w/ 3 levels "Rural","Semiurban",..: 1 2 1 1 1 2 2 1 1 1 ...
$ loan_status : Factor w/ 2 levels "0","1": 2 2 1 2 1 2 2 1 2 2 ...
Codes Used:
xfactors<-model.matrix(loan_status ~ gender+married+dependents+education+self_employed+loan_amount_term+credit_history+property_area,data=data_train)[,-1]
x<-as.matrix(data.frame(applicantincome,coapplicantincome,loanamount,xfactors))
glmmod<-glmnet(x,y=as.factor(loan_status),alpha=1,family='binomial')
plot(glmmod,xvar="lambda")
grid()
cv.glmmod <- cv.glmnet(x,y=loan_status,alpha=1) #This Is Where It Throws The Error
The credit for the answer goes to #user20650.
Suspect you need to add the familyto cv.glmnet as well. An example:
x <- model.matrix(am ~ 0 + . , data=mtcars)
cv.glmnet(x, y=factor(mtcars$am), alpha=1)
cv.glmnet(x, y=factor(mtcars$am), alpha=1, family="binomial")

must a dataset contain all factors in SVM in R

I'm trying to find class probabilities of new input vectors with support vector machines in R.
Training the model shows no errors.
fit <-svm(device~.,data=dataframetrain,
kernel="polynomial",probability=TRUE)
But predicting some input vector shows some errors.
predict(fit,dataframetest,probability=prob)
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
dataframetrain looks like:
> str(dataframetrain)
'data.frame': 24577 obs. of 5 variables:
$ device : Factor w/ 3 levels "mob","pc","tab": 1 1 1 1 1 1 1 1 1 1 ...
$ geslacht : Factor w/ 2 levels "M","V": 1 1 1 1 1 1 1 1 1 1 ...
$ leeftijd : num 77 67 67 66 64 64 63 61 61 58 ...
$ invultijd: num 12 12 12 12 12 12 12 12 12 12 ...
$ type : Factor w/ 8 levels "A","B","C","D",..: 5 5 5 5 5 5 5 5 5 5 ...
and dataframetest looks like:
> str(dataframetest)
'data.frame': 8 obs. of 4 variables:
$ geslacht : Factor w/ 1 level "M": 1 1 1 1 1 1 1 1
$ leeftijd : num 20 60 30 25 36 52 145 25
$ invultijd: num 6 12 2 5 6 8 69 7
$ type : Factor w/ 8 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8
I trained the model with 2 factors for 'geslacht' but sometime I have to predict data with only 1 factor of 'geslacht'.
Is it maybe possible that the class probabilites can be predicted with a test set with only 1 factor of 'geslacht'?
I hope someone can help me!!
Add another level (but not data) to geslacht.
x <- factor(c("A", "A"), levels = c("A", "B"))
x
[1] A A
Levels: A B
or
x <- factor(c("A", "A"))
levels(x) <- c("A", "B")
x
[1] A A
Levels: A B

Resources