I'm using R and the this breastCancer data frame. I want to use the function train in the packages caret but it doesn't work because of the error below. However, when I use another data frame, the function works.
library(mlbench)
library(caret)
data("breastCancer")
BC = na.omit(breastCancer[,-1])
a = train(Class~., data = as.matrix(BC), method = "svmRadial")
This is the error:
error : In .local(x, ...) : Variable(s) `' constant. Cannot scale data.
We can start with the data you have:
library(mlbench)
library(caret)
data(BreastCancer)
BC = na.omit(BreastCancer[,-1])
str(BC)
'data.frame': 683 obs. of 10 variables:
$ Cl.thickness : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 5 5 3 6 4 8 1 2 2 4 ...
$ Cell.size : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 1 1 2 ...
$ Cell.shape : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 2 1 1 ...
$ Marg.adhesion : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 5 1 1 3 8 1 1 1 1 ...
$ Epith.c.size : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 2 7 2 3 2 7 2 2 2 2 ...
$ Bare.nuclei : Factor w/ 10 levels "1","2","3","4",..: 1 10 2 4 1 10 10 1 1 1 ...
$ Bl.cromatin : Factor w/ 10 levels "1","2","3","4",..: 3 3 3 3 3 9 3 3 1 2 ...
$ Normal.nucleoli: Factor w/ 10 levels "1","2","3","4",..: 1 2 1 7 1 7 1 1 1 1 ...
$ Mitoses : Factor w/ 9 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 5 1 ...
$ Class : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...
BC is a data.frame and you can see all your predictors are categorical or ordinal. You are trying to do a svmRadial meaning a svm with radial basis function. It's not so trivial to calculate euclidean distance between categorical features and if you look at the distribution of your categories:
sapply(BC,table)
$Cl.thickness
1 2 3 4 5 6 7 8 9 10
139 50 104 79 128 33 23 44 14 69
$Cell.size
1 2 3 4 5 6 7 8 9 10
373 45 52 38 30 25 19 28 6 67
$Cell.shape
1 2 3 4 5 6 7 8 9 10
346 58 53 43 32 29 30 27 7 58
$Marg.adhesion
1 2 3 4 5 6 7 8 9 10
393 58 58 33 23 21 13 25 4 55
When you train the model, by default it is bootstrap, some of your training data will be missing the levels that are lowly represented, for example from the above table, category 9 for Marg.adhesion. And this variable becomes all zero for this training, hence it throws the error. It most likely doesn't affect the overall result much (since they are rare).
One solution is to use cross-validation (it is unlikely you select all the rare observations in the test fold). Note, you should never convert into a matrix using as.matrix() when you have a data.frame with factors and characters. Caret can handle data.frame like this:
train(Class ~.,data=BC,method="svmRadial",trControl=trainControl(method="cv"))
Support Vector Machines with Radial Basis Function Kernel
683 samples
9 predictor
2 classes: 'benign', 'malignant'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 614, 615, 615, 615, 616, 615, ...
Resampling results across tuning parameters:
C Accuracy Kappa
0.25 0.9575654 0.9101995
0.50 0.9619346 0.9190284
1.00 0.9633838 0.9220161
Tuning parameter 'sigma' was held constant at a value of 0.01841092
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.01841092 and C = 1.
The other option if you want to use bootstrap for cross-valiation, is to either omit the observations with these low classes, or combine them with others.
Your code contains some typos like the package name is caret not caren and dataset name is BreastCancer not breastCancer. You can use the following code to get rid of errors
library(mlbench)
library(caret)
data(BreastCancer)
BC = na.omit(BreastCancer[,-1])
a = train(Class~., data = as.matrix(BC), method = "svmRadial")
It returns me
#> Support Vector Machines with Radial Basis Function Kernel
#>
#> 683 samples
#> 9 predictor
#> 2 classes: 'benign', 'malignant'
#>
#> No pre-processing
#> Resampling: Bootstrapped (25 reps)
#> Summary of sample sizes: 683, 683, 683, 683, 683, 683, ...
#> Resampling results across tuning parameters:
#>
#> C Accuracy Kappa
#> 0.25 0.9550137 0.9034390
#> 0.50 0.9585504 0.9107666
#> 1.00 0.9611485 0.9161541
#>
#> Tuning parameter 'sigma' was held constant at a value of 0.02349173
#> Accuracy was used to select the optimal model using the largest value.
#> The final values used for the model were sigma = 0.02349173 and C = 1.
Related
I have the following dataset as seen beneath:
data.frame': 42172 obs. of 3 variables:
$ Product: Factor w/ 811 levels "P1","P2","P3",..: 1 2 3 4 5 6 7 8 9 10 ..
$ Week : Factor w/ 52 levels "W1","W2","W3",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Sales : int 11 7 7 12 8 3 4 8 14 22 ...
Subsequently, I pre-process my data and exclude the labels "Product" and "Week" from my data set to run clustering analyses:
df1_labels <- df1["Product","Week"]
df1$Product <- NULL
df1$Week <- NULL
str(df1)
'data.frame': 42172 obs. of 1 variable:
$ Sales: int 11 7 7 12 8 3 4 8 14 22 ...
sc_df <- as.data.frame(scale(df1))
summary(sc_df)
Sales
Min. :-0.7416
1st Qu.:-0.7416
Median :-0.4082
Mean : 0.0000
3rd Qu.: 0.2584
Max. : 5.3418
Finally, when I attempt to compute and plot a cluster I under go the following Issue:
dist_df <- dist(sc_df, method = 'euclidean')
hclust_avg <- hclust(dist_df, method = "average")
plot(hclust_avg)enter code here
Output:
Error: cannot allocate vector of size 6.6 Gb
Seemingly the dataset is to big, thus what would I need to do under such circumstances?
I am trying to use random forest to make a prediction for price with below data frame
data.frame': 10682 obs. of 9 variables:
Airline : Factor w/ 12 levels "Air Asia","Air India",..: 4 2 5 4 4 9 5 5 5 7 ...
Source : Factor w/ 5 levels "Banglore","Chennai",..: 1 4 3 4 1 4 1 1 1 3 ...
Destination : Factor w/ 6 levels "Banglore","Cochin",..: 6 1 2 1 6 1 6 6 6 2 ...
Route : Factor w/ 132 levels "BLR → AMD → DEL",..: 19 88 123 96 30 68 6 6 6 109 ...
Additional_Info: Factor w/ 10 levels "1 Long layover",..: 8 8 8 8 8 8 6 8 6 8 ...
Duration_Num : num 1.04 2 2.94 1.69 1.56 ...
Total_Stops_Num: num 0 2 2 1 1 0 1 1 1 1 ...
Departure_Num : POSIXct, format: "2019-03-24 22:20:00" "2019-05-01 05:50:00" ...
Price : num 8.27 8.94 9.54 8.74 9.5 ...
Initially i tried Multiple linear regression so i log transformed the dependent variable (Price)
All the non numeric variables were character before so i converted them into factor and date time
The variable Route has 132 levels. I tried one hot encode but results were not as good
How to preprocess this variable with 100+ levels as Random forest is getting failed every time
I am using the package randomForest to produce habitat suitability models for species. I thought everything was working as it should until I started looking at individual trees with getTree(). The documentation (see page 4 of the randomForest vignette) states that for categorical variables, the split point will be an integer, which makes sense. However, in the trees I have looked at for my results, this is not the case.
The data frame I used to build the model was formatted with categorical variables as factors:
> str(df.full)
'data.frame': 27087 obs. of 23 variables:
$ sciname : Factor w/ 2 levels "Laterallus jamaicensis",..: 1 1 1 1 1 1 1 1 1 1 ...
$ estid : Factor w/ 2 levels "7694","psabs": 1 1 1 1 1 1 1 1 1 1 ...
$ pres : Factor w/ 2 levels "1","0": 1 1 1 1 1 1 1 1 1 1 ...
$ stratum : Factor w/ 89 levels "poly_0","poly_1",..: 1 1 1 1 1 1 1 1 1 1 ...
$ ra : Factor w/ 3 levels "high","low","medium": 3 3 3 3 3 3 3 3 3 3 ...
$ eoid : Factor w/ 2 levels "0","psabs": 1 1 1 1 1 1 1 1 1 1 ...
$ avd3200 : num 0.1167 0.0953 0.349 0.1024 0.3765 ...
$ biocl05 : num 330 330 330 330 330 ...
$ biocl06 : num 66 65.8 66 65.8 66 ...
$ biocl08 : num 277 277 277 277 277 ...
$ biocl09 : num 170 170 170 170 170 ...
$ biocl13 : num 186 186 185 186 185 ...
$ cti : num 19.7 19 10.4 16.4 14.7 ...
$ dtnhdwat : num 168 240 39 206 309 ...
$ dtwtlnd : num 0 0 0 0 0 0 0 0 0 0 ...
$ e2em1n99 : num 0 0 0 0 0 0 0 0 0 0 ...
$ ems30_53 : Factor w/ 53 levels "0","602","2206",..: 19 4 17 4 19 19 4 4 19 19 ...
$ ems5607_46: num 0 0 1 0 0.4 ...
$ ksat : num 0.21 0.21 0.21 0.21 0.21 ...
$ lfevh_53 : Factor w/ 53 levels "0","11","16",..: 38 38 38 38 38 38 38 38 38 38 ...
$ ned : num 1.46 1.48 1.54 1.48 1.47 ...
$ soilec : num 14.8 14.8 19.7 14.8 14.8 ...
$ wtlnd_53 : Factor w/ 50 levels "0","3","7","11",..: 4 31 7 31 7 31 7 7 31 31 ...
This was the function call:
# rfStratum and sampSizeVec were previously defined
> rf.full$call
randomForest(x = df.full[, c(7:23)], y = df.full[, 3],
ntree = 2000, mtry = 7, replace = TRUE, strata = rfStratum,
sampsize = sampSizeVec, importance = TRUE, norm.votes = TRUE)
Here are the first 15 lines of an example tree (note that the variables in lines 1, 5, and 15 should be categorical, i.e., they should have integer split values):
> tree100
left daughter right daughter split var split point status prediction
1 2 3 ems30_53 9.007198e+15 1 <NA>
2 4 5 biocl08 2.753206e+02 1 <NA>
3 6 7 biocl06 6.110518e+01 1 <NA>
4 8 9 biocl06 1.002722e+02 1 <NA>
5 10 11 lfevh_53 9.006718e+15 1 <NA>
6 0 0 <NA> 0.000000e+00 -1 0
7 12 13 biocl05 3.310025e+02 1 <NA>
8 14 15 ned 2.814818e+00 1 <NA>
9 0 0 <NA> 0.000000e+00 -1 1
10 16 17 avd3200 4.199712e-01 1 <NA>
11 18 19 e2em1n99 1.724138e-02 1 <NA>
12 20 21 biocl09 1.738916e+02 1 <NA>
13 22 23 ned 8.837864e-01 1 <NA>
14 24 25 biocl05 3.442437e+02 1 <NA>
15 26 27 lfevh_53 9.007199e+15 1 <NA>
Additional information: I encountered this because I was investigating an error I was getting when predicting the results back onto the study area stating that the types of predictors in the new data did not match those of the training data. I have done 6 other iterations of this model using the same data frame and scripts (just with different subsets of predictors) and never before gotten this message. The only thing I could find that was different between the randomforest object in this run compared to that in the other runs is that the rf.full$forest$ncat components are stored as double instead of integer
> for(i in 1:length(rf.full$forest$ncat)){
+ cat(names(rf.full$forest$ncat)[[i]], ": ", class(rf.full$forest$ncat[[i]]), "\n")
+ }
avd12800 : numeric
cti : numeric
dtnhdwat : numeric
dtwtlnd : numeric
ems2207_99 : numeric
ems30_53 : numeric
ems5807_99 : numeric
hydgrp : numeric
ksat : numeric
lfevh_53 : numeric
ned : numeric
soilec : numeric
wtlnd_53 : numeric
>
> rf.full$forest$ncat
avd12800 cti dtnhdwat dtwtlnd ems2207_99 ems30_53 ems5807_99 hydgrp ksat lfevh_53
1 1 1 1 1 53 1 1 1 53
ned soilec wtlnd_53
1 1 50
However, xlevels (which appears to be a list of the predictor variables used and their types) are all showing the correct datatype for each predictor.
> for(i in 1:length(rf.full$forest$xlevels)){
+ cat(names(rf.full$forest$xlevels)[[i]], ": ", class(rf.full$forest$xlevels[[i]]),"\n")
+ }
avd12800 : numeric
cti : numeric
dtnhdwat : numeric
dtwtlnd : numeric
ems2207_99 : numeric
ems30_53 : character
ems5807_99 : numeric
hydgrp : character
ksat : numeric
lfevh_53 : character
ned : numeric
soilec : numeric
wtlnd_53 : character
# example continuous predictor
> rf.full$forest$xlevels$avd12800
[1] 0
# example categorical predictor
> rf.full$forest$xlevels$ems30_53
[1] "0" "602" "2206" "2207" "4504" "4507" "4702" "4704" "4705" "4706" "4707" "4717" "5207" "5307" "5600"
[16] "5605" "5607" "5616" "5617" "5707" "5717" "5807" "5907" "6306" "6307" "6507" "6600" "7002" "7004" "9107"
[31] "9116" "9214" "9307" "9410" "9411" "9600" "4607" "4703" "6402" "6405" "6407" "6610" "7005" "7102" "7104"
[46] "7107" "9000" "9104" "9106" "9124" "9187" "9301" "9505"
The ncat component is simply a vector of the number of categories per variable with 1 for continuous variables (as noted here), so it doesn't seem like it should matter if that is stored as an integer or a double, but it seems like this might all be related.
Questions
1) Shouldn't the split point for categorical predictors in any given tree of a randomForest forest be an integer, and if yes, any thoughts as to why factors in the data frame used as input to the randomForest call here are not being used as such?
2) Does the number type (double vs integer) of the ncat component of a randomForest object matter in any way related to model building, and any thoughts as to what could cause this to switch from integer in the first 6 runs to double in this last run (with each run containing different subsets of the same data)?
The randomforest::randomForest algorithm encodes low-cardinality (up to 32 categories) and high-cardinality (32 to 64? categories) categorical splits differently. Pay attention - all your "problematic" features belong to the latter class, and are encoded using 64-bit floating point values.
While the console output doesn't make sense for the human observer, the randomForest model object/algorithm itself is correct (ie. treats those variables as categorical), and is making correct predictions.
If you want to investigate the structure of decision tree, and decision tree ensemble models, then you might consider exporting them to the PMML data format. For example, you can use the R2PMML package for this:
library("r2pmml")
r2pmml(rf.full, "MyRandomForest.pmml")
Then, open the MyRandomForest.pmml in a text editor, and you shall have a nice overview about the internals of your model (branches, split conditions, leaf values, etc).
I am having an issue with creating a matrix of explanatory variables for running ridge and lasso regression using cv.glmnet.
My original data frame is of dimension 1460*81 and consist of several numeric and factor variables. In order to run glmnet, I am attempting to create a matrix of predictors using model.matrix.
However, when creating model.matrix on my original dataset, some of the rows are being dropped and my response variable and predictors are not of the same length.
Here's the code:
str(train1)
'data.frame': 1460 obs. of 80 variables:
$ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ...
$ MSZoning : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
$ LotFrontage : num 65 80 68 60 84 85 75 69 51 50 ...
$ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420
$ Street : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
$ Alley : Factor w/ 3 levels "Grvl","None",..: 2 2 2 2 2 2 2 2 2 2 ...
$ LotShape : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4
$ LandContour : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4
$ Utilities : Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
And now I am passing the data frame to model.matrix to create a matrix.
x = model.matrix(SalePrice ~., data = train1)
dim(x)
dim(x)
[1] 1370 260
Notice, how n = 1460 * 80 is transformed to 1370 * 260. This is causing a mismatch between lengths of my predictor variables and response variable when I try to run ridge regression.
cv.ridge <- glmnet(x, y, alpha = 0)
Error in glmnet(x, y, alpha = 0) :
number of observations in y (1460) not equal to the number of rows of x (1370)
Any ideas on where to look to ensure the length of the matrix (x) is equal (y)?
I'm trying to find class probabilities of new input vectors with support vector machines in R.
Training the model shows no errors.
fit <-svm(device~.,data=dataframetrain,
kernel="polynomial",probability=TRUE)
But predicting some input vector shows some errors.
predict(fit,dataframetest,probability=prob)
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
dataframetrain looks like:
> str(dataframetrain)
'data.frame': 24577 obs. of 5 variables:
$ device : Factor w/ 3 levels "mob","pc","tab": 1 1 1 1 1 1 1 1 1 1 ...
$ geslacht : Factor w/ 2 levels "M","V": 1 1 1 1 1 1 1 1 1 1 ...
$ leeftijd : num 77 67 67 66 64 64 63 61 61 58 ...
$ invultijd: num 12 12 12 12 12 12 12 12 12 12 ...
$ type : Factor w/ 8 levels "A","B","C","D",..: 5 5 5 5 5 5 5 5 5 5 ...
and dataframetest looks like:
> str(dataframetest)
'data.frame': 8 obs. of 4 variables:
$ geslacht : Factor w/ 1 level "M": 1 1 1 1 1 1 1 1
$ leeftijd : num 20 60 30 25 36 52 145 25
$ invultijd: num 6 12 2 5 6 8 69 7
$ type : Factor w/ 8 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8
I trained the model with 2 factors for 'geslacht' but sometime I have to predict data with only 1 factor of 'geslacht'.
Is it maybe possible that the class probabilites can be predicted with a test set with only 1 factor of 'geslacht'?
I hope someone can help me!!
Add another level (but not data) to geslacht.
x <- factor(c("A", "A"), levels = c("A", "B"))
x
[1] A A
Levels: A B
or
x <- factor(c("A", "A"))
levels(x) <- c("A", "B")
x
[1] A A
Levels: A B