Error when I try to predict class probabilities in R - caret - r

I've build a model using caret. When the training was completed I got the following warning:
Warning message:
In train.default(x, y, weights = w, ...) :
At least one of the class levels are not valid R variables names; This may cause errors if class probabilities are generated because the variables names will be converted to: X0, X1
The names of the variables are:
str(train)
'data.frame': 7395 obs. of 30 variables:
$ alchemy_category : Factor w/ 13 levels "arts_entertainment",..: 2 8 6 6 11 6 1 6 3 8 ...
$ alchemy_category_score : num 3737 2052 4801 3816 3179 ...
$ avglinksize : num 2.06 3.68 2.38 1.54 2.68 ...
$ commonlinkratio_1 : num 0.676 0.508 0.562 0.4 0.5 ...
$ commonlinkratio_2 : num 0.206 0.289 0.322 0.1 0.222 ...
$ commonlinkratio_3 : num 0.0471 0.2139 0.1202 0.0167 0.1235 ...
$ commonlinkratio_4 : num 0.0235 0.1444 0.0426 0 0.0432 ...
$ compression_ratio : num 0.444 0.469 0.525 0.481 0.446 ...
$ embed_ratio : num 0 0 0 0 0 0 0 0 0 0 ...
$ frameTagRatio : num 0.0908 0.0987 0.0724 0.0959 0.0249 ...
$ hasDomainLink : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ html_ratio : num 0.246 0.203 0.226 0.266 0.229 ...
$ image_ratio : num 0.00388 0.08865 0.12054 0.03534 0.05047 ...
$ is_news : Factor w/ 2 levels "0","1": 2 2 2 2 2 1 2 1 2 1 ...
$ lengthyLinkDomain : Factor w/ 2 levels "0","1": 2 2 2 1 2 1 1 1 1 2 ...
$ linkwordscore : num 24 40 55 24 14 12 21 5 17 14 ...
$ news_front_page : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ non_markup_alphanum_characters: num 5424 4973 2240 2737 12032 ...
$ numberOfLinks : num 170 187 258 120 162 55 93 132 194 326 ...
$ numwords_in_url : num 8 9 11 5 10 3 3 4 7 4 ...
$ parametrizedLinkRatio : num 0.1529 0.1818 0.1667 0.0417 0.0988 ...
$ spelling_errors_ratio : num 0.0791 0.1254 0.0576 0.1009 0.0826 ...
$ label : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 2 1 2 2 ...
$ isVideo : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 1 1 ...
$ isFashion : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 2 1 2 1 ...
$ isFood : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ hasComments : Factor w/ 2 levels "0","1": 1 2 2 2 2 1 2 2 1 2 ...
$ hasGoogleAnalytics : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 2 2 2 1 ...
$ hasInlineCSS : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 2 1 2 2 ...
$ noOfMetaTags : num 10 12 6 10 13 2 6 6 9 5 ...
My code is the following:
ctrl <- trainControl(method = "CV",
number=10,
classProbs = TRUE,
allowParallel = TRUE,
summaryFunction = twoClassSummary)
set.seed(476)
rfFit <- train(formula,
data=train,
method = "rf",
tuneGrid = expand.grid(.mtry = seq(4,20,by=2)),
ntrees=1000,
importance = TRUE,
metric = "ROC",
trControl = ctrl)
pred <- predict.train(rfFit, newdata = test, type = "prob")
I get the error: Error in [.data.frame(out, , obsLevels, drop = FALSE) :
undefined columns selected
The variable names on the test data set are:
str(test)
'data.frame': 3171 obs. of 29 variables:
$ alchemy_category : Factor w/ 13 levels "arts_entertainment",..: 8 4 12 4 10 12 12 8 1 2 ...
$ alchemy_category_score : num 5307 4825 1 6708 5416 ...
$ avglinksize : num 2.56 3.77 2.27 2.52 1.85 ...
$ commonlinkratio_1 : num 0.39 0.462 0.496 0.706 0.471 ...
$ commonlinkratio_2 : num 0.257 0.205 0.385 0.346 0.161 ...
$ commonlinkratio_3 : num 0.0441 0.0513 0.1709 0.123 0.0323 ...
$ commonlinkratio_4 : num 0.0221 0 0.1709 0.0906 0 ...
$ compression_ratio : num 0.49 0.782 1.25 0.449 0.454 ...
$ embed_ratio : num 0 0 0 0 0 0 0 0 0 0 ...
$ frameTagRatio : num 0.0671 0.0429 0.0588 0.0581 0.093 ...
$ hasDomainLink : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ html_ratio : num 0.23 0.366 0.162 0.147 0.244 ...
$ image_ratio : num 0.19944 0.08 10 0.00596 0.03571 ...
$ is_news : Factor w/ 2 levels "0","1": 2 1 1 2 2 1 1 2 1 1 ...
$ lengthyLinkDomain : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 1 1 1 ...
$ linkwordscore : num 15 62 42 41 34 35 15 22 41 7 ...
$ news_front_page : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ non_markup_alphanum_characters: num 5643 382 2420 5559 2209 ...
$ numberOfLinks : num 136 39 117 309 155 266 55 145 110 1 ...
$ numwords_in_url : num 3 2 1 10 10 7 1 9 5 0 ...
$ parametrizedLinkRatio : num 0.2426 0.1282 0.5812 0.0388 0.0968 ...
$ spelling_errors_ratio : num 0.0806 0.1765 0.125 0.0631 0.0653 ...
$ isVideo : Factor w/ 2 levels "0","1": 1 2 1 2 2 2 1 1 2 2 ...
$ isFashion : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...
$ isFood : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ hasComments : Factor w/ 2 levels "0","1": 2 1 1 2 2 2 1 2 2 1 ...
$ hasGoogleAnalytics : Factor w/ 2 levels "0","1": 1 2 2 2 2 1 1 2 1 1 ...
$ hasInlineCSS : Factor w/ 2 levels "0","1": 2 2 2 1 1 2 2 2 1 1 ...
$ noOfMetaTags : num 3 6 5 9 16 22 6 9 7 0 ...
If I omit the type="prob" part, I get no error.
Any ideas?
Could it be the length of the variable "alchemy_category" which is appended with the respective factor levels e.g. "alchemy_categoryarts_entertainment" inside the model??

The answer is in bold at the top of your post =]
What are you modeling? Is it alchemy_category? The code only says formula and we can't see it.
When you ask for class probabilities, model predictions are a data frame with separate columns for each class/level. If alchemy_category doesn't have levels that are valid column names, data.frame converts then to valid names. That creates a problem because the code is looking for a specific name but the data frame as a different (but valid) name.
For example, if I had
> test <- factor(c("level1", "level 2"))
> levels(test)
[1] "level 2" "level1"
> make.names(levels(test))
[1] "level.2" "level1"
the code would be looking for "level 2" but there is only "level.2".

As stated above the class values must be factors and must be valid names. Another way to insure this is,
levels(all.dat$target) <- make.names(levels(factor(all.dat$target)))

I have read through the answers above while facing a similar problem. A formal solution is to do this on the train and test datasets. Make sure you include the response variable in the feature.names too.
feature.names=names(train)
for (f in feature.names) {
if (class(train[[f]])=="factor") {
levels <- unique(c(train[[f]]))
train[[f]] <- factor(train[[f]],
labels=make.names(levels))
}
}
This creates syntactically correct labels for all factors.

As #Sam Firke already pointed out in comments (but I overlooked it) levels TRUE/FALSE also don't work. So I converted them to yes/no.

As per the above example, usually refactoring the outcome variable will fix the problem. It's better to change in the original dataset before partitioning into training and test datasets
levels <- unique(data$outcome)
data$outcome <- factor(data$outcome, labels=make.names(levels))
As others pointed out earlier, this problem only occurs when classProbs=TRUE which causes the train function to generate additional statistics related to the outcome class

Related

Error when trying to use one_hot encoding

I know this may be a potential duplicate question, but I found other answers didn't work in my situation.
I am using the following dataset:
> str(total_data)
'data.frame': 32260 obs. of 13 variables:
$ age : int 40 42 44 32 25 31 30 30 27 28 ...
$ workclass : Factor w/ 4 levels "Other-Unknown",..: 3 2 2 1 2 2 2 3 2 3 ...
$ education : Ord.factor w/ 7 levels "1"<"2"<"3"<"4"<..: 2 3 2 2 2 3 2 2 2 2 ...
$ marital.status : Factor w/ 5 levels "Divorced","Married",..: 2 1 2 3 3 3 3 2 2 3 ...
$ occupation : Factor w/ 6 levels "Blue-Collar",..: 5 3 6 2 1 6 6 1 1 6 ...
$ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 1 5 1 1 5 5 5 5 5 5 ...
$ sex : Factor w/ 2 levels "Female","Male": 2 2 2 1 2 2 2 2 1 1 ...
$ hours.per.week : int 84 40 40 38 40 38 48 70 35 38 ...
$ naitive.country: Factor w/ 41 levels "?","Cambodia",..: 39 39 39 39 39 39 39 12 39 39 ...
$ classifier : chr "<=50K" "<=50K" ">50K" "<=50K" ...
$ class_num : Factor w/ 2 levels "1","2": 1 1 2 1 1 1 1 2 1 1 ...
$ age_norm : num 0.315 0.342 0.37 0.205 0.11 ...
$ hours_norm : num 0.847 0.398 0.398 0.378 0.398 ...
I'm trying to encode the factors into binary using one_hot() but receive the following error message:
encoded_data <- one_hot(total_data, dropCols = FALSE)
ERROR MESSAGE:
Error in `[.data.frame`(dt, , cols, with = FALSE) :
unused argument (with = FALSE)
I'm not sure what the "with" argument is as I don't see it in the R documentation.
I also saw that someone suggested to use model.matrix. However, when I use that, my ordered factor gets encoded as well, which is what I'm trying to avoid.
This is what happens to my ordered factor variable:
education.L education.Q education.C education^4 education^5 education^6
-3.779645e-01 9.690821e-17 4.082483e-01 -0.5640761 4.364358e-01 -0.19738551
-1.889822e-01 -3.273268e-01 4.082483e-01 0.0805823 -5.455447e-01 0.49346377
I'm also not sure why there are sometimes letters or numbers after the attribute name. i.e. education**.L** vs education**^5**
Convert the data.frame into a data.table and it should work fine.
library(data.table)
dt = data.table(total_data)
one_hot(dt)

Issues producing a ROC curve with a KNN Model - undefined columns

I do apologize if this is rudimentary however I have run through the tracebook and tried googling to no real avail. Every time I try and run my code to produce a ROC curve I keep getting returned
Error in `[.data.frame`(data, , class) : undefined columns selected
I have checked the data and they are single column characters (as required)
library(cutpointr)
Temp1 <- predict(KnnModel, newdata=TestData, type="prob")
KnnProbs <- predict(object = KnnModel, newdata = TestData, type = "prob")
KnnProbs <- as.character(KnnProbs$`0`)
clch <- as.character(TrainData$loan_status)
KnnROC <- roc(data = TestData$loan_status, x = KnnProbs, class = clch)
plot(KnnROC, print.auc = T)
Any ideas as to what I am doing wrong and how to fix this
EDIT: The TrainData is of the following
'data.frame': 1500 obs. of 13 variables:
$ loan_amnt : num 6000 17625 8500 5000 10000 ...
$ loan_status : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 1 ...
$ int_rate : num 13.33 15.61 6.68 6.92 14.98 ...
$ term : num 1 1 1 1 1 1 2 1 1 1 ...
$ installment : num 203 616 261 154 347 ...
$ grade : num 3 4 1 1 3 4 4 3 2 1 ...
$ emp_length : num 10 11 3 3 8 3 3 3 2 1 ...
$ annual_inc : num 30000 49000 53100 60000 37000 ...
$ dti : num 25.5 12.2 26.2 27.8 31.4 ...
$ sub_grade : num 13 16 3 4 13 16 18 12 8 4 ...
$ verification_status: num 1 2 2 3 3 3 3 1 3 1 ...
$ home_ownership : Factor w/ 6 levels "ANY","MORTGAGE",..: 6 6 6 5 2 2 2 2 6 2 ...
$ pymnt_plan : Factor w/ 2 levels "n","y": 1 1 1 1 1 1 1 1 1 1 ...

R - geeglm Error: contrasts can be applied only to factors with 2 or more levels

I have applied GEE to the following dataset (str as below). Everything is fine.
> str(cd4.5m2)
'data.frame': 1300 obs. of 7 variables:
$ id : Factor w/ 260 levels "1","5","29","32",..: 1 1 1 1 1 2 2 2 2 2 ...
$ Treatment: Factor w/ 4 levels "Alternating",..: 2 2 2 2 2 1 1 1 1 1 ...
$ Age : num 36.4 36.4 36.4 36.4 36.4 ...
$ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
$ logcd4 : num 3.14 3.04 2.77 2.83 3.22 ...
$ Week : num 0 7.57 15.57 23.57 32.57 ...
$ Time : int 0 1 2 3 4 0 1 2 3 4 ...
I then transformed the outcome variable, reason being we want to monitor the change over time. So the str of the transformed data looks like below, which is almost exactly the same as the previous one (other than some name changes).
> str(cd4.5m1)
'data.frame': 1300 obs. of 6 variables:
$ id : Factor w/ 260 levels "1","5","29","32",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Treatment : Factor w/ 4 levels "Alternating",..: 2 1 4 1 3 3 1 4 1 3 ...
$ Age : num 36.4 35.9 47.5 37.3 42.7 ...
$ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 1 2 2 2 2 2 2 ...
$ Week : num 1 1 1 1 1 1 1 1 1 1 ...
$ cd4.change.norm: num 0.572 0.572 0.572 0.572 0.572 ...
I then run the GEE again and it gives me the error.
> gee1.default <- geeglm(cd4.change.norm ~ Treatment, data=cd4.5m1, id=id, family=gaussian, corstr="unstructured")
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
I also tested all variables in the data, they all contain multiple values. So I'm completely lost here. I also saw a lot of posts on this Error, but none seem to be able to address my issue here. Help appreciated!
I changed the correlation structure to AR1, and it worked. I did test the correlation (decreased over time) and AR1 is the correct structure to use.
But normally unstructured should be the save option?
I just reordered my data and it works. I'd like to suggest you try reordering your data like cd4.5m1<-cd4.5m1[order(cd4.5m1$id),]. Credits:KDG

R nnet multiniom (multinomial logistic regression models) - assign penalties to avoid misclassification

I am using multinom from nnet package to fit a logistic regression model to data consists of 3 classes, however the prevalence of the classes is not balanced. I would like to assign weight/penalties in order to tell the model to avoid misclassification for a certain class.
Here is my code and a slice of my data:
mnm <- multinom(formula = cut.rank ~ ., data = training.logist, trace = FALSE, maxit = 1000, weights=c(10,5,1))
> str(head(training.logist))
'data.frame': 6 obs. of 15 variables:
$ is_top_rated_listing : Factor w/ 2 levels "0","1": 1 1 1 2 2 2
$ seller_is_top_rated_seller : int 1 1 1 1 1 1
$ is_auto_pay : Factor w/ 2 levels "0","1": 2 2 2 2 2 2
$ is_returns_accepted : Factor w/ 2 levels "0","1": 2 2 2 2 2 2
$ seller_feedback_rating_star : Factor w/ 11 levels "Blue","Green",..: 7 7 7 9 9 9
$ keywords_title_assoc : num 1 1 1 1 1 1
$ normalized.price_shipping : num 0 0 0.00871 0.01853 0.01853 ...
$ normalized.seller_feedback_score : num 0.7117 0.8791 0.0966 0.095 0.095 ...
$ normalized.seller_positive_feedback_percent: num 0.7117 0.8791 0.0966 0.095 0.095 ...
$ item_condition : Factor w/ 2 levels "New","New other (see details)": 1 1 1 1 1 1
$ listing_type : Factor w/ 2 levels "FixedPrice","StoreInventory": 2 2 2 1 1 1
$ best_offer_enabled : Factor w/ 2 levels "0","1": 1 1 1 1 1 1
$ shipping_handling_time : int 10 10 10 1 1 1
$ shipping_locations : Factor w/ 7 levels "AU,Americas,Europe,Asia",..: 5 5 5 5 5 5
$ cut.rank : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1
>
Anyone have an idea how to assign misclassification penalties? specifically I would like assign a penalty ratio of 10:5:1 (correspond to class 1,2,3) meaning I really like to be accurate on class 1.
The distribution of my target variable cut.rank is ~ 0.04,0.08,0.88.
Because class 1 has a low prevalence the model sensitivity for that class is low.

Contrasts can be applied only to factor

I have a question about R.
I am using a test called levene.test to test a homogeneity of variance.
I know that you need a factor variable with at least two levels in order for this to work. And from what I see, I do have at least two levels for the factor variable that I am using. But somehow I keep getting the error of:
> nocorlevene <- levene.test(geno1rs11809462$SIF1, geno1rs11809462$k, correction.method = "correction.factor")
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
I even try generate a variable from a binomial distribution:
k<-rbinom(1304, 1, 0.5)
and then use that as a factor, but is still not working.
Lastly I create a variable with 3 levels:
k<-sample(c(1,0,2), 1304, replace=T)
but some how still not working and getting the same error of:
nocorlevene <- levene.test(geno1rs11809462$SIF1, geno1rs11809462$k, correction.method="zero.removal")
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
This is the output of the type of the variable in the data:
> str(geno1rs11809462)
'data.frame': 1304 obs. of 16 variables:
$ id : chr "WG0012669-DNA_A03_K05743" "WG0012669-DNA_A04_K05752" "WG0012669-DNA_A05_K05761" "WG0012669-DNA_A06_K05785" ...
$ rs11809462 : Factor w/ 2 levels "2/1","2/2": 2 2 2 2 2 2 2 2 2 2 ...
..- attr(*, "names")= chr "WG0012669-DNA_A03_K05743" "WG0012669-DNA_A04_K05752" "WG0012669-DNA_A05_K05761" "WG0012669-DNA_A06_K05785" ...
$ FID : chr "9370" "9024" "14291" "4126" ...
$ AGE_CALC : num 61 47 NA 62.5 55.6 59.7 46.6 41.2 NA 46.6 ...
$ MREFSUM : num 185 325 NA 211 212 ...
$ NORSOUTH : Factor w/ 3 levels "0","1","NA": 1 1 3 1 1 1 1 1 3 1 ...
$ smoke1 : Factor w/ 3 levels "0","1","NA": 2 2 3 1 1 1 2 1 3 1 ...
$ smoke2 : Factor w/ 3 levels "0","1","NA": 1 1 3 2 2 2 1 2 3 2 ...
$ ANYCG60 : num 0 0 NA 1 0 0 0 0 NA 1 ...
$ DCCT_HBA_MEAN: num 7.39 6.93 NA 7.37 7.56 7.86 6.22 8.88 NA 8.94 ...
$ EDIC_HBA : num 7.17 7.63 NA 8.66 9.68 7.74 6.59 9.34 NA 7.86 ...
$ HBAEL : num 7.3 8.82 NA 9.1 9.3 ...
$ ELDTED_HBA : num 7.23 7.76 NA 8.36 9.21 7.92 6.64 9.64 NA 9.09 ...
$ SIF1 : num 19.6 17 NA 23.8 24.1 ...
$ sex : Factor w/ 2 levels "0","1": 1 1 2 2 2 2 1 1 1 1 ...
$ k : Factor w/ 3 levels "0","1","2": 1 1 2 3 1 3 3 3 1 2 ...
As you can see the variable k, sex have 3 and 2 levels respectively but somehow I still get that error message.
> head(geno1rs11809462)
id rs11809462 FID AGE_CALC MREFSUM NORSOUTH smoke1 smoke2 ANYCG60
1 WG0012669-DNA_A03_K05743 2/2 9370 61.0 184.5925 0 1 0 0
2 WG0012669-DNA_A04_K05752 2/2 9024 47.0 325.0047 0 1 0 0
3 WG0012669-DNA_A05_K05761 2/2 14291 NA NA NA NA NA NA
4 WG0012669-DNA_A06_K05785 2/2 4126 62.5 211.2557 0 0 1 1
5 WG0012669-DNA_A08_K05802 2/2 11280 55.6 212.2922 0 0 1 0
6 WG0012669-DNA_A09_K05811 2/2 11009 59.7 261.0116 0 0 1 0
DCCT_HBA_MEAN EDIC_HBA HBAEL ELDTED_HBA SIF1 sex k
1 7.39 7.17 7.30 7.23 19.6136 0 0
2 6.93 7.63 8.82 7.76 17.0375 0 0
3 NA NA NA NA NA 1 1
4 7.37 8.66 9.10 8.36 23.8333 1 2
5 7.56 9.68 9.30 9.21 24.1338 1 0
6 7.86 7.74 8.53 7.92 25.7272 1 2
If anyone can give me some hints as to why this is happening, it would be great. I just don't know why the variable k or sex or having different levels are giving me error when I run the test.
thank you
I think I may have solved the problem. I believe it is due to NA value in the data. Because after I removed the na using say
x<-na.omit(original_data)
then apply the levene test on x, the warning message disappears.
Hopefully this is the cause of the problem.
If your factor has only one level, you will get this error. To check to see the levels of your factor variables, use lapply(df, levels). It will return nothing for non-factor variables, but will easily let you identify which variable is the offender. This is especially helpful if, like me, you have hundreds of variables.
You need to actually convert your variable to a factor. Just having three (or a finite) number of values does not necessarily make it a factor.
use x <- factor(x) to convert
When you look at the output of str(), it shows you the type of each variable:
<..cropped..>
$ SIF1 : num 19.6 17 NA 23.8 24.1 ...
$ sex : Factor w/ 2 levels "0","1": 1 1 2 2 2 2 1 1 1 1 ...
$ k : Factor w/ 3 levels "0","1","2": 1 1 2 3 1 3 3 3 1 2 ...
notice that $k is a factor but SIF1 is not
Thus, use
geno1rs11809462$SIF1 <- factor(geno1rs11809462$SIF1)

Resources