R - predict command error "undefined columns selected" - r

I’m a newbie to R, and I’m having trouble with an R predict command.
I receive this error
Error in `[.data.frame`(newdata, , as.character(object$formula[[2]])) :
undefined columns selected
when I execute this command:
model.predict <- predict.boosting(model,newdata=test)
Here is my model:
model <- boosting(Y~x1+x2+x3+x4+x5+x6+x7, data=train)
And here is the structure of my test data:
str(test)
'data.frame': 343 obs. of 7 variables:
$ x1: Factor w/ 4 levels "Americas","Asia_Pac",..: 4 2 4 2 4 3 3 3 4 1 ...
$ x2: Factor w/ 5 levels "Fifth","First",..: 3 3 2 2 4 2 4 4 1 1 ...
$ x3: Factor w/ 3 levels "Best","Better",..: 2 3 1 1 3 2 2 1 3 3 ...
$ x4: Factor w/ 2 levels "Female","Male": 1 1 2 1 1 2 1 2 2 2 ...
$ x5: int 82 55 47 31 6 53 77 68 76 86 ...
$ x6: num 22.8 14.6 25.5 38.3 7.9 32.8 4.6 34.2 36.7 21.7 ...
$ x7: num 0.679 0.925 0.897 0.684 0.195 ...
And the structure of my training data:
$ RecordID: int 1 2 3 4 5 6 7 8 9 10 ...
$ x1 : Factor w/ 4 levels "Americas","Asia_Pac",..: 1 2 2 3 1 1 1 2 2 4 ...
$ x2 : Factor w/ 5 levels "Fifth","First",..: 5 5 3 2 5 5 5 4 3 2 ...
$ x3 : Factor w/ 3 levels "Best","Better",..: 2 3 2 2 3 1 2 3 1 1 ...
$ x4 : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 2 1 1 ...
$ x5 : int 1 67 75 51 84 33 21 80 48 5 ...
$ x6 : num 21 13.8 30.3 11.9 1.7 13.2 33.9 17 3.4 19.5 ...
$ x7 : num 0.35 0.85 0.73 0.39 0.47 0.13 0.2 0.12 0.64 0.11 ...
$ Y : Factor w/ 2 levels "Green","Yellow": 2 2 1 2 2 2 1 2 2 2 ..
I think there’s a problem with the structure of the test data, but I can’t find it, or I have a mis-understanding as to the structure of the “predict” command. Note that if I run the predict command on the training data, it works. Any suggestions as to where to look?
Thanks!

predict.boosting() expects to be given the actual labels for the test data, so it can calculate how well it did (as in the confusion matrix shown below).
library(adabag)
data(iris)
iris.adaboost <- boosting(Species~Sepal.Length+Sepal.Width+Petal.Length+
Petal.Width, data=iris, boos=TRUE, mfinal=10)
# make a 'test' dataframe without the classes, as in the question
iris2 <- iris
iris2$Species <- NULL
# replicates the error
irispred=predict.boosting(iris.adaboost, newdata=iris2)
#Error in `[.data.frame`(newdata, , as.character(object$formula[[2]])) :
# undefined columns selected
Here's working example, drawn largely from the help file just so there is a working example here (and to demonstrate the confusion matrix).
# first create subsets of iris data for training and testing
sub <- c(sample(1:50, 25), sample(51:100, 25), sample(101:150, 25))
iris3 <- iris[sub,]
iris4 <- iris[-sub,]
iris.adaboost <- boosting(Species ~ ., data=iris3, mfinal=10)
# works
iris.predboosting<- predict.boosting(iris.adaboost, newdata=iris4)
iris.predboosting$confusion
# Observed Class
#Predicted Class setosa versicolor virginica
# setosa 50 0 0
# versicolor 0 50 0
# virginica 0 0 50

when your y is factor, show this error, try as.vector(y)~.

The column names of the data that you use to predict should be exactly the same as the column names of training data.

Related

Categorical variable with 132 levels in a prediction problem

I am trying to use random forest to make a prediction for price with below data frame
data.frame': 10682 obs. of 9 variables:
Airline : Factor w/ 12 levels "Air Asia","Air India",..: 4 2 5 4 4 9 5 5 5 7 ...
Source : Factor w/ 5 levels "Banglore","Chennai",..: 1 4 3 4 1 4 1 1 1 3 ...
Destination : Factor w/ 6 levels "Banglore","Cochin",..: 6 1 2 1 6 1 6 6 6 2 ...
Route : Factor w/ 132 levels "BLR → AMD → DEL",..: 19 88 123 96 30 68 6 6 6 109 ...
Additional_Info: Factor w/ 10 levels "1 Long layover",..: 8 8 8 8 8 8 6 8 6 8 ...
Duration_Num : num 1.04 2 2.94 1.69 1.56 ...
Total_Stops_Num: num 0 2 2 1 1 0 1 1 1 1 ...
Departure_Num : POSIXct, format: "2019-03-24 22:20:00" "2019-05-01 05:50:00" ...
Price : num 8.27 8.94 9.54 8.74 9.5 ...
Initially i tried Multiple linear regression so i log transformed the dependent variable (Price)
All the non numeric variables were character before so i converted them into factor and date time
The variable Route has 132 levels. I tried one hot encode but results were not as good
How to preprocess this variable with 100+ levels as Random forest is getting failed every time

Issues producing a ROC curve with a KNN Model - undefined columns

I do apologize if this is rudimentary however I have run through the tracebook and tried googling to no real avail. Every time I try and run my code to produce a ROC curve I keep getting returned
Error in `[.data.frame`(data, , class) : undefined columns selected
I have checked the data and they are single column characters (as required)
library(cutpointr)
Temp1 <- predict(KnnModel, newdata=TestData, type="prob")
KnnProbs <- predict(object = KnnModel, newdata = TestData, type = "prob")
KnnProbs <- as.character(KnnProbs$`0`)
clch <- as.character(TrainData$loan_status)
KnnROC <- roc(data = TestData$loan_status, x = KnnProbs, class = clch)
plot(KnnROC, print.auc = T)
Any ideas as to what I am doing wrong and how to fix this
EDIT: The TrainData is of the following
'data.frame': 1500 obs. of 13 variables:
$ loan_amnt : num 6000 17625 8500 5000 10000 ...
$ loan_status : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 1 ...
$ int_rate : num 13.33 15.61 6.68 6.92 14.98 ...
$ term : num 1 1 1 1 1 1 2 1 1 1 ...
$ installment : num 203 616 261 154 347 ...
$ grade : num 3 4 1 1 3 4 4 3 2 1 ...
$ emp_length : num 10 11 3 3 8 3 3 3 2 1 ...
$ annual_inc : num 30000 49000 53100 60000 37000 ...
$ dti : num 25.5 12.2 26.2 27.8 31.4 ...
$ sub_grade : num 13 16 3 4 13 16 18 12 8 4 ...
$ verification_status: num 1 2 2 3 3 3 3 1 3 1 ...
$ home_ownership : Factor w/ 6 levels "ANY","MORTGAGE",..: 6 6 6 5 2 2 2 2 6 2 ...
$ pymnt_plan : Factor w/ 2 levels "n","y": 1 1 1 1 1 1 1 1 1 1 ...

R - geeglm Error: contrasts can be applied only to factors with 2 or more levels

I have applied GEE to the following dataset (str as below). Everything is fine.
> str(cd4.5m2)
'data.frame': 1300 obs. of 7 variables:
$ id : Factor w/ 260 levels "1","5","29","32",..: 1 1 1 1 1 2 2 2 2 2 ...
$ Treatment: Factor w/ 4 levels "Alternating",..: 2 2 2 2 2 1 1 1 1 1 ...
$ Age : num 36.4 36.4 36.4 36.4 36.4 ...
$ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
$ logcd4 : num 3.14 3.04 2.77 2.83 3.22 ...
$ Week : num 0 7.57 15.57 23.57 32.57 ...
$ Time : int 0 1 2 3 4 0 1 2 3 4 ...
I then transformed the outcome variable, reason being we want to monitor the change over time. So the str of the transformed data looks like below, which is almost exactly the same as the previous one (other than some name changes).
> str(cd4.5m1)
'data.frame': 1300 obs. of 6 variables:
$ id : Factor w/ 260 levels "1","5","29","32",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Treatment : Factor w/ 4 levels "Alternating",..: 2 1 4 1 3 3 1 4 1 3 ...
$ Age : num 36.4 35.9 47.5 37.3 42.7 ...
$ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 1 2 2 2 2 2 2 ...
$ Week : num 1 1 1 1 1 1 1 1 1 1 ...
$ cd4.change.norm: num 0.572 0.572 0.572 0.572 0.572 ...
I then run the GEE again and it gives me the error.
> gee1.default <- geeglm(cd4.change.norm ~ Treatment, data=cd4.5m1, id=id, family=gaussian, corstr="unstructured")
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
I also tested all variables in the data, they all contain multiple values. So I'm completely lost here. I also saw a lot of posts on this Error, but none seem to be able to address my issue here. Help appreciated!
I changed the correlation structure to AR1, and it worked. I did test the correlation (decreased over time) and AR1 is the correct structure to use.
But normally unstructured should be the save option?
I just reordered my data and it works. I'd like to suggest you try reordering your data like cd4.5m1<-cd4.5m1[order(cd4.5m1$id),]. Credits:KDG

lmer Error: number of levels of each grouping factor must be < number of observations

I would like to do a ANOVA to get to know, where there is significance. I already surched for an answer of my problem but doesn`t find the mistake.
names:
[1] "Tier_ID" "species" "Klima" "Ressource" "Datum" "Gewicht" "IngestionRate"
data frame:
'data.frame': 70 obs. of 7 variables:
$ Tier_ID : Factor w/ 70 levels "Raupe1","Raupe10",..: 1 12 23 34 45 56 67 69 70 2 ...
$ species : Factor w/ 1 level "Agrotis exclamationis": 1 1 1 1 1 1 1 1 1 1 ...
$ Klima : Factor w/ 1 level "BL4": 1 1 1 1 1 1 1 1 1 1 ...
$ Ressource : Factor w/ 3 levels "Kontrolle","N",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Datum : Factor w/ 1 level "06.08.2015": 1 1 1 1 1 1 1 1 1 1 ...
$ Gewicht : num 7.8 4.1 10.8 51.2 33.3 17.9 40.6 11.7 35.1 7.1 ...
$ IngestionRate: num 0.385 1.057 1.598 0.164 0.396 ...
I did subsets like this:
K_NPK<-subset(Agro,Agro$Ressource!="N")
my model:
mod4 <- lmer(IngestionRate~Ressource+(1|Gewicht), data=K_NPK)
it answers:
Error: number of levels of each grouping factor must be < number of observations
But if I do this subset
N_K<-subset(Agro,Agro$Ressource!="NPK")
and this model
mod4 <- lmer(IngestionRate~Ressource+(1|Gewicht), data=N_K)
If this runs there is no error.
I hope you understand what I try to do.
Can anybody tell me whats wrong?

Contrasts can be applied only to factor

I have a question about R.
I am using a test called levene.test to test a homogeneity of variance.
I know that you need a factor variable with at least two levels in order for this to work. And from what I see, I do have at least two levels for the factor variable that I am using. But somehow I keep getting the error of:
> nocorlevene <- levene.test(geno1rs11809462$SIF1, geno1rs11809462$k, correction.method = "correction.factor")
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
I even try generate a variable from a binomial distribution:
k<-rbinom(1304, 1, 0.5)
and then use that as a factor, but is still not working.
Lastly I create a variable with 3 levels:
k<-sample(c(1,0,2), 1304, replace=T)
but some how still not working and getting the same error of:
nocorlevene <- levene.test(geno1rs11809462$SIF1, geno1rs11809462$k, correction.method="zero.removal")
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
This is the output of the type of the variable in the data:
> str(geno1rs11809462)
'data.frame': 1304 obs. of 16 variables:
$ id : chr "WG0012669-DNA_A03_K05743" "WG0012669-DNA_A04_K05752" "WG0012669-DNA_A05_K05761" "WG0012669-DNA_A06_K05785" ...
$ rs11809462 : Factor w/ 2 levels "2/1","2/2": 2 2 2 2 2 2 2 2 2 2 ...
..- attr(*, "names")= chr "WG0012669-DNA_A03_K05743" "WG0012669-DNA_A04_K05752" "WG0012669-DNA_A05_K05761" "WG0012669-DNA_A06_K05785" ...
$ FID : chr "9370" "9024" "14291" "4126" ...
$ AGE_CALC : num 61 47 NA 62.5 55.6 59.7 46.6 41.2 NA 46.6 ...
$ MREFSUM : num 185 325 NA 211 212 ...
$ NORSOUTH : Factor w/ 3 levels "0","1","NA": 1 1 3 1 1 1 1 1 3 1 ...
$ smoke1 : Factor w/ 3 levels "0","1","NA": 2 2 3 1 1 1 2 1 3 1 ...
$ smoke2 : Factor w/ 3 levels "0","1","NA": 1 1 3 2 2 2 1 2 3 2 ...
$ ANYCG60 : num 0 0 NA 1 0 0 0 0 NA 1 ...
$ DCCT_HBA_MEAN: num 7.39 6.93 NA 7.37 7.56 7.86 6.22 8.88 NA 8.94 ...
$ EDIC_HBA : num 7.17 7.63 NA 8.66 9.68 7.74 6.59 9.34 NA 7.86 ...
$ HBAEL : num 7.3 8.82 NA 9.1 9.3 ...
$ ELDTED_HBA : num 7.23 7.76 NA 8.36 9.21 7.92 6.64 9.64 NA 9.09 ...
$ SIF1 : num 19.6 17 NA 23.8 24.1 ...
$ sex : Factor w/ 2 levels "0","1": 1 1 2 2 2 2 1 1 1 1 ...
$ k : Factor w/ 3 levels "0","1","2": 1 1 2 3 1 3 3 3 1 2 ...
As you can see the variable k, sex have 3 and 2 levels respectively but somehow I still get that error message.
> head(geno1rs11809462)
id rs11809462 FID AGE_CALC MREFSUM NORSOUTH smoke1 smoke2 ANYCG60
1 WG0012669-DNA_A03_K05743 2/2 9370 61.0 184.5925 0 1 0 0
2 WG0012669-DNA_A04_K05752 2/2 9024 47.0 325.0047 0 1 0 0
3 WG0012669-DNA_A05_K05761 2/2 14291 NA NA NA NA NA NA
4 WG0012669-DNA_A06_K05785 2/2 4126 62.5 211.2557 0 0 1 1
5 WG0012669-DNA_A08_K05802 2/2 11280 55.6 212.2922 0 0 1 0
6 WG0012669-DNA_A09_K05811 2/2 11009 59.7 261.0116 0 0 1 0
DCCT_HBA_MEAN EDIC_HBA HBAEL ELDTED_HBA SIF1 sex k
1 7.39 7.17 7.30 7.23 19.6136 0 0
2 6.93 7.63 8.82 7.76 17.0375 0 0
3 NA NA NA NA NA 1 1
4 7.37 8.66 9.10 8.36 23.8333 1 2
5 7.56 9.68 9.30 9.21 24.1338 1 0
6 7.86 7.74 8.53 7.92 25.7272 1 2
If anyone can give me some hints as to why this is happening, it would be great. I just don't know why the variable k or sex or having different levels are giving me error when I run the test.
thank you
I think I may have solved the problem. I believe it is due to NA value in the data. Because after I removed the na using say
x<-na.omit(original_data)
then apply the levene test on x, the warning message disappears.
Hopefully this is the cause of the problem.
If your factor has only one level, you will get this error. To check to see the levels of your factor variables, use lapply(df, levels). It will return nothing for non-factor variables, but will easily let you identify which variable is the offender. This is especially helpful if, like me, you have hundreds of variables.
You need to actually convert your variable to a factor. Just having three (or a finite) number of values does not necessarily make it a factor.
use x <- factor(x) to convert
When you look at the output of str(), it shows you the type of each variable:
<..cropped..>
$ SIF1 : num 19.6 17 NA 23.8 24.1 ...
$ sex : Factor w/ 2 levels "0","1": 1 1 2 2 2 2 1 1 1 1 ...
$ k : Factor w/ 3 levels "0","1","2": 1 1 2 3 1 3 3 3 1 2 ...
notice that $k is a factor but SIF1 is not
Thus, use
geno1rs11809462$SIF1 <- factor(geno1rs11809462$SIF1)

Resources