I do apologize if this is rudimentary however I have run through the tracebook and tried googling to no real avail. Every time I try and run my code to produce a ROC curve I keep getting returned
Error in `[.data.frame`(data, , class) : undefined columns selected
I have checked the data and they are single column characters (as required)
library(cutpointr)
Temp1 <- predict(KnnModel, newdata=TestData, type="prob")
KnnProbs <- predict(object = KnnModel, newdata = TestData, type = "prob")
KnnProbs <- as.character(KnnProbs$`0`)
clch <- as.character(TrainData$loan_status)
KnnROC <- roc(data = TestData$loan_status, x = KnnProbs, class = clch)
plot(KnnROC, print.auc = T)
Any ideas as to what I am doing wrong and how to fix this
EDIT: The TrainData is of the following
'data.frame': 1500 obs. of 13 variables:
$ loan_amnt : num 6000 17625 8500 5000 10000 ...
$ loan_status : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 1 ...
$ int_rate : num 13.33 15.61 6.68 6.92 14.98 ...
$ term : num 1 1 1 1 1 1 2 1 1 1 ...
$ installment : num 203 616 261 154 347 ...
$ grade : num 3 4 1 1 3 4 4 3 2 1 ...
$ emp_length : num 10 11 3 3 8 3 3 3 2 1 ...
$ annual_inc : num 30000 49000 53100 60000 37000 ...
$ dti : num 25.5 12.2 26.2 27.8 31.4 ...
$ sub_grade : num 13 16 3 4 13 16 18 12 8 4 ...
$ verification_status: num 1 2 2 3 3 3 3 1 3 1 ...
$ home_ownership : Factor w/ 6 levels "ANY","MORTGAGE",..: 6 6 6 5 2 2 2 2 6 2 ...
$ pymnt_plan : Factor w/ 2 levels "n","y": 1 1 1 1 1 1 1 1 1 1 ...
Related
I've got a CSV file from the link Hearthstone Arena Card Pickup probability
It's just a list of vectors now, and I want to convert into 9 column data frame. so it may look like:
My current code is as follows but it's not working at all.
hsd <- read.csv("hearthstonedraw.csv", header = TRUE)
hsd1 <- as.data.frame(hsd,ncol = 9)
hsd1
Answer goest out to Maurits Evers and Adam Sampson.
read.csv can read from the address you indicate and automatically converts character columns into factors (default behaviour) as well as calulating the number of columns.
hsd1 <- read.csv("https://bnetcmsus-a.akamaihd.net/cms/gallery/LN4X4GN4W59R1532566073433.csv", header = TRUE)
str(hsd1)
# 'data.frame': 3931 obs. of 9 variables:
# $ Draft.Class : Factor w/ 9 levels "Druid","Hunter",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ Card.Name : Factor w/ 995 levels "Abominable Bowman",..: 716 813 646 500 263 964 549 186 509 984 ...
# $ Rarity : Factor w/ 5 levels "basic","common",..: 1 1 2 2 2 1 1 2 5 2 ...
# $ Type : Factor w/ 3 levels "Minion","Spell",..: 2 2 2 2 1 2 2 1 2 2 ...
# $ Card.Class : Factor w/ 10 levels "druid","hunter",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ Average : num 1.47 1.45 1.44 1.17 1.03 ...
# $ P.1.or.more.: num 0.78 0.776 0.774 0.696 0.649 ...
# $ P.2.or.more.: num 0.436 0.431 0.428 0.327 0.273 ...
# $ P.3.or.more.: num 0.1784 0.1757 0.1724 0.1081 0.0819 ...
ncol(hsd1)
# [1] 9
# There are 9 columns in the data frame
I have applied GEE to the following dataset (str as below). Everything is fine.
> str(cd4.5m2)
'data.frame': 1300 obs. of 7 variables:
$ id : Factor w/ 260 levels "1","5","29","32",..: 1 1 1 1 1 2 2 2 2 2 ...
$ Treatment: Factor w/ 4 levels "Alternating",..: 2 2 2 2 2 1 1 1 1 1 ...
$ Age : num 36.4 36.4 36.4 36.4 36.4 ...
$ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
$ logcd4 : num 3.14 3.04 2.77 2.83 3.22 ...
$ Week : num 0 7.57 15.57 23.57 32.57 ...
$ Time : int 0 1 2 3 4 0 1 2 3 4 ...
I then transformed the outcome variable, reason being we want to monitor the change over time. So the str of the transformed data looks like below, which is almost exactly the same as the previous one (other than some name changes).
> str(cd4.5m1)
'data.frame': 1300 obs. of 6 variables:
$ id : Factor w/ 260 levels "1","5","29","32",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Treatment : Factor w/ 4 levels "Alternating",..: 2 1 4 1 3 3 1 4 1 3 ...
$ Age : num 36.4 35.9 47.5 37.3 42.7 ...
$ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 1 2 2 2 2 2 2 ...
$ Week : num 1 1 1 1 1 1 1 1 1 1 ...
$ cd4.change.norm: num 0.572 0.572 0.572 0.572 0.572 ...
I then run the GEE again and it gives me the error.
> gee1.default <- geeglm(cd4.change.norm ~ Treatment, data=cd4.5m1, id=id, family=gaussian, corstr="unstructured")
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
I also tested all variables in the data, they all contain multiple values. So I'm completely lost here. I also saw a lot of posts on this Error, but none seem to be able to address my issue here. Help appreciated!
I changed the correlation structure to AR1, and it worked. I did test the correlation (decreased over time) and AR1 is the correct structure to use.
But normally unstructured should be the save option?
I just reordered my data and it works. I'd like to suggest you try reordering your data like cd4.5m1<-cd4.5m1[order(cd4.5m1$id),]. Credits:KDG
I would like to do a ANOVA to get to know, where there is significance. I already surched for an answer of my problem but doesn`t find the mistake.
names:
[1] "Tier_ID" "species" "Klima" "Ressource" "Datum" "Gewicht" "IngestionRate"
data frame:
'data.frame': 70 obs. of 7 variables:
$ Tier_ID : Factor w/ 70 levels "Raupe1","Raupe10",..: 1 12 23 34 45 56 67 69 70 2 ...
$ species : Factor w/ 1 level "Agrotis exclamationis": 1 1 1 1 1 1 1 1 1 1 ...
$ Klima : Factor w/ 1 level "BL4": 1 1 1 1 1 1 1 1 1 1 ...
$ Ressource : Factor w/ 3 levels "Kontrolle","N",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Datum : Factor w/ 1 level "06.08.2015": 1 1 1 1 1 1 1 1 1 1 ...
$ Gewicht : num 7.8 4.1 10.8 51.2 33.3 17.9 40.6 11.7 35.1 7.1 ...
$ IngestionRate: num 0.385 1.057 1.598 0.164 0.396 ...
I did subsets like this:
K_NPK<-subset(Agro,Agro$Ressource!="N")
my model:
mod4 <- lmer(IngestionRate~Ressource+(1|Gewicht), data=K_NPK)
it answers:
Error: number of levels of each grouping factor must be < number of observations
But if I do this subset
N_K<-subset(Agro,Agro$Ressource!="NPK")
and this model
mod4 <- lmer(IngestionRate~Ressource+(1|Gewicht), data=N_K)
If this runs there is no error.
I hope you understand what I try to do.
Can anybody tell me whats wrong?
Here are the steps I'm following to do a Multinomial Linear Regression.
> z<-read.table("2008 Racedata.txt", header=TRUE, sep="\t", row.names=NULL)
> head(z)
datekey raceno horseno place winner draw winodds log_odds jwt hwt
1 2008091501 1 8 1 1 2 12.0 2.484907 128 1170
2 2008091501 1 11 2 0 3 8.6 2.151762 123 1135
3 2008091501 1 6 3 0 5 7.0 1.945910 127 1114
4 2008091501 1 12 4 0 10 23.0 3.135494 123 1018
5 2008091501 1 14 5 0 4 11.0 2.397895 113 1027
6 2008091501 1 5 6 0 14 50.0 3.912023 131 972
> x<-mlogit.data(z,choice="winner",shape="long",id.var="datekey",alt.var="horseno")
Error in `row.names<-.data.frame`(`*tmp*`, value = c("1.8", "1.11", "1.6", :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘10.2’, ‘10.4’, ‘10.8’,
‘100.7’, ‘101.12’, ‘102.1’, ‘102.3’, ‘103.2’, ‘103.4’,
‘103.6’, ‘104.12’, ‘104.3’, ‘104.9’, ‘105.1’, ‘105.5’,
‘105.6’, ‘105.8’, ‘106.11’, ‘106.12’, ‘106.13’, ‘106.7’,
‘107.10’, ‘107.14’, ‘107.3’, ‘108.12’, ‘108.2’, ‘108.6’,
‘108.9’, ‘109.1’, ‘109.14’, ‘109.7’, ‘11.12’, ‘11.5’,
‘11.9’, ‘110.2’, ‘110.3’, ‘110.4’, ‘110.9’, ‘111.1’,
‘111.7’, ‘112.12’, ‘112.3’, ‘112.6’, ‘112.8’, ‘113.10’,
‘113.13’, ‘113.7’, ‘114.12’, ‘114.2’, ‘114.9’, ‘115.10’,
‘115.13’, ‘115.5’, ‘116.11’, ‘116.6’, ‘117.14’, ‘117.3’,
‘117.7’, ‘118.1’, ‘118.13’, ‘118.2’, ‘118.9’, ‘119.10’,
‘119.5’, ‘119.6’, ‘119.8’, ‘12.1’, ‘12.10’, ‘12.3’,
‚Äò12.6‚Äô, ‚Äò120.2‚Äô, ‚Äò120.4‚Äô, ‚Äò120.7‚ [... truncated]
>
What step am I missing here? Why the duplicates in row.names?
Thanks,
Walt
Two problems.
You seem to have some problem with encoding since we are seeing lots of umlauts and accent marks in that error message. Furthernore I am wondering if that datekey column got converted into a factor class?
In this case it it referring to an error in construction of the row.names attribute of the new object, x. If you do:
with( z, table( datekey, horseno) )
... you may see an a horse with multiple entries on the same day.
Actually there were no duplicate datekey x horseno combos. Changing to factor for horseno and datekey and then switching the "long" argument to "wide" produces error free result with this result:
z$datekey <- as.character(z$datekey)
z$horseno <- as.character(z$horseno)
x<-mlogit.data(z,choice="winner",shape="wide",id.var="datekey",alt.var="horseno")
str(x)
#----------
Classes ‘mlogit.data’ and 'data.frame': 18312 obs. of 11 variables:
$ datekey : Factor w/ 733 levels "2008091501","2008091502",..: 1 1 1 1 1 1 1 1 1 1 ...
$ raceno : int 1 1 1 1 1 1 1 1 1 1 ...
$ horseno : chr "0" "1" "0" "1" ...
$ place : int 1 1 2 2 3 3 4 4 5 5 ...
$ winner : logi FALSE TRUE TRUE FALSE TRUE FALSE ...
$ draw : int 2 2 3 3 5 5 10 10 4 4 ...
$ winodds : num 12 12 8.6 8.6 7 7 23 23 11 11 ...
$ log_odds: num 2.48 2.48 2.15 2.15 1.95 ...
$ jwt : int 128 128 123 123 127 127 123 123 113 113 ...
$ hwt : int 1170 1170 1135 1135 1114 1114 1018 1018 1027 1027 ...
$ chid : num 1 1 2 2 3 3 4 4 5 5 ...
- attr(*, "index")='data.frame': 18312 obs. of 3 variables:
..$ chid: Factor w/ 9156 levels "1","2","3","4",..: 1 1 2 2 3 3 4 4 5 5 ...
..$ alt : Factor w/ 2 levels "0","1": 1 2 1 2 1 2 1 2 1 2 ...
..$ id : Factor w/ 733 levels "2008091501","2008091502",..: 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "choice")= chr "winner"
I've build a model using caret. When the training was completed I got the following warning:
Warning message:
In train.default(x, y, weights = w, ...) :
At least one of the class levels are not valid R variables names; This may cause errors if class probabilities are generated because the variables names will be converted to: X0, X1
The names of the variables are:
str(train)
'data.frame': 7395 obs. of 30 variables:
$ alchemy_category : Factor w/ 13 levels "arts_entertainment",..: 2 8 6 6 11 6 1 6 3 8 ...
$ alchemy_category_score : num 3737 2052 4801 3816 3179 ...
$ avglinksize : num 2.06 3.68 2.38 1.54 2.68 ...
$ commonlinkratio_1 : num 0.676 0.508 0.562 0.4 0.5 ...
$ commonlinkratio_2 : num 0.206 0.289 0.322 0.1 0.222 ...
$ commonlinkratio_3 : num 0.0471 0.2139 0.1202 0.0167 0.1235 ...
$ commonlinkratio_4 : num 0.0235 0.1444 0.0426 0 0.0432 ...
$ compression_ratio : num 0.444 0.469 0.525 0.481 0.446 ...
$ embed_ratio : num 0 0 0 0 0 0 0 0 0 0 ...
$ frameTagRatio : num 0.0908 0.0987 0.0724 0.0959 0.0249 ...
$ hasDomainLink : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ html_ratio : num 0.246 0.203 0.226 0.266 0.229 ...
$ image_ratio : num 0.00388 0.08865 0.12054 0.03534 0.05047 ...
$ is_news : Factor w/ 2 levels "0","1": 2 2 2 2 2 1 2 1 2 1 ...
$ lengthyLinkDomain : Factor w/ 2 levels "0","1": 2 2 2 1 2 1 1 1 1 2 ...
$ linkwordscore : num 24 40 55 24 14 12 21 5 17 14 ...
$ news_front_page : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ non_markup_alphanum_characters: num 5424 4973 2240 2737 12032 ...
$ numberOfLinks : num 170 187 258 120 162 55 93 132 194 326 ...
$ numwords_in_url : num 8 9 11 5 10 3 3 4 7 4 ...
$ parametrizedLinkRatio : num 0.1529 0.1818 0.1667 0.0417 0.0988 ...
$ spelling_errors_ratio : num 0.0791 0.1254 0.0576 0.1009 0.0826 ...
$ label : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 2 1 2 2 ...
$ isVideo : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 1 1 ...
$ isFashion : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 2 1 2 1 ...
$ isFood : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ hasComments : Factor w/ 2 levels "0","1": 1 2 2 2 2 1 2 2 1 2 ...
$ hasGoogleAnalytics : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 2 2 2 1 ...
$ hasInlineCSS : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 2 1 2 2 ...
$ noOfMetaTags : num 10 12 6 10 13 2 6 6 9 5 ...
My code is the following:
ctrl <- trainControl(method = "CV",
number=10,
classProbs = TRUE,
allowParallel = TRUE,
summaryFunction = twoClassSummary)
set.seed(476)
rfFit <- train(formula,
data=train,
method = "rf",
tuneGrid = expand.grid(.mtry = seq(4,20,by=2)),
ntrees=1000,
importance = TRUE,
metric = "ROC",
trControl = ctrl)
pred <- predict.train(rfFit, newdata = test, type = "prob")
I get the error: Error in [.data.frame(out, , obsLevels, drop = FALSE) :
undefined columns selected
The variable names on the test data set are:
str(test)
'data.frame': 3171 obs. of 29 variables:
$ alchemy_category : Factor w/ 13 levels "arts_entertainment",..: 8 4 12 4 10 12 12 8 1 2 ...
$ alchemy_category_score : num 5307 4825 1 6708 5416 ...
$ avglinksize : num 2.56 3.77 2.27 2.52 1.85 ...
$ commonlinkratio_1 : num 0.39 0.462 0.496 0.706 0.471 ...
$ commonlinkratio_2 : num 0.257 0.205 0.385 0.346 0.161 ...
$ commonlinkratio_3 : num 0.0441 0.0513 0.1709 0.123 0.0323 ...
$ commonlinkratio_4 : num 0.0221 0 0.1709 0.0906 0 ...
$ compression_ratio : num 0.49 0.782 1.25 0.449 0.454 ...
$ embed_ratio : num 0 0 0 0 0 0 0 0 0 0 ...
$ frameTagRatio : num 0.0671 0.0429 0.0588 0.0581 0.093 ...
$ hasDomainLink : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ html_ratio : num 0.23 0.366 0.162 0.147 0.244 ...
$ image_ratio : num 0.19944 0.08 10 0.00596 0.03571 ...
$ is_news : Factor w/ 2 levels "0","1": 2 1 1 2 2 1 1 2 1 1 ...
$ lengthyLinkDomain : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 1 1 1 ...
$ linkwordscore : num 15 62 42 41 34 35 15 22 41 7 ...
$ news_front_page : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ non_markup_alphanum_characters: num 5643 382 2420 5559 2209 ...
$ numberOfLinks : num 136 39 117 309 155 266 55 145 110 1 ...
$ numwords_in_url : num 3 2 1 10 10 7 1 9 5 0 ...
$ parametrizedLinkRatio : num 0.2426 0.1282 0.5812 0.0388 0.0968 ...
$ spelling_errors_ratio : num 0.0806 0.1765 0.125 0.0631 0.0653 ...
$ isVideo : Factor w/ 2 levels "0","1": 1 2 1 2 2 2 1 1 2 2 ...
$ isFashion : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...
$ isFood : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ hasComments : Factor w/ 2 levels "0","1": 2 1 1 2 2 2 1 2 2 1 ...
$ hasGoogleAnalytics : Factor w/ 2 levels "0","1": 1 2 2 2 2 1 1 2 1 1 ...
$ hasInlineCSS : Factor w/ 2 levels "0","1": 2 2 2 1 1 2 2 2 1 1 ...
$ noOfMetaTags : num 3 6 5 9 16 22 6 9 7 0 ...
If I omit the type="prob" part, I get no error.
Any ideas?
Could it be the length of the variable "alchemy_category" which is appended with the respective factor levels e.g. "alchemy_categoryarts_entertainment" inside the model??
The answer is in bold at the top of your post =]
What are you modeling? Is it alchemy_category? The code only says formula and we can't see it.
When you ask for class probabilities, model predictions are a data frame with separate columns for each class/level. If alchemy_category doesn't have levels that are valid column names, data.frame converts then to valid names. That creates a problem because the code is looking for a specific name but the data frame as a different (but valid) name.
For example, if I had
> test <- factor(c("level1", "level 2"))
> levels(test)
[1] "level 2" "level1"
> make.names(levels(test))
[1] "level.2" "level1"
the code would be looking for "level 2" but there is only "level.2".
As stated above the class values must be factors and must be valid names. Another way to insure this is,
levels(all.dat$target) <- make.names(levels(factor(all.dat$target)))
I have read through the answers above while facing a similar problem. A formal solution is to do this on the train and test datasets. Make sure you include the response variable in the feature.names too.
feature.names=names(train)
for (f in feature.names) {
if (class(train[[f]])=="factor") {
levels <- unique(c(train[[f]]))
train[[f]] <- factor(train[[f]],
labels=make.names(levels))
}
}
This creates syntactically correct labels for all factors.
As #Sam Firke already pointed out in comments (but I overlooked it) levels TRUE/FALSE also don't work. So I converted them to yes/no.
As per the above example, usually refactoring the outcome variable will fix the problem. It's better to change in the original dataset before partitioning into training and test datasets
levels <- unique(data$outcome)
data$outcome <- factor(data$outcome, labels=make.names(levels))
As others pointed out earlier, this problem only occurs when classProbs=TRUE which causes the train function to generate additional statistics related to the outcome class