Related
I am practicing with this dataset: http://archive.ics.uci.edu/ml/datasets/Census+Income
I loaded training & testing data.
# Downloading train and test data
trainFile = "adult.data"; testFile = "adult.test"
if (!file.exists (trainFile))
download.file (url = "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
destfile = trainFile)
if (!file.exists (testFile))
download.file (url = "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",
destfile = testFile)
# Assigning column names
colNames = c ("age", "workclass", "fnlwgt", "education",
"educationnum", "maritalstatus", "occupation",
"relationship", "race", "sex", "capitalgain",
"capitalloss", "hoursperweek", "nativecountry",
"incomelevel")
# Reading training data
training = read.table (trainFile, header = FALSE, sep = ",",
strip.white = TRUE, col.names = colNames,
na.strings = "?", stringsAsFactors = TRUE)
# Load the testing data set
testing = read.table (testFile, header = FALSE, sep = ",",
strip.white = TRUE, col.names = colNames,
na.strings = "?", fill = TRUE, stringsAsFactors = TRUE)
I needed to combined two into one. But, there is a problem. I am seeing structure of the two data is not same.
Display structure of the training data
> str (training)
'data.frame': 32561 obs. of 15 variables:
$ age : int 39 50 38 53 28 37 49 52 31 42 ...
$ workclass : Factor w/ 8 levels "Federal-gov",..: 7 6 4 4 4 4 4 6 4 4 ...
$ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
$ education : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
$ educationnum : int 13 13 9 7 13 14 5 9 14 13 ...
$ maritalstatus: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
$ occupation : Factor w/ 14 levels "Adm-clerical",..: 1 4 6 6 10 4 8 4 10 4 ...
$ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
$ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
$ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
$ capitalgain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
$ capitalloss : int 0 0 0 0 0 0 0 0 0 0 ...
$ hoursperweek : int 40 13 40 40 40 40 16 45 50 40 ...
$ nativecountry: Factor w/ 41 levels "Cambodia","Canada",..: 39 39 39 39 5 39 23 39 39 39 ...
$ incomelevel : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
Display structure of the testing data
> str (testing)
'data.frame': 16282 obs. of 15 variables:
$ age : Factor w/ 74 levels "|1x3 Cross validator",..: 1 10 23 13 29 3 19 14 48 9 ...
$ workclass : Factor w/ 9 levels "","Federal-gov",..: 1 5 5 3 5 NA 5 NA 7 5 ...
$ fnlwgt : int NA 226802 89814 336951 160323 103497 198693 227026 104626 369667 ...
$ education : Factor w/ 17 levels "","10th","11th",..: 1 3 13 9 17 17 2 13 16 17 ...
$ educationnum : int NA 7 9 12 10 10 6 9 15 10 ...
$ maritalstatus: Factor w/ 8 levels "","Divorced",..: 1 6 4 4 4 6 6 6 4 6 ...
$ occupation : Factor w/ 15 levels "","Adm-clerical",..: 1 8 6 12 8 NA 9 NA 11 9 ...
$ relationship : Factor w/ 7 levels "","Husband","Not-in-family",..: 1 5 2 2 2 5 3 6 2 6 ...
$ race : Factor w/ 6 levels "","Amer-Indian-Eskimo",..: 1 4 6 6 4 6 6 4 6 6 ...
$ sex : Factor w/ 3 levels "","Female","Male": 1 3 3 3 3 2 3 3 3 2 ...
$ capitalgain : int NA 0 0 0 7688 0 0 0 3103 0 ...
$ capitalloss : int NA 0 0 0 0 0 0 0 0 0 ...
$ hoursperweek : int NA 40 50 40 40 30 30 40 32 40 ...
$ nativecountry: Factor w/ 41 levels "","Cambodia",..: 1 39 39 39 39 39 39 39 39 39 ...
$ incomelevel : Factor w/ 3 levels "","<=50K.",">50K.": 1 2 2 3 3 2 2 2 3 2 ...
Problem 1:
age has become factor at testing. and all other levels of factor in testing is being increased by 1 than levels of factor in training. This is because first row is an unnecessary row in testing.
|1x3 Cross validator
I tried to get rid of this by re-assigning testing:
testing = testing[-1,]
but, after running str() command again, I don't see any change.
Problem 2:
Like I said at previous, I needed to combine those two data-frame into one data-frame. So, I run this:
combined <- rbind(training , testing)
Besides the problem-1, I can see new a problem after running str()
> str(combined)
'data.frame': 48842 obs. of 15 variables:
$ age : chr "39" "50" "38" "53" ...
$ workclass : Factor w/ 9 levels "Federal-gov",..: 7 6 4 4 4 4 4 6 4 4 ...
$ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
$ education : Factor w/ 17 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
$ educationnum : int 13 13 9 7 13 14 5 9 14 13 ...
$ maritalstatus: Factor w/ 8 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
$ occupation : Factor w/ 15 levels "Adm-clerical",..: 1 4 6 6 10 4 8 4 10 4 ...
$ relationship : Factor w/ 7 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
$ race : Factor w/ 6 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
$ sex : Factor w/ 3 levels "Female","Male",..: 2 2 2 2 1 1 1 2 1 2 ...
$ capitalgain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
$ capitalloss : int 0 0 0 0 0 0 0 0 0 0 ...
$ hoursperweek : int 40 13 40 40 40 40 16 45 50 40 ...
$ nativecountry: Factor w/ 42 levels "Cambodia","Canada",..: 39 39 39 39 5 39 23 39 39 39 ...
$ incomelevel : Factor w/ 5 levels "<=50K",">50K",..: 1 1 1 1 1 1 1 2 2 2 ...
factor levels at target variable (incomelevel) in combined data-frame is 5 where it's 2 (which is correct) in the training data-frame and 3 (increased by 1 for problem-1) in testing data-frame. This is because there is a . (dot) after each value at incomelevel in testing data-frame (<=50K., <=50K., >50K.,......). So, I need to remove that .(dot) But, I am not getting idea how to remove it. Is there any function?
I am very in data and r. That's why, facing this type of basic issues. Can you please help me to solve the issue I am facing?
I think you can ignore the first line of test, this will solve the issue of age being a factor, because it seems like a header:
head(readLines(testFile))
[1] "|1x3 Cross validator"
[2] "25, Private, 226802, 11th, 7, Never-married, Machine-op-inspct, Own-child, Black, Male, 0, 0, 40, United-States, <=50K."
[3] "38, Private, 89814, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K."
We run your code, we can use read.csv, with skip=1 for test:
colNames = c ("age", "workclass", "fnlwgt", "education",
"educationnum", "maritalstatus", "occupation",
"relationship", "race", "sex", "capitalgain",
"capitalloss", "hoursperweek", "nativecountry",
"incomelevel")
# Reading training data
training = read.csv (trainFile, header = FALSE, col.names = colNames,stringsAsFactors = TRUE,na.strings = "?",strip.white = TRUE)
testing = read.csv (testFile, header = FALSE, col.names = colNames,na.strings = "?",stringsAsFactors = TRUE,skip=1,strip.white = TRUE)
Now, the income level, unfortunately we have to correct it manually, it's a good thing you check:
testing$incomelevel = factor(gsub("\\.","",as.character(testing$incomelevel)))
We check levels, only difference is native country:
all.equal(sapply(testing,levels) ,sapply(training,levels))
[1] "Component “nativecountry”: Lengths (40, 41) differ (string compare on first 40)"
[2] "Component “nativecountry”: 26 string mismatches"
And I don't think there's much you can do, maybe you have to remove it before / after joining:
setdiff(levels(training$nativecountry),levels(testing$nativecountry))
[1] "Holand-Netherlands"
I've used aregImpute to impute the missing values then i used impute.transcan function trying to get complete dataset using the following code.
impute_arg <- aregImpute(~ age + job + marital + education + default +
balance + housing + loan + contact + day + month + duration + campaign +
pdays + previous + poutcome + y , data = mov.miss, n.impute = 10 , nk =0)
imputed <- impute.transcan(impute_arg, imputation=1, data=mov.miss, list.out=TRUE, pr=FALSE, check=FALSE)
y <- completed[names(imputed)]
and when i used str(y) it already gives me a dataframe but with NAs as it is not imputed before, My question is how to get complete dataset without NAs after imputation?
str(y)
'data.frame': 4521 obs. of 17 variables:
$ age : int 30 NA 35 30 NA 35 36 39 41 43 ...
$ job : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
$ marital : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
$ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 NA 2 3 1 ...
$ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 NA 1 1 1 ...
$ balance : int NA 4789 1350 1476 0 747 307 147 NA -88 ...
$ housing : Factor w/ 2 levels "no","yes": NA 2 2 2 NA 1 2 2 2 2 ...
$ loan : Factor w/ 2 levels "no","yes": 1 2 1 2 NA 1 1 NA 1 2 ...
$ contact : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 NA 1 ...
$ day : int 19 NA 16 3 5 23 14 6 14 NA ...
$ month : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 NA 9 9 1 ...
$ duration : int 79 220 185 199 226 141 341 151 57 313 ...
$ campaign : int 1 1 1 4 1 2 1 2 2 NA ...
$ pdays : int -1 339 330 NA -1 176 330 -1 -1 NA ...
$ previous : int 0 4 NA 0 NA 3 2 0 0 2 ...
$ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
$ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
I have tested your code myself, and it works just fine, except for the last line:
y <- completed[names(imputed)]
I believe there's a type in the above line. Plus, you do not even need the completed function.
Besides, if you want to get a data.frame from the impute.transcan function, then wrap it with as.data.frame:
imputed <- as.data.frame(impute.transcan(impute_arg, imputation=1, data=mov.miss, list.out=TRUE, pr=FALSE, check=FALSE))
Moreover, if you need to test your missing data pattern, you can also use the md.pattern function provided by the mice package.
I have those data: http://www.unige.ch/ses/spo/static/simonhug/madi/Mitchell_et_al_1984.csv
> str(dataset)
'data.frame': 135 obs. of 13 variables:
$ CCode : int 2 20 40 41 42 51 52 70 90 91 ...
$ StateAbb : Factor w/ 130 levels "AFG","ALB","ALG",..: 124 19 28 52 33 62 117 75 49 53 ...
$ StateNme : Factor w/ 130 levels "Afghanistan",..: 122 20 27 51 33 62 116 76 47 52 ...
$ prison_score : Factor w/ 5 levels "never","often",..: 1 1 2 4 5 1 NA 4 5 4 ...
$ torture_score : Factor w/ 5 levels "never","often",..: 1 3 1 4 2 1 NA 2 5 2 ...
$ ht_colonial : Factor w/ 10 levels "0. Never colonized by a Western overseas colonial power",..: 1 1 4 8 4 7 7 4 4 4 ...
$ british : int NA NA 0 0 0 1 1 0 0 0 ...
$ british_colony : Factor w/ 2 levels "no","yes": NA NA 1 1 1 2 2 1 1 1 ...
$ continent : Factor w/ 5 levels "Africa","Americas",..: 2 2 2 2 2 2 2 2 2 2 ...
$ region_wb : Factor w/ 19 levels "Australia and New Zealand",..: 10 10 2 2 2 2 2 3 3 3 ...
$ gdppc_l1 : num 25839 23550 10095 1846 4758 ...
$ colonialExperience: chr NA NA "Other Colonial Background" "Other Colonial Background" ...
And have to create a similar result
With this code
# Copy the torture_score in a new col
dataset$torture_score_new = dataset$torture_score
# Add a level to the factor torture_score_new so we can t
levels(dataset$torture_score_new) = c(levels(dataset$torture_score_new), "rarely or never")
### Recode variables
# Torture score
dataset$torture_score_new[dataset$torture_score == "rarely"] = "rarely or never"
dataset$torture_score_new[dataset$torture_score == "never"] = "rarely or never"
dataset$torture_score_new = droplevels(dataset$torture_score_new)
dataset$torture_score_new = ordered(dataset$torture_score_new, levels =c("rarely or never", "somtimes", "often", "very often"))
### Text
dataset$colonialExperience = ifelse(dataset$british_colony == "yes",
"Former British Colony",
"Other Colonial Background")
useOfTortureByColonialExperience = table(dataset$torture_score_new, dataset$colonialExperience)
addmargins(round(prop.table(useOfTortureByColonialExperience)*100,2),1)
and get this result
Former British Colony Other Colonial Background
rarely or never 9.76 20.73
somtimes 10.98 15.85
often 6.10 18.29
very often 10.98 7.32
Sum 37.82 62.19
But I don't understand how to use conditional stat and get the Chi Square.
(I'm a programmer, but a total newbe to R)
Ok it's what I end up doing.
useOfTortureByColonialExperience = table(dataset$torture_score_new, dataset$colonialExperience)
# Get the number of observation
addmargins(useOfTortureByColonialExperience,1);
# Contingency table with conditional probability
useOfTortureByColonialExperienceProp = prop.table(useOfTortureByColonialExperience,2)
print(addmargins(useOfTortureByColonialExperienceProp*100,1),3)
## Chisq
chisq.test(useOfTortureByColonialExperience)
cramersV(useOfTortureByColonialExperience)
I have a list of data frames:
str(df.list)
List of 34
$ :'data.frame': 506 obs. of 7 variables:
..$ Protocol : Factor w/ 5 levels "P1","P2","P3",..: 1 1 1 1 1 1 1 1 1 1 ...
..$ Time : num [1:506] 0 2 3 0.5 6 1 24 24 24 24 ...
..$ SampleID : Factor w/ 40 levels "P1T0","P1T0.5",..: 1 5 7 2 8 3 6 6 6 6 ...
..$ VolunteerID: Factor w/ 15 levels "ID-02","ID-03",..: 10 10 10 10 10 10 10 11 13 14 ...
..$ Assay : Factor w/ 1 level "ALAT": 1 1 1 1 1 1 1 1 1 1 ...
..$ ResultAssay: int [1:506] 23 23 23 24 25 24 20 34 28 17 ...
..$ Index : Factor w/ 502 levels "P1T0.5VID-02",..: 8 31 37 2 43 19 25 26 28 29 ...
$ :'data.frame': 505 obs. of 7 variables:
..$ Protocol : Factor w/ 5 levels "P1","P2","P3",..: 1 1 1 1 1 1 1 1 1 1 ...
..$ Time : num [1:505] 0 2 3 0.5 6 1 24 24 24 24 ...
..$ SampleID : Factor w/ 40 levels "P1T0","P1T0.5",..: 1 5 7 2 8 3 6 6 6 6 ...
..$ VolunteerID: Factor w/ 15 levels "ID-02","ID-03",..: 10 10 10 10 10 10 10 11 13 14 ...
..$ Assay : Factor w/ 1 level "ALB": 1 1 1 1 1 1 1 1 1 1 ...
..$ ResultAssay: int [1:505] 45 46 47 47 49 47 46 46 44 43 ...
..$ Index : Factor w/ 501 levels "P1T0.5VID-02",..: 8 31 37 2 43 19 25 26 28 29 ..
The list contains 34 data frames with equal variable names. The variables Time and ResultAssay are of the wrong type: I would like to have Time as factor and ResultAssay as numerical.
I am trying to generate a function to use together with lapply to convert the variable type of this list of 34 data frames in one go, but so far i am unsuccessful.
I have tried things in parallel to:
ChangeType <- function(DF){
DF[,2] <- as.factor(DF[,2])
DF[, "ResultAssay"] <- as.numeric(DF[, c("ResultAssay")]
}
lapply(df.list, ChangeType)
What you have tried is nearly correct, but you also need to return the new data.frame and also store it to your existing variable, as so:
ChangeType <- function(DF){
DF[,2] <- as.factor(DF[,2])
DF[, "ResultAssay"] <- as.numeric(DF[, c("ResultAssay")]
DF #return the data.frame
}
# store the returned value to df.list,
# thus updating your existing data.frame
df.list <- lapply(df.list, ChangeType)
I've build a model using caret. When the training was completed I got the following warning:
Warning message:
In train.default(x, y, weights = w, ...) :
At least one of the class levels are not valid R variables names; This may cause errors if class probabilities are generated because the variables names will be converted to: X0, X1
The names of the variables are:
str(train)
'data.frame': 7395 obs. of 30 variables:
$ alchemy_category : Factor w/ 13 levels "arts_entertainment",..: 2 8 6 6 11 6 1 6 3 8 ...
$ alchemy_category_score : num 3737 2052 4801 3816 3179 ...
$ avglinksize : num 2.06 3.68 2.38 1.54 2.68 ...
$ commonlinkratio_1 : num 0.676 0.508 0.562 0.4 0.5 ...
$ commonlinkratio_2 : num 0.206 0.289 0.322 0.1 0.222 ...
$ commonlinkratio_3 : num 0.0471 0.2139 0.1202 0.0167 0.1235 ...
$ commonlinkratio_4 : num 0.0235 0.1444 0.0426 0 0.0432 ...
$ compression_ratio : num 0.444 0.469 0.525 0.481 0.446 ...
$ embed_ratio : num 0 0 0 0 0 0 0 0 0 0 ...
$ frameTagRatio : num 0.0908 0.0987 0.0724 0.0959 0.0249 ...
$ hasDomainLink : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ html_ratio : num 0.246 0.203 0.226 0.266 0.229 ...
$ image_ratio : num 0.00388 0.08865 0.12054 0.03534 0.05047 ...
$ is_news : Factor w/ 2 levels "0","1": 2 2 2 2 2 1 2 1 2 1 ...
$ lengthyLinkDomain : Factor w/ 2 levels "0","1": 2 2 2 1 2 1 1 1 1 2 ...
$ linkwordscore : num 24 40 55 24 14 12 21 5 17 14 ...
$ news_front_page : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ non_markup_alphanum_characters: num 5424 4973 2240 2737 12032 ...
$ numberOfLinks : num 170 187 258 120 162 55 93 132 194 326 ...
$ numwords_in_url : num 8 9 11 5 10 3 3 4 7 4 ...
$ parametrizedLinkRatio : num 0.1529 0.1818 0.1667 0.0417 0.0988 ...
$ spelling_errors_ratio : num 0.0791 0.1254 0.0576 0.1009 0.0826 ...
$ label : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 2 1 2 2 ...
$ isVideo : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 1 1 ...
$ isFashion : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 2 1 2 1 ...
$ isFood : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ hasComments : Factor w/ 2 levels "0","1": 1 2 2 2 2 1 2 2 1 2 ...
$ hasGoogleAnalytics : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 2 2 2 1 ...
$ hasInlineCSS : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 2 1 2 2 ...
$ noOfMetaTags : num 10 12 6 10 13 2 6 6 9 5 ...
My code is the following:
ctrl <- trainControl(method = "CV",
number=10,
classProbs = TRUE,
allowParallel = TRUE,
summaryFunction = twoClassSummary)
set.seed(476)
rfFit <- train(formula,
data=train,
method = "rf",
tuneGrid = expand.grid(.mtry = seq(4,20,by=2)),
ntrees=1000,
importance = TRUE,
metric = "ROC",
trControl = ctrl)
pred <- predict.train(rfFit, newdata = test, type = "prob")
I get the error: Error in [.data.frame(out, , obsLevels, drop = FALSE) :
undefined columns selected
The variable names on the test data set are:
str(test)
'data.frame': 3171 obs. of 29 variables:
$ alchemy_category : Factor w/ 13 levels "arts_entertainment",..: 8 4 12 4 10 12 12 8 1 2 ...
$ alchemy_category_score : num 5307 4825 1 6708 5416 ...
$ avglinksize : num 2.56 3.77 2.27 2.52 1.85 ...
$ commonlinkratio_1 : num 0.39 0.462 0.496 0.706 0.471 ...
$ commonlinkratio_2 : num 0.257 0.205 0.385 0.346 0.161 ...
$ commonlinkratio_3 : num 0.0441 0.0513 0.1709 0.123 0.0323 ...
$ commonlinkratio_4 : num 0.0221 0 0.1709 0.0906 0 ...
$ compression_ratio : num 0.49 0.782 1.25 0.449 0.454 ...
$ embed_ratio : num 0 0 0 0 0 0 0 0 0 0 ...
$ frameTagRatio : num 0.0671 0.0429 0.0588 0.0581 0.093 ...
$ hasDomainLink : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ html_ratio : num 0.23 0.366 0.162 0.147 0.244 ...
$ image_ratio : num 0.19944 0.08 10 0.00596 0.03571 ...
$ is_news : Factor w/ 2 levels "0","1": 2 1 1 2 2 1 1 2 1 1 ...
$ lengthyLinkDomain : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 1 1 1 ...
$ linkwordscore : num 15 62 42 41 34 35 15 22 41 7 ...
$ news_front_page : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ non_markup_alphanum_characters: num 5643 382 2420 5559 2209 ...
$ numberOfLinks : num 136 39 117 309 155 266 55 145 110 1 ...
$ numwords_in_url : num 3 2 1 10 10 7 1 9 5 0 ...
$ parametrizedLinkRatio : num 0.2426 0.1282 0.5812 0.0388 0.0968 ...
$ spelling_errors_ratio : num 0.0806 0.1765 0.125 0.0631 0.0653 ...
$ isVideo : Factor w/ 2 levels "0","1": 1 2 1 2 2 2 1 1 2 2 ...
$ isFashion : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...
$ isFood : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ hasComments : Factor w/ 2 levels "0","1": 2 1 1 2 2 2 1 2 2 1 ...
$ hasGoogleAnalytics : Factor w/ 2 levels "0","1": 1 2 2 2 2 1 1 2 1 1 ...
$ hasInlineCSS : Factor w/ 2 levels "0","1": 2 2 2 1 1 2 2 2 1 1 ...
$ noOfMetaTags : num 3 6 5 9 16 22 6 9 7 0 ...
If I omit the type="prob" part, I get no error.
Any ideas?
Could it be the length of the variable "alchemy_category" which is appended with the respective factor levels e.g. "alchemy_categoryarts_entertainment" inside the model??
The answer is in bold at the top of your post =]
What are you modeling? Is it alchemy_category? The code only says formula and we can't see it.
When you ask for class probabilities, model predictions are a data frame with separate columns for each class/level. If alchemy_category doesn't have levels that are valid column names, data.frame converts then to valid names. That creates a problem because the code is looking for a specific name but the data frame as a different (but valid) name.
For example, if I had
> test <- factor(c("level1", "level 2"))
> levels(test)
[1] "level 2" "level1"
> make.names(levels(test))
[1] "level.2" "level1"
the code would be looking for "level 2" but there is only "level.2".
As stated above the class values must be factors and must be valid names. Another way to insure this is,
levels(all.dat$target) <- make.names(levels(factor(all.dat$target)))
I have read through the answers above while facing a similar problem. A formal solution is to do this on the train and test datasets. Make sure you include the response variable in the feature.names too.
feature.names=names(train)
for (f in feature.names) {
if (class(train[[f]])=="factor") {
levels <- unique(c(train[[f]]))
train[[f]] <- factor(train[[f]],
labels=make.names(levels))
}
}
This creates syntactically correct labels for all factors.
As #Sam Firke already pointed out in comments (but I overlooked it) levels TRUE/FALSE also don't work. So I converted them to yes/no.
As per the above example, usually refactoring the outcome variable will fix the problem. It's better to change in the original dataset before partitioning into training and test datasets
levels <- unique(data$outcome)
data$outcome <- factor(data$outcome, labels=make.names(levels))
As others pointed out earlier, this problem only occurs when classProbs=TRUE which causes the train function to generate additional statistics related to the outcome class