I'm trying to predict a binary classification problem dealing with recommending films.
I've got a training data set of 50 rows (movies) and 6 columns (5 movie attributes and a consensus on the film).
I then have a test data set of 20 films with the same columns.
I then run
pred<-predict(svm_model, test)
and receive
Error in predict.svm(svm_model, test) : test data does not match model !.
From similar posts, it seems that the error is because the levels don't match between the training and test datasets. This is true and I've proved it by comparing str(test) and str(train). However, both datasets come from randomly selected films and will always have different levels for their categorical attributes. Doing
levels(test$Attr1) <- levels(train$Attr1)
changes the actual column data in test, thus rendering the predictor incorrect. Does anyone know how to solve this issue?
The first half dozen rows of my training set are in the following link.
https://justpaste.it/1ifsx
You could do something like this, assuming Attr1 is a character:
Create a levels attribute with the unique values from attribute1 from both test and train.
Create a factor on train and test attribute1 with all the levels found in point 1.
levels <- unique(c(train$Attr1, test$Attr1))
test$Attr1 <- factor(test$Attr1, levels=levels)
train$Attr1 <- factor(train$Attr1, levels=levels)
If you do not want factos, add as.integer to part of the code and you will get numbers instaed of factors. That is sometimes handier in models like xgboost and saves on one hot encoding.
as.integer(factor(test$Attr1, levels=levels))
Related
I am attempting to conduct a t test in R to try and determine whether there is a statistically significant difference in salary between US and foreign born workers in the Western US. I have 2 different data frames for the two groups based on nativity, and want to compare the column variable I have on salary titled "adj_SALARY". For simplicity, say that there are 3 observations in the US_Born_west frame, and 5 in the Immigrant_West data frame.
US_born_West$adj_SALARY=30000, 25000,22000
Immigrant_West$adj_SALARY=14000,20000,12000,16000,15000
#Here is what I attempted to run:
t.test(US_born_West$adj_SALARY~Immigrant_West$adj_SALARY, alternative="greater",conf.level = .95)
However I received this error message: "Error in model.frame.default(formula = US_born_West$adj_SALARY ~ Immigrant_West$adj_SALARY) :
variable lengths differ (found for 'Immigrant_West$adj_SALARY')"
Any ideas on how I can fix this? Thank you!
US_born_West$adj_SALARY and Immigrant_West$adj_SALARY are of unequal length. Using formula interface of t.test gives an error about that. We can pass them as individual vectors instead.
t.test(US_born_West$adj_SALARY, Immigrant_West$adj_SALARY,
alternative="greater",conf.level = .95)
I am working on a classification task with a categorical dependent variable with 99 levels (each corresponding to a country)
I am using a decision tree, and it looks like I cannot have more than 32 levels so that I need to reduce the number of levels. I was thinking to cluster countries by similarity such that the ones that are similar based on the 200 variables I have (v1, v2, v3...), would be grouped.
I was thinking of using UMAP in order to reduce the dimensionality of the dataset and then group the countries together (for instance Norway+Sweden, Laos+Cambodia or whatever) but I am having some hard times doing so, here is what I have so far ( working on a subsample), I tried to plot it but it doesn't much sense to me
data = sample_n(surveydata, 15000)
cluster.data = data[, grep("v", colnames(surveydata))]
library(umap)
data.umap = umap(cluster.data)
plot(data.umap$layout, col=data$Nationality)
(Nationality is the cathegorical variables with 99 levels I have to predict)
do you know any method I can use to reduce the levels to less than 32?
thank you in advance for your help!
I have a dataframe df, I am building an machine learning model (C5.0 decision tree) to predict the class of a column (loan_approved):
Structure (not real data):
id occupation income loan_approved
1 business 4214214 yes
2 business 32134 yes
3 business 43255 no
4 sailor 5642 yes
5 teacher 53335 no
6 teacher 6342 no
Process:
I randomly split the data frame into test and train, learned on train
dataset (rows 1,2,3,5,6 train and row 4 as test)
In order to account for new categorical levels in one or many column, I used try function
Function:
error_free_predict = function(x){
output = tryCatch({
predict(C50_model, newdata = test[x,], type = "class")
}, error = function(e) {
"no"
})
return(output)
}
Applied the predict function:
test <- mutate(test, predicted_class = error_free_predict(1:NROW(test)))
Problem:
id occupation income loan_approved predicted_class
1 business 4214214 yes no
2 business 32134 yes no
3 business 43255 no no
4 sailor 5642 yes no
5 teacher 53335 no no
6 teacher 6342 no no
Question:
I know this is because the test data frame had a new level that was not present in train data, but should not my function work all cases except this?
P.S: did not use sapply because it was too slow
There are two parts of this problem.
First part of problem comes during training the model because categorical variables are not equally divided in between train and test if one do random splitting. In your case say you have only one record with occupation "sailor" then it is possible that it will end up in test set when you do random split. Model built using train dataset would have never seen impact of occupation "sailor" and hence it will throw error. In more generalized case it is possible some other categorical variable level goes entirely to test set after random splitting.
So instead of dividing the data randomly in between train and test you can do stratified sampling. Code using data.table for 70:30 split is :
ind <- total_data[, sample(.I, round(0.3*.N), FALSE),by="occupation"]$V1
train <- total_data[-ind,]
test <- total_data[ind,]
This makes sure any level is divided equally among train and test dataset. So you will not get "new" categorical level in test dataset; which in random splitting case could be there.
Second part of the problem comes when model is in production and it encounters a altogether new variable which was not there in even training or test set. To tackle this problem one can maintain a list of all levels of all categorical variables by using
lvl_cat_var1 <- unique(cat_var1) and lvl_cat_var2 <- unique(cat_var2) etc. Then before predict one can check for new level and filter:
new_lvl_data <- total_data[!(var1 %in% lvl_cat_var1 & var2 %in% lvl_cat_var2)]
pred_data <- total_data[(var1 %in% lvl_cat_var1 & var2 %in% lvl_cat_var2)]
then for the default prediction do:
new_lvl_data$predicted_class <- "no"
and full blown prediction for pred_data.
I generally do this using a loop where any levels outside of the train would be recoded as NA by this function. Here train is the data that you used for training the model and test is the data which would be used for prediction.
for(i in 1:ncol(train)){
if(is.factor(train[,i])){
test[,i] <- factor(test[,i],levels=levels(train[,i]))
}
}
Trycatch is an error handling mechanism, i.e. after the error has been encountered. It would not be applicable unless you would like to do something different after the error has been encountered. But you would still like to run the model, then this loop would take care of the new levels.
I'm working with a data set that has a lot of NA's. I know that the first 6 columns do NOT have any NA's. Since the first column is an ID column I'm omitting it.
I run the following code to select only lines that have values in the response column:
sub1 <- TrainingData[which(!is.na(TrainingData[,70])),]
I then use sub1 as the data set in a randomForest using this code:
set.seed(448)
RF <- randomForest(sub1[,c(2:6)], sub1[,70]
,do.trace=TRUE,importance=TRUE,ntree=10,,forest=TRUE)
then I run this code to check the output for NA's:
> length(which(is.na(RF$predicted)))
[1] 65
I can't figure out why I'd be getting NA's if the data going in is clean.
Any suggestions?
I think you should use more trees. Because predicted values are preditions for the out-of-bag set. And if number of trees very small some cases are never present in out-of-bag set, because this set forms randomly.
Hi I'm a beginner in R programming language. I wrote one code for regression tree using rpart package. In my data some of my independent variables have more than 100 levels. After running the rpart function
I'm getting following warning message "More than 52 levels in a predicting factor, truncated for printout" & my tree is showing in very weird way. Say for example my tree is splitting by location which has around 70 distinct levels, but when the label is displaying in tree then it is showing "ZZZZZZZZZZZZZZZZ..........." where I don't have any location called "ZZZZZZZZ"
Please help me.
Thanks in advance.
Many of the functions in R have limits on the number of levels a factor-type variable can have (ie randomForest limits the number of levels of a factor to 32).
One way that I've seen it dealt with especially in data mining competitions is to:
1) Determine maximum number of levels allowed for a given function (call this X).
2) Use table() to determine the number of occurrences of each level of the factor and rank them from greatest to least.
3) For the top X - 1 levels of the factor leave them as is.
4) For the levels < X change them all to one factor to identify them as low-occurrence levels.
Here's an example that's a bit long but hopefully helps:
# Generate 1000 random numbers between 0 and 100.
vars1 <- data.frame(values1=(round(runif(1000) * 100,0)))
# Changes values to factor variable.
vars1$values1 <- factor(vars1$values1)
# Show top 6 rows of data frame.
head(vars1)
# Show the number of unique factor levels
length(unique(vars1$values1 ))
# Create table showing frequency of each levels occurrence.
table1 <- data.frame(table(vars1 ))
# Orders the table in descending order of frequency.
table1 <- table1[order(-table1$Freq),]
head(table1)
# Assuming we want to use the CART we choose the top 51
# levels to leave unchanged
# Get values of top 51 occuring levels
noChange <- table1$vars1[1:51]
# we use '-1000' as factor to avoid overlap w/ other levels (ie if '52' was
# actually one of the levels).
# ifelse() checks to see if the factor level is in the list of the top 51
# levels. If present it uses it as is, if not it changes it to '-1000'
vars1$newFactor <- (ifelse(vars1$values1 %in% noChange, vars1$values1, "-1000"))
# Show the number of levels of the new factor column.
length(unique(vars1$newFactor))
Finally, you may want to consider using truncated variables in rpart as the tree display gets very busy when there are a large number of variables or they have long names.