I am trying RF for the 1st time. I am trying to predict the genre of the game based on the factors
data <- read.csv("appstore_games.csv")
data <- data %>% drop_na()
data <- data %>% select(Average.User.Rating, User.Rating.Count, Price, Age.Rating, Genres)
data <- data %>% separate(Genres, c("Main Genre","Genre1","Genre2","Genre3"), extra = "drop" )
data1 <- data %>% select(Genre1 , Average.User.Rating, User.Rating.Count, Price )
str(data1)
data1$Genre1 <- as.factor(data1$Genre1)
set.seed(123)
sample <- sample(2 , nrow(data1),replace = TRUE, prob = c(0.7,0.3))
train_data <- data1[sample == 1,]
test_data <- data1[sample == 2,]
library(randomForest)
set.seed(1)
rf <- randomForest(train_data$Genre1 ~., data = train_data , proximity = TRUE, ntree = 200, importance = TRUE)
It shows error at this point
Error in randomForest.default(m, y, ...) : Can't have empty classes in y.
Can I know what is wrong here?
Thanks
The genre has names such as Strategy, Entertainment, etc
I am not completely sure, but I think that could happen if not all different levels of your Y is represented in the train data. Maybe you check this.
My other idea is that one of your classes in Y is "None".
train_data <- droplevels(train_data)
Try using this before you pass data to the model
Related
I am trying to run a two-way repeated measures ANOVA using the rstatix package however, I keep getting the following error message that I don't know how to interpret although I suspect it has something to do with dat$id.
Error in `spread()`:
! Each row of output must be identified by a unique combination of keys.
Keys are shared for 72 rows:
The data I am using has two locations with three measurements per Location for each Date. Any idea how to avoid this error?
Example Data
library(dplyr)
set.seed(321)
dat <- data.frame(matrix(ncol = 3, nrow = 72))
colnames(dat)[1:3] <- c("Date","Location","Value")
dat$Value <- round(rnorm(72, 100,50),0)
dat$Location <- rep(c("Location 1","Location 2"), each = 36)
st <- as.Date("2020-01-01")
en <- as.Date("2020-12-31")
dat$Date <- rep(seq.Date(st,en,by = '1 month'),3)
dat <- dat %>% mutate(id = dense_rank(Date))
dat$Date <- as.factor(dat$Date)
View(dat)
Two-way repeated measures ANOVA that throws the error
library(rstatix)
resaov <- anova_test(
data = dat, dv = Value, wid = id,
within = c(Location, Date))
i want to build a regression model.
i have data about job sekeers, academics, non-academics, woman and man.
the data seperated to areas (7).
i try to type this code but im not sure that it is the right model for my data...
I would be happy to help with this.
cleanData - the csv file, contains data from 01/2010 - 04/2020
I deleted the columns of month, city and the name of the city
regressionSubset <- cleanData
regressionSubset$MONTH <- NULL
regressionSubset$CITY <- NULL
regressionSubset$LOCALITY.NAME <- NULL
regressionData <- dummy_cols(regressionSubset, remove_first_dummy = TRUE, remove_selected_columns = TRUE)
regressionData <- data.frame(scale(regressionData))
set.seed(42)
model <- lm(ESTIMATED.CITY.UNEMPLOYMENT ~ ., data = regressionData)
step_model <- stepAIC(model, trace = FALSE)
summary(step_model)
extract_eq(step_model, use_coefs = TRUE, wrap=TRUE, terms_per_line = 1)
I am facing the following error using modelr add_predictions function.
modelr add_predictions error: in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels): fe.lead.surgeon has new levels ....
In my understanding, it is a common issue that arises when you are making the prediction model using a train dataset and applying the model to a test dataset since the factor levels that existed in a train dataset may not be present in a test dataset. However, I am using the same sample for creating the model and getting the predicted values, and still getting this error.
Specifically, here is the code I am using, and I would appreciate it for any insight on why this error occurs and how to solve this issue.
# indep is a vector of independent variable names
# dep is a vector of dependent variable names
# id.case is the id variable
# sample is my dataset.
eq <-
paste(indep, collapse = ' + ') %>%
paste(dep, ., sep = ' ~ ') %>%
as.formula
s <-
lm(eq, data = sample %>% select(-id.case))
pred <-
sample %>%
modelr::add_predictions(s) %>%
select(id.case, pred)
As per the request of #SimoneBianchi, I am providing the reproducible example here.
Reproducible example
library(tidyverse)
library(tibble)
library(data.table)
rename <- dplyr::rename
select <- dplyr::select
set.seed(10002)
id <- sample(1:1000, 1000, replace=F)
set.seed(10003)
fe1 <- sample(c('A','B','C'), 1000, replace=T)
set.seed(10001)
fe2 <- sample(c('a','b','c'), 1000, replace=T)
set.seed(10001)
cont1 <- sample(1:300, 1000, replace=T)
set.seed(10004)
value <- sample(1:30, 1000, replace=T)
sample <-
data.frame(id, fe1, fe2, cont1, value)
dep <- 'value'
indep <-
c('fe1','fe2', 'cont1')
eq <-
paste(indep, collapse = ' + ') %>%
paste(dep, ., sep = ' ~ ') %>%
as.formula
s <-
lm(eq, data = sample %>% select(-id))
pred <-
sample %>%
modelr::add_predictions(s) %>%
select(id, pred)
Update and Workaround
One workaround I found is that you don't use modelr function but use fitted function. However, I would still want to learn why the regression automatically drops soma factor levels from a factor variable. If anyone knows, please leave a comment.
pred <-
sample %>%
cbind(pred = fitted(s))
Closing: Problem found with the dataset
I found that some observations were NA that had new levels in the corresponding factor variable -- the error. After I fixed the NA, the original code worked fine. So, it was a problem with the dataset rather than the code!
Thank you all for trying to help me out.
I'm trying to make a decision tree but this error comes up when I make a confusion matrix in the last line :
Error : `data` and `reference` should be factors with the same levels
Here's my code:
library(rpart)
library(caret)
library(dplyr)
library(rpart.plot)
library(xlsx)
library(caTools)
library(data.tree)
library(e1071)
#Loading the Excel File
library(readxl)
FINALDATA <- read_excel("Desktop/FINALDATA.xlsm")
View(FINALDATA)
df <- FINALDATA
View(df)
#Selecting the meaningful columns for prediction
#df <- select(df, City, df$`Customer type`, Gender, Quantity, Total, Date, Time, Payment, Rating)
df <- select(df, City, `Customer type`, Gender, Quantity, Total, Date, Time, Payment, Rating)
#making sure the data is in the right format
df <- mutate(df, City= as.character(City), `Customer type`= as.character(`Customer type`), Gender= as.character(Gender), Quantity= as.numeric(Quantity), Total= as.numeric(Total), Time= as.numeric(Time), Payment = as.character(Payment), Rating= as.numeric(Rating))
#Splitting into training and testing data
set.seed(123)
sample = sample.split('Customer type', SplitRatio = .70)
train = subset(df, sample==TRUE)
test = subset(df, sample == FALSE)
#Training the Decision Tree Classifier
tree <- rpart(df$`Customer type` ~., data = train)
#Predictions
tree.customertype.predicted <- predict(tree, test, type= 'class')
#confusion Matrix for evaluating the model
confusionMatrix(tree.customertype.predicted, test$`Customer type`)
So I've tried to do this as said in another topic:
confusionMatrix(table(tree.customertype.predicted, test$`Customer type`))
But I still have an error:
Error in !all.equal(nrow(data), ncol(data)) : argument type is invalid
I made a toy data set and examined your code. There were a couple issues:
R has a easier time with variable names that follow a certain style. Your 'Customer type' variable has a space in it. In general, coding is easier when you avoid spaces. So I renamed it 'Customer_type". For your data.frame you could simply go into the source file, or use names(df) <- gsub("Customer type", "Customer_type", names(df)).
I coded 'Customer_type' as a factor. For you this will look like df$Customer_type <- factor(df$Customer_type)
The documentation for sample.split() says the first argument 'Y' should be a vector of labels. But in your code you gave the variable name. The labels are the names of the levels of the factor. In my example these levels are High, Med and Low. To see the levels of your variable you could use levels(df$Customer_type). Input these to sample.split() as a character vector.
Adjust the rpart() call as shown below.
With these adjustments, your code might be OK.
# toy data
df <- data.frame(City = factor(sample(c("Paris", "Tokyo", "Miami"), 100, replace = T)),
Customer_type = factor(sample(c("High", "Med", "Low"), 100, replace = T)),
Gender = factor(sample(c("Female", "Male"), 100, replace = T)),
Quantity = sample(1:10, 100, replace = T),
Total = sample(1:10, 100, replace = T),
Date = sample(seq(as.Date('2020/01/01'), as.Date('2020/12/31'), by="day"), 100),
Rating = factor(sample(1:5, 100, replace = T)))
library(rpart)
library(caret)
library(dplyr)
library(caTools)
library(data.tree)
library(e1071)
#Splitting into training and testing data
set.seed(123)
sample = sample.split(levels(df$Customer_type), SplitRatio = .70) # ADJUST YOUR CODE TO MATCH YOUR FACTOR LABEL NAMES
train = subset(df, sample==TRUE)
test = subset(df, sample == FALSE)
#Training the Decision Tree Classifier
tree <- rpart(Customer_type ~., data = train) # ADJUST YOUR CODE SO IT'S LIKE THIS
#Predictions
tree.customertype.predicted <- predict(tree, test, type= 'class')
#confusion Matrix for evaluating the model
confusionMatrix(tree.customertype.predicted, test$Customer_type)
Try to keep factor levels of train and test same as df.
train$`Customer type` <- factor(train$`Customer type`, unique(df$`Customer type`))
test$`Customer type` <- factor(test$`Customer type`, unique(df$`Customer type`))
I'm trying to use LIME to explain a binary classification model that I've trained using XGboost. I run into an error when calling the explain() function from LIME, which implies that I have columns that aren't matching in my model (or explainer) and the new data I'm trying to explain predictions for.
This vignette for LIME does demonstrate a version with xgboost, however it's a text problem which is a little different to my tabular data. This question seems to be encountering the same error, but also for a document term matrix, which seems to obscure the solution for my case. I've worked up a minimal example with mtcars which produced exactly the same errors I get in my own larger dataset.
library(pacman)
p_load(tidyverse)
p_load(xgboost)
p_load(Matrix)
p_load(lime)
### Prepare data with partition
df <- mtcars %>% rownames_to_column()
length <- df %>% nrow()
df_train <- df %>% select(-rowname) %>% head((length-10))
df_test <- df %>% select(-rowname) %>% tail(10)
### Transform data into matrix objects for XGboost
train <- list(sparse.model.matrix(~., data = df_train %>% select(-vs)), (df_train$vs %>% as.factor()))
names(train) <- c("data", "label")
test <- list(sparse.model.matrix(~., data = df_test %>% select(-vs)), (df_test$vs %>% as.factor()))
names(test) <- c("data", "label")
dtrain <- xgb.DMatrix(data = train$data, label=train$label)
dtest <- xgb.DMatrix(data = test$data, label=test$label)
### Train model
watchlist <- list(train=dtrain, test=dtest)
mod_xgb_tree <- xgb.train(data = dtrain, booster = "gbtree", eta = .1, nrounds = 15, watchlist = watchlist)
### Check prediction works
output <- predict(mod_xgb_tree, test$data) %>% tibble()
### attempt lime explanation
explainer <- df_train %>% select(-vs) %>% lime(model = mod_xgb_tree) ### works, no error or warning
explanation <- df_test %>% select(-vs) %>% explain(explainer, n_features = 4) ### error, Features stored names in `object` and `newdata` are different!
names_test <- test$data#Dimnames[[2]] ### 10 names
names_mod <- mod_xgb_tree$feature_names ### 11 names
names_explainer <- explainer$feature_type %>% enframe() %>% pull(name) ### 11 names
### see whether pre-processing helps
my_preprocess <- function(df){
data <- df %>% select(-vs)
label <- df$vs
test <<- list(sparse.model.matrix( ~ ., data = data), label)
names(test) <<- c("data", "label")
dtest <- xgb.DMatrix(data = test$data, label=test$label)
dtest
}
explanation <- df_test %>% explain(explainer, preprocess = my_preprocess(), n_features = 4) ### Error in feature_distribution[[i]] : subscript out of bounds
### check that the preprocessing is working ok
dtest_check <- df_test %>% my_preprocess()
output_check <- predict(mod_xgb_tree, dtest_check)
I assume that because the explainer only has the names of the original predictor columns, where test data in its transformed state also has an (Intercept) column, this is causing the problem. I just haven't figured out a neat way of preventing this occurring. Any help would be much appreciated. I assume there must be a neat solution.
If you look at this page (https://rdrr.io/cran/xgboost/src/R/xgb.Booster.R), you will see that some R users are likely to get the following error message: "Feature names stored in object and newdata are different!".
Here is the code from this page related to the error message:
predict.xgb.Booster <- function(object, newdata, missing = NA, outputmargin = FALSE, ntreelimit = NULL,predleaf = FALSE, predcontrib = FALSE, approxcontrib = FALSE, predinteraction = FALSE,reshape = FALSE, ...)
object <- xgb.Booster.complete(object, saveraw = FALSE)
if (!inherits(newdata, "xgb.DMatrix"))
newdata <- xgb.DMatrix(newdata, missing = missing)
if (!is.null(object[["feature_names"]]) &&
!is.null(colnames(newdata)) &&
!identical(object[["feature_names"]], colnames(newdata)))
stop("Feature names stored in `object` and `newdata` are different!")
identical(object[["feature_names"]], colnames(newdata)) => If the column names of object (i.e. your model based on your training set) are not identical to the column names of newdata (i.e. your test set), you will get the error message.
For more details:
train_matrix <- xgb.DMatrix(as.matrix(training %>% select(-target)), label = training$target, missing = NaN)
object <- xgb.train(data=train_matrix, params=..., nthread=2, nrounds=..., prediction = T)
newdata <- xgb.DMatrix(as.matrix(test %>% select(-target)), missing = NaN)
While setting by yourself object and newdata with your data thanks to the code above, you can probably fix this issue by looking at the differences between object[["feature_names"]] and colnames(newdata). Probably some columns that don't appear in the same order or something.
Try this in your new dataset,
colnames(test)<- make.names(colnames(test))
newdataset<- test %>% mutate_all(as.numeric)
newdataset<- as.matrix(newdataset)
nwtest<-xgb.DMatrix(newdataset)
I had the same problem but the columns weren't in alphabetical order. To fix this, I matched the order of the column names in the df_test to df_train so that the column names were in the same order.
Create list of df_test column numbers in same order as df_train:
idx<- match(colnames(df_train), colnames(df_test))
Create new df_test file using this column order:
df_test_match <- df_test[,idx]
To prevent the (Intercept) column showing up, you need to change your code slightly when creating the sparse matrix for your test data.
Change the line:
test <- list(sparse.model.matrix( ~ ., data = data), label)
to:
test <- list(sparse.model.matrix( ~ .-1, data = data), label)
Hope this helps