regression model in R use dummy cols - r

i want to build a regression model.
i have data about job sekeers, academics, non-academics, woman and man.
the data seperated to areas (7).
i try to type this code but im not sure that it is the right model for my data...
I would be happy to help with this.
cleanData - the csv file, contains data from 01/2010 - 04/2020
I deleted the columns of month, city and the name of the city
regressionSubset <- cleanData
regressionSubset$MONTH <- NULL
regressionSubset$CITY <- NULL
regressionSubset$LOCALITY.NAME <- NULL
regressionData <- dummy_cols(regressionSubset, remove_first_dummy = TRUE, remove_selected_columns = TRUE)
regressionData <- data.frame(scale(regressionData))
set.seed(42)
model <- lm(ESTIMATED.CITY.UNEMPLOYMENT ~ ., data = regressionData)
step_model <- stepAIC(model, trace = FALSE)
summary(step_model)
extract_eq(step_model, use_coefs = TRUE, wrap=TRUE, terms_per_line = 1)

Related

Error with rpart function in R: Error in terms.formula(formula, data = data) : duplicated name 'X.' in data frame using '.'

I was running some classification models for twitter sentiment analysis and I came across this error when using the rpart function:
Error in terms.formula(formula, data = data) : duplicated name 'X.' in data frame using '.'
Hope someone could help me solve it.
Here are the entire codes that I have used so far (apart from loading in the libraries):
setwd("D:/ProjectTwitterTrial")
ustweets <- read.csv('tweets.csv', header = T)
str(ustweets)
set.seed(123)
tweets <- ustweets[sample(nrow(ustweets)),]
tweets <- tweets %>%
select(airline_sentiment, negativereason, airline, text, tweet_location)
tweets %>% distinct() # Keep only unique/distinct rows from the data frame
tweets$language = textcat(tweets$text)
tweets = subset(tweets, language =='english')
datadtm = DocumentTermMatrix(corpus)
datadtm = removeSparseTerms(datadtm, 0.999)
dataset <- as.data.frame(as.matrix(datadtm))
colnames(dataset) <- make.names(colnames(dataset))
dataset$airline_sentiment <- tweets$airline_sentiment
str(dataset$airline_sentiment)
dataset$airline_sentiment <- as.factor(dataset$airline_sentiment)
set.seed(222)
split = sample(2,nrow(dataset),prob = c(0.8,0.2),replace = TRUE)
train_set = dataset[split == 1,]
test_set = dataset[split == 2,]
train_set[4:6,57:59]
test_set[4:6,57:59]
rf_classifier = randomForest(airline_sentiment ~., data=train_set, ntree = 20)

'factors with the same levels' in Confusion Matrix

I'm trying to make a decision tree but this error comes up when I make a confusion matrix in the last line :
Error : `data` and `reference` should be factors with the same levels
Here's my code:
library(rpart)
library(caret)
library(dplyr)
library(rpart.plot)
library(xlsx)
library(caTools)
library(data.tree)
library(e1071)
#Loading the Excel File
library(readxl)
FINALDATA <- read_excel("Desktop/FINALDATA.xlsm")
View(FINALDATA)
df <- FINALDATA
View(df)
#Selecting the meaningful columns for prediction
#df <- select(df, City, df$`Customer type`, Gender, Quantity, Total, Date, Time, Payment, Rating)
df <- select(df, City, `Customer type`, Gender, Quantity, Total, Date, Time, Payment, Rating)
#making sure the data is in the right format
df <- mutate(df, City= as.character(City), `Customer type`= as.character(`Customer type`), Gender= as.character(Gender), Quantity= as.numeric(Quantity), Total= as.numeric(Total), Time= as.numeric(Time), Payment = as.character(Payment), Rating= as.numeric(Rating))
#Splitting into training and testing data
set.seed(123)
sample = sample.split('Customer type', SplitRatio = .70)
train = subset(df, sample==TRUE)
test = subset(df, sample == FALSE)
#Training the Decision Tree Classifier
tree <- rpart(df$`Customer type` ~., data = train)
#Predictions
tree.customertype.predicted <- predict(tree, test, type= 'class')
#confusion Matrix for evaluating the model
confusionMatrix(tree.customertype.predicted, test$`Customer type`)
So I've tried to do this as said in another topic:
confusionMatrix(table(tree.customertype.predicted, test$`Customer type`))
But I still have an error:
Error in !all.equal(nrow(data), ncol(data)) : argument type is invalid
I made a toy data set and examined your code. There were a couple issues:
R has a easier time with variable names that follow a certain style. Your 'Customer type' variable has a space in it. In general, coding is easier when you avoid spaces. So I renamed it 'Customer_type". For your data.frame you could simply go into the source file, or use names(df) <- gsub("Customer type", "Customer_type", names(df)).
I coded 'Customer_type' as a factor. For you this will look like df$Customer_type <- factor(df$Customer_type)
The documentation for sample.split() says the first argument 'Y' should be a vector of labels. But in your code you gave the variable name. The labels are the names of the levels of the factor. In my example these levels are High, Med and Low. To see the levels of your variable you could use levels(df$Customer_type). Input these to sample.split() as a character vector.
Adjust the rpart() call as shown below.
With these adjustments, your code might be OK.
# toy data
df <- data.frame(City = factor(sample(c("Paris", "Tokyo", "Miami"), 100, replace = T)),
Customer_type = factor(sample(c("High", "Med", "Low"), 100, replace = T)),
Gender = factor(sample(c("Female", "Male"), 100, replace = T)),
Quantity = sample(1:10, 100, replace = T),
Total = sample(1:10, 100, replace = T),
Date = sample(seq(as.Date('2020/01/01'), as.Date('2020/12/31'), by="day"), 100),
Rating = factor(sample(1:5, 100, replace = T)))
library(rpart)
library(caret)
library(dplyr)
library(caTools)
library(data.tree)
library(e1071)
#Splitting into training and testing data
set.seed(123)
sample = sample.split(levels(df$Customer_type), SplitRatio = .70) # ADJUST YOUR CODE TO MATCH YOUR FACTOR LABEL NAMES
train = subset(df, sample==TRUE)
test = subset(df, sample == FALSE)
#Training the Decision Tree Classifier
tree <- rpart(Customer_type ~., data = train) # ADJUST YOUR CODE SO IT'S LIKE THIS
#Predictions
tree.customertype.predicted <- predict(tree, test, type= 'class')
#confusion Matrix for evaluating the model
confusionMatrix(tree.customertype.predicted, test$Customer_type)
Try to keep factor levels of train and test same as df.
train$`Customer type` <- factor(train$`Customer type`, unique(df$`Customer type`))
test$`Customer type` <- factor(test$`Customer type`, unique(df$`Customer type`))

Random Forest Tree for classification

I am trying RF for the 1st time. I am trying to predict the genre of the game based on the factors
data <- read.csv("appstore_games.csv")
data <- data %>% drop_na()
data <- data %>% select(Average.User.Rating, User.Rating.Count, Price, Age.Rating, Genres)
data <- data %>% separate(Genres, c("Main Genre","Genre1","Genre2","Genre3"), extra = "drop" )
data1 <- data %>% select(Genre1 , Average.User.Rating, User.Rating.Count, Price )
str(data1)
data1$Genre1 <- as.factor(data1$Genre1)
set.seed(123)
sample <- sample(2 , nrow(data1),replace = TRUE, prob = c(0.7,0.3))
train_data <- data1[sample == 1,]
test_data <- data1[sample == 2,]
library(randomForest)
set.seed(1)
rf <- randomForest(train_data$Genre1 ~., data = train_data , proximity = TRUE, ntree = 200, importance = TRUE)
It shows error at this point
Error in randomForest.default(m, y, ...) : Can't have empty classes in y.
Can I know what is wrong here?
Thanks
The genre has names such as Strategy, Entertainment, etc
I am not completely sure, but I think that could happen if not all different levels of your Y is represented in the train data. Maybe you check this.
My other idea is that one of your classes in Y is "None".
train_data <- droplevels(train_data)
Try using this before you pass data to the model

Feature names stored in `object` and `newdata` are different! when using LIME package to explain xgboost model in R

I'm trying to use LIME to explain a binary classification model that I've trained using XGboost. I run into an error when calling the explain() function from LIME, which implies that I have columns that aren't matching in my model (or explainer) and the new data I'm trying to explain predictions for.
This vignette for LIME does demonstrate a version with xgboost, however it's a text problem which is a little different to my tabular data. This question seems to be encountering the same error, but also for a document term matrix, which seems to obscure the solution for my case. I've worked up a minimal example with mtcars which produced exactly the same errors I get in my own larger dataset.
library(pacman)
p_load(tidyverse)
p_load(xgboost)
p_load(Matrix)
p_load(lime)
### Prepare data with partition
df <- mtcars %>% rownames_to_column()
length <- df %>% nrow()
df_train <- df %>% select(-rowname) %>% head((length-10))
df_test <- df %>% select(-rowname) %>% tail(10)
### Transform data into matrix objects for XGboost
train <- list(sparse.model.matrix(~., data = df_train %>% select(-vs)), (df_train$vs %>% as.factor()))
names(train) <- c("data", "label")
test <- list(sparse.model.matrix(~., data = df_test %>% select(-vs)), (df_test$vs %>% as.factor()))
names(test) <- c("data", "label")
dtrain <- xgb.DMatrix(data = train$data, label=train$label)
dtest <- xgb.DMatrix(data = test$data, label=test$label)
### Train model
watchlist <- list(train=dtrain, test=dtest)
mod_xgb_tree <- xgb.train(data = dtrain, booster = "gbtree", eta = .1, nrounds = 15, watchlist = watchlist)
### Check prediction works
output <- predict(mod_xgb_tree, test$data) %>% tibble()
### attempt lime explanation
explainer <- df_train %>% select(-vs) %>% lime(model = mod_xgb_tree) ### works, no error or warning
explanation <- df_test %>% select(-vs) %>% explain(explainer, n_features = 4) ### error, Features stored names in `object` and `newdata` are different!
names_test <- test$data#Dimnames[[2]] ### 10 names
names_mod <- mod_xgb_tree$feature_names ### 11 names
names_explainer <- explainer$feature_type %>% enframe() %>% pull(name) ### 11 names
### see whether pre-processing helps
my_preprocess <- function(df){
data <- df %>% select(-vs)
label <- df$vs
test <<- list(sparse.model.matrix( ~ ., data = data), label)
names(test) <<- c("data", "label")
dtest <- xgb.DMatrix(data = test$data, label=test$label)
dtest
}
explanation <- df_test %>% explain(explainer, preprocess = my_preprocess(), n_features = 4) ### Error in feature_distribution[[i]] : subscript out of bounds
### check that the preprocessing is working ok
dtest_check <- df_test %>% my_preprocess()
output_check <- predict(mod_xgb_tree, dtest_check)
I assume that because the explainer only has the names of the original predictor columns, where test data in its transformed state also has an (Intercept) column, this is causing the problem. I just haven't figured out a neat way of preventing this occurring. Any help would be much appreciated. I assume there must be a neat solution.
If you look at this page (https://rdrr.io/cran/xgboost/src/R/xgb.Booster.R), you will see that some R users are likely to get the following error message: "Feature names stored in object and newdata are different!".
Here is the code from this page related to the error message:
predict.xgb.Booster <- function(object, newdata, missing = NA, outputmargin = FALSE, ntreelimit = NULL,predleaf = FALSE, predcontrib = FALSE, approxcontrib = FALSE, predinteraction = FALSE,reshape = FALSE, ...)
object <- xgb.Booster.complete(object, saveraw = FALSE)
if (!inherits(newdata, "xgb.DMatrix"))
newdata <- xgb.DMatrix(newdata, missing = missing)
if (!is.null(object[["feature_names"]]) &&
!is.null(colnames(newdata)) &&
!identical(object[["feature_names"]], colnames(newdata)))
stop("Feature names stored in `object` and `newdata` are different!")
identical(object[["feature_names"]], colnames(newdata)) => If the column names of object (i.e. your model based on your training set) are not identical to the column names of newdata (i.e. your test set), you will get the error message.
For more details:
train_matrix <- xgb.DMatrix(as.matrix(training %>% select(-target)), label = training$target, missing = NaN)
object <- xgb.train(data=train_matrix, params=..., nthread=2, nrounds=..., prediction = T)
newdata <- xgb.DMatrix(as.matrix(test %>% select(-target)), missing = NaN)
While setting by yourself object and newdata with your data thanks to the code above, you can probably fix this issue by looking at the differences between object[["feature_names"]] and colnames(newdata). Probably some columns that don't appear in the same order or something.
Try this in your new dataset,
colnames(test)<- make.names(colnames(test))
newdataset<- test %>% mutate_all(as.numeric)
newdataset<- as.matrix(newdataset)
nwtest<-xgb.DMatrix(newdataset)
I had the same problem but the columns weren't in alphabetical order. To fix this, I matched the order of the column names in the df_test to df_train so that the column names were in the same order.
Create list of df_test column numbers in same order as df_train:
idx<- match(colnames(df_train), colnames(df_test))
Create new df_test file using this column order:
df_test_match <- df_test[,idx]
To prevent the (Intercept) column showing up, you need to change your code slightly when creating the sparse matrix for your test data.
Change the line:
test <- list(sparse.model.matrix( ~ ., data = data), label)
to:
test <- list(sparse.model.matrix( ~ .-1, data = data), label)
Hope this helps

SVM a dataframe based on the last column

I'm trying since hours and hours to svm a dataframe based on the last class name.
I have this data frame
#FIll the data frame
df = read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data",
sep=",",
col.names=c("buying", "maint", "doors", "persons", "lug_boot", "safety", ""),
fill=TRUE,
strip.white=TRUE)
lastColName <- colnames(df)[ncol(df)]
...
model <- svm(lastColName~.,
data = df,
kernel="polynomial",
degree = degree,
type = "C-classification",
cost = cost)
I'm getting either NULL or Error in model.frame.default(formula = str(lastColName) ~ ., data = df1, : invalid type (NULL) for variable 'str(lastColName)'. I understand that NULL arrives when the column hasn't a name. I don't understand the other error since it's the last column name..
Any idea?
You have to use as.formula when you are trying to use dynamic variable in the formula. For details see ?as.formula
The following code works fine:
library(e1071)
df_1 = read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data",
sep=",",
col.names=c("buying", "maint", "doors", "persons", "lug_boot", "safety", ""),
fill=TRUE,
strip.white=TRUE)
lastColName <- colnames(df_1)[ncol(df_1)]
model <- svm(as.formula(paste(lastColName, "~ .", sep = " ")),
data = df_1,
kernel="polynomial",
degree = 3,
type = "C-classification",
cost = 1)
# to predict on the data remove the last column
prediction <- predict(model, df_1[,-ncol(df_1)])
# The output
table(prediction)
# The output is:
prediction
acc good unacc vgood
0 0 1728 0
# Since this is a highly unbalanced classification the model is not doing a very good job

Resources