How to solve "R object is not a matrix" error - r

#data splicing
set.seed(12345)
train <- sample(1:nrow(student.mat.pass.or.fail),size =
ceiling(0.80*nrow(student.mat.pass.or.fail)),replace = FALSE)
# training set
students_train <- student.mat.pass.or.fail[train,]
# test set
students_test <- student.mat.pass.or.fail[-train,]
# penalty matrix
penalty.matrix <- matrix(c(0,1,10,0), byrow=TRUE, nrow=2)
# building the classification tree with part
tree <- rpart(class~.,
data = students_train, # as.matrix(students_train)
parms = list(loss = penalty.matrix),
method = "class")
object is not a matrix, can someone help me cause I'm new in R I also used the as. matrix(students_train) but it still showing the same problem

Related

R caret extractPrediction with random forest model: Error: $ operator is invalid for atomic vectors

I want to extract the predictions for new unseen data using the function caret::extractPrediction with a random forest model but I cannot figure out, why my code throws the error Error: $ operator is invalid for atomic vectors. How should the input parameters be structured, to use this function?
Here is my reproducible code:
library(caret)
dat <- as.data.frame(ChickWeight)
# create column set
dat$set <- rep("train", nrow(dat))
# split into train and validation set
set.seed(1)
dat[sample(nrow(dat), 50), which(colnames(dat) == "set")] <- "validation"
# predictors and response
all_preds <- dat[which(dat$set == "train"), which(names(dat) %in% c("Time", "Diet"))]
response <- dat[which(dat$set == "train"), which(names(dat) == "weight")]
# set train control parameters
contr <- caret::trainControl(method="repeatedcv", number=3, repeats=5)
# recursive feature elimination caret
set.seed(1)
model <- caret::train(x = all_preds,
y = response,
method ="rf",
ntree = 250,
metric = "RMSE",
trControl = contr)
# validation set
vali <- dat[which(dat$set == "validation"), ]
# not working
caret::extractPrediction(models = model, testX = vali[,-c(3,5,1)], testY = vali[,1])
caret::extractPrediction(models = model, testX = vali, testY = vali)
# works without problems
caret::predict.train(model, newdata = vali)
I found a solution by looking at the documentation of extractPrediction. Basically, the argument models doesn't want a single model instance, but a list of models. So I just inserted list(my_rf = model) and not just model.
caret::extractPrediction(models = list(my_rf = model), testX = vali[,-c(3,5,1)], testY = vali[,1])

Use of PCA results as input to XGboost model throwing an error: Feature names stored in `object` and `newdata` are different

I use PCA on my divided train dataset and project the test dataset to the results after removing irrelevant columns.
data <- read.csv('bottom10.csv')
set.seed(1)
inTrain <- createDataPartition(data$cuisine, p = .8)[[1]]
dataTrain <- data[,-1][inTrain,][,-1]
dataTest <- data[,-1][-inTrain,][,-1]
cuisine.pca <- prcomp(dataTrain[,-1])
Then I extract the first 500 components and project the test dataset.
traincom <- cuisine.pca$x[,1:500]
testcom <- scale(dataTest[,-1], cuisine.pca$center) %*% cuisine.pca$rotation
Then I transfer the labels into integer, and combine components and labels into xgbDMatrix form.
label_train <- as.integer(dataTrain$cuisine) - 1
label_test <- as.integer(dataTest$cuisine) - 1
xgb_train <- xgb.DMatrix(data = traincom, label = label_train)
xgb_test <- xgb.DMatrix(data = testcom, label = label_test)
Then I build the xgboost model as
xgb.fit <- xgboost(cuisine~., data = xgb_train, nrounds = 40, num_class = 10, early_stopping_rounds = 5)
And after I run this, there is a warning but the training can still run.
xgboost: label will be ignored
I can predict the train dataset using the model but when I try to predict test dataset there will be an error.
xgb_pred <- predict(xgb.fit, newdata = xgb_train)
sum(label_train == xgb_pred)/length(label_train)
xgb_pred <- predict(xgb.fit, newdata = xgb_test, rescale = T)
Error in predict.xgb.Booster(xgb.fit, newdata = xgb_test, rescale = T) :
Feature names stored in `object` and `newdata` are different!
Please let me know what am I doing wrong?
Regards

Issues with Naive Bayes Text Classification with two Categories in R

I'm trying to implement a Naive Bayes classifier on a data set which contains text data in the form of complaints from customers (Complaint) and Reddit comments (General_Text). The whole set has 250'000 Texts for each category. However, I use only 1000 Texts per category in the example postet here. I get the same result with the whole data set. I have done the text preprocessing with the "tm" package previously and it should not be an issue!
The data frame is structured as follows with 1000 entries for Complaint and General_Text:
type text
"General_Text" "random words"
"Complaint" "other random words"
For the Classification Task i split the data into a Training set on which the algorithm should learn and a test set to measure the accuracy. The naive Bayes algorithm is from the "e1071" library.
library(plyr)
library(e1071)
library(caret)
library(MLmetrics)
#Import data and rename columns into $type and $text`
General_Text<- read.csv("General_Text.csv", sep=";", head=T, stringsAsFactors = F)
Complaints<- read.csv("Complaints.csv", sep=";", head=T, stringsAsFactors = F)
Data <- rbind(General_Text, Complaints)
colnames(Data) <- c("type", "text")
# $type as factor and $text as string
Data$text <- iconv(Data$text, encoding = "UTF-8")
Data$type <- factor(Data$type)
# Split the data into training set (1400 texts) and test set (600 texts)
set.seed(1234)
trainIndex <- createDataPartition(Data$type, p = 0.7, list = FALSE, times = 1)
trainData <- Data[trainIndex,]
testData <- Data[-trainIndex,]
# Create corpus for training data
corpus<- Corpus(VectorSource(trainData$text))
# Create Document Term Matrix for training data
docs_dtm <- DocumentTermMatrix(corpus, control = list(global = c(2, Inf)))
# Remove Sparse Terms in DTM
docs_dtm_train <- removeSparseTerms(docs_dtm , 0.97)
# Convert counts into "Yes" or "No"
convert_counts <- function(x){
x <- ifelse(x > 0, 1, 0)
x <- factor(x, levels = c(0,1), labels = c("No", "Yes"))
return (x)
}
# Apply convert_counts function to the training data
docs_dtm_train <- apply(docs_dtm_train, MARGIN = 2, convert_counts)
# Create Corpus for test set
corpus_2 <- Corpus(VectorSource(testData$text))
# Create Document Term Matrix for test data
docs_dtm_2 <- DocumentTermMatrix(corpus_2, list(global = c(2, Inf)))
# Remove Sparse Terms in DTM
docs_dtm_test <- removeSparseTerms(docs_dtm_2, 0.97)
# Apply convert_ counts function to the test data
docs_dtm_test <- apply(docs_dtm_test, MARGIN = 2, convert_counts)
# Naive Bayes Classification
nb_classifier <- naiveBayes(docs_dtm_train, trainData$type)
nb_test_pred <- predict(nb_classifier, newdata = docs_dtm_test)
# Output as Confusion Matrix
ConfusionMatrix(nb_test_pred, testData$type)
I'm sorry that I cannot deliver the data and thus a reproducible example. The result which the code delivers is pretty demoralizing: It identifies all the texts as Complaints and none as General Texts.
> ConfusionMatrix(nb_test_pred, testData$type)
y_pred
y_true Complaint General_Text
Complaint 300 0
General_Text 300 0
I also get the following error message: In data.matrix(newdata) : NAs introduced by coercion
Could anyone clarify if I made any mistakes in my code or give me a heads up if someone had a similar issue?

Fitting data into the Self Organizing Map model using R

I'm new to R and its the first time i'm using SOM.
I want to predict survival using Self Organizing Map.
The following is the code i used to ingest data:
load raw data
train <- read.csv("train.csv", header = TRUE)
test <- read.csv("test.csv", header = TRUE)
Add a "Survived" variable to the test set to allow for combining data sets
test.survived <- data.frame(survived = rep("None", nrow(test)), test[,])
Combine data sets
data.combined <- rbind(train, test.survived)
Changed the variable to factors
data.combined$Survived <- as.factor(data.combined$survived)
data.combined$Pclass <- as.factor(data.combined$pclass)
Fitting the data to the SOM model
library(kohonen)
Train SOM
som.train.1 <- data.combined[1:891, c("pclass", "title")]
som.label <- as.factor(train$survived)
table(som.train.1)
table(som.label)
som.train.1.grid <- somgrid(xdim = 20, ydim=20, topo="hexagonal")
set.seed(1234)
som.model <- som(som.label,
grid=som.train.1.grid,
rlen = 100,
alpha = c(0.05, 0.01),
keep.data = TRUE,
normalizeDataLayers = TRUE)
plot(som.model)
I get an error that says: sort.list(y): 'x' must be atomic for 'sort.list'

binning continuous variables by IV value in R

I am building a logistic regression model in R. I want to bin continuous predictors in an optimal way in relationship to the target variable. There are two things that I know of:
the continuous variables are binned such that its IV (information value) is maximized
maximize the chi-square in the two way contingency table -- the target has two values 0 and 1, and the binned continuous variable has the binned buckets
Does anyone know of any functions in R that can perform such binning?
Your help will be greatly appreciated.
For the first point, you could bin using the weight of evidence (woe) with the package woebinning which optimizes the number of bins for the IV
library(woeBinning)
# get the bin cut points from your dataframe
cutpoints <- woe.binning(dataset, "target_name", "Variable_name")
woe.binning.plot(cutpoints)
# apply the cutpoints to your dataframe
dataset_woe <- woe.binning.deploy(dataset, cutpoint, add.woe.or.dum.var = "woe")
It returns your dataset with two extra columns
Variable_name.binned which is the labels
Variable_name.woe.binned which is the replaced values that you can then parse into your regression instead of Variable_name
For the second point, on chi2, the package discretization seems to handle it but I haven't tested it.
The methods used by regression splines to set knot locations might be considered. The rpart package probably has relevant code. You do need to penalize the inferential statistics because this results in an implicit hiding of the degrees of freedom expended in the process of moving the breaks around to get the best fit. Another common method is to specify breaks at equally spaced quantiles (quartiles or quintiles) within the subset with IV=1. Something like this untested code:
cont.var.vec <- # names of all your continuous variables
breaks <- function(var,n) quantiles( dfrm[[var]],
probs=seq(0,1,length.out=n),
na.rm=TRUE)
lapply(dfrm[ dfrm$IV == 1 , cont.var.vec] , breaks, n=5)
s
etwd("D:")
rm(list=ls())
options (scipen = 999)
read.csv("dummy_data.txt") -> dt
head(dt)
summary(dt)
mydata <- dt
head(mydata)
summary(mydata)
##Capping
for(i in 1:ncol(mydata)){
if(is.numeric(mydata[,i])){
val.quant <- unname(quantile(mydata[,i],probs = 0.75))
mydata[,i] = sapply(mydata[,i],function(x){if(x > (1.5*val.quant+1)){1.5*val.quant+1}else{x}})
}
}
library(randomForest)
x <- mydata[,!names(mydata) %in% c("Cust_Key","Y")]
y <- as.factor(mydata$Y)
set.seed(21)
fit <- randomForest(x,y,importance=T,ntree = 70)
mydata2 <- mydata[,!names(mydata) %in% c("Cust_Key")]
mydata2$Y <- as.factor(mydata2$Y)
fit$importance
####var reduction#####
vartoremove <- ncol(mydata2) - 20
library(rminer)
#####
for(i in 1:vartoremove){
rf <- fit(Y~.,data=mydata2,model = "randomForest", mtry = 10 ,ntree = 100)
varImportance <- Importance(rf,mydata2,method="sensg")
Z <- order(varImportance$imp,decreasing = FALSE)
IND <- Z[2]
var_to_remove <- names(mydata2[IND])
mydata2[IND] = NULL
print(i)
}
###########
library(smbinning)
as.data.frame(mydata2) -> inp
summary(inp)
attach(inp)
rm(result)
str(inp)
inp$target <- as.numeric(inp$Y) *1
table(inp$target)
ftable(inp$Y,inp$target)
inp$target <- inp$target -1
result= smbinning(df=inp, y="target", x="X37", p=0.0005)
result$ivtable
smbinning.plot(result,option="badrate",sub="test")
summary(inp)
result$ivtable
boxplot(inp$X2~inp$Y,horizontal=T, frame=F, col="red",main="Distribution")
###Sample
require(caTools)
inp$Y <- NULL
sample = sample.split(inp$target, SplitRatio = .7)
train = subset(inp, sample == TRUE)
test = subset(inp, sample == FALSE)
head(train)
nrow(train)
fit1 <- glm(train$target~.,data=train,family = binomial)
summary(rf)
prediction1 <- data.frame(actual = test$target, predicted = predict(fit1,test ,type="response") )
result= smbinning(df=prediction1, y="actual", x="predicted", p=0.005)
result$ivtable
smbinning.plot(result,option="badrate",sub="test")
tail(prediction1)
write.csv(prediction1 , "test_pred_logistic.csv")
predict_train <- data.frame(actual = train$target, predicted = predict(fit1,train ,type="response") )
write.csv(predict_train , "train_pred_logistic.csv")
result= smbinning(df=predict_train, y="actual", x="predicted", p=0.005)
result$ivtable
smbinning.plot(result,option="badrate",sub="train")
####random forest
rf <- fit(target~.,data=train,model = "randomForest", mtry = 10 ,ntree = 200)
prediction2 <- data.frame(actual = test$target, predicted = predict(rf,train))
result= smbinning(df=prediction2, y="actual", x="predicted", p=0.005)
result$ivtable
smbinning.plot(result,option="badrate",sub="train")
###########IV
library(devtools)
install_github("riv","tomasgreif")
library(woe)
##### K-fold Validation ########
library(caret)
cv_fold_count = 2
folds = createFolds(mydata2$Y,cv_fold_count,list=T);
smpl = folds[[i]];
g_train = mydata2[-smpl,!names(mydata2) %in% c("Y")];
g_test = mydata2[smpl,!names(mydata2) %in% c("Y")];
cost_train = mydata2[-smpl,"Y"];
cost_test = mydata2[smpl,"Y"];
rf <- randomForest(g_train,cost_train)
logit.data <- cbind(cost_train,g_train)
logit.fit <- glm(cost_train~.,data=logit.data,family = binomial)
prediction <- data.f
rame(actual = test$Y, predicted = predict(rf,test))

Resources