R NaiveBayes issue with numeric variables - r

Even though the NaiveBayes() help says that numeric can be passed in the first parameter 'x', I am not able to run it successfully. Without numeric variable(resale) it works fine. Here is the script:
library(readr)
library(klaR)
### load dataset
Dataset <- read_csv("D:/sampledata.csv")
### converting 'model' and 'type' to factor
Dataset$model <- factor(Dataset$model)
Dataset$type <- factor(Dataset$type)
### Executing NaiveBayes with numeric 'resale'
NaiveBayesModel1 <- NaiveBayes(model~type+mylogical+resale,data=Dataset,na.action =na.omit)
### now removing resale. Following works as expected.
NaiveBayesModel1 <- NaiveBayes(model~type+mylogical,data=Dataset,na.action =na.omit)
'model' and 'type' are factors,
'mylogical' is a logical and
'resale' is a numeric variable.
Since, I cannot attach my datafile, I am pasting few rows here. Copy these rows and save as sampledata.csv file on your drive. Modify read_csv() in the above script to point to this csv file.
"model","sales","resale","type","mylogical"
"Integra",16.919,16.36,"Automobile",TRUE
"TL",39.384,19.875,"Automobile",FALSE
"Camry",247.994,13.245,"Automobile",FALSE
"Avalon",63.849,18.14,"Automobile",TRUE
"Celica",33.269,15.445,"Automobile",TRUE
"Tacoma",84.087,9.575,"Truck",TRUE
"RAV4",25.106,13.325,"Truck",FALSE
"4Runner",68.411,19.425,"Truck",FALSE
"Land Cruiser",9.835,34.08,"Truck",TRUE
"Golf",9.761,11.425,"Automobile",FALSE
"Jetta",83.721,13.24,"Automobile",FALSE
"Passat",51.102,16.725,"Automobile",TRUE
"Cabrio",9.569,16.575,"Automobile",FALSE
"GTI",5.596,13.76,"Automobile",FALSE
I get following error if I run NaiveBayes with "resale".
Error in if (any(temp)) stop("Zero variances for at least one class in variables: ", :
missing value where TRUE/FALSE needed
R help ( help(NaiveBayes) ) says I can use numeric. I don't understand what is wrong. Please help.
Regards,
SG

The error is caused by zero variance in variable resale values for each of the outcomes in model. Most likely your training set contains single training record for each distinct value in model.

Related

Sentiment Analysis Of A Dataset With Multiple NewsPaper Articles

I'm trying to call get_nrc_sentiment in R but getting the following error:
Error in get_nrc_sentiment(Test) : Data must be a character vector.
Can anyone see what I'm doing wrong?
library("RDSTK")
library("readr")
library("qdap")
library("syuzhet")
library("ggplot2")
library(readxl)
Test <- read_excel("Test.xlsx")
View(Test)
scores = get_nrc_sentiment(Test) //throwing error
I suspect that the Test.xlsx file your are reading in has multiple columns. In that case, the Test object would not be a character vector, but a dataframe. Putting the dataframe object into the get_nrc_sentiment() causes the error. You can check test with class(Test) to determine what kind of R object it is.

Why this error happen "duplicated name in data frame using '.'?

I have a data frame with 30 row and 850 column(features).
when I want to use svm or other classifier with caret and e1071 packages, I faced this error!
Error in terms.formula(formula, data = data) :
duplicated name 'X10Percentile' in data frame using '.'
Even when I want to use feature selection method such as Boruta, I face the same error.
I double check my feature and found nothing. I thought I must have the same column name in data frame so I create a sample data and check as follow:
test<-data.frame("w1"=c(1:6),"w1.1"=c(2:7),"w1"=c(3:8), "ta"=c("T","F","T","F","F","T"))
set.seed(100)
train <- createDataPartition(y=test$ta,p=0.6,list = FALSE)
TrainSet <- test[train,]
TestSet <- test[-train,]
trcontrol_rcv<- trainControl(method="cv", number=10)
svm_test<-svm(ta ~., data=TrainSet,trControl=trcontrol_rcv)
It works good and no Error occurs.
As I see no error happen when test data even has exactly the same colname.
I want to know why this error"Error in terms.formula(formula, data = data) :
duplicated name 'X10Percentile' in data frame using '.'" happen for my data, and how can I eliminate it?
Thank you in advance.
Thank you, everyone. Fortunately, I found the cause of this error.
Because R considers variables as factors. Therefore it makes a data. frame (which in fact is a list).To solve this problem, I converted it into a data numeric in the following way;
test1<-sapply(test,function(x) as.numeric(as.character(x)))
For me that was not the solution, I had a LargeMatrix as an object of only numeric type vectors.
The problem was that some dimnames(MyLargeMatrix) were duplicated. I change them and the error went away.

'x' must be numeric R error when reading from file

I am trying to do Hartigan's diptest in R, however, I get the following error: 'x' must be numeric.
Apologies for such a basic question, but how do I ensure that the data that I load is numeric?
If I make up a set of values as follows (code below), the diptest works without problems:
library(diptest)
x = c(1,2,3,4,4,4,4,4,4,4,5,6,7,8,9,9,9,9,9,9,9,9,9)
hist(x)
dip.test(x)
But for example, when the same values are saved in an Excel file/tab delimited .txt file (saved as one column of values), and imported into R, when I run the diptest the 'x' must be numeric error occurs.
x <- read.csv("x.csv") # comes back in as a data frame
hist(x)
dip.test(x)
Is there a way to check what format the imported data from an Excel/.txt file is in R, and subsequently change it to be numeric? Thanks again.
Any help will be much appreciated thank you.
Here's what's happening. If you run the code that you know works, it's working because the data class is numeric as it should be. When you read it back in it's a data.frame, however. So you need to point to the numeric element of the data.frame:
library(diptest)
x = c(1,2,3,4,4,4,4,4,4,4,5,6,7,8,9,9,9,9,9,9,9,9,9)
write.csv(x, "x.csv", row.names=F)
x <- read.csv("x.csv") # comes back in as a data frame
hist(x$x)
dip.test(x$x)
Hartigans' dip test for unimodality / multimodality
data: x$x
D = 0.15217, p-value = 2.216e-05
alternative hypothesis: non-unimodal, i.e., at least bimodal
If you were to save the file to a .RDS instead of .csv then you could avoid this problem.
You could also check if your data frame contains any non-numeric characters as follows:
which(!grepl('^[0-9]',your_data_frame[[1]]))

CSV on mac doesn't recognize numbers

I am trying to do some analyses with R and experienced some problems when doing it with my Macbook. When reading the csv file in R it tells me that the numeric values are empty.
That's how it looks like when I try to run an analysis:
pall.values = numeric()
df <- data.frame(Estimate=numeric(50), P.value=numeric(50))
for (i in 2:51){ #change the column range based on your data sheet
x <- cor.test(fdata[,c(i)],fdata$ARHQ_dad)
df$Estimate[i-3]=x$estimate
df$P.value[i-3]=x$p.value
}
Then this error occurs:
Error in cor.test.default(fdata[, c(i)], fdata$ARHQ_dad) : 'x' must be
a numeric vector
The csv itself recognizes the values as numbers since I used the Data to Column function and divided the columns by spaces and commas. However, in R it doesn't seem to work...
I hope somebody knows the answer to this (simple) problem.

eqmcc function in R QCA package exiting with error

When I attempt to call eqmcc() against a truthTable object, the result is this error message:
Error: The outcome's length should be the same as the number of rows in the data.
Here's my script:
library(QCA); library (psych); library(readr)
gamson <- read_csv("/path/to/Gamson.csv", col_names = TRUE)
is.na(gamson)
ttACP2 <- truthTable(data=gamson, outcome = "ACP", conditions = "BUR, LOW, DIS, HLP", n.cut=3, incl.cut=0.750, sort.by="incl, n", complete=FALSE, show.cases=TRUE)
ttACP2
csACP2 <- eqmcc(ttACP2, details=TRUE, show.cases=TRUE, row.dom=TRUE, all.sol=FALSE, use.tilde=FALSE)
The is.na() function shows that there are no missing values in my data set. The data set contains 54 rows, of which the first is the column names. The truth table is generated according to expectations. But the minimization of the selected causal conditions fails.
I found a chunk of source code that matches the error message on line 90 here:
https://github.com/cran/QCApro/blob/master/R/pof.R
But I'm not competent enough in programming to understand what conditions lead to the error message being thrown.
This is because your dataset is a tibble instead of a dataframe. After loading the dataset, and before finding the truth table, do this:
gamson <- as.data.frame(gamson)
It should work after that. (The latest version of the eqmcc function is called minimize now.

Resources