does c50 algorithm works only on categorical datasets? - r

I found a sample code with iris data set in R language.
I want to use the same code but with other data set(heart disease dataset) which has only numerical values.will that work?

Make sure your data doesn't contain missing values. If values are missing, Compiler will throw an error, while model building stage. So, if some data points are missing, probably you should try for imputing them.
Also Make sure your output Variable/class variable is Categorical in nature.
Also, if its binary classification problem and labels are 0,1
make sure you encode those 0's and 1's to proper text labels and then convert them into factor's.
Example for encoding numbers into categorical's
data$class <- ifelse(data$class==0,"not_found","found")
data$class <- as.factor(data$class,levels=c("found","not_found))

Related

How to deal with NaN values in R?

I'm testing for random intercepts as a preparation for growth curve modeling.
Therefore, I've first created a wide subset and then converted it to a Long data set.
Calculating my ModelM1 <- gls(ent_act~1, data=school_l) with the long data set, I get an error message as I have missing values. In my long subset these values are stated as NaN.
When applying temp<-na.omit(school_l$ent_act), I can calculate ModelM1. But, when calculating ModelM2 ModelM2 <- lme(temp~1, random=~1|ID, data=school_l), then I get the error message of my variables being of unqueal lengths.
How can I deal with those missing values?
Any ideas or recommendations?
What you might get success with would be to make a temp dataframe where your remove entire lines indexed by negation of the missing condition: !is.na(school_1$ent_act)
temp<-school_l[ !is.na(school_l$ent_act), ]
Then re-run the lme call. There should now be no mismatch of variable lengths.
ModelM2 <- lme(ent_act ~1, random= ~1|ID, data=school_l)
Note that using school_l is going to be potentially confusing because it looks so much like school_1 when viewed in Times font.

R bnlearn - parameter learning with naive.bayes() check.data() error

I have a graph structure, determined from another method, and I want to do parameter learning. The bnlearn methods, however, seem to do parameter learning directly on the dataset (strictly in a dataframe). I have two questions: how do I do parameter learning from an igraph or graphNEL structure with bnlearn?
Second question: I am getting a check.data() error when I try to do parameter learning using my dataset. Their example code works, and I can't understand why my dataset does not. See their code below and a reproducible example, below.
Here is their example code:
require(bnlearn)
require(Rgraphviz)
data(learning.test)
bn <- naive.bayes(learning.test, "A")
pred <- predict(bn, learning.test)
table(pred, learning.test[,"A"])
My reproducible example (errors on naive.bayes() call):
require(bnlearn, Rgraphviz)
data <- data <- matrix(sample.int(200, 61*252, TRUE), nrow=252, ncol=61)
data <- as.data.frame(matrix(as.numeric(as.matrix(data)), ncol=ncol(data),
byrow=TRUE))
bn <- naive.bayes(data, names(data)[1])
Error message:
Error in check.data(data, allowed.types = discrete.data.types) :
valid data types are:
* all variables must be unordered factors.
* all variables must be ordered factors.
* variables can be either ordered or unordered factors.
I do not think this error comes from detecting integers, because when I cast my data to a dataframe, I first cast it to numeric, because other methods in bnlearn require numeric or factored data. This dataset IS count data, but I want to use the method assuming I am using continuous datasets. Does this make sense?

R - convert from categorical to numeric for KNN

I'm trying to use the Caret package of R to use the KNN applied to the "abalone" database from UCI Machine Learning (link to the data). But it doesn't allow to use KNN when there's categorical values.
How do I convert the categorical values (in this database: "M","F","I") to numeric values, such as 1,2,3, respectively?
The first answer seems like a really bad idea. Coding {"M","F","I"} to {1, 2, 3} implies that Infant = 3 * Male, Male = Female/2 and so on.
KNN via caret does allow categorical values as predictors if you use the formula methods. Otherwise you need to encode them as binary dummy variables.
Also, showing your code and having a reproducible example would help a lot.
Max
When data are read in via read.table, the data in the first column are factors. Then
data$iGender = as.integer(data$Gender)
would work. If they are character, a detour via factor is easiest:
data$iGender= as.integer(as.factor(data$Gender))
One of easiest way to use kNN algorithm in your dataset in which one of its feature is categorical : "M", "F" and "I" as you mentioned is as follows:
Just in your CVS or Excel file that your dataset exsits, go ahead in the right column and change M to 1 and F to 2 and I to 3. In this case you have discrete value in your dataset and you can easily use kNN algorithm using R.
You can simply read the file with stringsAsFactors = TRUE
Example
data_raw<-read.csv('...../credit-default.csv', stringsAsFactors = TRUE)
The stringasfactors will give a numerical replacement for the Char datatypes
Try using knncat package in R, which converts categorical variables into numerical counterpart.
Here's the link for the package

CCA Analysis: Error in weighted.mean.default(newX[, i], ...) : 'x' and 'w' must have the same length

I'm very new to R and this might be a very silly question to ask but I'm quite stuck right now.
I'm currently trying to do a Canonical Correspondence Analysis on my data to see which environmental factors have more weight on community distribution. I'm using the vegan package. My data consists of a table for the environmental factors (dataset EFamoA) and another for an abundance matrix (dataset AmoA). I have 41 soils, with 39 environmental factors and 334 species.
After cleaning my data of any variables which are not numerical, I try to perform the cca analysis using the formula notation:
CCA.amoA <- cca (AmoA ~ EFamoA$PH + EFamoA$LOI, data = EFamoA,
scale = TRUE, na.action = na.omit)
But then I get this error:
Error in weighted.mean.default(newX[, i], ...) :
'x' and 'w' must have the same length
I don't really know where to go from here and haven't found much regarding this problem anywhere (which leads me to think that it must be some sort of very basic mistake I'm doing). My environmental factor data is not standardized as I red in the cca help file that the algorithm does it but maybe I should standardize it before? (I've also red that scale = TRUE is only for species). Should I convert the data into matrices?
I hope I made my point clear enough as I've been struggling with this for a while now.
Edit: My environmental data has NA values
Alright so I was able to figure it out all by myself and it was indeed a silly thing, turns out my abundance data had soils as columns and species as rows, while environmental factor (EF) data had soils as rows and EF as columns.
using t() on my data, I transposed my data.frame (and collaterally converted it into a matrix) and cca() worked (as "length" was the same, I assume). Transposing the data separately and loading it already transposed works too.
Although maybe the t() approach saves the need of creating a whole new file (in case your data was organized using different rows as in my case), it converts the data into a matrix and this might not be desired in some cases, either way, this turned out to be a very simple and obvious thing to solve (took me a while though).

Bandwidth selection using NP package

New to R and having problem with a very simple task! I have read a few columns of .csv data into R, the contents of which contains of variables that are in the natural numbers plus zero, and have missing values. After trying to use the non-parametric package, I have two problems: first, if I use the simple command bw=npregbw(ydat=y, xdat=x, na.omit), where x and y are column vectors, I get the error that "number of regression data and response data do not match". Why do I get this, as I have the same number of elements in each vector?
Second, I would like to call the data ordered and tell npregbw this, using the command bw=npregbw(ydat=y, xdat=ordered(x)). When I do that, I get the error that x must be atomic for sort.list. But how is x not atomic, it is just a vector with natural numbers and NA's?
Any clarifications would be greatly appreciated!
1) You probably have a different number of NA's in y and x.
2) Can't be sure about this, since there is no example. If it is of following type:
x <- c(3,4,NA,2)
Then ordered(x) should work fine. Please provide an example of your case.
EDIT: You of course tried bw=npregbw(ydat=y, xdat=x)? ordered() makes your vector an ordered factor (see ?ordered), which is not an atomic vector (see 2.1.1 link and ?factor)
EDIT2: So the problem was the way of subsetting data. Note the difference in various ways of subsetting. data$x and data[,i] (where i = column number of column x) give you vectors, while data[c("x")] and data[i] give a data frame. Functions expect vectors, unless they call for data = (your data). In that case they work with column names

Resources