R - convert from categorical to numeric for KNN - r

I'm trying to use the Caret package of R to use the KNN applied to the "abalone" database from UCI Machine Learning (link to the data). But it doesn't allow to use KNN when there's categorical values.
How do I convert the categorical values (in this database: "M","F","I") to numeric values, such as 1,2,3, respectively?

The first answer seems like a really bad idea. Coding {"M","F","I"} to {1, 2, 3} implies that Infant = 3 * Male, Male = Female/2 and so on.
KNN via caret does allow categorical values as predictors if you use the formula methods. Otherwise you need to encode them as binary dummy variables.
Also, showing your code and having a reproducible example would help a lot.
Max

When data are read in via read.table, the data in the first column are factors. Then
data$iGender = as.integer(data$Gender)
would work. If they are character, a detour via factor is easiest:
data$iGender= as.integer(as.factor(data$Gender))

One of easiest way to use kNN algorithm in your dataset in which one of its feature is categorical : "M", "F" and "I" as you mentioned is as follows:
Just in your CVS or Excel file that your dataset exsits, go ahead in the right column and change M to 1 and F to 2 and I to 3. In this case you have discrete value in your dataset and you can easily use kNN algorithm using R.

You can simply read the file with stringsAsFactors = TRUE
Example
data_raw<-read.csv('...../credit-default.csv', stringsAsFactors = TRUE)
The stringasfactors will give a numerical replacement for the Char datatypes

Try using knncat package in R, which converts categorical variables into numerical counterpart.
Here's the link for the package

Related

Is there an R function for converting a dataset into the appropriate format for a Latent Class Analysis?

I'm a beginner with LCA so I apologise if this is a basic question, I tried to google it but could not find an answer. I want to conduct a LCA on a dataset I have (in excel). Some of the variables are binary (female/male, literate/illiterate...) and some have multiple levels (married, single, widowed, divorced). I understand that for the binary variables I can convert them into 1s and 0s, however I don't know what to do about the ones with multiple levels. Is there an R function that will convert them to a format accepted by poLCA? I would appreciate any help.
Dummy encoding in R is done implicitly, so you can just name columns of the table which aren't numbers:
library(poLCA)
data(election)
poLCA(cbind(AGE, PARTY, CARESB) ~ GENDER, data = election)

Using missForest in R with categorical variables

I am trying to use the missForest package to impute missing data into a fairly large dataset. Most of my variables are categorical with many factors. When I run missForest, it imputes decimal values and sometimes even negative values. Obviously, I'm doing something wrong. Here is my process below:
FIRST TRY: Entering predictor data directly. I got decimal values imputed into my dataset. I know that missForest only takes matrices but I'm not sure how to force it into recognizing what columns are factors. Someone on another post recommended dummy coding, so I tried that next, witht eh same results. code is below.
SECOND TRY: Dummy coding each predictor (so time consuming) and then running that.
homt_sub_dummy<-homt_sub[c("Psyprob.yes", "Psyprob.no","SUB2.2.0", "SUB2.2.1", "SUB2.2.2", "SUB2.2.3", "SUB2.2.4", "SUB2.2.5", "SUB2.2.6", "SUB2.2.7","Freq1.1", "Freq1.2", "Freq1.3", "Freq1.4","FRSTUSE1.0", "FRSTUSE1.1", "FRSTUSE1.2", "FRSTUSE1.3", "FRSTUSE1.4", "FRSTUSE1.5", "FRSTUSE1.6","FRSTUSE1.7", "FRSTUSE1.8", "FRSTUSE1.9", "FRSTUSE1.10", "FRSTUSE1.11","Freq2.1", "Freq2.2", "Freq2.3", "Freq2.4","AGEcont","Gender_male", "Gender_female", "Race2.0", "Race2.1", "Race2.2", "Arrests.0", "Arrests.1", "Arrests.2")]
homt_dummy_matrix<-data.matrix(homt_sub_dummy, rownames.force = NA)
homt_dummp.imp <- missForest(homt_dummy_matrix, verbose= TRUE, maxiter = 3, ntree = 20)
homt_dummy.imp.df<-as.data.frame(homt_dummp.imp$ximp)
View(homt_dummy.imp.df)
This is a chunk of the data.frame i saved with the imputed values
Any help would be appreciated. I'm pretty new to imputation. I wanted to compare results of MICE with this but I just can't seem to get missForest to work!!!
you can use as.factor function to transform the class of data that you want. For example
cleveland_t <- transform(cleveland,V2=as.factor(V2),V3 = as.factor(V3),V6 = as.factor(V6),V7=as.factor(V7),V9 = as.factor(V9),V11=as.factor(V11),V12 = as.factor(V12),V13= as.factor(V13),v14=as.factor(V14))
then use the sapply to check the class

does c50 algorithm works only on categorical datasets?

I found a sample code with iris data set in R language.
I want to use the same code but with other data set(heart disease dataset) which has only numerical values.will that work?
Make sure your data doesn't contain missing values. If values are missing, Compiler will throw an error, while model building stage. So, if some data points are missing, probably you should try for imputing them.
Also Make sure your output Variable/class variable is Categorical in nature.
Also, if its binary classification problem and labels are 0,1
make sure you encode those 0's and 1's to proper text labels and then convert them into factor's.
Example for encoding numbers into categorical's
data$class <- ifelse(data$class==0,"not_found","found")
data$class <- as.factor(data$class,levels=c("found","not_found))

Dummy Package in R

Could someone help ?
I am using the dummy package in R (function dummy) to convert a categorical variable(10 categories) into dummy variables because some of the algorithms I am using (adaboost and rotation forest), don't handle categorical variables well.
After using the package I get 10 dummy variables but they are factors. I expected them to be numeric with 1s and 0s.
Should I convert them to numeric ? or use them as factors.
thanks a lot !!!!
all the best
Pedro
After performing one hot encoding there is no difference keeping them as factor or numeric . Its better not to perform one hot encoding for Tree based models.It will decrease performance.Here is an article describing effect of one hotted variables..It better to pass the categorical variables by converting them into factors

How to convert character/factor to integer?

I know that has been asked quite frequently. However, by applying the previous advice I'm still confused about two things.
How to convert from multinomial values to integers?
How to get the integer back to the factor/character after the analysis?
library(car)
data(Prestige)
View(Prestige)
# here I convert directly from character which seems quite useless
Prestige$TYPE<-as.numeric(levels(Prestige$type))
# here I generate factors
Prestige$type<-as.factor(Prestige$type)
# and try to convert afterwards. doesnt work either
Prestige$TYPE<-as.numeric(levels(Prestige$type))
Basically, I would like to extract the three levels in type without renaming it manually.
A vector with class factor has an attributes called levels. The levels function acts on that attributes and not on the vector itself.
library(car)
data(Prestige)
length(Prestige$type) # 102
levels(Prestige$type) # Notice that this has length 3.
If you want the numeric values for the vector, use
as.numeric(Prestige$type)
What was bc is not 1, what was prof is now 2, and what was wc is now 3.
if you need to reconstitute the factor, use
factor(Prestige$type, 1:3, c("bc", "prof", "wc"))
But as a general rule, it's better not to alter your factors unless you need to alter the categories. If you need the numerical codes under the data, make a new variable
Prestige$type_numeric <- as.numeric(Prestige$type)

Resources