Dummy Package in R - r

Could someone help ?
I am using the dummy package in R (function dummy) to convert a categorical variable(10 categories) into dummy variables because some of the algorithms I am using (adaboost and rotation forest), don't handle categorical variables well.
After using the package I get 10 dummy variables but they are factors. I expected them to be numeric with 1s and 0s.
Should I convert them to numeric ? or use them as factors.
thanks a lot !!!!
all the best
Pedro

After performing one hot encoding there is no difference keeping them as factor or numeric . Its better not to perform one hot encoding for Tree based models.It will decrease performance.Here is an article describing effect of one hotted variables..It better to pass the categorical variables by converting them into factors

Related

Is there an R function for converting a dataset into the appropriate format for a Latent Class Analysis?

I'm a beginner with LCA so I apologise if this is a basic question, I tried to google it but could not find an answer. I want to conduct a LCA on a dataset I have (in excel). Some of the variables are binary (female/male, literate/illiterate...) and some have multiple levels (married, single, widowed, divorced). I understand that for the binary variables I can convert them into 1s and 0s, however I don't know what to do about the ones with multiple levels. Is there an R function that will convert them to a format accepted by poLCA? I would appreciate any help.
Dummy encoding in R is done implicitly, so you can just name columns of the table which aren't numbers:
library(poLCA)
data(election)
poLCA(cbind(AGE, PARTY, CARESB) ~ GENDER, data = election)

does c50 algorithm works only on categorical datasets?

I found a sample code with iris data set in R language.
I want to use the same code but with other data set(heart disease dataset) which has only numerical values.will that work?
Make sure your data doesn't contain missing values. If values are missing, Compiler will throw an error, while model building stage. So, if some data points are missing, probably you should try for imputing them.
Also Make sure your output Variable/class variable is Categorical in nature.
Also, if its binary classification problem and labels are 0,1
make sure you encode those 0's and 1's to proper text labels and then convert them into factor's.
Example for encoding numbers into categorical's
data$class <- ifelse(data$class==0,"not_found","found")
data$class <- as.factor(data$class,levels=c("found","not_found))

How to convert character/factor to integer?

I know that has been asked quite frequently. However, by applying the previous advice I'm still confused about two things.
How to convert from multinomial values to integers?
How to get the integer back to the factor/character after the analysis?
library(car)
data(Prestige)
View(Prestige)
# here I convert directly from character which seems quite useless
Prestige$TYPE<-as.numeric(levels(Prestige$type))
# here I generate factors
Prestige$type<-as.factor(Prestige$type)
# and try to convert afterwards. doesnt work either
Prestige$TYPE<-as.numeric(levels(Prestige$type))
Basically, I would like to extract the three levels in type without renaming it manually.
A vector with class factor has an attributes called levels. The levels function acts on that attributes and not on the vector itself.
library(car)
data(Prestige)
length(Prestige$type) # 102
levels(Prestige$type) # Notice that this has length 3.
If you want the numeric values for the vector, use
as.numeric(Prestige$type)
What was bc is not 1, what was prof is now 2, and what was wc is now 3.
if you need to reconstitute the factor, use
factor(Prestige$type, 1:3, c("bc", "prof", "wc"))
But as a general rule, it's better not to alter your factors unless you need to alter the categories. If you need the numerical codes under the data, make a new variable
Prestige$type_numeric <- as.numeric(Prestige$type)

R - convert from categorical to numeric for KNN

I'm trying to use the Caret package of R to use the KNN applied to the "abalone" database from UCI Machine Learning (link to the data). But it doesn't allow to use KNN when there's categorical values.
How do I convert the categorical values (in this database: "M","F","I") to numeric values, such as 1,2,3, respectively?
The first answer seems like a really bad idea. Coding {"M","F","I"} to {1, 2, 3} implies that Infant = 3 * Male, Male = Female/2 and so on.
KNN via caret does allow categorical values as predictors if you use the formula methods. Otherwise you need to encode them as binary dummy variables.
Also, showing your code and having a reproducible example would help a lot.
Max
When data are read in via read.table, the data in the first column are factors. Then
data$iGender = as.integer(data$Gender)
would work. If they are character, a detour via factor is easiest:
data$iGender= as.integer(as.factor(data$Gender))
One of easiest way to use kNN algorithm in your dataset in which one of its feature is categorical : "M", "F" and "I" as you mentioned is as follows:
Just in your CVS or Excel file that your dataset exsits, go ahead in the right column and change M to 1 and F to 2 and I to 3. In this case you have discrete value in your dataset and you can easily use kNN algorithm using R.
You can simply read the file with stringsAsFactors = TRUE
Example
data_raw<-read.csv('...../credit-default.csv', stringsAsFactors = TRUE)
The stringasfactors will give a numerical replacement for the Char datatypes
Try using knncat package in R, which converts categorical variables into numerical counterpart.
Here's the link for the package

CCA Analysis: Error in weighted.mean.default(newX[, i], ...) : 'x' and 'w' must have the same length

I'm very new to R and this might be a very silly question to ask but I'm quite stuck right now.
I'm currently trying to do a Canonical Correspondence Analysis on my data to see which environmental factors have more weight on community distribution. I'm using the vegan package. My data consists of a table for the environmental factors (dataset EFamoA) and another for an abundance matrix (dataset AmoA). I have 41 soils, with 39 environmental factors and 334 species.
After cleaning my data of any variables which are not numerical, I try to perform the cca analysis using the formula notation:
CCA.amoA <- cca (AmoA ~ EFamoA$PH + EFamoA$LOI, data = EFamoA,
scale = TRUE, na.action = na.omit)
But then I get this error:
Error in weighted.mean.default(newX[, i], ...) :
'x' and 'w' must have the same length
I don't really know where to go from here and haven't found much regarding this problem anywhere (which leads me to think that it must be some sort of very basic mistake I'm doing). My environmental factor data is not standardized as I red in the cca help file that the algorithm does it but maybe I should standardize it before? (I've also red that scale = TRUE is only for species). Should I convert the data into matrices?
I hope I made my point clear enough as I've been struggling with this for a while now.
Edit: My environmental data has NA values
Alright so I was able to figure it out all by myself and it was indeed a silly thing, turns out my abundance data had soils as columns and species as rows, while environmental factor (EF) data had soils as rows and EF as columns.
using t() on my data, I transposed my data.frame (and collaterally converted it into a matrix) and cca() worked (as "length" was the same, I assume). Transposing the data separately and loading it already transposed works too.
Although maybe the t() approach saves the need of creating a whole new file (in case your data was organized using different rows as in my case), it converts the data into a matrix and this might not be desired in some cases, either way, this turned out to be a very simple and obvious thing to solve (took me a while though).

Resources