Impute missing data - r

I have the following dataset:
> str(train)
'data.frame': 4619 obs. of 110 variables:
$ UserID : int 1 2 5 6 7 8 9 11 12 13 ...
$ YOB : int 1938 1985 1963 1997 1996 1991 1995 1983 1984 1997 ...
$ Gender : Factor w/ 3 levels "","Female","Male": 3 2 3 3 3 2 3 3 2 2 ...
$ Income : Factor w/ 7 levels "","$100,001 - $150,000",..: 1 3 6 5 4 7 5 2 4 6 ...
$ HouseholdStatus: Factor w/ 7 levels "","Domestic Partners (no kids)",..: 5 6 5 6 6 6 6 5 5 6 ...
$ EducationLevel : Factor w/ 8 levels "","Associate's Degree",..: 1 8 1 7 4 5 4 3 7 4 ...
$ Party : Factor w/ 6 levels "","Democrat",..: 3 2 1 6 1 1 6 3 6 2 ...
$ Happy : int 1 1 0 1 1 1 1 1 0 0 ...
$ Q124742 : Factor w/ 3 levels "","No","Yes": 2 1 2 1 2 3 1 2 2 1 ...
$ Q124122 : Factor w/ 3 levels "","No","Yes": 1 3 3 3 2 3 1 3 3 1 ...
$ Q123464 : Factor w/ 3 levels "","No","Yes": 2 2 2 3 2 2 1 2 2 1 ...
$ Q123621 : Factor w/ 3 levels "","No","Yes": 2 3 3 2 2 1 1 3 2 1 ...
$ Q122769 : Factor w/ 3 levels "","No","Yes": 2 2 2 1 3 1 1 2 2 2 ...
$ Q122770 : Factor w/ 3 levels "","No","Yes": 3 2 2 3 3 1 1 2 3 3 ...
$ Q122771 : Factor w/ 3 levels "","Private","Public": 3 3 2 2 3 3 1 3 3 3 ...
$ Q122120 : Factor w/ 3 levels "","No","Yes": 2 2 2 2 2 3 1 2 2 2 ...
$ Q121699 : Factor w/ 3 levels "","No","Yes": 3 3 3 2 2 3 2 3 3 2 ...
$ Q121700 : Factor w/ 3 levels "","No","Yes": 2 3 2 2 3 3 2 2 2 2 ...
$ Q120978 : Factor w/ 3 levels "","No","Yes": 1 3 2 3 3 2 2 3 3 3 ...
$ Q121011 : Factor w/ 3 levels "","No","Yes": 2 2 2 2 2 3 3 2 3 2 ...
$ Q120379 : Factor w/ 3 levels "","No","Yes": 2 3 3 2 3 3 2 2 2 3 ...
$ Q120650 : Factor w/ 3 levels "","No","Yes": 3 3 3 3 3 2 3 3 3 3 ...
$ Q120472 : Factor w/ 3 levels "","Art","Science": 1 3 3 3 3 2 3 3 2 3 ...
$ Q120194 : Factor w/ 3 levels "","Study first",..: 3 2 3 2 2 3 3 3 3 3 ...
$ Q120012 : Factor w/ 3 levels "","No","Yes": 2 3 3 1 2 3 2 2 3 3 ...
$ Q120014 : Factor w/ 3 levels "","No","Yes": 2 3 2 3 3 1 3 3 2 3 ...
$ Q119334 : Factor w/ 3 levels "","No","Yes": 1 3 2 2 2 3 2 3 2 2 ...
$ Q119851 : Factor w/ 3 levels "","No","Yes": 3 2 2 3 2 2 3 2 2 3 ...
$ Q119650 : Factor w/ 3 levels "","Giving","Receiving": 1 2 2 3 2 1 2 2 2 3 ...
$ Q118892 : Factor w/ 3 levels "","No","Yes": 3 3 3 2 3 2 1 3 2 2 ...
$ Q118117 : Factor w/ 3 levels "","No","Yes": 3 2 2 3 3 3 1 2 2 2 ...
$ Q118232 : Factor w/ 3 levels "","Idealist",..: 2 2 3 3 3 1 1 2 2 3 ...
$ Q118233 : Factor w/ 3 levels "","No","Yes": 2 2 2 2 2 2 1 2 3 2 ...
$ Q118237 : Factor w/ 3 levels "","No","Yes": 2 3 3 3 2 2 1 2 3 2 ...
$ Q117186 : Factor w/ 3 levels "","Cool headed",..: 1 2 2 2 1 3 1 2 3 1 ...
$ Q117193 : Factor w/ 3 levels "","Odd hours",..: 1 2 3 2 3 3 1 3 3 3 ...
$ Q116797 : Factor w/ 3 levels "","No","Yes": 3 3 2 2 2 1 1 2 2 1 ...
$ Q116881 : Factor w/ 3 levels "","Happy","Right": 2 2 3 3 2 2 1 2 2 1 ...
$ Q116953 : Factor w/ 3 levels "","No","Yes": 3 3 3 3 1 3 3 3 3 1 ...
$ Q116601 : Factor w/ 3 levels "","No","Yes": 3 3 3 2 3 3 1 3 3 1 ...
$ Q116441 : Factor w/ 3 levels "","No","Yes": 2 2 2 2 2 2 1 2 2 1 ...
$ Q116448 : Factor w/ 3 levels "","No","Yes": 2 3 3 3 2 2 1 2 3 1 ...
$ Q116197 : Factor w/ 3 levels "","A.M.","P.M.": 3 2 2 2 2 3 1 2 3 1 ...
$ Q115602 : Factor w/ 3 levels "","No","Yes": 3 3 3 3 3 2 1 3 2 1 ...
$ Q115777 : Factor w/ 3 levels "","End","Start": 3 2 3 3 3 3 1 3 2 1 ...
$ Q115610 : Factor w/ 3 levels "","No","Yes": 3 3 3 3 3 1 1 3 2 1 ...
$ Q115611 : Factor w/ 3 levels "","No","Yes": 2 2 3 3 2 2 1 2 2 1 ...
$ Q115899 : Factor w/ 3 levels "","Circumstances",..: 2 3 3 2 2 3 1 2 3 1 ...
$ Q115390 : Factor w/ 3 levels "","No","Yes": 3 2 2 2 1 2 3 3 2 1 ...
$ Q114961 : Factor w/ 3 levels "","No","Yes": 3 3 2 3 2 3 2 2 3 1 ...
$ Q114748 : Factor w/ 3 levels "","No","Yes": 3 2 2 2 3 3 3 2 3 1 ...
$ Q115195 : Factor w/ 3 levels "","No","Yes": 3 3 3 3 3 2 3 3 3 1 ...
$ Q114517 : Factor w/ 3 levels "","No","Yes": 2 3 2 3 2 2 2 2 3 1 ...
$ Q114386 : Factor w/ 3 levels "","Mysterious",..: 1 3 3 2 2 3 3 3 3 1 ...
$ Q113992 : Factor w/ 3 levels "","No","Yes": 3 1 3 2 2 2 2 2 3 1 ...
$ Q114152 : Factor w/ 3 levels "","No","Yes": 3 2 2 2 3 2 2 2 2 1 ...
$ Q113583 : Factor w/ 3 levels "","Talk","Tunes": 2 3 2 3 3 3 3 2 3 1 ...
$ Q113584 : Factor w/ 3 levels "","People","Technology": 3 2 2 3 2 1 3 2 2 1 ...
$ Q113181 : Factor w/ 3 levels "","No","Yes": 2 3 3 3 2 3 3 2 2 1 ...
[list output truncated]
As you can see, I have 110 variables. I am trying to build a predictive model to predict happiness using these variables. If I leave them in factor form (CART models, randomForest etc. struggle) so I'm trying to convert these into vectorised or numeric type (to make the algorithm's life a bit easier)...
Currently I am doing it one by one e.g.:
> table(train_new$Q117193)
Odd hours Standard hours
1410 1299 1910
> train_new$Q117193 = as.integer(train_new$Q117193)
> table(train_new$Q117193)
1 2 3
1410 1299 1910
You can notice that almost all the factor variables have missing values denoted by "".
I have converted this dataset to numeric using:
train_numeric$Gender = as.integer(train_numeric$Gender)
train_numeric[,grep(pattern="^Q1",colnames(train_numeric))] = lapply(train_numeric[,grep(pattern="^Q1",colnames(train_numeric))],as.integer)
I am using the mice package to impute this dataset... I am lost to be honest. Any ideas how I could fill these missing values please?

It seems that you are converting factor variables (like Gender) into numeric format and to my knowledge that is not possible in this case because they contain strings, so you could only convert them to character I believe.
To replace all missing values ("") with NAs in your data frame train you could do something like
train[train == ""] <- NA

You can correct this issue while importing your file for ex. I am assuming you imported csv file so code for that woudl be
dataset<-read.csv(file="file location",sep=",",header=True,na.strings = c("","NA"))
it will replace your blank with NA in categorical variable

Related

rpart warning message and then no proper rpart plot formed

I tried rpart and I got the following error and rpart plot only showed 0
Warning messages: error was: Setting row names on a tibble is
deprecated.
No proper rpart plot was formed. The below is the plot that was generated:
rpart.oc<-function(seed,training,labels,otrl){
ol<-makeSOCKcluster(6,type="SOCK")
registerDoSNOW(ol)
set.seed(seed)
rpart.oc<-train(x=training,y=labels,method="rpart",tuneLength=30,trControl=otrl)
stopCluster(ol)
return(rpart.oc)
}
rpart.1.cv.1o<-rpart.oc(94622,rpart.train.1o,of.label,otrl.3))
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1898 obs. of 6 variables:
$ so : Factor w/ 10 levels "",..: 7 7 7 7 7 7 7 7 7 7 ...
$ a : num 63 7 3 45 2 4 69 0 7 0 ...
$ n : Factor w/ 5 levels "","",..: 3 2 2 1 2 3 1 1 2 2 ...
$ s: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ d : Factor w/ 7 levels "Friday","Monday",..: 6 6 7 7 5 5 1 1 1 1 ...
$ c: Factor w/ 7 levels "A","C",..: 6 2 4 2 2 4 1 2 7 2

How to combine Columns in R

http://imgur.com/a/q4IdW "table"
Hi, I have a file that has coded complaints, you can see it in the link above, and I need to find a way to combine the 4 columns(primary issue, secondary issue, etc) so that I can then sum up all the issues together. it is possible for a complaint to have multiple issues, so that is why it is broken down like this, but for analysis purposes I want to treat all the issue columns as the same. I am very new to R so please try and speak in terms ill be able to understand or can google fairly quickly
> str(mydata)
'data.frame': 136 obs. of 25 variables:
$ ï..Issue.ID : Factor w/ 136 levels "CAO-2017-01",..: 20 21 22 23 24 25 26 27 28 29 ...
$ Reviewer.ID : Factor w/ 1 level "Vinokurov, A": 1 1 1 1 1 1 1 1 1 1 ...
$ Review.Date : Factor w/ 3 levels "6/30/2017","7/14/2017",..: 1 1 1 1 1 2 2 2 2 2 ...
$ CBA.ZIP.CODE : Factor w/ 61 levels "Allentown-Bethlehem-Easton, PA",..: 29 13 24 10 29 13 10 9 47 39 ...
$ Source.of.complaint : Factor w/ 7 levels "Advocate","Beneficiary",..: 7 7 3 7 6 7 2 3 3 3 ...
$ Primary.Issue.Category : Factor w/ 10 levels "Billing, coverage, coordination of benefits",..: 3 8 4 4 4 4 7 4 4 4 ...
$ Secondary.Issue.Category : Factor w/ 15 levels "","ABN issues ",..: 4 1 15 1 15 3 3 15 15 15 ...
$ Third.Issue.Category : Factor w/ 12 levels "","- Error -",..: 1 1 1 1 1 5 1 10 1 1 ...
$ Fourth.Issue.Category : Factor w/ 2 levels "","Low quantity/quality": 1 1 1 1 1 1 1 1 1 1 ...
$ Reviewer.Issue.Notes : logi NA NA NA NA NA NA ...
$ Primary.Equipment.Category : Factor w/ 13 levels "Commode chairs",..: 9 7 2 8 2 9 10 10 2 2 ...
$ Secondary.Equipment.Category : Factor w/ 7 levels "","- Error -",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Third.Equipment.Category : Factor w/ 10 levels "","- Error -",..: 1 1 1 1 1 2 1 2 1 1 ...
$ Fourth.Equipment.Category : logi NA NA NA NA NA NA ...
$ Reviewer.Equipment.Notes : logi NA NA NA NA NA NA ...
$ Primary.Resolution.Category : Factor w/ 16 levels "Beneficiary educated about DMEPOS\n",..: 9 12 15 12 5 14 13 9 5 10 ...
$ Secondary.Resolution.Category: Factor w/ 18 levels "","- Error -",..: 1 1 3 1 4 7 7 17 15 1 ...
$ Third.Resolution.Category : Factor w/ 8 levels "","Beneficiary educated about inquiry ",..: 1 1 1 1 3 1 1 1 1 1 ...
$ Fourth.Resolution.Category : logi NA NA NA NA NA NA ...
$ Reviewer.Resolution.Notes : logi NA NA NA NA NA NA ...
$ Future.Action : Factor w/ 4 levels "no","No","yes",..: 4 4 2 2 2 2 3 4 1 1 ...
$ Coder.1 : Factor w/ 2 levels "Briskin-Limehouse, A",..: 1 1 1 1 1 2 2 2 2 2 ...
$ Coder.1.Coded.Date : Factor w/ 4 levels "6/30/2017","7/13/2017",..: 1 1 1 1 1 2 2 2 2 2 ...
$ Coder.2 : Factor w/ 1 level "Aliu, F": 1 1 1 1 1 1 1 1 1 1 ...
$ Coder.2.Coded.Date : Factor w/ 7 levels "6/30/2017","7/12/2017",..: 1 1 1 1 1 2 3 3 3 3 ...
>
What i got is: You have something like this:
issue_1 issue_2 issue_3 issue_4
person1 0 0 0 1
person2 1 1 0 1
person3 1 0 1 1
where 1 is presence of issue, and 0 the opposite, took from some survey.
would you like to show something like this?
Issue_1 appeared 2x
issue_2 appeared 1x
issue_4 appeared 3x
Could you check and answer again, please?
Please, use str(your_data) too, since you can't link us

Caret : train() function - Error in train.default(x, y, weights = w, ...) : Stopping

I'm trying to predict binary variable (level of spending) using caret in R.
My dataset have 415 000 rows and 30 features (all features are factor). I need to compare performances of severals machine learning algorithms.
#Convert factor level
levels(tableRFM_train$niveau_Depense) <- c("A","B")
#Sample rows and select sub-sample
tableRFM_train <-tableRFM_train[sample(1:nrow(tableRFM_train),size=45000),]
Random Forest
When I try to tune mtry parameters with size of sub-sample > 45000 rows I have this error: (if size < : I don't have it)
Something is wrong; all the Accuracy metric values are missing:
Accuracy Kappa
Min. : NA Min. : NA
1st Qu.: NA 1st Qu.: NA
Median : NA Median : NA
Mean :NaN Mean :NaN
3rd Qu.: NA 3rd Qu.: NA
Max. : NA Max. : NA
NA's :1 NA's :1
Error in train.default(x, y, weights = w, ...) : Stopping
De plus : Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
I don't understand ... I try few things to solved problem :
remove variable with missing values (missing values are grouped in factor level) : sum(is.na(tableRFM_train))=0
remove unbalanced variables and variables with variance is zero
paramgrid <- data.frame(mtry=seq(1,3,by=1))
cl <- makeCluster(6)
registerDoParallel(cl)
rfopt <-train(niveau_Depense~.,data=tableRFM_train,method="rf",
trControl=trainControl(method="cv",number=10,search="grid",classProbs=T,
tuneGrid=paramgrid,prox=TRUE,allowParallel=TRUE)#,na.remove=T
stopCluster(cl)
When I test to tune GBM with sub sample of 200 000 rows I have the same error with NA's :1 NA's :1 for Accuracy and Kappa. When I tune SVM with sub sample > 45 000 I have also this error.
I don't have any missing values in target and others variables ..
The result of str(tableRFM_train) :
'data.frame': 276664 obs. of 28 variables:
$ q_pm_p_1 : Factor w/ 4 levels "[ 1 - 21 ]","[ 22 - 32 ]",..: 1 4 3 3 2 3 1 1 4 1 ...
$ q_pm_p_2 : Factor w/ 4 levels "[ 1 - 23 ]","[ 24 - 40 ]",..: 2 4 1 2 1 3 1 4 1 1 ...
$ q_pm_p_3 : Factor w/ 4 levels "[ 1 - 25 ]","[ 26 - 43 ]",..: 3 4 4 2 4 3 3 4 2 2 ...
$ q_ir_p_1 : Factor w/ 4 levels "[ 6 - 10 ]","[ 11 - 16 ]",..: 2 1 2 1 4 1 4 1 2 4 ...
$ q_ir_p_2 : Factor w/ 4 levels "[ 0 - 6 ]","[ 7 - 14 ]",..: 2 1 2 2 4 4 2 1 2 2 ...
$ q_ir_p_3 : Factor w/ 3 levels "[ 0 - 7 ]","[ 8 - 16 ]",..: 2 1 1 2 1 2 3 1 3 1 ...
$ q_evol_pm_p_3_p_2 : Factor w/ 4 levels "[ -100 - -40 ]",..: 1 3 3 2 3 1 1 3 1 1 ...
$ q_evol_pm_p_2_p_1 : Factor w/ 4 levels "[ -100 - -25 ]",..: 1 3 4 3 3 1 2 3 4 2 ...
$ q_ecart_ir_p_3_p_2 : Factor w/ 4 levels "[ -20 - -2 ]",..: 1 2 4 2 4 4 1 2 1 4 ...
$ q_ecart_ir_p_2_p_1 : Factor w/ 4 levels "[ -14 - -1 ]",..: 3 4 3 2 2 1 3 3 3 3 ...
$ q_age : Factor w/ 12 levels "[18-24]","[25-29]",..: 10 3 3 12 12 8 3 9 12 10 ...
$ q_anciennete : Factor w/ 3 levels "[ 0 - 4 ]","[ 5 - 5 ]",..: 2 1 2 3 2 2 2 1 2 2 ...
$ q_attachement_mag : Factor w/ 2 levels "Faible","Fort": 1 1 1 1 1 1 1 1 1 2 ...
$ q_diversification : Factor w/ 4 levels "[ 1 - 3 ]","[ 4 - 6 ]",..: 4 4 2 4 2 1 4 1 2 4 ...
$ q_indice_coordo : Factor w/ 4 levels "[ 0 - 4 ]","[ 5 - 5 ]",..: 3 3 1 1 2 3 2 1 3 3 ...
$ q_recence_p_1 : Factor w/ 4 levels "[ 0 - 11 ]","[ 12 - 30 ]",..: 4 1 2 3 2 2 1 4 3 1 ...
$ q_recence_p_2 : Factor w/ 4 levels "[ 0 - 12 ]","[ 13 - 45 ]",..: 2 4 3 2 2 2 3 4 3 3 ...
$ q_recence_p_3 : Factor w/ 4 levels "[ 0 - 14 ]","[ 15 - 47 ]",..: 2 4 4 2 4 2 2 4 1 3 ...
$ q_frequence_p_1 : Factor w/ 4 levels "[ 1 - 2 ]","[ 3 - 5 ]",..: 2 1 2 1 3 1 2 1 2 4 ...
$ q_frequence_p_2 : Factor w/ 4 levels "[ 1 - 3 ]","[ 4 - 7 ]",..: 1 4 1 1 2 2 2 4 1 2 ...
$ q_frequence_p_3 : Factor w/ 4 levels "[ 1 - 3 ]","[ 4 - 7 ]",..: 1 4 4 1 4 1 2 4 2 1 ...
$ q_presence_p_1 : Factor w/ 4 levels "[ 1 - 2 ]","[ 3 - 4 ]",..: 2 1 2 1 3 1 2 1 2 4 ...
$ q_presence_p_2 : Factor w/ 4 levels "[ 1 - 3 ]","[ 4 - 5 ]",..: 1 4 1 1 3 3 2 4 1 2 ...
$ q_presence_p_3 : Factor w/ 4 levels "[ 1 - 3 ]","[ 4 - 6 ]",..: 1 4 4 1 4 1 2 4 2 1 ...
$ q_delai_reachat_p_1: Factor w/ 4 levels "0-13","14-24",..: 3 4 3 4 3 4 3 4 3 2 ...
$ q_delai_reachat_p_2: Factor w/ 5 levels "0-14","15-26",..: 5 4 5 5 3 2 1 4 5 1 ...
$ q_delai_reachat_p_3: Factor w/ 5 levels "0-13","14-24",..: 5 4 4 5 4 5 3 4 3 5 ...
$ niveau_Depense : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...
R version 3.3.3 (2017-03-06) -- "Another Canoe"
Copyright (C) 2017 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit) and latest Caret Version

c50 code called exit with value 1 on Mushroom Data set [duplicate]

This question already has answers here:
C5.0 decision tree - c50 code called exit with value 1
(6 answers)
Closed 6 years ago.
I'm getting error while working on C5.0 with Mushroom Data set. I've factored the target class and there are no missing values.
f <-file("https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data", open="r")
data <- read.table(f, sep=",", header=F)
str(data)
gives
'data.frame': 8124 obs. of 23 variables:
$ V1 : Factor w/ 2 levels "e","p": 2 1 1 2 1 1 1 1 2 1 ...
$ V2 : Factor w/ 6 levels "b","c","f","k",..: 6 6 1 6 6 6 1 1 6 1 ...
$ V3 : Factor w/ 4 levels "f","g","s","y": 3 3 3 4 3 4 3 4 4 3 ...
$ V4 : Factor w/ 10 levels "b","c","e","g",..: 5 10 9 9 4 10 9 9 9 10 ...
$ V5 : Factor w/ 2 levels "f","t": 2 2 2 2 1 2 2 2 2 2 ...
$ V6 : Factor w/ 9 levels "a","c","f","l",..: 7 1 4 7 6 1 1 4 7 1 ...
$ V7 : Factor w/ 2 levels "a","f": 2 2 2 2 2 2 2 2 2 2 ...
$ V8 : Factor w/ 2 levels "c","w": 1 1 1 1 2 1 1 1 1 1 ...
$ V9 : Factor w/ 2 levels "b","n": 2 1 1 2 1 1 1 1 2 1 ...
$ V10: Factor w/ 12 levels "b","e","g","h",..: 5 5 6 6 5 6 3 6 8 3 ...
$ V11: Factor w/ 2 levels "e","t": 1 1 1 1 2 1 1 1 1 1 ...
$ V12: Factor w/ 5 levels "?","b","c","e",..: 4 3 3 4 4 3 3 3 4 3 ...
$ V13: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
$ V14: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
$ V15: Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
$ V16: Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
$ V17: Factor w/ 1 level "p": 1 1 1 1 1 1 1 1 1 1 ...
$ V18: Factor w/ 4 levels "n","o","w","y": 3 3 3 3 3 3 3 3 3 3 ...
$ V19: Factor w/ 3 levels "n","o","t": 2 2 2 2 2 2 2 2 2 2 ...
$ V20: Factor w/ 5 levels "e","f","l","n",..: 5 5 5 5 1 5 5 5 5 5 ...
$ V21: Factor w/ 9 levels "b","h","k","n",..: 3 4 4 3 4 3 3 4 3 3 ...
$ V22: Factor w/ 6 levels "a","c","n","s",..: 4 3 3 4 1 3 3 4 5 4 ...
$ V23: Factor w/ 7 levels "d","g","l","m",..: 6 2 4 6 2 2 4 4 2 4 ...
and when i run
C5.model <- C5.0(data[1:4000,-1],data[1:4000,1],trials = 3)
gives
c50 code called exit with value 1
I had no clue how to find this. Any idea on debugging is appreciated
Edit1 : Error is same but solution is different.
Note: When i changed the data set, it is working.
f <-file("https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data", open="r")
data <- read.table(f, sep=",", header=F)
str(data)
pacman::p_load(C50)
C5.model <- C5.0(data[1:10000,c(2:16,18:23)],data[1:10000,1],trials = 3,na.action = na.pass)
Column 17 was the cause of this problem as it had no identifying variation.

Cluster analysis with daisy

I'm trying to perform a Hierarchical cluster analysis with RStudio, by using the package daisy. This is my dataset:
data.frame':341 obs. of 28 variables:
$ Impo_Env : Ord.factor w/ 3 levels "Low"<"Med"<"High": 3 2 3 2 3 2 3 3 2 3 ...
$ ComparativePriority_IAS: Ord.factor w/ 3 levels "Low"<"Med"<"High": 3 1 3 2 3 2 3 2 3 2 ...
$ Strategy_Eradication: Ord.factor w/ 3 levels "No intervention"<..: 3 2 3 2 3 2 3 2 2 3 ...
$ Knowl_BiodivLoss: Factor w/ 2 levels "0","1": 2 1 2 2 2 1 2 2 2 2 ...
$ Control_Trade: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ Engagement_Retail: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ Knowl_PastProj: Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 2 1 ...
$ Priority_IAS: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ Knowl_Eradic: Factor w/ 2 levels "0","1": 2 1 2 1 2 2 1 2 2 1 ...
$ Alert_CFS: Factor w/ 2 levels "0","1": 1 2 1 2 1 2 2 1 2 1 ...
$ Alert_Municipality: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ Alert_Park: Factor w/ 2 levels "0","1": 2 1 2 1 2 1 1 2 1 1 ...
$ Alert_Police: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ Alert_Firemen: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
$ Supp_AuthorityIAS: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ Knowl_Env: Factor w/ 2 levels "0","1": 2 1 2 2 2 2 2 2 2 2 ...
$ Info_Tv: Factor w/ 2 levels "0","1": 2 2 1 2 2 2 2 1 2 1 ...
$ Info_Web: Factor w/ 2 levels "0","1": 2 1 2 2 2 1 2 1 2 2 ...
$ Info_Radio: Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 2 1 1 ...
$ Info_Magazines: Factor w/ 2 levels "0","1": 1 1 2 1 2 1 1 2 1 1 ...
$ Info_School: Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 1 2 ...
$ Blacklist: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ Workshop: Factor w/ 2 levels "0","1": 1 1 2 1 2 1 2 2 1 1 ...
$ SuppFin_FutProj: Factor w/ 2 levels "0","1": 2 1 2 1 2 2 2 2 2 2 ...
$ Tourist_dummy: Factor w/ 2 levels "0","1": 1 1 1 2 2 1 1 1 2 1 ...
$ Gender: Factor w/ 2 levels "Female","Male": 1 2 1 2 1 1 2 2 2 1 ...
$ logIASknown: num 2.89 2.94 2.89 2.56 3.14 ...
$ Age: int 20 41 14 10 26 33 19 59 23 16 ...
I would like to use the Euclidean distance with daisy, however when I run
daisy(fuu, metric = c("euclidean"), type=list(ordratio = c(1,2,3), asymm=c(4:24), symm=c(25,26)))
The output is not fine. Gower's distance is used instead of Euclidean distance:
Warning message:In daisy(fuu, metric = c("euclidean"), type = list(ordratio = c(1,:with mixed variables, metric "gower" is used automatically
How can I fix it?
As described in "Details" section within the documentation of the daisy function contained in the cluster package:
The handling of nominal, ordinal, and (a)symmetric binary data is
achieved by using the general dissimilarity coefficient of Gower
(1971). If x contains any columns of these data-types, both arguments
metric and stand will be ignored and Gower's coefficient will be used
as the metric.
In other words, for euclidean metrics (distances as root sum-of-squares of differences) to be computed, input columns must be numeric (mode) variables (i.e. all columns when x is a matrix) and thus recognised as interval scaled variables, as opposed to nominal (columns of class factor) variables or ordinal (columns of class ordered) variables. Specifying variable type within the type argument does not change this fact.
Under these premises, and supposing it makes sense for all of your 28 variables despite some being qualitative binary, you might try converting them with as.numeric and proceed then, reason being: with mixed variables metric "gower" overrides being automatically used.

Resources