How to clean a dataset in R from uci - r

Good afternoon ,
Assume we have the following function :
data_preprocessing<-function(link){
link=as.character(link)
dataset=read.csv(link)
dataset=replace(dataset,dataset=="?",NA)
return(dataset)
}
Example ( https protocole problem ) :
Echocardiogram=data_preprocessing("https://archive.ics.uci.edu/ml/machine-learning-databases/echocardiogram/echocardiogram.data")
Show Traceback
Rerun with Debug
Error in file(file, "rt") : cannot open the connection
After downloading the dataset :
Echocardiogram=data_preprocessing("http://archive.ics.uci.edu/ml/machine-learning-databases/echocardiogram/echocardiogram.data")
head(Echocardiogram)
X11 X0 X71 X0.1 X0.260 X9 X4.600 X14 X1 X1.1 name X1.2 X0.2
1 19 0 72 0 0.380 6 4.100 14 1.700 0.588 name 1 0
2 16 0 55 0 0.260 4 3.420 14 1 1 name 1 0
3 57 0 60 0 0.253 12.062 4.603 16 1.450 0.788 name 1 0
4 19 1 57 0 0.160 22 5.750 18 2.250 0.571 name 1 0
5 26 0 68 0 0.260 5 4.310 12 1 0.857 name 1 0
6 13 0 62 0 0.230 31 5.430 22.5 1.875 0.857 name 1 0
Also :
str(Echocardiogram)
'data.frame': 130 obs. of 12 variables:
$ X11 : Factor w/ 57 levels "",".03",".25",..: 18 16 54 18 27 14 50 18 26 12 ...
$ X0 : Factor w/ 4 levels "","?","0","1": 3 3 3 4 3 3 3 3 3 4 ...
$ X71 : Factor w/ 40 levels "","?","35","46",..: 30 12 17 14 26 19 17 4 11 34 ...
$ X0.1 : int 0 0 0 0 0 0 0 0 0 0 ...
$ X0.260: Factor w/ 74 levels "","?","0.010",..: 65 50 47 26 50 42 59 60 21 19 ...
$ X9 : Factor w/ 93 levels "","?","0","10",..: 69 57 13 46 62 56 79 3 19 29 ...
$ X4.600: Factor w/ 106 levels "","?","2.32",..: 25 6 54 92 38 85 76 70 47 33 ...
$ X14 : Factor w/ 48 levels "","?","10","10.5",..: 16 16 21 27 8 36 16 21 19 27 ...
$ X1 : Factor w/ 67 levels "","?","1","1.04",..: 48 3 37 60 3 52 3 11 16 50 ...
$ X1.1 : Factor w/ 32 levels "","?","0.140",..: 14 30 25 13 27 27 30 31 29 21 ...
$ X1.2 : Factor w/ 5 levels "","?","1","2",..: 3 3 3 3 3 3 3 3 3 3 ...
$ X0.2 : Factor w/ 5 levels "","?","0","1",..: 3 3 3 3 3 3 3 3 3 4 ...
Here , i'm wanting to replace all "?" in the dataset with NA. Also , it will be good to remove duplicated and empty rows ( like the 50 row ).
Thank you for help !

something like this?
library(data.table)
DT <- data.table::fread("https://archive.ics.uci.edu/ml/machine-learning-databases/echocardiogram/echocardiogram.data",
fill = TRUE,
na.strings = "?")

When using read.csv from base you can set na.strings = "?" and header=FALSE.
Echocardiogram <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/echocardiogram/echocardiogram.data"
, na.strings = "?", header=FALSE)
str(Echocardiogram)
#'data.frame': 133 obs. of 13 variables:
# $ V1 : num 11 19 16 57 19 26 13 50 19 25 ...
# $ V2 : int 0 0 0 0 1 0 0 0 0 0 ...
# $ V3 : num 71 72 55 60 57 68 62 60 46 54 ...
# $ V4 : int 0 0 0 0 0 0 0 0 0 0 ...
# $ V5 : num 0.26 0.38 0.26 0.253 0.16 0.26 0.23 0.33 0.34 0.14 ...
# $ V6 : num 9 6 4 12.1 22 ...
# $ V7 : num 4.6 4.1 3.42 4.6 5.75 ...
# $ V8 : num 14 14 14 16 18 12 22.5 14 16 15.5 ...
# $ V9 : num 1 1.7 1 1.45 2.25 ...
# $ V10: num 1 0.588 1 0.788 0.571 ...
# $ V11: chr "name" "name" "name" "name" ...
# $ V12: chr "1" "1" "1" "1" ...
# $ V13: int 0 0 0 0 0 0 0 0 0 0 ...

Related

Problem with putting information in a column in R

I have 2 data frames, one data frame called datos_octubre with 10131000 rows and other dataframe called datos_conductores.
I wanna put a new column called operador, this column will be fill by the follow instruction
for( j in 1:100){
for(i in 1:100){
if ( (datos_octubre$FECHA_GPS[i]== datos_conductores$fecha[i])){
if (datos_octubre$EQU_CODIGO[i]== datos_conductores$EQU_CODIGO[j]){
if (datos_octubre$HORA_GPS[i] <= datos_conductores$hora_fin[j]){
datos_octubre$Operador[i] <- datos_octubre$NOMBRE[j]
}
}
}
This is the structure of data frame datos_octubre and the head:
> str(datos_octubre)
'data.frame': 10131530 obs. of 14 variables:
$ REP_GPS_CODIGO : Factor w/ 9329105 levels "MI051","MI051_1832614921789237",..: 2 3 4 5 6 7 8 9 10 11 ...
$ EQU_CODIGO : chr "MI051" "MI051" "MI051" "MI051" ...
$ TRAM_GPS_CODIGO: Factor w/ 4 levels "01","03","05",..: 4 4 4 4 4 4 4 4 4 4 ...
$ EVE_GPS_CODIGO : Factor w/ 83 levels "01","02","03",..: 3 3 3 3 3 9 3 3 3 3 ...
$ FECHA_GPS : POSIXct, format: "2019-10-01" "2019-10-01" "2019-10-01" "2019-10-01" ...
$ HORA_GPS : Factor w/ 86389 levels "-75.6654","-75.6655",..: 16528 16536 16546 16556 16564 16568 16574 16583 16592 16601 ...
$ LON_GPS : num -75.7 -75.7 -75.7 -75.7 -75.7 ...
$ LAT_GPS : num 4.8 4.8 4.8 4.8 4.8 ...
$ VEL_GPS : num 0 0 0 0 0 0 0 0 0 0 ...
$ DIR_GPS : int 0 0 0 101 101 101 101 101 101 101 ...
$ ACL_GPS : int 0 0 0 0 0 NA 0 0 0 0 ...
$ ODO_GPS : int 28229762 28229762 28229762 28229768 28229770 NA 28229770 28229770 28229770 28229770 ...
$ ALT_GPS : Factor w/ 120 levels "","\"MI051_1902402005409507",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Operador : chr "" "" "" "" ...
> head(datos_octubre)
REP_GPS_CODIGO EQU_CODIGO TRAM_GPS_CODIGO EVE_GPS_CODIGO FECHA_GPS HORA_GPS LON_GPS LAT_GPS VEL_GPS DIR_GPS ACL_GPS ODO_GPS ALT_GPS Operador
1 MI051_1832614921789237 MI051 EV 03 2019-10-01 04:35:38 -75.7444 4.79857 0 0 0 28229762
2 MI051_1832614964979379 MI051 EV 03 2019-10-01 04:35:46 -75.7444 4.79857 0 0 0 28229762
3 MI051_1832616366109032 MI051 EV 03 2019-10-01 04:35:56 -75.7444 4.79857 0 0 0 28229762
4 MI051_1832617794914447 MI051 EV 03 2019-10-01 04:36:06 -75.7442 4.79907 0 101 0 28229768
5 MI051_1832619516509591 MI051 EV 03 2019-10-01 04:36:14 -75.7442 4.79908 0 101 0 28229770
6 MI051_1832619543973570 MI051 EV 10 2019-10-01 04:36:18 -75.7442 4.79908 0 101 NA NA
And this is the restructure of datos_conductores and the head:
> str(datos_conductores)
'data.frame': 16522 obs. of 11 variables:
$ fecha : POSIXct, format: "2019-10-01" "2019-10-01" "2019-10-01" "2019-10-01" ...
$ equ_id : int 99 99 99 99 99 99 99 99 99 99 ...
$ conductor : int 34 34 34 34 34 34 34 65 65 65 ...
$ servicio_id: int 533329 533328 533327 533326 533325 533324 533323 533333 533332 533331 ...
$ PERA_ID : int 362 362 362 362 362 362 362 107 107 107 ...
$ hora_ini : POSIXct, format: "2019-11-28 09:16:16" "2019-11-28 08:38:16" "2019-11-28 08:00:16" "2019-11-28 07:22:16" ...
$ hora_fin : POSIXct, format: "2019-11-28 09:21:00" "2019-11-28 09:16:16" "2019-11-28 08:38:16" "2019-11-28 08:00:16" ...
$ ruta_id : int 24 24 24 24 24 24 24 24 24 24 ...
$ NOMBRE : Factor w/ 85 levels "ALBERT HERNAN ZAPATA RESTREPO",..: 71 71 71 71 71 71 71 53 53 53 ...
$ PERA_CEDULA: int 1088253762 1088253762 1088253762 1088253762 1088253762 1088253762 1088253762 10087424 10087424 10087424 ...
$ EQU_CODIGO : Factor w/ 36 levels "MI051","MI052",..: 9 9 9 9 9 9 9 9 9 9 ...
> head(datos_octubre)
REP_GPS_CODIGO EQU_CODIGO TRAM_GPS_CODIGO EVE_GPS_CODIGO FECHA_GPS HORA_GPS LON_GPS LAT_GPS VEL_GPS DIR_GPS ACL_GPS ODO_GPS ALT_GPS Operador
1 MI051_1832614921789237 MI051 EV 03 2019-10-01 04:35:38 -75.7444 4.79857 0 0 0 28229762
2 MI051_1832614964979379 MI051 EV 03 2019-10-01 04:35:46 -75.7444 4.79857 0 0 0 28229762
3 MI051_1832616366109032 MI051 EV 03 2019-10-01 04:35:56 -75.7444 4.79857 0 0 0 28229762
4 MI051_1832617794914447 MI051 EV 03 2019-10-01 04:36:06 -75.7442 4.79907 0 101 0 28229768
5 MI051_1832619516509591 MI051 EV 03 2019-10-01 04:36:14 -75.7442 4.79908 0 101 0 28229770
6 MI051_1832619543973570 MI051 EV 10 2019-10-01 04:36:18 -75.7442 4.79908 0 101 NA NA
Also I tried with operator pype but I'm not getting the result I want.
I already find the solution.
I have to convert all data type data with the function as.POSIXct and in the for make a correction with the time_ini and the time_finish of every data.

Classify factor output with factors with >60 levels and numeric inputs

I'm newbie, and working on a classification to see the causes of coral diseases. The dataset contains 45 variables.
The output variable is a factor with 21 levels (21 diseases) and the inputs are numeric and factor variables, and those factors have even 94 levels, those are like "type of specie of coral", so I can't get into a split factor because I want to be as precise as possible, so maybe one species is less resistant than another. So I can't split those factors. Numeric variables are such as, population in the area, fishing trips etc.
First problem: tried genetic algorithms to select most important variables, random forests, etc., but... it gets aborted, so the variables I eliminated were just based on correlograms. I want something stronger to decide which variables select.
Second problem: I've tried everything I know and made tons of searches on Google to find something that runs and make a classification, but nothing goes on. I tried SVM, Random Forests, Cart, GBM, bagging and boosting, but nothing can't with this dataset.
This is the structure of the dataset
'data.frame': 136510 obs. of 45 variables:
$ SITE : Factor w/ 144 levels "TUT-1511","TUT-1513",..: 56 15 55 21 12 12 17 53 48 82 ...
$ Zone_Fine : Factor w/ 17 levels "Aunuu_E","Aunuu_W",..: 11 9 10 9 9 9 9 8 10 10 ...
$ TRANSECT : num 1 1 1 1 1 1 1 1 1 1 ...
$ SEGMENT : num 5 1 1 1 7 5 7 5 3 7 ...
$ Seg_WIDTH : num 1 1 1 1 1 1 1 1 1 1 ...
$ Seg_LENGTH : num 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
$ SPECIES : Factor w/ 156 levels "AAAA","AABR",..: 94 126 94 102 9 126 135 94 93 94 ...
$ COLONYLENGTH : num 11 45 10 5 12 10 8 30 20 14 ...
$ OLDDEAD : num 5 2 5 0 0 5 10 0 5 10 ...
$ RECENTDEAD : num 0 10 0 0 0 0 0 0 0 0 ...
$ DZCLASS : Factor w/ 21 levels "Acute Tissue Loss - White Syndrome",..: 14 14 14 14 14 14 14 14 14 14 ...
$ EXTENT : num 52.9 52.9 52.9 52.9 52.9 ...
$ SEVERITY : num 3.11 3.11 3.11 3.11 3.11 ...
$ TAXONNAME.x : Factor w/ 155 levels "Acanthastrea hemprichii",..: 95 132 95 107 7 132 133 95 89 95 ...
$ PHYLUM : Factor w/ 2 levels "Cnidaria","Rhodophyta": 1 1 1 1 1 1 1 1 1 1 ...
$ CLASS : Factor w/ 3 levels "Anthozoa","Florideophyceae",..: 1 1 1 1 1 1 1 1 1 1 ...
$ FAMILY : Factor w/ 20 levels "Acroporidae",..: 1 18 1 2 1 18 18 1 8 1 ...
$ GENUS : Factor w/ 55 levels "Acanthastrea",..: 35 44 35 39 2 44 44 35 34 35 ...
$ RANK : Factor w/ 2 levels "Genus","Species": 1 1 1 1 2 1 2 1 1 1 ...
$ DATE_ : Date, format: "0015-03-27" ...
$ OBS_YEAR : num 2015 2015 2015 2015 2015 ...
$ REEF_ZONE : Factor w/ 2 levels "Backreef","Forereef": 2 2 2 2 2 2 2 2 2 2 ...
$ DEPTH_BIN : Factor w/ 4 levels "Bank","Deep",..: 2 2 4 3 2 2 3 4 3 3 ...
$ LBSP : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
$ Zone_Fine_ReefZone_Depth: Factor w/ 41 levels "Aunuu_E_Deep",..: 30 24 29 25 24 24 25 23 28 28 ...
$ Area_km2.x : num 50.9 49.1 101.8 49.1 49.1 ...
$ Fishing.trips.per.km2 : num 719 1148 1431 1148 1148 ...
$ Area_km2.y : num 50.9 49.1 50.9 49.1 49.1 ...
$ Pop.km2 : num 167.5 49.1 561.9 49.1 49.1 ...
$ SHED_NAME : Factor w/ 35 levels "Aasu","Afao - Asili",..: 2 9 15 17 17 1 1 35 28 26 ...
$ Shed_Cond : Factor w/ 4 levels "Extensive","Intermediate",..: 3 4 2 4 4 3 3 3 1 2 ...
$ Shed_Area_Calc : num 30202 29422 458542 126361 32595 ...
$ Perc_Area : num 0.00128 0.00107 0.00993 0.00458 0.00118 ...
$ Cond_Scale : num 3 4 2 4 4 3 3 3 1 2 ...
$ Shoreline_m : num 23146 33046 45821 33046 33046 ...
$ Rank : num 5 9 3 9 9 9 9 6 3 3 ...
$ Comp.8 : num 0.826 0.814 0.838 0.814 0.814 ...
$ Ble : num 0.958 0.969 0.959 0.969 0.969 ...
$ DZ : num 0.647 0.837 0.732 0.837 0.837 ...
$ Herb : num 0.682 0.564 0.704 0.564 0.564 ...
$ Rec : num 0.375 0.477 0.467 0.477 0.477 ...
$ MA : num 0.965 0.975 0.907 0.975 0.975 ...
$ Dam : num 0.998 1 0.992 1 1 ...
$ TAXONNAME.y : Factor w/ 94 levels "Abudefduf sordidus",..: 94 94 94 94 94 94 94 94 94 94 ...
$ Dummy : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
I expected a classification of "DZCLASS".
Thanks, every recommendation is welcomed!

Error "(subscript) logical subscript too long" with tune.svm from e1071 package in R

I am trying to use SVM for a multi-class classification task.
I have a dataset called df, which I divided into a training and a test set with the following code:
sample <- df[sample(nrow(df), 10000),] # take a random sample of 10,000 from dataset df
sample <- sample %>% arrange(Date) # arrange chronologically
train <- sample[1:8000,] # 80% of the df dataset
test <- sample[8001:10000,] # 20% of the df dataset
This is what the training set looks like:
> str(train)
'data.frame': 8000 obs. of 45 variables:
$ Date : Date, format: "2008-01-01" "2008-01-01" "2008-01-02" ...
$ Weekday : chr "Tuesday" "Tuesday" "Wednesday" "Wednesday" ...
$ Season : Factor w/ 4 levels "Winter","Spring",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Weekend : num 0 0 0 0 0 0 0 0 0 0 ...
$ Icao.type : Factor w/ 306 levels "A124","A225",..: 7 29 112 115 107 10 115 115 115 112 ...
$ Act.description : Factor w/ 389 levels "A300-600F","A330-200F",..: 9 29 161 162 150 13 162 162 162 161 ...
$ Arr.dep : Factor w/ 2 levels "A","D": 2 2 1 1 1 1 1 1 1 1 ...
$ MTOW : num 77 69 46 21 22 238 21 21 21 46 ...
$ Icao.wtc : chr "Medium" "Medium" "Medium" "Medium" ...
$ Wind.direc : int 104 104 82 82 93 93 93 132 132 132 ...
$ Wind.speed.vec : int 35 35 57 57 64 64 64 62 62 62 ...
$ Wind.speed.daily: int 35 35 58 58 65 65 65 63 63 63 ...
$ Wind.speed.max : int 60 60 70 70 80 80 80 90 90 90 ...
$ Wind.speed.min : int 20 20 40 40 50 50 50 50 50 50 ...
$ Wind.gust.max : int 100 100 120 120 130 130 130 140 140 140 ...
$ Temp.daily : int 24 24 -5 -5 4 4 4 34 34 34 ...
$ Temp.min : int -7 -7 -25 -25 -13 -13 -13 11 11 11 ...
$ Temp.max : int 50 50 16 16 13 13 13 55 55 55 ...
$ Temp.10.min : int -11 -11 -32 -32 -18 -18 -18 9 9 9 ...
$ Sun.dur : int 7 7 65 65 19 19 19 0 0 0 ...
$ Sun.dur.prct : int 9 9 83 83 24 24 24 0 0 0 ...
$ Radiation : int 173 173 390 390 213 213 213 108 108 108 ...
$ Precip.dur : int 0 0 0 0 0 0 0 5 5 5 ...
$ Precip.daily : int 0 0 0 0 -1 -1 -1 2 2 2 ...
$ Precip.max : int 0 0 0 0 -1 -1 -1 2 2 2 ...
$ Sea.press.daily : int 10259 10259 10206 10206 10080 10080 10080 10063 10063 10063 ...
$ Sea.press.max : int 10276 10276 10248 10248 10132 10132 10132 10086 10086 10086 ...
$ Sea.press.min : int 10250 10250 10141 10141 10058 10058 10058 10001 10001 10001 ...
$ Visibility.min : int 1 1 40 40 43 43 43 58 58 58 ...
$ Visibility.max : int 59 59 75 75 66 66 66 65 65 65 ...
$ Cloud.daily : int 7 7 3 3 8 8 8 8 8 8 ...
$ Humidity.daily : int 96 96 86 86 77 77 77 82 82 82 ...
$ Humidity.max : int 99 99 92 92 92 92 92 90 90 90 ...
$ Humidity.min : int 91 91 74 74 71 71 71 76 76 76 ...
$ Evapo : int 2 2 4 4 2 2 2 1 1 1 ...
$ Wind.discrete : chr "South East" "South East" "North East" "North East" ...
$ Vmc.imc : chr "Unknown" "Unknown" "Unknown" "Unknown" ...
$ Beaufort : num 3 3 4 4 4 4 4 4 4 4 ...
$ Main.A : num 0 0 0 0 0 0 0 0 0 0 ...
$ Main.B : num 0 0 0 0 0 0 0 0 0 0 ...
$ Main.K : num 0 0 0 0 0 0 0 0 0 0 ...
$ Main.O : num 0 0 0 0 0 0 0 0 0 0 ...
$ Main.P : num 0 0 0 0 0 0 0 0 0 0 ...
$ Main.Z : num 0 0 0 0 0 0 0 0 0 0 ...
$ Runway : Factor w/ 13 levels "04","06","09",..: 3 8 2 2 2 6 2 6 6 6 ...
Then, I try to tune the SVM parameters with the following code:
library(e1071)
tuned <- tune.svm(Runway ~ ., data = train, gamma = 10 ^ (-6:-1), cost = 10 ^ (-1:1))
While this code has worked in the past, it now gives me the following error:
Error in newdata[, object$scaled, drop = FALSE] :
(subscript) logical subscript too long
The only thing I can think of that has changed is the rows in the dataset train, as running the first code block means taking a random sample of 10,000 (out of dataset df, that contains 3.5 million rows).
Does anyone know why I am getting this?
I recognise that this question was rather hard to solve without a good reproducible example.
However, I have found the solution to my problem and wanted to post it here for anyone who might be looking for this in the future.
Running the same code, but with selected columns from the train set:
tuned <- tune.svm(Runway ~ ., data = train[,c(1:2, 45)], gamma = 10 ^ (-6:-1), cost = 10 ^ (-1:1))
gave me absolutely no problem. I continued adding more features until the error was reproduced. I found that the features Vmc.imc and Icao.wtc were causing the error and that they were both chr features. Using the following code:
train$Vmc.imc <- as.factor(train$Vmc.imc)
train$Icao.wtc <- as.factor(train$Icao.wtc)
to change them into factors and then rerunning
tuned <- tune.svm(Runway ~ ., data = train, gamma = 10 ^ (-6:-1), cost = 10 ^ (-1:1))
solved my problem.
I do not know why the other chr features such as Weekday and Wind.discrete are not causing the same issue. If anyone knows the answer to this, I would be glad to find out.
Similar to this thread here. I added the fact that if you neglect making all your character features factors, you will also receive this error when attempting to run predict.

Have issues in getting probability values using SVM in R

I am having a data set which has 28 attributes. The response variable is binary (0 & 1). I tried using SVM with "Probability=T" while running it. But I still could not get the probability values from the result.
Here is my training data set (last attribute is my response variable):
str(train)
'data.frame': 73630 obs. of 29 variables:
$ EMOTION_INDICATOR: num -2 -0.625 0.9 0 1.625 ...
$ CLUSTER : Factor w/ 8 levels "","cluster0",..: 4 7 5 1 1 3 8 6 7 1 ...
$ GENDER : Factor w/ 3 levels "","Female","Male": 2 2 2 1 1 3 3 2 3 1 ...
$ AGE : num 36 37 70 NA NA ...
$ REGION : Factor w/ 6 levels "","'Northern Ireland'",..: 5 6 5 1 1 6 4 4 6 1 ...
$ WORKING : Factor w/ 14 levels "","A","B","C",..: 4 14 8 1 1 6 4 3 4 1 ...
$ MUSIC : Factor w/ 7 levels "","A","B","C",..: 5 7 6 1 1 4 5 2 5 1 ...
$ LIST_OWN : num 1 1 6 NA NA ...
$ LIST_BACK : num 1 3 2 NA NA 0.5 2 0.5 6 NA ...
$ Q1 : num 10 51 35 NA NA 29 51 25 69 NA ...
$ Q2 : num 53 51 36 NA NA 7 49 25 71 NA ...
$ Q3 : num 12 70 37 NA NA 26 51 23 70 NA ...
$ Q4 : num 11 31 36 NA NA 2 50 24 7 NA ...
$ Q5 : num 12 6 37 NA NA 51 73 22 10 NA ...
$ Q6 : num 12 6 9 NA NA 51 47 22 68 NA ...
$ Q7 : num 76 5 36 NA NA 29 50 30 11 NA ...
$ Q8 : num 76 24 13 NA NA 72 52 10 10 NA ...
$ Q9 : num 51 7 70 NA NA 12 36 48 53 NA ...
$ Q10 : num 53 70 69 NA NA 9 91 18 75 NA ...
$ Q11 : num 76 89 65 NA NA 53 53 18 86 NA ...
$ Q12 : num 76 91 63 NA NA 5 52 16 89 NA ...
$ Q13 : num 52 50 6 NA NA 51 77 17 99 NA ...
$ Q14 : num 75 73 62 NA NA 70 78 21 100 NA ...
$ Q15 : num 11 72 31 NA NA 33 48 19 67 NA ...
$ Q16 : num 12 47.7 24.3 NA NA ...
$ Q17 : num 71 74 51 NA NA 51 52 27 98 NA ...
$ Q18 : num 23.6 52 31 NA NA ...
$ Q19 : num 22.5 52 32 NA NA ...
$ AVERAGE_RATING : Factor w/ 2 levels "0","1": 1 1 2 1 2 1 1 1 1 1 ...
My test set looks similar too. It has 24544 obs. with 29 variables.
This is the code that I used for SVM:
fitSVM <- svm(AVERAGE_RATING ~., data=train, na.action = na.omit,probability=T)
predSVM <- predict(fitSVM,test[!rowSums(is.na(test)),],type="probability")
table(predSVM,test$AVERAGE_RATING[!rowSums(is.na(test))],useNA="no")
predSVM 0 1
0 8091 1523
1 3259 9865
I get proper output, but without probability values:
attr(predSVM,"probabilities")
NULL
Am I doing something wrong?
You need to call predict with:
predSVM <- predict(fitSVM,test[!rowSums(is.na(test)),], probability=T)
See ? predict.svm

Carc data from rda file to numeric matrix

I try to make KDA (Kernel discriminant analysis) for carc data, but when I call command X<-data.frame(scale(X)); r shows error:
"Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric"
I tried to use as.numeric(as.matrix(carc)) and carc<-na.omit(carc), but it does not help either
library(ks);library(MASS);library(klaR);library(FSelector)
install.packages("klaR")
install.packages("FSelector")
library(ks);library(MASS);library(klaR);library(FSelector)
attach("carc.rda")
data<-load("carc.rda")
data
carc<-na.omit(carc)
head(carc)
class(carc) # check for its class
class(as.matrix(carc)) # change class, and
as.numeric(as.matrix(carc))
XX<-carc
X<-XX[,1:12];X.class<-XX[,13];
X<-data.frame(scale(X));
fit.pc<-princomp(X,scores=TRUE);
plot(fit.pc,type="line")
X.new<-fit.pc$scores[,1:5]; X.new<-data.frame(X.new);
cfs(X.class~.,cbind(X.new,X.class))
X.new<-fit.pc$scores[,c(1,4)]; X.new<-data.frame(X.new);
fit.kda1<-Hkda(x=X.new,x.group=X.class,pilot="samse",
bw="plugin",pre="sphere")
kda.fit1 <- kda(x=X.new, x.group=X.class, Hs=fit.kda1)
Can you help to resolve this problem and make this analysis?
Added:The car data set( Chambers, kleveland, Kleiner & Tukey 1983)
> head(carc)
P M R78 R77 H R Tr W L T D G C
AMC_Concord 4099 22 3 2 2.5 27.5 11 2930 186 40 121 3.58 US
AMC_Pacer 4749 17 3 1 3.0 25.5 11 3350 173 40 258 2.53 US
AMC_Spirit 3799 22 . . 3.0 18.5 12 2640 168 35 121 3.08 US
Audi_5000 9690 17 5 2 3.0 27.0 15 2830 189 37 131 3.20 Europe
Audi_Fox 6295 23 3 3 2.5 28.0 11 2070 174 36 97 3.70 Europe
Here is a small dataset with similar characteristics to what you describe
in order to answer this error:
"Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric"
carc <- data.frame(type1=rep(c('1','2'), each=5),
type2=rep(c('5','6'), each=5),
x = rnorm(10,1,2)/10, y = rnorm(10))
This should be similar to your data.frame
str(carc)
# 'data.frame': 10 obs. of 3 variables:
# $ type1: Factor w/ 2 levels "1","2": 1 1 1 1 1 2 2 2 2 2
# $ type2: Factor w/ 2 levels "5","6": 1 1 1 1 1 2 2 2 2 2
# $ x : num -0.1177 0.3443 0.1351 0.0443 0.4702 ...
# $ y : num -0.355 0.149 -0.208 -1.202 -1.495 ...
scale(carc)
# Similar error
# Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
Using set()
require(data.table)
DT <- data.table(carc)
cols_fix <- c("type1", "type2")
for (col in cols_fix) set(DT, j=col, value = as.numeric(as.character(DT[[col]])))
str(DT)
# Classes ‘data.table’ and 'data.frame': 10 obs. of 4 variables:
# $ type1: num 1 1 1 1 1 2 2 2 2 2
# $ type2: num 5 5 5 5 5 6 6 6 6 6
# $ x : num 0.0465 0.1712 0.1582 0.1684 0.1183 ...
# $ y : num 0.155 -0.977 -0.291 -0.766 -1.02 ...
# - attr(*, ".internal.selfref")=<externalptr>
The first column(s) of your data set may be factors. Taking the data from corrgram:
library(corrgram)
carc <- auto
str(carc)
# 'data.frame': 74 obs. of 14 variables:
# $ Model : Factor w/ 74 levels "AMC Concord ",..: 1 2 3 4 5 6 7 8 9 10 ...
# $ Origin: Factor w/ 3 levels "A","E","J": 1 1 1 2 2 2 1 1 1 1 ...
# $ Price : int 4099 4749 3799 9690 6295 9735 4816 7827 5788 4453 ...
# $ MPG : int 22 17 22 17 23 25 20 15 18 26 ...
# $ Rep78 : num 3 3 NA 5 3 4 3 4 3 NA ...
# $ Rep77 : num 2 1 NA 2 3 4 3 4 4 NA ...
# $ Hroom : num 2.5 3 3 3 2.5 2.5 4.5 4 4 3 ...
# $ Rseat : num 27.5 25.5 18.5 27 28 26 29 31.5 30.5 24 ...
# $ Trunk : int 11 11 12 15 11 12 16 20 21 10 ...
# $ Weight: int 2930 3350 2640 2830 2070 2650 3250 4080 3670 2230 ...
# $ Length: int 186 173 168 189 174 177 196 222 218 170 ...
# $ Turn : int 40 40 35 37 36 34 40 43 43 34 ...
# $ Displa: int 121 258 121 131 97 121 196 350 231 304 ...
# $ Gratio: num 3.58 2.53 3.08 3.2 3.7 3.64 2.93 2.41 2.73 2.87 ...
So exclude them by trying this:
X<-XX[,3:14]
or this
X<-XX[,-(1:2)]

Resources