I am having a data set which has 28 attributes. The response variable is binary (0 & 1). I tried using SVM with "Probability=T" while running it. But I still could not get the probability values from the result.
Here is my training data set (last attribute is my response variable):
str(train)
'data.frame': 73630 obs. of 29 variables:
$ EMOTION_INDICATOR: num -2 -0.625 0.9 0 1.625 ...
$ CLUSTER : Factor w/ 8 levels "","cluster0",..: 4 7 5 1 1 3 8 6 7 1 ...
$ GENDER : Factor w/ 3 levels "","Female","Male": 2 2 2 1 1 3 3 2 3 1 ...
$ AGE : num 36 37 70 NA NA ...
$ REGION : Factor w/ 6 levels "","'Northern Ireland'",..: 5 6 5 1 1 6 4 4 6 1 ...
$ WORKING : Factor w/ 14 levels "","A","B","C",..: 4 14 8 1 1 6 4 3 4 1 ...
$ MUSIC : Factor w/ 7 levels "","A","B","C",..: 5 7 6 1 1 4 5 2 5 1 ...
$ LIST_OWN : num 1 1 6 NA NA ...
$ LIST_BACK : num 1 3 2 NA NA 0.5 2 0.5 6 NA ...
$ Q1 : num 10 51 35 NA NA 29 51 25 69 NA ...
$ Q2 : num 53 51 36 NA NA 7 49 25 71 NA ...
$ Q3 : num 12 70 37 NA NA 26 51 23 70 NA ...
$ Q4 : num 11 31 36 NA NA 2 50 24 7 NA ...
$ Q5 : num 12 6 37 NA NA 51 73 22 10 NA ...
$ Q6 : num 12 6 9 NA NA 51 47 22 68 NA ...
$ Q7 : num 76 5 36 NA NA 29 50 30 11 NA ...
$ Q8 : num 76 24 13 NA NA 72 52 10 10 NA ...
$ Q9 : num 51 7 70 NA NA 12 36 48 53 NA ...
$ Q10 : num 53 70 69 NA NA 9 91 18 75 NA ...
$ Q11 : num 76 89 65 NA NA 53 53 18 86 NA ...
$ Q12 : num 76 91 63 NA NA 5 52 16 89 NA ...
$ Q13 : num 52 50 6 NA NA 51 77 17 99 NA ...
$ Q14 : num 75 73 62 NA NA 70 78 21 100 NA ...
$ Q15 : num 11 72 31 NA NA 33 48 19 67 NA ...
$ Q16 : num 12 47.7 24.3 NA NA ...
$ Q17 : num 71 74 51 NA NA 51 52 27 98 NA ...
$ Q18 : num 23.6 52 31 NA NA ...
$ Q19 : num 22.5 52 32 NA NA ...
$ AVERAGE_RATING : Factor w/ 2 levels "0","1": 1 1 2 1 2 1 1 1 1 1 ...
My test set looks similar too. It has 24544 obs. with 29 variables.
This is the code that I used for SVM:
fitSVM <- svm(AVERAGE_RATING ~., data=train, na.action = na.omit,probability=T)
predSVM <- predict(fitSVM,test[!rowSums(is.na(test)),],type="probability")
table(predSVM,test$AVERAGE_RATING[!rowSums(is.na(test))],useNA="no")
predSVM 0 1
0 8091 1523
1 3259 9865
I get proper output, but without probability values:
attr(predSVM,"probabilities")
NULL
Am I doing something wrong?
You need to call predict with:
predSVM <- predict(fitSVM,test[!rowSums(is.na(test)),], probability=T)
See ? predict.svm
Related
Im new to R and have received data from others with a much better level of R than me.
I need to due some simple descriptive statistic with deadline tomorrow and noticed that the data is in "tibble" format and I would like it as a dataframe instead. Can anybody help? its probably very simple but my skills in tidyverse are still very limited - but working on it:)
The tibble is as i would like it to be with 165 rows (one for each patient) and 16 columns.
I would just like it to be a dataframe.
Thank you for your time
I tried the very simple
data_dataframe <- as.data.frame(model_data)
But didnt work.
My str out is:
> str(model_data)
tibble [165 × 16] (S3: tbl_df/tbl/data.frame)
$ PatientID : Factor w/ 165 levels "Patient_001",..: 1 2 3 4 5 6 7 8 9 10 ...
$ VAS_baseline : num [1:165] NA 99 25 75 50 50 90 81 80 100 ...
$ VAS_followup_30 : num [1:165] 75 95 53 85 88 92 98 NA NA 80 ...
$ VAS_followup_180 : num [1:165] 95 83 35 NA 94 68 98 NA NA 100 ...
$ Index_baseline : num [1:165] NA 1 0.847 1 0.813 0.826 0.967 1 0.96 0.952 ...
$ Index_followup_30 : num [1:165] 0.967 0.919 0.764 0.919 1 0.96 1 NA NA 0.919 ...
$ Index_followup_180: num [1:165] 1 0.88 0.728 NA 1 1 1 NA NA 0.952 ...
$ Age : num [1:165] 68 74 61 64 69 55 74 68 79 66 ...
$ Group : Factor w/ 3 levels "Group_1","Group_2",..: 2 3 1 1 1 3 1 3 2 2 ...
$ Surgeon : Factor w/ 6 levels "Surgeon_1","Surgeon_2",..: 1 3 1 1 5 3 1 6 4 6 ...
$ VAS_to_30 : num [1:165] NA -4 28 10 38 42 8 NA NA -20 ...
$ VAS_to_180 : num [1:165] NA -16 10 NA 44 18 8 NA NA 0 ...
$ VAS_30_to_180 : num [1:165] 20 -12 -18 NA 6 -24 0 NA NA 20 ...
$ Index_to_30 : num [1:165] NA -0.081 -0.083 -0.081 0.187 ...
$ Index_to_180 : num [1:165] NA -0.12 -0.119 NA 0.187 0.174 0.033 NA NA 0 ...
$ Index_30_to_180 : num [1:165] 0.033 -0.039 -0.036
Good afternoon ,
Assume we have the following function :
data_preprocessing<-function(link){
link=as.character(link)
dataset=read.csv(link)
dataset=replace(dataset,dataset=="?",NA)
return(dataset)
}
Example ( https protocole problem ) :
Echocardiogram=data_preprocessing("https://archive.ics.uci.edu/ml/machine-learning-databases/echocardiogram/echocardiogram.data")
Show Traceback
Rerun with Debug
Error in file(file, "rt") : cannot open the connection
After downloading the dataset :
Echocardiogram=data_preprocessing("http://archive.ics.uci.edu/ml/machine-learning-databases/echocardiogram/echocardiogram.data")
head(Echocardiogram)
X11 X0 X71 X0.1 X0.260 X9 X4.600 X14 X1 X1.1 name X1.2 X0.2
1 19 0 72 0 0.380 6 4.100 14 1.700 0.588 name 1 0
2 16 0 55 0 0.260 4 3.420 14 1 1 name 1 0
3 57 0 60 0 0.253 12.062 4.603 16 1.450 0.788 name 1 0
4 19 1 57 0 0.160 22 5.750 18 2.250 0.571 name 1 0
5 26 0 68 0 0.260 5 4.310 12 1 0.857 name 1 0
6 13 0 62 0 0.230 31 5.430 22.5 1.875 0.857 name 1 0
Also :
str(Echocardiogram)
'data.frame': 130 obs. of 12 variables:
$ X11 : Factor w/ 57 levels "",".03",".25",..: 18 16 54 18 27 14 50 18 26 12 ...
$ X0 : Factor w/ 4 levels "","?","0","1": 3 3 3 4 3 3 3 3 3 4 ...
$ X71 : Factor w/ 40 levels "","?","35","46",..: 30 12 17 14 26 19 17 4 11 34 ...
$ X0.1 : int 0 0 0 0 0 0 0 0 0 0 ...
$ X0.260: Factor w/ 74 levels "","?","0.010",..: 65 50 47 26 50 42 59 60 21 19 ...
$ X9 : Factor w/ 93 levels "","?","0","10",..: 69 57 13 46 62 56 79 3 19 29 ...
$ X4.600: Factor w/ 106 levels "","?","2.32",..: 25 6 54 92 38 85 76 70 47 33 ...
$ X14 : Factor w/ 48 levels "","?","10","10.5",..: 16 16 21 27 8 36 16 21 19 27 ...
$ X1 : Factor w/ 67 levels "","?","1","1.04",..: 48 3 37 60 3 52 3 11 16 50 ...
$ X1.1 : Factor w/ 32 levels "","?","0.140",..: 14 30 25 13 27 27 30 31 29 21 ...
$ X1.2 : Factor w/ 5 levels "","?","1","2",..: 3 3 3 3 3 3 3 3 3 3 ...
$ X0.2 : Factor w/ 5 levels "","?","0","1",..: 3 3 3 3 3 3 3 3 3 4 ...
Here , i'm wanting to replace all "?" in the dataset with NA. Also , it will be good to remove duplicated and empty rows ( like the 50 row ).
Thank you for help !
something like this?
library(data.table)
DT <- data.table::fread("https://archive.ics.uci.edu/ml/machine-learning-databases/echocardiogram/echocardiogram.data",
fill = TRUE,
na.strings = "?")
When using read.csv from base you can set na.strings = "?" and header=FALSE.
Echocardiogram <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/echocardiogram/echocardiogram.data"
, na.strings = "?", header=FALSE)
str(Echocardiogram)
#'data.frame': 133 obs. of 13 variables:
# $ V1 : num 11 19 16 57 19 26 13 50 19 25 ...
# $ V2 : int 0 0 0 0 1 0 0 0 0 0 ...
# $ V3 : num 71 72 55 60 57 68 62 60 46 54 ...
# $ V4 : int 0 0 0 0 0 0 0 0 0 0 ...
# $ V5 : num 0.26 0.38 0.26 0.253 0.16 0.26 0.23 0.33 0.34 0.14 ...
# $ V6 : num 9 6 4 12.1 22 ...
# $ V7 : num 4.6 4.1 3.42 4.6 5.75 ...
# $ V8 : num 14 14 14 16 18 12 22.5 14 16 15.5 ...
# $ V9 : num 1 1.7 1 1.45 2.25 ...
# $ V10: num 1 0.588 1 0.788 0.571 ...
# $ V11: chr "name" "name" "name" "name" ...
# $ V12: chr "1" "1" "1" "1" ...
# $ V13: int 0 0 0 0 0 0 0 0 0 0 ...
I am new to R and I am having issues moving forward with the data analysis. My Excel data has a lot of NA's and I tried troubleshooting this error. Here's my code if anyone can help, and a link to a sample of my data
file:///C:/Users/steph/Documents/DLI%20ANOVA%20Sample.htm
Some of my variables have 4 reps instead of all 8reps, so I have a lot of NA's in the excel file. I keep getting this error after I try tapply:
Error in tapply(X = data1$gi..m3., INDEX = data1$cultivar, FUN = mean, :
arguments must have same length
library(agricolae)
data1=read.csv("DLI ANOVA Sample.csv", header=T, as.is=T)
#setting factors
block = as.factor(data1$block)
treatmentt = as.factor(data1$trt)
cultivar<-factor(data1$cv,c("CR", "LB","RF","RR","S","SNS","SNY","SSJ","YC"))
str(data1)
#Summary statistics
tapply(X = data1$growth.index, INDEX = data1$cultivar, FUN = mean, na.rm=T)
tapply(X = data1$growth.index, INDEX = data1$treatment, FUN = mean, na.rm=T)
data.frame': 288 obs. of 24 variables:
$ block : int 1 1 2 2 3 3 4 4 1 1 ...
$ trt : chr "HL-L" "HL-L" "HL-L" "HL-L" ..
$ cv : chr "CR" "CR" "CR" "CR" ...
$ rep : int 1 2 3 4 5 6 7 8 1 2 ...
$ height : int 23 20 25 19 23 19 22 19 19 24
$ growth.index : num 0.0221 0.0258 0.0276 0.0227 0.0209
$ number.of.mature.fruit : int 34 30 35 34 28 25 40 24 12 16 ...
$ mature.fruit.fw : num 163 163 186 152 169 ...
$ number.of.immature.fruit : int 38 28 40 27 35 37 44 48 20 30 ...
$ immature.fruit.fw : num 77.4 66.6 87.6 43.4 81.3 ...
$ Total.number.of.fruit : num 72 58 75 61 63 62 84 72 32 46 ...
$ Total.fruit.fw : num 241 230 273 195 250 ...
$ Fruit.Water.Content..g. : num NA 209 NA 176 NA ...
$ Brix.. : num 4.9 NA 5.6 NA 4.7 NA 5.1 NA 5.6 NA ...
$ pH : num 4.17 NA 4.3 NA 4.1 ...
$ EC.uS.mL : num 4.46 NA 9.19 NA 8.24 ...
$ X..citric.Acid : num 0.704 NA 0.397 NA 0.653 ...
$ Sugar.Acid.Ratio : num 6.96 NA 14.11 NA 7.2 ...
$ oedema.injury.level..1.6. : int 3 3 1 2 1 1 1 2 2 1 ...
$ Stomatal.conductance : num NA 365 NA 422 NA ...
$ spad : num NA NA NA 64.3 NA 65.5 NA 68.7 NA 55.6 ...
$ Irrigation.Events : int NA 14 NA 12 NA 13 NA 16 NA 13 ...
$ WUE : num NA 0.00584 NA 0.00693 NA ...
$ transpiration..g.H2O.lost..g.dry.biomass.: num NA 117 NA 111 NA ...
I'm newbie, and working on a classification to see the causes of coral diseases. The dataset contains 45 variables.
The output variable is a factor with 21 levels (21 diseases) and the inputs are numeric and factor variables, and those factors have even 94 levels, those are like "type of specie of coral", so I can't get into a split factor because I want to be as precise as possible, so maybe one species is less resistant than another. So I can't split those factors. Numeric variables are such as, population in the area, fishing trips etc.
First problem: tried genetic algorithms to select most important variables, random forests, etc., but... it gets aborted, so the variables I eliminated were just based on correlograms. I want something stronger to decide which variables select.
Second problem: I've tried everything I know and made tons of searches on Google to find something that runs and make a classification, but nothing goes on. I tried SVM, Random Forests, Cart, GBM, bagging and boosting, but nothing can't with this dataset.
This is the structure of the dataset
'data.frame': 136510 obs. of 45 variables:
$ SITE : Factor w/ 144 levels "TUT-1511","TUT-1513",..: 56 15 55 21 12 12 17 53 48 82 ...
$ Zone_Fine : Factor w/ 17 levels "Aunuu_E","Aunuu_W",..: 11 9 10 9 9 9 9 8 10 10 ...
$ TRANSECT : num 1 1 1 1 1 1 1 1 1 1 ...
$ SEGMENT : num 5 1 1 1 7 5 7 5 3 7 ...
$ Seg_WIDTH : num 1 1 1 1 1 1 1 1 1 1 ...
$ Seg_LENGTH : num 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
$ SPECIES : Factor w/ 156 levels "AAAA","AABR",..: 94 126 94 102 9 126 135 94 93 94 ...
$ COLONYLENGTH : num 11 45 10 5 12 10 8 30 20 14 ...
$ OLDDEAD : num 5 2 5 0 0 5 10 0 5 10 ...
$ RECENTDEAD : num 0 10 0 0 0 0 0 0 0 0 ...
$ DZCLASS : Factor w/ 21 levels "Acute Tissue Loss - White Syndrome",..: 14 14 14 14 14 14 14 14 14 14 ...
$ EXTENT : num 52.9 52.9 52.9 52.9 52.9 ...
$ SEVERITY : num 3.11 3.11 3.11 3.11 3.11 ...
$ TAXONNAME.x : Factor w/ 155 levels "Acanthastrea hemprichii",..: 95 132 95 107 7 132 133 95 89 95 ...
$ PHYLUM : Factor w/ 2 levels "Cnidaria","Rhodophyta": 1 1 1 1 1 1 1 1 1 1 ...
$ CLASS : Factor w/ 3 levels "Anthozoa","Florideophyceae",..: 1 1 1 1 1 1 1 1 1 1 ...
$ FAMILY : Factor w/ 20 levels "Acroporidae",..: 1 18 1 2 1 18 18 1 8 1 ...
$ GENUS : Factor w/ 55 levels "Acanthastrea",..: 35 44 35 39 2 44 44 35 34 35 ...
$ RANK : Factor w/ 2 levels "Genus","Species": 1 1 1 1 2 1 2 1 1 1 ...
$ DATE_ : Date, format: "0015-03-27" ...
$ OBS_YEAR : num 2015 2015 2015 2015 2015 ...
$ REEF_ZONE : Factor w/ 2 levels "Backreef","Forereef": 2 2 2 2 2 2 2 2 2 2 ...
$ DEPTH_BIN : Factor w/ 4 levels "Bank","Deep",..: 2 2 4 3 2 2 3 4 3 3 ...
$ LBSP : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
$ Zone_Fine_ReefZone_Depth: Factor w/ 41 levels "Aunuu_E_Deep",..: 30 24 29 25 24 24 25 23 28 28 ...
$ Area_km2.x : num 50.9 49.1 101.8 49.1 49.1 ...
$ Fishing.trips.per.km2 : num 719 1148 1431 1148 1148 ...
$ Area_km2.y : num 50.9 49.1 50.9 49.1 49.1 ...
$ Pop.km2 : num 167.5 49.1 561.9 49.1 49.1 ...
$ SHED_NAME : Factor w/ 35 levels "Aasu","Afao - Asili",..: 2 9 15 17 17 1 1 35 28 26 ...
$ Shed_Cond : Factor w/ 4 levels "Extensive","Intermediate",..: 3 4 2 4 4 3 3 3 1 2 ...
$ Shed_Area_Calc : num 30202 29422 458542 126361 32595 ...
$ Perc_Area : num 0.00128 0.00107 0.00993 0.00458 0.00118 ...
$ Cond_Scale : num 3 4 2 4 4 3 3 3 1 2 ...
$ Shoreline_m : num 23146 33046 45821 33046 33046 ...
$ Rank : num 5 9 3 9 9 9 9 6 3 3 ...
$ Comp.8 : num 0.826 0.814 0.838 0.814 0.814 ...
$ Ble : num 0.958 0.969 0.959 0.969 0.969 ...
$ DZ : num 0.647 0.837 0.732 0.837 0.837 ...
$ Herb : num 0.682 0.564 0.704 0.564 0.564 ...
$ Rec : num 0.375 0.477 0.467 0.477 0.477 ...
$ MA : num 0.965 0.975 0.907 0.975 0.975 ...
$ Dam : num 0.998 1 0.992 1 1 ...
$ TAXONNAME.y : Factor w/ 94 levels "Abudefduf sordidus",..: 94 94 94 94 94 94 94 94 94 94 ...
$ Dummy : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
I expected a classification of "DZCLASS".
Thanks, every recommendation is welcomed!
I have a list of 18 data frames that I read in using read.xlsx. Each data frame has the same number of columns but some columns contain NA for some rows.
Also, in the Abundance column there are rows that contain non-numeric data and I suspect that I may need to remove these rows from each data frame but I have not been able to find a way to remove those rows.
My data frame structure is like this:
$ :'data.frame': 118 obs. of 10 variables:
..$ Locus : Factor w/ 24 levels "A","CS",..: 14 14 14 14 22 22 NA 22 10 10 ...
..$ Target : Factor w/ 96 levels "[AAAGA]14","[AAAGA]15",..: 88 91 90 87 11 12 NA 9 65 67 ...
..$ Length : num [1:118] 60 76 72 56 24 39 NA 20 139 141 ...
..$ Abundance : num [1:118] 1479 1108 180 144 1786 ...
..$ Size : num [1:118] 15 19 18 14 6 9.3 NA 5 32 32.2 ...
..$ Call : Factor w/ 4 levels "Al","HAs",..: 1 1 3 3 1 1 NA 3 1 1 ...
..$ RAR : num [1:118] NA 74.92 12.17 9.74 NA ...
..$ Position : num [1:118] NA NA NA NA NA NA NA NA NA NA ...
..$ Al.1.s.percent: num [1:118] NA NA 12.17 9.74 NA ...
..$ Al.2.s.percent: num [1:118] NA NA 16.2 13 NA ...
I want to apply this function to each data frame in my list of data frames.
add.sum = function(df){
transform(df, Tot.count = ave(df[[Abundunce]], df[[Locus]], FUN = sum))
}
I tried using this line with lapply
transformed.data = lapply(mydata, add.sum)
I also tried it this way
transformed.data = lapply(mydata, function (x) add.sum(x))
But these give me the following error
Error in .subset2(x, i, exact = exact) : no such index at level 1
Any suggestions on how to get this working correctly?