Im new to R and have received data from others with a much better level of R than me.
I need to due some simple descriptive statistic with deadline tomorrow and noticed that the data is in "tibble" format and I would like it as a dataframe instead. Can anybody help? its probably very simple but my skills in tidyverse are still very limited - but working on it:)
The tibble is as i would like it to be with 165 rows (one for each patient) and 16 columns.
I would just like it to be a dataframe.
Thank you for your time
I tried the very simple
data_dataframe <- as.data.frame(model_data)
But didnt work.
My str out is:
> str(model_data)
tibble [165 × 16] (S3: tbl_df/tbl/data.frame)
$ PatientID : Factor w/ 165 levels "Patient_001",..: 1 2 3 4 5 6 7 8 9 10 ...
$ VAS_baseline : num [1:165] NA 99 25 75 50 50 90 81 80 100 ...
$ VAS_followup_30 : num [1:165] 75 95 53 85 88 92 98 NA NA 80 ...
$ VAS_followup_180 : num [1:165] 95 83 35 NA 94 68 98 NA NA 100 ...
$ Index_baseline : num [1:165] NA 1 0.847 1 0.813 0.826 0.967 1 0.96 0.952 ...
$ Index_followup_30 : num [1:165] 0.967 0.919 0.764 0.919 1 0.96 1 NA NA 0.919 ...
$ Index_followup_180: num [1:165] 1 0.88 0.728 NA 1 1 1 NA NA 0.952 ...
$ Age : num [1:165] 68 74 61 64 69 55 74 68 79 66 ...
$ Group : Factor w/ 3 levels "Group_1","Group_2",..: 2 3 1 1 1 3 1 3 2 2 ...
$ Surgeon : Factor w/ 6 levels "Surgeon_1","Surgeon_2",..: 1 3 1 1 5 3 1 6 4 6 ...
$ VAS_to_30 : num [1:165] NA -4 28 10 38 42 8 NA NA -20 ...
$ VAS_to_180 : num [1:165] NA -16 10 NA 44 18 8 NA NA 0 ...
$ VAS_30_to_180 : num [1:165] 20 -12 -18 NA 6 -24 0 NA NA 20 ...
$ Index_to_30 : num [1:165] NA -0.081 -0.083 -0.081 0.187 ...
$ Index_to_180 : num [1:165] NA -0.12 -0.119 NA 0.187 0.174 0.033 NA NA 0 ...
$ Index_30_to_180 : num [1:165] 0.033 -0.039 -0.036
Related
I am new to R and I am having issues moving forward with the data analysis. My Excel data has a lot of NA's and I tried troubleshooting this error. Here's my code if anyone can help, and a link to a sample of my data
file:///C:/Users/steph/Documents/DLI%20ANOVA%20Sample.htm
Some of my variables have 4 reps instead of all 8reps, so I have a lot of NA's in the excel file. I keep getting this error after I try tapply:
Error in tapply(X = data1$gi..m3., INDEX = data1$cultivar, FUN = mean, :
arguments must have same length
library(agricolae)
data1=read.csv("DLI ANOVA Sample.csv", header=T, as.is=T)
#setting factors
block = as.factor(data1$block)
treatmentt = as.factor(data1$trt)
cultivar<-factor(data1$cv,c("CR", "LB","RF","RR","S","SNS","SNY","SSJ","YC"))
str(data1)
#Summary statistics
tapply(X = data1$growth.index, INDEX = data1$cultivar, FUN = mean, na.rm=T)
tapply(X = data1$growth.index, INDEX = data1$treatment, FUN = mean, na.rm=T)
data.frame': 288 obs. of 24 variables:
$ block : int 1 1 2 2 3 3 4 4 1 1 ...
$ trt : chr "HL-L" "HL-L" "HL-L" "HL-L" ..
$ cv : chr "CR" "CR" "CR" "CR" ...
$ rep : int 1 2 3 4 5 6 7 8 1 2 ...
$ height : int 23 20 25 19 23 19 22 19 19 24
$ growth.index : num 0.0221 0.0258 0.0276 0.0227 0.0209
$ number.of.mature.fruit : int 34 30 35 34 28 25 40 24 12 16 ...
$ mature.fruit.fw : num 163 163 186 152 169 ...
$ number.of.immature.fruit : int 38 28 40 27 35 37 44 48 20 30 ...
$ immature.fruit.fw : num 77.4 66.6 87.6 43.4 81.3 ...
$ Total.number.of.fruit : num 72 58 75 61 63 62 84 72 32 46 ...
$ Total.fruit.fw : num 241 230 273 195 250 ...
$ Fruit.Water.Content..g. : num NA 209 NA 176 NA ...
$ Brix.. : num 4.9 NA 5.6 NA 4.7 NA 5.1 NA 5.6 NA ...
$ pH : num 4.17 NA 4.3 NA 4.1 ...
$ EC.uS.mL : num 4.46 NA 9.19 NA 8.24 ...
$ X..citric.Acid : num 0.704 NA 0.397 NA 0.653 ...
$ Sugar.Acid.Ratio : num 6.96 NA 14.11 NA 7.2 ...
$ oedema.injury.level..1.6. : int 3 3 1 2 1 1 1 2 2 1 ...
$ Stomatal.conductance : num NA 365 NA 422 NA ...
$ spad : num NA NA NA 64.3 NA 65.5 NA 68.7 NA 55.6 ...
$ Irrigation.Events : int NA 14 NA 12 NA 13 NA 16 NA 13 ...
$ WUE : num NA 0.00584 NA 0.00693 NA ...
$ transpiration..g.H2O.lost..g.dry.biomass.: num NA 117 NA 111 NA ...
I am trying to build a decison tree model in R having both categorical and numerical variables.Some categorical variables have 3 levels , so can I just use as.factor and then use in my model? I tried to use model.matrix but my doubt is model.matrix converts the variable in numeric values of 0s and 1s and splitting happens on basis of these numeric values. For eg if Color has 3 level- blue,red,green, the splitting rule will look like color_green < 0.5 instead it should always take 0s and 1s only.
If you are asking whether you can use factors to build an rpart decision tree. Then yes. See below example from the documentation. Note that there are a lot of possible packages for decision trees.
library(rpart)
rpart(Reliability ~ ., data=car90)
#> n=76 (35 observations deleted due to missingness)
#>
#> node), split, n, loss, yval, (yprob)
#> * denotes terminal node
#>
#> 1) root 76 53 average (0.2 0.12 0.3 0.11 0.28)
#> 2) Country=Germany,Korea,Mexico,Sweden,USA 49 29 average (0.31 0.18 0.41 0.1 0)
#> 4) Tires=145,155/80,165/80,185/80,195/60,195/65,195/70,205/60,215/65,225/75,275/40 17 9 Much worse (0.47 0.29 0 0.24 0) *
#> 5) Tires=175/70,185/65,185/70,185/75,195/75,205/70,205/75,215/70 32 12 average (0.22 0.12 0.62 0.031 0)
#> 10) HP.revs< 4650 13 7 Much worse (0.46 0.23 0.31 0 0) *
#> 11) HP.revs>=4650 19 3 average (0.053 0.053 0.84 0.053 0) *
#> 3) Country=Japan,Japan/USA 27 6 Much better (0 0 0.11 0.11 0.78) *
str(car90)
#> 'data.frame': 111 obs. of 34 variables:
#> $ Country : Factor w/ 10 levels "Brazil","England",..: 5 5 4 4 4 4 10 10 10 NA ...
#> $ Disp : num 112 163 141 121 152 209 151 231 231 189 ...
#> $ Disp2 : num 1.8 2.7 2.3 2 2.5 3.5 2.5 3.8 3.8 3.1 ...
#> $ Eng.Rev : num 2935 2505 2775 2835 2625 ...
#> $ Front.Hd : num 3.5 2 2.5 4 2 3 4 6 5 5.5 ...
#> $ Frt.Leg.Room: num 41.5 41.5 41.5 42 42 42 42 42 41 41 ...
#> $ Frt.Shld : num 53 55.5 56.5 52.5 52 54.5 56.5 58.5 59 58 ...
#> $ Gear.Ratio : num 3.26 2.95 3.27 3.25 3.02 2.8 NA NA NA NA ...
#> $ Gear2 : num 3.21 3.02 3.25 3.25 2.99 2.85 2.84 1.99 1.99 2.33 ...
#> $ HP : num 130 160 130 108 168 208 110 165 165 101 ...
#> $ HP.revs : num 6000 5900 5500 5300 5800 5700 5200 4800 4800 4400 ...
#> $ Height : num 47.5 50 51.5 50.5 49.5 51 49.5 50.5 51 50.5 ...
#> $ Length : num 177 191 193 176 175 186 189 197 197 192 ...
#> $ Luggage : num 16 14 17 10 12 12 16 16 16 15 ...
#> $ Mileage : num NA 20 NA 27 NA NA 21 NA 23 NA ...
#> $ Model2 : Factor w/ 21 levels ""," Turbo 4 (3)",..: 1 1 1 1 1 1 1 14 13 1 ...
#> $ Price : num 11950 24760 26900 18900 24650 ...
#> $ Rear.Hd : num 1.5 2 3 1 1 2.5 2.5 4.5 3.5 3.5 ...
#> $ Rear.Seating: num 26.5 28.5 31 28 25.5 27 28 30.5 28.5 27.5 ...
#> $ RearShld : num 52 55.5 55 52 51.5 55.5 56 58.5 58.5 56.5 ...
#> $ Reliability : Ord.factor w/ 5 levels "Much worse"<"worse"<..: 5 5 NA NA 4 NA 3 3 3 NA ...
#> $ Rim : Factor w/ 6 levels "R12","R13","R14",..: 3 4 4 3 3 4 3 3 3 3 ...
#> $ Sratio.m : num NA NA NA NA NA NA NA NA NA NA ...
#> $ Sratio.p : num 0.86 0.96 0.97 0.71 0.88 0.78 0.76 0.83 0.87 0.88 ...
#> $ Steering : Factor w/ 3 levels "manual","power",..: 2 2 2 2 2 2 2 2 2 2 ...
#> $ Tank : num 13.2 18 21.1 15.9 16.4 21.1 15.7 18 18 16.5 ...
#> $ Tires : Factor w/ 30 levels "145","145/80",..: 16 20 20 8 17 28 13 23 23 22 ...
#> $ Trans1 : Factor w/ 4 levels "","man.4","man.5",..: 3 3 3 3 3 3 1 1 1 1 ...
#> $ Trans2 : Factor w/ 4 levels "","auto.3","auto.4",..: 3 3 2 2 3 3 2 3 3 3 ...
#> $ Turning : num 37 42 39 35 35 39 41 43 42 41 ...
#> $ Type : Factor w/ 6 levels "Compact","Large",..: 4 3 3 1 1 3 3 2 2 NA ...
#> $ Weight : num 2700 3265 2935 2670 2895 ...
#> $ Wheel.base : num 102 109 106 100 101 109 105 111 111 108 ...
#> $ Width : num 67 69 71 67 65 69 69 72 72 71 ...
I'm newbie, and working on a classification to see the causes of coral diseases. The dataset contains 45 variables.
The output variable is a factor with 21 levels (21 diseases) and the inputs are numeric and factor variables, and those factors have even 94 levels, those are like "type of specie of coral", so I can't get into a split factor because I want to be as precise as possible, so maybe one species is less resistant than another. So I can't split those factors. Numeric variables are such as, population in the area, fishing trips etc.
First problem: tried genetic algorithms to select most important variables, random forests, etc., but... it gets aborted, so the variables I eliminated were just based on correlograms. I want something stronger to decide which variables select.
Second problem: I've tried everything I know and made tons of searches on Google to find something that runs and make a classification, but nothing goes on. I tried SVM, Random Forests, Cart, GBM, bagging and boosting, but nothing can't with this dataset.
This is the structure of the dataset
'data.frame': 136510 obs. of 45 variables:
$ SITE : Factor w/ 144 levels "TUT-1511","TUT-1513",..: 56 15 55 21 12 12 17 53 48 82 ...
$ Zone_Fine : Factor w/ 17 levels "Aunuu_E","Aunuu_W",..: 11 9 10 9 9 9 9 8 10 10 ...
$ TRANSECT : num 1 1 1 1 1 1 1 1 1 1 ...
$ SEGMENT : num 5 1 1 1 7 5 7 5 3 7 ...
$ Seg_WIDTH : num 1 1 1 1 1 1 1 1 1 1 ...
$ Seg_LENGTH : num 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
$ SPECIES : Factor w/ 156 levels "AAAA","AABR",..: 94 126 94 102 9 126 135 94 93 94 ...
$ COLONYLENGTH : num 11 45 10 5 12 10 8 30 20 14 ...
$ OLDDEAD : num 5 2 5 0 0 5 10 0 5 10 ...
$ RECENTDEAD : num 0 10 0 0 0 0 0 0 0 0 ...
$ DZCLASS : Factor w/ 21 levels "Acute Tissue Loss - White Syndrome",..: 14 14 14 14 14 14 14 14 14 14 ...
$ EXTENT : num 52.9 52.9 52.9 52.9 52.9 ...
$ SEVERITY : num 3.11 3.11 3.11 3.11 3.11 ...
$ TAXONNAME.x : Factor w/ 155 levels "Acanthastrea hemprichii",..: 95 132 95 107 7 132 133 95 89 95 ...
$ PHYLUM : Factor w/ 2 levels "Cnidaria","Rhodophyta": 1 1 1 1 1 1 1 1 1 1 ...
$ CLASS : Factor w/ 3 levels "Anthozoa","Florideophyceae",..: 1 1 1 1 1 1 1 1 1 1 ...
$ FAMILY : Factor w/ 20 levels "Acroporidae",..: 1 18 1 2 1 18 18 1 8 1 ...
$ GENUS : Factor w/ 55 levels "Acanthastrea",..: 35 44 35 39 2 44 44 35 34 35 ...
$ RANK : Factor w/ 2 levels "Genus","Species": 1 1 1 1 2 1 2 1 1 1 ...
$ DATE_ : Date, format: "0015-03-27" ...
$ OBS_YEAR : num 2015 2015 2015 2015 2015 ...
$ REEF_ZONE : Factor w/ 2 levels "Backreef","Forereef": 2 2 2 2 2 2 2 2 2 2 ...
$ DEPTH_BIN : Factor w/ 4 levels "Bank","Deep",..: 2 2 4 3 2 2 3 4 3 3 ...
$ LBSP : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
$ Zone_Fine_ReefZone_Depth: Factor w/ 41 levels "Aunuu_E_Deep",..: 30 24 29 25 24 24 25 23 28 28 ...
$ Area_km2.x : num 50.9 49.1 101.8 49.1 49.1 ...
$ Fishing.trips.per.km2 : num 719 1148 1431 1148 1148 ...
$ Area_km2.y : num 50.9 49.1 50.9 49.1 49.1 ...
$ Pop.km2 : num 167.5 49.1 561.9 49.1 49.1 ...
$ SHED_NAME : Factor w/ 35 levels "Aasu","Afao - Asili",..: 2 9 15 17 17 1 1 35 28 26 ...
$ Shed_Cond : Factor w/ 4 levels "Extensive","Intermediate",..: 3 4 2 4 4 3 3 3 1 2 ...
$ Shed_Area_Calc : num 30202 29422 458542 126361 32595 ...
$ Perc_Area : num 0.00128 0.00107 0.00993 0.00458 0.00118 ...
$ Cond_Scale : num 3 4 2 4 4 3 3 3 1 2 ...
$ Shoreline_m : num 23146 33046 45821 33046 33046 ...
$ Rank : num 5 9 3 9 9 9 9 6 3 3 ...
$ Comp.8 : num 0.826 0.814 0.838 0.814 0.814 ...
$ Ble : num 0.958 0.969 0.959 0.969 0.969 ...
$ DZ : num 0.647 0.837 0.732 0.837 0.837 ...
$ Herb : num 0.682 0.564 0.704 0.564 0.564 ...
$ Rec : num 0.375 0.477 0.467 0.477 0.477 ...
$ MA : num 0.965 0.975 0.907 0.975 0.975 ...
$ Dam : num 0.998 1 0.992 1 1 ...
$ TAXONNAME.y : Factor w/ 94 levels "Abudefduf sordidus",..: 94 94 94 94 94 94 94 94 94 94 ...
$ Dummy : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
I expected a classification of "DZCLASS".
Thanks, every recommendation is welcomed!
I am having a data set which has 28 attributes. The response variable is binary (0 & 1). I tried using SVM with "Probability=T" while running it. But I still could not get the probability values from the result.
Here is my training data set (last attribute is my response variable):
str(train)
'data.frame': 73630 obs. of 29 variables:
$ EMOTION_INDICATOR: num -2 -0.625 0.9 0 1.625 ...
$ CLUSTER : Factor w/ 8 levels "","cluster0",..: 4 7 5 1 1 3 8 6 7 1 ...
$ GENDER : Factor w/ 3 levels "","Female","Male": 2 2 2 1 1 3 3 2 3 1 ...
$ AGE : num 36 37 70 NA NA ...
$ REGION : Factor w/ 6 levels "","'Northern Ireland'",..: 5 6 5 1 1 6 4 4 6 1 ...
$ WORKING : Factor w/ 14 levels "","A","B","C",..: 4 14 8 1 1 6 4 3 4 1 ...
$ MUSIC : Factor w/ 7 levels "","A","B","C",..: 5 7 6 1 1 4 5 2 5 1 ...
$ LIST_OWN : num 1 1 6 NA NA ...
$ LIST_BACK : num 1 3 2 NA NA 0.5 2 0.5 6 NA ...
$ Q1 : num 10 51 35 NA NA 29 51 25 69 NA ...
$ Q2 : num 53 51 36 NA NA 7 49 25 71 NA ...
$ Q3 : num 12 70 37 NA NA 26 51 23 70 NA ...
$ Q4 : num 11 31 36 NA NA 2 50 24 7 NA ...
$ Q5 : num 12 6 37 NA NA 51 73 22 10 NA ...
$ Q6 : num 12 6 9 NA NA 51 47 22 68 NA ...
$ Q7 : num 76 5 36 NA NA 29 50 30 11 NA ...
$ Q8 : num 76 24 13 NA NA 72 52 10 10 NA ...
$ Q9 : num 51 7 70 NA NA 12 36 48 53 NA ...
$ Q10 : num 53 70 69 NA NA 9 91 18 75 NA ...
$ Q11 : num 76 89 65 NA NA 53 53 18 86 NA ...
$ Q12 : num 76 91 63 NA NA 5 52 16 89 NA ...
$ Q13 : num 52 50 6 NA NA 51 77 17 99 NA ...
$ Q14 : num 75 73 62 NA NA 70 78 21 100 NA ...
$ Q15 : num 11 72 31 NA NA 33 48 19 67 NA ...
$ Q16 : num 12 47.7 24.3 NA NA ...
$ Q17 : num 71 74 51 NA NA 51 52 27 98 NA ...
$ Q18 : num 23.6 52 31 NA NA ...
$ Q19 : num 22.5 52 32 NA NA ...
$ AVERAGE_RATING : Factor w/ 2 levels "0","1": 1 1 2 1 2 1 1 1 1 1 ...
My test set looks similar too. It has 24544 obs. with 29 variables.
This is the code that I used for SVM:
fitSVM <- svm(AVERAGE_RATING ~., data=train, na.action = na.omit,probability=T)
predSVM <- predict(fitSVM,test[!rowSums(is.na(test)),],type="probability")
table(predSVM,test$AVERAGE_RATING[!rowSums(is.na(test))],useNA="no")
predSVM 0 1
0 8091 1523
1 3259 9865
I get proper output, but without probability values:
attr(predSVM,"probabilities")
NULL
Am I doing something wrong?
You need to call predict with:
predSVM <- predict(fitSVM,test[!rowSums(is.na(test)),], probability=T)
See ? predict.svm
I have a slightly complicated "data_frame" in the sense that the columns have zeroes stored with different decimal points.
So for example one column has zeroes as 0, the other as 0.0, the other as 0.000 and so on.
I am trying to count all the zeroes in each column of the data frame so when I code:
>colSums(data_frame==0)
I only get the number of zeros in columns that have zero values stored as 0. The others with zeros as 0.00, 0.000.... etc show up as NA count.
This is the format of the data
str(data_frame)
$ P0 : num 0 0 1 1 2 0 0 2 2 5 ...
$ P1 : num 8 10 2 0 5 0 6 4 2 5 ...
$ P2 : num 8 7 4 0 5 1 6 10 2 8 ...
$ P3 : num 7 6 2 3 6 6 6 2 2 10 ...
$ P4 : num 3 14.62 2 1.12 3 ...
$ P.x : num 6.5 9.4062 1.5 0.0312 2.75 ...
$ InvN.x : num 0.8792 1.505 -0.5619 -1.1856 -0.0886 ...
$ h1 : num 65 80 75 40 86 32 75 40 60 76 ...
$ h2 : num 65 75 65 60 86 74 45 0 60 60 ...
$ h3 : num 80 75 75 70 61 91 44 33 40 75 ...
$ h4 : num 65 60 60 45 50 84 40 75 80 85 ...
$ meanh : num 68.8 72.5 68.8 53.8 70.8 ...
$ PQ1 : num 1.663 2.812 0.23 0.015 0.762 ...
$ PQ2 : num 1.755 2.525 0.578 0.125 1.133 ...
$ PQ3 : num 1.843 2.217 0.54 -0.02 0.307 ...
$ change : int 21 24 7 3 12 12 18 12 5 15 ...
$ meanbin : int 15 18 15 5 16 16 3 1 8 19 ...
Can someone please help?
Thanks.
So if you have numerics, your comparisons are subject to floating point errors. Instead of ==, you want to do something like:
colSums(abs(data_frame) < epsilon)
for some small epsilon of your choice: something that makes sense given your data precision. An extreme value might be what the all.equal function uses as default tolerance: .Machine$double.eps ^ 0.5.
You can grep the . (dot) in that column to get a vector of indices where that cell has 0.0 or 0.00 and so on. Then you can manually assign those cells as NA. Finally you continue to use colSums(data_frame==0) and remember to specify na.rm=TRUE.