A chaotic decision tree with the "party" package - r

I'm trying to get a decision tree from the code:
library(party)
tree <- data.frame(mi[,-2])
levels(tree$OS_Gatunek) <- c("ta", "th", "tl", "to", "tp") # shortening species names
model <- ctree(OS_Gatunek~., data=tree)
plot(model)
My data look like this:
> str(mi)
'data.frame': 4393 obs. of 18 variables:
$ OS_Gatunek : Factor w/ 5 levels "Taraxacum ancistrolobum",..: 1 1 1 1 1 1 1 1 1 1 ...
$ PH_CreateDate : Factor w/ 15 levels "2016-04-06","2016-04-19",..: 2 2 2 2 2 2 2 2 2 2 ...
$ L_Dl : num 7.91 8.96 10.18 10.09 9.4 ...
$ L_SzerMaksOs : num 1.93 3.98 3.12 4.04 2.75 2.69 3.69 3.23 2.3 2.49 ...
$ L_DlMax : num 3.51 4.08 5.58 5.04 3.99 3.6 5.65 4.62 3.33 4.18 ...
$ KS_DlSk_Sr : num 1.78 3.28 2.88 4.19 1.88 2.47 3.11 4.04 1.61 2.09 ...
$ KS_Dl_Sr : num 1.68 2.83 2.62 3.84 1.68 2.12 2.87 3.8 1.44 1.86 ...
$ KS_Sz : num 1.35 3.41 2.38 3.31 1.66 2.35 2.45 2.96 1.57 1.9 ...
$ KB_DlSkos_Sr : num 1.07 1.94 1.84 1.69 1.25 1.49 1.96 1.77 1.43 1.55 ...
$ KB_Dl_Sr : num 0.62 1.49 1.12 1.34 0.86 0.77 1.22 1.07 0.82 1.05 ...
$ KB_Szer_Sr : num 0.85 1.23 1.46 0.94 0.89 1.32 1.53 1.41 1.17 1.14 ...
$ KB_SzerPierwKlapy: num 1.75 3.99 2.9 4.1 2.34 2.75 3.11 3.39 1.96 2.46 ...
$ I_Dl_Sr : num 0.25 0.86 0.48 0.61 0.44 0.41 0.7 0.86 0.14 0.59 ...
$ I_SzOs : num 0.37 0.83 0.47 0.87 0.39 0.73 0.53 0.96 0.4 0.33 ...
$ I_DlSz_Sr : num 2.3 4.08 3.35 5.23 2.34 3.39 3.22 4.43 1.96 2.55 ...
$ O_Dl_Sr : num 0.67 0.75 2.02 0.85 0.74 1.4 1.07 0.26 0.6 0.96 ...
$ O_SzerOs : num 1.35 1.59 1.31 0.91 1.08 0.94 1.18 0.84 1.71 0.93 ...
$ O_SzerOskrz_Sr : num 0.55 0.65 0.48 0.34 0.39 0.31 0.49 0.29 0.74 0.27 ...
But the tree looks like this:
And I would really like it to look like this (taken from: https://ademos.people.uic.edu/Chapter24.html):
What should my code look like? I'll be grateful for all suggestions.

Related

Can't compute covariance matrix on a data frame in R

I have the following data set (which I import from a 6-column Excel file with a .csv file extension):
treas <- read.csv(file = 'treas.csv', header = TRUE, stringsAsFactors = FALSE)
2YR 3YR 5YR 7YR 10YR 30YR
0.41 0.85 1.65 2.18 2.6 3.43
0.41 0.85 1.65 2.2 2.61 3.45
0.4 0.82 1.63 2.17 2.59 3.44
0.41 0.86 1.66 2.19 2.6 3.44
0.43 0.88 1.69 2.22 2.62 3.45
0.45 0.93 1.71 2.24 2.64 3.47
0.44 0.91 1.7 2.23 2.65 3.47
0.42 0.88 1.66 2.17 2.58 3.41
0.45 0.93 1.7 2.21 2.6 3.41
0.49 0.95 1.71 2.21 2.61 3.4
0.51 0.99 1.77 2.27 2.66 3.44
0.48 0.95 1.71 2.21 2.61 3.43
0.48 0.94 1.71 2.22 2.64 3.47
0.5 0.94 1.71 2.22 2.63 3.44
0.48 0.96 1.72 2.23 2.63 3.45
0.49 0.95 1.7 2.19 2.59 3.41
0.48 0.92 1.68 2.17 2.57 3.38
0.46 0.9 1.64 2.14 2.53 3.35
0.45 0.88 1.64 2.14 2.54 3.36
0.47 0.88 1.62 2.13 2.53 3.34
0.47 0.9 1.66 2.17 2.58 3.4
0.49 0.95 1.71 2.22 2.64 3.46
0.52 0.98 1.74 2.25 2.65 3.47
0.52 1 1.74 2.24 2.63 3.44
0.51 0.99 1.7 2.19 2.58 3.38
0.51 0.97 1.68 2.17 2.57 3.37
0.46 0.93 1.66 2.15 2.55 3.38
0.48 0.92 1.65 2.13 2.53 3.34
0.48 0.95 1.68 2.17 2.55 3.36
When I call the cov() method on the the treas data frame, I see the following error message:
Error: is.numeric(x) || is.logical(x) is not TRUE
To check the data types, I use:
sapply(treas, typeof)
The result is:
2YR 3YR 5YR 7YR 10YR 30YR
"character" "character" "character" "character" "character" "character"
Calling str(treas) reveals:
str(treas)
'data.frame': 1252 obs. of 6 variables:
$ 2YR : Factor w/ 235 levels ".","0.34","0.35",..: 8 8 7 8 10 12 11 9 12 16 ...
$ 3YR : chr w/ 219 levels ".","0.66","0.69",..: 18 18 15 19 21 26 24 21 26 28 ...
$ 5YR : chr w/ 207 levels ".","0.94","0.95",..: 67 67 65 68 71 73 72 68 72 73 ...
$ 7YR : chr w/ 192 levels ".","1.19","1.20",..: 96 98 95 97 100 102 101 95 99 99 ...
$ 10YR : chr w/ 178 levels ".","1.37","1.38",..: 118 119 117 118 120 122 123 116 118 119 ...
$ 30YR : chr w/ 125 levels ".","2.11","2.14",..: 121 123 122 122 123 125 125 120 120 119 ...
I've tried to force the data frame to numeric using:
lapply(treas, as.numeric)
But, doing so results in:
Warning messages:
1: In lapply(treas, as.numeric) : NAs introduced by coercion
2: In lapply(treas, as.numeric) : NAs introduced by coercion
3: In lapply(treas, as.numeric) : NAs introduced by coercion
4: In lapply(treas, as.numeric) : NAs introduced by coercion
5: In lapply(treas, as.numeric) : NAs introduced by coercion
6: In lapply(treas, as.numeric) : NAs introduced by coercion
Then, I still get the same error when calling cov(treas):
Error: is.numeric(x) || is.logical(x) is not TRUE
Anyone see what I'm doing incorrectly here? Thanks!

Readability of the decision tree

I have a problem with the decision tree. It is not very clear.
The structure of my data is as follows:
> str(mi)
'data.frame': 4393 obs. of 18 variables:
$ OS_Gatunek : Factor w/ 5 levels "Taraxacum ancistrolobum",..: 1 1 1 1 1 1 1 1 1 1 ...
$ PH_CreateDate : Factor w/ 15 levels "2016-04-06","2016-04-19",..: 2 2 2 2 2 2 2 2 2 2 ...
$ L_Dl : num 7.91 8.96 10.18 10.09 9.4 ...
$ L_SzerMaksOs : num 1.93 3.98 3.12 4.04 2.75 2.69 3.69 3.23 2.3 2.49 ...
$ L_DlMax : num 3.51 4.08 5.58 5.04 3.99 3.6 5.65 4.62 3.33 4.18 ...
$ KS_DlSk_Sr : num 1.78 3.28 2.88 4.19 1.88 2.47 3.11 4.04 1.61 2.09 ...
$ KS_Dl_Sr : num 1.68 2.83 2.62 3.84 1.68 2.12 2.87 3.8 1.44 1.86 ...
$ KS_Sz : num 1.35 3.41 2.38 3.31 1.66 2.35 2.45 2.96 1.57 1.9 ...
$ KB_DlSkos_Sr : num 1.07 1.94 1.84 1.69 1.25 1.49 1.96 1.77 1.43 1.55 ...
$ KB_Dl_Sr : num 0.62 1.49 1.12 1.34 0.86 0.77 1.22 1.07 0.82 1.05 ...
$ KB_Szer_Sr : num 0.85 1.23 1.46 0.94 0.89 1.32 1.53 1.41 1.17 1.14 ...
$ KB_SzerPierwKlapy: num 1.75 3.99 2.9 4.1 2.34 2.75 3.11 3.39 1.96 2.46 ...
$ I_Dl_Sr : num 0.25 0.86 0.48 0.61 0.44 0.41 0.7 0.86 0.14 0.59 ...
$ I_SzOs : num 0.37 0.83 0.47 0.87 0.39 0.73 0.53 0.96 0.4 0.33 ...
$ I_DlSz_Sr : num 2.3 4.08 3.35 5.23 2.34 3.39 3.22 4.43 1.96 2.55 ...
$ O_Dl_Sr : num 0.67 0.75 2.02 0.85 0.74 1.4 1.07 0.26 0.6 0.96 ...
$ O_SzerOs : num 1.35 1.59 1.31 0.91 1.08 0.94 1.18 0.84 1.71 0.93 ...
$ O_SzerOskrz_Sr : num 0.55 0.65 0.48 0.34 0.39 0.31 0.49 0.29 0.74 0.27 ...
The code looks like this:
model<-rpart(mi[,1]~., data=mi[,-c(1,2)])
plot(model)
text(model, cex=0.5)
And tree like this:
If I use the fancyRpartPlot() command from "rattle" package:
fancyRpartPlot(model, sub=NULL)
The tree is like this:
And if I use the rpart.plot() from "rpart.plot" package:
rpart.plot(model)
The tree looks like this:
They are completely illegible.
Despite the variety of literature available on the internet, I found nothing that would improve the legibility of my tree.
What should I change? I will be grateful for any suggestions.
When calling rpart.plot, create extra space for bigger text in the plotted tree, by using fallen.leaves=FALSE and/or tweak=1.1 (say).
Also reduce the length of the variable and factor names by using varlen=4 and faclen=4 (say).
See also the suggestions in the FAQ chapter of the rpart.plot vignette.

ggseasonalplot for daily data is not working

I am new to R.
The daily data of a pump has been taken. The data is of two years and has the row of 742 numbers.
cw<-read.csv("RCW1.csv")
str(cw)
data.frame': 742 obs. of 14 variables:
$ date : Date, format: "2016-04-01" "2016-04-02" "2016-04-03" ...
$ amp : num 226 227 0 225 226 ...
$ brg_de_tmp : num 38.1 38.1 39.6 41.6 41.5 ...
$ brg_nde_tmp: num 78.6 79.1 72 79.9 80.4 ...
$ kg : num 2.07 2.07 0.06 2.29 2.28 2.3 2.11 2.1 2.11 2.11 ...
$ level1 : num 8.45 8.46 8.69 8.67 8.43 8.6 8.39 8.5 8.46 8.65 ...
$ level2 : num 8.44 8.46 8.67 8.65 8.42 8.59 8.38 8.48 8.46 8.63 ...
$ mde_xvib : num 1.15 1.35 0.28 1.05 1.15 1.06 1.25 1.25 1.25 1.25 ...
$ mde_zvib : num 1.37 1.57 0.4 1.18 1.13 1.38 1.28 1.57 1.3 1.5 ...
$ rpm : num 296.46 296.91 -4.76 297.09 297.91 ...
$ mde_yvib : num 2.09 2.38 0.34 2 1.82 2.24 2.17 2.56 1.9 2.27 ...
$ m_nde_yvib : num 1.15 1.13 0.35 0.96 0.96 0.96 1.15 1.06 1.15 1.15 ...
$ m_nde_zvib : num 1.53 1.63 0.27 1.33 1.43 1.4 1.76 1.63 1.79 1.71 ...
$ permit : chr "#N/A" "#N/A" "CW Pump house: Motor stand" "#N/A" ...
convert it into time series
cw_x <- xts(cw, order.by=as.Date(cw[,1], "%Y/%m/%d"))
cw_ts<-as.ts(cw_x)
> head(cw_ts)
Time Series:
Start = 1
End = 6
Frequency = 1
date amp brg_de_tmp brg_nde_tmp kg level1 level2 mde_xvib mde_zvib rpm mde_yvib
1 2016-04-01 226.05 38.06 78.61 2.07 8.45 8.44 1.15 1.37 296.46 2.09
2 2016-04-02 226.59 38.08 79.13 2.07 8.46 8.46 1.35 1.57 296.91 2.38
3 2016-04-03 0.00 39.57 71.96 0.06 8.69 8.67 0.28 0.40 -4.76 0.34
4 2016-04-04 225.01 41.57 79.91 2.29 8.67 8.65 1.05 1.18 297.09 2.00
5 2016-04-05 226.41 41.54 80.43 2.28 8.43 8.42 1.15 1.13 297.91 1.82
6 2016-04-06 225.65 41.08 79.89 2.30 8.60 8.59 1.06 1.38 297.55 2.24
m_nde_yvib m_nde_zvib permit
1 1.15 1.53 #N/A
2 1.13 1.63 #N/A
3 0.35 0.27 CW Pump house: Motor stand
4 0.96 1.33 #N/A
5 0.96 1.43 #N/A
6 0.96 1.40 #N/A
I have two questions.
Number 1: How can I get "dates" in X-axis while plotting the following code.
autoplot(cw_ts[,2:5],facets = TRUE) + ylab("parameters")
See I am getting row number in X-axis, I want to change to dates.
Question number 2: I am trying to get seasonal plot with following two codes as describes by Robjhyndman book "Forecasting: principle and practice"
ggseasonplot(cw_ts, year.labels=TRUE, year.labels.left=TRUE) +
ylab("") +
ggtitle("Seasonal plot: Pump parameter")
AND
ggseasonplot(cw_ts, polar=TRUE) +
ylab("") +
ggtitle("Polar seasonal plot: Pump parameter")
So how do i impute my time series into above code. While I am getting following error.
ggseasonplot(cw_ts, year.labels=TRUE, year.labels.left=TRUE) +
+ ylab("") +
+ ggtitle("Seasonal plot: Pump parameter")
Error in ggseasonplot(cw_ts, year.labels = TRUE, year.labels.left = TRUE) :
Data are not seasonal
AND
ggseasonplot(cw_ts, polar=TRUE) +
+ ylab("") +
+ ggtitle("Polar seasonal plot: Pump parameter")
Error in ggseasonplot(cw_ts, polar = TRUE) : Data are not seasonal
>
Any suggestion will help me a lot. Thank You.

Error in r: undefined columns selected

I was trying to do a partition plot, and I used the following codes:
install.packages('klaR')
library(klaR)
partimat(Type~. , data = training, method = "lda")
partimat('Type'~. , data = training, method = "qda")
R gave me this error code:
Error in `[.data.frame`(m, xvars) : undefined columns selected
and my data is like this
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 178 obs. of 13 variables:
$ Alcohol : num 14.2 13.2 13.2 14.4 13.2 ...
$ Malic acid : num 1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
$ Ash : num 2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
$ Alcalinity of ash : num 15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
$ Magnesium : int 127 100 101 113 118 112 96 121 97 98 ...
$ Total phenols : num 2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
$ Flavanoids : num 3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
$ Nonflavanoid phenols: num 0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
$ Proanthocyanins : num 2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
$ Color intensity : num 5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
$ Hue : num 1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
$ Proline : int 1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...
$ Type : int 1 1 1 1 1 1 1 1 1 1 ...
Please let me know how to solve it!
There is no Type variable in the UCI Machine Learning Wine data set. The classification variable is class, and it is the first column in the data set.
# data source: UCI ML Repository Wine data
# https://archive.ics.uci.edu/ml/datasets/wine
library(klaR)
colNames <- c("class","alcohol","malicAcid","ash","acalinityOfAsh",
"magnesium","totalPhenols","flavanoids","nonflavanoidPhenols",
"proanthocyanins","colorIntensity","hue","od280.od315OfDilutedWines",
"proline")
wine <- read.csv("./data/wine.csv",header=FALSE,col.names=colNames)
wine$class <- as.factor(wine$class)
partimat(class ~ alcohol + malicAcid, data=wine, method="lda",plot.matrix=FALSE)
...and the output:
I had the same problem and I could fix it by changing the name of my varibles. In my data set I had a variable whose name had a blank space at the beginning. The program could not recognize it and that triggered the error. I removed that blank space and the problem disappeared.

ggbiplot graphical display in groups

I am learning biplot with wine data set. How does R know Barolo, Grignolino and Barbera are wine.class while we don't see the wine class column in the data set?
More details about the wine data set are in the following links
ggbiplot - how not to use the feature vectors in the plot
https://github.com/vqv/ggbiplot
Thanks very much
In the wine dataset, you have 2 objects, one data.frame wine with 178 observations of 13 quantitative variables:
str(wine)
'data.frame': 178 obs. of 13 variables:
$ Alcohol : num 14.2 13.2 13.2 14.4 13.2 ...
$ MalicAcid : num 1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
$ Ash : num 2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
$ AlcAsh : num 15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
$ Mg : int 127 100 101 113 118 112 96 121 97 98 ...
$ Phenols : num 2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
$ Flav : num 3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
$ NonFlavPhenols: num 0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
$ Proa : num 2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
$ Color : num 5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
$ Hue : num 1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
$ OD : num 3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
$ Proline : int 1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...
There is also one vector wine.class that contains 178 observations of the qualitative wine.class variable:
str(wine.class)
Factor w/ 3 levels "barolo","grignolino",..: 1 1 1 1 1 1 1 1 1 1 ...
The 13 quantitative variables are used to compute the PCA:
wine.pca <- prcomp(wine, scale. = TRUE)
while the wine.class variable is just used to color the points on the plot

Resources