Creating decision tree - r

I have a csv file (298 rows and 24 columns) and i want to create a decision tree to predict the column "salary". I have downloaded tree package and added via library function.
But when i try to create the decision tree:
model<-tree(salary~.,data)
I get the error like below:
*Error in tree(salary ~ ., data) :
factor predictors must have at most 32 levels*
What is wrong with that? Data is as follows:
Name bat hit homeruns runs
1 Alan Ashby 315 81 7 24
2 Alvin Davis 479 130 18 66
3 Andre Dawson 496 141 20 65
...
team position putout assists errors
1 Hou. C 632 43 10
2 Sea. 1B 880 82 14
3 Mon. RF 200 11 3
salary league87 team87
1 475 N Hou.
2 480 A Sea.
3 500 N Chi.
And its the value of str(data):
'data.frame': 263 obs. of 24 variables:
$ Name : Factor w/ 263 levels "Al Newman","Alan Ashby",..: 2 7 8 10 6 1 13 11 9 3 ...
$ bat : int 315 479 496 321 594 185 298 323 401 574 ...
$ hit : int 81 130 141 87 169 37 73 81 92 159 ...
$ homeruns : int 7 18 20 10 4 1 0 6 17 21 ...
$ runs : int 24 66 65 39 74 23 24 26 49 107 ...
$ runs.batted : int 38 72 78 42 51 8 24 32 66 75 ...
$ walks : int 39 76 37 30 35 21 7 8 65 59 ...
$ years.in.major.leagues : int 14 3 11 2 11 2 3 2 13 10 ...
$ bats.during.career : int 3449 1624 5628 396 4408 214 509 341 5206 4631 ...
$ hits.during.career : int 835 457 1575 101 1133 42 108 86 1332 1300 ...
$ homeruns.during.career : int 69 63 225 12 19 1 0 6 253 90 ...
$ runs.during.career : int 321 224 828 48 501 30 41 32 784 702 ...
$ runs.batted.during.career: int 414 266 838 46 336 9 37 34 890 504 ...
$ walks.during.career : int 375 263 354 33 194 24 12 8 866 488 ...
$ league : Factor w/ 2 levels "A","N": 2 1 2 2 1 2 1 2 1 1 ...
$ division : Factor w/ 2 levels "E","W": 2 2 1 1 2 1 2 2 1 1 ...
$ team : Factor w/ 24 levels "Atl.","Bal.",..: 9 21 14 14 16 14 10 1 7
8 ...
$ position : Factor w/ 23 levels "1B","1O","23",..: 10 1 20 1 22 4 22 22 13 22 ...
$ putout : int 632 880 200 805 282 76 121 143 0 238 ...
$ assists : int 43 82 11 40 421 127 283 290 0 445 ...
$ errors : int 10 14 3 4 25 7 9 19 0 22 ...
$ salary : num 475 480 500 91.5 750 ...
$ league87 : Factor w/ 2 levels "A","N": 2 1 2 2 1 1 1 2 1 1 ...
$ team87 : Factor w/ 24 levels "Atl.","Bal.",..: 9 21 5 14 16 13 10 1 7 8 ...

The issue is almost certainly that you're including the name variable in your model, as it has too many factor levels. I would also remove it a methodological standpoint but this probably isn't the place for that discussion. Try:
train <- data
train$Name <- NULL
model<-tree(salary~.,train)

It seems that your salary is a factor vector, while you are trying to perform a regression, so it should be a numbers vector. Simply convert you salary to numeric, and it should work just fine. For more details read the library's help:
http://cran.r-project.org/web/packages/tree/tree.pdf
Usage
tree(formula, data, weights, subset, na.action = na.pass,
control = tree.control(nobs, ...), method = "recursive.partition",
split = c("deviance", "gini"), model = FALSE, x = FALSE, y = TRUE, wts
= TRUE, ...)
Arguments
formula A formula expression. The left-hand-side (response) should be either a numerical vector when a
regression tree will be fitted or a factor, when a classification tree
is produced. The right-hand-side should be a series of numeric or
factor variables separated by +; there should be no interaction terms.
Both . and - are allowed: regression trees can have offset terms.
(...)
Depending on what exactly is stored in your salary variable, the conversion can be less or more tricky, but this should generaly work:
salary = as.numeric(levels(salary))[salary]
EDIT
As pointed out in the comment, the actual error corresponds to the data variable, so if it is a numerical data, it could also be converted to numeric to solve the issue, if it has to be a factor you will need another model or reduce the number of levels. You can also convert these factors to the numerical format by hand (by for example defining as many binary features as you have levels), but this can lead to the exponential growth of your input space.
EDIT2
It seems that you have to first decide what you are trying to model. You are trying to predict salary, but based on what? It seems that your data consists of players' records, then their names are for sure wrong type of data to use for this prediction (in particular - it is probably causing the 32 levels error). You should remove all the columns from the data variable which should not be used for building a prediction. I do not know what is the exact aim here (as there is no information regarding it in the question), so I can only guess that you are trying to predict the person's salary based on his/her stats, so you should remove from the input data: players' names, players' teams and obviously salaries (as predicting X using X is not a good idea ;)).

Related

Viewing dataset in RStudio shows different number of observations compared to R commands

I am currently studying data science with R. To practice, I am using the Auto data of the ISLR package. However, I am encountering a confusing situation when viewing the data. When I view the dataset Auto.df in RStudio, I get the following:
However, when I use dim(Auto.df), I get the following:
> dim(Auto.df)
[1] 392 9
And when I use nrow(Auto.df), I get the following:
> nrow(Auto.df)
[1] 392
And when I use str(Auto.df), I get the following:
> str(Auto.df)
'data.frame': 392 obs. of 9 variables:
$ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
$ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
$ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
$ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
$ weight : num 3504 3693 3436 3433 3449 ...
$ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
$ year : num 70 70 70 70 70 70 70 70 70 70 ...
$ origin : num 1 1 1 1 1 1 1 1 1 1 ...
$ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
And I have the following in my RStudio "Global Environment" tab:
So why does viewing the dataset in RStudio show 397 rows (observations), whilst everything else says that there are 392 observations?
There are 392 observations in the data. What you are viewing are the rownames of the data. You can set rownames as anything and they do not represent row number in the data.
If you check the rownames of Auto dataset you'll realise they are not sequential and some rownames jump by 2. For example, after 32 you don't have 33 but 34. Similarly after 126 there is 128. I don't know why the data is like that but that makes row number at the end to go till 397.

C50 failed in r with "c50 code called exit with value 1"

I am having issue with training C50 on my dataset. Before this post, I have researched all the other similar issues/solutions people had. However, my dataset has none of the issue they had but still failed the C50 execution in r. My dataset looks like:
'data.frame': 113967 obs. of 15 variables:
$ region : Factor w/ 51 levels "US:AK","US:AL",..: 2 3 3 4 4 4 4 5 5 5 ...
$ city : Factor w/ 6396 levels "179708","179720",..: 24 156 156 194 214 226 244 276 316 407 ...
$ dma : Factor w/ 211 levels "1","500","501",..: 24 148 148 173 173 173 189 195 204 208 ...
$ user_day : Factor w/ 7 levels "0","1","2","3",..: 6 6 6 6 6 6 6 6 6 6 ...
$ user_hour : Factor w/ 24 levels "0","1","10","11",..: 5 16 16 4 22 7 10 11 15 21 ...
$ os_extended : Factor w/ 71 levels "0","100","113",..: 55 68 68 7 29 14 14 14 29 34 ...
$ browser : Factor w/ 19 levels "0","10","11",..: 19 18 18 8 18 9 18 17 18 18 ...
$ domain : Factor w/ 2685 levels "0calc.com","100daysofrealfood.com",..: 1709 777 777 1406 727 2658 1406 1604 964 2658 ...
$ position : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 2 1 1 1 2 ...
$ placement : Factor w/ 5406 levels "10004098","10008956",..: 3331 1696 1714 3600 438 479 3598 3423 5406 479 ...
$ publisher : Factor w/ 1641 levels "1000773","1000776",..: 581 687 687 663 1369 1525 663 624 1641 1525 ...
$ seller_member_id : Factor w/ 304 levels "1001","1019",..: 19 101 101 40 19 35 40 40 75 35 ...
$ user_group : Factor w/ 1000 levels "0","1","10","100",..: 252 243 243 363 343 342 162 380 122 212 ...
$ size : Factor w/ 7 levels "160x600","300x250",..: 5 2 2 4 5 2 2 1 2 2 ...
$ predict.bid.vector.bin: Factor w/ 2 levels "(0.112,0.831]",..: 1 1 1 1 1 1 1 2 1 2 ...
As you can see, the last variable is my target variable (as factor) and all features here have more than 1 level. Moreover, there is no NA in the dataset. Yet, when i execute the C50, i got error:
> library(C50)
> myC50_Tree <- C5.0(x = test_set[,-15], y = test_set$predict.bid.vector.bin)
c50 code called exit with value 1
> summary(myC50_Tree)
Call:
C5.0.default(x = test_set[, -15], y = test_set$predict.bid.vector.bin)
C5.0 [Release 2.07 GPL Edition] Fri Apr 13 14:29:54 2018
-------------------------------
*** line 6 of `undefined.names': attribute `region' has only one value `US'
Error limit exceeded
What would be the issue here?
***You can get the simulated dataset of mine with following r code:
# --- Set unique feature values
region <- c("US:AL","US:AR","US:AZ","US:CA","US:CO","US:CT","US:DC","US:FL")
city <- c("179944","180802","181120","181212","181251","181315","181400","181512","181762","181842","181934","181953","182259","182295")
dma <- c('522','693','754','875','345','234')
user_day <- c('1','2','3','4','5','6')
user_hour <- c('12','11','10','9','8','7','6','5')
os_extended <- c('187','92','125','87','90')
browser <- c('8','9','18','5')
domain <- c('yahoo.com','youtube.com','mmctw.com','msn.com','frive.com','wework.com')
position <- c('0','1','2','3')
placement <- c('`234123412','34563451','235234624','46785467','234556834','85991927394')
publisher <- c('5345','57867','78034','123452','84567','245645','956752')
seller_memeber_id <- c('234','745','546','687','235')
user_group <- c('112','556','009','345','238')
size <- c('100X20','340X10','300X500','300X600')
predict.bid.vector.bin <- c('(0.831,1.55]', '(0.112,0.831]')
features <- list(region,city,dma,user_day,user_hour,os_extended,browser,domain,position,placement,publisher,seller_memeber_id,user_group,size,predict.bid.vector.bin)
# --- Sample simulated dataset
test_set <- vector()
for (feature in 1:length(features)) {
test_set <- cbind(test_set, sample(features[[feature]],1000,replace=TRUE))
}
test_set <- data.frame(test_set)
colnames(test_set) <- c('region','city','dma','user_day','user_hour',
'os_extended','browser','domain','position',
'placement','publisher','seller_memeber_id',
'user_group','size','predict.bid.vector.bin')
# --- check data
str(test_set)
The problem is the variable name region -- I think C5.0 doesn't like the colons in there. I recreated your dataset with:
region <- c("AL","AR","AZ","CA","CO","CT","DC","FL")
And then it worked with no errors:
treeModel <- C5.0(x=test_set[,-15],y=test_set[,15])
treeModel
...
Evaluation on training data (1000 cases):
Decision Tree
----------------
Size Errors
103 220(22.0%) <<
(a) (b) <-classified as
---- ----
358 122 (a): class 1
98 422 (b): class 2
Attribute usage:
100.00% user_hour
28.30% region
27.30% dma
24.30% city
17.60% user_day
15.40% size
12.70% placement
9.10% user_group
7.90% browser
6.50% os_extended
4.70% publisher
4.40% position
3.70% domain
3.00% seller_memeber_id
I also recoded the dependent variable as 1 and 2 just in case the string with the ranges was giving it a problem, but that didn't seem to matter at all (however in the output above you'll see that it predicted to Class 1 and Class 2, and that's why).

c50 runs for +1 hour, then returns c50 code called exit with value 1

I looked at other questions regarding my error but none had a similar issue as I do. I have no empty values, and none of the variable names in the dataset are used by the C50 package.
This is the structure of the used dataset (no empty values):
> str(dataset)
'data.frame': 776973 obs. of 13 variables:
$ CrimeID : int 9446748 9446846 9446876 9447044 9447227 9447263 9447282 9447312 9447340 9447387 ...
$ CaseNumber : Factor w/ 776907 levels "161884","F218264",..: 67 111 157 283 372 404 421 435 457 487 ...
$ CrimeDate : Factor w/ 326056 levels "1/1/2014 0:00",..: 1 1 1 1 1 1 1 1 1 1 ...
$ CrimeBlock : Factor w/ 31381 levels "0000X E 100TH PL",..: 3101 4085 26441 10811 6414 3183 7076 11201 12166 5271 ...
$ IUCR : Factor w/ 357 levels "031A","031B",..: 345 51 52 333 52 347 347 345 52 334 ...
$ LocationDescription: Factor w/ 135 levels "ABANDONED BUILDING",..: 24 18 122 24 122 122 122 18 122 122 ...
$ Arrest : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Domestic : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Beat : int 1832 1133 1631 1932 1932 1533 1012 1413 1033 1211 ...
$ District : int 18 11 16 19 19 15 10 14 10 12 ...
$ Ward : int 42 24 36 43 32 24 24 35 12 26 ...
$ CommunityArea : int 8 27 17 7 7 25 29 22 30 24 ...
$ FBICode : Factor w/ 26 levels "01A","01B","04A",..: 24 11 11 24 11 25 25 24 11 24 ...
The variable Arrest will be used as target variable in the decision tree process. I thus factorize the variable, rename the dataset as crimechicago, set the seed to create random training and test datasets, load librar c50, and run the c50 code. This code runs for over an hour and then returns the error: c50 code called exit with value 1
dataset$Arrest<- factor(dataset$Arrest)
crimechicago <- dataset
set.seed(222)
totalvalues <-nrow(crimechicago)
train_sample <- sample(totalvalues, 400000)
crimechicago_train <- crimechicago[train_sample, ]
crimechicago_test <- crimechicago[-train_sample, ]
library(C50)
crimechicago_model <- C5.0(crimechicago_train[-7], crimechicago_train$Arrest)
EDIT:
-removed CrimeID and CaseNumber from dataset as not useful predictors of target variable Arrest
-summary screenshot of the dataset: (the entire dataset, not a subset)
structure of the train dataset (400,000 rows, created by randomly selecting 400,000 rows of the 700,000+ row original dataset)
str(crimechicago_train)
'data.frame': 400000 obs. of 10 variables:
$ CrimeDate : Factor w/ 326056 levels "1/1/2014 0:00",..: 300760 132223 211541 3 287239 54284 93432 133588 284191 232747 ...
$ CrimeBlock : Factor w/ 31381 levels "0000X E 100TH PL",..: 124 14942 2696 24466 143 9024 10613 22404 17613 10766 ...
$ IUCR : Factor w/ 357 levels "031A","031B",..: 209 274 25 51 334 345 329 274 347 329 ...
$ LocationDescription: Factor w/ 135 levels "ABANDONED BUILDING",..: 118 18 80 106 80 110 18 118 122 18 ...
$ Arrest : Factor w/ 2 levels "FALSE","TRUE": 1 2 1 1 1 1 1 1 1 1 ...
$ Domestic : Factor w/ 2 levels "FALSE","TRUE": 1 2 1 2 1 1 1 2 1 1 ...
$ Beat : int 113 1133 1834 825 1834 1434 1921 715 2522 1431 ...
$ District : int 1 11 18 8 18 14 19 7 25 14 ...
$ Ward : int 42 24 42 15 42 32 47 15 30 1 ...
$ CommunityArea : int 32 27 8 66 8 24 5 67 20 22 ...

Summary() function in R -not showing statistics

I am trying to get summary statistics for my data set. The dataset is values for different countries cereal yield of a number of years. I want to get the summary statistics and for each year and then transpose the dataset and get the summary statistics for each country.
For some reason I am not getting the summary statistics and just a list of some of the values and the quantity of them.
I would appreciate any help with this issue.
Below is a sample of my dataset:
row.names YR1990 YR1991 YR1992
3 1200.6 1160 1097.7
4 320.9 417.4 397
5 2794.3 2071.8 2269.2
6 2216.4 1594 2315.3
7 2232.32 2666.1 3057.3
10 2380.9 1833.3 1722.2
This is the results I am getting after summary() function:
summary(CerialData)
YR1990 YR1991 YR1992 YR1993 YR1994
1000 : 1 1000 : 1 943.2 : 2 1000 : 1 1040.03: 1
1003.9 : 1 1043.19: 1 1000 : 1 1055.77: 1 1041.1 : 1
1026.7 : 1 1050.3 : 1 1021.2 : 1 1083.3 : 1 1091.6 : 1
1028.5 : 1 1055.3 : 1 1042.1 : 1 1109.3 : 1 1100 : 1
1033.2 : 1 1094 : 1 1069.7 : 1 1135.5 : 1 1111.1 : 1
1036.8 : 1 1108.3 : 1 1072.3 : 1 1153 : 1 1132.2 : 1
(Other):158 (Other):158 (Other):157 (Other):158 (Other):158
str(CerialData) 'data.frame': 164 obs. of 20 variables:
$ YR1990: Factor w/ 188 levels "","..","0","1000",..: 19 116 103 80 81 85 46 153 26 177 ...
$ YR1991: Factor w/ 191 levels "","..","0","1000",..: 14 141 66 38 93 53 40 154 28 181 ...
$ YR1992: Factor w/ 207 levels "","..","0","1000",..: 10 151 95 96 134 49 67 165 28 197 ...
$ YR1993: Factor w/ 194 levels "","..","0","1000",..: 8 97 99 178 107 35 62 153 23 182 ...
$ YR1994: Factor w/ 214 levels "","..","0","1040.03",..: 11 133 107 74 127 53 15 171 17 207 ...

R subsetting a data frame based on a factor variable formatted like a range (xx-xx)

I am facing this problem for many hours now, but I know I am missing something obvious.
Here is my problem:
I have a data-frame in .xlsx file that can be downloaded here.
I loaded this data-frame into R using RStudio on MAc and called it demoData.
There are 5 variables (AgeRange, Women, Men, Total, and Year).
I am not able to subset this data frame with a condition on the AgeRange. The format of this variable is as follow: xx-xx (00-04 meaning people between 00 and 04 years old). The message I have when I try to do that is that there is no row filling this condition.
The class of the variable "AgeRange" is factor.
Here is my code:
demoData[demoData$AgeRange=="00-04",]
Thank you for your help.
Edit: from Arun. Here's input from head(demoData):
Age Feminin Masculin. Ensemble Annee
1 00-04 720 745 1465 2004
2 05-09 745 767 1512 2004
3 10-14 813 830 1643 2004
4 15-19 824 820 1644 2004
5 20-24 839 823 1662 2004
6 25-29 752 699 1450 2004
# str(demoData)
'data.frame': 272 obs. of 5 variables:
$ Age : Factor w/ 16 levels "00-04 ","05-09 ",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Feminin : Factor w/ 216 levels "138 ","139 ",..: 112 124 164 165 174 130 106 86 78 66 ...
$ Masculin.: Factor w/ 201 levels "120 ","122 ",..: 132 141 174 169 170 124 111 89 90 75 ...
$ Ensemble : Factor w/ 242 levels "1041 ","1044 ",..: 53 66 115 116 119 50 38 14 9 238 ...
$ Annee : Factor w/ 17 levels "2004 ","2005",..: 1 1 1 1 1 1 1 1 1 1 ...
I read in your xlsx file with the xlsx package:
df<-read.xlsx("C:/Users/swatson1/Downloads/Evolution_Population_2004_2020.xlsx",1)
and it looked like this:
> df
Age Feminin MasculinÂ. Ensemble Annee
1 00-04Â 720Â 745Â 1465Â 2004Â
2 05-09Â 745Â 767Â 1512Â 2004Â
You could replace each column, getting rid of the extra character with something like:
df$Age<-substr(df$Age,1,5)
Alternatively, use gsub as this will work on any column regardless of the length of the entry:
df$Age<-gsub("Â\\s","",df$Age)
Then your code would work:
df[df$Age=="00-04",]
#coppied from the Excel file
str1 <- "00-04 "
utf8ToInt(str1)
#[1] 48 48 45 48 52 160
There seems to be a no-break space at the end of the string. Sanitize your file.
You should be able to remove the no-break spaces using
df$Age <- gsub(intToUtf8(160),"",df$Age)

Resources