Summary() function in R -not showing statistics - r

I am trying to get summary statistics for my data set. The dataset is values for different countries cereal yield of a number of years. I want to get the summary statistics and for each year and then transpose the dataset and get the summary statistics for each country.
For some reason I am not getting the summary statistics and just a list of some of the values and the quantity of them.
I would appreciate any help with this issue.
Below is a sample of my dataset:
row.names YR1990 YR1991 YR1992
3 1200.6 1160 1097.7
4 320.9 417.4 397
5 2794.3 2071.8 2269.2
6 2216.4 1594 2315.3
7 2232.32 2666.1 3057.3
10 2380.9 1833.3 1722.2
This is the results I am getting after summary() function:
summary(CerialData)
YR1990 YR1991 YR1992 YR1993 YR1994
1000 : 1 1000 : 1 943.2 : 2 1000 : 1 1040.03: 1
1003.9 : 1 1043.19: 1 1000 : 1 1055.77: 1 1041.1 : 1
1026.7 : 1 1050.3 : 1 1021.2 : 1 1083.3 : 1 1091.6 : 1
1028.5 : 1 1055.3 : 1 1042.1 : 1 1109.3 : 1 1100 : 1
1033.2 : 1 1094 : 1 1069.7 : 1 1135.5 : 1 1111.1 : 1
1036.8 : 1 1108.3 : 1 1072.3 : 1 1153 : 1 1132.2 : 1
(Other):158 (Other):158 (Other):157 (Other):158 (Other):158
str(CerialData) 'data.frame': 164 obs. of 20 variables:
$ YR1990: Factor w/ 188 levels "","..","0","1000",..: 19 116 103 80 81 85 46 153 26 177 ...
$ YR1991: Factor w/ 191 levels "","..","0","1000",..: 14 141 66 38 93 53 40 154 28 181 ...
$ YR1992: Factor w/ 207 levels "","..","0","1000",..: 10 151 95 96 134 49 67 165 28 197 ...
$ YR1993: Factor w/ 194 levels "","..","0","1000",..: 8 97 99 178 107 35 62 153 23 182 ...
$ YR1994: Factor w/ 214 levels "","..","0","1040.03",..: 11 133 107 74 127 53 15 171 17 207 ...

Related

Unable to designate CSV column heads "as.factor" for R -Error

I am having an issue with assigning factors to my data CSV. Here is a summary of the data frame:
> data.frame': 303 obs. of 12 variables:
> PLOT : int 19 177 54 114 41 48 142 134 160 267 ...
> RANGE : int 2 12 4 8 3 4 10 9 11 18 ...
> ROW : int 4 12 9 9 11 3 7 14 10 12 ...
> REP : int 1 1 1 1 1 1 1 1 1 1 ...
> ENTRY : Factor w/ 184 levels "","17_YMG_0293",..: 40 40 77 82 87 88 102 103 103 6 ...
> PLOT_ID : Factor w/ 301 levels "","18_HZG_OvOv_001",..: 20 178 55 115 42 49 143 135 161 268 ...
> Shatter : num 9 9 9 9 9 9 9 9 9 8 ...
> Chaff.Color : Factor w/ 4 levels "","*Blank ones are segregating in color",..: 3 4 3 4 4 4 3 4 4 3 ...
> Heading_d.from.Jan.1: int 138 139 137 133 135 135 133 137 135 136 ...
> Height_cm : int 74 73 77 76 74 79 78 73 76 70 ...
> Plot.weight..kg. : num 0.26 0.18 0.19 0.14 0.33 0.19 0.13 0.11 0.24 0.18 ...
But I get this error:
HAYSData$Rep<-as.factor(HAYSData$Rep)
Error in `$<-.data.frame`(`*tmp*`, Rep, value = integer(0)) :
replacement has 0 rows, data has 303
I get the same type of error for Entry, Range, and Rows. I am not sure when I look at length(Entry) for example I get 300. I even tested with changing factor to numeric but it does not help.
I don't have an NA in my data each category is its own column as well.
I don't know if something is wrong with my CSV. I have worked this same script with another CSV but no issues in the part of the script for the other data.
Can someone please help me?
It's case-sensitive, try with:
HAYSData$REP <- as.factor(HAYSData$REP)
HAYSData$ENTRY <- as.factor(HAYSData$ENTRY)
HAYSData$RANGE <- as.factor(HAYSData$RANGE)
HAYSData$ROW <- as.factor(HAYSData$ROW)

C50 failed in r with "c50 code called exit with value 1"

I am having issue with training C50 on my dataset. Before this post, I have researched all the other similar issues/solutions people had. However, my dataset has none of the issue they had but still failed the C50 execution in r. My dataset looks like:
'data.frame': 113967 obs. of 15 variables:
$ region : Factor w/ 51 levels "US:AK","US:AL",..: 2 3 3 4 4 4 4 5 5 5 ...
$ city : Factor w/ 6396 levels "179708","179720",..: 24 156 156 194 214 226 244 276 316 407 ...
$ dma : Factor w/ 211 levels "1","500","501",..: 24 148 148 173 173 173 189 195 204 208 ...
$ user_day : Factor w/ 7 levels "0","1","2","3",..: 6 6 6 6 6 6 6 6 6 6 ...
$ user_hour : Factor w/ 24 levels "0","1","10","11",..: 5 16 16 4 22 7 10 11 15 21 ...
$ os_extended : Factor w/ 71 levels "0","100","113",..: 55 68 68 7 29 14 14 14 29 34 ...
$ browser : Factor w/ 19 levels "0","10","11",..: 19 18 18 8 18 9 18 17 18 18 ...
$ domain : Factor w/ 2685 levels "0calc.com","100daysofrealfood.com",..: 1709 777 777 1406 727 2658 1406 1604 964 2658 ...
$ position : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 2 1 1 1 2 ...
$ placement : Factor w/ 5406 levels "10004098","10008956",..: 3331 1696 1714 3600 438 479 3598 3423 5406 479 ...
$ publisher : Factor w/ 1641 levels "1000773","1000776",..: 581 687 687 663 1369 1525 663 624 1641 1525 ...
$ seller_member_id : Factor w/ 304 levels "1001","1019",..: 19 101 101 40 19 35 40 40 75 35 ...
$ user_group : Factor w/ 1000 levels "0","1","10","100",..: 252 243 243 363 343 342 162 380 122 212 ...
$ size : Factor w/ 7 levels "160x600","300x250",..: 5 2 2 4 5 2 2 1 2 2 ...
$ predict.bid.vector.bin: Factor w/ 2 levels "(0.112,0.831]",..: 1 1 1 1 1 1 1 2 1 2 ...
As you can see, the last variable is my target variable (as factor) and all features here have more than 1 level. Moreover, there is no NA in the dataset. Yet, when i execute the C50, i got error:
> library(C50)
> myC50_Tree <- C5.0(x = test_set[,-15], y = test_set$predict.bid.vector.bin)
c50 code called exit with value 1
> summary(myC50_Tree)
Call:
C5.0.default(x = test_set[, -15], y = test_set$predict.bid.vector.bin)
C5.0 [Release 2.07 GPL Edition] Fri Apr 13 14:29:54 2018
-------------------------------
*** line 6 of `undefined.names': attribute `region' has only one value `US'
Error limit exceeded
What would be the issue here?
***You can get the simulated dataset of mine with following r code:
# --- Set unique feature values
region <- c("US:AL","US:AR","US:AZ","US:CA","US:CO","US:CT","US:DC","US:FL")
city <- c("179944","180802","181120","181212","181251","181315","181400","181512","181762","181842","181934","181953","182259","182295")
dma <- c('522','693','754','875','345','234')
user_day <- c('1','2','3','4','5','6')
user_hour <- c('12','11','10','9','8','7','6','5')
os_extended <- c('187','92','125','87','90')
browser <- c('8','9','18','5')
domain <- c('yahoo.com','youtube.com','mmctw.com','msn.com','frive.com','wework.com')
position <- c('0','1','2','3')
placement <- c('`234123412','34563451','235234624','46785467','234556834','85991927394')
publisher <- c('5345','57867','78034','123452','84567','245645','956752')
seller_memeber_id <- c('234','745','546','687','235')
user_group <- c('112','556','009','345','238')
size <- c('100X20','340X10','300X500','300X600')
predict.bid.vector.bin <- c('(0.831,1.55]', '(0.112,0.831]')
features <- list(region,city,dma,user_day,user_hour,os_extended,browser,domain,position,placement,publisher,seller_memeber_id,user_group,size,predict.bid.vector.bin)
# --- Sample simulated dataset
test_set <- vector()
for (feature in 1:length(features)) {
test_set <- cbind(test_set, sample(features[[feature]],1000,replace=TRUE))
}
test_set <- data.frame(test_set)
colnames(test_set) <- c('region','city','dma','user_day','user_hour',
'os_extended','browser','domain','position',
'placement','publisher','seller_memeber_id',
'user_group','size','predict.bid.vector.bin')
# --- check data
str(test_set)
The problem is the variable name region -- I think C5.0 doesn't like the colons in there. I recreated your dataset with:
region <- c("AL","AR","AZ","CA","CO","CT","DC","FL")
And then it worked with no errors:
treeModel <- C5.0(x=test_set[,-15],y=test_set[,15])
treeModel
...
Evaluation on training data (1000 cases):
Decision Tree
----------------
Size Errors
103 220(22.0%) <<
(a) (b) <-classified as
---- ----
358 122 (a): class 1
98 422 (b): class 2
Attribute usage:
100.00% user_hour
28.30% region
27.30% dma
24.30% city
17.60% user_day
15.40% size
12.70% placement
9.10% user_group
7.90% browser
6.50% os_extended
4.70% publisher
4.40% position
3.70% domain
3.00% seller_memeber_id
I also recoded the dependent variable as 1 and 2 just in case the string with the ranges was giving it a problem, but that didn't seem to matter at all (however in the output above you'll see that it predicted to Class 1 and Class 2, and that's why).

R: Changes to data when changing class

Prior to running a randomForest model, I load my data and sort variables into categorical and numerical so the model can process it.
Data as first loaded from the .csv file looks like this:
> str(DataFrame)
'data.frame': 1060 obs. of 6 variables:
$ VarX : int 1 1 1 1 0 0 0 0 1 0 ...
$ Var1 : num 127 135 137 138 138 ...
$ Var2 : Factor w/ 200 levels "#N/A","1690",..: 190 190 190 191 191 191 189 185 183 181 ...
$ Var3 : Factor w/ 138 levels "#N/A","100","101",..: 44 43 43 43 43 43 43 43 43 42 ...
$ Var4 : int 15 15 15 15 15 16 16 16 16 16 ...
$ Var5 : Factor w/ 189 levels "#N/A","10029",..: 87 87 87 87 87 85 85 85 85 85 ...
> head(DataFrame, 3)
VarX Var1 Var2 Var3 Var4 Var5
1 1 126.58 3660 152 15 7159.5
2 1 135.17 3660 150 15 7159.5
3 1 137.25 3660 150 15 7159.5
I then attempt to sort the variables in the following way:
##Sort numerical and categorical values
options(digits = 5)
cols <- c("VarX")
for (i in cols) {
DataFrame[,i] = as.factor(DataFrame[,i])
}
cols2 <- c("Var1", "Var2", "Var3", "Var4", "Var5")
for (i in cols2) {
DataFrame[,i] = as.numeric(DataFrame[,i])
}
However, this does something strange and undesirable to the data:
> str(DataFrame)
'data.frame': 1060 obs. of 6 variables:
$ VarX : Factor w/ 2 levels "0","1": 2 2 2 2 1 1 1 1 2 1 ...
$ Var1 : num 127 135 137 138 138 ...
$ Var2 : num 190 190 190 191 191 191 189 185 183 181 ...
$ Var3 : num 44 43 43 43 43 43 43 43 43 42 ...
$ Var4 : num 15 15 15 15 15 16 16 16 16 16 ...
$ Var5 : num 87 87 87 87 87 85 85 85 85 85 ...
> head(DataFrame,3)
VarX Var1 Var2 Var3 Var4 Var5
1 1 126.58 190 44 15 87
2 1 135.17 190 43 15 87
3 1 137.25 190 43 15 87
Also, while not shown in the above excerpt it turns all NA values into 1, which, depending on the data, can skew the results.
Q: What would be the correct way to process the data so that there is no corruption of the data, while ensuring that it can be used by the randomForest package?
You should have used as.numeric(as.character(variable_name)) to convert a factor column to numeric column, otherwise information will be lost.
If you see the documentation of ?factor it says in the WARNING section:
The interpretation of a factor depends on both the codes and the
"levels" attribute. Be careful only to compare factors with the same
set of levels (in the same order). In particular, as.numeric applied
to a factor is meaningless, and may happen by implicit coercion. To
transform a factor f to approximately its original numeric values,
as.numeric(levels(f))[f] is recommended and slightly more efficient
than as.numeric(as.character(f)).
Instead of for loops you can also use the power of sapply to convert these column into numeric like below:
dfnew <- sapply(df[,colms_to_be_converted],function(x)as.numeric(as.character(x)))

c50 runs for +1 hour, then returns c50 code called exit with value 1

I looked at other questions regarding my error but none had a similar issue as I do. I have no empty values, and none of the variable names in the dataset are used by the C50 package.
This is the structure of the used dataset (no empty values):
> str(dataset)
'data.frame': 776973 obs. of 13 variables:
$ CrimeID : int 9446748 9446846 9446876 9447044 9447227 9447263 9447282 9447312 9447340 9447387 ...
$ CaseNumber : Factor w/ 776907 levels "161884","F218264",..: 67 111 157 283 372 404 421 435 457 487 ...
$ CrimeDate : Factor w/ 326056 levels "1/1/2014 0:00",..: 1 1 1 1 1 1 1 1 1 1 ...
$ CrimeBlock : Factor w/ 31381 levels "0000X E 100TH PL",..: 3101 4085 26441 10811 6414 3183 7076 11201 12166 5271 ...
$ IUCR : Factor w/ 357 levels "031A","031B",..: 345 51 52 333 52 347 347 345 52 334 ...
$ LocationDescription: Factor w/ 135 levels "ABANDONED BUILDING",..: 24 18 122 24 122 122 122 18 122 122 ...
$ Arrest : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Domestic : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Beat : int 1832 1133 1631 1932 1932 1533 1012 1413 1033 1211 ...
$ District : int 18 11 16 19 19 15 10 14 10 12 ...
$ Ward : int 42 24 36 43 32 24 24 35 12 26 ...
$ CommunityArea : int 8 27 17 7 7 25 29 22 30 24 ...
$ FBICode : Factor w/ 26 levels "01A","01B","04A",..: 24 11 11 24 11 25 25 24 11 24 ...
The variable Arrest will be used as target variable in the decision tree process. I thus factorize the variable, rename the dataset as crimechicago, set the seed to create random training and test datasets, load librar c50, and run the c50 code. This code runs for over an hour and then returns the error: c50 code called exit with value 1
dataset$Arrest<- factor(dataset$Arrest)
crimechicago <- dataset
set.seed(222)
totalvalues <-nrow(crimechicago)
train_sample <- sample(totalvalues, 400000)
crimechicago_train <- crimechicago[train_sample, ]
crimechicago_test <- crimechicago[-train_sample, ]
library(C50)
crimechicago_model <- C5.0(crimechicago_train[-7], crimechicago_train$Arrest)
EDIT:
-removed CrimeID and CaseNumber from dataset as not useful predictors of target variable Arrest
-summary screenshot of the dataset: (the entire dataset, not a subset)
structure of the train dataset (400,000 rows, created by randomly selecting 400,000 rows of the 700,000+ row original dataset)
str(crimechicago_train)
'data.frame': 400000 obs. of 10 variables:
$ CrimeDate : Factor w/ 326056 levels "1/1/2014 0:00",..: 300760 132223 211541 3 287239 54284 93432 133588 284191 232747 ...
$ CrimeBlock : Factor w/ 31381 levels "0000X E 100TH PL",..: 124 14942 2696 24466 143 9024 10613 22404 17613 10766 ...
$ IUCR : Factor w/ 357 levels "031A","031B",..: 209 274 25 51 334 345 329 274 347 329 ...
$ LocationDescription: Factor w/ 135 levels "ABANDONED BUILDING",..: 118 18 80 106 80 110 18 118 122 18 ...
$ Arrest : Factor w/ 2 levels "FALSE","TRUE": 1 2 1 1 1 1 1 1 1 1 ...
$ Domestic : Factor w/ 2 levels "FALSE","TRUE": 1 2 1 2 1 1 1 2 1 1 ...
$ Beat : int 113 1133 1834 825 1834 1434 1921 715 2522 1431 ...
$ District : int 1 11 18 8 18 14 19 7 25 14 ...
$ Ward : int 42 24 42 15 42 32 47 15 30 1 ...
$ CommunityArea : int 32 27 8 66 8 24 5 67 20 22 ...

Creating decision tree

I have a csv file (298 rows and 24 columns) and i want to create a decision tree to predict the column "salary". I have downloaded tree package and added via library function.
But when i try to create the decision tree:
model<-tree(salary~.,data)
I get the error like below:
*Error in tree(salary ~ ., data) :
factor predictors must have at most 32 levels*
What is wrong with that? Data is as follows:
Name bat hit homeruns runs
1 Alan Ashby 315 81 7 24
2 Alvin Davis 479 130 18 66
3 Andre Dawson 496 141 20 65
...
team position putout assists errors
1 Hou. C 632 43 10
2 Sea. 1B 880 82 14
3 Mon. RF 200 11 3
salary league87 team87
1 475 N Hou.
2 480 A Sea.
3 500 N Chi.
And its the value of str(data):
'data.frame': 263 obs. of 24 variables:
$ Name : Factor w/ 263 levels "Al Newman","Alan Ashby",..: 2 7 8 10 6 1 13 11 9 3 ...
$ bat : int 315 479 496 321 594 185 298 323 401 574 ...
$ hit : int 81 130 141 87 169 37 73 81 92 159 ...
$ homeruns : int 7 18 20 10 4 1 0 6 17 21 ...
$ runs : int 24 66 65 39 74 23 24 26 49 107 ...
$ runs.batted : int 38 72 78 42 51 8 24 32 66 75 ...
$ walks : int 39 76 37 30 35 21 7 8 65 59 ...
$ years.in.major.leagues : int 14 3 11 2 11 2 3 2 13 10 ...
$ bats.during.career : int 3449 1624 5628 396 4408 214 509 341 5206 4631 ...
$ hits.during.career : int 835 457 1575 101 1133 42 108 86 1332 1300 ...
$ homeruns.during.career : int 69 63 225 12 19 1 0 6 253 90 ...
$ runs.during.career : int 321 224 828 48 501 30 41 32 784 702 ...
$ runs.batted.during.career: int 414 266 838 46 336 9 37 34 890 504 ...
$ walks.during.career : int 375 263 354 33 194 24 12 8 866 488 ...
$ league : Factor w/ 2 levels "A","N": 2 1 2 2 1 2 1 2 1 1 ...
$ division : Factor w/ 2 levels "E","W": 2 2 1 1 2 1 2 2 1 1 ...
$ team : Factor w/ 24 levels "Atl.","Bal.",..: 9 21 14 14 16 14 10 1 7
8 ...
$ position : Factor w/ 23 levels "1B","1O","23",..: 10 1 20 1 22 4 22 22 13 22 ...
$ putout : int 632 880 200 805 282 76 121 143 0 238 ...
$ assists : int 43 82 11 40 421 127 283 290 0 445 ...
$ errors : int 10 14 3 4 25 7 9 19 0 22 ...
$ salary : num 475 480 500 91.5 750 ...
$ league87 : Factor w/ 2 levels "A","N": 2 1 2 2 1 1 1 2 1 1 ...
$ team87 : Factor w/ 24 levels "Atl.","Bal.",..: 9 21 5 14 16 13 10 1 7 8 ...
The issue is almost certainly that you're including the name variable in your model, as it has too many factor levels. I would also remove it a methodological standpoint but this probably isn't the place for that discussion. Try:
train <- data
train$Name <- NULL
model<-tree(salary~.,train)
It seems that your salary is a factor vector, while you are trying to perform a regression, so it should be a numbers vector. Simply convert you salary to numeric, and it should work just fine. For more details read the library's help:
http://cran.r-project.org/web/packages/tree/tree.pdf
Usage
tree(formula, data, weights, subset, na.action = na.pass,
control = tree.control(nobs, ...), method = "recursive.partition",
split = c("deviance", "gini"), model = FALSE, x = FALSE, y = TRUE, wts
= TRUE, ...)
Arguments
formula A formula expression. The left-hand-side (response) should be either a numerical vector when a
regression tree will be fitted or a factor, when a classification tree
is produced. The right-hand-side should be a series of numeric or
factor variables separated by +; there should be no interaction terms.
Both . and - are allowed: regression trees can have offset terms.
(...)
Depending on what exactly is stored in your salary variable, the conversion can be less or more tricky, but this should generaly work:
salary = as.numeric(levels(salary))[salary]
EDIT
As pointed out in the comment, the actual error corresponds to the data variable, so if it is a numerical data, it could also be converted to numeric to solve the issue, if it has to be a factor you will need another model or reduce the number of levels. You can also convert these factors to the numerical format by hand (by for example defining as many binary features as you have levels), but this can lead to the exponential growth of your input space.
EDIT2
It seems that you have to first decide what you are trying to model. You are trying to predict salary, but based on what? It seems that your data consists of players' records, then their names are for sure wrong type of data to use for this prediction (in particular - it is probably causing the 32 levels error). You should remove all the columns from the data variable which should not be used for building a prediction. I do not know what is the exact aim here (as there is no information regarding it in the question), so I can only guess that you are trying to predict the person's salary based on his/her stats, so you should remove from the input data: players' names, players' teams and obviously salaries (as predicting X using X is not a good idea ;)).

Resources