I am currently studying data science with R. To practice, I am using the Auto data of the ISLR package. However, I am encountering a confusing situation when viewing the data. When I view the dataset Auto.df in RStudio, I get the following:
However, when I use dim(Auto.df), I get the following:
> dim(Auto.df)
[1] 392 9
And when I use nrow(Auto.df), I get the following:
> nrow(Auto.df)
[1] 392
And when I use str(Auto.df), I get the following:
> str(Auto.df)
'data.frame': 392 obs. of 9 variables:
$ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
$ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
$ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
$ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
$ weight : num 3504 3693 3436 3433 3449 ...
$ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
$ year : num 70 70 70 70 70 70 70 70 70 70 ...
$ origin : num 1 1 1 1 1 1 1 1 1 1 ...
$ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
And I have the following in my RStudio "Global Environment" tab:
So why does viewing the dataset in RStudio show 397 rows (observations), whilst everything else says that there are 392 observations?
There are 392 observations in the data. What you are viewing are the rownames of the data. You can set rownames as anything and they do not represent row number in the data.
If you check the rownames of Auto dataset you'll realise they are not sequential and some rownames jump by 2. For example, after 32 you don't have 33 but 34. Similarly after 126 there is 128. I don't know why the data is like that but that makes row number at the end to go till 397.
I am having issue with training C50 on my dataset. Before this post, I have researched all the other similar issues/solutions people had. However, my dataset has none of the issue they had but still failed the C50 execution in r. My dataset looks like:
'data.frame': 113967 obs. of 15 variables:
$ region : Factor w/ 51 levels "US:AK","US:AL",..: 2 3 3 4 4 4 4 5 5 5 ...
$ city : Factor w/ 6396 levels "179708","179720",..: 24 156 156 194 214 226 244 276 316 407 ...
$ dma : Factor w/ 211 levels "1","500","501",..: 24 148 148 173 173 173 189 195 204 208 ...
$ user_day : Factor w/ 7 levels "0","1","2","3",..: 6 6 6 6 6 6 6 6 6 6 ...
$ user_hour : Factor w/ 24 levels "0","1","10","11",..: 5 16 16 4 22 7 10 11 15 21 ...
$ os_extended : Factor w/ 71 levels "0","100","113",..: 55 68 68 7 29 14 14 14 29 34 ...
$ browser : Factor w/ 19 levels "0","10","11",..: 19 18 18 8 18 9 18 17 18 18 ...
$ domain : Factor w/ 2685 levels "0calc.com","100daysofrealfood.com",..: 1709 777 777 1406 727 2658 1406 1604 964 2658 ...
$ position : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 2 1 1 1 2 ...
$ placement : Factor w/ 5406 levels "10004098","10008956",..: 3331 1696 1714 3600 438 479 3598 3423 5406 479 ...
$ publisher : Factor w/ 1641 levels "1000773","1000776",..: 581 687 687 663 1369 1525 663 624 1641 1525 ...
$ seller_member_id : Factor w/ 304 levels "1001","1019",..: 19 101 101 40 19 35 40 40 75 35 ...
$ user_group : Factor w/ 1000 levels "0","1","10","100",..: 252 243 243 363 343 342 162 380 122 212 ...
$ size : Factor w/ 7 levels "160x600","300x250",..: 5 2 2 4 5 2 2 1 2 2 ...
$ predict.bid.vector.bin: Factor w/ 2 levels "(0.112,0.831]",..: 1 1 1 1 1 1 1 2 1 2 ...
As you can see, the last variable is my target variable (as factor) and all features here have more than 1 level. Moreover, there is no NA in the dataset. Yet, when i execute the C50, i got error:
> library(C50)
> myC50_Tree <- C5.0(x = test_set[,-15], y = test_set$predict.bid.vector.bin)
c50 code called exit with value 1
> summary(myC50_Tree)
Call:
C5.0.default(x = test_set[, -15], y = test_set$predict.bid.vector.bin)
C5.0 [Release 2.07 GPL Edition] Fri Apr 13 14:29:54 2018
-------------------------------
*** line 6 of `undefined.names': attribute `region' has only one value `US'
Error limit exceeded
What would be the issue here?
***You can get the simulated dataset of mine with following r code:
# --- Set unique feature values
region <- c("US:AL","US:AR","US:AZ","US:CA","US:CO","US:CT","US:DC","US:FL")
city <- c("179944","180802","181120","181212","181251","181315","181400","181512","181762","181842","181934","181953","182259","182295")
dma <- c('522','693','754','875','345','234')
user_day <- c('1','2','3','4','5','6')
user_hour <- c('12','11','10','9','8','7','6','5')
os_extended <- c('187','92','125','87','90')
browser <- c('8','9','18','5')
domain <- c('yahoo.com','youtube.com','mmctw.com','msn.com','frive.com','wework.com')
position <- c('0','1','2','3')
placement <- c('`234123412','34563451','235234624','46785467','234556834','85991927394')
publisher <- c('5345','57867','78034','123452','84567','245645','956752')
seller_memeber_id <- c('234','745','546','687','235')
user_group <- c('112','556','009','345','238')
size <- c('100X20','340X10','300X500','300X600')
predict.bid.vector.bin <- c('(0.831,1.55]', '(0.112,0.831]')
features <- list(region,city,dma,user_day,user_hour,os_extended,browser,domain,position,placement,publisher,seller_memeber_id,user_group,size,predict.bid.vector.bin)
# --- Sample simulated dataset
test_set <- vector()
for (feature in 1:length(features)) {
test_set <- cbind(test_set, sample(features[[feature]],1000,replace=TRUE))
}
test_set <- data.frame(test_set)
colnames(test_set) <- c('region','city','dma','user_day','user_hour',
'os_extended','browser','domain','position',
'placement','publisher','seller_memeber_id',
'user_group','size','predict.bid.vector.bin')
# --- check data
str(test_set)
The problem is the variable name region -- I think C5.0 doesn't like the colons in there. I recreated your dataset with:
region <- c("AL","AR","AZ","CA","CO","CT","DC","FL")
And then it worked with no errors:
treeModel <- C5.0(x=test_set[,-15],y=test_set[,15])
treeModel
...
Evaluation on training data (1000 cases):
Decision Tree
----------------
Size Errors
103 220(22.0%) <<
(a) (b) <-classified as
---- ----
358 122 (a): class 1
98 422 (b): class 2
Attribute usage:
100.00% user_hour
28.30% region
27.30% dma
24.30% city
17.60% user_day
15.40% size
12.70% placement
9.10% user_group
7.90% browser
6.50% os_extended
4.70% publisher
4.40% position
3.70% domain
3.00% seller_memeber_id
I also recoded the dependent variable as 1 and 2 just in case the string with the ranges was giving it a problem, but that didn't seem to matter at all (however in the output above you'll see that it predicted to Class 1 and Class 2, and that's why).
I looked at other questions regarding my error but none had a similar issue as I do. I have no empty values, and none of the variable names in the dataset are used by the C50 package.
This is the structure of the used dataset (no empty values):
> str(dataset)
'data.frame': 776973 obs. of 13 variables:
$ CrimeID : int 9446748 9446846 9446876 9447044 9447227 9447263 9447282 9447312 9447340 9447387 ...
$ CaseNumber : Factor w/ 776907 levels "161884","F218264",..: 67 111 157 283 372 404 421 435 457 487 ...
$ CrimeDate : Factor w/ 326056 levels "1/1/2014 0:00",..: 1 1 1 1 1 1 1 1 1 1 ...
$ CrimeBlock : Factor w/ 31381 levels "0000X E 100TH PL",..: 3101 4085 26441 10811 6414 3183 7076 11201 12166 5271 ...
$ IUCR : Factor w/ 357 levels "031A","031B",..: 345 51 52 333 52 347 347 345 52 334 ...
$ LocationDescription: Factor w/ 135 levels "ABANDONED BUILDING",..: 24 18 122 24 122 122 122 18 122 122 ...
$ Arrest : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Domestic : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Beat : int 1832 1133 1631 1932 1932 1533 1012 1413 1033 1211 ...
$ District : int 18 11 16 19 19 15 10 14 10 12 ...
$ Ward : int 42 24 36 43 32 24 24 35 12 26 ...
$ CommunityArea : int 8 27 17 7 7 25 29 22 30 24 ...
$ FBICode : Factor w/ 26 levels "01A","01B","04A",..: 24 11 11 24 11 25 25 24 11 24 ...
The variable Arrest will be used as target variable in the decision tree process. I thus factorize the variable, rename the dataset as crimechicago, set the seed to create random training and test datasets, load librar c50, and run the c50 code. This code runs for over an hour and then returns the error: c50 code called exit with value 1
dataset$Arrest<- factor(dataset$Arrest)
crimechicago <- dataset
set.seed(222)
totalvalues <-nrow(crimechicago)
train_sample <- sample(totalvalues, 400000)
crimechicago_train <- crimechicago[train_sample, ]
crimechicago_test <- crimechicago[-train_sample, ]
library(C50)
crimechicago_model <- C5.0(crimechicago_train[-7], crimechicago_train$Arrest)
EDIT:
-removed CrimeID and CaseNumber from dataset as not useful predictors of target variable Arrest
-summary screenshot of the dataset: (the entire dataset, not a subset)
structure of the train dataset (400,000 rows, created by randomly selecting 400,000 rows of the 700,000+ row original dataset)
str(crimechicago_train)
'data.frame': 400000 obs. of 10 variables:
$ CrimeDate : Factor w/ 326056 levels "1/1/2014 0:00",..: 300760 132223 211541 3 287239 54284 93432 133588 284191 232747 ...
$ CrimeBlock : Factor w/ 31381 levels "0000X E 100TH PL",..: 124 14942 2696 24466 143 9024 10613 22404 17613 10766 ...
$ IUCR : Factor w/ 357 levels "031A","031B",..: 209 274 25 51 334 345 329 274 347 329 ...
$ LocationDescription: Factor w/ 135 levels "ABANDONED BUILDING",..: 118 18 80 106 80 110 18 118 122 18 ...
$ Arrest : Factor w/ 2 levels "FALSE","TRUE": 1 2 1 1 1 1 1 1 1 1 ...
$ Domestic : Factor w/ 2 levels "FALSE","TRUE": 1 2 1 2 1 1 1 2 1 1 ...
$ Beat : int 113 1133 1834 825 1834 1434 1921 715 2522 1431 ...
$ District : int 1 11 18 8 18 14 19 7 25 14 ...
$ Ward : int 42 24 42 15 42 32 47 15 30 1 ...
$ CommunityArea : int 32 27 8 66 8 24 5 67 20 22 ...
I am trying to get summary statistics for my data set. The dataset is values for different countries cereal yield of a number of years. I want to get the summary statistics and for each year and then transpose the dataset and get the summary statistics for each country.
For some reason I am not getting the summary statistics and just a list of some of the values and the quantity of them.
I would appreciate any help with this issue.
Below is a sample of my dataset:
row.names YR1990 YR1991 YR1992
3 1200.6 1160 1097.7
4 320.9 417.4 397
5 2794.3 2071.8 2269.2
6 2216.4 1594 2315.3
7 2232.32 2666.1 3057.3
10 2380.9 1833.3 1722.2
This is the results I am getting after summary() function:
summary(CerialData)
YR1990 YR1991 YR1992 YR1993 YR1994
1000 : 1 1000 : 1 943.2 : 2 1000 : 1 1040.03: 1
1003.9 : 1 1043.19: 1 1000 : 1 1055.77: 1 1041.1 : 1
1026.7 : 1 1050.3 : 1 1021.2 : 1 1083.3 : 1 1091.6 : 1
1028.5 : 1 1055.3 : 1 1042.1 : 1 1109.3 : 1 1100 : 1
1033.2 : 1 1094 : 1 1069.7 : 1 1135.5 : 1 1111.1 : 1
1036.8 : 1 1108.3 : 1 1072.3 : 1 1153 : 1 1132.2 : 1
(Other):158 (Other):158 (Other):157 (Other):158 (Other):158
str(CerialData) 'data.frame': 164 obs. of 20 variables:
$ YR1990: Factor w/ 188 levels "","..","0","1000",..: 19 116 103 80 81 85 46 153 26 177 ...
$ YR1991: Factor w/ 191 levels "","..","0","1000",..: 14 141 66 38 93 53 40 154 28 181 ...
$ YR1992: Factor w/ 207 levels "","..","0","1000",..: 10 151 95 96 134 49 67 165 28 197 ...
$ YR1993: Factor w/ 194 levels "","..","0","1000",..: 8 97 99 178 107 35 62 153 23 182 ...
$ YR1994: Factor w/ 214 levels "","..","0","1040.03",..: 11 133 107 74 127 53 15 171 17 207 ...
I am facing this problem for many hours now, but I know I am missing something obvious.
Here is my problem:
I have a data-frame in .xlsx file that can be downloaded here.
I loaded this data-frame into R using RStudio on MAc and called it demoData.
There are 5 variables (AgeRange, Women, Men, Total, and Year).
I am not able to subset this data frame with a condition on the AgeRange. The format of this variable is as follow: xx-xx (00-04 meaning people between 00 and 04 years old). The message I have when I try to do that is that there is no row filling this condition.
The class of the variable "AgeRange" is factor.
Here is my code:
demoData[demoData$AgeRange=="00-04",]
Thank you for your help.
Edit: from Arun. Here's input from head(demoData):
Age Feminin Masculin. Ensemble Annee
1 00-04 720 745 1465 2004
2 05-09 745 767 1512 2004
3 10-14 813 830 1643 2004
4 15-19 824 820 1644 2004
5 20-24 839 823 1662 2004
6 25-29 752 699 1450 2004
# str(demoData)
'data.frame': 272 obs. of 5 variables:
$ Age : Factor w/ 16 levels "00-04 ","05-09 ",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Feminin : Factor w/ 216 levels "138 ","139 ",..: 112 124 164 165 174 130 106 86 78 66 ...
$ Masculin.: Factor w/ 201 levels "120 ","122 ",..: 132 141 174 169 170 124 111 89 90 75 ...
$ Ensemble : Factor w/ 242 levels "1041 ","1044 ",..: 53 66 115 116 119 50 38 14 9 238 ...
$ Annee : Factor w/ 17 levels "2004 ","2005",..: 1 1 1 1 1 1 1 1 1 1 ...
I read in your xlsx file with the xlsx package:
df<-read.xlsx("C:/Users/swatson1/Downloads/Evolution_Population_2004_2020.xlsx",1)
and it looked like this:
> df
Age Feminin MasculinÂ. Ensemble Annee
1 00-04Â 720Â 745Â 1465Â 2004Â
2 05-09Â 745Â 767Â 1512Â 2004Â
You could replace each column, getting rid of the extra character with something like:
df$Age<-substr(df$Age,1,5)
Alternatively, use gsub as this will work on any column regardless of the length of the entry:
df$Age<-gsub("Â\\s","",df$Age)
Then your code would work:
df[df$Age=="00-04",]
#coppied from the Excel file
str1 <- "00-04 "
utf8ToInt(str1)
#[1] 48 48 45 48 52 160
There seems to be a no-break space at the end of the string. Sanitize your file.
You should be able to remove the no-break spaces using
df$Age <- gsub(intToUtf8(160),"",df$Age)