C5.0 decision tree - c50 code called exit with value 1 - r

I am getting the following error
c50 code called exit with value 1
I am doing this on the titanic data available from Kaggle
# Importing datasets
train <- read.csv("train.csv", sep=",")
# this is the structure
str(train)
Output :-
'data.frame': 891 obs. of 12 variables:
$ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
$ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
$ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
Then I tried using C5.0 dtree
# Trying with C5.0 decision tree
library(C50)
#C5.0 models require a factor outcome otherwise error
train$Survived <- factor(train$Survived)
new_model <- C5.0(train[-2],train$Survived)
So running the above lines gives me this error
c50 code called exit with value 1
I'm not able to figure out what's going wrong? I was using similar code on different dataset and it was working fine. Any ideas about how can I debug my code?
-Thanks

For anyone interested, the data can be found here: http://www.kaggle.com/c/titanic-gettingStarted/data. I think you need to be registered in order to download it.
Regarding your problem, first of I think you meant to write
new_model <- C5.0(train[,-2],train$Survived)
Next, notice the structure of the Cabin and Embarked Columns. These two factors have an empty character as a level name (check with levels(train$Embarked)). This is the point where C50 falls over. If you modify your data such that
levels(train$Cabin)[1] = "missing"
levels(train$Embarked)[1] = "missing"
your algorithm will now run without an error.

Just in case. You can take a look to the error by
summary(new_model)
Also this error occurs when there are a special characters in the name of a variable. For example, one will get this error if there is "я"(it's from Russian alphabet) character in the name of a variable.

Here is what worked finally:-
Got this idea after reading this post
library(C50)
test$Survived <- NA
combinedData <- rbind(train,test)
combinedData$Survived <- factor(combinedData$Survived)
# fixing empty character level names
levels(combinedData$Cabin)[1] = "missing"
levels(combinedData$Embarked)[1] = "missing"
new_train <- combinedData[1:891,]
new_test <- combinedData[892:1309,]
new_model <- C5.0(new_train[,-2],new_train$Survived)
new_model_predict <- predict(new_model,new_test)
submitC50 <- data.frame(PassengerId=new_test$PassengerId, Survived=new_model_predict)
write.csv(submitC50, file="c50dtree.csv", row.names=FALSE)
The intuition behind this is that in this way both the train and test data set will have consistent factor levels.

I had the same error, but I was using a numeric dataset without missing values.
After a long time, I discovered that my dataset had a predictive attribute called "outcome" and the C5.0Control use this name, and this was the error cause :'(
My solution was changing the column name. Other way, would be create a C5.0Control object and change the value of the label attribute and then pass this object as parameter for the C50 method.

I also struggled some hours with the same Problem (Return code "1") when building a model as well as when predicting.
With the hint of answer of Marco I have written a small function to remove all factor levels equal to "" in a data frame or vector, see code below. However, since R does not allow for pass by reference to functions, you have to use the result of the function (it can not change the original dataframe):
removeBlankLevelsInDataFrame <- function(dataframe) {
for (i in 1:ncol(dataframe)) {
levels <- levels(dataframe[, i])
if (!is.null(levels) && levels[1] == "") {
levels(dataframe[,i])[1] = "?"
}
}
dataframe
}
removeBlankLevelsInVector <- function(vector) {
levels <- levels(vector)
if (!is.null(levels) && levels[1] == "") {
levels(vector)[1] = "?"
}
vector
}
Call of the functions may look like this:
trainX = removeBlankLevelsInDataFrame(trainX)
trainY = removeBlankLevelsInVector(trainY)
model = C50::C5.0.default(trainX,trainY)
However, it seems, that C50 has a similar Problem with character columns containing an empty cell, so you will have probably to extend this to handle also character attributes if you have some.

I also got the same error, but it was because of some illegal characters in the factor levels of one the columns.
I used make.names function and corrected the factor levels:
levels(FooData$BarColumn) <- make.names(levels(FooData$BarColumn))
Then the problem was resolved.

Related

"Number of observations <= number of random effects" error

I am using a package called diagmeta for meta-analysis purposes. I can use this package with a built in data set called Schneider2017. However when I make my own database/data set I get the following error:
Error: number of observations (=300) <= number of random effects (=3074) for term (Group * Cutoff | Study); the random-effects parameters and the residual variance (or scale parameter) are probably unidentifiable
Another thread here on SO suggests the error is caused by the data format of one or more columns. I have made sure every column's data type matches that in the Schneider2017 dataset - no effect.
Link to the other thread
I have tried extracting all of the data from the Schneider2017 dataset into excel and then importing a dataset from Excel through R studio. This again makes no difference. This suggests to me that something in the data format could be different, although I can't see how.
diag2 <- diagmeta(tpos, fpos, tneg, fneg, cutpoint,
studlab = paste(author,year,group),
data = SRschneider,
model = "DIDS", log.cutoff = FALSE,
check.nobs.vs.nRE = "ignore")
The dataset looks like this:
I expected the same successful execution and plotting as with the built-in data set, but keep getting this error.
Result from doing str(mydataset):
> str(SRschneider)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 150 obs. of 10 variables:
$ ...1 : num 1 2 3 4 5 6 7 8 9 10 ...
$ study_id: num 1 1 1 1 1 1 1 1 1 1 ...
$ author : chr "Arora" "Arora" "Arora" "Arora" ...
$ year : num 2006 2006 2006 2006 2006 ...
$ group : chr NA NA NA NA ...
$ cutpoint: chr "6" "7.0" "8.0" "9.0" ...
$ tpos : num 133 131 130 127 119 115 113 110 102 98 ...
$ fneg : num 5 7 8 11 19 23 25 28 36 40 ...
$ fpos : num 34 33 31 30 28 26 25 21 19 19 ...
$ tneg : num 0 1 3 4 6 8 9 13 15 15 ...
Just a quick follow-up on Ben's detailed answer.
The statistical method implemented in diagmeta() expects that argument cutpoint is a continuous variable. We added a corresponding check for argument cutpoint (as well as arguments TP, FP, TN, and FN) in version 0.3-1 of R package diagmeta; see commit in GitHub repository for technical details.
Accordingly, the following R commands will result in a more informative error message:
data(Schneider2017)
diagmeta(tpos, fpos, tneg, fneg, as.character(cutpoint),
studlab = paste(author, year, group), data = Schneider2017)
You said that you
have made sure every column's data type matches that in the Schneider2017 dataset
but that doesn't seem to be true. Besides differences between num (numeric) and int (integer) types (which actually aren't typically important), your data has
$ cutpoint: chr "6" "7.0" "8.0" "9.0" ...
while str(Schneider2017) has
$ cutpoint: num 6 7 8 9 10 11 12 13 14 15 ...
Having your cutpoint be a character rather than numeric means that R will try to treat it as a categorical variable (with many discrete levels). This is very likely the source of your problem.
The cutpoint variable is likely a character because R encountered some value in this column that can't be interpreted as numeric (something as simple as a typographic error). You can use SRschneider$cutpoint <- as.numeric(SRschneider$cutpoint) to convert the variable to numeric by brute force (values that can't be interpreted will be set to NA), but it would be better to go upstream and see where the problem is.
If you use tidyverse packages to load your data you should get a list of "parsing problems" that may be useful. You can also try cp <- SRschneider$cutpoint; cp[which(is.na(as.numeric(cp)))] to look at the values that can't be converted.

undefined columns selected error in R when trying to subset using sapply

I have been tearing my hair out over this for the last hour, the following code was working perfectly a couple of hours ago, and now I have no idea why it doesn't anymore. I have searched for other questions regarding the undefined columns selected error, but I think I have corrected for all of the info in those answers. I am sure there is some tiny thing I have overlooked or accidently left in, but I can't see it!
I have a data frame with both factor and numeric variables, I want to subset so that I keep all of the factor variables, and remove numeric variables whose columns have a mean < 0.1.
I found the following code on another question on stackoverflow, which slightly modified worked well on my test data (smaller sub-dataset I am using for testing before trying out code on a big 3GB object)
meanfunction01 <- function(x){
if(is.numeric(x)){
mean(x) > 0.1
} else {
TRUE}
}
#then apply function to data table
Zdata <- Data1[,sapply(Data1, meanfunction01)]
I swear I was using this a few hours ago, then when i came back to it and tried to use it again it stopped working and now just returns the following error:
Error in `[.data.frame`(Data1, , sapply(Data1, meanfunction01)) :
undefined columns selected
I was trying to modify the function so that it would loop over multiple objects (I have 54 objects I want to apply it to, and didn't want to type them all manually), but I don't think I edited the original function, and now it has stopped working.
A brief str() of my data:
> str(Data1[1:10])
'data.frame': 11 obs. of 10 variables:
$ Name : Factor w/ 11688 levels "GTEX-1117F-0226-SM-5GZZ7",..: 8186 8242 8262 8270 8343 8388 8403 8621 8689 8709 ...
$ SEX : Factor w/ 2 levels "Female","Male": 1 2 2 1 1 2 2 1 2 1 ...
$ AGE : Factor w/ 6 levels "20-29","30-39",..: 4 4 1 3 3 1 3 3 3 2 ...
$ CIRCUMSTANCES: Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Tissue.x : Factor w/ 53 levels "Adipose_Subcutaneous",..: 7 7 7 7 7 7 7 7 7 7 ...
$ ENSG00000223972.4 : num 0 0.0701 0.0339 0.1149 0.0549 ...
$ ENSG00000227232.4 : num 12.5 17.2 13.1 16 15.7 ...
$ ENSG00000243485.2 : num 0.0717 0 0.1508 0 0.061 ...
$ ENSG00000237613.2 : num 0 0.0654 0 0.0402 0.0768 ...
$ ENSG00000268020.2 : num 0 0.0421 0.0611 0 0 ...
So if your only issue is changing the class of the integer variables in your data.frame but you have many columns (>10000) you may want to consider converting your data.frame into a data.table. Your code would then look like this:
library(data.table)
Data1<-data.table(Data1) #or if you have your data in csv document just use fread instead of read.csv which will automatically give you a data.table.
Then you just need to find the integer columns using this:
which(sapply(Data1,is.integer))
Putting it altogether using the data.table commands:
Data1[,which(sapply(Data1,is.integer)):=lapply(.SD,as.numeric),.SDcols=which(sapply(Data1,is.integer))]
Note you don't need to assign the above line of code into anything since data.table uses pointers which makes it much faster than data.frame or tibbles objects. So running the above line will update your Data1 object efficiently. The classes of the other non-integer columns (i.e., factors) will remain unchanged.
Please update if you have further questions but this should answer your comment. Best of luck!

Grouping error with lmer

I have a data frame with the following structure:
> t <- read.csv("combinedData.csv")[,1:7]
> str(t)
'data.frame': 699 obs. of 7 variables:
$ Awns : int 0 0 0 0 0 0 0 0 1 0 ...
$ Funnel : Factor w/ 213 levels "MEL001","MEL002",..: 1 1 2 2 2 3 4 4 4 4 ...
$ Plant : int 1 2 1 3 8 1 1 2 3 5 ...
$ Line : Factor w/ 8 levels "a","b","c","cA",..: 2 2 1 1 1 3 1 1 1 1 ...
$ X : int 1 2 3 4 7 8 9 10 11 12 ...
$ ID : Factor w/ 699 levels "MEL_001-1b","MEL_001-2b",..: 1 2 3 4 5 6 7 8 9 10 ...
$ BobWhite_c10082_241: int 2 2 2 2 2 2 0 2 2 0 ...
I want to construct a mixed effect model. I know in my data frame that the random effect I want to include (Funnel) is a factor, but it does not work:
> lmer(t$Awns ~ (1|t$Funnel) + t$BobWhite_c10082_241)
Error: couldn't evaluate grouping factor t$Funnel within model frame: try adding grouping factor to data frame explicitly if possible
In fact this happens whatever I want to include as a random effect e.g. Plant:
> lmer(t$Awns ~ (1|t$Plant) + t$BobWhite_c10082_241)
Error: couldn't evaluate grouping factor t$Plant within model frame: try adding grouping factor to data frame explicitly if possible
Why is R giving me this error? The only other answer I could google fu is that the random effect fed in wasn't a factor in the DF. But as str shows, df$Funnel certainly is.
It is actually not so easy to provide a convenient syntax for modeling functions and at the same time have a robust implementation. Most package authors assume that you use the data parameter and even then scoping issues can occur. Thus, strange things can happen if you specify variables with DF$col syntax since package authors rarely spend a lot of effort to make this work correctly and don't include a lot of unit tests for this.
It is therefore strongly recommended to use the data parameter if the model function offers a formula method. Strange things can happen if you don't follow that praxis (also with other model functions like lm).
In your example:
lmer(Awns ~ (1|Funnel) + BobWhite_c10082_241, data = t)
This not only works, but is also more convenient to write.

Reading a csv file using ffdf and subsetting it successfully

I have been researching a way to efficiently extract information from large csv data sets using R. Many seem to recommend the package ff. I was successful in reading the data sets but am now running into problem trying to subset it.
The largest data set contains over 650,000 rows and 1005 columns. Not all columns contain the same data types. Viewed as a dataframe, the structure would look like this:
'data.frame': 5 obs. of 1005 variables:
$ SAMPLING_EVENT_ID : Factor w/ 5 levels "S6230404","S6252242",..: 2 1 3 4 5
$ LATITUDE : num 24.4 24.5 24.5 24.5 24.5
$ LONGITUDE : num -81.9 -81.9 -82 -82 -82
$ YEAR : int 2010 2010 2010 2010 2010
$ MONTH : int 4 3 10 10 10
$ DAY : int 97 88 299 298 300
$ TIME : num 9 10 10 11.58 9.58
$ COUNTRY : Factor w/ 1 level "United_States": 1 1 1 1 1
$ STATE_PROVINCE : Factor w/ 1 level "Florida": 1 1 1 1 1
$ COUNT_TYPE : Factor w/ 2 levels "P21","P22": 2 2 1 1 1
$ EFFORT_HRS : num 6 2 7 6 3.5
$ EFFORT_DISTANCE_KM : num 48.28 8.05 0 0 0
$ EFFORT_AREA_HA : int 0 0 0 0 0
$ OBSERVER_ID : Factor w/ 3 levels "obs132426","obs58643",..: 3 2 1 1 1
$ NUMBER_OBSERVERS : Factor w/ 2 levels "?","1": 2 1 2 2 2
$ Zenaida_macroura : int 0 0 1 0 0
All other variables being similar to this last one i.e. various species of bird.
Here is the code I used to “successfully: read the csv:
B2010 <- read.table.ffdf (x = NULL, “filePath&Name", nrows = -1, first.rows = 50000, next.rows = 50000)
Trying to learn about ffdf output, I entered command lines such as dim(B2010), str(B2010), ls(B2010), etc. dim(B2010) resulted in the appropriate number of rows but only one column (a string per record of the values separated by commas), and ls(B2010) outputted “[1] "physical" "row.names" "virtual" instead of the usual list of variables.
I not sure how to handle this type of output to be able to extract say STATE_PROVINCE == “California”? How do I tell B2010 what the variables are? I think I need to look at this differently but need some of your help to figure it out.
The ultimate goal for me is to subset a bunch of csv data sets (since I have one per year) and put the results back together as dataframe for various analysis.
Thanks,
Joe
To subset an ffdf, use the ffbase package.
As in
require(ffbase)
x <- subset(B2010, BB2010$STATE_PROVINCE == “California”)
I finally found the solution to getting the ffdf variable names and types properly read and accessible for subsetting:
B2010 <- read.csv.ffdf (file = "filepath/name", colClasses = c("factor", "numeric", "numeric", "integer", "integer", "integer", "numeric", rep("factor",998)), first.rows = 10000, next.rows = 50000, nrows = -1)
This took forever to read but seemed to have worked i.e. I was able to create a subset of the data. Next step: to save the subset back to a "normal" dataframe and/or to a csv.
According to the help page at ?read.table.ffdf, you should be using read.csv.ffdf(...). Then go to the page cited by Brandon.

Wrong R data type or bad data?

I'm having trouble doing simple functions on a data frame and am unsure whether it's the data type of the column, or bad data in the data frame.
I exported a SQL query into a CSV file, then loaded it into a data frame, then attached it.
df <-read.csv("~/Desktop/orders.csv")
Attach(df)
When I am done, and run str(df), here is what I get:
$ AccountID: Factor w/ 18093 levels "(819947 row(s) affected)",..: 10 97 167 207 207 299 299 309 352 573 ...
$ OrderID : int 1874197767 1874197860 1874196789 1874206918 1874209100 1874207018 1874209111 1874233050 1874196791 1875081598 ...
$ OrderDate : Factor w/ 280 levels "","2010-09-24",..: 2 2 2 2 2 2 2 2 2 2 ...
$ NumofProducts : int 16 6 4 6 10 4 2 4 6 40 ...
$ OrderTotal : num 20.3 13.8 12.5 13.8 16.4 ...
$ SpecialOrder : int 1 1 1 1 1 1 1 1 1 1 ...
Trying to run the following functions, here is what I get:
> length(OrderID)
[1] 0
> min(OrderTotal)
[1] NA
> min(OrderTotal, na.rm=TRUE)
[1] 5.00
> mean(NumofProducts)
[1] NA
> mean(NumofProducts, na.rm=TRUE)
[1] 3.462902
I have two questions related to this data frame:
Do I have the right data types for the columns? Nums versus integers versus decimals.
Is there a way to review the data set to find the rows that are driving the need to use na.rm=TRUE to make the function work? I'd like to know how many there are, etc.
The difference between num and int is pretty irrelevant at this stage.
See help(is.na) for starters on NA handling. Do things like:
sum(is.na(foo))
to see how many foo's are NA values. Then things like:
df[is.na(df$foo),]
to see the rows of df where foo is NA.

Resources