Running `ctree` using `party` package, column as factor and not character - r

I have referred convert data.frame column format from character to factor and Converting multiple data.table columns to factors in R and Convert column classes in data.table
Unfortunately it did not solve my problem. I am working with the bodyfat dataset and my dataframe is called > bf. I added a column called agegrp to categorize persons of different ages as young, middle or old thus :
bf$agegrp<-ifelse(bf$age<=40, "young", ifelse(bf$age>40 & bf$age<55,"middle", "old"))
This is the ctree analysis:
> set.seed(1234)
> modelsample<-sample(2, nrow(bf), replace=TRUE, prob=c(0.7, 0.3))
> traindata<-bf[modelsample==1, ]
> testdata<-bf[modelsample==2, ]
> predictor<-agegrp~DEXfat+waistcirc+hipcirc+kneebreadth` and ran, `bf_ctree<-ctree(predictor, data=traindata)
> bf_ctree<-ctree(predictor, data=traindata)
I got the following error:
Error in trafo(data = data, numeric_trafo = numeric_trafo, factor_trafo = factor_trafo, :
data class character is not supported
In addition: Warning message:
In storage.mode(RET#predict_trafo) <- "double" : NAs introduced by coercion
Since bf$agegrp is of class "character" I ran,
> bf$agegrp<-as.factor(bf$agegrp)
the agegrp column is now coerced to factor.
> Class (bf$agegrp) gives [1] "Factor".
I tried running the ctree again, but it throws the same error. Does anyone know what the root-cause of the problem is?

This works for me:
library(mboot)
library(party)
bf <- bodyfat
bf$agegrp <- cut(bf$age,c(0,40,55,100),labels=c("young","middle","old"))
predictor <- agegrp~DEXfat+waistcirc+hipcirc+kneebreadth
set.seed(1234)
modelsample <-sample(2, nrow(bf), replace=TRUE, prob=c(0.7, 0.3))
traindata <-bf[modelsample==1, ]
testdata <-bf[modelsample==2, ]
bf_ctree <-ctree(predictor, data=traindata)
plot(bf_ctree)

Related

Error message using mstate::msprep ??bug?

I have had a problem with an error abend using mstate::msprep to prepare my data for a pretty classical 3 state problem. I can run the code from the mstate package vignette with no difficulty. My problem is entirely parallel to the vignette example. Subjects receive an islet transplant, then may achieve insulin independence. Whether they do or do not, they may have islet graft failure (or loss of insulin independence if it was achieved.) The vignette example works with included covariates (retained by the keep = parameter). My version works fine if I don't include the keep parameter but fails consistently if I use the keep parameter. Since my example works perfectly well without the keep variable, I very much doubt that there is a problem with my main data. It must be some problem with the “keep” data. See below for the session output.
Neither data set has any missing data. I tried the vignette data limiting it to three covariates -- one categorical, one continuous, and the third with one of the event-time variables, exactly parallel to my three covariates. The vignette still works perfectly, but mine doesn’t. Both covariate "keep" lists are character vectors. In sum, I can't imagine a more parallel "real" question to the vignette example.
I have tracked the problem to a subroutine of msprep "msprepEngine" at line 85 at the second time through the processing loop, but I haven't been able to figure out what the problem is. I suspect that it is a bug, but since I can't identify it, I can't be sure.
I would be very grateful for anyone that can help me with this issue. The vignette code is available with the package. Unfortunately I am not free to share my problem's data, but as I said above, the program works perfectly without the keep parameter. There must be something about my "keep" covariates that is giving the system indigestion.
Thanks in advance for any suggestions.
Larry Hunsicker
> library(magrittr)
> library(survival)
> library(mstate)
>
> #Three state tmat:
> data(ebmt3)
> names(msbmt)
[1] "id" "from" "to" "trans" "Tstart" "Tstop" "time" "status" "dissub" "age"
[11] "prtime"
> dim(msbmt)
[1] 5577 11
> tmat <- trans.illdeath(names = c("Tx", "PR", "RelDeath"))
> covs <- c('dissub', 'age', 'drmatch', 'tcd', 'prtime')
> class(covs)
[1] "character"
> msbmt <- msprep(time = c(NA, "prtime", "rfstime"),
+ status = c(NA, "prstat", "rfsstat"),
+ data = ebmt3, trans = tmat, id = 'id', keep = covs)
>
> names(insfree3)
[1] "PatientID" "YrFree" "Free" "YrLossFail" "LossFail" "StudyID" "IEQ_kg"
> tmat3 <- trans.illdeath(names = c("Tx", "II", "LossFail"))
> IImt <- msprep(time = c(NA, 'YrFree', 'YrLossFail'),
+ status = c(NA, 'Free', 'LossFail'),
+ data = insfree3, trans = tmat3, id = 'PatientID')
>
> tmat3 <- trans.illdeath(names = c("Tx", "II", "LossFail"))
> covs <- c('StudyID', 'IEQ_kg', 'YrFree')
> class(covs)
[1] "character"
> IImt <- msprep(time = c(NA, 'YrFree', 'YrLossFail'),
+ status = c(NA, 'Free', 'LossFail'),
+ data = insfree3, trans = tmat3, id = 'PatientID', keep = covs)
Error in rep(keep[, i], tbl) : invalid 'times' argument
I found the problem, and it is a bug. I just don't know whose bug it is. msprep() works when data is a data.frame, but not when it is a tibble. My repro example:
> library(survival)
> library(mstate)
> library(dplyr)
> data(ebmt3)
> class(ebmt3)
[1] "data.frame"
> tmat <- transMat(x = list(c(2, 3), c(3), c()), names = c("Tx",
+ "PR", "RelDeath"))
> ebmt3$prtime <- ebmt3$prtime/365.25
> ebmt3$rfstime <- ebmt3$rfstime/365.25
> covs <- c("dissub", "age", "drmatch", "tcd", "prtime")
> msbmt <- msprep(time = c(NA, "prtime", "rfstime"),
+ status = c(NA, "prstat", "rfsstat"), data = ebmt3,
+ trans = tmat, keep = covs)
> ebmt3 <- as_tibble(ebmt3)
> class(ebmt3)
[1] "tbl_df" "tbl" "data.frame"
> msbmt <- msprep(time = c(NA, "prtime", "rfstime"),
+ status = c(NA, "prstat", "rfsstat"), data = ebmt3,
+ trans = tmat, keep = covs)
Error in rep(keep[, i], tbl) : invalid 'times' argument
I tracked the error down to line 157 in msprep()
ddcovs <- lapply(1:nkeep, function(i) rep(keep[, i], tbl))
When data is a data.frame, this line works. When it is a tibble, it abends with the above error message.
It was my impression that things that work with a data.frame should also work with a tibble, since a tibble is a data.frame. So I'm not sure whether this is a bug in msprep() or in the code for a tibble. But the way to avoid the error is to be sure that the data parameter in the call to msprep() is a data.frame, but not a tibble.
Larry Hunsicker

Debug error in frame$yval2[where, 1L + nclass + 1L:nclass, drop = FALSE]: subscript out of bounds

I'm using rpart library to build a regression tree, with the following code:
skillcraft <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00272/SkillCraft1_Dataset.csv", header = T, sep =",")
skillcraft$LeagueIndex <- factor(skillcraft$LeagueIndex)
skillcraft <- skillcraft[-1]
skillcraft$Age <- as.numeric(levels(skillcraft$Age))[skillcraft$Age]
skillcraft$TotalHours <- as.numeric(
levels(skillcraft$TotalHours))[skillcraft$TotalHours]
skillcraft$HoursPerWeek <- as.numeric(
levels(skillcraft$HoursPerWeek))[skillcraft$HoursPerWeek]
skillcraft <- skillcraft[complete.cases(skillcraft),]
library(caret)
set.seed(133)
skillcraft_sampling_vector <- createDataPartition(
skillcraft$LeagueIndex, p = 0.8, list = F)
skillcraft_train <- skillcraft[skillcraft_sampling_vector,]
skillcraft_test <- skillcraft[-skillcraft_sampling_vector,]
library(rpart)
regtree <- rpart(LeagueIndex ~., data = skillcraft_train)
regtree_predictions <- predict(regtree, skillcraft_test)
The last line of this code is throwing the error:
Error in frame$yval2[where, 1L + nclass + 1L:nclass, drop = FALSE] :
subscript out of bounds
This doesn't seem very clear, but I've checked that both data frames (train and test) have the same structure and now I'm having trouble in finding a way to debug this code.
Can anyone help?
Thanks in advance!
My best guess is that the problem lies in the LeagueIndex factor. This variable was provided as ordinal data (from Bronze to Professional) and converted to a character factor "1", "2", "3", etc. up to "8".
It looks like in addition to your error with rpart, you get a warning when partitioning the data based on this factor:
In createDataPartition(skillcraft$LeagueIndex, p = 0.8, list = F) :
Some classes have no records ( 8 ) and these will be ignored
Apparently there are no records with LeagueIndex of 8. This seems to come after you select for completed cases here:
skillcraft <- skillcraft[complete.cases(skillcraft),]
And all of the LeagueIndex=8 cases are removed as these will have missing data for Age, HoursPerWeek, and TotalHours (coerced to NA) when converted via as.numeric.
skillcraft[which(skillcraft$LeagueIndex == 8), c("Age", "HoursPerWeek", "TotalHours")]
Age HoursPerWeek TotalHours
3341 ? ? ?
3342 ? ? ?
3343 ? ? ?
...
Assuming you still wanted a factor, I believe if you get rid of the unused factor level this will work such as:
skillcraft$LeagueIndex <- droplevels(skillcraft$LeagueIndex)
before partitioning the data. (You could just do on the training set in this example, but you would want the same factor levels in your test and train sets.)

Error in cor(x, use = use) : supply both 'x' and 'y' or a matrix-like 'x'

I am using the psych package,
following code I tried:
library(psych)
str(price_per_d)
Least_appealing <-subset(zdf_base, select=c("price_per_h",
"price_per_d", "mileage", "one_way_option", "difficulties",
"vehicle_types", "parking_spot","picking_up","availability", "dirty",
"returning","refilling", "loalty_programs"))
# code from stackoverflow which I use, to get a numeric x
Least_appealing <- gsub(",", "", Least_appealing)
Least_appealing <- as.numeric(Least_appealing)
fa.parallel(Least_appealing)
I get this error messages:
> library(psych)
> str(price_per_d)
Factor w/ 1 level "Price (daily rate too high)": 1 NA 1 1 1 NA NA 1 1
NA ...
> Least_appealing <-subset(zdf_base, select=c("price_per_h",
+ "price_per_d",
"mileage", "one_way_option", "difficulties",
+ "vehicle_types",
"parking_spot","picking_up","availability", "dirty",
+ "returning","refilling",
"loalty_programs"))
>
> Least_appealing <- gsub(",", "", Least_appealing)
> Least_appealing <- as.numeric(Least_appealing)
**Warnmeldung:
NAs durch Umwandlung erzeugt**
>
> fa.parallel(Least_appealing)
**Fehler in cor(x, use = use) : supply both 'x' and 'y' or a matrix-like
'x'**
>
How can I conduct a Factor analysis succesfully?
First I got the error message, my 'x' must be numeric, that's why I used the above mentioned code.
When I used this code, R tells me, that I got NA's through the conversion.
I still kept on and tried fa.parallel, which gives me another error message.
If you have character data intermixed with numeric data (e.g., your coding is categorical and you need to convert it to numerical, you could try using the char2numeric function before doing the fa.
e.g. with data that are a mix of categorical and numerical;
describe(data) #this will flag those variables that are categorical with an asterix
new.data <- char2numeric(data) #this makes all numeric
fa(new.data, nfactors=3) #to get three factors
It appears that you have only one variable in your 'least.appealing' object.

PCA with result non-interactively in R

I send you a message because I would like realise an PCA in R with the package ade4.
I have the data "PAYSAGE" :
All the variables are numeric, PAYSAGE is a data frame, there are no NAS or blank.
But when I do :
require(ade4)
ACP<-dudi.pca(PAYSAGE)
2
I have the message error :
**You can reproduce this result non-interactively with:
dudi.pca(df = PAYSAGE, scannf = FALSE, nf = NA)
Error in if (nf <= 0) nf <- 2 : missing value where TRUE/FALSE needed
In addition: Warning message:
In as.dudi(df, col.w, row.w, scannf = scannf, nf = nf, call = match.call(), :
NAs introduced by coercion**
I don't understand what does that mean. Have you any idea??
Thank you so much
I'd suggest sharing a data set/example others could access, if possible. This seems data-specific and with NAs introduced by coercion you may want to check the type of your input - typeof(PAYSAGE) - the manual for dudi.pca states it takes a data frame of numeric values as input.
Yes, for example :
ag_div <- c(75362,68795,78384,79087,79120,73155,58558,58444,68795,76223,50696,0,17161,0,0)
canne <- c(rep(0,10),5214,6030,0,0,0)
prairie_el<- c(60, rep(0,13),76985)
sol_nu <- c(18820,25948,13150,9903,12097,21032,35032,35504,25948,20438,12153,33096,15748,33260,44786)
urb_peu_d <- c(448,459,5575,5902,5562,458,6271,6136,459,1850,40,13871,40,13920,28669)
urb_den <- c(rep(0,12),14579,0,0)
veg_arbo <- c(2366,3327,3110,3006,3049,2632,7546,7620,3327,37100,3710,0,181,0,181)
veg_arbu <- c(18704,18526,15768,15527,15675,18886,12971,12790,18526,15975,22216,24257,30962,24001,14523)
eau <- c(rep(0,10),34747,31621,36966,32165,28054)
PAYSAGE<-data.frame(ag_div,canne,prairie_el,sol_nu,urb_peu_d,urb_den,veg_arbo,veg_arbu,eau)
require(ade4)
ACP<-dudi.pca(PAYSAGE)

Error running neural net

library(nnet)
set.seed(9850)
train1<- sample(1:155,110)
test1 <- setdiff(1:110,train1)
ideal <- class.ind(hepatitis$class)
hepatitisANN = nnet(hepatitis[train1,-20], ideal[train1,], size=10, softmax=TRUE)
j <- predict(hepatitisANN, hepatitis[test1,-20], type="class")
hepatitis[test1,]$class
table(predict(hepatitisANN, hepatitis[test1,-20], type="class"),hepatitis[test1,]$class)
confusionMatrix(hepatitis[test1,]$class, j)
Error:
Error in nnet.default(hepatitis[train1, -20], ideal[train1, ], size = 10, :
NA/NaN/Inf in foreign function call (arg 2)
In addition: Warning message:
In nnet.default(hepatitis[train1, -20], ideal[train1, ], size = 10, :
NAs introduced by coercion
hepatitis variable consists of the hepatitis dataset available on UCI.
This error message is because you have character values in your data.
Try reading the hepatitis dataset with na.strings = "?". This is defined in the description of the dataset on the uci page.
headers <- c("Class","AGE","SEX","STEROID","ANTIVIRALS","FATIGUE","MALAISE","ANOREXIA","LIVER BIG","LIVER FIRM","SPLEEN PALPABLE","SPIDERS","ASCITES","VARICES","BILIRUBIN","ALK PHOSPHATE","SGOT","ALBUMIN","PROTIME","HISTOLOGY")
hepatitis <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/hepatitis/hepatitis.data", header = FALSE, na.strings = "?")
names(hepatitis) <- headers
library(nnet)
set.seed(9850)
train1<- sample(1:155,110)
test1 <- setdiff(1:110,train1)
ideal <- class.ind(hepatitis$Class)
# will give error due to missing values
# 1st column of hepatitis dataset is the class variable
hepatitisANN <- nnet(hepatitis[train1,-1], ideal[train1,], size=10, softmax=TRUE)
This code will not give your error, but it will give an error on missing values. You will need to do address those before you can continue.
Also be aware that the class variable is the first variable in the dataset straight from the UCI data repository
Edit based on comments:
The na.action only works if you use the formula notation of nnet.
So in your case:
hepatitisANN <- nnet(class.ind(Class)~., hepatitis[train1,], size=10, softmax=TRUE, na.action = na.omit)

Resources