Invalid prediction for "rpart" object Error - r

I am using the exact code for best first search from page 4 of this CRAN document (https://cran.r-project.org/web/packages/FSelector/FSelector.pdf), which uses the iris dataset. It works just fine on the iris dataset, but does not work on my ow ndata. My data has 37 predictor variables (both numerical and categorical) with the 38th column the Class prediction.
I'm getting the error:
Error in predict.rpart(tree, test, type = "c") :
Invalid prediction for "rpart" object
Which I think comes from this line:
error.rate = sum(test$Class != predict(tree, test, type="c")) / nrow(test)
I've tried the debug and traceback but I'm not understanding why this error is occurring (and like I said, it's not reproducible with iris data).
Here's some of my data so you can see What I'm working with:
> head(data)
Numeric Binary Binary.1 Categorical Binary.2 Numeric.1 Numeric.2 Numeric.3 Numeric.4 Numeric.5 Numeric.6
1 42 1 0 1 0 27.38953 38.93202 27.09122 38.15687 9.798653 18.57313
2 43 1 0 3 0 76.34071 75.18190 73.66722 72.39449 23.546124 54.29957
3 67 0 0 1 0 485.87158 287.35052 471.58863 281.55261 73.454080 389.40092
4 49 0 0 3 0 200.83924 171.77136 164.33999 137.13165 36.525225 122.74080
5 42 1 1 2 0 421.56508 243.05138 388.66823 221.17644 57.803488 285.72923
6 48 1 1 2 0 69.48605 68.86291 67.57764 66.68408 16.661986 43.27868
Numeric.7 Numeric.8 Numeric.9 Numeric.10 Numeric.11 Numeric.12 Numeric.13 Numeric.14 Numeric.15 Numeric.16
1 1.9410 1.6244 1.4063 3.761285 11.07121 12.00510 1.631108 2.061702 0.7911462 1.0196401
2 2.7874 2.4975 1.8621 4.519124 18.09848 15.46028 2.069787 2.650712 0.7808421 0.9650938
3 4.9782 4.5829 4.0747 10.165202 24.66558 18.26303 2.266640 3.504340 0.6468095 1.8816444
4 3.4169 3.0646 2.7983 7.275817 15.15534 13.93672 2.085589 2.309878 0.9028999 1.6726948
5 5.2302 3.7912 3.4401 7.123413 59.64406 28.71171 3.311343 5.645815 0.5865128 0.8572746
6 2.9730 2.2918 1.5164 4.541603 26.81567 18.67885 2.637904 3.523510 0.7486581 0.7908798
Numeric.17 Numeric.18 Numeric.19 Numeric.20 Categorical.1 Numeric.21 Numeric.22 Numeric.23 Numeric.24
1 2.145868 1.752803 64.91618 41.645192 1 9.703708 1.116614 0.09654643 4.0075897
2 2.336676 1.933997 19.93420 11.824950 3 31.512059 1.360054 0.03559176 0.5806225
3 5.473179 1.857276 44.22981 33.698516 1 8.498998 1.067967 0.04122081 0.7760942
4 3.394066 2.143688 10.61420 29.636776 3 39.734071 1.549718 0.04577881 0.3102006
5 1.744118 4.084250 38.28577 87.214615 2 59.519129 2.132184 0.16334461 0.3529899
6 1.124962 4.037118 58.37065 3.894945 2 64.895248 2.190225 0.13461692 0.2672686
Numeric.25 Numeric.26 Numeric.27 Numeric.28 Numeric.29 Numeric.30 Numeric.31 Class
1 0.065523088 1.012919 1.331637 0.18721221 645.60854 144.49088 20.356321 FALSE
2 0.030128214 1.182271 1.633734 0.10035377 206.18575 142.63844 24.376264 FALSE
3 0.005638842 0.802835 1.172351 0.07512149 81.98983 91.44951 18.949937 FALSE
4 0.061873262 1.323395 1.733104 0.12725994 51.14379 113.19654 28.529134 FALSE
5 0.925931194 1.646710 3.096853 0.39408020 151.65062 103.64733 6.769099 FALSE
6 0.548181302 1.767779 2.547693 0.34173633 46.10354 111.04652 9.658817 FALSE

Since I'm not very familiar with the rpart-package yet, I might be wrong but it works for me:
Try using type = "vector" instead of type = "c". Your variable Class is logical so the rpart-function should have generated a regression tree, not a classification tree. The documentation of predict.rpart states, that the types class and prob are only meant for classification trees.
With the following code you can get your predicted classes:
your_threshold <- 0.5
predicted_classes <- predict(tree, test, type = "vector") >= your_threshold
Alternatively you can factorize your variable Class before training the tree. rpart will then build a classification tree:
data$Class <- factor(data$Class)
tree <- rpart(Class ~ ., data)
predicted_classes <- predict(tree, test, type = "class") # or type = "c" if you prefer
Your choice ;) Hope that helps!

Related

blueprint$forge$clean error "attempt to apply non-function"

I've trained a naive Bayes model with R using the tidymodels framework.
The whole model is saved in an rds file. Here's a snippet of the content of that model (the whole model has 181 or more such tables so I can't post it here):
══ Workflow [trained] ══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: naive_Bayes()
── Preprocessor ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
0 Recipe Steps
── Model ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
$apriori
grouping
A B C D E F
0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667
$tables
$tables$var1
var
grouping 1 2 3 4
1 0.3173302108 0.3728337237 0.2304449649 0.0793911007
2 0.2104513064 0.3885985748 0.2923990499 0.1085510689
3 0.2561613144 0.5481702763 0.1784914115 0.0171769978
4 0.0038167939 0.1059160305 0.5477099237 0.3425572519
5 0.0009017133 0.0324616772 0.3841298467 0.5825067628
6 0.1474328780 0.4399434762 0.3655204899 0.0471031559
$tables$var2
var
grouping 1 2 3 4
1 0.2215456674 0.3592505855 0.2777517564 0.1414519906
2 0.1532066508 0.3446555819 0.3225653207 0.1795724466
3 0.1762509335 0.4458551158 0.3330843913 0.0448095594
4 0.0009541985 0.0324427481 0.4208015267 0.5458015267
5 0.0009017133 0.0189359784 0.2957619477 0.6844003607
6 0.1427225624 0.4371172869 0.3546867640 0.0654733867
$tables$var3
var
grouping 1 2 3 4 5
1 0.7679700304 0.1992507609 0.0320767970 0.0004682744 0.0002341372
2 0.3680835906 0.3526478271 0.2526715744 0.0256471147 0.0009498931
3 0.0432835821 0.2328358209 0.5201492537 0.1694029851 0.0343283582
4 0.0514775977 0.2278360343 0.4642516683 0.1954242135 0.0610104862
5 0.0117117117 0.0702702703 0.3144144144 0.3486486486 0.2549549550
6 0.0150659134 0.1012241055 0.4077212806 0.3436911488 0.1322975518
$tables$var4
var
grouping 1 2 3 4 5
1 0.6518379771 0.3289627722 0.0187309764 0.0002341372 0.0002341372
2 0.1260983139 0.2125385894 0.5079553550 0.1184991688 0.0349085728
3 0.3089552239 0.4783582090 0.2059701493 0.0037313433 0.0029850746
4 0.3441372736 0.4718779790 0.1811248808 0.0019065777 0.0009532888
5 0.0270270270 0.0360360360 0.3432432432 0.3612612613 0.2324324324
6 0.0127118644 0.0555555556 0.4119585687 0.3672316384 0.1525423729
I read that file into R which works fine and then want to use that model and predict some values of a new data set with:
model <- readRDS(file.choose())
new_pred <- predict(model,
dat_new,
type = "prob")
For me, personally, this runs just fine. But when I sent this to a client of me, they get the following error:
Error in blueprint$forge$clean(blueprint = blueprint, new_data = new_data, :
attempt to apply non-function)
I know, with such little information it is very difficult to figure out what's going on, but I'm still trying. Maybe the tidymodels experts here know where such an error might come from.
Any ideas?
Update to show how the model is created:
library(tidymodels)
library(discrim)
model_recipe <- recipe(outcome_var ~ ., data = dat_train)
model_final <- naive_Bayes(Laplace = 1) |>
set_mode("classification") |>
set_engine("klaR", prior = rep((1/6), 6))
model_final_wf <- workflow() |>
add_recipe(model_recipe) |>
add_model(model_final)
full_fit <- model_final_wf |>
fit(data = dat_full)
​
saveRDS(full_fit, file = 'my_model.rds')
You are getting this error because your client are using too old a version of {hardhat}.
In version 1.1.0 of hardhat a lot of internals were changed about hardhat. This means that the $clean object is no longer present, which is causing the error that we are seeing.
The recommended cause of action is for both of you to use the same version of {hardhat}, preferably the most recent one, which at the time of writing this is 1.2.0.
Additionally: when sharing models like this it is recommended that you also move along package versions to make sure everything is in sync, such as with renv or by using more dedicated model deployment such as with vetiver

lcmm::predictClass with l-spline link function

I am getting an error message trying to predict class membership in lcmm::predictClass(). This seems to be due to using a spline-based link function, as exemplified below. The lcmm::predictClass() function works okay for the default link function.
The following shows 1) a reproduceable example giving the error message, and 2) a working example with the same broad approach.
## define initialisation values for quick result here
BB <- c(-19.064,21.718,-1.192,-1.295,-1.205,-0.281,0.110,
-0.232, 1.339,-1.007, 1.019,-9.395, 1.702,2.030,
2.089, 1.352,-9.369, 1.220, 1.532, 2.481,1.223)
library(lcmm)
m2c <- multlcmm(Ydep1+Ydep2~1+Time*X2,
random=~1+Time,
subject="ID",
link="3-quant-splines",
ng=2,
mixture=~1+Time,
classmb=~1+X1,
data=data_lcmm,
B=BB)
## converges in 3 iterations
## define the prediction cases
library(dplyr)
X <- data_lcmm %>%
filter(ID %in% sample(ID,10)) %>% ## 10 random IDs
select(ID,Ydep1,Ydep2,Time,X1,X2)
## find predicted class memberships
predictClass(m2c, newdata=X)
## Error in multlcmm(fixed = Ydep1 + Ydep2 ~ 1 + Time * X2, mixture = ~1 + :
## Length of vector range is not correct.
On the other hand, a similar approach with a linear link function gives the following. Note that these models are based on the example in the ?multlcmm help section.
library(lcmm)
m2 <- multlcmm(Ydep1+Ydep2~1+Time*X2,
random=~1+Time,
subject="ID",
link="linear",
ng=2,
mixture=~1+Time,
classmb=~1+X1,
data=data_lcmm,
B=c(18,-20.77,1.16,-1.41,-1.39,-0.32,0.16,
-0.26,1.69,1.12,1.1,10.8,1.24,24.88,1.89))
## converges in 2 iterations
library(dplyr)
X <- data_lcmm %>%
filter(ID %in% sample(ID,10)) %>%
select(ID,Ydep1,Ydep2,Time,X1,X2)
predictClass(m2, newdata=X)
## ID class prob1 prob2
## 1 21 2 0.031948951 9.680510e-01
## 2 25 2 0.042938984 9.570610e-01
## 3 33 2 0.026053178 9.739468e-01
## 4 46 1 0.999999964 3.597409e-08
## 5 50 2 0.066291287 9.337087e-01
## 6 74 2 0.005630593 9.943694e-01
## 7 120 2 0.024787290 9.752127e-01
## 8 171 2 0.053499974 9.465000e-01
## 9 229 1 0.999999996 4.368222e-09
##10 235 2 0.008173507 9.918265e-01
## ...or similar
The other predict functions predictL() and predictY() seem to work okay. The predictRE() throws the same error message.
I will also email the package maintainer.

How do I generate the archetypes of new dataset from the GLRM predict function

I have used these sites as reference and though has been resourceful, I'm unable to regenerate the reduced dimensions of new datasets via the glrm predict function
https://bradleyboehmke.github.io/HOML/GLRM.html
https://github.com/h2oai/h2o-tutorials/blob/master/best-practices/glrm/GLRM-BestPractices.Rmd
I work in the Sparklyr environment with H2o. I'm keen to use the GLRM function to reduce dimensions to cluster. Though from the model, i am able to access the PCAs or Arch, i would like to generate the Archs from the GRLM predict function on new datasets.
Appreciate your help.
Here is the training of the GLRM model on the training dataset
glrm_model <-
h2o.glrm(
training_frame = train,
cols = glrm_cols,
loss = "Absolute",
model_id = "rank2",
seed = 1234,
k = 5,
transform = "STANDARDIZE",
loss_by_col_idx = losses$index,
loss_by_col = losses$loss
)
# Decompose training frame into XY
X <- h2o.getFrame(glrm_model#model$representation_name) #as h2o frame
The Arch Types from the training dataset:
X
Arch1 Arch2 Arch3 Arch4 Arch5
1 0.10141381 0.10958071 0.26773514 0.11584502 0.02865024
2 0.11471676 0.06489475 0.01407714 0.24536782 0.10223535
3 0.08848878 0.26742082 0.04915022 0.11693702 0.03530641
4 -0.03062604 0.29793032 -0.07003814 0.01927975 0.52451867
5 0.09497268 0.12915324 0.21392107 0.08574152 0.03750636
6 0.05857743 0.18863508 0.14570623 0.08695144 0.03448957
But when i wish use the trained GLRM model on new dataset to regenerate these arch types,
I got the full dimensions instead of the Arch types as per X above?
I'm using these Arch as features for clustering purposes.
# Generate predictions on a validation set (if necessary):
glrm_pred <- h2o.predict(glrm_model, newdata = test)
glrm_pred
reconstr_price reconstr_bedrooms reconstr_bathrooms reconstr_sqft_living reconstr_sqft_lot reconstr_floors reconstr_waterfront reconstr_view reconstr_condition reconstr_grade reconstr_sqft_above reconstr_sqft_basement reconstr_yr_built reconstr_yr_renovated
1 -0.8562455 -1.03334892 -1.9903167 -1.3950774 -0.2025564 -1.6537486 0 4 5 13 -1.20187061 -0.6584413 -1.25146116 -0.3042907
2 -0.7940549 -0.29723926 -0.7863867 -0.4364751 -0.1666500 -0.8527297 0 4 5 13 -0.13831432 -0.6545514 0.54821146 -0.3622902
3 -0.7499614 -0.18296317 0.1970824 -0.3989486 -0.1532677 0.4914559 0 4 5 13 -0.09100889 -0.6614534 1.38779632 -0.1844416
4 -1.0941432 0.08954988 0.7872987 -0.2087964 -0.1599888 0.8254916 0 4 5 13 0.11973488 -0.6623575 2.70176558 -0.2363486
5 0.3727360 0.82848389 0.4965246 1.1134378 -0.9013011 -1.3388791 0 4 5 13 0.08427185 2.1354440 -0.07213625 -1.2213866
6 -0.4042458 -0.59876839 -0.9685556 -0.7093578 -0.1745297 -0.5061798 0 4 5 13 -0.43503836 -0.6628391 -0.55165408 -0.2207544
reconstr_lat reconstr_long reconstr_sqft_living15 reconstr_sqft_lot15
1 -0.07307503 -0.4258959 -1.0132778 -0.1964471
2 -0.52124543 0.7283153 0.1242903 -0.1295341
3 -0.56113519 0.6011221 -0.1616215 -0.1624136
4 -0.99759090 1.3032420 0.1556193 -0.1569607
5 0.70028433 -0.6436112 1.1400189 -0.9272790
6 -0.02222403 -0.2257382 -0.4859787 -0.1817499
[6416 rows x 18 columns]
thank you

comp() returns ranks instead of p-values

I am given example that comp() should be returning p-vals but it ends up returning ranks so let me ask:
Why is comp() function from survmisc package returning ranks instead of p-values?
Is there a way to change it?
test_drug <- survfit(Surv(N_Days,Cens) ~ Drug, data = df)
comp(ten(test_drug), p=c(0,1,1,0.5,0.5),q=c(1,0,1,0.5,2))
output:
Q Var Z pNorm
1 3.3457e+00 2.7643e+01 0.63634 4
n 3.2000e+02 1.0304e+06 0.31524 10
sqrtN 3.4634e+01 4.8218e+03 0.49877 9
S1 2.1524e+00 1.6867e+01 0.52410 7
S2 2.1294e+00 1.6650e+01 0.52185 8
FH_p=0_q=1 1.1647e+00 2.2356e+00 0.77898 3
FH_p=1_q=0 2.1809e+00 1.7056e+01 0.52809 6
FH_p=1_q=1 8.4412e-01 7.9005e-01 0.94968 1
FH_p=0.5_q=0.5 1.6895e+00 4.1759e+00 0.82678 2
FH_p=0.5_q=2 2.7491e-01 2.2027e-01 0.58575 5
maxAbsZ Var Q pSupBr
1 5.8550e+00 2.7643e+01 1.11361 5
n 9.7000e+02 1.0304e+06 0.95556 6
sqrtN 6.3636e+01 4.8218e+03 0.91643 7
S1 3.5891e+00 1.6867e+01 0.87391 9
S2 3.5737e+00 1.6650e+01 0.87581 8
FH_p=0_q=1 2.2539e+00 2.2356e+00 1.50743 2
FH_p=1_q=0 3.6025e+00 1.7056e+01 0.87230 10
FH_p=1_q=1 1.4726e+00 7.9005e-01 1.65678 1
FH_p=0.5_q=0.5 2.9457e+00 4.1759e+00 1.44148 3
FH_p=0.5_q=2 6.3430e-01 2.2027e-01 1.35150 4
So according to the topic here:
https://github.com/dardisco/survMisc/issues/21
And information that I got from the profesor lecturer who solved the problem earlier.
This is issue with R version and update is required to the fuction itself by authors or contributors.
This can be solves using attr() func with 'tft' parameter standing for test for trend. Code example here:
test_bilirubin <- survfit(Surv(N_Days,Cens) ~ Bilirubin_cat, data = df)
b=ten(test_bilirubin)
comp(b,p=c(0,1,1,0.5,0.5),q=c(1,0,1,0.5,2))
d=attr(b,"tft")
# "lrt" - the long-rank family of tests
#"sup" - Renyi test,
#"tft" - test for trend
cbind(d$tft$W,round(d$tft$pChisq,4))

Class probabilities in Neural networks

I use the caret package with multi-layer perception.
My dataset consists of a labelled output value, which can be either A,B or C. The input vector consists of 4 variables.
I use the following lines of code to calculate the class probabilities for each input value:
fit <- train(device~.,data=dataframetrain[1:100,], method="mlp",
trControl=trainControl(classProbs=TRUE))
(p=(predict(fit,newdata=dataframetest,type=("prob"))))
I thought that the class probabilities for each record must sum up to one. But I get the following:
rowSums(p)
# 1 2 3 4 5 6 7 8
# 1.015291 1.015265 1.015291 1.015291 1.015291 1.014933 1.015011 1.015291
# 9 10 11 12 13 14 15 16
# 1.014933 1.015206 1.015291 1.015291 1.015291 1.015224 1.015011 1.015291
Can anybody help me because I don't know what I did wrong.
There's probably nothing wrong, it just seems that caret returns the values of the neurons in the output layer without converting them to probabilities (correct me if I'm wrong). When using the RSNNS::mlp function outside of caret the rows of the predictions also don't sum to one.
Since all output neurons have the same activation function the outputs can be converted to probabilities by dividing the predictions by the respective row sum, see this question.
This behavior seems to be true when using method = "mlp" or method = "mlpWeightDecay" but when using method = "nnet" the predictions do sum to one.
Example:
library(RSNNS)
data(iris)
#shuffle the vector
iris <- iris[sample(1:nrow(iris),length(1:nrow(iris))),1:ncol(iris)]
irisValues <- iris[,1:4]
irisTargets <- iris[,5]
irisTargetsDecoded <- decodeClassLabels(irisTargets)
iris2 <- splitForTrainingAndTest(irisValues, irisTargetsDecoded, ratio=0.15)
iris2 <- normTrainingAndTestSet(iris2)
set.seed(432)
model <- mlp(iris2$inputsTrain, iris2$targetsTrain,
size=5, learnFuncParams=c(0.1), maxit=50,
inputsTest=iris2$inputsTest, targetsTest=iris2$targetsTest)
predictions <- predict(model,iris2$inputsTest)
head(rowSums(predictions))
# 139 26 17 104 54 82
# 1.0227419 1.0770722 1.0642565 1.0764587 0.9952268 0.9988647
probs <- predictions / rowSums(predictions)
head(rowSums(probs))
# 139 26 17 104 54 82
# 1 1 1 1 1 1
# nnet example --------------------------------------
library(caret)
training <- sample(seq_along(irisTargets), size = 100, replace = F)
modelCaret <- train(y = irisTargets[training],
x = irisValues[training, ],
method = "nnet")
predictionsCaret <- predict(modelCaret,
newdata = irisValues[-training, ],
type = "prob")
head(rowSums(predictionsCaret))
# 122 100 89 134 30 86
# 1 1 1 1 1 1
I don't know how much flexibility the caret package offers in these choices, but the standard way to make a neural net produce outputs which sum to one is to use the softmax function as the activation function in the output layer.

Resources