How do I convert an "RWeka" decision tree into a "party" tree in R? - r

I am using the RWeka package in R to fit M5' trees to a dataset using "M5P". I then want to convert the tree generated into a "party" tree so that I can access variable importances. The issue I am having is that I can't seem to get the function as.party to work without getting the following error:
"Error: all(sapply(split, head, 1) %in% c("<=", ">")) is not TRUE"
This error only arises when I apply the function within a for loop, but the for loop is necessary as I am running 5-fold cross validation.
Below is the code I have been running:
n <- nrow(data)
k <- 5
indCV <- sample( rep(1:k,each=ceiling(n/k)), n)
for(i in 1:k){
#Training data is for all the observations where indCV is not equal to i
training_data <- data.frame(x[-which(indCV==i),])
training_response <- y[-which(indCV==i)]
#Test the data on the fifth of the data where the observation indices are equal to i
test_data <- x[which(indCV==i),]
test_response <- y[which(indCV==i)]
#Fit a pruned model to the training data
fit <- M5P(training_response~., data=training_data, control=Weka_control(N=TRUE))
#Convert to party
p <- as.party(fit)
}

The RWeka package has an example for converting M5P trees into party objects. If you run example("M5P", package = "RWeka") then the tree visualizations are actually drawn by partykit. After running the examples, see plot(m3) and as.party(m3).
However, while for J48 you can get a fully fledged constparty object, the same is not true for M5P. In the latter case, the tree structure itself can be converted to party but the linear models within the nodes are not completely straightforward to convert into lm objects. Thus, if you want to use the party representation to compute measures that only depend on the tree structure (e.g., variables used for splitting, number of splits, splitpoints, etc.) then you can do so. But if you want to compute measures that depend on the models or the predictions (e.g., mean square errors etc.) then the party class won't be of much help.

Related

Plot an envelope for an mppm object in spatstat

My question is closely related to this previous one: Simulation-based hypothesis testing on spatial point pattern hyperframes using "envelope" function in spatstat
I have obtained an mppm object by fitting a model on several independent datasets using the mppmfunction from the R package spatstat. How can I study its envelope to compare it to my observations ?
I fitted my model as such:
data <- listof(NMJ1,NMJ2,NMJ3)
data <- hyperframe(X=1:3, Points=data)
model <- mppm(Points ~marks*sqrt(x^2+y^2), data)
where NMJ1, NMJ2, and NMJ3 are marked ppp and are independent realizations of the same experiment.
However, the envelope function does not accept inputs of type mppm:
> envelope(model, Kcross.inhom, nsim=10)
Error in UseMethod("envelope") :
no applicable method for 'envelope' applied to an object of class "c('mppm', 'list')"
The answer provided to the previously mentioned question indicates how to plot global envelopes for each pattern, and to use the product rule for multiple testing. However, my fitted model implies that my 3 ppp objects are statistically equivalent, and are independent realizations of the same experiment (ie no different covariates between them). I would thus like to obtain one single plot comparing my fitted model to my 3 datasets. The following code:
gamma= 1 - 0.95^(1/3)
nsims=round(1/gamma-1)
sims <- simulate(model, nsim=2*nsims)
SIMS <- list()
for(i in 1:nrow(sims)) SIMS[[i]] <- as.solist(sims[i,,drop=TRUE])
Hplus <- cbind(data, hyperframe(Sims=SIMS))
EE1 <- with(Hplus, envelope(Points, Kcross.inhom, nsim=nsims, simulate=Sims))
pool(EE1[1],EE1[2],EE1[3])
leads to the following error:
Error in pool.envelope(`1` = list(r = c(0, 0.78125, 1.5625, 2.34375, 3.125, :
Arguments 2 and 3 do not belong to the class “envelope”
Wrong type of subset index. Use
pool(EE1[[1]], EE1[[2]], EE1[[3]])
or just
pool(EE1)
These would have given an error message that the envelope commands should have been called with savefuns=TRUE. So you just need to change that step as well.
However, statistically this procedure makes little sense. You have already fitted a model, which allows for rigorous statistical inference using anova.mppm and other tools. Instead of this, you are generating simulated data from the fitted model and performing a Monte Carlo test, with all the fraught issues of multiple testing and low power. There are additional problems with this approach - for example, even if the model is "the same" for each row of the hyperframe, the patterns are not statistically equivalent unless the windows of the point patterns are identical, and so on.

How to convert random forest prediction probabilities to a single classified response?

I have many large random forest classification models (~60min run time each) that are used for prediction of a raster using the type="prob" option. I am happy with the raster output (probabilities for each of x classes as a raster stack). However, I would like a simple way to covert these probabilities (a raster stack with x layers, where x is the number of classes) to a simple one layer classification (i.e. winners only, no probabilities). This would be equivalent of type="response".
Here is a simple example (which is not a raster, but still applies):
library(randomForest)
data(iris)
set.seed(111)
ind <- sample(2, nrow(iris), replace = TRUE, prob=c(0.8, 0.2))
iris.rf <- randomForest(Species ~ ., data=iris[ind == 1,])
iris.prob <- predict(iris.rf, type="prob")
iris.resp <- predict(iris.rf, type="response")
What is the most efficient way to use the iris.prob object to get the equivalent output of iris.resp without rerunning the randomforests (which, in my case with many large rasters, would take too many hours)?
Thanks in advance
If you are trying to determine the max of multiple columns, with the same general format as iris.prob I would try to find the max from each row and return the colname.
colnames(iris.prob)[max.col(iris.prob,ties.method="first")]
Got the exact usage from another thread so if this isn't working you might try another response
iris.prob should contains a classification result, with the probability that one observation is classified in one category. So you just need to extract the colname of the maximum value of each row.
Eg :
iris.resp2 = colnames(iris.prob)[apply(iris.prob,1,which.max)]
iris.resp2 == as.character(iris.resp) should return TRUE everytime

column names - xgboost predict on new data

I have never productionised an xgboost model and am concerned re how to handle fresh data predictions within an xgboost model. Specifically when column names do not match the trained models sparse matrix column names - either due to new columns being added or certain columns being removed when fresh data is converted to a sparse matrix.
What if I attempt to predict an xgboost model on new data with extra or some missing column names? I see this definitely occurring and would like to create code to account for it so that predictions are correct. I would prefer to avoid hacking together a solution if more elegant ones already exist.
So specifically if the new datas sparse matrix has different column names then what?
My best guess is to factorise (levels based on trained data levels) > create sparse matrix > then remove non-matching columns (between trained dataset and new data).
I have created dummy data (in below code) as an example of prediction errors given different column names.
1st step = build model (just for illustrative purposes I know it's a bad build)
2nd step = resample entire dataset then predict (= no problems. Predictions match)
3rd step = only select from 10% of data then predict - this gets prediction errors due to different column names.
Here's the code:
Step 1 create dummy data and create a lazy xgboost model just for illustrative purposes.
library(xgboost) # for xgboost algo
library(Matrix) # for sparse matrix
### Create dummy data
num_rows <- 100
set.seed(1234)
target <- runif(num_rows)
dummy_data <- data.frame(
LETTER_SINGLE=sample(LETTERS,num_rows,replace=TRUE),
DOUBLE_LETTER=paste(sample(LETTERS,num_rows,replace=TRUE),sample(LETTERS,num_rows,replace=TRUE),sep=""),
TRIPLE_LETTER=paste(sample(LETTERS,num_rows,replace=TRUE),sample(LETTERS,num_rows,replace=TRUE),sample(LETTERS,num_rows,replace=TRUE),sep=""),
stringsAsFactors=FALSE
)
## STEP 1 CREATE XGBOOST MODEL AND GET PREDICTED VALUES TO COMPARE WITH FUTURE DATA CUTS.
model_data_01 <- dummy_data
target_01 <- target
# create matrix
model_01_sparse <- sparse.model.matrix(~ .-1, data = model_data_01)
# colnames model 1
colnames_trained_model <- colnames(model_01_sparse)
# train a model
xgb_fit_01 <-
xgboost(data = model_01_sparse,
label = target_01,
#param = best_param,
nrounds=100,
verbose = T
)
pred_01 <- predict(xgb_fit_01,newdata=model_01_sparse)
Step 2. Test to see if order of observations cause differences in predictions. Spoiler - no prediction errors occur.
## STEP 2 CREATE SHUFFLED DATA (SAME DATA SAMPLES BUT SHUFFLED) THEN PREDICT AND COMPARE.
sample_order <- sample(1:num_rows)
model_data_shuffled <- dummy_data[sample_order,]
target_shuffled <- target[sample_order]
# They are different
head(model_data_01)
head(model_data_shuffled)
# create matrix
model_shuffled_sparse <- sparse.model.matrix(~ .-1, data = model_data_shuffled)
# colnames model 1
colnames_shuffled <- colnames(model_shuffled_sparse)
pred_shuffled <- predict(xgb_fit_01,newdata=model_shuffled_sparse)
# check if predictions differ
pred_01[sample_order] - pred_shuffled
## This matched. Yay. sparse.model.matrix function must first sort alphabetically then create column names.
# due to same column names
mean(colnames_trained_model == colnames_shuffled)
Step 3. Only sample a select few rows and predict to see whether missing columns - in sparse matrix - cause prediction errors.
## STEP 2 WORKED FINE SO ONTO...
## STEP 3 RANDOMLY SAMPLE ONLY A HANDFUL OF ROWS PREDICT AND COMPARE.
sample_order_02 <- sample(1:(num_rows*0.1))
model_data_shuffled_02 <- dummy_data[sample_order_02,]
target_shuffled_02 <- target[sample_order_02]
# create matrix
model_shuffled_sparse_02 <- sparse.model.matrix(~ .-1, data = model_data_shuffled_02)
# colnames model 1
colnames_shuffled_02 <- colnames(model_shuffled_sparse_02)
pred_shuffled_02 <- predict(xgb_fit_01,newdata=model_shuffled_sparse_02)
# check if predictions differ
pred_01[sample_order_02] - pred_shuffled_02
## This did not matched. Damn.
# Due to different column names
colnames_trained_model
colnames_shuffled_02
mean(colnames_trained_model == colnames_shuffled_02)
As you can see this last attempt gets variance in the predicted values due solely to missing column names in the spare matrix.
I don't want to hack an ugly solution together if an elegant one exists for me to learn from.
So my question is... Is there an elegant way to force sparse model matrix column names to match that of the built model (the one used for predictions on new data)?
I have searched the web and no luck thus far finding any best practices solution.
If anybody could help by answering the Question or pointing me in the right direction that would be much appreciated.
What is your production environment? R, Python, Java or something else?
The idea is to use XGBoost functionality (both training and prediction) via production environment-specific wrapper library, not directly. For example, in Python, you could use Scikit-Learn wrappers, which encapsulate feature engineering and -selection tasks into a reusable sklearn.pipeline.Pipeline object. You would 1) fit the pipeline object (where the XGBoost estimator is the final task) in development environment and serialize it to a pickle file, 2) move the pickle file from development to production environment, and 3) de-serialize it from the pickle file and use for transforming new data in production environment. This is a high-level API, which completely abstracts away low-level details such as the layout of XGBoost "internal" data matrices.
For a platform-independent solution, you could export XGBoost models (and associated data pre-processing logic) in the standardized PMML representation.

Successive training in neuralnet

I have a huge trainData and I want to withdraw random subsets out of it (let's say 1000 times) and use them to train the nural network object successively. Is it possible to do by using neuralnet R package. What I am thinking about is something like:
library(neuralnet)
for (i=1:1000){
classA <- 2000
classB <- 2000
dataB <- trainData[sample(which(trainData$class == "B"), classB, replace=TRUE),] #withdraw 2000 samples from class B
dataU <- trainData[sample(which(trainData$class == "A"), classA, replace=TRUE),] #withdraw 2000 samples from class A
subset <- rbind(dataB, dataU) #bind them to make a subset
and then feed this subset of actual trainData to train the neuralnet object again and again like:
nn <- neuralnet(formula, data=subset, hidden=c(3,5), linear.output = F, stepmax = 2147483647) #use that subset for training the neural network
}
My question is will this neualnet object named nn will be trained in every iteration of loop and when loop will finish will I get a fully trained neural network object? Secondly, what will be the effect of non-convergence in the cases when the neuralnet would be unable to converge for a particular subset? Will it affect the predictions result?
The shortest answer - No
More nuanced answer - Sort of ...
Why? - Because the neuralnet::neuralnet function is not designed to return the weights if the threshold is not reached within stepmax. However, if the threshold is reached, the resulting object will contain the final weights. These weights could then be fed to the neuralnet function as the startweights argument allowing for successive learning. Your call would look like the following:
# nn.prior = previously run neuralnet object
nn <- neuralnet(formula, data=subset, hidden=c(3,5), linear.output = F, stepmax = 2147483647, startweights = nn.prior$weights)
However, I initially answer 'No' because choosing a threshold to get a suitable amount of information out of a subset while also making sure it 'converges' before stepmax would likely be a guessing game and not very objective.
You have essentially four options I can think of:
Find another package that allows for this explicitly
Get the neuralnet source code and modify it to return the weights even when 'convergence' isn't achieved (i.e. reaching threshold).
Take a suitably sized random subset and just build your model on that and test its' performance. (This is actually quite common practice AFAIK).
Take all your subsets, build a model on each and look into combining them as an 'ensemble' model.
I would recommend to use k-fold validation to train many nets using library(e1071) and tune function.

Search for corresponding node in a regression tree using rpart

I'm pretty new to R and I'm stuck with a pretty dumb problem.
I'm calibrating a regression tree using the rpart package in order to do some classification and some forecasting.
Thanks to R the calibration part is easy to do and easy to control.
#the package rpart is needed
library(rpart)
# Loading of a big data file used for calibration
my_data <- read.csv("my_file.csv", sep=",", header=TRUE)
# Regression tree calibration
tree <- rpart(Ratio ~ Attribute1 + Attribute2 + Attribute3 +
Attribute4 + Attribute5,
method="anova", data=my_data,
control=rpart.control(minsplit=100, cp=0.0001))
After having calibrated a big decision tree, I want, for a given data sample to find the corresponding cluster of some new data (and thus the forecasted value).
The predict function seems to be perfect for the need.
# read validation data
validationData <-read.csv("my_sample.csv", sep=",", header=TRUE)
# search for the probability in the tree
predict <- predict(tree, newdata=validationData, class="prob")
# dump them in a file
write.table(predict, file="dump.txt")
However with the predict method I just get the forecasted ratio of my new elements, and I can't find a way get the decision tree leaf where my new elements belong.
I think it should be pretty easy to get since the predict method must have found that leaf in order to return the ratio.
There are several parameters that can be given to the predict method through the class= argument, but for a regression tree all seem to return the same thing (the value of the target attribute of the decision tree)
Does anyone know how to get the corresponding node in the decision tree?
By analyzing the node with the path.rpart method, it would help me understanding the results.
Benjamin's answer unfortunately doesn't work: type="vector" still returns the predicted values.
My solution is pretty klugy, but I don't think there's a better way. The trick is to replace the predicted y values in the model frame with the corresponding node numbers.
tree2 = tree
tree2$frame$yval = as.numeric(rownames(tree2$frame))
predict = predict(tree2, newdata=validationData)
Now the output of predict will be node numbers as opposed to predicted y values.
(One note: the above worked in my case where tree was a regression tree, not a classification tree. In the case of a classification tree, you probably need to omit as.numeric or replace it with as.factor.)
You can use the partykit package:
fit <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)
library("partykit")
fit.party <- as.party(fit)
predict(fit.party, newdata = kyphosis[1:4, ], type = "node")
For your example just set
predict(as.party(tree), newdata = validationData, type = "node")
I think what you want is type="vector" instead of class="prob" (I don't think class is an accepted parameter of the predict method), as explained in the rpart docs:
If type="vector": vector of predicted
responses. For regression trees this
is the mean response at the node, for
Poisson trees it is the estimated
response rate, and for classification
trees it is the predicted class (as a
number).
treeClust::rpart.predict.leaves(tree, validationData) returns node number
also tree$where returns node numbers for the training set

Resources