R:rpart: convertng the rpart fit object into a function - r

I can use rpart to predict (as below),
library(rpart)
datpred <-tail(car.test.frame,10)
fit <- rpart(Mileage ~ Weight+Price, car.test.frame)
predict(fit,newdata=datpred)
plot(fit, uniform=TRUE)
text(fit, use.n=TRUE, all=TRUE, cex=.8)
objects(fit)
Is there an easy way to convert the fit objects into a simple function that contains only the splitting logic on the data input and then outputs the prediction?
The reason for this is that I can then have the function within a single script with no need to load the fit objects from an external source.
Thank you for your help.

You can save your object in a file using dput() then read it with dget():
dput(fit, 'fit.dput')
rm(fit)
fit <- dget('fit.dput')

The output that you are interested in, namely the variable names and the values at which the tree is split, are assembled by the labels.rpart function:
labels(fit)
#-----------
[1] "root" "Weight>=2568" "Weight>=3088" "Weight< 3088"
[5] "Weight>=2748" "Weight< 2748" "Weight< 2568"
The 'splits' element of the fit object is where the cutpoints are stored (in the "index" column):
> fit$splits
count ncat improve index adj
Weight 60 1 0.5953491 2567.5 0
Weight 45 1 0.5045118 3087.5 0
Weight 23 1 0.1476996 2747.5 0
You can look at the code but if you don't already know how to do that then this is not a function that is easy to understand:
> methods(labels)
[1] labels.default labels.dendrogram* labels.dist*
[4] labels.lm* labels.rpart* labels.survreg
[7] labels.terms*
see '?methods' for accessing help and source code
> getAnywhere(labels.rpart)

Related

How to input matrix data into brms formula?

I am trying to input matrix data into the brm() function to run a signal regression. brm is from the brms package, which provides an interface to fit Bayesian models using Stan. Signal regression is when you model one covariate using another within the bigger model, and you use the by parameter like this: model <- brm(response ~ s(matrix1, by = matrix2) + ..., data = Data). The problem is, I cannot input my matrices using the 'data' parameter because it only allows one data.frame object to be inputted.
Here are my code and the errors I obtained from trying to get around that constraint...
First off, my reproducible code leading up to the model-building:
library(brms)
#100 rows, 4 columns. Each cell contains a number between 1 and 10
Data <- data.frame(runif(100,1,10),runif(100,1,10),runif(100,1,10),runif(100,1,10))
#Assign names to the columns
names(Data) <- c("d0_10","d0_100","d0_1000","d0_10000")
Data$Density <- as.matrix(Data)%*%c(-1,10,5,1)
#the coefficients we are modelling
d <- c(-1,10,5,1)
#Made a matrix with 4 columns with values 10, 100, 1000, 10000 which are evaluation points. Rows are repeats of the same column numbers
Bins <- 10^matrix(rep(1:4,times = dim(Data)[1]),ncol = 4,byrow =T)
Bins
As mentioned above, since 'data' only allows one data.frame object to be inputted, I've tried other ways of inputting my matrix data. These methods include:
1) making the matrix within the brm() function using as.matrix()
signalregression.brms <- brm(Density ~ s(Bins,by=as.matrix(Data[,c(c("d0_10","d0_100","d0_1000","d0_10000"))])),data = Data)
#Error in is(sexpr, "try-error") :
argument "sexpr" is missing, with no default
2) making the matrix outside the formula, storing it in a variable, then calling that variable inside the brm() function
Donuts <- as.matrix(Data[,c(c("d0_10","d0_100","d0_1000","d0_10000"))])
signalregression.brms <- brm(Density ~ s(Bins,by=Donuts),data = Data)
#Error: The following variables can neither be found in 'data' nor in 'data2':
'Bins', 'Donuts'
3) inputting a list containing the matrix using the 'data2' parameter
signalregression.brms <- brm(Density ~ s(Bins,by=donuts),data = Data,data2=list(Bins = 10^matrix(rep(1:4,times = dim(Data)[1]),ncol = 4,byrow =T),donuts=as.matrix(Data[,c(c("d0_10","d0_100","d0_1000","d0_10000"))])))
#Error in names(dat) <- object$term :
'names' attribute [1] must be the same length as the vector [0]
None of the above worked; each had their own errors and it was difficult troubleshooting them because I couldn't find answers or examples online that were of a similar nature in the context of brms.
I was able to use the above techniques just fine for gam(), in the mgcv package - you don't have to define a data.frame using 'data', you can call on variables defined outside of the gam() formula, and you can make matrices inside the gam() function itself. See below:
library(mgcv)
signalregression2 <- gam(Data$Density ~ s(Bins,by = as.matrix(Data[,c("d0_10","d0_100","d0_1000","d0_10000")]),k=3))
#Works!
It seems like brms is less flexible... :(
My question: does anyone have any suggestions on how to make my brm() function run?
Thank you very much!
My understanding of signal regression is limited enough that I'm not convinced this is correct, but I think it's at least a step in the right direction. The problem seems to be that brm() expects everything in its formula to be a column in data. So we can get the model to compile by ensuring all the things we want are present in data:
library(tidyverse)
signalregression.brms = brm(Density ~
s(cbind(d0_10_bin, d0_100_bin, d0_1000_bin, d0_10000_bin),
by = cbind(d0_10, d0_100, d0_1000, d0_10000),
k = 3),
data = Data %>%
mutate(d0_10_bin = 10,
d0_100_bin = 100,
d0_1000_bin = 1000,
d0_10000_bin = 10000))
Writing out each column by hand is a little annoying; I'm sure there are more general solutions.
For reference, here are my installed package versions:
map_chr(unname(unlist(pacman::p_depends(brms)[c("Depends", "Imports")])), ~ paste(., ": ", pacman::p_version(.), sep = ""))
[1] "Rcpp: 1.0.6" "methods: 4.0.3" "rstan: 2.21.2" "ggplot2: 3.3.3"
[5] "loo: 2.4.1" "Matrix: 1.2.18" "mgcv: 1.8.33" "rstantools: 2.1.1"
[9] "bayesplot: 1.8.0" "shinystan: 2.5.0" "projpred: 2.0.2" "bridgesampling: 1.1.2"
[13] "glue: 1.4.2" "future: 1.21.0" "matrixStats: 0.58.0" "nleqslv: 3.3.2"
[17] "nlme: 3.1.149" "coda: 0.19.4" "abind: 1.4.5" "stats: 4.0.3"
[21] "utils: 4.0.3" "parallel: 4.0.3" "grDevices: 4.0.3" "backports: 1.2.1"

Function "effects" fails to find data in a model object loaded from a rds file

I am encountering a problem when trying to call the effects function (from the package effects) on a lm object loaded from a rds file. I need to save the files, since I adjust several models in a loop and then later retrieve the models in order to produce graphs for some of the models.
Heres is an example of the problem:
var1<-rnorm(100)
var2<-rnorm(100)
df1<-data.frame(var1,var2)
lm1<-lm(var1~var2,data=df1)
saveRDS(lm1,"lm1.RDS")
rm(lm1,var1,var2,df1)
loaded<-readRDS("lm1.RDS")
library(effects)
eff<-allEffects(loaded)
I don't know why the effects::allEffects function does not do this but it's fairly simple to recover data from a model:
> df1 <- data.frame(var1 =loaded$model$var1, var2=loaded$model$var2)
> eff<-allEffects(loaded)
> eff
model: var1 ~ var2
var2 effect
var2
-2 -1 0 1 2
-0.08501500 -0.09133397 -0.09765294 -0.10397191 -0.11029088

predict.lm with arbitrary coefficients r

I'm trying to predict an lm object using predict.lm. However, I would like to use manually inserted coefficients.
To do this I tried:
model$coefficients <- coeff
(where "coeff" is a vector of correct coefficients)
which would indeed modify the coefficients as I want. Nevertheless, when I execute
predict.lm(model, new.data)
I just get predictions calculated with the "old" parameters. Is there a way I could force predict.lm to use the new ones?
Post Scriptum: I need to do this to fit a bin-smooth (also called regressogram).
In addition, when I predict "by hand" (i.e. using matrix multiplication) the results are fine, hence I'm quite sure that the problem lies in the predict.lm not recognizing my new coefficients.
Thanks in advance for the help!
Hacking the $coefficients element does indeed seem to work. Can you show what doesn't work for you?
dd <- data.frame(x=1:5,y=1:5)
m1 <- lm(y~x,dd)
m1$coefficients <- c(-2,1)
m1
## Call:
## lm(formula = y ~ x, data = dd)
##
## Coefficients:
## [1] -2 1
predict(m1,newdata=data.frame(x=7)) ## 5 = -2+1*7
predict.lm(...) gives the same results.
I would be very careful with this approach, checking each time you do something different with the hacked model.
In general it would be nice if predict and simulate methods took a newparams argument, but they don't in general ...

How to retrieve value of a function in R

when calling a function in R, how can I retrieve the result values. For example, I used 'roc' function and I need to extract AUC value and CI (0.6693 and 0.6196-0.7191 respectively in the following example).
> roc(tmpData[,lenCnames], fitted(model), ci=TRUE)
Call:
roc.default(response = tmpData[, lenCnames], predictor = fitted(model), ci = TRUE)
Data: fitted(model) in 127 controls (tmpData[, lenCnames] 0) < 3248 cases (tmpData[, lenCnames] 1).
Area under the curve: 0.6693
95% CI: 0.6196-0.7191 (DeLong)
I can use the following to fetch these values with associated texts.
> z$auc
Area under the curve: 0.6693
> z$ci
95% CI: 0.6196-0.7191 (DeLong)
Is there a way to get only the values and not the text.
I do now how to get these using 'regular expression' or 'strsplit' function, but I suspect there should be some other way to directly access these values.
It's helpful to use reproducible examples when asking a question. Also best to refer to the library you're asking about ("pROC"), since it is not loaded with base R. pROC has functions that extract auc and ci.auc objects from the roc object.
>library("pROC")
>data(aSAH)
# Basic example
>z <- roc(aSAH$outcome, aSAH$s100b,
levels=c("Good", "Poor"))
# Examining the class of 'auc' output shows us that it is also of class 'numeric'
> class(auc(z))
[1] "auc" "numeric"
# calling 'as.numeric' will extract the value
> as.numeric(auc(z))
[1] 0.7313686
# calling 'as.numeric' on the 'ci.auc' object extracts three values.
as.numeric(ci(z))
[1] 0.6301182 0.7313686 0.8326189
# The ones we want are 1 and 3
> as.numeric(ci(z))[c(1,3)]
[1] 0.6301182 0.8326189
Using the functions str, class, and attributes will often help you figure out how to get what you want out of an object.

Predicting with lm object in R - black box paradigm

I have a function that returns an lm object. I want to produce predicted values based on some new data. The new data is a data.frame in the exact format as the data passed to the lm function, except that the response has been removed (since we're predicting, not training). I would expect to execute the following, but get an error:
predict( model , newdata )
"Error in eval(expr, envir, enclos) : object 'ModelResponse' not found"
In my case, ModelResponse was the name of the response column in the data I originally trained on. So just for kicks, I tried to insert NA reponse:
newdata$ModelResponse = NA
predict( model , newdata )
Error in terms.default(object, data = data) : no terms component nor attribute
Highly frustrating! R's notion of models/regression doesn't match mine: 1. I train a model with some data and get a model object. 2. I can score new data from any environment/function/frame/etc. so long as I input data into the model object that "looks like" the data I trained on (i.e. same column names). This is a standard black-box paradigm.
So here are my questions:
1. What concept(s) am I missing here?
2. How do I get my scenario to work?
3. How can I get model object to be portable? str(model) shows me that the model object saved the original data it trained on! So the model object is massive. I want my model to be portable to any function/environment/etc. and only contain the data it needs to score.
In the absence of str() on either the model or the data offered to the model, here's my guess regarding this error message:
predict( model , newdata )
"Error in eval(expr, envir, enclos) : object 'ModelResponse' not found"
I guess that you made a model object named "model" and that your outcome variable (the left-hand-side of the formula( in the original call to lm was named "ModelResponse" and that you then named a column in newdata by the same name. But what you should have done was leave out the "ModelResponse" columns (because that is what you are predicting) and put in the "Model_Predictor1", Model_Predictor2", etc. ... i.e. all the names on the right-hand-side of the formula given to lm()
The coef() function will allow you to extract the information needed to make the model portable.
mod.coef <- coef(model)
mod.coef
Since you expressed interest in the rms/Hmisc package combo Function, here it is using the help-example from ols and comparing the output with an extracted function and the rms Predict method. Note the capitals, since these are designed to work with the package equivalents of lm and glm(..., family="binomial") and coxph, which in rms become ols, lrm, and cph.
> set.seed(1)
> x1 <- runif(200)
> x2 <- sample(0:3, 200, TRUE)
> distance <- (x1 + x2/3 + rnorm(200))^2
> d <- datadist(x1,x2)
> options(datadist="d") # No d -> no summary, plot without giving all details
>
>
> f <- ols(sqrt(distance) ~ rcs(x1,4) + scored(x2), x=TRUE)
>
> Function(f)
function(x1 = 0.50549065,x2 = 1) {0.50497361+1.0737604* x1-
0.79398383*pmax(x1-0.083887788,0)^3+ 1.4392827*pmax(x1-0.38792825,0)^3-
0.38627901*pmax(x1-0.65115162,0)^3-0.25901986*pmax(x1-0.92736774,0)^3+
0.06374433*x2+ 0.60885222*(x2==2)+0.38971577*(x2==3) }
<environment: 0x11b4568e8>
> ols.fun <- Function(f)
> pred1 <- Predict(f, x1=1, x2=3)
> pred1
x1 x2 yhat lower upper
1 1 3 1.862754 1.386107 2.339401
Response variable (y): sqrt(distance)
Limits are 0.95 confidence limits
# The "yhat" is the same as one produces with the extracted function
> ols.fun(x1=1, x2=3)
[1] 1.862754
(I have learned through experience that the restricted cubic-spline fit functions coming from rms need to have spaces and carriage returns added to improve readability. )
Thinking long-term, you should probably take a look at the caret package. Many or most modeling functions work with data frames and matrices, others have a preference, and there may be other variations of their expectations. It's important to quickly get your head around each, but if you want a single wrapper that will simplify life for you, making the intricacies into a "black box", then caret is as close as you can get.
As a disclaimer: I do not use caret, as I don't think modeling should be a be a black box. I've had more than a few emails to maintainers of modeling packages resulting from looking into their code and seeing something amiss. Wrapping that in another layer would not serve my interests. So, in the very long-run, avoid caret and develop an enjoyment for dissecting what's going into and out of the different modeling functions. :)

Resources