Random Forest in R - r

If x is a Random Forest in R, for example,
x <- cforest (y~ a+b+c, data = football),
what does x[[9]] mean?

You can't subset this object, so in some sense, x[[9]] is nothing, it is not accessible as such.
x is an object of S4 class "RandomForest-class". This class is documented on help page ?'RandomForest-class'. The slots of this object are named and described there. You can also get the slot names via slotNames()
library("party")
foo <- cforest(ME ~ ., data = mammoexp, control = cforest_unbiased(ntree = 50))
> slotNames(foo)
[1] "ensemble" "where" "weights"
[4] "initweights" "data" "responses"
[7] "cond_distr_response" "predict_response" "prediction_weights"
[10] "get_where" "update"
If by x[[9]] you meant the 9th slot, then that is predict_weights and ?'RandomForest-class' tells us that this is
‘prediction_weights’: a function for extracting weights from
terminal nodes.

Related

How to input matrix data into brms formula?

I am trying to input matrix data into the brm() function to run a signal regression. brm is from the brms package, which provides an interface to fit Bayesian models using Stan. Signal regression is when you model one covariate using another within the bigger model, and you use the by parameter like this: model <- brm(response ~ s(matrix1, by = matrix2) + ..., data = Data). The problem is, I cannot input my matrices using the 'data' parameter because it only allows one data.frame object to be inputted.
Here are my code and the errors I obtained from trying to get around that constraint...
First off, my reproducible code leading up to the model-building:
library(brms)
#100 rows, 4 columns. Each cell contains a number between 1 and 10
Data <- data.frame(runif(100,1,10),runif(100,1,10),runif(100,1,10),runif(100,1,10))
#Assign names to the columns
names(Data) <- c("d0_10","d0_100","d0_1000","d0_10000")
Data$Density <- as.matrix(Data)%*%c(-1,10,5,1)
#the coefficients we are modelling
d <- c(-1,10,5,1)
#Made a matrix with 4 columns with values 10, 100, 1000, 10000 which are evaluation points. Rows are repeats of the same column numbers
Bins <- 10^matrix(rep(1:4,times = dim(Data)[1]),ncol = 4,byrow =T)
Bins
As mentioned above, since 'data' only allows one data.frame object to be inputted, I've tried other ways of inputting my matrix data. These methods include:
1) making the matrix within the brm() function using as.matrix()
signalregression.brms <- brm(Density ~ s(Bins,by=as.matrix(Data[,c(c("d0_10","d0_100","d0_1000","d0_10000"))])),data = Data)
#Error in is(sexpr, "try-error") :
argument "sexpr" is missing, with no default
2) making the matrix outside the formula, storing it in a variable, then calling that variable inside the brm() function
Donuts <- as.matrix(Data[,c(c("d0_10","d0_100","d0_1000","d0_10000"))])
signalregression.brms <- brm(Density ~ s(Bins,by=Donuts),data = Data)
#Error: The following variables can neither be found in 'data' nor in 'data2':
'Bins', 'Donuts'
3) inputting a list containing the matrix using the 'data2' parameter
signalregression.brms <- brm(Density ~ s(Bins,by=donuts),data = Data,data2=list(Bins = 10^matrix(rep(1:4,times = dim(Data)[1]),ncol = 4,byrow =T),donuts=as.matrix(Data[,c(c("d0_10","d0_100","d0_1000","d0_10000"))])))
#Error in names(dat) <- object$term :
'names' attribute [1] must be the same length as the vector [0]
None of the above worked; each had their own errors and it was difficult troubleshooting them because I couldn't find answers or examples online that were of a similar nature in the context of brms.
I was able to use the above techniques just fine for gam(), in the mgcv package - you don't have to define a data.frame using 'data', you can call on variables defined outside of the gam() formula, and you can make matrices inside the gam() function itself. See below:
library(mgcv)
signalregression2 <- gam(Data$Density ~ s(Bins,by = as.matrix(Data[,c("d0_10","d0_100","d0_1000","d0_10000")]),k=3))
#Works!
It seems like brms is less flexible... :(
My question: does anyone have any suggestions on how to make my brm() function run?
Thank you very much!
My understanding of signal regression is limited enough that I'm not convinced this is correct, but I think it's at least a step in the right direction. The problem seems to be that brm() expects everything in its formula to be a column in data. So we can get the model to compile by ensuring all the things we want are present in data:
library(tidyverse)
signalregression.brms = brm(Density ~
s(cbind(d0_10_bin, d0_100_bin, d0_1000_bin, d0_10000_bin),
by = cbind(d0_10, d0_100, d0_1000, d0_10000),
k = 3),
data = Data %>%
mutate(d0_10_bin = 10,
d0_100_bin = 100,
d0_1000_bin = 1000,
d0_10000_bin = 10000))
Writing out each column by hand is a little annoying; I'm sure there are more general solutions.
For reference, here are my installed package versions:
map_chr(unname(unlist(pacman::p_depends(brms)[c("Depends", "Imports")])), ~ paste(., ": ", pacman::p_version(.), sep = ""))
[1] "Rcpp: 1.0.6" "methods: 4.0.3" "rstan: 2.21.2" "ggplot2: 3.3.3"
[5] "loo: 2.4.1" "Matrix: 1.2.18" "mgcv: 1.8.33" "rstantools: 2.1.1"
[9] "bayesplot: 1.8.0" "shinystan: 2.5.0" "projpred: 2.0.2" "bridgesampling: 1.1.2"
[13] "glue: 1.4.2" "future: 1.21.0" "matrixStats: 0.58.0" "nleqslv: 3.3.2"
[17] "nlme: 3.1.149" "coda: 0.19.4" "abind: 1.4.5" "stats: 4.0.3"
[21] "utils: 4.0.3" "parallel: 4.0.3" "grDevices: 4.0.3" "backports: 1.2.1"

Getting the class of the predictors from an lme4 model

After fitting an lme4 model, I was wondering how we get the class of the predictors from terms(fit)[[3]]?
Here is a simple example, but I appreciate a functional answer for any other model in lme4.
Note: Everything has to be extracted from the model.
library(lme4)
h <- read.csv('https://raw.githubusercontent.com/hkil/m/master/h.csv')
h$year <- as.factor(h$year)
m <- lmer(scale~ year*group + (1|stid), data = h)
terms(m)[[3]] ## What are the `class`es of the variables in here (e.g., `integer`, `factor` etc.)
Maybe not perfectly robust, but:
extract the names of variables from the terms object
av <- all.vars(terms(m)[[3]]) ## c("year","group")
Look them up in the data frame supplied as data=:
setNames(lapply(av, function(x) class(h[[x]])), av)
$year
[1] "factor"
$group
[1] "character"
If you want to get everything from the model this gets MUCH HARDER in general, because the original variables are not necessarily stored. In the example you gave this works:
setNames(lapply(av, function(x) class(model.frame(m)[[x]])), av)
$year
[1] "factor"
$group
[1] "factor"
You'll notice that group has been converted to a factor. You can break this, e.g., by using a term like log(x) in the model ...

does h2o.glm model object not save 'weight_column' parameter?

I am using the h2o.glm module (in R). I tried to find the 'weights_column' specification value in the outputting h2o GLM model object but I can not find it. I looked into model#allparameters and model#parameters, none of these two objects contain the weights column information. Is the weight information saved anywhere in the model object?
If you specify weights_column to GLM or any of the H2O algos, it will store the column name (not the actual column data) in the model object. In R, it stores it in both model#parameters and model#allparameters. Here's an example:
library(h2o)
model <- h2o.glm(x = 1:3, y = 5,
training_frame = as.h2o(iris),
weights_column = names(iris)[4],
family = "multinomial")
You can see the relevant info here:
> model#parameters$weights_column
$`__meta`
$`__meta`$schema_version
[1] 3
$`__meta`$schema_name
[1] "ColSpecifierV3"
$`__meta`$schema_type
[1] "VecSpecifier"
$column_name
[1] "Petal.Width"
$is_member_of_frames
NULL

S4 object creation in R

I am busy with comparing different machine learning techniques in R.
This is the case: I made several functions that, in an automated way
are able to create each a different prediction model (e.g: logistic regression, random forest, neural network, hybrid ensemble , etc.) , predictions, confusion matrices, several statistics (e.g AUC and Fscore) ,and different plots.
Now I would like to create a list of S4 (or S3?) objects in R, where each object contains the model, predictions, the plots, confusion matrix , auc and fscore.
The idea is that each function creates such object and then append it to the object list in the return statement.
How should I program such class? And how can I define that each model can be of some different type (I suppose that all models that I create are S3 objects, so how do can I define this in my S4 class?
The end result should be able to do something like this: modelList[i]#plot should for example summon the requested plot. and names(modelList[i]) should give the name of the used model (if this is not possible, modelList[i]#name will do). Also, it should be possible to select the best model out of the list, based on a parameter, such as AUC.
I am not experienced in creating such object, so this is the code / idea I have at the moment:
modelObject <- setClass(
# Set the name for the class
"modelObject",
# Define the slots
slots = c(
modelName = "character"
model = #should contain a glm, neural network, random forest , etc model
predictions = #should contain a matrix or dataframe of custid and prediction
rocCurve = #when summoned, the ROC curve should be plotted
plotX = #when summoned, plot X should be plotted
AUC = "numeric" #contains the value of the AUC
confusionMatrix = "matrix" #prints the confusion matrix in the console
statX = "numeric"#contains statistic X about the confusion matrix e.g. Fscore
),
# Set the default values for the slots. (optional)
prototype=list(
# I guess i can assign NULL to each variable of the S4 object
),
# Make a function that can test to see if the data is consistent.
# This is not called if you have an initialize function defined!
validity=function(object)
{
#not really an idea how to handle this
}
return(TRUE)
}
)
Use setOldClass() to promote each S3 class to it's S4 equivalent
setOldClass("lm")
setOldClass(c("glm", "lm"))
setOldClass(c("nnet.formula", "nnet"))
setOldClass("xx")
Use setClassUnion() to insert a common base class in the hierarchy
setClassUnion("lmORnnetORxx", c("lm", "nnet", "xx"))
.ModelObject <- setClass("ModelObject", slots=c(model="lmORnnetORxx"))
setMethod("show", "ModelObject", function(object) {
cat("model class: ", class(object#model), "\n")
})
In action:
> library(nnet)
> x <- y <- 1:10
> .ModelObject(model=lm(x~y))
model class: lm
> .ModelObject(model=glm(x~y))
model class: glm lm
> .ModelObject(model=nnet(x~y, size=10, trace=FALSE))
model class: nnet.formula nnet
I think that you would also like to implement a Models object that contains a list where all elements are ModelObject; the constraint would be imposed by a validity method (see ?setValidity).
What I would do, is for each slot you want in your modelObject class, determine the range of expected values. For example, your model slot has to support all the possible classes of objects that can be returned by model training functions (e.g. lm(), glm(), nnet(), etc.). In the example case, you see the following objects returned:
```
x <- y <- 1:10
class(lm(x~y))
class(glm(x~y))
class(nnet(x~y, size=10))
```
Since there is no common class among the objects returned, it might make more sense to use an S3, which has less rigorous syntax and would allow you to assign various classes of output to the same field name. Your question is actually quite tough to answer, given that there are so many different approaches to take with R's myriad OO systems.

What is the datatype of "formula" for the e1071 package in R?

I am learning R's e1071 package to perform naive Bayes analysis. According to this tutorial, the package's naieveBayes method takes an input called "formula" -- which it says is "A formula of the form class ~ x1 + x2 + .."
How do I create such a formula? I have a dataset with gender, job and income columns and want to perform analysis on each of those dimensions/factors. Do I need to somehow turn them into a formula object? (I'm pretty new to R so I am unclear if R even supports specific data types like formula).
Just typing ~ x1 + x2 will create an object of the class formula.
Look at ?lm for examples of the basic idea. The domain-specific language is pretty flexible, so different models use it in different ways.
For instance:
dat <- data.frame(x=runif(10), y=runif(10))
lm( y ~ x, data=dat)
f <- y ~ x
class(f)
lm( f, data=dat )
Formulas are created by the use of the ~ function. It can be used either as a prefix operator or an infix operator:
form <- ~ atom
form <- atomA ~ atomB
is.function(`~`)
[1] TRUE
class(form[[2]])
#[1] "name"
is.language(form)
#[1] TRUE
is.language(form[[2]])
#[1] TRUE
Just like functions, formula objects get created with an environment:
str(form)
#Class 'formula' length 3 atomA ~ atomB
# ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
Different functions have different assumptions about which form (prefix or infix) to use. You can extract the components by positional number using list indexing. The first element will always be the tilde operator.
form[[1]]
# returns: ~>
form[[2]]
# atomA
form[[3]]
# atomB
You can demote a formula to a character object with as.character or promote a character object to a formula with as.formula. The formula function does not create formulas, but is rather a generic function that has specific behavior with different classed objects, generally serving to extract a formula from a regression object.
methods(formula)
#-----------
[1] formula.character*
[2] formula.data.frame*
[3] formula.default*
[4] formula.formula*
[5] formula.glm*
[6] formula.lm*
[7] formula.nls*
[8] formula.quantmod*
[9] formula.summary.formula.cross
Your results for methods(formula) will differ depending on which packages you have loaded at the time.
[10] formula.terms*

Resources