'mtry' in 'rfcv()' function of R's 'randomForest' library - r

I would like to use cross validation to determine the number of variables to try in a Random Forest method. I don't understand how to use the mtry argument in the rfcv() function.
I have 6 predictors in my dataset. I want to use mtry = 6,5,4,3,2,1, e.g., any possible m value, and cross validate with 5-fold CV.
I believe this can be done with rfcv() function of randomForest package. I am running the code:
rf_cv<- rfcv(training_x,training_y,cv.fold=5, mtry=function(p) max(1, p-1))
However, calling rf_cv$n.var gives me:
[1] 6 3 1
So, this method does not apply mtry as I was hoping, since I said each time subtract the number of variables used by 1.
How can I try every number of variables by applying a 5-fold cross validation for each number of variable?
I checked this post, however it is not completely related since they are discussing the default of mtry.

In the post you referenced, it explains how the steps will determine the mtry tested. So in your case, p=6, and since you did not change step or the scale, then:
p=6; 0.5
k <- floor(log(p, base = 1/step))
n.var <- round(p * step^(0:(k - 1)))
[1] 6 3
And if n.var does not include 1, it goes ahead and includes it for you, which gives you 6,3,1. So if you want to try all numbers, set mtry to be identity, and step to be 1, set scale to anything but "log" (yeah the code doesn't give you other options):
rf_cv=rfcv(matrix(rnorm(100*6),ncol=6),rnorm(100),cv.fold=3,
mtry=identity,scale="new",step=-1)
rf_cv$n.var
[1] 6 5 4 3 2 1

Related

Key methods selection in rpart

What's the rule to select the complexity parameter (cp) and method in rpart() function from the rpart? I've read a couple articles regarding the package, but the contents was too technical for me to grok.
Example:
rpart_1 <- rpart(myFormula, data = kyphosis,
method = "class",
control = rpart.control(minsplit = 0, cp = 0))
plotcp(rpart_1)
printcp(rpart_1)
You typically don't choose the method parameter as such; it's chosen for you as part of the problem you're solving. If it's a classification problem, you use method="class", if it's a regression problem, you use method="anova", and so on. Naturally, this means you have to understand what the problem is you're trying to solve, and whether your data will let you solve it.
The cp parameter controls the size of the fitted tree. You choose its value via cross-validation or using a separate test dataset. rpart is somewhat different to most other R modelling packages in how it handles this. The rpart function does cross-validation automatically by default. You then examine the model to see the result of the cross-validation, and prune the model based on that.
Worked example, using the MASS::Boston dataset:
library(MASS)
# does 10-fold CV by default
Bos.tree <- rpart(medv ~ ., data=Boston, cp=0)
# look at the result of the CV
plotcp(Bos.tree)
The plot shows that the 10-fold cross-validated error flattens out beginning at a tree size of about 9 leaf nodes. The dotted line is the minimum of the curve plus 1 standard error, which is a standard rule of thumb for pruning decision trees: you pick the smallest tree size that is within 1 SE of the minimum.
Printing the CP values gives a more precise view of how to choose the tree size:
printcp(Bos.tree)
#CP nsplit rel error xerror xstd
#1 0.45274420 0 1.00000 1.00355 0.082973
#2 0.17117244 1 0.54726 0.61743 0.057053
#3 0.07165784 2 0.37608 0.43034 0.046596
#4 0.03616428 3 0.30443 0.34251 0.042502
#5 0.03336923 4 0.26826 0.32642 0.040456
#6 0.02661300 5 0.23489 0.32591 0.040940
#7 0.01585116 6 0.20828 0.29324 0.040908
#8 0.00824545 7 0.19243 0.28256 0.039576
#9 0.00726539 8 0.18418 0.27334 0.037122
#10 0.00693109 9 0.17692 0.27593 0.037326
#11 0.00612633 10 0.16999 0.27467 0.037310
#12 0.00480532 11 0.16386 0.26547 0.036897
# . . .
This shows that a CP value of 0.00612 corresponds to a tree with 10 splits (and hence 11 leaves). This is the value of cp you use to prune the tree. So:
# prune with a value of cp slightly greater than 0.00612633
Bos.tree.cv <- prune(Bos.tree, cp=0.00613)

Why is the likelihood/AIC of my poisson regression infinite?

I am trying to evaluate themodel fit of several regressions in R, and I have run into a problem I have had multiple times now: the log-likelihood of my Poisson regression is infinite.
I'm using a non-integer dependent variable (Note: I know what I'm doing in this regard), and I'm wondering if maybe that's the problem. However, I don't get an infinite log-likelihood when running the regression with glm.nb.
Code to reproduce the issue is below.
Edit: the problem appears to go away when I coerce the DV to integer. Any idea how to get log likelihood from Poissons with non-integer DVs?
# Input Data
so_data <- data.frame(dv = c(21.0552722691125, 24.3061351414885, 7.84658638053276,
25.0294679770848, 15.8064731063311, 10.8171744654056, 31.3008088413026,
2.26643928259238, 18.4261153345417, 5.62915828161753, 17.0691184593063,
1.11959635820499, 30.0154935602592, 23.0000809735738, 28.4389825676123,
27.7678405415711, 23.7108405071757, 23.5070651053276, 14.2534787168392,
15.2058525068363, 19.7449094187771, 2.52384709295823, 29.7081691356397,
32.4723790240354, 19.2147002673637, 61.7911384519901, 10.5687170234821,
23.9047421013736, 18.4889651451222, 13.0360878554798, 15.1752866581849,
11.5205948111817, 31.3539840929108, 31.7255952728076, 25.3034625215724,
5.00013988265465, 30.2037887018226, 1.86123112349445, 3.06932041603219,
22.6739418581257, 6.33738321053804, 24.2933951601142, 14.8634827414491,
31.8302947881089, 34.8361908525564, 1.29606416941288, 13.206844629927,
28.843579313401, 25.8024295609021, 14.4414831628722, 18.2109680632694,
14.7092063453463, 10.0738043919183, 28.4124482962025, 27.1004208775326,
1.31350378236957, 14.3009307888745, 1.32555197766214, 2.70896028922312,
3.88043749517381, 3.79492216916016, 19.4507965653633, 32.1689088941444,
2.61278585713499, 41.6955885902228, 2.13466761675063, 30.4207256294235,
24.8231524369244, 20.7605955978196, 17.2182798298094, 2.11563574288652,
12.290778250655, 0.957467139696772, 16.1775287334746))
# Run Model
p_mod <- glm(dv ~ 1, data = so_data, family = poisson(link = 'log'))
# Be Confused
logLik(p_mod)
Elaborating on #ekstroem's comment: the Poisson distribution is only supported over the non-negative integers (0, 1, ...). So, technically speaking, the probability of any non-integer value is zero -- although R does allow for a little bit of fuzz, to allow for round-off/floating-point representation issues:
> dpois(1,lambda=1)
[1] 0.3678794
> dpois(1.1,lambda=1)
[1] 0
Warning message:
In dpois(1.1, lambda = 1) : non-integer x = 1.100000
> dpois(1+1e-7,lambda=1) ## fuzz
[1] 0.3678794
It is theoretically possible to compute something like a Poisson log-likelihood for non-integer values:
my_dpois <- function(x,lambda,log=FALSE) {
LL <- -lambda+x*log(lambda)-lfactorial(x)
if (log) LL else exp(LL)
}
but I would be very careful - some quick tests with integrate suggest it integrates to 1 (after I fixed the bug in it), but I haven't checked more carefully that this is really a well-posed probability distribution. (On the other hand, some reasonable-seeming posts on CrossValidated suggest that it's not insane ...)
You say "I know what I'm doing in this regard"; can you give some more of the context? Some alternative possibilities (although this is steering into CrossValidated territory) -- the best answer depends on where your data really come from (i.e., why you have "count-like" data that are non-integer but you think should be treated as Poisson).
a quasi-Poisson model (family=quasipoisson). (R will still not give you log-likelihood or AIC values in this case, because technically they don't exist -- you're supposed to do inference on the basis of the Wald statistics of the parameters; see e.g. here for more info.)
a Gamma model (probably with a log link)
if the data started out as count data that you've scaled by some measure of effort or exposure), use an appropriate offset model ...
a generalized least-squares model (nlme::gls) with an appropriate heteroscedasticity specification
Poisson log-likelihood involves calculating log(factorial(x)) (https://www.statlect.com/fundamentals-of-statistics/Poisson-distribution-maximum-likelihood). For values larger than 30 it has to be done using Stirling's approximation formula in order to avoid exceeding the limit of computer arithmetic. Sample code in Python:
# define a likelihood function. https://www.statlect.com/fundamentals-of- statistics/Poisson-distribution-maximum-likelihood
def loglikelihood_f(lmba, x):
#Using Stirling formula to avoid calculation of factorial.
#logfactorial(n) = n*ln(n) - n
n = x.size
logfactorial = x*np.log(x+0.001) - x #np.log(factorial(x))
logfactorial[logfactorial == -inf] = 0
result =\
- np.sum(logfactorial) \
- n * lmba \
+ np.log(lmba) * np.sum(x)
return result

Reparametrize to remove constraints for optimization (R)

I am teaching myself how to run some Markov models in R, by working through the textbook "Hidden Markov Models for Time Series: An Introduction using R". I am a bit stuck on how to go about implementing something that is mentioned in the text.
So, I have the following function:
f <- function(samples,lambda,delta) -sum(log(outer(samples,lambda,dpois)%*%delta))
Which I can optimize with respect to, say, lambda using:
optim(par, fn=f, samples=x, delta=d)
where "par" is the initial guess for lambda, for some x and d.
Which works perfectly fine. However, in the part of the text corresponding to the example I am trying to reproduce, they note: "The parameters delta and lambda are constrained by sum(delta_i)=1 for i=1,...m, delta_i>0, and lambda_i>0. It is therefore necessary to reparametrize if one wishes to use an unconstrained optimizer such as nlm". One possibility is to maximize the likelihood with respect to the 2m-1 unconstrained parameters".
The unconstrained parameters are given by
eta<-log(lambda)
tau<-log(delta/(1-sum(delta))
I don't entirely understand how to go about implementing this. How would I write a function to optimize over this transformed parameter space?
When using optim() without parmater transfromations like so:
simpleFun <- function(x)
(x-3)^2
out = optim(par=5,
fn=simpleFun)
the set of parmaters estimates would be obtained via out$par which is 3 in
the case, as you might expect. Alternatively, you can wrap your function
f in a transformation the parameters like so:
out = optim(par=5,
fn=function(x)
# apply the transformation x -> x^3
simpleFun(x^3))
and now the trick to get the correct set of optimal parmeters to your
function you need to apply the same transfromation to the parameter
estimates as in:
(out$par)^3
#> 2.99741
(and yes, the parameter estimate is slightly different. For this contrived
example, you could set method="BFGS" for a slightly better estimate. Anyhow, this goes to show that the choice of transformation does matter in
some cases, but that's for another discussion...)
To complete the answer, It sounds like you a want to use a wrapper like so
# the function to be optimized
f <- function(samples,lambda,delta)
-sum(log(outer(samples,lambda,dpois)%*%delta))
out <- optim(# par it now a 2m vector
par = c(eta1 = 1,
eta2 = 1,
eta3 = 1,
tau1 = 1,
tau2 = 1,
tau3 = 1),
# a wrapper that applies the constraints
fn=function(x,samples){
# exp() guarantees that the values of lambda are > 0
lambda = exp(x[1:3])
# delta is also > 0
delta = exp(x[4:6])
# and now it sums to 1
delta = delta / sum(delta)
f(samples,lambda,delta)
},
samples=samples)
The above guarantees that the the parameters passed to f()have the correct constraints, and as long as you apply the same transformation to out$par, optim() will estimate an optimal set of parameters for f().

R e1071: Balanced Error Rate (BER) as error criterion in tune function

I'm kind of new to R and machine learning in general, so apologies if this seems stupid!
I'm using the e1071 package to tune the parameters of various models. My dataset is very unbalanced and I would like for the error criterion to be Balanced Error Rate... NOT overall classification error. However, I'm stumped as how to achieve this.
Here is my code:
#Find optimal value 'k' value for k-NN model (feature subset).
c <- data_train_sub[1:13]
d <- data_train_sub[,14]
knn2 <- tune.knn(c, d, k = 1:10, tunecontrol = tune.control(sampling = "cross", performances = TRUE, sampling.aggregate = mean)
)
summary(knn2)
plot(knn2)
Which returns this:
Parameter tuning of ‘knn.wrapper’:
- sampling method: 10-fold cross validation
- best parameters:
k
1
- best performance: 0.001190476
- Detailed performance results:
k error dispersion
1 1 0.001190476 0.003764616
2 2 0.005952381 0.006274360
3 3 0.003557423 0.005728122
4 4 0.005924370 0.008352124
5 5 0.005938375 0.008407043
6 6 0.005938375 0.008407043
7 7 0.007128852 0.008315090
8 8 0.009495798 0.009343555
9 9 0.008305322 0.009751997
10 10 0.008319328 0.009795292
Has anyone any experience of altering the error being assessed in this function?
Look at the class.weights argument of the svm() function:
a named vector of weights for the different classes, used for asymmetric class sizes...
Coefficient can easily be calculated as such:
class.weights = table(Xcal$species)/sum(table(Xcal$species))

Help using predict() for kernlab's SVM in R?

I am trying to use the kernlab R package to do Support Vector Machines (SVM). For my very simple example, I have two pieces of training data. A and B.
(A and B are of type matrix - they are adjacency matrices for graphs.)
So I wrote a function which takes A+B and generates a kernel matrix.
> km
[,1] [,2]
[1,] 14.33333 18.47368
[2,] 18.47368 38.96053
Now I use kernlab's ksvm function to generate my predictive model. Right now, I'm just trying to get the darn thing to work - I'm not worried about training error, etc.
So, Question 1: Am I generating my model correctly? Reasonably?
# y are my classes. In this case, A is in class "1" and B is in class "-1"
> y
[1] 1 -1
> model2 = ksvm(km, y, type="C-svc", kernel = "matrix");
> model2
Support Vector Machine object of class "ksvm"
SV type: C-svc (classification)
parameter : cost C = 1
[1] " Kernel matrix used as input."
Number of Support Vectors : 2
Objective Function Value : -0.1224
Training error : 0
So far so good. We created our custom kernel matrix, and then we created a ksvm model using that matrix. We have our training data labeled as "1" and "-1".
Now to predict:
> A
[,1] [,2] [,3]
[1,] 0 1 1
[2,] 1 0 1
[3,] 0 0 0
> predict(model2, A)
Error in as.matrix(Z) : object 'Z' not found
Uh-oh. This is okay. Kind of expected, really. "Predict" wants some sort of vector, not a matrix.
So lets try some things:
> predict(model2, c(1))
Error in as.matrix(Z) : object 'Z' not found
> predict(model2, c(1,1))
Error in as.matrix(Z) : object 'Z' not found
> predict(model2, c(1,1,1))
Error in as.matrix(Z) : object 'Z' not found
> predict(model2, c(1,1,1,1))
Error in as.matrix(Z) : object 'Z' not found
> predict(model2, km)
Error in as.matrix(Z) : object 'Z' not found
Some of the above tests are nonsensical, but that is my point: no matter what I do, I just can't get predict() to look at my data and do a prediction. Scalars don't work, vectors don't work. A 2x2 matrix doesn't work, nor does a 3x3 matrix.
What am I doing wrong here?
(Once I figure out what ksvm wants, then I can make sure that my test data can conform to that format in a sane/reasonable/mathematically sound way.)
If you think about how the support vector machine might "use" the kernel matrix, you'll see that you can't really do this in the way you're trying (as you've seen :-)
I actually struggled a bit with this when I first was using kernlab + a kernel matrix ... coincidentally, it was also for graph kernels!
Anyway, let's first realize that since the SVM doesn't know how to calculate your kernel function, it needs to have these values already calculated between your new (testing) examples, and the examples it picks out as the support vectors during the training step.
So, you'll need to calculate the kernel matrix for all of your examples together. You'll later train on some and test on the others by removing rows + columns from the kernel matrix when appropriate. Let me show you with code.
We can use the example code in the ksvm documentation to load our workspace with some data:
library(kernlab)
example(ksvm)
You'll need to hit return a few (2) times in order to let the plots draw, and let the example finish, but you should now have a kernel matrix in your workspace called K. We'll need to recover the y vector that it should use for its labels (as it has been trampled over by other code in the example):
y <- matrix(c(rep(1,60),rep(-1,60)))
Now, pick a subset of examples to use for testing
holdout <- sample(1:ncol(K), 10)
From this point on, I'm going to:
Create a training kernel matrix named trainK from the original K kernel matrix.
Create an SVM model from my training set trainK
Use the support vectors found from the model to create a testing kernel matrix testK ... this is the weird part. If you look at the code in kernlab to see how it uses the support vector indices, you'll see why it's being done this way. It might be possible to do this another way, but I didn't see any documentation/examples on predicting with a kernel matrix, so I'm doing it "the hard way" here.
Use the SVM to predict on these features and report accuracy
Here's the code:
trainK <- as.kernelMatrix(K[-holdout,-holdout]) # 1
m <- ksvm(trainK, y[-holdout], kernel='matrix') # 2
testK <- as.kernelMatrix(K[holdout, -holdout][,SVindex(m), drop=F]) # 3
preds <- predict(m, testK) # 4
sum(sign(preds) == sign(y[holdout])) / length(holdout) # == 1 (perfect!)
That should just about do it. Good luck!
Responses to comment below
what does K[-holdout,-holdout] mean? (what does the "-" mean?)
Imagine you have a vector x, and you want to retrieve elements 1, 3, and 5 from it, you'd do:
x.sub <- x[c(1,3,5)]
If you want to retrieve everything from x except elements 1, 3, and 5, you'd do:
x.sub <- x[-c(1,3,5)]
So K[-holdout,-holdout] returns all of the rows and columns of K except for the rows we want to holdout.
What are the arguments of your as.kernelMatrix - especially the [,SVindex(m),drop=F] argument (which is particulary strange because it looks like that entire bracket is a matrix index of K?)
Yeah, I inlined two commands into one:
testK <- as.kernelMatrix(K[holdout, -holdout][,SVindex(m), drop=F])
Now that you've trained the model, you want to give it a new kernel matrix with your testing examples. K[holdout,] would give you only the rows which correspond to the training examples in K, and all of the columns of K.
SVindex(m) gives you the indexes of your support vectors from your original training matrix -- remember, those rows/cols have holdout removed. So for those column indices to be correct (ie. reference the correct sv column), I must first remove the holdout columns.
Anyway, perhaps this is more clear:
testK <- K[holdout, -holdout]
testK <- testK[,SVindex(m), drop=FALSE]
Now testK only has the rows of our testing examples and the columns that correspond to the support vectors. testK[1,1] will have the value of the kernel function computed between your first testing example, and the first support vector. testK[1,2] will have the kernel function value between your 1st testing example and the second support vector, etc.
Update (2014-01-30) to answer comment from #wrahool
It's been a while since I've played with this, so the particulars of kernlab::ksvm are a bit rusty, but in principle this should be correct :-) ... here goes:
what is the point of testK <- K[holdout, -holdout] - aren't you removing the columns that correspond to the test set?
Yes. The short answer is that if you want to predict using a kernel matrix, you have to supply the a matrix that is of the dimension rows by support vectors. For each row of the matrix (the new example you want to predict on) the values in the columns are simply the value of the kernel matrix evaluated between that example and the support vector.
The call to SVindex(m) returns the index of the support vectors given in the dimension of the original training data.
So, first doing testK <- K[holdout, -holdout] gives me a testK matrix with the rows of the examples I want to predict on, and the columns are from the same examples (dimension) the model was trained on.
I further subset the columns of testK by SVindex(m) to only give me the columns which (now) correspond to my support vectors. Had I not done the first [, -holdout] selection, the indices returned by SVindex(m) may not correspond to the right examples (unless all N of your testing examples are the last N columns of your matrix).
Also, what exactly does the drop = FALSE condition do?
It's a bit of defensive coding to ensure that after the indexing operation is performed, the object that is returned is of the same type as the object that was indexed.
In R, if you index only one dimension of a 2D (or higher(?)) object, you are returned an object of the lower dimension. I don't want to pass a numeric vector into predict because it wants to have a matrix
For instance
x <- matrix(rnorm(50), nrow=10)
class(x)
[1] "matrix"
dim(x)
[1] 10 5
y <- x[, 1]
class(y)
[1] "numeric"
dim(y)
NULL
The same will happen with data.frames, etc.
First off, I have not used kernlab much. But simply looking at the docs, I do see working examples for the predict.ksvm() method. Copying and pasting, and omitting the prints to screen:
## example using the promotergene data set
data(promotergene)
## create test and training set
ind <- sample(1:dim(promotergene)[1],20)
genetrain <- promotergene[-ind, ]
genetest <- promotergene[ind, ]
## train a support vector machine
gene <- ksvm(Class~.,data=genetrain,kernel="rbfdot",\
kpar=list(sigma=0.015),C=70,cross=4,prob.model=TRUE)
## predict gene type probabilities on the test set
genetype <- predict(gene,genetest,type="probabilities")
That seems pretty straight-laced: use random sampling to generate a training set genetrain and its complement genetest, then fitting via ksvm and a call to a predict() method using the fit, and new data in a matching format. This is very standard.
You may find the caret package by Max Kuhn useful. It provides a general evaluation and testing framework for a variety of regression, classification and machine learning methods and packages, including kernlab, and contains several vignettes plus a JSS paper.
Steve Lianoglou is right.
In kernlab it is a bit wired, and when predicting it requires the input kernel matrix between each test example and the support vectors. You need to find this matrix yourself.
For example, a test matrix [n x m], where n is the number of test samples and m is the number of support vectors in the learned model (ordered in the sequence of SVindex(model)).
Example code
trmat <- as.kernelMatrix(kernels[trainidx,trainidx])
tsmat <- as.kernelMatrix(kernels[testidx,trainidx])
#training
model = ksvm(x=trmat, y=trlabels, type = "C-svc", C = 1)
#testing
thistsmat = as.kernelMatrix(tsmat[,SVindex(model)])
tsprediction = predict(model, thistsmat, type = "decision")
kernels is the input kernel matrix. trainidx and testidx are ids for training and test.
Build the labels yourself from the elements of the solution. Use this alternate predictor method which takes ksvm model (m) and data in original training format (d)
predict.alt <- function(m, d){
sign(d[, m#SVindex] %*% m#coef[[1]] - m#b)
}
K is a kernelMatrix for training. For validation's sake, if you run predict.alt on the training data you will notice that the alternate predictor method switches values alongside the fitted values returned by ksvm. The native predictor behaves in an unexpected way:
aux <- data.frame(fit=kout#fitted, native=predict(kout, K), alt=predict.alt(m=kout, d=as.matrix(K)))
sample_n(aux, 10)
fit native alt
1 0 0 -1
100 1 0 1
218 1 0 1
200 1 0 1
182 1 0 1
87 0 0 -1
183 1 0 1
174 1 0 1
94 1 0 1
165 1 0 1

Resources