Univariate feature selection in caret - r

I would like to select features based on anovaScores in caret. I can get the scores by scores <- apply(train_data, 2, anovaScores, train_data$target) and then sort features and select n best ones, but I don't know how to do it with sbfControl. In documentation to anovaScores is written: "The functions described here are passed to the algorithm via the functions argument of sbfControl."
Doing
featSel_ctrl <- sbfControl(functions = anovaScores)
featSel <- sbf(target ~., data=train_data, sbfControl = featSel_ctrl)
doesn't work. Will produce 'object of type 'closure' is not subsettable' error.

functions has other elements that you are excluding. See the documentation that has some details. If you are doing classification, anovaScores is already being used.

Related

Error in eval(parse()) - r unable to find argument input

I am very new to R, and this is my first time of encountering the eval() function. So I am trying to use the med and boot.med function from the following package: mma. I am using it to conduct mediation analysis. med and boot.med take in models such as linear models, and dataframes that specify mediators and predictors and then estimate the mediation effect of each mediator.
The author of the package gives the flexible option of specifying one's own custom.function. From the source code of med, it can be seen that the custom.function is passed to the eval(). So I tried insert the gbmt function as the custom function. However, R kept giving me error message: Error during wrapup: Number of trees to be used in prediction must be provided. I have been searching online for days and tried many ways of specifying the number of trees parameter n.trees, but nothing works (I believe others have raised similar issues: post 1, post 2).
The following codes are part of the source code of the med function:
cf1 = gsub("responseY", "y[,j]", custom.function[j])
cf1 = gsub("dataset123", "x2", cf1)
cf1 = gsub("weights123", "w", cf1)
full.model[[j]] <- eval(parse(text = cf1))
One custom function example the author gives in the package documentation is as follows:
temp1<-med(data=data.bin,n=2,custom.function = 'glm(responseY~.,data=dataset123,family="quasibinomial",
weights=weights123)')
Here the glm is the custom function. This example code works and you can replicate it easily (if you have mma installed and loaded). However when I am trying to use the gbmt function on a survival object, I got errors and here is what my code looks like:
temp1 <- med(data = data.surv,n=2,type = "link",
custom.function = 'gbmt(responseY ~.,
data = dataset123,
distribution = dist,
train_params = start_stop,
cv_folds=10,
keep_gbm_data = TRUE,
)')
Anyone has any idea how the argument about number of trees n.trees can be added somewhere in the above code?
Many thanks in advance!
Update: in order to replicate the example code, please install mma and try the following:
library("mma")
data("weight_behavior") ##binary x #binary y
x=weight_behavior[,c(2,4:14)]
pred=weight_behavior[,3]
y=weight_behavior[,15]
data.bin<-data.org(x,y,pred=pred,contmed=c(7:9,11:12),binmed=c(6,10), binref=c(1,1),catmed=5,catref=1,predref="M",alpha=0.4,alpha2=0.4)
temp1<-med(data=data.bin,n=2) #or use self-defined final function
temp1<-med(data=data.bin,n=2, custom.function = 'glm(responseY~.,data=dataset123,family="quasibinomial",
weights=weights123)')
I changed the custom.function to gbmt and used a survival object as responseY and the error occurs. When I use the gbmt function on my data outside the med function, there is no error.

Specifying custom weights for the nonparametric estimate of spatially-varying relative risk in spatstat

Is there a way to specify weights in relrisk.ppp function in spatstat (version 1.63-3)?
The relrisk.ppp function calls the density.ppp function, which does allow users to specify their own weights.
For example, let us build upon the provided spatstat.data::urkiola data where, instead of individual trees, the locations are tree stands and we have a second numeric mark for the frequency of trees at each point-location:
urkiola_new <- spatstat.data::urkiola
urkiola_new$marks <- data.frame("type" = urkiola_new$marks, "freq" = rpois(urkiola_new$n, 3))
f1 <- spatstat::relrisk(urkiola_new, weights = urkiola_new$marks$freq)
When using the urkiola_new in a call of relrisk, urkiola_new is caught by stopifnot(is.multitype(X)) in relrisk.ppp. I next tried specifying the weights separately as a vector while using the original urkiola data,
f2 <- spatstat::relrisk(urkiola, weights = urkiola_new$marks$freq)
but was caught by a warning from the pixellate.ppp function within the internal density.ppp function:
Error in pixellate.ppp(x, ..., padzero = TRUE) : length(weights) == npoints(x) || length(weights) == 1 is not TRUE
The same error occurs when I convert the weights into a list
urkiola_weights <- split(urkiola_new$marks$freq, urkiola_new$marks$type)
f3 <- spatstat::relrisk(urkiola, weights = urkiola_weights)
I suspect there is a way to specify the weights cleverly, but it yet escapes me. Any suggestions or guidance would be helpful, thank you!
The function relrisk.ppp is not currently designed to handle weights. The help entry for relrisk.ppp does not mention weights.
The example above does not work because relrisk.ppp applies density.ppp separately to the sub-patterns of points of each type, and the extra argument weights is the wrong length for these sub-patterns.
I will take this question as a feature request, to add this capability to relrisk.ppp. It should be done soon.
Update: this is now implemented in the development version, spatstat 1.64-0.018 available at the spatstat github repository

R implementation of kohonen SOMs: prediction error due to data type.

I have been trying to run an example code for supervised kohonen SOMs from https://clarkdatalabs.github.io/soms/SOM_NBA . When I tried to predict test set data I got the following error:
pos.prediction <- predict(NBA.SOM3, newdata = NBA.testing)
Error in FUN(X[[i]], ...) :
Data type not allowed: should be a matrix or a factor
I tried newdata = as.matrix(NBA.testing) but it did not help. Neither did as.factor().
Why does it happen? And how can I fix that?
You should put one more argument to the predict function, i.e. "whatmap", then set its value to 1.
The code would be like:
pos.prediction <- predict(NBA.SOM3, newdata = NBA.testing, whatmap = 1)
To verify the prediction result, you can check using:
table(NBA$Pos[-training_indices], pos.prediction$predictions[[2]], useNA = 'always')
The result may be different from that of the tutorial, since it did not declare the use of set.seed() function.
I suggest that the set.seed() with an arbitrary number in it was declared somewhere before the training phase.
For simplicity, put it once on the top most of your script, e.g.
set.seed(12345)
This will guarantee a reproducible result of your model next time you re-run your script.
Hope that will help.

Error in Bagging with party::cforest

I'm trying to bag conditional inference trees following the advice of Kuhn et al in 'Applied Predictive Modeling', Ch.8:
Conditional inference trees can also be bagged using the cforest function > in the party package if the argument mtry is equal to the number of
predictors:
library(party)
The mtry parameter should be the number of predictors (the
number of columns minus 1 for the outcome).
bagCtrl <- cforest_control(mtry = ncol(trainData) - 1)
baggedTree <- cforest(y ~ ., data = trainData, controls = bagCtrl)
Note there may be a typo in the above code (and also in the package's help file), as discussed here:
R package 'partykit' unused argument in ctree_control
However when I try to replicate this code using a dataframe (and trainData in above code is also a dataframe) such that there is more than one independent/predictor variable, I'm getting an error though it works for just one independent variable:
Some dummy code for simulations:
library(party)
df = data.frame(y = runif(5000), x = runif(5000), z = runif(5000))
bagCtrl <- cforest_control(mtry = ncol(df) - 1)
baggedTree_cforest <- cforest(y ~ ., data = df, control = bagCtrl)
The error message is:
Error: $ operator not defined for this S4 class
Thanks for any help.
As suggested, posting my comment from above as an answer as a general R 'trick' if something expected doesn't work and the program has several libraries loaded:
but what solved it was adding the party namespace explicitly to the function > call, so party::cforest() instead of just cforest(). I've also got
library(partykit) loaded in my actual program which too has a cforest()
function and the error could be stemming from there though both functions are > essentially the same
caret::train() is another example where this often pops up

Using mRMRe in R

I am currently working on a project where I have to do some feature selection for building a predictive model. I was lead to a package in R called mRMRe. I am just trying to work the example but cannot get it working. The example can be found here - http://www.inside-r.org/packages/cran/mRMRe/docs/mRMR.ensemble.
Here is my code -
data(cgps)
data <- data.frame(target=cgps.ic50, cgps.ge)
mRMR.ensemble(data, 1, rep.int(1, 30))
When I run this code I get the error -
Error in .local(.Object, ...) : data must be of type mRMRe.Data.
I dug a litter further and found that you actually have to convert the data to mRMR.Data type. So I did this update -
# Update
data <- mRMR.data(data = data.frame(target=cgps.ic50, cgps.ge))
mRMR.ensemble(data, 1, rep.int(1, 30))
but I still get the same error. When I look at the class I have -
> class(data)
[1] "mRMRe.Data"
attr(,"package")
[1] "mRMRe"
So the data is the requested type but the code is still not functional.
My question is if anyone has experience using this package or any help or comments would be appreciated!
Also want to note that in the example from the link - when I load the data
cgps_ic50 -> cgps.ic50
cgps_ge -> cgps.ge
so the names of the data aren't the same as the same in the example.
With the code you wrote:
data(cgps)
data <- mRMR.data(data = data.frame(target=cgps.ic50, cgps.ge))
mRMR.ensemble(data, 1, rep.int(1, 30))
The function mRMR.ensemble is getting the data as the first parameter, but the default first parameter in this function is solution_count.
I understand that your intentions executing that example are finding 30 relevant and non-redundant features using the classic mRMR feature selection algorithm so try this:
data(cgps)
data <- mRMR.data(data = data.frame(target=cgps.ic50, cgps.ge))
mRMR.ensemble(data = data, target_indices = 1,
feature_count = 30, solution_count = 1)
The target_indices are the positions in the original data.frame of the features used to maximize the relevance (correlation or other quality measure for this issue), so features selected in the end will be good for explaining the features indicated in the target_indices.
For example, in a classification problem, we would choose the position of the class variable as the value for the target_indices parameter.
The feature_count parameter indicates the number of variables to be chosen.
The solution_count is not a parameter of the classic mRMR. It indicates the number of mRMR algorithms to be ensembled to get a final feature selection, so if set to 1 it performs only one classic mRMR.

Resources