Select Features for Naive Bayes Clasification in R - r

i want to use naive Bayes classifier to make some predictions.
So far i can make the prediction with the following (sample) code in R
library(klaR)
library(caret)
Faktor<-x <- sample( LETTERS[1:4], 10000, replace=TRUE, prob=c(0.1, 0.2, 0.65, 0.05) )
alter<-abs(rnorm(10000,30,5))
HF<-abs(rnorm(10000,1000,200))
Diffalq<-rnorm(10000)
Geschlecht<-sample(c("Mann","Frau", "Firma"),10000,replace=TRUE)
data<-data.frame(Faktor,alter,HF,Diffalq,Geschlecht)
set.seed(5678)
flds<-createFolds(data$Faktor, 10)
train<-data[-flds$Fold01 ,]
test<-data[flds$Fold01 ,]
features <- c("HF","alter","Diffalq", "Geschlecht")
formel<-as.formula(paste("Faktor ~ ", paste(features, collapse= "+")))
nb<-NaiveBayes(formel, train, usekernel=TRUE)
pred<-predict(nb,test)
test$Prognose<-as.factor(pred$class)
Now i want to improve this model by feature selection. My real data is about 100 features big.
So my question is , what woould be the best way to select the most important features for naive Bayes classification?
Is there any paper dor reference?
I tried the following line of code, bit this did not work unfortunately
rfe(train[, 2:5],train[, 1], sizes=1:4,rfeControl = rfeControl(functions = ldaFuncs, method = "cv"))
EDIT: It gives me the following error message
Fehler in { : task 1 failed - "nicht-numerisches Argument für binären Operator"
Calls: rfe ... rfe.default -> nominalRfeWorkflow -> %op% -> <Anonymous>
Because this is in german you may please reproduce this on your machine
How can i adjust the rfe() call to get a recursive feature elimination?

This error appears to be due to the ldaFuncs. Apparently they do not like factors when using matrix input. The main problem can be re-created with your test data using
mm <- ldaFuncs$fit(train[2:5], train[,1])
ldaFuncs$pred(mm,train[2:5])
# Error in FUN(x, aperm(array(STATS, dims[perm]), order(perm)), ...) :
# non-numeric argument to binary operator
And this only seems to happens if you include the factor variable. For example
mm <- ldaFuncs$fit(train[2:4], train[,1])
ldaFuncs$pred(mm,train[2:4])
does not return the same error (and appears to work correctly). Again, this only appears to be a problem when you use the matrix syntax. If you use the formula/data syntax, you don't have the same problem. For example
mm <- ldaFuncs$fit(Faktor ~ alter + HF + Diffalq + Geschlecht, train)
ldaFuncs$pred(mm,train[2:5])
appears to work as expected. This means you have a few different options. Either you can use the rfe() formula syntax like
rfe(Faktor ~ alter + HF + Diffalq + Geschlecht, train, sizes=1:4,
rfeControl = rfeControl(functions = ldaFuncs, method = "cv"))
Or you could expand the dummy variables yourself with something like
train.ex <- cbind(train[,1], model.matrix(~.-Faktor, train)[,-1])
rfe(train.ex[, 2:6],train.ex[, 1], ...)
But this doesn't remember which variables are paired in the same factor so it's not ideal.

Related

Error while using the weights option in nlme in r

Sorry this is crossposting from https://stats.stackexchange.com/questions/593717/nlme-regression-with-weights-syntax-in-r, but I thought it might be more appropriate to post it here.
I am trying to fit a power curve to model some observations in an nlme. However, I know some observations to be less reliable than others (reliability of each OBSID reflected in the WEIV in the dummy data), relatively independent of variance, and I quantified this beforehand and wish to include it as weights in my model. Moreover, I know a part of my variance is correlated with my independent variable so I cannot use directly the variance as weights.
This is my model:
coeffs_start = lm(log(DEPV)~log(INDV), filter(testdummy10,DEPV!=0))$coefficients
nlme_fit <- nlme(DEPV ~ a*INDV^b,
data = testdummy10,
fixed=a+b~ 1,
random = a~ 1,
groups = ~ PARTID,
start = c(a=exp(coeffs_start[1]), b=coeffs_start[2]),
verbose = F,
method="REML",
weights=varFixed(~WEIV))
This is some sample dummy data (I know it is not a great fit but it's fake data anyway) : https://github.com/FlorianLeprevost/dummydata/blob/main/testdummy10.csv
This runs well without the "weights" argument, but when I add it I get this error and I am not sure why because I believe it is the correct syntax:
Error in recalc.varFunc(object[[i]], conLin) :
dims [product 52] do not match the length of object [220]
In addition: Warning message:
In conLin$Xy * varWeights(object) :
longer object length is not a multiple of shorter object length
Thanks in advance!
This looks like a very long-standing bug in nlme. I have a patched version on Github, which you can install via remotes::install_github() as below ...
remotes::install_github("bbolker/nlme")
testdummy10 <- read.csv("testdummy10.csv") |> subset(DEPV>0 & INDV>0)
coeffs_start <- coef(lm(log(DEPV)~log(INDV), testdummy10))
library(nlme)
nlme_fit <- nlme(DEPV ~ a*INDV^b,
data = testdummy10,
fixed=a+b~ 1,
random = a~ 1,
groups = ~ PARTID,
start = c(a=exp(coeffs_start[1]),
b=coeffs_start[2]),
verbose = FALSE,
method="REML",
weights=varFixed(~WEIV))
packageVersion("nlme") ## 3.1.160.9000

Syntax error when trying to parse Bayesian model using RJAGS

I'm running the following code to attempt Bayesian modelling using rjags but coming up with the syntax error below.
Error in jags.model(file = "RhoModeldef.txt", data = ModelData, inits
= ModelInits, : Error parsing model file: syntax error on line 4 near "~"
RhoModel.def <- function() {
for (s in 1:S) {
log(rhohat[s]) ~ dnorm(log(rho[s]),log(rhovar[s]))
rho[s] ~ dgamma(Kappa,Beta)
}
Kappa ~ dt(0,2.5,1) # dt(0, pow(2.5,-2), 1) https://stackoverflow.com/questions/34935606/cauchy-prior-in-jags https://arxiv.org/pdf/0901.4011.pdf
sig.k <- abs(Kappa)
Beta ~ dt(0,2.5,1)
sig.b <- abs(Beta)
}
S <- length(africasad21)-1 # integer
Rhohat <- afzip30$Rho # vector
Rhovar <- afzip30$RhoVar # vector
ModelData <-list(S=S,rhohat=Rhohat,rhovar=Rhovar)
ModelInits <- list(list(rho = rep(1,S),Kappa=0.1,Beta=0.1))
Model.1 <- jags.model(file = 'RhoModeldef.txt',data = ModelData,inits=ModelInits,
n.chains = 4, n.adapt = 100)
Does anyone have any ideas how I might be able to fix this? I'm thinking it might have something to do with my attempts to fit a logged model? Please let me know if more details are needed.
Thanks!
Line 4 of the file 'RhoModeldef.txt' is most likely this one:
log(rhohat[s]) ~ dnorm(log(rho[s]),log(rhovar[s]))
JAGS does not allow log transformations on the left hand side of stochastic relations, only deterministic ones. Given that you are providing rhohat as data, the easiest solution is to do the log transformation in R and drop that part in JAGS i.e.:
log_rhohat[s] ~ dnorm(log(rho[s]), log(rhovar[s]))
ModelData <-list(S=S, log_rhohat=log(Rhohat), rhovar=Rhovar)
Alternatively, you could use dlnorm rather than dnorm in JAGS.
Does that solve your problem? Your example is not self-contained so I can't check myself, but I guess it should work now.

Error when passing the "weights" argument to the coxph function using riskRegression in R

I am trying to use inverse probability of treatment weighting in a cause-specific cox regression using the CSC function in the riskRegression Package.
I calculated the weights without a problem, but when I try to pass the weights to the CSC function I get the following error message:
Error in eval(extras, data, env) :
..1 used in an incorrect context, no ... to look in
A complete reproducible example looks like this:
library(ipw)
library(cmprsk)
library(survival)
library(riskRegression)
data(mgus2)
# get some example data
mgus2$etime <- with(mgus2, ifelse(pstat==0, futime, ptime))
mgus2$event <- with(mgus2, ifelse(pstat==0, 2*death, 1))
mgus2$event <- factor(mgus2$event, 0:2, labels=c("censor", "pcm", "death"))
mgus2$age_cat <- cut(mgus2$age, breaks=seq(0, 100, 25))
mgus2$sex <- ifelse(mgus2$sex=="F", 0, 1)
# remove NA
mgus2 <- subset(mgus2, !is.na(mspike))
# estimate inverse probability weights
weights <- ipwpoint(sex, "binomial", "logit", denominator= ~ age_cat + mspike,
data=mgus2)
mgus2$weights <- weights$ipw.weights
# rerun cox model using weights
mod2 <- CSC(Hist(etime, event) ~ sex + age_cat + mspike, cause="pcm",
surv.type="hazard", fitter="coxph", data=mgus2,
weights=weights)
I know from the documentation that the CSC function calls the coxph function internally, passing additional arguments to it using ... syntax. Other arguments could be passed to the function just fine, but the weight argument always produces the error message above.
How can I fix this?
UPDATE:
I have contacted the Package Maintainer and he has fixed the bug already. It should work fine now, with one little difference: Instead of weights=weights one has to use weights=mgus2$weights.

R mlogit package: use LAPACK instead of LINPACK

I am estimating a fairly simple McFadden choice model using a very large data set (101.6 million unit-alternatives). I can estimate this model just fine in Stata using the asclogit command, but when I try to use the mlogit package in R, I get the following error:
region1 <- mlogit(chosen ~ mean_log.wage + mean_log.rent + bornNear + Dim.1 + regionFE | 0,
shape= "long", chid.var = "chid", alt.var = "alternatives", data = ready)
Error in qr.default(na.omit(X)) : too large a matrix for LINPACK
Calls: mlogit ... model.matrix -> model.matrix.mFormula -> qr -> qr.default
If I look at the source code of qr.R it's clear that the number of elements in my design matrix is too big relative to the LINPACK limit of 2,147,483,647. However, no such limit exists for LAPACK (that I can tell, at least).
From qr.R:
qr.default <- function(x, tol = 1e-07, LAPACK = FALSE, ...)
{
x <- as.matrix(x)
if(is.complex(x))
return(structure(.Internal(La_qr_cmplx(x)), class = "qr"))
## otherwise :
if(LAPACK)
return(structure(.Internal(La_qr(x)), useLAPACK = TRUE, class = "qr"))
## else "Linpack" case:
p <- as.integer(ncol(x))
if(is.na(p)) stop("invalid ncol(x)")
n <- as.integer(nrow(x))
if(is.na(n)) stop("invalid nrow(x)")
if(1.0 * n * p > 2147483647) stop("too large a matrix for LINPACK")
...
qr() appears to be called in the mFormula method of mlogit, when model.matrix is being created, and probably while checking NAs. But I can't tell if there is a way to pass LAPACK = TRUE to mlogit, or if there is a way to skip the NA checking.
I'm hoping #YvesCroissant will see this.
As I mentioned, I can estimate this model just fine in Stata, so it's not a question of resources. My Stata license is not portable, however, which is why I would like to use R.
Thanks to Julius' comment and this post on namespaces in R, I figured out the answer. I added the following code right after my library statements:
source("mymFormula.R")
tmpfun <- get("model.matrix.mFormula", envir = asNamespace("mlogit"))
environment(mymFormula) <- environment(tmpfun)
attributes(mymFormula) <- attributes(tmpfun) # don't know if this is really needed
assignInNamespace("model.matrix.mFormula", mymFormula, ns="mlogit")
mymFormula.R is an R script where I copy/pasted the contents of mlogit:::model.matrix.mFormula and added mymFormula <- before the function invocation at the top of the file.
I viewed the contents of mlogit:::model.matrix.mFormula by typing trace(mlogit:::model.matrix.mFormula, edit=TRUE) in RStudio. (Thanks to this answer for help on how to do that.)

"promise already under evaluation" error in R caret's rfe function

I have a matrix X and vector Y which I use as arguments into the rfe function from the caret package. It's as simple as:
I get a weird error which I can't decipher:
promise already under evaluation: recursive default argument reference or earlier problems?
EDIT:
Here is a reproducible example for the first 5 rows of my data:
library(caret)
X_values = c(29.04,96.57,4.57,94.23,66.81,26.71,69.01,77.06,49.52,97.59,47.57,64.07,24.25,11.27,77.30,90.99,44.05,30.96,96.32,16.04)
X = matrix(X_values, nrow = 5, ncol=4)
Y = c(5608.11,2916.61,5093.05,3949.35,2482.52)
rfe(X, Y)
My R version is 3.2.3. Caret package is 6.0-76.
Does anybody know what this is?
There are two problems with your code.
You need to specify the function/algorithm that you want to fit. (this is what causes the error message you get. I am unsure why rfe throws such a cryptic error message; it makes it difficult to debug, indeed.)
You need to name your columns in the input data.
The following works:
library(caret)
X_values = c(29.04,96.57,4.57,94.23,66.81,26.71,69.01,77.06,49.52,97.59,47.57,64.07,24.25,11.27,77.30,90.99,44.05,30.96,96.32,16.04)
X = matrix(X_values, nrow = 5, ncol=4)
Y = c(5608.11,2916.61,5093.05,3949.35,2482.52)
ctrl <- rfeControl(functions = lmFuncs)
colnames(X) <- letters[1:ncol(X)]
set.seed(123)
rfe(X, Y, rfeControl = ctrl)
I chose a linear model for the rfe.
The reason for the warning messages is the low number of observations in your data during cross validation. You probably also want to set the sizes argument to get a meaningful feature elimination.

Resources