adabag boosting function throws error when giving mfinal>10 - r

I have a strange issue, whenever I try increasing the mfinal argument in boosting function of adabag package beyond 10 I get an error, Even with mfinal=9 I get warnings.
My train data has 7 class Dependant variable and 100 independant variables and around 22000 samples of data(Smoted one class using DMwR). My Dependant Variable is at the end of the training dataset in sequence.
library(adabag)
gc()
exp_recog_boo <- boosting(V1 ~ .,data=train_dataS,boos=TRUE,mfinal=9)
Error in 1:nrow(object$splits) : argument of length 0
In addition: Warning messages:
1: In acum + acum1 :
longer object length is not a multiple of shorter object length
Thanks in advance.

My mistake was that I didn't set the TARGET as factor before.
Try this:
train$target <- as.factor(train$target)
and check by doing:
str(train$TARGET)

This worked for me:
modelADA <- boosting(lettr ~ ., data = trainAll, boos = TRUE, mfinal = 10, control = (minsplit = 0))
Essentially I just told rpart to require a minimum split length of zero to generate tree, it eliminated the error. I haven't tested this extensively so I can't guarantee it's a valid solution (what does a tree with a zero length leaf actually mean?), but it does prevent the error from being thrown.

I think i Hit the problem.
ignore this -if you configure your control with a cp = 0, this wont happen. I think that if the first node of a tree make no improvement (or at least no better than the cp) the tree stay wiht 0 nodes so you have an empty tree and that make the algorithm fail.
EDIT: The problem is that the rpart generates trees with only one leaf(node) and the boosting metod use this sentence "k <- varImp(arboles[[m]], surrogates = FALSE, competes = FALSE)" being arboles[[m]] a tree with only one node it give you the eror.
To solve that you can modify the boosting metod:
Write: fix(boosting) and add the *'S lines.
if (boos == TRUE) {
** k <- 1
** while (k == 1){
boostrap <- sample(1:n, replace = TRUE, prob = pesos)
fit <- rpart(formula, data = data[boostrap, -1],
control = control)
** k <- length(fit$frame$var)
** }
flearn <- predict(fit, newdata = data[, -1], type = "class")
ind <- as.numeric(vardep != flearn)
err <- sum(pesos * ind)
}
this will prevent the algorith from acepting one leaf trees but you have to set the CP from the control param as 0 to avoid an endless loop..

Just ran into the same problem, and setting the complexity parameter to -1 or minimum split to 0 both work for me with rpart.control, e.g.
library(adabag)
r1 <- boosting(Y ~ ., data = data, boos = TRUE,
mfinal = 10, control = rpart.control(cp = -1))
r2 <- boosting(Y ~ ., data = data, boos = TRUE,
mfinal = 10, control = rpart.control(minsplit = 0))

I also run into this same problem recently and this example R script solves it completely!
The main idea is that you need to set the control for rpart (which adabag uses for creating trees, see rpart.control) appropriately, so that at least a split is attempted in every tree.
I'm not totally sure but it appears that your "argument of length 0" may be the result of an empty tree, which can happen since there is a default setting of a "complexity" parameter that tells the function not to attempt a split if the decrease in homogeneity/lack of fit is below certain threshold.

use str() to see the attributes of your dataframe. For me, I just convert myclass variable as factor, then everything runs.

Related

Error while using the weights option in nlme in r

Sorry this is crossposting from https://stats.stackexchange.com/questions/593717/nlme-regression-with-weights-syntax-in-r, but I thought it might be more appropriate to post it here.
I am trying to fit a power curve to model some observations in an nlme. However, I know some observations to be less reliable than others (reliability of each OBSID reflected in the WEIV in the dummy data), relatively independent of variance, and I quantified this beforehand and wish to include it as weights in my model. Moreover, I know a part of my variance is correlated with my independent variable so I cannot use directly the variance as weights.
This is my model:
coeffs_start = lm(log(DEPV)~log(INDV), filter(testdummy10,DEPV!=0))$coefficients
nlme_fit <- nlme(DEPV ~ a*INDV^b,
data = testdummy10,
fixed=a+b~ 1,
random = a~ 1,
groups = ~ PARTID,
start = c(a=exp(coeffs_start[1]), b=coeffs_start[2]),
verbose = F,
method="REML",
weights=varFixed(~WEIV))
This is some sample dummy data (I know it is not a great fit but it's fake data anyway) : https://github.com/FlorianLeprevost/dummydata/blob/main/testdummy10.csv
This runs well without the "weights" argument, but when I add it I get this error and I am not sure why because I believe it is the correct syntax:
Error in recalc.varFunc(object[[i]], conLin) :
dims [product 52] do not match the length of object [220]
In addition: Warning message:
In conLin$Xy * varWeights(object) :
longer object length is not a multiple of shorter object length
Thanks in advance!
This looks like a very long-standing bug in nlme. I have a patched version on Github, which you can install via remotes::install_github() as below ...
remotes::install_github("bbolker/nlme")
testdummy10 <- read.csv("testdummy10.csv") |> subset(DEPV>0 & INDV>0)
coeffs_start <- coef(lm(log(DEPV)~log(INDV), testdummy10))
library(nlme)
nlme_fit <- nlme(DEPV ~ a*INDV^b,
data = testdummy10,
fixed=a+b~ 1,
random = a~ 1,
groups = ~ PARTID,
start = c(a=exp(coeffs_start[1]),
b=coeffs_start[2]),
verbose = FALSE,
method="REML",
weights=varFixed(~WEIV))
packageVersion("nlme") ## 3.1.160.9000

cpquery function of bnlearn always returns 0 for simple discrete artificial data

I want to calculate conditional probabilities based on a bayesian network based on some binary data I create. However, using the bnlearn::cpquery, always a value of 0 is returned, while bnlearn::bn.fit fits a correct model.
# Create data for binary chain network X1->Y->X2 with zeros and ones.
# All base rates are 0.5, increase in probability due to parent = .5 (from .25 to.75)
X1<-c(rep(1,9),rep(1,3),1,rep(1,3),rep(0,3),0,rep(0,3),rep(0,9))
Y<-c(rep(1,9),rep(1,3),0,rep(0,3),rep(1,3),1,rep(0,3),rep(0,9))
X2<-c(rep(1,9),rep(0,3),1,rep(0,3),rep(1,3),0,rep(1,3),rep(0,9))
dag1<-data.frame(X1,Y,X2)
# Fit bayes net to data.
res <- hc(dag1)
fittedbn <- bn.fit(res, data = dag1)
# Fitting works as expected, as seen by graph structure and coefficients in fittedbn:
plot(res)
print(fittedbn)
# Conditional probability query
cpquery(fittedbn, event = (Y==1), evidence = (X1==1), method = "ls")
# Using LW method
cpquery(fittedbn, event = (Y==1), evidence = list(X1 = 1), method = "lw")
'cpquery' just returns 0. I have also tried using the predict function, however this returns an error:
predict(object = fittedbn, node = "Y", data = list(X1=="1"), method = "bayes-lw", prob = True)
# Returns:
# Error in check.data(data, allow.levels = TRUE) :
# the data must be in a data frame.
In the above cpquery the expected result is .75, but 0 is returned. This is not specific to this event or evidence, regardless of what event or evidence I put in (e.g event = (X2==1), evidence = (X1==0) or event = (X2==1), evidence = (X1==0 & Y==1)) the function returns 0.
One thing I tried, as I thought the small amount of observations might be an issue, is to just increase the number of observations (i.e. vertically concatenating the above dataframe with itself a bunch of times), however this did not change the output.
I've seen many threads on cpquery and that it can be fragile, but none indicate this issue. To note: the example in 'cpquery' documentation works as expected, so it seems the problem is not due to my environment.
Any help would be greatly appreciated!

Errors while performing caret tuning in R

I am building a predictive model with caret/R and I am running into the following problems:
When trying to execute the training/tuning, I get this error:
Error in if (tmps < .Machine$double.eps^0.5) 0 else tmpm/tmps :
missing value where TRUE/FALSE needed
After some research it appears that this error occurs when there missing values in the data, which is not the case in this example (I confirmed that the data set has no NAs). However, I also read somewhere that the missing values may be introduced during the re-sampling routine in caret, which I suspect is what's happening.
In an attempt to solve problem 1, I tried "pre-processing" the data during the re-sampling in caret by removing zero-variance and near-zero-variance predictors, and automatically inputting missing values using a carets knn automatic imputing method preProcess(c('zv','nzv','knnImpute')), , but now I get the following error:
Error: Matrices or data frames are required for preprocessing
Needless to say I checked and confirmed that the input data set are indeed matrices, so I dont understand why I get this second error.
The code follows:
x.train <- predict(dummyVars(class ~ ., data = train.transformed),train.transformed)
y.train <- as.matrix(select(train.transformed,class))
vbmp.grid <- expand.grid(estimateTheta = c(TRUE,FALSE))
adaptive_trctrl <- trainControl(method = 'adaptive_cv',
number = 10,
repeats = 3,
search = 'random',
adaptive = list(min = 5, alpha = 0.05,
method = "gls", complete = TRUE),
allowParallel = TRUE)
fit.vbmp.01 <- train(
x = (x.train),
y = (y.train),
method = 'vbmpRadial',
trControl = adaptive_trctrl,
preProcess(c('zv','nzv','knnImpute')),
tuneGrid = vbmp.grid)
The only difference between the code for problem (1) and (2) is that in (1), the pre-processing line in the train statement is commented out.
In summary,
-There are no missing values in the data
-Both x.train and y.train are definitely matrices
-I tried using a standard 'repeatedcv' method in instead of 'adaptive_cv' in trainControl with the same exact outcome
-Forgot to mention that the outcome class has 3 levels
Anyone has any suggestions as to what may be going wrong?
As always, thanks in advance
reyemarr
I had the same problem with my data, after some digging i found that I had some Inf (infinite) values in one of the columns.
After taking them out (df <- df %>% filter(!is.infinite(variable))) the computation ran without error.

Own error function including weights for neuralnet in R

My question is very related to this one: https://stackoverflow.com/questions/25510960/how-to-implement-own-error-function-while-using-neuralnet-package-in-r#= asking for the general syntax for providing a custom error function in the neuralnet package in R.
While the question and its answer is helpful for me already as I also want to use a custom error function. However, in my error function I would like to include the weights of every iteration. E.g. I would like to add lamda*(weights)^2 where my choice of a constant lambda influences the optimization's selection of larger/smaller weights. Any idea on how to realize that?
Thanks!
I pretty sure this is impossible with the current neuralnet code. You can look at it here.
The relevant section starts at about line 350
result <- rprop(weights = weights, threshold = threshold,
response = response, covariate = covariate, learningrate.limit = learningrate.limit,
learningrate.factor = learningrate.factor, stepmax = stepmax,
lifesign = lifesign, lifesign.step = lifesign.step, act.fct = act.fct,
act.deriv.fct = act.deriv.fct, err.fct = err.fct, err.deriv.fct = err.deriv.fct,
algorithm = algorithm, linear.output = linear.output,
exclude = exclude, learningrate.bp = learningrate.bp)
startweights <- weights
weights <- result$weights
step <- result$step
reached.threshold <- result$reached.threshold
net.result <- result$net.result
error <- sum(err.fct(net.result, response))
if (is.na(error) & type(err.fct) == "ce")
if (all(net.result <= 1, net.result >= 0))
error <- sum(err.fct(net.result, response), na.rm = T)
Here you can see that the internal err.fct is explicitly passed only the result of the network and not the weights. If you wanted to pass the weights and a lambda parameter you would need to change the source code. Although perhaps not for the 'faint of heart' this is indeed possible as you can always download the source and begin modifying it.
I wanted to give this a go. So I downloaded neuralnet.r and edited it:
Changed line 364 from line error <- err.fct(net.result, response) to error <- caseweights %*% err.fct(net.result, response).
Changed line 367 from line error <- sum(err.fct(net.result, response), na.rm = T) to error <- caseweights %*% sum(err.fct(net.result, response), na.rm = T).
Changed line 554 from line err.deriv <- err.deriv.fct(result$net.result, response) to err.deriv <- caseweights * err.deriv.fct(result$net.result, response).
Changed line 588 from line err.deriv <- err.deriv.fct(result$net.result, response) to err.deriv <- caseweights * err.deriv.fct(result$net.result, response).
I also added caseweights to appropriate function arguments, also to neuralnet, and renamed some of the functions so I wouldn't accidentally be using the standard library ones.
Not sure if this works, though. A simple test seems to work but a more complicated one has trouble converging - it's difficult to set caseweights so that they are not too small or too big.

(R) function: object not found: environment depth fine?

I'm puzzled by a function error & would appreciate some insight.
The function, very briefly, automates the multiple processes involved in Boosted Regression Trees using gbm.step & other gbm's.
"gbm.auto" <- function (grids, samples, 3 parameters) {
starts 2 counters, require(gbm), does various small processing jobs with grids & samples
for parameter 1{
for parameter 2{
for parameter 3{
Runs 2 BRTs per parameter-combination loop, generates & iteratively updates a 'best' BRT for each, adds to counters. Extensive use of samples.
}}}
closes the loops, function continues as the first } is still open.
The next BRT can't find samples, even though it's at the same environment depth (1?) as the pre-loop processing jobs which used it successfully. Furthermore, adding "globalsamples<<-samples" after the }}} loop successfully saves the object, suggesting that samples is still available. Adding env1,2 & 3<<-environment() before the {{{ loop, within it & after it results in Environment for all three. Also suggesting it's all the same function environment & samples should be available.
What am I missing here? Thanks in advance!
Edit: exact message:
Error in eval(expr, envir, enclos) : object 'samples' not found
Function - loads removed & compacted but still gives same error message:
"gbm.auto" <-
function (samples, expvar, resvar, tc, lr, bf)
{ # open function
require(gbm)
require(dismo)
# create binary (0/1) response variable, for bernoulli BRTs
samples$brv <- ifelse(samples[resvar] > 0, 1, 0)
brvcol <- which(colnames(samples)=="brv") # brv column number for BRT
for(j in tc){ # permutations of tree complexity
for(k in lr){ # permutations of learning rate
for(l in bf){ # permutations of bag fraction
Bin_Best_Model<- gbm.step(data=samples,gbm.x = expvar, gbm.y = brvcol, family = "bernoulli", tree.complexity = j, learning.rate = k, bag.fraction = l)
}}} # close loops, producing all BRT/GBM objects & continue through model selection
Bin_Best_Simp_Check <- gbm.simplify(Bin_Best_Model) # simplify model
# if best number of variables to remove isn't 0 (i.e. it's worth simplifying), re-run the best model (Bin_Best_Model, using gbm.call to get its values)
# with just-calculated best number of variables to remove, removed. gbm.x asks which number of drops has the minimum mean (lowest point on the line)
# & that calls up the list of predictor variables with those removed, from $pred.list
if(min(Bin_Best_Simp_Check$deviance.summary$mean) < 0)
assign("Bin_Best_Simp", gbm.step(data = samples,
gbm.x = Bin_Best_Simp_Check$pred.list[[which.min(Bin_Best_Simp_Check$deviance.summary$mean)]],
gbm.y = brvcol, family = "bernoulli", tree.complexity = j, learning.rate = k, bag.fraction = l))
}
Read in data:
mysamples<-data.frame(response=round(sqrt(rnorm(5000, mean= 2.5, sd=1.5)^2)),
depth=sqrt(rnorm(5000, mean= 35, sd=24)^2),
temp=rnorm(5000, mean= 15, sd=1.2),
sal=rnorm(5000, mean= 34, sd=0.34))
Run this: gbm.auto(expvar=c(2,3,4),resvar=1,samples=mysamples,tc=2,lr=0.00000000000000001,bf=0.5)
Problem now: this causes a different error because my fake data are somehow wrong. ARGHG!
Edit: rounded the response data to integers and kept shrinking the learning rate until it runs. If it doesn't work for you, add zeroes until it does.
Edit: so this worked on my computer but reading it back to a clean sheet from online fails on a DIFFERENT count:
Error in var(cv.cor.stats, use = "complete.obs") :
no complete element pairs
In cor(y_i, u_i) : the standard deviation is zero
Is it allowed to attach or link to a csv of a small clip of my data? I'm currently burrowing deeper & deeper into bugfixing problems created by using fake data which I'm only using for this question, & thus getting off topic from the actual problem. Exasperation mode on!
Cheers
Edit2: if this is allowed: 1000row 4column csv link here: https://drive.google.com/file/d/0B6LsdZetdypkaC1WYXpKU3ZScjQ

Resources