(R) function: object not found: environment depth fine? - r

I'm puzzled by a function error & would appreciate some insight.
The function, very briefly, automates the multiple processes involved in Boosted Regression Trees using gbm.step & other gbm's.
"gbm.auto" <- function (grids, samples, 3 parameters) {
starts 2 counters, require(gbm), does various small processing jobs with grids & samples
for parameter 1{
for parameter 2{
for parameter 3{
Runs 2 BRTs per parameter-combination loop, generates & iteratively updates a 'best' BRT for each, adds to counters. Extensive use of samples.
}}}
closes the loops, function continues as the first } is still open.
The next BRT can't find samples, even though it's at the same environment depth (1?) as the pre-loop processing jobs which used it successfully. Furthermore, adding "globalsamples<<-samples" after the }}} loop successfully saves the object, suggesting that samples is still available. Adding env1,2 & 3<<-environment() before the {{{ loop, within it & after it results in Environment for all three. Also suggesting it's all the same function environment & samples should be available.
What am I missing here? Thanks in advance!
Edit: exact message:
Error in eval(expr, envir, enclos) : object 'samples' not found
Function - loads removed & compacted but still gives same error message:
"gbm.auto" <-
function (samples, expvar, resvar, tc, lr, bf)
{ # open function
require(gbm)
require(dismo)
# create binary (0/1) response variable, for bernoulli BRTs
samples$brv <- ifelse(samples[resvar] > 0, 1, 0)
brvcol <- which(colnames(samples)=="brv") # brv column number for BRT
for(j in tc){ # permutations of tree complexity
for(k in lr){ # permutations of learning rate
for(l in bf){ # permutations of bag fraction
Bin_Best_Model<- gbm.step(data=samples,gbm.x = expvar, gbm.y = brvcol, family = "bernoulli", tree.complexity = j, learning.rate = k, bag.fraction = l)
}}} # close loops, producing all BRT/GBM objects & continue through model selection
Bin_Best_Simp_Check <- gbm.simplify(Bin_Best_Model) # simplify model
# if best number of variables to remove isn't 0 (i.e. it's worth simplifying), re-run the best model (Bin_Best_Model, using gbm.call to get its values)
# with just-calculated best number of variables to remove, removed. gbm.x asks which number of drops has the minimum mean (lowest point on the line)
# & that calls up the list of predictor variables with those removed, from $pred.list
if(min(Bin_Best_Simp_Check$deviance.summary$mean) < 0)
assign("Bin_Best_Simp", gbm.step(data = samples,
gbm.x = Bin_Best_Simp_Check$pred.list[[which.min(Bin_Best_Simp_Check$deviance.summary$mean)]],
gbm.y = brvcol, family = "bernoulli", tree.complexity = j, learning.rate = k, bag.fraction = l))
}
Read in data:
mysamples<-data.frame(response=round(sqrt(rnorm(5000, mean= 2.5, sd=1.5)^2)),
depth=sqrt(rnorm(5000, mean= 35, sd=24)^2),
temp=rnorm(5000, mean= 15, sd=1.2),
sal=rnorm(5000, mean= 34, sd=0.34))
Run this: gbm.auto(expvar=c(2,3,4),resvar=1,samples=mysamples,tc=2,lr=0.00000000000000001,bf=0.5)
Problem now: this causes a different error because my fake data are somehow wrong. ARGHG!
Edit: rounded the response data to integers and kept shrinking the learning rate until it runs. If it doesn't work for you, add zeroes until it does.
Edit: so this worked on my computer but reading it back to a clean sheet from online fails on a DIFFERENT count:
Error in var(cv.cor.stats, use = "complete.obs") :
no complete element pairs
In cor(y_i, u_i) : the standard deviation is zero
Is it allowed to attach or link to a csv of a small clip of my data? I'm currently burrowing deeper & deeper into bugfixing problems created by using fake data which I'm only using for this question, & thus getting off topic from the actual problem. Exasperation mode on!
Cheers
Edit2: if this is allowed: 1000row 4column csv link here: https://drive.google.com/file/d/0B6LsdZetdypkaC1WYXpKU3ZScjQ

Related

bartMachine in caret train error : incorrect number of dimensions

I encounter a strange problem when trying to train a model in R using caret :
> bart <- train(x = cor_data, y = factor(outcome), method = "bartMachine")
Error in tuneGrid[!duplicated(tuneGrid), , drop = FALSE] :
nombre de dimensions incorrect
However, when using rf, xgbTree, glmnet, or svmRadial instead of bartMachine, no error is raised.
Moreover, dim(cor_data) and length(outcome) return [1] 3056 134 and [1] 3056 respectively, which indicates that there is indeed no issue with the dimensions of my dataset.
I have tried changing the tuneGrid parameter in train, which resolved the problem but caused this issue instead :
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "pool-89-thread-1"
My dataset includes no NA, and all variables are either numerical or binary.
My goal is to extract the most important variables in the bart model. For example, I use for random forests:
rf <- train(x = cor_data, y = factor(outcome), method = "rf")
rfImp <- varImp(rf)
rf_select <- row.names(rfImp$importance[order(- rfImp$importance$Overall)[1:43], , drop = FALSE])
Thank you in advance for your help.
Since your goal is to extract the most important variables in the bart model, I will assume you are willing to bypass the caret wrapper and do it directly in R bartMachine, which is the only way I could successfully run it.
For my system, solving the memory issue required 2 further things:
Restart R and before loading anything, allocate 8Gb memory as so:
options(java.parameters = "-Xmx8g")
When running bartMachineCV, turn off mem_cache_for_speed:
library(bartMachine)
set_bart_machine_num_cores(16)
bart <- bartMachineCV(X = cor_data, y = factor(outcome), mem_cache_for_speed = F)
This will iterate through 3 values of k (2, 3 and 5) and 2 values of m (50 and 200) running 5 cross-validations each time, then builds a bartMachine using the best hyperparameter combination. You may also have to reduce the number of cores depending on your system, but this took about an hour on a 20,000 observation x 12 variable training set on 16 cores. You could also reduce the number of hyperparameter combinations it tests using the k_cvs and num_tree_cvs arguments.
Then to get the variable importance:
vi <- investigate_var_importance(bart, num_replicates_for_avg = 20)
print(vi)
You can also use it as a predictive model with predict(bart, new_data=new) similar to the object normally returned by caret::train(). This worked on R4.0.5, bartMachine_1.2.6 and rJava_1.0-4

cpquery function of bnlearn always returns 0 for simple discrete artificial data

I want to calculate conditional probabilities based on a bayesian network based on some binary data I create. However, using the bnlearn::cpquery, always a value of 0 is returned, while bnlearn::bn.fit fits a correct model.
# Create data for binary chain network X1->Y->X2 with zeros and ones.
# All base rates are 0.5, increase in probability due to parent = .5 (from .25 to.75)
X1<-c(rep(1,9),rep(1,3),1,rep(1,3),rep(0,3),0,rep(0,3),rep(0,9))
Y<-c(rep(1,9),rep(1,3),0,rep(0,3),rep(1,3),1,rep(0,3),rep(0,9))
X2<-c(rep(1,9),rep(0,3),1,rep(0,3),rep(1,3),0,rep(1,3),rep(0,9))
dag1<-data.frame(X1,Y,X2)
# Fit bayes net to data.
res <- hc(dag1)
fittedbn <- bn.fit(res, data = dag1)
# Fitting works as expected, as seen by graph structure and coefficients in fittedbn:
plot(res)
print(fittedbn)
# Conditional probability query
cpquery(fittedbn, event = (Y==1), evidence = (X1==1), method = "ls")
# Using LW method
cpquery(fittedbn, event = (Y==1), evidence = list(X1 = 1), method = "lw")
'cpquery' just returns 0. I have also tried using the predict function, however this returns an error:
predict(object = fittedbn, node = "Y", data = list(X1=="1"), method = "bayes-lw", prob = True)
# Returns:
# Error in check.data(data, allow.levels = TRUE) :
# the data must be in a data frame.
In the above cpquery the expected result is .75, but 0 is returned. This is not specific to this event or evidence, regardless of what event or evidence I put in (e.g event = (X2==1), evidence = (X1==0) or event = (X2==1), evidence = (X1==0 & Y==1)) the function returns 0.
One thing I tried, as I thought the small amount of observations might be an issue, is to just increase the number of observations (i.e. vertically concatenating the above dataframe with itself a bunch of times), however this did not change the output.
I've seen many threads on cpquery and that it can be fragile, but none indicate this issue. To note: the example in 'cpquery' documentation works as expected, so it seems the problem is not due to my environment.
Any help would be greatly appreciated!

Error related to randomisation test within lapply() function in R

I have 30 datasets that are conbined in a data list. I wanted to analyze spatial point pattern by L function along with randomisation test. Codes are following.
The first code works well for a single dataset (data1) but once it is applied to a list of dataset with lapply() function as shown in 2nd code, it gives me a very long error like so,
"Error in Kcross(X, i, j, ...) : No points have mark i = Acoraceae
Error in envelopeEngine(X = X, fun = fun, simul = simrecipe, nsim =
nsim, : Exceeded maximum number of errors"
Can anybody tell me what is wrong with 2nd code?
grp <- factor(data1$species)
window <- ripras(data1$utmX, data1$utmY)
pp.grp <- ppp(data1$utmX, data1$utmY, window=window, marks=grp)
L.grp <- alltypes(pp.grp, Lest, correlation = "Ripley")
LE.grp <- alltypes(pp.grp, Lcross, nsim = 100, envelope = TRUE)
plot(L.grp)
plot(LE.grp)
L.LE.sp <- lapply(data.list, function(x) {
grp <- factor(x$species)
window <- ripras(x$utmX, x$utmY)
pp.grp <- ppp(x$utmX, x$utmY, window = window, marks = grp)
L.grp <- alltypes(pp.grp, Lest, correlation = "Ripley")
LE.grp <- alltypes(pp.grp, Lcross, envelope = TRUE)
result <- list(L.grp=L.grp, LE.grp=LE.grp)
return(result)
})
plot(L.LE.sp$LE.grp[1])
This question is about the R package spatstat.
It would help if you could add a minimal working example including data which demonstrate this problem.
If that is not available, please generate the error on your computer, then type traceback() and capture the output and post it here. This will trace the location of the error.
Without this information, my best guess is the following:
The error message says No points have mark i=Acoraceae. That means that the code is expecting a point pattern to include points of type Acoraceae but found that there were none. This can happen because in alltypes(... envelope=TRUE) the code generates random point patterns according to complete spatial randomness. In the simulated patterns, the number of points of type Acoraceae (say) will be random according to a Poisson distribution with a mean equal to the number of points of type Acoraceae in the observed data. If the number of Acoraceae in the actual data is small then there is a reasonable chance that the simulated pattern will contain no Acoraceae at all. This is probably what is causing the error message No points have mark i=Acoraceae.
If this interpretation is correct then you should be able to suppress the error by including the argument fix.marks=TRUE, that is,
alltypes(pp.grp, Lcross, envelope=TRUE, fix.marks=TRUE, nsim=99)
I'm not suggesting this is necessarily appropriate for your application, but this should remove the error message if my guess is correct.
In the latest development version of spatstat, available on github, the code for envelope has been tweaked to detect this error.

Estimate parameters of Frechet distribution using mmedist or fitdist(with mme) error

I'm relatively new in R and I would appreciated if you could take a look at the following code. I'm trying to estimate the shape parameter of the Frechet distribution (or inverse weibull) using mmedist (I tried also the fitdist that calls for mmedist) but it seems that I get the following error :
Error in mmedist(data, distname, start = start, fix.arg = fix.arg, ...) :
the empirical moment function must be defined.
The code that I use is the below:
require(actuar)
library(fitdistrplus)
library(MASS)
#values
n=100
scale = 1
shape=3
# simulate a sample
data_fre = rinvweibull(n, shape, scale)
memp=minvweibull(c(1,2), shape=3, rate=1, scale=1)
# estimating the parameters
para_lm = mmedist(data_fre,"invweibull",start=c(shape=3,scale=1),order=c(1,2),memp = "memp")
Please note that I tried many times en-changing the code in order to see if my mistake was in syntax but I always get the same error.
I'm aware of the paradigm in the documentation. I've tried that as well but with no luck. Please note that in order for the method to work the order of the moment must be smaller than the shape parameter (i.e. shape).
The example is the following:
require(actuar)
#simulate a sample
x4 <- rpareto(1000, 6, 2)
#empirical raw moment
memp <- function(x, order)
ifelse(order == 1, mean(x), sum(x^order)/length(x))
#fit
mmedist(x4, "pareto", order=c(1, 2), memp="memp",
start=c(shape=10, scale=10), lower=1, upper=Inf)
Thank you in advance for any help.
You will need to make non-trivial changes to the source of mmedist -- I recommend that you copy out the code, and make your own function foo_mmedist.
The first change you need to make is on line 94 of mmedist:
if (!exists("memp", mode = "function"))
That line checks whether "memp" is a function that exists, as opposed to whether the argument that you have actually passed exists as a function.
if (!exists(as.character(expression(memp)), mode = "function"))
The second, as I have already noted, relates to the fact that the optim routine actually calls funobj which calls DIFF2, which calls (see line 112) the user-supplied memp function, minvweibull in your case with two arguments -- obs, which resolves to data and order, but since minvweibull does not take data as the first argument, this fails.
This is expected, as the help page tells you:
memp A function implementing empirical moments, raw or centered but
has to be consistent with distr argument. This function must have
two arguments : as a first one the numeric vector of the data and as a
second the order of the moment returned by the function.
How can you fix this? Pass the function moment from the moments package. Here is complete code (assuming that you have made the change above, and created a new function called foo_mmedist):
# values
n = 100
scale = 1
shape = 3
# simulate a sample
data_fre = rinvweibull(n, shape, scale)
# estimating the parameters
para_lm = foo_mmedist(data_fre, "invweibull",
start= c(shape=5,scale=2), order=c(1, 2), memp = moment)
You can check that optimization has occurred as expected:
> para_lm$estimate
shape scale
2.490816 1.004128
Note however, that this actually reduces to a crude way of doing overdetermined method of moments, and am not sure that this is theoretically appropriate.

adabag boosting function throws error when giving mfinal>10

I have a strange issue, whenever I try increasing the mfinal argument in boosting function of adabag package beyond 10 I get an error, Even with mfinal=9 I get warnings.
My train data has 7 class Dependant variable and 100 independant variables and around 22000 samples of data(Smoted one class using DMwR). My Dependant Variable is at the end of the training dataset in sequence.
library(adabag)
gc()
exp_recog_boo <- boosting(V1 ~ .,data=train_dataS,boos=TRUE,mfinal=9)
Error in 1:nrow(object$splits) : argument of length 0
In addition: Warning messages:
1: In acum + acum1 :
longer object length is not a multiple of shorter object length
Thanks in advance.
My mistake was that I didn't set the TARGET as factor before.
Try this:
train$target <- as.factor(train$target)
and check by doing:
str(train$TARGET)
This worked for me:
modelADA <- boosting(lettr ~ ., data = trainAll, boos = TRUE, mfinal = 10, control = (minsplit = 0))
Essentially I just told rpart to require a minimum split length of zero to generate tree, it eliminated the error. I haven't tested this extensively so I can't guarantee it's a valid solution (what does a tree with a zero length leaf actually mean?), but it does prevent the error from being thrown.
I think i Hit the problem.
ignore this -if you configure your control with a cp = 0, this wont happen. I think that if the first node of a tree make no improvement (or at least no better than the cp) the tree stay wiht 0 nodes so you have an empty tree and that make the algorithm fail.
EDIT: The problem is that the rpart generates trees with only one leaf(node) and the boosting metod use this sentence "k <- varImp(arboles[[m]], surrogates = FALSE, competes = FALSE)" being arboles[[m]] a tree with only one node it give you the eror.
To solve that you can modify the boosting metod:
Write: fix(boosting) and add the *'S lines.
if (boos == TRUE) {
** k <- 1
** while (k == 1){
boostrap <- sample(1:n, replace = TRUE, prob = pesos)
fit <- rpart(formula, data = data[boostrap, -1],
control = control)
** k <- length(fit$frame$var)
** }
flearn <- predict(fit, newdata = data[, -1], type = "class")
ind <- as.numeric(vardep != flearn)
err <- sum(pesos * ind)
}
this will prevent the algorith from acepting one leaf trees but you have to set the CP from the control param as 0 to avoid an endless loop..
Just ran into the same problem, and setting the complexity parameter to -1 or minimum split to 0 both work for me with rpart.control, e.g.
library(adabag)
r1 <- boosting(Y ~ ., data = data, boos = TRUE,
mfinal = 10, control = rpart.control(cp = -1))
r2 <- boosting(Y ~ ., data = data, boos = TRUE,
mfinal = 10, control = rpart.control(minsplit = 0))
I also run into this same problem recently and this example R script solves it completely!
The main idea is that you need to set the control for rpart (which adabag uses for creating trees, see rpart.control) appropriately, so that at least a split is attempted in every tree.
I'm not totally sure but it appears that your "argument of length 0" may be the result of an empty tree, which can happen since there is a default setting of a "complexity" parameter that tells the function not to attempt a split if the decrease in homogeneity/lack of fit is below certain threshold.
use str() to see the attributes of your dataframe. For me, I just convert myclass variable as factor, then everything runs.

Resources