Error in optim: L-BFGS-B needs finite value of fn - r

I am trying to run impute_errors() function of the imputeTestBench package for a series of values. I am using six user defined methods for selection of best imputation method. Below is my code:
correctedSalesHistoryMatrix[, 1:2],
matrix(unlist(apply(X = as.matrix(correctedSalesHistoryMatrix[, -c(1, 2)]),
MARGIN = 1,
FUN = impute_errors,
smps = "mcar",
methods = c(
"imputationMethod1"
, "imputationMethod2"
, "imputationMethod3"
, "imputationMethod4"
, "imputationMethod5"
, "imputationMethod6"
),
methodPath = "C:\\Documents\\Imputations.R",
errorParameter = "mape",
missPercentFrom = 10,
missPercentTo = 10
)
), nrow = nrow(correctedSalesHistoryMatrix), byrow = T
)
)
When I am using a small dataset, the function executes successfully. When I am using a large dataset I am using the following error:
Error in optim(init[mask], getLike, method = "L-BFGS-B", lower = rep(0, :
L-BFGS-B needs finite values of 'fn'
Called from: optim(init[mask], getLike, method = "L-BFGS-B", lower = rep(0,
np + 1L), upper = rep(Inf, np + 1L), control = optim.control)

I don't think this is an easy fix.
Error is probably not caused by imputeTestBench itself, but rather by one of your user defined imputation methods.
Run impute_errors like before and only add na_mean as method instead of your user defined methods (impute_errors(..., methods = 'na_mean') ) to see if this suggestion is true.
The error itself occurs quite often and has to do with stats::optim receiving inputs it can't deal with. Quite likely you are not using stats::optim in your user defined imputation methods (so you can't easily fix the input). More likely is that a package your are using is doing some calculations and then using stats::optim. Or even worse a package you are using is using another package, that is using stats::optim.
In the answers to this question you can see an explanation underlying problem. Overall seems to occur especially for large datasets, when the fn input parameter to stats::optim becomes Inf.
Here a some examples of the problem also occurring for different R packages and functions (which all use stats::optim somewhere internally): 1, 2, 3
Not too much you can do overall, if you don't want to go extremely deep into the underlying packages.
If you are using the imputeTS package for one of your user supplied imputation methods, in this Github Issue a workaround is proposed, which might help, if the error occurs within the na_kalman or na_seadec method.

Related

Error in eval(parse()) - r unable to find argument input

I am very new to R, and this is my first time of encountering the eval() function. So I am trying to use the med and boot.med function from the following package: mma. I am using it to conduct mediation analysis. med and boot.med take in models such as linear models, and dataframes that specify mediators and predictors and then estimate the mediation effect of each mediator.
The author of the package gives the flexible option of specifying one's own custom.function. From the source code of med, it can be seen that the custom.function is passed to the eval(). So I tried insert the gbmt function as the custom function. However, R kept giving me error message: Error during wrapup: Number of trees to be used in prediction must be provided. I have been searching online for days and tried many ways of specifying the number of trees parameter n.trees, but nothing works (I believe others have raised similar issues: post 1, post 2).
The following codes are part of the source code of the med function:
cf1 = gsub("responseY", "y[,j]", custom.function[j])
cf1 = gsub("dataset123", "x2", cf1)
cf1 = gsub("weights123", "w", cf1)
full.model[[j]] <- eval(parse(text = cf1))
One custom function example the author gives in the package documentation is as follows:
temp1<-med(data=data.bin,n=2,custom.function = 'glm(responseY~.,data=dataset123,family="quasibinomial",
weights=weights123)')
Here the glm is the custom function. This example code works and you can replicate it easily (if you have mma installed and loaded). However when I am trying to use the gbmt function on a survival object, I got errors and here is what my code looks like:
temp1 <- med(data = data.surv,n=2,type = "link",
custom.function = 'gbmt(responseY ~.,
data = dataset123,
distribution = dist,
train_params = start_stop,
cv_folds=10,
keep_gbm_data = TRUE,
)')
Anyone has any idea how the argument about number of trees n.trees can be added somewhere in the above code?
Many thanks in advance!
Update: in order to replicate the example code, please install mma and try the following:
library("mma")
data("weight_behavior") ##binary x #binary y
x=weight_behavior[,c(2,4:14)]
pred=weight_behavior[,3]
y=weight_behavior[,15]
data.bin<-data.org(x,y,pred=pred,contmed=c(7:9,11:12),binmed=c(6,10), binref=c(1,1),catmed=5,catref=1,predref="M",alpha=0.4,alpha2=0.4)
temp1<-med(data=data.bin,n=2) #or use self-defined final function
temp1<-med(data=data.bin,n=2, custom.function = 'glm(responseY~.,data=dataset123,family="quasibinomial",
weights=weights123)')
I changed the custom.function to gbmt and used a survival object as responseY and the error occurs. When I use the gbmt function on my data outside the med function, there is no error.

How to find out the computational complexity of a algorithm using GuessCompx package

I have been trying to assess the computational complexity of Roughkmeans_PE algorithm found in the Softclustering package. To assess the computational complexity of a algorithm, GuessCompx package is available from the CRAN site. When I tried to assess the complexity, it works well only when we use the example code as given below.
CompEst(d = ggplot2::diamonds[, 5:10], f = dist, replicates = 10, max.time = 10)
But if I try to assess the complexity of Roughkmeans_PE, I get the following error message;
The code that I used:
library(SoftClustering)
CompEst(iris[,-5], RoughKMeans_PE, random.sampling = FALSE, max.time = 30, start.size = NULL, replicates = 4, strata = NULL, power.factor = 2, alpha.value = 0.005, plot.result = TRUE)
The error I got:
Error in if (datatypeInteger(meansMatrix)) { : missing value where
TRUE/FALSE needed Timing stopped at: 0.01 0 0.02
I request you to please clarify how can I find out the complexity of my own algorithm using the GuessCompx package.
Thank you,
I'm the maintainer of the GuessCompx package. The error you see actually comes from the clustering function for which you need to input some mandatory arguments (the documentation is not clear about that)
RoughKMeans_PE(iris[, -5]) # gives the same error
RoughKMeans_PE(iris[, -5], meansMatrix = 1, nClusters = 3) # OK
That's why the CompEst() function gives the error, you need to wrap your clustering function with the correct arguments inside an anonymous or custom function; this will work:
f = function(df) RoughKMeans_PE(df, 1, 2, 100)
CompEst(iris[, -5], f)
However, it won't give your the right result, first because the dataset iris is too small to measure any time difference, and second because the computation time of your clustering algorithm has a high variability (number of iterations). So I suggest you change the default arguments, increasing replicates and max.time. Resulting plot on a larger dataset seems to give O(N) or O(NLOGN) asymptotic behavior:

liquidSVM parameter selection clarification in R

This is my first question and i'm only a basic "programmer" so i'm sorry if i do not make myself clear enough.
I'm currently using liquidSVM 1.2.1 on R 3.5.0 and, despite its great potential i do not understand some technicisms, as the help is not explanatory enough for me and i cannot find anything on the internet.
More specifically I'd like to understand further how the parameter selection works.
The final liquidSVM model contains in fact info on gammas and lambdas but i cannot understand if these parameters are all being used in different cells or if just a final couple has been chosen for the final model.
These leads to 2 sub-questions:
If using all the values, how can i disable grid_choice and select only a value for each parameter?
If the algorithm selects a final couple of values, how can i understand which one it is?
This is the setting i've been using so far:
model = liquidSVM:: svm(formula, TRAIN, threads = 3, predict.prob = T, random_seed = 123, folds = 5, scale = F, d = 1, partition_choice = 5, grid_choice = -1)
I tried different things, for example:
setting gamma = 0.01 and lambda = 0.1;
setting max_gamma = 0.01 and min_gamma = 0.01
setting grid_choice = NULL or grid_choice = list(gamma = 0.01, lambda = 0.01)
but it still does a grid selection on its own.
If only i could understand how to disable this grid search and provide my chosen parameters, i'd code a grid search by myself (thus knowing what the code is doing).
Thank you in advance.
The question is somewhat older now. However, if someone is still looking for a corresponding solution.
You can define your grid in which to be searched for the best matching values by using the arguments called gammas and lambdas. In this case you set them to one value.
For example:
model <- svm(x1~., train, display=1,folds=5, mc_type="OvA_ls",
gammas = 0.01,
lambdas = 0.1
)
would set the gamma to only 0.01 and the lambda to 0.1.
However this is not a grid search anymore and you should expect to get two hands full of warning messages. If you provide a vector of gammas and a vector of lambdas it will search that set grid rather than the default. Hence the arguments can be handy if you want to compare liquidSVM with other packages for example.
Best luck

how to debug errors like: "dim(x) must have a positive length" with caret

I'm running a predict over a fit similar to what is found in the caret guide:
Caret Measuring Performance
predictions <- predict(caretfit, testing, type = "prob")
But I get the error:
Error in apply(x, 1, paste, collapse = ",") :
dim(X) must have a positive length
I would like to know 1) the general way to diagnose these errors that are the result of bad inputs into functions like this or 2) why my code is failing.
1)
So looking at the error It's something to do with 'X'. Which argument is x? Obviously the first one in 'apply', but which argument in predict is eventually passed to apply? Looking at traceback():
10: stop("dim(X) must have a positive length")
9: apply(x, 1, paste, collapse = ",")
8: paste(apply(x, 1, paste, collapse = ","), collapse = "\n")
7: makeDataFile(x = newdata, y = NULL)
6: predict.C5.0(modelFit, newdata, type = "prob")
5: predict(modelFit, newdata, type = "prob") at C5.0.R#59
4: method$prob(modelFit = modelFit, newdata = newdata, submodels = param)
3: probFunction(method = object$modelInfo, modelFit = object$finalModel,
newdata = newdata, preProc = object$preProcess)
2: predict.train(caretfit, testing, type = "prob")
1: predict(caretfit, testing, type = "prob")
Now, this problem would be easy to solve if I could follow the code through and understand the problem as opposed to these general errors. I can trace the code using this traceback to the code at C5.0.R#59. (It looks like there's no way to get line numbers on every trace?) I can follow this code as far as this line 59 and then (I think) the predict function on line 44:
Github Caret C5.0 source
But after this I'm not sure where the logic flows. I don't see 'makeDataFile' anywhere in the caret source or, if it's in another package, how it got there. I've also tried Rstudio debugging, debug() and browser(). None provide the stacktrace I would expect from other languages. Any suggestion on how to follow the code when you don't know what an error msg means?
2) As for my particular inputs, 'caretfit' is simply the result of a caret fit and the testing data is 3million rows by 59 columns:
fitcontrol <- trainControl(method = "repeatedcv",
number = 10,
repeats = 1,
classProbs = TRUE,
summaryFunction = custom.summary,
allowParallel = TRUE)
fml <- as.formula(paste("OUTVAR ~",paste(colnames(training[,1:(ncol(training)-2)]),collapse="+")))
caretfit <- train(fml,
data = training[1:200000,],
method = "C5.0",
trControl = fitcontrol,
verbose = FALSE,
na.action = na.pass)
1 Debuging Procedure
You can pinpoint the problem using a couple of functions.
Although there still doesn't seem to be anyway to get a full stacktrace with line numbers in code (Boo!), you can use the functions you do get from the traceback and use the function getAnywhere() to search for the function you are looking for. So for example, you can do:
getAnywhere(makeDataFile)
to see the location and source. (Which also works great in windows when the libraries are often bundled up in binaries.) Then you have to use source or github to find the specific line numbers or to trace through the logic of the code.
In my particular problem if I run:
newdata <- testing
caseString <- C50:::makeDataFile(x = newdata, y = NULL)
(Note the three ":".) I can see that this step completes at this level, So it appears as if something is happening to my training dataset along the way.
So using gitAnywhere() and github over and over through my traceback I can find the line number manually (Boo!)
in caret/R/predict.train.R, predict.train (defined on line 108)
calls probFunction on line 153
in caret/R/probFunction, probFunction
(defined on line 3) calls method$prob function which is a stored
function in the fit object caretfit$modelInfo$prob which can be
inspected by entering this into the console. This is the same
function found in caret/models/files/C5.0.R on line 58 which calls
'predict' on line 59
something in caret knows to use
C50/R/predict.C5.0.R which you can see by searching with
getAnywhere()
this function runs makeDataFile on line 25 (part of
the C50 package)
which calls paste, which calls apply, which dies
with stop
2 Particular Problem with caret's predict
As for my problem, I kept inspecting the code, and adding inputs at different levels and it would complete successfully. What happens is that some modification happens to my dataset in predict.train.R which causes it to fail. Well it turns out that I wasn't including my 'na.action' argument, which for my tree-based data, used 'na.pass'. If I include this argument:
prediction <- predict(caretfit, testing, type = "prob", na.action = na.pass)
it works as expected. line 126 of predict.train makes use of this argument to decide whether to include non-complete cases in the prediction. My data has no complete cases and so it failed complaining of needing a matrix of some positive length.
Now how one would be able to know the answer to this apply error is due to a missing na.action argument is not obvious at all, hence the need for a good debugging procedure. If anyone knows of other ways to debug (keeping in mind that in windows, stepping through library source in Rstudio doesnt work very well), please answer or comment.

Estimate parameters of Frechet distribution using mmedist or fitdist(with mme) error

I'm relatively new in R and I would appreciated if you could take a look at the following code. I'm trying to estimate the shape parameter of the Frechet distribution (or inverse weibull) using mmedist (I tried also the fitdist that calls for mmedist) but it seems that I get the following error :
Error in mmedist(data, distname, start = start, fix.arg = fix.arg, ...) :
the empirical moment function must be defined.
The code that I use is the below:
require(actuar)
library(fitdistrplus)
library(MASS)
#values
n=100
scale = 1
shape=3
# simulate a sample
data_fre = rinvweibull(n, shape, scale)
memp=minvweibull(c(1,2), shape=3, rate=1, scale=1)
# estimating the parameters
para_lm = mmedist(data_fre,"invweibull",start=c(shape=3,scale=1),order=c(1,2),memp = "memp")
Please note that I tried many times en-changing the code in order to see if my mistake was in syntax but I always get the same error.
I'm aware of the paradigm in the documentation. I've tried that as well but with no luck. Please note that in order for the method to work the order of the moment must be smaller than the shape parameter (i.e. shape).
The example is the following:
require(actuar)
#simulate a sample
x4 <- rpareto(1000, 6, 2)
#empirical raw moment
memp <- function(x, order)
ifelse(order == 1, mean(x), sum(x^order)/length(x))
#fit
mmedist(x4, "pareto", order=c(1, 2), memp="memp",
start=c(shape=10, scale=10), lower=1, upper=Inf)
Thank you in advance for any help.
You will need to make non-trivial changes to the source of mmedist -- I recommend that you copy out the code, and make your own function foo_mmedist.
The first change you need to make is on line 94 of mmedist:
if (!exists("memp", mode = "function"))
That line checks whether "memp" is a function that exists, as opposed to whether the argument that you have actually passed exists as a function.
if (!exists(as.character(expression(memp)), mode = "function"))
The second, as I have already noted, relates to the fact that the optim routine actually calls funobj which calls DIFF2, which calls (see line 112) the user-supplied memp function, minvweibull in your case with two arguments -- obs, which resolves to data and order, but since minvweibull does not take data as the first argument, this fails.
This is expected, as the help page tells you:
memp A function implementing empirical moments, raw or centered but
has to be consistent with distr argument. This function must have
two arguments : as a first one the numeric vector of the data and as a
second the order of the moment returned by the function.
How can you fix this? Pass the function moment from the moments package. Here is complete code (assuming that you have made the change above, and created a new function called foo_mmedist):
# values
n = 100
scale = 1
shape = 3
# simulate a sample
data_fre = rinvweibull(n, shape, scale)
# estimating the parameters
para_lm = foo_mmedist(data_fre, "invweibull",
start= c(shape=5,scale=2), order=c(1, 2), memp = moment)
You can check that optimization has occurred as expected:
> para_lm$estimate
shape scale
2.490816 1.004128
Note however, that this actually reduces to a crude way of doing overdetermined method of moments, and am not sure that this is theoretically appropriate.

Resources