clValid function to work for Big Data - r

Package clValid reports validation measures for clustering results. The function returns an object of class clValid, which contains the clustering results in addition to the validation measures. The validation measures fall into three general categories: internal, stability, and biological.
clValid(obj, nClust, clMethods="hierarchical", validation="stability", maxitems=600,
metric="euclidean", method="average", neighbSize=10, ...)
Any idea how to tweak the above function so that it works on big data (i.e., maxitems >= 50000)?

I donot clearly understand your meaning of "tweak". Maybe you have to parallel this function. BTW, you can set the parameter maxitems=nrow(obj), but it's really slow.
Update:
Here is a paralled clValid.

Related

Error when computing jacobian vector product

I have a group with coupled disciplines which is nested in a model where all other components are uncoupled. I have assigned a nonlinear Newton and linear direct solvers to the coupled group.
When I try to run the model with default "RunOnce" solver everything is OK, but as soon as I try to run optimization I get following error raised from linear_block_gs.py:
File "...\openmdao\core\group.py", line 1790, in _apply_linear scope_out, scope_in)
File "...\openmdao\core\explicitcomponent.py", line 339, in _apply_linear
self.compute_jacvec_product(*args)
File "...\Thermal_Cycle.py", line 51, in compute_jacvec_product
d_inputs['T'] = slope * deff_dT / alp_sc
File "...\openmdao\vectors\vector.py", line 363, in setitem
raise KeyError(msg.format(name)) KeyError: 'Variable name "T" not found.'
Below is the N2 diagram of the model. Variable "T" which is mentioned in the error comes from implicit "temp" component and is fed back to "sc" component (file Thermal_Cycle.py in the error msg) as input.
N2 diagram
The error disappears when I assign DirectSolver on top of the whole model. My impression was that "RunOnce" would work as long as groups with implicit components have appropriate solvers applied to them as suggested here and is done in my case. Why does it not work when trying to compute total derivatives of the model, i.e. why compute_jacvec_product cannot find coupled variable "T"?
The reason I want to use "RunOnce" solver is that optimization with DirecSolver on top becomes very long as my variable vector "T" increases. I suspect it should be much faster with linear "RunOnce"?
I think this example of the compute_jacvec_product method might be helpful.
The problem is that, depending on the solver configuration or the structure of the model, OpenMDAO may only need some of the partials that you provide in this method. For example, your matrix-free component might have two inputs, but only one is connected, so OpenMDAO does not need the derivative with respect to the unconnected input, and in fact, does not allocate space for it in the d_inputs or d_outputs vectors.
So, to fix the problem, you just need to put an if statement before assigning the value, just like in the example.
Based on the N2, I think that I agree with your strategy of putting the direct solver down around the coupling only. That should work fine, however it looks like you're implementing a linear operator in your component, based on:
File "...\Thermal_Cycle.py", line 51, in compute_jacvec_product d_inputs['T'] = slope * deff_dT / alp_sc
You shouldn't use direct solver with matrix-free partials. The direct solver computes an inverse, which requires the full assembly of the matrix. The only reason it works at all is that OM has some fall-back functionality to manually assemble the jacobian by passing columns of the identity matrix through the compute_jacvec_product method.
This fallback mechanism is there to make things work, but its very slow (you end up calling compute_jacvec_product A LOT).
The error you're getting, and why it works when you put the direct solver higher up in the model, is probably due to a lack of necessary if conditions in your compute_jacvec_product implementation.
See the docs on explicit component for some examples, but the key insight is to realize that not every single variable will be present when doing a jacvec product (it depends on what kind of solve is being done --- i.e. one for Newton vs one for total derivatives of the whole model).
So those if-checks are needed to check if variables are relevant. This is done, because for expensive codes (i.e. CFD) some of these operations are quite expensive and you don't want to do them unless you need to.
Are your components so big that you can't use the compute_partials function? Have you tried specifying the sparsity in your jacobian? Usually the matrix-free partial derivative methods are not needed until you start working with really big PDE solvers with 1e6 or more implicit outputs variables.
Without seeing some code, its hard to comment with more detail, but in summary:
You shouldn't use compute_jacvec_product in combination with direct solver. If you really need matrix-free partials, then you need to switch to iterative linear solvers liket PetscKrylov.
If you can post the code for the the component in Thermal_Cycle.py that has the compute_jacvec_product I could give a more detailed recommendation on how to handle the partial derivatives in that case.

maximizing function using optim in r where one of the parameters is an integer

I have a function that I need to maximize that contains 3 parameters, one of which is an integer.
How do I let the optim function know to maximize (instead of minimize which is the default).
And how do I let it know that one of the parameters in an integer?
Will it work if one of the parameters is a binary or categorical?
Max vs min is easy (set fnscale=-1 in the control parameter).
Integer parameters are not easy. I don't know of a simple out-of-the-box solution for this, hopefully someone else does.
Most of the methods implemented in optim assume continuous parameter spaces. (method="SANN" will work since you can give it explicit rules for updating - see the examples - but it's tricky to get it to work efficiently.) Most of the optimizers listed in the Optimization Task View are for continuous optimization - the section on global/stochastic gives the most options for mixed discrete/continuous problems.
If the range of plausible integers is reasonably small you can use brute force (i.e., optimize over the two continuous parameters for each of a range of fixed integer values); you could also use bisection search over the integers.

In R, incomplete gamma function with complex input?

Incomplete gamma functions can be calculated in R with pgamma, or with gamma_inc_Q from library(gsl), or with gammainc from library(expint). However, all of these functions take only real input.
I need an implementation of the incomplete gamma function which will take complex input. Specifically, I have an integer for the first argument, and a complex number for the second argument (the limit in the integral).
This function is well-defined for complex inputs (see Wikipedia), and I've been calculating it in Mathematica. It doesn't seem to be built into R though, and I don't see it in any libraries.
So, can anyone suggest a shorter path to doing these calculations, than looking up an algorithm, implementing it in C, and writing an R interface?
(If I do have to implement it myself, here's the only algorithm for complex inputs that I've found: Kostlan & Gokhman 1987)
Here is an implementation, assuming you want the lower incomplete gamma function. I've compared a couple of values with Wolfram and they match.
library(CharFun)
incgamma <- function(s,z){
z^s * exp(-z) * hypergeom1F1(z, 1, s+1) / s
}
Perhaps the evaluation fails for a large s.
EDIT
Looks like CharFun has been removed from CRAN. You can use IncGamma in HypergeoMat:
> library(HypergeoMat)
> IncGamma(m=50, 2+2i, 5-6i)
[1] 0.3841221+0.3348439i
The result is the same on Wolfram.

Change Objective Function in nls.lm() in "R"

I'm using the function nls.lm {package: minpack.lm} to optimize a parameteristion for a hydrological model. The function is working quite well, but I want to use an other objective function (OF). Normally, the obective function "fn" in the nls.lm is defined as
A function that returns a vector of residuals, the sum square of which
is to be minimized. The first argument of fn must be par.
Now I want to use the Nash-Sutcliff-Efficiency, which is defined as
NSE <- 1 - (sum((obs - sim)^2) / sum((obs - mean(obs))^2))
or other OF. The problem is that nls.lm minimizes the expression sum(x)^2 and only the x is modifiable. I know that the best fit NSE = 1. Thus 1 - NSE creates a real minimization problem.
BTW: Example 1 from a nls.lm help page is a good example; there
observed - getPred(p,xx)
is minimized, what actually means that
sum ( observed - getPred(p,xx) )^2
is minimized by the nls.lm function, whereas getPred(p,xx) returns sim.
Any suggestion would be helpful. Thanks in advance.
Micha
nls.lm (and the related functions nls, and nlsLM) are designed to minimize the sum square of the residuals. For the problem you seek to solve, I would try application of a general-purpose minimizer.
If the problem is 'not too hard' (that is, has a single global minimum, is smooth), you could try to apply 'optim' to it (I would try the 'Nelder-Mead' and 'BFGS' options first), or the 'bobyqa' function from the package 'minqa', among other functions.
If the problem requires a global optimizer, you could try the 'GenSA' function from package 'GenSA', the 'genoud' function from the package 'rgenoud', or the 'DEoptim' function from package 'DEoptim', among other options. A review on 'Global Optimization in R' is forthcoming in the Journal of Statistical Software, and that will give a more complete overview of applicable functions.

S3 and order of classes

I've alway had trouble understanding the the documentation on how S3 methods are called, and this time it's biting me back.
I'll apologize up front for asking more than one question, but they are all closely related. Deep in the heart of a complex set of functions, I create a lot of glmnet fits, in particular logistic ones. Now, glmnet documentation specifies its return value to have both classes "glmnet" and (for logistic regression) "lognet". In fact, these are specified in this order.
However, looking at the end of the implementation of glmnet, righter after the call to (the internal function) lognet, that sets the class of fit to "lognet", I see this line of code just before the return (of the variable fit):
class(fit) = c(class(fit), "glmnet")
From this, I would conclude that the order of the classes is in fact "lognet", "glmnet".
Unfortunately, the fit I had, had (like the doc suggests):
> class(myfit)
[1] "glmnet" "lognet"
The problem with this is the way S3 methods are dispatched for it, in particular predict. Here's the code for predict.lognet:
function (object, newx, s = NULL, type = c("link", "response",
"coefficients", "class", "nonzero"), exact = FALSE, offset,
...)
{
type = match.arg(type)
nfit = NextMethod("predict") #<- supposed to call predict.glmnet, I think
switch(type, response = {
pp = exp(-nfit)
1/(1 + pp)
}, class = ifelse(nfit > 0, 2, 1), nfit)
}
I've added a comment to explain my reasoning. Now when I call predict on this myfit with a new datamatrix mydata and type="response", like this:
predict(myfit, newx=mydata, type="response")
, I do not, as per the documentation, get the predicted probabilities, but the linear combinations, which is exactly the result of calling predict.glmnet immediately.
I've tried reversing the order of the classes, like so:
orgclass<-class(myfit)
class(myfit)<-rev(orgclass)
And then doing the predict call again: lo and behold: it works! I do get the probabilities.
So, here come some questions:
Am I right in 'having learned' that
S3 methods are dispatched in order
of appearance of the classes?
Am I right in assuming the code in
glmnetwould cause the wrong order
for correct dispatching of
predict?
In my code there is nothing that
manipulates classes
explicitly/visibly to my knowledge.
What could cause the order to
change?
For completeness' sake: here's some sample code to play around with (as I'm doing myself now):
library(glmnet)
y<-factor(sample(2, 100, replace=TRUE))
xs<-matrix(runif(100), ncol=1)
colnames(xs)<-"x"
myfit<-glmnet(xs, y, family="binomial")
mydata<-matrix(runif(10), ncol=1)
colnames(mydata)<-"x"
class(myfit)
predict(myfit, newx=mydata, type="response")
class(myfit)<-rev(class(myfit))
class(myfit)
predict(myfit, newx=mydata, type="response")
class(myfit)<-rev(class(myfit))#set it back
class(myfit)
Depending on the data generated, the difference is more or less obvious (in my true dataset I noticed negative values in the so called probabilities, which is how I picked up the problem), but you should indeed see a difference.
Thanks for any input.
Edit:
I just found out the horrible truth: either order worked in glmnet 1.5.2 (which is present on the server where I ran the actual code, resulting in the fit with the class order reversed), but the code from 1.6 requires the order to be "lognet", "glmnet". I have yet to check what happens in 1.7.
Thanks to #Aaron for reminding me of the basics of informatics (besides 'if all else fails, restart': 'check your versions'). I had mistakenly assumed that a package by the gods of statistical learning would be protected from this type of error), and to #Gavin for confirming my reconstruction of how S3 works.
Yes, the order of dispatch is in the order in which the classes are listed in the class attribute. In the simple, every-day case, yes, the first stated class is the one chosen first by method dispatch, and only if it fails to find a method for that class (or NextMethod is called) will it move on to the second class, or failing that search for a default method.
No, I do not think you are right that the order of the classes is wrong in the code. The documentation appears wrong. The intent is clearly to call predict.lognet() first, use the workhorse predict.glmnet() to do the basic computations for all types of lasso/elastic net models fitted by glmnet, and finally do some post processing of those general predictions. That predict.glmnet() is not exported from the glmnet NAMESPACE whilst the other methods are is perhaps telling, also.
I'm not sure why you think the output from this:
predict(myfit, newx=mydata, type="response")
is wrong? I get a matrix of 10 rows and 21 columns, with the columns relating to the intercept-only model prediction plus predictions at 20 values of lambda at which model coefficients along the lasso/elastic net path have been computed. These do not seem to be linear combinations and are one the response scale as you requested.
The order of the classes is not changing. I think you are misunderstanding how the code is supposed to work. There is a bug in the documentation, as the ordering is stated wrong there. But the code is working as I think it should.

Resources