error on boxcox transformation response variable must be positive - r

I have tried, log, sqrt root and arcsine transformation on my data and nothing worked.
I tried to use boxcox and I got the error response variable must be positive
md<-lm(Score1~Location1+Site1+Trial1+Stage1)
summary(md)
plot(md, which = 1)
bc<-boxcox(md, plotit = T, lambda = seq(0.5,1.5, by =0.1))
This is what I ran on R and I got the error message
Any idea on how I can fix my code?

The Box-Cox transformation is defined as BC(y) = (y^lambda - 1)/lambda (and as log(y) for lambda==0). This transformation is not generally well-defined for negative y values (because it requires raising negative values to a power, which generates complex values in most cases).
The car package provides similar transformations that allow negative input, specifically the Yeo-Johnson transformation (see Wikipedia link above) and an adjusted version of the Box-Cox transformation; these are available via car::yjPower() and car::bcnPower() respectively (see the documentation for more details).
(Now that you know the names of some alternatives, you can also search for other R functions/packages that provide them.)

Related

How to get upper and lower bounds of objective vector in gurobi R

I'm trying to get the upper and lower bound vectors of the objective vector that will keep the same optimal solution of a linear program. I am using gurobi in R to solve my LP. The gurobi reference manual says that the attributes SAObjLow and SAObjUP will give you these bounds, but I cannot find them in the output of my gurobi call.
Is there a special way to tell the solver to return these vectors?
The only values that I see in the output of my gurobi call are status, runtime, itercount, baritercount, nodecount, objval, x, slack, rc, pi, vbasis, cbasis, objbound. The dual variables and reduced costs are returned in pi and rc, but not bounds on the objective vector.
I have tried forcing all 6 different 'methods' but none of them return what I'm looking for.
I know I can get these easily using the lpsolve R package, but I'm solving a relatively large problem and I trust gurobi more than this package.
Here's a reproducible example...
library(gurobi)
model = list()
model$obj = c(500,450)
model$modelsense = 'max'
model$A = matrix(c(6,10,1,5,20,0),3,2)
model$rhs = c(60,150,8)
model$sense = '<'
sol = gurobi(model)
names(sol)
Ideally something like SAObjLow would be one of the possible entries in sol.
Not all attributes are available in the Gurobi R interface - this includes the ones for sensitivity analysis.
You may find this example helpful.
Alternatively, you can use a different API, like Python, to query all available information.

Basis provided by Ns() in R Epi package

As I was working out how Epi generates the basis for its spline functions (via the function Ns), I was a little confused by how it handles the detrend argument.
When detrend=T I would have expected that Epi::Ns(...) would more or less project the basis given by splines::ns(...) onto the orthogonal complement of the column space of [1 t] and finally extract the set of linearly independent columns (so that we have a basis).
However, this doesn’t appear to be the exactly the case; I tried
library(Epi)
x=seq(-0.75, 0.75, length.out=5)
Ns(x, knots=c(-0.5,0,0.5), Boundary.knots=c(-1,1), detrend=T)
and
library(splines)
detrend(ns(x, knots=c(-0.5,0,0.5), Boundary.knots=c(-1,1)), x)
The matrices produced by the above code are not the same, however, they do have the same column space (in this example) suggesting that if plugged in to a linear model, the fitted coefficients will be different but the fit (itself) will be the same.
The first question I had was; is this true in general?
The second question is why are the two different?
Regarding the second question - when detrend is specified, Epi::Ns gives a warning that fixsl is ignored.
Diving into Epi github NS.r ... in the construction of the basis, in the call to Epi::Ns above with detrend=T, the worker ns.ld() is called (a function almost identical to the guts of splines::ns()), which passes c(NA,NA) along to splines::spline.des as the derivs argument in determining a matrix const;
const <- splines::spline.des( Aknots, Boundary.knots, 4, c(2-fixsl[1],2-fixsl[2]))$design
This is the difference between what happens in Ns(detrend=T) and the call to ns() above which passes c(2,2) to splineDesign as the derivs argument.
So that explains how they are different, but not why? Does anyone have an explanation for why fixsl=c(NA,NA) is used instead of fixsl=c(F,F) in Epi::Ns()?
And does anyone have a proof/or an answer to the first question?
I think the orthogonal complement of const's column space is used so that second (or desired) derivatives are zero at the boundary (via projection of the general spline basis) - but I'm not sure about this step as I haven't dug into the mathematics, I'm just going by my 'feel' for it. Perhaps if I understood this better, the reason that the differences in the result for const from the call to splineDesign/spline.des (in ns() and Ns() respectively) would explain why the two matrices from the start are not the same, yet yield the same fit.
The fixsl=c(NA,NA) was a bug that has been fixed since a while. See the commits on the CRAN Github mirror.
I have still sent an email to the maintainer to ask if the fix could be made a little bit more consistent with the condition, but in principle this could be closed.

Passing a list to a function in R for using in optimization

I want to program the maximum likelihood of a gamma distribution in R; until now I have done the following:
library(stats4)
x<-scan("http://www.cmc.edu/pages/faculty/MONeill/Math152/Handouts/gamma-arrivals.txt")
loglike2<-function(LL){
alpha<-LL$a
beta<-LL$b
(alpha-1)*sum(log(x))-n*alpha*log(beta)-n*lgamma(alpha)}
mle(loglike2,start=list(a=0.5,b=0.5))
but when I want to run it, the following message appear:
Error in mle(loglike2, start = list(a = 0.5, b = 0.5)) :
some named arguments in 'start' are not arguments to the supplied log-likelihood
What am I doing wrong?
From the error message it sounds like mle needs to be able to see the variable names listed in start= in the function call itself.
loglike2<-function(a, b){
alpha<-a
beta<-b
(alpha-1)*sum(log(x))-n*alpha*log(beta)-n*lgamma(alpha)
}
mle(loglike2,start=list(a=0.5,b=0.5))
If that doesn't work you should post a reproducible example with all variables defined and also explicitly indicate which package the mle function is coming from.
The error message is unfortunately criptic because it indicates mising
values owing to the fact that alpha and gamma have to be positive and mle optimizes over the real numbers. Hence, you need to transfomt the vector over which the function is being optimized, like so:
library(stats4)
x<-scan("http://www.cmc.edu/pages/faculty/MONeill/Math152/Handouts/gamma-arrivals.txt")
loglike<-function(alpha,beta){
(alpha-1)*sum(log(x))-n*alpha*log(beta)-n*lgamma(alpha)
}
fit <- mle(function(alpha,beta)
# transfrom the parameters so they are positive
loglike(exp(alpha),exp(beta)),
start=list(alpha=log(.5),beta=log(.5)))
# of course you would have to exponentiate the estimates too.
exp(coef(fit1))
note that the error now is that you are using n in loglike()
which you have not defined. If you define n, then you get an error stating
Lapack routine dgesv: system is exactly singular: U[1,1] = 0. which is
caused either by a not very good guess for the start value of alpha and
beta or (more likely) that loglike() does not have a minima (I think your
deleted post from last night had a slightly different formula which I was
able to get working, but not able to respond to b/c the post was deleted...)
FYI, if you want to inspect the alpha and beta parameters that cause the
errors, you can use scoping assignment to post the most recently called
parameters to the environment in which loglike() is defined as in:
loglike<-function(alpha,beta){
g <<- c(alpha,beta)
(alpha-1)*sum(log(x))-n*alpha*log(beta)-n*lgamma(alpha)
}

How to handle boundary constraints when using `nls.lm` in R

I asked this question a while ago. I am not sure whether I should post this as an answer or a new question. I do not have an answer but I "solved" the problem by applying the Levenberg-Marquardt algorithm using nls.lm in R and when the solution is at the boundary, I run the trust-region-reflective algorithm (TRR, implemented in R) to step away from it. Now I have new questions.
From my experience, doing this way the program reaches the optimal and is not so sensitive to the starting values. But this is only a practical method to step aside from the issues I encounterd using nls.lm and also other optimization functions in R. I would like to know why nls.lm behaves this way for optimization problems with boundary constraints and how to handle the boundary constraints when using nls.lm in practice.
Following I gave an example illustrating the two issues using nls.lm.
It is sensitive to starting values.
It stops when some parameter reaches the boundary.
A Reproducible Example: Focus Dataset D
library(devtools)
install_github("KineticEval","zhenglei-gao")
library(KineticEval)
data(FOCUS2006D)
km <- mkinmod.full(parent=list(type="SFO",M0 = list(ini = 0.1,fixed = 0,lower = 0.0,upper =Inf),to="m1"),m1=list(type="SFO"),data=FOCUS2006D)
system.time(Fit.TRR <- KinEval(km,evalMethod = 'NLLS',optimMethod = 'TRR'))
system.time(Fit.LM <- KinEval(km,evalMethod = 'NLLS',optimMethod = 'LM',ctr=kingui.control(runTRR=FALSE)))
compare_multi_kinmod(km,rbind(Fit.TRR$par,Fit.LM$par))
dev.print(jpeg,"LMvsTRR.jpeg",width=480)
The differential equations that describes the model/system is:
"d_parent = - k_parent * parent"
"d_m1 = - k_m1 * m1 + k_parent * f_parent_to_m1 * parent"
In the graph on the left is the model with initial values, and in the middle is the fitted model using "TRR"(similar to the algorithm in Matlab lsqnonlin function ), on the right is the fitted model using "LM" with nls.lm. Looking at the fitted parameters(Fit.LM$par) you will find that one fitted parameter(f_parent_to_m1) is at the boundary 1. If I change the starting value for one parameter M0_parent from 0.1 to 100, then I got the same results using nls.lm and lsqnonlin.I have many cases like this one.
newpars <- rbind(Fit.TRR$par,Fit.LM$par)
rownames(newpars)<- c("TRR(lsqnonlin)","LM(nls.lm)")
newpars
M0_parent k_parent k_m1 f_parent_to_m1
TRR(lsqnonlin) 99.59848 0.09869773 0.005260654 0.514476
LM(nls.lm) 84.79150 0.06352110 0.014783294 1.000000
Except for the above problems, it often happens that the Hessian returned by nls.lm is not invertable(especially when some parameters are on the boundary) so I cannot get an estimation of the covariance matrix. On the other hand, the "TRR" algorithm(in Matlab) almost always give an estimation by calculating the Jacobian at the solution point. I think this is useful but I am also sure that R optimization algorithms(the ones I have tried) did not do this for a reason. I would like to know whether I am wrong by using the Matlab way of calculating the covariance matrix to get standard error for the parameter estimates.
One last note, I claimed in my previous post that the Matlab lsqnonlin outperforms R's optimization functions in almost all cases. I was wrong. The "Trust-Region-Reflective" algorithm used in Matlab is in fact slower(sometimes much slower) if also implemented in R as you can see from the above example. However, it is still more stable and reaches a better solution than the R's basic optimization algorithms.
First off, I am not an expert on Matlab and Optimisation and have never used R.
I am not sure I see what your actual question is, but maybe I can shed some light into your puzzlement:
LM is slightly enhanced Gauß-Newton approach - for problems with several local minima it is very sensitive to initial states. Including boundaries typically generates more of those minima.
TRR is akin to LM, but more robust. It has better capabilities for "jumping out of" bad local minima. It is quite feasible that it will behave better, but perform worse, than an LM. Actually explaining why is very hard. You would need to study the algorithms in detail and look at how they behave in this situation.
I cannot explain the difference between Matlab's and R's implementation, but there are several extensions to TRR that maybe Matlab uses and R does not.
Does your approach of using LM and TRR alternatingly converge better than TRR alone?
Using the mkin package, you can find the parameters using the "Port" algorithm (which is also a kind of a TRR algorithm as far as I could tell from its documentation), or the "Marq" algorithm, which uses nls.lm in the background. Then you can use "normal" starting values or "bad" starting values.
library(mkin)
packageVersion("mkin")
Recent mkin version can speed up the process considerably as they compile the models from automatically generated C code if a compiler is available on your system (e.g. you have r-base-dev installed on Debian/Ubuntu, or Rtools on Windows).
This defines the model:
m <- mkinmod(parent = mkinsub("SFO", "m1"),
m1 = mkinsub("SFO"),
use_of_ff = "max")
You can check that the differential equations are correct:
cat(m$diffs, sep = "\n")
Then we fit in four variants, Port and LM, with or without M0 fixed to 0.1:
f.Port = mkinfit(m, FOCUS_2006_D)
f.Port.M0 = mkinfit(m, FOCUS_2006_D, state.ini = c(parent = 0.1, m1 = 0))
f.LM = mkinfit(m, FOCUS_2006_D, method.modFit = "Marq")
f.LM.M0 = mkinfit(m, FOCUS_2006_D, state.ini = c(parent = 0.1, m1 = 0),
method.modFit = "Marq")
Then we look at the results:
results <- sapply(list(Port = f.Port, Port.M0 = f.Port.M0, LM = f.LM, LM.M0 = f.LM.M0),
function(x) round(summary(x)$bpar[, "Estimate"], 5))
which are
Port Port.M0 LM LM.M0
parent_0 99.59848 99.59848 99.59848 39.52278
k_parent 0.09870 0.09870 0.09870 0.00000
k_m1 0.00526 0.00526 0.00526 0.00000
f_parent_to_m1 0.51448 0.51448 0.51448 1.00000
So we can see that the Port algorithm finds the best solution (to the best of my knowledge) even with bad starting values. The speed issue that one may have with more complicated models is alleviated using the automatic generation of C code.

numerical differentiation with Scipy

I was trying to learn Scipy, using it for mixed integrations and differentiations, but at the very initial step I encountered the following problems.
For numerical differentiation, it seems that the only Scipy function that works for callable functions is scipy.derivative() if I'm right!? However, I couldn't work with it:
1st) when I am not going to specify the point at which the differentiation is to be taken, e.g. when the differentiation is under an integral so that it is the integral that should assign the numerical values to its integrand's variable, not me. As a simple example I tried this code in Sage's notebook:
import scipy as sp
from scipy import integrate, derivative
var('y')
f=lambda x: 10^10*sin(x)
g=lambda x,y: f(x+y^2)
I=integrate.quad( sp.derivative(f(y),y, dx=0.00001, n=1, order=7) , 0, pi)[0]; show(I)
show( integral(diff(f(y),y),y,0,1).n() )
also it gives the warning that "Warning: The occurrence of roundoff error is detected, which prevents the requested tolerance from being achieved. The error may be underestimated." and I don't know what does this warning stand for as it persists even with increasing "dx" and decreasing the "order".
2nd) when I want to find the derivative of a multivariable function like g(x,y) in the above example and something like sp.derivative(g(x,y),(x,0.5), dx=0.01, n=1, order=3) gives error, as is easily expected.
Looking forward to hearing from you about how to resolve the above cited problems with numerical differentiation.
Best Regards
There are some strange problems with your code that suggest you need to brush up on some python! I don't know how you even made these definitions in python since they are not legal syntax.
First, I think you are using an older version of scipy. In recent versions (at least from 0.12+) you need from scipy.misc import derivative. derivative is not in the scipy global namespace.
Second, var is not defined, although it is not necessary anyway (I think you meant to import sympy first and use sympy.var('y')). sin has also not been imported from math (or numpy, if you prefer). show is not a valid function in sympy or scipy.
^ is not the power operator in python. You meant **
You seem to be mixing up the idea of symbolic and numeric calculus operations here. scipy won't numerically differentiate an expression involving a symbolic object -- the second argument to derivative is supposed to be the point at which you wish to take the derivative (i.e. a number). As you say you are trying to do numeric differentiation, I'll resolve the issue for that purpose.
from scipy import integrate
from scipy.misc import derivative
from math import *
f = lambda x: 10**10*sin(x)
df = lambda x: derivative(f, x, dx=0.00001, n=1, order=7)
I = integrate.quad( df, 0, pi)[0]
Now, this last expression generates the warning you mentioned, and the value returned is not very close to zero at -0.0731642869874073 in absolute terms, although that's not bad relative to the scale of f. You have to appreciate the issues of roundoff error in finite differencing. Your function f varies on your interval between 0 and 10^10! It probably seems paradoxical, but making the dx value for differentiation too small can actually magnify roundoff error and cause numerical instability. See the second graph here ("Example showing the difficulty of choosing h due to both rounding error and formula error") for an explanation: http://en.wikipedia.org/wiki/Numerical_differentiation
In fact, in this case, you need to increase it, say to 0.001: df = lambda x: derivative(f, x, dx=0.001, n=1, order=7)
Then, you can integrate safely, with no terrible roundoff.
I=integrate.quad( df, 0, pi)[0]
I don't recommend throwing away the second return value from quad. It's an important verification of what happened, as it is "an estimate of the absolute error in the result". In this case, I == 0.0012846582250212652 and the abs error is ~ 0.00022, which is not bad (the interval that implies still does not include zero). Maybe some more fiddling with the dx and absolute tolerances for quad will get you an even better solution, but hopefully you get the idea.
For your second problem, you simply need to create a proper scalar function (call it gx) that represents g(x,y) along y=0.5 (this is called Currying in computer science).
g = lambda x, y: f(x+y**2)
gx = lambda x: g(x, 0.5)
derivative(gx, 0.2, dx=0.01, n=1, order=3)
gives you a value of the derivative at x=0.2. Naturally, the value is huge given the scale of f. You can integrate using quad like I showed you above.
If you want to be able to differentiate g itself, you need a different numerical differentiation functio. I don't think scipy or numpy support this, although you could hack together a central difference calculation by making a 2D fine mesh (size dx) and using numpy.gradient. There are probably other library solutions that I'm not aware of, but I know my PyDSTool software contains a function diff that will do that (if you rewrite g to take one array argument instead). It uses Ridder's method and is inspired from the Numerical Recipes pseudocode.

Resources