User-specified link function in R for glm. How to? (no documentation found, what are the arguments to use, etc.) [duplicate] - r

This question already has answers here:
modify glm function to adopt user-specified link function in R
(2 answers)
Closed 7 years ago.
This question has already been somewhat addressed already in the past on this site, but the answers provided are not fully helpful to me. Here are the details of my questions that are actually somewhat different from what has already been discussed here:
After working hard on this, I remained unable to understand how I can define my own user-specified link function in R for glm. I have several questions on this.
First of all, I understand I have to write my own function (likely modifying one that already exists), and - in it - I need to define the following elements:
linkfun: the link function.
linkinv: the inverse of the link function, as a function of "eta".
mu.eta: the first derivative of the invlink respect to eta.
valideta: that must return TRUE if the value of eta are in the correct interval
And return all of this in a list element.
So far, so good.
Here is the first set of my questions:
The link function is sometimes defined as a function of "y" and sometimes as a function of "mu". What must be done in this respect?
Let's take an example, and type make.link("sqrt"). We then indeed discover that linkfun is sqrt(mu), linkinv is eta^2, mu.eta is 2*eta. So far, so good. However, if you look at make.link("log"), mu.eta is not simply exp(eta) as it should, but pmax(exp(eta), .Machine$double.eps) (i.e., the maximal values of the first derivative for all the eta vector). Why? I remained unable to understand this.
Just for my curiosity, why the algorithm needs the first derivative of the invlink respect to eta? This is not fully clear to me.
In my specific case, I need a quasi-logistic regression for binomial data. Instead of having a standard logit function log(p/(1-p)), I need to have the slightly modified link function (if p is defined as Y/N): log((Y+0.5)/(N-Y+0.5)).
My other question in this case is:
I remained unable to built this.. Can someone give me some hints?
Where can I find a detailed explanation of all of this? I have looked at the good old Chambers & Hastie book (1992), but the explanation is not sufficient. Are there any detailed courses available on the web, etc.?

Not sure whether I can answer all of your questions, but I give it a try:
Can you specify a linkfun which takes mu and y? Up to my knowledge, the link function should only tkae mu as the GLM (as opposed to the LM) models a function of the expecetd value mu (aka linkfunction) instead of the expecetd value itself. Hence, there should be only mu as an argument.
This has to do with vectorization. pmax returns the parallel maxima and we want to assure that we do not report values smaller than Machine$double.eps. So the linkfun does not return the maximum of all exp(eta) (that would be max(exp(eta), .Machine$double.eps)). Imagine now that eta is now a vector of all eta for which you want to calculate then linkinv, with pmax(.) you make sure that you return exp(eta) only in these cases where it is indeed larger than .Machine$double.eps. So you return also a vector of maxima. (try pmax(1:6, 4) you will get [1] 4 4 4 4 5 6)
You need the first derivative in order to calcuate the estimator of the score function of dL / dbeta[j] = sum_i^n((y[i] - mu[i])/(a(phi[i] * V(mu[i])x[ij]/g'(mu[i]) = 0. That is the derivative of the likelihood function w.r.t. to beta[j] (i.e. dL/dbeta[j]) depends on:
a(phi[i]) is a (known) function of the dispersion parameter coming from the respective distribution (e.g. a(phi) = phi = sigma^2 for the normal distribution)
V(mu[i]) for distributions of the exponential family (for which the GLM was designed) you can derive that var(Y) can be written as a(phi) * V(mu) indicating that the variance is indeed a function of the mean.
g'(mu[i]) is finally the derivative of the link function. So in order to solve the score function (thus to get estimates for beta[j], you will need the derivative of the link function
So in your case you need to define:
the linkfun
the inverse
the derivative
function to validate eta
I see your problem that you link function would also need to take y as an parameter, however, I am not sure whether the glm can deal with it, because in its fitting mechanism it will call linkfun at some point and looking at the pre-defined linkfuns, all of these require just one parameter. You could get around with that if you twist the code of glm but this will be quite some work to do (all things untested and just as food for thoughts without any guarantee that it will work):
Provide your linkfun/linkinvers etc as something like function(mu, y) [whatever you want to have here]
Create a copy of glm.fit (glm.fit2 say)
Change calls fo linkfun(mu), linkinv(eta) etc to linkfun(mu, y), linkinv(eta, y) and so forth
when you call your glm provide method = "glm.fit2" to tell glm that it should use your own fitting procedure.
The refernce book for that is McCullagh, Nelder: Generalized Linear Models. Where you find all the explanations about the exponential family of distributions, score functions etc.
You can look into function powerVarianceFamily of package EQL which also uses parameterized families for extended quasi likelihood estaimation.
Update
As just learned from the excellent answer in the previous post, no need to redefine the glm.fit as long as you use y in you linkfun, as by the time linkfun is called y should be known in the encapsualting function. So you should define linkfun like this
function(mu) [a function which uses mu and y -
as y is known within the context where this function is called]

Related

CRAN package submission: "Error: C stack usage is too close to the limit"

Right upfront: this is an issue I encountered when submitting an R package to CRAN. So I
dont have control of the stack size (as the issue occured on one of CRANs platforms)
I cant provide a reproducible example (as I dont know the exact configurations on CRAN)
Problem
When trying to submit the cSEM.DGP package to CRAN the automatic pretest (for Debian x86_64-pc-linux-gnu; not for Windows!) failed with the NOTE: C stack usage 7975520 is too close to the limit.
I know this is caused by a function with three arguments whose body is about 800 rows long. The function body consists of additions and multiplications of these arguments. It is the function varzeta6() which you find here (from row 647 onwards).
How can I adress this?
Things I cant do:
provide a reproducible example (at least I would not know how)
change the stack size
Things I am thinking of:
try to break the function into smaller pieces. But I dont know how to best do that.
somehow precompile? the function (to be honest, I am just guessing) so CRAN doesnt complain?
Let me know your ideas!
Details / Background
The reason why varzeta6() (and varzeta4() / varzeta5() and even more so varzeta7()) are so long and R-inefficient is that they are essentially copy-pasted from mathematica (after simplifying the mathematica code as good as possible and adapting it to be valid R code). Hence, the code is by no means R-optimized (which #MauritsEvers righly pointed out).
Why do we need mathematica? Because what we need is the general form for the model-implied construct correlation matrix of a recursive strucutral equation model with up to 8 constructs as a function of the parameters of the model equations. In addition there are constraints.
To get a feel for the problem, lets take a system of two equations that can be solved recursivly:
Y2 = beta1*Y1 + zeta1
Y3 = beta2*Y1 + beta3*Y2 + zeta2
What we are interested in is the covariances: E(Y1*Y2), E(Y1*Y3), and E(Y2*Y3) as a function of beta1, beta2, beta3 under the constraint that
E(Y1) = E(Y2) = E(Y3) = 0,
E(Y1^2) = E(Y2^2) = E(Y3^3) = 1
E(Yi*zeta_j) = 0 (with i = 1, 2, 3 and j = 1, 2)
For such a simple model, this is rather trivial:
E(Y1*Y2) = E(Y1*(beta1*Y1 + zeta1) = beta1*E(Y1^2) + E(Y1*zeta1) = beta1
E(Y1*Y3) = E(Y1*(beta2*Y1 + beta3*(beta1*Y1 + zeta1) + zeta2) = beta2 + beta3*beta1
E(Y2*Y3) = ...
But you see how quickly this gets messy when you add Y4, Y5, until Y8.
In general the model-implied construct correlation matrix can be written as (the expression actually looks more complicated because we also allow for up to 5 exgenous constructs as well. This is why varzeta1() already looks complicated. But ignore this for now.):
V(Y) = (I - B)^-1 V(zeta)(I - B)'^-1
where I is the identity matrix and B a lower triangular matrix of model parameters (the betas). V(zeta) is a diagonal matrix. The functions varzeta1(), varzeta2(), ..., varzeta7() compute the main diagonal elements. Since we constrain Var(Yi) to always be 1, the variances of the zetas follow. Take for example the equation Var(Y2) = beta1^2*Var(Y1) + Var(zeta1) --> Var(zeta1) = 1 - beta1^2. This looks simple here, but is becomes extremly complicated when we take the variance of, say, the 6th equation in such a chain of recursive equations because Var(zeta6) depends on all previous covariances betwenn Y1, ..., Y5 which are themselves dependend on their respective previous covariances.
Ok I dont know if that makes things any clearer. Here are the main point:
The code for varzeta1(), ..., varzeta7() is copy pasted from mathematica and hence not R-optimized.
Mathematica is required because, as far as I know, R cannot handle symbolic calculations.
I could R-optimze "by hand" (which is extremly tedious)
I think the structure of the varzetaX() must be taken as given. The question therefore is: can I somehow use this function anyway?
Once conceivable approach is to try to convince the CRAN maintainers that there's no easy way for you to fix the problem. This is a NOTE, not a WARNING; The CRAN repository policy says
In principle, packages must pass R CMD check without warnings or significant notes to be admitted to the main CRAN package area. If there are warnings or notes you cannot eliminate (for example because you believe them to be spurious) send an explanatory note as part of your covering email, or as a comment on the submission form
So, you could take a chance that your well-reasoned explanation (in the comments field on the submission form) will convince the CRAN maintainers. In the long run it would be best to find a way to simplify the computations, but it might not be necessary to do it before submission to CRAN.
This is a bit too long as a comment, but hopefully this will give you some ideas for optimising the code for the varzeta* functions; or at the very least, it might give you some food for thought.
There are a few things that confuse me:
All varzeta* functions have arguments beta, gamma and phi, which seem to be matrices. However, in varzeta1 you don't use beta, yet beta is the first function argument.
I struggle to link the details you give at the bottom of your post with the code for the varzeta* functions. You don't explain where the gamma and phi matrices come from, nor what they denote. Furthermore, seeing that beta are the model's parameter etimates, I don't understand why beta should be a matrix.
As I mentioned in my earlier comment, I would be very surprised if these expressions cannot be simplified. R can do a lot of matrix operations quite comfortably, there shouldn't really be a need to pre-calculate individual terms.
For example, you can use crossprod and tcrossprod to calculate cross products, and %*% implements matrix multiplication.
Secondly, a lot of mathematical operations in R are vectorised. I already mentioned that you can simplify
1 - gamma[1,1]^2 - gamma[1,2]^2 - gamma[1,3]^2 - gamma[1,4]^2 - gamma[1,5]^2
as
1 - sum(gamma[1, ]^2)
since the ^ operator is vectorised.
Perhaps more fundamentally, this seems somewhat of an XY problem to me where it might help to take a step back. Not knowing the full details of what you're trying to model (as I said, I can't link the details you give to the cSEM.DGP code), I would start by exploring how to solve the recursive SEM in R. I don't really see the need for Mathematica here. As I said earlier, matrix operations are very standard in R; analytically solving a set of recursive equations is also possible in R. Since you seem to come from the Mathematica realm, it might be good to discuss this with a local R coding expert.
If you must use those scary varzeta* functions (and I really doubt that), an option may be to rewrite them in C++ and then compile them with Rcpp to turn them into R functions. Perhaps that will avoid the C stack usage limit?

Recursive arc-length reparameterization of an arbitrary curve

I have a 3D parametric curve defined as P(t) = [x(t), y(t), z(t)].
I'm looking for a function to reparametrize this curve in terms of arc-length. I'm using OpenSCAD, which is a declarative language with no variables (constants only), so the solution needs to work recursively (and with no variables aside from global constants and function arguments).
More precisely, I need to write a function Q(s) that gives the point on P that is (approximately) distance s along the arc from the point where t=0. I already have functions for numeric integration and derivation that can be incorporated into the answer.
Any suggestions would be greatly appreciated!
p.s It's not possible to pass functions as a parameter in OpenSCAD, I usually get around this by just using global declarations.
The length of an arc sigma between parameter values t=0 and t=T can be computed by solving the following integral:
sigma(T) = Integral[ sqrt[ x'(t)^2 + y'(t)^2 + z'(t)^2 ],{t,0,T}]
If you want to parametrize your curve with the arc-length, you have to invert this formula. This is unfortunately rather difficult from a mathematics point of view. The simplest method is to implement a simple bisection method as a numeric solver. The computation method quickly becomes heavy so reusing previous results is ideal. The secant method is also useful as the derivative of sigma(t) is already known and equals
sigma'(t) = sqrt[ x'(t)^2 + y'(t)^2 + z'(t)^2]
Maybe not really the most helpful answer, but I hope it gives you some ideas. I cannot help you with the OpenSCad implementation.

Basis provided by Ns() in R Epi package

As I was working out how Epi generates the basis for its spline functions (via the function Ns), I was a little confused by how it handles the detrend argument.
When detrend=T I would have expected that Epi::Ns(...) would more or less project the basis given by splines::ns(...) onto the orthogonal complement of the column space of [1 t] and finally extract the set of linearly independent columns (so that we have a basis).
However, this doesn’t appear to be the exactly the case; I tried
library(Epi)
x=seq(-0.75, 0.75, length.out=5)
Ns(x, knots=c(-0.5,0,0.5), Boundary.knots=c(-1,1), detrend=T)
and
library(splines)
detrend(ns(x, knots=c(-0.5,0,0.5), Boundary.knots=c(-1,1)), x)
The matrices produced by the above code are not the same, however, they do have the same column space (in this example) suggesting that if plugged in to a linear model, the fitted coefficients will be different but the fit (itself) will be the same.
The first question I had was; is this true in general?
The second question is why are the two different?
Regarding the second question - when detrend is specified, Epi::Ns gives a warning that fixsl is ignored.
Diving into Epi github NS.r ... in the construction of the basis, in the call to Epi::Ns above with detrend=T, the worker ns.ld() is called (a function almost identical to the guts of splines::ns()), which passes c(NA,NA) along to splines::spline.des as the derivs argument in determining a matrix const;
const <- splines::spline.des( Aknots, Boundary.knots, 4, c(2-fixsl[1],2-fixsl[2]))$design
This is the difference between what happens in Ns(detrend=T) and the call to ns() above which passes c(2,2) to splineDesign as the derivs argument.
So that explains how they are different, but not why? Does anyone have an explanation for why fixsl=c(NA,NA) is used instead of fixsl=c(F,F) in Epi::Ns()?
And does anyone have a proof/or an answer to the first question?
I think the orthogonal complement of const's column space is used so that second (or desired) derivatives are zero at the boundary (via projection of the general spline basis) - but I'm not sure about this step as I haven't dug into the mathematics, I'm just going by my 'feel' for it. Perhaps if I understood this better, the reason that the differences in the result for const from the call to splineDesign/spline.des (in ns() and Ns() respectively) would explain why the two matrices from the start are not the same, yet yield the same fit.
The fixsl=c(NA,NA) was a bug that has been fixed since a while. See the commits on the CRAN Github mirror.
I have still sent an email to the maintainer to ask if the fix could be made a little bit more consistent with the condition, but in principle this could be closed.

Equation of rbfKernel in kernlab is different from the standard?

I have observed that kernlab uses rbfkernel as,
rbf(x,y) = exp(-sigma * euclideanNorm(x-y)^2)
but according to this wiki link, the rbf kernel should be of the form
rbf(x,y) = exp(-euclideanNorm(x-y)^2/(2*sigma^2))
which is also more intuitive since two close samples with a large kernel sigma value will lead to a higher similarity matching.
I am not sure what e1071 svm uses (native code libsvm?)
I hope someone can enlighten me on why there is a difference ? I caught this because I was initially using e1071 but switched to ksvm but saw inconsistent results for the two.
A small example for comparison
set.seed(123)
x <- rnorm(3)
y <- rnorm(3)
sigma <- 100
rbf <- rbfdot(sigma=sigma)
rbf(x, y)
exp( -sum((x-y)^2)/(2*sigma^2) )
I would expect the kernel value to be close to 1 (since x,y come from sigma=1, while kernel sigma=100). This is observed only in the second case.
I came across that discrepancy too and I wound up digging into the source to figure out if there was a typo in the documentation or what was going on exactly since sigma in the context of Gaussians traditionally goes as the standard deviation in the denominator right?
Here's the relevant source
**kernlab\R\kernels.R**
## Define the kernel objects,
## functions with an additional slot for the kernel parameter list.
## kernel functions take two vector arguments and return a scalar (dot product)
rbfdot<- function(sigma=1)
{
rval <- function(x,y=NULL)
{
if(!is(x,"vector")) stop("x must be a vector")
if(!is(y,"vector")&&!is.null(y)) stop("y must a vector")
if (is(x,"vector") && is.null(y)){
return(1)
}
if (is(x,"vector") && is(y,"vector")){
if (!length(x)==length(y))
stop("number of dimension must be the same on both data points")
return(exp(sigma*(2*crossprod(x,y) - crossprod(x) - crossprod(y))))
# sigma/2 or sigma ??
}
}
return(new("rbfkernel",.Data=rval,kpar=list(sigma=sigma)))
}
You can observe from their comment on sigma/2 or sigma ?? that they may perhaps be a bit confused about the convention to adopt, the presence of /2 would be consistent with the standard deviation form /(2*sigma), but I had to speculate about this discovery.
Now another corroborating piece of evidence is in the help page for ? rbfdot which reads...
sigma The inverse kernel width used by the Gaussian the Laplacian,
the Bessel and the ANOVA kernel
And that is consistent with the form they use with sigma in the numerator, since in the denominator it would scale proportionately with the width of the Gaussian right. So it indeed looks like they settled on the convention that is described in the Wikipedia article as the gamma form, where they say
An equivalent, but simpler, definition involves a parameter gamma =
-1/(2*sigma^2)
So the difference just seems to be a matter of adopting different but equivalent conventions. One motivator for the particular convention (which someone may confirm in a comment) may arise from issues of code reuse and consistency, where as you see the parameter is used by three other kernel forms that may have their parameters more traditionally set in the numerator. I'm not sure on that point however since I've never used those alternate kernels and am unfamiliar with each.

Change Objective Function in nls.lm() in "R"

I'm using the function nls.lm {package: minpack.lm} to optimize a parameteristion for a hydrological model. The function is working quite well, but I want to use an other objective function (OF). Normally, the obective function "fn" in the nls.lm is defined as
A function that returns a vector of residuals, the sum square of which
is to be minimized. The first argument of fn must be par.
Now I want to use the Nash-Sutcliff-Efficiency, which is defined as
NSE <- 1 - (sum((obs - sim)^2) / sum((obs - mean(obs))^2))
or other OF. The problem is that nls.lm minimizes the expression sum(x)^2 and only the x is modifiable. I know that the best fit NSE = 1. Thus 1 - NSE creates a real minimization problem.
BTW: Example 1 from a nls.lm help page is a good example; there
observed - getPred(p,xx)
is minimized, what actually means that
sum ( observed - getPred(p,xx) )^2
is minimized by the nls.lm function, whereas getPred(p,xx) returns sim.
Any suggestion would be helpful. Thanks in advance.
Micha
nls.lm (and the related functions nls, and nlsLM) are designed to minimize the sum square of the residuals. For the problem you seek to solve, I would try application of a general-purpose minimizer.
If the problem is 'not too hard' (that is, has a single global minimum, is smooth), you could try to apply 'optim' to it (I would try the 'Nelder-Mead' and 'BFGS' options first), or the 'bobyqa' function from the package 'minqa', among other functions.
If the problem requires a global optimizer, you could try the 'GenSA' function from package 'GenSA', the 'genoud' function from the package 'rgenoud', or the 'DEoptim' function from package 'DEoptim', among other options. A review on 'Global Optimization in R' is forthcoming in the Journal of Statistical Software, and that will give a more complete overview of applicable functions.

Resources