Vectorising two similar functions in R works for one - r

Today, I came across a problem: two almost identical functions work as intended before vectorisation, but after it, one works fine, and another one returns an error.
I am examining the robustness of various estimators with respect to different transformations of residuals and aggregating functions. Quantile Regression and Least Median of Squares are particular cases of what I am doing.
So I wrote the following code to see how the Least Trimean of Squares is going to work and found out that it works fine if model parameters are supplied as different arguments, but fails if they come in a vector. For instance, I need the first function for plotting (it is convenient to use outer(...) to get a value matrix for persp or just supply f(x, y) to persp3d from library(rgl), but the second one for estimation (R optimisers are expecting a vector of inputs as the first argument over which the minimisation is going to be done).
MWE:
set.seed(105)
N <- 204
x <- rlnorm(N)
y <- 1 + x + rnorm(N)*sqrt(.1+.2*x+.3*x^2)
# A simple linear model with heteroskedastic errors
resfun <- function(x) return(x^2)
# Going to minimise a function of squared residuals...
distfun <- function(x) return(mean(quantile(x, c(0.25, 0.5, 0.5, 0.75))))
# ...which in this case is the trimean
penalty <- function(theta0, theta1) {
r <- y - theta0 - theta1*x
return(distfun(resfun(r)))
}
pen2 <- function(theta) {
r <- y - theta[1] - theta[2]*x
return(distfun(resfun(r)))
}
penalty(1, 1) # 0.5352602
pen2(c(1, 1)) # 0.5352602
vpenalty <- Vectorize(penalty)
vpen2 <- Vectorize(pen2)
vpenalty(1, 1) # 0.5352602
vpen2(c(1, 1))
Error in quantile.default(x, c(0.25, 0.5, 0.5, 0.75)) :
missing values and NaN's not allowed if 'na.rm' is FALSE
Why does vpen2, being vectorised pen2, choke even on a single input?

As jogo pointed out, vpen2 reads the elements of the input vector and tries to take the first one. The right way to go is to use something like
a <- matrix(..., ncol=2)
apply(a, 1, pen2)
This will return a vector of values from vpar2 evaluated at each row of the matrix.

Related

Issue with the dimension of matrix being optimised in R

I am attempting to calculate some weights in order to perform an indirect treatment comparison using R. I have altered some code slightly, in order to reflect that I am only centring the mean. However, this code will not run.
I believe this is due to the a1 matrix having two columns instead of one, but I really can't work out how to change this. I have tried adding a column of zeros and ones to the matrix, but I'm not sure if this will give me a correct result.
Of course, this may not be the issue at all, but I fail to see what else could be causing this. I have included the code and any advice would be appreciated.
# Objective function
objfn <- function(a1, X){
sum(exp(X %*% a1))
}
# Gradient function
gradfn <- function(a1, X){
colSums(sweep(X, 1, exp(X %*% a1), "*"))
}
X.EM.0 = data$A-age.mean
# Estimate weights
print(opt1 <- optim(par = c(0,0), fn = objfn, gr = gradfn, X = X.EM.0, method = "BFGS"))
a1 <- opt1$par
Such a simple solution, I'm slightly embarrassed to have posted this.
par=c(0,0) should be altered to match the columns of data. Here it should have been changed to one.

Find out which percentile a number has [duplicate]

Using R, it is trivial to calculate the quantiles for given probabilities in a sampled distribution:
x <- rnorm(1000, mean=4, sd=2)
quantile(x, .9) # results in 6.705755
However, I can't find an easy way to do the inverseā€”calculate the probability for a given quantile in the sample x. The closest I've come is to use pnorm() with the same mean and standard deviation I used when creating the sample:
pnorm(5, mean=4, sd=2) # results in 0.6914625
However, because this is calculating the probability from the full normal distribution, and not the sample x, it's not entirely accurate.
Is there a function that essentially does the inverse of quantile()? Something that essentially lets me do the same thing as pnorm() but with a sample? Something like this:
backwards_quantile(x, 5)
I've found the ecdf() function, but can't figure out a way to make it result in a single probability instead of a full equation object.
ecdf returns a function: you need to apply it.
f <- ecdf(x)
f( quantile(x,.91) )
# Equivalently:
ecdf(x)( quantile(x,.91) )
Just for convenience, this function helps:
quantInv <- function(distr, value) ecdf(distr)(value)
set.seed(1)
x <- rnorm(1000, mean=4, sd=2)
quantInv(x, c(4, 5, 6.705755))
[1] 0.518 0.685 0.904
You more or less have the answer yourself. When you want to write
backwards_quantile(x, 5)
just write
ecdf(x)(5)
This corresponds to the inverse of quantile() with type=1. However, if you want other types (I favour the NIST standard, corresponding to Excel's Percentile.exc, which is type=6), you have more work to do.
In these latter cases, consider which use you are going to put it to. If all you want is to plot it, for instance, then consider
yVals<-seq(0,1,0.01)
plot(quantile(x,yVals,type=6))
But if you want the inverse for a single value, like 5, then you need to write a solving function to find the P that makes
quantile(x,P,type=6) = 5
For instance this, which uses binary search between the extreme values of x:
inverse_quantile<-function(x,y,d=0.01,type=1) {
A<-min(x)
B<-max(x)
k<-(log((B-A)/d)/log(2))+1
P=0.5
for (i in 1:k) {
P=P+ifelse((quantile(x,P,type=type)<y),2^{-i-1},-2^{-i-1})
}
P
}
So if you wanted the type 4 quantile of your set x for the number 5, with precision 0.00001, then you would write
inverse_quantile<-function(x,5,d=0.00001,type=4)

Problems with Gaussian Quadrature in R

I'm using the the gaussquad package to evaluate some integrals numerically.
I thought the ghermite.h.quadrature command worked by evaluating a function f(x) at points x1, ..., xn and then constructing the sum w1*f(x1) + ... + wn*f(xn), where x1, ..., xn and w1, ..., wn are nodes and weights supplied by the user.
Thus I thought the commands
ghermite.h.quadrature(f,rule)
sum(sapply(rule$x,f)*rule$w)
would yield the same output for any function f, where ''rule'' is a dataframe which stores the nodes in a column labelled ''x'' and the weights in a column labelled "w". For many functions the output is indeed the same, but for some functions I get very different results. Can someone please help me understand this discrepancy?
Thanks!
Code:
n.quad = 50
rule = hermite.h.quadrature.rules(n.quad)[[n.quad]]
f <- function(z){
f1 <- function(x,y) pnorm(x+y)
f2 <- function(y) ghermite.h.quadrature(f1,rule,y = y)
g <- function(x,y) x/(1+y) / f2(y)*pnorm(x+y)
h <- function(y) ghermite.h.quadrature(g,rule,y=y)
h(z)
}
ghermite.h.quadrature(f,rule)
sum(sapply(rule$x,f)*rule$w)
Ok, that problem got me interested.
I've looked into gaussquad sources, and clearly author is not running sapply internally, because all integrands/function shall return vector on vector argument.
It is clearly stated in documentation:
functn an R function which should take a numeric argument x and possibly some parameters.
The function returns a numerical vector value for the given argument x
In case where you're using some internal functions, they're written that way, so everything works.
You have to rewrite your function to work with vector argument and return back a vector
UPDATE
Vectorize() works for me to rectify the problem, as well as simple wrapper with sapply
vf <- function(z) {
sapply(z, f)
}
After either of those changes, results are identical: 0.2029512

Integration of a vector return one value

I am using R to do some multivariate analysis. For this work I need to integrate the trivariate PDF.Since I want to use this in a MLE, a want a vector of integration. Is there a way to make Integratebring a vector instead of one value.
Here is simple example:
f1=function(x, y, z) {dmvnorm(x=as.matrix(cbind(x,y,z)), mean=c(0,0,0), sigma=sigma)}
f1(x=c(1,1,1), y=c(1,1,1), z=c(1,1,1))
integrate(Vectorize(function(x) {f1(x=c(1,1,1), y=c(1,1,1), z=c(1,1,1))}), lower = - Inf, upper = -1)$value
Error in integrate(Vectorize(function(x) { : evaluation of function gave a result of wrong length
To integrate a function of one variable, with vector values,
you can transform the function into n functions with real values,
and integrate each of them.
This is very inefficient (when integrating the i-th function,
I evaluate all the functions, and discard all but one value).
# Function to integrate
d <- rnorm(10)
f <- function(x) dnorm(d, mean=x)
# Integrate those n functions separately.
n <- length(f(1))
r <- sapply( 1:n,
function(i) integrate(
Vectorize(function(x) f(x)[i]),
lower=-Inf, upper=0
)$value
)
r
For 2-dimensional integrals, you can check pracma::integral2,
but the same manipulation (transforming a bivariate function with vector values
into n bivariate functions with real values) will probably be needed.

Why does glm.nb throw a "missing value" error only on very specific inputs

glm.nb throws an unusual error on certain inputs. While there are a variety of values that cause this error, changing the input even very slightly can prevent the error.
A reproducible example:
set.seed(11)
pop <- rnbinom(n=1000,size=1,mu=0.05)
glm.nb(pop~1,maxit=1000)
Running this code throws the error:
Error in while ((it <- it + 1) < limit && abs(del) > eps) { :
missing value where TRUE/FALSE needed
At first I assumed that this had something to do with the algorithm not converging. However, I was surprised to find that changing the input even very slightly can prevent the error. For example:
pop[1000] <- pop[1000] + 1
glm.nb(pop~1,maxit=1000)
I've found that it throws this error on 19.4% of the seeds between 1 and 500:
fit.with.seed = function(s) {
set.seed(s)
pop <- rnbinom(n=1000, size=1, mu=0.05)
m = glm.nb(pop~1, maxit=1000)
}
errors = sapply(1:500, function(s) {
is.null(tryCatch(fit.with.seed(s), error=function(e) NULL))
})
mean(errors)
I've found only one mention of this error anywhere, on a thread with no responses.
What could be causing this error, and how can it be fixed (other than randomly permuting the inputs every time glm.nb throws an error?)
ETA: Setting control=glm.control(maxit=200,trace = 3) finds that the theta.ml algorithm breaks by getting very large, then becoming -Inf, then becoming NaN:
theta.ml: iter67 theta =5.77203e+15
theta.ml: iter68 theta =5.28327e+15
theta.ml: iter69 theta =1.41103e+16
theta.ml: iter70 theta =-Inf
theta.ml: iter71 theta =NaN
It's a bit crude, but in the past I have been able to work around problems with glm.nb by resorting to straight maximum likelihood estimation (i.e. no clever iterative estimation algorithms as used in glm.nb)
Some poking around/profiling indicates that the MLE for the theta parameter is effectively infinite. I decided to fit it on the inverse scale, so that I could put a boundary at 0 (a fancier version would set up a log-likelihood function that would revert to Poisson at theta=zero, but that would undo the point of trying to come up with a quick, canned solution).
With two of the bad examples given above, this works reasonably well, although it does warn that the parameter fit is on the boundary ...
library(bbmle)
m1 <- mle2(Y~dnbinom(mu=exp(logmu),size=1/invk),
data=d1,
parameters=list(logmu~X1+X2+offset(X3)),
start=list(logmu=0,invk=1),
method="L-BFGS-B",
lower=c(rep(-Inf,12),1e-8))
The second example is actually more interesting because it demonstrates numerically that the MLE for theta is essentially infinite even though we have a good-sized data set that is exactly generated from negative binomial deviates (or else I'm confused about something ...)
set.seed(11);pop <- rnbinom(n=1000,size=1,mu=0.05);glm.nb(pop~1,maxit=1000)
m2 <- mle2(pop~dnbinom(mu=exp(logmu),size=1/invk),
data=data.frame(pop),
start=list(logmu=0,invk=1),
method="L-BFGS-B",
lower=c(-Inf,1e-8))
Edit: The code and answer has been simplified to one sample, like in the question.
Yes, theta can approach Inf in small samples and sparse data (many zeroes, small mean and large skew). I have found that fitting glm.nb fails when the data are all zeroes and returns:
Error in while ((it <- it + 1) < limit && abs(del) > eps) { :
missing value where TRUE/FALSE needed
The following code simulates small samples with a small mean and theta. To prevent the loop from crashing, glm.nb is not fitted when the data are all zeroes.
en1 <- 10
mu1 <- 0.5
size1 <- 0.5
temp <- matrix(nrow=10000, ncol=2)
# theta == Inf is rare so use a large number of reps
for (r in 1:10000){
dat1 <- rnbinom(n=en1, size=size1, mu=mu1)
temp[r, 1:2] <- c(mean(dat1), ifelse(max(dat1)!=0, glm.nb(dat1~1)$theta, NA))
}
temp <- as.data.frame(temp)
names(temp) <- c("mean1","theta1")
temp[which(is.na(temp$theta1)),]
# note that it's rare to get all zeroes in the sample
sum(is.na(temp$theta1))/dim(temp)[1]
# a log scale helps see what's happening
with(temp, plot(mean1, log10(theta1)))
# estimated thetas should equal size1 = 0.5
abline(h=log10(0.5), col="red")
text(2.5, 5, "n1 = n2 = 10", col="red", cex=2, adj=1)
text(1, 4, "extreme thetas", col="red", cex=2)
See that estimated thetas can be extremely large when the sample size is small (in the first plot below):
Lesson learnt: don't expect high quality results from glm.nb for small samples and sparse data; get larger samples (e.g. in the second plot below).

Resources