R: looping to search for max of non-monotonic function - r

Refer to the R code below. The function (someRfunction) operates on a vector and returns a scalar value. The data are pairs (x,y), where x and y are vectors of length n, which may be large.
I want to know the value of x* such that the result of someRfunction on y where {x>x*} is maximized. The function operates on y values and is non-monotonic in x*. I need to evaluate for all x* (i.e. each element of x). Speed is not an issue if executed once, but the code would be executed many times in a simulation. Is there any way to make this code more efficient/faster?
### x and y are vectors of length n
### sort x and y such that they are ordered by descending x
xord <- x[order(-x)]
yord <- y[order(-x)]
maxf <- -99999
maxcut <- NA
for (i in 1:n) {
### yi is a subvector of y that corresponds to y[x>x{i}]
### where x{i} is the (n-i+1)th order statistic of x
yi <- yord[1:(i-1)]
fxi <- someRfunction(yi)
if (fxi>maxf) {
maxf <- fxi
maxcut <- xord[i]
}
}
Thanks.
Edit: let someRfunction(yi)=t.test(yi)$statistic.

If you can say anything more about the function, particularly whether it is smooth and whether its gradient can be determine, you will get a better answer. At the moment the only increase in speed will be modest due to the ability to pre-specify a vector to hold the results, omit that if-max clause and then use which.max() on the vector. You might want to look at the function optimx in package "optimx".

Related

Seeking an lapply like function for a list of lists

I have a list in R which looks something like this
b0=5;b1=2
f <- function(x) b0 + b1*x
Nsim <- 100
my.list <- vector("list", Nsim)
for(i in 1:Nsim){
x <- rep(0,1000)
y <- x
y[1] <- f(x[1])
for(j in 2:1000){
x[j] <- x[j-1] + rnorm(1,0,0.1)
y[j] < f(x[j])
}
my.list[[i]]$x <- x
my.list[[i]]$y <- y
}
In reality, f is the result of an optimisation routine and x tracks the input value over time and y is the function values which are generated. So in essence, I have Nsim time series. I want to plot metrics of these time series over time by averaging over the index i. For instance, the average performance of the algorithm over time.
At the moment I'm doing this with a bespoke function for each metric I want to calculate (e.g. mean squared error of x from the true value of x, another for generating error bars and so on). I want to use something like lapply to average over i so I can visualise how x and y evolve over time but that doesn't do the right thing.
Is what I want to output is a pointwise summary of the results. As an analogy, if my.list[[i]]$x was instead stored as a matrix, I could take colMeans() to see the average value of x over "time".
Is there a function/package which is good for working with lists of lists?
At least for what has been presented there is no real reason to use a list of lists. The x's are all the same and equal to 1, 2, 3, ... so this could be represented by a matrix with the x component being implicit or represented by row names or we could represent this as a ts object or zoo object. In the last two cases if X is the object time(X) is the common x.
mat <- sapply(my.list, "[[", "y")
ts(mat)
library(zoo); zoo(mat)
Alternately, get rid of my.list and construct one of these directly in the code.

Faster way to generate large list of vectors from permuted datasets [R]

Setup For the purposes of my simulation, I'm generating a list of B=2000 elements, with each element being the output of a permutation procedure in which I first permute the rows of a 200x8000 matrix and for each column, I calculate the Kolmogorov-Smirnov test statistic between the first and second 100 rows (you can think of the first 100 rows as data from one group and the second 100 rows as data from another group).
Question This process takes a very long time (about 30-40 minutes) to generate the list. Is there a much faster way? In the future, I'd like to increase B to a larger value.
Code
B=2000
n.row=200; n.col=8000
#Generate sample data
samp.dat = matrix(rnorm(n.row*n.col),nrow=n.row)
perm.KS.list = NULL
for (b in 1:B){
#permute the rows
perm.dat.tmp = samp.dat[sample(nrow(samp.dat)),]
#Compute the permutation-based test statistics
perm.KS.list[[b]]= apply(perm.dat.tmp,2,function(y) ks.test.stat(y[1:100],y[101:200]))
}
#Modified KS-test function (from base package)
ks.test.stat <- function(x,y){
x <- x[!is.na(x)]
n <- length(x)
y <- y[!is.na(y)]
n.x <- as.double(n)
n.y <- length(y)
w <- c(x, y)
z <- cumsum(ifelse(order(w) <= n.x, 1/n.x, -1/n.y))
z <- z[c(which(diff(sort(w)) != 0), n.x + n.y)] #exclude ties
STATISTIC <- max(abs(z))
return(STATISTIC)
}
The 1:B loop has several places to optimize, but I agree that the real consumer is that inner function. Because you're simulating your well-behaved bootstrap samples, you can make two simplifying assumptions that the general base function can't:
There aren't missing values. This obviates the is.na() adjustments
The two sides (ie, x & y) have the same number of elements, so you don't need to count them separately. instead of splitting y in the loop, and them joining them back in the function (into w), just keep it together. The balanced sides also permit simplifications like remove the ifelse() clause. It produces a bunch of 0/1s, which are rescaled to -1/1s with integer arithmetic.
The function is reduced, which saves about 25% of the time. I added integers, instead of doubles inside cumsum().
ks.test.stat.balanced <- function(w){
n <- as.integer(length(w) * .5)
# z <- cumsum(ifelse(order(w) <= n, 1L, -1L)) / n
z <- cumsum((order(w) <= n)*2L - 1L) / n
# z <- z[c(which(diff(sort(w)) != 0), n + n)] #exclude ties
return( max(abs(z)) )
}
Ties shouldn't occur often with your gaussian rng, and the diff(sort(.)) is very expensive. If you're willing to remove that protection, the time is reduced by about 65%.
If you move the equation for z into abs(), it saves a little time over all those reps. I kept it separate above, so it's easier to read.
edit in case of an unbalanced simulation I'd recommend you:
still keep out the is.na,
still pass w,
still keep as much as possible in integer, not numeric, but
now include arguments n1 & n2 for the two group sizes.
Also, experiment w/ precalculating 1/n before cumsum() to avoid a lot of expensive divisions. Try to think of other math-y ways to extract calculations from an inner loop so it occurs less frequently.

Problems with Gaussian Quadrature in R

I'm using the the gaussquad package to evaluate some integrals numerically.
I thought the ghermite.h.quadrature command worked by evaluating a function f(x) at points x1, ..., xn and then constructing the sum w1*f(x1) + ... + wn*f(xn), where x1, ..., xn and w1, ..., wn are nodes and weights supplied by the user.
Thus I thought the commands
ghermite.h.quadrature(f,rule)
sum(sapply(rule$x,f)*rule$w)
would yield the same output for any function f, where ''rule'' is a dataframe which stores the nodes in a column labelled ''x'' and the weights in a column labelled "w". For many functions the output is indeed the same, but for some functions I get very different results. Can someone please help me understand this discrepancy?
Thanks!
Code:
n.quad = 50
rule = hermite.h.quadrature.rules(n.quad)[[n.quad]]
f <- function(z){
f1 <- function(x,y) pnorm(x+y)
f2 <- function(y) ghermite.h.quadrature(f1,rule,y = y)
g <- function(x,y) x/(1+y) / f2(y)*pnorm(x+y)
h <- function(y) ghermite.h.quadrature(g,rule,y=y)
h(z)
}
ghermite.h.quadrature(f,rule)
sum(sapply(rule$x,f)*rule$w)
Ok, that problem got me interested.
I've looked into gaussquad sources, and clearly author is not running sapply internally, because all integrands/function shall return vector on vector argument.
It is clearly stated in documentation:
functn an R function which should take a numeric argument x and possibly some parameters.
The function returns a numerical vector value for the given argument x
In case where you're using some internal functions, they're written that way, so everything works.
You have to rewrite your function to work with vector argument and return back a vector
UPDATE
Vectorize() works for me to rectify the problem, as well as simple wrapper with sapply
vf <- function(z) {
sapply(z, f)
}
After either of those changes, results are identical: 0.2029512

Linear regression using a list of function

I've a dataset with X and Y value obtained from a calibration and I have to interpolate them with a predefined list of polynomial functions and choose the one with the best R2.
The most silly function should be
try<-function(X,Y){
f1<- x + I(x^2.0) - I(x^3.0)
f2<- x + I(x^1.5) - I(x^3.0)
...
f20<- I(x^2.0) - I(x^2.5) + I(x^0.5)
r1<- lm(y~f1)
r2<- lm(y~f2)
...
r20<-lm(y~f20)
v1<-summary(r1)$r.squared
v2<-summary(r2)$r.squared
...
v20<-summary(r20)$r.squared
v<-c(v1,v2,...,v20)
return(v)
}
I'd like then to make this function shorter and smarter (especially from the definition of r1 to the end). I'd also like to give the user the possibility to choose a function among f1 to f20 (typing the desired row number of v) and see the output of the function print and plot on it.
Please, could you help me?
Thank you.
#mso: the idea of using sapply is nice but unfortunately in this way I don't use a polynome for the regression: my x vector is transformed in the f1 vector according to the formula and then used for the regression. I obtain just one parameter instead of 3 (in this case).
Create F as a list and proceed:
F = list(f1, f2, ...., f20)
r = sapply(F, function(x) lm(y~x))
v = sapply(r, function(x) summary(x)$r.squared)
return v
sapply will take each element of F and perform lm with y and put results in vector r. In next line, sapply will take every element of r and get summary and put the results in the vector v. Hopefully, it should work. You could also try lapply (instead of sapply) which is very similar.

Integration of a vector return one value

I am using R to do some multivariate analysis. For this work I need to integrate the trivariate PDF.Since I want to use this in a MLE, a want a vector of integration. Is there a way to make Integratebring a vector instead of one value.
Here is simple example:
f1=function(x, y, z) {dmvnorm(x=as.matrix(cbind(x,y,z)), mean=c(0,0,0), sigma=sigma)}
f1(x=c(1,1,1), y=c(1,1,1), z=c(1,1,1))
integrate(Vectorize(function(x) {f1(x=c(1,1,1), y=c(1,1,1), z=c(1,1,1))}), lower = - Inf, upper = -1)$value
Error in integrate(Vectorize(function(x) { : evaluation of function gave a result of wrong length
To integrate a function of one variable, with vector values,
you can transform the function into n functions with real values,
and integrate each of them.
This is very inefficient (when integrating the i-th function,
I evaluate all the functions, and discard all but one value).
# Function to integrate
d <- rnorm(10)
f <- function(x) dnorm(d, mean=x)
# Integrate those n functions separately.
n <- length(f(1))
r <- sapply( 1:n,
function(i) integrate(
Vectorize(function(x) f(x)[i]),
lower=-Inf, upper=0
)$value
)
r
For 2-dimensional integrals, you can check pracma::integral2,
but the same manipulation (transforming a bivariate function with vector values
into n bivariate functions with real values) will probably be needed.

Resources