How to estimate the correlation between two functions with unknown variables inside? - r

I need to solve this optimization problem in order to estimate lambda:
Basically, I need to find the correlation between these two functions:
f1 <- function(lambda, tau){slope = (1-exp(-lambda*tau))/(lambda*tau)
return(slope)}
f2 <- function(lambda, tau){curve = ((1-exp(-lambda*tau))/(lambda*tau))-exp(-lambda*tau)
return(curve)}
I know the different values of tau. Suppose for example tau = 0.25: now f1 and f2 have only one missing parameter, lambda, which should be estimated. However, when I try to implement the optim() function to be minimized, it does not work since f1 and f2 are not numeric. How can I build this kind of optimization problem mantaining f1 and f2 as functions?
Many thanks

If I am understanding correctly, you are trying to minimise the squared correlation between the output of f1 and f2 at different values of lambda. This means that for each value of lambda you are assessing, you need to feed in the complete vector of tau values. This will give a vector output for each value of lambda so that a correlation between the output from the two functions can be calculated at any single value of lambda.
To do this, we create a vectorized function that takes lambda values and calculates the squared correlation between f1 and f2 at those values of lambda across all values of tau
f3 <- function(lambda) {
sapply(lambda, function(l) {
cor(f1(l, seq(1/12, 10, 1/12)), f2(l, seq(1/12, 10, 1/12)))^2
})
}
To get the optimal value of lambda that minimizes the squared correlation, we just use optimize:
optimize(f3, c(0, 100))$minimum
#> [1] 0.6678021

Perhaps the examples at the bottom of the page help: https://search.r-project.org/CRAN/refmans/NMOF/html/NSf.html
They input a vector of times into the functions (which is fixed for a given yield-curve), so you can compute the correlation for a given lambda. To minimize the correlation, do a grid search over the lambdas.
In your case, for instance,
lambda <- 2
cor(f1(lambda, 1:10), f2(lambda, 1:10))
Note that I have assumed maturity measured in years, 1 to 10. You will need to fill in appropriate values.
To find a lambda that leads to a low correlation, you could run a grid search.
lambdas <- seq(0.00001, 25, length.out = 1000)
squared.corr <- rep(NA_real_, length(lambdas))
for (i in seq_along(lambdas)) {
c <- cor(f1(lambdas[i], 1:10),
f2(lambdas[i], 1:10))
squared.corr[i] <- c*c
}
lambdas[which.min(c2)]
## [1] 0.490
(I am one of the authors of Gilli, Grosse and Schumann (2010), on which the suggestion to minimize the correlation is based.)

Related

Non-linear optimization for exponential function with linear constraints

I try to solve a non-linear optimization problem using the function donlp2 in R. My goal is to find out the maximum value of the following function:
442.8658*(x1+1)^(0.008752747)*(y1+1)^(0.555782)+(x2+1)^(0.008752747)*(y2+1)^(0.555782)
There is no non-linear constraints. The linear constraints are listed below:
x1+x2<=20000;
y1+y2<=20000;
x1<=4662.41;
x2<=149339;
y1<=14013.94;
y2<=1342738;
x1>=0;
x2>=0;
y1>=0;
y2>=0;
Below is my code:
p <- c(rep(0,4))
par.l <- c(rep(0,4))
par.u <- c(4662.41, 149339, 14013.94, 1342738)
fn <- function(par){
x1 <- par[1]; y1<-par[3]
x2 <- par[2]; y2<-par[4]
y <- 1 / (442.8658*(x1+1)^(0.008752747)*(y1+1)^(0.555782)
+ (x2+1)^(0.008752747)*(y2+1)^(0.555782))
}
A <- matrix(c(rep(c(1,0),2), rep(c(0,1),2)), nrow=2)
lin.l <- c(-Inf, 20000)
lin.u <- c(-Inf, 20000)
ret <- donlp2(p, fn, par.u=par.u, par.l=par.l, A=A, lin.l=lin.l, lin.u=lin.u)
I searched and found some related posts saying that donlp2 is only good to find minimum value of a function, which is the reason I took the reciprocal in the objective function.
The code ran correctly, but I have concerns with the results, since I can easily find other values that can give me greater outcome, i.e. the minimization of the objective function is not true.
I also found that when I change the initial value or the lower bound of x1,x2,y1,y2, the results will change dramatically. For example, if I set p=c(rep(0,4)), par.l<-c(rep(1,4)) instead of p=c(rep(0,4)), par.l<-c(rep(0,4)), the results will change from
$par
[1] 2.410409e+00 5.442753e-03 1.000000e+04 1.000000e+04
to
$par
[1] 2331.748 74670.025 3180.113 16819.887
Any ideas? I appreciate your suggestions and help!

Repeating a piece of code in R and storing the results as a vector

I've been running estimations in R by fitting a curve to a price series. I want to evaluate the fitness of the curve by making very small changes to the key parameters m and omega at their optimum values. To do that I want to see how the sum of squared residuals changes at the optimum. I defined the function for residuals as below:
# Define function for sum of squared residuals, to evaluate the fitness of parameters m and omega
residuals <- function(m, omega, tc) {
lm.result <- LPPL(rTicker, m, omega, tc)
return(sum((FittedLPPL(rTicker, lm.result, m, omega, tc) - rTicker$Close) ** 2))
}
I can then yield an absolute value for the SSR at the optimum as follows:
#To return value of SSR
residvalue <- residuals(m, omega,tc)
What I want to do is repeat this code over a sequence of values for m (and then omega).
For instance if the optimum m = 0.5, I want to run this code to calculate the object 'residvalue' for a sequence of m values that lie between 0 < m < 1, interval size = 0.01 (ie run it 100 times for 100 different SSR values). I would then like to store these resulting SSR values in a vector (which I can then turn into a data frame of observations). This appears like a trivial task but I'm not sure how to go about doing it. Any help would be appreciated.
You could use sapply:
sapply(seq(0,1,0.01),function(m) residuals(m,omega,tc))

Function to calculate R2 (R-squared) in R

I have a dataframe with observed and modelled data, and I would like to calculate the R2 value. I expected there to be a function I could call for this, but can't locate one. I know I can write my own and apply it, but am I missing something obvious? I want something like
obs <- 1:5
mod <- c(0.8,2.4,2,3,4.8)
df <- data.frame(obs, mod)
R2 <- rsq(df)
# 0.85
You need a little statistical knowledge to see this. R squared between two vectors is just the square of their correlation. So you can define you function as:
rsq <- function (x, y) cor(x, y) ^ 2
Sandipan's answer will return you exactly the same result (see the following proof), but as it stands it appears more readable (due to the evident $r.squared).
Let's do the statistics
Basically we fit a linear regression of y over x, and compute the ratio of regression sum of squares to total sum of squares.
lemma 1: a regression y ~ x is equivalent to y - mean(y) ~ x - mean(x)
lemma 2: beta = cov(x, y) / var(x)
lemma 3: R.square = cor(x, y) ^ 2
Warning
R squared between two arbitrary vectors x and y (of the same length) is just a goodness measure of their linear relationship. Think twice!! R squared between x + a and y + b are identical for any constant shift a and b. So it is a weak or even useless measure on "goodness of prediction". Use MSE or RMSE instead:
How to obtain RMSE out of lm result?
R - Calculate Test MSE given a trained model from a training set and a test set
I agree with 42-'s comment:
The R squared is reported by summary functions associated with regression functions. But only when such an estimate is statistically justified.
R squared can be a (but not the best) measure of "goodness of fit". But there is no justification that it can measure the goodness of out-of-sample prediction. If you split your data into training and testing parts and fit a regression model on the training one, you can get a valid R squared value on training part, but you can't legitimately compute an R squared on the test part. Some people did this, but I don't agree with it.
Here is very extreme example:
preds <- 1:4/4
actual <- 1:4
The R squared between those two vectors is 1. Yes of course, one is just a linear rescaling of the other so they have a perfect linear relationship. But, do you really think that the preds is a good prediction on actual??
In reply to wordsforthewise
Thanks for your comments 1, 2 and your answer of details.
You probably misunderstood the procedure. Given two vectors x and y, we first fit a regression line y ~ x then compute regression sum of squares and total sum of squares. It looks like you skip this regression step and go straight to the sum of square computation. That is false, since the partition of sum of squares does not hold and you can't compute R squared in a consistent way.
As you demonstrated, this is just one way for computing R squared:
preds <- c(1, 2, 3)
actual <- c(2, 2, 4)
rss <- sum((preds - actual) ^ 2) ## residual sum of squares
tss <- sum((actual - mean(actual)) ^ 2) ## total sum of squares
rsq <- 1 - rss/tss
#[1] 0.25
But there is another:
regss <- sum((preds - mean(preds)) ^ 2) ## regression sum of squares
regss / tss
#[1] 0.75
Also, your formula can give a negative value (the proper value should be 1 as mentioned above in the Warning section).
preds <- 1:4 / 4
actual <- 1:4
rss <- sum((preds - actual) ^ 2) ## residual sum of squares
tss <- sum((actual - mean(actual)) ^ 2) ## total sum of squares
rsq <- 1 - rss/tss
#[1] -2.375
Final remark
I had never expected that this answer could eventually be so long when I posted my initial answer 2 years ago. However, given the high views of this thread, I feel obliged to add more statistical details and discussions. I don't want to mislead people that just because they can compute an R squared so easily, they can use R squared everywhere.
Why not this:
rsq <- function(x, y) summary(lm(y~x))$r.squared
rsq(obs, mod)
#[1] 0.8560185
It is not something obvious, but the caret package has a function postResample() that will calculate "A vector of performance estimates" according to the documentation. The "performance estimates" are
RMSE
Rsquared
mean absolute error (MAE)
and have to be accessed from the vector like this
library(caret)
vect1 <- c(1, 2, 3)
vect2 <- c(3, 2, 2)
res <- caret::postResample(vect1, vect2)
rsq <- res[2]
However, this is using the correlation squared approximation for r-squared as mentioned in another answer. I'm not sure why Max Kuhn didn't just use the conventional 1-SSE/SST.
caret also has an R2() method, although it's hard to find in the documentation.
The way to implement the normal coefficient of determination equation is:
preds <- c(1, 2, 3)
actual <- c(2, 2, 4)
rss <- sum((preds - actual) ^ 2)
tss <- sum((actual - mean(actual)) ^ 2)
rsq <- 1 - rss/tss
Not too bad to code by hand of course, but why isn't there a function for it in a language primarily made for statistics? I'm thinking I must be missing the implementation of R^2 somewhere, or no one cares enough about it to implement it. Most of the implementations, like this one, seem to be for generalized linear models.
You can also use the summary for linear models:
summary(lm(obs ~ mod, data=df))$r.squared
Here is the simplest solution based on [https://en.wikipedia.org/wiki/Coefficient_of_determination]
# 1. 'Actual' and 'Predicted' data
df <- data.frame(
y_actual = c(1:5),
y_predicted = c(0.8, 2.4, 2, 3, 4.8))
# 2. R2 Score components
# 2.1. Average of actual data
avr_y_actual <- mean(df$y_actual)
# 2.2. Total sum of squares
ss_total <- sum((df$y_actual - avr_y_actual)^2)
# 2.3. Regression sum of squares
ss_regression <- sum((df$y_predicted - avr_y_actual)^2)
# 2.4. Residual sum of squares
ss_residuals <- sum((df$y_actual - df$y_predicted)^2)
# 3. R2 Score
r2 <- 1 - ss_residuals / ss_total
Not sure why this isn't implemented directly in R, but this answer is essentially the same as Andrii's and Wordsforthewise, I just turned into a function for the sake of convenience if somebody uses it a lot like me.
r2_general <-function(preds,actual){
return(1- sum((preds - actual) ^ 2)/sum((actual - mean(actual))^2))
}
I am use the function MLmetrics::R2_Score from the packages MLmetrics, to compute R2 it uses the vanilla 1-(RSS/TSS) formula.

Error in optim(): searching for global minimum for a univariate function

I am trying to optmize a function in R
The function is the Likelihood function of negative binominal when estimating only mu parameter. This should not be a problem since the function clearly has just one point of maximum. But, I am not being able to reach the desirable result.
The function to be optmized is:
EMV <- function(data, par) {
Mi <- par
Phi <- 2
N <- NROW(data)
Resultado <- log(Mi/(Mi + Phi))*sum(data) + N*Phi*log(Phi/(Mi + Phi))
return(Resultado)
}
Data is a vector of negative binomial variables with parameters 2 and 2
data <- rnegbin(10000, mu = 2, theta = 2)
When I plot the function having mu as variable with the following code:
x <- seq(0.1, 100, 0.02)
z <- EMV(data,0.1)
for (aux in x) {z <- rbind(z, EMV(data,aux))}
z <- z[2:NROW(z)]
plot(x,z)
I get the following curve:
And the maximum value of z is close to parameter value --> 2
x[which.max(z)]
But the optimization is not working with BFGS
Error in optim(par = theta, fn = EMV, data = data, method = "BFGS") :
non-finite finite-difference value [1]
And is not going to right value using SANN, for example:
$par
[1] 5.19767e-05
$value
[1] -211981.8
$counts
function gradient
10000 NA
$convergence
[1] 0
$message
NULL
The questions are:
What am I doing wrong?
Is there a way to tell optim that the param should be bigger than 0?
Is there a way to tell optim that I want to maximize the function? (I am afraid the optim is trying to minimize and is going to a very small value where function returns smallest values)
Minimization or Maximization?
Although ?optim says it can do maximization, but that is in a bracket, so minimization is default:
fn: A function to be minimized (or maximized) ...
Thus, if we want to maximize an objective function, we need to multiply an -1 to it, and then minimize it. This is quite a common situation. In statistics we often want to find maximum log likelihood, so to use optim(), we have no choice but to minimize the negative log likelihood.
Which method to use?
If we only do 1D minimization, we should use method "Brent". This method allows us to specify a lower bound and an upper bound of search region. Searching will start from one bound, and search toward the other, until it hit the minimum, or it reach the boundary. Such specification can help you to constrain your parameters. For example, you don't want mu to be smaller than 0, then just set lower = 0.
When we move to 2D or higher dimension, we should resort to "BFGS". In this case, if we want to constrain one of our parameters, say a, to be positive, we need to take log transform log_a = log(a), and reparameterize our objective function using log_a. Now, log_a is free of constraint. The same goes when we want constrain multiple parameters to be positive.
How to change your code?
EMV <- function(data, par) {
Mi <- par
Phi <- 2
N <- NROW(data)
Resultado <- log(Mi/(Mi + Phi))*sum(data) + N*Phi*log(Phi/(Mi + Phi))
return(-1 * Resultado)
}
optim(par = theta, fn = EMV, data = data, method = "Brent", lower = 0, upper = 1E5)
The help file for optim says: "By default optim performs minimization, but it will maximize if control$fnscale is negative." So if you either multiply your function output by -1 or change the control object input, you should get the right answer.

Does cattell's profile similarity coefficient (Rp) exist as a function in R?

i'm comparing different measures of distance and similarity for vector profiles (Subtest results) in R, most of them are easy to compute and/or exist in dist().
Unfortunately, one that might be interesting and is to difficult for me to calculate myself is Cattel's Rp. I can not find it in R.
Does anybody know if this exists already?
Or can you help me to write a function?
The formula (Cattell 1994) of Rp is this:
(2k-d^2)/(2k + d^2)
where:
k is the median for chi square on a sample of size n;
d is the sum of the (weighted=m) difference between the two profiles,
sth like: sum(m(x(i)-y(i)));
one thing i don't know is, how to get the chi square median in there
Thank you
What i get without defining the k is:
Rp.Cattell <- function(x,y){z <- (2k-(sum(x-y))^2)/(2k+(sum(x-y))^2);return(z)}
Vector examples are:
x <- c(-1.2357,-1.1999,-1.4727,-0.3915,-0.2547,-0.4758)
y <- c(0.7785,0.9357,0.7165,-0.6067,-0.4668,-0.5925)
They are measures by the same device, but related to different bodyparts. They don't need to be standartised or weighted, i would say.
This page gives a general formula for k, and then gives a more thorough method using SAS/IML which pretty much gives the same results. So I used the general formula, added calculation of degrees of freedom, which leads to this:
Rp.Cattell <- function(x,y) {
dof <- (2-1) * (length(y)-1)
k <- (1-2/(9*dof))^3
z <- (2*k-sum(sum(x-y))^2)/(2*k+sum(sum(x-y))^2)
return(z)
}
x <- c(-1.2357,-1.1999,-1.4727,-0.3915,-0.2547,-0.4758)
y <- c(0.7785,0.9357,0.7165,-0.6067,-0.4668,-0.5925)
Rp.Cattell(x, y)
# [1] -0.9012083
Does this figure appear to make sense?
Trying to verify the function, I found out now that the median of chisquare is the chisquare value for 50% probability - relating to random. So the function should be:
Rp.Cattell <- function(x,y){
dof <- (2-1) * (length(y)-1)
k <- qchisq(.50, df=dof)
z <- (2k-(sum(x-y))^2)/(2k+(sum(x-y))^2);
return(z)}
It is necessary though to standardize the Values before, so the results are distributed correctly.
So:
library ("stringr")
# they are centered already
x <- as.vector(scale(c(-1.2357,-1.1999,-1.4727,-0.3915,-0.2547,-0.4758),center=F, scale=T))
y <- as.vector(scale(c(0.7785,0.9357,0.7165,-0.6067,-0.4668,-0.5925),center=F, scale=T))
Rp.Cattell(x, y) -0.584423
This sounds reasonable now - or not?
I consider calculation of z is incorrect.
You need to calculate the sum of the squared differences. Not the square of the sum of differences. Besides product operator is missing in 2k.
It should be
z <- (2*k-sum((x-y)^2))/(2*k+sum((x-y)^2))
Do you agree?

Resources