In R, how do I find the optimal variable to minimise the correlation between two datasets [duplicate] - r

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
In R, how do I find the optimal variable to maximize or minimize correlation between several datasets
This can be done in Excel, but my dataset has gotten too large. In excel, I would use solver.
I have 5 variables and I want to recreate a weighted average of these 5 variables so that they have the lowest correlation to a 6th variable.
Column A,B,C,D,E = random numbers
Column F = random number (which I want to minimise the correlation to)
Column G = Awi1+Bwi2+C*2i3+D*wi4+wi5*E
where wi1 to wi5 are coefficients resulted from solver In a separate cell, I would have correl(F,G)
This is all achieved with the following constraints in mind:
1. A,B,C,D, E have to be between 0 and 1
2. A+B+C+D+E= 1
I'd like to print the results of this so that I can have an efficient frontier type chart.
How can I do this in R? Thanks for the help.

I looked at the other thread mentioned by Vincent and I think I have a better solution. I hope it is correct. As Vincent points out, your biggest problem is that the optimization tools for such non-linear problems do not offer a lot of flexibility for dealing with your constraints. Here, you have two types of constraints: 1) all your weights must be >= 0, and 2) they must sum to 1.
The optim function has a lower option that can take care of your first constraint. For the second constraint, you have to be a bit creative: you can force your weights to sum to one by scaling them inside the function to be minimized, i.e. rewrite your correlation function as function(w) cor(X %*% w / sum(w), Y).
# create random data
n.obs <- 100
n.var <- 6
X <- matrix(runif(n.obs * n.var), nrow = n.obs, ncol = n.var)
Y <- matrix(runif(n.obs), nrow = n.obs, ncol = 1)
# function to minimize
correl <- function(w)cor(X %*% w / sum(w), Y)
# inital guess
w0 <- rep(1 / n.var, n.var)
# optimize
opt <- optim(par = w0, fn = correl, method = "L-BFGS-B", lower = 0)
optim.w <- opt$par / sum(opt$par)

Related

Finding minimum by optimising a vector in R

I need to find a minimum of an objective function by optimising a vector. The problem is finance related if that helps - the function RC (provided below) computes the sum of squared differences of risk contribution of different assets, where the risk contribution is a product of input Risk Measure (RM, given) and weights.
The goal is to find such weights that the sum is zero, i.e. all assets have equal risk contributions.
RC = function (RM, w){
w = w/sum(w) # normalizing weights so they sum up to 1
nAssets = length(RM)
rc_matrix = matrix(nrow=1,ncol=nAssets)
rc_matrix = RM*w #risk contributions: RM (risk measure multiplied by asset's
#w eight in the portfolio)
rc_sum_squares = numeric(length=1) #placeholder
rc_sum_squares = sum(combn(
seq_along(RM),
2,
FUN = function(x)
(rc_matrix[ , x[1]] - rc_matrix[, x[2]]) ** 2
)) # this function sums the squared differences of the risk contributions
return(rc_sum_squares)
}
I searched and the solution seems to lie in the "optim" function, so I tried:
out <- optim(
par = rep(1 / length(RM), length(RM)), # initial guess
fn = RC,
RM = RM,
method = "L-BFGS-B",
lower = 0.00001,
upper = 1)
However, this returns an error message: "Error in rc_matrix[, x[1]] : incorrect number of dimensions"
I don't know how the optimization algorithm works, so I can't really wrap my head around it. The RC function works though, here is a sample for replicability:
RM <- c(0.06006928, 0.06823795, 0.05716360, 0.08363529, 0.06491009, 0.06673174, 0.03103578, 0.05741140)
w <- matrix(0.125, nrow=1, ncol=1)
I saw also CVXR package, which crashes my RStudio for some reason and nlm(), which is little more complicated and I can't write the function properly.
A solution might be not to do the funky summation of the squared differences, but finding the weights so that the risk contributions (RM*weight) are equal. I will be very glad for your help.
Note: the vector of the weights has to sum up to 1 and the values have to lie between 0 and 1.
Cheers
Daniel

How do I minimize a linear least squares function in R?

I'm reading Deep Learning by Goodfellow et al. and am trying to implement gradient descent as shown in Section 4.5 Example: Linear Least Squares. This is page 92 in the hard copy of the book.
The algorithm can be viewed in detail at https://www.deeplearningbook.org/contents/numerical.html with R implementation of linear least squares on page 94.
I've tried implementing in R, and the algorithm as implemented converges on a vector, but this vector does not seem to minimize the least squares function as required. Adding epsilon to the vector in question frequently produces a "minimum" less than the minimum outputted by my program.
options(digits = 15)
dim_square = 2 ### set dimension of square matrix
# Generate random vector, random matrix, and
set.seed(1234)
A = matrix(nrow = dim_square, ncol = dim_square, byrow = T, rlnorm(dim_square ^ 2)/10)
b = rep(rnorm(1), dim_square)
# having fixed A & B, select X randomly
x = rnorm(dim_square) # vector length of dim_square--supposed to be arbitrary
f = function(x, A, b){
total_vector = A %*% x + b # this is the function that we want to minimize
total = 0.5 * sum(abs(total_vector) ^ 2) # L2 norm squared
return(total)
}
f(x,A,b)
# how close do we want to get?
epsilon = 0.1
delta = 0.01
value = (t(A) %*% A) %*% x - t(A) %*% b
L2_norm = (sum(abs(value) ^ 2)) ^ 0.5
steps = vector()
while(L2_norm > delta){
x = x - epsilon * value
value = (t(A) %*% A) %*% x - t(A) %*% b
L2_norm = (sum(abs(value) ^ 2)) ^ 0.5
print(L2_norm)
}
minimum = f(x, A, b)
minimum
minimum_minus = f(x - 0.5*epsilon, A, b)
minimum_minus # less than the minimum found by gradient descent! Why?
On page 94 of the pdf appearing at https://www.deeplearningbook.org/contents/numerical.html
I am trying to find the values of the vector x such that f(x) is minimized. However, as demonstrated by the minimum in my code, and minimum_minus, minimum is not the actual minimum, as it exceeds minimum minus.
Any idea what the problem might be?
Original Problem
Finding the value of x such that the quantity Ax - b is minimized is equivalent to finding the value of x such that Ax - b = 0, or x = (A^-1)*b. This is because the L2 norm is the euclidean norm, more commonly known as the distance formula. By definition, distance cannot be negative, making its minimum identically zero.
This algorithm, as implemented, actually comes quite close to estimating x. However, because of recursive subtraction and rounding one quickly runs into the problem of underflow, resulting in massive oscillation, below:
Value of L2 Norm as a function of step size
Above algorithm vs. solve function in R
Above we have the results of A %% x followed by A %% min_x, with x estimated by the implemented algorithm and min_x estimated by the solve function in R.
The problem of underflow, well known to those familiar with numerical analysis, is probably best tackled by the programmers of lower-level libraries best equipped to tackle it.
To summarize, the algorithm appears to work as implemented. Important to note, however, is that not every function will have a minimum (think of a straight line), and also be aware that this algorithm should only be able to find a local, as opposed to a global minimum.

How to work with binary contraints in linear optimization?

I have two input matrices, dt(10,3) & wt(3,3), that i need to use to find the optimal decision matrix (same dimension), Par(10,3) so as to maximize an objective function. Below R code would give some direction into the problem (used Sample inputs here) -
#Input Matrices
dt <- matrix(runif(300),100,3)
wt <- matrix(c(1,0,0,0,2,0,0,0,1),3,3) #weights
#objective function
Obj <- function(Par) {
P = matrix(Par, nrow = 10, byrow=F) # Reshape
X = t((dt%*%wt)[,1])%*%P[,1]
Y = t((dt%*%wt)[,2])%*%P[,2]
Z = t((dt%*%wt)[,3])%*%P[,3]
as.numeric(X+Y+Z) #maximize
}
Now I am struggling to apply the following constraints to the problem :
1) Matrix, Par can only have binary values (0 or 1)
2) rowSums(Par) = 1 (Basically a row can only have 1 in one of the three columns)
3) colSums(Par[,1]) <= 5, colSums(Par[,2]) <= 6, & colSums(Par[,3]) <= 4
4) X/(X+Y+Z) < 0.35, & Y/(X+Y+Z) < 0.4 (X,Y,Z are defined in the objective function)
I tried coding the constraints in constrOptim, but not sure how to input binary & integer constraints. I am reading up on lpSolve, but not able to figure out. Any help much appreciated. Thanks!
I believe this is indeed a MIP so no issues with convexity. If I am correct the model can look like:
This model can be easily transcribed into R. Note that LP/MIP solvers do not use functions for the objective and constraints (opposed to NLP solvers). In R typically one builds up matrices with the LP coefficients.
Note: I had to make the limits on the column sums much larger (I used 50,60,40).
Based on Erwin's response, I am able to formulate the model using lpSolve in R. However still struggling to add the final constraint to the model (4th constraint in my question above). Here's what I am able to code so far :
#input dimension
r <- 10
c <- 3
#input matrices
dt <- matrix(runif(r*c),r,c)
wt <- matrix(c(1,0,0,0,2,0,0,0,1),3,3) #weights
#column controller
c.limit <- c(60,50,70)
#create structure for lpSolve
ncol <- r*c
lp.create <- make.lp(ncol=ncol)
set.type(lp.create, columns=1:ncol, type = c("binary"))
#create objective values
obj.vals <- as.vector(t(dt%*%wt))
set.objfn(lp.create, obj.vals)
lp.control(lp.create,sense='max')
#Add constraints to ensure sum of parameters for every row (rowSum) <= 1
for (i in 1:r){
add.constraint(lp.create, xt=c(1,1,1),
indices=c(3*i-2,3*i-1,3*i), rhs=1, type="<=")
}
#Add constraints to ensure sum of parameters for every column (colSum) <= column limit (defined above)
for (i in 1:c){
add.constraint(lp.create, xt=rep(1,r),
indices=seq(i,ncol,by=c), rhs=c.limit[i], type="<=")
}
#Add constraints to ensure sum of column objective (t((dt%*%wt)[,i])%*%P[,i) <= limits defined in the problem)
#NOT SURE HOW TO APPLY A CONSTRAINT THAT IS DEPENDENT ON THE OBJECTIVE FUNCTION
solve(lp.create)
get.objective(lp.create) #20
final.par <- matrix(get.variables(lp.create), ncol = c, byrow=T) # Reshape
Any help that can get me to the finish line is much appreciated :)
Thanks

Constrained optimization of a vector

I have a (non-symmetric) probability matrix, and an observed vector of integer outcomes. I would like to find a vector that maximises the probability of the outcomes, given the transition matrix. Simply, I am trying to estimate a distribution of particles at sea given their ultimate distribution on land, and a matrix of probabilities of a particle released from a given point in the ocean ending up at a given point on the land.
The vector that I want to find is subject to the constraint that all components must be between 0-1, and the sum of the components must equal 1. I am trying to figure out the best optimisation approach for the problem.
My transition matrix and data set are quite large, but I have created a smaller one here:
I used a simulated known at- sea distribution of
msim<-c(.3,.2,.1,.3,.1,0) and a simulated probability matrix (t) to come up with an estimated coastal matrix (Datasim2), as follows:
t<-matrix (c(0,.1,.1,.1,.1,.2,0,.1,0,0,.3,0,0,0,0,.4,.1,.3,0,.1,0,.1,.4,0,0,0,.1,0,.1,.1),
nrow=5,ncol=6, byrow=T)
rownames(t)<-c("C1","C2","C3","C4","C5") ### locations on land
colnames(t)<-c("S1","S2","S3","S4","S5","S6") ### locations at sea
Datasim<-as.numeric (round((t %*% msim)*500))
Datasim2<-c(rep("C1",95), rep("C2",35), rep("C3",90),rep("C4",15),rep("C5",30))
M <-c(0.1,0.1,0.1,0.1,0.1,0.1) ## starting M
I started with a straightforward function as follows:
EstimateSource3<-function(M,Data,T){
EstEndProbsall<-M%*%T
TotalLkhd<-rep(NA, times=dim(Data)[1])
for (j in 1:dim(Data)[1]){
ObsEstEndLkhd<-0
ObsEstEndLkhd<-1-EstEndProbsall[1,] ## likelihood of particle NOT ending up at locations other than the location of interest
IndexC<-which(colnames(EstEndProbsall)==Data$LocationCode[j], arr.ind=T) ## likelihood of ending up at location of interest
ObsEstEndLkhd[IndexC]<-EstEndProbsall[IndexC]
#Total likelihood
TotalLkhd[j]<-sum(log(ObsEstEndLkhd))
}
SumTotalLkhd<-sum(TotalLkhd)
return(SumTotalLkhd)
}
DistributionEstimate <- optim(par = M, fn = EstimateSource3, Data = Datasim2, T=t,
control = list(fnscale = -1, trace=5, maxit=500), lower = 0, upper = 1)
To constrain the sum to 1, I tried using a few of the suggestions posted here:How to set parameters' sum to 1 in constrained optimization
e.g. adding M<-M/sum(M) or SumTotalLkhd<-SumTotalLkhd-(10*pwr) to the body of the function, but neither yielded anything like msim, and in fact, the 2nd solution came up with the error “L-BFGS-B needs finite values of 'fn'”
I thought perhaps the quadprog package might be of some help, but I don’t think I have a symmetric positive definite matrix…
Thanks in advance for your help!
What about that: Let D = distribution at land, M = at sea, T the transition matrix. You know D, T, you want to calculate M. You have
D' = M' T
hence D' T' = M' (T T')
and accordingly D'T'(T T')^(-1) = M'
Basically you solve it as when doing linear regression (seems SO does not support math notation: ' is transpose, ^(-1) is ordinary matrix inverse.)
Alternatively, D may be counts of particles, and now you can ask questions like: what is the most likely distribution of particles at sea. That needs a different approach though.
Well, I have never done such models but think along the following lines. Let M be of length 3 and D of length 2, and T is hence 3x2. We know T and we observe D_1 particles at location 1 and D_2 particles at location 2.
What is the likelihood that you observe one particle at location D_1? It is Pr(D = 1) = M_1 T_11 + M_2 T_21 + M_3 T_32. Analogously, Pr(D = 2) = M_1 T_12 + M_2 T_22 + M_3 T_32. Now you can easily write the log-likelihood of observing D_1 and D_2 particles at locations 1 and 2. The code might look like this:
loglik <- function(M) {
if(M[1] < 0 | M[1] > 1)
return(NA)
if(M[2] < 0 | M[2] > 1)
return(NA)
M3 <- 1 - M[1] - M[2]
if(M3 < 0 | M3 > 1)
return(NA)
D[1]*log(T[1,1]*M[1] + T[2,1]*M[2] + T[3,1]*M3) +
D[2]*log(T[1,2]*M[1] + T[2,2]*M[2] + T[3,2]*M3)
}
T <- matrix(c(0.1,0.2,0.3,0.9,0.8,0.7), 3, 2)
D <- c(100,200)
library(maxLik)
m <- maxLik(loglik, start=c(0.4,0.4), method="BFGS")
summary(m)
I get the answer (0, 0.2, 0.8) when I estimate it but standard errors are very large.
As I told, I have never done it so I don't know it it makes sense.

Generating random sample from the quantiles of unknown density in R [duplicate]

This question already has answers here:
How do I best simulate an arbitrary univariate random variate using its probability function?
(4 answers)
Closed 9 years ago.
How can I generate random sample data from the quantiles of the unknown density f(x) for x between 0 and 4 in R?
f = function(x) ((x-1)^2) * exp(-(x^3/3-2*x^2/2+x))
If I understand you correctly (??) you want to generate random samples with the distribution whose density function is given by f(x). One way to do this is to generate a random sample from a uniform distribution, U[0,1], and then transform this sample to your density. This is done using the inverse cdf of f, a methodology which has been described before, here.
So, let
f(x) = your density function,
F(x) = cdf of f(x), and
F.inv(y) = inverse cdf of f(x).
In R code:
f <- function(x) {((x-1)^2) * exp(-(x^3/3-2*x^2/2+x))}
F <- function(x) {integrate(f,0,x)$value}
F <- Vectorize(F)
F.inv <- function(y){uniroot(function(x){F(x)-y},interval=c(0,10))$root}
F.inv <- Vectorize(F.inv)
x <- seq(0,5,length.out=1000)
y <- seq(0,1,length.out=1000)
par(mfrow=c(1,3))
plot(x,f(x),type="l",main="f(x)")
plot(x,F(x),type="l",main="CDF of f(x)")
plot(y,F.inv(y),type="l",main="Inverse CDF of f(x)")
In the code above, since f(x) is only defined on [0,Inf], we calculate F(x) as the integral of f(x) from 0 to x. Then we invert that using the uniroot(...) function on F-y. The use of Vectorize(...) is needed because, unlike almost all R functions, integrate(...) and uniroot(...) do not operate on vectors. You should look up the help files on these functions for more information.
Now we just generate a random sample X drawn from U[0,1] and transform it with Z = F.inv(X)
X <- runif(1000,0,1) # random sample from U[0,1]
Z <- F.inv(X)
Finally, we demonstrate that Z is indeed distributed as f(x).
par(mfrow=c(1,2))
plot(x,f(x),type="l",main="Density function")
hist(Z, breaks=20, xlim=c(0,5))
Rejection sampling is easy enough:
drawF <- function(n) {
f <- function(x) ((x-1)^2) * exp(-(x^3/3-2*x^2/2+x))
x <- runif(n, 0 ,4)
z <- runif(n)
subset(x, z < f(x)) # Rejection
}
Not the most efficient but it gets the job done.
Use sample . Generate a vector of probablities from your existing function f, normalized properly. From the help page:
sample(x, size, replace = FALSE, prob = NULL)
Arguments
x Either a vector of one or more elements from which to choose, or a positive integer. See ‘Details.’
n a positive number, the number of items to choose from. See ‘Details.’
size a non-negative integer giving the number of items to choose.
replace Should sampling be with replacement?
prob A vector of probability weights for obtaining the elements of the vector being sampled.

Resources