I'm trying to convert Matlab code to R. I'm not familiar with Matlab matrix operations, and it appears the results from my R code do not match the results from Matlab, so any help would be greatly appreciated. The Matlab code I'm trying to convert is below (from this website):
% Mean Variance Optimizer
% S is matrix of security covariances
S = [185 86.5 80 20; 86.5 196 76 13.5; 80 76 411 -19; 20 13.5 -19 25]
% Vector of security expected returns
zbar = [14; 12; 15; 7]
% Unity vector..must have same length as zbar
unity = ones(length(zbar),1)
% Vector of security standard deviations
stdevs = sqrt(diag(S))
% Calculate Efficient Frontier
A = unity'*S^-1*unity
B = unity'*S^-1*zbar
C = zbar'*S^-1*zbar
D = A*C-B^2
% Efficient Frontier
mu = (1:300)/10;
% Plot Efficient Frontier
minvar = ((A*mu.^2)-2*B*mu+C)/D;
minstd = sqrt(minvar);
plot(minstd,mu,stdevs,zbar,'*')
title('Efficient Frontier with Individual Securities','fontsize',18)
ylabel('Expected Return (%)','fontsize',18)
xlabel('Standard Deviation (%)','fontsize',18)
Here is my attempt in R:
# S is matrix of security covariances
S <- matrix(c(185, 86.5, 80, 20, 86.5, 196, 76, 13.5, 80, 76, 411, -19, 20, 13.5, -19, 25), nrow=4, ncol=4, byrow=TRUE)
# Vector of security expected returns
zbar = c(14, 12, 15, 7)
# Unity vector..must have same length as zbar
unity <- rep(1, length(zbar))
# Vector of security standard deviations
stdevs <- sqrt(diag(S))
# Calculate Efficient Frontier
A <- unity*S^-1*unity
B <- unity*S^-1*zbar
C <- zbar*S^-1*zbar
D <- A*C-B^2
# Efficient Frontier
mu = (1:300)/10
# Plot Efficient Frontier
minvar = ((A*mu^2)-2*B*mu+C)/D
minstd = sqrt(minvar)
It appears that unity*S in Matlab is equivalent to colSums(S) in R. But I haven't been able to figure out how to calculate the equivalent of S^-1*unity in R. If I type this Matlab code in R (S^-1*unity), it calculates without error, but it gives a different answer. Because I don't understand the underlying Matlab calculation, I'm not sure how to translate it to R.
I used to do matlab -> R conversions quite a bit a few years ago.
My general suggestion is to open 2 terminals side by side and try to do everything going line-by-line. Then after each line you should check if what you got in MATLAB and R are equivalent.
This document should be handy: http://mathesaurus.sourceforge.net/octave-r.html
In your case these appear to be the commands that you should have in mind:
Matrix multiplication:
Matlab: A*B
R: A %*% B
Transpose:
Matlab: A'
R: t(A)
Matrix inverse:
Matlab: inv(A) or A^-1
R: solve(A)
Don't try to convert everything at once because you will run into trouble. When the results won't match you will not be able to tell where the error is.
Related
I'm trying to tackle a nonlinear optimization problem where the objective functions are non-linear and constraints are linear. I read a bit on the ROI package in R and I decided to use the same. However, I am facing a problem while solving the optimization problem.
I am essentially trying to minimize the area under a supply-demand curve. The equation for the supply and demand curves are defined in the code:
Objective function: minimize (Integral of supply curve + integral of demand curve),
subject to constraints q greater than or equal to 34155 (stored in a variable called ICR),
q greater than or equal to 0
and q less than or equal to 40000.
I have tried to run this through the ROI package in RStudio and I keep getting an error telling me that there is no solver to be found.
library(tidyverse)
library(ROI)
library(rSymPy)
library(mosaicCalc)
# Initializing parameters for demand curve
A1 <- 6190735.2198302800
B1 <- -1222739.9618776600
C1 <- 103427.9556133250
D1 <- -4857.0627045073
E1 <- 136.7660814828
# Initializing parameters for Supply Curve
S1 <- -1.152
S2 <- 0.002
S3 <- a-9.037e-09
S4 <- 2.082e-13
S5 <- -1.64e-18
ICR <- 34155
demand_curve_integral <- antiD(A1 + B1*q + C1*(q^2)+ D1*(q^3) + E1*(q^4) ~q)
supply_curve_integral <- antiD(S1 + S2*(q) + S3*(q^2) + S4*(q^3) + S5*(q^4)~q)
# Setting up the objective function
obj_func <- function(q){ (18.081*demand_curve_integral(q))+supply_curve_integral(q)}
# Setting up the optimization Problem
lp <- OP(objective = F_objective(obj_func, n=1L),
constraints=L_constraint(L=matrix(c(1, 1, 1), nrow=3),
dir=c(">=", ">=", "<="),
rhs=c(ICR, 0, 40000, 1))),
maximum = FALSE)
sol <- ROI_solve(lp)
This is the error that I keep getting in RStudio:
Error in ROI_solve(lp) : no solver found for this signature:
objective: F
constraints: L
bounds: V
cones: X
maximum: FALSE
C: TRUE
I: FALSE
B: FALSE
What should I do to rectify this error?
In general you could use ROI.plugin.alabama or ROI.plugin.nloptr for this optimization problem.
But I looked at the problem and this raised several questions.
a is not defined in the code.
You state that q has length 1 and add 3 linear constraints the constraints say
q >= 34155, q >= 0, q <= 40000 or q <= 1
I am not entirely sure since the length of rhs is 4 but L and dir
suggest there are only 3 linear constraints.
How should the constraint look like?
34155 <= q <= 40000?
Then you could specify the constraint as bounds and use ROI.plugin.optimx
or since you have a one dimensional optimization problem just use optimize
from the stats package https://stat.ethz.ch/R-manual/R-devel/library/stats/html/optimize.html.
I haven't run NLP using ROI. But you have to install an ROI solver plug-in and then load the library in your code. The current solver plug-ins are:
library(ROI.plugin.glpk)
library(ROI.plugin.lpsolve)
library(ROI.plugin.neos)
library(ROI.plugin.symphony)
library(ROI.plugin.cplex)
Neos provides access to NLP solvers but I don't know how to pass solver parameters via an ROI plug-in function call.
https://neos-guide.org/content/nonlinear-programming
I am trying to "translate" these lines from R to Julia:
n <- 100
mean <- 0
sd <- 1
x <- qnorm(seq(1 / n, 1 - 1 / n, length.out = n), mean, sd)
However, I have trouble with the qnorm function. I've searched for "quantile function" and found the quantile() function. However, the R's version returns a vector of length 100, while the Julia's version returns a vector of length 5.
Here's my attempt:
import Distributions
n = 100
x = Distributions.quantile(collect(range(1/n, stop=1-1/n, length=n)))
Under Julia 1.1 you should broadcast the call to quantile like this:
quantile.(Normal(0, 1), range(1/n, 1-1/n, length = n))
Try
using Distributions
n = 100
qs = range(1/n, stop=1-1/n, length=n) # no need to collect it
d = Normal() # default is mean = 0, std = 1
result = [quantile(d, q) for q in qs]
Julia uses multiple dispatch to select the appropriate quantile method for a given distribution, in constrast to R where you seem to have prefixes. According to the documentation the first argument should be the distribution, the second argument the point where you want to evaluate the inverse cdf.
Strangely I get an error when I try to do quantile.(d, qs) (broadcast the quantile call). UPDATE: See Bogumil's answer in this case. In my benchmarks, both approaches have the same speed.
I have a joint probability mass function of two variables X,Y like here
How can I calculate the covariance in R?
I created two vectors x,y and fed them into cov(), but I get the wrong result.
How can I do this right?
Thanks in advance and happy coding!
Since SO is a coding forum, I'll leave working out the math/stats details up to you. Here is an implementation in R.
We start by noting the sample spaces for X and Y
# For G
G <- 0:3;
# For R
R <- 0:2;
The joint probability mass function is given by the following matrix
joint_pmf <- matrix(
c(4/84, 12/84, 4/84,
18/84, 24/84, 3/84,
12/84, 6/84, 0,
1/84, 0, 0),
ncol = 3, byrow = T);
We calculate the population means
# For G
mu_G <- rowSums(joint_pmf) %*% G;
# For R
mu_R <- colSums(joint_pmf) %*% R;
We can make use of the theorem Cov(X, Y) = E[XY] - E[X]E[Y] to calculate the covariance
cov_GR <- G %*% joint_pmf %*% R - mu_G * mu_R;
# [,1]
#[1,] -0.1666667
where we have used the fact that E[G] = mu_G and E[R] = mu_R are the respective population means.
Suppose f(k) = exp(k/200) - 1 and we want to minimize ( f(a) + f(b) + f(c) + f(d) - pi )^2. The solution should be a = 6, b = 75, c = 89, d = 226. The sum of squares for this solution is ~ 8e-17.
sumsq <- function(theta, n=200) {
f <- function(k) exp(k/n) - 1
(f(theta[1]) + f(theta[2]) + f(theta[3]) + f(theta[4]) - pi)^2
}
theta <- optim(par=c(10, 90, 70, 300), fn=sumsq)
# theta$par = 62.97 106.89, 78.64, 189.82
# theta$value = 6.32e-10
# sumsq(c(6,75,89,226)) = 8.20e-17
So clearly, the solution of a = 6, b = 75, c = 89, d = 226 is better than the one the optim function gave by comparing the sum of squares. I would to know how to make R more accurate with its optimization technique. I have also tried the nlm() function, without success.
The value of pi used is 3.1415926535897931 - I think that the accuracy of pi is not the reason why the optim function isn't producing an optimal solution
As the commenters say, this isn't a problem with the accuracy of optim, but rather that the algorithm used by optim may not be suitable for your particular problem. There are very many optimization packages and interfaces available in R; I have had good results using the rgenoud package to improve maximum likelihood-based parameter estimates with the fitdist packages (which I believe uses optim by default).
The other question, of course, is whether the problem your are posing actually has a global minimum that is distinguishable from other local minimums within the numerical tolerance you can specify/R can detect. 6.32e-10 and 8.20e-17 are both pretty small and far beyond the numerical tolerances I consider acceptable in my work... but I don't know about your field.
This is not a well-posed minimization problem. There is an infinite amount of possible solutions. One of them is
a=b=c=d=200*log(1+pi/4)
which numerically is approximately
115.92829021682383
The residual sumsq in this case is zero (within the numerical accuracy) if you insert the numbers.
The problem would probably be far more complex to solve if one would impose, e.g., the restriction that only natural or only integer numbers are allowed. In that case, your combination (and permutations thereof) might be the best, but at the moment I wouldn't know how to verify this. A minimization in the presence of such a constraint would be a qualitatively different problem, which might be interesting for mathematicians. In any case, the usual numerical optimization algorithms won't allow to introduce such a constraint.
I used the "BFGS" method:
sumsq <- function(theta, n=200) {
f <- function(k) exp(k/n) - 1
(f(theta[1]) + f(theta[2]) + f(theta[3]) + f(theta[4]) - pi)^2
}
theta <- optim(par=c(10, 90, 70, 300), fn=sumsq, method="BFGS")
Look at the result:
> theta
$par
[1] -2.629695 71.159586 52.952260 246.174513
$value
[1] 4.009243e-22
$counts
function gradient
19 8
$convergence
[1] 0
I'm currently writing a tutorial about bootstrapping in R. I settled on the function boot in the boot package. I got the book "An introduction to the Bootstrap" by Efron/Tibshirani (1993) and just replicate a few of their examples.
Quite often in those examples, they compute statistics based on different samples. For instance, they have this one example where they have a sample of 16 mice. 7 of those mice received a treatment that was meant to prolong survival time after a test surgery. The remaining 9 mice did not receive the treatment. For each mouse, the number of days it survived was collected (values are given below).
Now, I want to use the bootstrapping approach to find out if the difference of mean is significant or not. However, if I understand the help page of boot correctly, I can't just pass two different samples with unequal sample size to the function. My workaround is as follows:
#Load package boot
library(boot)
#Read in the survival time in days for each mouse
treatment <- c(94, 197, 16, 38, 99, 141, 23)
control <- c(52, 104, 146, 10, 51, 30, 40, 27, 46)
#Call boot twice(!)
b1 <- boot(data = treatment,
statistic = function(x, i) {mean(x[i])},
R = 10000)
b2 <- boot(data = control,
statistic = function(x, i) {mean(x[i])},
R = 10000)
#Compute difference of mean manually
mean_diff <- b1$t -b2$t
In my opinion, this solution is a bit of a hack. The statistic I'm interested in is now saved in a vector mean_diff, but I don't get all the great functionality of the boot package anymore. I can't call boot.ci on mean_diff, etc.
So my question basically is if my hack is the only way to do a bootstrap with the boot package in R and statistics that compare two different samples. Or is there another way?
I thought about passing one data.frame in with 16 rows and an additional column "Group":
df <- data.frame(survival=c(treatment, control),
group=c(rep(1, length(treatment)), rep(2, length(control))))
head(df)
survival group
1 94 1
2 197 1
3 16 1
4 38 1
5 99 1
6 141 1
However, now I would have to tell boot that it has to sample always 7 observations from the first 7 rows and 9 observations from the last 9 rows and treat these as separate samples. I would not know how to do that.
I've never really figured out what the big advantage of boot is, since it is so easy to manually code bootstrap procedures. You could try for example the following using replicate:
myboot1 <- function(){
booty <- tapply(df$survival,df$group,FUN=function(x) sample(x,length(x),TRUE))
sapply(booty,mean)
}
out1 <- replicate(1000,myboot1())
From this you can get a bunch of useful statistics quite easily:
rowMeans(out1) # group means
diff(rowMeans(out1)) # difference
mean(out1[1,]-out1[2,]) # another way of getting difference
apply(out1,1,quantile,c(0.025,0.975)) # treatment-group CIs
quantile(out1[1,]-out1[2,],c(0.025,0.975)) # CI for the difference
This is an example in ?boot.return:
diff.means <- function(d, f)
{ n <- nrow(d)
gp1 <- 1:table(as.numeric(d$series))[1]
m1 <- sum(d[gp1,1] * f[gp1])/sum(f[gp1])
m2 <- sum(d[-gp1,1] * f[-gp1])/sum(f[-gp1])
ss1 <- sum(d[gp1,1]^2 * f[gp1]) - (m1 * m1 * sum(f[gp1]))
ss2 <- sum(d[-gp1,1]^2 * f[-gp1]) - (m2 * m2 * sum(f[-gp1]))
c(m1 - m2, (ss1 + ss2)/(sum(f) - 2))
}
grav1 <- gravity[as.numeric(gravity[,2]) >= 7,]
boot(grav1, diff.means, R = 999, stype = "f", strata = grav1[,2])
Section3.2 of Davison and Hinkley can be referenced.
Giving it another thought, I realized that I could actually combine Thomas' answer with boot. Here is a solution:
b <- boot(data=df,
statistic = function(x, i) {
booty <- tapply(x$survival,x$group,FUN=function(x) sample(x,length(x),TRUE))
diff(sapply(booty,mean))*-1
},
R=10000)
The trick is that the function you provide to the argument statistic has to accept a parameter i for the index, but that you completely ignore this parameter within your function. Instead, you do the sampling yourself. Of course, this is not the most efficient (because boot has to do the sampling as well), but I guess that in most cases this shouldn't be a big issue.