im currently working on creating a minimum Variance portfolio and decided to use the
function optimize.portfolio of the PortfolioAnalytics package.
Unfortunately, when extracting the weights, all of them are NA, eventhough non of my returns do have any NA value which would be the only reason (from my piont of view) to cause resulting weights to be NA.
My dataset consists of a multiple assets (+5000) each with 60 observations (monthly).
#index returns is an xts object consisting of 3800 stock Ids(columns) and 60 observations in
# monthly interval: To exemplifiy my problem, I set all values in index_returns to 1, to make
# sure that no NA values exist.
any(is.na(index_returns)) # --> evaluates to FALSE
port_spec <- portfolio.spec(assets =colnames(index_returns) )
# Add a full investment constraint such that the weights sum to 1
port_spec <- add.constraint(portfolio = port_spec, type = "full_investment")
# Add a long only constraint such that the weight of an asset is between 0 and 1
port_spec <- add.constraint(portfolio = port_spec, type = "long_only")
# Add an objective to min portfolio variance
port_spec <- add.objective(portfolio = port_spec, type = "risk", name = "var")
# Solve the optimization problem
opt <- optimize.portfolio(R = index_returns, trace=TRUE, portfolio = port_spec,optimize_method = "ROI")
extractWeights(opt) #evaluates to NA for all assets
Does anyone know why this occurs and has any suggestion how to deal with this issue. I know that this optimsiation problem very likely faces invertibility issues due to far more columns than rows, but apart from this notion Im struggling to make any progress with my problem.
Your optimization most likely fails because you have way more assets than observations. Then, as you correctly assumed, you can't obtain an inverse of the estimated covariance matrix.
To quote from "A Portfolio Optimization Approach with aLarge Number of Assets: Applications tothe US and Korean Stock Markets" available at: https://onlinelibrary.wiley.com/doi/epdf/10.1111/ajfs.12233
"Many attempts have been made to find an invertible estimator of the covariancematrix when N is larger than T. The pseudoinverse estimators of the covariancematrix are used by Sengupta (1983) and Pappas et al. (2010), and the shrinkageestimators of the covariance matrix are suggested by Ledoit and Wolf (2003). Ledoitand Wolf (2003) propose estimating the co variance matrix by an optimallyweighted average of two existing estimators: the sample covariance matrix and sin-gle-index covariance matrix."
So I'd suggest you take a look at the Ledoit-Wolf shrinkage method as a first step. The R-package RiskPortfolios and might also be useful, see https://joss.theoj.org/papers/10.21105/joss.00171


RiskPortfolio Package Optimal Portfolio results

I'm trying to use the RiskPortfolios package to find the optimal portfolio weights for a couple of different optimizations, with long only and weights sum to 100% constraints.
Using the sample data provided with the package as an example (given below), I am getting portfolio weights that are all ~0% for all securities.
rets = Industry_10
Sigma = covEstimation(rets)
optimalPortfolio(Sigma = Sigma, control = list(type = 'maxdiv', constraint = 'lo'))
[1] 1.596802e-03 4.426586e-02 5.115952e-21 1.829356e-01 9.853242e-18
[6] 1.092012e-01 1.528876e-01 1.821066e-01 3.270063e-01 4.557049e-21
Does anyone have any experience with this package or info on where i'm going wrong?
The weights will sum to 1 (i.e. 100%). For instance, 4th weight is 1.82e-01, i.e. 18.2%.

K-means: Initial centers are not distinct

I am using the GA Package and my aim is to find the optimal initial centroids positions for k-means clustering algorithm. My data is a sparse-matrix of words in TF-IDF score and is downloadable here. Below are some of the stages I have implemented:
0. Libraries and dataset
library(clusterSim) ## for index.DB()
library(GA) ## for ga()
corpus <- read.csv("Corpus_EnglishMalay_tfidf.csv") ## a dataset of 5000 x 1168
1. Binary encoding and generate initial population.
k_min <- 15
initial_population <- function(object) {
## generate a population to turn-on 15 cluster bits
init <- t(replicate(object#popSize, sample(rep(c(1, 0), c(k_min, object#nBits - k_min))), TRUE))
2. Fitness Function Minimizes Davies-Bouldin (DB) Index. Where I evaluate DBI for each solution generated from initial_population.
DBI2 <- function(x) {
## x is a vector of solution of nBits
## exclude first column of corpus
initial_centroid <- corpus[x==1, -1]
cl <- kmeans(corpus[-1], initial_centroid)
dbi <- index.DB(corpus[-1], cl=cl$cluster, centrotypes = "centroids")
score <- -dbi$DB
3. Running GA. With these settings.
g2<- ga(type = "binary",
fitness = DBI2,
population = initial_population,
selection = ga_rwSelection,
crossover = gabin_spCrossover,
pcrossover = 0.8,
pmutation = 0.1,
popSize = 100,
nBits = nrow(corpus),
seed = 123)
4. The problem. Error in kmeans(corpus[-1], initial_centroid) : initial centers are not distinct`.
I found a similar problem here, where the user also had to used a parameter to dynamically pass in the number of clusters to use. It was solve by hard-coding the number of clusters. However for my case, I really need to dynamically pass in the number of clusters, since it is coming in from a randomly generated binary vector, where those 1's will represent the initial centroids.
Checking with the kmeans() code, I noticed that the error is caused by duplicated centers:
stop("initial centers are not distinct")
I edited the kmeans function with trace to print out the duplicated centers. The output:
[1] "206" "520" "564" "1803" "2059" "2163" "2652" "2702" "3195" "3206" "3254" "3362" "3375"
[14] "4063" "4186"
Which shows no duplication in the randomly selected initial_centroids and I have no idea why this error keeps occurring. Is there anything else that would lead to this error?
P/S: I do understand some may suggest GA + K-means is not a good idea. But I do hope to finish what I have started. It is better to view this problem as a K-means problem (well at least in solving the initial centers are not distinct error).
Genetic algorithms are not well suited for optimizing k-means by the nature of the problem - initialization seeds interact too much, ga will not be better than taking a random sample of all possible seeds.
So my main advise is to not use genetic algorithms at all here!
If you insist, what you would need to do is detect the bad parameters, then simply return a bad score for bad initialization so they don't "survive".
To answer your question just do:
any(corpus[520, -1] != corpus[564, -1])
Your 520 and 564 rows of corpus are the same, with the only difference in an attribute row.names, see:
identical(colnames(corpus[520, -1]), colnames(corpus[564, -1])) # just to be sure
rownames(corpus[520, -1])
rownames(corpus[564, -1])
Regarding the GA and k-means, see e.g.:
Bashar Al-Shboul, Myaeng Sung-Hyon, "Initializing K-Means using Genetic Algorithms", World Academy of Science, Engineering & Technology, Jun2009, Issue 30, p. 114, (especially section II B); or

How to solve a portfolio optimization with a generalised objective function?

I have a portfolio of 5 stocks for which I want to find an optimal mix of minimizing portfolio variance and maximizing expected future dividends. The latter is from analysts forecasts. My problem is that I know how to solve an minimum variance problem but I am not sure how to put the quadratic form into the right matrix form for the objective function of quadprog.
The standard minimum variance problem reads
Min! ( portfolio volatility )
wherer has the 252 daily returns of the five stocks,d has the expected yearly dividend yields ( where firm_A pays 1 %, firm_B pays 2 % etc, )
and I have programmed it as follows
dat = rep( rnorm( 10, mean = 0, sd = 1 ), 252*5 )
r = matrix( dat, nr = 252, nc = 5 )
d = matrix( c( 1, 2, 1, 2, 2 ) )
# Dmat (covariance) and dvec (penalized returns) are generated easily
risk.param = 0.5
Dmat = cov(r)
dvec = matrix(colMeans(r) * risk.param)
# The weights sum up to 1
n = 5
A = matrix( rep( 1, n ), nr = n )
b = 1
meq = 1
res = solve.QP( Dmat, dvec, A, b, meq = 1 )
Obviously, the returns in r a standard normal, hence each stocks gets about 20% weight.
Q1: How can I account for the fact that firm_A pays a dividend of 1, firm_B a dividend of 2, etc?
The new objective function reads:
Max! ( 0.5 * Portfolio_div - 0.5 * Portfolio_variance )
but I don't know how to hard-code it. The portfolio variance was easy to put into Dmat but the new objective function has the Portfolio_div element defined as Portfolio_div = w * d where w has the five weights.
Thanks a lot.
EDIT: Maybe it makes sense to add a higher-level description of the problem:
I am able to use a minimum-variance optimization with the code above. Minimizing the portfolio variance means optimizing the weights on the variace-covariance matrix Dmat (of dimension 5x5). However, I want to add an additional part to the optimization, which are the dividends in d multiplied with the weights (hence of dimension 5x1). The same weights are also used for Dmat.
Q2: How can I add the vector d to the code?
EDIT2: I guess the answer is to simply use
dvec = -1/d
as I maximize expected dividends by minimizing the inverse of the negative.
Q3: Could someone please tell me if that's right?
Opening a can of worms:
TLDR While I respect great work Harry MARKOWITZ ( 1990 Nobel prize ) has performed, I appreciate much more his wonderfull CACI Simulations spin-off deterministic simulation framework COMET III, than the Portfolio theory assumption, that variance per-se is the ruling minimiser driver for the portfolio optimisation process.
Driving this principal point of view ( which still may meet a bit ill-formed motivation of big funds,that live happily from their 2-by-20 feesdue to the nature and scale of "their" skewed perspective of perception of what are direct losses,which they recognise as a non-acquired hefty & risk-free management feesassociated with a crowd-panic churn attributed AUM erosion,rather than the real profits & losses, gained from their (in)ability to deliver any above average AUM returns ) further,closer to your ideathe problem is in the proper formulation of the { penalty | utility } function.
While variance is taken in classical efficient frontier theory as a penalty factor, operated in a min! global search, it has not much to do with real profit generation. You get penalised even for positive-side variance components, which is a nonsense per-se.
On the contrary, the dividend is a direct benefit, an absolute utility, entering the max! optimisation process.
So the first step in Q3 & Q1 ought be a design of a consistent utility function isolated from relative, revenue un-related factors, but containing all other absolute factors -- a cost of entry, transaction costs, rebalancing costs -- as otherwise your utility model would be misleading your portfolio wealth management strategy.
A2: Without this a-priori designed property, no one may claim a model is worth a single CPU-hour to even start the model's global optimisation efforts.

What is the formula to calculate the gini with sample weight

I need your helps to explain how I can obtain the same result as this function does:
gini(x, weights=rep(1,length=length(x)))
http://cran.r-project.org/web/packages/reldist/reldist.pdf --> page 2. Gini
Let's say, we need to measure the inocme of the population N. To do that, we can divide the population N into K subgroups. And in each subgroup kth, we will take nk individual and ask for their income. As the result, we will get the "individual's income" and each individual will have particular "sample weight" to represent for their contribution to the population N. Here is example that I simply get from previous link and the dataset is from NLS
# Convert the wage growth from (log. dollar) to (dollar)
y <- exp(recent$chpermwage);y
# Compute the unweighted estimate
gini_y <- gini(y)
# Compute the weighted estimate
gini_yw <- gini(y,w=recent$wgt)
> --- Here is the result----
> gini_y = 0.3418394
> gini_yw = 0.3483615
I know how to compute the Gini without WEIGHTS by my own code. Therefore, I would like to keep the command gini(y) in my code, without any doubts. The only thing I concerned is that the way gini(y,w) operate to obtain the result 0.3483615. I tried to do another calculation as follow to see whether I can come up with the same result as gini_yw. Here is another code that I based on CDF, Section 9.5, from this book: ‘‘Relative
Distribution Methods in the Social Sciences’’ by Mark S. Handcock,
# test how gini computes with the sample weights
z <- exp(recent$chpermwage) * recent$wgt
gini_z <- gini(z)
# Result gini_z = 0.3924161
As you see, my calculation gini_z is different from command gini(y, weights). If someone of you know how to build correct computation to obtain exactly
gini_yw = 0.3483615, please give me your advices.
Thanks a lot friends.
function (x, weights = rep(1, length = length(x)))
ox <- order(x)
x <- x[ox]
weights <- weights[ox]/sum(weights)
p <- cumsum(weights)
nu <- cumsum(weights * x)
n <- length(nu)
nu <- nu/nu[n]
sum(nu[-1] * p[-n]) - sum(nu[-n] * p[-1])
This is the source code for the function gini which can be seen by entering gini into the console. No parentheses or anything else.
This can be done for any function or object really.
This is bit late, but one may be interested in concentration/diversity measures contained in the [SciencesPo][1] package.

Parameters estimation of a bivariate mixture normal-lognormal model

I have to create a model which is a mixture of a normal and log-normal distribution. To create it, I need to estimate the 2 covariance matrixes and the mixing parameter (total =7 parameters) by maximizing the log-likelihood function. This maximization has to be performed by the nlm routine.
As I use relative data, the means are known and equal to 1.
I’ve already tried to do it in 1 dimension (with 1 set of relative data) and it works well. However, when I introduce the 2nd set of relative data I get illogical results for the correlation and a lot of warnings messages (at all 25).
To estimate these parameters I defined first the log-likelihood function with the 2 commands dmvnorm and dlnorm.plus. Then I assign starting values of the parameters and finally I use the nlm routine to estimate the parameters (see script below).
`P <- read.ascii.grid("d:/Documents/JOINT_FREQUENCY/grid_E727_P-3000.asc", return.header=
V <- read.ascii.grid("d:/Documents/JOINT_FREQUENCY/grid_E727_V-3000.asc", return.header=
p <- c(P); # tranform matrix into a vector
v <- c(V);
p<- p[!is.na(p)] # removing NA values
v<- v[!is.na(v)]
p_rel <- p/mean(p) #Transforming the data to relative values
v_rel <- v/mean(v)
PV <- cbind(p_rel, v_rel) # create a matrix of vectors
L <- function(par,p_rel,v_rel) {
return (-sum(log( (1- par[7])*dmvnorm(PV, mean=c(1,1), sigma= matrix(c(par[1]^2, par[1]*par[2]
*par[3],par[1]*par[2]*par[3], par[2]^2 ),nrow=2, ncol=2))+
par[7]*dlnorm.rplus(PV, meanlog=c(1,1), varlog= matrix(c(par[4]^2,par[4]*par[5]*par[6],par[4]
*par[5]*par[6],par[5]^2), nrow=2,ncol=2)) )))
par.start<- c(0.74, 0.66 ,0.40, 1.4, 1.2, 0.4, 0.5) # log-likelihood estimators
result<-nlm(L,par.start,v_rel=v_rel,p_rel=p_rel, hessian=TRUE, iterlim=200, check.analyticals= TRUE)
Messages d'avis :
1: In log(eigen(sigma, symmetric = TRUE, only.values = TRUE)$values) :
production de NaN
2: In sqrt(2 * pi * det(varlog)) : production de NaN
3: In nlm(L, par.start, p_rel = p_rel, v_rel = v_rel, hessian = TRUE) :
NA/Inf replaced by maximum positive value
4: In log(eigen(sigma, symmetric = TRUE, only.values = TRUE)$values) :
production de NaN
…. Until 25.
par.hat <- result$estimate
cat("sigN_p =", par[1],"\n","sigN_v =", par[2],"\n","rhoN =", par[3],"\n","sigLN_p =", par [4],"\n","sigLN_v =", par[5],"\n","rhoLN =", par[6],"\n","mixing parameter =", par[7],"\n")
sigN_p = 0.5403361
sigN_v = 0.6667375
rhoN = 0.6260181
sigLN_p = 1.705626
sigLN_v = 1.592832
rhoLN = 0.9735974
mixing parameter = 0.8113369`
Does someone know what is wrong in my model or how should I do to find these parameters in 2 dimensions?
Thank you very much for taking time to look at my questions.
Gladys Hertzog
When I do these kind of optimization problems, I find that it's important to make sure that all the variables that I'm optimizing over are constrained to plausible values. For example, standard deviation variables have to be positive, and from knowledge of the situation that I'm modelling I'll probably be able to put an upper bound all my standard deviation variables as well. So if s is one of my standard deviation variables, and if m is the maximum value that I want it to take, instead of working with s I'll solve for the variable z which is related to s via
s = m/(1+e-z)
In that formula, z is unconstrained, but s must lie between 0 and m. This is vital because optimization routines where the variables are not constrained to take plausible values will often try completely implausible values while they're trying to bound the solution. Implausible values often cause problems with e.g. precision, that then results in NaN's etc. The general formula that I use for constraining a single variable x to lie between a and b is
x = a + (b - a)/(1+e-z)
However, regarding your particular problem where you're looking for covariance matrices, a more sophisticated approach is necessary than simply bounding all the individual variables. Covariance matrices must be positive semi-definite, so if you're simply optimizing the individual values in the matrix, the optimization will probably fail (producing NaN's) if a matrix which isn't positive definite is fed into the likelihood function. To get round this problem, one approach is to solve for the Cholesky decomposition of the covariance matrix instead of the covariance matrix itself. My guess is that this is probably what's causing your optimization to fail.
