#calculate NMI(c,t) c : cluster assignment , t : ground truth
NMI <- function(c,t){
n <- length(c) # = length(t)
r <- length(unique(c))
g <- length(unique(t))
N <- matrix(0,nrow = r , ncol = g)
for(i in 1:r){
for (j in 1:g){
N[i,j] = sum(t[c == i] == j)
}
}
N_t <- colSums(N)
N_c <- rowSums(N)
B <- (1/n)*log(t( t( (n*N) / N_c ) / N_t))
W <- B*N
I <- sum(W,na.rm = T)
H_c <- sum((1/n)*(N_c * log(N_c/n)) , na.rm = T)
H_t <- sum((1/n)*(N_t * log(N_t/n)) , na.rm = T)
nmi <- I/sqrt(H_c * H_t)
return (nmi)
}
Running this on some clustering benchmarks here gives me a value of the Normalized Mutual information . But , when I compare it with values of NMI obtained from the aricode library , I get values of NMI that generally differ in the second decimal place .
I will be grateful if someone is able to pin-point any possible error that has creeped into this code .
I am including a test case for this using a synthetic case :
library(aricode)
c <- c(1,1,2,2,2,3,3,3,3,4,4,4)
t <- c(1,2,2,2,3,4,3,3,3,4,4,2)
print(aricode::NMI(c , t)) #0.489574
print(NMI(c,t)) #0.5030771
This might be very late for an answer but for the sake of posterity:
The difference is in the way you and the aricode package normalise the index. You divide by sqrt() whereas aricode offers the following options:
function (c1, c2, variant = c("max", "min", "sqrt", "sum", "joint"))
so if you select variant = sqrt you should hopefully get the same answer.
The NMI package uses sum.
Related
Introduction to the problem
I am trying to write down a code in R so to obtain the weights of an Equally-Weighted Contribution (ERC) Portfolio. As some of you may know, the portfolio construction was presented by Maillard, Roncalli and Teiletche.
Skipping technicalities, in order to find the optimal weights of an ERC portfolio one needs to solve the following Sequential Quadratic Programming problem:
with:
Suppose we are analysing N assets. In the above formulas, we have that x is a (N x 1) vector of portfolio weights and Σ is the (N x N) variance-covariance matrix of asset returns.
What I have done so far
Using the function slsqp of the package nloptr which solves SQP problems, I would like to solve the above minimisation problem. Here is my code. Firstly, the objective function to be minimised:
ObjFuncERC <- function (x, Sigma) {
sum <- 0
R <- Sigma %*% x
for (i in 1:N) {
for (j in 1:N) {
sum <- sum + (x[i]*R[i] - x[j]*R[j])^2
}
}
}
Secondly, the starting point (we start by an equally-weighted portfolio):
x0 <- matrix(1/N, nrow = N, ncol = 1)
Then, the equality constraint (weights must sum to one, that is: sum of the weights minus one equal zero):
heqERC <- function (x) {
h <- numeric(1)
h[1] <- (t(matrix(1, nrow = N, ncol = 1)) %*% x) - 1
return(h)
}
Finally, the lower and upper bounds constraints (weights cannot exceed one and cannot be lower than zero):
lowerERC <- matrix(0, nrow = N, ncol = 1)
upperERC <- matrix(1, nrow = N, ncol = 1)
So that the function which should output optimal weights is:
slsqp(x0 = x0, fn = ObjFuncERC, Sigma = Sigma, lower = lowerERC, upper = upperERC, heq = heqERC)
Unfortunately, I do not know how to share with you my variance-covariance matrix (which takes name Sigma and is a (29 x 29) matrix, so that N = 29) so to reproduce my result, still you can simulate one.
The output error
Running the above code yields the following error:
Error in nl.grad(x, fn) :
Function 'f' must be a univariate function of 2 variables.
I have no idea what to do guys. Probably, I have misunderstood how things must be written down in order for the function slsqp to understand what to do. Can someone help me understand how to fix the problem and get the result I want?
UPDATE ONE: as pointed out by #jogo in the comments, I have updated the code, but it still produces an error. The code and the error above are now updated.
UPDATE 2: as requested by #jaySf, here is the full code that allows you to reproduce my error.
## ERC Portfolio Test
# Preliminary Operations
rm(list=ls())
require(quantmod)
require(nloptr)
# Load Stock Data in R through Yahoo! Finance
stockData <- new.env()
start <- as.Date('2014-12-31')
end <- as.Date('2017-12-31')
tickers <-c('AAPL','AXP','BA','CAT','CSCO','CVX','DIS','GE','GS','HD','IBM','INTC','JNJ','JPM','KO','MCD','MMM','MRK','MSFT','NKE','PFE','PG','TRV','UNH','UTX','V','VZ','WMT','XOM')
getSymbols.yahoo(tickers, env = stockData, from = start, to = end, periodicity = 'monthly')
# Create a matrix containing the price of all assets
prices <- do.call(cbind,eapply(stockData, Op))
prices <- prices[-1, order(colnames(prices))]
colnames(prices) <- tickers
# Compute Returns
returns <- diff(prices)/lag(prices)[-1,]
# Compute variance-covariance matrix
Sigma <- var(returns)
N <- 29
# Set up the minimization problem
ObjFuncERC <- function (x, Sigma) {
sum <- 0
R <- Sigma %*% x
for (i in 1:N) {
for (j in 1:N) {
sum <- sum + (x[i]*R[i] - x[j]*R[j])^2
}
}
}
x0 <- matrix(1/N, nrow = N, ncol = 1)
heqERC <- function (x) {
h <- numeric(1)
h[1] <- t(matrix(1, nrow = N, ncol = 1)) %*% x - 1
}
lowerERC <- matrix(0, nrow = N, ncol = 1)
upperERC <- matrix(1, nrow = N, ncol = 1)
slsqp(x0 = x0, fn = ObjFuncERC, Sigma = Sigma, lower = lowerERC, upper = upperERC, heq = heqERC)
I spotted several mistakes in your code. For instance, ObjFuncERC is not returning any value. You should use the following instead:
# Set up the minimization problem
ObjFuncERC <- function (x, Sigma) {
sum <- 0
R <- Sigma %*% x
for (i in 1:N) {
for (j in 1:N) {
sum <- sum + (x[i]*R[i] - x[j]*R[j])^2
}
}
sum
}
heqERC doesn't return anything too, I also changed your function a bit
heqERC <- function (x) {
sum(x) - 1
}
I made those changes and tried slsqp without lower and upper and it worked. Still, another thing to consider is that you set lowerERC and upperERC as matrices. Use the following instead:
lowerERC <- rep(0,N)
upperERC <- rep(1,N)
Hope this helps.
I am trying to learn how the CVXR package works, and I was porting a
Python example
by Steve Diamond here:
https://groups.google.com/forum/#!topic/cvxpy/5hBSB9KVbuI
and
http://nbviewer.jupyter.org/github/cvxgrp/cvx_short_course/blob/master/intro/control.ipynb
The R equivalent of the code is below:
set.seed(1)
n = 8
m = 2
T1 = 50
alpha = 0.2
beta = 5
A = diag(n) + alpha*replicate(n, rnorm(n))
B = replicate(m, rnorm(n))
x_0 = beta*replicate(1, rnorm(n))
# Form and solve control problem.
x = Variable(n, T1+1)
u = Variable(m, T1)
states = c()
for (t in 1:T1) {
cost = sum_squares(x[,t+1]) + sum_squares(u[,t])
constr = list(x[, t+1] == A%*%x[, t] + B%*%u[, t],
norm_inf(u[,t]) <= 1)
states = c(states, Problem(Minimize(cost), constr) )
}
# sums problem objectives and concatenates constraints.
prob <- Reduce("+", states)
constraints(prob) <- c(constraints(prob), x[ ,T1] == 0)
constraints(prob) <- c(constraints(prob), x[ ,0] == x_0)
sol <- solve(prob)
I have a challenge with the second-to-last line (it throws an error):
constraints(prob) <- c(constraints(prob), x[ ,0] == x_0)
My guess is that x[ , 0] points to the zero-th index position of the
variable, x, which does not exist in R. But from Python which the
program is converted from, a zero-th index position exists from the
for loop (for t in range(T)). range(T) is a vector starting from 0
- 49.
But in R, the for loop (for (t in 1:T1) ) is for a vector of 1 - 50.
Please, any ideas to help will be much appreciated.
Thank you.
You need to bump up the index number by 1, so x[,1] == x_0 and x[,T1+1] == 0 in the second and third from the last line, respectively. Otherwise, you never set the T1+1 entry.
I wrote the following code trying to find all the prime numbers from a random generated data set. sadly it seems something went wrong, could anybody help me.
set.seed(20171106)
n <- 10000
num <- sample(1:100000,n,replace=TRUE)
findPrime <- function(x){
apple<-c()
n<-length(x)
for(i in n){
if(any(x[i]%%(1:(x[i]-1))!=0)) apple <-c(apple,x[i])
}
return(apple)
}
To get results:
type:findPrime(num)
This is the warning message:
Warning message:
In if (x[i]%%(1:(x[i] - 1)) == 0) apple <- c(apple, x[i]) :
the condition has length > 1 and only the first element will be used
so how can I fix the problem?
if statements only accept single elements and in your declaration seems to get the whole vector. I have rewritten your function using a ifelse expression wrapped inside a sapply loop.
I hope this works for you.
findPrime <- function(x = 0){
primes <- c()
# Prime finder
primes <- sapply(X = x,FUN = function(x) {
ifelse(any(x %% (1:(x - 1)) != 0), T, F)}
)
# Select primes
primes <- num[primes]
return(primes)
}
findPrime(num)
I have checked another silly mistake... Inside the function change num for x in the select primes step and invert the F, T outcomes. It should look like this:
findPrime <- function(x = 0){
primes <- c()
# Prime finder
primes <- sapply(X = x,FUN = function(x) {
ifelse(any(x %% (2:(x - 1)) == 0), F, T)}
)
# Select primes
primes <- x[primes]
return(primes)
}
I have just tried it and it works fine.
use package "gmp" which has a function "isprime" which returns 0 for non prime numbers and 2 for prime numbers and then subset the data based on the same
say you have a vector a = c(1:10)
a = c(1:10)
b = gmp::isprime(a)
c = cbind(a,b)
c = as.data.frame(c)
c = c[c$b==2,]
a1 = c$a
a1
In your code: for(i in 1:n), there is the error
I have a presence/absence dataset and need to calculate an Ochiai distance matrix with pairwise deletion of missing values. What is the simplest way to do this?
I can use designdist from the vegan package to generate a matrix, but not sure what it is doing with the missing values. If they are coded as "?" it produces a result, but if coded as "NA" then is produces a matrix of all NAs. In vegdist you can specify if you want pairwise deletion, but you can't implement the Ochiai coefficient. None of the other distance matrix functions in other packages have this combination as far as I can tell. Any ideas?
Cheers,
James
This could be implemented in vegan::designdist(), but with the current design only for terms="minimum". Binary data should be handled with 0/1 transformation of the input either in straight R or using decostand(..., "pa"). The following changes would do this in vegan::designdist():
--- a/R/designdist.R
+++ b/R/designdist.R
## -1,7 +1,7 ##
`designdist` <-
function (x, method = "(A+B-2*J)/(A+B)",
terms = c("binary", "quadratic", "minimum"),
- abcd = FALSE, alphagamma = FALSE, name)
+ abcd = FALSE, alphagamma = FALSE, name, na.rm = FALSE)
{
terms <- match.arg(terms)
if ((abcd || alphagamma) && terms != "binary")
## -9,13 +9,16 ##
x <- as.matrix(x)
N <- nrow(x)
P <- ncol(x)
+ ## check NA
+ if (na.rm && terms != "minimum" && any(is.na(x)))
+ stop("'na.rm = TRUE' can only be used with 'terms = \"minimum\"\' ")
if (terms == "binary")
x <- ifelse(x > 0, 1, 0)
if (terms == "binary" || terms == "quadratic")
x <- tcrossprod(x)
if (terms == "minimum") {
- r <- rowSums(x)
- x <- dist(x, "manhattan")
+ r <- rowSums(x, na.rm = na.rm)
+ x <- vegdist(x, "manhattan", na.rm = na.rm)
x <- (outer(r, r, "+") - as.matrix(x))/2
}
d <- diag(x)
Any help with this would be greatly appreciated. I am optimising parameters of a lognormal distribution so that the proportion of estimates matches a set of target values (distances). The proportions are calculated using the following functions:
adj_sumifs <- function(sum_array, condition_array, f, m=1){
n <- length(condition_array)
sm = 0
if (n == length(condition_array)){
fun <- function(x,i){if (f (condition_array[i])){sum_array[i] + x}else{x} }
sm <- Reduce(fun,1:n,0)
}
ifelse(m <= 0, sm , sm/m)
}
and
estimate.inrange <- function(vals,dist,lower,upper,total){
n <- length(lower)
if (n == length(upper)){
sapply(1:n, function(i){ ifelse(i < n ,
adj_sumifs(vals,dist, (function(x) x >= lower[i] && x < upper[i]),total) ,
adj_sumifs(vals,dist, (function(x) x >= lower[i]) , total)
) }
)
}else{
# for a failure in the process
as.numeric()
}
}
And the function I would like to optimise is:
calculate_Det_ptns <- function(alpha, beta, pxa, low,up, distances, eF){
temp <- numeric()
if ( length(pxa) == length(distances) && length(low) == length(up) )
{
ln_values <- as.numeric(Map(function(pa,d) eF * pa * dlnorm(d, meanlog = alpha, sdlog = beta),pxa,distances))
temp <- estimate.inrange (ln_values,distances,low,up, total = sum(ln_values))
}
temp
}
Optimisation is done using the Levenberg-Marquardt algorithm
lnVals <- nlsLM(target ~ calculate_Det_ptns(alpha = a,beta = b, pxa = odab,low = low, up = up, distances = dist, eF = expF),
start = list(a = mu, b = sd ),
trace = T)
where up,low and target are extracted from the same data file, e,g,
low, up, target
1,2,0.1
2,3,0.4
3,4,0.6
4,5,0.6
5,6,0.9
while odab and distance are vectors of arbitrary lengths (usually much longer than target,etc). The process works well when the target file has anout 150 rows, and distances and odab have about 500000 values. However, for reasons I cannot fathom, is fails when the target file has about 16 rows. The error message is:
Error in model.frame.default(formula = ~target + odab + low + up + dist) :
variable lengths differ (found for 'odab')
which suggests that the function is not being evaluated in the formula. Can anyone suggest a solution or explanation? It is important that the proportions are re-estimated for every new mu and sd.
You could try surrounding the function with I(), which will evaluate it as is before evaluating the formula; however, I could not replicate your problem with the code provided because I am missing some of the referenced objects (a, b, odab, dist, expF, mu, sd) so I could not confirm whether or not this works.
nVals <- nlsLM(target ~ I(calculate_Det_ptns(alpha = a,beta = b, pxa = odab,low = low, up = up, distances = dist, eF = expF)), start = list(a = mu, b = sd ), trace = T)