How does distances weighting work in KNN? - r

I'm writing KNN classifier in R. I want to add weighting scheme, e. g. inverted indices 1/d. As it is, for Iris dataset I get almost perfect 66% accuracy (no matter the metric used) since value no. 3 ("virginica") almost never shows up and I want to make it better with weighting. My question is: what exactly and how do I weight? I've read that I should weight classes of K nearest neighbours with those distances.
I've tried creating vectors of classes and distances to K nearest neighbours and then taking weighted mean from it:
inverted <- function(vals, distances)
{
inv_distances <- 1 / distances
# eliminate division-by-zero errors
inv_distances <- ifelse((inv_distances < 0.01), 0.01, inv_distances)
weighted.mean(vals, inv_distances)
}
My results are weird: for correct vectors vals (classes) and distances I sometimes get NaN (Not a Number) or NA values. Also my weights don't sum to 1, and... they probably should? I'm not sure. I just need someone to clear this weighting scheme for me.
EDIT:
I've debugged above code, since it multiplied by weight too late (therefore not eliminating distance 0 and causing NaNs). I've also changed it to harmonic series weights, not using distance (so first neighbour has weight 1, second 1/2, third 1/3 etc.). I still don't know exactly how it works and what other weights may be.
inverted <- function(vals)
{
weights <- 1 / seq(length(vals))
res <- weighted.mean(vals, weights)
res
}

Related

eigen() and the correct eigenvectors

My problem is the following:
I'm trying to use R in order to compute numerically this problem.
So I've correctly setup the problem in my console, and then I tried to compute the eigenvectors.
But I expect that the eigenvector associated with lambda = 1 is (1,2,1) instead of what I've got here. So, the scaling is correct (0.4082483 is effectively half of 0.8164966), but I would like to obtain a consistent result.
My original problem is to find a stationary distribution for a Markov Chain using R instead of doing it on paper. So from a probabilistic point of view, my stationary distribution is a vector whose sum of the components is equal to 1. For that reason I was trying to change the scale in order to obtain what I've defined "a consistent result".
How can I do that ?
The eigen vectors returned by R are normalized (for the square-norm). If V is a eigen vector then s * V is a eigen vector as well for any non-zero scalar s. If you want the stationary distribution as in your link, divide by the sum:
V / sum(V)
and you will get (1/4, 1/2, 1/4).
So:
ev <- eigen(t(C))$vectors
ev / colSums(ev)
to get all the solutions in one shot.
C <- matrix(c(0.5,0.25,0,0.5,0.5,0.5,0,0.25,0.5),
nrow=3)
ee <- eigen(t(C))$vectors
As suggested by #Stéphane Laurent in the comments, the scaling of eigenvectors is arbitrary; only the relative value is specified. The default in R is that the sum of squares of the eigenvectors (their norms) are equal to 1; colSums(ee^2) is a vector of 1s.
Following the link, we can see that you want each eigenvector to sum to 1.
ee2 <- sweep(ee,MARGIN=2,STATS=colSums(ee),FUN=`/`)
(i.e., divide each eigenvector by its sum).
(This is a good general solution, but in this case the sum of the second and third eigenvectors are both approximately zero [theoretically, they are exactly zero], so this only really makes sense for the first eigenvector.)

Calculating error in PCA

I have a question about a result which I did not expect when doing PCA.
I have successfully calculated the principal components using reference data, and then as a check to ensure that what's going on is what I think is going on, I've projected the reference data onto the entire basis of its eigenfucntions (kept all components) and then transformed back, (this is in python, so it's pca.fit(ref_data) followed by ref_data_transform =pca.transform.(ref_data) followed by pca.inverse_transform(ref_data_transform) I get the exact same data. This is not a surprise.
What is also not a surprise is that as I choose fewer and fewer principle components, the point to point difference between the original data and that which has been projected onto a smaller basis and then projected back increases. That is, if you plot the original data and "filtered" data, it looks different, with the difference increasing as you reduce the size of the subspace onto which you're projecting. I can capture the difference between each data point in a vector called, say, difference_vec.
What IS a surprise (to me at least) is that when I sum over any column of difference_vec it always equals zero. That is, while the actual differences between any original data point and the corresponding one filtered by some number of principal components grow larger as I project onto a smaller and smaller subspace, the TOTAL error is always zero.
I very much appreciate any insight that one my have into if I'm making some mistake here and if not, why this erstwhile "projection induced error" metric doesn't work.
Thanks.
This happens because ref_data and what I’ll call inv_data = pca.inverse_transform(pca.transform(ref_data)) both have the same mean (taken along the second dimension, i.e., averaging over samples).
To see this, take a look at the code for transform:
transform = lambda X: dot(X - mu, V.T)
whereas inverse_transform can be defined as:
inverse_transform = lambda X: dot(X, V) + mu
where mu is the mean of ref_data and V are the first N eigenvectors of covariance(ref_data).
So if you follow the chain of data and its mean:
ref_data with mean mu;
transform(ref_data) has mean 0 (see the equivalent definition above: X-mu has zero mean, then projecting the result linearly onto some coordinate reference only rotates/shears/flips those zero-mean points, doesn’t alter their mean;
Finally, inv_data = inverse_transform(transform(ref_data)) adds mu back so it has mu-mean;
you see that ref_data and inv_data both have mean mu.
Finally, sum(ref_data - inv_data) can be seen as sum(mean(ref_data - inv_data) * num_samples), which by linearity simplifies to sum(mu - mu), which is 0.
That’s a lot of words, sorry, but the idea, now that I see it, is really simple. As I mentioned in my comment, in cases like this you want to use a matrix norm, like the Frobenius norm, to measure a distance between two matrixes, not just sum(A - B) 😅!
Sample code:
import numpy as np
from sklearn.decomposition import PCA
ref_data = np.random.randn(20, 3)
pca = PCA(n_components=1)
pca.fit(ref_data)
trans_data = pca.transform(ref_data)
inv_data = pca.inverse_transform(trans_data)
np.mean(inv_data, 0) # array([ 0.03664149, 0.51348007, 0.0360179 ])
np.mean(ref_data, 0) # array([ 0.03664149, 0.51348007, 0.0360179 ])
np.mean(trans_data, 0) # array([ -2.49800181e-17]) meanwhile ...
np.sum(inv_data - ref_data) # -1.3877787807814457e-15 !

I want to maximize returns on a portfolio ensuring risk is below a certain level. Which function can I use for optimization?

Objective function to be maximized : pos%*%mu where pos is the weights row vector and mu is the column vector of mean returns of d stocks
Constraints: 1) ones%*%pos = 1 where ones is a row vector of 1's of size 1*d (d is the number of stocks)
2) pos%*%cov%*%t(pos) = rb^2 # where cov is the covariance matrix of size d*d and rb is risk budget which is the free parameter whose values will be changed to draw the efficient frontier
I want to write a code for this optimization problem in R but I can't think of any function or library for help.
PS: solve.QP in library quadprog has been used to minimize covariance subject to a target return . Can this function be also used to maximize return subject to a risk budget ? How should I specify the Dmat matrix and dvec vector for this problem ?
EDIT :
library(quadprog)
mu <- matrix(c(0.01,0.02,0.03),3,1)
cov # predefined covariance matrix of size 3*3
pos <- matrix(c(1/3,1/3,1/3),1,3) # random weights vector
edr <- pos%*%mu # expected daily return on portfolio
m1 <- matrix(1,1,3) # constraint no.1 ( sum of weights = 1 )
m2 <- pos%*%cov # constraint no.2
Amat <- rbind(m1,m2)
bvec <- matrix(c(1,0.1),2,1)
solve.QP(Dmat= ,dvec= ,Amat=Amat,bvec=bvec,meq=2)
How should I specify Dmat and dvec ? I want to optimize over pos
Also, I think I have not specified constraint no.2 correctly. It should make the variance of portfolio equal to the risk budget.
(Disclaimer: There may be a better way to do this in R. I am by no means an expert in anything related to R, and I'm making a few assumptions about how R is doing things, notably that you're using an interior-point method. Also, there is likely an R package for what you're trying to do, but I don't know what it is or how to use it.)
Minimising risk subject to a target return is a linearly-constrained problem with a quadratic objective, looking like this:
min x^T Q x
subject to sum x_i = 1
sum ret_i x_i >= target
(and x >= 0 if you want to be long-only).
Maximising return subject to a risk budget is quadratically-constrained, however; it looks like this:
max ret^T x
subject to sum x_i = 1
x^T Q x <= riskbudget
(and maybe x >= 0).
Convex quadratic terms in the objective impose less of a computational cost in an interior-point method compared to introducing a convex quadratic constraint. With a quadratic objective term, the Q matrix just shows up in the augmented system. With a convex quadratic constraint, you need to optimise over a more complicated cone containing a second-order cone factor and you need to be careful about how you solve the linear systems that arise.
I would suggest you use the risk-minimisation formulation repeatedly, doing a binary search on the target parameter until you've found a portfolio approximately maximising return subject to your risk budget. I am suggesting this approach because it is likely sufficient for your needs.
If you really want to solve your problem directly, I would suggest using an interface Todd, Toh, and Tutuncu's SDPT3. This really is overkill; SDPT3 permits you to formulate and solve symmetric cone programs of your choosing. I would also note that portfolio optimisation problems are particularly special cases of symmetric cone programs; other approaches exist that are reportedly very successful. Unfortunately, I'm not studied up on them.

Random sampling based on vector of probability weights

I have the vector d<-1:100
I want to sample k=3 times from this vector without replacement. I would like to make elements that are at a distance length(d)/k from the first sampled element to have a higher probability of getting sampled. I am not yet sure how much higher. I know that sample has a prob= argument, however i can't seem to find a way so that the prob= vectors gets to be recalculated from the location of the initial sample.
Any ideas?
Example:
d<-1:100 . Lets say the first trial samples d[30]=30. Then the elements of ddd that are near 0, 60 and 90 should have a higher probability of sampling. So after the initial sample the the distribution of the sampling probabilities of the rest of the elements of ddd is as in the image:
I think:
samp <- sample(1:100,1)
prob <- rep(1,100)
prob[samp]=0
MORE EDIT: I'm an idiot today. Now this will make the probability shape you asked for.
peke<-c(2,5,7,10,7,5,2) #your 'triangle' probability
for (jj = c(0,2,3){
prob[(1:7)*(1+samp*(jj)] <- peke
}
newsamp <-sample(1:100,1,prob)
You may want to add a slight offset if that doesn't place the probability peaks where you wanted them.

multivariate skew normal in R

I'm trying to generate random numbers with a multivariate skew normal distribution using the rmsn command from the sn package in R. I would like, ideally, to be able to get three columns of numbers with a specified variances and covariances, while having one column strongly skewed. But I'm struggling to achieve both goals simultaneously.
The post at skew normal distribution was related and useful (and the source of some of the code below), but hasn't completely clarified the issue for me.
I've been trying:
a <- c(5, 0, 0) # set shape parameter
s <- diag(3) # create variance-covariance matrix
w <- sqrt(1/(1-((2*(a^2)/(1 + a^2))/pi))) # determine scale parameter to get sd of 1
xi <- w*a/sqrt(1 + a^2)*sqrt(2/pi) # determine location parameter to get mean of 0
apply(rmsn(n=1000, xi=c(xi), Omega=s, alpha=a), 2, sd)
colMeans(rmsn(n=1000, xi=c(xi), Omega=s, alpha=a))
The columns means and SDs are correct for the second and third columns (which have no skew) but not the first (which does). Can anyone clarify where my code above, or my thinking, has gone wrong? I may be misunderstanding how to use rmsn, or the output. Any assistance would be appreciated.
The location is not the mean (except when there is no skew). From the documentation:
Notice that the location vector ‘xi’ does not represent the mean
vector of the distribution (which in fact may not even exist if ‘df <=
1’), and similarly ‘Omega’ is not the covariance matrix of the
distribution
And you may want to replace Omega=s with Omega=w.
And this is supposed to be a variance matrix: there should be no square root.

Resources