Recommendation Engine: Cosine Similarity vs Measuring %difference between each vector component - vector

Lets say I have a database of users who rate different products on a scale of 1-5. Our recommendation engine recommends products to users based on the preferences of other users who are highly similar. My first approach to finding similar users was to use Cosine Similarity, and just treat user ratings as vector components. The main problem with this approach is that it just measures vector angles and doesn't take rating scale or magnitude into consideration.
My question is this:
Can somebody explain to me why Cosine Similarity is in any way better suited for judging user similarity than simply measuring the percent difference between the vector components of two vectors(users)?
For Example, why not just do this:
n = 5 stars
a = (1,4,4)
b = (2,3,4)
similarity(a,b) = 1 - ( (|1-2|/5) + (|4-3|/5) + (|4-4|/5) ) / 3 = .86667
Instead of Cosine Similarity :
a = (1,4,4)
b = (2,3,4)
CosSimilarity(a,b) =
(1*2)+(4*3)+(4*4) / sqrt( (1^2)+(4^2)+(4^2) ) * sqrt( (2^2)+(3^2)+(4^2) ) = .9697

I suppose one answer is that not all recommender problems operate on ratings on a 1-5 scale, and not all operate on the original feature space at all, but sometimes on a low-rank feature space. The answer changes there.
I don't think cosine similarity is a great metric for ratings. Ratings aren't something you want to normalize away. It makes a bit more sense if you normalize each users' ratings to have mean 0.
I'm not sure it's optimal to use this sort of modified L1 distance either. Consider normal Euclidean / L2 distance too. In the end, empirical testing will tell you what works best for your data.

Related

How to find the optimal feature count by mRMRe package?

I am trying to use the mRMRe package in R to do a feature selection on a gene expression dataset.I have RNA seq data containing over 10K genes and I would like to find the optimal feature fitter the classification model. I am wondering how to find the optimal feature count. Here is my code ,
mrEnsemble <- mRMR.ensemble(data = Xdata, target_indices = c(1) ,feature_count = 100 ,solution_count = 1)
mrEnsemble_genes <- as.data.frame(apply(solutions(mrEnsemble)[[1]], 2, function(x, y) { return(y[x]) }, y=featureNames(Xdata)))
View(mrEnsemble_genes)
I just set feature_count = 100 but I am wondering how to find the optimal feature count for classification without setting the number.
and the result after extracting mrEnsemble_genes will be the list of genes like,
gene05
gene08
gene45
gene67
Are they ranked by score calculated from mutual information? I mean the first ranked gene gain the highest MI and it may be a good gene for classifying the class of sample i.e. cancer and normal, right ? Thank you
As far as I understand, the MRMR method simply ranks the N number of features you input according to their MRMR score. It is then up to you to decide which features to keep and discard.
According to the package documentation of mRMRe, the MRMR score is computed as the following:
For each target, the score of a feature is defined as the
mutual information between the target and this feature minus the average mutual information of
previously selected features and this feature.
So in other words,
Relevancy = Mutual information with the target
Redundancy = Average mutual information with the previous predictors
MRMR Score = Relevancy - Redundancy.
The way I interpret this, the scores themselves don't offer a clear-cut answer for keeping or discarding features. Higher scores are better, but a zero / negative score does not mean the feature has no effect on the target: It could have some mutual information with the target, but higher average mutual information with the other predictors, leading to a negative MRMR score. Finding the exact optimal feature set requires experimentation.
To retrieve the indexes of the features (in the original data), ranked from highest to lowest MRMR score, use:
solutions(mrEnsemble)
To retrieve the actual MRMR scores, use:
scores(MrEnsemble)

Generating groups of skewed size but whose elements add to a fixed sum

I have some fixed number of people (e.g. 1000). I would like to split these 1000 people into some random number of classes Y (e.g. 5), but not equally. I want them to be distributed unevenly, according to some probability distribution that is heavily skewed (something like a power-law distribution).
My intuition is that I need to generate a distribution of probabilities that is (1) skewed and (2) which also adds up to 1.
My ad hoc solution was to generate random numbers from a power law distribution, multiply these by some scalar that ensures these add up to something close to my target number, adjust my target number to that new number, and then split accordingly.
But it seems awfully inelegant, and 'y_size' doesn't always sum to 1000, which requires looping through and trying again. What's a better approach?
require(poweRlaw)
x<-1000
y<-10
y_sizes<-rpldis(10,xmin=5,alpha=2,discrete_max=x)
y_sizes<-round(y_sizes * x/sum(y_sizes))
newx<-y_sizes #newx only approx = x rather than = x
people<-1:x
groups<-cut(
people,
c(0,cumsum(y_sizes))
) %>% as.numeric
data.frame(
people=people,
group=groups
)
The algorithm presented by Smith and Tromble in "Sampling Uniformly from the Unit Simplex" shows a solution. I have pseudocode on this algorithm in my section "Random Integers with a Given Positive Sum".

Topsis - Query regarding negative and positive attributes

In Topsis technique, we calculate negative and positive ideal solutions, so we need to have positive and negative attributes (criterions) measuring the impact but what if I have attributes in the model having only positive impact? Is it possible to calculate Topsis results using only positive attributes?? If yes then how to calculate the relative part. Thanks in advance
Good question. Yes, you can have all positive attributes or even all negatives. So, while assessing alternatives you might encounter two different types of attributes: desirable attributes or undesirable attributes.
As a decision-maker, you want to maximise desirable attributes (beneficial criteria) and minimise undesirable attributes (costing criteria).
TOPSIS was created in 1981 by Hwang and Yoon*1. The central idea behind this algorithm is that the most desirable solution would be the one that it is the most similar to the ideal solution, so a hypothetical alternative with the highest possible desirable attributes and the lowest possible desirable attributes, and the less similar to the so-called 'anti-ideal' solution, so a hypothetical alternative with the lowest possible desirable attributes and the highest possible undesirable attributes.
That similarity is modelled with a geometric distance, known as Euclidean distance.*2
Assuming you already have built the decision matrix. So that you know the alternatives with their respective criterion and values. And you already identified which attributes are desirable and undesirable. (Make sure you normalise and weight the matrix)
The steps of TOPSIS are:
Model the IDEAL Solution.
Model the ANTI-IDEAL Solution.
Calculate Euclidean distance to the Ideal solution for each alternative.
Calculate Euclidean distance to the Anti-Ideal solution for each alternative.
You have to calculate the ratio of relative proximity to the ideal solution.
The formula is the following:
So, distance to anti-ideal solution divided by distance to ideal solution + distance to anti-ideal solution.
Then, you have to sort the alternatives by this ratio and select the one that outranks the others.
Now, let's put this theory into practice... let's say you want to select which is the best investment out of different startups. And you will only consider 4 beneficial criteria: (A) Sales revenue, (B) Active Users, (C) Life-time value, (D) Return rate
## Here we have our decision matrix, in R known as performance matrix...
performanceTable <- matrix(c(5490,51.4,8.5,285,6500,70.6,7,
288,6489,54.3,7.5,290),
nrow=3,
ncol=4,
byrow=TRUE)
# The rows of the matrix contains the alternatives.
row.names(performanceTable) <- c("Wolox","Globant","Bitex")
# The columns contains the attributes:
colnames(performanceTable) <- c("Revenue","Users",
"LTV","Rrate")
# You set the weights depending on the importance for the decision-maker.
weights <- c(0.35,0.25,0.25,0.15)
# And here is WHERE YOU INDICATE THAT YOU WANT TO MAXIMISE ALL THOSE ATTRIBUTES!! :
criteriaMinMax <- c("max", "max", "max", "max")
Then for the rest of the process you can follow R documentation on the TOPSIS function: https://www.rdocumentation.org/packages/MCDA/versions/0.0.19/topics/TOPSIS
Resources:
Youtube tutorial: TOPSIS - Technique for Order Preference by Similarity to Ideal Solution. Manoj Mathew Retrieved from https://www.youtube.com/watch?v=kfcN7MuYVeI
R documentation on TOPSIS function. Retrieved from: https://www.rdocumentation.org/packages/MCDA/versions/0.0.19/topics/TOPSIS
REFERENCES:
1 Hwang, C. L., & Yoon, K. (1981). Methods for multiple attribute decision making. In Multiple attribute decision making (pp. 58-191).Springer, Berlin, Heidelberg.
James E. Gentle (2007). Matrix Algebra: Theory, Computations, and Applications in Statistics. Springer-Verlag. p. 299. ISBN 0-387-70872-3.

Random Data Sets within loops

Here is what I want to do:
I have a time series data frame with let us say 100 time-series of length 600 - each in one column of the data frame.
I want to pick up 4 of the time-series randomly and then assign them random weights that sum up to one (ie 0.1, 0.5, 0.3, 0.1). Using those I want to compute the mean of the sum of the 4 weighted time series variables (e.g. convex combination).
I want to do this let us say 100k times and store each result in the form
ts1.name, ts2.name, ts3.name, ts4.name, weight1, weight2, weight3, weight4, mean
so that I get a 9*100k df.
I tried some things already but R is very bad with loops and I know vector oriented
solutions are better because of R design.
Here is what I did and I know it is horrible
The df is in the form
v1,v2,v2.....v100
1,5,6,.......9
2,4,6,.......10
3,5,8,.......6
2,2,8,.......2
etc
e=NULL
for (x in 1:100000)
{
s=sample(1:100,4)#pick 4 variables randomly
a=sample(seq(0,1,0.01),1)
b=sample(seq(0,1-a,0.01),1)
c=sample(seq(0,(1-a-b),0.01),1)
d=1-a-b-c
e=c(a,b,c,d)#4 random weights
average=mean(timeseries.df[,s]%*%t(e))
e=rbind(e,s,average)#in the end i get the 9*100k df
}
The procedure runs way to slow.
EDIT:
Thanks for the help i had,i am not used to think R and i am not very used to translate every problem into a matrix algebra equation which is what you need in R.
Then the problem becomes a little bit complex if i want to calculate the standard deviation.
i need the covariance matrix and i am not sure i can if/how i can pick random elements for each sample from the original timeseries.df covariance matrix then compute the sample variance
t(sampleweights)%*%sample_cov.mat%*%sampleweights
to get in the end the ts.weighted_standard_dev matrix
Last question what is the best way to proceed if i want to bootstrap the original df
x times and then apply the same computations to test the robustness of my datas
thanks
Ok, let me try to solve your problem. As a foreword: I can think of no application where it is sensible to do what you are doing. However, that is for you to judge (non the less I would be interested in the application...)
First, note that the mean of the weighted sums equals the weighted sum of the means, as:
Let's generate some sample data:
timeseries.df <- data.frame(matrix(runif(1000, 1, 10), ncol=40))
n <- 4 # number of items in the convex combination
replications <- 100 # number of replications
Thus, we may first compute the mean of all columns and do all further computations using this mean:
ts.means <- apply(timeseries.df, 2, mean)
Let's create some samples:
samples <- replicate(replications, sample(1:length(ts.means), n))
and the corresponding weights for those samples:
weights <- matrix(runif(replications*n), nrow=n)
# Now norm the weights so that each column sums up to 1:
weights <- weights / matrix(apply(weights, 2, sum), nrow=n, ncol=replications, byrow=T)
That part was a little bit tricky. Run the single functions on each own with a small number of replications to figure out what they are doing. Note that I took a different approach for generating the weights: First get uniformly distributed data and then norm them by their sum. The result should be identical to your approach, but with arbitrary resolution and much better performance.
Again a little bit trick: Get the means for each time series and multiply them with the weights just computed:
ts.weightedmeans <- matrix(ts.means[samples], nrow=n) * weights
# and sum them up:
weights.sum <- apply(ts.weightedmeans, 2, sum)
Now, we are basically done - all information are available and ready to use. The rest is just a matter of correctly formatting the data.frame.
result <- data.frame(t(matrix(names(ts.means)[samples], nrow=n)), t(weights), weights.sum)
# For perfectness, use better names:
colnames(result) <- c(paste("Sample", 1:n, sep=''), paste("Weight", 1:n, sep=''), "WeightedMean")
I would assume this approach to be rather fast - on my system the code took 1.25 seconds with the amount of repetitions you stated.
Final word: You were in luck that I was looking for something that kept me thinking for a while. Your question was not asked in a way to encourage users to think about your problem and give good answers. The next time you have a problem, I would suggest you to read www.whathaveyoutried.com before and try to break down the problem as far as you are able to. The more concrete your problem, the faster and of higher quality your answers will be.
Edit
You mentioned correctly that the weights generated above are not uniformly distributed over the whole range of values. (I still have to object that even (0.9, 0.05, 0.025, 0.025) is possible, but it is very unlikely).
Now we are playing in a different league, though. I am pretty sure that the approach you took is not uniformly distributed as well - the probability of the last value being 0.9 is far less than the probability of the first one being that large. Honestly I do not have a good idea ready for you concerning the generation of uniformly distributed random numbers on the unit sphere according to the L_1 distance. (Actually, it is not really a unit sphere, but both problems should be identical).
Thus, I have to give up on this.
I would suggest you to raise a new question at stats.stackexchange.com concerning the generation of those random vectors. It probably is fairly simple using the correct technique. However, I doubt that this question with that heading and a fairly long answer will attract a potential responder... (If you ask the question over there, I would appreciate a link, as I would like to know the solution ;)
Concerning the variance: I do not fully understand which standard deviation you want to compute. If you just want to compute the standard deviation of each time series, why do you not use the built-in function sd? In the computation above you could just replace mean by it.
Bootstrapping: That is a whole new question. Separate different topics by starting new questions.

mathematical model to build a ranking/ scoring system

I want to rank a set of sellers. Each seller is defined by parameters var1,var2,var3,var4...var20. I want to score each of the sellers.
Currently I am calculating score by assigning weights on these parameters(Say 10% to var1, 20 % to var2 and so on), and these weights are determined based on my gut feeling.
my score equation looks like
score = w1* var1 +w2* var2+...+w20*var20
score = 0.1*var1+ 0.5 *var2 + .05*var3+........+0.0001*var20
My score equation could also look like
score = w1^2* var1 +w2* var2+...+w20^5*var20
where var1,var2,..var20 are normalized.
Which equation should I use?
What are the methods to scientifically determine, what weights to assign?
I want to optimize these weights to revamp the scoring mechanism using some data oriented approach to achieve a more relevant score.
example
I have following features for sellers
1] Order fulfillment rates [numeric]
2] Order cancel rate [numeric]
3] User rating [1-5] { 1-2 : Worst, 3: Average , 5: Good} [categorical]
4] Time taken to confirm the order. (shorter the time taken better is the seller) [numeric]
5] Price competitiveness
Are there better algorithms/approaches to solve this problem? calculating score? i.e I linearly added the various features, I want to know better approach to build the ranking system?
How to come with the values for the weights?
Apart from using above features, few more that I can think of are ratio of positive to negative reviews, rate of damaged goods etc. How will these fit into my Score equation?
Unfortunately stackoverflow doesn't have latex so images will have to do:
Also as a disclaimer, I don't think this is a concise answer but your question is quite broad. This has not been tested but is an approach I would most likely take given a similar problem.
As a possible direction to go, below is the multivariate gaussian. The idea would be that each parameter is in its own dimension and therefore could be weighted by importance. Example:
Sigma = [1,0,0;0,2,0;0,0,3] for a vector [x1,x2,x3] the x1 would have the greatest importance.
The co-variance matrix Sigma takes care of scaling in each dimension. To achieve this simply add the weights to a diagonal matrix nxn to the diagonal elements. You are not really concerned with the cross terms.
Mu is the average of all logs in your data for your sellers and is a vector.
xis the mean of every category for a particular seller and is as a vector x = {x1,x2,x3...,xn}. This is a continuously updated value as more data are collected.
The parameters of the the function based on the total dataset should evolve as well. That way biased voting especially in the "feelings" based categories can be weeded out.
After that setup the evaluation of the function f_x can be played with to give the desired results. This is a probability density function, but its utility is not restricted to stats.

Resources