Topsis - Query regarding negative and positive attributes - r

In Topsis technique, we calculate negative and positive ideal solutions, so we need to have positive and negative attributes (criterions) measuring the impact but what if I have attributes in the model having only positive impact? Is it possible to calculate Topsis results using only positive attributes?? If yes then how to calculate the relative part. Thanks in advance

Good question. Yes, you can have all positive attributes or even all negatives. So, while assessing alternatives you might encounter two different types of attributes: desirable attributes or undesirable attributes.
As a decision-maker, you want to maximise desirable attributes (beneficial criteria) and minimise undesirable attributes (costing criteria).
TOPSIS was created in 1981 by Hwang and Yoon*1. The central idea behind this algorithm is that the most desirable solution would be the one that it is the most similar to the ideal solution, so a hypothetical alternative with the highest possible desirable attributes and the lowest possible desirable attributes, and the less similar to the so-called 'anti-ideal' solution, so a hypothetical alternative with the lowest possible desirable attributes and the highest possible undesirable attributes.
That similarity is modelled with a geometric distance, known as Euclidean distance.*2
Assuming you already have built the decision matrix. So that you know the alternatives with their respective criterion and values. And you already identified which attributes are desirable and undesirable. (Make sure you normalise and weight the matrix)
The steps of TOPSIS are:
Model the IDEAL Solution.
Model the ANTI-IDEAL Solution.
Calculate Euclidean distance to the Ideal solution for each alternative.
Calculate Euclidean distance to the Anti-Ideal solution for each alternative.
You have to calculate the ratio of relative proximity to the ideal solution.
The formula is the following:
So, distance to anti-ideal solution divided by distance to ideal solution + distance to anti-ideal solution.
Then, you have to sort the alternatives by this ratio and select the one that outranks the others.
Now, let's put this theory into practice... let's say you want to select which is the best investment out of different startups. And you will only consider 4 beneficial criteria: (A) Sales revenue, (B) Active Users, (C) Life-time value, (D) Return rate
## Here we have our decision matrix, in R known as performance matrix...
performanceTable <- matrix(c(5490,51.4,8.5,285,6500,70.6,7,
288,6489,54.3,7.5,290),
nrow=3,
ncol=4,
byrow=TRUE)
# The rows of the matrix contains the alternatives.
row.names(performanceTable) <- c("Wolox","Globant","Bitex")
# The columns contains the attributes:
colnames(performanceTable) <- c("Revenue","Users",
"LTV","Rrate")
# You set the weights depending on the importance for the decision-maker.
weights <- c(0.35,0.25,0.25,0.15)
# And here is WHERE YOU INDICATE THAT YOU WANT TO MAXIMISE ALL THOSE ATTRIBUTES!! :
criteriaMinMax <- c("max", "max", "max", "max")
Then for the rest of the process you can follow R documentation on the TOPSIS function: https://www.rdocumentation.org/packages/MCDA/versions/0.0.19/topics/TOPSIS
Resources:
Youtube tutorial: TOPSIS - Technique for Order Preference by Similarity to Ideal Solution. Manoj Mathew Retrieved from https://www.youtube.com/watch?v=kfcN7MuYVeI
R documentation on TOPSIS function. Retrieved from: https://www.rdocumentation.org/packages/MCDA/versions/0.0.19/topics/TOPSIS
REFERENCES:
1 Hwang, C. L., & Yoon, K. (1981). Methods for multiple attribute decision making. In Multiple attribute decision making (pp. 58-191).Springer, Berlin, Heidelberg.
James E. Gentle (2007). Matrix Algebra: Theory, Computations, and Applications in Statistics. Springer-Verlag. p. 299. ISBN 0-387-70872-3.

Related

Dealing with "less than"s in R

Perhaps this is a philosophical question rather than a programming question, but here goes...
In R, is there some package or method that will let you deal with "less than"s as a concept?
Backstory: I have some data which, for privacy reasons, is given as <5 for small numbers (representing integers 1, 2, 3 or 4, in fact). I'd like to do some simple arithmetic on this data (adding, subtracting, averaging, etc.) but obviously I need to find some way to deal with these <5s conceptually. I could replace them all with NAs, sure, but of course that's throwing away potentially useful information, and I would like to avoid that if possible.
Some examples of what I mean:
a <- c(2,3,8)
b <- c(<5,<5,8)
mean(a)
> 4.3333
mean(b)
> 3.3333 -> 5.3333
If you are interested in the values at the bounds, I would take each dataset and split it into two datasets; one with all <5s set to 1 and one with all <5s set to 4.
a <- c(2,3,8)
b1 <- c(1,1,8)
b2 <- c(4,4,8)
mean(a)
# 4.333333
mean(b1)
# 3.3333
mean(b2)
# 5.3333
Following #hedgedandlevered proposal, but he's wrong wrt normal and/or uniform. You ask for integer numbers, so you have to use discrete distributions, like Poisson, binomial (including negative one), geometric etc
In statistics "less than" data is known as "left censored" https://en.wikipedia.org/wiki/Censoring_(statistics), searching on "censored data" might help.
My favoured approach to analysing such data is maximum likelihood https://en.wikipedia.org/wiki/Maximum_likelihood. There are a number of R packages for maximum likelihood estimation, I like the survival package https://cran.r-project.org/web/packages/survival/index.html but there are others, e.g. fitdistrplus https://cran.r-project.org/web/packages/fitdistrplus/index.html which "provides functions for fitting univariate distributions to different types of data (continuous censored or non-censored data and discrete data) and allowing different estimation methods (maximum likelihood, moment matching, quantile matching and maximum goodness-of-t estimation)".
You will have to specify (assume?) the form of the distribution of the data; you say it is integer so maybe a Poisson [related] distribution may be appropriate.
Treat them as a certain probability distribution of your choosing, and replace them with actual randomly generated numbers. All equal to 2.5, normal-like distribution capped at 0 and 5, uniform on [0,5] are all options
I deal with similar data regularly. I strongly dislike any of the suggestions of replacing the <5 values with a particular number. Consider the following two cases:
c(<5,<5,<5,<5,<5,<5,<5,<5,6,12,18)
c(<5,6,12,18)
The problem comes when you try to do arithmetic with these.
I think a solution to your issue is to think of the values as factors (in the R sense. You can bin the values above 5 too if that helps, for example
c(<5,<5,<5,<5,<5,<5,<5,<5,5-9,10-14,15-19)
c(<5,5-9,10-14,15-19)
Now, you still wouldn't do arithmetic on these, but your summary statistics (histograms/proportion tables/etc...) would make more sense.

Understanding `scale` in R

I'm trying to understand the definition of scale that R provides. I have data (mydata) that I want to make a heat map with, and there is a VERY strong positive skew. I've created a heatmap with a dendrogram for both scale(mydata) and log(my data), and the dendrograms are different for both. Why? What does it mean to scale my data, versus log transform my data? And which would be more appropriate if I want to look at the dendrogram illustrating the relationship between the columns of my data?
Thank you for any help! I've read the definitions but they are whooping over my head.
log simply takes the logarithm (base e, by default) of each element of the vector.
scale, with default settings, will calculate the mean and standard deviation of the entire vector, then "scale" each element by those values by subtracting the mean and dividing by the sd. (If you use scale(x, scale=FALSE), it will only subtract the mean but not divide by the std deviation.)
Note that this will give you the same values
set.seed(1)
x <- runif(7)
# Manually scaling
(x - mean(x)) / sd(x)
scale(x)
It provides nothing else but a standardization of the data. The values it creates are known under several different names, one of them being z-scores ("Z" because the normal distribution is also known as the "Z distribution").
More can be found here:
http://en.wikipedia.org/wiki/Standard_score
This is a late addition but I was looking for information on the scale function myself and though it might help somebody else as well.
To modify the response from Ricardo Saporta a little bit.
Scaling is not done using standard deviation, at least not in version 3.6.1 of R, I base this on "Becker, R. (2018). The new S language. CRC Press." and my own experimentation.
X.man.scaled <- X/sqrt(sum(X^2)/(length(X)-1))
X.aut.scaled <- scale(X, center = F)
The result of these rows are exactly the same, I show it without centering because of simplicity.
I would respond in a comment but did not have enough reputation.
I thought I would contribute by providing a concrete example of the practical use of the scale function. Say you have 3 test scores (Math, Science, and English) that you want to compare. Maybe you may even want to generate a composite score based on each of the 3 tests for each observation. Your data could look as as thus:
student_id <- seq(1,10)
math <- c(502,600,412,358,495,512,410,625,573,522)
science <- c(95,99,80,82,75,85,80,95,89,86)
english <- c(25,22,18,15,20,28,15,30,27,18)
df <- data.frame(student_id,math,science,english)
Obviously it would not make sense to compare the means of these 3 scores as the scale of the scores are vastly different. By scaling them however, you have more comparable scoring units:
z <- scale(df[,2:4],center=TRUE,scale=TRUE)
You could then use these scaled results to create a composite score. For instance, average the values and assign a grade based on the percentiles of this average. Hope this helped!
Note: I borrowed this example from the book "R In Action". It's a great book! Would definitely recommend.

An "asymmetric" pairwise distance matrix

Suppose there are three sequences to be compared: a, b, and c. Traditionally, the resulting 3-by-3 pairwise distance matrix is symmetric, indicating that the distance from a to b is equal to the distance from b to a.
I am wondering if TraMineR provides some way to produce an asymmetric pairwise distance matrix.
No, TraMineR does not produce 'assymetric' dissimilaries precisely for the reasons stressed in Pat's comment.
The main interest of computing pairwise dissimilarities between sequences is that once we have such dissimilarities we can for instance
measure the discrepancy among sequences, determine neighborhoods, find medoids, ...
run cluster algorithms, self-organizing maps, MDS, ...
make ANOVA-like analysis of the sequences
grow regression trees for the sequences
Inputting a non symmetric dissimilarity matrix in those processes would most probably generate irrelevant outcomes.
It is because of this symmetry requirement that the substitution costs used for computing Optimal Matching distances MUST be symmetrical. It is important to not interpret substitution costs as the cost of switching from one state to the other, but to understand them for what they are, i.e., edit costs. When comparing two sequences, for example
aabcc and aadcc, we can make them equal either by replacing arbitrarily b with d in the first one or d with b in the second one. It would then not make sense not giving the same cost for the two substitutions.
Hope this helps.

how to generate pseudo-random positive definite matrix with constraints on the off-diagonal elements? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
how to generate pseudo-random positive definite matrix with constraints on the off-diagonal elements?
The user wants to impose a unique, non-trivial, upper/lower bound on the correlation between every pair of variable in a var/covar matrix.
For example: I want a variance matrix in which all variables have 0.9 > |rho(x_i,x_j)| > 0.6, rho(x_i,x_j) being the correlation between variables x_i and x_j.
Thanks.
There are MANY issues here.
First of all, are the pseudo-random deviates assumed to be normally distributed? I'll assume they are, as any discussion of correlation matrices gets nasty if we diverge into non-normal distributions.
Next, it is rather simple to generate pseudo-random normal deviates, given a covariance matrix. Generate standard normal (independent) deviates, and then transform by multiplying by the Cholesky factor of the covariance matrix. Add in the mean at the end if the mean was not zero.
And, a covariance matrix is also rather simple to generate given a correlation matrix. Just pre and post multiply the correlation matrix by a diagonal matrix composed of the standard deviations. This scales a correlation matrix into a covariance matrix.
I'm still not sure where the problem lies in this question, since it would seem easy enough to generate a "random" correlation matrix, with elements uniformly distributed in the desired range.
So all of the above is rather trivial by any reasonable standards, and there are many tools out there to generate pseudo-random normal deviates given the above information.
Perhaps the issue is the user insists that the resulting random matrix of deviates must have correlations in the specified range. You must recognize that a set of random numbers will only have the desired distribution parameters in an asymptotic sense. Thus, as the sample size goes to infinity, you should expect to see the specified distribution parameters. But any small sample set will not necessarily have the desired parameters, in the desired ranges.
For example, (in MATLAB) here is a simple positive definite 3x3 matrix. As such, it makes a very nice covariance matrix.
S = randn(3);
S = S'*S
S =
0.78863 0.01123 -0.27879
0.01123 4.9316 3.5732
-0.27879 3.5732 2.7872
I'll convert S into a correlation matrix.
s = sqrt(diag(S));
C = diag(1./s)*S*diag(1./s)
C =
1 0.0056945 -0.18804
0.0056945 1 0.96377
-0.18804 0.96377 1
Now, I can sample from a normal distribution using the statistics toolbox (mvnrnd should do the trick.) As easy is to use a Cholesky factor.
L = chol(S)
L =
0.88805 0.012646 -0.31394
0 2.2207 1.6108
0 0 0.30643
Now, generate pseudo-random deviates, then transform them as desired.
X = randn(20,3)*L;
cov(X)
ans =
0.79069 -0.14297 -0.45032
-0.14297 6.0607 4.5459
-0.45032 4.5459 3.6549
corr(X)
ans =
1 -0.06531 -0.2649
-0.06531 1 0.96587
-0.2649 0.96587 1
If your desire was that the correlations must ALWAYS be greater than -0.188, then this sampling technique has failed, since the numbers are pseudo-random. In fact, that goal will be a difficult one to achieve unless your sample size is large enough.
You might employ a simple rejection scheme, whereby you do the sampling, then redo it repeatedly until the sample has the desired properties, with the correlations in the desired ranges. This may get tiring.
An approach that might work (but one that I've not totally thought out at this point) is to use the standard scheme as above to generate a random sample. Compute the correlations. I they fail to lie in the proper ranges, then identify the perturbation one would need to make to the actual (measured) covariance matrix of your data, so that the correlations would be as desired. Now, find a zero mean random perturbation to your sampled data that would move the sample covariance matrix in the desired direction.
This might work, but unless I knew that this is actually the question at hand, I won't bother to go any more deeply into it. (Edit: I've thought some more about this problem, and it appears to be a quadratic programming problem, with quadratic constraints, to find the smallest perturbation to a matrix X, such that the resulting covariance (or correlation) matrix has the desired properties.)
This is not a complete answer, but a suggestion of a possible constructive method:
Looking at the characterizations of the positive definite matrices (http://en.wikipedia.org/wiki/Positive-definite_matrix) I think one of the most affordable approaches could be using the Sylvester criterion.
You can start with a trivial 1x1 random matrix with positive determinant and expand it in one row and column step by step while ensuring that the new matrix has also a positive determinant (how to achieve that is up to you ^_^).
Woodship,
"First of all, are the pseudo-random deviates assumed to be normally distributed?"
yes.
"Perhaps the issue is the user insists that the resulting random matrix of deviates must have correlations in the specified range."
Yes, that's the whole difficulty
"You must recognize that a set of random numbers will only have the desired distribution parameters in an asymptotic sense."
True, but this is not the problem here: your strategy works for p=2, but fails for p>2, regardless of sample size.
"If your desire was that the correlations must ALWAYS be greater than -0.188, then this sampling technique has failed, since the numbers are pseudo-random. In fact, that goal will be a difficult one to achieve unless your sample size is large enough."
It is not a sample size issue b/c with p>2 you do not even observe convergence to the right range for the correlations, as sample size growths: i tried the technique you suggest before posting here, it obviously is flawed.
"You might employ a simple rejection scheme, whereby you do the sampling, then redo it repeatedly until the sample has the desired properties, with the correlations in the desired ranges. This may get tiring."
Not an option, for p large (say larger than 10) this option is intractable.
"Compute the correlations. I they fail to lie in the proper ranges, then identify the perturbation one would need to make to the actual (measured) covariance matrix of your data, so that the correlations would be as desired."
Ditto
As for the QP, i understand the constraints, but i'm not sure about the way you define the objective function; by using the "smallest perturbation" off some initial matrix, you will always end up getting the same (solution) matrix: all the off diagonal entries will be exactly equal to either one of the two bounds (e.g. not pseudo random); plus it is kind of an overkill isn't it ?
Come on people, there must be something simpler

mathematical model to build a ranking/ scoring system

I want to rank a set of sellers. Each seller is defined by parameters var1,var2,var3,var4...var20. I want to score each of the sellers.
Currently I am calculating score by assigning weights on these parameters(Say 10% to var1, 20 % to var2 and so on), and these weights are determined based on my gut feeling.
my score equation looks like
score = w1* var1 +w2* var2+...+w20*var20
score = 0.1*var1+ 0.5 *var2 + .05*var3+........+0.0001*var20
My score equation could also look like
score = w1^2* var1 +w2* var2+...+w20^5*var20
where var1,var2,..var20 are normalized.
Which equation should I use?
What are the methods to scientifically determine, what weights to assign?
I want to optimize these weights to revamp the scoring mechanism using some data oriented approach to achieve a more relevant score.
example
I have following features for sellers
1] Order fulfillment rates [numeric]
2] Order cancel rate [numeric]
3] User rating [1-5] { 1-2 : Worst, 3: Average , 5: Good} [categorical]
4] Time taken to confirm the order. (shorter the time taken better is the seller) [numeric]
5] Price competitiveness
Are there better algorithms/approaches to solve this problem? calculating score? i.e I linearly added the various features, I want to know better approach to build the ranking system?
How to come with the values for the weights?
Apart from using above features, few more that I can think of are ratio of positive to negative reviews, rate of damaged goods etc. How will these fit into my Score equation?
Unfortunately stackoverflow doesn't have latex so images will have to do:
Also as a disclaimer, I don't think this is a concise answer but your question is quite broad. This has not been tested but is an approach I would most likely take given a similar problem.
As a possible direction to go, below is the multivariate gaussian. The idea would be that each parameter is in its own dimension and therefore could be weighted by importance. Example:
Sigma = [1,0,0;0,2,0;0,0,3] for a vector [x1,x2,x3] the x1 would have the greatest importance.
The co-variance matrix Sigma takes care of scaling in each dimension. To achieve this simply add the weights to a diagonal matrix nxn to the diagonal elements. You are not really concerned with the cross terms.
Mu is the average of all logs in your data for your sellers and is a vector.
xis the mean of every category for a particular seller and is as a vector x = {x1,x2,x3...,xn}. This is a continuously updated value as more data are collected.
The parameters of the the function based on the total dataset should evolve as well. That way biased voting especially in the "feelings" based categories can be weeded out.
After that setup the evaluation of the function f_x can be played with to give the desired results. This is a probability density function, but its utility is not restricted to stats.

Resources