d <- matrix(rpois(12, 5), nrow = 4)
w <- c(1, 1, 2)
i <- c("+", "-", "+")
topsis(d, w, i)
this is the function available in R for Ahp topsis, i am confused about how to assign "+" and "-" signs here for "impact" argument. how is it done here in this example
Good question.
'c("+", "-", "+")' indicates which criteria you need to maximise and which criteria you need to minimise.
So TOPSIS was developed in 1981 by Hwang and Yoon [1] and is a common algorithm used for MCDC (multi-criteria decision making) problems. TOPSIS is based on the premise that the 'best' solution out a set of alternatives, is the one with the closest geometric distance to the ideal solution and the farthest geometric distance to the anti-ideal solution.
Each alternative is characterised with different criteria. Criteria can be beneficial or unbeneficial. If it is beneficial you want to maximise, but if it is a cost you want to minimise.
So, let's say you want to select the 'best' car from an array of car alternatives.
Price is a cost criterion... that you want to minimise. But, maybe 'speed limit' is something you want to maximise.
As said, those '+', '-' indicates which are the attributes are costs and which are benefits so that you can compute the ideal and anti-ideal solution.
Resources:
TOPSIS Package documentation. Retrieved from https://cran.r-project.org/web/packages/topsis/topsis.pdf
Manoj Mathew. TOPSIS - Technique for Order Preference by Similarity to Ideal Solution Retrieved from: https://www.youtube.com/watch?v=kfcN7MuYVeI
REFERENCES:
1 Hwang, C. L., & Yoon, K. (1981). Methods for multiple attribute decision making. In Multiple attribute decision making (pp. 58-191). Springer, Berlin, Heidelberg.
First off, I don't have any experience with TOPSIS but the code of that function explains what is going on and matches the description of TOPSIS. You can see the code by typing topsis.
matrix d in this example is a 4x3 matrix. Each row represents one alternative (for instance , a model of car available in the market) while each column represents a criterion on which these alternatives are to be judged (for instance , you might use cost, efficiency, torque and ground clearance to select a car)
The + and - just show how that particular criteria(column) impacts the outcome. For instance, cost of a car might be a -ve while torque will be +ve.
The algorithm uses these impact signs to come to a Positive Ideal solution and a Negative(worst) Ideal solution .
Positive Ideal solution is derived by using the max value of +ve columns and the min value of the -ve columns. Here's the relevant line from the code.
u <- as.integer(impacts == "+") * apply(V, 2, max) + as.integer(impacts ==
"-") * apply(V, 2, min)
Negative ideal is the opposite.
From thereon the code proceeds to find distance of each of our alternatives with these best and worst outcomes and ranks them.
Related
I was hoping that someone can explain a specific part of an academic paper and assist in writing R code for that section:
Name of Paper
Large-scale Analysis of Counseling Conversations:
An Application of Natural Language Processing to Mental Health (https://cs.stanford.edu/people/jure/pubs/counseling-tacl16.pdf)
On page 5, we have the following snippet:
"
...build a TF-IDF vector of word occurrences
to represent the language of counselors within this
subset. We use the global inverse document (i.e.,
conversation) frequencies instead of the ones from
each subset to make the vectors directly comparable
and control for different counselors having different numbers of conversations by weighting conversations so all counselors have equal contributions.
"
What does the paper mean by "global inverse document frequency"?
How can I code this in R with the different subsets (positive and negative counsellors for example)
Here is my sample code:
corp_pos_1 <- Corpus(VectorSource(positive_chats$Text1))
#corp_pos_1 <- tm_map(corp_pos_1, removeWords, stopwords("english"))
tdm_pos_1 <- DocumentTermMatrix(corp_pos_1,control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE)))
ui = unique(tdm_pos_1 $i)
tdm_pos_1 = tdm_pos_1 [ui,]
cosine_tdm_pos_1 <- crossprod_simple_triplet_matrix(tdm_pos_1)/(sqrt(col_sums(tdm_pos_1^2) %*% t(col_sums(tdm_pos_1^2))))
In the code 'pos' stands for positive, and 'neg' would stand for negative.
The number at the end of the variable end shows the part of the chunk being calculated.
Now I have them chunked in 5 different parts trying to follow this paper. But how would I be able to calculate "global inverse document frequency"?
I think I have found this stackoverflow question from before but I am still not understanding the paper + what I need to do in R.
R: weighted inverse document frequency (tfidf) similarity between strings
TF/IDF is a well-known measure in information retrieval. For more information on it, and formulae that describe how to calculate it, see the Wikipedia page.
In short, you want to have words that are specific to texts; words that occur in all texts do not add any distinctive information. So, the inverse document frequency is the number of all documents divided by the number of documents that contain a given word. For common words such as the or of, the IDF would be 1.0, as we would assume they occurred in all texts. For that reason they are often excluded as stop words. IDF can also be scaled, eg by taking the logarithm.
If I understand your application correctly, you would take a term and divide the total number of documents by the number of negative documents that contain the term.
I've an evaluations series. Each evaluation could have discrete values ranging from 0 to 4. Series should decrease in time. However, since values are inserted manually, errors could happen.
Therefore, I would like to modify my series to be monotonous decreasing. Moreover, I would minimize the number of evaluations modified. Finally, if two or more series would satisfy these criteria, would choose the one with the higher overall values sum.
E.g.
Recorded evaluation
4332422111
Ideal evaluation
4332222111
Recorded evaluation
4332322111
Ideal evaluation
4333322111
(in this case, 4332222111 would have satisfied criteria too, but I chose with the higher values)
I tried with brutal force approach by generating all possible combinations, selecting those monotonous decreasing and finally comparing each one of these with that recorded.
However, series could be even 20-evaluations long and combinations would too many.
x1 <- c(4,3,3,2,4,2,2,1,1,1)
x2 <- c(4,3,3,2,3,2,2,1,1,1)
You could almost certainly break this algorithm, but here's a first try: replace locations with increased values by NA, then fill them in with the previous location.
dfun <- function(x) {
r <- replace(x,which(c(0,diff(x))>0),NA)
zoo::na.locf(r)
}
dfun(x1)
dfun(x2)
This gives the "less-ideal" answer in the second case.
For the record, I also tried
dfun2 <- function(x) {
s <- as.stepfun(isoreg(-x))
-s(seq_along(x))
}
but this doesn't handle the first example as desired.
You could also try to do this with discrete programming (about which I know almost nothing), or a slightly more sophisticated form of brute force -- use a stochastic algorithm that strongly penalizes non-monotonicity and weakly penalizes the distance from the initial sequence ... (e.g. optim(..., method="SANN") with a candidate function that adds or subtracts 1 from an element at random)
Background:
On this combinatorics question, the issue is how to determine the sample space: the ways 8 different soccer teams can be paired up for the next round of competition. Two different answers have been advanced for that part of the problem: 28 (see comments OP) and 105 (see edit within OP and answer).
I'd like to do this manually to try to hone down on the mistake in whichever answer is incorrect.
What I have tried:
teams = 1:8
names(teams) = c("RM", "BCN", "SEV", "JUV", "ROM", "MC", "LIV", "BYN")
split(sample(teams), rep(1:(length(teams)/2), each=2))
Unfortunately, the output is a list, and I wanted a vector to be able to run something like:
unique(...,MARGIN=2)
Is there a way of doing this in an elegant manner?
After a now erased answer (thank you), I would go with
a <- replicate(1e5, unlist(split(sample(teams), rep(1:(length(teams)/2), each=2))))
to simulate 100,000 random samples, and later run
unique(a, MARGIN = 2).
But how can I account for the fact that the order of the 4 pairings of opponents doesn't matter, and that LIV-BYN and BYN-LIV, for example, is the same pairing (field advantage notwithstanding)?
> u = ncol(unique(replicate(1e6, unlist(split(sample(teams), rep(1:(length(teams)/2), each=2)))), MARGIN = 2))
> u / (factorial(4) * 2^4)
[1] 105
The idea of unlist is from #Song Zhengyi, and if his answer is un-deleted, I'll accept it. The complete answer is in the lines above.
u needs to be divided by 4! because
BCN-RM, BYN-SEV, JUV-ROM, LIV-MC
is exactly the same as
LIV-MC, BCN-RM, BYN-SEV, JUV-ROM
or
BCN-RM, LIV-MC, BYN-SEV, JUV-ROM
etc.
The term 2^4 is to avoid over-counting since for every possible unique draw, each one of the pairings can be flipped without loss (discarding field advantage): BCN-RM is the same as RM-BCN, and there are 4 pairs in each draw.
If field advantage is a consideration (real life)...
> u/factorial(4)
[1] 1680
we end up with 1,680 possible draws.
For a text analysis program, I would like to analyze the co-occurrence of certain words in a text. For example, I would like to see that e.g. the words "Barack" and "Obama" appear more often together (i.e. have a positive correlation) than others.
This does not seem to be that difficult. However, to be honest, I only know how to calculate the correlation between two numbers, but not between two words in a text.
How can I best approach this problem?
How can I calculate the correlation between words?
I thought of using conditional probabilities, since e.g. Barack Obama is much more probable than Obama Barack; however, the problem I try to solve is much more fundamental and does not depend on the ordering of the words
The Ngram Statistics Package (NSP) is devoted precisely to this task. They have a paper online which describes the association measures they use. I haven't used the package myself, so I cannot comment on its reliability/requirements.
Well a simple way to solve your question is by shaping the data in a 2x2 matrix
obama | not obama
barack A B
not barack C D
and score all occuring bi-grams in the matrix. That way you can for instance use simple chi squared.
I don't know how this is commonly done, but I can think of one crude way to define a notion of correlation that captures word adjacency.
Suppose the text has length N, say it is an array
text[0], text[1], ..., text[N-1]
Suppose the following words appear in the text
word[0], word[1], ..., word[k]
For each word word[i], define a vector of length N-1
X[i] = array(); // of length N-1
as follows: the ith entry of the vector is 1 if the word is either the ith word or the (i+1)th word, and zero otherwise.
// compute the vector X[i]
for (j = 0:N-2){
if (text[j] == word[i] OR text[j+1] == word[i])
X[i][j] = 1;
else
X[i][j] = 0;
}
Then you can compute the correlation coefficient between word[a] and word[b] as the dot product between X[a] and X[b] (note that the dot product is the number of times these words are adjacent) divided by the lenghts (the length is the square root of the number of appearances of the word, well maybe twice that). Call this quantity COR(X[a],X[b]). Clearly COR(X[a],X[a]) = 1, and COR(X[a],X[b]) is larger if word[a], word[b] are often adjacent.
This can be generalized from "adjacent" to other notions of near - for example we could have chosen to use 3 word (or 4, 5, etc.) blocks instead. One can also add weights, probably do many more things as well if desired. One would have to experiment to see what is useful, if any of it is of use at all.
This problem sounds like a bigram, a sequence of two "tokens" in a larger body of text. See this Wikipedia entry, which has additional links to the more general n-gram problem.
If you want to do a full analysis, you'd most likely take any given pair of words and do a frequency analysis. E.g., the sentence "Barack Obama is the Democratic candidate for President," has 8 words, so there are 8 choose 2 = 28 possible pairs.
You can then ask statistical questions like, "in how many pairs does 'Obama' follow 'Barack', and in how many pairs does some other word (not 'Obama') follow 'Barack'? In this case, there are 7 pairs that include 'Barack' but in only one of them is it paired with 'Obama'.
Do the same for every possible word pair (e.g., "in how many pairs does 'candidate' follow 'the'?"), and you've got a basis for comparison.
So, this is a bit different than standard fantasy football. What I have is a list of players, their average "points per game" (PPG) and their salary. I want to maximize points per game under the constraint that my team does not exceed a salary cap. A team consists of 1 QB, 1 TE, 3 WRs, and 2 RBs. So, if we have 15 of each position we have 15X15 X(15 c 3)X(15 c 2) = 10749375 possible teams.
Pretty computationally complex. I can use a bit of branch and bound i.e. once a team has surpassed the salary cap I can trim the tree, but even with that the algorithm is still pretty slow. I tried another option where I used a "genetic algorithm" i.e. made 10 random teams, picked the best one and "mutated" it (randomly changing some of the players) into another 10 teams and then picked of those and then looped through a bunch of times until the points per game of the "best team" stopped getting better.
There must be a better way to do this. I'm not a computer scientist and I've only taken an intro course in algorithmics. Programmers - what are your thoughts? I have a feeling that some sort of application of dynamic programming could help.
Thanks
I think a genetic algorithm, intelligently implemented, will yield an acceptable result for you. You might want to use a metric like points per salary dollar rather than straight PPG to decide the best team. This way you are inherently measuring value added. Also, you should consider running the full algorithm/mutation to satisfactory completion numerous times so that you can identity what players consistently show up in the final outcomes. These players then should be valued above others.
Of course the problem with the genetc approach Is that you need a good mutation algorithm and that is highly personal for how you want to implement it.
Take i as the current number of players out of n players and j to be the current remaining salary that is left. Take m[i, j] to be the dynamic set of solutions.
Then m[i, 0] = 0, m[0, j] = 0
and
m[i, j] = m[i - 1, j] if salary for player i is greater than j
else
m[i, j] = max ( m[i - 1, j], m[i - 1, j - salary of player i] + PPG of player i)
Sorry that I don't know R but I'm good with algorithms so I hope this helps.
A further optimization you can make is that you really only need 2 rows of m[i, j] because the DP solution only uses the current row and the last row (you can save memory this way)
First of all, the variation you have provided should not be right. Best way to build team is limit positions by limited plus there is absolutely no sense of moving 3 similar positions players between themselves.
Christian Ronaldo, Suarez and Messi will give you the equal sum of fantasy points in any line-up, like:
Christian Ronaldo, Suarez and Messi
or
Suarez, Christian Ronaldo and Messi
or
Messi, Suarez, Ronaldo
First step - simplify the variation possibility.
Next step - calculate the average price, and build the team one by one by adding player with lower salary but higher price. When reach salary limit, remove expensive one and add cheaper but with same fantasy points - and so on. Don't build the variation, value the weight of each player by combination of salary and fantasy points.
Does this help? It sets up the constraints and maximises points.
You could adapt to get data out of excel
http://pena.lt/y/2014/07/24/mathematically-optimising-fantasy-football-teams
14/07/24/mathematically-optimising-fantasy-football-teams