Measures of dissimilarity (distance) between character vectors in R - r

I have a seemingly easy question which however is troubling me a bit.
I have couples of vectors made up of nominal attributes. They can be of different length and sometimes some of the attributes in one might not be included in the other. See a and b as two potential examples.
a
1 mathematician
2 engineer
3 mathematician
4 mathematician
5 mathematician
6 engineer
7 mathematician
8 mathematician
9 mathematician
10 mathematician
11 mathematician
12 engineer
13 mathematician
14 mathematician
15 engineer
b
1 physicist
2 surgeon
3 physicist
4 surgeon
5 physicist
6 physicist
7 surgeon
8 surgeon
9 physicist
10 physicist
11 mathematician
Do you have in mind a measure (an index) that could summarize the dissimilarity between them? The type of measure I am looking for is something like the Euclidean distance, but for qualitative vectors.
One option I thought of is to actually compute the Euclidean distance among the categorical vectors earlier transformed into frequency vectors. In this way, they would become quantitative and would be of the same length. But my question is, do you find this a sound approach?
More generally, is there a R package that tackles these type of distances? Can you suggest other distances suitable to the case of nominal variables?
Many thanks!

I've only come across the unalikeability coefficient.
http://www.amstat.org/publications/jse/v15n2/kader.html
Weird name, intuitive approach, and incredibly simple implementation. For example:
> table(a)
a
engineer mathematician
4 11
> unalike(table(a))
[1] 0.391
> table(b)
b
mathematician physicist surgeon
1 6 4
> unalike(table(b))
[1] 0.562
It is clear just by eye-balling that b would be more dissimilar, and this coefficient gives a more quantitative measure.
There are some examples in the paper which I will calculate for you here:
> unalike(3,7)
[1] 0.42
> unalike(5,5)
[1] 0.5
> unalike(1,9)
[1] 0.18
The formula in this function is based on the paper I linked you to above:
unalike <- function(...) {
props <- c(...)
zzz <- 1 - sum(((props) / sum(props)) ** 2)
zzz <- round(zzz, 3)
return(zzz)
}
Let me know how your thing goes since this is a small side project for me as well.

I am not sure this is a programming question, because you do not know what you want to do yet, so we can't offer a solution. I think the main question here is what are you going to use this measure for, because you can measure dissimilarities in a lot of different ways, some will be good for what you want and some will not.
But trying to answer anyway, there is the utils::adist function and there is also a package called stringdist (these are the ones I have used before). But it seems that they are not quite what you want, based on your question, because they will measure the distance for each character string, and not for the whole matrix. But you could use them to have some ideas about how to measure the distance between the two vectors. For example, one measure could be how many changes you would have to make in vector a so it turns to vector b.

Thank you for keeping this open.
One option, which appears to have become available after this discussion, is R's qualvar (Gombin) package. The package provides functions for each of Wilcox's (1967, 1973) Indices of Qualitative Variation. Included with the package is a useful vignette summarizing implementation and results. I have found in limited experience that index selection requires some brute-force testing with actual and simulated data.

Related

Solving binomial distribution question in R language

Insurance policies are sold to 10 different people age 25-30 years. All of these are in good health
conditions. The probability that a person of similar condition will enjoy more than 25 years is 4/5. Calculate the probability that in 25 years Almost 2 will die. Perform this calculation in R syntax without using any direct built-in function.
n <- 10
p_live <- 4/5
p_notlive <- 0.2
Pzerodie <- combn(n,0)*(p_live^0)*(p_notlive^n-0)
print(Pzerodie)
I will do the same process for P(one die) and P(two die) and then add all three variable. Now the above code should print 1.024 * 10^-7 for P(Pzerodie)! But its printing : [,1]. Can anyone guide? Thanks

Using a column that contains a frequency/weight/count in R [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
This is an easy question to ask, but a hard one to search for. Frequency is used all over the place. I tried a synonym (weight), but since mtcars is so widely used, I get a lot of false negatives as well. Same thing for counts.
I'm looking at datasets::HairEyeColor, partly reproduced here:
Hair Eye Sex Freq
1 Black Brown Male 32
2 Brown Brown Male 53
3 Red Brown Male 10
4 Blond Brown Male 3
5 Black Blue Male 11
6 Brown Blue Male 50
7 Red Blue Male 10
8 Blond Blue Male 30
9 Black Hazel Male 10
10 Brown Hazel Male 25
.
.
.
I can across this when trying to show someone how to make a mosaic plot of any two of Hair, Eye, and Gender. On first read, I didn't see a way to specify a column to specify "this column represents 32 of the set members" but I didn't read too carefully.
I suppose I could reshape the data using melt() and reshape() every time I receive data with a frequency column, but that seems kind of drastic.
In other languages I know, I could add a parameter to the fitting function to let it know “there’s not just one row with this set of levels, there are n of them. So if I wanted to see a distribution, I might say
DISTR(Y=Hair, FREQ=freq)
...which would generate a histogram or density plot with n values per row
Alternately,
lm(hair ~ eye + sex, data = ‘HairEyeColor’, freq = ‘freq’)
Would fit a linear model with 32 replications if the first row rather than 1.
I’m asking about a way to use the 32 in the first row (for example) to tell the modeling or graphing function that there are 32 cases with this combination of levels, 53 with the combination in the second row, etc.
Surely this kind of data shows up a lot. I see it all the time, but there’s usually a way to say that this number specifies the frequency that this row represents in the actual data. Rather than a data table with 32 rows of Black, Brown, Male, there’s one row with frequency 32.
(No plyr please.)
No, there is not a standard way to use this type of data across all of R.
Many of the basic modeling functions, e.g., lm, glm, nls, loess, and more from the stats package accept a weights argument that will meet your needs. prop.test accepts data in either format. But many other modeling functions do not, e.g., knn, princomp, and many others not in base R.
barplot accepts input in either format. mosaicplot expects input as an aggregated contingency table. Other types of plots would require more custom handling, because there are a lot of different things you could do with frequency.
Of course, anything not in base R is up to whoever writes it.
ggplot2 (which is not base R) generally handles this really well, e.g., geom_bar will stack up values by default, or in the case of scatterplots you could map size or color or alpha to visually convey the intensity.
randomForest and xgboost do not accept weights
I will say that I very rarely find this to be a problem. I'd encourage you to ask specific questions about methods where it is causing you issues. I think mosaicplot is a bad example as it expects a contingency table, so the problem would be the opposite: using it with disaggregated data would require first aggregating it up to a frequency table.

Fisher test more than 2 groups

Major Edit:
I decided to rewrite this question since my original was poorly put. I will leave the original question below to maintain a record. Basically, I need to do Fisher's Test on tables as big as 4 x 5 with around 200 observations. It turns out that this is often a major computational challenge as explained here (I think, I can't follow it completely). As I use both R and Stata I will frame the question for both with some made-up data.
Stata:
tabi 1 13 3 27 46 \ 25 0 2 5 3 \ 22 2 0 3 0 \ 19 34 3 8 1 , exact(10)
You can increase exact() to 1000 max (but it will take maybe a day before returning an error).
R:
Job <- matrix(c(1,13,3,27,46, 25,0,2,5,3, 22,2,0,3,0, 19,34,3,8,1), 4, 5,
dimnames = list(income = c("< 15k", "15-25k", "25-40k", ">40k"),
satisfaction = c("VeryD", "LittleD", "ModerateS", "VeryS", "exstatic")))
fisher.test(Job)
For me, at least, it errors out on both programs. So the question is how to do this calculation on either Stata or R?
Original Question:
I have Stata and R to play with.
I have a dataset with various categorical variables, some of which have multiple categories.
Therefore I'd like to do Fisher's exact test with more than 2 x 2 categories
i.e. apply Fisher's to a 2 x 6 table or a 4 x 4 table.
Can this be done with either R or Stata ?
Edit: whilst this can be done in Stata - it will not work for my dataset as I have too many categories. Stata goes through endless iterations and even being left for a day or more does not produce a solution.
My question is really - can R do this, and can it do it quickly ?
Have you studied the documentation of R function fisher.test? Quoting from help("fisher.test"):
For 2 by 2 cases, p-values are obtained directly using the (central or
non-central) hypergeometric distribution. Otherwise, computations are
based on a C version of the FORTRAN subroutine FEXACT which implements
the network developed by Mehta and Patel (1986) and improved by
Clarkson, Fan and Joe (1993).
This is an example given in the documentation:
Job <- matrix(c(1,2,1,0, 3,3,6,1, 10,10,14,9, 6,7,12,11), 4, 4,
dimnames = list(income = c("< 15k", "15-25k", "25-40k", "> 40k"),
satisfaction = c("VeryD", "LittleD", "ModerateS", "VeryS")))
fisher.test(Job)
# Fisher's Exact Test for Count Data
#
# data: Job
# p-value = 0.7827
# alternative hypothesis: two.sided
As far as Stata is concerned, your original statement was totally incorrect. search fisher leads quickly to help tabulate twoway and
the help for the exact option explains that it may be applied to r x
c as well as to 2 x 2 tables
the very first example in the same place of Fisher's exact test underlines that Stata is not limited to 2 x 2 tables.
It's a minimal expectation anywhere on this site that you try to read basic documentation. Please!

Testing recurrences and orders in strings matlab

I have observed nurses during 400 episodes of care and recorded the sequence of surfaces contacts in each.
I categorised the surfaces into 5 groups 1:5 and calculated the probability density functions of touching any one of 1:5 (PDF).
PDF=[ 0.255202629 0.186199343 0.104052574 0.201533406 0.253012048]
I then predicted some 1000 sequences using:
for i=1:1000 % 1000 different nurses
seq(i,1:end)=randsample(1:5,max(observed_seq_length),'true',PDF);
end
eg.
seq = 1 5 2 3 4 2 5 5 2 5
stairs(1:max(observed_seq_length),seq) hold all
I'd like to compare my empirical sequences with my predicted one. What would you suggest to be the best strategy or property to look at?
Regards,
EDIT: I put r as a tag as this may well fall more easily under that category due to the nature of the question rather than the matlab code.

simple rank formula

I'm looking for a mathmatical ranking formula.
Sample is
2008 2009 2010
A 5 6 4
B 6 7 5
C 7 8 2
I want to add a rank column for each period code field
rank
2008 2009 2010 2008 2009 2010
B 6 7 5 2 1 1
A 5 6 4 3 2 2
C 7 2 2 1 3 3
please do not reply with methods that loop thru the rows and columns, incrementing the rank value as it goes, that's easy. I'm looking for a formula much like finding the percent total (item / total). I know i've seen this before but an havning a tough time locating it.
Thanks in advance!
sort ((letters_col, number_col) descending by number_col)
As efficient as your sort alg.
Then number the rows, of course
Edit
I really got upset by your comment "please don't up vote this answer, sorting and loop is not what I'm asking for. i specifically stated this in my original question. " , and the negative votes, because, as you may have noted by the various answers received, it's basically correct.
However, I remained pondering where and how you may "have seen this before".
Well, I think I got the answer: You saw this in Excel.
Look at this:
This is the result after entering the formulas and sorting by column H.
It's exactly what you want ...
What are you using? If you're using Excel, you're looking for RANK(num, ref).
=RANK(B2,B$2:B$9)
I don't know of any programming language that has that built in, it would always require a loop of some form.
If you want the rank of a single element, you can do it in O(n) by looping through the elements, counting how many have value above the given element, and adding 1.
If you want the rank of all the elements, the best (and really only) way is to sort the elements. Anything else you do will be equivalent to sorting (there is no "formula")
Are you using T-SQL? T-SQL RANK() may pull what you want.

Resources