How to calculate similarity of numbers (in list) - math

I am looking for a method for calculating similarity score for list of numbers. Ideally the method should give result in fixed range. For example from 0 to 1 where 0 is not similar at all and 1 means all numbers are identical.
For clarity let me provide a few examples:
0 1 2 3 4 5 6 7 8 9 10 => the similarity should be 0 or close to zero as all numbers are different
1 1 1 1 1 1 1 => 1
10 9 11 10.5 => close to 1
1 1 1 1 1 1 1 1 1 1 100 => score should be still pretty high as only the last value is different
I have tried to calculate the similarity based on normalization and average, but that gives me really bad results when there is one 'bad number'.
Thank you.

Similarity tests are always incredibly subjective, and the right one to use depends heavily on what you're trying to use it for. We already have three typical measures of central tendency (mean, median, mode). It's hard to say what test will work for you because there are different ways of measuring that will do what you're asking, but have wildly different measures for other lists (like [1]*7 + [100] * 7). Here's one solution:
import statistics as stats
def tester(ell):
mode_measure = 1 - len(set(ell))/len(ell)
avg_measure = 1 - stats.stdev(ell)/stats.mean(ell)
return max(avg_measure, mode_measure)

Related

R: Pairwise Matrix Manipulation & Variable Construction with Many Groups

I'm starting with data of scores at the "group-person" level as follows:
group_id person_id score
1 1 3
1 2 1
1 3 5
2 1 3
2 2 3
2 3 6
The goal is to generate data on person-person pairs that looks like the following:
person_id1 person_id2 sumsquarederror
1 2 4
1 3 13
2 3 25
where the "sumsquarederror" variable is defined as the sum across all groups of the squared differences in score values for each possible pair of persons. In mathspeak, this variable would be defined like: for persons i=1 and i=2 and groups j=(1,...,J)
sumsquarederror(i=1,i=2) = sum_j (( score(i=1) - score(i=2) )^2)
Building this data is trivial with small numbers of groups and persons, but I have roughly 1,000 groups and 150,000 persons, so creating matrices/dataframes for all combinations possible quickly becomes computationally burdensome (=150K by 150K by 1K, before collapsing to the sumsquarederror variable)
I'm guessing there might be some linear algebra approaches or regression-type ideas, but am stumped. Any tips or tricks or useful packages would be greatly appreciated!

Calculating the mean of compass directions which cross 360/ 0 degrees

Given a data set of compass directions like this:
A<-c(1,1,1,2,2,2,3,3,3,4,4,4)
CompDir<-c(350,358,355,358,2,356,180,173,170,2,3,359)
DF<-data.frame(A,CompDir)
If I want to take an average by group:
aggregate(DF[,2],list(DF$A),mean)
I run into trouble when I cross the 360/ 0 threshold.
Group.1 x
1 1 354.3333
2 2 238.6667
3 3 174.3333
4 4 121.3333
The means for groups 2 and 4 are incorrect, so how do you correctly calculate means for this kind of directional data?

Sum variables conditionally with loop in r

I realize this is a topic that's covered somewhat well but I couldn't find anything that approaches this specific concern:
I have a df with 800 columns, 10 iterations of 80 columns (each column represents an item) - Each column is named something like: 1_BL_PRE.1 1_FU_PRE.1 1_BL_PRE.1 1_BL_POST.1
Where the first '1' indicates the item number and the second '1' indicates the iteration number.
What I'm trying to figure out is how to get the sums of specific groups of items from all 10 iterations.
As a short example let's say I want to take the 1st and 3rd item of BL_PRE and get the sum of all 10 iterations for those 2 items - how would I do this?
subject 1_BL_PRE.1 2_BL_PRE.1 3_BL_PRE.1 1_BL_PRE.2 2_BL_PRE.2
1 40002 3 4 3 1 2
2 40004 1 2 3 4 4
3 40006 4 3 3 3 1
4 40008 2 3 1 2 3
5 40009 3 4 1 2 3
Expected output (where A represents the sum of 1_BL_PRE.1, 3_BL_PRE.1, 1_BL_PRE.2 and so on):
subject BL_PRE_A
1 40002 12
2 40004 14
3 40006 15
4 40008 20
5 40009 12
My hunch is the solution is related to a for-loop or lappy (and I'm not familiar at all with either). I'm trying to work with apply(finaldata,1,function(x) {sum(x ...)}) but I haven't been able to figure out the conditional statement for the function of sum.
If there's an implementation with plyr I'd be really curious to see what that looks like. (and if there's a thread that answers this, apologies and just re-direct!)
**Edited to include small example + code I'm trying to get to work
Thanks!

Probability of account win/loss using Bayesian Statistics

I am trying to estimate the probability of winning or losing an account, and I'd like to do this using Bayesian Methods. I'm not really that familiar with these methods, but I think I understand the general idea.
I know some information about losses and wins. Wins are usually characterized by some combination of activities; losses are usually characters by a different combination of activities. I'd like to be able to get some posterior probability of whether or not a new observation will be won or lost based on the current number of activities that are associated with that account.
Here is an example of my data: (This is just a sample for simplicity)
Email Call Callback Outcome
14 9 2 1
3 2 4 0
16 14 2 0
15 1 3 1
5 2 2 0
1 1 0 0
10 3 5 0
2 0 1 0
17 8 4 1
3 15 2 0
17 1 3 0
10 7 5 0
10 2 3 0
8 0 0 1
14 10 3 0
1 9 3 1
5 10 3 1
13 5 1 0
9 4 4 0
So from here I know that 30% of the observations have an outcome of 1 (win) and 70% have an outcome of 0 (loss). Let's say that I want to use the other columns to get a probability of win/loss for a new observation which may have a small number of events (emails, calls, and callbacks) associated with it.
Now let's say that I want to use the counts/proportions of the different events as priors for a new observation. This is where I start getting tripped up. My thinking is to create a dirichlet distribution for wins and losses, so two separate distributions, one for wins and one for losses. Using the counts/proportions of events for each outcome as the priors. I guess I'm not sure how to do this in R. I think my course of action would be estimate a dirichlet distribution (since I have 3 variables) for each outcome using maximum likelihood. I've been trying to use the dirichlet.simul and dirichlet.mle functions from the sirt package in R. I'm not sure if I need to simulate one first?
Another issue is once I have this distribution, it's unclear to me how to get a posterior distribution of a new observation. I've read several papers and can't seem to find a straightforward process on how to do this. (Or maybe there's some holes in my understanding). Any pushes in the right direction would be greatly appreciated.
This is the code I've tried so far:
### FOR WON ACCOUNTS
set.seed(789)
N <- 6
probs <- c(0.535714286, 0.330357143, 0.133928571 )
alpha <- probs
alpha <- matrix( alpha , nrow=N , ncol=length(alpha) , byrow=TRUE )
x <- dirichlet.simul( alpha )
dirichlet.mle(x)
$alpha
[1] 0.3385607 0.2617939 0.1972898
$alpha0
[1] 0.7976444
$xsi
[1] 0.4244507 0.3282088 0.2473405
### FOR LOST ACCOUNTS
set.seed(789)
N2 <- 14
probs2 <- c(0.528037383,0.308411215,0.163551402 )
alpha2 <- probs2
alpha2 <- matrix( alpha2 , nrow=N , ncol=length(alpha2) , byrow=TRUE )
x2 <- dirichlet.simul( alpha2 )
dirichlet.mle(x2)
$alpha
[1] 0.3388486 0.2488771 0.2358043
$alpha0
[1] 0.8235301
$xsi
[1] 0.4114587 0.3022077 0.2863336
Not sure if this is a correct approach or how to get posteriors from here. I realize all the outputs look similar across won/lost accounts. I just used some simulated data to represent what I'm working with.

New calculation loop

I want to have a loop that will perform a calculation for me, and export the variable (along with identifying information) into a new data frame.
My data look like this:
Each unique sampling point (UNIQUE) has 4 data points associated with it (they differ by WAVE).
WAVE REFLECT REFEREN PLOT LOCAT COMCOMP DATE UNIQUE
1 679.9 119 0 1 1 1 11.16.12 1
2 799.9 119 0 1 1 1 11.16.12 1
3 899.8 117 0 1 1 1 11.16.12 1
4 970.3 113 0 1 1 1 11.16.12 1
5 679.9 914 31504 1 2 1 11.16.12 2
6 799.9 1693 25194 1 2 1 11.16.12 2
And I want to create a new data frame that will look like this:
For each unique sampling point, I want to calculate "WBI" from 2 specific "WAVE" measurements.
WBI PLOT .... UNIQUE
(WAVE==899.8/WAVE==970) 1 1
(WAVE==899.8/WAVE==970) 1 2
(WAVE==899.8/WAVE==970) 1 3
Depends on the size of your input data.frame there could be better solution in terms of efficiency but the following should work ok for small or medium data sets, and is kind of simple:
out.unique = unique(input$UNIQUE);
out.plot = sapply(out.unique,simplify=T,function(uq) {
# assuming that plot is simply the first PLOT of those belonging to that
# unique number. If not yo should change this.
subset(input,subset= UNIQUE == uq)$PLOT[1];
});
out.wbi = sapply(out.unique,simplify=T,function(uq) {
# not sure how you compose WBI but I assume that are the two last
# record with that unique number so it matches the first output of your example
uq.subset = subset(input,subset= UNIQUE == uq);
uq.nrow = nrow(uq.subset);
paste("(WAVE=",uq.subset$WAVE[uq.nrow-1],"/WAVE=",uq.subset$WAVE[uq.nrow],")",sep="")
});
output = data.frame(WBI=out.wbi,PLOT=out.plot,UNIQUE=out.unique);
If the input data is big however you may want to exploit de fact that records seem to be sorted by "UNIQUE"; repetitive data.frame sub-setting would be costly. Also both sapply calls can be combined into one but make it a bit more cumbersome so I had leave it like this.

Resources