Measure similarity of objects over a period of time - math

I've got a dataset that has monthly metrics for different stores. Each store has three monthly (Total sales, customers and transaction count), my task is over a year I need to find the store that most closely matches a specific test store (Ex: Store 77).
Therefore over the year both the test store and most similar store need to have similar performance. My question is how do I go about finding the most similar store? I've currently used euclidean distance but would like to know if there's a better way to go about it.
Thanks in advance
STORE
month
Metric 1
22
Jan-18
10
23
Jan-18
20
Is correlation a better way to measure similarity in this case compared to distance? I'm fairly new to data so if there's any resources where I can learn more about this stuff it would be much appreciated!!

In general, deciding similarity of items is domain-specific, i.e. it depends on the problem you try to solve. Therefore, there is not one-size-fits-all solution. Nevertheless, there is some a basic procedure someone can follow trying to solve this kind of problems.
Case 1 - only distance matters:
If you want to find the most similar items (stores in our case) using a distance measure, it's a good tactic to firstly scale your features in some way.
Example (min-max normalization):
Store
Month
Total sales
Total sales (normalized)
1
Jan-18
50
0.64
2
Jan-18
40
0.45
3
Jan-18
70
0
4
Jan-18
15
1
After you apply normalization on all attributes, you can calculate euclidean distance or any other metric that you think it fits your data.
Some resources:
Similarity measures
Feature scaling
Case 2 - Trend matters:
Now, say that you want to find the similarity over the whole year. If the definition of similarity for your problem is just the instance of the stores at the end of the year, then distance will do the job.
But if you want to find similar trends of increase/decrease of the attributes of two stores, then distance measures conceal this information. You would have to use correlation metrics or any other more sophisticated technique than just a distance.
Simple example:
To keep it simple, let's say we are interested in 3-months analysis and that we use only sales attribute (unscaled):
Store
Month
Total sales
1
Jan-18
20
1
Feb-18
20
1
Mar-18
20
2
Jan-18
5
2
Feb-18
15
2
Mar-18
40
3
Jan-18
10
3
Feb-18
30
3
Mar-18
78
At the end of March, in terms of distance Store 1 and Store 2 are identical, both having 60 total sales.
But, as far as the increase ratio per month is concerned, Store 2 and Store 3 is our match. In February they both had 2 times more sales and in March 1.67 and 1.6 times more sales respectively.
Bottom line: It really depends on what you want to quantify.
Well-known correlation metrics:
Pearson correlation coefficient
Spearman correlation coefficient

Related

Anomaly Detection - Correlated Variables

I am working on an 'Anomaly' detection assignment in R. My dataset has around 30,000 records of which around 200 are anomalous. It has around 30 columns & all are quantitative. Some of the variables are highly correlated (~0.9). By anomaly I mean some of the records have unusual (high/low) values for some column(s) while some have correlated variables not behaving as expected. The below example will give some idea.
Suppose vehicle speed & heart rate are highly positively correlated. Usually vehicle speed varies between 40 & 60 while heart rate between 55-70.
time_s steering vehicle.speed running.distance heart_rate
0 -0.011734953 40 0.251867414 58
0.01 -0.011734953 50 0.251936555 61
0.02 -0.011734953 60 0.252005577 62
0.03 -0.011734953 60 0.252074778 90
0.04 -0.011734953 40 0.252074778 65
Here we have two types of anomalies. 4th record has exceptionally high value for heart_rate while 5th record seems okay if we look individual columns. But as we can see that heart_rate increases with speed, we expected a lower heart rate for 5th record while we have a higher value.
I could identify the column level anomalies using box plots etc but find it hard to identify the second type. Somewhere I read about PCA based anomaly detection but I couldn't find it's implementation in R.
Will you please help me with PCA based anomaly detection in R for this scenario. My google search was throwing mainly time series related stuff which is not something I am looking for.
Note: There is a similar implementation in Microsoft Azure Machine Learning - 'PCA Based Anomaly Detection for Credit Risk' which does the job but I wan't to know the logic behind it & replicate the same in R.

R Linear programming

Example 1.
Use R, in similar way as above, to solve the following problem:
The Handy-Dandy Company makes three types of kitchen appliances (A, B and C).
To make each of
these appliance types, just two inputs are required - labour and materials. Each unit of A made requires
7 hours of labour and 4 Kg of materials; for each unit of B made the requirements are 3 hours of
labour and 4 Kg of materials, while for C the unit requirements are 6 hours of labour and 5 Kg of
material.
The company expects to make a profit of €40 for every unit of A sold, while the profit per
unit for B and C are €20 and €30 respectively. Given that the company has available to it 150 hours of
labour and 200 Kg of material each day, formulate this as a linear programming problem.
Click here
x1 <- Rglpk_read_file("F:\ \Linear_programming_R\\first.txt", type = "MathProg")
Rglpk_solve_LP(x1$objective, x1$constraints[[1]], x1$constraints[[2]], x1$constraints[[3]],
x1$bounds, x1$types, x1$maximum)
Can someone explain to me what 1,2,3 in brackets mean? Thanks
Those access elements of a list; so x1$constraints is a list and x1$constraints[[1]] is the first component of that list.
The operator $ accesses a variable in an object (data.frame). Have a look at some tutorial about data types in R for example here

CART Methodology for data with mutually exhaustive rows

I am trying to use CART to analyse a data set whose each row is a segment, for example
Segment_ID | Attribute_1 | Attribute_2 | Attribute_3 | Attribute_4 | Target
1 2 3 100 3 0.1
2 0 6 150 5 0.3
3 0 3 200 6 0.56
4 1 4 103 4 0.23
Each segment has a certain population from the base data (irrelevant to my final use).
I want to condense, for example in the above case, the 4 segments into 2 big segments, based on the 4 attributes and on the target variable. I am currently dealing with 15k segments and want only 10 segments with each of the final segment based on target and also having a sensible attribute distribution.
Now, pardon my if I am wrong but CHAID on SPSS (if not using autogrow) will generally split the data into 70:30 ratio where it builds the tree on 70% of the data and tests on the remaining 30%. I can't use this approach since I need all my segments in the data to be included. I essentially want to club these segments into a a few big segments as explained before. My question is whether I can use CART (rpart in R) for the same. There is an explicit option 'subset' in the rpart function in R but I am not sure whether not mentioning it will ensure CART utilizing 100% of my data. I am relatively new to R and hence a very basic question.

Testing recurrences and orders in strings matlab

I have observed nurses during 400 episodes of care and recorded the sequence of surfaces contacts in each.
I categorised the surfaces into 5 groups 1:5 and calculated the probability density functions of touching any one of 1:5 (PDF).
PDF=[ 0.255202629 0.186199343 0.104052574 0.201533406 0.253012048]
I then predicted some 1000 sequences using:
for i=1:1000 % 1000 different nurses
seq(i,1:end)=randsample(1:5,max(observed_seq_length),'true',PDF);
end
eg.
seq = 1 5 2 3 4 2 5 5 2 5
stairs(1:max(observed_seq_length),seq) hold all
I'd like to compare my empirical sequences with my predicted one. What would you suggest to be the best strategy or property to look at?
Regards,
EDIT: I put r as a tag as this may well fall more easily under that category due to the nature of the question rather than the matlab code.

Average percent

I'm dealing with some math problems here. We have the average sale and loss number in 3 stores.
Site SALE LOSS % LOSS
-------------------------------------------
Store 1 474750 336740 70,92996314 (LOSS*100)/SALE
Store 2 321920 247810 76,97875249
Store 3 149240 118440 79,36210131
-------------------------------------------
Total 945910 702990 74,31890983
If i sum the loss percentages of store 1,2 and 3 i get an average of 75.75.
But when i calculate the total SALE and LOSS and calculate the percentage LOSS i get 74.31?
Shouldent the numbers match? Or is this the wrong way?
Thank you for all answers!
You can't average percentages when they're taken from different totals. Calculating the total sale and loss is the correct way to do the calculation.
See Averaging Percentages on the Ask Dr. Math forum.
It would be the same only if the Sales are equal on the 3 stores.
In general:
(A+B+C)/(D+E+F) != (A/D+B/E+C/F)/3
But imagine equal sales D=E=F
(A+B+C)/(D+E+F) = (A+B+C)/(3D)
(A/D+B/E+C/F)/3 = (A/D+B/D+C/D)/3 = (A+B+C)/(3D)

Resources