I've got a dataset that has monthly metrics for different stores. Each store has three monthly (Total sales, customers and transaction count), my task is over a year I need to find the store that most closely matches a specific test store (Ex: Store 77).
Therefore over the year both the test store and most similar store need to have similar performance. My question is how do I go about finding the most similar store? I've currently used euclidean distance but would like to know if there's a better way to go about it.
Thanks in advance
STORE
month
Metric 1
22
Jan-18
10
23
Jan-18
20
Is correlation a better way to measure similarity in this case compared to distance? I'm fairly new to data so if there's any resources where I can learn more about this stuff it would be much appreciated!!
In general, deciding similarity of items is domain-specific, i.e. it depends on the problem you try to solve. Therefore, there is not one-size-fits-all solution. Nevertheless, there is some a basic procedure someone can follow trying to solve this kind of problems.
Case 1 - only distance matters:
If you want to find the most similar items (stores in our case) using a distance measure, it's a good tactic to firstly scale your features in some way.
Example (min-max normalization):
Store
Month
Total sales
Total sales (normalized)
1
Jan-18
50
0.64
2
Jan-18
40
0.45
3
Jan-18
70
0
4
Jan-18
15
1
After you apply normalization on all attributes, you can calculate euclidean distance or any other metric that you think it fits your data.
Some resources:
Similarity measures
Feature scaling
Case 2 - Trend matters:
Now, say that you want to find the similarity over the whole year. If the definition of similarity for your problem is just the instance of the stores at the end of the year, then distance will do the job.
But if you want to find similar trends of increase/decrease of the attributes of two stores, then distance measures conceal this information. You would have to use correlation metrics or any other more sophisticated technique than just a distance.
Simple example:
To keep it simple, let's say we are interested in 3-months analysis and that we use only sales attribute (unscaled):
Store
Month
Total sales
1
Jan-18
20
1
Feb-18
20
1
Mar-18
20
2
Jan-18
5
2
Feb-18
15
2
Mar-18
40
3
Jan-18
10
3
Feb-18
30
3
Mar-18
78
At the end of March, in terms of distance Store 1 and Store 2 are identical, both having 60 total sales.
But, as far as the increase ratio per month is concerned, Store 2 and Store 3 is our match. In February they both had 2 times more sales and in March 1.67 and 1.6 times more sales respectively.
Bottom line: It really depends on what you want to quantify.
Well-known correlation metrics:
Pearson correlation coefficient
Spearman correlation coefficient
I am having trouble with the findCorrelation() function, Here is my input and the output.
>findCorrelation(cor(subset(segdata, select=-c(56))),cutoff=0.9)
[1] 16 17 14 15 30 51 31 25 40
>cor(segdata)[c(16,17,14,15,30,51,31,25,40),c(16,17,14,15,30,51,31,25,40)]
enter image description here
I deleted the 56 colum because this is factor variable.
Above the code, I use cutoff=0.9. it means print only those variables whose correlation is greater than or equal to 0.9.
But, in the result image file, the end variable(P12002900) has very very low correlation. As i use "cutoff=0.9", Low correlations such as P12002900 should not be output.
why this is printed??
so I use Vehicle bigdata that presented in R.
>library(mlbench)
>library(caret)
>data(Vehicle)
>findCorrelation(cor(subset(Vehicle,select=-c(Class))),cutoff=0.9)
[1]3 8 11 7 9 2
>cor(subset(Vehicle,select=-c(Class)))[c(3,8,11,7,9,2),c(3,8,11,7,9,2)]
this is result image.
enter image description here
the last variable(Circ) has lower than 0.9 correlation.
but it is printed....
please help me... thanks you for your help!
I need to find the sector with the lowest frequency in my data frame. Using min gives the minimum number of occurrences, but I would like to obtain the corresponding sector name with the lowest number of occurrences...So in this case, I would like it to print "consumer staples". I keep getting the frequency and not the actual sector name. Is there a way to do this?
Thank you.
sector_count <- count(portfolio, "Sector")
sector_count
Sector freq
1 Consumer Discretionary 5
2 Consumer Staples 1
3 Health Care 2
4 Industrials 3
5 Information Technology 4
min(sector_count$freq)
[1] 1
You want
sector_count$Sector[which.min(sector_count$freq)]
The which.min(sector_count$freq) function selects the index or row where the minimum value is found. The sector_count$Sector vector is then subset to the corresponding value.
Please help to interpret the results of SPADE frequent sequence mining algorithm (http://www.inside-r.org/packages/cran/arulesSequences/docs/cspade)
With support = 0.05:
s1 <- cspade(x, parameter = list(support = 0.05), control = list(verbose = TRUE))
I get, for example, these sequences:
4 <{C},{V}> 0.15644023
5 <{C,V}> 0.73127376
Looks like these are the same sequences, aren't they? How <{C},{V}> semantically differes from <{C,V}> ? Any real life examples?
From Spade paper (M. J. Zaki. (2001). SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning Journal, 42, 31--60):
"An input-sequence C is said to contain another sequence A, if A is a
subsequence of the input-sequence C. The support or frequency of a sequence is the the total number of input-sequences in the database D that contain A."
Then, for example, if:
sequence support
1 <{C}> 1.00000000
Does it mean that sequence <{C}> is contained in all sequences in database D, correct?
Complete output that I get from my data:
> as(s1, "data.frame")
sequence support
1 <{C}> 1.00000000
2 <{L}> 0.20468120
3 <{V}> 0.73127376
4 <{C},{V}> 0.15644023
5 <{C,V}> 0.73127376
6 <{L,V}> 0.07882027
7 <{V},{V}> 0.13343431
8 <{C,V},{V}> 0.13343431
9 <{C},{C},{V}> 0.05558572
10 <{C,L,V}> 0.07882027
11 <{V},{C,V}> 0.13343431
12 <{C},{C,V}> 0.15644023
13 <{C,V},{C,V}> 0.13343431
14 <{C},{C},{C,V}> 0.05558572
15 <{C},{L}> 0.05738619
16 <{C,L}> 0.20468120
17 <{C},{C,L}> 0.05738619
18 <{C},{C}> 0.22128547
19 <{L},{C}> 0.06233031
20 <{V},{C}> 0.16921494
21 <{V},{V},{C}> 0.05047012
22 <{V},{C},{C}> 0.06233031
23 <{C,V},{C}> 0.16921494
24 <{C},{V},{C}> 0.05781487
25 <{C,V},{V},{C}> 0.05047012
26 <{V},{C,V},{C}> 0.05047012
27 <{C},{C,V},{C}> 0.05781487
28 <{C,V},{C,V},{C}> 0.05047012
29 <{C,L},{C}> 0.06233031
30 <{C},{C},{C}> 0.07882027
31 <{C,V},{C},{C}> 0.06233031
> summary(s1)
set of 31 sequences with
most frequent items:
C V L (Other)
27 22 8 8
most frequent elements:
{C} {V} {C,V} {L} {C,L} (Other)
21 12 12 3 3 2
element (sequence) size distribution:
sizes
1 2 3
7 13 11
sequence length distribution:
lengths
1 2 3 4 5
3 9 12 6 1
summary of quality measures:
support
Min. :0.05047
1st Qu.:0.05760
Median :0.07882
Mean :0.17121
3rd Qu.:0.16283
Max. :1.00000
includes transaction ID lists: FALSE
mining info:
data ntransactions nsequences support
x 61000 34991 0.05
>
When using SPADE algorithm, remember that you are also dealing with temporal data (i.e. you can know the order or time of occurrence of the item).
Looks like these are the same sequences, aren't they? How <{C},{V}>
semantically differs from <{C,V}> ? Any real life examples?
In your example, <{C}, {V}> means that item C occurred first, and then item V; <{C, V}> means than item C and V occurred at the same time.
Then, for example, if:
sequence support
1 <{C}> 1.00000000
Does it mean that sequence <{C}> is contained in all sequences in
database D, correct?
An item with support value of 1 means that it happened (in a market basket analysis example) in ALL transactions.
Hope this helps.
Looks like these are the same sequences, aren't they? How <{C},{V}>
semantically differes from <{C,V}> ? Any real life examples?
As user2552108 pointed, {C,V} implies that C and V occurred at the same time. In practice this can be used to encode multi-dimensional sequential data. For example, suppose that C was Canada and V was Vancouver. Now this could have been something like:
[{C,V,M,peanut,butter,maple_syrup}, ... , {}]
In this case, your frequent item-set can not only have single length sets like say {C}, {V}, {U}, {W}, or {X}, but also sets with length > 1 (the sets that appeared simultaneously - at the same time).
For this reason, the element in transactions/sequences are defined as sets and not single elements.
Does it mean that sequence <{C}> is contained in all sequences in
database D, correct?
That's correct!
I have a dataset that I need to sort by participant (RECORDING_SESSION_LABEL) and by trial_number. However, when I sort the data using R none of the sort functions I have tried put the variables in the correct numeric order that I want. The participant variable comes out ok but the trial ID variable comes out in the wrong order for what I need.
using:
fix_rep[order(as.numeric(RECORDING_SESSION_LABEL), as.numeric(trial_number)),]
Participant ID comes out as:
118 118 118 etc. 211 211 211 etc. 306 306 306 etc.(which is fine)
trial_number comes out as:
1 1 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 2 2 20 20 .... (which is not what I want - it seems to be sorting lexically rather than numerically)
What I would like is trial_number to be order like this within each participant number:
1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 ....
I have checked that these variables are not factors and are numeric and also tried without the 'as.numeric', but with no joy. Looking around I saw suggestions that sort() and mixedsort() might do the trick in place of 'order', both come up with errors. I am slowly pulling my hair out over what I think should be a simple thing. Can anybody help shed some light on how to do this to get what I need?
Even though you claim it is not a factor, it does behave exactly as if it were a factor. Testing if something is a factor can be tricky since a factor is just an integer vector with a levels attribute and a class label. If it is a factor, your code needs to have a call to as.character() nested inside the as.numeric():
fix_rep[order(as.numeric(RECORDING_SESSION_LABEL), as.numeric(as.character(trial_number))),]
To be really sure if it's a factor, I recommend the str() function:
str(trial_number)
I think it may be worthwhile for you to design your own function in this case. It wouldn't be too hard, basically you could just design a bubble-sort algorithm with a few alterations. These alterations could change each number to a string, and begin by sorting those with different numbers of digits into different bins (easily done by finding which numbers, which are now strings, have the greatest numbers of indices). Then, in a similar fashion, the numbers in these bins could be sorted by converting the least significant digit to a numeric type and checking to see which are the largest/smallest. If you're interested, I could come up with some code for this, however, it looks like the two above me have beat me to the punch with some of the built-in functions. I've never used those functions, so I'm not sure if they'll work as you intend, but there's no use in reinventing the wheel.