I'm running the apriori algorithm like this:
rules <-apriori(dt)
inspect(rules)
where dt is my data.frame with this format:
> head(dt)
Cus T C B
1: C1 0 1 1
2: C2 0 1 0
3: C3 0 1 0
4: C4 0 1 0
5: C5 0 1 0
6: C6 0 1 1
The idea of the data set is to capture the customer and whether he\she bought three different items (T, C and B) on a particular purchase. For example, based on the information above, we can see that C1 bought C and B; customers C2 to C5 bought only C and customer C6 bought only C and B.
the output is the following:
lhs rhs support confidence lift
1 {} => {T=0} 0.90 0.9000000 1.0000000
2 {} => {C=1} 0.91 0.9100000 1.0000000
3 {B=0} => {T=0} 0.40 0.8163265 0.9070295
4 {B=0} => {C=1} 0.40 0.8163265 0.8970621
5 {B=1} => {T=0} 0.50 0.9803922 1.0893246
6 {B=1} => {C=1} 0.51 1.0000000 1.0989011
My questions are:
1) how can I get rid of rules where T,C or B are equal to 0. If you think about it, the rule {B=0} => {T=0} or even {B=1} => {T=0} doesn't really make sense.
2)I was reading about the apriori algorithm and in most of the examples, each line represents the actual transactions so in my case, it should be something like:
C,B
C
C
C
C
C, B
instead of my sets of ones and zeros, is that a rule? Or can I still work with my format?
Thanks
Not sure what the aim of the program is supposed to be, but the aim of the Apriori algorithm is first to extract frequent itemsets of a given data, in which frequent itemsets are a certain quantity of items which often appear as such quantity in the data. And second to generate of those extracted frequent itemsets association rules. An association rule looks for example like this:
B -> C
Which in the stated case means, that customers who bought B buys C too to a certain probability. Whereby the probability is determined by the support and confidence level of the Apriori algorithm. The support level regulates the amount of frequent itemsets and the confidence level the amount of association rules. Association rules over the confidence are called strong association rules.
Do not understand against this backdrop why for the determination whether a customer bought different articles the Apriori algorithm is used. This could be answered by an if statement. And the provided output makes no sense in this context. The output says for example for the third line that if a customer does not buy B then he buys not T with a support of 40% and a confidence of 81.6%. Apart of that association rules does not have a support, only the association rule B -> C is correct, but it's confidence value wrong.
Nevertheless, if the aim is to generate described association rules the original Apriori cannot operate an input in this format:
> head(dt)
Cus T C B
1: C1 0 1 1
2: C2 0 1 0
3: C3 0 1 0
4: C4 0 1 0
5: C5 0 1 0
6: C6 0 1 1
For the uncustomized Apriori algorithm a data set needs this format:
> head(dt)
C1: {B, C}
C2: {C}
C3: {C}
C4: {C}
C5: {C}
C6: {B, C}
See two solutions: Either to format the input wherever or to customize the Apriori algorithm to this format what would be argubaly a change of the input format within the algorithm. To clarify the need of the stated input format, the Apriori algorithm in a nutshell with the provided data:
Support level = 0.3
Confidence level = 0.3
Number of customers = 6
Total number of B's bought = 2
Total number of C's bought = 6
Support of B = 2 / 6 = 0.3 >= 0.3 = support level
Support of C = 6 / 6 = 1 >= 0.3 = support level
Support of B, C = 2 / 6 = 0.3 >= 0.3 = support level
-> Frequent itemsets = {B, C, BC}
-> Association rules = {B -> C}
Confidence of B -> C = 2 / 2 = 1 >= 0.3 = confidence level
-> Strong association rules = {B -> C}
Hope this helps.
Related
I am looking for a method for calculating similarity score for list of numbers. Ideally the method should give result in fixed range. For example from 0 to 1 where 0 is not similar at all and 1 means all numbers are identical.
For clarity let me provide a few examples:
0 1 2 3 4 5 6 7 8 9 10 => the similarity should be 0 or close to zero as all numbers are different
1 1 1 1 1 1 1 => 1
10 9 11 10.5 => close to 1
1 1 1 1 1 1 1 1 1 1 100 => score should be still pretty high as only the last value is different
I have tried to calculate the similarity based on normalization and average, but that gives me really bad results when there is one 'bad number'.
Thank you.
Similarity tests are always incredibly subjective, and the right one to use depends heavily on what you're trying to use it for. We already have three typical measures of central tendency (mean, median, mode). It's hard to say what test will work for you because there are different ways of measuring that will do what you're asking, but have wildly different measures for other lists (like [1]*7 + [100] * 7). Here's one solution:
import statistics as stats
def tester(ell):
mode_measure = 1 - len(set(ell))/len(ell)
avg_measure = 1 - stats.stdev(ell)/stats.mean(ell)
return max(avg_measure, mode_measure)
I am using principal component analysis (PCA) based on ~30 variables to compose an index that classifies individuals in 3 different categories (top, middle, bottom) in R.
I have a dataframe of ~2000 individuals with 28 binary and 2 continuous variables.
Now, I would like to use the loading factors from PC1 to construct an
index that classifies my 2000 individuals for these 30 variables in 3 different groups.
Problem: Despite extensive research, I could not find out how to extract the loading factors from PCA_loadings, give each individual a score (based on the loadings of the 30 variables), which would subsequently allow me to rank each individual (for further classification). Does it make sense to display the loading factors in a graph?
I've performed the following steps:
a) Ran a PCA using PCA_outcome <- prcomp(na.omit(df1), scale = T)
b) Extracted the loadings using PCA_loadings <- PCA_outcome$rotation
c) Removed all the variables for which the loading factors were close to 0.
I have considered creating 30 new variable, one for each loading factor, which I would sum up for each binary variable == 1 (though, I am not sure how to proceed with the continuous variables). Consequently, I would assign each individual a score. However, I would not know how to assemble the 30 values from the loading factors to a score for each individual.
R code
df1 <- read.table(text="
educ call house merge_id school members
A 1 0 1 12_3 0 0.9
B 0 0 0 13_3 1 0.8
C 1 1 1 14_3 0 1.1
D 0 0 0 15_3 1 0.8
E 1 1 1 16_3 3 3.2", header=T)
## Run PCA
PCA_outcome <- prcomp(na.omit(df1), scale = T)
## Extract loadings
PCA_loadings <- PCA_outcome$rotation
## Explanation: A-E are 5 of the 2000 individuals and the variables (education, call, house, school, members) represent my 30 variables (binary and continuous).
Expected results:
- Get a rank score for each individual
- Subsequently, assign a category 1-3 to each individual.
I'm not 100% sure what you're asking, but here's an answer to the question I think you're asking.
First of all, PC1 of a PCA won't necessarily provide you with an index of socio-economic status. As explained here, PC1 simply "accounts for as much of the variability in the data as possible". PC1 may well work as a good metric for socio-economic status for your data set, but you'll have to critically examine the loadings and see if this makes sense. Depending on the signs of the loadings, it could be that a very negative PC1 corresponds to a very positive socio-economic status. As I say: look at the results with a critical eye. An explanation of how PC scores are calculated can be found here. Anyway, that's a discussion that belongs on Cross Validated, so let's get to the code.
It sounds like you want to perform the PCA, pull out PC1, and associate it with your original data frame (and merge_ids). If that's your goal, here's a solution.
# Create data frame
df <- read.table(text = "educ call house merge_id school members
A 1 0 1 12_3 0 0.9
B 0 0 0 13_3 1 0.8
C 1 1 1 14_3 0 1.1
D 0 0 0 15_3 1 0.8
E 1 1 1 16_3 3 3.2", header = TRUE)
# Perform PCA
PCA <- prcomp(df[, names(df) != "merge_id"], scale = TRUE, center = TRUE)
# Add PC1
df$PC1 <- PCA$x[, 1]
# Look at new data frame
print(df)
#> educ call house merge_id school members PC1
#> A 1 0 1 12_3 0 0.9 0.1000145
#> B 0 0 0 13_3 1 0.8 1.6610864
#> C 1 1 1 14_3 0 1.1 -0.8882381
#> D 0 0 0 15_3 1 0.8 1.6610864
#> E 1 1 1 16_3 3 3.2 -2.5339491
Created on 2019-05-30 by the reprex package (v0.2.1.9000)
As you say you have to use PCA, I'm assuming this is for a homework question, so I'd recommend reading up on PCA so that you get a feel of what it does and what it's useful for.
I will present my question in two ways. First, requesting a solution for a task; and second, as a description of my overall objective (in case I am overthinking this and there is an easier solution).
1) Task Solution
Data context: each row contains four price variables (columns) representing (a) the price at which the respondent feels the product is too cheap; (b) the price that is perceived as a bargain; (c) the price that is perceived as expensive; (d) the price that is too expensive to purchase.
## mock data set
a<-c(1,5,3,4,5)
b<-c(6,6,5,6,8)
c<-c(7,8,8,10,9)
d<-c(8,10,9,11,12)
df<-as.data.frame(cbind(a,b,c,d))
## result
# a b c d
#1 1 6 7 8
#2 5 6 8 10
#3 3 5 8 9
#4 4 6 10 11
#5 5 8 9 12
Task Objective: The goal is to create a single column in a new data frame that lists all of the unique values contained in a, b, c, and d.
price
#1 1
#2 3
#3 4
#4 5
#5 6
...
#12 12
My initial thought was to use rbind() and unique()...
price<-rbind(df$a,df$b,df$c,df$d)
price<-unique(price)
...expecting that a, b, c and d would stack vertically.
[Pseudo illustration]
a[1]
a[2]
a[...]
a[n]
b[1]
b[2]
b[...]
b[n]
etc.
Instead, the "columns" are treated as rows and stacked horizontally.
V1 V2 V3 V4 V5
1 1 5 3 4 5
2 6 6 5 6 8
3 7 8 8 10 9
4 8 10 9 11 12
How may I stack a, b, c and d such that price consists of only one column ("V1") that contains all twenty responses? (The unique part I can handle separately afterwards).
2) Overall Objective: The Bigger Picture
Ultimately, I want to create a cumulative share of population for each price (too cheap, bargain, expensive, too expensive) at each price point (defined by the unique values described above). For example, what percentage of respondents felt $1 was too cheap, what percentage felt $3 or less was too cheap, etc.
The cumulative shares for bargain and expensive are later inverted to become not.bargain and not.expensive and the four vectors reside in a data frame like this:
buckets too.cheap not.bargain not.expensive too.expensive
1 0.01 to 0.50 0.000000000 1 1 0
2 0.51 to 1.00 0.000000000 1 1 0
3 1.01 to 1.50 0.000000000 1 1 0
4 1.51 to 2.00 0.000000000 1 1 0
5 2.01 to 2.50 0.001041667 1 1 0
6 2.51 to 3.00 0.001041667 1 1 0
...
from which I may plot something that looks like this:
Above, I accomplished my plotting objective using defined price buckets ($0.50 ranges) and the hist() function.
However, the intersections of these lines have meanings and I want to calculate the exact price at which any of the lines cross. This is difficult when the x-axis is defined by price range buckets instead of a specific value; hence the desire to switch to exact values and the need to generate the unique price variable.
[Postscript: This analysis is based on Peter Van Westendorp's Price Sensitivity Meter (https://en.wikipedia.org/wiki/Van_Westendorp%27s_Price_Sensitivity_Meter) which has known practical limitations but is relevant in the context of my research which will explore consumer perceptions of value under different treatments rather than defining an actual real-world price. I mention this for two reasons 1) to provide greater insight into my objective in case another approach comes to mind, and 2) to keep the thread focused on the mechanics rather than whether or not the Price Sensitivity Meter should be used.]
We can unlist the data.frame to a vector and get the sorted unique elements
sort(unique(unlist(df)))
When we do an rbind, it creates a matrix and unique of matrix calls the unique.matrix
methods('unique')
#[1] unique.array unique.bibentry* unique.data.frame unique.data.table* unique.default unique.IDate* unique.ITime*
#[8] unique.matrix unique.numeric_version unique.POSIXlt unique.warnings
which loops through the rows as the default MARGIN is 1 and then looks for unique elements. Instead, if we use the 'price', either as.vector or c(price) converts into vector
sort(unique(c(price)))
#[1] 1 3 4 5 6 7 8 9 10 11 12
If we use unique.default
sort(unique.default(price))
#[1] 1 3 4 5 6 7 8 9 10 11 12
I have data which is in this format:
User Item
1 A
1 B
1 C
1 D
2 A
2 C
2 E
What I want to get is a frequency count for each pair. Order is not important so I don't want to count the inverse. I want to end up with a result similar to this, where the frequency counts are partitioned by user.
Pair Frequency
AB 1
AC 2
AD 1
AE 1
BC 1
BD 1
BE 0
CD 1
CE 1
What tool can I use to formulate this kind of table? I'd prefer some open source solution if possible.
Edit- Added example for my comment below
I'm reading in data from a CSV file using the following two lines and removing the factors with these two steps in code.
xa<-read.csv("C:/Direcotry/MyData.csv")
xa<-data.frame(lapply(xa, as.character), stringsAsFactors=FALSE)
User Item
1 394324 Item A
2 124209 Item B
3 212457 Item C
4 427052 Item A
5 118281 Item D
6 156831 Item A
7 212442 Item E
8 156831 Item B
9 212442 Item A
10 177734 Item C
When I try running suggested answer, I get an error with this result:
Error in combn(x, 2) : n < m
Well R is open source.
Here's an example based on your tiny sample of data:
Here I just read your data in by copypasting it straight from your post:
> xa=read.table(stdin(),header=TRUE,as.is=TRUE)
0: User Item
1: 1 A
2: 1 B
3: 1 C
4: 1 D
5: 2 A
6: 2 C
7: 2 E
8:
So that's the data in. Then with a couple of lines of code:
> f=function(x) apply(combn(x,2),2,paste0,collapse="")
> table(unlist(tapply(xa$Item,xa$User,f)))
AB AC AD AE BC BD CD CE
1 2 1 1 1 1 1 1
If you need all the empty combinations explicitly as zeroes it takes another line or two (you need to generate all the possible combinations as a factor, rather than just the observed ones and tell table to include the empty ones).
After some research and suggestions by Glen, I came up with the following code which gets me a 3 column CSV file with the pair combination plus frequency count. If anyone sees a better way, let me know! But this seems to work.
The errors I was referring to in my follow up comments were caused by users having purchased only at one location.
library(reshape2)
xa<-read.csv("C:/Input.csv",as.is=TRUE)
xa=xa[!duplicated(xa),]
xa<-data.table(xa)
setkey(xa,ContactId,PurchaseLocation)
tab=table(xa$ContactId)
xa=xa[xa$ContactId %in% names(tab[tab>1]),]
f=function(x) apply(combn(x,2),2,paste0,collapse="--")
xb<-as.data.frame(table(unlist(tapply(xa$PurchaseLocation,xa$ContactId,f))))
xc=with(xb, cbind(Freq, colsplit(xb$Var1, pattern = "--", names = c('a', 'b'))))
xc=subset(xc,a!=b & a!="" & b!="" & Freq>1)
write.csv(xc,file="C:/Output.csv")
Edit- I made a small change to make it order independent by sorting the data table on a key.
So reading through this paper:
http://www.cs.nyu.edu/~mohri/pub/fla.pdf
I see that a weighted finite state transducer (WFST) is a semiring, and many operations on WFST can be expressed in terms of "sum" and "product" over the semiring. For example, composition of Transducers one and two is:
(T1 ◦ T2)(x, y) = ⊕ z∈∆∗ T1(x, z)⊗T2(z, y)
But I can't seem to find an explanation on how do pure sum and product of WFST, and am having trouble backing out the operation from the composition example above.
A demonstration over this example would be much appreciated:
format: state1 state2, input alphabet : output alphabet, transition prob
T1
0 1 a : b, 0.1
0 2 b : b, 0.2
2 3 b : b, 0.3
0 0 a : a, 0.4
1 3 b : a, 0.5
T2
0 1 b : a, 0.1
1 2 b : a, 0.2
1 1 a : d, 0.3
1 2 a : c, 0.4
Example taken from: How to perform FST (Finite State Transducer) composition
--------------- update ------------
Found the answer in this document: http://www.cs.nyu.edu/~mohri/pub/hwa.pdf
page 12