Interpreting results of sequence mining with SPADE - r

Please help to interpret the results of SPADE frequent sequence mining algorithm (http://www.inside-r.org/packages/cran/arulesSequences/docs/cspade​)
With support = 0.05:
s1 <- cspade(x, parameter = list(support = 0.05), control = list(verbose = TRUE))
I get, for example, these sequences:
4 <{C},{V}> 0.15644023
5 <{C,V}> 0.73127376
Looks like these are the same sequences, aren't they? How <{C},{V}> semantically differes from <{C,V}> ? Any real life examples?
From Spade paper (M. J. Zaki. (2001). SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning Journal, 42, 31--60):
"An input-sequence C is said to contain another sequence A, if A is a
subsequence of the input-sequence C. The support or frequency of a sequence is the the total number of input-sequences in the database D that contain A."
Then, for example, if:
sequence support
1 <{C}> 1.00000000
Does it mean that sequence <{C}> is contained in all sequences in database D, correct?
Complete output that I get from my data:
> as(s1, "data.frame")
sequence support
1 <{C}> 1.00000000
2 <{L}> 0.20468120
3 <{V}> 0.73127376
4 <{C},{V}> 0.15644023
5 <{C,V}> 0.73127376
6 <{L,V}> 0.07882027
7 <{V},{V}> 0.13343431
8 <{C,V},{V}> 0.13343431
9 <{C},{C},{V}> 0.05558572
10 <{C,L,V}> 0.07882027
11 <{V},{C,V}> 0.13343431
12 <{C},{C,V}> 0.15644023
13 <{C,V},{C,V}> 0.13343431
14 <{C},{C},{C,V}> 0.05558572
15 <{C},{L}> 0.05738619
16 <{C,L}> 0.20468120
17 <{C},{C,L}> 0.05738619
18 <{C},{C}> 0.22128547
19 <{L},{C}> 0.06233031
20 <{V},{C}> 0.16921494
21 <{V},{V},{C}> 0.05047012
22 <{V},{C},{C}> 0.06233031
23 <{C,V},{C}> 0.16921494
24 <{C},{V},{C}> 0.05781487
25 <{C,V},{V},{C}> 0.05047012
26 <{V},{C,V},{C}> 0.05047012
27 <{C},{C,V},{C}> 0.05781487
28 <{C,V},{C,V},{C}> 0.05047012
29 <{C,L},{C}> 0.06233031
30 <{C},{C},{C}> 0.07882027
31 <{C,V},{C},{C}> 0.06233031
> summary(s1)
set of 31 sequences with
most frequent items:
C V L (Other)
27 22 8 8
most frequent elements:
{C} {V} {C,V} {L} {C,L} (Other)
21 12 12 3 3 2
element (sequence) size distribution:
sizes
1 2 3
7 13 11
sequence length distribution:
lengths
1 2 3 4 5
3 9 12 6 1
summary of quality measures:
support
Min. :0.05047
1st Qu.:0.05760
Median :0.07882
Mean :0.17121
3rd Qu.:0.16283
Max. :1.00000
includes transaction ID lists: FALSE
mining info:
data ntransactions nsequences support
x 61000 34991 0.05
> ​

When using SPADE algorithm, remember that you are also dealing with temporal data (i.e. you can know the order or time of occurrence of the item).
Looks like these are the same sequences, aren't they? How <{C},{V}>
semantically differs from <{C,V}> ? Any real life examples?
In your example, <{C}, {V}> means that item C occurred first, and then item V; <{C, V}> means than item C and V occurred at the same time.
Then, for example, if:
sequence support
1 <{C}> 1.00000000
Does it mean that sequence <{C}> is contained in all sequences in
database D, correct?
An item with support value of 1 means that it happened (in a market basket analysis example) in ALL transactions.
Hope this helps.

Looks like these are the same sequences, aren't they? How <{C},{V}>
semantically differes from <{C,V}> ? Any real life examples?
As user2552108 pointed, {C,V} implies that C and V occurred at the same time. In practice this can be used to encode multi-dimensional sequential data. For example, suppose that C was Canada and V was Vancouver. Now this could have been something like:
[{C,V,M,peanut,butter,maple_syrup}, ... , {}]
In this case, your frequent item-set can not only have single length sets like say {C}, {V}, {U}, {W}, or {X}, but also sets with length > 1 (the sets that appeared simultaneously - at the same time).
For this reason, the element in transactions/sequences are defined as sets and not single elements.
Does it mean that sequence <{C}> is contained in all sequences in
database D, correct?
That's correct!

Related

Optimizing the sum of a variable in R given a constraint

Using the following dataset:
ID=c(1:24)
COST=c(85,109,90,104,107,87,99,95,82,112,105,89,101,93,111,83,113,81,97,97,91,103,86,108)
POINTS=c(113,96,111,85,94,105,105,95,107,88,113,100,96,89,89,93,100,92,109,90,101,114,112,109)
mydata=data.frame(ID,COST,POINTS)
I need a R function that will consider all combinations of rows where the sum of 'COST' is less than a fixed value - in this case, $500 - and make the optimal selection based on the summed 'POINTS'.
Your help is appreciated.
So since this post is still open I thought I would give my solution. These kinds of problems are always fun. So, you can try to brute force the solution by checking all possible combinations (some 2^24, or over 16 million) one by one. This could be done by considering that for each combination, a value is either in it or not. Thinking in binary you could use the follow function code which was inspired by this post:
#DO NOT RUN THIS CODE
for(i in 1:2^24)
sum_points[i]<-ifelse(sum(as.numeric((intToBits(i)))[1:24] * mydata$COST) < 500,
sum(as.numeric((intToBits(i)))[1:24] * mydata$POINTS),
0)
I estimate this would take many hours to run. Improvements could be made with parallelization, etc, but still this is a rather intense calculation. This method will also not scale very well, as an increase by 1 (to 25 different IDs now) will double the computation time. Another option would be to cheat a little. For example, we know that we have to stay under $500. If we added up the n cheapest items, at would n would we definitely be over $500?
which(cumsum(sort(mydata$COST))>500)
[1] 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
So any more than 5 IDs chosen and we are definitely over $500. What else.
Well we can run a little code and take the max for that portion and see what that tells us.
sum_points<-1:10000
for(i in 1:10000)
sum_points[i]<-ifelse(sum(as.numeric((intToBits(i)))[1:24]) <6,
ifelse(sum(as.numeric((intToBits(i)))[1:24] * mydata$COST) < 500,
sum(as.numeric((intToBits(i)))[1:24] * mydata$POINTS),
0),
0)
sum_points[which.max(sum_points)]
[1] 549
So we have to try to get over 549 points with the remaining 2^24 - 10000 choices. But:
which(cumsum(rev(sort(mydata$POINTS)))<549)
[1] 1 2 3 4
Even if we sum the 4 highest point values, we still dont beat 549, so there is no reason to even search those. Further, the number of choices to consider must be greater than 4, but less than 6. My gut feeling tells me 5 would be a good number to try. Instead of looking at all 16 millions choices, we can just look at all of the ways to make 5 out of 24, which happens to be 24 choose 5:
num<-1:choose(24,5)
combs<-combn(24,5)
sum_points<-1:length(num)
for(i in num)
sum_points[i]<-ifelse(sum(mydata[combs[,i],]$COST) < 500,
sum(mydata[combs[,i],]$POINTS),
0)
which.max(sum_points)
[1] 2582
sum_points[2582]
[1] 563
We have a new max on the 2582nd iteration. To retrieve the IDs:
mydata[combs[,2582],]$ID
[1] 1 3 11 22 23
And to verify that nothing went wrong:
sum(mydata[combs[,2582],]$COST)
[1] 469 #less than 500
sum(mydata[combs[,2582],]$POINTS)
[1] 563 #what we expected.

Grouping words that are similar

CompanyName <- c('Kraft', 'Kraft Foods', 'Kfraft', 'nestle', 'nestle usa', 'GM', 'general motors', 'the dow chemical company', 'Dow')
I want to get either:
CompanyName2
Kraft
Kraft
Kraft
nestle
nestle
general motors
general motors
Dow
Dow
But would be absolutely fine with:
CompanyName2
1
1
1
2
2
3
3
I see algorithms for getting the distance between two words, so if I had just one weird name I would compare it to all other names and pick the one with the lowest distance. But I have thousands of names and want to group them all into groups.
I do not know anything about elastic search, but would one of the functions in the elastic package or some other function help me out here?
I'm sorry there's no programming here. I know. But this is way out of my area of normal expertise.
Solution: use string distance
You're on the right track. Here is some R code to get you started:
install.packages("stringdist") # install this package
library("stringdist")
CompanyName <- c('Kraft', 'Kraft Foods', 'Kfraft', 'nestle', 'nestle usa', 'GM', 'general motors', 'the dow chemical company', 'Dow')
CompanyName = tolower(CompanyName) # otherwise case matters too much
# Calculate a string distance matrix; LCS is just one option
?"stringdist-metrics" # see others
sdm = stringdistmatrix(CompanyName, CompanyName, useNames=T, method="lcs")
Let's take a look. These are the calculated distances between strings, using Longest Common Subsequence metric (try others, e.g. cosine, Levenshtein). They all measure, in essence, how many characters the strings have in common. Their pros and cons are beyond this Q&A. You might look into something that gives a higher similarity value to two strings that contain the exact same substring (like dow)
sdm[1:5,1:5]
kraft kraft foods kfraft nestle nestle usa
kraft 0 6 1 9 13
kraft foods 6 0 7 15 15
kfraft 1 7 0 10 14
nestle 9 15 10 0 4
nestle usa 13 15 14 4 0
Some visualization
# Hierarchical clustering
sdm_dist = as.dist(sdm) # convert to a dist object (you essentially already have distances calculated)
plot(hclust(sdm_dist))
If you want to group then explicitly into k groups, use k-medoids.
library("cluster")
clusplot(pam(sdm_dist, 5), color=TRUE, shade=F, labels=2, lines=0)

Finding max and identify name's cell for another column

Hopefully someone can solve me the following problem.
Here I have a data about different birds and their maximum lengths:
a<-c("bird1","bird2","bird1","bird3","bird2","bird2")
b<-c(32,45,35,25,51,47)
c<-data.frame(animal=a,max=b)
animal max
1 bird1 32
2 bird2 45
3 bird1 35
4 bird3 25
5 bird2 51
6 bird2 47
My purpose is to identify the name of the animal which has the maximum length. I know that using max()and which.max()is easy to identify the maximum length and the corresponding cell but how can I know the name of the animal?
Any valuable comment will be helpful for me!
This will provide the output of the bird with highest age
Modification
a<-c("bird1","bird2","bird1","bird3","bird2","bird2")
b<-c(32,45,35,25,51,47)
compined_birds<-data.frame(animal=a,max=b)
compined_birds$animal[which.max(compined_birds$max)]

How to optimize for loops in extremely large dataframe

I have a dataframe "x" with 5.9 million rows and 4 columns: idnumber/integer, compdate/integer and judge/character,, representing individual cases completed in an administrative court. The data was imported from a stata dataset and the date field came in as integer, which is fine for my purposes. I want to create the caseload variable by calculating the number of cases completed by the judge within the 30 day window of the completion date of the case at issue.
here are the first 34 rows of data:
idnumber compdate judge
1 9615 JVC
2 15316 BAN
3 15887 WLA
4 11968 WFN
5 15001 CLR
6 13914 IEB
7 14760 HSD
8 11063 RJD
9 10948 PPL
10 16502 BAN
11 15391 WCP
12 14587 LRD
13 10672 RTG
14 11864 JCW
15 15071 GMR
16 15082 PAM
17 11697 DLK
18 10660 ADP
19 13284 ECC
20 13052 JWR
21 15987 MAK
22 10105 HEA
23 14298 CLR
24 18154 MMT
25 10392 HEA
26 10157 ERH
27 9188 RBR
28 12173 JCW
29 10234 PAR
30 10437 ADP
31 11347 RDW
32 14032 JTZ
33 11876 AMC
34 11470 AMC
Here's what I came up with. So for each record I'm taking a subset of the data for that particular judge and then subsetting the cases decided in the 30 day window, and then assigning the length of a vector in the subsetted dataframe to the caseload variable for the subject case, as follows:
for(i in 1:length(x$idnumber)){
e<-x$compdate[i]
f<-e-29
a<-x[x$judge==x$judge[i] & !is.na(x$compdate),]
b<-a[a$compdate<=e & a$compdate>=f,]
x$caseload[i]<-length(b$idnumber)
}
It is working but it is taking extremely long to complete. How can I optimize this or do this easier. Sorry I'm very new to r and to programming -- I'm a law professor trying to analyze court data.... Your help is appreciated. Thanks.
Ken
You don't have to loop through every row. You can do operations on the entire column at once. First, create some data:
# Create some data.
n<-6e6 # cases
judges<-apply(combn(LETTERS,3),2,paste0,collapse='') # About 2600 judges
set.seed(1)
x<-data.frame(idnumber=1:n,judge=sample(judges,n,replace=TRUE),compdate=Sys.Date()+round(runif(n,1,120)))
Now, you can make a rolling window function, and run it on each judge.
# Sort
x<-x[order(x$judge,x$compdate),]
# Create a little rolling window function.
rolling.window<-function(y,window=30) seq_along(y) - findInterval(y-window,y)
# Run the little function on each judge.
x$workload<-unlist(by(x$compdate,x$judge,rolling.window)))
I don't have much experience with rolling calculations, but...
Calculate this per-day, not per-case (since it will be the same for cases on the same day).
Calculate a cumulative sum of the number of cases, and then take the difference of the current value of this sum and the value of the sum 31 days ago (or min{daysAgo:daysAgo>30} since cases are not resolved every day).
It's probably fastest to use a data.table. This is my attempt, using #nograpes simulated data. Comments start with #.
require(data.table)
DT <- data.table(x)
DT[,compdate:=as.integer(compdate)]
setkey(DT,judge,compdate)
# count cases for each day
ldt <- DT[,.N,by='judge,compdate']
# cumulative sum of counts
ldt[,nrun:=cumsum(N),by=judge]
# see how far to look back
ldt[,lookbk:=sapply(1:.N,function(i){
z <- compdate[i]-compdate[i:1]
older <- which(z>30)
if (length(older)) min(older)-1L else as(NA,'integer')
}),by=judge]
# compute cumsum(today) - cumsum(more than 30 days ago)
ldt[,wload:=list(sapply(1:.N,function(i)
nrun[i]-ifelse(is.na(lookbk[i]),0,nrun[i-lookbk[i]])
))]
On my laptop, this takes under a minute. Run this command to see the output for one judge:
print(ldt['XYZ'],nrow=120)

Sorting data in R

I have a dataset that I need to sort by participant (RECORDING_SESSION_LABEL) and by trial_number. However, when I sort the data using R none of the sort functions I have tried put the variables in the correct numeric order that I want. The participant variable comes out ok but the trial ID variable comes out in the wrong order for what I need.
using:
fix_rep[order(as.numeric(RECORDING_SESSION_LABEL), as.numeric(trial_number)),]
Participant ID comes out as:
118 118 118 etc. 211 211 211 etc. 306 306 306 etc.(which is fine)
trial_number comes out as:
1 1 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 2 2 20 20 .... (which is not what I want - it seems to be sorting lexically rather than numerically)
What I would like is trial_number to be order like this within each participant number:
1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 ....
I have checked that these variables are not factors and are numeric and also tried without the 'as.numeric', but with no joy. Looking around I saw suggestions that sort() and mixedsort() might do the trick in place of 'order', both come up with errors. I am slowly pulling my hair out over what I think should be a simple thing. Can anybody help shed some light on how to do this to get what I need?
Even though you claim it is not a factor, it does behave exactly as if it were a factor. Testing if something is a factor can be tricky since a factor is just an integer vector with a levels attribute and a class label. If it is a factor, your code needs to have a call to as.character() nested inside the as.numeric():
fix_rep[order(as.numeric(RECORDING_SESSION_LABEL), as.numeric(as.character(trial_number))),]
To be really sure if it's a factor, I recommend the str() function:
str(trial_number)
I think it may be worthwhile for you to design your own function in this case. It wouldn't be too hard, basically you could just design a bubble-sort algorithm with a few alterations. These alterations could change each number to a string, and begin by sorting those with different numbers of digits into different bins (easily done by finding which numbers, which are now strings, have the greatest numbers of indices). Then, in a similar fashion, the numbers in these bins could be sorted by converting the least significant digit to a numeric type and checking to see which are the largest/smallest. If you're interested, I could come up with some code for this, however, it looks like the two above me have beat me to the punch with some of the built-in functions. I've never used those functions, so I'm not sure if they'll work as you intend, but there's no use in reinventing the wheel.

Resources