I have a large dataset and I'm trying to mining association rules between the variables.
My problem is that I have 160 variables among which I have to look for the association rules and also I have more than 1800 item-sets.
Furthermore my variables are continuous variables. To mining association rules, I usually used the apriori algorithm, but as is well known, this algorithm requires the use of categorical variables.
Does anyone have any suggestions on what kind of algorithm I can use in this case?
A restricted example of my dataset is the following:
ID_Order Model ordered quantity
A.1 typeX 20
A.1 typeZ 10
A.1 typeY 5
B.2 typeX 16
B.2 typeW 12
C.3 typeZ 1
D.4 typeX 8
D.4 typeG 4
...
My goal would be mining association rules and correlation between different products, maybe with a neural network algorithm in R Does anyone have any suggestions on how to solve this problem?
Thanks in advance
You can create transactions from your dataset like this:
library(dplyr)
This function is used to get the transactions per ID_Order
concat <- function(x) {
return(list(as.character(x)))
}
Group df by ID_Order and concatenate. pull() returns the concatenated Models in a list.
a_list <- df %>%
group_by(ID_Order) %>%
summarise(concat = concat(Model)) %>%
pull(concat)
Set names to ID_Order:
names(a_list) <- unique(df$ID_Order)
Then you can use the package arules:
Get object of transactions class:
transactions <- as(a_list, "transactions")
Extract rules. You can set minimum support and minimum confidence in supp and conf resp.
rules <- apriori(transactions,
parameter = list(supp = 0.1, conf = 0.5, target = "rules"))
To inspect the rules use:
inspect(rules)
And this is what you get:
lhs rhs support confidence lift count
[1] {} => {typeZ} 0.50 0.50 1.0000000 2
[2] {} => {typeX} 0.75 0.75 1.0000000 3
[3] {typeW} => {typeX} 0.25 1.00 1.3333333 1
[4] {typeG} => {typeX} 0.25 1.00 1.3333333 1
[5] {typeY} => {typeZ} 0.25 1.00 2.0000000 1
[6] {typeZ} => {typeY} 0.25 0.50 2.0000000 1
[7] {typeY} => {typeX} 0.25 1.00 1.3333333 1
[8] {typeZ} => {typeX} 0.25 0.50 0.6666667 1
[9] {typeY,typeZ} => {typeX} 0.25 1.00 1.3333333 1
[10] {typeX,typeY} => {typeZ} 0.25 1.00 2.0000000 1
[11] {typeX,typeZ} => {typeY} 0.25 1.00 4.0000000 1
From the example section of ? transactions:
## example 4: creating transactions from a data.frame with
## transaction IDs and items (by converting it into a list of transactions first)
a_df3 <- data.frame(
TID = c(1,1,2,2,2,3),
item=c("a","b","a","b","c","b")
)
a_df3
trans4 <- as(split(a_df3[,"item"], a_df3[,"TID"]), "transactions")
trans4
inspect(trans4)
Related
The acf method in the stats package returns a complex output. For example
x = rnorm(1000, mean=100, sd=10)
acf(x)
returns a plot. If I do
acf_x = acf(x)
acf_x
it returns
Autocorrelations of series ‘x’, by lag
0 1 2 3 4 5 6 7 8 9 10 11
1.000 0.000 -0.031 -0.002 -0.052 0.017 -0.014 0.030 0.011 0.002 -0.044 0.000
12 13 14 15 16 17 18 19 20 21 22 23
0.055 -0.007 0.049 0.025 -0.027 -0.048 0.033 0.027 0.043 -0.007 -0.010 0.025
24 25 26 27 28 29 30
-0.083 0.045 -0.074 0.016 0.041 -0.046 0.010
If I look at class(acf) it returns 'acf'.
How do I extract the autocorrelation versus lag into a data_frame?
More generally, when presented with a function that returns a complex object, how do I extract the data from it, i.e. is there a general pattern for this type of function?
If you look at the help function of acf via ?acf you'll see under "value" what the output will look like.
In this case, the acf object is a list with several elements.
If you e.g. want the lags, you can simply access this via:
my_lags <- acf_x$lag
Deschen's answer to the original question gives the general response - how do I discover the elements in a complex model object: str(). One can also use the names() function for S3 objects, where the result lists the names one can use to extract elements from the list() with the $ or [[ forms of the extract operator.
set.seed(95014)
x = rnorm(1000, mean=100, sd=10)
acf_x <- acf(x)
names(acf_x)
> names(acf_x)
[1] "acf" "type" "n.used" "lag" "series" "snames"
>
Since the acf and lag elements are stored as arrays, we'll need to extract just the first dimension to obtain a simple vector. We can accomplish this by chaining the [ form of the extract operator onto the object that is generated by the [[ extract on the model object.
head(acf_x[["acf"]][,1,1]) # second extract returns a simple vector
> head(acf_x[["acf"]][,1,1])
[1] 1.000000000 -0.034863150 0.037745441 -0.020464290 -0.004974406
[6] 0.016770363
In this case R performs the extraction left to right - first acf_x[["acf"]] is evaluated, and then [,1,1] is applied to the result.
For the concrete part of the question, "how do I create a data frame with this data?" One can create a data frame with the output from the acf() function as follows.
set.seed(95014)
x = rnorm(1000, mean=100, sd=10)
acf_x <- acf(x)
results <- data.frame(acf_value = acf_x$acf[,1,1],
acf_lag = acf_x$lag[,1,1])
head(results)
...and the output:
> head(results)
acf_value acf_lag
1 1.000000000 0
2 -0.034863150 1
3 0.037745441 2
4 -0.020464290 3
5 -0.004974406 4
6 0.016770363 5
Try
str(acf_x)
or
print.default(acf_x)
This will get you an idea how the object looks like internally and how to access the elements in it.
I am stuck on how to proceed with coding in RStudio for the Bonferroni Correction and the raw P values for the Pearson Correlation Matrix. I am a student and am new to R. I am also lost on how to get a table of the mean,SD, and n for the data. When I calculated the Pearson Correlation Matrix I just got the r value and not the raw probabilities value also. I am not sure how to code to get that in RStudio. I then tried to calculate the Bonferroni Correction and received an error message saying list object cannot be coerced to type double. How do I fix my code so this goes away? I also tried to create a table of the mean, SD, and n for the data and I became stuck on how to proceed.
My data is as follows:
Tree Height DBA Leaf Diameter
45.3 14.9 0.76
75.2 26.6 1.06
70.1 22.9 1.19
95 31.8 1.59
107.8 35.5 0.43
93 26.2 1.49
91.5 29 1.19
78.5 29.2 1.1
85.2 30.3 1.24
50 16.8 0.67
47.1 12.8 0.98
73.2 28.4 1.2
Packages I have installed dplyr,tidyr,multcomp,multcompview
I Read in the data from excel CSV(comma delimited) file and This creates data>dataHW8_1 12obs. of 3 variables
summary(dataHW8_1)
I then created Scatterplots of the data
plot(dataHW8_1$Tree_Height,dataHW8_1$DBA,main="Scatterplot Tree Height Vs Trunk Diameter at Breast Height (DBA)",xlab="Tree Height (cm)",ylab="DBA (cm)")
plot(dataHW8_1$Tree_Height,dataHW8_1$Leaf_Diameter,main="Scatterplot Tree Height Vs Leaf Diameter",xlab="Tree Height (cm)",ylab="Leaf Diameter (cm)")
plot(dataHW8_1$DBA,dataHW8_1$Leaf_Diameter,main="Scatterplot Trunk Diameter at Breast Height (DBA) Vs Leaf Diameter",xlab="DBA (cm)",ylab="Leaf Diameter (cm)")
I then noticed that the data was not linear so I transformed it using the log() fucntion
dataHW8_1log = log(dataHW8_1)
I then re-created my Scatterplots using the transformed data
plot(dataHW8_1log$Tree_Height,dataHW8_1log$DBA,main="Scatterplot of
Transformed (log)Tree Height Vs Trunk Diameter at Breast Height
(DBA)",xlab="Tree Height (cm)",ylab="DBA (cm)")
plot(dataHW8_1log$Tree_Height,dataHW8_1log$Leaf_Diameter,main="Scatterplot
of Transformed (log)Tree Height Vs Leaf Diameter",xlab="Tree Height
(cm)",ylab="Leaf Diameter (cm)")
plot(dataHW8_1log$DBA,dataHW8_1log$Leaf_Diameter,main="Scatterplot of
Transformed (log) Trunk Diameter at Breast Height (DBA) Vs Leaf
Diameter",xlab="DBA (cm)",ylab="Leaf Diameter (cm)")
I then created a matrix plot of Scatterplots
pairs(dataHW8_1log)
I then calculated the correlation coefficent using the Pearson method
this does not give an uncorreted matrix of P values------How do you do that?
cor(dataHW8_1log,method="pearson")
I am stuck on what to do to get a matrix of the raw probabilities (uncorrected P values) of the data
I then calculated the Bonferroni correction-----How do you do that?
Data$Bonferroni =
p.adjust(dataHW8_1log,
method = "bonferroni")
Doing this gave me the follwing error:
Error in p.adjust(dataHW8_1log, method = "bonferroni") :
(list) object cannot be coerced to type 'double'
I tried to fix using lapply, but that did not fix my promblem
I then tried to make a table of mean, SD, n, but I was only able to create the following code and became stuck on where to go from there------How do you do that?
(,data = dataHW8_1log,
FUN = function(x) c(Mean = mean(x, na.rm = T),
n = length(x),
sd = sd(x, na.rm = T))
I have tried following examples online, but none of them have helped me with the getting the Bonferroni Correction to code correctly.If anyone can help explain what I did wrong and how to make the Matrices/table I would greatly appreciate it.
Here is an example using a 50 rows by 10 columns sample dataframe.
# 50 rows x 10 columns sample dataframe
df <- as.data.frame(matrix(runif(500), ncol = 10));
We can show pairwise scatterplots.
# Pairwise scatterplot
pairs(df);
We can now use cor.test to get p-values for a single comparison. We use a convenience function cor.test.p to do this for all pairwise comparisons. To give credit where credit is due, the function cor.test.p has been taken from this SO post, and takes as an argument a dataframe whilst returning a matrix of uncorrected p-values.
# cor.test on dataframes
# From: https://stackoverflow.com/questions/13112238/a-matrix-version-of-cor-test
cor.test.p <- function(x) {
FUN <- function(x, y) cor.test(x, y)[["p.value"]];
z <- outer(
colnames(x),
colnames(x),
Vectorize(function(i,j) FUN(x[,i], x[,j])));
dimnames(z) <- list(colnames(x), colnames(x));
return(z);
}
# Uncorrected p-values from pairwise correlation tests
pval <- cor.test.p(df);
We now correct for multiple hypothesis testing by applying the Bonferroni correction to every row (or column, since the matrix is symmetric) and we're done. Note that p.adjust takes a vector of p-values as an argument.
# Multiple hypothesis-testing corrected p-values
# Note: pval is a symmetric matrix, so it doesn't matter if we correct
# by column or by row
padj <- apply(pval, 2, p.adjust, method = "bonferroni");
padj;
#V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
#V1 0 1 1.0000000 1 1.0000000 1.0000000 1 1 1.0000000 1
#V2 1 0 1.0000000 1 1.0000000 1.0000000 1 1 1.0000000 1
#V3 1 1 0.0000000 1 0.9569498 1.0000000 1 1 1.0000000 1
#V4 1 1 1.0000000 0 1.0000000 1.0000000 1 1 1.0000000 1
#V5 1 1 0.9569498 1 0.0000000 1.0000000 1 1 1.0000000 1
#V6 1 1 1.0000000 1 1.0000000 0.0000000 1 1 0.5461443 1
#V7 1 1 1.0000000 1 1.0000000 1.0000000 0 1 1.0000000 1
#V8 1 1 1.0000000 1 1.0000000 1.0000000 1 0 1.0000000 1
#V9 1 1 1.0000000 1 1.0000000 0.5461443 1 1 0.0000000 1
#V10 1 1 1.0000000 1 1.0000000 1.0000000 1 1 1.0000000 0
I created a data.frame that holds my words and its frequencies. Now I would like to do a findAssocs against every row of my frame but I cannot get my code to work. Any help is appreciated.
Here is an example of my data.frame term.df
term.df <- data.frame(word = names(v),freq=v)
word freq
ounce 8917
pack 6724
count 4992
organic 3696
frozen 2534
free 1728
I created a TermDocumentMatrix tdm and the following code works as expected.
findAssocs(tdm, 'frozen', 0.20)
I would like to append the output of findAssocs as a new column
Here's the code I tried:
library(dplyr)
library(tm)
library(pbapply)
#I would like to append all findings in a new column
res <- merge(do.call(rbind.data.frame, pblapply(term.df, findAssocs(tdm, term.df$word , 0.18))),
term.df[, c("word")], by.x="list.q", by.y="word", all.x=TRUE)
EDIT:
as for the output. The single statement above gets me something like this.
$yogurt
greek ellenos fat chobani dannon fage yoplait nonfat wallaby
0.62 0.36 0.25 0.24 0.24 0.24 0.24 0.22 0.20
I was hoping it would be possible to add a single column to my original table (ASSOC) and put the results as comma separated name:value tuples but I'm really open to ideas.
I think a structure that would be simplest to handle would be a nested list:
lapply(seq_len(nrow(text.df)), function(i) {
list(word=text.df$word[i],
freq=text.df$freq[i],
assoc=findAssocs(tdm, as.character(text.df$word[i]), 0.7)[[1]])
})
# [[1]]
# [[1]]$word
# [1] "oil"
#
# [[1]]$freq
# [1] 3
#
# [[1]]$assoc
# 15.8 opec clearly late trying who winter analysts
# 0.87 0.87 0.80 0.80 0.80 0.80 0.80 0.79
# said meeting above emergency market fixed that prices
# 0.78 0.77 0.76 0.75 0.75 0.73 0.73 0.72
# agreement buyers
# 0.71 0.70
#
#
# [[2]]
# [[2]]$word
# [1] "opec"
#
# [[2]]$freq
# [1] 2
#
# [[2]]$assoc
# meeting emergency oil 15.8 analysts buyers above
# 0.88 0.87 0.87 0.85 0.85 0.83 0.82
# said ability they prices. agreement but clearly
# 0.82 0.80 0.80 0.79 0.76 0.74 0.74
# december. however, late production sell trying who
# 0.74 0.74 0.74 0.74 0.74 0.74 0.74
# winter quota that through bpd market
# 0.74 0.73 0.73 0.73 0.70 0.70
#
#
# [[3]]
# [[3]]$word
# [1] "xyz"
#
# [[3]]$freq
# [1] 1
#
# [[3]]$assoc
# numeric(0)
In my experience this will be easier to handle than a nested string because you can still access the word associations for each row of your original text.df object by accessing the corresponding element in the outputted list.
If you really want to keep a data frame structure, then you could pretty easily convert the findAssocs output to a string representation, for instance using toJSON:
library(RJSONIO)
text.df$assoc <- sapply(text.df$word, function(x) toJSON(findAssocs(tdm, x, 0.7)[[1]], collapse=""))
text.df
# word freq
# 1 oil 3
# 2 opec 2
# 3 xyz 1
# assoc
# 1 { "15.8": 0.87,"opec": 0.87,"clearly": 0.8,"late": 0.8,"trying": 0.8,"who": 0.8,"winter": 0.8,"analysts": 0.79,"said": 0.78,"meeting": 0.77,"above": 0.76,"emergency": 0.75,"market": 0.75,"fixed": 0.73,"that": 0.73,"prices": 0.72,"agreement": 0.71,"buyers": 0.7 }
# 2 { "meeting": 0.88,"emergency": 0.87,"oil": 0.87,"15.8": 0.85,"analysts": 0.85,"buyers": 0.83,"above": 0.82,"said": 0.82,"ability": 0.8,"they": 0.8,"prices.": 0.79,"agreement": 0.76,"but": 0.74,"clearly": 0.74,"december.": 0.74,"however,": 0.74,"late": 0.74,"production": 0.74,"sell": 0.74,"trying": 0.74,"who": 0.74,"winter": 0.74,"quota": 0.73,"that": 0.73,"through": 0.73,"bpd": 0.7,"market": 0.7 }
# 3 [ ]
Data:
library(tm)
data("crude")
tdm <- TermDocumentMatrix(crude)
(text.df <- data.frame(word=c("oil", "opec", "xyz"), freq=c(3, 2, 1), stringsAsFactors=FALSE))
# word freq
# 1 oil 3
# 2 opec 2
# 3 xyz 1
To begin, I have been able to put together a nested for loop to create the object I am after, and it works OK for a small toy data set, but the data I will be working with in general will be larger and I am trying to determine if a package exists in R with a built in function to accomplish this task.
The final object is a data frame or matrix that shows the conditional percentage in a column given the reference row. Here is the code for a toy data and my nested for loop that generates the final output object.
mylist <- list(
ID001=c("apple","orange","grape"),
ID002=c("banana","grape"),
ID003=c("apple","pineapple"),
ID004=c("orange","apple"),
ID005=c("orange","grape", "apple"))
dat <- reshape2:::melt(mylist)
names(dat) <- c("fruit","id")
dat <- dat[,c(2,1)]
theFruit <- unique(dat$fruit)
n=length(theFruit)
final.df <- data.frame(matrix(nrow=n,ncol=n, dimnames=list(theFruit,theFruit)))
for(i in theFruit){
for(j in theFruit){
tempid1 <- dat[dat$fruit==i,]$id
tempid2 <- dat[dat$fruit==j,]$id
final.df[i,j] <- round(length(which(tempid1%in%tempid2))/length(tempid1),2)
}
}
final.df
apple orange grape banana pineapple
apple 1.00 0.75 0.50 0.00 0.25
orange 1.00 1.00 0.67 0.00 0.00
grape 0.67 0.67 1.00 0.33 0.00
banana 0.00 0.00 1.00 1.00 0.00
pineapple 1.00 0.00 0.00 0.00 1.00
Reading the output we see that, given a person ate an apple (apple row), 75% also ate an orange (orange column). Similarly, given a person ate an orange (orange row) 100% also ate an apple (apple column). This is not intended to be symmetric with intersections of the two fruits eaten, it is column conditioned on row.
This seems to be akin to a market basket analysis application and I have been working with the arules package the past few days to get at this. In the vernacular of the arules package, I would say the name of the percentages populating the data frame are support values but I have not been able to generate a matrix or data frame of all of the support percentages from arules.
The data I will be working with will have a couple million IDs but only about 150 "products" so the output matrix would only be about 150x150. I can use arules to identify the compelling pairwise relationships but there is interest in seeing ALL of the conditionals.
Does anyone know if arules or another package can accomplish this?
You are looking for the confidence values (Wikipedia). You get a similar output to yours like this with arules:
library(arules)
library(reshape2)
trans <- as(mylist, "transactions")
rules <- apriori(trans, parameter = list(supp = 0, conf = 0, minlen=2, maxlen=2))
df <- inspect(rules)[, c("lhs", "rhs", "confidence")]
dcast(df, lhs~rhs, value.var="confidence", fill=1)
# lhs {apple} {banana} {grape} {orange} {pineapple}
# 1 {apple} 1.0000000 0.0000000 0.5000000 0.7500000 0.25
# 2 {banana} 0.0000000 1.0000000 1.0000000 0.0000000 0.00
# 3 {grape} 0.6666667 0.3333333 1.0000000 0.6666667 0.00
# 4 {orange} 1.0000000 0.0000000 0.6666667 1.0000000 0.00
# 5 {pineapple} 1.0000000 0.0000000 0.0000000 0.0000000 1.00
Of course you can make the first column to row names and convert the data frame to a matrix later on. I leave it up to you.
here is the function which is working fine, but giving the output in an output which I don't want.
frequencies <- {}
for (k in (1:4))
{
interval <- (t(max_period_set[k]))
intervals <- round(quantile(interval,c(0,0.05,0.15,0.25,0.35,0.45,0.55,0.65,0.75,0.85,0.95,1.0)))
frequency <- {}
for (i in (2:length(intervals)))
{
count = 0;
for (r in (1:length(interval)))
{
if (r == length(interval))
{
if (interval[r] >= intervals[i-1] && interval[r] <= intervals[i])
{
count = count + 1
}
}
else
{
if (interval[r] >= intervals[i-1] && interval[r] < intervals[i])
{
count = count + 1
}
}
}
frequency <- c(frequency,count)
}
frequencies[[length(frequencies)+1]] <- frequency
}
The output is as follows:
> frequencies
[[1]]
[1] 2 6 5 4 5 4 6 5 5 5 3
[[2]]
[1] 1 7 5 4 5 4 5 6 5 5 3
[[3]]
[1] 3 5 5 4 5 4 6 5 5 5 3
[[4]]
[1] 3 5 5 4 4 6 5 5 5 5 3
I would like to have it in a format as follows:
[[],[],[],[]] which is a list of list whose first element I can access like frequencies[1] to get the first list, etc...
If it is not possible, how can I access the first list values in my current format? frequencies[1] does not give me the first list values back.
Thanks for your help!
Guys an another question:
now I can access the data but r is representing the last line in different format:
[[1]]
[1] 1.00 0.96 0.84 0.74 0.66 0.56 0.48 0.36 0.26 0.16 0.06 0.00
[[2]]
[1] 1.00 0.98 0.84 0.74 0.66 0.56 0.48 0.38 0.26 0.16 0.06 0.00
[[3]]
[1] 1.00 0.94 0.84 0.74 0.66 0.56 0.48 0.36 0.26 0.16 0.06 0.00
[[4]]
[1] 1.000000e+00 9.400000e-01 8.400000e-01 7.400000e-01 6.600000e-01 5.800000e-01 4.600000e-01 3.600000e-01 2.600000e-01 1.600000e-01 6.000000e-02 1.110223e-16
Why is it happening with the accuracy? the first three lines are as it should be but the last line is odd, the numbers were not infractional numbers, so it can be represented as a number with its accuracy of 2 after comma digits.
frequencies is a list, so you need
frequencies[[1]]
to access the first element. If list were named, you could also index by element name.
Lists are the most general data structure, and the only one that can
be nested: lists within lists within ...
be ragged: does not require rectangular dimensions
so you should try to overcome initial aversion to the fact that it is different. These are very powerful data structures, and are use a lot behind the scenes.
Edit: Also, a number of base functions as well as add-on packages can post-process lists. It starts with something basic like do.call(), goes to lapply and ends all the over at the plyr packages. Keep reading -- there are many ways to skin the same cat, some better than others.
While I completely agree with Dirk on the usefulness of lists, you can, if all of your lists are the same length, convert them to a dataframe using as.data.frame() and then you can index them by column i frequencies[,i] or by row j frequencies[j,]