R arules, mine only rules from specific column - r

I would like to mine specific rhs rules. There is an example in the documentation which demonstrates that this is possible, but only for a specific case (as we see below). First an data set to illustrate my problem:
input <- matrix( c( rep(10001,6) , rep(10002,3) , rep(10003,3), 100001,100002,100003,100004,100005,100006,100002,100003,100007,100002,100003,100008,rep('a',6),rep('b',6)), ncol=3)
colnames(input) <- c(letters[1:3])
input <- as.data.frame(input)
Now i can create rules:
r <- apriori(input)
To see the rules:
inspect(r)
I would like to only mine rules that have b=... on the rhs. For specific values this can be done by adding:
appearance = list(rhs = c("b=100001", "b=100002"),default="lhs")
to the apriori command. I will also have to adjust the confidence if i want to find them ofcourse. The problem lies in the number of elements in column b. I can manualy type all the elements in the "b=....." format in this example, but I can't in my own data.
I tried to get the values of b using unique() and then giving that to the rhs, but it will generate an error because i give values like: "100001" "100002" instead of "b=100001" "b=100002".
Is there a was to only get rhs rules from a specific column?
If not, is there an easy way to generate 'want' from 'current?
current <- c("100001", "100002", "100003", "100004", "100005", "100006", "100007", "100008")
want <- c("b=100001", "b=100002", "b=100003", "b=100004", "b=100005", "b=100006", "b=100007", "b=100008")
Somewhat related is this question: Creating specific rules with arules in r
But that has the same problem for me, only a different way.

You can use subset:
r <- apriori(input, parameter = list(support = 0.1, confidence = 0.1))
inspect( subset( r, subset = rhs %pin% "b=" ) )
# lhs rhs support confidence lift
# 1 {} => {b=100002} 0.2500000 0.2500000 1.000000
# 2 {} => {b=100003} 0.2500000 0.2500000 1.000000
# 3 {c=b} => {b=100002} 0.1666667 0.3333333 1.333333
# 4 {c=b} => {b=100003} 0.1666667 0.3333333 1.333333
For you second question, you can use paste:
paste0( "b=", current )
# [1] "b=100001" "b=100002" "b=100003" "b=100004" "b=100005" "b=100006" "b=100007"
# [8] "b=100008"

The arules documentation now has an example that does exactly what you want:
bItems <- grep("^b=", itemLabels(input), value = TRUE)
rules <- apriori(input, parameter = list(support = 0.1, confidence = 0.1),
appearance = list(rhs = bItems))
I haven't actually tested this with your example code (the arules documentation example uses a transactions object, not a data.frame), but grep-ing those column labels should work out.

Related

Using cpquery function for several pairs from dataset

I am relatively beginner in R and trying to figure out how to use cpquery function for bnlearn package for all edges of DAG.
First of all, I created a bn object, a network of bn and a table with all strengths.
library(bnlearn)
data(learning.test)
baynet = hc(learning.test)
fit = bn.fit(baynet, learning.test)
sttbl = arc.strength(x = baynet, data = learning.test)
Then I tried to create a new variable in sttbl dataset, which was the result of cpquery function.
sttbl = sttbl %>% mutate(prob = NA) %>% arrange(strength)
sttbl[1,4] = cpquery(fit, `A` == 1, `D` == 1)
It looks pretty good (especially on bigger data), but when I am trying to automate this process somehow, I am struggling with errors, such as:
Error in sampling(fitted = fitted, event = event, evidence = evidence, :
logical vector for evidence is of length 1 instead of 10000.
In perfect situation, I need to create a function that fills the prob generated variable of sttbl dataset regardless it's size. I tried to do it with for loop to, but stumbled over the error above again and again. Unfortunately, I am deleting failed attempts, but they were smt like this:
for (i in 1:nrow(sttbl)) {
j = sttbl[i,1]
k = sttbl[i,2]
sttbl[i,4]=cpquery(fit, fit$j %in% sttbl[i,1]==1, fit$k %in% sttbl[i,2]==1)
}
or this:
for (i in 1:nrow(sttbl)) {
sttbl[i,4]=cpquery(fit, sttbl[i,1] == 1, sttbl[i,2] == 1)
}
Now I think I misunderstood something in R or bnlearn package.
Could you please tell me how to realize this task with filling the column by multiple cpqueries? That would help me a lot with my research!
cpquery is quite difficult to work with programmatically. If you look at the examples in the help page you can see the author uses eval(parse(...)) to build the queries. I have added two approaches below, one using the methods from the help page and one using cpdist to draw samples and reweighting to get the probabilities.
Your example
library(bnlearn); library(dplyr)
data(learning.test)
baynet = hc(learning.test)
fit = bn.fit(baynet, learning.test)
sttbl = arc.strength(x = baynet, data = learning.test)
sttbl = sttbl %>% mutate(prob = NA) %>% arrange(strength)
This uses cpquery and the much maligned eval(parse(...)) -- this is the
approach the the bnlearn author takes to do this programmatically in the ?cpquery examples. Anyway,
# You want the evidence and event to be the same; in your question it is `1`
# but for example using learning.test data we use 'a'
state = "\'a\'" # note if the states are character then these need to be quoted
event = paste(sttbl$from, "==", state)
evidence = paste(sttbl$to, "==", state)
# loop through using code similar to that found in `cpquery`
set.seed(1) # to make sampling reproducible
for(i in 1:nrow(sttbl)) {
qtxt = paste("cpquery(fit, ", event[i], ", ", evidence[i], ",n=1e6", ")")
sttbl$prob[i] = eval(parse(text=qtxt))
}
I find it preferable to work with cpdist which is used to generate random samples conditional on some evidence. You can then use these samples to build up queries. If you use likelihood weighting (method="lw") it is slightly easier to do this programatically (and without evil(parse(...))).
The evidence is added in a named list i.e. list(A='a').
# The following just gives a quick way to assign the same
# evidence state to all the evidence nodes.
evidence = setNames(replicate(nrow(sttbl), "a", simplify = FALSE), sttbl$to)
# Now loop though the queries
# As we are using likelihood weighting we need to reweight to get the probabilities
# (cpquery does this under the hood)
# Also note with this method that you could simulate from more than
# one variable (event) at a time if the evidence was the same.
for(i in 1:nrow(sttbl)) {
temp = cpdist(fit, sttbl$from[i], evidence[i], method="lw")
w = attr(temp, "weights")
sttbl$prob2[i] = sum(w[temp=='a'])/ sum(w)
}
sttbl
# from to strength prob prob2
# 1 A D -1938.9499 0.6186238 0.6233387
# 2 A B -1153.8796 0.6050552 0.6133448
# 3 C D -823.7605 0.7027782 0.7067417
# 4 B E -720.8266 0.7332107 0.7328657
# 5 F E -549.2300 0.5850828 0.5895373

How to coerce stslist.freq to dataframe

I am doing some describtive sequence analysis using the "TraMineR" library. I want to report my findings via R-Markdown in html format. For formating tables I use "kable" and "kableExtra".
To get the frequency and propotions of the most common sequences I use seqtab(). The result is an stslist.freq object. When I try to coerce it to a dataframe, the dataframe is not containing any frequencies and proportions.
I tried to print the results of seqtab() and store this result again. This gives me the dataframe I desire. However there are two "problems" with that: (1) I don't understand what is happening here and it seems like a "dirty" trick, (2) as a result I also get the output of the print command in my final html document if I don't split the code in multiple chunks and disable the ouput in the specific chunk.
Here is some code to replicate the problem:
library("TraMineR")
#Data creation
data.long <- data.frame(
id=rep(1:50, each=4),
time = c(0,1,2,3),
status = sample(letters[1:2], 200, replace = TRUE),
weight=rep(runif(50, 0, 1), each=4)
)
#reshape
data.wide <- reshape(data.long, v.names = "status", idvar="id", direction="wide", timevar="time")
#sequence
sequence <- seqdef(data.wide,
var=c("status.0", "status.1", "status.2", "status.3"),
weights=data.wide$weight)
#frequencies of sequences
##doesn't work:
seqtab.df1 <- as.data.frame(seqtab(sequence))
##works:
seqtab.df2 <- print(seqtab(sequence))
I expect the dataframe to be the same as the one saved in seqtab.df2, however either without using the print command or with "silently" (no output printed) using the print command.
Thank you very much for your help and let me know if I forgot something to make answering the question possible!
If you look at the class() of the object returned by seqtab, it has the type
class(seqtab(sequence))
# [1] "stslist.freq" "stslist" "data.frame"
so if we look at exactly, what's happening in the print statement for such an object we can get a clue what's going on
TraMineR:::print.stslist.freq
# function (x, digits = 2, width = 1, ...)
# {
# table <- attr(x, "freq")
# print(table, digits = digits, width = width, ...)
# }
# <bytecode: 0x0000000003e831f8>
# <environment: namespace:TraMineR>
We see that what it's really giving you is the "freq" attribute. You can extract this directly and skip the print()
attr(seqtab(sequence), "freq")
# Freq Percent
# a/3-b/1 4.283261 20.130845
# b/1-a/1-b/2 2.773341 13.034390
# a/2-b/1-a/1 2.141982 10.067073
# a/1-b/1-a/1-b/1 1.880359 8.837476
# a/1-b/2-a/1 1.723489 8.100203
# b/1-a/2-b/1 1.418302 6.665861
# b/2-a/1-b/1 1.365099 6.415813
# a/1-b/3 1.241644 5.835586
# a/1-b/1-a/2 1.164434 5.472710
# a/2-b/2 1.092656 5.135360

Selecting features from a feature set using mRMRe package

I am a new user of R and trying to use mRMRe R package (mRMR is one of the good and well known feature selection approaches) to obtain feature subset from a feature set. Please excuse if my question is simple as I really want to know how I can fix an error. Below is the detail.
Suppose, I have a csv file (gene.csv) having feature set of 6 attributes ([G1.1.1.1], [G1.1.1.2], [G1.1.1.3], [G1.1.1.4], [G1.1.1.5], [G1.1.1.6]) and a target class variable [Output] ('1' indicates positive class and '-1' stands for negative class). Here's a sample gene.csv file:
[G1.1.1.1] [G1.1.1.2] [G1.1.1.3] [G1.1.1.4] [G1.1.1.5] [G1.1.1.6] [Output]
11.688312 0.974026 4.87013 7.142857 3.571429 10.064935 -1
12.538226 1.223242 3.669725 6.116208 3.363914 9.174312 1
10.791367 0.719424 6.115108 6.47482 3.597122 10.791367 -1
13.533835 0.37594 6.766917 7.142857 2.631579 10.902256 1
9.737828 2.247191 5.992509 5.992509 2.996255 8.614232 -1
11.864407 0.564972 7.344633 4.519774 3.389831 7.909605 -1
11.931818 0 7.386364 5.113636 3.409091 6.818182 1
16.666667 0.333333 7.333333 4.333333 2 8.333333 -1
I am trying to get best feature subset of 2 attributes (out of above 6 attributes) and wrote following R code.
library(mRMRe)
file_n<-paste0("E:\\gene", ".csv")
df <- read.csv(file_n, header = TRUE)
f_data <- mRMR.data(data = data.frame(df))
featureData(f_data)
mRMR.ensemble(data = f_data, target_indices = 7,
feature_count = 2, solution_count = 1)
When I run this code, I am getting following error for the statement f_data <- mRMR.data(data = data.frame(df)):
Error in .local(.Object, ...) :
data columns must be either of numeric, ordered factor or Surv type
However, data in each column of the csv file are real number.So, how can I change the R code to fix this problem? Also, I am not sure what should be the value of target_indices in the statement mRMR.ensemble(data = f_data, target_indices = 7,feature_count = 2, solution_count = 1) as my target class variable name is "[Output]" in the gene.csv file.
I will appreciate much if anyone can help me to obtain the best feature subset based on the gene.csv file using mRMRe R package.
I solved the problem by modifying my code as follows.
library(mRMRe)
file_n<-paste0("E:\\gene", ".csv")
df <- read.csv(file_n, header = TRUE)
df[[7]] <- as.numeric(df[[7]])
f_data <- mRMR.data(data = data.frame(df))
results <- mRMR.classic("mRMRe.Filter", data = f_data, target_indices = 7,
feature_count = 2)
solutions(results)
It worked fine. The output of the code gives the indices of the selected 2 features.
I think it has to do with your Output column which is probably of class integer. You can check that using class(df[[7]]).
To convert it to numeric as required by the warning, just type:
df[[7]] <- as.numeric(df[[7]])
That worked for me.
As for the other question, after reading the documentation, setting target_indices = 7 seems the right choice.

R - Association rules, lhs column is empty

I am using R library arules for rules minning.
So first I tried just to see the rules:
#Get the rules
rules <- apriori(trans, parameter = list(supp=0.05, conf = 0.05)) #minlen = 2
rules <- sort(rules, by="confidence", decreasing=TRUE)
However the lhs column is empty:
inspect(rules)
lhs rhs support confidence lift
3 {} => {product=CM,DD,OS} 0.501 0.501 1
2 {} => {product=CM,DD} 0.223 0.223 1
1 {} => {product=CM} 0.068 0.068 1
So I tried to specifically ask for the lhs column:
rules <- apriori(data=trans, parameter=list(supp=0.05, conf = 0.05),
appearance = list(default="rhs", lhs="product=CM,DD,OS"),
control = list(verbose=F))
rules <- sort(rules, by="confidence", decreasing=TRUE)
inspect(rules)
Unfortunately the output remains same.
One of the reason might be that most of the clients have ~4 products, therefore they might not be any rules, but I find that unlikley.
So the problem was in the format of the data. If I before dump data into .csv and use read.transactions, it works correctly.
trans = read.transactions("C:/.../basket_analysis_data.csv", format="single",sep = ";", cols = c(2,1))
Before I was using direct ODBC connection, put data into Data frame and then convert them like this:
trans <- data.frame(product = as.factor(qry$product_owned))
trans <- as(trans, "transactions")
However use .csv as immediate step is annoying. If anyone can help how to make it work without .csv, I would appreciate it.

How do I display labels from data on Dissimilarity matrix using Coldiss function rather than default numbers?

I think I have read every page on the internet that mentions coldiss and I am still having trouble getting the labels to look correctly. In the image I inserted, the matrices look good but the labels are default numbers (so aren't that useful for a stand alone image) and in the ordered matrix the matrix gets ordered correctly, but the labels didn't re-order, which doesn't make sense.
[Matrix output images][1]
My questions are:
1) How do I get the labels to order properly for the ordered matrix? If the cells in the heat map are changing colors after being ordered, the respective labels should be different too.
2) Is it possible to edit the coldiss function to use my isolate labels that can be found in the top row or first column to label the heat map rather than the default numbers?
Here is the code I'm running.
library(gclus)
library(ape)
source("coldiss.txt")
tree<-read.tree("BP_SNPS_only-BioNJ_tree_100BS")
PatristicDistMatrix100BS<-cophenetic.phylo(tree)
coldiss(D = PatristicDistMatrix100BS, nc = 4, byrank = TRUE, diag = TRUE)
Here is the coldiss.txt file:
# coldiss()
# Color plots of a dissimilarity matrix, without and with ordering
#
# License: GPL-2
# Author: Francois Gillet, 23 August 2012
#
"coldiss" <- function(D, nc = 4, byrank = TRUE, diag = FALSE)
{
require(gclus)
if (max(D)>1) D <- D/max(D)
if (byrank) {
spe.color <- dmat.color(1-D, cm.colors(nc))
}
else {
spe.color <- dmat.color(1-D, byrank=FALSE, cm.colors(nc))
}
spe.o <- order.single(1-D)
speo.color <- spe.color[spe.o, spe.o]
op <- par(mfrow=c(1,2), pty="s")
if (diag) {
plotcolors(spe.color, rlabels=attributes(D)$Labels,
main="Dissimilarity Matrix",
dlabels=attributes(D)$Labels)
plotcolors(speo.color, rlabels=attributes(D)$Labels[spe.o],
main="Ordered Dissimilarity Matrix",
dlabels=attributes(D)$Labels[spe.o])
}
else {
plotcolors(spe.color, rlabels=attributes(D)$Labels,
main="Dissimilarity Matrix")
plotcolors(speo.color, rlabels=attributes(D)$Labels[spe.o],
main="Ordered Dissimilarity Matrix")
}
par(op)
}
# Usage:
# coldiss(D = dissimilarity.matrix, nc = 4, byrank = TRUE, diag = FALSE)
# If D is not a dissimilarity matrix (max(D) > 1), then D is divided by max(D)
# nc number of colours (classes)
# byrank= TRUE equal-sized classes
# byrank= FALSE equal-length intervals
# diag = TRUE print object labels also on the diagonal
# Example:
# coldiss(spe.dj, nc=9, byrank=F, diag=T)
Here is an abbreviated version of PatristicDistMatrix100BS:
CDC-B043_1995 CDC-A267_1994 CDC-A161_1992 CDC-C931_1998
CDC-B043_1995 0 0.00099 0.00099 0.00166
CDC-A267_1994 0.00099 0 0.00066 0.00133
CDC-A161_1992 0.00099 0.00066 0 0.00133
CDC-C931_1998 0.00166 0.00133 0.00133 0
I hope this provides all the relevant information and thank you for any help you can provide even if it's a completely different function.
There is nothing wrong in the code. The main problem I think is some other packages you have loaded. I also had same problem but when I tried separately it worked well and as you require. Just remove other packages or calculate separately. For more details have a look on the code of chapter three of this document (http://adn.biol.umontreal.ca/~numericalecology/numecolR/). Here is the code I work with.
(vegan must be loaded after ade4 to avoid some conflicts)
library(ade4)
library(vegan)
library(gclus)
library(cluster)
library(FD)
files must be in the working directory. You can search this file from internet from this link (https://github.com/JoeyBernhardt/NumericalEcology)
source("coldiss.R")
source("panelutils.R")
Then calculate your dissimilarity matrix and plot using the code
BCD <- vegdist(df[-1])
coldiss(BCD, byrank = FALSE, diag = TRUE)
Hopefully it will work.

Resources