Split a R dataframe by rows containing a keyword - r

Is there a quick way to split a large data.frame by keywords
so for example if I have the data set below is there a quick way to split the data frame at each occurrence of the source:restaurant line? Another take on the question would be is there a quick way of creating factors for the dataframe based upon a list of cut offs (in this case c(3,7,10)) that would then give me e.g. factors=c(A,A,A,B,B,B,B,C,C,C) that I could use in a split(mylist,factors) formula? Thanks
mylist=structure(list(V1 = structure(c(5L, 3L, 7L, 8L, 6L, 4L, 7L, 2L,
1L, 7L), .Label = c("cider", "claret", "custard", "krispies",
"rhubarb", "shreddies", "source:restaurant", "weetabix"), class = "factor"),
V2 = c(1L, 5L, NA, 9L, 13L, 17L, NA, 21L, 25L, NA), V3 = c(2L,
6L, NA, 10L, 14L, 18L, NA, 22L, 26L, NA), V4 = c(3L, 7L,
NA, 11L, 15L, 19L, NA, 23L, 27L, NA), V5 = c(4L, 8L, NA,
12L, 16L, 20L, NA, 24L, 28L, NA)), .Names = c("V1", "V2",
"V3", "V4", "V5"), class = "data.frame", row.names = c(NA, -10L
))
A very clunky possible solution below but I'm hoping for something a bit more elegant..
temp=NULL
a=which(mylist[,1] == 'source:restaurant')
for(i in seq_along(a)){temp=c(temp,rep(letters[i],(a[i]-length(temp))))}
temp=as.factor(temp)
split(mylist,temp)

The factor:
factor(cumsum(mylist$V1 == "source:restaurant") + 1)
the split:
split(mylist, cumsum(mylist$V1 == "source:restaurant"))
UPDATE: you probably have the restaurant:soure at the end of each group that it marks, to account for this you can use:
factor(cumsum(c(0, head(mylist$V1 == "source:restaurant", -1))) + 1)
split(mylist, cumsum(c(0, head(mylist$V1 == "source:restaurant", -1))))
would be better.

Related

how to produce Sankey diagram with auto-references and circular-references using NetworkD3 library R

I have this data:
list(nodes = structure(list(name = c(NA, NA, "1.1.1. Formação Florestal",
"1.1.2. Formação Savanica", NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, "3.1. Pastagem", NA, NA, NA, "3.2.1. Cultura Anual e Perene",
NA, "3.3. Mosaico de Agricultura e Pastagem", NA, NA, "4.2. Infraestrutura Urbana",
"4.5. Outra Área não Vegetada", NA, NA, NA, NA, NA, NA, NA, "5.1 Rio ou Lago ou Oceano"
)), class = "data.frame", row.names = c(NA, -33L)), links = structure(list(
source = c(3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 15L, 15L,
15L, 15L, 15L, 15L, 15L, 19L, 19L, 19L, 19L, 21L, 21L, 21L,
21L, 21L, 21L, 24L, 25L, 25L, 25L, 33L), target = c(3L, 21L,
4L, 21L, 15L, 3L, 25L, 4L, 33L, 19L, 15L, 21L, 3L, 25L, 4L,
33L, 15L, 19L, 4L, 21L, 4L, 21L, 25L, 33L, 15L, 3L, 4L, 25L,
4L, 33L, 33L), value = c(0.544859347827813, 0.00354385993588971,
0.494359662221154, 4.67602736159475, 2.20248911690968, 0.501437742068369,
0.00354375594818463, 24.8427814053755, 0.439418727642527,
0.0079740332093807, 11.8060486886398, 2.76329829691466, 0.000886029792298199,
0.00177186270758855, 3.35504921147758, 0.14263144351167,
1.12170804870686, 0.0478454594554582, 0.217079959877658,
0.00620223918980076, 1.79754946594068, 9.02868098124075,
0.00442981113709027, 0.242743895018645, 0.498770814980772,
0.00265782877794886, 0.000885894856554407, 0.379188333632346,
0.00265793188317263, 0.00265771537700804, 0.39158027235054
)), row.names = c(NA, -31L), class = "data.frame"))
and I'm trying to produce a sankey diagram using the networkD3package with this simple code:
sankeyNetwork(Links = landuse$links, Nodes = landuse$nodes, Source = "source",
Target = "target", Value = "value", NodeID = "name",
units = "km²", fontSize = 12, nodeWidth = 30)
I received this message:
Warning message:
It looks like Source/Target is not zero-indexed. This is required in JavaScript and so your plot may not render.
But even if I zero-indexed the target/source nothing is redering in dev. I have the data in the same format like in this example, so I would like to know the possible problem.
EDIT:
I have auto-references and circular-references. Is it possible to do the diagram with this type of data using the package?
Based on the example you provided a link to in one of your comments (here), you don't actually want auto and circular references, but instead what you want is two distinct nodes for each thing, one for the left column and one for the right column (e.g. "Formação Florestal" in the left/1985 column and "Formação Florestal" in the right/2017 column).
You can achieve that with the data you provided by distinguishing the source and target nodes that have the same index as separate nodes, like so...
landuse <- list(
nodes = data.frame(
name = c(
NA, NA, "1.1.1. Formação Florestal", "1.1.2. Formação Savanica", NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, "3.1. Pastagem", NA, NA, NA,
"3.2.1. Cultura Anual e Perene", NA,
"3.3. Mosaico de Agricultura e Pastagem", NA, NA,
"4.2. Infraestrutura Urbana", "4.5. Outra Área não Vegetada", NA, NA, NA,
NA, NA, NA, NA,"5.1 Rio ou Lago ou Oceano"
),
stringsAsFactors = FALSE
),
links = data.frame(
source = c(
3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 15L, 15L, 15L, 15L, 15L, 15L, 15L,
19L, 19L, 19L, 19L, 21L, 21L, 21L, 21L, 21L, 21L, 24L, 25L, 25L, 25L, 33L
),
target = c(
3L, 21L, 4L, 21L, 15L, 3L, 25L, 4L, 33L, 19L, 15L, 21L, 3L, 25L, 4L, 33L,
15L, 19L, 4L, 21L, 4L, 21L, 25L, 33L, 15L, 3L, 4L, 25L, 4L, 33L,33L
),
value = c(
0.544859347827813, 0.00354385993588971, 0.494359662221154,
4.67602736159475, 2.20248911690968, 0.501437742068369,
0.00354375594818463, 24.8427814053755, 0.439418727642527,
0.0079740332093807, 11.8060486886398, 2.76329829691466,
0.000886029792298199, 0.00177186270758855, 3.35504921147758,
0.14263144351167, 1.12170804870686, 0.0478454594554582,
0.217079959877658, 0.00620223918980076, 1.79754946594068,
9.02868098124075, 0.00442981113709027, 0.242743895018645,
0.498770814980772, 0.00265782877794886, 0.000885894856554407,
0.379188333632346, 0.00265793188317263, 0.00265771537700804,
0.39158027235054
),
stringsAsFactors = FALSE
)
)
# create a links data frame where the right and left column versions of each node
# are distinguishble
links <-
data.frame(source = paste0(landuse$nodes$name[landuse$links$source], " (1985)"),
target = paste0(landuse$nodes$name[landuse$links$target], " (2017)"),
value = landuse$links$value,
stringsAsFactors = FALSE)
# build a nodes data frame from the new links data frame
nodes <- data.frame(name = unique(c(links$source, links$target)),
stringsAsFactors = FALSE)
# change the source and target variables to be the zero-indexed position of
# each node in the new nodes data frame
links$source <- match(links$source, nodes$name) - 1
links$target <- match(links$target, nodes$name) - 1
# remove the year indicator from the node names
nodes$name <- substring(nodes$name, 1, nchar(nodes$name) - 7)
# plot it
library(networkD3)
sankeyNetwork(Links = links, Nodes = nodes, Source = "source",
Target = "target", Value = "value", NodeID = "name",
units = "km²", fontSize = 12, nodeWidth = 30)
Well, because of how it is built sankeyNetwork, you need to start from 0 in your links. As you can see from landuse, your data start from 3.
You can reindex link to start from 0:
landuse$links$source <- landuse$links$source-3
landuse$links$target <- landuse$links$target-3
sankeyNetwork(Links = landuse$links, Nodes = landuse$nodes, Source = "source",
Target = "target", Value = "value", NodeID = "name",
units = "km²", fontSize = 12, nodeWidth = 30)
For sure, it is does not look as pretty as the sankey you link in your question. Why? Because of your data
You have "autoreferences": links where the source and the target is the same node. That generates those weirds semicircles starting and ending in the same node
You have "circular references": links where the source 'X' goes to target 'Y', source 'Y' going to target 'Z' and then source 'Z' going to target 'Z'. That generates those wierd curves
Some of you values are several orders smaller than other, so those little one are badly visualized.
You need maybe sanity check your data:
Are you really interested in "autoreferences". If not, delete them
Are you comfortable with circular references or you will prefer to duplicate nodes to show a linear sankey?
Are you interested in show very small nodes? If not, delete them

create scatter plot matrix with openair and hexbin

I've worked with the openair and hexbin packages to create two scatter plots with the help of the scatter plot function commands:
scatterPlot(mydata, x ="Observed" , y = "Model1",xlab=10, ylab=10,method = "hexbin",mod.line=T,auto.text=F, col = "jet", xbin = 30)
scatterPlot(mydata, x ="Observed" , y = "Model2",xlab=10, ylab=10,method = "hexbin",mod.line=T,auto.text=F, col = "jet", xbin = 30)
I've got the scatter plots, but if I want to put them into one plot and with one color counts to get something similar to this:How should i proceed?
please refer to this link to view the image : https://ibb.co/rF148kp
You could reorganize your data frame so that it has three columns - "Observed", "Modeled", and "Model Type". Example -
structure(list(observed = c(2L, 2L, 4L, 4L, 6L, 6L, 8L, 8L, 10L,
10L, 12L, 12L, 14L, 14L, 16L, 16L, 18L, 18L, 20L, 20L), modelled = c(1L,
5L, 7L, 2L, 5L, 9L, 13L, 15L, 16L, 14L, 18L, 17L, 10L, 21L, 26L,
24L, 22L, 28L, 27L, 30L), model_type = structure(c(1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L), .Label = c("Model 1", "Model 2"), class = "factor")), class = "data.frame",
row.names = c(NA,
-20L))
This way, you can then use the following code -
scatterPlot(mydata, x = "observed", y = "modelled", type = c("model_type"),
method = "hexbin",mod.line=T,auto.text=F, col = "jet", xbin = 5,
linear = TRUE, layout = c(2, 1))
To create a plot containing the two scatter plots. Note, the above code sets xbin to 5 purely for the reason that I have used a small data set for testing purposes. Also, excuse the spelling error in the y-axis and code ("modelled" should be "modeled")!

select a row based on another row in a data frame

My data looks like this
df <- structure(list(V1 = 1:15, V2 = structure(c(5L, 9L, 7L, 8L, 10L,
2L, 13L, 3L, 11L, 12L, 15L, 1L, 4L, 14L, 6L), .Label = c("A0A087WNY6",
"B2RTL5", "B8JJX9", "D3Z2H7", "E9PZ97", "G3UWX1", "Q2VWQ4", "Q3TMB5",
"Q3TWK2", "Q6ZPS9", "Q7TMW3", "Q8BP71", "Q8R4K2", "Q925B0", "Q9WU01"
), class = "factor"), V3 = c(5L, 7L, 10L, 11L, 13L, 15L, NA,
NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("V1", "V2", "V3"
), class = "data.frame", row.names = c(NA, -15L))
I want to select the rows from the first two columns based on third column values
the expected output is this
5 Q6ZPS9
7 Q8R4K2
10 Q8BP71
11 Q9WU01
13 D3Z2H7
15 G3UWX1
I feel like V3 should not be in this dataframe but a different vector. But here is a way
df[df$V1 %in% df$V3,1:2]

Errors when exporting plots to Plot.ly

I have this data (sample of the first 20 rows):
Codering variable value
1 Z1 Week.0 0
2 Z2 Week.0 0
3 Z3 Week.0 0
4 Z4 Week.0 0
5 Z5 Week.0 0
6 Z6 Week.0 0
7 Z7 Week.0 0
8 Z8 Week.0 0
9 Z9 Week.0 0
10 Z101 Week.0 NA
11 Z102 Week.0 NA
12 Z1 Week.1 0
13 Z2 Week.1 0
14 Z3 Week.1 0
15 Z4 Week.1 0
16 Z5 Week.1 0
17 Z6 Week.1 0
18 Z7 Week.1 0
19 Z8 Week.1 0
and I plot it using:
pZ <- ggplot(zmeltdata,aes(x=variable,y=value,color=Codering,group=Codering)) +
geom_line()+
geom_point()+
theme_few()+
theme(legend.position="right")+
scale_color_hue(name = "Treatment group:")+
scale_y_continuous(labels = percent)+
ylab("Germination percentage")+
xlab("Week number")+
labs(title = "Z. monophyllum germination data")
pZ
The graph displays just fine:
Yet when I want to export this to Plot.ly I get the following errors:
> py <- plotly()
> response<-py$ggplotly(pZ)
Error in if (all(xcomp) && all(ycomp)) { :
missing value where TRUE/FALSE needed
In addition: Warning message:
In trace.list[[lind[1]]]$y == trace.list[[lind[2]]]$y :
longer object length is not a multiple of shorter object length
And I have searched for these errors, yet the explanation thoroughly confuses me. "The missing value where TRUE/FALSE needed." is supposed to occur if you use logical termms as IF/ELSE/TRUE/FALSE and such in your process, which I don't at all! Even when checking for any NA's in the value of the graph I get:
> is.na(pZ)
data layers scales mapping theme coordinates facet plot_env labels
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
and the 'longer object length is not multiple of shorter object length' is supposed to pop up when you have objects of different lengths, but I'm only using 1 object with 3 rows that have exactly the same length.. The value of the graph does give me a NULL when I ask for those rows, but that is supposed to happen..
> nrow(zmeltdata)
[1] 143
> nrow(test)
NULL
All in all, I'm very confused and don't know how to correctly handle these errors, could someone elaborate?
Thanks for your time.
EDIT: I have tried to export a different graph to Plot.ly using a random sample of 1:100 and that worked just fine, I'm pretty sure the error is in my data, I just can't figure out how to fix it.
EDIT2: In response to #Gregor:
> dput(head(zmeltdata, 20))
structure(list(Codering = structure(c(16L, 19L, 20L, 21L, 22L,
23L, 24L, 25L, 26L, 17L, 18L, 16L, 19L, 20L, 21L, 22L, 23L, 24L,
25L, 26L), .Label = c("B1", "C2", "C3", "C8", "M1", "M101", "M102",
"M2", "M3", "M4", "M5", "M6", "M7", "M8", "M9", "Z1", "Z101",
"Z102", "Z2", "Z3", "Z4", "Z5", "Z6", "Z7", "Z8", "Z9"), class = "factor"),
variable = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Week.0",
"Week.1", "Week.2", "Week.3", "Week.4", "Week.5", "Week.6",
"Week.7", "Week.8", "Week.9", "Week.10", "Week.11", "Week.12"
), class = "factor"), value = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0)), .Names = c("Codering",
"variable", "value"), row.names = c(NA, 20L), class = "data.frame")
And the tail:
> dput(tail(zmeltdata, 43))
structure(list(Codering = structure(c(19L, 20L, 21L, 22L, 23L,
24L, 25L, 26L, 17L, 18L, 16L, 19L, 20L, 21L, 22L, 23L, 24L, 25L,
26L, 17L, 18L, 16L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 17L,
18L, 16L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 17L, 18L), .Label = c("B1",
"C2", "C3", "C8", "M1", "M101", "M102", "M2", "M3", "M4", "M5",
"M6", "M7", "M8", "M9", "Z1", "Z101", "Z102", "Z2", "Z3", "Z4",
"Z5", "Z6", "Z7", "Z8", "Z9"), class = "factor"), variable = structure(c(10L,
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 11L,
11L, 11L, 11L, 11L, 11L, 11L, 11L, 12L, 12L, 12L, 12L, 12L, 12L,
12L, 12L, 12L, 12L, 12L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L,
13L, 13L, 13L), .Label = c("Week.0", "Week.1", "Week.2", "Week.3",
"Week.4", "Week.5", "Week.6", "Week.7", "Week.8", "Week.9", "Week.10",
"Week.11", "Week.12"), class = "factor"), value = c(0.1, 0.06,
0.05, 0.09, 0.04, 0.08, 0.05, 0.08, 0, 0, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("Codering",
"variable", "value"), row.names = 101:143, class = "data.frame")
I am not at all surprised by these, there are quite some NA's in the dataset but they shouldn't prove to be an issue, since I have used a similar (bigger) dataset before.
And I also have the .csv file for you to use if you wish: https://www.mediafire.com/?jij1vlp14a29ntt
The issue is about handling NA's... I got https://plot.ly/~marianne2/417/z-monophyllum-germination-data/ by running the following code:
pZ <- ggplot(na.omit(zmeltdata), aes(x=variable, y=value, color=Codering,
group=Codering)) +
geom_line() +
geom_point() +
# theme_few() +
theme(legend.position="right") +
scale_color_hue(name="Treatment group:") +
# scale_y_continuous(labels = percent) +
ylab("Germination percentage") +
xlab("Week number") +
labs(title="Z. monophyllum germination data")
py$ggplotly(pZ, kwargs=list(fileopt="overwrite", filename="test_zdata"))
Note that I had to comment out theme_few() and scale_y_continuous(labels = percent) because from loading only "ggplot2", I would get the following errors:
Error: could not find function "theme_few"
and
Error in structure(list(call = match.call(), aesthetics = aesthetics, :
object 'percent' not found
respectively. I guess these are dependency issues (maybe you're using a version of "ggthemes"?).
I don't know what kind of magic theme_few() does, but if I don't use na.omit() on zmeltdata, my pZ plot looks like this:
Eww, "Week.10" comes after "Week.1" instead of after "Week.9"... So you wouldn't want to send this to plotly anyway! So I cannot exactly reproduce your ggplot example. But I wonder if you really want to keep these NA's (the CSV itself reads "NA", I was expecting blank "cells"). Don't you want to pre-process these anyway?
Note that I get the following warning message when I don't use na.omit() on zmeltdata:
Warning messages:
1: Removed 20 rows containing missing values (geom_path).
2: Removed 47 rows containing missing values (geom_point).
Again, beyond pure displaying/plotting considerations, since this looks like scientific data, wouldn't you want to number weeks with actual numbers, or pad the digits if you really want a string? ("Week.01", "Week.02", etc.)
And it looks like the missing data is all trailing... There's just no data (yet) for weeks 10+, right?
Thanks for reporting,
Marianne

Converting object of class rules to data frame in R

I have an output of apriori function, which mines data and gives set of rules. I want to convert it to data frame for further processing.
The rules object looks like this:
> inspect(output)
lhs rhs support confidence lift
1 {curtosis=(846,1.27e+03]} => {skewness=(-0.254,419]} 0.2611233 0.8044944 2.418776
2 {variance=(892,1.34e+03]} => {notes.class=FALSE} 0.3231218 0.9888393 1.781470
3 {variance=(-0.336,446]} => {notes.class=TRUE} 0.2859227 0.8634361 1.940608
4 {skewness=(837,1.26e+03]} => {notes.class=FALSE} 0.2924872 0.8774617 1.580815
5 {entropy=(-0.155,386],
class=FALSE} => {skewness=(837,1.26e+03]} 0.1597374 0.9521739 2.856522
6 {variance=(-0.336,446],
curtosis=(846,1.27e+03]} => {skewness=(-0.254,419]} 0.1378556 0.8325991 2.503275
We can create rules object using data frame. Data frame looks like this:
> data
variance skewness curtosis entropy notes.class
1 (892,1.34e+03] (837,1.26e+03] (-0.268,424] (386,771] FALSE
2 (892,1.34e+03] (-0.254,419] (424,846] (771,1.16e+03] FALSE
3 (892,1.34e+03] (837,1.26e+03] (-0.268,424] (-0.155,386] FALSE
4 (446,892] (-0.254,419] (846,1.27e+03] (386,771] FALSE
Than we can get output variable using this:
> output <- apriori(data)
There was used arules package. dput(output) gives this:
new("rules"
, lhs = new("itemMatrix"
, data = new("ngCMatrix"
, i = c(8L, 2L, 0L, 5L, 9L, 12L, 0L, 8L, 0L, 3L, 0L, 8L, 8L, 13L, 8L,
10L, 3L, 10L, 8L, 11L, 8L, 13L, 3L, 12L, 2L, 5L, 2L, 6L, 2L,
5L, 2L, 6L, 2L, 10L, 2L, 7L, 2L, 11L, 0L, 3L, 0L, 10L, 0L, 7L,
11L, 13L, 5L, 6L, 6L, 12L, 5L, 10L, 1L, 5L, 4L, 6L, 6L, 13L,
0L, 3L, 8L, 0L, 8L, 13L, 3L, 8L, 13L, 0L, 3L, 13L, 2L, 5L, 6L,
2L, 5L, 12L, 2L, 6L, 12L)
, p = c(0L, 1L, 2L, 3L, 4L, 6L, 8L, 10L, 12L, 14L, 16L, 18L, 20L, 22L,
24L, 26L, 28L, 30L, 32L, 34L, 36L, 38L, 40L, 42L, 44L, 46L, 48L,
50L, 52L, 54L, 56L, 58L, 61L, 64L, 67L, 70L, 73L, 76L, 79L)
, Dim = c(14L, 38L)
, Dimnames = list(NULL, NULL)
, factors = list()
)
, itemInfo = structure(list(labels = structure(c("variance=(-0.336,446]",
"variance=(446,892]", "variance=(892,1.34e+03]", "skewness=(-0.254,419]",
"skewness=(419,837]", "skewness=(837,1.26e+03]", "curtosis=(-0.268,424]",
"curtosis=(424,846]", "curtosis=(846,1.27e+03]", "entropy=(-0.155,386]",
"entropy=(386,771]", "entropy=(771,1.16e+03]", "notes.class=FALSE",
"notes.class=TRUE"), class = "AsIs"), variables = structure(c(5L,
5L, 5L, 4L, 4L, 4L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L), .Label = c("curtosis",
"entropy", "notes.class", "skewness", "variance"), class = "factor"),
levels = structure(c(4L, 8L, 12L, 2L, 6L, 10L, 3L, 7L, 11L,
1L, 5L, 9L, 13L, 14L), .Label = c("(-0.155,386]", "(-0.254,419]",
"(-0.268,424]", "(-0.336,446]", "(386,771]", "(419,837]",
"(424,846]", "(446,892]", "(771,1.16e+03]", "(837,1.26e+03]",
"(846,1.27e+03]", "(892,1.34e+03]", "FALSE", "TRUE"), class = "factor")), .Names = c("labels",
"variables", "levels"), row.names = c(NA, -14L), class = "data.frame")
, itemsetInfo = structure(list(), .Names = character(0), row.names = integer(0), class = "data.frame")
)
, rhs = new("itemMatrix"
, data = new("ngCMatrix"
, i = c(3L, 12L, 13L, 12L, 5L, 3L, 8L, 13L, 0L, 3L, 8L, 3L, 3L, 8L,
6L, 5L, 12L, 12L, 12L, 12L, 12L, 13L, 13L, 13L, 3L, 12L, 5L,
12L, 12L, 13L, 4L, 13L, 3L, 0L, 8L, 12L, 6L, 5L)
, p = 0:38
, Dim = c(14L, 38L)
, Dimnames = list(NULL, NULL)
, factors = list()
)
, itemInfo = structure(list(labels = structure(c("variance=(-0.336,446]",
"variance=(446,892]", "variance=(892,1.34e+03]", "skewness=(-0.254,419]",
"skewness=(419,837]", "skewness=(837,1.26e+03]", "curtosis=(-0.268,424]",
"curtosis=(424,846]", "curtosis=(846,1.27e+03]", "entropy=(-0.155,386]",
"entropy=(386,771]", "entropy=(771,1.16e+03]", "notes.class=FALSE",
"notes.class=TRUE"), class = "AsIs"), variables = structure(c(5L,
5L, 5L, 4L, 4L, 4L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L), .Label = c("curtosis",
"entropy", "notes.class", "skewness", "variance"), class = "factor"),
levels = structure(c(4L, 8L, 12L, 2L, 6L, 10L, 3L, 7L, 11L,
1L, 5L, 9L, 13L, 14L), .Label = c("(-0.155,386]", "(-0.254,419]",
"(-0.268,424]", "(-0.336,446]", "(386,771]", "(419,837]",
"(424,846]", "(446,892]", "(771,1.16e+03]", "(837,1.26e+03]",
"(846,1.27e+03]", "(892,1.34e+03]", "FALSE", "TRUE"), class = "factor")), .Names = c("labels",
"variables", "levels"), row.names = c(NA, -14L), class = "data.frame")
, itemsetInfo = structure(list(), .Names = character(0), row.names = integer(0), class = "data.frame")
)
, quality = structure(list(support = c(0.261123267687819, 0.323121808898614,
0.285922684172137, 0.292487235594457, 0.159737417943107, 0.137855579868709,
0.137855579868709, 0.142231947483589, 0.142231947483589, 0.110138584974471,
0.110138584974471, 0.12399708242159, 0.153902261123268, 0.107221006564551,
0.13056163384391, 0.13056163384391, 0.150984682713348, 0.139314369073669,
0.100656455142232, 0.107221006564551, 0.154631655725748, 0.165572574762947,
0.112326768781911, 0.105762217359592, 0.12180889861415, 0.181619256017505,
0.181619256017505, 0.102844638949672, 0.105762217359592, 0.12837345003647,
0.12837345003647, 0.137855579868709, 0.137855579868709, 0.137855579868709,
0.137855579868709, 0.13056163384391, 0.13056163384391, 0.13056163384391
), confidence = c(0.804494382022472, 0.988839285714286, 0.863436123348018,
0.87746170678337, 0.952173913043478, 0.832599118942731, 0.832599118942731,
0.859030837004405, 0.898617511520737, 0.853107344632768, 0.915151515151515,
0.80188679245283, 0.972350230414747, 0.885542168674699, 0.864734299516908,
0.913265306122449, 1, 0.974489795918367, 1, 1, 0.990654205607477,
1, 0.980891719745223, 0.873493975903614, 0.814634146341463, 0.943181818181818,
0.950381679389313, 1, 0.92948717948718, 0.931216931216931, 0.897959183673469,
1, 0.969230769230769, 0.895734597156398, 0.832599118942731, 1,
0.864734299516908, 0.93717277486911), lift = c(2.41877587226493,
1.78146998779801, 1.94060807395104, 1.580814717477, 2.85652173913043,
2.50327498261071, 2.56515369004603, 1.93070701234925, 2.71366653809456,
2.56493458221826, 2.81948927477017, 2.41093594836147, 2.92344773223381,
2.72826587247868, 2.58853870008227, 2.73979591836735, 1.80157687253614,
1.75561827884899, 1.80157687253614, 1.80157687253614, 1.78473970550309,
2.24754098360656, 2.20459434060771, 1.96321350977681, 2.44926187419769,
1.69921455023295, 2.85114503816794, 1.80157687253614, 1.67454260588295,
2.09294821753838, 2.68799572230639, 2.24754098360656, 2.91406882591093,
2.70496064471679, 2.56515369004603, 1.80157687253614, 2.58853870008227,
2.81151832460733)), row.names = c(NA, 38L), .Names = c("support",
"confidence", "lift"), class = "data.frame")
, info = structure(list(data = data, ntransactions = 1371L, support = 0.1,
confidence = 0.8), .Names = c("data", "ntransactions", "support",
"confidence"))
)
We can't duplicate your data from your question (oh, you just added your data as I was typing this! Sorry!), so I'll use the example from the arules package:
library('arules');
data("Adult")
## Mine association rules.
rules <- apriori(Adult,
parameter = list(supp = 0.5, conf = 0.9,
target = "rules"))
Then I can duplicate the stuff output from inspect(rules):
> ruledf = data.frame(
lhs = labels(lhs(rules))$elements,
rhs = labels(rhs(rules))$elements,
rules#quality)
> head(ruledf)
lhs rhs support confidence lift
1 {} {capital-gain=None} 0.9173867 0.9173867 1.0000000
2 {} {capital-loss=None} 0.9532779 0.9532779 1.0000000
3 {hours-per-week=Full-time} {capital-gain=None} 0.5435895 0.9290688 1.0127342
4 {hours-per-week=Full-time} {capital-loss=None} 0.5606650 0.9582531 1.0052191
5 {sex=Male} {capital-gain=None} 0.6050735 0.9051455 0.9866565
6 {sex=Male} {capital-loss=None} 0.6331027 0.9470750 0.9934931
and do stuff like order by decreasing lift:
head(ruledf[order(-ruledf$lift),])
The help for the rules class: http://www.rdocumentation.org/packages/arules/functions/rules-class.html will tell you what you can get from your rules object - I just used that information to build a data frame. If its not exactly what you want, then cook one up using your own recipe!
Run apriori in data Adult
rules <- apriori(Adult, parameter = list(supp = 0.5, conf = 0.9, target =
"rules"))
Inspect LHS, RHS, support, confidence and lift
arules::inspect(rules)
Create a dataframe
df = data.frame(
lhs = labels(lhs(rules)),
rhs = labels(rhs(rules)),
rules#quality)
View top 6 lines in new dataframe
head(df)
This does the trick
rules_dataframe <- as(output, 'data.frame')

Resources