Related
I have a dataset with a repeatedly measured continuous outcome and some covariates of different classes, like in the example below.
Id y Date Soda Team
1 -0.4521 1999-02-07 Coke Eagles
1 0.2863 1999-04-15 Pepsi Raiders
2 0.7956 1999-07-07 Coke Raiders
2 -0.8248 1999-07-26 NA Raiders
3 0.8830 1999-05-29 Pepsi Eagles
4 0.1303 2005-03-04 NA Cowboys
5 0.1375 2013-11-02 Coke Cowboys
5 0.2851 2015-06-23 Coke Eagles
5 -0.3538 2015-07-29 Pepsi NA
6 0.3349 2002-10-11 NA NA
7 -0.1756 2005-01-11 Pepsi Eagles
7 0.5507 2007-10-16 Pepsi Cowboys
7 0.5132 2012-07-13 NA Cowboys
7 -0.5776 2017-11-25 Coke Cowboys
8 0.5486 2009-02-08 Coke Cowboys
I am trying to multiply impute missing values in Soda and Team using the mice package. As I understand it, because MI is not a causal model, there is no concept of dependent and independent variable. I am not sure how to setup this MI process using mice. I like some suggestions or advise from others who have encountered missing data in a repeated measure setting like this and how they used mice to tackle this problem. Thanks in advance.
Edit
This is what I have tried so far, but this does not capture the repeated measure part of the dataset.
library(mice)
init = mice(dat, maxit=0)
methd = init$method
predM = init$predictorMatrix
methd [c("Soda")]="logreg";
methd [c("Team")]="logreg";
imputed = mice(data, method=methd , predictorMatrix=predM, m=5)
There are several options to accomplish what you are asking for. I have decided to impute missing values in covariates in the so-called 'wide' format. I will illustrate this with the following worked example, which you can easily apply to your own data.
Let's first make a reprex. Here, I use the longitudinal Mayo Clinic Primary Biliary Cirrhosis Data (pbc2), which comes with the JM package. This data is organized in the so-called 'long' format, meaning that each patient i has multiple rows and each row contains a measurement of variable x measured on time j. Your dataset is also in the long format. In this example, I assume that pbc2$serBilir is our outcome variable.
# install.packages('JM')
library(JM)
# note: use function(x) instead of \(x) if you use a version of R <4.1.0
# missing values per column
miss_abs <- \(x) sum(is.na(x))
miss_perc <- \(x) round(sum(is.na(x)) / length(x) * 100, 1L)
miss <- cbind('Number' = apply(pbc2, 2, miss_abs), '%' = apply(pbc2, 2, miss_perc))
# --------------------------------
> miss[which(miss[, 'Number'] > 0),]
Number %
ascites 60 3.1
hepatomegaly 61 3.1
spiders 58 3.0
serChol 821 42.2
alkaline 60 3.1
platelets 73 3.8
According to this output, 6 variables in pbc2 contain at least one missing value. Let's pick alkaline from these. We also need patient id and the time variable years.
# subset
pbc_long <- subset(pbc2, select = c('id', 'years', 'alkaline', 'serBilir'))
# sort ascending based on id and, within each id, years
pbc_long <- with(pbc_long, pbc_long[order(id, years), ])
# ------------------------------------------------------
> head(pbc_long, 5)
id years alkaline serBilir
1 1 1.09517 1718 14.5
2 1 1.09517 1612 21.3
3 2 14.15234 7395 1.1
4 2 14.15234 2107 0.8
5 2 14.15234 1711 1.0
Just by quickly eyeballing, we observe that years do not seem to differ within subjects, even though variables were repeatedly measured. For the sake of this example, let's add a little bit of time to all rows of years but the first measurement.
set.seed(1)
# add little bit of time to each row of 'years' but the first row
new_years <- lapply(split(pbc_long, pbc_long$id), \(x) {
add_time <- 1:(length(x$years) - 1L) + rnorm(length(x$years) - 1L, sd = 0.25)
c(x$years[1L], x$years[-1L] + add_time)
})
# replace the original 'years' variable
pbc_long$years <- unlist(new_years)
# integer time variable needed to store repeated measurements as separate columns
pbc_long$measurement_number <- unlist(sapply(split(pbc_long, pbc_long$id), \(x) 1:nrow(x)))
# only keep the first 4 repeated measurements per patient
pbc_long <- subset(pbc_long, measurement_number %in% 1:4)
Since we will perform our multiple imputation in wide format (meaning that each participant i has one row and repeated measurements on x are stored in j different columns, so xj columns in total), we have to convert the data from long to wide. Now that we have prepared our data, we can use reshape to do this for us.
# convert long format into wide format
v_names <- c('years', 'alkaline', 'serBilir')
pbc_wide <- reshape(pbc_long,
idvar = 'id',
timevar = "measurement_number",
v.names = v_names, direction = "wide")
# -----------------------------------------------------------------
> head(pbc_wide, 4)[, 1:9]
id years.1 alkaline.1 serBilir.1 years.2 alkaline.2 serBilir.2 years.3 alkaline.3
1 1 1.095170 1718 14.5 1.938557 1612 21.3 NA NA
3 2 14.152338 7395 1.1 15.198249 2107 0.8 15.943431 1711
12 3 2.770781 516 1.4 3.694434 353 1.1 5.148726 218
16 4 5.270507 6122 1.8 6.115197 1175 1.6 6.716832 1157
Now let's multiply the missing values in our covariates.
library(mice)
# Setup-run
ini <- mice(pbc_wide, maxit = 0)
meth <- ini$method
pred <- ini$predictorMatrix
visSeq <- ini$visitSequence
# avoid collinearity issues by letting only variables measured
# at the same point in time predict each other
pred[grep("1", rownames(pred), value = TRUE),
grep("2|3|4", colnames(pred), value = TRUE)] <- 0
pred[grep("2", rownames(pred), value = TRUE),
grep("1|3|4", colnames(pred), value = TRUE)] <- 0
pred[grep("3", rownames(pred), value = TRUE),
grep("1|2|4", colnames(pred), value = TRUE)] <- 0
pred[grep("4", rownames(pred), value = TRUE),
grep("1|2|3", colnames(pred), value = TRUE)] <- 0
# variables that should not be imputed
pred[c("id", grep('^year', names(pbc_wide), value = TRUE)), ] <- 0
# variables should not serve as predictors
pred[, c("id", grep('^year', names(pbc_wide), value = TRUE))] <- 0
# multiply imputed missing values ------------------------------
imp <- mice(pbc_wide, pred = pred, m = 10, maxit = 20, seed = 1)
# Time difference of 2.899244 secs
As can be seen in the below three example traceplots (which can be obtained with plot(imp), the algorithm has converged nicely. Refer to this section of Stef van Buuren's book for more info on convergence.
Now we need to convert back the multiply imputed data (which is in wide format) to long format, so that we can use it for analyses. We also need to make sure that we exclude all rows that had missing values for our outcome variable serBilir, because we do not want to use imputed values of the outcome.
# need unlisted data
implong <- complete(imp, 'long', include = FALSE)
# 'smart' way of getting all the names of the repeated variables in a usable format
v_names <- as.data.frame(matrix(apply(
expand.grid(grep('ye|alk|ser', names(implong), value = TRUE)),
1, paste0, collapse = ''), nrow = 4, byrow = TRUE), stringsAsFactors = FALSE)
names(v_names) <- names(pbc_long)[2:4]
# convert back to long format
longlist <- lapply(split(implong, implong$.imp),
reshape, direction = 'long',
varying = as.list(v_names),
v.names = names(v_names),
idvar = 'id', times = 1:4)
# logical that is TRUE if our outcome was not observed
# which should be based on the original, unimputed data
orig_data <- reshape(imp$data, direction = 'long',
varying = as.list(v_names),
v.names = names(v_names),
idvar = 'id', times = 1:4)
orig_data$logical <- is.na(orig_data$serBilir)
# merge into the list of imputed long-format datasets:
longlist <- lapply(longlist, merge, y = subset(orig_data, select = c(id, time, logical)))
# exclude rows for which logical == TRUE
longlist <- lapply(longlist, \(x) subset(x, !logical))
Finally, convert longlist back into a mids using datalist2mids from the miceadds package.
imp <- miceadds::datalist2mids(longlist)
# ----------------
> imp$loggedEvents
NULL
I am trying to create a loop to use compare_means (ggpubr library in R) across all columns in a dataframe and then select only significant p.adjusted values, but it does not work well.
Here is some code
head(df3)
sampleID Actio Beta Gammes Traw Cluster2
gut10 10 2.2 55 13 HIGH
gut12 20 44 67 12 HIGH
gut34 5.5 3 89 33 LOW
gut26 4 45 23 4 LOW
library(ggpubr)
data<-list()
for (i in 2:length(df3)){
data<-compare_means(df3[[i]] ~ Cluster2, data=df3, paired = FALSE,p.adjust.method="bonferroni",method = "wilcox.test")
}
Error: `df3[i]` must evaluate to column positions or names, not a list
I would like to create an output to convert in dataframe with all the information contained in compare_means output
Thanks a lot
Try this:
library(ggpubr)
data<-list()
for (i in 2:(length(df3)-1)){
new<-df3[,c(i,"Cluster2")]
colnames(new)<-c("interest","Cluster2")
data<-compare_means(interest ~ Cluster2, data=new, paired = FALSE,p.adjust.method="bonferroni",method = "wilcox.test")
}
I am having trouble figuring out how to trim the end off of a string in a data frame.
I want to trim everything to a "base" name, after #s and letters, a period, then a number. My goal is trim everything in my dataframe to this "base" name, then sum the values with the same "base." I was thinking it would be possible to trim, then merge and sum the values.
ie/
Gene_name Values
B0222.5 4
B0222.6 16
B0228.7.1 2
B0228.7.2 12
B0350.2h.1 30
B0350.2h.2 2
B0350.2i 15
2RSSE.1a 3
2RSSE.1b 10
R02F11.11 4
to
Gene_name Values
B0222.5 4
B0222.6 16
B0228.7 14
B0350.2 47
2RSSE.1 13
R02F11.11 4
Thank you for any help!
Here is a solution using the dplyr and stringr packages. You first create a column with your extracted base pattern, and then use the group_by and summarise functions from dplyr to get the sum of values for each name:
library(dplyr)
library(stringr)
df2 = df %>% mutate(Gene_name = str_extract(Gene_name,"[[:alnum:]]+\\.\\d+")) %>%
group_by(Gene_name) %>% summarise(Values = sum(Values))
Gene_name Values
<chr> <int>
1 2RSSE.1 13
2 B0222.5 4
3 B0222.6 16
4 B0228.7 14
5 B0350.2 47
6 R02F11.11 4
As someone has also suggested, I would get gene names first, and then search for them in the original data.frame
df <- data.frame(Gene_name = c("B0222.5", "B0222.6", "B0228.7.1", "B0228.7.2", "B0350.2h.1", "B0350.2h.2", "B0350.2i", "2RSSE.1a", "2RSSE.1b", "R02F11.11"),
Values = c(4, 16, 2, 12, 30, 2, 15, 3, 10, 4),
stringsAsFactors = F)
pat <- "(^[[:alnum:]]+\\.[[:digit:]]*)"
cap.pos <- regexpr(pat, df$Gene_name)
cap.gene <- unique(substr(df$Gene_name, cap.pos, (cap.pos + attributes(cap.pos)$match.length - 1)))
do.call(rbind, lapply(cap.gene, (function(nm){
sumval <- sum(df[grepl(nm, df$Gene_name, fixed = T),]$Values, na.rm = T)
data.frame(Gene_name = nm, Value = sumval)
})))
The result tracks with your request
Gene_name Value
1 B0222.5 4
2 B0222.6 16
3 B0228.7 14
4 B0350.2 47
5 2RSSE.1 13
6 R02F11.11 4
You can also create the Gene_name as a factor and change the levels.
# coerce the vector as a factor
Gene_name <- as.factor(Gene_name)
# view the levels
levels(Gene_name)
# to make B0228.7.1 into B0228.7
levels(Gene_name)[ *index for B0228.7.1* ] <- B0228.7
You can repeat this for the levels that need to change and then the values will automatically sum together and rows with similar levels will be treated as the same category.
I've been trying to plot a network using networkD3 package in R for a week now. The simpleNetwork function works normally, but it doesn't allow much control over the appearance of the graph. The forceNetwork function is there for this purpose: display graph with enriched visual features.
The problem I have is pretty much the same exposed in this question. I have carefully read the package documentation, and tried the solution proposed in the above thread, with no luck: all I get is a cloud of nodes with no edges linking them.
Here my data.frames:
edg
Gene1 Gene2 Prob
1 22 3
2 22 6
3 22 6
4 22 9
5 22 3
6 22 4
7 22 8
8 22 4
9 22 6
10 22 8
11 22 6
12 22 10
13 22 6
14 22 3
15 22 6
16 22 6
17 22 0
18 22 4
19 22 6
20 22 4
vert
Symbol Chr Expr
1 21 9
2 17 10
3 17 0
4 20 0
5 6 9
6 5 11
7 12 0
8 1 20
9 17 11
10 17 7
11 17 11
12 10 0
13 17 0
14 7 7
15 17 6
16 17 0
17 2 5
18 5 10
19 17 10
20 17 9
21 12 4
22 3 2
Well, this results in the above mentioned cloud of nodes with no edges. Same thing if I change 'Symbol' column with actual labels I'd put on the nodes (respecting the order of Links' table, as required by the package).
Note that the package illustrates the use of this function with this example, and if you open the datsets used (MisLinks, MisNodes), their content is the same as mine, except for the labels of the nodes. Running that very same example works; running with my data does not.
Here is the function I use to plot the network:
forceNetwork( Links = edg, Nodes = vert, Source = "Gene1", Target = "Gene2",
Value = "Prob", NodeID = "Symbol", Group = "Chr", opacity = 0.7,
colourScale = "d3.scale.category20b()", Nodesize = "Expr", zoom = T,
legend = T )
Every other property is correctly displayed (node size, legend, colours), but I keep seeing no edges. There must be a mistake somewhere in my datasets, which I cannot find in any way.
I was having the same problem (simpleNetwork working normally, forceNetwork first displaying only nodes & no edges, then subsequently no display at all).
The problem (which you presumably fixed when you "rebuilt dataframes starting numbering from 0") was your original Links data, edg, starting from 1 instead of 0?
The networkD3 documentation, http://christophergandrud.github.io/networkD3/, has this note:
Note: You are probably used to R’s 1-based numbering (i.e. counting in R starts from 1). However, networkD3 plots are created using JavaScript, which is 0-based. So, your data links will need to start from 0.
Re. incorrect data types which I also originally thought might be the problem, I tested casting all the different columns (except the factor variable for the NodeID) as.numeric vs as.integer - however having now corrected my data to be 0-based instead of 1-based, my forceNetwork display works normally with either data type.
Hope this helps!
I have just fixed the same problem in my own forceNetwork. It turned out that the dataframe of edges that I had created (exported from iGraph) had character types, not int types. Casting the edge 'from' and 'to' columns using as.numeric() resolved the problem and the links drew correctly.
I hope this helps.
With regards,
Will
Technically speaking, the reason your example data will not work, even if you address other possible problems (like edg$Gene1 and edg$Gene2 being non-numeric), is because you refer to a node 22 in your edg data, which in "0-based index" terms points to the 23rd row of your vert data frame, which does not exist.
As has been pointed out, this is probably because it is in 1-based indexing and should be converted, which could easily be done with
edg$Gene1 <- edg$Gene1 - 1
edg$Gene2 <- edg$Gene2 - 1
Alternatively, one might have intended to refer to another node which, for whatever reason did not make it into the vert data frame, in which case that node would need to be added to the vert data frame, which could easily be done with (for example)...
vert <- rbind(vert, c(23,1,1))
You could test whether or not you refer to a node in your edj data that doesn't exist in your vert data with something like...
all(unique(c(edg$Gene1, edg$Gene2)) %in% (1:nrow(vert) - 1))
# [1] FALSE
which should return TRUE. If not, something's wrong.
You could determine which nodes are referred to in your edg data that do not exist in your vert data with...
unique(c(edg$Gene1, edg$Gene2))[which(!unique(c(edg$Gene1, edg$Gene2)) %in% (1:nrow(vert) - 1))]
# [1] 22
fully reproducible example adjusting the indices in edg to be "0-based"
edg <- read.csv(header = TRUE, colClasses = 'character', text = '
Gene1,Gene2,Prob
1,22,3
2,22,6
3,22,6
4,22,9
5,22,3
6,22,4
7,22,8
8,22,4
9,22,6
10,22,8
11,22,6
12,22,10
13,22,6
14,22,3
15,22,6
16,22,6
17,22,0
18,22,4
19,22,6
20,22,4
')
vert <- read.csv(header = TRUE, colClasses = 'character', text = '
Symbol,Chr,Expr
1,21,9
2,17,10
3,17,0
4,20,0
5,6,9
6,5,11
7,12,0
8,1,20
9,17,11
10,17,7
11,17,11
12,10,0
13,17,0
14,7,7
15,17,6
16,17,0
17,2,5
18,5,10
19,17,10
20,17,9
21,12,4
22,3,2
')
# cast to numeric just to be sure
edg$Gene1 <- as.numeric(edg$Gene1)
edg$Gene2 <- as.numeric(edg$Gene2)
# adjust the indices so they're "0-based"
edg$Gene1 <- edg$Gene1 - 1
edg$Gene2 <- edg$Gene2 - 1
# Nodesize is also necessarily numeric
vert$Expr <- as.numeric(vert$Expr)
library(networkD3)
forceNetwork(Links = edg, Nodes = vert, Source = "Gene1", Target = "Gene2",
Value = "Prob", NodeID = "Symbol", Group = "Chr", opacity = 0.7,
Nodesize = "Expr", zoom = TRUE, legend = TRUE)
fully reproducible example adding a node to vert
edg <- read.csv(header = TRUE, colClasses = 'character', text = '
Gene1,Gene2,Prob
1,22,3
2,22,6
3,22,6
4,22,9
5,22,3
6,22,4
7,22,8
8,22,4
9,22,6
10,22,8
11,22,6
12,22,10
13,22,6
14,22,3
15,22,6
16,22,6
17,22,0
18,22,4
19,22,6
20,22,4
')
vert <- read.csv(header = TRUE, colClasses = 'character', text = '
Symbol,Chr,Expr
1,21,9
2,17,10
3,17,0
4,20,0
5,6,9
6,5,11
7,12,0
8,1,20
9,17,11
10,17,7
11,17,11
12,10,0
13,17,0
14,7,7
15,17,6
16,17,0
17,2,5
18,5,10
19,17,10
20,17,9
21,12,4
22,3,2
')
# cast to numeric just to be sure
edg$Gene1 <- as.numeric(edg$Gene1)
edg$Gene2 <- as.numeric(edg$Gene2)
vert$Expr <- as.numeric(vert$Expr)
# add another node to the Nodes data frame
vert <- rbind(vert, c(23,1,1))
library(networkD3)
forceNetwork(Links = edg, Nodes = vert, Source = "Gene1", Target = "Gene2",
Value = "Prob", NodeID = "Symbol", Group = "Chr", opacity = 0.7,
Nodesize = "Expr", zoom = TRUE, legend = TRUE)
I met the same problem, but fixed it by setting the factor levels of source and target to be consistent with node names before transferring into numeric:
edg$Gene1<-factor(edg$Gene1,levels=vert$Symbol)
edg$Gene2<-factor(edg$Gene2,levels=vert$Symbol)
edg$source<-as.numeric(edg$Gene1)-1
edg$target<-as.numeric(edg$Gene2)-1
so that source and target vectors have consistent factor levels as node names (vert$Symbol), then
forceNetwork( Links = edg, Nodes = vert, Source = "source", Target = "target",
Value = "Prob", NodeID = "Symbol", Group = "Chr", opacity = 0.7,
colourScale = "d3.scale.category20b()", Nodesize = "Expr", zoom = T,
legend = T )
works for me.
Hope this is helpful.
I'm finding working with the arule package a bit tricky. I'm using the apriori algorithm to find association rules; something similar to an example in the arules documentation.
data("AdultUCI")
dim(AdultUCI)
AdultUCI[1:2,]
#Ignore everything from here to the last two lines, this is just data preparation
## remove attributes
AdultUCI[["fnlwgt"]] <- NULL
AdultUCI[["education-num"]] <- NULL
## map metric attributes
AdultUCI[[ "age"]] <- ordered(cut(AdultUCI[[ "age"]], c(15,25,45,65,100)),
labels = c("Young", "Middle-aged", "Senior", "Old"))
AdultUCI[[ "hours-per-week"]] <- ordered(cut(AdultUCI[[ "hours-per-week"]],
c(0,25,40,60,168)),
labels = c("Part-time", "Full-time", "Over-time", "Workaholic"))
AdultUCI[[ "capital-gain"]] <- ordered(cut(AdultUCI[[ "capital-gain"]],
c(-Inf,0,median(AdultUCI[[ "capital-gain"]][AdultUCI[[ "capital-gain"]]>0]),
Inf)), labels = c("None", "Low", "High"))
AdultUCI[[ "capital-loss"]] <- ordered(cut(AdultUCI[[ "capital-loss"]],
c(-Inf,0, median(AdultUCI[[ "capital-loss"]][AdultUCI[[ "capital-loss"]]>0]),
Inf)), labels = c("None", "Low", "High"))
#resume reading here
rules <- apriori(AdultUCI, parameter=list(support=0.6, confidence=0.75, minlen=4))
inspect(rules)
Which returns the following four rules
lhs rhs support confidence lift
1 {race=White,
capital-gain=None,
native-country=United-States} => {capital-loss=None} 0.680398 0.9457029 0.9920537
2 {race=White,
capital-loss=None,
native-country=United-States} => {capital-gain=None} 0.680398 0.9083504 0.9901500
3 {race=White,
capital-gain=None,
capital-loss=None} => {native-country=United-States} 0.680398 0.9189249 1.0239581
4 {capital-gain=None,
capital-loss=None,
native-country=United-States} => {race=White} 0.680398 0.8730100 1.0210133
I must be missing something: how do you find the rows in the source data that match an lhs rule using just arules functions?
Is there an easy way to build an SQL WHERE clause from the lhs(rules)?
Thanks
This answer is based in the following answer: https://stats.stackexchange.com/questions/21340/finding-suitable-rules-for-new-data-using-arules.
The solution is very slow, i´m not sure if will work for large aplications.
library(arules)
rules <- apriori(AdultUCI, parameter=list(support=0.4, confidence=0.75, minlen=4))
inspect(rules)
rec <- function(rules, data, iter){
basket <- data[iter]
rulesMatchLHS <- is.subset(rules#lhs,basket)
suitableRules <- rulesMatchLHS & !(is.subset(rules#rhs,basket))
rules <- sort(rules[rulesMatchLHS], decreasing=TRUE, by="lift")
as(head(rules, 1), "data.frame")
}
recom_loop <- function(rules, data){
temp <- lapply(seq_along(data), function(x) rec(rules, data, x))
temp <- do.call("rbind", temp)
recom <- gsub(".*=> |\\{|\\}", "", temp$rules)
as.data.frame(cbind(as(data, "data.frame"), recom))
}
trans <- as(AdultUCI, "transactions")
recom <- recom_loop(rules, trans[1:50])
Here is some example output:
head(recom)
transactionID
1 1
2 2
3 3
4 4
5 5
6 6
items
1 {age=Middle-aged,workclass=State-gov,education=Bachelors,marital-status=Never-married,occupation=Adm-clerical,relationship=Not-in-family,race=White,sex=Male,capital-gain=Low,capital-loss=None,hours-per-week=Full-time,native-country=United-States,income=small}
2 {age=Senior,workclass=Self-emp-not-inc,education=Bachelors,marital-status=Married-civ-spouse,occupation=Exec-managerial,relationship=Husband,race=White,sex=Male,capital-gain=None,capital-loss=None,hours-per-week=Part-time,native-country=United-States,income=small}
3 {age=Middle-aged,workclass=Private,education=HS-grad,marital-status=Divorced,occupation=Handlers-cleaners,relationship=Not-in-family,race=White,sex=Male,capital-gain=None,capital-loss=None,hours-per-week=Full-time,native-country=United-States,income=small}
4 {age=Senior,workclass=Private,education=11th,marital-status=Married-civ-spouse,occupation=Handlers-cleaners,relationship=Husband,race=Black,sex=Male,capital-gain=None,capital-loss=None,hours-per-week=Full-time,native-country=United-States,income=small}
5 {age=Middle-aged,workclass=Private,education=Bachelors,marital-status=Married-civ-spouse,occupation=Prof-specialty,relationship=Wife,race=Black,sex=Female,capital-gain=None,capital-loss=None,hours-per-week=Full-time,native-country=Cuba,income=small}
6 {age=Middle-aged,workclass=Private,education=Masters,marital-status=Married-civ-spouse,occupation=Exec-managerial,relationship=Wife,race=White,sex=Female,capital-gain=None,capital-loss=None,hours-per-week=Full-time,native-country=United-States,income=small}
recom
1 race=White
2 race=White
3 race=White
4 race=White
5 race=White
6 capital-gain=None
As for the first question, transactions supporting may be found using this function (should work faster, than the one from the other response):
supp_trans_ids = function(items, transactions){
# makes a logical matrix showing which set of items in rows are fully contains in transactions on rows
tmp = is.subset(items, transactions)
tmp2 = lapply(
seq_len(nrow(tmp)),
# 'which' alone would leave a name for each index, which is a complete rule (and would use a lot of memory therefore)
function(i) {
t = which(tmp[i,])
names(t) = NULL
t
}
)
# to easily idenfify sets of items
names(tmp2) = rownames(tmp)
tmp2
}
Now, you may find which transactions support each rule's lhs with:
AdultUCI_trans = as(AdultUCI, 'transactions')
trans_supporting = supp_trans_ids(lhs(rules), AdultUCI_trans)
e.g.
> str(trans_supporting)
List of 4
$ {race=White,capital-gain=None,native-country=United-States} : int [1:35140] 2 3 6 8 13 17 18 19 20 21 ...
$ {race=White,capital-loss=None,native-country=United-States} : int [1:36585] 1 2 3 6 8 9 10 13 17 18 ...
$ {race=White,capital-gain=None,capital-loss=None} : int [1:36164] 2 3 6 8 13 17 18 19 20 21 ...
$ {capital-gain=None,capital-loss=None,native-country=United-States}: int [1:38066] 2 3 4 6 8 11 13 14 17 18 ...
And data you may find with:
AdultUCI_trans[trans_supporting[[1]]] # transactions supporting
AdultUCI[trans_supporting[[1]],] # data on which these transactions are based