How can i partition market basket items into clusters? - r

I have a data set as follows: (i took a simple example but the real data set is much bigger)
V1 V2 V3 V4
1 1 0 0 1
2 0 1 1 0
3 0 0 1 0
4 1 1 1 1
5 0 1 1 0
6 1 0 0 1
7 0 0 0 1
8 0 1 1 1
9 1 0 1 0
10 0 1 1 0
...
where V1, V2,V3...Vn are items and 1,2,3,4...1000 are transactions. I want to partition these items into k clusters such that in each cluster i have the items that appear the most frequently together in the same transactions.
To determine the number times each couple of items appear together i tried crosstable, i got the following results:
V1 V2 V3 V4
V1 4 1 2 3
V2 1 5 5 2
V3 2 5 7 2
V4 3 2 2 5
For this small example if i want to create 2 clusters (k=2) such that a cluster must contain 2 items (to maintain the balance between clusters), i will get:
Cluster1={V1,V4}
Cluster2={V2,V3}
because:
1) V1 appears more frequently with V4 (V1,V4)=3 > (V1,V3) > (V1,V2) and same for V4.
2) V2 appears more frequently with V2 (V2,V3)=5 > (V2,V4) > (V2, V1) and same for V3.
How can i do this partition with R and for a bigger set of data ?

I think you are asking about clustering. It is not quite the same as what you are doing above, but you could use hclust to look for similarity between variables with a suitable distance measure.
For example
plot(hclust(dist(t(df),method="binary")))
produces the following...
You should look at ?dist to see if this distance measure is meaningful in your context, and ?hclust for other things you can do once you have your dendrogram (such as identifying clusters).
Or you could use your crosstab as a distance matrix (perhaps take the reciprocal of the values, and then as.dist).

library(data.table)
data:
df<-
fread("
V1 V2 V3
1 1 0 0
2 0 0 1
3 0 0 1
4 1 1 1
5 0 0 1
6 1 0 0
7 0 0 0
8 0 1 1
9 1 0 1
10 0 1 1
")[,-1]
code:
setDT(df)
sapply(names(df),function(x){
df[get(x)==1,lapply(.SD,sum,na.rm=T),.SDcols=names(df)]
})
result:
V2 V3 V4
V2 4 1 2
V3 1 3 3
V4 2 3 7

df <- read.table(text="
ID V1 V2 V3
1 1 0 0
2 0 0 1
3 0 0 1
4 1 1 1
5 0 0 1
6 1 0 0
7 0 0 0
8 0 1 1
9 1 0 1
10 0 1 1
", header = TRUE)
k = 3 # number of clusters
library(dplyr)
df %>%
# group and count on all except the first id column
group_by_at(2:ncol(df)) %>%
# get the counts, and collect all the transaction ids
summarize(n = n(), tran_ids = paste(ID, collapse = ',')) %>%
ungroup() %>%
# grab the top k summarizations
top_n(k, n)
# V1 V2 V3 n tran_ids
# <int> <int> <int> <int> <chr>
# 1 0 0 1 3 2,3,5
# 2 0 1 1 2 8,10
# 3 1 0 0 2 1,6

You can transpose your table and use standard methods of clustering. Thus, you will cluster the items. The features are the transactions.
Geometrical approaches can be used like kmeans. Alternatively, you can use mixture models which provide information criteria (like BIC) for selecting the number of clusters. Here is an R script
require(VarSelLCM)
my.data <- as.data.frame(t(df))
# To consider Gaussian mixture
# Alternatively Poisson mixture can be considered by converting each column into integer.
for (j in 1:ncol(my.data)) my.data[,j] <- as.numeric(my.data[,j])
## Clustering by considering all the variables as discriminative
# Number of clusters is between 1 and 6
res.all <- VarSelCluster(my.data, 1:6, vbleSelec = FALSE)
# partition
res.all#partitions#zMAP
# shiny application
VarSelShiny(res.all)
## Clustering with variable selection
# Number of clusters is between 1 and 6
res.selec <- VarSelCluster(my.data, 1:6, vbleSelec = TRUE)
# partition
res.selec#partitions#zMAP
# shiny application
VarSelShiny(res.selec)

Related

making 1000 contingency tables in R

I have a vector called "combined" with 1's and 0's
combined
1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
I sampled twice from this vector, each with a sample size of 3 and put it into a contingency table of counts as follows.
2 1
1 2
I want to reiterate this sampling 1000 times such that I end with 1000 contingency tables each with counts of 1s and 0s from the sampling.
This is what I tried:
sample1 = as.vector(replicate(10000, sample(combined, 3)))
sample2 = as.vector(replicate(10000, sample(combined, 3)))
con_table = table(sample1,sample2)
but I ended up only getting 1 table instead of 10000. Hoping to get some help.
8109 7573
7306 7012
You need to wrap the entire expression, sample and table inside replicate. Add a conversion to a factor to ensure you always get a 2x2 table. E.g. a simple version with 2 replications:
combined <- rep(0:1,each=10)
combined <- as.factor(combined)
replicate(2, table(sample(combined,3), sample(combined,3)), simplify=FALSE)
#[[1]]
#
# 0 1
# 0 0 1
# 1 1 1
#
#[[2]]
#
# 0 1
# 0 1 1
# 1 0 1

Sub-setting or arrange the data in R

As I am new to R, this question may seem to you piece of a cake.
I have a data in txt format. The first column has Cluster Number and the second column has names of different organisms.
For example:
0 org4|gene759
1 org1|gene992
2 org1|gene1101
3 org4|gene757
4 org1|gene1702
5 org1|gene989
6 org1|gene990
7 org1|gene1699
9 org1|gene1102
10 org4|gene2439
10 org1|gene1374
I need to re-arrange/reshape the data in following format.
Cluster No. Org 1 Org 2 org3 org4
0 0 0 1
1 0 0 0
I could not figure out how to do it in R.
Thanks
We could use table
out <- cbind(ClusterNo = seq_len(nrow(df1)), as.data.frame.matrix(table(seq_len(nrow(df1)),
factor(sub("\\|.*", "", df1[[2]]), levels = paste0("org", 1:4)))))
head(out, 2)
# ClusterNo org1 org2 org3 org4
#1 1 0 0 0 1
#2 2 1 0 0 0
It is also possible that we need to use the first column to get the frequency
out1 <- as.data.frame.matrix(table(df1[[1]],
factor(sub("\\|.*", "", df1[[2]]), levels = paste0("org", 1:4))))
Reading the table into R can be done with
input <- read.table('filename.txt')
Then we can extract the relevant number from the org4|gene759 string using a regular expression, and set this to a third column of our input:
input[, 3] <- gsub('^org(.+)\\|.*', '\\1', input[, 2])
Our input data now looks like this:
> input
V1 V2 V3
1 0 org4|gene759 4
2 1 org1|gene992 1
3 2 org1|gene1101 1
4 3 org4|gene757 4
5 4 org1|gene1702 1
6 5 org1|gene989 1
7 6 org1|gene990 1
8 7 org1|gene1699 1
9 9 org1|gene1102 1
10 10 org4|gene2439 4
11 10 org1|gene1374 1
Then we need to list the possible values of org:
possibleOrgs <- seq_len(max(input[, 3])) # = c(1, 2, 3, 4)
Now for the tricky part. The following function takes each unique cluster number in turn (I notice that 10 appears twice in your example data), takes all the rows relating to that cluster, and looks at the org value for those rows.
result <- vapply(unique(input[, 1]), function (x)
possibleOrgs %in% input[input[, 1] == x, 3], logical(4)))
We can then format this result as we like, perhaps using t to transform its orientation, * 1 to convert from TRUEs and FALSEs to 1s and 0s, and colnames to title its columns:
result <- t(result) * 1
colnames (result) <- paste0('org', possibleOrgs)
rownames(result) <- unique(input[, 1])
I hope that this is what you were looking for -- it wasn't quite clear from your question!
Output:
> result
org1 org2 org3 org4
0 0 0 0 1
1 1 0 0 0
2 1 0 0 0
3 0 0 0 1
4 1 0 0 0
5 1 0 0 0
6 1 0 0 0
7 1 0 0 0
9 1 0 0 0
10 1 0 0 1

Count in buckets (Total by Row, aka Tabulate) [duplicate]

This question already has answers here:
Table by row with R
(4 answers)
Closed 6 years ago.
Imagine a group of three of machines (a,b,c) capture data in a series of tests. I need to count per test how many of each possible outcome has happened.
Using this test data and sample output, how might you solve it (assume that the test results may be numbers or alpha).
tests <- data.table(
a = c(1,2,2,3,0),
b = c(1,2,3,0,3),
c = c(2,2,3,0,2)
)
sumry <- data.table(
V0 = c(0,0,0,2,1),
V1 = c(2,0,0,0,0),
V2 = c(1,3,1,0,1),
V3 = c(0,0,2,1,1),
v4 = c(0,0,0,0,0)
)
tests
sumry
The output from sumry shows a column for each possible outcome/value (prefixed with V as in 'value' measured). Note: the sumry output indicates that there is the potential for a value of 4 but that is not observed in any of the test data here and therefore is always zero.
> tests
a b c
1: 1 1 2
2: 2 2 2
3: 2 3 3
4: 3 0 0
5: 0 3 2
> sumry
V0 V1 V2 V3 v4
1: 0 2 1 0 0
2: 0 0 3 0 0
3: 0 0 1 2 0
4: 2 0 0 1 0
5: 1 0 1 1 0
the V0 column from sumry indicates how many times the value zero is observed from any machine in test #1. For this set of test data zero is only observed in the 4th and 5th tests. The same holds true for V1-V4
I'm sure there's a simple name for this.
Here's one solution built around tabulate():
res <- suppressWarnings(do.call(rbind,apply(tests+1L,1L,tabulate)));
colnames(res) <- paste0('V',seq(0L,len=ncol(res)));
res;
## V0 V1 V2 V3
## [1,] 0 2 1 0
## [2,] 0 0 3 0
## [3,] 0 0 1 2
## [4,] 2 0 0 1
## [5,] 1 0 1 1

Data Manipulation in R Project: compare rows

I'm looking to compare values within a dataset
Every row starts with a unique ID followed by a couple binary variables
The data looks like this:
row.name v1 v2 v3 ...
1 0 0 0
2 1 1 1
3 1 0 1
I want to know which values are the same (if equal assign value of 1) and which are different (if not equal assign value of 0) for all unique pairings.
For example in column v1: row1 == 0 and row2 == 1, which should result in an assignment of 0.
So, the output should look like this
id1 id2 v1 v2 v3 ...
1 2 0 0 0 ...
1 3 0 1 0 ...
2 3 1 0 1 ...
I'm looking for an efficient way of doing this for more than 1000 rows...
There's no way to do this without expanding each combination of rows, so with 1000 rows, it is going to take a bit of time. But here is a solution:
dat <- read.table(header=T, text="row.name v1 v2 v3
1 0 0 0
2 1 1 1
3 1 0 1")
Create the index rows:
indices <- t(combn(dat$row.name, 2))
colnames(indices) <- c('id1', 'id2')
Loop through the index rows, and collect the comparisons:
res1 <- t(apply(indices, 1, function(x) as.numeric(dat[x[1],-1] == dat[x[2],-1])))
colnames(res1) <- names(dat[-1])
Put them together:
result <- cbind(indices, res1)
result
## id1 id2 v1 v2 v3
## [1,] 1 2 0 0 0
## [2,] 1 3 0 1 0
## [3,] 2 3 1 0 1

Creating subgroups from categorical data by using lapply in R

I was wondering if you kind folks could answer a question I have. In the sample data I've provided below, in column 1 I have a categorical variable, and in column 2 p-values.
x <- c(rep("A",0.1*10000),rep("B",0.2*10000),rep("C",0.65*10000),rep("D",0.05*10000))
categorical_data=as.matrix(sample(x,10000))
p_val=as.matrix(runif(10000,0,1))
combi=as.data.frame(cbind(categorical_data,p_val))
head(combi)
V1 V2
1 A 0.484525170875713
2 C 0.48046557046473
3 C 0.228440979029983
4 B 0.216991128632799
5 C 0.521497668232769
6 D 0.358560319757089
I want to now take one of the categorical variables, let's say "C", and create another variable if it is C (print 1 in column 3, or 0 if it isn't).
combi$NEWVAR[combi$V1=="C"] <-1
combi$NEWVAR[combi$V1!="C" <-0
V1 V2 NEWVAR
1 A 0.484525170875713 0
2 C 0.48046557046473 1
3 C 0.228440979029983 1
4 B 0.216991128632799 0
5 C 0.521497668232769 1
6 D 0.358560319757089 0
I'd like to do this for each of the variables in V1, and then loop over using lapply:
variables=unique(combi$V1)
loopeddata=lapply(variables,function(x){
combi$NEWVAR[combi$V1==x] <-1
combi$NEWVAR[combi$V1!=x]<-0
}
)
My output however looks like this:
[[1]]
[1] 0
[[2]]
[1] 0
[[3]]
[1] 0
[[4]]
[1] 0
My desired output would be like the table in the second block of code, but when looping over the third column would be A=1, while B,C,D=0. Then B=1, A,C,D=0 etc.
If anyone could help me out that would be very much appreciated.
How about something like this:
model.matrix(~ -1 + V1, data=combi)
Then you can cbind it to combi if you desire:
combi <- cbind(combi, model.matrix(~ -1 + V1, data=combi))
model.matrix is definitely the way to do this in R. You can, however, also consider using table.
Here's an example using the result I get when using set.seed(1) (always use a seed when sharing example problems with random data).
LoopedData <- table(sequence(nrow(combi)), combi$V1)
head(LoopedData)
#
# A B C D
# 1 0 1 0 0
# 2 0 0 1 0
# 3 0 0 1 0
# 4 0 0 1 0
# 5 0 1 0 0
# 6 0 0 1 0
## If you want to bind it back with the original data
combi <- cbind(combi, as.data.frame.matrix(LoopedData))
head(combi)
# V1 V2 A B C D
# 1 B 0.0647124934475869 0 1 0 0
# 2 C 0.676612401846796 0 0 1 0
# 3 C 0.735371692571789 0 0 1 0
# 4 C 0.111299667274579 0 0 1 0
# 5 B 0.0466546178795397 0 1 0 0
# 6 C 0.130910312291235 0 0 1 0

Resources