How to count the number of combinations of boolean data in R - r

What is the best way to determine a factor or create a new category field based on a number of boolean fields? In this example, I need to count the number of unique combinations of medications.
> MultPsychMeds
ID OLANZAPINE HALOPERIDOL QUETIAPINE RISPERIDONE
1 A 1 1 0 0
2 B 1 0 1 0
3 C 1 0 1 0
4 D 1 0 1 0
5 E 1 0 0 1
6 F 1 0 0 1
7 G 1 0 0 1
8 H 1 0 0 1
9 I 0 1 1 0
10 J 0 1 1 0
Perhaps another way to state it is that I need to pivot or cross tabulate the pairs. The final results need to look something like:
Combination Count
OLANZAPINE/HALOPERIDOL 1
OLANZAPINE/QUETIAPINE 3
OLANZAPINE/RISPERIDONE 4
HALOPERIDOL/QUETIAPINE 2
This data frame can be replicated in R with:
MultPsychMeds <- structure(list(ID = structure(1:10, .Label = c("A", "B", "C",
"D", "E", "F", "G", "H", "I", "J"), class = "factor"), OLANZAPINE = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L), HALOPERIDOL = c(1L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L), QUETIAPINE = c(0L, 1L, 1L, 1L,
0L, 0L, 0L, 0L, 1L, 1L), RISPERIDONE = c(0L, 0L, 0L, 0L, 1L,
1L, 1L, 1L, 0L, 0L)), .Names = c("ID", "OLANZAPINE", "HALOPERIDOL",
"QUETIAPINE", "RISPERIDONE"), class = "data.frame", row.names = c(NA,
-10L))

Here's one approach using the reshape and plyr packages:
library(reshape)
library(plyr)
#Melt into long format
dat.m <- melt(MultPsychMeds, id.vars = "ID")
#Group at the ID level and paste the drugs together with "/"
out <- ddply(dat.m, "ID", summarize, combos = paste(variable[value == 1], collapse = "/"))
#Calculate a table
with(out, count(combos))
x freq
1 HALOPERIDOL/QUETIAPINE 2
2 OLANZAPINE/HALOPERIDOL 1
3 OLANZAPINE/QUETIAPINE 3
4 OLANZAPINE/RISPERIDONE 4

Just for fun, a base R solution (that can be turned into a oneliner :-) ):
data.frame(table(apply(MultPsychMeds[,-1], 1, function(currow){
wc<-which(currow==1)
paste(colnames(MultPsychMeds)[wc+1], collapse="/")
})))

Another way could be:
subset(
as.data.frame(
with(MultPsychMeds, table(OLANZAPINE, HALOPERIDOL, QUETIAPINE, RISPERIDONE)),
responseName="count"
),
count>0
)
which gives
OLANZAPINE HALOPERIDOL QUETIAPINE RISPERIDONE count
4 1 1 0 0 1
6 1 0 1 0 3
7 0 1 1 0 2
10 1 0 0 1 4
It's not an exact way you want it, but is fast and simple.
There is shorthand in plyr package:
require(plyr)
count(MultPsychMeds, c("OLANZAPINE", "HALOPERIDOL", "QUETIAPINE", "RISPERIDONE"))
# OLANZAPINE HALOPERIDOL QUETIAPINE RISPERIDONE freq
# 1 0 1 1 0 2
# 2 1 0 0 1 4
# 3 1 0 1 0 3
# 4 1 1 0 0 1

Related

Counting the common indices shared between columns of a dataframe containing only binary values in R

Suppose I have a dataframe containing binary values as:
A B C D
a 1 0 0 0
b 0 1 1 0
c 1 1 0 1
d 0 0 1 1
e 1 1 1 1
f 1 0 0 1
I'd like to count the number of common indices shared between the pair of columns in the dataframe
as shown below. What is the most efficient way to do so in R ?
A B C D
A - 2 1 3
B 2 - 2 2
C 1 2 - 2
D 3 2 2 -
Any help is greatly appreciated. Thanks!
Maybe this is what you are after
> `diag<-`(crossprod(as.matrix(df)),NA)
A B C D
A NA 2 1 3
B 2 NA 2 2
C 1 2 NA 2
D 3 2 2 NA
Data
> dput(df)
structure(list(A = c(1L, 0L, 1L, 0L, 1L, 1L), B = c(0L, 1L, 1L,
0L, 1L, 0L), C = c(0L, 1L, 0L, 1L, 1L, 0L), D = c(0L, 0L, 1L,
1L, 1L, 1L)), class = "data.frame", row.names = c("a", "b", "c",
"d", "e", "f"))

Concatenate dichotome columns to semicolon-separated column

I have data frame containing the results of a multiple choice question. Each item has either 0 (not mentioned) or 1 (mentioned). The columns are named like this:
F1.2_1, F1.2_2, F1.2_3, F1.2_4, F1.2_5, F1.2_99
etc.
I would like to concatenate these values like this: The new column should be a semicolon-separated string of the selected items. So if a row has a 1 in F1.2_1, F1.2_4 and F1.2_5 it should be: 1;4;5
The last digit(s) of the dichotome columns are the item codes to be used in the string.
Any idea how this could be achieved with R (and data.table)? Thanks for any help!
edit:
Here is a example DF with the desired result:
structure(list(F1.2_1 = c(0L, 1L, 0L, 1L), F1.2_2 = c(1L, 0L,
0L, 1L), F1.2_3 = c(0L, 1L, 0L, 1L), F1.2_4 = c(0L, 1L, 0L, 0L
), F1.2_5 = c(0L, 0L, 0L, 0L), F1.2_99 = c(0L, 0L, 1L, 0L), desired_result = structure(c(3L,
2L, 4L, 1L), .Label = c("1;2;3", "1;3;4", "2", "99"), class = "factor")), .Names = c("F1.2_1",
"F1.2_2", "F1.2_3", "F1.2_4", "F1.2_5", "F1.2_99", "desired_result"
), class = "data.frame", row.names = c(NA, -4L))
F1.2_1 F1.2_2 F1.2_3 F1.2_4 F1.2_5 F1.2_99 desired_result
1 0 1 0 0 0 0 2
2 1 0 1 1 0 0 1;3;4
3 0 0 0 0 0 1 99
4 1 1 1 0 0 0 1;2;3
In his comment, the OP asked how to deal with more multiple choice questions.
The approach below will be able to handle an arbitrary number of questions and choices for each question. It uses melt() and dcast() from the data.table package.
Sample input data
Let's assume the input data.frame DT for the extended case contains two questions, one with 6 choices and the other with 4 choices:
DT
# F1.2_1 F1.2_2 F1.2_3 F1.2_4 F1.2_5 F1.2_99 F2.7_1 F2.7_2 F2.7_3 F2.7_11
#1: 0 1 0 0 0 0 0 1 1 0
#2: 1 0 1 1 0 0 1 1 1 1
#3: 0 0 0 0 0 1 1 0 1 0
#4: 1 1 1 0 0 0 1 0 1 1
Code
library(data.table)
# coerce to data.table and add row number for later join
setDT(DT)[, rn := .I]
# reshape from wide to long format
molten <- melt(DT, id.vars = "rn")
# alternatively, the measure cols can be specified (in case of other id vars)
# molten <- melt(DT, measure.vars = patterns("^F"))
# split question id and choice id
molten[, c("question_id", "choice_id") := tstrsplit(variable, "_")]
# reshape only selected choices from long to wide format,
# thereby pasting together the ids of the selected choices for each question
result <- dcast(molten[value == 1], rn ~ question_id, paste, collapse = ";",
fill = NA, value.var = "choice_id")
# final join for demonstration only, remove row number as no longer needed
DT[result, on = "rn"][, rn := NULL][]
# F1.2_1 F1.2_2 F1.2_3 F1.2_4 F1.2_5 F1.2_99 F2.7_1 F2.7_2 F2.7_3 F2.7_11 F1.2 F2.7
#1: 0 1 0 0 0 0 0 1 1 0 2 2;3
#2: 1 0 1 1 0 0 1 1 1 1 1;3;4 1;2;3;11
#3: 0 0 0 0 0 1 1 0 1 0 99 1;3
#4: 1 1 1 0 0 0 1 0 1 1 1;2;3 1;3;11
For each question, the final result shows which choices were selected in each row.
Reproducible data
The sample data can be created with
DT <- structure(list(F1.2_1 = c(0L, 1L, 0L, 1L), F1.2_2 = c(1L, 0L,
0L, 1L), F1.2_3 = c(0L, 1L, 0L, 1L), F1.2_4 = c(0L, 1L, 0L, 0L
), F1.2_5 = c(0L, 0L, 0L, 0L), F1.2_99 = c(0L, 0L, 1L, 0L), F2.7_1 = c(0L,
1L, 1L, 1L), F2.7_2 = c(1L, 1L, 0L, 0L), F2.7_3 = c(1L, 1L, 1L,
1L), F2.7_11 = c(0L, 1L, 0L, 1L)), .Names = c("F1.2_1", "F1.2_2",
"F1.2_3", "F1.2_4", "F1.2_5", "F1.2_99", "F2.7_1", "F2.7_2",
"F2.7_3", "F2.7_11"), row.names = c(NA, -4L), class = "data.frame")
We can try
j1 <- do.call(paste, c(as.integer(sub(".*_", "",
names(DF)[-7]))[col(DF[-7])]*DF[-7], sep=";"))
DF$newCol <- gsub("^;+|;+$", "", gsub(";*0;|0$|^0", ";", j1))
DF$newCol
#[1] "2" "1;3;4" "99" "1;2;3"

R: data.table compare sets of rows

I am working in R with data.tables. I have the following data.table encoding a set of points with coordinates A,B,C,D and index encoding a set the point belongs to.
library(data.table)
A B C D set
1: 0 0 0 0 1
2: 1 0 1 0 2
3: 1 1 1 0 2
4: 0 1 0 0 2
5: 1 0 1 1 2
6: 0 1 0 0 3
7: 1 1 0 0 3
8: 0 0 1 0 4
9: 1 0 1 0 4
10: 0 1 0 1 4
11: 0 0 0 0 5
12: 1 0 0 0 5
13: 1 1 1 0 5
14: 1 1 1 1 5
dt = setDT(structure(list(A = c(0L, 1L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L,
0L, 1L, 1L, 1L), B = c(0L, 0L, 1L, 1L, 0L, 1L, 1L, 0L, 0L, 1L,
0L, 0L, 1L, 1L), C = c(0L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 0L,
0L, 0L, 1L, 1L), D = c(0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L,
0L, 0L, 0L, 1L), set = c(1L, 2L, 2L, 2L, 2L, 3L, 3L, 4L, 4L,
4L, 5L, 5L, 5L, 5L)), .Names = c("A", "B", "C", "D", "set"), row.names = c(NA,
-14L), class = "data.frame"))
I have another table encoding e.g. probability of each set.
set mass
1: 1 0.27809187
2: 2 0.02614841
3: 3 0.36890459
4: 4 0.28975265
5: 5 0.03710247
wt = setDT(structure(list(set = 1:5, mass = c(0.27809187, 0.02614841, 0.36890459,
0.28975265, 0.03710247)), .Names = c("set", "mass"), row.names = c(NA,
-5L), class = "data.frame"))
I would like to have a procedure to create a projection to a subspace e.g. C,D. (Note that original points 1,4,6,7,11,12 coincide in this case, sets 1 and 3 are the same in this subspace as well as sets 2 and 5.
unique(dt[,c("C","D", "set")])
> C D set
1: 0 0 1
2: 1 0 2
3: 0 0 2
4: 1 1 2
5: 0 0 3
6: 1 0 4
7: 0 1 4
8: 0 0 5
9: 1 0 5
10: 1 1 5
and to identify the same sets, keep just unique ones and sum the corresponding masses. I.e. in this case:
> C D set
1: 0 0 1
2: 1 0 2
3: 0 0 2
4: 1 1 2
5: 1 0 4
6: 0 1 4
set mass
1: 1 0.6469965 % set 1 + set 3
2: 2 0.06325088 % set 2 + set 5
3: 4 0.36890459
Thank you for your ideas.
Similar in concept to Frank's, we can map the binary values of each set to a decimal with x * 2 ^ ((length(x) - 1):0). Subsetting, also, for "C" and "D", we get:
coords = c("C", "D")
d = data.frame(set = dt$set,
val = Reduce("+", Map("*", list(dt$C, dt$D), 2 ^ ((length(coords) - 1):0))))
d
Then, we can group identical sets following the same idea:
tab = table(d$val, d$set) > 0L ## `table(d) > 0` to ignore the duplicates
gr = colSums(tab * (2 ^ ((nrow(tab) - 1):0)))
gr
# 1 2 3 4 5
# 8 11 8 6 11
## another (pre-edit) alternative with unnecessary overhead
#gr = cutree(hclust(dist(table(d) > 0L)), h = 0)
#gr
#1 2 3 4 5
#1 2 1 3 2
and aggregate based on this group:
rowsum(wt$mass[match(names(gr), wt$set)], gr, reorder = FALSE)
# [,1]
#8 0.64699646
#11 0.06325088
#6 0.28975265
A somewhat clunky option: make a unique string for each set, and then group on that.
coords = c("C", "D")
gDT = setorder(unique(dt[,c(coords, "set"), with=FALSE]))[,
.(s = paste(do.call(paste, c(.SD, .(sep="_"))), collapse="."))
, by=set, .SDcols = coords][,
g := .GRP
, by=s][]
# set s g
# 1: 1 0_0 1
# 2: 2 0_0.1_0.1_1 2
# 3: 3 0_0 1
# 4: 5 0_0.1_0.1_1 2
# 5: 4 0_1.1_0 3
gDT[wt, on=.(set), mass := i.mass ]
gDT[, .(set = first(set), mass = sum(mass)), by=g]
# g set mass
# 1: 1 1 0.64699646
# 2: 2 2 0.06325088
# 3: 3 4 0.28975265
Comments
You can get rid of the g by chaining on [, g := NULL][] in the last line.
setorder is just sorting the data so that the unique string turns out the same in set sets that are the same.
Grouped first and sum operations are optimized, as you can see if you add verbose = TRUE to the final line, like gDT[, .(set = first(set), mass = sum(mass)), by=g, verbose=TRUE].

1 factor column and 15 columns to factor under the 1 factor column then sum each 15 individually

I don't know the vocabulary otherwise I am sure I would be able to effectively search for this. So far I have not found anything and I am running out of time.
So I have 16 columns of information, 1 of them is a factor column, we'll assume dates, and the other 15 are hour times (6 am - 8 pm, representing hour only) with either a 1 or a 0, representing active state or inactive state. What I want to do is
Group the data by the factor column, (Dates)
After everything is grouped, I want to individually sum over each 15 columns per grouping
display a 2 dimensional table with the dates running vertically and time sum running horizontally
Please, if you can help, please use the vocabulary so I can not only learn it myself, but so I can look up documentation and teach it to others too please.
An example would be
Date Hour1 Hour2 Hour3 Hour4 Hour5 ... Hour15
9-15 0 0 0 1 1 ... 0
9-15 0 1 1 1 1 ... 0
9-16 0 1 1 1 0 ... 0
9-16 0 0 0 0 0 ... 1
9-16 1 1 0 0 0 ... 1
9-18 0 1 0 1 1 ... 0
.
.
.
11-7 0 1 1 1 0 ... 0
What I want is
Hour1 Hour2 Hour3 Hour4 Hour5 ... Hour15
9-15 5 10 15 25 45 ... 20
9-16 5 6 25 28 15 ... 11
9-17 3 45 42 6 17 ... 32
9-18 5 10 15 25 45 ... 20
.
.
.
11-7 12 36 84 9 7 ... 21
where each of the entry is the sum over the column variable rather than a 1 or zero frequency count.
You can do that quite easily with dplyr - first group by column "Date", then summarise each of the other columns with sum:
require(dplyr)
df %>%
group_by(Date) %>%
summarise_each(funs(sum))
#Source: local data frame [4 x 7]
#
# Date Hour1 Hour2 Hour3 Hour4 Hour5 Hour15
#1 11-7 0 1 1 1 0 0
#2 9-15 0 1 1 2 2 0
#3 9-16 1 2 1 1 0 2
#4 9-18 0 1 0 1 1 0
data
df <- structure(list(Date = structure(c(2L, 2L, 3L, 3L, 3L, 4L, 1L), .Label = c("11-7",
"9-15", "9-16", "9-18"), class = "factor"), Hour1 = c(0L, 0L,
0L, 0L, 1L, 0L, 0L), Hour2 = c(0L, 1L, 1L, 0L, 1L, 1L, 1L), Hour3 = c(0L,
1L, 1L, 0L, 0L, 0L, 1L), Hour4 = c(1L, 1L, 1L, 0L, 0L, 1L, 1L
), Hour5 = c(1L, 1L, 0L, 0L, 0L, 1L, 0L), Hour15 = c(0L, 0L,
0L, 1L, 1L, 0L, 0L)), .Names = c("Date", "Hour1", "Hour2", "Hour3",
"Hour4", "Hour5", "Hour15"), class = "data.frame", row.names = c(NA,
-7L))

R: use a row as a grouping vector for row sums

If I have a data set laid out like:
Cohort Food1 Food2 Food 3 Food 4
--------------------------------
Group 1 1 2 3
A 1 1 0 1
B 0 0 1 0
C 1 1 0 1
D 0 0 0 1
I want to sum each row, where I can define food groups into different categories. So I would like to use the Group row as the defining vector.
Which would mean that food1 and food2 are in group 1, food3 is in group 2, food 4 is in group 3.
Ideal output something like:
Cohort Group1 Group2 Group3
A 2 0 1
B 0 1 0
C 2 0 1
D 0 0 1
I tried using this rowsum() based functions but no luck, do I need to use ddply() instead?
Example data from comment:
dat <-
structure(list(species = c("group", "princeps", "bougainvillei",
"hombroni", "lindsayi", "concretus", "galatea", "ellioti", "carolinae",
"hydrocharis"), locust = c(1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L,
0L), grasshopper = c(1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L),
snake = c(2L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L), fish = c(2L,
1L, 0L, 1L, 1L, 0L, 1L, 0L, 1L, 0L), frog = c(2L, 0L, 0L,
0L, 0L, 0L, 0L, 1L, 0L, 0L), toad = c(2L, 0L, 0L, 0L, 0L,
1L, 0L, 0L, 0L, 0L), fruit = c(3L, 0L, 0L, 0L, 0L, 1L, 1L,
0L, 0L, 0L), seed = c(3L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L,
0L)), .Names = c("species", "locust", "grasshopper", "snake",
"fish", "frog", "toad", "fruit", "seed"), class = "data.frame", row.names = c(NA,
-10L))
There are most likely more direct approaches, but here is one you can try:
First, create a copy of your data minus the second header row.
dat2 <- dat[-1, ]
melt() and dcast() and so on from the "reshape2" package don't work nicely with duplicated column names, so let's make the column names more "reshape2 appropriate".
Seq <- ave(as.vector(unlist(dat[1, -1])),
as.vector(unlist(dat[1, -1])),
FUN = seq_along)
names(dat2)[-1] <- paste("group", dat[1, 2:ncol(dat)],
".", Seq, sep = "")
melt() the dataset
m.dat2 <- melt(dat2, id.vars="species")
Use the colsplit() function to split the columns correctly.
m.dat2 <- cbind(m.dat2[-2],
colsplit(m.dat2$variable, "\\.",
c("group", "time")))
head(m.dat2)
# species value group time
# 1 princeps 0 group1 1
# 2 bougainvillei 0 group1 1
# 3 hombroni 1 group1 1
# 4 lindsayi 0 group1 1
# 5 concretus 0 group1 1
# 6 galatea 0 group1 1
Proceed with dcast() as usual
dcast(m.dat2, species ~ group, sum)
# species group1 group2 group3
# 1 bougainvillei 0 0 0
# 2 carolinae 1 1 0
# 3 concretus 0 2 2
# 4 ellioti 0 1 0
# 5 galatea 1 1 1
# 6 hombroni 2 1 0
# 7 hydrocharis 0 0 0
# 8 lindsayi 0 1 0
# 9 princeps 0 1 0
Note: Edited because original answer was incorrect.
Update: An easier way in base R
This problem is much more easily solved if you start by transposing your data.
dat3 <- t(dat[-1, -1])
dat3 <- as.data.frame(dat3)
names(dat3) <- dat[[1]][-1]
t(do.call(rbind, lapply(split(dat3, as.numeric(dat[1, -1])), colSums)))
# 1 2 3
# princeps 0 1 0
# bougainvillei 0 0 0
# hombroni 2 1 0
# lindsayi 0 1 0
# concretus 0 2 2
# galatea 1 1 1
# ellioti 0 1 0
# carolinae 1 1 0
# hydrocharis 0 0 0
You can do this using base R fairly easily. Here's an example.
First, figure out which animals belong in which group:
groupings <- as.data.frame(table(as.numeric(dat[1,2:9]),names(dat)[2:9]))
attach(groupings)
grp1 <- groupings[Freq==1 & Var1==1,2]
grp2 <- groupings[Freq==1 & Var1==2,2]
grp3 <- groupings[Freq==1 & Var1==3,2]
detach(groupings)
Then, use the groups to do a rowSums() on the correct columns.
dat <- cbind(dat,rowSums(dat[as.character(grp1)]))
dat <- cbind(dat,rowSums(dat[as.character(grp2)]))
dat <- cbind(dat,rowSums(dat[as.character(grp3)]))
Delete the initial row and the intermediate columns:
dat <- dat[-1,-c(2:9)]
Then, just rename things correctly:
row.names(dat) <- rm()
names(dat) <- c("species","group_1","group_2","group_3")
And you ultimately get:
species group_1 group_2 group_3
bougainvillei 0 0 0
carolinae 1 1 0
concretus 0 2 2
ellioti 0 1 0
galatea 1 1 1
hombroni 2 1 0
hydrocharis 0 0 0
lindsayi 0 1 0
princeps 0 1 0
EDITED: Changed sort order to alphabetical, like other answer.

Resources