I have data frame containing the results of a multiple choice question. Each item has either 0 (not mentioned) or 1 (mentioned). The columns are named like this:
F1.2_1, F1.2_2, F1.2_3, F1.2_4, F1.2_5, F1.2_99
etc.
I would like to concatenate these values like this: The new column should be a semicolon-separated string of the selected items. So if a row has a 1 in F1.2_1, F1.2_4 and F1.2_5 it should be: 1;4;5
The last digit(s) of the dichotome columns are the item codes to be used in the string.
Any idea how this could be achieved with R (and data.table)? Thanks for any help!
edit:
Here is a example DF with the desired result:
structure(list(F1.2_1 = c(0L, 1L, 0L, 1L), F1.2_2 = c(1L, 0L,
0L, 1L), F1.2_3 = c(0L, 1L, 0L, 1L), F1.2_4 = c(0L, 1L, 0L, 0L
), F1.2_5 = c(0L, 0L, 0L, 0L), F1.2_99 = c(0L, 0L, 1L, 0L), desired_result = structure(c(3L,
2L, 4L, 1L), .Label = c("1;2;3", "1;3;4", "2", "99"), class = "factor")), .Names = c("F1.2_1",
"F1.2_2", "F1.2_3", "F1.2_4", "F1.2_5", "F1.2_99", "desired_result"
), class = "data.frame", row.names = c(NA, -4L))
F1.2_1 F1.2_2 F1.2_3 F1.2_4 F1.2_5 F1.2_99 desired_result
1 0 1 0 0 0 0 2
2 1 0 1 1 0 0 1;3;4
3 0 0 0 0 0 1 99
4 1 1 1 0 0 0 1;2;3
In his comment, the OP asked how to deal with more multiple choice questions.
The approach below will be able to handle an arbitrary number of questions and choices for each question. It uses melt() and dcast() from the data.table package.
Sample input data
Let's assume the input data.frame DT for the extended case contains two questions, one with 6 choices and the other with 4 choices:
DT
# F1.2_1 F1.2_2 F1.2_3 F1.2_4 F1.2_5 F1.2_99 F2.7_1 F2.7_2 F2.7_3 F2.7_11
#1: 0 1 0 0 0 0 0 1 1 0
#2: 1 0 1 1 0 0 1 1 1 1
#3: 0 0 0 0 0 1 1 0 1 0
#4: 1 1 1 0 0 0 1 0 1 1
Code
library(data.table)
# coerce to data.table and add row number for later join
setDT(DT)[, rn := .I]
# reshape from wide to long format
molten <- melt(DT, id.vars = "rn")
# alternatively, the measure cols can be specified (in case of other id vars)
# molten <- melt(DT, measure.vars = patterns("^F"))
# split question id and choice id
molten[, c("question_id", "choice_id") := tstrsplit(variable, "_")]
# reshape only selected choices from long to wide format,
# thereby pasting together the ids of the selected choices for each question
result <- dcast(molten[value == 1], rn ~ question_id, paste, collapse = ";",
fill = NA, value.var = "choice_id")
# final join for demonstration only, remove row number as no longer needed
DT[result, on = "rn"][, rn := NULL][]
# F1.2_1 F1.2_2 F1.2_3 F1.2_4 F1.2_5 F1.2_99 F2.7_1 F2.7_2 F2.7_3 F2.7_11 F1.2 F2.7
#1: 0 1 0 0 0 0 0 1 1 0 2 2;3
#2: 1 0 1 1 0 0 1 1 1 1 1;3;4 1;2;3;11
#3: 0 0 0 0 0 1 1 0 1 0 99 1;3
#4: 1 1 1 0 0 0 1 0 1 1 1;2;3 1;3;11
For each question, the final result shows which choices were selected in each row.
Reproducible data
The sample data can be created with
DT <- structure(list(F1.2_1 = c(0L, 1L, 0L, 1L), F1.2_2 = c(1L, 0L,
0L, 1L), F1.2_3 = c(0L, 1L, 0L, 1L), F1.2_4 = c(0L, 1L, 0L, 0L
), F1.2_5 = c(0L, 0L, 0L, 0L), F1.2_99 = c(0L, 0L, 1L, 0L), F2.7_1 = c(0L,
1L, 1L, 1L), F2.7_2 = c(1L, 1L, 0L, 0L), F2.7_3 = c(1L, 1L, 1L,
1L), F2.7_11 = c(0L, 1L, 0L, 1L)), .Names = c("F1.2_1", "F1.2_2",
"F1.2_3", "F1.2_4", "F1.2_5", "F1.2_99", "F2.7_1", "F2.7_2",
"F2.7_3", "F2.7_11"), row.names = c(NA, -4L), class = "data.frame")
We can try
j1 <- do.call(paste, c(as.integer(sub(".*_", "",
names(DF)[-7]))[col(DF[-7])]*DF[-7], sep=";"))
DF$newCol <- gsub("^;+|;+$", "", gsub(";*0;|0$|^0", ";", j1))
DF$newCol
#[1] "2" "1;3;4" "99" "1;2;3"
Related
I would like to identify if an activity occurs consecutive times and how often during a week. The starting point is t1 that records the occurrence of an activity at t1_1 , t1_2, t1_3 and so on. For example in the case of id 12 activity occurred at t1_2, t1_3, t2_2, t3_1, t3_3, t4_2, t5_2, t6_1, t6_2, t6_3 and t7_3. As here was reported activity during all 7 days I assume the activity occurred consecutively. I would like to identify all id's in which an activity occured consecutively and the sum of occurrence.
Input
id t1_1 t1_2 t1_3 t2_1 t2_2 t2_3 t3_1 t3_2 t3_3 t4_1 t4_2 t4_3 t5_1 t5_2 t5_3 t6_1 t6_2 t6_3 t7_1 t7_2 t7_3
12 0 1 1 0 1 0 1 0 1 0 1 0 0 1 0 1 1 1 0 0 1
123 0 0 0 1 1 1 0 0 0 1 1 1 1 1 1 0 0 0 1 1 1
10 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Output
Id Sum
12 11
10 21
Here is an option with rle. Loop over the rows of the dataset with apply (MARGIN = 1) without the 'id' column, apply rle and extract the lengths where the 'values' are 1 ('x1'). If the length of 'x1' is either 1 or greater than or equal to 7, get the sum (1 is because if all the values are 1). Then, stack the named list to a 2 column data.frame and set the names of the columns ('out')
out <- stack(setNames(apply(df1[-1], 1, function(x) {
x1 <- with(rle(x), lengths[as.logical(values)])
if(length(x1) >=7|length(x1) == 1) sum(x1) }), df1$id))[2:1]
names(out) <- c('Id', 'Sum')
out
# Id Sum
#1 12 11
#2 10 21
data
df1 <- structure(list(id = c(12L, 123L, 10L), t1_1 = c(0L, 0L, 1L),
t1_2 = c(1L, 0L, 1L), t1_3 = c(1L, 0L, 1L), t2_1 = c(0L,
1L, 1L), t2_2 = c(1L, 1L, 1L), t2_3 = c(0L, 1L, 1L), t3_1 = c(1L,
0L, 1L), t3_2 = c(0L, 0L, 1L), t3_3 = c(1L, 0L, 1L), t4_1 = c(0L,
1L, 1L), t4_2 = c(1L, 1L, 1L), t4_3 = c(0L, 1L, 1L), t5_1 = c(0L,
1L, 1L), t5_2 = c(1L, 1L, 1L), t5_3 = c(0L, 1L, 1L), t6_1 = c(1L,
0L, 1L), t6_2 = c(1L, 0L, 1L), t6_3 = c(1L, 0L, 1L), t7_1 = c(0L,
1L, 1L), t7_2 = c(0L, 1L, 1L), t7_3 = c(1L, 1L, 1L)),
class = "data.frame", row.names = c(NA,
-3L))
An option using data.table:
melt(DT, id.vars="id")[,
c("day", "time") := tstrsplit(variable, "_")][
value==1L, if(all(paste0("t", 1L:7L) %chin% day)) .(Sum=sum(value)) , id]
output:
id Sum
1: 10 21
2: 12 11
data:
library(data.table)
DT <- fread("id t1_1 t1_2 t1_3 t2_1 t2_2 t2_3 t3_1 t3_2 t3_3 t4_1 t4_2 t4_3 t5_1 t5_2 t5_3 t6_1 t6_2 t6_3 t7_1 t7_2 t7_3
12 0 1 1 0 1 0 1 0 1 0 1 0 0 1 0 1 1 1 0 0 1
123 0 0 0 1 1 1 0 0 0 1 1 1 1 1 1 0 0 0 1 1 1
10 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1")
Explanation:
Convert into long format using melt
use tstrsplit to split the columns names into day of the week and time
filter for value==1L and then for each id, check if all 7 days are in subset before summing (i.e. if(all(paste0("t", 1L:7L) %chin% day)) .(Sum=sum(value)))
I've got genotyping data from several overlapping NPs/individuals which I am attempting to compare.
As you can see in the data structure below, e[1,2] and e[2,3] have NA's. Now I want to replace d[1,2](1) and d[2,3](1) by NA values.
d <- structure(list(`100099681` = c(0L, 2L, 0L), `101666591` = c(1L, 1L, 0L), `102247652` = c(1L, 1L, 1L), `102284616` = c(0L, 1L, 0L), `103582612` = c(0L, 1L, 1L), `104344528` = c(2L, 1L, 0L), `105729734` = c(1L, 0L, 1L), `109897137` = c(0L, 0L, 2L), `112768301` = c(0L, 1L, 1L), `114724443` = c(1L, 1L, 1L), `114826164` = c(1L, 0L, 1L), `115358770` = c(0L, 2L, 0L), `115399788` = c(1L, 1L, 0L), `118669033` = c(0L, 1L, 1L), `118875482` = c(2L, 1L, 0L), `119366362` = c(0L, 2L, 0L), `119627971` = c(0L, 1L, 1L), `120295351` = c(0L, 2L, 0L), `120998030` = c(0L, 0L, 2L)), .Names = c("100099681", "101666591", "102247652", "102284616", "103582612", "104344528", "105729734", "109897137", "112768301", "114724443", "114826164", "115358770", "115399788", "118669033", "118875482", "119366362", "119627971", "120295351", "120998030"), row.names = c("7:100038150_C", "7:100079759_T", "7:100256942_A"), class = "data.frame")
> d
# 100099681 101666591 102247652 102284616 103582612 104344528 105729734 109897137 112768301 114724443 114826164 115358770 115399788 118669033 118875482 119366362 119627971 120295351 120998030
#7:100038150_C 0 1 1 0 0 2 1 0 0 1 1 0 1 0 2 0 0 0 0
#7:100079759_T 2 1 1 1 1 1 0 0 1 1 0 2 1 1 1 2 1 2 0
#7:100256942_A 0 0 1 0 1 0 1 2 1 1 1 0 0 1 0 0 1 0 2
e<- structure(list(`100099681` = c(1L, 1L, 0L), `101666591` = c(NA, 1L, 1L), `102247652` = c(0L, NA, 0L), `102284616` = c(1L, 1L, 0L), `103582612` = c(1L, 0L, 1L), `104344528` = c(1L, 0L, 1L), `105729734` = c(0L, 0L, 1L), `109897137` = c(1L, 1L, 0L), `112768301` = c(0L, 1L, 1L), `114724443` = c(0L, 2L, 0L), `114826164` = c(0L, 0L, 2L), `115358770` = c(0L, 0L, 2L), `115399788` = c(0L, 2L, 0L), `118669033` = c(0L, 0L, 2L), `118875482` = c(0L, 1L, 1L), `119366362` = c(2L, 1L, 0L), `119627971` = c(0L, 1L, 1L), `120295351` = c(0L, 2L, 0L), `120998030` = c(0L, 2L, 1L)), .Names = c("100099681", "101666591", "102247652", "102284616", "103582612", "104344528", "105729734", "109897137", "112768301", "114724443", "114826164", "115358770", "115399788", "118669033", "118875482", "119366362", "119627971", "120295351", "120998030"), row.names = c("7:100038150_C", "7:100079759_T", "7:100256942_A"), class = "data.frame")
> e
# 100099681 101666591 102247652 102284616 103582612 104344528 105729734 109897137 112768301 114724443 114826164 115358770 115399788 118669033 118875482 119366362 119627971 120295351 120998030
#7:100038150_C 1 NA 0 1 1 1 0 1 0 0 0 0 0 0 0 2 0 0 0
#7:100079759_T 1 1 NA 1 0 0 0 1 1 2 0 0 2 0 1 1 1 2 2
#7:100256942_A 0 1 0 0 1 1 1 0 1 0 2 2 0 2 1 0 1 0 1
Thus my expected output would be
> expected_d
# 100099681 101666591 102247652 102284616 103582612 104344528 105729734 109897137 112768301 114724443 114826164 115358770 115399788 118669033 118875482 119366362 119627971 120295351 120998030
#7:100038150_C 0 NA 1 0 0 2 1 0 0 1 1 0 1 0 2 0 0 0 0
#7:100079759_T 2 1 NA 1 1 1 0 0 1 1 0 2 1 1 1 2 1 2 0
#7:100256942_A 0 0 1 0 1 0 1 2 1 1 1 0 0 1 0 0 1 0 2
I've gotten this far;
g <- which(is.na(e), arr.ind=TRUE)
> g
# row col
#7:100038150_C 1 2
#7:100079759_T 2 3
Then trying to use an apply function to replace the location by "TEST" (or na for that matter)
apply(g, 1, function(x){
e[x[1], x[2]] <- "TEST" }
)
#> apply(g, 1, function(x){ e[x[1], x[2]] <- "TEST" })
#7:100038150_C 7:100079759_T
# "TEST" "TEST"
I will be running this bit of code over several million rows/columns so speed will be an issue.
Thank you in advance:)
We can try doing
NA^(is.na(e))*d
If memory is an issue
d[] <- Map(function(x,y) NA^(is.na(y))* x, d, e)
Another way based on your approach,
d[which(is.na(e), arr.ind = T)] <- NA
I have a dataframe which looks like this:
>head(df)
chrom pos strand ref alt A_pos A_neg C_pos C_neg G_pos G_neg T_pos T_neg
chr1 2283161 - G A 3 1 2 0 0 0 0 0
chr1 2283161 - G A 3 1 2 0 0 0 0 0
chr1 2283313 - G C 0 0 0 0 0 0 0 0
chr1 2283313 - G C 0 0 0 0 0 0 0 0
chr1 2283896 - G A 0 0 0 0 0 0 0 0
chr1 2283896 + G A 0 0 0 0 0 0 0 0
I want to extract the value from columns 6:13 (A_pos...T_neg) based on the value of the columns 'strand', 'ref' and 'alt'. For instance, in row1: strand = '-', ref = 'G' and alt = 'A', so I should extract the values from G_neg and A_neg. Again, in row6: stand = '+', ref = 'G' and alt = 'A', so I should get the values from G_pos and A_pos. I basically intend to do a chi-square test after extracting these values (These are my observed values, I have another set of expected values) but that is another story.
So the logic is somewhat like:
if(df$strand=="+")
do
print:paste(df$ref,"pos",sep="_") #extract value in column df$ref_pos
print:paste(df$alt,"pos",sep="_") #extract value in column df$alt_pos
else if(gt.merge$gene_strand=="-")
do
print:paste(df$ref,"neg",sep="_") #extract value in column df$ref_neg
print:paste(df$alt,"neg",sep="_") #extract value in column df$alt_neg
Here, I am trying to use paste on the values in 'ref' and 'alt' to get the desired column names. For instance, if strand ='+' and ref = 'G', it will fetch value from column G_pos.
The data frame is actually large and so I ruled out using for-loops. I am not sure how else can I do this to make the code as efficient as possible. Any help/suggestions would be appreciated.
Thanks!
Another alternative that looks valid, at least with the sample data:
tmp = ifelse(as.character(DF$strand) == "-", "neg", "pos")
sapply(DF[c("ref", "alt")],
function(x) as.integer(DF[cbind(seq_len(nrow(DF)),
match(paste(x, tmp, sep = "_"), names(DF)))]))
# ref alt
#[1,] 0 1
#[2,] 0 1
#[3,] 0 0
#[4,] 0 0
#[5,] 0 0
#[6,] 0 0
Where DF:
DF = structure(list(chrom = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "chr1", class = "factor"),
pos = c(2283161L, 2283161L, 2283313L, 2283313L, 2283896L,
2283896L), strand = structure(c(1L, 1L, 1L, 1L, 1L, 2L), .Label = c("-",
"+"), class = "factor"), ref = structure(c(1L, 1L, 1L, 1L,
1L, 1L), .Label = "G", class = "factor"), alt = structure(c(1L,
1L, 2L, 2L, 1L, 1L), .Label = c("A", "C"), class = "factor"),
A_pos = c(3L, 3L, 0L, 0L, 0L, 0L), A_neg = c(1L, 1L, 0L,
0L, 0L, 0L), C_pos = c(2L, 2L, 0L, 0L, 0L, 0L), C_neg = c(0L,
0L, 0L, 0L, 0L, 0L), G_pos = c(0L, 0L, 0L, 0L, 0L, 0L), G_neg = c(0L,
0L, 0L, 0L, 0L, 0L), T_pos = c(0L, 0L, 0L, 0L, 0L, 0L), T_neg = c(0L,
0L, 0L, 0L, 0L, 0L)), .Names = c("chrom", "pos", "strand",
"ref", "alt", "A_pos", "A_neg", "C_pos", "C_neg", "G_pos", "G_neg",
"T_pos", "T_neg"), class = "data.frame", row.names = c(NA, -6L
))
Not very elegant, but does the job:
strand.map <- c("-"="_neg", "+"="_pos")
cbind(
df[1:5],
do.call(
rbind,
lapply(
split(df[-(1:2)], 1:nrow(df)),
function(x)
c(
ref=x[-(1:2)][, paste0(x[[2]], strand.map[x[[1]]])],
alt=x[-(1:2)][, paste0(x[[3]], strand.map[x[[1]]])]
) ) ) )
We cycle through each row in your data frame and apply a function that pulls the value based on strand, ref, and alt. This produces:
chrom pos strand ref alt ref alt
1 chr1 2283161 - G A 0 1
2 chr1 2283161 - G A 0 1
3 chr1 2283313 - G C 0 0
4 chr1 2283313 - G C 0 0
5 chr1 2283896 - G A 0 0
6 chr1 2283896 + G A 0 0
An alternate approach is to use melt, but the format of your data makes it rather annoying because we need two melts in a row, and we need to create a unique id column so we can reconstitute the data frame once we're done computing.
df$id <- 1:nrow(df)
df.mlt <-
melt(
melt(df, id.vars=c("id", "chrom", "pos", "strand", "ref", "alt")),
measure.vars=c("ref", "alt"), value.name="base",
variable.name="alt_or_ref"
)
dcast(
subset(df.mlt, paste0(base, strand.map[strand]) == variable),
id + chrom + pos + strand ~ alt_or_ref,
value.var="value"
)
Which produces:
id chrom pos strand ref alt
1 1 chr1 2283161 - 0 1
2 2 chr1 2283161 - 0 1
3 3 chr1 2283313 - 0 0
4 4 chr1 2283313 - 0 0
5 5 chr1 2283896 - 0 0
6 6 chr1 2283896 + 0 0
Another way
testFunc <- function(x){
posneg <- if(x["strand"] == "-") {"neg"} else {"pos"}
cbind(as.numeric(x[paste0(x["ref"],"_",posneg)]), as.numeric(x[paste0(x["alt"],"_",posneg)]))
}
temp <- t(apply(df, 1, testFunc))
colnames(temp) <- c("ref", "alt")
using the [very] fast data.table library:
library(data.table)
df = fread('df.txt') # fastread
df[,ref := ifelse(strand == "-",
paste(ref,"neg",sep = "_"),
paste(ref,"pos",sep = "_"))]
df[,alt := ifelse(strand == "-",
paste(alt,"neg",sep = "_"),
paste(alt,"pos",sep = "_"))]
df[,strand := NULL] # not required anymore
dfm = melt(df,
id.vars = c("chrom","pos","ref","alt"),
variable.name = "mycol", value.name = "value")
dfm[mycol == ref | mycol == alt,] # matching
This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 6 years ago.
I have a simple R problem, but I just can't find the answer.
I have a dataframe like this:
A 1 0 0 0 0 0
B 0 1 0 0 0 0
B 0 0 1 0 0 1
B 0 0 0 0 1 0
C 1 0 0 0 0 0
C 0 0 0 1 1 0
And i want it to be just like this:
A 1 0 0 0 0 0
B 0 1 1 0 1 1
C 1 0 0 1 1 0
Thank you very much!
Regards Lisanne
Here's one possbility using tapply:
cbind(unique(dat[1]), do.call(rbind, tapply(dat[-1], dat[[1]], colSums)))
# V1 V2 V3 V4 V5 V6 V7
# 1 A 1 0 0 0 0 0
# 2 B 0 1 1 0 1 1
# 5 C 1 0 0 1 1 0
where dat is the name of your data frame.
dat <- structure(list(V1 = structure(c(1L, 2L, 2L, 2L, 3L, 3L), .Label = c("A",
"B", "C"), class = "factor"), V2 = c(1L, 0L, 0L, 0L, 1L, 0L),
V3 = c(0L, 1L, 0L, 0L, 0L, 0L), V4 = c(0L, 0L, 1L, 0L, 0L,
0L), V5 = c(0L, 0L, 0L, 0L, 0L, 1L), V6 = c(0L, 0L, 0L, 1L,
0L, 1L), V7 = c(0L, 0L, 1L, 0L, 0L, 0L)), .Names = c("V1",
"V2", "V3", "V4", "V5", "V6", "V7"), class = "data.frame", row.names = c(NA,
-6L))
You could...
aggregate(.~ V1 , data =dat, sum)
or
library(plyr)
ddply(dat, .(V1), function(x) colSums(x[,2:7]) )
If you're working with a data.frame where there are duplicates but you only want the presence or absence of a 1 to be noted, then after these functions you might want to do something like dat[!(dat %in% c(1,0)] <- 1.
A possibility not mentioned is the aggregate function. I think this is quite 'readable'.
aggregate(cbind(data$X1, data$X2, data$X3, data$X4),
by = list(category = data$group), FUN = sum)
What is the best way to determine a factor or create a new category field based on a number of boolean fields? In this example, I need to count the number of unique combinations of medications.
> MultPsychMeds
ID OLANZAPINE HALOPERIDOL QUETIAPINE RISPERIDONE
1 A 1 1 0 0
2 B 1 0 1 0
3 C 1 0 1 0
4 D 1 0 1 0
5 E 1 0 0 1
6 F 1 0 0 1
7 G 1 0 0 1
8 H 1 0 0 1
9 I 0 1 1 0
10 J 0 1 1 0
Perhaps another way to state it is that I need to pivot or cross tabulate the pairs. The final results need to look something like:
Combination Count
OLANZAPINE/HALOPERIDOL 1
OLANZAPINE/QUETIAPINE 3
OLANZAPINE/RISPERIDONE 4
HALOPERIDOL/QUETIAPINE 2
This data frame can be replicated in R with:
MultPsychMeds <- structure(list(ID = structure(1:10, .Label = c("A", "B", "C",
"D", "E", "F", "G", "H", "I", "J"), class = "factor"), OLANZAPINE = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L), HALOPERIDOL = c(1L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L), QUETIAPINE = c(0L, 1L, 1L, 1L,
0L, 0L, 0L, 0L, 1L, 1L), RISPERIDONE = c(0L, 0L, 0L, 0L, 1L,
1L, 1L, 1L, 0L, 0L)), .Names = c("ID", "OLANZAPINE", "HALOPERIDOL",
"QUETIAPINE", "RISPERIDONE"), class = "data.frame", row.names = c(NA,
-10L))
Here's one approach using the reshape and plyr packages:
library(reshape)
library(plyr)
#Melt into long format
dat.m <- melt(MultPsychMeds, id.vars = "ID")
#Group at the ID level and paste the drugs together with "/"
out <- ddply(dat.m, "ID", summarize, combos = paste(variable[value == 1], collapse = "/"))
#Calculate a table
with(out, count(combos))
x freq
1 HALOPERIDOL/QUETIAPINE 2
2 OLANZAPINE/HALOPERIDOL 1
3 OLANZAPINE/QUETIAPINE 3
4 OLANZAPINE/RISPERIDONE 4
Just for fun, a base R solution (that can be turned into a oneliner :-) ):
data.frame(table(apply(MultPsychMeds[,-1], 1, function(currow){
wc<-which(currow==1)
paste(colnames(MultPsychMeds)[wc+1], collapse="/")
})))
Another way could be:
subset(
as.data.frame(
with(MultPsychMeds, table(OLANZAPINE, HALOPERIDOL, QUETIAPINE, RISPERIDONE)),
responseName="count"
),
count>0
)
which gives
OLANZAPINE HALOPERIDOL QUETIAPINE RISPERIDONE count
4 1 1 0 0 1
6 1 0 1 0 3
7 0 1 1 0 2
10 1 0 0 1 4
It's not an exact way you want it, but is fast and simple.
There is shorthand in plyr package:
require(plyr)
count(MultPsychMeds, c("OLANZAPINE", "HALOPERIDOL", "QUETIAPINE", "RISPERIDONE"))
# OLANZAPINE HALOPERIDOL QUETIAPINE RISPERIDONE freq
# 1 0 1 1 0 2
# 2 1 0 0 1 4
# 3 1 0 1 0 3
# 4 1 1 0 0 1