Binary Variables Combinations Analysis in R - r

I have a data set, which has a lot of binary variables. For the ease of illustration, here is a smaller version with only 4 variables:
set.seed(5)
my_data<-data.frame("Slept Well"=sample(c(0,1),10,TRUE),
"Had Breakfast"=sample(c(0,1),10,TRUE),
"Worked out"=sample(c(0,1),10,TRUE),
"Meditated"=sample(c(0,1),10,TRUE))
In the above, each row corresponds to an observation. I am interested in analysing the frequency of each unique combination of the variables. For example, how many observations said that they both slept well and meditated, but did not have breakfast or worked out?
I would like to be able to rank the unique combinations from most frequently occurring to the least frequently occurring. What is the best way to go about coding that up?

You can use aggregate.
x <- aggregate(list(n=rep(1, nrow(my_data))), my_data, length)
#x <- aggregate(list(n=my_data[,1]), my_data, length) #Alternative
x[order(-x$n),]
# Slept.Well Had.Breakfast Worked.out Meditated n
#4 0 1 1 0 2
#1 0 0 0 0 1
#2 1 1 0 0 1
#3 0 0 1 0 1
#5 0 0 0 1 1
#6 1 0 0 1 1
#7 0 1 0 1 1
#8 0 0 1 1 1
#9 0 1 1 1 1

What about a dplyr solution:
library(dplyr)
my_data %>%
# group it
group_by_all() %>%
# frequencies
summarise(freq = n()) %>%
# order decreasing
arrange(-freq)
# A tibble: 9 x 5
Slept.Well Had.Breakfast Worked.out Meditated freq
<chr> <chr> <chr> <chr> <int>
1 0 1 1 0 2
2 0 0 0 0 1
3 0 0 0 1 1
4 0 0 1 0 1
5 0 0 1 1 1
6 0 1 0 1 1
7 0 1 1 1 1
8 1 0 0 1 1
9 1 1 0 0 1
Or with data.table:
res <- setorder(data.table(my_data)[,"."(freq = .N), by = names(my_data)],-freq)
res
Slept.Well Had.Breakfast Worked.out Meditated freq
1: 0 1 1 0 2
2: 1 0 0 1 1
3: 0 0 1 0 1
4: 0 0 0 0 1
5: 0 1 0 1 1
6: 0 1 1 1 1
7: 0 0 1 1 1
8: 0 0 0 1 1
9: 1 1 0 0 1

Related

Generate a dummy variable satisfying a condition for the same individual in a panel dataframe

I have a dataframe of this form
ID panelid dummy1 dummy2
1 1 0 1
1 2 1 0
2 1 1 0
2 2 0 1
3 1 1 0
3 2 1 0
4 1 0 1
4 2 0 1
I want to generate a dummy variable equal to one in occurrence of panelid==2 and only if the same individual presents a value for the dummy1 equal to 1 in panelid==1 and a value for the dummy2 equal to 1 in panelid==2. Thus I want to obtain something like this
ID panelid dummy1 dummy2 result
1 1 0 1 0
1 2 1 0 0
2 1 1 0 0
2 2 0 1 1
3 1 1 0 0
3 2 1 0 0
4 1 0 1 0
4 2 0 1 0
Can someone help me with these?
Many thanks to everyone
This is almost identical solution to #Cole's solution.
dataset <- read.table(text = 'ID panelid dummy1 dummy2
1 1 0 1
1 2 1 0
2 1 1 0
2 2 0 1
3 1 1 0
3 2 1 0
4 1 0 1
4 2 0 1',
header = TRUE)
temp_ID <- dataset$ID[(dataset$panelid == 1) & (dataset$dummy1 == 1)]
dataset$result <- as.integer(x = ((dataset$panelid == 2) & (dataset$dummy2 == 1) & (dataset$ID %in% temp_ID)))
dataset
ID panelid dummy1 dummy2 result
1 1 1 0 1 0
2 1 2 1 0 0
3 2 1 1 0 0
4 2 2 0 1 1
5 3 1 1 0 0
6 3 2 1 0 0
7 4 1 0 1 0
8 4 2 0 1 0
Here's a base R approach:
dummy1_in_panelid <- with(df, ID[panelid == 1 & dummy1 == 1])
#initialize
df$result <- 0
df$result[with(df, which(panelid == 2 & ID %in% dummy1_in_panelid & dummy2 == 1))] <- 1
df
ID panelid dummy1 dummy2 result
1 1 1 0 1 0
2 1 2 1 0 0
3 2 1 1 0 0
4 2 2 0 1 1
5 3 1 1 0 0
6 3 2 1 0 0
7 4 1 0 1 0
8 4 2 0 1 0
And the data...
df <- as.data.frame(data.table::fread('
ID panelid dummy1 dummy2
1 1 0 1
1 2 1 0
2 1 1 0
2 2 0 1
3 1 1 0
3 2 1 0
4 1 0 1
4 2 0 1'))

Calculating means in R, via case/row frommultiple variables; count and exclude NA values

I'm trying to calculate participant average scores on the following scheme:
1. Take a series of values from multiple variables (test items),
2. Calculate an average score only for items answered Yes or No,
3. Omitting NA values from affecting the mean yet counting frequency and getting coordinates for all NA values,
4. Storing that newfound mean value in a new variable.
I need to do this with binary questions (1 = Yes, 0 = No, -99 = Missing / NA), such as below:
id var1 var2 var3 var4 var5
1 1 0 0 0 0
2 1 1 0 1 1
3 1 0 0 1 0
4 1 0 0 1 0
5 1 0 0 0 0
6 1 1 0 0 1
7 1 1 0 0 1
8 1 1 0 0 0
9 1 0 1 0 1
10 1 0 0 -99 1
11 1 1 0 1 0
12 1 0 0 1 0
13 1 0 0 -99 0
14 1 -99 0 1 1
15 1 0 0 1 0
16 1 0 0 0 1
17 1 0 0 1 0
18 1 0 -99 0 1
19 1 0 0 1 0
20 1 0 0 1 1
21 1 0 0 1 0
22 1 0 0 1 1
23 1 0 0 1 0
24 1 0 0 0 1
25 1 0 0 0 0
26 1 0 0 1 0
27 1 0 0 0 0
28 1 1 0 1 1
And with Likert scale questions (0 = Strongly Disagree / 6 = Strongly Disagree, -99 Missing / NA).
var10 var11 var12 var13 var14
1 1 1 1 0
4 1 1 1 1
1 1 1 1 1
2 1 1 1 1
4 1 1 1 1
2 1 1 1 0
1 1 1 1 0
1 1 1 1 1
2 1 1 1 1
1 1 1 1 0
4 1 1 1 1
4 1 1 1 1
-99 1 1 1 1
1 1 2 1 1
1 4 2 2 0
4 1 1 1 1
4 1 1 1 1
1 1 1 1 1
2 1 1 1 1
4 1 1 1 0
1 1 1 1 1
4 1 1 1 1
1 1 1 1 1
4 1 1 1 1
1 1 1 1 1
Any ideas of how to go about this? I'm sure it can be done by selecting individual columns or by indicating a range of columns from which to draw data. However, I'm inexperienced in writing such a complex, multi-stepped function in R so I'm hoping to get a veteran's advice.
Thanks in advance.

Aggregate R data frame over count of a field: Pivot table-like result set [duplicate]

This question already has answers here:
How do I get a contingency table?
(6 answers)
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Closed 4 years ago.
I have a data frame in the following structure
ChannelId,AuthorId
1,32
28,2393293
2,32
2,32
1,2393293
31,3
3,32
5,4
2,5
What I want is
AuthorId,1,2,3,5,28,31
4,0,0,0,1,0,0
3,0,0,0,0,0,1
5,0,1,0,0,0,0
32,1,2,0,1,0,0
2393293,1,0,0,0,1,0
Is there a way to do this?
The xtabs function can be called with a formula that specifies the margins:
xtabs( ~ AuthorId+ChannelId, data=dat)
ChannelId
AuthorId 1 2 28 3 31 5
2393293 1 0 1 0 0 0
3 0 0 0 0 1 0
32 1 2 0 1 0 0
4 0 0 0 0 0 1
5 0 1 0 0 0 0
Perhaps the simplest way would be: t(table(df)):
# ChannelId
#AuthorId 1 2 3 5 28 31
# 3 0 0 0 0 0 1
# 4 0 0 0 1 0 0
# 5 0 1 0 0 0 0
# 32 1 2 1 0 0 0
# 2393293 1 0 0 0 1 0
If you want to use dplyr::count you could do:
library(dplyr)
library(tidyr)
df %>%
count(AuthorId, ChannelId) %>%
spread(ChannelId, n, fill = 0)
Which gives:
#Source: local data frame [5 x 7]
#Groups: AuthorId [5]
#
# AuthorId 1 2 3 5 28 31
#* <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 3 0 0 0 0 0 1
#2 4 0 0 0 1 0 0
#3 5 0 1 0 0 0 0
#4 32 1 2 1 0 0 0
#5 2393293 1 0 0 0 1 0
We can also use dcast from data.table. Convert the 'data.frame' to 'data.table' and use dcast with the fun.aggregate as length.
library(data.table)
dcast(setDT(df1), AuthorId~ChannelId, length)
# AuthorId 1 2 3 5 28 31
#1: 3 0 0 0 0 0 1
#2: 4 0 0 0 1 0 0
#3: 5 0 1 0 0 0 0
#4: 32 1 2 1 0 0 0
#5: 2393293 1 0 0 0 1 0

count occurrences in unique group combination

I have a data set that resembles the below:
SSN Auto MtgHe Personal Other None
A 1 1 0 0 0
B 1 1 0 0 0
C 1 0 0 0 0
D 1 0 1 1 0
E 0 0 0 0 1
F 0 0 0 0 1
G 0 0 0 0 1
SSN is the person, Auto, MtgHe, Personal, Other are loan categories and 'None' means no loans present. There are 15 total unique possible loan combinations plus 1 other possibility of 'None' which represents no loans present. So a person could have only an Auto loan, or an Auto and Personal loan, or no loan at all for example. I would like a count of SSNs that have each different combination. Using the table above the results would look like:
Cnt Auto MtgHe Personal Other None
2 1 1 0 0 0
1 1 0 0 0 0
1 1 0 1 1 0
3 0 0 0 0 1
Any ideas on how to accomplish this in R? My data set really has tens of thousands of cases, but any help would be appreciated.
And the obligatory data.table version (the only one that won't reorder the data set)
library(data.table)
setDT(df)[, .(Cnt = .N), .(Auto, MtgHe, Personal, Other, None)]
# Auto MtgHe Personal Other None Cnt
# 1: 1 1 0 0 0 2
# 2: 1 0 0 0 0 1
# 3: 1 0 1 1 0 1
# 4: 0 0 0 0 1 3
Or a shorter version could be
temp <- names(df)[-1]
setDT(df)[, .N, temp]
# Auto MtgHe Personal Other None N
# 1: 1 1 0 0 0 2
# 2: 1 0 0 0 0 1
# 3: 1 0 1 1 0 1
# 4: 0 0 0 0 1 3
And just for fun, here's another (unordered) base R version
Cnt <- rev(tapply(df[,1], do.call(paste, df[-1]), length))
cbind(unique(df[-1]), Cnt)
# Auto MtgHe Personal Other None Cnt
# 1 1 1 0 0 0 2
# 3 1 0 0 0 0 1
# 4 1 0 1 1 0 1
# 5 0 0 0 0 1 3
And an additional dplyr version for completness
library(dplyr)
group_by(df, Auto, MtgHe, Personal, Other, None) %>% tally
# Source: local data frame [4 x 6]
# Groups: Auto, MtgHe, Personal, Other
#
# Auto MtgHe Personal Other None n
# 1 0 0 0 0 1 3
# 2 1 0 0 0 0 1
# 3 1 0 1 1 0 1
# 4 1 1 0 0 0 2
One option, using dplyr's count function:
library(dplyr)
count(df, Auto, MtgHe, Personal, Other, None) %>% ungroup()
#Source: local data frame [4 x 6]
#
# Auto MtgHe Personal Other None n
#1 0 0 0 0 1 3
#2 1 0 0 0 0 1
#3 1 0 1 1 0 1
#4 1 1 0 0 0 2
And for those who prefer base R and without ordering:
x <- interaction(df[-1])
df <- transform(df, n = ave(seq_along(x), x, FUN = length))[!duplicated(x),-1]
# Auto MtgHe Personal Other None n
#1 1 1 0 0 0 2
#3 1 0 0 0 0 1
#4 1 0 1 1 0 1
#5 0 0 0 0 1 3
Base R solution using aggregate:
aggregate(count ~ ., data=transform(dat[-1],count=1), FUN=sum )
# Auto MtgHe Personal Other None count
#1 1 0 0 0 0 1
#2 1 1 0 0 0 2
#3 1 0 1 1 0 1
#4 0 0 0 0 1 3

R data.table condition within group, but recorded at first instance in group

I have data that looks a bit like this:
df <- data.frame(ID=c(rep(1,4),rep(2,2),rep(3,2),4), TYPE=c(1,3,2,4,1,2,2,3,2),
SEQUENCE=c(seq(1,4),1,2,1,2,1))
ID TYPE SEQUENCE
1 1 1
1 3 2
1 2 3
1 4 4
2 1 1
2 2 2
3 2 1
3 3 2
4 2 1
I know need to check if a certain type is present in each ID block (binary), but only record the
answer in the first record per block (SEQUENCE == 1).
The best I came up with so far is coding them in the row they are present in, e.g.
library(data.table)
DT <- data.table(df)
DT$A[DT$TYPE==1] <- 1
DT$B[DT$TYPE==2] <- 1
DT$C[DT$TYPE==3] <- 1
DT$D[DT$TYPE==4] <- 1
DT[is.na(DT)] <- 0
RESULT:
ID TYPE SEQUENCE A B C D
1 1 1 1 0 0 0
1 3 2 0 0 1 0
1 2 3 0 1 0 0
1 4 4 0 0 0 1
2 1 1 1 0 0 0
2 2 2 0 1 0 0
3 2 1 0 1 0 0
3 3 2 0 0 1 0
4 2 1 0 1 0 0
However, the result should look like this:
ID TYPE SEQUENCE A B C D
1 1 1 1 1 1 1
1 3 2 0 0 0 0
1 2 3 0 0 0 0
1 4 4 0 0 0 0
2 1 1 1 1 0 0
2 2 2 0 0 0 0
3 2 1 0 1 1 0
3 3 2 0 0 0 0
4 2 1 0 1 0 0
I assume this can be done with data.table, but I haven't quite found the correct syntax.
This makes one copy of the data.table:
DT[, FAC := factor(TYPE, labels=LETTERS[1:4])]
DT <- dcast.data.table(DT, ID+TYPE+SEQUENCE~FAC, fun.aggregate=length)
DT[,LETTERS[1:4] := lapply(.SD,
function(x) c(any(as.logical(x)), rep(0L, length(x)-1))),
.SDcols=LETTERS[1:4], by=ID]
# ID TYPE SEQUENCE A B C D
#1: 1 1 1 1 1 1 1
#2: 1 2 3 0 0 0 0
#3: 1 3 2 0 0 0 0
#4: 1 4 4 0 0 0 0
#5: 2 1 1 1 1 0 0
#6: 2 2 2 0 0 0 0
#7: 3 2 1 0 1 1 0
#8: 3 3 2 0 0 0 0
#9: 4 2 1 0 1 0 0

Resources