count occurrences in unique group combination - r

I have a data set that resembles the below:
SSN Auto MtgHe Personal Other None
A 1 1 0 0 0
B 1 1 0 0 0
C 1 0 0 0 0
D 1 0 1 1 0
E 0 0 0 0 1
F 0 0 0 0 1
G 0 0 0 0 1
SSN is the person, Auto, MtgHe, Personal, Other are loan categories and 'None' means no loans present. There are 15 total unique possible loan combinations plus 1 other possibility of 'None' which represents no loans present. So a person could have only an Auto loan, or an Auto and Personal loan, or no loan at all for example. I would like a count of SSNs that have each different combination. Using the table above the results would look like:
Cnt Auto MtgHe Personal Other None
2 1 1 0 0 0
1 1 0 0 0 0
1 1 0 1 1 0
3 0 0 0 0 1
Any ideas on how to accomplish this in R? My data set really has tens of thousands of cases, but any help would be appreciated.

And the obligatory data.table version (the only one that won't reorder the data set)
library(data.table)
setDT(df)[, .(Cnt = .N), .(Auto, MtgHe, Personal, Other, None)]
# Auto MtgHe Personal Other None Cnt
# 1: 1 1 0 0 0 2
# 2: 1 0 0 0 0 1
# 3: 1 0 1 1 0 1
# 4: 0 0 0 0 1 3
Or a shorter version could be
temp <- names(df)[-1]
setDT(df)[, .N, temp]
# Auto MtgHe Personal Other None N
# 1: 1 1 0 0 0 2
# 2: 1 0 0 0 0 1
# 3: 1 0 1 1 0 1
# 4: 0 0 0 0 1 3
And just for fun, here's another (unordered) base R version
Cnt <- rev(tapply(df[,1], do.call(paste, df[-1]), length))
cbind(unique(df[-1]), Cnt)
# Auto MtgHe Personal Other None Cnt
# 1 1 1 0 0 0 2
# 3 1 0 0 0 0 1
# 4 1 0 1 1 0 1
# 5 0 0 0 0 1 3
And an additional dplyr version for completness
library(dplyr)
group_by(df, Auto, MtgHe, Personal, Other, None) %>% tally
# Source: local data frame [4 x 6]
# Groups: Auto, MtgHe, Personal, Other
#
# Auto MtgHe Personal Other None n
# 1 0 0 0 0 1 3
# 2 1 0 0 0 0 1
# 3 1 0 1 1 0 1
# 4 1 1 0 0 0 2

One option, using dplyr's count function:
library(dplyr)
count(df, Auto, MtgHe, Personal, Other, None) %>% ungroup()
#Source: local data frame [4 x 6]
#
# Auto MtgHe Personal Other None n
#1 0 0 0 0 1 3
#2 1 0 0 0 0 1
#3 1 0 1 1 0 1
#4 1 1 0 0 0 2
And for those who prefer base R and without ordering:
x <- interaction(df[-1])
df <- transform(df, n = ave(seq_along(x), x, FUN = length))[!duplicated(x),-1]
# Auto MtgHe Personal Other None n
#1 1 1 0 0 0 2
#3 1 0 0 0 0 1
#4 1 0 1 1 0 1
#5 0 0 0 0 1 3

Base R solution using aggregate:
aggregate(count ~ ., data=transform(dat[-1],count=1), FUN=sum )
# Auto MtgHe Personal Other None count
#1 1 0 0 0 0 1
#2 1 1 0 0 0 2
#3 1 0 1 1 0 1
#4 0 0 0 0 1 3

Related

Binary Variables Combinations Analysis in R

I have a data set, which has a lot of binary variables. For the ease of illustration, here is a smaller version with only 4 variables:
set.seed(5)
my_data<-data.frame("Slept Well"=sample(c(0,1),10,TRUE),
"Had Breakfast"=sample(c(0,1),10,TRUE),
"Worked out"=sample(c(0,1),10,TRUE),
"Meditated"=sample(c(0,1),10,TRUE))
In the above, each row corresponds to an observation. I am interested in analysing the frequency of each unique combination of the variables. For example, how many observations said that they both slept well and meditated, but did not have breakfast or worked out?
I would like to be able to rank the unique combinations from most frequently occurring to the least frequently occurring. What is the best way to go about coding that up?
You can use aggregate.
x <- aggregate(list(n=rep(1, nrow(my_data))), my_data, length)
#x <- aggregate(list(n=my_data[,1]), my_data, length) #Alternative
x[order(-x$n),]
# Slept.Well Had.Breakfast Worked.out Meditated n
#4 0 1 1 0 2
#1 0 0 0 0 1
#2 1 1 0 0 1
#3 0 0 1 0 1
#5 0 0 0 1 1
#6 1 0 0 1 1
#7 0 1 0 1 1
#8 0 0 1 1 1
#9 0 1 1 1 1
What about a dplyr solution:
library(dplyr)
my_data %>%
# group it
group_by_all() %>%
# frequencies
summarise(freq = n()) %>%
# order decreasing
arrange(-freq)
# A tibble: 9 x 5
Slept.Well Had.Breakfast Worked.out Meditated freq
<chr> <chr> <chr> <chr> <int>
1 0 1 1 0 2
2 0 0 0 0 1
3 0 0 0 1 1
4 0 0 1 0 1
5 0 0 1 1 1
6 0 1 0 1 1
7 0 1 1 1 1
8 1 0 0 1 1
9 1 1 0 0 1
Or with data.table:
res <- setorder(data.table(my_data)[,"."(freq = .N), by = names(my_data)],-freq)
res
Slept.Well Had.Breakfast Worked.out Meditated freq
1: 0 1 1 0 2
2: 1 0 0 1 1
3: 0 0 1 0 1
4: 0 0 0 0 1
5: 0 1 0 1 1
6: 0 1 1 1 1
7: 0 0 1 1 1
8: 0 0 0 1 1
9: 1 1 0 0 1

R: How to drop columns with less than 10% 1's

My dataset:
a b c
1 1 0
1 0 0
1 1 0
I want to drop columns which have less than 10% 1's. I have this code but it's not working:
sapply(df, function(x) df[df[,c(x)]==1]>0.1))
Maybe I need a totally different approach.
Try this option with apply() and a build-in function to test the threshold of 1 across each column. I have created a dummy example. The index i contains the columns that will be dropped after using myfun to compute the proportion of 1's in each column. Here the code:
#Data
df <- as.data.frame(matrix(c(1,0),20,10))
df$V1<-c(1,rep(0,19))
df$V2<-c(1,rep(0,19))
#Function
myfun <- function(x) {sum(x==1)/length(x)}
#Index For removing
i <- unname(which(apply(df,2,myfun)<0.1))
#Drop
df2 <- df[,-i]
The output:
df2
V3 V4 V5 V6 V7 V8 V9 V10
1 1 1 1 1 1 1 1 1
2 0 0 0 0 0 0 0 0
3 1 1 1 1 1 1 1 1
4 0 0 0 0 0 0 0 0
5 1 1 1 1 1 1 1 1
6 0 0 0 0 0 0 0 0
7 1 1 1 1 1 1 1 1
8 0 0 0 0 0 0 0 0
9 1 1 1 1 1 1 1 1
10 0 0 0 0 0 0 0 0
11 1 1 1 1 1 1 1 1
12 0 0 0 0 0 0 0 0
13 1 1 1 1 1 1 1 1
14 0 0 0 0 0 0 0 0
15 1 1 1 1 1 1 1 1
16 0 0 0 0 0 0 0 0
17 1 1 1 1 1 1 1 1
18 0 0 0 0 0 0 0 0
19 1 1 1 1 1 1 1 1
20 0 0 0 0 0 0 0 0
Where columns V1 and V2 are dropped due to having 1's less than 0.1.
You can use colMeans in base R to keep columns that have more than 10% of 1's.
df[colMeans(df == 1) >= 0.1, ]
Or in dplyr use select with where :
library(dplyr)
df %>% select(where(~mean(. == 1) >= 0.1))

Transpose and create categorical values in R

I have a data frame with the below structure from which I am looking to transpose the variables into categorical. Intent is to find the weighted mix of the variables.
data <- read.table(header=T, text='
subject weight sex test
1 2 M control
2 3 F cond1
3 2 F cond2
4 4 M control
5 3 F control
6 2 F control
')
data
Expected output:
subject weight control_F control_M cond1_F cond1_M cond2_F cond2_M
1 2 0 1 0 0 0 0
2 3 0 0 1 0 0 0
3 2 0 0 0 0 1 0
4 4 0 1 0 0 0 0
5 3 1 0 0 0 0 0
6 2 1 0 0 0 0 0
I tried using a combination of ifelse and cut, but just couldn't produce the output.
Any ideas on how I can do this?
TIA
You may use
model.matrix(~ subject + weight + sex:test - 1, data)
I think model.matrix is most natural here (see #Julius' answer), but here's an alternative:
library(data.table)
setDT(data)
dcast(data, subject+weight~test+sex, fun=length, drop=c(TRUE,FALSE))
subject weight cond1_F cond1_M cond2_F cond2_M control_F control_M
1: 1 2 0 0 0 0 0 1
2: 2 3 1 0 0 0 0 0
3: 3 2 0 0 1 0 0 0
4: 4 4 0 0 0 0 0 1
5: 5 3 0 0 0 0 1 0
6: 6 2 0 0 0 0 1 0
To get the columns in the "right" order (with the control first), set factor levels before casting:
data[, test := relevel(test, "control")]
dcast(data, subject+weight~test+sex, fun=length, drop=c(TRUE,FALSE))
subject weight control_F control_M cond1_F cond1_M cond2_F cond2_M
1: 1 2 0 1 0 0 0 0
2: 2 3 0 0 1 0 0 0
3: 3 2 0 0 0 0 1 0
4: 4 4 0 1 0 0 0 0
5: 5 3 1 0 0 0 0 0
6: 6 2 1 0 0 0 0 0
(Note: reshape2's dcast isn't so good here, since its drop option applies to both rows and cols.)

Faster way to multiplication in data frame

I have a data frame (name t) like this
ID N com_a com_b com_c
A 3 1 0 0
A 5 0 1 0
B 1 1 0 0
B 1 0 1 0
B 4 0 0 1
B 4 1 0 0
I have try to do com_a*N com_b*N com_c*N
ID N com_a com_b com_c com_a_N com_b_N com_c_N
A 3 1 0 0 3 0 0
A 5 0 1 0 0 5 0
B 1 1 0 0 1 0 0
B 1 0 1 0 0 1 0
B 4 0 0 1 0 0 4
B 4 1 0 0 4 0 0
I use for-function, but it need many time how do i do the fast in the big data
for (i in 1:dim(t)[1]){
t$com_a_N[i]=t$com_a[i]*t$N[i]
t$com_b_N[i]=t$com_b[i]*t$N[i]
t$com_c_N[i]=t$com_c[i]*t$N[i]
}
t <- transform(t,
com_a_N=com_a*N,
com_b_N=com_b*N,
com_c_N=com_c*N)
should be much faster. data.table solutions might be faster still.
You can use sweep for this
(st <- sweep(t[, 3:5], 1, t$N, "*"))
# com_a com_b com_c
#1 3 0 0
#2 0 5 0
#3 1 0 0
#4 0 1 0
#5 0 0 4
#6 4 0 0
The new names can be created with paste and setNames, and you can add the new columns to the existing data.frame with cbind. This will scale for any number of columns.
cbind(t, setNames(st, paste(names(st), "N", sep="_")))
# ID N com_a com_b com_c com_a_N com_b_N com_c_N
#1 A 3 1 0 0 3 0 0
#2 A 5 0 1 0 0 5 0
#3 B 1 1 0 0 1 0 0
#4 B 1 0 1 0 0 1 0
#5 B 4 0 0 1 0 0 4
#6 B 4 1 0 0 4 0 0
A data.table solution as proposed by #BenBolker
library(data.table)
setDT(t)[, c("com_a_N", "com_b_N", "com_c_N") := list(com_a*N, com_b*N, com_c*N)]
## ID N com_a com_b com_c com_a_N com_b_N com_c_N
## 1: A 3 1 0 0 3 0 0
## 2: A 5 0 1 0 0 5 0
## 3: B 1 1 0 0 1 0 0
## 4: B 1 0 1 0 0 1 0
## 5: B 4 0 0 1 0 0 4
## 6: B 4 1 0 0 4 0 0
Even faster using matrix multiplication:
cbind(dat,dat[,3:5]*dat$N)
Though you should set colnames after....
To avoid using explicit column index(not recommended) , you can use some grep magic:
cbind(dat,dat[,grep('com',colnames(dat))]*dat$N)
Another option with dplyr:
require(dplyr)
t <- mutate(t, com_a_N=com_a*N,
com_b_N=com_b*N,
com_c_N=com_c*N)

R Data Transformations

I have a need to look at the data in a data frame in a different way. Here is the problem..
I have a data frame as follows
Person Item BuyOrSell
1 a B
1 b S
1 a S
2 d B
3 a S
3 e S
One of the requirements I have is to see the data as follows. Show the sum of all transactions made by the Person on individual items broken by the transaction type (B or S)
Person aB aS bB bS dB dS eB eS
1 1 1 0 1 0 0 0 0
2 0 0 0 0 1 0 0 0
3 1 0 0 0 0 0 0 1
So i created a new column and appended the values of both the Item and BuyOrSell.
df$newcol<-paste(Item,"-",BuyOrSell,sep="")
table(Person,newcol)
and was able to achieve the above results.
The last transformation requirement which was a tough nut to crack was as follows....
aB aS bB bS dB dS eB eS
aB 1 1 0 1 0 0 0 0
aS 1 2 0 1 0 0 0 1
bB 0 0 0 0 0 0 0 0
bS 1 1 0 0 0 0 0 0
dB 0 0 0 0 1 0 0 0
dS 0 0 0 0 0 0 0 0
eB 0 0 0 0 0 0 0 0
eS 0 1 0 0 0 0 0 1
where the above table had to be filled in with the number of people who made a particular transaction also made a transaction on another item.
I tried table(newcol,newcol) but it generated counts only for aB-aB,aS-aS,bB-bB,..... and 0s for all other combinations.
Any ideas on what package or command will let me crack this nut ?
Isn't the final result just:
# Following Ricardo's solution for casting, but using `acast` instead
A <- acast(Person~Item+BuyOrSell,data=df,fun.aggregate=length,drop=FALSE)
# A' * A
> t(A) %*% A
# a_B a_S b_B b_S d_B d_S e_B e_S
# a_B 1 1 0 1 0 0 0 0
# a_S 1 2 0 1 0 0 0 1
# b_B 0 0 0 0 0 0 0 0
# b_S 1 1 0 1 0 0 0 0
# d_B 0 0 0 0 1 0 0 0
# d_S 0 0 0 0 0 0 0 0
# e_B 0 0 0 0 0 0 0 0
# e_S 0 1 0 0 0 0 0 1
I think there is a better way, but here's a method using the package reshape2.
require(reshape2)
#reshapes data so each item and buy/sell event interaction occurs once
df2 <- dcast(Person~Item+BuyOrSell,data=df,fun.aggregate=length,drop=FALSE)
df2
# Person a_B a_S b_B b_S d_B d_S e_B e_S
# 1 1 1 1 0 1 0 0 0 0
# 2 2 0 0 0 0 1 0 0 0
# 3 3 0 1 0 0 0 0 0 1
#reshapes data so every row is an interaction by person
df3 <- melt(df2,id.vars="Person")
head(df3)
# Person variable value
# 1 1 a_B 1
# 2 2 a_B 0
# 3 3 a_B 0
# 4 1 a_S 1
# 5 2 a_S 0
# 6 3 a_S 1
#removes empty rows where no action occurred
#removes value column
df4 <- with(df3,
data.frame(Person=rep.int(Person,value),variable=rep.int(variable,value))
#performs a self-merge: now each row is
#every combination of two actions that one person has done
df5 <- merge(df4,df4,by="Person")
head(df5)
# Person variable.x variable.y
# 1 1 a_B a_B
# 2 1 a_B a_S
# 3 1 a_B b_S
# 4 1 a_S a_B
# 5 1 a_S a_S
# 6 1 a_S b_S
#tabulates variable interactions
with(df5,table(variable.x,variable.y))
Blue Magister,your solution works perfectly and i analyzed each an every step that you performed.
The output of df4 was follows:
Person variable
1 1 a_B
2 1 a_S
3 3 a_S
4 1 b_S
5 2 d_B
6 3 e_S
The output of with(df5,table(variable.x,variable.y)) was
variable.y
variable.x a_B a_S b_B b_S d_B d_S e_B e_S
a_B 1 1 0 1 0 0 0 0
a_S 1 2 0 1 0 0 0 1
b_B 0 0 0 0 0 0 0 0
b_S 1 1 0 1 0 0 0 0
d_B 0 0 0 0 1 0 0 0
d_S 0 0 0 0 0 0 0 0
e_B 0 0 0 0 0 0 0 0
e_S 0 1 0 0 0 0 0 1
which is exactly what i want.
When i look at the output of d4 it was almost similar to the my newcol solution ( using paste )
> df
Person newcol
1 1 a-B
2 1 b-S
3 1 a-S
4 2 d-B
5 3 a-S
6 3 e-S
The only difference here is the ordering of the rows when compared to your df4.
So, i ended up running this command
dfx <- merge(df,df,by="Person")
with(dfx,table(newcol.x,newcol.y))
and it generated the following...
newcol.y
newcol.x a-B a-S b-S d-B e-S
a-B 1 1 1 0 0
a-S 1 2 1 0 1
b-S 1 1 1 0 0
d-B 0 0 0 1 0
e-S 0 1 0 0 1
The above output ignored few rows and columns. What am i doing different from you ?

Resources