Concordance matrix - r

Assume I have a data set with an arbitrary number of rows and columns like shown below.
tmp <- tibble(id = 1:10,
v1 = c(0, 0, 0, 1, 1, 0, 0, 0, 0, 1),
v2 = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 1),
v3 = c(0, 0, 0, 1, 0, 0, 0, 0, 1, 0),
v4 = c(0, 0, 0, 1, 1, 0, 0, 0, 1, 0))
Each row is a response. The respondent has either said yes (1) or no (0) to a specific question. Here, we have 4 questions.
What is the easiest way to convert this into a concordance matrix like below:
v1 v2 v3 v4
v1 3 2 1 1
v2 2 2 1 1
v3 1 1 2 2
v4 1 1 2 3
Where each cell shows of those who answers yes to the question on the row, how many also answered yes to the question on the column.
Please note that the number of questions maybe bigger than 4, so I prefer not to have to hard code variable names in the solution. I can make sure the variable names always follow a specific format if that is helpful. A solution that doesn't care about variable names is ideal (we can drop the id column if needed).

The easiest way is with matrix multiplication...
mx <- as.matrix(tmp[,-1])
t(mx) %*% mx
v1 v2 v3 v4
v1 3 2 1 2
v2 2 2 1 1
v3 1 1 2 2
v4 2 1 2 3
crossprod(mx) will do the same thing.

Using tcrossprod
tcrossprod(t(tmp))

Related

Creating multiple summary tables with one function in R

I couldn't find an answer to this specific question sorry if it's been asked:
library(tidyverse)
#sampledata
df <- data.frame(group=c(1, 1, 1, 1, 0, 0, 0, 0),
v1=c(1, 0, 0, 1, 0, 1, 1, 1),
v2=c(0, 0, 0, 0, 1, 0, 0, 1),
v3=c(0, 1, 0, 1, 1, 0, 1, 1))
I want to find the number of "1"s and "0"s in each v1, v2, v3 for each level of "group".
Currently I have been using
table(df$group, df$v1)
table(df$group, df$v2)
table(df$group, df$v3)
ad nauseum to get the number of "1" in each variable but I can't figure out how to create many such tables with one function...Any help would be greatly appreciated
We can use lapply to apply the same function to multiple columns.
lapply(df[-1], function(x) table(df$group, x))
#$v1
# x
# 0 1
# 0 1 3
# 1 2 2
#$v2
# x
# 0 1
# 0 2 2
# 1 4 0
#$v3
# x
# 0 1
# 0 1 3
# 1 2 2
Or with dplyr we can use count
purrr::map(names(df)[-1], ~count(df, group, !!sym(.x)))

Coining Binary variables into one in R

Hope all goes well.
I am working on a data set that has 7 binary variables ( they are all 0 and 1) and they are not mutually exclusive.
I need to convert them all into one categorical variable which will have 2^7 levels.
I was wondering if anyone has done such a thing in R before?
I really appreciate your time and answer.
Best,
library(tidyr)
data <- data.frame(x1 = c(0, 1, 0, 1), x2 = c(1, 1, 1, 1), x3 = c(0, 0, 0, 0),
x4 = c(1, 0, 1, 0), x5 = c(0, 0, 1, 1), x6 = c(1, 1, 0, 0), x7 = c(1, 0, 0, 1))
data <- unite(data, combine_x, 1:7, remove=FALSE)
data$combine_x <- factor(data$combine_x)
Using paste something like this should work.
#create dataframe
df<-as.data.frame(cbind(rbinom(100,1,.5), rbinom(100,1,.5),rbinom(100,1,.5)))
#paste columns together using apply to loop over rows
df$new<-apply(df,1,function(x) paste(x, collapse =""))
Any non NA character can be used in the collapse argument (for example collapse = ":" if you want to seperate records by :). Output:
> head(df)
V1 V2 V3 new
1 0 1 1 011
2 0 0 1 001
3 1 1 1 111
4 0 0 1 001
5 0 1 0 010
6 1 0 0 100

Recode a value in a vector based on surrounding values

I'm trying to programmatically change a variable from a 0 to a 1 if there are three 1s before and after a 0.
For example, if the number in a vector were 1, 1, 1, 0, 1, 1, and 1, then I want to change the 0 to a 1.
Here is data in the vector dummy_code in the data.frame df:
original_df <- data.frame(dummy_code = c(1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1))
Here is how I'm trying to have the values be recoded:
desired_df <- data.frame(dummy_code = c(1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1)
I tried to use the function fill in the package tidyr, but this fills in missing values, so it won't work. If I were to recode the 0 values to be missing, then that would not work either, because it would simply code every NA as 1, when I would only want to code every NA surrounded by three 1s as 1.
Is there a way to do this in an efficient way programmatically?
An rle alternative, using the x from #G. Grothendieck's answer:
r <- rle(x)
Find indexes of runs of three 1:
i1 <- which(r$lengths == 3 & r$values == 1)
Check which of the "1 indexes" that surround a 0, and get the indexes of the 0 to be replaced:
i2 <- i1[which(diff(i1) == 2)] + 1
Replace relevant 0 with 1:
r$values[i2] <- 1
Reverse the rle operation on the updated runs:
inverse.rle(r)
# [1] 1 0 0 1 1 1 1 1 1 1 0 0 1
A similar solution based on data.table::rleid, slightly more compact and perhaps easier to read:
library(data.table)
d <- data.table(x)
Calculate length of each run:
d[ , n := .N, by = rleid(x)]
For "x" which are zero and the preceeding and subsequent runs of 1 are of length 3, set "x" to 1:
d[x == 0 & shift(n) == 3 & shift(n, type = "lead") == 3, x := 1]
d$x
# [1] 1 0 0 1 1 1 1 1 1 1 0 0 1
Here is a one-liner using rollapply from zoo:
library(zoo)
rollapply(c(0, 0, 0, x, 0, 0, 0), 7, function(x) if (all(x[-4] == 1)) 1 else x[4])
## [1] 1 0 0 1 1 1 1 1 1 1 0 0 1
Note: Input used was:
x <- c(1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1)

generate increasing sequence of varying length in R

Given n, generate a sequence like this:
0, 0, 1, 0, 1, 2, ........, 0, 1, 2, 3, 4, 5, 6, ....n
Let's say n=3, then the sequence should be:
0, 0, 1, 0, 1, 2, 0, 1, 2, 3
I've tried using rep, but it only generates a fixed length, where as I need the sequence length to increase each time.
You can use a simply Map with an unlist to get the result you want
n <- 3
unlist(Map(seq, from=0, to=0:n))
# [1] 0 0 1 0 1 2 0 1 2 3
From this answer
n <- 3
sequence(0:(n+1))-1
# [1] 0 0 1 0 1 2 0 1 2 3

Transform a dataset to summarize table in R

I am learning data mining about market basket analysis and would like to transform the rawdata to a summarize table for further calculation of support and confidence.
Below is an example that about 4 transactions that indicate the customer has purchased corresponding item.
Example is like following:
Afterwards would like to have all possible item sets. For above example, total possibility is 24 item sets.
It sounds like you're looking for the crossprod function:
M <- data.frame(ID = 1:4, A = c(1, 0, 1, 0),
B = c(1, 1, 0, 0), C = c(0, 1, 1, 0),
D = c(0, 0, 1, 1))
crossprod(as.matrix(M[-1]))
# A B C D
# A 2 1 1 1
# B 1 2 1 0
# C 1 1 2 1
# D 1 0 1 2

Resources