Running a interaction matrix between many variables - r

I have a data set with 70 column variables, each is 0-1 dummy variable, and 3500 observations. I am looking to see how often observations with a 'success' in one variable are matched with another variable. In other words it obs 1 has a success dummy in variable one how often does it also have a success in variable 2 and so on for all the variables. I have found how to create a matrix table showing interactions when only two columns are involved however i cant find anything involving many columns. Ideally id like to present this in an interaction matrix with 70 variables across and 70 down. Here is an idea of the data set:
Dat A B C D
XX 1 1 1 1
XY 0 1 0 1
XZ 0 0 1 1
The output im hoping for would be:
Out A B C D
A 0 1 1 1
B 0 1 2
C 0 2
D 0
Showing the number of times that (A,B) is a pairing (B,C) is a pairing and so on.
I have tried using the table() command as well as as.matrix but it seems these require data organized as two columns and cannot understand the data when it refers to many column variables. I am fairly new to R so I apologize if my question isnt clear or is possibly quite simple.
Any help is appreciated. Thanks

Here's how to create a correlation matrix of indefinite size. First create a reproducible example of your dataset...
dat <- matrix(sample(0:1, size = 700, replace = TRUE), ncol = 70)
dat <- data.frame(dat)
Then calculate the correlation...
dat <- cor(dat)
And then plot the correlation visually...
library(corrplot)
corrplot(dat, method = "square")
You can also plot the correlation using numbers instead of colors...
corrplot(dat, method = "number")
Obviously you'll want to finesse these charts before using them in a publication. corrplot offers tons of options for chart appearance.

You can try:
res <- apply(combn(2:ncol(df), 2), 2, function(x, y) sum(rowSums(y[, x]) == 2), df)
m <- diag(x=0, ncol(df)-1)
m[upper.tri(m)] <- res
m[lower.tri(m)] <- NA
dimnames(m) <- list(colnames(df)[-1], colnames(df)[-1])
A B C D
A 0 1 1 1
B NA 0 1 2
C NA NA 0 2
D NA NA NA 0

Related

Make dummy variables for a categorial variable

Let's say I have a data frame df as follows:
df <- data.frame(type = c("A","B","AB","O","O","B","A"))
Obviously there are 4 kinds of type. However, in my actual data, I don't know how many kinds are in a column type. The number of dummy variables should be one less than the number of kinds in type. In this example, number of dummy variables should be 3. My expected output looks like this:
df <- data.frame(type = c("A","B","AB","O","O","B","A"),
A = c(1,0,0,0,0,0,1),
B = c(0,1,0,0,0,1,0),
AB = c(0,0,1,0,0,0,0))
Here I used A, B and AB as dummy variables, but whatever I choose from type doesn't matter. Even if I don't know the values of type and the number of kinds, I somehow want to make it as dummy variables.
The number of dummy variables should be one less than the number of kinds in type.
Here I used "A", "B" and "AB" as dummy variables, but whatever I choose from type doesn't matter.
Even if I don't know the values in type and the number of kinds, I somehow want to make it as dummy variables.
This is treatment contrasts coding. First, you need a factor variable.
## option 1: if you care the order of dummy variables
## the 1st level is not in dummy variables
## I do this to match your example output with "A", "B" and "AB"
f <- factor(df$type, levels = c("O", "A", "B", "AB"))
## option 2: if you don't care, then let R automatically order levels
f <- factor(df$type)
Now, apply treatment contrasts coding.
## option 1 (recommended): using contr.treatment()
m <- contr.treatment(nlevels(f))[f, ]
## option 2 (less efficient): using model.matrix()
m <- model.matrix(~ f)[, -1]
Finally you want to have nice row/column names for readability.
dimnames(m) <- list(1:length(f), levels(f)[-1])
The resulting m looks like:
# A B AB
#1 1 0 0
#2 0 1 0
#3 0 0 1
#4 0 0 0
#5 0 0 0
#6 0 1 0
#7 1 0 0
This is a matrix. If you want a data frame, do data.frame(m).

Create table by pre-specified frequencies

I have large R matrix with 1,000 rows and 4 attributes and 4 levels each attribute such that:
Row A B C D
1 1 3 4 2
2 2 1 3 4
3 1 2 4 3
... ...
1000 3 4 1 2
I want to create a new table by pre-specified proportions such that level 1 of attribute A appears 25% of the time, level 2 50%, level 3 10% and level 4 15% of the time. The table can be of a smaller size than 1,000 rows and rows have to be uniques.
proportions <- c(0.25,0.5,0.1,0.15)
I know it's kind of a basic question but I have broken my head for two hours and haven't found anything on Stack Overflow nor internet.
UPDATE
I want to keep the same combinations within rows. So I want to create a new table with the proportions given but using the table, thus the combinations, that I already have.
You can create your set with the proportions you want then "reshuffle".
A <- c(rep(1,250), rep(2,500), rep(3,100), rep(4,150))
B <- sample(A, 1000)
EDIT:
it's not entirely cleat what the OP wants.
if you want the same exact table randomized you can try
df_new <- df[sample(1:nrow(df), nrow(df)),]
to get the same exact proportions you are restricted to a number of observation so that all the new counts are divisible by the old counts
in order to get the proportions of each unique row you can try :
# simulating the table
a <- c(rep(1,250), rep(2,500), rep(3,100), rep(4,150))
b <- sample(a, 4000, replace = T)
df <- as.data.frame(matrix(b, ncol = 4))
names(df) <- c('a','b','c','d')
# getting the proportions
z <- aggregate(row.names(df), list(df$a, df$b, df$c, df$d), function(x) freq = length(x))

Dynamically creating columns of binary values based on a series True/False conditions

I would like to be able to create new columns in a dataframe, the values of which will be determined by a pre-defined list of conditional statements. The ultimate goal of this is to arrive at a table binary values that represent if a condition is being met for each instance. It may seem like a clunky or odd output, but it is the requirement of an economic model I'm trying to build (repeated sales model).
Here is a much simplified reproducible example:
df <- data.frame(a=c(1,2,3,4,5),b=c(0.3,0.2,0.5,0.3,0.7))
conditions <- data.frame(y=df$b>=0.5, z=df$b>=0.7)
columns <- c("y","z")
for(i in length(columns)){
df[, paste("var_",columns[i],sep="")] <- ifelse(conditions[i],1,0)
}
So in this instance, I'd like to get columns "var_y" and "var_z" which have binary values representing if the criteria for conditions y or z are being met.
Right now, I'm getting this error:
Error in ifelse(conditions[i], 1, 0) : (list) object cannot be
coerced to type 'logical'
Which I don't understand, as all of the information in the dataframe "conditions" is of the type 'logical'.
We can just do
df[paste0("var_", seq_along(columns))] <- +(conditions)
df
# a b var_1 var_2
#1 1 0.3 0 0
#2 2 0.2 0 0
#3 3 0.5 1 0
#4 4 0.3 0 0
#5 5 0.7 1 1

Count the number of instances where a variable or a combination of variables are TRUE

I'm an enthusiastic R newbie that needs some help! :)
I have a data frame that looks like this:
id<-c(100,200,300,400)
a<-c(1,1,0,1)
b<-c(1,0,1,0)
c<-c(0,0,1,1)
y=data.frame(id=id,a=a,b=b,c=c)
Where id is an unique identifier (e.g. a person) and a, b and c are dummy variables for whether the person has this feature or not (as always 1=TRUE).
I want R to create a matrix or data frame where I have the variables a, b and c both as the names of the columns and of the rows. For the values of the matrix R will have to calculate the number of identifiers that have this feature, or the combination of features.
So for example, IDs 100, 200 and 400 have feature a then in the diagonal of the matrix where a and a cross, R will input 3. Only ID 100 has both features a and b, hence R will input 1 where a and b cross, and so forth.
The resulting data frame will have to look like this:
l<-c("","a","b","c")
m<-c("a",3,1,1)
n<-c("b",1,2,1)
o<-c("c",1,1,2)
result<-matrix(c(l,m,n,o),nrow=4,ncol=4)
As my data set has 10 variables and hundreds of observations, I will have to automate the whole process.
Your help will be greatly appreciated.
Thanks a lot!
With base R:
crossprod(as.matrix(y[,-1]))
# a b c
# a 3 1 1
# b 1 2 1
# c 1 1 2
This is called an adjacency matrix. You can do this pretty easily with the qdap package:
library(qdap)
adjmat(y[,-1])$adjacency
## a b c
## a 3 1 1
## b 1 2 1
## c 1 1 2
It throws a warning because you're feeding it a dataframe. Not a big deal and can be ignored. Also noticed I dropped the first column (ID's) with negative indexing y[, -1].
Note that because you started out with a Boolean matrix you could have gotten there with:
Y <- as.matrix(y[,-1])
t(Y) %*% Y

Calculate daily change in value in R

I have a matrix with the following values: a <- c(4,6,7,78,3,2,5,6,7,8)
I would like to create a second matrix b which lists the changes in a's value at each step.
So the solution would be: b <- c(2,1,71,-75,-1,3,1,1,1)
Is there a function for this in R and if there isn't what is the easiest way to proceed?
a <- c(4,6,7,78,3,2,5,6,7,8) #this is not a matrix in R btw
diff(a)
#[1] 2 1 71 -75 -1 3 1 1 1

Resources