This question already has an answer here:
How to convert two factors to adjacency matrix in R?
(1 answer)
Closed 4 years ago.
I am facing a challenge that I cannot manage to solve. I have a list of observations x_i (dimension is large, something around 30k) and a list of observations y_j (also large). x_i and y_i are id of the same units (say firms).
I have a dataframe of two columns that links x_i and y_j: if they appear on the same line, it means that they are connected. What I would like is to convert this network into a large matrix M of size (unique(union(x, y))) and which takes the value 1 if the two firms are connected.
Here is an example in small dimensions:
x1 x2
x3 x6
x4 x5
x1 x5
What I would like is a matrix:
0 1 0 0 1 0
0 0 0 0 0 0
0 0 0 0 0 1
0 0 0 1 0 0
0 0 0 0 0 0
0 0 0 0 0 0
Right now, the only solution I could think of is a double loop combined with a search in the initial dataframe:
list_firm = union(as.vector(df[1]), as.vector(df[2]))
list_firm <- sort(list_firm[[1]])
list_firm <- unique(list_firm)
M <- Matrix(nrow = length(list_firm), ncol = length(list_firm))
for (i in list_firm) {
for (j in list_firm) {
M[i, j] = !is.null(which(df$col1 == i & df$col2 == j))
}
}
Where df is the two columns data frame. This is obviously much too long to run.
Any suggestion? This would be very welcome
We convert the columns to factor with levels specified as the unique elements of both columns and get the frequency with table
lvls <- sort(unique(unlist(df)))
df[] <- lapply(df, factor, levels = lvls)
table(df)
# col2
#col1 x1 x2 x3 x4 x5 x6
# x1 0 1 0 0 1 0
# x2 0 0 0 0 0 0
# x3 0 0 0 0 0 1
# x4 0 0 0 0 1 0
# x5 0 0 0 0 0 0
# x6 0 0 0 0 0 0
data
df <- structure(list(col1 = c("x1", "x3", "x4", "x1"), col2 = c("x2",
"x6", "x5", "x5")), class = "data.frame", row.names = c(NA, -4L
))
The answer provided by #akrun in the comments is a good one. However, this is a good scenario to take advantage of a different data structure than data frames. Basically, what you're looking for is an adjacency matrix, which is a data structure in social network analysis. To achieve this, we can use the igraph package in R.
library(igraph)
library(dplyr)
df = data_frame(source=c('x1', 'x3', 'x4', 'x1'), target=c('x2', 'x6', 'x5', 'x5'))
g = graph_from_data_frame(df, directed=FALSE)
output = as.matrix(get.adjacency(g))
x1 x3 x4 x2 x6 x5
x1 0 0 0 1 0 1
x3 0 0 0 0 1 0
x4 0 0 0 0 0 1
x2 1 0 0 0 0 0
x6 0 1 0 0 0 0
x5 1 0 1 0 0 0
The output columns aren't in the exact order as your example, but this is a trivial problem to solve if needed.
Related
I have a dataframe in R with 4 variables and would like to create a new variable based on any 2 conditions being true on those variables.
I have attempted to create it via if/else statements however would require a permutation of every variable condition being true. I would also need to scale to where I can create a new variable based on any 3 conditions being true. I am not sure if there is a more efficient method than using if/else statements?
My example:
I have a dataframe X with following column variables
x1 = c(1,0,1,0)
X2 = c(0,0,0,0)
X3 = c(1,1,0,0)
X4 = c(0,0,1,0)
I would like to create a new variable X5 if any 2 of the variables are true (eg ==1)
The new variable based on the above dataframe would produce X5 (1,0,1,0)
This can easily be done by using the apply function:
x1 = c(1,0,1,0)
x2 = c(0,0,0,0)
x3 = c(1,1,0,0)
x4 = c(0,0,1,0)
df <- data.frame(x1,x2,x3,x4)
df$x5 <- apply(df,1,function(row) ifelse(sum(row != 0) == 2, 1, 0))
x1 x2 x3 x4 X5
1 1 0 1 0 1
2 0 0 1 0 0
3 1 0 0 1 1
4 0 0 0 0 0
apply with option 1 means: Do this function on every row. To scale this up to 3...N true values, just change the number in the ifelse statement.
You can try this:
#Data
df <- data.frame(x1,X2,X3,X4)
#Code
df$X5 <- ifelse(rowSums(df,na.rm=T)==2,1,0)
x1 X2 X3 X4 X5
1 1 0 1 0 1
2 0 0 1 0 0
3 1 0 0 1 1
4 0 0 0 0 0
You can use:
df$X5 <- 1*(apply(df == 1, 1, sum) == 2)
or
df$X5 <- 1*(mapply(sum, df) == 2)
Output
> df
X1 X2 X3 X4 X5
1 0 1 0 1
0 0 1 0 0
1 0 0 1 1
0 0 0 0 0
Data
df <- data.frame(X1,X2,X3,X4)
If I have a date set with lots of binary variables, all with values o/1. I want to create a new column, and add by one if the observation is 1 of one binary variable, add by two if it has 1 of two binary variables...
Such as:
x1 x2 x3 x4 x5
1 1 1 0 1
0 0 1 0 0
0 0 0 0 0
I want to have
x1 x2 x3 x4 x5 count
1 1 1 0 1 4
0 0 1 0 0 1
0 0 0 0 0 0
If your dataset contains only the binary variables you are interested in, you can use
df$count <- rowSums(df)
Otherwise, please provide a more detailed description of your data.
Another option is Reduce with +
df$count <- Reduce(`+`, df)
Updated with dummy data and dummycode - apologies, I assumed my question was simple and you could advice the best way without a reproducible example.
dummy<-data.frame(prodA=c(0,0,0,1,1,0,0,1),
prodB=c(0,0,1,1,0,1,1,0),
prodC=c(1,1,1,0,0,0,0,1))
dummy[,4:6]<-dummy[,1:3]
for (j in (1:nrow(dummy))){
for (i in 4:6){
dummy[j,i]<-ifelse(dummy[j,i]==1,colnames(dummy[i]),"")}
}
dummy2<-dummy[,4:6]
dummy$NewProds<-apply(dummy2,1,paste,collapse="")
dummy$NewProds<-gsub(".1","//",dummy$NewProds)
My second attempt is as:
prods<-dummy[,1:3]
prods[,4:6]<-dummy[,1:3]
for (i in 4:6){
prods[,i]<-colnames(prods[i-3])
}
prods[,7:9]<-prods[,4:6]
#works, but I will need multiple ifs for this to work, suggesting this
#won't be very efficient
prods[,10]<-ifelse(prods[,1]==1,prods[,4],"")
Original Post Follows:
I am playing with the Santander Product recommendation dataset from Kaggle. I have identified which products have been purchased from one month to another. This means I have 23 columns of 1's ( when a new product is added) and 0's (when not).
I created the following code to return the column name when a product has been purchased. It works great on a sample of 6 lines, but it runs forever when I try this on the 48k customers who changed, let alone the million in the dataset.
Is there another way to do this?
df2[,99:122]<-df2[,72:95]
for (j in (1:nrow(df2))){
for (i in 99:122){
df2[j,i]<-ifelse(df2[j,i]==1,colnames(df2[i]),"")}
}
df22<-df2[,99:122]
df2$NewProds<-apply(df22,1,paste,collapse="")
df2$NewProds<-gsub("change.1","//",df2$NewProds)
I figured the challenge was that I am looking at every variable and so started with another approach whereby I would take a couple of versions of the data, and then do an if variable is 1 then take the name. However I couldn't get this to work, and I think I come to the same issue.
#copy a bunch of 1's and 0's
prods<-df2[,72:95]
#repeat and overwrite with colnames
prods[,25:48]<-df2[,72:95]
for (i in 25:48){
prods[,i]<-colnames(prods[i-24])
}
prods[,49:72]<-prods[,25:48]
#attempt to only populate colnames if it was originally a 1 - doesn't work
prod[,49]<-ifelse(prod[,1]==1,prod[,25],"")
I haven't provided any data but I hope you can see what I am tring to do and can advise on efficient ways of doing this.
Thanks in advance,
J
Using apply as #AndersEllernBilgrau illustrated is one obvious way to do it, but it will be slow for data sets with many rows.
dummy[["NewProds"]] <- do.call(
paste,
c(mapply(ifelse,
dummy,
names(dummy),
MoreArgs = list(no = ""),
SIMPLIFY = FALSE),
sep = "//"))
is a bit harder to follow, but it will be much faster:
library(microbenchmark)
n <- 10000
dummy <- data.frame(prodA = rep(c(0,0,0,1,1,0,0,1), n),
prodB = rep(c(0,0,1,1,0,1,1,0), n),
prodC = rep(c(1,1,1,0,0,0,0,1), n))
microbenchmark(
do.call = do.call(
paste,
c(mapply(ifelse,
dummy,
names(dummy),
MoreArgs = list(no = ""),
SIMPLIFY = FALSE),
sep = "//")),
apply = apply(
dummy == 1,
1,
function(x) paste0(names(which(x)), collapse = "//")
))
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## do.call 63.92695 65.44777 72.07261 67.8667 73.3850 184.5151 100 a
## apply 296.81323 364.31947 404.71894 397.0927 443.7223 683.3892 100 b
Without data, I have a hard time understanding precisely what you want to do.
A couple of things are (almost) certain however:
You probably do not need for loops.
You should used R's vectorized functions, the dataset is not that big
Using some toy data, does the following do what you want?
d <- 23
n <- 46e3
# Simulate some toy data
df <- data.frame(matrix(rbinom(d*n, 1, 0.1), n, d),
row.names = paste0("row", 1:n))
head(df)
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23
row1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
row2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
row3 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
row4 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0
row5 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
row6 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0
# Paste together the colnames of all non-zero rows
res <- apply(df == 1, 1, function(x) paste0(names(which(x)), collapse = "-"))
head(res)
# row1 row2 row3 row4 row5 row6
#"X8-X16" "X1" "X8-X20" "X4-X11-X20" "X7-X15" "X4-X18-X21"
I.e. res is here a character vector of length n with the colnames of each row the corresponding to 1 entries pasted together (with separator -). This it at least what it appears to me what your code is doing conceptually.
I'm attempting to use formula to generate a model.matrix object to be used in a custom optimizer function.
It works great for the most part, but when it comes to factor-factor interactions, I'd like to specify the interaction as dummy coded rather than effects coded.
Take for example the following data set:
set.seed(1987)
myDF <- data.frame(Y = rnorm(100),
X1 = factor(LETTERS[sample(1:3, 100, replace = TRUE)]),
X2 = factor(LETTERS[sample(1:3, 100, replace = TRUE)]))
head(myDF)
Both the : and / operators create an effects coded model matrix (the latter being an additive effects structure, I think).
head(model.matrix(formula(Y ~ X1 : X2), data = myDF))
head(model.matrix(formula(Y ~ X1 / X2), data = myDF))
But I am looking to generate a dummy coded model matrix, which would have the first level of X1 omitted for each level of X2. Resulting in these terms (columns):
X1B:X2A
X1C:X2A
X1B:X2B
X1C:X2B
X1B:X2C
X1C:X2C
Is there a way to achieve this?
Is ~X1:X2-1 what you're looking for?
Make test data (as above):
set.seed(1987)
myDF <- data.frame(Y = rnorm(100),
X1 = factor(LETTERS[sample(1:3, 100, replace = TRUE)]),
X2 = factor(LETTERS[sample(1:3, 100, replace = TRUE)]))
Generate model matrix:
mm1 <- model.matrix(formula(Y ~ X1 : X2 - 1), data = myDF)
head(mm1)
## X1A:X2A X1B:X2A X1C:X2A X1A:X2B X1B:X2B X1C:X2B X1A:X2C X1B:X2C X1C:X2C
## 1 0 0 0 0 1 0 0 0 0
## 2 1 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 1 0
## 4 0 0 0 0 0 1 0 0 0
## 5 0 0 0 1 0 0 0 0 0
## 6 0 0 0 0 0 0 1 0 0
Or perhaps you really do just want some columns excluded:
mm0 <- model.matrix(formula(Y ~ X1 : X2), data = myDF)
mm0B <- mm0[,!grepl("(Intercept|^X1A:)",colnames(mm0))]
## X1B:X2A X1C:X2A X1B:X2B X1C:X2B X1B:X2C X1C:X2C
## 1 0 0 1 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 1 0
## 4 0 0 0 1 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
I thought you also might be interested in sum-to-zero contrasts:
mm2 <- model.matrix(formula(Y ~ X1 : X2 - 1), data = myDF,
contrasts.arg=list(X1=contr.sum,X2=contr.sum))
Below is another trial.
set.seed(1987)
myDF <- data.frame(Y = rnorm(100),
X1 = factor(LETTERS[sample(1:3, 100, replace = TRUE)]),
X2 = factor(LETTERS[sample(1:3, 100, replace = TRUE)]))
# row subsetting to exclude A
modelMat <- model.matrix(formula(Y ~ X1 : X2), data = myDF[myDF$X1 != 'A',])
# column subsetting to eliminate all columns including X1A
modelMat <- modelMat[,substring(colnames(modelMat), 1, 3) != "X1A"]
head(modelMat)
(Intercept) X1B:X2A X1C:X2A X1B:X2B X1C:X2B X1B:X2C X1C:X2C
1 1 0 0 1 0 0 0
3 1 0 0 0 0 1 0
4 1 0 0 0 1 0 0
8 1 0 0 0 0 1 0
10 1 0 0 0 0 0 1
11 1 0 0 0 0 0 1
Having 2 vectors like the following:
vec1<-c("x", "y")
vec2<-c(rep(0, 5))
I would like to create a data.frame object where vec1 becomes the 1st of column of data.frame DF and vec2 becomes the row with column names too. Visually talking, it may be like this.
vec1 1 2 3 4 5
x 0 0 0 0 0
y 0 0 0 0 0
I have tried the following code, but it adds both vectors as columns:
DF<-data.frame(vec1, vec2)
Instead of generating a vector for your rows, you can generate a whole matrix, and then use data.frame to bind it to your first vector. Something like this :
mat <- matrix(0, nrow=2, ncol=5)
vec <- c("x","y")
data.frame(vec, mat)
Which gives :
vec X1 X2 X3 X4 X5
1 x 0 0 0 0 0
2 y 0 0 0 0 0
You can use rbind() inside the data.frame() function to put vec2 values in both rows of new data frame.
vec1<-c("x", "y")
vec2<-c(rep(0, 5))
data.frame(vec1,rbind(vec2,vec2))
vec1 X1 X2 X3 X4 X5
1 x 0 0 0 0 0
2 y 0 0 0 0 0