data.table versus tidyr::expand_grid - r

I have
XIa <- diag(1, 3)
colnames(XIa) <- rownames(XIa) <- c("a0", "a1", "a2")
XIb <- diag(1, 2)
colnames(XIb) <- rownames(XIb) <- c("b0", "b1")
XIc <- diag(1, 2)
colnames(XIc) <- rownames(XIc) <- c("c0", "c1")
tidyr::expand_grid gives me:
tidyr::expand_grid(as.data.frame(XIa), as.data.frame(XIb), as.data.frame(XIc))
# A tibble: 12 x 7
a0 a1 a2 b0 b1 c0 c1
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 0 1 0 1 0
2 1 0 0 1 0 0 1
3 1 0 0 0 1 1 0
4 1 0 0 0 1 0 1
5 0 1 0 1 0 1 0
6 0 1 0 1 0 0 1
7 0 1 0 0 1 1 0
8 0 1 0 0 1 0 1
9 0 0 1 1 0 1 0
10 0 0 1 1 0 0 1
11 0 0 1 0 1 1 0
12 0 0 1 0 1 0 1
How do I achieve the same result using data.table?
Clearly, there is this way:
dXIa <- data.table(XIa)
dXIb <- data.table(XIb)
dXIc <- data.table(XIc)
cbind(
dXIa[c(rep(1:3, each = 4))],
dXIb[c(rep(1:2, each = 2))],
dXIc[c(rep(1:2, len = 12))]
)
a0 a1 a2 b0 b1 c0 c1
1: 1 0 0 1 0 1 0
2: 1 0 0 1 0 0 1
3: 1 0 0 0 1 1 0
4: 1 0 0 0 1 0 1
5: 0 1 0 1 0 1 0
6: 0 1 0 1 0 0 1
7: 0 1 0 0 1 1 0
8: 0 1 0 0 1 0 1
9: 0 0 1 1 0 1 0
10: 0 0 1 1 0 0 1
11: 0 0 1 0 1 1 0
12: 0 0 1 0 1 0 1
but that is probably not optimal/ideal.

You can use CJ but it does not work with data.table directly. Using the function cjdt from this answer you can do -
library(data.table)
dXIa <- data.table(XIa)
dXIb <- data.table(XIb)
dXIc <- data.table(XIc)
cjdt <- function(a,b){
cj = CJ(1:nrow(a),1:nrow(b))
cbind(a[cj[[1]],],b[cj[[2]],])
}
Reduce(cjdt, list(dXIa, dXIb, dXIc))
# a0 a1 a2 b0 b1 c0 c1
# 1: 1 0 0 1 0 1 0
# 2: 1 0 0 1 0 0 1
# 3: 1 0 0 0 1 1 0
# 4: 1 0 0 0 1 0 1
# 5: 0 1 0 1 0 1 0
# 6: 0 1 0 1 0 0 1
# 7: 0 1 0 0 1 1 0
# 8: 0 1 0 0 1 0 1
# 9: 0 0 1 1 0 1 0
#10: 0 0 1 1 0 0 1
#11: 0 0 1 0 1 1 0
#12: 0 0 1 0 1 0 1

As an alternative to RonakShah's use of cjdt, here's a modified version that has two more features:
Guards against 0-row frames, which should really be a no-op for the 0-row frame;
Uses a single call to cbind instead of Reduce; while reduce isn't evil here, there may be benefits with a much longer list of frames/tables; and
While not a stated constraint here, it works with data.frame and data.table alike.
cjdt2 <- function(...) {
dots <- Filter(nrow, list(...))
eg <- do.call(expand.grid, lapply(sapply(dots, nrow), seq_len))
do.call(cbind, Map(function(x, i) x[i,], dots, eg))
}
cjdt2(XIa, XIb, XIc)
# a0 a1 a2 b0 b1 c0 c1
# a0 1 0 0 1 0 1 0
# a1 0 1 0 1 0 1 0
# a2 0 0 1 1 0 1 0
# a0 1 0 0 0 1 1 0
# a1 0 1 0 0 1 1 0
# a2 0 0 1 0 1 1 0
# a0 1 0 0 1 0 0 1
# a1 0 1 0 1 0 0 1
# a2 0 0 1 1 0 0 1
# a0 1 0 0 0 1 0 1
# a1 0 1 0 0 1 0 1
# a2 0 0 1 0 1 0 1
Which you can easily wrap with setDT (either externally or mod the function).

Here's another approach that uses data.table merge
expgridDT<-function(...) {
DTs<-list(...)
for(jj in 1:(length(DTs)-1)) {
DTs[[jj+1]]<-merge(DTs[[1]][,c(kfjekflj=1,.SD)], DTs[[2]][,c(kfjekflj=1,.SD)],by=.EACHI, allow.cartesian=TRUE)[,!"kfjekflj",with=FALSE]
}
return(DTs[[length(DTs)]][])
}
Essentially what this does is create's a dummy column on each data.table with a non-sense name (kfjekflj) to make a collision with a real column name unlikely. It sets that dummy column as the join by column. Then it merges two tables at a time with allow.cartesian turned on. It does that for every data.table that is passed to the function.
Here's a benchmark:
XIa <- diag(1, 50)
colnames(XIa) <- rownames(XIa) <- paste0("a",1:ncol(XIa))
XIb <- diag(1, 72)
colnames(XIb) <- rownames(XIb) <- paste0("b",1:ncol(XIb))
XIc <- diag(1, 80)
colnames(XIc) <- rownames(XIc) <- paste0("c",1:ncol(XIc))
XIa <- as.data.table(XIa)
XIb <- as.data.table(XIb)
XIc <- as.data.table(XIc)
microbenchmark(expgridDT(XIa, XIb, XIc), Reduce(cjdt, list(XIa, XIb, XIc)), cjdt2(XIa, XIb, XIc))
Unit: milliseconds
expr min lq mean median uq max neval
expgridDT(XIa, XIb, XIc) 167.5827 191.6542 264.8172 203.8769 231.6937 852.2033 100
Reduce(cjdt, list(XIa, XIb, XIc)) 164.4640 217.2215 252.2262 230.7276 255.6974 689.1763 100
cjdt2(XIa, XIb, XIc) 65611.1425 67829.0407 77024.1458 77151.0220 84385.0727 95048.6625 100

Related

R loop to generate multiple variables based on a condition

The data is given as below.
x1<- c(1,0,0,0,1,1,1,0)
x2<- c(1,0,0,0,1,1,1,0)
x3<- c(1,0,0,0,1,1,1,0)
x4<- c(1,0,0,0,1,1,1,0)
x5<- c(1,0,0,0,1,1,1,0)
x6<- c(1,0,0,0,1,1,1,0)
my_data <- as.data.frame(cbind(x1, x2, x3, x4, x5, x6))
I want to use a loop to automate the following process:
my_data$a1 = ifelse(my_data$x1> 0 & is.na(my_data$x1) != T, 1, 0)
my_data$a2 = ifelse(my_data$x2> 0 & is.na(my_data$x2) != T, 1, 0)
my_data$a3 = ifelse(my_data$x3> 0 & is.na(my_data$x3) != T, 1, 0)
my_data$a4 = ifelse(my_data$x4> 0 & is.na(my_data$x4) != T, 1, 0)
my_data$a5 = ifelse(my_data$x5> 0 & is.na(my_data$x5) != T, 1, 0)
my_data$a6 = ifelse(my_data$x6> 0 & is.na(my_data$x6) != T, 1, 0)
Any help would be appreciated, thanks!
You can use the following code -
my_data[paste0('a', seq_along(my_data))] <- +(my_data > 0 & !is.na(my_data))
my_data
# x1 x2 x3 x4 x5 x6 a1 a2 a3 a4 a5 a6
#1 1 1 1 1 1 1 1 1 1 1 1 1
#2 0 0 0 0 0 0 0 0 0 0 0 0
#3 0 0 0 0 0 0 0 0 0 0 0 0
#4 0 0 0 0 0 0 0 0 0 0 0 0
#5 1 1 1 1 1 1 1 1 1 1 1 1
#6 1 1 1 1 1 1 1 1 1 1 1 1
#7 1 1 1 1 1 1 1 1 1 1 1 1
#8 0 0 0 0 0 0 0 0 0 0 0 0
This will assign 1 where the value is greater than 0 and is not NA. my_data > 0 & !is.na(my_data) returns a logical value (TRUE/FALSE) adding + ahead of it turns them to integers (1/0).
You can use following for loop
for (i in 1:ncol(my_data)) {
my_data[,paste0("a",i)] <- ifelse(my_data[,i] > 0 & !is.na(my_data[,i]),1,0)
}
Output
x1 x2 x3 x4 x5 x6 a1 a2 a3 a4 a5 a6
1 1 1 1 1 1 1 1 1 1 1 1 1
2 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0 0
5 1 1 1 1 1 1 1 1 1 1 1 1
6 1 1 1 1 1 1 1 1 1 1 1 1
7 1 1 1 1 1 1 1 1 1 1 1 1
8 0 0 0 0 0 0 0 0 0 0 0 0
We can use tidyverse
library(dplyr)
library(stringr)
df <- my_data %>%
mutate(across(everything(), ~ +(. > 0 & !is.na(.)),
.names = "a{.col}")) %>%
rename_with(~ str_remove(., 'x'), starts_with('a'))
df
x1 x2 x3 x4 x5 x6 a1 a2 a3 a4 a5 a6
1 1 1 1 1 1 1 1 1 1 1 1 1
2 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0 0
5 1 1 1 1 1 1 1 1 1 1 1 1
6 1 1 1 1 1 1 1 1 1 1 1 1
7 1 1 1 1 1 1 1 1 1 1 1 1
8 0 0 0 0 0 0 0 0 0 0 0 0
data.table
x1<- c(1,0,0,0,1,1,1,0)
x2<- c(1,0,0,0,1,1,1,0)
x3<- c(1,0,0,0,1,1,1,0)
x4<- c(1,0,0,0,1,1,1,0)
x5<- c(1,0,0,0,1,1,1,0)
x6<- c(1,0,0,0,1,1,1,0)
my_data <- as.data.frame(cbind(x1, x2, x3, x4, x5, x6))
library(data.table)
setDT(my_data)[, (paste0("a", seq_len(length(names(my_data))))) := lapply(.SD, function(x) ifelse(x > 0 & !is.na(x), 1, 0))][]
#> x1 x2 x3 x4 x5 x6 a1 a2 a3 a4 a5 a6
#> 1: 1 1 1 1 1 1 1 1 1 1 1 1
#> 2: 0 0 0 0 0 0 0 0 0 0 0 0
#> 3: 0 0 0 0 0 0 0 0 0 0 0 0
#> 4: 0 0 0 0 0 0 0 0 0 0 0 0
#> 5: 1 1 1 1 1 1 1 1 1 1 1 1
#> 6: 1 1 1 1 1 1 1 1 1 1 1 1
#> 7: 1 1 1 1 1 1 1 1 1 1 1 1
#> 8: 0 0 0 0 0 0 0 0 0 0 0 0
Created on 2021-06-06 by the reprex package (v2.0.0)

R to recode variables if the categorical variable's frequency lower than an defined value

Here is an example for the dataset (d):
rs3 rs4 rs5 rs6
1 0 0 0
1 0 1 0
0 0 0 0
2 0 1 0
0 0 0 0
0 2 0 1
0 2 NA 1
0 2 2 1
NA 1 2 1
To check the frequency of the SNP genotype (0,1,2), we can use the table command
table (d$rs3)
The output would be
0 1 2
5 2 1
Here we want to recode the variables if the genotype 2's frequency is <3, the recoded output should be
rs3 rs4 rs5 rs6
1 0 0 0
1 0 1 0
0 0 0 0
1 0 1 0
0 0 0 0
0 2 0 1
0 2 NA 1
0 2 1 1
NA 1 1 1
I have 70000SNPs that need to check and recode. How to use the for loop or other method to do that in R?
Here's another possible (vectorized) solution
indx <- colSums(d == 2, na.rm = TRUE) < 3 # Select columns by condition
d[indx][d[indx] == 2] <- 1 # Inset 1 when the subset by condition equals 2
d
# rs3 rs4 rs5 rs6
# 1 1 0 0 0
# 2 1 0 1 0
# 3 0 0 0 0
# 4 1 0 1 0
# 5 0 0 0 0
# 6 0 2 0 1
# 7 0 2 NA 1
# 8 0 2 1 1
# 9 NA 1 1 1
We can try
d[] <- lapply(d, function(x)
if(sum(x==2, na.rm=TRUE) < 3) replace(x, x==2, 1) else x)
d
# rs3 rs4 rs5 rs6
#1 1 0 0 0
#2 1 0 1 0
#3 0 0 0 0
#4 1 0 1 0
#5 0 0 0 0
#6 0 2 0 1
#7 0 2 NA 1
#8 0 2 1 1
#9 NA 1 1 1
Or the same methodology can be used in dplyr
library(dplyr)
d %>%
mutate_each(funs(if(sum(.==2, na.rm=TRUE) <3)
replace(., .==2, 1) else .))

Transform data frame

I have a questionnaire with an open-ended question like "Please name up to ten animals", which gives me the following data frame (where each letter stands for an animal):
nrow <- 1000
list <- vector("list", nrow)
for(i in 1:nrow){
na <- rep(NA, sample(1:10, 1))
list[[i]] <- sample(c(letters, na), 10, replace=FALSE)
}
df <- data.frame()
df <- rbind(df, do.call(rbind, list))
head(df)
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
# 1 r <NA> a j w e i h u z
# 2 t o e x d v <NA> z n c
# 3 f y e s n c z i u k
# 4 y <NA> v j h z p i c q
# 5 w s v f <NA> c g b x e
# 6 p <NA> a h v x k z o <NA>
How can I transform this data frame to look like the following data frame? Remember that I don't actually know the column names.
r <- 1000
c <- length(letters)
t1 <- matrix(rbinom(r*c,1,0.5),r,c)
colnames(t1) <- letters
head(t1)
# a b c d e f g h i j k l m n o p q r s t u v w x y z
# [1,] 0 1 0 1 0 0 0 1 0 0 1 1 1 1 0 0 0 1 0 1 0 1 1 0 1 0
# [2,] 1 1 1 1 0 1 0 1 1 1 1 0 1 0 0 0 1 1 1 0 0 1 0 1 0 1
# [3,] 0 1 0 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0
# [4,] 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 0 1 0 1 1 0 0
# [5,] 1 0 1 1 1 1 1 1 1 0 1 1 0 0 0 0 1 1 0 1 1 0 0 1 0 0
# [6,] 1 1 0 1 1 0 0 1 0 0 1 0 0 0 0 0 1 1 1 0 0 0 1 1 0 1
td <- data.frame(t(apply(df, 1, function(x) as.numeric( unique(unlist(df)) %in% x))))
colnames (td) <- unique(unlist(df))
letters could be replaced with a vector of animal names colnames(t1).
You can do the following using tidyr which could be much faster than other approaches, though I like the approach by #germcd very much. You may need to tinker with the select, removing NAs as well as a blank space, which may be an artifact of the simulated data you provided:
require(tidyr)
## Add an ID for each record:
df$id <- 1:nrow(df)
out <- (df %>%
gather(column, animal, -id) %>%
filter(animal != " ") %>%
spread(animal, column)
)
head(out)
This code gathers the unnamed columns into a long format, removes any empty columns or missing data, and then spreads by the unique values of the animal column. This also has the potentially desirable property of preserving the column order in which the animals were named. If it's not desirable then you could easily convert the resulting animal columns to numeric:
out_num <- out
out_num[,-1] <- as.numeric((!is.na(out[,-1])))
head(out_num)
You can try mtabulate from the "qdapTools" package:
library(qdapTools)
head(mtabulate(as.data.frame(t(df))))
# c d i l m o r v x y a f s t k p u b h j n q e g w z
# 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 2 0 1 0 0 1 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
# 3 0 0 1 0 0 0 1 0 1 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0
# 4 1 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
# 5 0 1 0 0 0 0 1 0 0 0 0 0 1 0 1 1 0 1 1 0 1 1 0 0 0 0
# 6 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 1 1 0 1 0 1 0 0 0 0
There are, of course, many other options.
For example, cSplit_e from my "splitstackshape" package (with the downside that inefficiently, you need to paste the values together first before you can split them):
library(splitstackshape)
library(dplyr)
As ones and zeroes:
df %>%
mutate(combined = apply(., 1, function(x) paste(na.omit(x), collapse = ","))) %>%
cSplit_e("combined", ",", mode = "binary", type = "character", fill = 0) %>%
select(starts_with("combined_")) %>%
head
# combined_a combined_b combined_c combined_d combined_e combined_f combined_g combined_h combined_i
# 1 0 0 1 1 0 0 0 0 1
# 2 1 0 0 1 0 1 0 0 0
# 3 1 0 0 0 0 0 0 0 1
# 4 0 1 1 0 0 0 0 1 1
# 5 0 1 0 1 0 0 0 1 0
# 6 0 1 0 0 0 0 0 0 0
# combined_j combined_k combined_l combined_m combined_n combined_o combined_p combined_q combined_r
# 1 0 0 1 1 0 1 0 0 1
# 2 0 0 0 1 0 0 0 0 0
# 3 0 1 0 0 0 0 1 0 1
# 4 1 0 1 0 1 0 0 0 0
# 5 0 1 0 0 1 0 1 1 1
# 6 1 1 0 1 0 0 0 1 0
# combined_s combined_t combined_u combined_v combined_w combined_x combined_y combined_z
# 1 0 0 0 1 0 1 1 0
# 2 1 1 0 0 0 0 0 0
# 3 0 1 1 0 0 1 1 0
# 4 0 0 1 0 0 0 1 0
# 5 1 0 0 0 0 0 0 0
# 6 1 1 1 0 0 0 0 0
As the original values:
df %>%
mutate(combined = apply(., 1, function(x) paste(na.omit(x), collapse = ","))) %>%
cSplit_e("combined", ",", mode = "value", type = "character", fill = "") %>%
select(starts_with("combined_")) %>%
head
# combined_a combined_b combined_c combined_d combined_e combined_f combined_g combined_h combined_i
# 1 c d i
# 2 a d f
# 3 a i
# 4 b c h i
# 5 b d h
# 6 b
# combined_j combined_k combined_l combined_m combined_n combined_o combined_p combined_q combined_r
# 1 l m o r
# 2 m
# 3 k p r
# 4 j l n
# 5 k n p q r
# 6 j k m q
# combined_s combined_t combined_u combined_v combined_w combined_x combined_y combined_z
# 1 v x y
# 2 s t
# 3 t u x y
# 4 u y
# 5 s
# 6 s t u
Alternatively, you can use "reshape2":
library(reshape2)
## The values
dcast(melt(as.matrix(df), na.rm = TRUE),
Var1 ~ value, value.var = "value")
## ones and zeroes
dcast(melt(as.matrix(df), na.rm = TRUE),
Var1 ~ value, value.var = "value", fun.aggregate = length)

How to create design matrix in r

I have two factors. factor A have 2 level, factor B have 3 level.
How to create the following design matrix?
factorA1 factorA2 factorB1 factorB2 factorB3
[1,] 1 0 1 0 0
[2,] 1 0 0 1 0
[3,] 1 0 0 0 1
[4,] 0 1 1 0 0
[5,] 0 1 0 1 0
[6,] 0 1 0 0 1
You have a couple of options:
Use base and piece it together yourself:
(iris.dummy<-with(iris,model.matrix(~Species-1)))
(IRIS<-data.frame(iris,iris.dummy))
Or use the ade4 package as follows:
dummy <- function(df) {
require(ade4)
ISFACT <- sapply(df, is.factor)
FACTS <- acm.disjonctif(df[, ISFACT, drop = FALSE])
NONFACTS <- df[, !ISFACT,drop = FALSE]
data.frame(NONFACTS, FACTS)
}
dat <-data.frame(eggs = c("foo", "foo", "bar", "bar"),
ham = c("red","blue","green","red"), x=rnorm(4))
dummy(dat)
## x eggs.bar eggs.foo ham.blue ham.green ham.red
## 1 0.3365302 0 1 0 0 1
## 2 1.1341354 0 1 1 0 0
## 3 2.0489741 1 0 0 1 0
## 4 1.1019108 1 0 0 0 1
Assuming your data in in a data.frame called dat, let's say the two factors are given as in this example:
> dat <- data.frame(f1=sample(LETTERS[1:3],20,T),f2=sample(LETTERS[4:5],20,T),id=1:20)
> dat
f1 f2 id
1 C D 1
2 B E 2
3 B E 3
4 A D 4
5 C E 5
6 C E 6
7 C D 7
8 B E 8
9 C D 9
10 A D 10
11 B E 11
12 C E 12
13 B D 13
14 B E 14
15 A D 15
16 C E 16
17 C D 17
18 C D 18
19 B D 19
20 C D 20
> dat$f1
[1] C B B A C C C B C A B C B B A C C C B C
Levels: A B C
> dat$f2
[1] D E E D E E D E D D E E D E D E D D D D
Levels: D E
You can use outer to get a matrix as you showed, for each factor:
> F1 <- with(dat, outer(f1, levels(f1), `==`)*1)
> colnames(F1) <- paste("f1",sep="=",levels(dat$f1))
> F1
f1=A f1=B f1=C
[1,] 0 0 1
[2,] 0 1 0
[3,] 0 1 0
[4,] 1 0 0
[5,] 0 0 1
[6,] 0 0 1
[7,] 0 0 1
[8,] 0 1 0
[9,] 0 0 1
[10,] 1 0 0
[11,] 0 1 0
[12,] 0 0 1
[13,] 0 1 0
[14,] 0 1 0
[15,] 1 0 0
[16,] 0 0 1
[17,] 0 0 1
[18,] 0 0 1
[19,] 0 1 0
[20,] 0 0 1
Now do the same for the second factor:
> F2 <- with(dat, outer(f2, levels(f2), `==`)*1)
> colnames(F2) <- paste("f2",sep="=",levels(dat$f2))
And cbind them to get the final result:
> cbind(F1,F2)
model.matrix is the process that lm and others use in the background to convert for you.
dat <- data.frame(f1=sample(LETTERS[1:3],20,T),f2=sample(LETTERS[4:5],20,T),id=1:20)
dat
model.matrix(~dat$f1 + dat$f2)
It creates the INTERCEPT variable as a column of 1's, but you can easily remove that if you need.
model.matrix(~dat$f1 + dat$f2)[,-1]
Edit: Now i see that this is essentially the same as one of the other comments, but more concise.
Expanding and generalizing #Ferdinand.kraft's answer:
dat <- data.frame(
f1 = sample(LETTERS[1:3], 20, TRUE),
f2 = sample(LETTERS[4:5], 20, TRUE),
row.names = paste0("id_", 1:20))
covariates <- c("f1", "f2") # in case you have other columns that you don't want to include in the design matrix
design <- do.call(cbind, lapply(covariates, function(covariate){
apply(outer(dat[[covariate]], unique(dat[[covariate]]), FUN = "=="), 2, as.integer)
}))
rownames(design) <- rownames(dat)
colnames(design) <- unlist(sapply(covariates, function(covariate) unique(dat[[covariate]])))
design <- design[, !duplicated(colnames(design))] # duplicated colnames happen sometimes
design
# C A B D E
# id_1 1 0 0 1 0
# id_2 0 1 0 1 0
# id_3 0 0 1 1 0
# id_4 1 0 0 1 0
# id_5 0 1 0 1 0
# id_6 0 1 0 0 1
# id_7 0 0 1 0 1
Model matrix only allows what it calls "dummy" coding for the first factor in a formula.
If the intercept is present, it plays that role. To get the desired effect of a redundant index matrix (where you have a 1 in every column for the corresponding factor level and 0 elsewhere), you can lie to model.matrix() and pretend there's an extra level. Then trim off the intercept column.
> a=rep(1:2,3)
> b=rep(1:3,2)
> df=data.frame(A=a,B=b)
> # Lie and pretend there's a level 0 in each factor.
> df$A=factor(a,as.character(0:2))
> df$B=factor(b,as.character(0:3))
> mm=model.matrix (~A+B,df)
> mm
(Intercept) A1 A2 B1 B2 B3
1 1 1 0 1 0 0
2 1 0 1 0 1 0
3 1 1 0 0 0 1
4 1 0 1 1 0 0
5 1 1 0 0 1 0
6 1 0 1 0 0 1
attr(,"assign")
[1] 0 1 1 2 2 2
attr(,"contrasts")
attr(,"contrasts")$A
[1] "contr.treatment"
attr(,"contrasts")$B
[1] "contr.treatment"
> # mm has an intercept column not requested, so kill it
> dm=as.matrix(mm[,-1])
> dm
A1 A2 B1 B2 B3
1 1 0 1 0 0
2 0 1 0 1 0
3 1 0 0 0 1
4 0 1 1 0 0
5 1 0 0 1 0
6 0 1 0 0 1
> # You can also add interactions
> mm2=model.matrix (~A*B,df)
> dm2=as.matrix(mm2[,-1])
> dm2
A1 A2 B1 B2 B3 A1:B1 A2:B1 A1:B2 A2:B2 A1:B3 A2:B3
1 1 0 1 0 0 1 0 0 0 0 0
2 0 1 0 1 0 0 0 0 1 0 0
3 1 0 0 0 1 0 0 0 0 1 0
4 0 1 1 0 0 0 1 0 0 0 0
5 1 0 0 1 0 0 0 1 0 0 0
6 0 1 0 0 1 0 0 0 0 0 1
Things get complicated with model.matrix() again if we add a covariate x and interactions of x with factors.
a=rep(1:2,3)
b=rep(1:3,2)
x=1:6
df=data.frame(A=a,B=b,x=x)
# Lie and pretend there's a level 0 in each factor.
df$A=factor(a,as.character(0:2))
df$B=factor(b,as.character(0:3))
mm=model.matrix (~A + B + A:x + B:x,df)
print(mm)
(Intercept) A1 A2 B1 B2 B3 A0:x A1:x A2:x B1:x B2:x B3:x
1 1 1 0 1 0 0 0 1 0 1 0 0
2 1 0 1 0 1 0 0 0 2 0 2 0
3 1 1 0 0 0 1 0 3 0 0 0 3
4 1 0 1 1 0 0 0 0 4 4 0 0
5 1 1 0 0 1 0 0 5 0 0 5 0
6 1 0 1 0 0 1 0 0 6 0 0 6
So mm has an intercept, but now A:x interaction terms have an unwanted level A0:x
If we reintroduce x as as a separate term, we will cancel that unwanted level
mm2=model.matrix (~ x + A + B + A:x + B:x, df)
print(mm2)
(Intercept) x A1 A2 B1 B2 B3 x:A1 x:A2 x:B1 x:B2 x:B3
1 1 1 1 0 1 0 0 1 0 1 0 0
2 1 2 0 1 0 1 0 0 2 0 2 0
3 1 3 1 0 0 0 1 3 0 0 0 3
4 1 4 0 1 1 0 0 0 4 4 0 0
5 1 5 1 0 0 1 0 5 0 0 5 0
6 1 6 0 1 0 0 1 0 6 0 0 6
We can get rid of the unwanted intercept and the unwanted bare x term
dm2=as.matrix(mm2[,c(-1,-2)])
print(dm2)
A1 A2 B1 B2 B3 x:A1 x:A2 x:B1 x:B2 x:B3
1 1 0 1 0 0 1 0 1 0 0
2 0 1 0 1 0 0 2 0 2 0
3 1 0 0 0 1 3 0 0 0 3
4 0 1 1 0 0 0 4 4 0 0
5 1 0 0 1 0 5 0 0 5 0
6 0 1 0 0 1 0 6 0 0 6

R Data Transformations

I have a need to look at the data in a data frame in a different way. Here is the problem..
I have a data frame as follows
Person Item BuyOrSell
1 a B
1 b S
1 a S
2 d B
3 a S
3 e S
One of the requirements I have is to see the data as follows. Show the sum of all transactions made by the Person on individual items broken by the transaction type (B or S)
Person aB aS bB bS dB dS eB eS
1 1 1 0 1 0 0 0 0
2 0 0 0 0 1 0 0 0
3 1 0 0 0 0 0 0 1
So i created a new column and appended the values of both the Item and BuyOrSell.
df$newcol<-paste(Item,"-",BuyOrSell,sep="")
table(Person,newcol)
and was able to achieve the above results.
The last transformation requirement which was a tough nut to crack was as follows....
aB aS bB bS dB dS eB eS
aB 1 1 0 1 0 0 0 0
aS 1 2 0 1 0 0 0 1
bB 0 0 0 0 0 0 0 0
bS 1 1 0 0 0 0 0 0
dB 0 0 0 0 1 0 0 0
dS 0 0 0 0 0 0 0 0
eB 0 0 0 0 0 0 0 0
eS 0 1 0 0 0 0 0 1
where the above table had to be filled in with the number of people who made a particular transaction also made a transaction on another item.
I tried table(newcol,newcol) but it generated counts only for aB-aB,aS-aS,bB-bB,..... and 0s for all other combinations.
Any ideas on what package or command will let me crack this nut ?
Isn't the final result just:
# Following Ricardo's solution for casting, but using `acast` instead
A <- acast(Person~Item+BuyOrSell,data=df,fun.aggregate=length,drop=FALSE)
# A' * A
> t(A) %*% A
# a_B a_S b_B b_S d_B d_S e_B e_S
# a_B 1 1 0 1 0 0 0 0
# a_S 1 2 0 1 0 0 0 1
# b_B 0 0 0 0 0 0 0 0
# b_S 1 1 0 1 0 0 0 0
# d_B 0 0 0 0 1 0 0 0
# d_S 0 0 0 0 0 0 0 0
# e_B 0 0 0 0 0 0 0 0
# e_S 0 1 0 0 0 0 0 1
I think there is a better way, but here's a method using the package reshape2.
require(reshape2)
#reshapes data so each item and buy/sell event interaction occurs once
df2 <- dcast(Person~Item+BuyOrSell,data=df,fun.aggregate=length,drop=FALSE)
df2
# Person a_B a_S b_B b_S d_B d_S e_B e_S
# 1 1 1 1 0 1 0 0 0 0
# 2 2 0 0 0 0 1 0 0 0
# 3 3 0 1 0 0 0 0 0 1
#reshapes data so every row is an interaction by person
df3 <- melt(df2,id.vars="Person")
head(df3)
# Person variable value
# 1 1 a_B 1
# 2 2 a_B 0
# 3 3 a_B 0
# 4 1 a_S 1
# 5 2 a_S 0
# 6 3 a_S 1
#removes empty rows where no action occurred
#removes value column
df4 <- with(df3,
data.frame(Person=rep.int(Person,value),variable=rep.int(variable,value))
#performs a self-merge: now each row is
#every combination of two actions that one person has done
df5 <- merge(df4,df4,by="Person")
head(df5)
# Person variable.x variable.y
# 1 1 a_B a_B
# 2 1 a_B a_S
# 3 1 a_B b_S
# 4 1 a_S a_B
# 5 1 a_S a_S
# 6 1 a_S b_S
#tabulates variable interactions
with(df5,table(variable.x,variable.y))
Blue Magister,your solution works perfectly and i analyzed each an every step that you performed.
The output of df4 was follows:
Person variable
1 1 a_B
2 1 a_S
3 3 a_S
4 1 b_S
5 2 d_B
6 3 e_S
The output of with(df5,table(variable.x,variable.y)) was
variable.y
variable.x a_B a_S b_B b_S d_B d_S e_B e_S
a_B 1 1 0 1 0 0 0 0
a_S 1 2 0 1 0 0 0 1
b_B 0 0 0 0 0 0 0 0
b_S 1 1 0 1 0 0 0 0
d_B 0 0 0 0 1 0 0 0
d_S 0 0 0 0 0 0 0 0
e_B 0 0 0 0 0 0 0 0
e_S 0 1 0 0 0 0 0 1
which is exactly what i want.
When i look at the output of d4 it was almost similar to the my newcol solution ( using paste )
> df
Person newcol
1 1 a-B
2 1 b-S
3 1 a-S
4 2 d-B
5 3 a-S
6 3 e-S
The only difference here is the ordering of the rows when compared to your df4.
So, i ended up running this command
dfx <- merge(df,df,by="Person")
with(dfx,table(newcol.x,newcol.y))
and it generated the following...
newcol.y
newcol.x a-B a-S b-S d-B e-S
a-B 1 1 1 0 0
a-S 1 2 1 0 1
b-S 1 1 1 0 0
d-B 0 0 0 1 0
e-S 0 1 0 0 1
The above output ignored few rows and columns. What am i doing different from you ?

Resources