Showing missing levels in model matrix - r

I would like to know if there a way to insert a column into a matrix such that..
p1 <- c("a","b","c","e","d","a","c")
p2 <- c("a","b","c","e","e","a","c")
p1mat <- model.matrix(~p1 + 0)
p2mat <- model.matrix(~p2 + 0)
colnames(p1mat) <- gsub("p1","",colnames(p1mat))
colnames(p2mat) <- gsub("p2","",colnames(p2mat))
this would give me for p1mat
a b c d e
1 1 0 0 0 0
2 0 1 0 0 0
3 0 0 1 0 0
4 0 0 0 0 1
5 0 0 0 1 0
6 1 0 0 0 0
7 0 0 1 0 0
and for p2mat
a b c e
1 1 0 0 0
2 0 1 0 0
3 0 0 1 0
4 0 0 0 1
5 0 0 0 1
6 1 0 0 0
7 0 0 1 0
My question is, is there a way to sneak in a column vector d consisting of only zeros into the matrix p2mat? such that
d
0
0
0
0
0
0
0
and the vector is automatically ordered and placed between columns c and e resulting in to following matrix for p2mat
a b c d e
1 1 0 0 0 0
2 0 1 0 0 0
3 0 0 1 0 0
4 0 0 0 0 1
5 0 0 0 0 1
6 1 0 0 0 0
7 0 0 1 0 0
Basically I want matrix p2mat to look into every column in p1mat to create an identical size matrix and to keep track of the data via dummy matrices.
Thank you.

You can factor both your inputs, making sure they both have the same levels. Then model.matrix should work as you expected.
Example:
p1 <- c("a","b","c","e","d","a","c")
p2 <- c("a","b","c","e","e","a","c")
levs <- sort(unique(c(p1, p2)))
f1 <- factor(p1, levs)
f2 <- factor(p2, levs)
model.matrix(~f1 + 0)
# f1a f1b f1c f1d f1e
# 1 1 0 0 0 0
# 2 0 1 0 0 0
# 3 0 0 1 0 0
# 4 0 0 0 0 1
# 5 0 0 0 1 0
# 6 1 0 0 0 0
# 7 0 0 1 0 0
# attr(,"assign")
# [1] 1 1 1 1 1
# attr(,"contrasts")
# attr(,"contrasts")$f1
# [1] "contr.treatment"
model.matrix(~f2 + 0)
# f2a f2b f2c f2d f2e
# 1 1 0 0 0 0
# 2 0 1 0 0 0
# 3 0 0 1 0 0
# 4 0 0 0 0 1
# 5 0 0 0 0 1
# 6 1 0 0 0 0
# 7 0 0 1 0 0
# attr(,"assign")
# [1] 1 1 1 1 1
# attr(,"contrasts")
# attr(,"contrasts")$f2
# [1] "contr.treatment"
If you're really looking to write a function, you might want to look at something like the following:
myfun <- function(..., overwrite = FALSE) {
l <- setNames(list(...), sapply(substitute(list(...))[-1], deparse))
cols <- sort(unique(unlist(lapply(l, colnames), use.names = FALSE)))
out <- lapply(l, function(x) {
cols_x <- c(colnames(x), setdiff(cols, colnames(x)))
temp <- `colnames<-`(x[, match(cols, colnames(x))], cols_x)[, cols]
replace(temp, is.na(temp), 0)
})
if (isTRUE(overwrite)) list2env(out, envir = .GlobalEnv)
out
}
This will take any number of items as inputs, compare the columns in all of them, and add missing columns where necessary. The output is stored as a list, which is a convenient structure to keep if you want to continue doing similar operations on all of the matrices. If you want to overwrite the original object, then you can change the "overwrite" argument to TRUE.
Here's some more sample data to work with.
set.seed(1)
p1 <- c("a","b","c","e","d","a","c"); p2 <-c("a","b","x","e","e","a","x")
p3 <- sample(c(cols, "z"), 7, TRUE)
p1mat <- model.matrix(~p1 + 0)
p2mat <- model.matrix(~p2 + 0)
p3mat <- model.matrix(~p3 + 0)
colnames(p1mat) <- gsub("p1","",colnames(p1mat))
colnames(p2mat) <- gsub("p2","",colnames(p2mat))
colnames(p3mat) <- gsub("p3","",colnames(p3mat))
Try the function out:
myfun(p1mat, p2mat)
myfun(p2mat, p1mat)
myfun(p3mat, p1mat)
myfun(p3mat, p1mat, p2mat)

This function takes 2 matrices, and compares their dimensions. If their dimensions differ, it inserts a new column of zeros into the matrix with fewer columns, at the exact column position that is lacking. It thus produces a new matrix with the same dimensions as the other.
match_matrices <- function(matrix1, matrix2) {
if(ncol(matrix1) != ncol(matrix2)) {
get_cols <- function(x) { l <- list(); for(i in 1:ncol(x)) { l[i] <- list(as.numeric(x[,i])) }; return(l) }
k <- get_cols(matrix2)
odd_one_out <- setdiff(colnames(matrix1), colnames(matrix2))
insert_at <- which(colnames(matrix1) == odd_one_out)
res <- t(do.call('rbind', append(k, list(rep(0, nrow(matrix2))), insert_at-1)))
colnames(res) <- colnames(matrix1)
}
return(res)
}
Using your matrices:
match_matrices(p1mat, p2mat)

Related

How to set a loop to assign lots of variables

I just started using R for a psych class, so please go easy on me. I watched a bunch of youtube videos on For loops, but none have answered my question. I have 4 data frames (A, B, C, D), each with 25 columns. I want to combine the nth column from each data frame together, and save them as an object, like so:
Q1 <- cbind(A[1], B[1], C[1], D[1])
Q2 <- cbind(A[2], B[2], C[2], D[2])
How can I set a loop to do this for all 25 so I don’t have to do it manually?
Thanks in advance
Each of my data frames looks like this (with column headings reflecting the letter of the data frame (i.e. B has QB1, QB2, etc.
QA1 QA2 QA3 QA4 QA5 QA6 QA7 QA8 QA9 QA10 QA11 QA12 QA13 QA14 QA15
1 1 2 2 0 0 2 0 1 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0
3 1 0 0 0 0 0 1 0 0 2 1 1 0 0 0
4 1 0 0 0 0 0 1 1 0 1 0 2 0 0 0
In order to do it in a for loop, you need to use assign() from baseR and eval_tidy(), sym() from rlang(). Basically, you will need to evaluate strings as variables.
Create simulation data
library(rlang)
nrows = 10
ncols = 25
df_names <- c("A","B","C","D")
for(df_name in df_names){
# assign value to a string as variable
assign(
df_name,
as.data.frame(
matrix(
data = sample(
c(0,1),
size = nrows * ncols,
replace = TRUE
),
ncol = 25
)
)
)
# rename columns
assign(
df_name,
setNames(eval_tidy(sym(df_name)),paste0("Q",df_name,1:ncols))
)
}
Show A
> head(A)
QA1 QA2 QA3 QA4 QA5 QA6 QA7 QA8 QA9 QA10 QA11 QA12 QA13 QA14 QA15 QA16 QA17 QA18 QA19 QA20 QA21 QA22 QA23 QA24 QA25
1 1 1 0 0 1 0 1 0 1 1 0 0 1 1 1 0 0 1 0 0 1 1 0 1 1
2 0 1 0 1 1 1 1 0 1 1 1 1 0 0 0 1 0 1 1 0 1 0 1 1 0
3 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1 1 1 0 0 0 1 1 1
4 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 0 0 0 1 0 1 1 1 1
5 1 1 0 1 1 1 1 1 1 0 1 0 0 0 0 0 1 0 1 0 1 1 0 1 1
6 1 1 0 0 1 1 0 1 1 1 0 1 0 1 1 0 1 0 0 1 1 0 1 1 0
To answer your question:
This should create 25 variables from Q1 to Q25:
# assign dataframes from Q1 to Q25
for(i in 1:25){
new_df_name <- paste0("Q",i)
# initialize Qi with the same number of rows as A,B,C,D ...
assign(
new_df_name,
data.frame(tmp = matrix(NA,nrow = rows))
)
# loop A,B,C,D ... and bind them
for(df_name in df_names){
assign(
new_df_name,
cbind(
eval_tidy(sym(new_df_name)),
eval_tidy(sym(df_name))[,i,drop = FALSE]
)
)
}
# drop tmp to clean up
assign(
new_df_name,
eval_tidy(sym(new_df_name))[,-1]
)
}
Show result:
> Q25
QA25 QB25 QC25 QD25
1 1 0 1 1
2 0 1 0 0
3 1 1 0 0
4 1 0 1 1
5 1 1 0 0
6 0 1 1 1
7 1 0 0 0
8 0 0 0 1
9 1 1 1 0
10 0 0 1 1
The codes should be much easier if you save results in a list using map(). The major complexity is from assigning values to separate variables.
You can combine some dplyr verbs in a for loop to combine the columns from each data set and assign them to 25 new objects.
# merge data, gather, split by var numbers, assign each df to environment
for (i in 1:25) {
df <- cbind(q1,q2,q3,q4) %>% mutate(id=row_number()) %>%
gather(k,v,-id) %>%
mutate(num=sub('A|B|C|D','',k)) %>%
filter(num==i) %>% select(-num) %>% spread(k,v)
assign(paste0('df',i),df)
}
ls(pattern = 'df')
[1] "df1" "df10" "df11" "df12" "df13" "df14" "df15" "df16" "df17" "df18" "df19" "df2"
[13] "df20" "df21" "df22" "df23" "df24" "df25" "df3" "df4" "df5" "df6" "df7" "df8"
[25] "df9"
Code to create initial 4 toy data frames.
# create four toy data frames
q1 <- data.frame(matrix(runif(100),ncol=25))
q2 <- data.frame(matrix(runif(100),ncol=25))
q3 <- data.frame(matrix(runif(100),ncol=25))
q4 <- data.frame(matrix(runif(100),ncol=25))
# set var names for each toy data
names(q1) <- sub('X','A',names(q1))
names(q2) <- sub('X','B',names(q2))
names(q3) <- sub('X','C',names(q3))
names(q4) <- sub('X','D',names(q4))

find how many times and which columns a string is repeated

My data consists of 6 strings per each element. It has string with 6 characters. The data has white space too.
I want to know how many times each string is repeated in all columns
for example P67809 is repeated 2 times in column a and column d
so the output should look likes
string No columns
P67809 2 a,b
Based on this function I can assign a row number to each string
normalize <- function(x, delim) {
x <- gsub(")", "", x, fixed=TRUE)
x <- gsub("(", "", x, fixed=TRUE)
idx <- rep(seq_len(length(x)), times=nchar(gsub(sprintf("[^%s]",delim), "", as.character(x)))+1)
names <- unlist(strsplit(as.character(x), delim))
return(setNames(idx, names))
}
Then I apply the function on all and each columns string like
myS <- lapply(mydata, normalize,";")
but I don't know how to then search and get the output
We could melt the data from 'wide' to 'long' format. Split the 'value' column with ; to get a list output. We set the names of the list as the 'variable' column of 'dM'. Then stack the list to a two column output, and get the frequency count with 'tbl'. It may be easier to understand the result from the 'tbl' output.
library(reshape2)
dM <- melt(mydata, id.var=NULL)
lst1 <- setNames(strsplit(dM$value, ";"), dM$variable)
tbl <- table(stack(lst1)[2:1])
tbl
values
#ind A4QPH2 O60814 P0CG47 P0CG48 P14923 P15924 P19338 P35908 P42356 P57053 P58876 P62750 P62807 P62851 P62979 P63241 P67809 Q02413 Q06830 Q07955 Q16658 Q5QNW6 Q6IS14 Q8N8J0 Q93079 Q969S3
# a 0 1 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0
# b 3 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 0 0 0
# c 1 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 0 1 0 1 0 0 0
# d 0 0 1 1 1 0 0 0 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 0 0 1 1
# values
#ind Q99877 Q99879 Q9Y2T7
# a 0 0 1
# b 0 0 0
# c 0 0 0
# d 1 1 1
We get the total number of each element with colSums.
cS <- colSums(tbl)
If we need to get the output as in the OP's post, we can melt the list output to create a 2 column data.frame. From this, we convert to 'data.table' (setDT(), grouped by 'value' column , we get the length of unique elements of 'variable' and also paste together the unique elements.
library(data.table)
res <- setDT(melt(lst1))[, list(No= uniqueN(L1),
columns= toString(unique(L1))) ,.(string=value)]
head(res,2)
# string No columns
#1: P67809 2 a, d
#2: Q9Y2T7 2 a, d
One approach might be:
res <- apply(mydata, 2, function(x) unlist(strsplit(x, ";")))
un <- unique(unlist(res))
res2 <- sapply(un, function(x) lapply(res, function(y) as.numeric(x %in% y)))
res2
P67809 Q9Y2T7 P42356 Q8N8J0 A4QPH2 P35908 P19338 P15924 P14923 Q02413 P63241 Q6IS14
a 1 1 1 1 1 1 1 1 1 0 0 0
b 0 0 0 0 0 0 0 0 0 1 1 1
c 0 0 0 0 0 0 0 0 0 1 1 1
d 1 1 0 0 0 0 0 0 0 0 0 0
P62979 P0CG47 P0CG48 Q16658 P62851 Q07955 Q06830 P62807 O60814 P57053 Q99879 Q99877
a 0 0 0 0 0 0 0 0 0 0 0 0 0
b 1 1 1 1 0 0 0 0 0 0 0 0 0
c 1 1 1 1 1 1 0 0 0 0 0 0 0
d 1 1 1 0 0 0 1 1 1 1 1 1 1
Q93079 Q5QNW6 P58876 P62750 Q969S3
a 0 0 0 0 0
b 0 0 0 0 0
c 0 0 0 0 0
d 1 1 1 1 1
as.data.frame(t(apply(t(res2), 1, function(x) cbind(sum(as.numeric(x)), paste(names(x)[which(as.logical(x))], collapse = ",")))))
V1 V2
P67809 2 a,d
Q9Y2T7 2 a,d
P42356 1 a
Q8N8J0 1 a
A4QPH2 1 a
P35908 1 a
P19338 1 a
P15924 1 a
P14923 1 a
Q02413 2 b,c
P63241 2 b,c
Q6IS14 2 b,c
P62979 3 b,c,d
P0CG47 3 b,c,d
P0CG48 3 b,c,d
2 b,c
Q16658 1 c
P62851 1 c
Q07955 1 d
Q06830 1 d
P62807 1 d
O60814 1 d
P57053 1 d
Q99879 1 d
Q99877 1 d
Q93079 1 d
Q5QNW6 1 d
P58876 1 d
P62750 1 d
Q969S3 1 d
An alternative approach with cSplit from splitstackshape and gather from tidyr.
library(splitstackshape)
library(tidyr)
library(dplyr)
splitted <- cSplit(mydata, splitCols = names(mydata), sep = ";") %>% gather() # Split cols and melt data
splitted$key <- substring(splitted$key, 1, 1) # Lose irrelevant string
table(splitted) # Generate frequency table

Transform data frame

I have a questionnaire with an open-ended question like "Please name up to ten animals", which gives me the following data frame (where each letter stands for an animal):
nrow <- 1000
list <- vector("list", nrow)
for(i in 1:nrow){
na <- rep(NA, sample(1:10, 1))
list[[i]] <- sample(c(letters, na), 10, replace=FALSE)
}
df <- data.frame()
df <- rbind(df, do.call(rbind, list))
head(df)
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
# 1 r <NA> a j w e i h u z
# 2 t o e x d v <NA> z n c
# 3 f y e s n c z i u k
# 4 y <NA> v j h z p i c q
# 5 w s v f <NA> c g b x e
# 6 p <NA> a h v x k z o <NA>
How can I transform this data frame to look like the following data frame? Remember that I don't actually know the column names.
r <- 1000
c <- length(letters)
t1 <- matrix(rbinom(r*c,1,0.5),r,c)
colnames(t1) <- letters
head(t1)
# a b c d e f g h i j k l m n o p q r s t u v w x y z
# [1,] 0 1 0 1 0 0 0 1 0 0 1 1 1 1 0 0 0 1 0 1 0 1 1 0 1 0
# [2,] 1 1 1 1 0 1 0 1 1 1 1 0 1 0 0 0 1 1 1 0 0 1 0 1 0 1
# [3,] 0 1 0 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0
# [4,] 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 0 1 0 1 1 0 0
# [5,] 1 0 1 1 1 1 1 1 1 0 1 1 0 0 0 0 1 1 0 1 1 0 0 1 0 0
# [6,] 1 1 0 1 1 0 0 1 0 0 1 0 0 0 0 0 1 1 1 0 0 0 1 1 0 1
td <- data.frame(t(apply(df, 1, function(x) as.numeric( unique(unlist(df)) %in% x))))
colnames (td) <- unique(unlist(df))
letters could be replaced with a vector of animal names colnames(t1).
You can do the following using tidyr which could be much faster than other approaches, though I like the approach by #germcd very much. You may need to tinker with the select, removing NAs as well as a blank space, which may be an artifact of the simulated data you provided:
require(tidyr)
## Add an ID for each record:
df$id <- 1:nrow(df)
out <- (df %>%
gather(column, animal, -id) %>%
filter(animal != " ") %>%
spread(animal, column)
)
head(out)
This code gathers the unnamed columns into a long format, removes any empty columns or missing data, and then spreads by the unique values of the animal column. This also has the potentially desirable property of preserving the column order in which the animals were named. If it's not desirable then you could easily convert the resulting animal columns to numeric:
out_num <- out
out_num[,-1] <- as.numeric((!is.na(out[,-1])))
head(out_num)
You can try mtabulate from the "qdapTools" package:
library(qdapTools)
head(mtabulate(as.data.frame(t(df))))
# c d i l m o r v x y a f s t k p u b h j n q e g w z
# 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 2 0 1 0 0 1 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
# 3 0 0 1 0 0 0 1 0 1 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0
# 4 1 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
# 5 0 1 0 0 0 0 1 0 0 0 0 0 1 0 1 1 0 1 1 0 1 1 0 0 0 0
# 6 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 1 1 0 1 0 1 0 0 0 0
There are, of course, many other options.
For example, cSplit_e from my "splitstackshape" package (with the downside that inefficiently, you need to paste the values together first before you can split them):
library(splitstackshape)
library(dplyr)
As ones and zeroes:
df %>%
mutate(combined = apply(., 1, function(x) paste(na.omit(x), collapse = ","))) %>%
cSplit_e("combined", ",", mode = "binary", type = "character", fill = 0) %>%
select(starts_with("combined_")) %>%
head
# combined_a combined_b combined_c combined_d combined_e combined_f combined_g combined_h combined_i
# 1 0 0 1 1 0 0 0 0 1
# 2 1 0 0 1 0 1 0 0 0
# 3 1 0 0 0 0 0 0 0 1
# 4 0 1 1 0 0 0 0 1 1
# 5 0 1 0 1 0 0 0 1 0
# 6 0 1 0 0 0 0 0 0 0
# combined_j combined_k combined_l combined_m combined_n combined_o combined_p combined_q combined_r
# 1 0 0 1 1 0 1 0 0 1
# 2 0 0 0 1 0 0 0 0 0
# 3 0 1 0 0 0 0 1 0 1
# 4 1 0 1 0 1 0 0 0 0
# 5 0 1 0 0 1 0 1 1 1
# 6 1 1 0 1 0 0 0 1 0
# combined_s combined_t combined_u combined_v combined_w combined_x combined_y combined_z
# 1 0 0 0 1 0 1 1 0
# 2 1 1 0 0 0 0 0 0
# 3 0 1 1 0 0 1 1 0
# 4 0 0 1 0 0 0 1 0
# 5 1 0 0 0 0 0 0 0
# 6 1 1 1 0 0 0 0 0
As the original values:
df %>%
mutate(combined = apply(., 1, function(x) paste(na.omit(x), collapse = ","))) %>%
cSplit_e("combined", ",", mode = "value", type = "character", fill = "") %>%
select(starts_with("combined_")) %>%
head
# combined_a combined_b combined_c combined_d combined_e combined_f combined_g combined_h combined_i
# 1 c d i
# 2 a d f
# 3 a i
# 4 b c h i
# 5 b d h
# 6 b
# combined_j combined_k combined_l combined_m combined_n combined_o combined_p combined_q combined_r
# 1 l m o r
# 2 m
# 3 k p r
# 4 j l n
# 5 k n p q r
# 6 j k m q
# combined_s combined_t combined_u combined_v combined_w combined_x combined_y combined_z
# 1 v x y
# 2 s t
# 3 t u x y
# 4 u y
# 5 s
# 6 s t u
Alternatively, you can use "reshape2":
library(reshape2)
## The values
dcast(melt(as.matrix(df), na.rm = TRUE),
Var1 ~ value, value.var = "value")
## ones and zeroes
dcast(melt(as.matrix(df), na.rm = TRUE),
Var1 ~ value, value.var = "value", fun.aggregate = length)

Using loop to make column selections using different vectors

Let's say I have 3 vectors (strings of 10):
X <- c(1,1,0,1,0, 1,1, 0, NA,NA)
H <- c(0,0,1,0,NA,1,NA,1, 1, 1 )
I <- c(0,0,0,0,0, 1,NA,NA,NA,1 )
Data.frame Y contains 10 columns and 6 rows:
1 2 3 4 5 6 7 8 9 10
0 1 0 0 1 1 1 0 1 0
1 1 1 0 1 0 1 0 0 0
0 0 0 0 1 0 0 1 0 1
1 0 1 1 0 1 1 1 0 0
0 0 0 0 0 0 1 0 0 0
1 1 0 1 0 0 0 0 1 1
I'd like to use vector X, H en I to make column selections in data.frame Y, using "1's" and "0's" in the vector as selection criterium .
So the results for vector X using the '1' as selection criterium should be:
X <- c(1,1,0,1,0, 1,1, 0, NA,NA)
1 2 4 6 7
0 1 0 1 1
1 1 0 0 1
0 0 0 0 0
1 0 1 1 1
0 0 0 0 1
1 1 1 0 0
For vector H using the '1' as selection criterium:
H <- c(0,0,1,0,NA,1,NA,1, 1, 1 )
3 6 8 9 10
0 1 0 1 0
1 0 0 0 0
0 0 1 0 1
1 1 1 0 0
0 0 0 0 0
0 0 0 1 1
For vector I using the '1' as selection criterium:
I <- c(0,0,0,0,0, 1,NA,NA,NA,1 )
6 10
1 0
0 0
0 1
1 0
0 0
0 1
For convenience and speed I'd like to use a loop. It might be something like this:
all.ones <- lapply[,function(x) x %in% 1]
In the outcome (all.ones), the result for each vector should stay separate. For example:
X 1,2,4,6,7
H 3,6,8,9,10
I 6,10
The standard way of doing this is using the %in% operator:
Y[, X %in% 1]
To do this for multiple vectors (assuming you want an AND operation):
mylist = list(X, H, I, D, E, K)
Y[, Reduce(`&`, lapply(mylist, function(x) x %in% 1))]
The problem is the NA, use which to get round it. Consider the following:
x <- c(1,0,1,NA)
x[x==1]
[1] 1 1 NA
x[which(x==1)]
[1] 1 1
How about this?
idx <- which(X==1)
Y[,idx]
EDIT: For six vectors, do
idx <- which(X==1 & H==1 & I==1 & D==1 & E==1 & K==1)
Y[,idx]
Replace & with | if you want all columns of Y where at least one of the lists has a 1.

How to do row-wise subtraction and replace a specific number with zero?

Step 1: I have a simplified dataframe like this:
df1 = data.frame (B=c(1,0,1), C=c(1,1,0)
, D=c(1,0,1), E=c(1,1,0), F=c(0,0,1)
, G=c(0,1,0), H=c(0,0,1), I=c(0,1,0))
B C D E F G H I
1 1 1 1 1 0 0 0 0
2 0 1 0 1 0 1 0 1
3 1 0 1 0 1 0 1 0
Step 2: I want to do row wise subtraction, i.e. (row1 - row2), (row1 - row3) and (row2 - row3)
row1-row2 1 0 1 0 0 -1 0 -1
row1-row3 0 1 0 1 -1 0 -1 0
row2-row3 -1 1 -1 1 -1 1 -1 1
step 3: replace all -1 to 0
row1-row2 1 0 1 0 0 0 0 0
row1-row3 0 1 0 1 0 0 0 0
row2-row3 0 1 0 1 0 1 0 1
Could you mind to teach me how to do so?
I like using the plyr library for things like this using the combn function to generate all possible pairs of rows/columns.
require(plyr)
combos <- combn(nrow(df1), 2)
adply(combos, 2, function(x) {
out <- data.frame(df1[x[1] , ] - df1[x[2] , ])
out[out == -1] <- 0
return(out)
}
)
Results in:
X1 B C D E F G H I
1 1 1 0 1 0 0 0 0 0
2 2 0 1 0 1 0 0 0 0
3 3 0 1 0 1 0 1 0 1
If necessary, you can drop the first column, plyr spits that out automagically for you.
Similar questions:
Sum pairwise rows with R?
Chi Square Analysis using for loop in R
Compare one row to all other rows in a file using R
For the record, I would do this:
cmb <- combn(seq_len(nrow(df1)), 2)
out <- df1[cmb[1,], ] - df1[cmb[2,], ]
out[out < 0] <- 0
rownames(out) <- apply(cmb, 2,
function(x) paste("row", x[1], "-row", x[2], sep = ""))
This yields (the last line above is a bit of sugar, and may not be needed):
> out
B C D E F G H I
row1-row2 1 0 1 0 0 0 0 0
row1-row3 0 1 0 1 0 0 0 0
row2-row3 0 1 0 1 0 1 0 1
Which is fully vectorised and exploits indices to extend/extract the elements of df1 required for the row-by-row operation.
> df2 <- rbind(df1[1,]-df1[2,], df1[1,]-df1[3,], df1[2,]-df1[3,])
> df2
B C D E F G H I
1 1 0 1 0 0 -1 0 -1
2 0 1 0 1 -1 0 -1 0
21 -1 1 -1 1 -1 1 -1 1
> df2[df2==-1] <- 0
> df2
B C D E F G H I
1 1 0 1 0 0 0 0 0
2 0 1 0 1 0 0 0 0
21 0 1 0 1 0 1 0 1
If you'd like to change the name of the rows to those in your example:
> rownames(df2) <- c('row1-row2', 'row1-row3', 'row2-row3')
> df2
B C D E F G H I
row1-row2 1 0 1 0 0 0 0 0
row1-row3 0 1 0 1 0 0 0 0
row2-row3 0 1 0 1 0 1 0 1
Finally, if the number of rows is not known ahead of time, the following should do the trick:
df1 = data.frame (B=c(1,0,1), C=c(1,1,0), D=c(1,0,1), E=c(1,1,0), F=c(0,0,1), G=c(0,1,0), H=c(0,0,1), I=c(0,1,0))
n <- length(df1[,1])
ret <- data.frame()
for (i in 1:(n-1)) {
for (j in (i+1):n) {
diff <- df1[i,] - df1[j,]
rownames(diff) <- paste('row', i, '-row', j, sep='')
ret <- rbind(ret, diff)
}
}
ret[ret==-1] <- 0
print(ret)

Resources