Outside Union in R - r

I need to perform appears to be an union on two tables in R. However the union needs to include columns that are not common to the two parent matrices / tables.
This scenario looks very similar to the Outer Union described here: https://cs.stackexchange.com/questions/6997/what-is-outer-union-and-why-is-it-partially-compatible
I have two Matrices:
Matrix 1
Name Var1 Var2
1 1 0
2 1 0
Matrix 2
Name Var1 Var3
3 0 1
4 0 1
That I need to combine into Matrix 3:
Name Var1 Var2 Var3
1 1 0 0
2 1 0 0
3 0 0 1
4 0 0 1

A base R solution using merge
M <- replace(M<-as.matrix(merge(data.frame(M1),data.frame(M2),all = T)),
which(is.na(M)),
0)
such that
> M
Name Var1 Var2 Var3
[1,] 1 1 0 0
[2,] 2 1 0 0
[3,] 3 0 0 1
[4,] 4 0 0 1
DATA
M1 <- structure(c(1L, 2L, 1L, 1L, 0L, 0L), .Dim = 2:3, .Dimnames = list(
NULL, c("Name", "Var1", "Var2")))
M2 <- structure(c(3L, 4L, 0L, 0L, 1L, 1L), .Dim = 2:3, .Dimnames = list(
NULL, c("Name", "Var1", "Var3")))

We can convert to data.frame and use bind_rows. By default, it fills the missing values with NA
library(dplyr)
library(tidyr)
bind_rows(as.data.frame(m1), as.data.frame(m2)) %>%
mutate_all(replace_na, 0) %>%
as.matrix
# Name Var1 Var2 Var3
#[1,] 1 1 0 0
#[2,] 2 1 0 0
#[3,] 3 0 0 1
#[4,] 4 0 0 1
Or as #markus mentioned rbind.fill.matrix from plyr would be useful
plyr::rbind.fill.matrix(m1, m2)
data
m1 <- structure(c(1L, 2L, 1L, 1L, 0L, 0L), .Dim = 2:3, .Dimnames = list(
NULL, c("Name", "Var1", "Var2")))
m2 <- structure(c(3L, 4L, 0L, 0L, 1L, 1L), .Dim = 2:3, .Dimnames = list(
NULL, c("Name", "Var1", "Var3")))

Related

Alternative of Stata's foreach command in R

I am a new user of R and am used to using Stata software.
I used to loop through multiple variables by foreach command in Stata. So, for example, I can convert multiple numerical variables to factor ones.
In Stata, first, define the label:
label define NoYes 0 "No" 1 "Yes"
Then, apply the loop command:
foreach x in var1 var2 var3 {
label values `x' NoYes
}
I am figuring out how I can do so in R; any help would be appreciated.
In base R we can use lapply.
dat[c(1, 3)] <- lapply(dat[c(1, 3)], factor, levels=0:1, labels=c('No', 'Yes'))
dat
# X1 X2 X3 X4 X5
# 1 No 1 No 1 0
# 2 No 1 Yes 1 1
# 3 No 0 No 0 0
# 4 No 1 No 0 0
# 5 Yes 0 Yes 0 0
# 6 Yes 1 Yes 0 0
To avoid confusion, I generally recommend not using too many fancy packages while you're new to R.
The literal translation could look like this (reload dat before trying):
for (x in c('X1', 'X3')) {
dat[[x]] <- factor(dat[[x]], levels=0:1, labels=c('No', 'Yes'))
}
dat
# X1 X2 X3 X4 X5
# 1 No 1 No 1 0
# 2 No 1 Yes 1 1
# 3 No 0 No 0 0
# 4 No 1 No 0 0
# 5 Yes 0 Yes 0 0
# 6 Yes 1 Yes 0 0
Data:
dat <- structure(list(X1 = c(0L, 0L, 0L, 0L, 1L, 1L), X2 = c(1L, 1L,
0L, 1L, 0L, 1L), X3 = c(0L, 1L, 0L, 0L, 1L, 1L), X4 = c(1L, 1L,
0L, 0L, 0L, 0L), X5 = c(0L, 1L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-6L))
Example data:
library(dplyr)
df <- data.frame(var1 = c(1, 1, 0, 0),
var2 = c(0, 0, 1, 1),
var3 = c(0, 1, 0, 1))
There may be several alternatives, and across of dplyr is one of them.
new_df <- df %>% mutate(across(var1:var3, ~ factor(.x, levels = c(0, 1), labels=c("No", "Yes"))))
new_df
var1 var2 var3
1 Yes No No
2 Yes No Yes
3 No Yes No
4 No Yes Yes
You do not really need levels = c(0, 1) here, but I would always do it in real data just to be safe.

Find overlapping ranges in a dataframe and assign them values

A simpler version of the original question which I asked but nobody answered it yet.
I have a huge input file (a representative sample of which is shown below as input):
> input
CT1 CT2 CT3
1 chr1:200-400 chr1:250-450 chr1:400-800
2 chr1:800-970 chr2:200-500 chr1:700-870
3 chr2:300-700 chr2:600-1000 chr2:700-1400
I want to process it by following a rule (described below) so that I get an output like:
> output
CT1 CT2 CT3
chr1:200-400 1 1 0
chr1:800-970 1 0 1
chr2:300-700 1 1 0
chr1:250-450 1 1 1
chr2:200-500 1 1 0
chr2:600-1000 1 1 1
chr1:400-800 0 1 1
chr1:700-870 1 0 1
chr2:700-1400 0 1 1
Rule:
Take every index (the first in this case is chr1:200-400) of the dataframe, see if it overlaps with any other value in the dataframe. If yes, write 1 below that column in which it exists, if not write 0.
For example, if we take 1st index of the input input[1,1] which is chr1:200-400. As it exists in column 1 we will write 1 below it. Now we will check if this range overlap with any other range which exists in any of the other columns in the input. This value overlaps only with the first value (chr1:250-450) of the second column (CT2), therefore, we write 1 below that as well. As there is no overlap with any of the values in CT3, we write 0 below CT3 in the output dataframe.
Here are the dput of input and output:
> dput(input)
structure(list(CT1 = structure(1:3, .Label = c("chr1:200-400",
"chr1:800-970", "chr2:300-700"), class = "factor"), CT2 = structure(1:3, .Label = c("chr1:250-450",
"chr2:200-500", "chr2:600-1000"), class = "factor"), CT3 = structure(1:3, .Label = c("chr1:400-800",
"chr1:700-870", "chr2:700-1400"), class = "factor")), .Names = c("CT1",
"CT2", "CT3"), class = "data.frame", row.names = c(NA, -3L))
> dput(output)
structure(list(CT1 = c(1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L), CT2 = c(1L,
0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L), CT3 = c(0L, 0L, 0L, 0L, 0L,
1L, 1L, 1L, 1L)), .Names = c("CT1", "CT2", "CT3"), class = "data.frame", row.names = c("chr1:200-400",
"chr1:800-970", "chr2:300-700", "chr1:250-450", "chr2:200-500",
"chr2:600-1000", "chr1:400-800", "chr1:700-870", "chr2:700-1400"
))
A possible solution using the data.table-package:
# load the 'data.table'-package and convert 'input' to a data.table with 'setDT'
library(data.table)
setDT(input)
# reshape 'input' to long format and split the strings in 3 columns
DT <- melt(input, measure.vars = 1:3)[, c('chr','low','high') := tstrsplit(value, split = ':|-', type.convert = TRUE)
, by = variable][]
# create aggregation function; needed in the ast reshape step
f <- function(x) as.integer(length(x) > 0)
# cartesian self join & reshape result back to wide format with aggregation function
DT[DT, on = .(chr, low < high, high > low), allow.cartesian = TRUE
][, dcast(.SD, value ~ i.variable, fun = f)]
which gives:
value CT1 CT2 CT3
1: chr1:200-400 1 1 0
2: chr1:250-450 1 1 1
3: chr1:400-800 0 1 1
4: chr1:700-870 1 0 1
5: chr1:800-970 1 0 1
6: chr2:200-500 1 1 0
7: chr2:300-700 1 1 0
8: chr2:600-1000 1 1 1
9: chr2:700-1400 0 1 1

Concatenate dichotome columns to semicolon-separated column

I have data frame containing the results of a multiple choice question. Each item has either 0 (not mentioned) or 1 (mentioned). The columns are named like this:
F1.2_1, F1.2_2, F1.2_3, F1.2_4, F1.2_5, F1.2_99
etc.
I would like to concatenate these values like this: The new column should be a semicolon-separated string of the selected items. So if a row has a 1 in F1.2_1, F1.2_4 and F1.2_5 it should be: 1;4;5
The last digit(s) of the dichotome columns are the item codes to be used in the string.
Any idea how this could be achieved with R (and data.table)? Thanks for any help!
edit:
Here is a example DF with the desired result:
structure(list(F1.2_1 = c(0L, 1L, 0L, 1L), F1.2_2 = c(1L, 0L,
0L, 1L), F1.2_3 = c(0L, 1L, 0L, 1L), F1.2_4 = c(0L, 1L, 0L, 0L
), F1.2_5 = c(0L, 0L, 0L, 0L), F1.2_99 = c(0L, 0L, 1L, 0L), desired_result = structure(c(3L,
2L, 4L, 1L), .Label = c("1;2;3", "1;3;4", "2", "99"), class = "factor")), .Names = c("F1.2_1",
"F1.2_2", "F1.2_3", "F1.2_4", "F1.2_5", "F1.2_99", "desired_result"
), class = "data.frame", row.names = c(NA, -4L))
F1.2_1 F1.2_2 F1.2_3 F1.2_4 F1.2_5 F1.2_99 desired_result
1 0 1 0 0 0 0 2
2 1 0 1 1 0 0 1;3;4
3 0 0 0 0 0 1 99
4 1 1 1 0 0 0 1;2;3
In his comment, the OP asked how to deal with more multiple choice questions.
The approach below will be able to handle an arbitrary number of questions and choices for each question. It uses melt() and dcast() from the data.table package.
Sample input data
Let's assume the input data.frame DT for the extended case contains two questions, one with 6 choices and the other with 4 choices:
DT
# F1.2_1 F1.2_2 F1.2_3 F1.2_4 F1.2_5 F1.2_99 F2.7_1 F2.7_2 F2.7_3 F2.7_11
#1: 0 1 0 0 0 0 0 1 1 0
#2: 1 0 1 1 0 0 1 1 1 1
#3: 0 0 0 0 0 1 1 0 1 0
#4: 1 1 1 0 0 0 1 0 1 1
Code
library(data.table)
# coerce to data.table and add row number for later join
setDT(DT)[, rn := .I]
# reshape from wide to long format
molten <- melt(DT, id.vars = "rn")
# alternatively, the measure cols can be specified (in case of other id vars)
# molten <- melt(DT, measure.vars = patterns("^F"))
# split question id and choice id
molten[, c("question_id", "choice_id") := tstrsplit(variable, "_")]
# reshape only selected choices from long to wide format,
# thereby pasting together the ids of the selected choices for each question
result <- dcast(molten[value == 1], rn ~ question_id, paste, collapse = ";",
fill = NA, value.var = "choice_id")
# final join for demonstration only, remove row number as no longer needed
DT[result, on = "rn"][, rn := NULL][]
# F1.2_1 F1.2_2 F1.2_3 F1.2_4 F1.2_5 F1.2_99 F2.7_1 F2.7_2 F2.7_3 F2.7_11 F1.2 F2.7
#1: 0 1 0 0 0 0 0 1 1 0 2 2;3
#2: 1 0 1 1 0 0 1 1 1 1 1;3;4 1;2;3;11
#3: 0 0 0 0 0 1 1 0 1 0 99 1;3
#4: 1 1 1 0 0 0 1 0 1 1 1;2;3 1;3;11
For each question, the final result shows which choices were selected in each row.
Reproducible data
The sample data can be created with
DT <- structure(list(F1.2_1 = c(0L, 1L, 0L, 1L), F1.2_2 = c(1L, 0L,
0L, 1L), F1.2_3 = c(0L, 1L, 0L, 1L), F1.2_4 = c(0L, 1L, 0L, 0L
), F1.2_5 = c(0L, 0L, 0L, 0L), F1.2_99 = c(0L, 0L, 1L, 0L), F2.7_1 = c(0L,
1L, 1L, 1L), F2.7_2 = c(1L, 1L, 0L, 0L), F2.7_3 = c(1L, 1L, 1L,
1L), F2.7_11 = c(0L, 1L, 0L, 1L)), .Names = c("F1.2_1", "F1.2_2",
"F1.2_3", "F1.2_4", "F1.2_5", "F1.2_99", "F2.7_1", "F2.7_2",
"F2.7_3", "F2.7_11"), row.names = c(NA, -4L), class = "data.frame")
We can try
j1 <- do.call(paste, c(as.integer(sub(".*_", "",
names(DF)[-7]))[col(DF[-7])]*DF[-7], sep=";"))
DF$newCol <- gsub("^;+|;+$", "", gsub(";*0;|0$|^0", ";", j1))
DF$newCol
#[1] "2" "1;3;4" "99" "1;2;3"

Create n-way frequency tables using R

I need some help to create a n-way frequency table.
I am using the code below:
tab <- table(VAR1,VAR2,VAR3)
finaltab <- ftable(tab,row.vars=c(2,3))
print(finaltab)
VAR1, VAR2 and VAR3 are all factor variables. By doing this, I produce the following table:
But since VAR2 and VAR3 have several categories, I got a lot of lines with "0" and I which to remove those lines to keep in which category of VAR2 only frequencies for the categories of VAR3 that really have frequency values, as follows:
Does anyone knows how to do it, either by subsetting the table I created first or using another function that doesn't return all levels of VAR3 in each VAR2 category but only those which actually have frequencies?
Contingency tables have the same number of rows in every category. If you
remove rows from one category you no longer have a table but a matrix.
t <- structure(c(0L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 1L, 1L), .Dim = c(3L, 2L, 3L), .Dimnames = structure(list(c("A", "B", "C"), c("A", "B"), c("A", "B", "C")), .Names = c("","", "")), class = "table")
> (ft <- ftable(t, row.vars=c(2,3)))
A B C
A A 0 0 1
B 1 1 1
C 0 1 0
B A 1 1 0
B 0 0 0
C 1 1 1
> ft[apply(ft, 1, any), ]
[,1] [,2] [,3]
[1,] 0 0 1
[2,] 1 1 1
[3,] 0 1 0
[4,] 1 1 0
[5,] 1 1 1
The unfortunate effect of subsetting the table is that names are lost. This
can be mitigated to some extent by coercing the table to a matrix before
taking the subset, but the printed output still isn't as pretty as that of a
contengency table.
> as.matrix(ft)[apply(ft, 1, any), ]
_ A B C
A_A 0 0 1
A_B 1 1 1
A_C 0 1 0
B_A 1 1 0
B_C 1 1 1

How to count the number of combinations of boolean data in R

What is the best way to determine a factor or create a new category field based on a number of boolean fields? In this example, I need to count the number of unique combinations of medications.
> MultPsychMeds
ID OLANZAPINE HALOPERIDOL QUETIAPINE RISPERIDONE
1 A 1 1 0 0
2 B 1 0 1 0
3 C 1 0 1 0
4 D 1 0 1 0
5 E 1 0 0 1
6 F 1 0 0 1
7 G 1 0 0 1
8 H 1 0 0 1
9 I 0 1 1 0
10 J 0 1 1 0
Perhaps another way to state it is that I need to pivot or cross tabulate the pairs. The final results need to look something like:
Combination Count
OLANZAPINE/HALOPERIDOL 1
OLANZAPINE/QUETIAPINE 3
OLANZAPINE/RISPERIDONE 4
HALOPERIDOL/QUETIAPINE 2
This data frame can be replicated in R with:
MultPsychMeds <- structure(list(ID = structure(1:10, .Label = c("A", "B", "C",
"D", "E", "F", "G", "H", "I", "J"), class = "factor"), OLANZAPINE = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L), HALOPERIDOL = c(1L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L), QUETIAPINE = c(0L, 1L, 1L, 1L,
0L, 0L, 0L, 0L, 1L, 1L), RISPERIDONE = c(0L, 0L, 0L, 0L, 1L,
1L, 1L, 1L, 0L, 0L)), .Names = c("ID", "OLANZAPINE", "HALOPERIDOL",
"QUETIAPINE", "RISPERIDONE"), class = "data.frame", row.names = c(NA,
-10L))
Here's one approach using the reshape and plyr packages:
library(reshape)
library(plyr)
#Melt into long format
dat.m <- melt(MultPsychMeds, id.vars = "ID")
#Group at the ID level and paste the drugs together with "/"
out <- ddply(dat.m, "ID", summarize, combos = paste(variable[value == 1], collapse = "/"))
#Calculate a table
with(out, count(combos))
x freq
1 HALOPERIDOL/QUETIAPINE 2
2 OLANZAPINE/HALOPERIDOL 1
3 OLANZAPINE/QUETIAPINE 3
4 OLANZAPINE/RISPERIDONE 4
Just for fun, a base R solution (that can be turned into a oneliner :-) ):
data.frame(table(apply(MultPsychMeds[,-1], 1, function(currow){
wc<-which(currow==1)
paste(colnames(MultPsychMeds)[wc+1], collapse="/")
})))
Another way could be:
subset(
as.data.frame(
with(MultPsychMeds, table(OLANZAPINE, HALOPERIDOL, QUETIAPINE, RISPERIDONE)),
responseName="count"
),
count>0
)
which gives
OLANZAPINE HALOPERIDOL QUETIAPINE RISPERIDONE count
4 1 1 0 0 1
6 1 0 1 0 3
7 0 1 1 0 2
10 1 0 0 1 4
It's not an exact way you want it, but is fast and simple.
There is shorthand in plyr package:
require(plyr)
count(MultPsychMeds, c("OLANZAPINE", "HALOPERIDOL", "QUETIAPINE", "RISPERIDONE"))
# OLANZAPINE HALOPERIDOL QUETIAPINE RISPERIDONE freq
# 1 0 1 1 0 2
# 2 1 0 0 1 4
# 3 1 0 1 0 3
# 4 1 1 0 0 1

Resources