A simpler version of the original question which I asked but nobody answered it yet.
I have a huge input file (a representative sample of which is shown below as input):
> input
CT1 CT2 CT3
1 chr1:200-400 chr1:250-450 chr1:400-800
2 chr1:800-970 chr2:200-500 chr1:700-870
3 chr2:300-700 chr2:600-1000 chr2:700-1400
I want to process it by following a rule (described below) so that I get an output like:
> output
CT1 CT2 CT3
chr1:200-400 1 1 0
chr1:800-970 1 0 1
chr2:300-700 1 1 0
chr1:250-450 1 1 1
chr2:200-500 1 1 0
chr2:600-1000 1 1 1
chr1:400-800 0 1 1
chr1:700-870 1 0 1
chr2:700-1400 0 1 1
Rule:
Take every index (the first in this case is chr1:200-400) of the dataframe, see if it overlaps with any other value in the dataframe. If yes, write 1 below that column in which it exists, if not write 0.
For example, if we take 1st index of the input input[1,1] which is chr1:200-400. As it exists in column 1 we will write 1 below it. Now we will check if this range overlap with any other range which exists in any of the other columns in the input. This value overlaps only with the first value (chr1:250-450) of the second column (CT2), therefore, we write 1 below that as well. As there is no overlap with any of the values in CT3, we write 0 below CT3 in the output dataframe.
Here are the dput of input and output:
> dput(input)
structure(list(CT1 = structure(1:3, .Label = c("chr1:200-400",
"chr1:800-970", "chr2:300-700"), class = "factor"), CT2 = structure(1:3, .Label = c("chr1:250-450",
"chr2:200-500", "chr2:600-1000"), class = "factor"), CT3 = structure(1:3, .Label = c("chr1:400-800",
"chr1:700-870", "chr2:700-1400"), class = "factor")), .Names = c("CT1",
"CT2", "CT3"), class = "data.frame", row.names = c(NA, -3L))
> dput(output)
structure(list(CT1 = c(1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L), CT2 = c(1L,
0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L), CT3 = c(0L, 0L, 0L, 0L, 0L,
1L, 1L, 1L, 1L)), .Names = c("CT1", "CT2", "CT3"), class = "data.frame", row.names = c("chr1:200-400",
"chr1:800-970", "chr2:300-700", "chr1:250-450", "chr2:200-500",
"chr2:600-1000", "chr1:400-800", "chr1:700-870", "chr2:700-1400"
))
A possible solution using the data.table-package:
# load the 'data.table'-package and convert 'input' to a data.table with 'setDT'
library(data.table)
setDT(input)
# reshape 'input' to long format and split the strings in 3 columns
DT <- melt(input, measure.vars = 1:3)[, c('chr','low','high') := tstrsplit(value, split = ':|-', type.convert = TRUE)
, by = variable][]
# create aggregation function; needed in the ast reshape step
f <- function(x) as.integer(length(x) > 0)
# cartesian self join & reshape result back to wide format with aggregation function
DT[DT, on = .(chr, low < high, high > low), allow.cartesian = TRUE
][, dcast(.SD, value ~ i.variable, fun = f)]
which gives:
value CT1 CT2 CT3
1: chr1:200-400 1 1 0
2: chr1:250-450 1 1 1
3: chr1:400-800 0 1 1
4: chr1:700-870 1 0 1
5: chr1:800-970 1 0 1
6: chr2:200-500 1 1 0
7: chr2:300-700 1 1 0
8: chr2:600-1000 1 1 1
9: chr2:700-1400 0 1 1
Related
I need to perform appears to be an union on two tables in R. However the union needs to include columns that are not common to the two parent matrices / tables.
This scenario looks very similar to the Outer Union described here: https://cs.stackexchange.com/questions/6997/what-is-outer-union-and-why-is-it-partially-compatible
I have two Matrices:
Matrix 1
Name Var1 Var2
1 1 0
2 1 0
Matrix 2
Name Var1 Var3
3 0 1
4 0 1
That I need to combine into Matrix 3:
Name Var1 Var2 Var3
1 1 0 0
2 1 0 0
3 0 0 1
4 0 0 1
A base R solution using merge
M <- replace(M<-as.matrix(merge(data.frame(M1),data.frame(M2),all = T)),
which(is.na(M)),
0)
such that
> M
Name Var1 Var2 Var3
[1,] 1 1 0 0
[2,] 2 1 0 0
[3,] 3 0 0 1
[4,] 4 0 0 1
DATA
M1 <- structure(c(1L, 2L, 1L, 1L, 0L, 0L), .Dim = 2:3, .Dimnames = list(
NULL, c("Name", "Var1", "Var2")))
M2 <- structure(c(3L, 4L, 0L, 0L, 1L, 1L), .Dim = 2:3, .Dimnames = list(
NULL, c("Name", "Var1", "Var3")))
We can convert to data.frame and use bind_rows. By default, it fills the missing values with NA
library(dplyr)
library(tidyr)
bind_rows(as.data.frame(m1), as.data.frame(m2)) %>%
mutate_all(replace_na, 0) %>%
as.matrix
# Name Var1 Var2 Var3
#[1,] 1 1 0 0
#[2,] 2 1 0 0
#[3,] 3 0 0 1
#[4,] 4 0 0 1
Or as #markus mentioned rbind.fill.matrix from plyr would be useful
plyr::rbind.fill.matrix(m1, m2)
data
m1 <- structure(c(1L, 2L, 1L, 1L, 0L, 0L), .Dim = 2:3, .Dimnames = list(
NULL, c("Name", "Var1", "Var2")))
m2 <- structure(c(3L, 4L, 0L, 0L, 1L, 1L), .Dim = 2:3, .Dimnames = list(
NULL, c("Name", "Var1", "Var3")))
I have rows with recurring IDs that I would like to merge. The columns are binaries so I would like to sum them together
Example before:
id nam1 nam2
1 1 1
1 0 0
2 1 0
2 0 1
3 1 1
3 1 0
Example after:
id nam1 nam2
1 1 1
2 1 1
3 2 1
Any ideas on how to do this?
#d.b's answer in comment:
aggregate(.~id, df, sum)
or using dplyr:
library(dplyr)
df %>%
group_by(id) %>%
summarize_all("sum")
Result:
# A tibble: 3 x 3
id nam1 nam2
<int> <int> <int>
1 1 1 1
2 2 1 1
3 3 2 1
Data
df = structure(list(id = c(1L, 1L, 2L, 2L, 3L, 3L), nam1 = c(1L, 0L,
1L, 0L, 1L, 1L), nam2 = c(1L, 0L, 0L, 1L, 1L, 0L)), .Names = c("id",
"nam1", "nam2"), row.names = c(NA, -6L), class = "data.frame")
#Sample data:
df <- data.frame(id=c(1,1,2,2,3,3),
nam1=c(1,0,1,0,1,1),
nam2=c(1,0,0,1,1,0))
library(data.table)
setDT(df)[, lapply(.SD, sum), by=.(id)]
id nam1 nam2
1 1 1
2 1 1
3 2 1
I have data frame containing the results of a multiple choice question. Each item has either 0 (not mentioned) or 1 (mentioned). The columns are named like this:
F1.2_1, F1.2_2, F1.2_3, F1.2_4, F1.2_5, F1.2_99
etc.
I would like to concatenate these values like this: The new column should be a semicolon-separated string of the selected items. So if a row has a 1 in F1.2_1, F1.2_4 and F1.2_5 it should be: 1;4;5
The last digit(s) of the dichotome columns are the item codes to be used in the string.
Any idea how this could be achieved with R (and data.table)? Thanks for any help!
edit:
Here is a example DF with the desired result:
structure(list(F1.2_1 = c(0L, 1L, 0L, 1L), F1.2_2 = c(1L, 0L,
0L, 1L), F1.2_3 = c(0L, 1L, 0L, 1L), F1.2_4 = c(0L, 1L, 0L, 0L
), F1.2_5 = c(0L, 0L, 0L, 0L), F1.2_99 = c(0L, 0L, 1L, 0L), desired_result = structure(c(3L,
2L, 4L, 1L), .Label = c("1;2;3", "1;3;4", "2", "99"), class = "factor")), .Names = c("F1.2_1",
"F1.2_2", "F1.2_3", "F1.2_4", "F1.2_5", "F1.2_99", "desired_result"
), class = "data.frame", row.names = c(NA, -4L))
F1.2_1 F1.2_2 F1.2_3 F1.2_4 F1.2_5 F1.2_99 desired_result
1 0 1 0 0 0 0 2
2 1 0 1 1 0 0 1;3;4
3 0 0 0 0 0 1 99
4 1 1 1 0 0 0 1;2;3
In his comment, the OP asked how to deal with more multiple choice questions.
The approach below will be able to handle an arbitrary number of questions and choices for each question. It uses melt() and dcast() from the data.table package.
Sample input data
Let's assume the input data.frame DT for the extended case contains two questions, one with 6 choices and the other with 4 choices:
DT
# F1.2_1 F1.2_2 F1.2_3 F1.2_4 F1.2_5 F1.2_99 F2.7_1 F2.7_2 F2.7_3 F2.7_11
#1: 0 1 0 0 0 0 0 1 1 0
#2: 1 0 1 1 0 0 1 1 1 1
#3: 0 0 0 0 0 1 1 0 1 0
#4: 1 1 1 0 0 0 1 0 1 1
Code
library(data.table)
# coerce to data.table and add row number for later join
setDT(DT)[, rn := .I]
# reshape from wide to long format
molten <- melt(DT, id.vars = "rn")
# alternatively, the measure cols can be specified (in case of other id vars)
# molten <- melt(DT, measure.vars = patterns("^F"))
# split question id and choice id
molten[, c("question_id", "choice_id") := tstrsplit(variable, "_")]
# reshape only selected choices from long to wide format,
# thereby pasting together the ids of the selected choices for each question
result <- dcast(molten[value == 1], rn ~ question_id, paste, collapse = ";",
fill = NA, value.var = "choice_id")
# final join for demonstration only, remove row number as no longer needed
DT[result, on = "rn"][, rn := NULL][]
# F1.2_1 F1.2_2 F1.2_3 F1.2_4 F1.2_5 F1.2_99 F2.7_1 F2.7_2 F2.7_3 F2.7_11 F1.2 F2.7
#1: 0 1 0 0 0 0 0 1 1 0 2 2;3
#2: 1 0 1 1 0 0 1 1 1 1 1;3;4 1;2;3;11
#3: 0 0 0 0 0 1 1 0 1 0 99 1;3
#4: 1 1 1 0 0 0 1 0 1 1 1;2;3 1;3;11
For each question, the final result shows which choices were selected in each row.
Reproducible data
The sample data can be created with
DT <- structure(list(F1.2_1 = c(0L, 1L, 0L, 1L), F1.2_2 = c(1L, 0L,
0L, 1L), F1.2_3 = c(0L, 1L, 0L, 1L), F1.2_4 = c(0L, 1L, 0L, 0L
), F1.2_5 = c(0L, 0L, 0L, 0L), F1.2_99 = c(0L, 0L, 1L, 0L), F2.7_1 = c(0L,
1L, 1L, 1L), F2.7_2 = c(1L, 1L, 0L, 0L), F2.7_3 = c(1L, 1L, 1L,
1L), F2.7_11 = c(0L, 1L, 0L, 1L)), .Names = c("F1.2_1", "F1.2_2",
"F1.2_3", "F1.2_4", "F1.2_5", "F1.2_99", "F2.7_1", "F2.7_2",
"F2.7_3", "F2.7_11"), row.names = c(NA, -4L), class = "data.frame")
We can try
j1 <- do.call(paste, c(as.integer(sub(".*_", "",
names(DF)[-7]))[col(DF[-7])]*DF[-7], sep=";"))
DF$newCol <- gsub("^;+|;+$", "", gsub(";*0;|0$|^0", ";", j1))
DF$newCol
#[1] "2" "1;3;4" "99" "1;2;3"
I have data that looks like this
ID v1 v2
1 1 0
2 0 1
3 1 0
3 0 1
4 0 1
I want to replace all values with 'NA' if the ID occurs more than once in the dataframe. The final product should look like this
ID v1 v2
1 1 0
2 0 1
3 NA NA
3 NA NA
4 0 1
I could do this by hand, but I want R to detect all the duplicate cases (in this case two times ID '3') and replace the values with 'NA'.
Thanks for your help!
You could use duplicated() from either end, and then replace.
idx <- duplicated(df$ID) | duplicated(df$ID, fromLast = TRUE)
df[idx, -1] <- NA
which gives
ID v1 v2
1 1 1 0
2 2 0 1
3 3 NA NA
4 3 NA NA
5 4 0 1
This will also work if the duplicated IDs are not next to each other.
Data:
df <- structure(list(ID = c(1L, 2L, 3L, 3L, 4L), v1 = c(1L, 0L, 1L,
0L, 0L), v2 = c(0L, 1L, 0L, 1L, 1L)), .Names = c("ID", "v1",
"v2"), class = "data.frame", row.names = c(NA, -5L))
One more option:
df1[df1$ID %in% df1$ID[duplicated(df1$ID)], -1] <- NA
#> df1
# ID v1 v2
#1 1 1 0
#2 2 0 1
#3 3 NA NA
#4 3 NA NA
#5 4 0 1
data
df1 <- structure(list(ID = c(1L, 2L, 3L, 3L, 4L), v1 = c(1L, 0L, 1L,
0L, 0L), v2 = c(0L, 1L, 0L, 1L, 1L)), .Names = c("ID", "v1",
"v2"), class = "data.frame", row.names = c(NA, -5L))
Here is a base R method
# get list of repeated IDs
repeats <- rle(df$ID)$values[rle(df$ID)$lengths > 1]
# set the corresponding variables to NA
df[, -1] <- sapply(df[, -1], function(i) {i[df$ID %in% repeats] <- NA; i})
In the first line, we use rle to extract repeated IDs. In the second, we use sapply to loop through non-ID variables and replace IDs that repeat with NA for each variable.
Note that this assumes that the data set is sorted by ID. This may be accomplished with the order function. (df <- df[order(df$ID),]).
If the dataset is very large, you might break up the first function into two steps to avoid computing the rle twice:
dfRle <- rle(df$ID)
repeats <- dfRle$values[dfRle$lengths > 1]
data
df <- read.table(header=T, text="ID v1 v2
1 1 0
2 0 1
3 1 0
3 0 1
4 0 1")
What is the best way to determine a factor or create a new category field based on a number of boolean fields? In this example, I need to count the number of unique combinations of medications.
> MultPsychMeds
ID OLANZAPINE HALOPERIDOL QUETIAPINE RISPERIDONE
1 A 1 1 0 0
2 B 1 0 1 0
3 C 1 0 1 0
4 D 1 0 1 0
5 E 1 0 0 1
6 F 1 0 0 1
7 G 1 0 0 1
8 H 1 0 0 1
9 I 0 1 1 0
10 J 0 1 1 0
Perhaps another way to state it is that I need to pivot or cross tabulate the pairs. The final results need to look something like:
Combination Count
OLANZAPINE/HALOPERIDOL 1
OLANZAPINE/QUETIAPINE 3
OLANZAPINE/RISPERIDONE 4
HALOPERIDOL/QUETIAPINE 2
This data frame can be replicated in R with:
MultPsychMeds <- structure(list(ID = structure(1:10, .Label = c("A", "B", "C",
"D", "E", "F", "G", "H", "I", "J"), class = "factor"), OLANZAPINE = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L), HALOPERIDOL = c(1L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L), QUETIAPINE = c(0L, 1L, 1L, 1L,
0L, 0L, 0L, 0L, 1L, 1L), RISPERIDONE = c(0L, 0L, 0L, 0L, 1L,
1L, 1L, 1L, 0L, 0L)), .Names = c("ID", "OLANZAPINE", "HALOPERIDOL",
"QUETIAPINE", "RISPERIDONE"), class = "data.frame", row.names = c(NA,
-10L))
Here's one approach using the reshape and plyr packages:
library(reshape)
library(plyr)
#Melt into long format
dat.m <- melt(MultPsychMeds, id.vars = "ID")
#Group at the ID level and paste the drugs together with "/"
out <- ddply(dat.m, "ID", summarize, combos = paste(variable[value == 1], collapse = "/"))
#Calculate a table
with(out, count(combos))
x freq
1 HALOPERIDOL/QUETIAPINE 2
2 OLANZAPINE/HALOPERIDOL 1
3 OLANZAPINE/QUETIAPINE 3
4 OLANZAPINE/RISPERIDONE 4
Just for fun, a base R solution (that can be turned into a oneliner :-) ):
data.frame(table(apply(MultPsychMeds[,-1], 1, function(currow){
wc<-which(currow==1)
paste(colnames(MultPsychMeds)[wc+1], collapse="/")
})))
Another way could be:
subset(
as.data.frame(
with(MultPsychMeds, table(OLANZAPINE, HALOPERIDOL, QUETIAPINE, RISPERIDONE)),
responseName="count"
),
count>0
)
which gives
OLANZAPINE HALOPERIDOL QUETIAPINE RISPERIDONE count
4 1 1 0 0 1
6 1 0 1 0 3
7 0 1 1 0 2
10 1 0 0 1 4
It's not an exact way you want it, but is fast and simple.
There is shorthand in plyr package:
require(plyr)
count(MultPsychMeds, c("OLANZAPINE", "HALOPERIDOL", "QUETIAPINE", "RISPERIDONE"))
# OLANZAPINE HALOPERIDOL QUETIAPINE RISPERIDONE freq
# 1 0 1 1 0 2
# 2 1 0 0 1 4
# 3 1 0 1 0 3
# 4 1 1 0 0 1