I would like to identify the duration of an activity that start at t1 and end at t7. The starting point is t1 that records the occurrence of an activity at t1_1 , t1_2, t1_3 and so on. For example in the case of id 12 activity occurred at t1_2 and t1_3 (i would like to save this) t2_2 (as there is no activity before and after I am not intrested in this activity), t3_1 (same as t2_2), t3_3, t4_2, t5_2, t6_1, t6_2, t6_3 and t7_3. I would like to identify to the start and end all id's in which an activity occured, the duration and the most frequent one.
Input:
id t1_1 t1_2 t1_3 t2_1 t2_2 t2_3 t3_1 t3_2 t3_3 t4_1 t4_2 t4_3 t5_1 t5_2 t5_3 t6_1 t6_2 t6_3 t7_1 t7_2 t7_3
12 0 1 1 0 1 0 1 0 1 0 1 0 0 1 0 1 1 1 0 0 1
123 0 0 0 1 1 1 0 0 0 1 1 1 1 1 1 0 0 0 1 1 1
10 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Output for id 12
Id Start/End Duration Frequency
12 t1_1, t1_3 2 1
12 t6_1, t6_3 3 1
One way to dot this is using the bioconductor library but are there any better solution?
Sample data
df1 <- structure(list(id = c(12L, 123L, 10L), t1_1 = c(0L, 0L, 1L),
t1_2 = c(1L, 0L, 1L), t1_3 = c(1L, 0L, 1L), t2_1 = c(0L,
1L, 1L), t2_2 = c(1L, 1L, 1L), t2_3 = c(0L, 1L, 1L), t3_1 = c(1L,
0L, 1L), t3_2 = c(0L, 0L, 1L), t3_3 = c(1L, 0L, 1L), t4_1 = c(0L,
1L, 1L), t4_2 = c(1L, 1L, 1L), t4_3 = c(0L, 1L, 1L), t5_1 = c(0L,
1L, 1L), t5_2 = c(1L, 1L, 1L), t5_3 = c(0L, 1L, 1L), t6_1 = c(1L,
0L, 1L), t6_2 = c(1L, 0L, 1L), t6_3 = c(1L, 0L, 1L), t7_1 = c(0L,
1L, 1L), t7_2 = c(0L, 1L, 1L), t7_3 = c(1L, 1L, 1L)),
class = "data.frame", row.names = c(NA,
-3L))
We convert to 'long' format with pivot_longer, then create a grouping variable with rleid (from data.table) based on the occurrence of similar adjacent elements in 'value', filter the rows where the 'value' is 1, grouped by 'id', 'grp', we keep only rows where the frequency count is greater than 1, summarise by pasteing (str_c) the first and last elements of 'name' as well get the count (n()) and arrange if necessary
library(dplyr)
library(tidyr)
library(stringr)
library(data.table)
df1 %>%
pivot_longer(cols = -id) %>%
mutate(grp = rleid(value)) %>%
filter(as.logical(value)) %>%
group_by(id, grp) %>%
filter(n() > 1) %>%
summarise(Start_End = str_c(first(name), last(name), sep=", "),
Duration = n()) %>%
arrange(id, grp)
library('data.table')
df1 <- melt(setDT(df1), id.var = 'id')
df1[, c('time', 'subtime') := tstrsplit(as.character(variable), "_", fixed = TRUE)]
df2 <- df1[, rle(value), by = .(id, time)][lengths > 1 & values == 1, ]
df3 <- df1[df2, on = c('id', 'time')]
df3 <- df3[, .(`Start/End` = paste0(time, '_', c(min(subtime), max(subtime)), collapse = " - "),
Duration = unique(lengths)),
by = .(id, time)]
df3[, Frequency := .N, by = .(id, `Start/End`)]
df3[, time := NULL]
df3[order(id), ]
# id Start/End Duration Frequency
# 1: 10 t1_1 - t1_3 3 1
# 2: 10 t2_1 - t2_3 3 1
# 3: 10 t3_1 - t3_3 3 1
# 4: 10 t4_1 - t4_3 3 1
# 5: 10 t5_1 - t5_3 3 1
# 6: 10 t6_1 - t6_3 3 1
# 7: 10 t7_1 - t7_3 3 1
# 8: 12 t1_1 - t1_3 2 1
# 9: 12 t6_1 - t6_3 3 1
# 10: 123 t2_1 - t2_3 3 1
# 11: 123 t4_1 - t4_3 3 1
# 12: 123 t5_1 - t5_3 3 1
# 13: 123 t7_1 - t7_3 3 1
Related
I would like to identify if an activity occurs consecutive times and how often during a week. The starting point is t1 that records the occurrence of an activity at t1_1 , t1_2, t1_3 and so on. For example in the case of id 12 activity occurred at t1_2, t1_3, t2_2, t3_1, t3_3, t4_2, t5_2, t6_1, t6_2, t6_3 and t7_3. As here was reported activity during all 7 days I assume the activity occurred consecutively. I would like to identify all id's in which an activity occured consecutively and the sum of occurrence.
Input
id t1_1 t1_2 t1_3 t2_1 t2_2 t2_3 t3_1 t3_2 t3_3 t4_1 t4_2 t4_3 t5_1 t5_2 t5_3 t6_1 t6_2 t6_3 t7_1 t7_2 t7_3
12 0 1 1 0 1 0 1 0 1 0 1 0 0 1 0 1 1 1 0 0 1
123 0 0 0 1 1 1 0 0 0 1 1 1 1 1 1 0 0 0 1 1 1
10 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Output
Id Sum
12 11
10 21
Here is an option with rle. Loop over the rows of the dataset with apply (MARGIN = 1) without the 'id' column, apply rle and extract the lengths where the 'values' are 1 ('x1'). If the length of 'x1' is either 1 or greater than or equal to 7, get the sum (1 is because if all the values are 1). Then, stack the named list to a 2 column data.frame and set the names of the columns ('out')
out <- stack(setNames(apply(df1[-1], 1, function(x) {
x1 <- with(rle(x), lengths[as.logical(values)])
if(length(x1) >=7|length(x1) == 1) sum(x1) }), df1$id))[2:1]
names(out) <- c('Id', 'Sum')
out
# Id Sum
#1 12 11
#2 10 21
data
df1 <- structure(list(id = c(12L, 123L, 10L), t1_1 = c(0L, 0L, 1L),
t1_2 = c(1L, 0L, 1L), t1_3 = c(1L, 0L, 1L), t2_1 = c(0L,
1L, 1L), t2_2 = c(1L, 1L, 1L), t2_3 = c(0L, 1L, 1L), t3_1 = c(1L,
0L, 1L), t3_2 = c(0L, 0L, 1L), t3_3 = c(1L, 0L, 1L), t4_1 = c(0L,
1L, 1L), t4_2 = c(1L, 1L, 1L), t4_3 = c(0L, 1L, 1L), t5_1 = c(0L,
1L, 1L), t5_2 = c(1L, 1L, 1L), t5_3 = c(0L, 1L, 1L), t6_1 = c(1L,
0L, 1L), t6_2 = c(1L, 0L, 1L), t6_3 = c(1L, 0L, 1L), t7_1 = c(0L,
1L, 1L), t7_2 = c(0L, 1L, 1L), t7_3 = c(1L, 1L, 1L)),
class = "data.frame", row.names = c(NA,
-3L))
An option using data.table:
melt(DT, id.vars="id")[,
c("day", "time") := tstrsplit(variable, "_")][
value==1L, if(all(paste0("t", 1L:7L) %chin% day)) .(Sum=sum(value)) , id]
output:
id Sum
1: 10 21
2: 12 11
data:
library(data.table)
DT <- fread("id t1_1 t1_2 t1_3 t2_1 t2_2 t2_3 t3_1 t3_2 t3_3 t4_1 t4_2 t4_3 t5_1 t5_2 t5_3 t6_1 t6_2 t6_3 t7_1 t7_2 t7_3
12 0 1 1 0 1 0 1 0 1 0 1 0 0 1 0 1 1 1 0 0 1
123 0 0 0 1 1 1 0 0 0 1 1 1 1 1 1 0 0 0 1 1 1
10 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1")
Explanation:
Convert into long format using melt
use tstrsplit to split the columns names into day of the week and time
filter for value==1L and then for each id, check if all 7 days are in subset before summing (i.e. if(all(paste0("t", 1L:7L) %chin% day)) .(Sum=sum(value)))
I have rows with recurring IDs that I would like to merge. The columns are binaries so I would like to sum them together
Example before:
id nam1 nam2
1 1 1
1 0 0
2 1 0
2 0 1
3 1 1
3 1 0
Example after:
id nam1 nam2
1 1 1
2 1 1
3 2 1
Any ideas on how to do this?
#d.b's answer in comment:
aggregate(.~id, df, sum)
or using dplyr:
library(dplyr)
df %>%
group_by(id) %>%
summarize_all("sum")
Result:
# A tibble: 3 x 3
id nam1 nam2
<int> <int> <int>
1 1 1 1
2 2 1 1
3 3 2 1
Data
df = structure(list(id = c(1L, 1L, 2L, 2L, 3L, 3L), nam1 = c(1L, 0L,
1L, 0L, 1L, 1L), nam2 = c(1L, 0L, 0L, 1L, 1L, 0L)), .Names = c("id",
"nam1", "nam2"), row.names = c(NA, -6L), class = "data.frame")
#Sample data:
df <- data.frame(id=c(1,1,2,2,3,3),
nam1=c(1,0,1,0,1,1),
nam2=c(1,0,0,1,1,0))
library(data.table)
setDT(df)[, lapply(.SD, sum), by=.(id)]
id nam1 nam2
1 1 1
2 1 1
3 2 1
I have data frame containing the results of a multiple choice question. Each item has either 0 (not mentioned) or 1 (mentioned). The columns are named like this:
F1.2_1, F1.2_2, F1.2_3, F1.2_4, F1.2_5, F1.2_99
etc.
I would like to concatenate these values like this: The new column should be a semicolon-separated string of the selected items. So if a row has a 1 in F1.2_1, F1.2_4 and F1.2_5 it should be: 1;4;5
The last digit(s) of the dichotome columns are the item codes to be used in the string.
Any idea how this could be achieved with R (and data.table)? Thanks for any help!
edit:
Here is a example DF with the desired result:
structure(list(F1.2_1 = c(0L, 1L, 0L, 1L), F1.2_2 = c(1L, 0L,
0L, 1L), F1.2_3 = c(0L, 1L, 0L, 1L), F1.2_4 = c(0L, 1L, 0L, 0L
), F1.2_5 = c(0L, 0L, 0L, 0L), F1.2_99 = c(0L, 0L, 1L, 0L), desired_result = structure(c(3L,
2L, 4L, 1L), .Label = c("1;2;3", "1;3;4", "2", "99"), class = "factor")), .Names = c("F1.2_1",
"F1.2_2", "F1.2_3", "F1.2_4", "F1.2_5", "F1.2_99", "desired_result"
), class = "data.frame", row.names = c(NA, -4L))
F1.2_1 F1.2_2 F1.2_3 F1.2_4 F1.2_5 F1.2_99 desired_result
1 0 1 0 0 0 0 2
2 1 0 1 1 0 0 1;3;4
3 0 0 0 0 0 1 99
4 1 1 1 0 0 0 1;2;3
In his comment, the OP asked how to deal with more multiple choice questions.
The approach below will be able to handle an arbitrary number of questions and choices for each question. It uses melt() and dcast() from the data.table package.
Sample input data
Let's assume the input data.frame DT for the extended case contains two questions, one with 6 choices and the other with 4 choices:
DT
# F1.2_1 F1.2_2 F1.2_3 F1.2_4 F1.2_5 F1.2_99 F2.7_1 F2.7_2 F2.7_3 F2.7_11
#1: 0 1 0 0 0 0 0 1 1 0
#2: 1 0 1 1 0 0 1 1 1 1
#3: 0 0 0 0 0 1 1 0 1 0
#4: 1 1 1 0 0 0 1 0 1 1
Code
library(data.table)
# coerce to data.table and add row number for later join
setDT(DT)[, rn := .I]
# reshape from wide to long format
molten <- melt(DT, id.vars = "rn")
# alternatively, the measure cols can be specified (in case of other id vars)
# molten <- melt(DT, measure.vars = patterns("^F"))
# split question id and choice id
molten[, c("question_id", "choice_id") := tstrsplit(variable, "_")]
# reshape only selected choices from long to wide format,
# thereby pasting together the ids of the selected choices for each question
result <- dcast(molten[value == 1], rn ~ question_id, paste, collapse = ";",
fill = NA, value.var = "choice_id")
# final join for demonstration only, remove row number as no longer needed
DT[result, on = "rn"][, rn := NULL][]
# F1.2_1 F1.2_2 F1.2_3 F1.2_4 F1.2_5 F1.2_99 F2.7_1 F2.7_2 F2.7_3 F2.7_11 F1.2 F2.7
#1: 0 1 0 0 0 0 0 1 1 0 2 2;3
#2: 1 0 1 1 0 0 1 1 1 1 1;3;4 1;2;3;11
#3: 0 0 0 0 0 1 1 0 1 0 99 1;3
#4: 1 1 1 0 0 0 1 0 1 1 1;2;3 1;3;11
For each question, the final result shows which choices were selected in each row.
Reproducible data
The sample data can be created with
DT <- structure(list(F1.2_1 = c(0L, 1L, 0L, 1L), F1.2_2 = c(1L, 0L,
0L, 1L), F1.2_3 = c(0L, 1L, 0L, 1L), F1.2_4 = c(0L, 1L, 0L, 0L
), F1.2_5 = c(0L, 0L, 0L, 0L), F1.2_99 = c(0L, 0L, 1L, 0L), F2.7_1 = c(0L,
1L, 1L, 1L), F2.7_2 = c(1L, 1L, 0L, 0L), F2.7_3 = c(1L, 1L, 1L,
1L), F2.7_11 = c(0L, 1L, 0L, 1L)), .Names = c("F1.2_1", "F1.2_2",
"F1.2_3", "F1.2_4", "F1.2_5", "F1.2_99", "F2.7_1", "F2.7_2",
"F2.7_3", "F2.7_11"), row.names = c(NA, -4L), class = "data.frame")
We can try
j1 <- do.call(paste, c(as.integer(sub(".*_", "",
names(DF)[-7]))[col(DF[-7])]*DF[-7], sep=";"))
DF$newCol <- gsub("^;+|;+$", "", gsub(";*0;|0$|^0", ";", j1))
DF$newCol
#[1] "2" "1;3;4" "99" "1;2;3"
I have a question to calculate the rate under the same id numbers.
Here is the sample dataset d:
id answer
1 1
1 0
1 0
1 1
1 1
1 1
1 0
2 0
2 0
2 0
3 1
3 0
The ideal output is
id rate freq
1 4/7 (=0.5714) 7
2 0 3
3 1/2 (=0.5) 2
Thanks.
Just for fun, you can use aggregate
> aggregate(answer~id, function(x) c(rate=mean(x), freq=length(x)), data=df1)
id answer.rate answer.freq
1 1 0.5714286 7.0000000
2 2 0.0000000 3.0000000
3 3 0.5000000 2.0000000
Try
library(data.table)
setDT(df1)[,list(rate= mean(answer), freq=.N) ,id]
# id rate freq
#1: 1 0.5714286 7
#2: 2 0.0000000 3
#3: 3 0.5000000 2
Or
library(dplyr)
df1 %>%
group_by(id) %>%
summarise(rate=mean(answer), freq=n())
data
df1 <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
3L, 3L), answer = c(1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L,
0L)), .Names = c("id", "answer"), class = "data.frame",
row.names = c(NA, -12L))
I have a dataframe which looks like this:
>head(df)
chrom pos strand ref alt A_pos A_neg C_pos C_neg G_pos G_neg T_pos T_neg
chr1 2283161 - G A 3 1 2 0 0 0 0 0
chr1 2283161 - G A 3 1 2 0 0 0 0 0
chr1 2283313 - G C 0 0 0 0 0 0 0 0
chr1 2283313 - G C 0 0 0 0 0 0 0 0
chr1 2283896 - G A 0 0 0 0 0 0 0 0
chr1 2283896 + G A 0 0 0 0 0 0 0 0
I want to extract the value from columns 6:13 (A_pos...T_neg) based on the value of the columns 'strand', 'ref' and 'alt'. For instance, in row1: strand = '-', ref = 'G' and alt = 'A', so I should extract the values from G_neg and A_neg. Again, in row6: stand = '+', ref = 'G' and alt = 'A', so I should get the values from G_pos and A_pos. I basically intend to do a chi-square test after extracting these values (These are my observed values, I have another set of expected values) but that is another story.
So the logic is somewhat like:
if(df$strand=="+")
do
print:paste(df$ref,"pos",sep="_") #extract value in column df$ref_pos
print:paste(df$alt,"pos",sep="_") #extract value in column df$alt_pos
else if(gt.merge$gene_strand=="-")
do
print:paste(df$ref,"neg",sep="_") #extract value in column df$ref_neg
print:paste(df$alt,"neg",sep="_") #extract value in column df$alt_neg
Here, I am trying to use paste on the values in 'ref' and 'alt' to get the desired column names. For instance, if strand ='+' and ref = 'G', it will fetch value from column G_pos.
The data frame is actually large and so I ruled out using for-loops. I am not sure how else can I do this to make the code as efficient as possible. Any help/suggestions would be appreciated.
Thanks!
Another alternative that looks valid, at least with the sample data:
tmp = ifelse(as.character(DF$strand) == "-", "neg", "pos")
sapply(DF[c("ref", "alt")],
function(x) as.integer(DF[cbind(seq_len(nrow(DF)),
match(paste(x, tmp, sep = "_"), names(DF)))]))
# ref alt
#[1,] 0 1
#[2,] 0 1
#[3,] 0 0
#[4,] 0 0
#[5,] 0 0
#[6,] 0 0
Where DF:
DF = structure(list(chrom = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "chr1", class = "factor"),
pos = c(2283161L, 2283161L, 2283313L, 2283313L, 2283896L,
2283896L), strand = structure(c(1L, 1L, 1L, 1L, 1L, 2L), .Label = c("-",
"+"), class = "factor"), ref = structure(c(1L, 1L, 1L, 1L,
1L, 1L), .Label = "G", class = "factor"), alt = structure(c(1L,
1L, 2L, 2L, 1L, 1L), .Label = c("A", "C"), class = "factor"),
A_pos = c(3L, 3L, 0L, 0L, 0L, 0L), A_neg = c(1L, 1L, 0L,
0L, 0L, 0L), C_pos = c(2L, 2L, 0L, 0L, 0L, 0L), C_neg = c(0L,
0L, 0L, 0L, 0L, 0L), G_pos = c(0L, 0L, 0L, 0L, 0L, 0L), G_neg = c(0L,
0L, 0L, 0L, 0L, 0L), T_pos = c(0L, 0L, 0L, 0L, 0L, 0L), T_neg = c(0L,
0L, 0L, 0L, 0L, 0L)), .Names = c("chrom", "pos", "strand",
"ref", "alt", "A_pos", "A_neg", "C_pos", "C_neg", "G_pos", "G_neg",
"T_pos", "T_neg"), class = "data.frame", row.names = c(NA, -6L
))
Not very elegant, but does the job:
strand.map <- c("-"="_neg", "+"="_pos")
cbind(
df[1:5],
do.call(
rbind,
lapply(
split(df[-(1:2)], 1:nrow(df)),
function(x)
c(
ref=x[-(1:2)][, paste0(x[[2]], strand.map[x[[1]]])],
alt=x[-(1:2)][, paste0(x[[3]], strand.map[x[[1]]])]
) ) ) )
We cycle through each row in your data frame and apply a function that pulls the value based on strand, ref, and alt. This produces:
chrom pos strand ref alt ref alt
1 chr1 2283161 - G A 0 1
2 chr1 2283161 - G A 0 1
3 chr1 2283313 - G C 0 0
4 chr1 2283313 - G C 0 0
5 chr1 2283896 - G A 0 0
6 chr1 2283896 + G A 0 0
An alternate approach is to use melt, but the format of your data makes it rather annoying because we need two melts in a row, and we need to create a unique id column so we can reconstitute the data frame once we're done computing.
df$id <- 1:nrow(df)
df.mlt <-
melt(
melt(df, id.vars=c("id", "chrom", "pos", "strand", "ref", "alt")),
measure.vars=c("ref", "alt"), value.name="base",
variable.name="alt_or_ref"
)
dcast(
subset(df.mlt, paste0(base, strand.map[strand]) == variable),
id + chrom + pos + strand ~ alt_or_ref,
value.var="value"
)
Which produces:
id chrom pos strand ref alt
1 1 chr1 2283161 - 0 1
2 2 chr1 2283161 - 0 1
3 3 chr1 2283313 - 0 0
4 4 chr1 2283313 - 0 0
5 5 chr1 2283896 - 0 0
6 6 chr1 2283896 + 0 0
Another way
testFunc <- function(x){
posneg <- if(x["strand"] == "-") {"neg"} else {"pos"}
cbind(as.numeric(x[paste0(x["ref"],"_",posneg)]), as.numeric(x[paste0(x["alt"],"_",posneg)]))
}
temp <- t(apply(df, 1, testFunc))
colnames(temp) <- c("ref", "alt")
using the [very] fast data.table library:
library(data.table)
df = fread('df.txt') # fastread
df[,ref := ifelse(strand == "-",
paste(ref,"neg",sep = "_"),
paste(ref,"pos",sep = "_"))]
df[,alt := ifelse(strand == "-",
paste(alt,"neg",sep = "_"),
paste(alt,"pos",sep = "_"))]
df[,strand := NULL] # not required anymore
dfm = melt(df,
id.vars = c("chrom","pos","ref","alt"),
variable.name = "mycol", value.name = "value")
dfm[mycol == ref | mycol == alt,] # matching