Efficiently Repeating Observations by Group - r

I am trying to find an efficient way to repeat rows by group in data.table for only some groups. Please consider the following example:
library(data.table)
DT <- data.table(x = c("A","A", "B", "B", "C","C", "D","D"),
y = 1:8)
This dataset looks like:
head(DT)
x y
1: A 1
2: A 2
3: B 3
4: B 4
5: C 5
6: C 6
Say I have a separate vector rep <- c("A", "A", "A", "B", "B", "C"). Given this vector, I want to be able to repeat all rows of A three times (due to the cardinality of the "A" characters in rep) and all rows associated with B two times. Thus, the final dataset should like:
x y
1: A 1
2: A 2
3: A 1
4: A 2
5: A 1
6: A 2
7: B 3
8: B 4
9: B 3
10: B 4
11: C 5
12: C 6
Notice that I did not repeat "C" because the cardinality of "C" is only 1 in rep. I have a hackish way of doing this procedure at the moment, but I'm wondering if there was a more efficient data.table way of doing the above.
Thank you!
P.S. The reason I am doing this is because I am doing some matching with replacement in my regressions and sometimes, the same control firm is assigned to more than one treatment firm.

A data.table merge won't give you the same ordering but you aren't supposed to rely on ordering in datatables, anyway:
merge(DT, data.frame(x=rep), by="x")
x y
1: A 1
2: A 1
3: A 1
4: A 2
5: A 2
6: A 2
7: B 3
8: B 3
9: B 4
10: B 4
11: C 5
12: C 6

One solution is to gather up the counts and left join onto them:
library(data.table)
library(data.table)
DT <- data.table(x = c("A","A", "B", "B", "C","C", "D","D"),
y = 1:8)
rep_vec <- c("A", "A", "A", "B", "B", "C")
rep_DT <- DT %>%
left_join(data.frame(group = rep_vec), by = c("x" = "group"))
Are you sure duplicating rows in a dataframe is your ideal option though?

We can do
DT[ data.table(x = v1)[, .N, x], on = .(x)][rep(seq_len(.N), N)]
Or to return in the same order
DT[, .(y = list(y)), x][data.table(x = v1), on = .(x)][, .(x, y = unlist(y))]
data
v1 <- c("A", "A", "A", "B", "B", "C")

Related

.N in data.table with conditions

I am trying to count multiple .N for data.table with conditions
I have a data.table
SD = data.table(x=c(1,1,1,1,2,2,2,2), y=c("a", "a", "b", "b","a", "a",
"b", "b"), z= c("x", "x", "y","y", "x","z","x","w"))
I want to count two things:
counts of each combination by x and y
I did it this way
SD[,.N, .(x,y)]
and output is what I want
x y N
1: 1 a 2
2: 1 b 2
3: 2 a 2
4: 2 b 2
counts of each combination by x and y where z == "x"
As a quick solution I did
SD[z == "x", .N, .(x,y)]
merge.data.table(SD[,.N, .(x,y)],
SD[z == "x", .N, .(x,y)],
by = c('x','y'), all.x =TRUE)`
Is there any way to do both in one line of code without joins?
It works but takes a long time due to the big size of data.table
Any help is appreciated
How about this one:
SD[, .(.N, count2 = sum(z == 'x')), .(x,y)]
# x y N count2
# 1: 1 a 2 2
# 2: 1 b 2 0
# 3: 2 a 2 1
# 4: 2 b 2 1
In addition to counting the number of rows by unique x-y combination, we also count the number of rows where z == 'x' in each x-y combination.

R - building new variables from sequenced data

This is an update / follow-up on this question. The answer outlined their doesn't meet the new requirements.
I am looking for an efficient way (data.table?) to construct two new measures for each ID.
Measure 1 and Measure 2 needs to meet the following conditions:
Condition 1:
Find a sequence of three rows for which:
the first count > 0
the second `count >1' and
the third count ==1.
Condition 2 for Measure 1:
takes the value of the elements in product of the third row of the sequence that are:
in the product of second row of sequence and
NOT in the stock of the first row in sequence.
Condition 2 for measure 2:
takes the value of the elements in product of the last row of the sequence that are:
NOT in the product of second row of sequence
NOT in the stock of the first row in sequence.
Data:
df2 <- data.frame(ID = c(1,1,1,1,1,1,1,2,2,2,3,3,3,3),
seqs = c(1,2,3,4,5,6,7,1,2,3,1,2,3,4),
count = c(2,1,3,1,1,2,3,1,2,1,3,1,4,1),
product = c("A", "B", "C", "A,C,E", "A,B", "A,B,C", "D", "A", "B", "A", "A", "A,B,C", "D", "D"),
stock = c("A", "A,B", "A,B,C", "A,B,C,E", "A,B,C,E", "A,B,C,E", "A,B,C,D,E", "A", "A,B", "A,B", "A", "A,B,C", "A,B,C,D", "A,B,C,D"))
> df2
ID seqs count product stock
1 1 1 2 A A
2 1 2 1 B A,B
3 1 3 3 C A,B,C
4 1 4 1 A,C,E A,B,C,E
5 1 5 1 A,B A,B,C,E
6 1 6 2 A,B,C A,B,C,E
7 1 7 3 D A,B,C,D,E
8 2 1 1 A A
9 2 2 2 B A,B
10 2 3 1 A A,B
11 3 1 3 A A
12 3 2 1 A,B,C A,B,C
13 3 3 4 D A,B,C,D
14 3 4 1 D A,B,C,D
The desired output looks like this:
ID seq1 seq2 seq3 measure1 measure2
1: 1 2 3 4 C E
2: 2 1 2 3
3: 3 2 3 4 D
How would you code this?
Few things you need to know to be able to do this:
shift function to compare values in your groups
separate_rows function to split your strings to get to the normalised data view.
library(data.table)
dt <- data.table(ID = c(1,1,1,1,1,1,1,2,2,2,3,3,3,3),
seqs = c(1,2,3,4,5,6,7,1,2,3,1,2,3,4),
count = c(2,1,3,1,1,2,3,1,2,1,3,1,4,1),
product = c("A", "B", "C", "A,C,E", "A,B", "A,B,C", "D", "A", "B", "A", "A", "A,B,C", "D", "D"),
stock = c("A", "A,B", "A,B,C", "A,B,C,E", "A,B,C,E", "A,B,C,E", "A,B,C,D,E", "A", "A,B", "A,B", "A", "A,B,C", "A,B,C,D", "A,B,C,D"))
dt[, count.2 := shift(count, type = "lead")]
dt[, count.3 := shift(count, n = 2, type = "lead")]
dt[, product.2 := shift(product, type = "lead")]
dt[, product.3 := shift(product, n = 2, type = "lead")]
dt <- dt[count > 0 & count.2 > 1 & count.3 == 1]
dt <- unique(dt, by = "ID")
library(tidyr)
dt.measure <- separate_rows(dt, product.3, sep = ",")
dt.measure <- separate_rows(dt.measure, stock, sep = ",")
dt.measure <- separate_rows(dt.measure, product, sep = ",")
dt.measure[, measure.1 := (product.3 == product.2 & product.3 != stock)]
dt.measure[, measure.2 := (product.3 != product.2 & product.3 != stock)]
res <- dt.measure[,
.(
measure.1 = max(ifelse(measure.1, product.3, NA_character_), na.rm = TRUE),
measure.2 = max(ifelse(measure.2, product.3, NA_character_), na.rm = TRUE)
),
ID
]
dt <- merge(dt, res, by = "ID")
dt[, .(ID, measure.1, measure.2)]
# ID measure.1 measure.2
# 1: 1 C E
# 2: 2 <NA> <NA>
# 3: 3 D <NA>
I'm not sure what the criteria for efficient is, but here's an approach using embed and tidyverse style. It filters down so you are working with less and less.
Loading up the data and packages (note later on setdiff and intersect are from dplry)
library(purrr)
library(dplyr)
df1 <- data.frame(ID = c(1,1,1,1,1,1,1,2,2,2,3,3,3,3),
seqs = c(1,2,3,4,5,6,7,1,2,3,1,2,3,4),
count = c(2,1,3,1,1,2,3,1,2,1,3,1,4,1),
product = c("A", "B", "C", "A,C,E", "A,B",
"A,B,C", "D", "A", "B", "A", "A",
"A,B,C", "D", "D"),
stock = c("A", "A,B", "A,B,C", "A,B,C,E", "A,B,C,E",
"A,B,C,E", "A,B,C,D,E", "A", "A,B", "A,B", "A",
"A,B,C", "A,B,C,D", "A,B,C,D"),
stringsAsFactors = FALSE)
Define a helper function to evaluate condition 1
meetsCond1 <- function(rseg) {
seg <- rev(rseg)
all(seg[1] > 0, seg[2] > 1, seg[3] == 1)
}
The embed function warps a time series into a matrix where essentially each row is a window of the length of interest. Using apply, you filter down to which rows start relevant sequences.
cond1Match<- embed(df1$count, 3) %>%
apply(1, meetsCond1) %>%
which()
You can translate that back to final products, the previous products, and stock rows of interest to determine the measures by adding offsets. Split them into a list of individual components.
finalProds <- df1$product[cond1Match + 2] %>%
strsplit(",")
prevProds <- df1$product[cond1Match + 1] %>%
strsplit(",")
initialStock <- df1$stock[cond1Match] %>%
strsplit(",")
For both measures, neither of them can be in the stock.
notStock <- map2(finalProds, initialStock, ~.x[!(.x %in% .y)])
Then generate your data.frame by retrieving the seqs and ID values of the window. The measures then are just the intersect and setdiff of the final products with those in the previous rows.
data.frame(ID = df1$ID[cond1Match],
seq1 = df1$seqs[cond1Match],
seq2 = df1$seqs[cond1Match + 1],
seq3 = df1$seqs[cond1Match + 2],
measure1 = imap_chr(notStock,
~intersect(.x, prevProds[[.y]]) %>%
{if(length(.) == 0) "" else paste(., sep = ",")}
),
measure2 = imap_chr(notStock,
~setdiff(.x, prevProds[[.y]]) %>%
{if(length(.) == 0) "" else paste(., sep = ",")}
),
stringsAsFactors = FALSE
) %>%
slice(match(unique(ID), ID))
which yields the desired output, which seems to limit at most one line per ID. In the original post, you specify you want all reported. Removing the slice call would then instead yield
#> ID seq1 seq2 seq3 measure1 measure2
#> 1 1 2 3 4 C E
#> 2 1 6 7 1
#> 3 2 1 2 3
#> 4 2 3 1 2 C
#> 5 3 2 3 4 D
If you're looking to really squeeze efficiency, you might be able to gain some by placing the definitions of finalProds, prevProds, and initialStock instead of assigning them to variables first. I would imagine unless your set of matches is really large, it would be negligible.
A rolling window approach using data.table with base R code in j:
library(data.table)
cols <- c("product", "stock")
setDT(df2)[, (cols) := lapply(.SD, function(x) strsplit(as.character(x), split=",")), .SDcols=cols]
ans <- df2[,
transpose(lapply(1L:(.N-2L), function(k) {
if(count[k]>0 && count[k+1L]>1 && count[k+2L]==1) {
m1 <- setdiff(intersect(product[[k+2L]], product[[k+1L]]), stock[[k]])
m2 <- setdiff(setdiff(product[[k+2L]], product[[k+1L]]), stock[[k]])
c(seq1=seqs[k], seq2=seqs[k+1L], seq3=seqs[k+2L],
measure1=if(length(m1) > 0) paste(m1, collapse=",") else "",
measure2=if(length(m2) > 0) paste(m2, collapse=",") else "")
}
}), ignore.empty=TRUE),
ID]
setnames(ans, names(ans)[-1L], c(paste0("seq", 1:3), paste0("measure", 1:2)))
ans
output:
ID seq1 seq2 seq3 measure1 measure2
1: 1 2 3 4 C E
2: 2 1 2 3
3: 3 2 3 4 D

Merging two data frames of different length by group id [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 3 years ago.
I am trying to merge two data frames by group id. However, both data frames are not of the same length and some elements of certain groups are missing in the second data frame. In the merged file, the missing elements of a certain group should be NAs.
The data looks something like this
df1 <- data.frame(id = c(1,1,1,2,3,3,4), x = c("a", "b", "c", "d", "e", "f", "g"))
df2 <- data.frame(id = c(1,1,2,3,4), y = c("A", "B", "D", "E", "G"))
Ideally, the result would look like this:
id x y
1 a A
1 b B
1 c <NA>
2 d D
3 e E
3 f <NA>
4 g G
It would be great if the code worked for additional columns that also correspond to the same group ids but may miss elements at different places.
I have tried full_join and merge so far but without success, as they just repreat the y values instead of introducing na's.
I know there are similar questions out there, but I have found none that solves this problem. Any help is appreciated.
This data.table solution might work..
first, create row_id's per group. The join by id on these row id's.
library(data.table)
dt1 <- data.table(id = c(1,1,1,2,3,3,4), x = c("a", "b", "c", "d", "e", "f", "g"))
dt2 <- data.table(id = c(1,1,2,3,4), y = c("A", "B", "D", "E", "G"))
#rumber rows by group
dt1[ , row_id := seq.int(1:.N), by = .(id)]
dt2[ , row_id := seq.int(1:.N), by = .(id)]
dt1[dt2, y := i.y, on = .(id, row_id)][, row_id := NULL][]
# id x y
# 1: 1 a A
# 2: 1 b B
# 3: 1 c <NA>
# 4: 2 d D
# 5: 3 e E
# 6: 3 f <NA>
# 7: 4 g G

Removing a row in a data frame depending on the value of the previous row (R)

I am trying to clean some data after importing it as a data frame to R.
My data looks like this:
Event Time
A 10:59:36
B 11:00:27
A 11:01:36
B 11:02:01
A 11:02:15
A 11:02:20
B 11:02:45
Time is in POSIXct objects. Events are Strings.
The correct form for the data should be:
A followed by B.
However, sometimes A is followed by A and B is followed by B. This is an error and I need to remove the latter row.
So, if two subsequent rows have the same value for 'Event' the second row has to be removed.
Any help would be appreciated.
We can do this with rleid from data.table
library(data.table)
setDT(df1)[!duplicated(rleid(Event))]
# Event Time
#1: A 10:59:36
#2: B 11:00:27
#3: A 11:01:36
#4: B 11:02:01
#5: A 11:02:15
#6: B 11:02:45
data
df1 <- structure(list(Event = c("A", "B", "A", "B", "A",
"A", "B"),
Time = c("10:59:36", "11:00:27", "11:01:36", "11:02:01",
"11:02:15", "11:02:20", "11:02:45")), .Names = c("Event",
"Time"), class = "data.frame", row.names = c(NA, -7L))
You can use the cumsum() and rle() functions to achieve what you want:
events <- data.frame(Event=c("A", "B", "A", "B", "A", "A", "B"),
Time=c("10:59:36", "11:00:27", "11:01:36",
"11:02:01", "11:02:15", "11:02:20", "11:02:45"))
rows.keep <- cumsum(rle(as.numeric(events[,1]))$lengths)
y <- c(FALSE, rows.keep[1:length(rows.keep)-1] == rows.keep[2:length(rows.keep)] - 2)
rows.keep[y] <- rows.keep[y] - 1
events <- events[rows.keep, ]
> events
Event Time
1 A 2016-01-25 10:59:36
2 B 2016-01-25 11:00:27
3 A 2016-01-25 11:01:36
4 B 2016-01-25 11:02:01
5 A 2016-01-25 11:02:15
6 B 2016-01-25 11:02:45
A dplyr solution. The row_number condition is a little clumsy for my tastes, but it might nonetheless be more readable than other solutions.
library(dplyr)
Time <- as.POSIXct("2016-01-25 10:59:36")
set.seed(10)
dat <-
data_frame(Event = sample(c("A", "B"), size = 15, replace = TRUE)) %>%
mutate(Time = Sys.time() + rnorm(15, 0, 999)) %>%
arrange(Time)
dat %>%
arrange(Time) %>%
filter(Event != lag(Event) | row_number(Time) == 1)
# Source: local data frame [9 x 2]
#
# Event Time
# (chr) (time)
# 1 B 2016-01-25 18:36:16
# 2 A 2016-01-25 18:46:30
# 3 B 2016-01-25 18:55:18
# 4 A 2016-01-25 18:58:18
# 5 B 2016-01-25 19:03:10
# 6 A 2016-01-25 19:07:20
# 7 B 2016-01-25 19:09:24
# 8 A 2016-01-25 19:14:35
# 9 B 2016-01-25 19:26:27
Without the | row_number(Time) == 1), the first row would be omitted. Note that if there are multiple duplicated consecutive Events, only the first will be retained.

Get number of same individuals for different groups

I have a data set with individuals (ID) that can be part of more than one group.
Example:
library(data.table)
DT <- data.table(
ID = rep(1:5, c(3:1, 2:3)),
Group = c("A", "B", "C", "B",
"C", "A", "A", "C",
"A", "B", "C")
)
DT
# ID Group
# 1: 1 A
# 2: 1 B
# 3: 1 C
# 4: 2 B
# 5: 2 C
# 6: 3 A
# 7: 4 A
# 8: 4 C
# 9: 5 A
# 10: 5 B
# 11: 5 C
I want to know the sum of identical individuals for 2 groups.
The result should look like this:
Group.1 Group.2 Sum
A B 2
A C 3
B C 3
Where Sum indicates the number of individuals the two groups have in common.
Here's my version:
# size-1 IDs can't contribute; skip
DT[ , if (.N > 1)
# simplify = FALSE returns a list;
# transpose turns the 3-length list of 2-length vectors
# into a length-2 list of 3-length vectors (efficiently)
transpose(combn(Group, 2L, simplify = FALSE)), by = ID
][ , .(Sum = .N), keyby = .(Group.1 = V1, Group.2 = V2)]
With output:
# Group.1 Group.2 Sum
# 1: A B 2
# 2: A C 3
# 3: B C 3
As of version 1.9.8 (on CRAN 25 Nov 2016), data.table has gained the ability to do non-equi joins. So, a self non-equi join can be used:
library(data.table) # v1.9.8+
setDT(DT)[, Group:= factor(Group)]
DT[DT, on = .(ID, Group < Group), nomatch = 0L, .(ID, x.Group, i.Group)][
, .N, by = .(x.Group, i.Group)]
x.Group i.Group N
1: A B 2
2: A C 3
3: B C 3
Explanantion
The non-equi join on ID, Group < Group is a data.table version of combn() (but applied group-wise):
DT[DT, on = .(ID, Group < Group), nomatch = 0L, .(ID, x.Group, i.Group)]
ID x.Group i.Group
1: 1 A B
2: 1 A C
3: 1 B C
4: 2 B C
5: 4 A C
6: 5 A B
7: 5 A C
8: 5 B C
We self-join with the same dataset on 'ID', subset the rows where the 'Group' columns are different, get the nrows (.N), grouped by the 'Group' columns, sort the 'Group.1' and 'Group.2' columns by row using pmin/pmax and get the unique value of 'N'.
library(data.table)#v1.9.6+
DT[DT, on='ID', allow.cartesian=TRUE][Group!=i.Group, .N ,.(Group, i.Group)][,
list(Sum=unique(N)) ,.(Group.1=pmin(Group, i.Group), Group.2=pmax(Group, i.Group))]
# Group.1 Group.2 Sum
#1: A B 2
#2: A C 3
#3: B C 3
Or as mentioned in the comments by #MichaelChirico and #Frank, we can convert 'Group' to factor class, subset the rows based on as.integer(Group) < as.integer(i.Group), group by 'Group', 'i.Group' and get the nrow (.N)
DT[, Group:= factor(Group)]
DT[DT, on='ID', allow.cartesian=TRUE][as.integer(Group) < as.integer(i.Group), .N,
by = .(Group.1= Group, Group.2= i.Group)]
Great answers above.
Just an alternative using dplyr in case you, or someone else, is interested.
library(dplyr)
cmb = combn(unique(dt$Group),2)
data.frame(g1 = cmb[1,],
g2 = cmb[2,]) %>%
group_by(g1,g2) %>%
summarise(l=length(intersect(DT[DT$Group==g1,]$ID,
DT[DT$Group==g2,]$ID)))
# g1 g2 l
# (fctr) (fctr) (int)
# 1 A B 2
# 2 A C 3
# 3 B C 3
yet another solution (base R):
tmp <- split(DT, DT[, 'Group'])
ans <- apply(combn(LETTERS[1 : 3], 2), 2, FUN = function(ind){
out <- length(intersect(tmp[[ind[1]]][, 1], tmp[[ind[2]]][, 1]))
c(group1 = ind[1], group2 = ind[2], sum_ = out)
}
)
data.frame(t(ans))
# group1 group2 sum_
#1 A B 2
#2 A C 3
#3 B C 3
first split data into list of groups, then for each unique pairwise combinations of two groups see how many subjects in common they have, using length(intersect(....

Resources