Group a data.table using a column which is list - r

I have a really big problem and looping through the data.table to do what I want is too slow, so I am trying to get around looping. Let assume I have a data.table as follows:
a <- data.table(i = c(1,2,3), j = c(2,2,6), k = list(c("a","b"),c("a","c"),c("b")))
> a
i j k
1: 1 2 a,b
2: 2 2 a,c
3: 3 6 b
And I want to group based on the values in k. So something like this:
a[, sum(j), by = k]
right now I am getting the following error:
Error in `[.data.table`(a, , sum(i), by = k) :
The items in the 'by' or 'keyby' list are length (2,2,1). Each must be same length as rows in x or number of rows returned by i (3).
The answer I am looking for is to group first all the rows having "a" in column k and calculate sum(j) and then all rows having "b" and so on. So the desired answer would be:
k V1
a 4
b 8
c 2
Any hint how to do it efficiently? I cant melt the column K by repeating the rows since the size of the data.table would be too big for my case.

I think this might work:
a[, .(k = unlist(k)), by=.(i,j)][,sum(j),by=k]
k V1
1: a 4
2: b 8
3: c 2

If we are using tidyr, a compact option would be
library(tidyr)
unnest(a, k)[, sum(j) ,k]
# k V1
#1: a 4
#2: b 8
#3: c 2
Or using the dplyr/tidyr pipes
unnest(a, k) %>%
group_by(k) %>%
summarise(V1 = sum(j))
# k V1
# <chr> <dbl>
#1 a 4
#2 b 8
#3 c 2

Since by-group operations can be slow, I'd consider...
dat = a[rep(1:.N, lengths(k)), c(.SD, .(k = unlist(a$k))), .SDcols=setdiff(names(a), "k")]
i j k
1: 1 2 a
2: 1 2 b
3: 2 2 a
4: 2 2 c
5: 3 6 b
We're repeating rows of cols i:j to match the unlisted k. The data should be kept in this format instead of using a list column, probably. From there, as in #MikeyMike's answer, we can dat[, sum(j), by=k].
In data.table 1.9.7+, we can similarly do
dat = a[, c(.SD[rep(.I, lengths(k))], .(k = unlist(k))), .SDcols=i:j]

Related

Manipulating data.table column recursively on other column condition

I need to calculate a formula in a data frame. Each set of values across few columns have to be, lets say simplicity sake, aggregated. However, I do not want calculation across rows. I want to calculate each set with another set based on condition else where.
This is what I mean:
I have a data.table.
data = data.table(A = c("a","c","b","b","a"),
B = c(1:5),
C = c(1:5)
)
setorder(data, by=A)
> data
A B C
1: a 1 1
2: a 5 5
3: b 3 3
4: b 4 4
5: c 2 2
In column D I need to have and aggregate of values in B and C and values B and C when A is "a". As I have more than one "a", multiple aggregations are needed. From every aggregate minimum should be written in.
Here is an example.
For row 1: (1+1)+(1+1)=4, (5+5)+(1+1)=12, so 4 is minimum - D1 =4.
For row 3: (3+3)+(1+1)=8, (3+3)+(5+5)=16, D3 = 8. And so on.
This is what I expect
> data_new
A B C D
1: a 1 1 4
2: a 5 5 12
3: b 3 3 8
4: b 4 4 10
5: c 2 2 6
I tried this and run into issues.
for (i in data)data[i, D:=(min((data[i,B+C]) + (data[a=="a",(B+C)])))]
The expression below for minimum selection works fine on its own when I substitute i for a row number returning list of two numbers for min() returns proper value. Below answer is 8.
min((data[3,B+C]) + (data[A=="a",(B+C)]))
My previous attempts involved grid.expansion() and intersection(). However, with the size of my data set I ran into memory issue and Rstudio quit on me. As a side note, I need to run the calculations as I could not project the smallest outcome by "a" beforehand - it is a set of coordinates and they do not correlate with the magnitude of an answer.
Any suggestion where is my glaring issue
You can store the value of B + C where A = 'a' in a variable (val). For each row you can take minimum of B + C + val value.
library(data.table)
val <- data[A =='a', B + C]
data[, D := min(B + C + val), seq_len(nrow(data))]
data
# A B C D
#1: a 1 1 4
#2: a 5 5 12
#3: b 3 3 8
#4: b 4 4 10
#5: c 2 2 6
You can also use lapply :
data[, D := lapply(B + C, function(x) min(x + val))]
An option is also to replicate the 'a' rows after taking the min of 'B', 'C' and then do a direct + with the 'B', 'C' columns. The advantage is that, we don't have to group or loop
library(data.table)
Reduce(`+`, (data[A == 'a', .(B = min(B), C = min(C))][rep(seq_len(.N), nrow(data))] + data[, .(B, C)]))
#[1] 4 12 8 10 6
Or in a single line
data[, D := B + C + min(B[A== 'a']) + min(C[A== 'a'])]
data$D
#[1] 4 12 8 10 6

calculated columns in new datatable without altering the original

I have a dataset which looks like this:
set.seed(43)
dt <- data.table(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10),
e = sample(c("x","y"),10,replace = T),
f=sample(c("t","s"),10,replace = T)
)
i need (for example) a count of negative values in columns 1:4 for each value of e, f. The result would have to look like this:
e neg_a_count neg_b_count neg_c_count neg_d_count
1: x 6 3 5 3
2: y 2 1 3 NA
1: s 4 2 3 1
2: t 4 2 5 2
Here's my code:
for (k in 5:6) { #these are the *by* columns
for (i in 1:4) {#these are the columns whose negative values i'm counting
n=paste("neg",names(dt[,i,with=F]),"count","by",names(dt[,k,with=F]),sep="_")
dt[dt[[i]]<0, (n):=.N, by=names(dt[,k,with=F])]
}
}
dcast(unique(melt(dt[,5:14], id=1, measure=3:6))[!is.na(value),],e~variable)
dcast(unique(melt(dt[,5:14], id=2, measure=7:10))[!is.na(value),],f~variable)
which obviously produces two tables, not one:
e neg_a_count_by_e neg_b_count_by_e neg_c_count_by_e neg_d_count_by_e
1: x 6 3 5 3
2: y 2 1 3 NA
f neg_a_count_by_f neg_b_count_by_f neg_c_count_by_f neg_d_count_by_f
1: s 4 2 3 1
2: t 4 2 5 2
and need to be rbind to produce one table.
This approach modifies dt by adding eight additional columns (4 data columns x 2 by columns), and the counts related to the levels of e and f get recycled (as expected). I was wondering if there is a cleaner way to achieve the result, one which does not modify dt. Also, casting after melting seems inefficient, there should be a better way, especially since my dataset has several e and f-like columns.
If there is only two grouping columns, we could do an rbindlist after grouping by them separately
rbindlist(list(dt[,lapply(.SD, function(x) sum(x < 0)) , .(e), .SDcols = a:d],
dt[,lapply(.SD, function(x) sum(x < 0)) , .(f), .SDcols = a:d]))
# e a b c d
#1: y 2 1 3 0
#2: x 6 3 5 3
#3: s 4 2 3 1
#4: t 4 2 5 2
Or make it more dynamic by looping through the grouping column names
rbindlist(lapply(c('e', 'f'), function(x) dt[, lapply(.SD,
function(.x) sum(.x < 0)), by = x, .SDcols = a:d]))
You can melt before aggregating as follows:
cols <- c("a","b","c", "d")
melt(dt, id.vars=cols)[,
lapply(.SD, function(x) sum(x < 0)), by=value, .SDcols=cols]

Using .BY, .GRP or other methods to add a multicolumn aggregation with data.table

Say we have this toy data.table example:
temp <- data.table(V=c("A", "B", "C", "D","A"), GR=c(1,1,1,2,2))
"V" "GR"
A 1
B 1
C 1
D 2
A 2
I would like to generate all ordered combinations with combn within each subset defined by GR and create with it a new data.table and with a new column with the grouping factor.
For example, for GR=1 we have (A,B),(A,C),(B,C)
for GR=2 we have (D,A)
If I create the result manually it would be
cbind(V=c(1,1,1,2),rbind(t(combn(c("A", "B", "C"),2)),t(combn(c( "D","A"),2))))
1 A B
1 A C
1 B C
2 D A
But I would like to do it with data.table easily instead.
This two option don't work:
temp[,cbind(rep(.GRP,.N),as.data.frame(t(combn(V,2)))),by=GR]
temp[,cbind(rep(.BY,.N),as.data.frame(t(combn(V,2)))),by=GR]
This one work, but I don't understand why. I'm afraid it could copy the whole B vector as is instead of the proper value.
temp[,.(GR,as.list(as.data.frame((combn(V,2))))),by=GR]
And I guess it should be a shorter way to write it.
This works:
> temp[, {v_comb = combn(V,2); .(v_comb[1,], v_comb[2,])}, by=GR]
GR V1 V2
1: 1 A B
2: 1 A C
3: 1 B C
4: 2 D A
In general, I would avoid when possible all the reshaping operations within the data.table using cbind(), rep(), as.data.frame() or t()... It takes many trials-and-errors to figure out the right way to do it, and produces code that is very hard to maintain.
On the other hand, using code blocks {...} improves the readability of the code.
This uses data.table, though not all within [] using .BY or .GRP.
library(data.table)
temp <- data.table(V=c("A", "B", "C", "D","A"), GR=c(1,1,1,2,2))
tempfunc <- function(x){
dat <- as.data.table(t(combn(temp[GR == x, V], 2)))
dat[, GR := x]
setcolorder(dat, c("GR", "V1", "V2"))
dat[]
}
rbindlist(lapply(unique(temp$GR), tempfunc))
GR V1 V2
1: 1 A B
2: 1 A C
3: 1 B C
4: 2 D A
Here are two other approaches which also work if there is a group with just one row, e.g., row 6 below:
library(data.table)
temp <- data.table(V=c("A", "B", "C", "D","A","E"), GR=c(1,1,1,2,2,3))
temp
V GR
1: A 1
2: B 1
3: C 1
4: D 2
5: A 2
6: E 3
Using cominat::combn2
temp[, as.data.table(combinat::combn2(V)), by = GR]
GR V1 V2
1: 1 A B
2: 1 A C
3: 1 B C
4: 2 D A
Using a non-equi join
temp[, V := factor(V)][temp, on = .(GR, V < V), .(GR, x.V, i.V),
nomatch = 0L, allow = TRUE]
GR x.V i.V
1: 1 A B
2: 1 A C
3: 1 B C
4: 2 A D
I have one solution but it seems long and complex too.
temp[,do.call(c, apply(t(combn(V,2)), 2, list)),by=GR]
I've also found that combn is 10 times slower than some specialized packages such as iterpc or combinat
temp[,do.call(c, apply(combn2(V), 2, list)),by=GR]
You must also first filter out any group having just one row because otherwise it would cause an error.
And this is my final version, much faster and needs much less memory:
temp[,.(from=rep(V,(.N-1):0),to=V[unlist(sapply(2:.N, seq, .N, simplify = T))]), by=GR]

Expand data.table with combinations of two columns given condition in another column

I have a data.table that gives me the connections between locations (origin and destination) for different bus routes (route_id).
library(data.table)
library(magrittr)
# data for reproducible example
dt <- data.table( origin = c('A','B','C', 'F', 'G', 'H'),
destination = c('B','C','D', 'G', 'H', 'I'),
freq = c(2,2,2,10,10,10),
route_id = c(1,1,1,2,2,2), stringsAsFactors=FALSE )
# > dt
# origin destination freq route_id
# 1: A B 2 1
# 2: B C 2 1
# 3: C D 2 1
# 4: F G 10 2
# 5: G H 10 2
# 6: H I 10 2
For the purposes of what I'd want to do, if there is a route_id that gives a connection A-B and a connection B-C, then I want to add to the data a connection A-C for that same route_id and so on.
Problems: So far, I've created a simple code that does this job but:
it uses a for loop that takes a long time (my real data has hundreds of thousands observations)
it still does not cope well with direction. The direction of the connections matter here. So although there is a B-C connection in the original data, there should be no C-B in the output.
My slow solution
# loop
# a) get a data subset corresponding to each route_id
# b) get all combinations of origin-destination pairs
# c) row bind the new pairs to original data
for (i in unique(dt$route_id)) {
temp <- dt[ route_id== i,]
subset_of_pairs <- expand.grid(temp$origin, temp$destination) %>% setDT()
setnames(subset_of_pairs, c("origin", "destination"))
dt <- rbind(dt, subset_of_pairs, fill=T)
}
# assign route_id and freq to new pairs
dt[, route_id := route_id[1L], by=origin]
dt[, freq := freq[1L], by=route_id]
# Keepe only different pairs that are unique
dt[, origin := as.character(origin) ][, destination := as.character(destination) ]
dt <- dt[ origin != destination, ][order(route_id, origin, destination)]
dt <- unique(dt)
Desired output
origin destination freq route_id
1: A B 2 1
2: A C 2 1
3: A D 2 1
4: B C 2 1
5: B D 2 1
6: C D 2 1
7: F G 10 2
8: F H 10 2
9: F I 10 2
10: G H 10 2
11: G I 10 2
12: H I 10 2
One way:
res = dt[, {
stops = c(origin, last(destination))
pairs = combn(.N + 1L, 2L)
.(o = stops[pairs[1,]], d = stops[pairs[2,]])
}, by=route_id]
route_id o d
1: 1 A B
2: 1 A C
3: 1 A D
4: 1 B C
5: 1 B D
6: 1 C D
7: 2 F G
8: 2 F H
9: 2 F I
10: 2 G H
11: 2 G I
12: 2 H I
This is assuming that c(origin, last(destination)) is a full list of stops in order. If dt does not contain enough info to construct a complete order, the task becomes much more difficult.
If vars from dt are needed, an update join like res[dt, on=.(route_id), freq := i.freq] works.
Tasks like this always risk running out of memory. In this case, the OP has up to a million rows containing groups of up to 341 stops, so the end result could be as large as 1e6/341*choose(341,2) = 170 million rows. That's manageable, but in general this sort of analysis does not scale.
How it works
Generally, data.table syntax can be treated just like a loop over groups:
DT[, {
...
}, by=g]
This has a few advantages over loops:
Nothing created in the ... body will pollute the workspace.
All columns can be referenced by name.
Special symbols .N, .SD, .GRP and .BY are available, along with .() for list().
In the code above, pairs finds pairs of indices taken from 1 .. #stops (=.N+1 where .N is the number of rows in the subset of the data associated with a given route_id). It is a matrix with the first row corresponding to the first element of a pair; and the second row with the second. The ... should evaluate to a list of columns; and here list() is abbreviated as .().
Further improvements
I guess the time is mostly devoted to computing combn many times. If multiple routes have the same #stops, this can be addressed by computing beforehand:
Ns = dt[,.N, by=route_id][, unique(N)]
cb = lapply(setNames(,Ns), combn, 2)
Then grab pairs = cb[[as.character(.N)]] in the main code. Alternately, define a pairs function that uses memoization to avoid recomputing.

Speed up data.frame rearrangement

I have a data frame with coordinates ("start","end") and labels ("group"):
a <- data.frame(start=1:4, end=3:6, group=c("A","B","C","D"))
a
start end group
1 1 3 A
2 2 4 B
3 3 5 C
4 4 6 D
I want to create a new data frame in which labels are assigned to every element of the sequence on the range of coordinates:
V1 V2
1 1 A
2 2 A
3 3 A
4 2 B
5 3 B
6 4 B
7 3 C
8 4 C
9 5 C
10 4 D
11 5 D
12 6 D
The following code works but it is extremely slow with wide ranges:
df<-data.frame()
for(i in 1:dim(a)[1]){
s<-seq(a[i,1],a[i,2])
df<-rbind(df,data.frame(s,rep(a[i,3],length(s))))
}
colnames(df)<-c("V1","V2")
How can I speed this up?
You can try data.table
library(data.table)
setDT(a)[, start:end, by = group]
which gives
group V1
1: A 1
2: A 2
3: A 3
4: B 2
5: B 3
6: B 4
7: C 3
8: C 4
9: C 5
10: D 4
11: D 5
12: D 6
Obviously this would only work if you have one row per group, which it seems you have here.
If you want a very fast solution in base R, you can manually create the data.frame in two steps:
Use mapply to create a list of your ranges from "start" to "end".
Use rep + lengths to repeat the "groups" column to the expected number of rows.
The base R approach shared here won't depend on having only one row per group.
Try:
temp <- mapply(":", a[["start"]], a[["end"]], SIMPLIFY = FALSE)
data.frame(group = rep(a[["group"]], lengths(temp)),
values = unlist(temp, use.names = FALSE))
If you're doing this a lot, just put it in a function:
myFun <- function(indf) {
temp <- mapply(":", indf[["start"]], indf[["end"]], SIMPLIFY = FALSE)
data.frame(group = rep(indf[["group"]], lengths(temp)),
values = unlist(temp, use.names = FALSE))
}
Then, if you want some sample data to try it with, you can use the following as sample data:
set.seed(1)
a <- data.frame(start=1:4, end=sample(5:10, 4, TRUE), group=c("A","B","C","D"))
x <- do.call(rbind, replicate(1000, a, FALSE))
y <- do.call(rbind, replicate(100, x, FALSE))
Note that this does seem to slow down as the number of different unique values in "group" increases.
(In other words, the "data.table" approach will make the most sense in general. I'm just sharing a possible base R alternative that should be considerably faster than your existing approach.)

Resources