Progress bar in data.table aggregate action - r

ddply has a .progress to get a progress bar while it's running, is there an equivalent for data.table in R?

Yes, you can use any progress status you want.
library(data.table)
dt = data.table(a=1:4, b=c("a","b"))
dt[, {cat("group:",b,"\n"); sum(a)}, b]
#group: a
#group: b
# b V1
#1: a 4
#2: b 6
If you ask about progress in loading csv file with fread then it will automatically be displayed for bigger datasets. Also as mentioned by Sergey in comment you can use verbose argument to get more information, both in fread and in [.data.table.
If you want the percentage of groups processed.
grpn = uniqueN(dt$b)
dt[, {cat("progress",.GRP/grpn*100,"%\n"); sum(a)}, b]
#progress 50 %
#progress 100 %
# b V1
#1: a 4
#2: b 6

Following up on #jangorecki's excellent answer, here's a way to use a text progress bar:
library(data.table)
dt = data.table(a=1:4, b=c("a","b"))
grpn = uniqueN(dt$b)
pb <- txtProgressBar(min = 0, max = grpn, style = 3)
dt[, {setTxtProgressBar(pb, .GRP); Sys.sleep(0.5); sum(a)}, b]
close(pb)

Following up on #jangorecki and other great answers, you can use the data.table symbol .NGRP instead of calculating grpn as in the other answers:
dt[, {cat("progress",.GRP/.NGRP*100,"%\n"); sum(a)}, b]

Following up again on #jangorecki's great answer.
If you don't want to spam your terminal too much, you can make an external function equivalent to jangorecki's, but which does a modulus check and only prints if .GRP is divisible by a certain number "mod". Note, using the if function within the data.table curly-brackets itself doesn't work, which I assume is because if function's in R also use curly brackets.
progress = function(.GRP, grpn, mod) {
if(!(.GRP %% mod)) {
cat("progress", .GRP/grpn*100,"%\n")
}
}
Then do. Here I use mod = 1000, so it would only print the percentage every 1000 groups.
dt[, {progress(.GRP, grpn, 1000); sum(a)}, b]

Related

Modify list-column by reference in nested data.table

When using a list column of data.tables in a nested data.table it is easy to apply a function over the column. Example:
dt<- data.table(mtcars)[, list(dt.mtcars = list(.SD)), by = gear]
We can use:
dt[ ,list(length = nrow(dt.mtcars[[1]])), by = gear]
dt[ ,list(length = nrow(dt.mtcars[[1]])), by = gear]
gear length
1: 4 12
2: 3 15
3: 5 5
or
dt[, list( length = lapply(dt.mtcars, nrow)), by = gear]
gear length
1: 4 12
2: 3 15
3: 5 5
I would like to do the same process and apply a modification by reference using the operator := to each data.table of the column.
Example:
modify_by_ref<- function(d){
d[, max_hp:= max(hp)]
}
dt[, modify_by_ref(dt.mtcars[[1]]), by = gear]
That returns the error:
Error in `[.data.table`(d, , `:=`(max_hp, max(hp))) :
.SD is locked. Using := in .SD's j is reserved for possible future use; a tortuously flexible way to modify by group. Use := in j directly to modify by group by reference.
Using the tip in the error message do not works in any way for me, it seems to be targeting another case but maybe I am missing something. Is there any recommended way or flexible workaround to modify list columns by refence?
This can be done in following two steps or in Single Step:
The given table is:
dt<- data.table(mtcars)[, list(dt.mtcars = list(.SD)), by = gear]
Step 1 - Let's add list of column hp vectors in each row of dt
dt[, hp_vector := .(list(dt.mtcars[[1]][, hp])), by = list(gear)]
Step 2 - Now calculate the max of hp
dt[, max_hp := max(hp_vector[[1]]), by = list(gear)]
The given table is:
dt<- data.table(mtcars)[, list(dt.mtcars = list(.SD)), by = gear]
Single Step - Single step is actually the combination of both of the above steps:
dt[, max_hp := .(list(max(dt.mtcars[[1]][, hp])[[1]])), by = list(gear)]
If we wish to populate values within nested table by Reference then the following link talks about how to do it, just that we need to ignore a warning message. I will be happy if anyone can point me how to fix the warning message or is there any pitfall. For more detail please refer the link:
https://stackoverflow.com/questions/48306010/how-can-i-do-fast-advance-data-manipulation-in-nested-data-table-data-table-wi/48412406#48412406
Taking inspiration from the same i am going to show how to do it here for the given data set.
Let's first clean everything:
rm(list = ls())
Let's re-define the given table in different way:
dt<- data.table(mtcars)[, list(dt.mtcars = list(data.table(.SD))), by = list(gear)]
Note that i have defined the table slightly different. I have used data.table in addition to list in the above definition.
Next, populate the max by reference within nested table:
dt[, dt.mtcars := .(list(dt.mtcars[[1]][, max_hp := max(hp)])), by = list(gear)]
And, what good one can expect, we can perform manipulation within nested table:
dt[, dt.mtcars := .(list(dt.mtcars[[1]][, weighted_hp_carb := max_hp*carb])), by = list(gear)]

Using data.table function in lapply on a list with data.frames elements (Answer = setDT)

First question, let me know if more info or background is needed in the comments please.
Many answers on here and elsewhere deal with calling lapply in a data.table function. I want to do the opposite, which on paper should be easy lapply(list.of.dfs, fun(x) x) but I cant get it to work with data.table functions.
I have a list that contains several data.frames with the same columns but differing numbers of rows. This comes from the output of several simulation scenarios so they must be treated seperately and not rbind'ed.
#sample list of data.frames
scenarios <- replicate(5, data.frame(a=sample(letters[1:4],10,T),
b=sample(1:2,10,T),
x=sample(1:10, 10),
y =runif(10)), simplify = FALSE)
I want to add a column to every element that is the sum of x/y by a and b.
From the data.table documentation in the examples section the process to do this for one data.frame is the following (search: add new column by reference by group in the doc page):
test <- as.data.table(scenarios[[1]]) #must specify data.table class
test[, newcol := sum(x/y), by = .(a , b)][]
I want to use lapply to do the same thing to every element in the scenarios list and return the list.
My most recent attempt:
lapply(scenarios, function(i) {as.data.table(i[, z := sum(x/y), by=.(a,b)]); i})
but I keep getting the error unused argument (by = .a,b))
After pouring over the results of this and other sites I have been unable to solve this problem. Which I'm fairly sure means that there is something I dont understand about calling anonymous functions, and/or using the data.table function. Is this one of those cases where one you use the [ as the function? Or possibly my as.data.table is out of place.
This answer was a step in the right direction (I think), it covers the use of fun(x) {... ; x} to use an anonymous function and return x.
Thanks!
You can use setDT here instead.
scenarios <- lapply(scenarios, function(i) setDT(i)[, z := sum(x/y), by=.(a,b)])
scenarios[[1]]
a b x y z
1: c 2 2 0.87002174 2.298793
2: b 2 10 0.19720775 78.611837
3: b 2 8 0.47041670 78.611837
4: b 2 4 0.36705023 78.611837
5: a 1 5 0.78922686 12.774035
6: a 1 6 0.93186209 12.774035
7: b 1 3 0.83118438 3.609307
8: c 1 1 0.08248658 30.047494
9: c 1 7 0.89382050 30.047494
10: c 1 9 0.89172831 30.047494
Using as.data.table, the syntax would be
scenarios <- lapply(scenarios, function(i) {i <- as.data.table(i); i[, z := sum(x/y),
by=.(a,b)]})
but this wouldn't be recommended as it will create an additional copy, which is avoided by setDT.

Unexpected .GRP sequence in data.table

Given a data.table such as:
library(data.table)
n = 5000
set.seed(123)
pop = data.table(id=1:n, age=sample(18:80, n, replace=TRUE))
and a function which converts a numeric vector into an ordered factor, such as:
toAgeGroups <- function(x){
groups=c('Under 40','40-64','65+')
grp = findInterval(x, c(40,65)) +1
factor(groups[grp], levels=groups, ordered=TRUE)
}
I am seeing unexpected results when grouping on the output of this function as a key and indexing with .GRP.
pop[, .(age_segment_id = .GRP, pop_count=.N), keyby=.(age_segment = toAgeGroups(age))]
returns:
age_segment age_segment_id pop_count
1: Under 40 1 1743
2: 40-64 3 2015
3: 65+ 2 1242
I would have expected the age_segment_id values to be c(1,2,3), not c(1,3,2), but .GRP seems set on order of occurrence in underlying data (as in by= order) rather than sorted order (as in keyby=).
I was planning on using .GRP as an index for some additional labelling, but instead I need to do something like:
pop[, .(pop_count=.N), keyby=.(age_segment = toAgeGroups(age))][, age_segment_id := .I][]
to get what I want.
Is this expected behavior? If so, is there a better workaround?
(v. 1.9.6)
This issue should no longer occur in versions 1.9.8+ of data.table.
library(data.table) #1.9.8+
pop[, .(age_segment_id = .GRP, pop_count=.N),
keyby=.(age_segment = toAgeGroups(age))]
# age_segment age_segment_id pop_count
# 1: Under 40 1 1743
# 2: 40-64 2 2015
# 3: 65+ 3 1242
For some more, see the discussion here. Basically, how by works internally returns sorted rows for each group, then re-sorts the table back to its original order.
The change recognized that this re-sort is unnecessary if keyby is specified, so now your approach works as you expected.
Before (through 1.9.6), keyby would just re-sort the answer at the end by running setkey, as documented in ?data.table:
[keyby is the s]ame as by, but with an additional setkey() run on the by columns of the result.
Thus, on less-than-brand-new versions of data.table, you'd have to fix your code as:
pop[(order(age), .(age_segment_id = .GRP, pop_count=.N),
keyby=.(age_segment = toAgeGroups(age))]

R : Efficient loop on row with data.table

I am using data.table in R and looping over my table, it s really slow because of my table size.
I wonder if someone have any idea on
I have a set of value that I want to "cluster".
Each line have a position, a positive integer. You can load a simple view of that :
library(data.table)
#Here is a toy example
fulltable=c(seq (1,4))*c(seq(1,1000,10))
fulltable=data.table(pos=fulltable[order(fulltable)])
fulltable$id=1
So I loop in my lines and When there is more than 50 between two position I change the group :
#here is the main loop
lastposition=fulltable[1]$pos
lastid=fulltable[1]$id
for(i in 2:nrow(fulltable)){
if(fulltable[i]$pos-50>lastposition){
lastid=lastid+1
print(lastid)
}
fulltable[i]$id=lastid;
lastposition=fulltable[i]$pos
}
Any idea for an effi
fulltable[which((c(fulltable$pos[-1], NA) - fulltable$pos) > 50) + 1, new_group := 2:(.N+1)]
fulltable[is.na(new_group), new_group := 1]
fulltable[, c("lastid_new", "new_group") := list(cummax(new_group), NULL)]

split data.table

I have a data.table which I want to split into two. I do this as follows:
dt <- data.table(a=c(1,2,3,3),b=c(1,1,2,2))
sdt <- split(dt,dt$b==2)
but if I want to to something like this as a next step
sdt[[1]][,c:=.N,by=a]
I get the following warning message.
Warning message: In [.data.table(sdt[[1]], , :=(c, .N), by = a) :
Invalid .internal.selfref detected and fixed by taking a copy of the
whole table, so that := can add this new column by reference. At an
earlier point, this data.table has been copied by R. Avoid key<-,
names<- and attr<- which in R currently (and oddly) may copy the whole
data.table. Use set* syntax instead to avoid copying: setkey(),
setnames() and setattr(). Also, list(DT1,DT2) will copy the entire DT1
and DT2 (R's list() copies named objects), use reflist() instead if
needed (to be implemented). If this message doesn't help, please
report to datatable-help so the root cause can be fixed.
Just wondering if there is a better way of splitting the table so that it would be more efficient (and would not get this message)?
This works in v1.8.7 (and may work in v1.8.6 too) :
> sdt = lapply(split(1:nrow(dt), dt$b==2), function(x)dt[x])
> sdt
$`FALSE`
a b
1: 1 1
2: 2 1
$`TRUE`
a b
1: 3 2
2: 3 2
> sdt[[1]][,c:=.N,by=a] # now no warning
> sdt
$`FALSE`
a b c
1: 1 1 1
2: 2 1 1
$`TRUE`
a b
1: 3 2
2: 3 2
But, as #mnel said, that's inefficient. Please avoid splitting if possible.
I was looking for some way to do a split in data.table, I came across this old question.
Sometime a split is what you want to do, and the data.table "by" approach is not convenient.
Actually you can easily do your split by hand with data.table only instructions and it works very efficiently:
SplitDataTable <- function(dt,attr) {
boundaries=c(0,which(head(dt[[attr]],-1)!=tail(dt[[attr]],-1)),nrow(dt))
return(
mapply(
function(start,end) {dt[start:end,]},
head(boundaries,-1)+1,
tail(boundaries,-1),
SIMPLIFY=F))
}
As mentionned above (#jangorecki), the package data.table already has its own function for splitting. In that simplified case we can use:
> dt <- data.table(a = c(1, 2, 3, 3), b = c(1, 1, 2, 2))
> split(dt, by = "b")
$`1`
a b
1: 1 1
2: 2 1
$`2`
a b
1: 3 2
2: 3 2
For more difficult/concrete cases, I would recommend to create a new variable in the data.table using the by reference functions := or set and then call the function split. If you care about performance, make sure to always remain in the data.table environment e.g., dt[, SplitCriteria := (...)] rather than computing the splitting variable externallly.

Resources