how to avoid an optimization warning in data.table - r

I have the following code:
> dt <- data.table(a=c(rep(3,5),rep(4,5)),b=1:10,c=11:20,d=21:30,key="a")
> dt
a b c d
1: 3 1 11 21
2: 3 2 12 22
3: 3 3 13 23
4: 3 4 14 24
5: 3 5 15 25
6: 4 6 16 26
7: 4 7 17 27
8: 4 8 18 28
9: 4 9 19 29
10: 4 10 20 30
> dt[,lapply(.SD,sum),by="a"]
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0
Optimized j from 'lapply(.SD, sum)' to 'list(sum(b), sum(c), sum(d))'
Starting dogroups ... done dogroups in 0 secs
a b c d
1: 3 15 65 115
2: 4 40 90 140
> dt[,c(count=.N,lapply(.SD,sum)),by="a"]
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0
Optimization is on but j left unchanged as 'c(count = .N, lapply(.SD, sum))'
Starting dogroups ... The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.
done dogroups in 0 secs
a count b c d
1: 3 5 15 65 115
2: 4 5 40 90 140
How do I avoid the scary "very inefficient" warning?
I can add the count column before the join:
> dt$count <- 1
> dt
a b c d count
1: 3 1 11 21 1
2: 3 2 12 22 1
3: 3 3 13 23 1
4: 3 4 14 24 1
5: 3 5 15 25 1
6: 4 6 16 26 1
7: 4 7 17 27 1
8: 4 8 18 28 1
9: 4 9 19 29 1
10: 4 10 20 30 1
> dt[,lapply(.SD,sum),by="a"]
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0
Optimized j from 'lapply(.SD, sum)' to 'list(sum(b), sum(c), sum(d), sum(count))'
Starting dogroups ... done dogroups in 0 secs
a b c d count
1: 3 15 65 115 5
2: 4 40 90 140 5
but this does not look too elegant...

One way I could think of is to assign count by reference:
dt.out <- dt[, lapply(.SD,sum), by = a]
dt.out[, count := dt[, .N, by=a][, N]]
# alternatively: count := table(dt$a)
# a b c d count
# 1: 3 15 65 115 5
# 2: 4 40 90 140 5
Edit 1: I still think it's just message and not a warning. But if you still want to avoid that, just do:
dt.out[, count := as.numeric(dt[, .N, by=a][, N])]
Edit 2: Very interesting. Doing the equivalent of multiple := assignment does not produce the same message.
dt.out[, `:=`(count = dt[, .N, by=a][, N])]
# Detected that j uses these columns: a
# Finding groups (bysameorder=TRUE) ... done in 0.001secs. bysameorder=TRUE and o__ is length 0
# Detected that j uses these columns: <none>
# Optimization is on but j left unchanged as '.N'
# Starting dogroups ... done dogroups in 0 secs
# Detected that j uses these columns: N
# Assigning to all 2 rows
# Direct plonk of unnamed RHS, no copy.

This solution removes the message about the named elements. But you have to put the names back afterwards.
require(data.table)
options(datatable.verbose = TRUE)
dt <- data.table(a=c(rep(3,5),rep(4,5)),b=1:10,c=11:20,d=21:30,key="a")
dt[, c(.N, unname(lapply(.SD, sum))), by = "a"]
Output
> dt[, c(.N, unname(lapply(.SD, sum))), by = "a"]
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0
Optimization is on but j left unchanged as 'c(.N, unname(lapply(.SD, sum)))'
Starting dogroups ... done dogroups in 0.001 secs
a V1 V2 V3 V4
1: 3 5 15 65 115
2: 4 5 40 90 140

Related

How I get the value of a variable using the lag position that comes from another variable?

I am trying to get the values of a variable (B) that cames from the lag position given by other variable (A).
The variables are something like this:
# A B
# 1: 1 10
# 2: 1 20
# 3: 1 30
# 4: 1 40
# 5: 2 50
I want the output (C) to be like this, the first value woud be zero and the condition start in the second row:
# A B C
# 1: 1 10 0
# 2: 1 20 10
# 3: 1 30 20
# 4: 2 40 20
# 5: 2 50 30
I have done it with loops but because it´s a large amount of information is a lot of time to wait. I hope someone could give me an idea.
Here's a way with dplyr:
library(dplyr)
x %>%
mutate(
C = c(0, B[(2:n()) - A[-1]])
)
# A B C
# 1: 1 10 0
# 2: 1 20 10
# 3: 1 30 20
# 4: 2 40 20
# 5: 2 50 30
It translates directly to data.table (with your colons in row names, I thought you might be using that package)
library(data.table)
dt = as.data.table(x)
dt[, C := c(0, B[(2:.N) - A[-1]])]
dt
# A B C
# 1: 1 10 0
# 2: 1 20 10
# 3: 1 30 20
# 4: 2 40 20
# 5: 2 50 30
Using this data:
x = read.table(text =' A B
1: 1 10
2: 1 20
3: 1 30
4: 2 40
5: 2 50', header = T)

Why can't I remove current observation using .I in data.table?

Recently I saw a question (can't find the link) that was something like this
I want to add a column on a data.frame that computes the variance of a different column while removing the current observation.
dt = data.table(
id = c(1:13),
v = c(9,5,8,1,25,14,7,87,98,63,32,12,15)
)
So, with a for() loop:
res = NULL
for(i in 1:13){
res[i] = var(dt[-i,v])
}
I tried doing this in data.table, using negative indexing with .I, but to my surprise none of the following works:
#1
dt[,var := var(dt[,v][-.I])]
#2
dt[,var := var(dt$v[-.I])]
#3
fun = function(x){
v = c(9,5,8,1,25,14,7,87,98,63,32,12,15)
var(v[-x])
}
dt[,var := fun(.I)]
#4
fun = function(x){
var(dt[-x,v])
}
dt[,var := fun(.I)]
All of those gives the same output:
id v var
1: 1 9 NA
2: 2 5 NA
3: 3 8 NA
4: 4 1 NA
5: 5 25 NA
6: 6 14 NA
7: 7 7 NA
8: 8 87 NA
9: 9 98 NA
10: 10 63 NA
11: 11 32 NA
12: 12 12 NA
13: 13 15 NA
What am I missing? I thought it was a problem with .I being passed to functions, but a dummy example:
fun = function(x,c){
x*c
}
dt[,dummy := fun(.I,2)]
id v var
1: 1 9 2
2: 2 5 4
3: 3 8 6
4: 4 1 8
5: 5 25 10
6: 6 14 12
7: 7 7 14
8: 8 87 16
9: 9 98 18
10: 10 63 20
11: 11 32 22
12: 12 12 24
13: 13 15 26
works fine.
Why can't I use .I in this specific scenario?
You may use .BY:
a list containing a length 1 vector for each item in by
dt[ , var_v := dt[id != .BY$id, var(v)], by = id]
Variance is calculated once per row (by = id). In each calculation, the current row is excluded using id != .BY$id in the 'inner' i.
all.equal(dt$var_v, res)
# [1] TRUE
Why doesn't your code work? Because...
.I is an integer vector equal to seq_len(nrow(x)),
...your -.I not only removes current observation, it removes all rows in one go from 'v'.
A small illustration which starts with your attempt (just without the assignment :=) and simplifies it step by step:
# your attempt
dt[ , var(dt[, v][-.I])]
# [1] NA
# without the `var`, indexing only
dt[ , dt[ , v][-.I]]
# numeric(0)
# an empty vector
# same indexing written in a simpler way
dt[ , v[-.I]]
# numeric(0)
# even more simplified, with a vector of values
# and its corresponding indexes (equivalent to .I)
v <- as.numeric(11:14)
i <- 1:4
v[i]
# [1] 11 12 13 14
x[-i]
# numeric(0)
Here's a brute-force thought:
exvar <- function(x, na.rm = FALSE) sapply(seq_len(length(x)), function(i) var(x[-i], na.rm = na.rm))
dt[,var := exvar(v)]
dt
# id v var
# 1: 1 9 1115.538
# 2: 2 5 1098.265
# 3: 3 8 1111.515
# 4: 4 1 1077.841
# 5: 5 25 1153.114
# 6: 6 14 1132.697
# 7: 7 7 1107.295
# 8: 8 87 822.447
# 9: 9 98 684.697
# 10: 10 63 1040.265
# 11: 11 32 1153.697
# 12: 12 12 1126.424
# 13: 13 15 1135.538

Efficient altering of variables in dataframe referencing previous row's value

I have a database like the following:
df <- data.frame(id=c(1,1,1,2,2,3,3,4),
num=c(12,12,12,28,28,17,17,7))
id num
1 1 12
2 1 12
3 1 12
4 2 28
5 2 28
6 3 17
7 3 17
8 4 7
I want to increment the value of num by 1 for every time there is another row for the id. I have the following code to do it:
for (i in 2:nrow(df3)) {
if(df[i,1]==df[i-1,1]) {
df[i,2]=df[i-1,2]+1
}
}
This would result in an answer like this:
id num
1 1 12
2 1 13
3 1 14
4 2 28
5 2 29
6 3 17
7 3 18
8 4 7
This code works but my actual dataset to be worked on has 100's millions of rows and so is very inefficient. I have tried using the lag() function from dplry in different ways but have had no success. One such way was to get the id from the previous row on the same row to compare, here was my attempt:
df[,lag := shift(Id, 1L, type="lag")]
df[df$id==df$lag,2]<-shift(df[df$id==df$lag,2], 1L, type="lag")+1
This will obviously not run however. Any help to speed up my approach would be great! Thanks.
library(data.table)
setDT(df)
df[, num := num + rowid(id) - 1L]
result:
# id num
# 1: 1 12
# 2: 1 13
# 3: 1 14
# 4: 2 28
# 5: 2 29
# 6: 3 17
# 7: 3 18
# 8: 4 7
Using ave
df$num+ave(df$id,df$id,FUN = seq_along)-1
[1] 12 13 14 28 29 17 18 7
Another data.table approach in case you like an explicit by:
library(data.table)
setDT(df)
df[ , num := num + 1:.N - 1, by=id]

R: strange behavior in data.table when attempting to operate on a column using another data.table

I'm trying to operate on a data.table column using a different data.table, and assign the result to a new column in the first data.table. But I keep having this issue:
Warning messages:
1: In from:(from + len) :
numerical expression has 10 elements: only the first used
Here is the data:
tstamps = c(1504306173, NA, NA, NA, NA, 1504393006, NA, NA, 1504459211, NA)
set.seed(0.1)
dt1 = data.table(utc_tstamp = sample(rep(tstamps, 100), 100))
dt2 = data.table(from = sample((1:90), 10), len = sample(1:10, 10))
> dt2
from len
1: 55 6
2: 59 9
3: 32 10
4: 24 3
5: 86 7
6: 54 1
7: 18 5
8: 11 8
9: 40 4
10: 75 2
I'm trying to count the number of NA's in dt1[from:(from+len), ] and assign the result to a new column, count, in dt2.
What I currently have for that is this
dt2[, count := dt1[from:(from+len), ][is.na(utc_tstamp), .N]]
but this is only using dt2[1,]$from and dt2[1,]$len, all the counts are just the number of NA's in dt1[dt2[1,]$from:(dt2[1,]$from + dt2[1,]$len), ], and I receive the following warning
Warning messages:
1: In from:(from + len) :
numerical expression has 10 elements: only the first used
2: In from:(from + len) :
numerical expression has 10 elements: only the first used
and the result is this:
> dt2
from len count
1: 55 6 5
2: 59 9 5
3: 32 10 5
4: 24 3 5
5: 86 7 5
6: 54 1 5
7: 18 5 5
8: 11 8 5
9: 40 4 5
10: 75 2 5
while it should be this:
> dt2
from len count
1: 55 6 5
2: 59 9 5
3: 32 10 8
4: 24 3 3
5: 86 7 5
6: 54 1 2
7: 18 5 4
8: 11 8 5
9: 40 4 4
10: 75 2 2
I'd appreciate it if someone explains why this is happening and how can I get what I want.
Based on the description, we get the sequence between the 'from' and 'from' added with 'len', based on this position index get the corresponding elements of 'utc_stamp' column from 'dt1', convert it to logical (is.na(), and get the sum i.e. sum of TRUE elements or the number of NA elements. Assign (:=) it to create a new column 'count' in 'df2'
dt2[, count := unlist(Map(function(x, y)
sum(is.na(dt1$utc_tstamp[x:y])), from , from + len))]
dt2
# from len count
# 1: 55 6 5
# 2: 59 9 5
# 3: 32 10 8
# 4: 24 3 3
# 5: 86 7 5
# 6: 54 1 2
# 7: 18 5 4
# 8: 11 8 5
# 9: 40 4 4
#10: 75 2 2
Or another option is to group by sequence of rows and then do the the sequence (:) based on 'from', 'len' columns to subset the column values from 'dt1' and get the sum of logical vector
dt2[, count := sum(is.na(dt1$utc_tstamp[from:(from + len)])), by = 1:nrow(dt2)]
Or define the joining variables explicitly and use a non-equi join:
dt2[, to := from+len]
dt1[, r := .I]
dt2[, ct := dt1[is.na(utc_tstamp)][dt2, on=.(r >= from, r <= to), .N, by=.EACHI]$N]

Selecting rows based on index positions, when the index positions are passed through variables in data.table R

I can select one column by index position in data.table by passing the index position through a variable like this:
DT <- data.table(a = 1:6, b=10:15, c=20:25, d=30:35, e = 40:45)
i <- 1
j <- 5
DT[, ..i]
But how can I select columns i : i+2 and j in one line of code using data.table syntax?
Your advice will be appreciated.
If you don't want to use lukeA's approach using the with = FALSE parameter you have other choices as well:
DT[, .SD, .SDcols = c(i:(i+2), j)]
# a b c e
#1: 1 10 20 40
#2: 2 11 21 41
#3: 3 12 22 42
#4: 4 13 23 43
#5: 5 14 24 44
#6: 6 15 25 45
Note the parantheses around (i+2) because the colon operator takes precendence.
This one is a modification of OP's code and not exactly a one-liner as requested:
icol <- c(i:(i+2), j); DT[, ..icol]
a b c e
1: 1 10 20 40
2: 2 11 21 41
3: 3 12 22 42
4: 4 13 23 43
5: 5 14 24 44
6: 6 15 25 45

Resources