Summing in data.table returns different values in R 3 vs 4 - r

I am getting a weird summation problem when using data.table in R 4.0.2. When I group data by one column and sum the other (the bar[,.(C = sum(B)), by = A] line), I get some incorrect numbers. Here is a reprex where I only load data.table:
> library(data.table)
> bar <- data.table(data.frame("A" = as.character(c(1,2,3,2,3,2)), "B" = as.numeric(c(1,2,3,4,5,6))))
> bar
A B
1: 1 1
2: 2 2
3: 3 3
4: 2 4
5: 3 5
6: 2 6
> bar[,.(C = sum(B)), by = A]
A C
1: 1 2
2: 2 10
3: 3 8
> bar[A == 1, sum(B)]
[1] 1
> bar[A == 2, sum(B)]
[1] 12
> bar[A == 3, sum(B)]
[1] 8
> bar[,.(C = sum(as.integer(B))), by = A]
A C
1: 1 1
2: 2 12
3: 3 8
Yet, if I do this on R 3.6.3, everything works as I expect, and the problematic portion above now looks like:
> bar[,.(C = sum(B)), by = A]
A C
1: 1 1
2: 2 12
3: 3 8
And everything else is the same.
Did R 4.* change some method of how numerics are summed? Why is it fixed when I convert to integers first?

Related

Add max column using variable

I am trying to do the same thing as this question: Add max value to a new column in R, however, I want to pass in a variable instead of the column name directly so I don't hard code the columns name into the formula.
Sample code:
a <- c(1,1,2,2,3,3)
b <- c(1,3,5,9,4,NA)
d <- data.table(a, b)
d
a b
1 1
1 3
2 5
2 9
3 4
3 NA
I can get this:
a b max_b
1 1 3
1 3 3
2 5 9
2 9 9
3 4 4
3 NA 4
By hard coding it: setDT(d)[, max_b:= max(b, na.rm = T), a] but I would like to do something like this instead:
cn <- "b"
setDT(d)[, paste0("max_", cn):= max(cn, na.rm = T), a]
However, this is not working because inside of max() it evaluates to max of the character instead of the column. And it evaluates to a column named max_b that contains the value b because max("b") = "b". I get why this is happening, I just do not know a workaround.
What is a solution to this?
Note: the above stack question I tagged was marked as a duplicate and closed, but I chose that question because I am using the accepted answer from it in my code. I also do not 100% agree that it is a duplicate question anyways.
Try setDT(d)[, paste0("max_", cn) := eval(parse(text = max(eval(parse(text = cn))))), a]
# output
a b max_b
1: 1 1 3
2: 1 3 3
3: 2 5 9
4: 2 9 9
5: 3 4 4
# example with missing values
a <- c(1,1,2,2,3,3)
b <- c(1,3,5,9,4,NA)
d <- data.table(a, b)
cn <- "b"
setDT(d)[, paste0("max_", cn) := eval(parse(text = max(eval(parse(text = cn)),
na.rm = TRUE))), a]
#output
a b max_b
1: 1 1 3
2: 1 3 3
3: 2 5 9
4: 2 9 9
5: 3 4 4
6: 3 NA 4
One option is to specify the variable in .SDcols and then apply the function on .SD (Subset of Data.table).
d[, paste0("max_", cn) := lapply(.SD, max, na.rm = TRUE), by = a, .SDcols = cn]
d
# a b max_b
#1: 1 1 3
#2: 1 3 3
#3: 2 5 9
#4: 2 9 9
#5: 3 4 4
#6: 3 NA 4
Another option is converting to symbol and then do the evaluation
d[, paste0("max_", cn) := max(eval(as.symbol(cn)), na.rm = TRUE), by = a]

Vectorised between: datatable R

I have a hard time to understand the "Vectorised between" example in data.table packages document V1.10.4?
X = data.table(a=1:5, b=6:10, c=c(5:1))
> X
a b c
1: 1 6 5
2: 2 7 4
3: 3 8 3
4: 4 9 2
5: 5 10 1
# NEW feature in v1.9.8, vectorised between
> X[c %between% list(a,b)]
a b c
1: 1 6 5
2: 2 7 4
3: 3 8 3
X[between(c, a, b)] # same as above
Can someone please explain it to me how dose it work? why only 5,4,3 from c was selected? Thanks.
-----As posted in comments----
In row 4, 2 is not between 4 and 9....between(c=2,a=4,b=9).
between uses >= and <= (rather than > and <). That's why in row 3, it returns 3 (since its TRUE)

get rows of unique values by group

I have a data.table and want to pick those lines of the data.table where some values of a variable x are unique relative to another variable y
It's possible to get the unique values of x, grouped by y in a separate dataset, like this
dt[,unique(x),by=y]
But I want to pick the rows in the original dataset where this is the case. I don't want a new data.table because I also need the other variables.
So, what do I have to add to my code to get the rows in dt for which the above is true?
dt <- data.table(y=rep(letters[1:2],each=3),x=c(1,2,2,3,2,1),z=1:6)
y x z
1: a 1 1
2: a 2 2
3: a 2 3
4: b 3 4
5: b 2 5
6: b 1 6
What I want:
y x z
1: a 1 1
2: a 2 2
3: b 3 4
4: b 2 5
5: b 1 6
The idiomatic data.table way is:
require(data.table)
unique(dt, by = c("y", "x"))
# y x z
# 1: a 1 1
# 2: a 2 2
# 3: b 3 4
# 4: b 2 5
# 5: b 1 6
data.table is a bit different in how to use duplicated. Here's the approach I've seen around here somewhere before:
dt <- data.table(y=rep(letters[1:2],each=3),x=c(1,2,2,3,2,1),z=1:6)
setkey(dt, "y", "x")
key(dt)
# [1] "y" "x"
!duplicated(dt)
# [1] TRUE TRUE FALSE TRUE TRUE TRUE
dt[!duplicated(dt)]
# y x z
# 1: a 1 1
# 2: a 2 2
# 3: b 1 6
# 4: b 2 5
# 5: b 3 4
The simpler data.table solution is to grab the first element of each group
> dt[, head(.SD, 1), by=.(y, x)]
y x z
1: a 1 1
2: a 2 2
3: b 3 4
4: b 2 5
5: b 1 6
Thanks to dplyR
library(dplyr)
col1 = c(1,1,3,3,5,6,7,8,9)
col2 = c("cust1", 'cust1', 'cust3', 'cust4', 'cust5', 'cust5', 'cust5', 'cust5', 'cust6')
df1 = data.frame(col1, col2)
df1
distinct(select(df1, col1, col2))

R data.table with rollapply

Is there an existing idiom for computing rolling statistics using data.table grouping?
For example, given the following code:
DT = data.table(x=rep(c("a","b","c"),each=2), y=c(1,3), v=1:6)
setkey(DT, y)
stat.ror <- DT[,rollapply(v, width=1, by=1, mean, na.rm=TRUE), by=y];
If there isn't one yet, what would be the best way to do it?
In fact I am trying to solve this very problem right now. Here is a partial solution which will work for grouping by a single column:
Edit: got it with RcppRoll, I think:
windowed.average <- function(input.table,
window.width = 2,
id.cols = names(input.table)[3],
index.col = names(input.table)[1],
val.col = names(input.table)[2]) {
require(RcppRoll)
avg.with.group <-
input.table[,roll_mean(get(val.col), n = window.width),by=c(id.cols)]
avg.index <-
input.table[,roll_mean(get(index.col), n = window.width),by=c(id.cols)]$V1
output.table <- data.table(
Group = avg.with.group,
Index = avg.index)
# rename columns to (sensibly) match inputs
setnames(output.table, old=colnames(output.table),
new = c(id.cols,val.col,index.col))
return(output.table)
}
A (badly written) unit test that will pass the above:
require(testthat)
require(zoo)
test.datatable <- data.table(Time = rep(seq_len(10), times=2),
Voltage = runif(20),
Channel= rep(seq_len(2),each=10))
test.width <- 8
# first test: single id column
test.avgtable <- data.table(
test.datatable[,rollapply(Voltage, width = test.width, mean, na.rm=TRUE),
by=c("Channel")],
Time = test.datatable[,rollapply(Time, width = test.width, mean, na.rm=TRUE),
by=c("Channel")]$V1)
setnames(test.avgtable,old=names(test.avgtable),
new=c("Channel","Voltage","Time"))
expect_that(test.avgtable,
is_identical_to(windowed.average(test.datatable,test.width)))
How it looks:
> test.datatable
Time Voltage Channel Class
1: 1 0.310935570 1 1
2: 2 0.565257533 1 2
3: 3 0.577278573 1 1
4: 4 0.152315111 1 2
5: 5 0.836052122 1 1
6: 6 0.655417230 1 2
7: 7 0.034859642 1 1
8: 8 0.572040136 1 2
9: 9 0.268105436 1 1
10: 10 0.126484340 1 2
11: 1 0.139711248 2 1
12: 2 0.336316520 2 2
13: 3 0.413086486 2 1
14: 4 0.304146029 2 2
15: 5 0.399344631 2 1
16: 6 0.581641210 2 2
17: 7 0.183586025 2 1
18: 8 0.009775488 2 2
19: 9 0.449576242 2 1
20: 10 0.938517952 2 2
> test.avgtable
Channel Voltage Time
1: 1 0.4630195 4.5
2: 1 0.4576657 5.5
3: 1 0.4028191 6.5
4: 2 0.2959510 4.5
5: 2 0.3346841 5.5
6: 2 0.4099593 6.5
Unfortunately, I haven't managed to make it work with multiple groupings (as this second section shows):
Looks okay for multiple column groups:
# second test: multiple id columns
# Depends on the first test passing to be meaningful.
test.width <- 4
test.datatable[,Class:= rep(seq_len(2),times=ceiling(nrow(test.datatable)/2))]
# windowed.average(test.datatable,test.width,id.cols=c("Channel","Class"))
test.avgtable <- rbind(windowed.average(test.datatable[Class==1,],test.width),
windowed.average(test.datatable[Class==2,],test.width))
# somewhat artificially attaching expected class labels
test.avgtable[,Class:= rep(seq_len(2),times=nrow(test.avgtable)/4,each=2)]
setkey(test.avgtable,Channel)
setcolorder(test.avgtable,c("Channel","Class","Voltage","Time"))
expect_that(test.avgtable,
is_equivalent_to(windowed.average(test.datatable,test.width,
id.cols=c("Channel","Class"))))

Add a countdown column to data.table containing rows until a special row encountered

I have a data.table with ordered data labled up, and I want to add a column that tells me how many records until I get to a "special" record that resets the countdown.
For example:
DT = data.table(idx = c(1,3,3,4,6,7,7,8,9),
name = c("a", "a", "a", "b", "a", "a", "b", "a", "b"))
setkey(DT, idx)
#manually add the answer
DT[, countdown := c(3,2,1,0,2,1,0,1,0)]
Gives
> DT
idx name countdown
1: 1 a 3
2: 3 a 2
3: 3 a 1
4: 4 b 0
5: 6 a 2
6: 7 a 1
7: 7 b 0
8: 8 a 1
9: 9 b 0
See how the countdown column tells me how many rows until a row called "b".
The question is how to create that column in code.
Note that the key is not evenly spaced and may contain duplicates (so is not very useful in solving the problem). In general the non-b names could be different, but I could add a dummy column that is just True/False if the solution requires this.
Here's another idea:
## Create groups that end at each occurrence of "b"
DT[, cd:=0L]
DT[name=="b", cd:=1L]
DT[, cd:=rev(cumsum(rev(cd)))]
## Count down within them
DT[, cd:=max(.I) - .I, by=cd]
# idx name cd
# 1: 1 a 3
# 2: 3 a 2
# 3: 3 a 1
# 4: 4 b 0
# 5: 6 a 2
# 6: 7 a 1
# 7: 7 b 0
# 8: 8 a 1
# 9: 9 b 0
I'm sure (or at least hopeful) that a purely "data.table" solution would be generated, but in the meantime, you could make use of rle. In this case, you're interested in reversing the countdown, so we'll use rev to reverse the "name" values before proceeding.
output <- sequence(rle(rev(DT$name))$lengths)
makezero <- cumsum(rle(rev(DT$name))$lengths)[c(TRUE, FALSE)]
output[makezero] <- 0
DT[, countdown := rev(output)]
DT
# idx name countdown
# 1: 1 a 3
# 2: 3 a 2
# 3: 3 a 1
# 4: 4 b 0
# 5: 6 a 2
# 6: 7 a 1
# 7: 7 b 0
# 8: 8 a 1
# 9: 9 b 0
Here's a mix of Josh's and Ananda's solution, in that, I use RLE to generate the way Josh has given the answer:
t <- rle(DT$name)
t <- t$lengths[t$values == "a"]
DT[, cd := rep(t, t+1)]
DT[, cd:=max(.I) - .I, by=cd]
Even better: Taking use of the fact that there's only one b always (or assuming here), you could do this one better:
t <- rle(DT$name)
t <- t$lengths[t$values == "a"]
DT[, cd := rev(sequence(rev(t+1)))-1]
Edit: From OP's comment, it seems clear that there is more than 1 b possible and in such cases, all b should be 0. The first step in doing this is to create groups where b ends after each consecutive a's.
DT <- data.table(idx=sample(10), name=c("a","a","a","b","b","a","a","b","a","b"))
t <- rle(DT$name)
val <- cumsum(t$lengths)[t$values == "b"]
DT[, grp := rep(seq(val), c(val[1], diff(val)))]
DT[, val := c(rev(seq_len(sum(name == "a"))),
rep(0, sum(name == "b"))), by = grp]
# idx name grp val
# 1: 1 a 1 3
# 2: 7 a 1 2
# 3: 9 a 1 1
# 4: 4 b 1 0
# 5: 2 b 1 0
# 6: 8 a 2 2
# 7: 6 a 2 1
# 8: 3 b 2 0
# 9: 10 a 3 1
# 10: 5 b 3 0

Resources