Related
I get different results if I use order() in data.frame and data.table. For example:
A <- data.frame(one=c("k"),two=c("3_28","31_60","48_68"))
B <- as.data.table(A)
A[order(A$one,A$two),]
one two
1 k 3_28
2 k 31_60
3 k 48_68
B[order(B$one, B$two),]
one two
1: k 31_60
2: k 3_28
3: k 48_68
I must admit this was a bit of a nasty shock, as I have assumed equivalent results for order() from data.frame and data.table for many years. I guess there is a lot of code I need to check!
Is there any way to ensure order() gives the same results in data.frame and data.table?
Many apologies if this difference in behavior is already well known, and is just an example of my ignorance.
When used inside of a data.table operation, order(..) uses data.table:::forder. According to the Introduction to data.table vignette:
order() is internally optimised
We can use "-" on a character columns within the frame of a data.table to sort in decreasing order.
In addition, order(...) within the frame of a data.table uses data.table's internal fast radix order forder(). This sort provided such a compelling improvement over R's base::order that the R project adopted the data.table algorithm as its default sort in 2016 for R 3.3.0, see ?sort and the R Release NEWS.
The key to see the difference is that it uses a "fast radix order". If you see base::order, though, it has an argument method= which
method: the method to be used: partial matches are allowed. The
default ('"auto"') implies '"radix"' for short numeric
vectors, integer vectors, logical vectors and factors.
Otherwise, it implies '"shell"'. For details of methods
'"shell"', '"quick"', and '"radix"', see the help for 'sort'.
Since the second column of your data.table is not one of numeric, integer, logical, or factor, then base::order uses the "shell" method for sorting, which produces different results.
However, if we force base::order to use method="radix", we get the same result.
order(A$two)
# [1] 1 2 3
order(A$two, method="radix")
# [1] 2 1 3
A[order(A$one, A$two, method = "radix"),]
# one two
# 2 k 31_60
# 1 k 3_28
# 3 k 48_68
You can affect the same ordering by using base::order:
B[base::order(B$one,B$two),]
# one two
# <char> <char>
# 1: k 3_28
# 2: k 31_60
# 3: k 48_68
I have a question related to Why does median trip up data.table (integer versus double)?. Except in my case I am using a maximum. I am excluding missing values. In base R, the max of a length 0 vector is -Inf which, interestingly is a double and not an integer. I think there may be a bug in data.table's recent optimization routines.
Take this data table:
dt <- data.table(id = c(1,1,1,3,3,3), num = 1:6, log = c(F,F,F,T,F,T))
If we perform:
dt[, .(mnum = max(num[log], na.rm=T)), by=id]
We find the error:
Error in `[.data.table`(dt, , .(mnum = max(num[log], na.rm=T)), by=id1] :
Column 1 of result for group 2 is type 'integer' but expecting type 'double'. Column types must be consistent for each group.
Am I correct in thinking this is a bug or is there a syntactic omission here?
The expected output would, of course be
mnum id
-Inf 1
6 3
I have very interesting problem, though I'd rather not to have one.
I have to round a number to a closes multiple so I followed the solution here
It used to work OK, until I've discover the bug with data.table
library(data.table)
options(digits = 20) # to see number representation
mround <- function (number, multiple) {
return(multiple * round(number / multiple))
}
DT = data.table(a = mround(112.3, 0.1), b = "B")
DT[a == 112.3,] # works as expected, i.e returns one row
DT[a == 112.3 & b == 'B', ] # doesn't work
To be fair, with data.frame even the first filter doesn't work. Any ideas how to fix that?
Just to add to #Tens great answer.
What seem to be happening are three things
You have a floating point issue (as mentioned already)
You are using and old data.table version
Secondary indices are kicking in while you aren't aware of it
Using your setup
library(data.table)
options(digits = 20) # to see number representation
mround <- function (number, multiple) {
return(multiple * round(number / multiple))
}
DT = data.table(a = mround(112.3, 0.1), b = "B")
So lets address the points above. Since you have a floating point and quoting ?setNumericRounding
Computers cannot represent some floating point numbers (such as 0.6) precisely, using base 2. This leads to unexpected behaviour when joining or grouping columns of type 'numeric'; i.e. 'double
This led data.table devs to implement the setNumericRounding which auto rounded floats so a the radix algorithm would behave as expected.
Prior to v1.9.8, setNumericRounding(2) was the default (hence your first example works), but after some complaints from users for inconsistency on GH (IIRC), since v1.9.8 the default was set back to setNumericRounding(0) in order to be consistent with data.frame behavior (see here). So if you'll update your data.table to the latest version, you will see that both data.table and data.frame will behave the same for your both examples (and both of your examples will fail).
Compare
setNumericRounding(0)
DT[a == 112.3]
## Empty data.table (0 rows) of 2 cols: a,b
To
setNumericRounding(1)
DT[a == 112.3]
# a b
# 1: 112.30000000000001 B
So you will ask, "what on earth radix algorithm has to do with anything here". So here we reach the third point above- secondary indices (please read this). Lets see what actually happens when you are running you code above
options(datatable.verbose = TRUE)
DT[a == 112.3] # works as expected, i.e returns one row
# Creating new index 'a' <~~~~
# forder took 0 sec
# Starting bmerge ...done in 0 secs
# a b
# 1: 112.30000000000001 B
Lets checks the new secondary indices
indices(DT)
#[1] "a"
when you've ran ==, data.table set a as your secondary index in order to perform future operations much more efficiently (this was introduced in v1.9.4, see here). In other words, you performed a binary join on a instead the usual vector scan like it was prior v1.9.4 (As a side note, this can be disabled by doing options(datatable.auto.index = FALSE), in that case, none of your examples will work even with setNumericRounding(1) unless you will explicitly specify a key using setkey or the on argument)
This is probably will also explain why
DT[a == 112.30000 & b == 'B']
doesn't work. You are sub-setting here by two columns and neither secondary indices or binary join don't (automatically) kick-in for an expressions such as == & == (yet), hence you did a normal vector scan and setNumericRounding(1) didn't kick in
Though, you can set the keys manually and make it work, for instance (like I commented under #Tens answer), you can do
setNumericRounding(1) # make sure autoroundings turned on
DT[.(112.3, 'B'), nomatch = 0L, on = .(a, b)]
# Calculated ad hoc index in 0 secs
# Starting bmerge ...done in 0 secs
# a b
# 1: 112.3 B
Or using the old way
setkey(DT, a, b)
DT[.(112.3, 'B'), nomatch = 0L]
# Starting bmerge ...done in 0 secs
# a b
# 1: 112.3 B
It's a problem of floating point precision.
See DT[abs(a - 112.3)<1.e-6 & b == 'B',] using an error margin of 0.000001 will give you proper result.
If you want more precision you can use .Machine$double.eps^0.5 as does all.equal.
General advice is to never compare equality of floats but compare the difference with a value near enough to the machine precision to get around the precision drift between 0 and 1), more details here
One way to fix your problem could be to refactor your function to:
mround <- function (number, multiple, digits=nchar(strsplit(as.character(multiple),".",fixed=TRUE)[[1]][2])) {
round(multiple * round(number / multiple),digits)
}
I used a "convoluted" method to get the digits needed from the multiple passed as default significant digits, adapt to your needs (you may used 2 here for example, or force the precision when calling).
I removed the unnecessary return which just cause the interpreter to look for a function already called at end of the function call.
This way your output should be precise enough, but you'll still have corner cases:
> mround(112.37,0.2)
[1] 112.40000000000001
To use floats in joins, you can use (courtesy of David Arenburg):
setNumericRounding(1)
DT[.(112.3, 'B'), nomatch = 0L, on = .(a, b)]
I would like to split one row into two (or more) rows when the cumsum of one of the column breaks the period.
Is there any elegant way to perform such specific row explosion using data.table?
Do not focus on cumsum (which I used in reversed order to have cumsum from most recent row to the oldest one), strictly speaking I want transform dt into rdt from code below.
# current data
dt <- data.table(
time_id = 101:110,
desc = c('asd','qwe','xyz','qwe','qwe','xyz','asd','asd','qwe','asd'),
value = c(5.5,3.5,14,0.7,6,5.5,9.3,29.8,4,7.2)
)
dt[, cum_value_from_now := rev(cumsum(rev(value)))]
period_width <- 10
dt[, value_period := ceiling(cum_value_from_now/period_width)*period_width]
dt
# expected result
rdt <- data.table(
time_id = c(101,102,103,103,104,105,105,106,107,107,108,108,108,108,109,109,110),
desc = c('asd','qwe','xyz','xyz','qwe','qwe','qwe','xyz','asd','asd','asd','asd','asd','asd','qwe','qwe','asd'),
value = c(5.5,3.5,6.5,7.5,0.7,1.8,4.2,5.5,0.3,9,1,10,10,8.8,1.2,2.8,7.2)
)[, cum_value_from_now := rev(cumsum(rev(value)))][, value_period := ceiling(cum_value_from_now/period_width)*period_width]
rdt
# validation
all.equal(
dt[,list(time_id,desc,value)],
rdt[,list(value = sum(value)), by=c('time_id','desc')]
)
edit: I realized my question is not explained well the transformation I want to perform. To better understand the breaks the period meaning please take a look at my rdt the cum_value_from_now values from the last to first. Each value_period is completely filled by cumsum on value, the rest of value is produced as new row (if value is big enough then it is produced to multiple rows) to fit into next period(s). Thanks
First, you seem to be applying your rules inconsistently. If "breaking the period" means that a row has value_period different from the previous row, then row 2 breaks the period, but you do not treat it that way.
Second, you never explain the partitioning of value. For instance, row 3 has value=14. This is replaced in rdt with two rows with values 6.5 and 7.5. These add to 14 all right, but there is no explanation of why this should be 6.5 and 7.5, rather than, say, 7 and 7. So in the solution below I partition equally.
The code below produces a result which passes your test, but it is not quite the same as your rdt, due to the above-mentioned problems with your question.
dt[,diff:=c(-diff(value_period)/10,0)]
rdt <- dt[,list(value=as.numeric(rep(value/(diff+1),diff+1))),
by=list(time_id,desc,cum_value_from_now, value_period)]
all.equal(
dt[,list(time_id,desc,value)],
rdt[,list(value = sum(value)), by=c('time_id','desc')]
)
# [1] TRUE
In R 2.15.0 and data.table 1.8.9:
d = data.table(a = 1:5, value = 2:6, key = "a")
d[J(3), value]
# a value
# 3 4
d[J(3)][, value]
# 4
I expected both to produce the same output (the 2nd one) and I believe they should.
In the interest of clearing up that this is not a J syntax issue, same expectation applies to the following (identical to the above) expressions:
t = data.table(a = 3, key = "a")
d[t, value]
d[t][, value]
I would expect both of the above to return the exact same output.
So let me rephrase the question - why is (data.table designed so that) the key column printed out automatically in d[t, value]?
Update (based on answers and comments below): Thanks #Arun et al., I understand the design-why now. The reason the above prints the key is because there is a hidden by present every time you do a data.table merge via the X[Y] syntax, and that by is by the key. The reason it's designed this way seems to be the following - since the by operation has to be performed when merging, one might as well take advantage of that and not do another by if you are going to do that by the key of the merge.
Now that said, I believe that's a syntax design flaw. The way I read data.table syntax d[i, j, by = b] is
take d, apply the i operation (be that subsetting or merging or whatnot), and then do the j expression "by" b
The by-without-by breaks this reading and introduces cases one has to think about specifically (am I merging on i, is by just the key of the merge, etc). I believe this should be the job of the data.table - the commendable effort to make data.table faster in one particular case of the merge, when the by is equal to the key, should be done in an alternative way (e.g. by checking internally if the by expression is actually the key of the merge).
Edit number Infinity: Faq 1.12 exactly answers your question: (Also useful/relevant is FAQ 1.13, not pasted here).
1.12 What is the difference between X[Y] and merge(X,Y)?
X[Y] is a join, looking up X's rows using Y (or Y's key if it has one) as an index. Y[X] is a join, looking up Y's rows using X (or X's key if it has one) as an index. merge(X,Y)1 does both ways at the same time. The number of rows of X[Y] and Y[X] usually dier; whereas the number of rows returned by merge(X,Y) and merge(Y,X) is the same. BUT that misses the main point. Most tasks require something to be done on the data after a join or merge. Why merge all the columns of data, only to use a small subset of them afterwards?
You may suggest merge(X[,ColsNeeded1],Y[,ColsNeeded2]), but that takes copies of the subsets of data, and it requires the programmer to work out which columns are needed. X[Y,j] in data.table does all that in one step for you. When you write X[Y,sum(foo*bar)], data.table
automatically inspects the j expression to see which columns it uses. It will only subset those columns only; the others are ignored. Memory is only created for the columns the j uses, and Y columns enjoy standard R recycling rules within the context of each group. Let's say foo is in X, and bar is in Y (along with 20 other columns in Y). Isn't X[Y,sum(foo*bar)] quicker to program and quicker to run than a merge followed by a subset?
Old answer which did nothing to answer the OP's question (from OP's comment), retained here because I believe it does).
When you give a value for j like d[, 4] or d[, value] in data.table, the j is evaluated as an expression. From the data.table FAQ 1.1 on accessing DT[, 5] (the very first FAQ) :
Because, by default, unlike a data.frame, the 2nd argument is an expression which is evaluated within the scope of DT. 5 evaluates to 5.
The first thing, therefore, to understand is, in your case:
d[, value] # produces a "vector"
# [1] 2 3 4 5 6
This is not different when the query for i is a basic indexing like:
d[3, value] # produces a vector of length 1
# [1] 4
However, this is different when i is by itself a data.table. From data.table introduction (page 6):
d[J(3)] # is equivalent to d[data.table(a = 3)]
Here, you are performing a join. If you just do d[J(3)] then you'd get all columns corresponding to that join. If you do,
d[J(3), value] # which is equivalent to d[J(3), list(value)]
Since you say this answer does nothing to answer your question, I'll point where the answer to your "rephrased" question, I believe, lies: ---> then you'd get just that column, but since you're performing a join, the key column will also be output'd (as it's a join between two tables based on the key column).
Edit: Following your 2nd edit, If your question is why so?, then I'd reluctantly (or rather ignorantly) answer, Matthew Dowle designed so to differentiate between a data.table join-based-subset and a index-based-subsetting operation.
Your second syntax is equivalent to:
d[J(3)][, value] # is equivalent to:
dd <- d[J(3)]
dd[, value]
where, again, in dd[, value], j is evaluated as an expression and therefore you get a vector.
To answer your 3rd modified question: for the 3rd time, it's because it is a JOIN between two data.tables based on the key column. If I join two data.tables, I'd expect a data.table
From data.table introduction, once again:
Passing a data.table into a data.table subset is analogous to A[B] syntax in base R where A is a matrix and B is a 2-column matrix. In fact, the A[B] syntax in base R inspired the data.table package.
As of data.table 1.9.3, the default behavior has been changed and the examples below produce the same result. To get the by-without-by result, one now has to specify an explicit by=.EACHI:
d = data.table(a = 1:5, value = 2:6, key = "a")
d[J(3), value]
#[1] 4
d[J(3), value, by = .EACHI]
# a value
#1: 3 4
And here's a slightly more complicated example, illustrating the difference:
d = data.table(a = 1:2, b = 1:6, key = 'a')
# a b
#1: 1 1
#2: 1 3
#3: 1 5
#4: 2 2
#5: 2 4
#6: 2 6
# normal join
d[J(c(1,2)), sum(b)]
#[1] 21
# join with a by-without-by, or by-each-i
d[J(c(1,2)), sum(b), by = .EACHI]
# a V1
#1: 1 9
#2: 2 12
# and a more complicated example:
d[J(c(1,2,1)), sum(b), by = .EACHI]
# a V1
#1: 1 9
#2: 2 12
#3: 1 9
This is not unexpected behaviour, it is documented behaviour. Arun has done a good job of explaining and demonstrating in the FAQ where this is clearly documented.
there is a feature request FR 1757 that proposes the use of the drop argument in this case
When implemented, the behaviour you want might be coded
d = data.table(a = 1:5, value = 2:6, key = "a")
d[J(3), value, drop = TRUE]
I agree with Arun's answer. Here's another wording: After you do a join, you often will use the join column as a reference or as an input to further transformation. So you keep it, and you have an option to discard it with the (more roundabout) double [ syntax. From a design perspective, it is easier to keep frequently relevant information and then discard when desired, than to discard early and risk losing data that is difficult to reconstruct.
Another reason that you'd want to keep the join column is that you can perform aggregate operations at the same time as you perform a join (the by without by). For example, the results here are much clearer by including the join column:
d <- data.table(a=rep.int(1:3,2),value=2:7,other=100:105,key="a")
d[J(1:3),mean(value)]
# a V1
#1: 1 3.5
#2: 2 4.5
#3: 3 5.5