order() in data.frame and data.table - r

I get different results if I use order() in data.frame and data.table. For example:
A <- data.frame(one=c("k"),two=c("3_28","31_60","48_68"))
B <- as.data.table(A)
A[order(A$one,A$two),]
one two
1 k 3_28
2 k 31_60
3 k 48_68
B[order(B$one, B$two),]
one two
1: k 31_60
2: k 3_28
3: k 48_68
I must admit this was a bit of a nasty shock, as I have assumed equivalent results for order() from data.frame and data.table for many years. I guess there is a lot of code I need to check!
Is there any way to ensure order() gives the same results in data.frame and data.table?
Many apologies if this difference in behavior is already well known, and is just an example of my ignorance.

When used inside of a data.table operation, order(..) uses data.table:::forder. According to the Introduction to data.table vignette:
order() is internally optimised
We can use "-" on a character columns within the frame of a data.table to sort in decreasing order.
In addition, order(...) within the frame of a data.table uses data.table's internal fast radix order forder(). This sort provided such a compelling improvement over R's base::order that the R project adopted the data.table algorithm as its default sort in 2016 for R 3.3.0, see ?sort and the R Release NEWS.
The key to see the difference is that it uses a "fast radix order". If you see base::order, though, it has an argument method= which
method: the method to be used: partial matches are allowed. The
default ('"auto"') implies '"radix"' for short numeric
vectors, integer vectors, logical vectors and factors.
Otherwise, it implies '"shell"'. For details of methods
'"shell"', '"quick"', and '"radix"', see the help for 'sort'.
Since the second column of your data.table is not one of numeric, integer, logical, or factor, then base::order uses the "shell" method for sorting, which produces different results.
However, if we force base::order to use method="radix", we get the same result.
order(A$two)
# [1] 1 2 3
order(A$two, method="radix")
# [1] 2 1 3
A[order(A$one, A$two, method = "radix"),]
# one two
# 2 k 31_60
# 1 k 3_28
# 3 k 48_68
You can affect the same ordering by using base::order:
B[base::order(B$one,B$two),]
# one two
# <char> <char>
# 1: k 3_28
# 2: k 31_60
# 3: k 48_68

Related

R: detect changing characters without loop

I'm analyzing a huge dataset of ~700000 rows.
I would like to detect where (in which rows) the character change from previous one without using loops.
For instance, in the array "dat", the ideal function would give c(4,6)
dat=c(BIS84003, BIS84003, BIS84003, BIS84005, BIS84005, BIS84006)
Does someone has any idea?
Here are two ways of doing this:
Use run-length encoding
Directly compare vectors
Method 1: Use run length encoding with the function rle().
dat=c("BIS84003", "BIS84003", "BIS84003", "BIS84005", "BIS84005", "BIS84006")
head(cumsum(rle(dat)$lengths) + 1, -1)
[1] 4 6
Method 2: compare vectors
1 + which(dat[-1] != dat[-length(dat)])
[1] 4 6
Using diff
which(!!c(0,diff(as.numeric(factor(dat)))))
#[1] 4 6

Using .I to return row numbers with data.table package

Would someone please explain to me the correct usage of .I for returning the row numbers of a data.table?
I have data like this:
require(data.table)
DT <- data.table(X=c(5, 15, 20, 25, 30))
DT
# X
# 1: 5
# 2: 15
# 3: 20
# 4: 25
# 5: 30
I want to return a vector of row indices where a condition in i is TRUE, e.g. which rows have an X greater than 20.
DT[X > 20]
# rows 4 & 5 are greater than 20
To get the indices, I tried:
DT[X > 20, .I]
# [1] 1 2
...but clearly I am doing it wrong, because that simply returns a vector containing 1 to the number of returned rows. (Which I thought was pretty much what .N was for?).
Sorry if this seems extremely basic, but all I have been able to find in the data.table documentation is WHAT .I and .N do, not HOW to use them.
If all you want is the row numbers rather than the rows themselves, then use which = TRUE, not .I.
DT[X > 20, which = TRUE]
# [1] 4 5
That way you get the benefits of optimization of i, for example fast joins or using an automatic index. The which = TRUE makes it return early with just the row numbers.
Here's the manual entry for the which argument inside data.table :
TRUE returns the row numbers of x that i matches to. If NA, returns
the row numbers of i that have no match in x. By default FALSE and the
rows in x that match are returned.
Explanation:
Notice there is a specific relationship between .I and the i = .. argument in DT[i = .., j = .., by = ..]
Namely, .I is a vector of row numbers of the subsetted table.
### Lets create some sample data
set.seed(1)
LL <- sample(LETTERS[1:5], 20, TRUE)
DT <- data.table(X=LL)
look at the difference between subsetting the whole table, and subsetting just .I
DT[X == "B", .I]
# [1] 1 2 3 4 5 6
DT[ , .I[X == "B"] ]
# [1] 1 2 5 11 14 19
Sorry if this seems extremely basic, but all I have been able to find in the data.table documentation is WHAT .I and .N do, not HOW to use them.
First let's check the documentation. I typed ?data.table and searched for .I. Here's what's there :
Advanced: When grouping, symbols .SD, .BY, .N, .I and .GRP may be used
in the j expression, defined as follows.
.I is an integer vector equal to seq_len(nrow(x)). While grouping, it
holds for each item in the group its row location in x. This is
useful to subset in j; e.g. DT[, .I[which.max(somecol)], by=grp].
Emphasis added by me here. The original intention was for .I to be used while grouping. Note that there is in fact an example there in the documentation of HOW to use .I.
You aren't grouping.
That said, what you tried was reasonable. Over time these symbols have become to be used when not grouping as well. There might be a case that .I should return what you expected. I can see that using .I in j together with both i and by could be useful. Currently .I doesn't seem helpful when i is present, as you pointed out.
Using the which() function is good but might then circumvent optimization in i (which() needs a long logical input which has to be created and passed to it). Using the which=TRUE argument is good but then just returns the row numbers (you couldn't then do something with those row numbers in j by group).
Feature request #1494 filed to discuss changing .I to work the way you expected. The documentation does contain the words "its row location in x" which would imply what you expected since x is the whole data.table.
Alternatively,
DataTable[ , which(X>10) ]
is probably easier to understand and more idiomatically R.

data.table join and j-expression unexpected behavior

In R 2.15.0 and data.table 1.8.9:
d = data.table(a = 1:5, value = 2:6, key = "a")
d[J(3), value]
# a value
# 3 4
d[J(3)][, value]
# 4
I expected both to produce the same output (the 2nd one) and I believe they should.
In the interest of clearing up that this is not a J syntax issue, same expectation applies to the following (identical to the above) expressions:
t = data.table(a = 3, key = "a")
d[t, value]
d[t][, value]
I would expect both of the above to return the exact same output.
So let me rephrase the question - why is (data.table designed so that) the key column printed out automatically in d[t, value]?
Update (based on answers and comments below): Thanks #Arun et al., I understand the design-why now. The reason the above prints the key is because there is a hidden by present every time you do a data.table merge via the X[Y] syntax, and that by is by the key. The reason it's designed this way seems to be the following - since the by operation has to be performed when merging, one might as well take advantage of that and not do another by if you are going to do that by the key of the merge.
Now that said, I believe that's a syntax design flaw. The way I read data.table syntax d[i, j, by = b] is
take d, apply the i operation (be that subsetting or merging or whatnot), and then do the j expression "by" b
The by-without-by breaks this reading and introduces cases one has to think about specifically (am I merging on i, is by just the key of the merge, etc). I believe this should be the job of the data.table - the commendable effort to make data.table faster in one particular case of the merge, when the by is equal to the key, should be done in an alternative way (e.g. by checking internally if the by expression is actually the key of the merge).
Edit number Infinity: Faq 1.12 exactly answers your question: (Also useful/relevant is FAQ 1.13, not pasted here).
1.12 What is the difference between X[Y] and merge(X,Y)?
X[Y] is a join, looking up X's rows using Y (or Y's key if it has one) as an index. Y[X] is a join, looking up Y's rows using X (or X's key if it has one) as an index. merge(X,Y)1 does both ways at the same time. The number of rows of X[Y] and Y[X] usually dier; whereas the number of rows returned by merge(X,Y) and merge(Y,X) is the same. BUT that misses the main point. Most tasks require something to be done on the data after a join or merge. Why merge all the columns of data, only to use a small subset of them afterwards?
You may suggest merge(X[,ColsNeeded1],Y[,ColsNeeded2]), but that takes copies of the subsets of data, and it requires the programmer to work out which columns are needed. X[Y,j] in data.table does all that in one step for you. When you write X[Y,sum(foo*bar)], data.table
automatically inspects the j expression to see which columns it uses. It will only subset those columns only; the others are ignored. Memory is only created for the columns the j uses, and Y columns enjoy standard R recycling rules within the context of each group. Let's say foo is in X, and bar is in Y (along with 20 other columns in Y). Isn't X[Y,sum(foo*bar)] quicker to program and quicker to run than a merge followed by a subset?
Old answer which did nothing to answer the OP's question (from OP's comment), retained here because I believe it does).
When you give a value for j like d[, 4] or d[, value] in data.table, the j is evaluated as an expression. From the data.table FAQ 1.1 on accessing DT[, 5] (the very first FAQ) :
Because, by default, unlike a data.frame, the 2nd argument is an expression which is evaluated within the scope of DT. 5 evaluates to 5.
The first thing, therefore, to understand is, in your case:
d[, value] # produces a "vector"
# [1] 2 3 4 5 6
This is not different when the query for i is a basic indexing like:
d[3, value] # produces a vector of length 1
# [1] 4
However, this is different when i is by itself a data.table. From data.table introduction (page 6):
d[J(3)] # is equivalent to d[data.table(a = 3)]
Here, you are performing a join. If you just do d[J(3)] then you'd get all columns corresponding to that join. If you do,
d[J(3), value] # which is equivalent to d[J(3), list(value)]
Since you say this answer does nothing to answer your question, I'll point where the answer to your "rephrased" question, I believe, lies: ---> then you'd get just that column, but since you're performing a join, the key column will also be output'd (as it's a join between two tables based on the key column).
Edit: Following your 2nd edit, If your question is why so?, then I'd reluctantly (or rather ignorantly) answer, Matthew Dowle designed so to differentiate between a data.table join-based-subset and a index-based-subsetting operation.
Your second syntax is equivalent to:
d[J(3)][, value] # is equivalent to:
dd <- d[J(3)]
dd[, value]
where, again, in dd[, value], j is evaluated as an expression and therefore you get a vector.
To answer your 3rd modified question: for the 3rd time, it's because it is a JOIN between two data.tables based on the key column. If I join two data.tables, I'd expect a data.table
From data.table introduction, once again:
Passing a data.table into a data.table subset is analogous to A[B] syntax in base R where A is a matrix and B is a 2-column matrix. In fact, the A[B] syntax in base R inspired the data.table package.
As of data.table 1.9.3, the default behavior has been changed and the examples below produce the same result. To get the by-without-by result, one now has to specify an explicit by=.EACHI:
d = data.table(a = 1:5, value = 2:6, key = "a")
d[J(3), value]
#[1] 4
d[J(3), value, by = .EACHI]
# a value
#1: 3 4
And here's a slightly more complicated example, illustrating the difference:
d = data.table(a = 1:2, b = 1:6, key = 'a')
# a b
#1: 1 1
#2: 1 3
#3: 1 5
#4: 2 2
#5: 2 4
#6: 2 6
# normal join
d[J(c(1,2)), sum(b)]
#[1] 21
# join with a by-without-by, or by-each-i
d[J(c(1,2)), sum(b), by = .EACHI]
# a V1
#1: 1 9
#2: 2 12
# and a more complicated example:
d[J(c(1,2,1)), sum(b), by = .EACHI]
# a V1
#1: 1 9
#2: 2 12
#3: 1 9
This is not unexpected behaviour, it is documented behaviour. Arun has done a good job of explaining and demonstrating in the FAQ where this is clearly documented.
there is a feature request FR 1757 that proposes the use of the drop argument in this case
When implemented, the behaviour you want might be coded
d = data.table(a = 1:5, value = 2:6, key = "a")
d[J(3), value, drop = TRUE]
I agree with Arun's answer. Here's another wording: After you do a join, you often will use the join column as a reference or as an input to further transformation. So you keep it, and you have an option to discard it with the (more roundabout) double [ syntax. From a design perspective, it is easier to keep frequently relevant information and then discard when desired, than to discard early and risk losing data that is difficult to reconstruct.
Another reason that you'd want to keep the join column is that you can perform aggregate operations at the same time as you perform a join (the by without by). For example, the results here are much clearer by including the join column:
d <- data.table(a=rep.int(1:3,2),value=2:7,other=100:105,key="a")
d[J(1:3),mean(value)]
# a V1
#1: 1 3.5
#2: 2 4.5
#3: 3 5.5

Different behaviour of intersect on vectors and factors

I try to compare multiple vectors of Entrez IDs (integer vectors) by using Reduce(intersect,...). The vectors are selected from a database using "DISTINCT" so a single vector does not contain duplicates.
length(factor(c(l1$entrez)))
gives the same length (and the same IDs w/o the length function) as
length(c(l1$entrez))
When I compare multiple vectors with
length(Reduce(intersect,list(c(l1$entrez),c(l2$entrez),c(l3$entrez),c(l4$entrez))))
or
length(Reduce(intersect,list(c(factor(l1$entrez)),c(factor(l2$entrez)),c(factor(l3$entrez)),c(factor(l4$entrez)))))
the result is not the same. I know that factor!=originalVector but I cannot understand why the result differs although the length and the levels of the initial factors/vectors are the same.
Could somebody please explain the different behaviour of the intersect function on vectors and factors? Is it that the intersect of two factor lists are again factorlists and then duplicates are treated differently?
Edit - Example:
> head(l1)
entrez
1 1
2 503538
3 29974
4 87769
5 2
6 144568
> head(l2)
entrez
1 1743
2 1188
3 8915
4 7412
5 51082
6 5538
The lists contain around 500 to 20K Entrez IDs. So the vectors contain pure integer and should give the intersect among all tested vectors.
> length(Reduce(intersect,list(c(factor(l1$entrez)),c(factor(l2$entrez)),c(factor(l3$entrez)),c(factor(l4$entrez)))))
[1] 514
> length(Reduce(intersect,list(c(l1$entrez),c(l2$entrez),c(l3$entrez),c(l4$entrez))))
[1] 338
> length(Reduce(intersect,list(l1$entrez,l2$entrez,l3$entrez,l4$entrez)))
[1] 494
I have to apologize profusely. The different behaviour of the intersect function may be caused by a problem with the data. I have found fields in the dataset containing comma seperated Entrez IDs (22038, 23207, ...). I should have had a more detailed look at the data first. Thank you for the answers and your time. Although I do not understand the different results yet, I am sure that this is the cause of the different behaviour. Can somebody confirm that?
As Roman says, an example would be very helpful.
Nevertheless, one possibility is that your variables l1$entrez, l2$entrez etc have the same levels but in different orders.
intersect converts its arguments via as.vector, which turns factors into character variables. This is usually the right thing to do, as it means that varying level order doesn't make any difference to the result.
Passing factor(l1$entrez) as an argument to intersect also removes the impact of varying level order, as it effectively creates a new factor with level ordering set to the default. However, if you pass c(l1$entrez), you strip the factor attributes off your variable and what you're left with is the raw integer codes which will depend on level ordering.
Example:
a <- factor(letters[1:3], levels=letters)
b <- factor(letters[1:3], levels=rev(letters)
# returns 1 2 3
intersect(c(factor(a)), c(factor(b)))
# returns integer(0)
intersect(c(a), c(b))
I don't see any reason why you should use c() in here. Just let R handle factors by itself (although to be fair, there are other scenarios where you do want to step in).

Explain R tapply description

I understand what tapply() does in R. However, I cannot parse this description of it from the documentaion:
Apply a Function Over a "Ragged" Array
Description:
Apply a function to each cell of a ragged array, that is to each
(non-empty) group of values given by a unique combination of the
levels of certain factors.
Usage:
tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE)
When I think of tapply, I think of group by in sql. You group values in X together by its parallel factor levels in INDEX and apply FUN to those groups. I have read the description of tapply 100 times and still can't figure out how what it says maps to how I understand tapply. Perhaps someone can help me parse it?
#joran's great answer helped me understand it (so please vote for his - I would have added it as comment if it wasn't too long for that), but this may be of help to some:
In quite a few languages, you have twodimensional arrays. Depending on the language, these arrays have fixed dimensions (i.e.: each row has the same number of columns), or some languages allow the number of items per row to differ. So instead of:
A: 1 2 3
B: 4 5 6
C: 7 8 9
You could get something like
A: 1 3
B: 4 5 6
C: 8
This is called a ragged array because, well, the right side of it looks ragged.
In typical R-style, we might represent this as two vectors:
values<-c(1,3,4,5,6,8)
names<-c("A", "A", "B", "B", "B", "C")
So tapply with these two vectors as the first parameters indeed allows us to apply this function to each 'row' of our ragged array.
Let's see what the R documentation says on the subject:
The combination of a vector and a labelling factor is an example of what is sometimes called a ragged array, since the subclass sizes are possibly irregular. When the subclass sizes are all the same the indexing may be done implicitly and much more efficiently, as we see in the next section.
The list of factors you supply via INDEX together specify a collection of subsets of X, of possibly different lengths (hence, the 'ragged' descriptor). And then FUN is applied to each subset.
EDIT: #Joris makes an excellent point in the comments. It may be helpful to think of tapply(X,Y,...) as a wrapper for sapply(split(X,Y),...) in that if Y is a list of grouping factors, it builds a new, single grouping factor based on their unique levels, splits X accordingly and applies FUN to each piece.
EDIT: Here's an illustrative example:
library(lattice)
library(plyr)
set.seed(123)
#Make this example unbalanced
dat <- barley[sample(1:120,50),]
#Suppose we want the avg yield by year/site:
table(dat$year,dat$site)
#That's what they mean by 'ragged' array; there are different
# numbers of obs at each comb of levels
#In plyr we could use ddply:
ddply(dat,.(year,site),.fun=function(x){mean(x$yield)})
#Which gives the same result (listed in a diff order) as:
melt(tapply (dat$yield, list (dat$year, dat$site), mean))

Resources