R data.table transform with row explosion - r

I would like to split one row into two (or more) rows when the cumsum of one of the column breaks the period.
Is there any elegant way to perform such specific row explosion using data.table?
Do not focus on cumsum (which I used in reversed order to have cumsum from most recent row to the oldest one), strictly speaking I want transform dt into rdt from code below.
# current data
dt <- data.table(
time_id = 101:110,
desc = c('asd','qwe','xyz','qwe','qwe','xyz','asd','asd','qwe','asd'),
value = c(5.5,3.5,14,0.7,6,5.5,9.3,29.8,4,7.2)
)
dt[, cum_value_from_now := rev(cumsum(rev(value)))]
period_width <- 10
dt[, value_period := ceiling(cum_value_from_now/period_width)*period_width]
dt
# expected result
rdt <- data.table(
time_id = c(101,102,103,103,104,105,105,106,107,107,108,108,108,108,109,109,110),
desc = c('asd','qwe','xyz','xyz','qwe','qwe','qwe','xyz','asd','asd','asd','asd','asd','asd','qwe','qwe','asd'),
value = c(5.5,3.5,6.5,7.5,0.7,1.8,4.2,5.5,0.3,9,1,10,10,8.8,1.2,2.8,7.2)
)[, cum_value_from_now := rev(cumsum(rev(value)))][, value_period := ceiling(cum_value_from_now/period_width)*period_width]
rdt
# validation
all.equal(
dt[,list(time_id,desc,value)],
rdt[,list(value = sum(value)), by=c('time_id','desc')]
)
edit: I realized my question is not explained well the transformation I want to perform. To better understand the breaks the period meaning please take a look at my rdt the cum_value_from_now values from the last to first. Each value_period is completely filled by cumsum on value, the rest of value is produced as new row (if value is big enough then it is produced to multiple rows) to fit into next period(s). Thanks

First, you seem to be applying your rules inconsistently. If "breaking the period" means that a row has value_period different from the previous row, then row 2 breaks the period, but you do not treat it that way.
Second, you never explain the partitioning of value. For instance, row 3 has value=14. This is replaced in rdt with two rows with values 6.5 and 7.5. These add to 14 all right, but there is no explanation of why this should be 6.5 and 7.5, rather than, say, 7 and 7. So in the solution below I partition equally.
The code below produces a result which passes your test, but it is not quite the same as your rdt, due to the above-mentioned problems with your question.
dt[,diff:=c(-diff(value_period)/10,0)]
rdt <- dt[,list(value=as.numeric(rep(value/(diff+1),diff+1))),
by=list(time_id,desc,cum_value_from_now, value_period)]
all.equal(
dt[,list(time_id,desc,value)],
rdt[,list(value = sum(value)), by=c('time_id','desc')]
)
# [1] TRUE

Related

Grouping conditional linked values within a data.table

I have a data.table with 3 input columns as follows and a fourth column representing my target output:
require(data.table)
Test <- data.table(Created = c(5,9,13,15,19,23,27,31,39,42,49),
Next_peak = c(9,15,15,23,27,27,31,39,49,49,50),
Valid_reversal = c(T,T,F,F,T,F,T,F,T,F,F),
Target_output = c(5,5,13,5,19,23,19,19,39,42,39))
I'm not sure if this is completely necessary, but I'll try to explain the dataset to hopefully make it easier to see what I'm trying to do. This is a little hard to explain in writing, so please bear with me!
The "Created" column represents the row number location of a price 'peak' (i.e. reversal point) in a time-series of financial data that I'm analysing. The "Next_peak" column represents the corresponding row number (in the original data set) of the next peak which exceeds the peak for that row. e.g. looking at row 1, the "Next_peak" value is 9, corresponding to the same row location as the "Created" level on row 2 of this summarised table. This means that the second peak exceeds the first peak. Conversely, in row 2 where the second peak's data is stored, the "Next peak" value of 15 suggests that it isn't until the 4th peak (i.e. corresponding to the '15' value in the "Created" column) that the second peak's price level is exceeded.
Lastly, the "Valid_reversal" column denotes whether the "Created" and "Next_peak" levels are within a predefined threshold. For example, "T" in the first row suggests that the peaks at rows 5 and 9 ("Next_peak") met this criteria. If I then go to the value of "Created" corresponding to a value of 9, there is also a "T", suggesting that the "Next_peak" value of 15 also meet the criteria. However, when I go to the 4th row where Created = 15, there is a "F", we find that the next peak does not meet the criteria.
What I'm trying to do is to link the 'chains' of valid reversal points and then return the original starting "Created" value. i.e. I want rows 1, 2 and 4 to have a value of '5', suggesting that the peaks for these rows were all within a predefined threshold of the original peak in row 5 of the original data-set.
Conversely, row 3 should simply return 13 as there were no valid reversals at the "Next_peak" value of 15 relative to the peak formed at row 13.
I can create the desired output with the following code, however, it's not a workable solution as the number of steps could easily exceed 3 with my actual data sets where there are more than 3 peaks which are 'linked' with the same reversal point.
I could do this with a 'for' loop, but I'm wondering if there is a better way to do this, preferably in a manner which is as vectorised as possible as the actual data set that I'm using contains millions of rows.
Here's my current approach:
Test[Valid_reversal == T,Step0 := Next_peak]
Test[,Step1 := sapply(seq_len(.N),function(x) ifelse(any(!(Created[x] %in% Step0[seq_len(x)])),
Created[x],NA))]
Test[,Step2 := unlist(ifelse(is.na(Step1),
lapply(.I,function(x) Step1[which.max(Step0[seq_len(x-1)] == Created[x])]),
Step1))]
Test[,Step3 := unlist(ifelse(is.na(Step2),
lapply(.I,function(x) Step2[which.max(Step0[seq_len(x-1)] == Created[x])]),
Step2))]
As you can see, while this data set only needs 3 iterations, the number of steps in the approach that I've taken is not definable in advance (as far as I can see). Therefore, to implement this approach, I'd have to repeat Step 2 until all values had been calculated, potentially via a 'while' loop. I'm struggling a little to work out how to do this.
Please let me know if you have any thoughts on how to address this in a more efficient way.
Thanks in advance,
Phil
Edit: Please note that I didn't mention in the above that the "Next_peak" values aren't necessarily monotonically increasing. The example above meant that nafill could be used, however, as the following example / sample output shows, it wouldn't give the correct output in the following instance:
Test <- data.table(Created = c(5,9,13,15,19,23,27,31,39,42,49),
Next_peak = c(27,15,15,19,23,27,42,39,42,49,50),
Valid_reversal = c(T,T,F,T,F,F,T,F,F,T,F),
Target_output = c(5,9,13,9,9,23,5,31,39,5,5))
Not sure if I understand your requirements correctly, you can use nafill after Step 1:
#step 0 & 1
Test[, out :=
Test[(Valid_reversal)][.SD, on=.(Next_peak=Created), mult="last",
fifelse(is.na(x.Created), i.Created, NA_integer_)]
]
#your steps 2, 3, ...
Test[Valid_reversal | is.na(out), out := nafill(out, "locf")]
edit for the new example. You can use igraph to find the chains:
#step 0 & 1
Test[, out :=
Test[(Valid_reversal)][.SD, on=.(Next_peak=Created), mult="last",
fifelse(is.na(x.Created), i.Created, NA_integer_)]
]
#steps 2, 3, ...
library(igraph)
g <- graph_from_data_frame(Test[Valid_reversal | is.na(out)])
DT <- setDT(stack(clusters(g)$membership), key="ind")[,
ind := as.numeric(levels(ind))[ind]][,
root := min(ind), values]
Test[Valid_reversal | is.na(out), out := DT[.SD, on=.(ind=Created), root]]
just for completeness, here is a while loop version:
#step 0 & 1
Test[, out :=
Test[(Valid_reversal)][.SD, on=.(Next_peak=Created), mult="last",
fifelse(is.na(x.Created), i.Created, NA_integer_)]
]
#step 2, 3, ...
while(Test[, any(is.na(out))]) {
Test[is.na(out), out := Test[.SD, on=.(Next_peak=Created), mult="first", x.out]]
}
Test

R data.table Multiple Conditions Join

I’ve devised a solution to lookup values from multiple columns of two separate data tables and add a new column based calculations of their values (multiple conditional comparisons). Code below. It involves using a data.table and join while calculating values from both tables, however, the tables aren’t joined on the columns I’m comparing, and therefore I suspect I may not be getting the speed advantages inherent to data.tables that I’ve read so much about and am excited about tapping into. Said another way, I’m joining on a ‘dummy’ column, so I don’t think I’m joining “properly.”
The exercise is, given an X by X grid dtGrid and a list of X^2 random Events dtEvents within the grid, to determine how many Events occur within a 1 unit radius of each grid point. The code is below. I picked a grid size of 100 X 100, which takes ~1.5 sec to run the join on my machine. But I can’t go much bigger without introducing an enormous performance hit (200 X 200 takes ~22 sec).
I really like the flexibility of being able to add multiple conditions to my val statement (e.g., if I wanted to add a bunch of AND and OR combinations I could do that), so I'd like to retain that functionality.
Is there a way to use data.table joins ‘properly’ (or any other data.table solution) to achieve a much speedier / efficient outcome?
Thanks so much!
#Initialization stuff
library(data.table)
set.seed(77L)
#Set grid size constant
#Increasing this number to a value much larger than 100 will result in significantly longer run times
cstGridSize = 100L
#Create Grid
vecXYSquare <- seq(0, cstGridSize, 1)
dtGrid <- data.table(expand.grid(vecXYSquare, vecXYSquare))
setnames(dtGrid, 'Var1', 'x')
setnames(dtGrid, 'Var2', 'y')
dtGrid[, DummyJoin:='A']
setkey(dtGrid, DummyJoin)
#Create Events
xrand <- runif(cstGridSize^2, 0, cstGridSize + 1)
yrand <- runif(cstGridSize^2, 0, cstGridSize + 1)
dtEvents <- data.table(x=xrand, y=yrand)
dtEvents[, DummyJoin:='A']
dtEvents[, Counter:=1L]
setkey(dtEvents, DummyJoin)
#Return # of events within 1 unit radius of each grid point
system.time(
dtEventsWithinRadius <- dtEvents[dtGrid, {
val = Counter[(x - i.x)^2 + (y - i.y)^2 < 1^2]; #basic circle fomula: x^2 + y^2 = radius^2
list(col_i.x=i.x, col_i.y=i.y, EventsWithinRadius=sum(val))
}, by=.EACHI]
)
Very interesting question.. and great use of by = .EACHI! Here's another approach using the NEW non-equi joins from the current development version, v1.9.7.
Issue: Your use of by=.EACHI is completely justified because the other alternative is to perform a cross join (each row of dtGrid joined to all rows of dtEvents) but that's too exhaustive and is bound to explode very quickly.
However by = .EACHI is performed along with an equi-join using a dummy column, which results in computing all distances (except that it does one at a time, therefore memory efficient). That is, in your code, for each dtGrid, all possible distances are still computed with dtEvents; hence it doesn't scale as well as expected.
Strategy: Then you'd agree that an acceptable improvement is to restrict the number of rows that would result from joining each row of dtGrid to dtEvents.
Let (x_i, y_i) come from dtGrid and (a_j, b_j) come from from dtEvents, say, where 1 <= i <= nrow(dtGrid) and 1 <= j <= nrow(dtEvents). Then, i = 1 implies, all j that satisfies (x1 - a_j)^2 + (y1 - b_j)^2 < 1 needs to be extracted. That can only happen when:
(x1 - a_j)^2 < 1 AND (y1 - b_j)^2 < 1
This helps reduce the search space drastically because, instead of looking at all rows in dtEvents for each row in dtGrid, we just have to extract those rows where,
a_j - 1 <= x1 <= a_j + 1 AND b_j - 1 <= y1 <= b_j + 1
# where '1' is the radius
This constraint can be directly translated to a non-equi join, and combined with by = .EACHI as before. The only additional step required is to construct the columns a_j-1, a_j+1, b_j-1, b_j+1 as follows:
foo1 <- function(dt1, dt2) {
dt2[, `:=`(xm=x-1, xp=x+1, ym=y-1, yp=y+1)] ## (1)
tmp = dt2[dt1, on=.(xm<=x, xp>=x, ym<=y, yp>=y),
.(sum((i.x-x)^2+(i.y-y)^2<1)), by=.EACHI,
allow=TRUE, nomatch=0L
][, c("xp", "yp") := NULL] ## (2)
tmp[]
}
## (1) constructs all columns necessary for non-equi joins (since expressions are not allowed in the formula for on= yet.
## (2) performs a non-equi join that computes distances and checks for all distances that are < 1 on the restricted set of combinations for each row in dtGrid -- hence should be much faster.
Benchmarks:
# Here's your code (modified to ensure identical column names etc..):
foo2 <- function(dt1, dt2) {
ans = dt2[dt1,
{
val = Counter[(x - i.x)^2 + (y - i.y)^2 < 1^2];
.(xm=i.x, ym=i.y, V1=sum(val))
},
by=.EACHI][, "DummyJoin" := NULL]
ans[]
}
# on grid size of 100:
system.time(ans1 <- foo1(dtGrid, dtEvents)) # 0.166s
system.time(ans2 <- foo2(dtGrid, dtEvents)) # 1.626s
# on grid size of 200:
system.time(ans1 <- foo1(dtGrid, dtEvents)) # 0.983s
system.time(ans2 <- foo2(dtGrid, dtEvents)) # 31.038s
# on grid size of 300:
system.time(ans1 <- foo1(dtGrid, dtEvents)) # 2.847s
system.time(ans2 <- foo2(dtGrid, dtEvents)) # 151.32s
identical(ans1[V1 != 0]L, ans2[V1 != 0L]) # TRUE for all of them
The speedups are ~10x, 32x and 53x respectively.
Note that the rows in dtGrid for which the condition is not satisfied even for a single row in dtEvents will not be present in the result (due to nomatch=0L). If you want those rows, you'll have to also add one of the xm/xp/ym/yp cols.. and check them for NA (= no matches).
This is the reason we had to remove all 0 counts to get identical = TRUE.
HTH
PS: See history for another variation where the entire join is materialised and then the distance is computed and counts generated.

Excell or R: writting code to automate filtering of non-osicllatory changes in data.

I am new to coding and need direction to turn my method into code.
In my lab I am working on a time-series project to discover which gene's in a cell naturally change over the organism's cell cycle. I have a tabular data set with numerical values (originally 10 columns, 27,000 rows). To analyze whether a gene is cycling over the data set I divided the values of one time point (or column) by each subsequent time point (or column), and continued that trend across the data set (the top section of the picture is an example of spread sheet with numerical value at each time-point. The bottom section is an example of what the time-comparisons looked like across the data.
I then imposed an advanced filter with multiple AND / OR criteria that followed the logic (Source Jeeped)
WHERE (column A >= 2.0 AND column B <= 0.5)
OR (column A >= 2.0 AND column C <= 0.5)
OR (column A >= 2.0 AND column D <= 0.5)
OR (column A >= 2.0 AND column E <= 0.5)
(etc ...)
From there, I slid the advanced filter across the entire data set(in the photograph, A on the left -- exanple of the original filter, and B -- the filter sliding across the data)
The filters produced multiple sheets of genes that fit my criteria. To figure how many unique genes met this criteria I merged Column A (Gene_ID's) of all the sheets and removed duplicates to produce a list of unique gene ID's.
The process took me nearly 3 hours due to the size of each spread sheet (37 columns, 27000 rows before filtering). Can this process be expedited? and if so can someone point me in the right direction or help me create the code to do so?
Thank you for your time, and if you need any clarification please don't hesitate to ask.
There are a few ways to do this in R. I think but a common an easy to think about way is to use the any function. This basically takes a series of logical tests and puts an "OR" between each of them, so that if any of them return true then it returns true. You can pass each column to it and then combine it with an AND for the logical test for column a. There are probably other ways to abstract this as well, but this should get you started:
df <- data.frame(
a = 1:100,
b = 1:100,
c = 51:150,
d = 101:200,
value = rep("a", 100)
)
df[ df$a > 2 & any(df$b > 5, df$c > 5, df$d > 5), "value"] <- "Test Passed!"

Using .I to return row numbers with data.table package

Would someone please explain to me the correct usage of .I for returning the row numbers of a data.table?
I have data like this:
require(data.table)
DT <- data.table(X=c(5, 15, 20, 25, 30))
DT
# X
# 1: 5
# 2: 15
# 3: 20
# 4: 25
# 5: 30
I want to return a vector of row indices where a condition in i is TRUE, e.g. which rows have an X greater than 20.
DT[X > 20]
# rows 4 & 5 are greater than 20
To get the indices, I tried:
DT[X > 20, .I]
# [1] 1 2
...but clearly I am doing it wrong, because that simply returns a vector containing 1 to the number of returned rows. (Which I thought was pretty much what .N was for?).
Sorry if this seems extremely basic, but all I have been able to find in the data.table documentation is WHAT .I and .N do, not HOW to use them.
If all you want is the row numbers rather than the rows themselves, then use which = TRUE, not .I.
DT[X > 20, which = TRUE]
# [1] 4 5
That way you get the benefits of optimization of i, for example fast joins or using an automatic index. The which = TRUE makes it return early with just the row numbers.
Here's the manual entry for the which argument inside data.table :
TRUE returns the row numbers of x that i matches to. If NA, returns
the row numbers of i that have no match in x. By default FALSE and the
rows in x that match are returned.
Explanation:
Notice there is a specific relationship between .I and the i = .. argument in DT[i = .., j = .., by = ..]
Namely, .I is a vector of row numbers of the subsetted table.
### Lets create some sample data
set.seed(1)
LL <- sample(LETTERS[1:5], 20, TRUE)
DT <- data.table(X=LL)
look at the difference between subsetting the whole table, and subsetting just .I
DT[X == "B", .I]
# [1] 1 2 3 4 5 6
DT[ , .I[X == "B"] ]
# [1] 1 2 5 11 14 19
Sorry if this seems extremely basic, but all I have been able to find in the data.table documentation is WHAT .I and .N do, not HOW to use them.
First let's check the documentation. I typed ?data.table and searched for .I. Here's what's there :
Advanced: When grouping, symbols .SD, .BY, .N, .I and .GRP may be used
in the j expression, defined as follows.
.I is an integer vector equal to seq_len(nrow(x)). While grouping, it
holds for each item in the group its row location in x. This is
useful to subset in j; e.g. DT[, .I[which.max(somecol)], by=grp].
Emphasis added by me here. The original intention was for .I to be used while grouping. Note that there is in fact an example there in the documentation of HOW to use .I.
You aren't grouping.
That said, what you tried was reasonable. Over time these symbols have become to be used when not grouping as well. There might be a case that .I should return what you expected. I can see that using .I in j together with both i and by could be useful. Currently .I doesn't seem helpful when i is present, as you pointed out.
Using the which() function is good but might then circumvent optimization in i (which() needs a long logical input which has to be created and passed to it). Using the which=TRUE argument is good but then just returns the row numbers (you couldn't then do something with those row numbers in j by group).
Feature request #1494 filed to discuss changing .I to work the way you expected. The documentation does contain the words "its row location in x" which would imply what you expected since x is the whole data.table.
Alternatively,
DataTable[ , which(X>10) ]
is probably easier to understand and more idiomatically R.

data.table join and j-expression unexpected behavior

In R 2.15.0 and data.table 1.8.9:
d = data.table(a = 1:5, value = 2:6, key = "a")
d[J(3), value]
# a value
# 3 4
d[J(3)][, value]
# 4
I expected both to produce the same output (the 2nd one) and I believe they should.
In the interest of clearing up that this is not a J syntax issue, same expectation applies to the following (identical to the above) expressions:
t = data.table(a = 3, key = "a")
d[t, value]
d[t][, value]
I would expect both of the above to return the exact same output.
So let me rephrase the question - why is (data.table designed so that) the key column printed out automatically in d[t, value]?
Update (based on answers and comments below): Thanks #Arun et al., I understand the design-why now. The reason the above prints the key is because there is a hidden by present every time you do a data.table merge via the X[Y] syntax, and that by is by the key. The reason it's designed this way seems to be the following - since the by operation has to be performed when merging, one might as well take advantage of that and not do another by if you are going to do that by the key of the merge.
Now that said, I believe that's a syntax design flaw. The way I read data.table syntax d[i, j, by = b] is
take d, apply the i operation (be that subsetting or merging or whatnot), and then do the j expression "by" b
The by-without-by breaks this reading and introduces cases one has to think about specifically (am I merging on i, is by just the key of the merge, etc). I believe this should be the job of the data.table - the commendable effort to make data.table faster in one particular case of the merge, when the by is equal to the key, should be done in an alternative way (e.g. by checking internally if the by expression is actually the key of the merge).
Edit number Infinity: Faq 1.12 exactly answers your question: (Also useful/relevant is FAQ 1.13, not pasted here).
1.12 What is the difference between X[Y] and merge(X,Y)?
X[Y] is a join, looking up X's rows using Y (or Y's key if it has one) as an index. Y[X] is a join, looking up Y's rows using X (or X's key if it has one) as an index. merge(X,Y)1 does both ways at the same time. The number of rows of X[Y] and Y[X] usually dier; whereas the number of rows returned by merge(X,Y) and merge(Y,X) is the same. BUT that misses the main point. Most tasks require something to be done on the data after a join or merge. Why merge all the columns of data, only to use a small subset of them afterwards?
You may suggest merge(X[,ColsNeeded1],Y[,ColsNeeded2]), but that takes copies of the subsets of data, and it requires the programmer to work out which columns are needed. X[Y,j] in data.table does all that in one step for you. When you write X[Y,sum(foo*bar)], data.table
automatically inspects the j expression to see which columns it uses. It will only subset those columns only; the others are ignored. Memory is only created for the columns the j uses, and Y columns enjoy standard R recycling rules within the context of each group. Let's say foo is in X, and bar is in Y (along with 20 other columns in Y). Isn't X[Y,sum(foo*bar)] quicker to program and quicker to run than a merge followed by a subset?
Old answer which did nothing to answer the OP's question (from OP's comment), retained here because I believe it does).
When you give a value for j like d[, 4] or d[, value] in data.table, the j is evaluated as an expression. From the data.table FAQ 1.1 on accessing DT[, 5] (the very first FAQ) :
Because, by default, unlike a data.frame, the 2nd argument is an expression which is evaluated within the scope of DT. 5 evaluates to 5.
The first thing, therefore, to understand is, in your case:
d[, value] # produces a "vector"
# [1] 2 3 4 5 6
This is not different when the query for i is a basic indexing like:
d[3, value] # produces a vector of length 1
# [1] 4
However, this is different when i is by itself a data.table. From data.table introduction (page 6):
d[J(3)] # is equivalent to d[data.table(a = 3)]
Here, you are performing a join. If you just do d[J(3)] then you'd get all columns corresponding to that join. If you do,
d[J(3), value] # which is equivalent to d[J(3), list(value)]
Since you say this answer does nothing to answer your question, I'll point where the answer to your "rephrased" question, I believe, lies: ---> then you'd get just that column, but since you're performing a join, the key column will also be output'd (as it's a join between two tables based on the key column).
Edit: Following your 2nd edit, If your question is why so?, then I'd reluctantly (or rather ignorantly) answer, Matthew Dowle designed so to differentiate between a data.table join-based-subset and a index-based-subsetting operation.
Your second syntax is equivalent to:
d[J(3)][, value] # is equivalent to:
dd <- d[J(3)]
dd[, value]
where, again, in dd[, value], j is evaluated as an expression and therefore you get a vector.
To answer your 3rd modified question: for the 3rd time, it's because it is a JOIN between two data.tables based on the key column. If I join two data.tables, I'd expect a data.table
From data.table introduction, once again:
Passing a data.table into a data.table subset is analogous to A[B] syntax in base R where A is a matrix and B is a 2-column matrix. In fact, the A[B] syntax in base R inspired the data.table package.
As of data.table 1.9.3, the default behavior has been changed and the examples below produce the same result. To get the by-without-by result, one now has to specify an explicit by=.EACHI:
d = data.table(a = 1:5, value = 2:6, key = "a")
d[J(3), value]
#[1] 4
d[J(3), value, by = .EACHI]
# a value
#1: 3 4
And here's a slightly more complicated example, illustrating the difference:
d = data.table(a = 1:2, b = 1:6, key = 'a')
# a b
#1: 1 1
#2: 1 3
#3: 1 5
#4: 2 2
#5: 2 4
#6: 2 6
# normal join
d[J(c(1,2)), sum(b)]
#[1] 21
# join with a by-without-by, or by-each-i
d[J(c(1,2)), sum(b), by = .EACHI]
# a V1
#1: 1 9
#2: 2 12
# and a more complicated example:
d[J(c(1,2,1)), sum(b), by = .EACHI]
# a V1
#1: 1 9
#2: 2 12
#3: 1 9
This is not unexpected behaviour, it is documented behaviour. Arun has done a good job of explaining and demonstrating in the FAQ where this is clearly documented.
there is a feature request FR 1757 that proposes the use of the drop argument in this case
When implemented, the behaviour you want might be coded
d = data.table(a = 1:5, value = 2:6, key = "a")
d[J(3), value, drop = TRUE]
I agree with Arun's answer. Here's another wording: After you do a join, you often will use the join column as a reference or as an input to further transformation. So you keep it, and you have an option to discard it with the (more roundabout) double [ syntax. From a design perspective, it is easier to keep frequently relevant information and then discard when desired, than to discard early and risk losing data that is difficult to reconstruct.
Another reason that you'd want to keep the join column is that you can perform aggregate operations at the same time as you perform a join (the by without by). For example, the results here are much clearer by including the join column:
d <- data.table(a=rep.int(1:3,2),value=2:7,other=100:105,key="a")
d[J(1:3),mean(value)]
# a V1
#1: 1 3.5
#2: 2 4.5
#3: 3 5.5

Resources