Joining tables with identical (non-keyed) column names in R data.table - r

How do you deal with identically named, non-key columns when joining data.tables? I am looking for a solution to table.field notation in SQL.
For instance, lets' say I have a table DT that is repopulated with new data for column v every time period. I also have a table DT_HIST that stores entries from previous time periods (t). I want to find the difference between the current and previous time period for each x
In this case: DT is time period 3, and DT_HIST has time periods 1 and 2:
DT <- data.table(x=c(1,2,3,4),v=c(20,20,35,30))
setkey(DT,x)
DT_HIST <- data.table(x=rep(seq(1,4,1),2),v=c(40,40,40,40,30,25,45,40),t=c(rep(1,4),rep(2,4)))
setkey(DT_HIST,x)
> DT
x v
1: 1 20
2: 2 20
3: 3 35
4: 4 30
> DT_HIST
x v t
1: 1 40 1
2: 1 30 2
3: 2 40 1
4: 2 25 2
5: 3 40 1
6: 3 45 2
7: 4 40 1
8: 4 40 2
I would like to join DT with DT_HIST[t==1,] on x and calculate the difference in v.
Just joining the tables results in columns v and v.1.
> DT[DT_HIST[t==2],]
x v v.1 t
1: 1 20 30 2
2: 2 20 25 2
3: 3 35 45 2
4: 4 30 40 2
However, I can't find a way to refer to the different v columns when doing the join.
> DT[DT_HIST[t==2],list(delta=v-v.1)]
Error in `[.data.table`(DT, DT_HIST[t == 2], list(delta = v - v.1)) :
object 'v.1' not found
> DT[DT_HIST[t==2],list(delta=v-v)]
x delta
1: 1 0
2: 2 0
3: 3 0
4: 4 0
If this is a duplicate, I apologize. I searched and couldn't find a similar question.
Also, I realize that I can simply rename the columns after joining and then run my desired expression, but I want to know if I'm doing this in the completely wrong way.

You can use i.colname to access the column in the i-expression data.table. I see you're using an old data.table version. There have been a few changes since then: the duplicated joined column names have a prefix i. instead of a number postfix (making it more consistent with the i. access of joined column names, and there is no by-without-by anymore by default.
In the latest version (1.9.3), this is what you get:
DT[DT_HIST[t==2],list(delta = v - i.v)]
# delta
#1: -10
#2: -5
#3: -10
#4: -10

Related

Summing sequences in r using data.table

I am trying to sum pieces of a series using data.table in r. The idea is that I define a start index and an end index as columns in the table, then make a third column for "sum of the series between start and end indexes."
series = c(1,2,3,4,5,6)
a = data.table(start=c(1,2,3),end=c(4,5,6))
a[,S := sum(series[start:end])]
a
Expected result:
start end S
1: 1 4 10
2: 2 5 14
3: 3 6 18
Actual result:
Warning messages:
1: In start:end : numerical expression has 3 elements: only the first used
2: In start:end : numerical expression has 3 elements: only the first used
> a
start end S
1: 1 4 10
2: 2 5 10
3: 3 6 10
What am I missing here? If I just do a[,S := start+end] the code executes as one would expect.
An option is to loop over the 'start', 'end' columns with Map, get the sequence (:) of the corresponding elements, get the sum and unlist, the list column to assign (:=) it to a new column
a[, S := unlist(Map(function(x, y) sum(x:y), start, end))]
-output
a
# start end S
#1: 1 4 10
#2: 2 5 14
#3: 3 6 18
The : is not vectorized for its operands i.e. it takes just a single operand on either side, and that is the reason it showed a warning
Maybe you can try cumsum like below, which allows you apply vectorized operations within data.table
cs <- cumsum(series)
a[,S := cs[end]-c(0,cs)[start]]
which gives
start end S
1: 1 4 10
2: 2 5 14
3: 3 6 18
You can use the arithmetic series formula:
a[, S := (end - start + 1) * (start + end) / 2]
Gives:
start end S
1: 1 4 10
2: 2 5 14
3: 3 6 18
Your code would work if you make this a rowwise operation so each start and end represent a single value at a time.
library(data.table)
a[,S := sum(series[start:end]), 1:nrow(a)]
a
# start end S
#1: 1 4 10
#2: 2 5 14
#3: 3 6 18

Empty data.table message generated with .SD and fwrite when grouping by single column

I have a large dataframe containing data from several participants. Each participant's data extends over many rows and I need to save the data for each participant in a separate text file. As suggested elsewhere (Split dataframe into multiple output files), I am using data.table::fwrite for this job. However, this prompts a message and I am unsure about its interpretation in my case:
Empty data.table (0 rows) of 1 col: A
I have already searched for possible explanations, but felt that none of these directly addressed my problem (e.g. see: unclear error/message: "Empty data.table (0 rows) of 1 col:"; https://github.com/Rdatatable/data.table/issues/3262).
To illustrate, I have generated a sample script, which is similar in nature and replicates this message:
# generate data table
DT = data.table(A=c(1,2,1,2,1,2), B=c(1,2,1,1,2,2), C=c(1:6), D=c(20:15))
A B C D
1: 1 1 1 20
2: 2 2 2 19
3: 1 1 3 18
4: 2 1 4 17
5: 1 2 5 16
6: 2 2 6 15
# save data in separate files based on column A
DT[, fwrite(.SD, stringr::str_c("T", unique(A),".csv")), by=.(A)]
This generates the above message. Nonetheless, two files are created (as needed) and contain the relevant data.
fread("T1.csv")
B C D
1: 1 1 20
2: 1 3 18
3: 2 5 16
fread("T2.csv")
B C D
1: 2 2 19
2: 1 4 17
3: 2 6 15
The message prompt suggests to me, however, that something in my script didn't quite work (I don't intend to have, or save, empty data tables). Is something amiss here indeed or is it just the message prompt that's erroneous? Interestingly, when removing fwrite, the message disappears, suggesting to me that the message prompt is linked to fwrite:
DT[, .SD, by=.(A)]
1: 1 1 1 20
2: 1 1 3 18
3: 1 2 5 16
4: 2 2 2 19
5: 2 1 4 17
6: 2 2 6 15
Any advice (and explanations) would be much appreciated.

R data.table subsetting on multiple conditions.

With the below data set, how do I write a data.table call that subsets this table and returns all customer ID's and associated orders for that customer IF that customer has ever purchased SKU 1?
Expected result should return a table that excludes cid 3 and 5 on that condition and every row for customers matching sku==1.
I am getting stuck as I don't know how to write a "contains" statement, == literal returns only sku's matching condition... I am sure there is a better way..
library("data.table")
df<-data.frame(cid=c(1,1,1,1,1,2,2,2,2,2,3,4,5,5,6,6),
order=c(1,1,1,2,3,4,4,4,5,5,6,7,8,8,9,9),
sku=c(1,2,3,2,3,1,2,3,1,3,2,1,2,3,1,2))
dt=as.data.table(df)
This is similar to a previous answer, but here the subsetting works in a more data.table like manner.
First, lets take the cids that meet our condition:
matching_cids = dt[sku==1, cid]
the %in% operator allows us to filter to just those items that are contained in the list. so, using the above:
dt[cid %in% matching_cids]
or on one line:
> dt[cid %in% dt[sku==1, cid]]
cid order sku
1: 1 1 1
2: 1 1 2
3: 1 1 3
4: 1 2 2
5: 1 3 3
6: 2 4 1
7: 2 4 2
8: 2 4 3
9: 2 5 1
10: 2 5 3
11: 4 7 1
12: 6 9 1
13: 6 9 2
I would have thought that it was more (?!) data.table to use keys. I couldn't quite work out how to stick the whole lot on a single line, but I think that this would be a bit quicker on large data, because as I understand it (and I may very well be mistaken) this is the only solution presented thus far that avoids vector scanning (which is slow compared to binary search):
# Set initial key
setkey(dt,sku)
# Select only rows with 1 in the sku and return first example of each, setting key to customer id
dts <- dt[ J(1) , .SD[1] , keyby = cid ]
# change key of dt to cid to match customer id
setkey(dt,cid)
# join based on common key
dt[dts,.SD]
# cid order sku
# 1: 1 1 1
# 2: 1 1 2
# 3: 1 2 2
# 4: 1 1 3
# 5: 1 3 3
# 6: 2 4 1
# 7: 2 5 1
# 8: 2 4 2
# 9: 2 4 3
#10: 2 5 3
#11: 4 7 1
#12: 6 9 1
#13: 6 9 2
An alternative that you can do on one line is to use a data.table merge like so...
setkey(dt,sku)
merge( dt[ J(1) , .SD[1] , keyby = cid ] , dt , by = "cid" )

data.table aggregations that return vectors, such as scale()

I have recently been work with much larger datasets and have started learning and migrating to data.table to improve performance of aggregation/grouping. I have been unable to get certain expressions or functions to group as expected. Here is an example of a basic group by operation that I am having trouble with.
library(data.table)
category <- rep(1:10, 10)
value <- rnorm(100)
df <- data.frame(category, value)
dt <- data.table(df)
If I want to simply calculate the mean for each group by category. This works easily enough.
dt[,mean(value),by="category"]
category V1
1: 1 -0.67555478
2: 2 -0.50438413
3: 3 0.29093723
4: 4 -0.41684790
5: 5 0.33921764
6: 6 0.01970997
7: 7 -0.23684245
8: 8 -0.04280998
9: 9 0.01838804
10: 10 0.44295978
I run into problems if I try and use the scale function or even a simple expression subtracting the value from itself. The grouping is ignored and I get the function/expression applied to each row instead. The following returns all 100 rows instead of 10 group by categories.
dt[,scale(value),by="category"]
dt[,value-mean(value),by="category"]
I thought recreating scale as function that returns a numeric vector instead of a matrix might help.
zScore <- function(x) {
z=(x-mean(x,na.rm=TRUE))/sd(x,na.rm = TRUE)
return(z)
}
dt[,zScore(value),by="category"]
category V1
1: 1 -1.45114132
2: 1 -0.35304528
3: 1 -0.94075418
4: 1 1.44454416
5: 1 1.39448268
6: 1 0.55366652
....
97: 10 -0.43190602
98: 10 -0.25409244
99: 10 0.35496694
100: 10 0.57323480
category V1
This also returns the zScore function applied to all rows (N=100) and ignoring the grouping. What am I missing in order to get scale() or a custom function to use the grouping like it did above when using mean()?
You've clarified in the comments that you'd like the same behaviour as:
ddply(df,"category",transform, zscorebycategory=zScore(value))
which gives:
category value zscorebycategory
1 1 0.28860691 0.31565682
2 1 1.17473759 1.33282374
3 1 0.06395503 0.05778463
4 1 1.37825487 1.56643607
etc
The data table option you gave gives:
category V1
1: 1 0.31565682
2: 1 1.33282374
3: 1 0.05778463
4: 1 1.56643607
etc
Which is exactly the same data. However you'd like to also repeat the value column in your result, and rename the V1 variable with something more descriptive. data.table gives you the grouping variable in the result, along with the result of the expression you provide. So lets modify that to give the rows you'd like:
Your
dt[,zScore(value),by="category"]
becomes:
dt[,list(value=value, zscorebycategory=zScore(value)),by="category"]
Where the named items in the list become columns in the result.
plyr = data.table(ddply(df,"category",transform, zscorebycategory=zScore(value)))
dt = dt[,list(value=value, zscorebycategory=zScore(value)),by="category"]
identical(plyr, dt)
> TRUE
(note I converted your ddply data.frame result into a data.table, to allow the identical command to work).
Your claim that data.table does not group is wrong:
library(data.table)
category <- rep(1:2, each=4)
value <- c(rep(c(1:2),each=2),rep(c(4,10),each=2))
dt <- data.table(category, value)
category value
1: 1 1
2: 1 1
3: 1 2
4: 1 2
5: 2 4
6: 2 4
7: 2 10
8: 2 10
dt[,value-mean(value),by=category]
category V1
1: 1 -0.5
2: 1 -0.5
3: 1 0.5
4: 1 0.5
5: 2 -3.0
6: 2 -3.0
7: 2 3.0
8: 2 3.0
If you want to scale/transform this is exactly the behavior you want, because these operations by definition return an object of the same size as the input.

Is my way of duplicating rows in data.table efficient?

I have monthly data in one data.table and annual data in another data.table and now I want to match the annual data to the respective observation in the monthly data.
My approach is as follows: Duplicating the annual data for every month and then join the monthly and annual data. And now I have a question regarding the duplication of rows. I know how to do it, but I'm not sure if it is the best way to do it, so some opinions would be great.
Here is an exemplatory data.table DT for my annual data and how I currently duplicate:
library(data.table)
DT <- data.table(ID = paste(rep(c("a", "b"), each=3), c(1:3, 1:3), sep="_"),
values = 10:15,
startMonth = seq(from=1, by=2, length=6),
endMonth = seq(from=3, by=3, length=6))
DT
ID values startMonth endMonth
[1,] a_1 10 1 3
[2,] a_2 11 3 6
[3,] a_3 12 5 9
[4,] b_1 13 7 12
[5,] b_2 14 9 15
[6,] b_3 15 11 18
#1. Alternative
DT1 <- DT[, list(MONTH=startMonth:endMonth), by="ID"]
setkey(DT, ID)
setkey(DT1, ID)
DT1[DT]
ID MONTH values startMonth endMonth
a_1 1 10 1 3
a_1 2 10 1 3
a_1 3 10 1 3
a_2 3 11 3 6
[...]
The last join is exactly what I want. However, DT[, list(MONTH=startMonth:endMonth), by="ID"] already does everything I want except adding the other columns to DT, so I was wondering if I could get rid of the last three rows in my code, i.e. the setkey and join operations. It turns out, you can, just do the following:
#2. Alternative: More intuitiv and just one line of code
DT[, list(MONTH=startMonth:endMonth, values, startMonth, endMonth), by="ID"]
ID MONTH values startMonth endMonth
a_1 1 10 1 3
a_1 2 10 1 3
a_1 3 10 1 3
a_2 3 11 3 6
...
This, however, only works because I hardcoded the column names into the list expression. In my real data, I do not know the names of all columns in advance, so I was wondering if I could just tell data.table to return the column MONTH that I compute as shown above and all the other columns of DT. .SD seemed to be able to do the trick, but:
DT[, list(MONTH=startMonth:endMonth, .SD), by="ID"]
Error in `[.data.table`(DT, , list(YEAR = startMonth:endMonth, .SD), by = "ID") :
maxn (4) is not exact multiple of this j column's length (3)
So to summarize, I know how it's been done, but I was just wondering if this is the best way to do it because I'm still struggling a little bit with the syntax of data.table and often read in posts and on the wiki that there are good and bads ways of doing things. Also, I don't quite get why I get an error when using .SD. I thought it is just any easy way to tell data.table that you want all columns. What do I miss?
Looking at this I realized that the answer was only possible because ID was a unique key (without duplicates). Here is another answer with duplicates. But, by the way, some NA seem to creep in. Could this be a bug? I'm using v1.8.7 (commit 796).
library(data.table)
DT <- data.table(x=c(1,1,1,1,2,2,3),y=c(1,1,2,3,1,1,2))
DT[,rep:=1L][c(2,7),rep:=c(2L,3L)] # duplicate row 2 and triple row 7
DT[,num:=1:.N] # to group each row by itself
DT
x y rep num
1: 1 1 1 1
2: 1 1 2 2
3: 1 2 1 3
4: 1 3 1 4
5: 2 1 1 5
6: 2 1 1 6
7: 3 2 3 7
DT[,cbind(.SD,dup=1:rep),by="num"]
num x y rep dup
1: 1 1 1 1 1
2: 2 1 1 1 NA # why these NA?
3: 2 1 1 2 NA
4: 3 1 2 1 1
5: 4 1 3 1 1
6: 5 2 1 1 1
7: 6 2 1 1 1
8: 7 3 2 3 1
9: 7 3 2 3 2
10: 7 3 2 3 3
Just for completeness, a faster way is to rep the row numbers and then take the subset in one step (no grouping and no use of cbind or .SD) :
DT[rep(num,rep)]
x y rep num
1: 1 1 1 1
2: 1 1 2 2
3: 1 1 2 2
4: 1 2 1 3
5: 1 3 1 4
6: 2 1 1 5
7: 2 1 1 6
8: 3 2 3 7
9: 3 2 3 7
10: 3 2 3 7
where in this example data the column rep happens to be the same name as the rep() base function.
Great question. What you tried was very reasonable. Assuming you're using v1.7.1 it's now easier to make list columns. In this case it's trying to make one list column out of .SD (3 items) alongside the MONTH column of the 2nd group (4 items). I'll raise it as a bug [EDIT: now fixed in v1.7.5], thanks.
In the meantime, try :
DT[, cbind(MONTH=startMonth:endMonth, .SD), by="ID"]
ID MONTH values startMonth endMonth
a_1 1 10 1 3
a_1 2 10 1 3
a_1 3 10 1 3
a_2 3 11 3 6
...
Also, just to check you've seen roll=TRUE? Typically you'd have just one startMonth column (irregular with gaps) and then just roll join to it. Your example data has overlapping month ranges though, so that complicates it.
Here is a function I wrote which mimics disaggregate (I needed something that handled complex data). It might be useful for you, if it isn't overkill. To expand only rows, set the argument fact to c(1,12) where 12 would be for 12 'month' rows for each 'year' row.
zexpand<-function(inarray, fact=2, interp=FALSE, ...) {
fact<-as.integer(round(fact))
switch(as.character(length(fact)),
'1' = xfact<-yfact<-fact,
'2'= {xfact<-fact[1]; yfact<-fact[2]},
{xfact<-fact[1]; yfact<-fact[2];warning(' fact is too long. First two values used.')})
if (xfact < 1) { stop('fact[1] must be > 0') }
if (yfact < 1) { stop('fact[2] must be > 0') }
# new nonloop method, seems to work just ducky
bigtmp <- matrix(rep(t(inarray), each=xfact), nrow(inarray), ncol(inarray)*xfact, byr=T)
#does column expansion
bigx <- t(matrix(rep((bigtmp),each=yfact),ncol(bigtmp),nrow(bigtmp)*yfact,byr=T))
return(invisible(bigx))
}
The fastest and most succinct way of doing it:
DT[rep(1:nrow(DT), endMonth - startMonth)]
We can also enumerate by group by:
dd <- DT[rep(1:nrow(DT), endMonth - startMonth)]
dd[, nn := 1:.N, by = ID]
dd

Resources