I have two data.tables (dat and results) that share column names. On a side note, results holds summary statistics computed earlier on *sub*groups of dat. In other words, nrow(results) != nrow(dat) (but I don't think this is relevant for the question)
Now I want to incorporate these results back into dat (i.e. the original data.table) by adding a new column (i.e. NewColZ) to dat
This doesn't work as I expect:
dat[,list(colA,colB,NewColZ=results1[colX==colX & colY==colY,colZ])
,by=list(colX, colY)]
Why? because "colX" and "colY" are columns names in both data.tables (i.e. dat and results). What I want to say is, results1[take_from_self(colX)==take_from_parent(colX)]
Therefore the following works (observe I have only RENAMED the columns)
dat[,list(colA,colB,NewCol=results1[cx==colX & cy==colY,colZ,])
,by=list(colX, colY)]
Though I have a feeling that this can simply and easily be done by a join. But dat has many more columns than results
What you are trying to do is a join on colX and colY. You can use := to assign by reference. Joining is most straightforward when you have unique combinations (which I am assuming you do)
keys <- c('colX', 'colY')
setkeyv(dat, keys)
setkeyv(results, keys)
dat[results, newcolZ := colZ]
# perhap use `i.` if there is a colZ in dat
# dat[results, newcolZ := i.colZ]
I do concur with the comments that suggest reading the FAQ and introduction vignettes as well as going through the many examples in ?data.table.
Your issue was a scoping issue, but your primary issue was not being fully aware of the data.table idioms. The join approach is the idoimatically data.table approach.
Related
I have no experience with data.table, so I don't know if there is a solution to my question (30 minutes on Google gave no answer at least), but here it goes.
With data.frame I often use the following command to check the number of observations of a unique value:
df$Obs=with(df, ave(v1, ID-Date, FUN=function(x) length(unique(x))))
Is there any corresponding method when working with data.table?
Yes, there is. Happily, you've asked about one of the newest features of data.table, added in v1.8.2 :
:= by group is now implemented (FR#1491) and sub-assigning to a new column
by reference now adds the column automatically (initialized with NA where
the sub-assign doesn't touch) (FR#1997). := by group can be combined with all
types of i, so := by group includes grouping by i as well as by by.
Since := by group is by reference, it should be significantly faster than any
method that (directly or indirectly) cbinds the grouped results to DT, since
no copy of the (large) DT is made at all. It's a short and natural syntax that
can be compounded with other queries.
DT[,newcol:=sum(colB),by=colA]
In your example, iiuc, it should be something like :
DT[, Obs:=.N, by=ID-Date]
instead of :
df$Obs=with(df, ave(v1, ID-Date, FUN=function(x) length(unique(x))))
Note that := by group scales well for large data sets (and smaller datasets will a lot of small groups).
See ?":=" and Search data.table tag for "reference"
I want to merge 2 data frames (data1 and data2). Both initially contain around 35 million observations (around 2GB each).
I removed the duplicates from data2. I would need to keep the duplicates in data 1, as I wish to use them for further calculations per observation in data1.
I initially get the well documented error:
Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including j and dropping by (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.
As a solution (I looked at several topics, such here, here, and here), I included allow.cartesian=TRUE, but now I run into memory issues. Also, for a subset it works, but it gives me more observations than I wish (data1 now has 50 million observations, although is specify all.x=TRUE).
My code is:
#Remove duplicates before merge
data2 <- unique(data2)
#Merge
require(data.table)
data1 <- merge(data1, data2, by="ID", all.x=TRUE, allow.cartesian=TRUE)
Any advice on how to merge this, is very welcome.
In order to do a left join, the merge statement needs to understand which column you are using as the "key" or "index" for the join. If you have duplicate column names that are used as the key/index, it doesn't know what to do and gives that error. Further, it needs to know what to do if columns are being joined that have the same name as existing columns.
The solution is to temporarily rename the key/index column in your left (data1) dataset As a general rule, having duplicate column names is "bad" in R because it will confuse a lot of functions. Many functions silently call make.unique() to de-duplicate column names to avoid confusion.
If you have duplicate ID columns in data1 change them with colnames(data1) <- make.unique(colnames(data1)), which will set them to ID.1, ID.2, etc. Then do your merge (make sure to specify by.x="ID.1", by.y="ID" because of the rename. By default, duplicate columns that are merged will be appended with .y although you can specify the suffix with the suffixes= option (See Merge helpfile for details)
Lastly, it's worth noting that the merge() function in the data.table package tends to be a lot faster than the base merge() function with similar syntax. Seepage 47 of the data.table manual.
This is almost a duplicate of this. I want to drop columns from a data table, but I want to do it efficiently. I have a list of names of columns that I want to keep. All the answers to the linked question imply doing something akin to
data.table.new <- data.table.old[, my.list]
which at some crucial point will give me a new object, while the old object is still in memory. However, my data.table.old is huge, and hence I prefer to do this via reference, as suggested here
set(data.table.old, j = 'a', value = NULL)
However, as I have a whitelist of columns, and not a blacklist, I would need to iterate through all the column names, checks whether they are in my.list, and then apply set(). Is there any cleaner/other way to doing so?
Not sure if you can do by reference ops on data.frame without making it data.table.
Below code should works if you consider to use data.table.
library(data.table)
setDT(data.frame.old)
dropcols <- names(data.frame.old)[!names(data.frame.old) %in% my.list]
data.frame.old[, c(dropcols) := NULL]
I would like to create a data.table in tidy form containing the columns articleID, period and demand (with articleID and period as key). The demand is subject to a random function with input data from another data.frame (params). It is created at runtime for differing numbers of periods.
It is easy to do this in "non-tidy" form:
#example data
params <- data.frame(shape=runif(10), rate=runif(10)*2)
rownames(params) <- letters[1:10]
periods <- 10
# create non-tidy data with one column for each period
df <- replicate(nrow(params),
rgamma(periods,shape=params[,"shape"], rate=params[,"rate"]))
rownames(df) <- rownames(params)
Is there a "tidy" way to do this creation? I would need to replicate the rgamma(), but I am not sure how to make it use the parameters of the corresponding article. I tried starting with a Cross Join from data.table:
dt <- CJ(articleID=rownames(params), per=1:periods, demand=0)
but I don't know how to pass the rgamma to the dt[,demand] directly and correctly at creation nor how to change the values now without using some ugly for loop. I also considered using gather() from the tidyr package, but as far as I can see, I would need to use a for loop either.
It does not really matter to me whether I use data.frame or data.table for my current use case. Solutions for any (or both!) would be highly appreciated.
This'll do (note that it assumes that params is sorted by row names, if not you can convert it to a data.table and merge the two):
CJ(articleID=rownames(params), per=1:periods)[,
demand := rgamma(.N, shape=params[,"shape"], rate=params[,"rate"]), by = per]
I have no experience with data.table, so I don't know if there is a solution to my question (30 minutes on Google gave no answer at least), but here it goes.
With data.frame I often use the following command to check the number of observations of a unique value:
df$Obs=with(df, ave(v1, ID-Date, FUN=function(x) length(unique(x))))
Is there any corresponding method when working with data.table?
Yes, there is. Happily, you've asked about one of the newest features of data.table, added in v1.8.2 :
:= by group is now implemented (FR#1491) and sub-assigning to a new column
by reference now adds the column automatically (initialized with NA where
the sub-assign doesn't touch) (FR#1997). := by group can be combined with all
types of i, so := by group includes grouping by i as well as by by.
Since := by group is by reference, it should be significantly faster than any
method that (directly or indirectly) cbinds the grouped results to DT, since
no copy of the (large) DT is made at all. It's a short and natural syntax that
can be compounded with other queries.
DT[,newcol:=sum(colB),by=colA]
In your example, iiuc, it should be something like :
DT[, Obs:=.N, by=ID-Date]
instead of :
df$Obs=with(df, ave(v1, ID-Date, FUN=function(x) length(unique(x))))
Note that := by group scales well for large data sets (and smaller datasets will a lot of small groups).
See ?":=" and Search data.table tag for "reference"