I have a project that has already been written using context of data.frame. In order to improve calc times I'm trying to leverage the speed of using data.table instead. My methodology for this has been to construct wrapper functions that read in frames, convert them to tables, do the calculations and then convert back to frames. Here's one of the simple examples...
FastAgg<-function(x, FUN, aggFields, byFields = NULL, ...){
require('data.table')
y<-setDT(x)
y<-y[,lapply(X=.SD,FUN=FUN,...),.SDcols = aggFields,by=byFields]
y<-data.frame(y)
y
}
The problem I'm having is that after running this function x has been converted to a table and then lines of code that I have written using data.frame notation fail. How do I make sure that the data.frame I feed in is unchanged by the running function?
For your case, I'd recommend (of course) to use data.table through out and not just in a function :-).
But if it's not likely to happen, then I'd recommend the setDT + setDF setup. I'd recommend using setDT outside the function (and provide the data.table as input) - to convert your data.frame to a data.table by reference, and then after finishing the operations you'd like, you can use setDF to convert the result back to a data.frame using setDF and return that from the function. However, doing setDT(x) changes x to a data.table - as it operates by reference.
If that is not ideal, then use as.data.table(.) inside your function, as it operates on a copy. Then, you can still use setDF() to convert the resulting data.table to data.frame and return that data.frame from your function.
These functions are recently introduced (mostly due to user requests). The idea to avoid this confusion is to export shallow() function and keep track of objects that require columns to be copied, and do it all internally (and automatically). It's all in very early stages right now. When we've managed, I'll update this post.
Also have a look at ?copy, ?setDT and ?setDF. The first paragraph in these function's help page is:
In data.table parlance, all set* functions change their input by reference. That is, no copy is made at all, other than temporary working memory, which is as large as one column.. The only other data.table operator that modifies input by reference is :=. Check out the See Also section below for other set* function data.table provides.
And the example for setDT:
set.seed(45L)
X = data.frame(A=sample(3, 10, TRUE),
B=sample(letters[1:3], 10, TRUE),
C=sample(10), stringsAsFactors=FALSE)
# get the frequency of each "A,B" combination
setDT(X)[, .N, by="A,B"][]
does no assignment (although I admit it could be explained slightly better here).
In setDF:
X = data.table(x=1:5, y=6:10)
## convert 'X' to data.frame, without any copy.
setDF(X)
I think this is pretty clear. But I'll try to provide more clarity. Also, I'll try and add how best to use these functions in the documentation as well.
Related
I am working on a project where we frequently work with a list of usernames. We also have a function to take a username and return a dataframe with that user's data. E.g.
users = c("bob", "john", "michael")
get_data_for_user = function(user)
{
data.frame(user=user, data=sample(10))
}
We often:
Iterate over each element of users
Call get_data_for_user to get their data
rbind the results into a single dataerame
I am currently doing this in a purely imperative way:
ret = get_data_for_user(users[1])
for (i in 2:length(users))
{
ret = rbind(ret, get_data_for_user(users[i]))
}
This works, but my impression is that all the cool kids are now using libraries like purrr to do this in a single line. I am fairly new to purrr, and the closest I can see is using map_df to convert the vector of usernames to a vector of dataframes. I.e.
dfs = map_df(users, get_data_for_user)
That is, it seems like I would still be on the hook for writing a loop to do the rbind.
I'd like to clarify whether my solution (which works) is currently considered best practice in R / amongst users of the tidyverse.
Thanks.
That looks right to me - map_df handles the rbind internally (you'll need {dplyr} in addition to {purrr}).
FWIW, purrr::map_dfr() will do the same thing, but the function name is a bit more explicit, noting that it will be binding rows; purrr::map_dfc() binds columns.
I would suggest a slight adjustment:
dfs = map_dfr(users, get_data_for_user)
map_dfr() explicitely states that you want to do a row bind. And I would be inclined to call this best practice when working with purrr.
For the sake of completeness, here are some additional approaches:
using built-in functions
Reduce(rbind, lapply(users, get_data_for_user))
using data.table approach
library(data.table)
rbindlist(lapply(users, get_data_for_user))
I read in an SO topic an answer from Matt Dowle about a shallow function to make shallow copies in data.table. However, I can't find the topic again.
data.table does not have any exported function called shallow. There is an internal one but not documented. Can I use it safely? What is its behavior?
What I would like to do is a memory efficient copy of a big table. Let DT be a big table with n columns and f a function which memory efficiently adds a column. Is something like that possible?
DT2 = f(DT)
with DT2 being a data.table with n columns pointing to the original adresses (no deep copies) and an extra one existing only for DT2. If yes, what appends to DT1 if I do DT2[, col3 := NULL]?
You can't use data.table:::shallow safely, no. It is deliberately not exported and not meant for user use. Either from the point of view of it itself working, or its name or arguments changing in future.
Having said this, you could decide to use it as long as you can either i) guarantee that := or set* won't be called on the result either by you or your users (if you're creating a package) or ii) if := or set* is called on the result then you're ok with both objects being changed by reference. When shallow is used internally by data.table, that's what we promise ourselves.
More background in this answer a few days ago here :
https://stackoverflow.com/a/45891502/403310
In that question I asked for the bigger picture: why is this needed? Having that clear would help to raise the priority in either investigating ALTREP or perhaps doing our own reference count.
In your question you alluded to your bigger picture and that is very useful. So you'd like to create a function which adds working columns to a big data.table inside the function but doesn't change the big data.table. Can you explain more why you'd like to create a function like that? Why not load the big data.table, add the ephemeral working columns directly to it, and then proceed. Your R session is already a working copy in memory of the data which is persistent somewhere else.
Note that I am not saying no. I'm not saying that you don't have a valid reason. I'm asking to discover more about that valid reason so the priority can be raised.
If that isn't the answer you had seen, there are currently 39 question or answers returned by the search string "[data.table] shallow". Worst case, you could trawl through those to find it again.
I would like to convert this data frame
data <- data.frame(color=c("red","red","red","green","green","green","blue","blue","blue"),object=c("box","chair","table","box","chair","table","box","chair","table"),units=c(1:9),price=c(11.5,12.5,13.5,14.5,15.5,16.5,17.5,18.5,19.5))
to this other one
output <- data.frame(color=c("red","green","blue"),units_box=c(1,4,7),price_box=c(11.5,14.5,17.5), units_chair=c(2,5,8),price_chair=c(12.5,15.5,18.5),units_table=c(3,6,9),price_table=c(13.5,16.5,19.5))
Therefore, I am using reshape2::melt and reshape2::dcast to build a user-defined function as the following
fun<-function(df,var,group){
r<-reshape2::melt(df,id.vars=var)
r<-reshape2::dcast(r,var~group)
return(r)
}
When I use the function as follows
fun(data,color,object)
I get the following error message
Error in melt_check(data, id.vars, measure.vars, variable.name,
value.name) : object 'color' not found
Do you know how can I solve it? I think that the problem is that I should call the variables in reshape2::melt with quotes but I do not know how.
Note 1: I would like keep the original number format of variables (i.e. objects without decimals and price with one decimal)
Note 2: I would like to remark that that my real code (this is just a simplified example) is much longer and involves dplyr functions (including enquo() and UQ() functions). Therefore the solutions for this case should be compatible with dplyr.
Note 3: I do not use tidyr (I am a big fun of the whole tidyverse) because the current tidyr still use the old language for functions and I share the script with other people that might not be willing to use the development version of tidyr.
We can use dcast from data.table
library(data.table)
dcast(setDT(data), color ~object, value.var = c("units", "price"), FUN = c(length, mean))
I solved the issue by myself (although I do not know very well the reasons behind).
The main problem, as I suspected was passing the variables of the user-defined function in melt and dcast cause some kind of conflict maybe due to the lack of quotes (?).
Anyway I renamed the variables using dplyr::rename so that the names are not anymore depended of variables but characters. Here you can see the final code I am applying:
fun<-function(df,var,group){
enquo_var<-enquo(var)
enquo_group<-enquo(group)
r<-df%>%
reshape2::melt(., id.var=1, variable.name = "parameter")%>%
dplyr::rename(var = UQ(enquo_var))%>%
reshape2::dcast(data=., formula = var~parameter, value.var = "value")
return(r)
}
funx<-fun(data,color,object)
Although I found the solution to my particular problem, I would appreciate very much if someone explains me the reasons behind.
PS: I hope anyway that the new version of tidyr is ready soon to make such tasks easier. Thanks #hadley for your fantastic work.
I have a table in an R package that I'm writing which is very large. To keep the size down for distribution, I'm eliminating all columns from the table that can be calculated from other columns. For example, day of week can be calculated from the date, so I leave out day of week from the package data set. However, I want to make it convenient to recalculate these columns in a standard way for anyone that uses the package. I'd like to do it with the data.table in place assignments, for the sake of efficiency. I'm imagining something like this:
dt = myPackageData # minimal data set included in the package
extend_dow = function(your_data_table) {
your_data_table[,`:=`(day_of_week = lubridate::wday(my_date))]
}
extend_dow(dt)
And then dt would have the day_of_week column available for use.
The problem that I'm running into is that the in-place assignment of the new column seems to be occurring in a lower level environment, and the data.table that I pass to the function doesn't actually get modified.
Does anyone know how I can store the complete formula for a new column, which can be applied using a single function call to the same data.table that the user passes to the function?
I figured it out. The example that I posted above does work, but only if you make a data.table::copy of the data.table before feeding it to the function, like so:
library(myPackage)
library(data.table)
dt = copy(myPackageData)
extend.weekday = function(your_data_table) {
your_data_table[,`:=`(day_of_week = lubridate::wday(my_date))]
}
extend.weekday(dt)
The mistake in my example is that I was assigning the package data directly to dt = myPackageData, without making a copy. In that case, the column extension does not get applied. I would guess that this is because the object is still referencing the package data somehow, which prevents any changes from being applied when the function is executed.
I am trying to add some columns to a large data.table that are based on rolling calculations split by a unique identifier.
Based on Equivalent to ddply(...,transform,...) in data.table I have generated this statement:
alldefaults[,`:=`(.SD[,list(obsPaymentDownMax24m=rollapplyr(obsPaymentsDown,24,max,partial=TRUE)
,obsPaymentDownAvg24m=rollapplyr(obsPaymentsDown,24,mean,partial=TRUE)
,obsPaymentDownMax12m=rollapplyr(obsPaymentsDown,12,max,partial=TRUE)
,obsPaymentDownAvg12m=rollapplyr(obsPaymentsDown,12,mean,partial=TRUE))]),by=OriginalApplicationID]
It produces an error
Error in `[.data.table`(alldefaults, , `:=`(.SD[, list(obsPaymentDownMax24m = rollapplyr(obsPaymentsDown, :
In `:=`(col1=val1, col2=val2, ...) form, all arguments must be named.
When I run this without the function := named it work well but is a new dataset and then joining it back on would be required.
Inserting the assignment within the .SD
alldefaults[,.SD[,`:=`(list(obsPaymentDownMax24m=rollapplyr(obsPaymentsDown,24,max,partial=TRUE)
,obsPaymentDownAvg24m=rollapplyr(obsPaymentsDown,24,mean,partial=TRUE)
,obsPaymentDownMax12m=rollapplyr(obsPaymentsDown,12,max,partial=TRUE)
,obsPaymentDownAvg12m=rollapplyr(obsPaymentsDown,12,mean,partial=TRUE)))],by=OriginalApplicationID]
Produces this error
Error in `[.data.table`(.SD, , `:=`(list(obsPaymentDownMax24m = rollapplyr(obsPaymentsDown, :
.SD is locked. Using := in .SD's j is reserved for possible future use; a tortuously flexible way to modify by group. Use := in j directly to modify by group by reference.
Is there a trick to making this update work that I've missed?
PS - Not sure if this a hefty enough question to require a reproducible example as it seems primarily syntax oriented and is hopefully easy to point out what the statement ought to be. Also if anyone has recommendations for making this faster again, I'd be very appreciative!
I'm guessing (guessing, because the question is not well formed, which is probably why you got downvoted) you want to do this:
alldefaults[,`:=`(obsPaymentDownMax24m=rollapplyr(obsPaymentsDown,24,max,partial=TRUE)
,obsPaymentDownAvg24m=rollapplyr(obsPaymentsDown,24,mean,partial=TRUE)
,obsPaymentDownMax12m=rollapplyr(obsPaymentsDown,12,max,partial=TRUE)
,obsPaymentDownAvg12m=rollapplyr(obsPaymentsDown,12,mean,partial=TRUE)),by=OriginalApplicationID]