Is it possible to reuse generated columns in ddply? - r

I have a script where I'm using ddply, as in the following example:
ddply(df, .(col),
function(x) data.frame(
col1=some_function(x$y),
col2=some_other_function(x$y)
)
)
Within ddply, is it possible to reuse col1 without calling the entire function again?
For example:
ddply(df, .(col),
function(x) data.frame(
col1=some_function(x$y),
col2=some_other_function(x$y)
col3=col1*col2
)
)

You've got a whole function to play with! Doesn't have to be a one-liner! This should work:
ddply(df, .(col), function(x) {
tmp <- some_other_function(x$y)
data.frame(
col1=some_function(x$y),
col2=tmp,
col3=tmp
)
})

This appears to be a good candidate for data.table using the scoping rules of the j component. See FAQ 2.8 for details.
From the FAQ
No anonymous function is passed to
the j. Instead, an anonymous body is passed to the j.
So, for your case
library(data.table)
DT <- as.data.table(df)
DT[,{
col1=some_function(y)
col2=some_other_function(y)
col3= col1 *col2
list(col1 = col1, col2 = col2, col3 = col3)
}, by = col]
or a slightly more direct way :
DT[,list(
col1=col1<-some_function(y)
col2=col2<-some_other_function(y)
col3=col1*col2
), by = col]
This avoids one repetition each of col1 and col2, and avoids two repeats of col3; repetition is something we strive to reduce in data.table. The = followed by <- might initially look cumbersome. That allows the following syntactic sugar, though :
DT[,list(
"Projected return (%)"= col1<-some_function(y),
"Investment ($m)"= col2<-some_other_function(y),
"Return on Investment ($m)"= col1*col2
), by = col]
where the output can be sent directly to latex or html, for example.

I don't think that's possible, but it shouldn't matter too much, because at that point it's not an aggregation function anymore. For example:
#use summarize() in ddply()
data.means <- ddply(data, .(groups), summarize, mean = mean(x), sd = sd(x), n = length(x))
data.means$se <- data.means$sd / sqrt(data.means$n)
data.means$Upper <- data.means$mean + (data.means$SE * 1.96)
data.means$Lower <- data.means$mean - (data.means$SE * 1.96)
So I didn't calculate the SEs directly, but it wasn't so bad calculating it outside of ddply(). If you really wanted to, you could also do
ddply(data, .(groups), summarize, se = sd(x) / sqrt(length(x)))
Or to put it in terms of your example
ddply(df, .(col), summarize,
col1=some_function(y),
col2=some_other_function(y)
col3=some_function(y)*some_other_function(y)
)

Related

data.table .. notation with functions in j

I am trying to use data.table's .. notation with functions, here is the code I have so far:
set.seed(42)
dt <- data.table(
x = rnorm(10),
y = runif(10)
)
test_func <- function(data, var, var2) {
vars <- c(var, var2)
data[, ..vars]
}
test_func(dt, 'x', 'y') # this works
test_func2 <- function(data, var, var2) {
data[, ..var]
}
test_func2(dt, 'x', 'y') # this works too
test_func3 <- function(data, var, var2) {
data[, sum(..var)]
}
test_func3(dt, 'x', 'y')
# this does not work
# Error in eval(jsub, SDenv, parent.frame()) : object '..var' not found
It seems data.table does not recognize .. once it's wrapped inside another function in j. I know I can use sum(get(var)) to achieve the results but I want to know I am using the best practice in most situation.
Parroting an answer to a different problem that works here as well. Not the prettiest solution, but variants on this have worked for me numerous times in the past.
Thanks #Frank for a non-parse() solution here!
I'm well familiar with the old adage "If the answer is parse() you should usually rethink the question.", but I have a hard time coming up with alternatives many times when evaluating within the data.table calling environment, I'd love to see a robust solution that doesn't execute arbitrary code passed in as a character string. In fact, half the reason I'm posting an answer like this is in hopes that someone can recommend a better option.
test_func3 <- function(data, var, var2) {
expr = substitute(sum(var), list(var=as.symbol(var)))
data[, eval(expr)]
}
test_func3(dt, 'x', 'y')
## [1] 5.472968
Quick disclaimer on hypothetical doomsday scenarios possible with eval(parse(...))
There are far more in depth discussions on the dangers of eval(parse(...)), but I'll avoid repeating them in full.
Theoretically you could have issues if one of your columns is named something unfortunate like "(system(paste0('kill ',Sys.getpid())))" (Do not execute that, it will kill your R session on the spot!). This is probably enough of an outside chance to not lose sleep over it unless you plan on putting this in a package on CRAN.
Update:
For the specific case in the comments below where the table is grouped and then sum is applied to all, .SDcols is potentially useful. The only way I'm aware of to make sure that this function would return consistent results even if dt had a column named var3 is to evaluate the arguments within the function environment but outside of the data.table environment using c().
set.seed(42)
dt <- data.table(
x = rnorm(10),
y = rnorm(10),
z = sample(c("a","b","c"),size = 10, replace = TRUE)
)
test_func3 <- function(data, var, var2, var3) {
ListOfColumns = c(var,var2)
GroupColumn <- c(var3)
dt[, lapply(.SD, sum), by= eval(GroupColumn), .SDcols = ListOfColumns]
}
test_func3(dt, 'x', 'y','z')
returns
z x y
1: b 1.0531555 2.121852
2: a 0.3631284 -1.388861
3: c 4.0566838 -2.367558

Optimizing processing time in the nested for loops - R

I have two datasets with 24k and 15k rows. I used nested for loops in order to rewrite some data... however it takes forever to compute the operation.
does anyone have a suggestion how to optimize the code to speed the process?
my code:
for(i in 1:length(data$kolicina)){
for(j in 1:length(df$kolicina)){
if(data$LIXcode[i] == df$LIXcode[j]){
data$kolicina[i] <- df$kolicina[j]
}
}
}
the full code with the imput looks like this:
df <- data[grepl("Trennscheiben", data$a_naziv) & data$SestavKolicina > 1,]
for(i in 1:length(df$kolicina)){
df$kolicina[i] <- df$kolicina[i] / 10
}
for(i in 1:length(data$kolicina)){
for(j in 1:length(df$kolicina)){
if(data$LIXcode[i] == df$LIXcode[j]){
data$kolicina[i] <- df$kolicina[j]
}
}
}
the data:
LIXcode a_naziv RacunCenaNaEM kolicina
LIX2017396957 MINI HVLP Spritzpistole 20,16 1
LIX2017396957 MINI HVLP Spritzpistole 20,16 1
LIX2017396963 Trennscheiben Ø115 Ø12 12,53 30
LIX2017396963 Trennscheiben Ø115 Ø12 12,53 1
I haven't tried this on my own machine, but this should work
fun <- function(x,y){
x[which(x$LIXcode %in% y$LIXcode)]$kolicina =
y[which(x$LIXcode %in% y$LIXcode)]$kolicina
}
}
fun(data,df)
R has the capability to do them all in parallel
As far as I understand, the question concerns table "dt1" with key column "a" and any number of value columns and any number of observations. And then we have a "dt2" that has some sort of mapping - which means that column "a" has unique values and some column "b" has values that need to be written into "dt1" where columns "a" match.
I would suggest joining tables:
require(data.table)
dt1 <- data.table(a = sample(1:10, 1000, replace = T),
b = sample(letters, 1000, replace = T))
dt2 <- data.table(a = 1:10,
b = letters[1:10])
output <- merge(dt1, dt2, by = "a", all.x = T)
Also you can try:
dt1[,new_value:=dt2$b[match(a, dt2$a)]
Both of these solutions are vectorized, therefore almost instant.
Base solution (no data.table syntax, although I'd highly recommend you to learn it):
dt1$new_value <- dt2$b[match(dt1$a, dt2$a)]
And that's if I understood the question correctly...
Here's a working solution to accommodate for expected output:
dt1[a %in% dt2$a, b:=dt2$b[match(a, dt2$a)]]

plyr outperforms dplyr and data.table - What's wrong?

I have to apply a function to every row of a large table (~ 2M rows). I used to use plyr for that, but the table is growing continuously and the current solution starts to approach unacceptable runtimes. I thought I could just switch to data.table or dplyr and all is fine, but that's not the case.
Here's an example:
library(data.table)
library(plyr)
library(dplyr)
dt = data.table("ID_1" = c(1:1000), # unique ID
"ID_2" = ceiling(runif(1000, 0, 100)), # other ID, duplicates possible
"group" = sample(LETTERS[1:10], 1000, replace = T),
"value" = runif(1000),
"ballast1" = "X", # keeps unchanged in derive_dt
"ballast2" = "Y", # keeps unchanged in derive_dt
"ballast3" = "Z", # keeps unchanged in derive_dt
"value_derived" = 0)
setkey(dt, ID_1)
extra_arg = c("A", "F", "G", "H")
ID_1 is guaranteed to contain no duplicates. Now I define a function to apply to every row/ID_1:
derive = function(tmprow, extra_arg){
if(tmprow$group %in% extra_arg){return(NULL)} # exlude entries occuring in extra_arg
group_index = which(LETTERS == tmprow$group)
group_index = ((group_index + sample(1:26, 1)) %% 25) + 1
new_group = LETTERS[group_index]
if(new_group %in% unique(dt$group)){return(NULL)}
new_value = runif(1)
row_derived = tmprow
row_derived$group = new_group
row_derived$value = runif(1)
row_derived$value_derived = 1
return(row_derived)
}
This one doesn't do anything useful (the actual one does). The point is that the function takes one row and computes a new row of the same format.
Now the comparison:
set.seed(42)
system.time(result_dt <- dt[, derive(.SD, extra_arg), by = ID_1])
set.seed(42)
system.time(result_dplyr <- dt %>% group_by(ID_1) %>% do(derive(., extra_arg)))
set.seed(42)
system.time(results_plyr <- x <- ddply(dt, .variable = "ID_1", .fun = derive, extra_arg))
plyr is about 8x faster than both data.table and dplyr. Obviously I'm doing something wrong here, but what?
EDIT
Thanks to eddi's answer I could reduce runtimes for data.table and dplyr to ~ 0.6 and 0.8 of the plyr version, respectively. I intialized row_derived as data.frame: row_derived = as.data.frame(tmprow). That's cool, but I still expected a higher performance increase from these packages...any further suggestions?
The issue is the assignment you use has a very high overhead in data.table, and plyr converts the row to a data.frame before passing to your derive function, and thus avoids it:
library(microbenchmark)
df = as.data.frame(dt)
microbenchmark({dt$group = dt$group}, {df$group = df$group})
#Unit: microseconds
# expr min lq mean median uq max neval
# { dt$group = dt$group } 1895.865 2667.499 3092.38903 3080.3620 3389.049 4984.406 100
# { df$group = df$group } 26.045 45.244 64.13909 61.6045 79.635 157.266 100
I can't suggest a good fix, since you say your example is not real problem, so no point in solving it better. Some basic suggestions to look at are - vectorizing the code, and using := or set instead (depending on what exactly you end up doing).

Find max per group and return another column

Given the following test matrix:
testMatrix <- matrix( c(1,1,2,10,20,30,300,100,200,"A","B","C"), 3, 4)
colnames(testMatrix) <- c("GroupID", "ElementID", "Value", "Name")
Here I want to find the max per group and then return the name of that column.
E.g. I would expect 1, A and 2, C. If there is a tie with max, the first match would be fine.
After that I would have to attach this to the matrix with a new Column "GroupName"
How can I do this?
I already have the Group, Max Value combination:
groupMax <- aggregate (as.numeric(testMatrix[,3]), by=list( testMatrix[,1] ), max )
The way I used to add columns to my matrix works like this (let's assume there is also already a matrix groupNames with GroupID, Name combinations):
testMatrix <- cbind ( testMatrix, groupNames[match( testMatrix[,1], groupNames[,1] ), 2] )
A data.table solution for time and memory efficiency and syntactic elegance
library(data.table)
DT <- as.data.table(testMatrix)
DT[,list(Name = Name[which.max(Value)]),by = GroupID]
Base solution, not as simple as Dan M's:
testMatrix <- data.frame(GroupID = c(1,1,2), ElementID = c(10,20,30),
Value=c(300,100,200), Name=c("A","B","C"))
A <- lapply(split(testMatrix, testMatrix$GroupID), function(x) {
x[which.max(x$Value), c(1, 4)]
}
)
do.call(rbind, A)
As #Tyler said, a data.frame is easier to work with. Here's an option:
testMatrix <- data.frame(GroupID = c(1,1,2), ElementID = c(10,20,30), Value=c(300,100,200), Name=c("A","B","C"))
ddply(testMatrix, .(GroupID), summarize, Name=Name[which.max(Value)])
I figured out a nice way to do this via dplyr
filter(group_by(testMatrix,GroupID),min_rank(desc(Value))==1)

R data.table sliding window

What is the best (fastest) way to implement a sliding window function with the data.table package?
I'm trying to calculate a rolling median but have multiple rows per date (due to 2 additional factors), which I think means that the zoo rollapply function wouldn't work. Here is an example using a naive for loop:
library(data.table)
df <- data.frame(
id=30000,
date=rep(as.IDate(as.IDate("2012-01-01")+0:29, origin="1970-01-01"), each=1000),
factor1=rep(1:5, each=200),
factor2=1:5,
value=rnorm(30, 100, 10)
)
dt = data.table(df)
setkeyv(dt, c("date", "factor1", "factor2"))
get_window <- function(date, factor1, factor2) {
criteria <- data.table(
date=as.IDate((date - 7):(date - 1), origin="1970-01-01"),
factor1=as.integer(factor1),
factor2=as.integer(factor2)
)
return(dt[criteria][, value])
}
output <- data.table(unique(dt[, list(date, factor1, factor2)]))[, window_median:=as.numeric(NA)]
for(i in nrow(output):1) {
print(i)
output[i, window_median:=median(get_window(date, factor1, factor2))]
}
data.table doesn't have any special features for rolling windows, currently. Further detail here in my answer to another similar question here :
Is there a fast way to run a rolling regression inside data.table?
Rolling median is interesting. It would need a specialized function to do efficiently (same link as in earlier comment) :
Rolling median algorithm in C
The data.table solutions in the question and answers here are all very inefficient, relative to a proper specialized rollingmedian function (which isn't available for R afaik).
I managed to get the example down to 1.4s by creating a lagged dataset and doing a huge join.
df <- data.frame(
id=30000,
date=rep(as.IDate(as.IDate("2012-01-01")+0:29, origin="1970-01-01"), each=1000),
factor1=rep(1:5, each=200),
factor2=1:5,
value=rnorm(30, 100, 10)
)
dt2 <- data.table(df)
setkeyv(dt, c("date", "factor1", "factor2"))
unique_set <- data.table(unique(dt[, list(original_date=date, factor1, factor2)]))
output2 <- data.table()
for(i in 1:7) {
output2 <- rbind(output2, unique_set[, date:=original_date-i])
}
setkeyv(output2, c("date", "factor1", "factor2"))
output2 <- output2[dt]
output2 <- output2[, median(value), by=c("original_date", "factor1", "factor2")]
That works pretty well on this test dataset but on my real one it fails with 8GB of RAM. I'm going to try moving up to one of the High Memory EC2 instance (with 17, 34 or 68GB RAM) to get it working. Any ideas on how to do this in a less memory intensive way would be appreciated
This solution works but it takes a while.
df <- data.frame(
id=30000,
date=rep(seq.Date(from=as.Date("2012-01-01"),to=as.Date("2012-01-30"),by="d"),each=1000),
factor1=rep(1:5, each=200),
factor2=1:5,
value=rnorm(30, 100, 10)
)
myFun <- function(dff,df){
median(df$value[df$date>as.Date(dff[2])-8 & df$date<as.Date(dff[2])-1 & df$factor1==dff[3] & df$factor2==dff[4]])
}
week_Med <- apply(df,1,myFun,df=df)
week_Med_df <- cbind(df,week_Med)
I address this in a related thread: https://stackoverflow.com/a/62399700/7115566
I suggest looking into the frollapply function. For instance, see below
library(data.table)
set.seed(17)
dt <- data.table(i = 1:100,
x = sample(1:10, 100, replace = T),
y = sample(1:10, 100, replace = T))
dt$index <- dt$x == dt$y
dt[,`:=` (MA = frollapply(index,10,mean)), ]
head(dt,12)

Resources