I have to apply a function to every row of a large table (~ 2M rows). I used to use plyr for that, but the table is growing continuously and the current solution starts to approach unacceptable runtimes. I thought I could just switch to data.table or dplyr and all is fine, but that's not the case.
Here's an example:
library(data.table)
library(plyr)
library(dplyr)
dt = data.table("ID_1" = c(1:1000), # unique ID
"ID_2" = ceiling(runif(1000, 0, 100)), # other ID, duplicates possible
"group" = sample(LETTERS[1:10], 1000, replace = T),
"value" = runif(1000),
"ballast1" = "X", # keeps unchanged in derive_dt
"ballast2" = "Y", # keeps unchanged in derive_dt
"ballast3" = "Z", # keeps unchanged in derive_dt
"value_derived" = 0)
setkey(dt, ID_1)
extra_arg = c("A", "F", "G", "H")
ID_1 is guaranteed to contain no duplicates. Now I define a function to apply to every row/ID_1:
derive = function(tmprow, extra_arg){
if(tmprow$group %in% extra_arg){return(NULL)} # exlude entries occuring in extra_arg
group_index = which(LETTERS == tmprow$group)
group_index = ((group_index + sample(1:26, 1)) %% 25) + 1
new_group = LETTERS[group_index]
if(new_group %in% unique(dt$group)){return(NULL)}
new_value = runif(1)
row_derived = tmprow
row_derived$group = new_group
row_derived$value = runif(1)
row_derived$value_derived = 1
return(row_derived)
}
This one doesn't do anything useful (the actual one does). The point is that the function takes one row and computes a new row of the same format.
Now the comparison:
set.seed(42)
system.time(result_dt <- dt[, derive(.SD, extra_arg), by = ID_1])
set.seed(42)
system.time(result_dplyr <- dt %>% group_by(ID_1) %>% do(derive(., extra_arg)))
set.seed(42)
system.time(results_plyr <- x <- ddply(dt, .variable = "ID_1", .fun = derive, extra_arg))
plyr is about 8x faster than both data.table and dplyr. Obviously I'm doing something wrong here, but what?
EDIT
Thanks to eddi's answer I could reduce runtimes for data.table and dplyr to ~ 0.6 and 0.8 of the plyr version, respectively. I intialized row_derived as data.frame: row_derived = as.data.frame(tmprow). That's cool, but I still expected a higher performance increase from these packages...any further suggestions?
The issue is the assignment you use has a very high overhead in data.table, and plyr converts the row to a data.frame before passing to your derive function, and thus avoids it:
library(microbenchmark)
df = as.data.frame(dt)
microbenchmark({dt$group = dt$group}, {df$group = df$group})
#Unit: microseconds
# expr min lq mean median uq max neval
# { dt$group = dt$group } 1895.865 2667.499 3092.38903 3080.3620 3389.049 4984.406 100
# { df$group = df$group } 26.045 45.244 64.13909 61.6045 79.635 157.266 100
I can't suggest a good fix, since you say your example is not real problem, so no point in solving it better. Some basic suggestions to look at are - vectorizing the code, and using := or set instead (depending on what exactly you end up doing).
Related
I am struggling to understand what apply = TRUE/FALSE means within collapse: settransformv. Maybe somebody can give a hand on this? Below I am adding an example where a code fails when I specify apply = TRUE.
library(collapse)
library(data.table)
lagamount <- 1
testdf_1 <- data.table(group = c(1,1,1,1,1,2,2,2,2,2),
counter = as.integer(c(1,2,3,5,6,7,8,9,11,12)),
xval = seq(100, 1000, 100))
testdf_2 <- copy(testdf_1)
settransformv(testdf_1, "xval", flag, 1:3, group, counter, apply = FALSE)
settransformv(testdf_2, "xval", flag, 1:3, group, counter, apply = TRUE)
So the apply argument is there mainly because all .FAST_FUN in the package have a data.frame method, which is more efficient than applying the function to multiple columns using lapply, espeically if some grouping/indexing is done in the call. This example should clarify it.
library(collapse)
library(magrittr)
library(microbenchmark)
# These two are equivalent
settransformv(wlddev, PCGDP:POP, flag, 1, iso3c, year)
fselect(wlddev, PCGDP:POP) %<>% lapply(flag, 1, wlddev$iso3c, wlddev$year)
# -> so here we redo the indexing for each column
# These two are also equivalent
settransformv(wlddev, PCGDP:POP, flag, 1, iso3c, year, apply = FALSE)
fselect(wlddev, PCGDP:POP) %<>% flag(1, wlddev$iso3c, wlddev$year)
# -> we only index once, so this is efficient
# To prove the point:
microbenchmark(A = settransformv(wlddev, PCGDP:POP, flag, 1, iso3c, year),
B = settransformv(wlddev, PCGDP:POP, flag, 1, iso3c, year, apply = FALSE))
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> A 485.276 578.5715 1354.6314 610.572 653.7245 31025.151 100
#> B 211.232 243.4785 264.8313 262.031 284.4375 320.497 100
Created on 2022-09-08 by the reprex package (v2.0.1)
Note that for a single column there are no benefits to setting apply = FALSE, except if you have a function that only operates on data frames. For the single column case you also don't need to use the (v) version of the function, but can simply do
settransform(wlddev, PCGDP_lag = flag(PCGDP, 1, iso3c, year))
I would like to build a structure which, for each record, stores a string, an index and a numeric value. I would like to be able to access the numeric value by querying the data structure with either the index or the string. Also, the data structure is small (on the order of 30 records) but it must be accessed and modified many times (possibly even a million times). Normally I would just use a data frame, but given the efficiency requirements, do you think there would be a better (faster) way? Judging by the syntax, I have the impression that my_struct needs to be accessed two times for each operation (read or write): maybe it's not a big deal, but I wonder if expert R coders, when efficiency is a constraint, would use this code or something different.
# define data structure
my_struct <- data.frame(index = c(3:14,24), variable = c("Pin", "Pout", "Tout", "D", "L", "mu", "R","K","c","omega","alpha","beta","gamma"), value = runif(13), stringsAsFactors = FALSE)
# examples of read/write statements
my_struct$value[my_struct$variable == "Pin"]
my_struct$value[my_struct$index %in% c(3:14)]
my_struct$value[my_struct$index %in% c(3,5)] <- rnorm(2)
The data.table package supports indices and has nice syntax for read and write:
library(data.table)
dat <- data.table(index = c(3:14,24), variable = c("Pin", "Pout", "Tout", "D", "L", "mu", "R","K","c","omega","alpha","beta","gamma"), value = runif(13))
setindex(dat, index)
setindex(dat, variable)
# read
dat[ index %in% 3:4, value ]
# write
dat[ index %in% 3:4, value := 2:3 ]
To see how the index works, add verbose = TRUE, like dat[ index %in% 3:4, value := 2:3, verbose = TRUE ] and read the vignettes. (Indices are covered in the fourth one.)
Benchmark for OP's example
library(microbenchmark)
datDF = data.frame(dat)
n_idx = 2L
idxcol = "variable"
idx = sample(dat[[idxcol]], n_idx)
v = rnorm(length(idx))
e = substitute(idxcol %in% idx, list(idxcol = as.name(idxcol)))
microbenchmark(
DT = dat[eval(e), value := v ],
DF = datDF$value[ datDF[[idxcol]] %in% idx ] <- v
)
# Unit: microseconds
# expr min lq mean median uq max neval
# DT 449.694 473.136 487.17583 481.042 487.0065 1049.193 100
# DF 27.742 30.239 44.21525 36.065 38.4225 854.723 100
So it's actually slower. I'd still go with it for the (in my opinion) nicer syntax. Note that dplyr has no syntax for updating a subset of rows.
With a large table, you'd see the benchmark reversed:
dat = data.table(variable = do.call(paste0, CJ(LETTERS, LETTERS, LETTERS, LETTERS)))
dat[, index := .I ]
dat[, value := rnorm(.N) ]
setindex(dat, index)
setindex(dat, variable)
datDF = data.frame(dat)
n_idx = 2L
idxcol = "variable"
idx = sample(dat[[idxcol]], n_idx)
v = rnorm(length(idx))
e = substitute(idxcol %in% idx, list(idxcol = as.name(idxcol)))
microbenchmark(
DT = dat[eval(e), value := v ],
DF = datDF$value[ datDF[[idxcol]] %in% idx ] <- v
)
# Unit: microseconds
# expr min lq mean median uq max neval
# DT 471.887 492.5545 701.7914 757.766 817.827 1647.582 100
# DF 17387.134 17729.3280 23750.6721 22629.490 25912.309 83057.928 100
Note: The DF way can also be written datDF$value[ match(idx, datDF[[idxcol]]) ] <- v, but I'm seeing about the same timing.
I'm curious how I could convert a for loop that I've written into a function in R? I've no experience with writing my own functions in R. I looked here and here but these did not seem to offer much help. I am aware that for loops are not necessary and overall I'm trying to do something similar to this blog post.
The for loop with reproducible data is here:
P <- c(1:50)
y <- length(P)
D <- as.data.frame(combs(P,2))
Z <- choose(y,2)
Num = NULL
Denom = NULL
Diff = NULL
for(n in 1:Z)
{
Num[n] = abs(D$V1[n]-D$V2[n])
Denom[n] = max(D$V1[n], D$V2[n])
Diff[n] = Num[n]/Denom[n]
}
PV=mean(Diff)
PV
But, I'm interested in calculating PV based on levels such as in this data:
DATA <- c(1:500)
NAME <- c("a", "b", "c", "d", "e")
mydf <- as.data.frame(cbind(DATA, NAME))
Therefore, my final code I would like to use would be:
ANSWER <- tapply(mydf$DATA, mydf$NAME, MY.FUNCTION)
So, if I could turn the above for loop into a working function I could run the tapply function to get PV based on levels.
Any help would be appreciated or any other suggestions opposed to the one I offer.
Thanks!
Once you have your library loaded:
library(caTools)
Here's a function you can run on your data:
mymeandiff <- function(values){
df <- as.data.frame(combs(values, 2))
diff <- abs(df$V1 - df$V2)/pmax(df$V1, df$V2)
mean(diff)
}
mymeandiff(1:50)
Then we can use dplyr to run on each group (after correcting the data):
mydf$DATA <-as.numeric(as.character(mydf$DATA))
library(dplyr)
mydf %>% group_by(NAME) %>%
summarise(mymeandiff(DATA))
For apply, rather than dplyr:
tapply(mydf$DATA, mydf$NAME, FUN = mymeandiff)
Let's time it:
microbenchmark::microbenchmark(tapply = tapply(mydf$DATA, mydf$NAME, FUN=mymeandiff),
dplyr = mydf %>% group_by(NAME) %>%
summarise(mymeandiff(DATA)))
Unit: milliseconds
expr min lq mean median uq max neval
tapply 60.36543 61.08658 63.81995 62.61182 66.13671 80.37819 100
dplyr 61.84766 62.53751 67.33161 63.61270 67.58688 287.78364 100
tapply is slightly faster
I do have a similar problem that is explained in this question. Similar to that question I have a data frame that has 3 columns (id, group, value). I want to take n samples with replacement from each group and produce a smaller data frame with n samples from each group.
However, I am doing hundreds of subsamples in a simulation code and the solution based on ddply is very slow to be used in my code. I tried to rewrite a simple code to see if I can get a better performance but it is still slow (not better than the ddply solution if not worse). Below is my code. I am wondering if it can be improved for performance
#Producing example DataFrame
dfsize <- 10
groupsize <- 7
test.frame.1 <- data.frame(id = 1:dfsize, group = rep(1:groupsize,each = ceiling(dfsize/groupsize))[1:dfsize], junkdata = sample(1:10000, size =dfsize))
#Main function for subsampling
sample.from.group<- function(df, dfgroup, size, replace){
outputsize <- 1
newdf <-df # assuming a sample cannot be larger than the original
uniquegroups <- unique(dfgroup)
for (uniquegroup in uniquegroups){
dataforgroup <- which(dfgroup==uniquegroup)
mysubsample <- df[sample(dataforgroup, size, replace),]
sizeofsample <- nrow(mysubsample)
newdf[outputsize:(outputsize+sizeofsample-1), ] <- mysubsample
outputsize <- outputsize + sizeofsample
}
return(newdf[1:(outputsize-1),])
}
#Using the function
sample.from.group(test.frame.1, test.frame.1$group, 100, replace = TRUE)
Here's two plyr based solutions:
library(plyr)
dfsize <- 1e4
groupsize <- 7
testdf <- data.frame(
id = seq_len(dfsize),
group = rep(1:groupsize, length = dfsize),
junkdata = sample(1:10000, size = dfsize))
sample_by_group_1 <- function(df, dfgroup, size, replace) {
ddply(df, dfgroup, function(x) {
x[sample(nrow(df), size = size, replace = replace), , drop = FALSE]
})
}
sample_by_group_2 <- function(df, dfgroup, size, replace) {
idx <- split_indices(df[[dfgroup]])
subs <- lapply(idx, sample, size = size, replace = replace)
df[unlist(subs, use.names = FALSE), , drop = FALSE]
}
library(microbenchmark)
microbenchmark(
ddply = sample_by_group_1(testdf, "group", 100, replace = TRUE),
plyr = sample_by_group_2(testdf, "group", 100, replace = TRUE)
)
# Unit: microseconds
# expr min lq median uq max neval
# ddply 4488 4723 5059 5360 36606 100
# plyr 443 487 507 536 31343 100
The second approach is much faster because it does the subsetting in a single step - if you can figure out how to do it in one step, it's usually any easy way to get better performance.
I think this is cleaner and possibly faster:
z <- sapply(unique(test.frame.1$group), FUN= function(x){
sample(which(test.frame.1$group==x), 100, TRUE)
})
out <- test.frame.1[z,]
out
What is the best (fastest) way to implement a sliding window function with the data.table package?
I'm trying to calculate a rolling median but have multiple rows per date (due to 2 additional factors), which I think means that the zoo rollapply function wouldn't work. Here is an example using a naive for loop:
library(data.table)
df <- data.frame(
id=30000,
date=rep(as.IDate(as.IDate("2012-01-01")+0:29, origin="1970-01-01"), each=1000),
factor1=rep(1:5, each=200),
factor2=1:5,
value=rnorm(30, 100, 10)
)
dt = data.table(df)
setkeyv(dt, c("date", "factor1", "factor2"))
get_window <- function(date, factor1, factor2) {
criteria <- data.table(
date=as.IDate((date - 7):(date - 1), origin="1970-01-01"),
factor1=as.integer(factor1),
factor2=as.integer(factor2)
)
return(dt[criteria][, value])
}
output <- data.table(unique(dt[, list(date, factor1, factor2)]))[, window_median:=as.numeric(NA)]
for(i in nrow(output):1) {
print(i)
output[i, window_median:=median(get_window(date, factor1, factor2))]
}
data.table doesn't have any special features for rolling windows, currently. Further detail here in my answer to another similar question here :
Is there a fast way to run a rolling regression inside data.table?
Rolling median is interesting. It would need a specialized function to do efficiently (same link as in earlier comment) :
Rolling median algorithm in C
The data.table solutions in the question and answers here are all very inefficient, relative to a proper specialized rollingmedian function (which isn't available for R afaik).
I managed to get the example down to 1.4s by creating a lagged dataset and doing a huge join.
df <- data.frame(
id=30000,
date=rep(as.IDate(as.IDate("2012-01-01")+0:29, origin="1970-01-01"), each=1000),
factor1=rep(1:5, each=200),
factor2=1:5,
value=rnorm(30, 100, 10)
)
dt2 <- data.table(df)
setkeyv(dt, c("date", "factor1", "factor2"))
unique_set <- data.table(unique(dt[, list(original_date=date, factor1, factor2)]))
output2 <- data.table()
for(i in 1:7) {
output2 <- rbind(output2, unique_set[, date:=original_date-i])
}
setkeyv(output2, c("date", "factor1", "factor2"))
output2 <- output2[dt]
output2 <- output2[, median(value), by=c("original_date", "factor1", "factor2")]
That works pretty well on this test dataset but on my real one it fails with 8GB of RAM. I'm going to try moving up to one of the High Memory EC2 instance (with 17, 34 or 68GB RAM) to get it working. Any ideas on how to do this in a less memory intensive way would be appreciated
This solution works but it takes a while.
df <- data.frame(
id=30000,
date=rep(seq.Date(from=as.Date("2012-01-01"),to=as.Date("2012-01-30"),by="d"),each=1000),
factor1=rep(1:5, each=200),
factor2=1:5,
value=rnorm(30, 100, 10)
)
myFun <- function(dff,df){
median(df$value[df$date>as.Date(dff[2])-8 & df$date<as.Date(dff[2])-1 & df$factor1==dff[3] & df$factor2==dff[4]])
}
week_Med <- apply(df,1,myFun,df=df)
week_Med_df <- cbind(df,week_Med)
I address this in a related thread: https://stackoverflow.com/a/62399700/7115566
I suggest looking into the frollapply function. For instance, see below
library(data.table)
set.seed(17)
dt <- data.table(i = 1:100,
x = sample(1:10, 100, replace = T),
y = sample(1:10, 100, replace = T))
dt$index <- dt$x == dt$y
dt[,`:=` (MA = frollapply(index,10,mean)), ]
head(dt,12)