I am working on a large dataframe in R of 2,3 Million records that contain transactions of users at locations with starting and stop times. My goal is to create a new dataframe that contains the amount of time connected per user/per location. Let's call this hourly connected.
Transaction can differ from 8 minutes to 48 hours, thus the goal dataframe will be around 100 Million records and will grow each month.
The code underneath shows how the final dataframe is developed, although the total code is much complexer. Running the total code takes ~ 9 hours on a Intel(R) Xeon(R) CPU E5-2630 v3 # 2.40GHz, 16 cores 128GB RAM.
library(dplyr)
numsessions<-1000000
startdate <-as.POSIXlt(runif(numsessions,1,365*60*60)*24,origin="2015-1-1")
df.Sessions<-data.frame(userID = round(runif(numsessions,1,500)),
postalcode = round(runif(numsessions,1,100)),
daynr = format(startdate,"%w"),
start =startdate ,
end= startdate + runif(1,1,60*60*10)
)
dfhourly.connected <-df.Sessions %>% rowwise %>% do(data.frame(userID=.$userID,
hourlydate=as.Date(seq(.$start,.$end,by=60*60)),
hournr=format(seq(.$start,.$end,by=60*60),"%H")
)
)
We want to parallelize this procedure over (some of) the 16 cores to speed up the procedure. A first attempt was to use the multidplyr package. The partition is made based on daynr
df.hourlyconnected<-df.Sessions %>%
partition(daynr,cluster=init_cluster(6)) %>%
rowwise %>% do(data.frame(userID=.$userID,
hourlydate=as.Date(seq(.$start,.$end,by=60*60)),
hournr=format(seq(.$start,.$end,by=60*60),"%H")
)
) %>% collect()
Now, the rowwise function appears to require a dataframe as input instead of a partition.
My questions are
Is there a workaround to perform a rowwise calculation on partitions per core?
Has anyone got a suggestion to perform this calculation with a different R package and methods?
(I think posting this as an answer could benefit future readers who have interest in efficient coding.)
R is a vectorized language, thus operations by row are one of the most costly operations; Especially if you are evaluating lots of functions, dispatching methods, converting classes and creating new data set while you at it.
Hence, the first step is to reduce the "by" operations. By looking at your code, it seems that you are enlarging the size of your data set according to userID, start and end - all the rest of the operations could come afterwords (and hence be vectorized). Also, running seq (which isn't a very efficient function by itself) twice by row adds nothing. Lastly, calling explicitly seq.POSIXt on a POSIXt class will save you the overhead of method dispatching.
I'm not sure how to do this efficiently with dplyr, because mutate can't handle it and the do function (IIRC) always proved it self to be highly inefficient. Hence, let's try the data.table package that can handle this task easily
library(data.table)
res <- setDT(df.Sessions)[, seq.POSIXt(start, end, by = 3600), by = .(userID, start, end)]
Again, please note that I minimized "by row" operations to a single function call while avoiding methods dispatch
Now that we have the data set ready, we don't need any by row operations any more, everything can be vectorized from now on.
Though, vectorizing isn't the end of story. We also need to take into consideration classes conversions, method dispatching, etc. For instance, we can create both the hourlydate and hournr using either different Date class functions or using format or maybe even substr. The trade off that needs to be taken in account is that, for instance, substr will be the fastest, but the result will be a character vector rather a Date one - it's up to you to decide if you prefer the speed or the quality of the end product. Sometimes you can win both, but first you should check your options. Lets benchmark 3 different vectorized ways of calculating the hournr variable
library(microbenchmark)
set.seed(123)
N <- 1e5
test <- as.POSIXlt(runif(N, 1, 1e5), origin = "1900-01-01")
microbenchmark("format" = format(test, "%H"),
"substr" = substr(test, 12L, 13L),
"data.table::hour" = hour(test))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# format 273874.784 274587.880 282486.6262 275301.78 286573.71 384505.88 100 b
# substr 486545.261 503713.314 529191.1582 514249.91 528172.32 667254.27 100 c
# data.table::hour 5.121 7.681 23.9746 27.84 33.44 55.36 100 a
data.table::hour is the clear winner by both speed and quality (results are in an integer vector rather a character one), while improving the speed of your previous solution by factor of ~x12,000 (and I haven't even tested it against your by row implementation).
Now lets try 3 different ways for data.table::hour
microbenchmark("as.Date" = as.Date(test),
"substr" = substr(test, 1L, 10L),
"data.table::as.IDate" = as.IDate(test))
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# as.Date 19.56285 20.09563 23.77035 20.63049 21.16888 50.04565 100 a
# substr 492.61257 508.98049 525.09147 515.58955 525.20586 663.96895 100 b
# data.table::as.IDate 19.91964 20.44250 27.50989 21.34551 31.79939 145.65133 100 a
Seems like the first and third options are pretty much the same speed-wise, while I prefer as.IDate because of the integer storage mode.
Now that we know where both efficiency and quality lies, we could simply finish the task by running
res[, `:=`(hourlydate = as.IDate(V1), hournr = hour(V1))]
(You can then easily remove the unnecessary columns using a similar syntax of res[, yourcolname := NULL] which I'll leave to you)
There could be probably more efficient ways of solving this, but this demonstrates a possible way of how to make your code more efficient.
As a side note, if you want further to investigate data.table syntax/features, here's a good read
https://github.com/Rdatatable/data.table/wiki/Getting-started
Related
I'm hoping someone more knowledgeable than myself can help optimize this code. I've tried a number of methods, including foreach with doparallel (and snow) and compiler, but I think there may be simpler ways to improve the code, such as changing dataframes to datatables/matrices, and perhaps pre-loading blank objects instead of concatenating vectors in a loop.
Most of the variables listed below must be allowed to change in length depending on previous steps in the pipeline. Dimensions listed are taken from 1 example to show relative magnitude.
s.ids = a factor with length 66510. Haven't noticed a difference in speed when changed to a character vector.
g.list = a character vector with length 978.
l_signatures = a 978x66511 matrix.
d_g_up and d_g_down = small dataframes (nx10, n ranging from 5-200) with metadata related to g.list
c_score_new() computes a score. It's complex enough that it's essentially unchangeable in this scenario. It expects e_signature to have 2 columns, 1 made of g.list ("ids"), and the other as corresponding "rank"s generated by: rank(-1 * l_signatures[,as.character(id)], ties.method="random")
for (id in s.ids) {
e_signature <- data.frame(g.list,
rank(-1 * l_signatures[, as.character(id)],
ties.method="random"))
colnames(e_signature) <- c("ids","rank")
d_scores <- c(d_scores, c_score_new(d_g_up$Symbol, d_g_down$Symbol, e_signature))
}
Total, this takes 5-10 minutes to compute, with 3-5 minutes attributable to the generation of e_signature, which is not computationally complex. That's where I suspect optimization might be of the most benefit.
If we did a loop to generate e_signature in a more efficient way and combined results into 1 object (978x66510) before doing c_score_new(), it might be faster?
I'm having trouble working out the details, and I'm not confident this is the best method anyhow. So before I chased this wild goose, I thought the community might be able to steer me in the best direction.
The largest amount of time is taken by rank. It is possible to reduce computation time by more than 50%, i.e. change base::rank with for loop to Rfast::colRanks, please see below:
library(microbenchmark)
library(Rfast)
n <- 978
m <- 40000 #66510
x <- matrix(rnorm(n * m), ncol = m)
microbenchmark(
Initial = {
for (i in 1:ncol(x)) {
base::rank(x[, i], ties.method = "random")
}
},
Optimized = {
colRanks(x, method = "min")
},
times = 1
)
Output:
Unit: seconds
expr min lq mean median uq max neval
Initial 8.092186 8.092186 8.092186 8.092186 8.092186 8.092186 1
Optimized 3.397526 3.397526 3.397526 3.397526 3.397526 3.397526 1
I got curious about the speed of string comparison in R, when's the right time to use != vs == and was wondering how much shortcutting they do.
If I have a vector with two levels, one which occurs frequently, and another which is rare, (trying to multiply my desired effect).
x <- sample(c('ALICE', 'HAL90000000000'), replace = TRUE, 1000, prob = c(0.05,0.95))
I would assume (if there is shortcutting) that the operation
x != 'ALICE'
would be considerably faster than:
x == 'HAL90000000000'
since to check equality in the latter case, I would assume I need to check every character, while the former would be invalidated by either the first or last character (depending on which side the algorithm checks)
but when I benchmark, it either does not seem to be the case (it was inconclusive in repeated trials, though with a very slight bias toward the == operation being faster ?!), or this isn't a fair trial:
> microbenchmark(x != 'ALICE', x == 'HAL90000000000')
Unit: microseconds
expr min lq mean median uq max neval
x != "ALICE" 4.520 4.5505 4.61831 4.5775 4.6525 4.970 100
x == "HAL90000000000" 3.766 3.8015 4.00386 3.8425 3.9200 13.766 100
Why is this?
EDIT:
I'm assuming it's because it's doing full string matching, but if so, is there a way to get R to optimize these ones? I don't get any gains from the obfuscation of the amount of time it takes to match long or short strings, no worries about passwords.
I have a large data.table which I need to subset, sum and group the same way on several occurrences in my code. Therefore, I store the result to save time. The operation still takes rather long and I would like to know how to speed it up.
inco <- inventory[period > p, sum(incoming), by = articleID][,V1]
The keys of inventory are period and articleID. The size varies depending on the parameters but is always greater than 3 GB. It has about 62,670,000 rows of 7 variables.
I comment on my thought so far:
1. Subset: period > p
This could be faster with vector scanning, but I would need to generate the sequence from p to max(p) for that, taking additional time. Plus, the data.table is already sorted by p. So I suppose, the gain in speed is not high.
2. Aggregate: sum(incoming)
No idea how to improve this.
3. Group: by = articleID
This grouping might be faster with another key setting of the table, but this would have a bad impact on my other code.
4. Access: [, V1]
This could be neglected and done during later operations, but I doubt a speed gain.
Do you have ideas for detailed profiling or improving this operation?
Minimum reproducible example
(decrease n to make it run on your machine, if necessary):
library(data.table)
p <- 100
n <- 10000
inventory <- CJ(period=seq(1,n,1), weight=c(0.1,1), volume=c(1,10), price=c(1,1000), E_demand=c(1000), VK=seq(from=0.2, to=0.8, by=0.2), s=c(seq(1,99,1), seq(from=100, to=1000, by=20)))
inventory[, articleID:=paste0("W",weight,"V",volume,"P",price,"E", round(E_demand,2), "VK", round(VK,3), "s",s)]
inventory[, incoming:=rgamma( rate=1,shape=0.3, dim(inventory)[1])]
setkey(inventory, period, articleID)
inco <- inventory[period > p, sum(incoming), by = articleID][,V1]
I have a program in R that i have run for about a day now and its only reached about 10 percent completion. The main source of slowness comes from having to make thousands of sqldf(SELECT ...) calls from a data set of length ~ 1 million using the R package sqldf. My select statements currently take the following form:
sqldf(SELECT V1, V2, FROM mytable WHERE cast(start as real) <= sometime and cast(realized as real) > sometime)
sometime is just some integer representing a unix timestamp, and start and realized are columns of mytable that are also filled with unix timestamps entries. What i additionally know however is that |realized - start| < 172800 always, which is quite a small period as the dataset spans over a year. My thought is that I should be able to exploit this fact to tell R to only check the dataframe from time +- 172800 in each of these calls.
Is the package sqldf inappropriate to use here? Should i be using a traditional [,] traversal of the data.frame? Is there an easy way to incorporate this fact to speed up the program? My gut feeling is to break up the data frame, sort the vectors, and then build custom functions that traverse and select the appropriate entries themselves, but I'm looking for some affirmation if this is the best way.
First, the slow part is probably cast(...), so rather than doing that twice for each record, in each query, why don't you leave start and realized as timestamps, and change the query to accommodate that.
Second, the data.table option is still about 100 times faster (but see the bit at the end about indexing with sqldf).
library(sqldf)
library(data.table)
N <- 1e6
# sqldf option
set.seed(1)
df <- data.frame(start=as.character(as.POSIXct("2000-01-01")+sample(0:1e6,N,replace=T)),
realized=as.character(as.POSIXct("2000-01-01")+sample(0:1e6,N,replace=T)),
V1=rnorm(N), V2=rpois(N,4))
sometime <- "2000-01-05 00:00:00"
query <- "SELECT V1, V2 FROM df WHERE start <= datetime('%s') and realized > datetime('%s')"
query <- sprintf(query,sometime,sometime)
system.time(result.sqldf <- sqldf(query))
# user system elapsed
# 12.17 0.03 12.23
# data.table option
set.seed(1)
DT <- data.table(start=as.POSIXct("2000-01-01")+sample(0:1e6,N,replace=T),
realized=as.POSIXct("2000-01-01")+sample(0:1e6,N,replace=T),
V1=rnorm(N), V2=rpois(N,4))
setkey(DT,start,realized)
system.time(result.dt <- DT[start<=as.POSIXct(sometime) & realized > as.POSIXct(sometime),list(V1,V2)])
# user system elapsed
# 0.15 0.00 0.15
Note that the two result-sets will be sorted differently.
EDIT Based on comments below from #G.Grothendieck (author of the sqldf package).
This is turning into a really good comparison of the packages...
# code from G. Grothendieck comment
sqldf() # opens connection
sqldf("create index ix on df(start, realized)")
query <- fn$identity("SELECT V1, V2 FROM main.df WHERE start <= '$sometime' and realized > '$sometime'")
system.time(result.sqldf <- sqldf(query))
sqldf() # closes connection
# user system elapsed
# 1.28 0.00 1.28
So creating an index speeds sqldf by about a factor of 10 in this case. Index creation is slow but you only have to do it once. "key" creation in data.table (which physically sorts the table) is extremely fast, but does not improve performace all that much in this case (only about a factor of 2).
Benchmarking using system.time() is a bit risky (1 data point), so it's better to use microbenchmark(...). Note that for this to work, we have to run the code above and leave the connection open (e.g., remove the last call the sqldf().)
f.dt <- function() result.dt <- DT[start<=as.POSIXct(sometime) & realized > as.POSIXct(sometime),list(V1,V2)]
f.sqldf <- function() result.sqldf <- sqldf(query)
library(microbenchmark)
microbenchmark(f.dt(),f.sqldf())
# Unit: milliseconds
# expr min lq median uq max neval
# f.dt() 110.9715 184.0889 200.0634 265.648 833.4041 100
# f.sqldf() 916.8246 1232.6155 1271.6862 1318.049 1951.5074 100
So we can see that, in this case, data.table using keys is about 6 times faster than sqldf using indexes. The actual times will depend on the size of the result-set, so you might want to compare the two options.
I am using R and attempting to recover frequencies (really, just a number close to the actual frequency) from a large number of sound waves (1000s of audio files) by applying Fast Fourier transforms to each of them and identifying the frequency with the highest magnitude for each file. I'd like to be able to recover these peak frequencies as quickly as possible. The FFT method is one method that I've learned about recently and I think it should work for this task, but I am open to answers that do not rely on FFTs. I have tried a few ways of applying the FFT and getting the frequency of highest magnitude, and I have seen significant performance gains since my first method, but I'd like to speed up the execution time much more if possible.
Here is sample data:
s.rate<-44100 # sampling frequency
t <- 2 # seconds, for my situation, I've got 1000s of 1 - 5 minute files to go through
ind <- seq(s.rate*t)/s.rate # time indices for each step
# let's add two sin waves together to make the sound wave
f1 <- 600 # Hz: freq of sound wave 1
y <- 100*sin(2*pi*f1*ind) # sine wave 1
f2 <- 1500 # Hz: freq of sound wave 2
z <- 500*sin(2*pi*f2*ind+1) # sine wave 2
s <- y+z # the sound wave: my data isn't this nice, but I think this is an OK example
The first method I tried was using the fpeaks and spec functions from the seewave package, and it seems to work. However, it is prohibitively slow.
library(seewave)
fpeaks(spec(s, f=s.rate), nmax=1, plot=F) * 1000 # *1000 in order to recover freq in Hz
[1] 1494
# pretty close, quite slow
After doing a bit more reading, I tried this next approach, wherein
spec(s, f=s.rate, plot=F)[which(spec(s, f=s.rate, plot=F)[,2]==max(spec(s, f=s.rate, plot=F)[,2])),1] * 1000 # again need to *1000 to get Hz
x
1494
# pretty close, definitely faster
After a bit more looking around, I found this approach to work reasonably well.
which(Mod(fft(s)) == max(abs(Mod(fft(s))))) * s.rate / length(s)
[1] 1500
# recovered the exact frequency, and quickly!
Here is some performance data:
library(microbenchmark)
microbenchmark(
WHICH.MOD = which(Mod(fft(s))==max(abs(Mod(fft(s))))) * s.rate / length(s),
SPEC.WHICH = spec(s,f=s.rate,plot=F)[which(spec(s,f=s.rate,plot=F)[,2] == max(spec(s,f=s.rate,plot=F)[,2])),1] * 1000, # this is spec from the seewave package
# to recover a number around 1500, you have to multiply by 1000
FPEAKS.SPEC = fpeaks(spec(s,f=s.rate),nmax=1,plot=F)[,1] * 1000, # fpeaks is from the seewave package... again, need to multiply by 1000
times=10)
Unit: milliseconds
expr min lq median uq max neval
WHICH.MOD 10.78 10.81 11.07 11.43 12.33 10
SPEC.WHICH 64.68 65.83 66.66 67.18 78.74 10
FPEAKS.SPEC 100297.52 100648.50 101056.05 101737.56 102927.06 10
Good solutions will be the ones that recover a frequency close (± 10 Hz) to the real frequency the fastest.
More Context
I've got many files (several GBs), each containing a tone that gets modulated several times a second, and sometimes the signal actually disappears altogether so that there is just silence. I want to identify the frequency of the unmodulated tone. I know they should all be somewhere less than 6000 Hz, but I don't know more precisely than that. If (big if) I understand correctly, I've got an OK approach here, it's just a matter of making it faster. Just fyi, I have no previous experience in digital signal processing, so I appreciate any tips and pointers related to the mathematics / methods in addition to advice on how better to approach this programmatically.
After coming to a better understanding of this task and some of the terminology involved, I came across some additional approaches, which I'll present here. These additional approaches allow for window functions and a lot more, really, and the fastest approach in my question does not. I also just sped things up a bit by assigning the result of some of the functions to an object and indexing the object instead of running the function again
#i.e.
(ms<-meanspec(s,f=s.rate,wl=1024,plot=F))[which.max(ms[,2]),1]*1000
# instead of
meanspec(s,f=s.rate,wl=1024,plot=F)[which.max(meanspec(s,f=s.rate,wl=1024,plot=F)[,2]),1]*1000
I have my favorite approach, but I welcome constructive warnings, feedback, and opinions.
microbenchmark(
WHICH.MOD = which((mfft<-Mod(fft(s)))[1:(length(s)/2)] == max(abs(mfft[1:(length(s)/2)]))) * s.rate / length(s),
MEANSPEC = (ms<-meanspec(s,f=s.rate,wl=1024,plot=F))[which.max(ms[,2]),1]*1000,
DFREQ.HIST = (h<-hist(dfreq(s,f=s.rate,wl=1024,plot=F)[,2],200,plot=F))$mids[which.max(h$density)]*1000,
DFREQ.DENS = (dens <- density(dfreq(s,f=s.rate,wl=1024,plot=F)[,2],na.rm=T))$x[which.max(dens$y)]*1000,
FPEAKS.MSPEC = fpeaks(meanspec(s,f=s.rate,wl=1024,plot=F),nmax=1,plot=F)[,1]*1000 ,
times=100)
Unit: milliseconds
expr min lq median uq max neval
WHICH.MOD 8.119499 8.394254 8.513992 8.631377 10.81916 100
MEANSPEC 7.748739 7.985650 8.069466 8.211654 10.03744 100
DFREQ.HIST 9.720990 10.186257 10.299152 10.492016 12.07640 100
DFREQ.DENS 10.086190 10.413116 10.555305 10.721014 12.48137 100
FPEAKS.MSPEC 33.848135 35.441716 36.302971 37.089605 76.45978 100
DFREQ.DENS returns a frequency value farthest from the real value. The other approaches return values close to the real value.
With one of my audio files (i.e. real data) the performance results are a bit different (see below). One potentially relevant difference between the data being used above and the real data used for the performance data below is that above the data is just a vector of numerics and my real data is stored in a Wave object, an S4 object from the tuneR package.
library(Rmpfr) # to avoid an integer overflow problem in `WHICH.MOD`
microbenchmark(
WHICH.MOD = which((mfft<-Mod(fft(d#left)))[1:(length(d#left)/2)] == max(abs(mfft[1:(length(d#left)/2)]))) * mpfr(s.rate,100) / length(d#left),
MEANSPEC = (ms<-meanspec(d,f=s.rate,wl=1024,plot=F))[which.max(ms[,2]),1]*1000,
DFREQ.HIST = (h<-hist(dfreq(d,f=s.rate,wl=1024,plot=F)[,2],200,plot=F))$mids[which.max(h$density)]*1000,
DFREQ.DENS = (dens <- density(dfreq(d,f=s.rate,wl=1024,plot=F)[,2],na.rm=T))$x[which.max(dens$y)]*1000,
FPEAKS.MSPEC = fpeaks(meanspec(d,f=s.rate,wl=1024,plot=F),nmax=1,plot=F)[,1]*1000 ,
times=25)
Unit: seconds
expr min lq median uq max neval
WHICH.MOD 3.249395 3.320995 3.361160 3.421977 3.768885 25
MEANSPEC 1.180119 1.234359 1.263213 1.286397 1.315912 25
DFREQ.HIST 1.468117 1.519957 1.534353 1.563132 1.726012 25
DFREQ.DENS 1.432193 1.489323 1.514968 1.553121 1.713296 25
FPEAKS.MSPEC 1.207205 1.260006 1.277846 1.308961 1.390722 25
WHICH.MOD actually has to run twice to account for the left and right audio channels (i.e. my data is stereo), so it takes longer than the output indicates. Note: I needed to use the Rmpfr library in order for the WHICH.MOD approach to work with my real data, as I was having problems with integer overflow.
Interestingly, FPEAKS.MSPEC performed really well with my data, and it seems to return a pretty accurate frequency (based on my visual inspection of a spectrogram). DFREQ.HIST and DFREQ.DENS are quick, but the output frequency isn't as close to what I judge is the real value, and both are relatively ugly solutions. My favorite solution so far MEANSPEC uses the meanspec and which.max. I'll mark this as the answer since I haven't had any other answers, but feel free to provide an other answer. I'll vote for it and maybe select it as the answer if it provides a better solution.