What is the most efficient way to do linear interpolation when the desired interpolation points are sparse compared to the available data? I have a very long data frame containing many columns, one of which represents a timestamp and the rest are variables, for which I am interested in interpolating at a very small number of timestamps. For example, consider the two variable case:
microbenchmark::microbenchmark(approx(1:2, 1:2, 1.5)$y)
# Unit: microseconds
# expr min lq mean median uq max neval
# ... 39.629 41.3395 46.80514 42.195 52.8865 138.558 100
microbenchmark::microbenchmark(approx(seq_len(1e6), seq_len(1e6), 1.5)$y)
# Unit: milliseconds
# expr min lq mean median uq max neval
# ... 129.5733 231.0047 229.3459 236.3845 247.3096 369.4621 100
we see that although only one interpolated value (at t = 1.5) is desired, increasing the number of pairs (x, y) can cause a few orders of magnitude difference in running time.
Another example, this time with a data table.
library(data.table)
tmp_dt <- data.table(time = seq_len(1e7), a = seq_len(1e7), b = seq_len(1e7), c = seq_len(1e7))
Running tmp_dt[, lapply(.SD, function(col) {approx(time, col, 1.5)$y}), .SDcols = c("a", "b", "c")] produces a one row data table but it takes a while.
I am thinking there must be some efficiency to be gained by removing all rows in the data table that are not necessary for interpolation.
If your linear interpolation is weighted.mean(c(x0, x1), c(t1-t, t-t0)), where (t0, x0) is the nearest point below and (t1, x1) the nearest above...
# fix bad format
tmp_dt[, names(tmp_dt) := lapply(.SD, as.numeric)]
# enumerate target times
tDT = data.table(t = seq(1.5, 100.5, by=.5))
# handle perfect matches
tDT[, a := tmp_dt[.SD, on=.(time = t), x.a]]
# handle interpolation
tDT[is.na(a), a := {
w = findInterval(t, tmp_dt$time)
cbind(tmp_dt[w, .(t0 = time, a0 = a)], tmp_dt[w+1L, .(t1 = time, a1 = a)])[,
(a0*(t1-t) + a1*(t-t0))/(t1-t0)]
}]
The extension to more columns is a little messy, but can be shoehorned in here.
Some sort of rolling, like w = tmp_dt[t, on=.(time), roll=TRUE, which=TRUE], might be faster than findInterval, but I haven't looked into it.
Related
I have a data.frame with columns "a" and "b". I want to add columns called "high" and "low" that contain the highest and the lowest among columns a and b.
Is there a way of doing this without looping over the lines in the dataframe?
edit: this is for OHLC data, and so the high and low column should contain the highest and lowest element between a and b on the same line, and not among the whole columns. sorry if this is badly worded.
Sounds like you're looking for pmax and pmin ("parallel" max/min):
Extremes package:base R Documentation
Maxima and Minima
Description:
Returns the (parallel) maxima and minima of the input values.
Usage:
max(..., na.rm = FALSE)
min(..., na.rm = FALSE)
pmax(..., na.rm = FALSE)
pmin(..., na.rm = FALSE)
pmax.int(..., na.rm = FALSE)
pmin.int(..., na.rm = FALSE)
Arguments:
...: numeric or character arguments (see Note).
na.rm: a logical indicating whether missing values should be
removed.
Details:
‘pmax’ and ‘pmin’ take one or more vectors (or matrices) as
arguments and return a single vector giving the ‘parallel’ maxima
(or minima) of the vectors. The first element of the result is
the maximum (minimum) of the first elements of all the arguments,
the second element of the result is the maximum (minimum) of the
second elements of all the arguments and so on. Shorter inputs
are recycled if necessary. ‘attributes’ (such as ‘names’ or
‘dim’) are transferred from the first argument (if applicable).
Here's a version I implemented using Rcpp. I compared pmin with my version, and my version is roughly 3 times faster.
library(Rcpp)
cppFunction("
NumericVector min_vec(NumericVector vec1, NumericVector vec2) {
int n = vec1.size();
if(n != vec2.size()) return 0;
else {
NumericVector out(n);
for(int i = 0; i < n; i++) {
out[i] = std::min(vec1[i], vec2[i]);
}
return out;
}
}
")
x1 <- rnorm(100000)
y1 <- rnorm(100000)
microbenchmark::microbenchmark(min_vec(x1, y1))
microbenchmark::microbenchmark(pmin(x1, y1))
x2 <- rnorm(500000)
y2 <- rnorm(500000)
microbenchmark::microbenchmark(min_vec(x2, y2))
microbenchmark::microbenchmark(pmin(x2, y2))
The microbenchmark function output for 100,000 elements is:
> microbenchmark::microbenchmark(min_vec(x1, y1))
Unit: microseconds
expr min lq mean median uq
min_vec(x1, y1) 215.731 222.3705 230.7018 224.484 228.1115
max neval
284.631 100
> microbenchmark::microbenchmark(pmin(x1, y1))
Unit: microseconds
expr min lq mean median uq max
pmin(x1, y1) 891.486 904.7365 943.5884 922.899 954.873 1098.259
neval
100
And for 500,000 elements:
> microbenchmark::microbenchmark(min_vec(x2, y2))
Unit: milliseconds
expr min lq mean median uq
min_vec(x2, y2) 1.493136 2.008122 2.109541 2.140318 2.300022
max neval
2.97674 100
> microbenchmark::microbenchmark(pmin(x2, y2))
Unit: milliseconds
expr min lq mean median uq
pmin(x2, y2) 4.652925 5.146819 5.286951 5.264451 5.445638
max neval
6.639985 100
So you can see the Rcpp version is faster.
You could make it better by adding some error checking in the function, for instance: check that both vectors are the same length, or that they are comparable (not character vs. numeric, or boolean vs. numeric).
If your data.frame name is dat.
dat$pmin <- do.call(pmin,dat[c("a","b")])
dat$pmax <- do.call(pmax,dat[c("a","b")])
Another possible solution:
set.seed(21)
Data <- data.frame(a=runif(10),b=runif(10))
Data$low <- apply(Data[,c("a","b")], 1, min)
Data$high <- apply(Data[,c("a","b")], 1, max)
I have this data.frame:
set.seed(1)
df <- cbind(matrix(rnorm(26,100),26,100),data.frame(id=LETTERS,parent.id=sample(letters[1:5],26,replace = T),stringsAsFactors = F))
Each row is 100 measurements from a certain subject (designated by id), which is associated with a parent ID (designated by parent.id). The relationship between parent.id and id is one-to-many.
I'm looking for a fast way to get the fraction of each df$id (for each of its 100 measurements) out the measurements of its parent.id. Meaning that for each id in df$id I want to divide each of its 100 measurements by the sum of its measurements across all df$id's which correspond to its df$parent.id.
What I'm trying is:
sum.df <- dplyr::select(df,-id) %>% dplyr::group_by(parent.id) %>% dplyr::summarise_all(sum)
fraction.df <- do.call(rbind,lapply(df$id,function(i){
pid <- dplyr::filter(df,id == i)$parent.id
(dplyr::filter(df,id == i) %>% dplyr::select(-id,-parent.id))/
(dplyr::filter(sum.df,parent.id == pid) %>% dplyr::select(-parent.id))
}))
But for the real dimensions of my data: length(df$id) = 10,000 with 1,024 measurements, this is not fast enough.
Any idea how to improve this, ideally using dplyr functions?
Lets compare these options with microbenchmark, all using the new definition for the dataset in #Sathish's answer:
OP method:
Units: seconds
min lq mean median uq max neval
1.423583 1.48449 1.602001 1.581978 1.670041 2.275105 100
#Sathish method speeds it up by a factor of about 5. This is valuable, to be sure
Units: milliseconds
min lq mean median uq max neval
299.3581 334.787 388.5283 363.0363 398.6714 951.4654 100
One possible base R implementation below, using principles of efficient R code, improves things by a factor of about 65 (24 milliseconds, vs 1,582 milliseconds):
Units: milliseconds
min lq mean median uq max neval
21.49046 22.59205 24.97197 23.81264 26.36277 34.72929 100
Here's the base R implementation. As is the case for the OP's implementation, the parent.id and id columns are not included in the resulting structure (here fractions). fractions is a matrix with rows ordered according to sort(interaction(df$id, df$parent.id, drop = TRUE)).
values <- df[1:100]
parents <- split(values, df$parent.id)
sums <- vapply(parents, colSums, numeric(100), USE.NAMES = FALSE)
fractions <- matrix(0, 26, 100)
f_count <- 0
for (p_count in seq_along(parents)){
parent <- as.matrix(parents[[p_count]])
dimnames(parent) <- NULL
n <- nrow(parent)
for (p_row in seq_len(nrow(parent))){
fractions[(f_count + p_row),] <- parent[p_row,] / sums[,p_count]
}
f_count <- f_count + p_row
}
Note: there's still room for improvement. split() is not particularly efficient.
Note 2: What "principles of efficient R code" were used?
Get rid of names whenever you can
It's faster to find things in a matrix than a data frame
Don't be afraid of for loops for efficiency, provided you're not growing an object
Prefer vapply to the other apply family functions.
The problem with your data is all rows are duplicate of each other, so I changed it slightly to reflect different values in the dataset.
Data:
set.seed(1L)
df <- cbind(matrix(rnorm(2600), nrow = 26, ncol = 100),data.frame(id=LETTERS,parent.id=sample(letters[1:5],26,replace = T),stringsAsFactors = F))
Code:
library('data.table')
setDT(df) # assign data.table class by reference
# compute sum for each `parent.id` for each column (100 columns)
sum_df <- df[, .SD, .SDcols = which(colnames(df) != 'id' )][, lapply(.SD, sum ), by = .(parent.id ) ]
# get column names for sum_df and df which are sorted for consistency
no_pid_id_df <- gtools::mixedsort( colnames(df)[ ! ( colnames(df) %in% c( 'id', 'parent.id' ) ) ] )
no_pid_sum_df <- gtools::mixedsort( colnames(sum_df)[ colnames(sum_df) != 'parent.id' ] )
# match the `parent.id` for each `id` and then divide its value by the value of `sum_df`.
df[, .( props = {
pid <- parent.id
unlist( .SD[, .SD, .SDcols = no_pid_id_df ] ) /
unlist( sum_df[ parent.id == pid, ][, .SD, .SDcols = no_pid_sum_df ] )
}, parent.id ), by = .(id)]
Output:
# id props parent.id
# 1: A -0.95157186 e
# 2: A 0.06105359 e
# 3: A -0.42267771 e
# 4: A -0.03376174 e
# 5: A -0.16639600 e
# ---
# 2596: Z 2.34696158 e
# 2597: Z 0.23762369 e
# 2598: Z 0.60068440 e
# 2599: Z 0.14192337 e
# 2600: Z 0.01292592 e
Benchmark:
library('microbenchmark')
microbenchmark( sathish(), frank(), dan())
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# sathish() 404.450219 413.456675 433.656279 420.46044 429.876085 593.44202 100 c
# frank() 2.035302 2.304547 2.707019 2.47257 2.622025 18.31409 100 a
# dan() 17.396981 18.230982 19.316653 18.59737 19.700394 27.13146 100 b
Is there a way to preserve the index of an xts object when passing rowSums a xts object?
Currently I recast the result as an xts object, but this doesn't seem to be as fast as it could be if rowSums was able to simply return what it was passed.
xts(rowSums(abs(data)),index(data))
Interesting question. Let's ignore the abs calculation, as it's not relevant a lot of the time with just prices. If your concern is performance, here is a set of timings to consider for the current suggestions:
library(microbenchmark)
sample.xts <- xts(order.by = as.POSIXct("2004-01-01 00:00:00") + 1:1e6, matrix(rnorm(1e6 *4), ncol = 4), dimnames = list(NULL, c("A", "B", "C", "D")))
# See how quickly rowSum works on just the underlying matrix of data in the timings below:
xx <- coredata(sample.xts)
microbenchmark(
coredata(sample.xts),
rowSums(xx),
rowSums(sample.xts),
rowSums(coredata(sample.xts)),
.xts(x = rowSums(sample.xts), .index(sample.xts)),
xts(rowSums(coredata(sample.xts)), index(sample.xts)),
xts(rowSums(sample.xts),index(sample.xts)),
Reduce("+", as.list(sample.xts)), times = 100)
# Unit: milliseconds
# expr min lq mean median uq max neval
# coredata(sample.xts) 2.558479 2.661242 6.884048 2.817607 6.356423 104.57993 100
# rowSums(xx) 10.314719 10.824184 11.872108 11.289788 12.382614 18.39334 100
# rowSums(sample.xts) 10.358009 10.887609 11.814267 11.335977 12.387085 17.16193 100
# rowSums(coredata(sample.xts)) 12.916714 13.839761 18.968731 15.950048 17.836838 113.78552 100
# .xts(x = rowSums(sample.xts), .index(sample.xts)) 14.402382 15.764736 20.307027 17.808984 19.072600 114.24039 100
# xts(rowSums(coredata(sample.xts)), index(sample.xts)) 20.490542 24.183286 34.251031 25.566188 27.900599 125.93967 100
# xts(rowSums(sample.xts), index(sample.xts)) 17.436137 19.087269 25.259143 21.923877 22.805013 119.60638 100
# Reduce("+", as.list(sample.xts)) 21.745574 26.075326 41.696152 27.669601 30.442397 136.38650 100
y = .xts(x = rowSums(sample.xts), .index(sample.xts))
y2 = xts(rowSums(sample.xts),index(sample.xts))
all.equal(y, y2)
#[1] TRUE
coredata(sample.xts) returns the underlying numeric matrix. I think the fastest performance you can expect is given by rowSums(xx) for doing the computation, which can be considered a "benchmark". The question is then, what's the quickest way to do it in an xts object. It seems
.xts(x = rowSums(sample.xts), .index(sample.xts)) gives decent performance.
If your objection is having to pick apart and put together the components of the input then if x is your xts object then try this. It returns an xts object directly:
Reduce("+", as.list(x))
I have to subset a sequence of data.frames frequently (millions of times each run). The data.frames are of approximate size 200 rows x 30 columns. Depending on the state, the values in the data.frame change from one iteration to the next. Thus, doing one subset in the beginning is not working.
In contrast to the question, when a data.table starts to be faster than a data.frame, I am looking for a speed-up of subsetting for a given size of the data.frame/data.table
The following minimum reproducible example shows, that data.frame seems to be the fastest:
library(data.table)
nmax <- 1e2 # for 1e7 the results look as expected: data.table is really fast!
set.seed(1)
x<-runif(nmax,min=0,max=10)
y<-runif(nmax,min=0,max=10)
DF<-data.frame(x,y)
DT<-data.table(x,y)
summary(microbenchmark::microbenchmark(
setkey(DT,x,y),
times = 10L, unit = "us"))
# expr min lq mean median uq max neval
# 1 setkey(DT, x, y) 70.326 72.606 105.032 80.3985 126.586 212.877 10
summary(microbenchmark::microbenchmark(
DF[DF$x>5, ],
`[.data.frame`(DT,DT$x < 5,),
DT[x>5],
times = 100L, unit = "us"))
# expr min lq mean median uq max neval
# 1 DF[DF$x > 5, ] 41.815 45.426 52.40197 49.9885 57.4010 82.110 100
# 2 `[.data.frame`(DT, DT$x < 5, ) 43.716 47.707 58.06979 53.5995 61.2020 147.873 100
# 3 DT[x > 5] 205.273 214.777 233.09221 222.0000 231.6935 900.164 100
Is there anything I can do to improve performance?
Edit after input:
I am running a discrete event simulation and for each event I have to search in a list (I don't mind whether it is a data.frame or data.table). Most likely, I could implement a different approach, but then I have to re-write the code which was developed over more than 3 years. At the moment, this is not an option. But if there is no way to get it faster this might become an option in the future.
Technically, it is not a sequence of data.frames but just one data.frame, which changes with each iteration. However, this has no impact on "how to get the subset faster" and I hope that the question is now more comprehensive.
You will see a performance boost by converting to matrices. This is a viable alternative if the whole content of your data.frame is numerical (or can be converted without too much trouble).
Here we go. First I modified the data to have it with size 200x30:
library(data.table)
nmax = 200
cmax = 30
set.seed(1)
x<-runif(nmax,min=0,max=10)
DF = data.frame(x)
for (i in 2:cmax) {
DF = cbind(DF, runif(nmax,min=0,max=10))
colnames(DF)[ncol(DF)] = paste0('x',i)
}
DT = data.table(DF)
DM = as.matrix(DF) # # # or data.matrix(DF) if you have factors
And the comparison, ranked from quickest to slowest:
summary(microbenchmark::microbenchmark(
DM[DM[, 'x']>5, ], # # # # Quickest
as.matrix(DF)[DF$x>5, ], # # # # Still quicker with conversion
DF[DF$x>5, ],
`[.data.frame`(DT,DT$x < 5,),
DT[x>5],
times = 100L, unit = "us"))
# expr min lq mean median uq max neval
# 1 DM[DM[, "x"] > 5, ] 13.883 19.8700 22.65164 22.4600 24.9100 41.107 100
# 2 as.matrix(DF)[DF$x > 5, ] 141.100 181.9140 196.02329 195.7040 210.2795 304.989 100
# 3 DF[DF$x > 5, ] 198.846 238.8085 260.07793 255.6265 278.4080 377.982 100
# 4 `[.data.frame`(DT, DT$x < 5, ) 212.342 268.2945 346.87836 289.5885 304.2525 5894.712 100
# 5 DT[x > 5] 322.695 396.3675 465.19192 428.6370 457.9100 4186.487 100
If your use-case involves querying multiple times the data, then you can do the conversion only once and increase the speed by one order of magnitude.
I have a data.frame with columns "a" and "b". I want to add columns called "high" and "low" that contain the highest and the lowest among columns a and b.
Is there a way of doing this without looping over the lines in the dataframe?
edit: this is for OHLC data, and so the high and low column should contain the highest and lowest element between a and b on the same line, and not among the whole columns. sorry if this is badly worded.
Sounds like you're looking for pmax and pmin ("parallel" max/min):
Extremes package:base R Documentation
Maxima and Minima
Description:
Returns the (parallel) maxima and minima of the input values.
Usage:
max(..., na.rm = FALSE)
min(..., na.rm = FALSE)
pmax(..., na.rm = FALSE)
pmin(..., na.rm = FALSE)
pmax.int(..., na.rm = FALSE)
pmin.int(..., na.rm = FALSE)
Arguments:
...: numeric or character arguments (see Note).
na.rm: a logical indicating whether missing values should be
removed.
Details:
‘pmax’ and ‘pmin’ take one or more vectors (or matrices) as
arguments and return a single vector giving the ‘parallel’ maxima
(or minima) of the vectors. The first element of the result is
the maximum (minimum) of the first elements of all the arguments,
the second element of the result is the maximum (minimum) of the
second elements of all the arguments and so on. Shorter inputs
are recycled if necessary. ‘attributes’ (such as ‘names’ or
‘dim’) are transferred from the first argument (if applicable).
Here's a version I implemented using Rcpp. I compared pmin with my version, and my version is roughly 3 times faster.
library(Rcpp)
cppFunction("
NumericVector min_vec(NumericVector vec1, NumericVector vec2) {
int n = vec1.size();
if(n != vec2.size()) return 0;
else {
NumericVector out(n);
for(int i = 0; i < n; i++) {
out[i] = std::min(vec1[i], vec2[i]);
}
return out;
}
}
")
x1 <- rnorm(100000)
y1 <- rnorm(100000)
microbenchmark::microbenchmark(min_vec(x1, y1))
microbenchmark::microbenchmark(pmin(x1, y1))
x2 <- rnorm(500000)
y2 <- rnorm(500000)
microbenchmark::microbenchmark(min_vec(x2, y2))
microbenchmark::microbenchmark(pmin(x2, y2))
The microbenchmark function output for 100,000 elements is:
> microbenchmark::microbenchmark(min_vec(x1, y1))
Unit: microseconds
expr min lq mean median uq
min_vec(x1, y1) 215.731 222.3705 230.7018 224.484 228.1115
max neval
284.631 100
> microbenchmark::microbenchmark(pmin(x1, y1))
Unit: microseconds
expr min lq mean median uq max
pmin(x1, y1) 891.486 904.7365 943.5884 922.899 954.873 1098.259
neval
100
And for 500,000 elements:
> microbenchmark::microbenchmark(min_vec(x2, y2))
Unit: milliseconds
expr min lq mean median uq
min_vec(x2, y2) 1.493136 2.008122 2.109541 2.140318 2.300022
max neval
2.97674 100
> microbenchmark::microbenchmark(pmin(x2, y2))
Unit: milliseconds
expr min lq mean median uq
pmin(x2, y2) 4.652925 5.146819 5.286951 5.264451 5.445638
max neval
6.639985 100
So you can see the Rcpp version is faster.
You could make it better by adding some error checking in the function, for instance: check that both vectors are the same length, or that they are comparable (not character vs. numeric, or boolean vs. numeric).
If your data.frame name is dat.
dat$pmin <- do.call(pmin,dat[c("a","b")])
dat$pmax <- do.call(pmax,dat[c("a","b")])
Another possible solution:
set.seed(21)
Data <- data.frame(a=runif(10),b=runif(10))
Data$low <- apply(Data[,c("a","b")], 1, min)
Data$high <- apply(Data[,c("a","b")], 1, max)