Is there a vectorized parallel max() and min()? - r

I have a data.frame with columns "a" and "b". I want to add columns called "high" and "low" that contain the highest and the lowest among columns a and b.
Is there a way of doing this without looping over the lines in the dataframe?
edit: this is for OHLC data, and so the high and low column should contain the highest and lowest element between a and b on the same line, and not among the whole columns. sorry if this is badly worded.

Sounds like you're looking for pmax and pmin ("parallel" max/min):
Extremes package:base R Documentation
Maxima and Minima
Description:
Returns the (parallel) maxima and minima of the input values.
Usage:
max(..., na.rm = FALSE)
min(..., na.rm = FALSE)
pmax(..., na.rm = FALSE)
pmin(..., na.rm = FALSE)
pmax.int(..., na.rm = FALSE)
pmin.int(..., na.rm = FALSE)
Arguments:
...: numeric or character arguments (see Note).
na.rm: a logical indicating whether missing values should be
removed.
Details:
‘pmax’ and ‘pmin’ take one or more vectors (or matrices) as
arguments and return a single vector giving the ‘parallel’ maxima
(or minima) of the vectors. The first element of the result is
the maximum (minimum) of the first elements of all the arguments,
the second element of the result is the maximum (minimum) of the
second elements of all the arguments and so on. Shorter inputs
are recycled if necessary. ‘attributes’ (such as ‘names’ or
‘dim’) are transferred from the first argument (if applicable).

Here's a version I implemented using Rcpp. I compared pmin with my version, and my version is roughly 3 times faster.
library(Rcpp)
cppFunction("
NumericVector min_vec(NumericVector vec1, NumericVector vec2) {
int n = vec1.size();
if(n != vec2.size()) return 0;
else {
NumericVector out(n);
for(int i = 0; i < n; i++) {
out[i] = std::min(vec1[i], vec2[i]);
}
return out;
}
}
")
x1 <- rnorm(100000)
y1 <- rnorm(100000)
microbenchmark::microbenchmark(min_vec(x1, y1))
microbenchmark::microbenchmark(pmin(x1, y1))
x2 <- rnorm(500000)
y2 <- rnorm(500000)
microbenchmark::microbenchmark(min_vec(x2, y2))
microbenchmark::microbenchmark(pmin(x2, y2))
The microbenchmark function output for 100,000 elements is:
> microbenchmark::microbenchmark(min_vec(x1, y1))
Unit: microseconds
expr min lq mean median uq
min_vec(x1, y1) 215.731 222.3705 230.7018 224.484 228.1115
max neval
284.631 100
> microbenchmark::microbenchmark(pmin(x1, y1))
Unit: microseconds
expr min lq mean median uq max
pmin(x1, y1) 891.486 904.7365 943.5884 922.899 954.873 1098.259
neval
100
And for 500,000 elements:
> microbenchmark::microbenchmark(min_vec(x2, y2))
Unit: milliseconds
expr min lq mean median uq
min_vec(x2, y2) 1.493136 2.008122 2.109541 2.140318 2.300022
max neval
2.97674 100
> microbenchmark::microbenchmark(pmin(x2, y2))
Unit: milliseconds
expr min lq mean median uq
pmin(x2, y2) 4.652925 5.146819 5.286951 5.264451 5.445638
max neval
6.639985 100
So you can see the Rcpp version is faster.
You could make it better by adding some error checking in the function, for instance: check that both vectors are the same length, or that they are comparable (not character vs. numeric, or boolean vs. numeric).

If your data.frame name is dat.
dat$pmin <- do.call(pmin,dat[c("a","b")])
dat$pmax <- do.call(pmax,dat[c("a","b")])

Another possible solution:
set.seed(21)
Data <- data.frame(a=runif(10),b=runif(10))
Data$low <- apply(Data[,c("a","b")], 1, min)
Data$high <- apply(Data[,c("a","b")], 1, max)

Related

For two equal sized vectors, how to create a new vector containing the max (or min) of either vector at each index? [duplicate]

I have a data.frame with columns "a" and "b". I want to add columns called "high" and "low" that contain the highest and the lowest among columns a and b.
Is there a way of doing this without looping over the lines in the dataframe?
edit: this is for OHLC data, and so the high and low column should contain the highest and lowest element between a and b on the same line, and not among the whole columns. sorry if this is badly worded.
Sounds like you're looking for pmax and pmin ("parallel" max/min):
Extremes package:base R Documentation
Maxima and Minima
Description:
Returns the (parallel) maxima and minima of the input values.
Usage:
max(..., na.rm = FALSE)
min(..., na.rm = FALSE)
pmax(..., na.rm = FALSE)
pmin(..., na.rm = FALSE)
pmax.int(..., na.rm = FALSE)
pmin.int(..., na.rm = FALSE)
Arguments:
...: numeric or character arguments (see Note).
na.rm: a logical indicating whether missing values should be
removed.
Details:
‘pmax’ and ‘pmin’ take one or more vectors (or matrices) as
arguments and return a single vector giving the ‘parallel’ maxima
(or minima) of the vectors. The first element of the result is
the maximum (minimum) of the first elements of all the arguments,
the second element of the result is the maximum (minimum) of the
second elements of all the arguments and so on. Shorter inputs
are recycled if necessary. ‘attributes’ (such as ‘names’ or
‘dim’) are transferred from the first argument (if applicable).
Here's a version I implemented using Rcpp. I compared pmin with my version, and my version is roughly 3 times faster.
library(Rcpp)
cppFunction("
NumericVector min_vec(NumericVector vec1, NumericVector vec2) {
int n = vec1.size();
if(n != vec2.size()) return 0;
else {
NumericVector out(n);
for(int i = 0; i < n; i++) {
out[i] = std::min(vec1[i], vec2[i]);
}
return out;
}
}
")
x1 <- rnorm(100000)
y1 <- rnorm(100000)
microbenchmark::microbenchmark(min_vec(x1, y1))
microbenchmark::microbenchmark(pmin(x1, y1))
x2 <- rnorm(500000)
y2 <- rnorm(500000)
microbenchmark::microbenchmark(min_vec(x2, y2))
microbenchmark::microbenchmark(pmin(x2, y2))
The microbenchmark function output for 100,000 elements is:
> microbenchmark::microbenchmark(min_vec(x1, y1))
Unit: microseconds
expr min lq mean median uq
min_vec(x1, y1) 215.731 222.3705 230.7018 224.484 228.1115
max neval
284.631 100
> microbenchmark::microbenchmark(pmin(x1, y1))
Unit: microseconds
expr min lq mean median uq max
pmin(x1, y1) 891.486 904.7365 943.5884 922.899 954.873 1098.259
neval
100
And for 500,000 elements:
> microbenchmark::microbenchmark(min_vec(x2, y2))
Unit: milliseconds
expr min lq mean median uq
min_vec(x2, y2) 1.493136 2.008122 2.109541 2.140318 2.300022
max neval
2.97674 100
> microbenchmark::microbenchmark(pmin(x2, y2))
Unit: milliseconds
expr min lq mean median uq
pmin(x2, y2) 4.652925 5.146819 5.286951 5.264451 5.445638
max neval
6.639985 100
So you can see the Rcpp version is faster.
You could make it better by adding some error checking in the function, for instance: check that both vectors are the same length, or that they are comparable (not character vs. numeric, or boolean vs. numeric).
If your data.frame name is dat.
dat$pmin <- do.call(pmin,dat[c("a","b")])
dat$pmax <- do.call(pmax,dat[c("a","b")])
Another possible solution:
set.seed(21)
Data <- data.frame(a=runif(10),b=runif(10))
Data$low <- apply(Data[,c("a","b")], 1, min)
Data$high <- apply(Data[,c("a","b")], 1, max)

Apply function (quantile) to matrix rows and use result to modify row

I have a matrix, A, filled with random values with shape 10x10. How can I perform a function on each row (finding the 75th quantile), and divide each element in that row of A by that result?
In the below attempt, I am getting a single value for q, but q should be at least 10 values (one for every row). At that point I should be able to do element-wise division with A/q. What am I doing wrong?
A <- matrix(rnorm(10 * 10), 10, 10)
q <- c(quantile(A[1,], 0.75))
A/q
There's rowQuantiles from the matrixStats package:
library(matrixStats)
res <- A / rowQuantiles(A, probs=0.75)
Same result?
identical(apply(A, 1, quantile, probs=0.75), rowQuantiles(A, probs=0.75))
[1] TRUE
Is it faster?
library(microbenchmark)
microbenchmark(apply=apply(A, 1, quantile, probs=0.75),
matStat=rowQuantiles(A, probs=0.75))
Unit: microseconds
expr min lq mean median uq max neval cld
apply 788.298 808.9675 959.816 829.3515 855.154 13259.652 100 b
matStat 246.611 267.2800 278.302 276.1180 284.386 362.075 100 a
On this matrix, definitely.
What about on a bigger matrix (1000 X 1000)?
A <- matrix(rnorm(1e6), 1000, 1000)
microbenchmark(apply=apply(A, 1, quantile, probs=0.75),
matStat=rowQuantiles(A, probs=0.75))
Unit: milliseconds
expr min lq mean median uq max neval cld
apply 115.57328 123.4831 183.1455 139.82021 308.3715 353.1725 100 b
matStat 74.22657 89.2162 136.1508 95.41482 113.0969 745.1526 100 a
Not as dramatic, but still yes (ignoring the max value).
Solved the issue by using apply, as below:
A <- matrix(rnorm(10 * 10), 10, 10)
q <- apply(A, 1, quantile, probs = c(0.75), na.rm = TRUE)
A <- A/q
It technically answers the question, but a vectorized approach would be nice.

Efficient sparse linear interpolation of row by row data

What is the most efficient way to do linear interpolation when the desired interpolation points are sparse compared to the available data? I have a very long data frame containing many columns, one of which represents a timestamp and the rest are variables, for which I am interested in interpolating at a very small number of timestamps. For example, consider the two variable case:
microbenchmark::microbenchmark(approx(1:2, 1:2, 1.5)$y)
# Unit: microseconds
# expr min lq mean median uq max neval
# ... 39.629 41.3395 46.80514 42.195 52.8865 138.558 100
microbenchmark::microbenchmark(approx(seq_len(1e6), seq_len(1e6), 1.5)$y)
# Unit: milliseconds
# expr min lq mean median uq max neval
# ... 129.5733 231.0047 229.3459 236.3845 247.3096 369.4621 100
we see that although only one interpolated value (at t = 1.5) is desired, increasing the number of pairs (x, y) can cause a few orders of magnitude difference in running time.
Another example, this time with a data table.
library(data.table)
tmp_dt <- data.table(time = seq_len(1e7), a = seq_len(1e7), b = seq_len(1e7), c = seq_len(1e7))
Running tmp_dt[, lapply(.SD, function(col) {approx(time, col, 1.5)$y}), .SDcols = c("a", "b", "c")] produces a one row data table but it takes a while.
I am thinking there must be some efficiency to be gained by removing all rows in the data table that are not necessary for interpolation.
If your linear interpolation is weighted.mean(c(x0, x1), c(t1-t, t-t0)), where (t0, x0) is the nearest point below and (t1, x1) the nearest above...
# fix bad format
tmp_dt[, names(tmp_dt) := lapply(.SD, as.numeric)]
# enumerate target times
tDT = data.table(t = seq(1.5, 100.5, by=.5))
# handle perfect matches
tDT[, a := tmp_dt[.SD, on=.(time = t), x.a]]
# handle interpolation
tDT[is.na(a), a := {
w = findInterval(t, tmp_dt$time)
cbind(tmp_dt[w, .(t0 = time, a0 = a)], tmp_dt[w+1L, .(t1 = time, a1 = a)])[,
(a0*(t1-t) + a1*(t-t0))/(t1-t0)]
}]
The extension to more columns is a little messy, but can be shoehorned in here.
Some sort of rolling, like w = tmp_dt[t, on=.(time), roll=TRUE, which=TRUE], might be faster than findInterval, but I haven't looked into it.

R: fast method to check if integer value is in sorted integer vector and return its index

Lets say I have vector x that:
is very large ( > 200 000 )
is integer
is sorted
all of it's values are unique
I would like to check if an integer value y is in this vector, and if it is, I would like to get the index of it. I would like to take advantage of the fact, that vector is sorted, so it can be done fast.
How would I accomplish such thing?
Here's some data
set.seed(123)
x = sort(unique(floor(runif(1e6, 1, 1e7))))
y = sample(1e7, 10000)
And a couple of approaches
f0 = function(y, vec) y %in% vec
f1 = function(y, vec) vec[findInterval(y, vec)] == y
The %in% does a full scan; findInterval() does a binary search (I think). They generate the same result
> identical(f0(y, x), f1(y, x))
[1] TRUE
And have approximately similar amortized performance
> library(microbenchmark)
> microbenchmark(f0(y, x), f1(y, x), times=10)
Unit: milliseconds
expr min lq mean median uq max neval
f0(y, x) 99.35425 100.87319 102.32160 102.20107 103.67718 105.70854 10
f1(y, x) 94.83219 95.05068 95.93625 95.77822 96.72601 97.50961 10
But findInterval() is I think faster for small queries
> microbenchmark(f0(y[1:10], x), f1(y[1:10], x), times=10)
Unit: milliseconds
expr min lq mean median uq max neval
f0(y[1:10], x) 83.441578 85.116818 86.264751 86.07515 87.13516 89.430801 10
f1(y[1:10], x) 7.731606 7.734207 7.757201 7.75199 7.77210 7.810957 10
David suggests (I think)
f2 = function(x, vec) vec[which.max(x == vec)] == x
which.max() is only good for scalar y, which is seldom (saying this for the benefit of OP) a good use of R. It appears less performant than findInterval()
> microbenchmark(f1(x[1000], x), f2(x[1000], x), times=10)
Unit: milliseconds
expr min lq mean median uq max neval
f1(x[1000], x) 7.707420 7.709047 7.714576 7.711979 7.718953 7.729688 10
f2(x[1000], x) 9.353225 9.358874 9.381781 9.378680 9.400808 9.426102 10
Contrary to #Laterow I don't see any particular performance difference between which() and which.max() (in current R-devel or R-3-2-branch; also, the results aren't the same, so it's an apples-to-oranges comparison). I have a vague recollection of an R-devel conversation about this in the last 6 months...
> set.seed(123) ; x <- sample(2e5, replace = TRUE)
> microbenchmark(which.max(x == 1e7), which(x == 1e7)[1])
Unit: milliseconds
expr min lq mean median uq max
which.max(x == 1e+07) 4.240606 4.266470 5.975966 5.015947 5.217903 43.78467
which(x == 1e+07)[1] 4.060040 4.132667 5.550078 4.986287 5.059128 43.88074
neval
100
100
Performance of which versus which.max might have changed with this commit, where previously which.max() would coerce logical to numeric vectors before the scan, triggering a copy.

Set an alpha trim in colMeans

Is there an implementation of colMeans in R that includes an alpha trimmed mean parameter?
If not, how could I make one?
The original question has been answered in the comments below.
apply(x, 2, mean, trim=.05) is not as optimized as colMeans;
what is an implementation of equal efficiency?
Here are some examples of different ways to calculate trimmed colMeans, with a comparison of their performance.
m <- matrix(runif(1000000), nc=1000)
trim <- 0.1
Using apply:
out1 <- apply(m, 2, mean, trim=trim)
Using sapply:
out2 <- sapply(seq_len(ncol(m)), function(i) mean(m[, i], trim=trim))
Using Rcpp:
library(inline)
library(RcppArmadillo)
f <- 'using namespace arma;
mat x = sort(as<mat>(x_));
double trim = as<double>(trim_);
int low;
if(x.n_rows % 2 == 1) {
low = ceil(trim * x.n_rows) - 1;
} else {
low = ceil(trim * x.n_rows);
}
int high = ceil((1 - trim) * x.n_rows) - 1;
return(wrap(mean(x.rows(low, high))));'
trim.colMeans <- cxxfunction(signature(x_='matrix', trim_='numeric'),
f, plugin="RcppArmadillo")
out3 <- trim.colMeans(m, trim)
Comparison
identical(out1, out2)
[1] TRUE
identical(out1, c(out3))
[1] TRUE
library(microbenchmark)
microbenchmark(apply=apply(m, 2, mean, trim=trim),
sapply=sapply(seq_len(ncol(m)), function(i) mean(m[, i], trim=trim)),
Rcpp=trim.colMeans(m, trim),
colMeans=colMeans(m))
Unit: microseconds
expr min lq median uq max neval
apply 68907.162 100439.4775 102555.396 109044.4025 136034.067 100
sapply 64675.928 66383.6010 66937.615 68152.1115 98680.906 100
Rcpp 43614.629 44297.6980 44761.360 45164.4850 46883.602 100
colMeans 782.458 805.7995 828.538 988.4625 1452.877 100
I'm sure my Rcpp implementation is sub-optimal, so feel free to chime in with improvements. As you can see, none of these methods is as efficient as an untrimmed calculation of colMeans, yet I suspect equivalent efficiency is impossible, since additional calculations must be made, including sorting and subsetting of the matrix. This penalty for trimming data is evident when benchmarking the mean of a vector vs. the trimmed counterpart:
v <- runif(1000)
microbenchmark(mean(v), mean(v, trim=0.1))
Unit: microseconds
expr min lq median uq max neval
mean(v) 5.722 6.325 6.927 7.229 124.989 100
mean(v, trim = 0.1) 42.165 43.671 44.574 44.876 84.630 100

Resources