Is there an implementation of colMeans in R that includes an alpha trimmed mean parameter?
If not, how could I make one?
The original question has been answered in the comments below.
apply(x, 2, mean, trim=.05) is not as optimized as colMeans;
what is an implementation of equal efficiency?
Here are some examples of different ways to calculate trimmed colMeans, with a comparison of their performance.
m <- matrix(runif(1000000), nc=1000)
trim <- 0.1
Using apply:
out1 <- apply(m, 2, mean, trim=trim)
Using sapply:
out2 <- sapply(seq_len(ncol(m)), function(i) mean(m[, i], trim=trim))
Using Rcpp:
library(inline)
library(RcppArmadillo)
f <- 'using namespace arma;
mat x = sort(as<mat>(x_));
double trim = as<double>(trim_);
int low;
if(x.n_rows % 2 == 1) {
low = ceil(trim * x.n_rows) - 1;
} else {
low = ceil(trim * x.n_rows);
}
int high = ceil((1 - trim) * x.n_rows) - 1;
return(wrap(mean(x.rows(low, high))));'
trim.colMeans <- cxxfunction(signature(x_='matrix', trim_='numeric'),
f, plugin="RcppArmadillo")
out3 <- trim.colMeans(m, trim)
Comparison
identical(out1, out2)
[1] TRUE
identical(out1, c(out3))
[1] TRUE
library(microbenchmark)
microbenchmark(apply=apply(m, 2, mean, trim=trim),
sapply=sapply(seq_len(ncol(m)), function(i) mean(m[, i], trim=trim)),
Rcpp=trim.colMeans(m, trim),
colMeans=colMeans(m))
Unit: microseconds
expr min lq median uq max neval
apply 68907.162 100439.4775 102555.396 109044.4025 136034.067 100
sapply 64675.928 66383.6010 66937.615 68152.1115 98680.906 100
Rcpp 43614.629 44297.6980 44761.360 45164.4850 46883.602 100
colMeans 782.458 805.7995 828.538 988.4625 1452.877 100
I'm sure my Rcpp implementation is sub-optimal, so feel free to chime in with improvements. As you can see, none of these methods is as efficient as an untrimmed calculation of colMeans, yet I suspect equivalent efficiency is impossible, since additional calculations must be made, including sorting and subsetting of the matrix. This penalty for trimming data is evident when benchmarking the mean of a vector vs. the trimmed counterpart:
v <- runif(1000)
microbenchmark(mean(v), mean(v, trim=0.1))
Unit: microseconds
expr min lq median uq max neval
mean(v) 5.722 6.325 6.927 7.229 124.989 100
mean(v, trim = 0.1) 42.165 43.671 44.574 44.876 84.630 100
Related
I have a data.frame with columns "a" and "b". I want to add columns called "high" and "low" that contain the highest and the lowest among columns a and b.
Is there a way of doing this without looping over the lines in the dataframe?
edit: this is for OHLC data, and so the high and low column should contain the highest and lowest element between a and b on the same line, and not among the whole columns. sorry if this is badly worded.
Sounds like you're looking for pmax and pmin ("parallel" max/min):
Extremes package:base R Documentation
Maxima and Minima
Description:
Returns the (parallel) maxima and minima of the input values.
Usage:
max(..., na.rm = FALSE)
min(..., na.rm = FALSE)
pmax(..., na.rm = FALSE)
pmin(..., na.rm = FALSE)
pmax.int(..., na.rm = FALSE)
pmin.int(..., na.rm = FALSE)
Arguments:
...: numeric or character arguments (see Note).
na.rm: a logical indicating whether missing values should be
removed.
Details:
‘pmax’ and ‘pmin’ take one or more vectors (or matrices) as
arguments and return a single vector giving the ‘parallel’ maxima
(or minima) of the vectors. The first element of the result is
the maximum (minimum) of the first elements of all the arguments,
the second element of the result is the maximum (minimum) of the
second elements of all the arguments and so on. Shorter inputs
are recycled if necessary. ‘attributes’ (such as ‘names’ or
‘dim’) are transferred from the first argument (if applicable).
Here's a version I implemented using Rcpp. I compared pmin with my version, and my version is roughly 3 times faster.
library(Rcpp)
cppFunction("
NumericVector min_vec(NumericVector vec1, NumericVector vec2) {
int n = vec1.size();
if(n != vec2.size()) return 0;
else {
NumericVector out(n);
for(int i = 0; i < n; i++) {
out[i] = std::min(vec1[i], vec2[i]);
}
return out;
}
}
")
x1 <- rnorm(100000)
y1 <- rnorm(100000)
microbenchmark::microbenchmark(min_vec(x1, y1))
microbenchmark::microbenchmark(pmin(x1, y1))
x2 <- rnorm(500000)
y2 <- rnorm(500000)
microbenchmark::microbenchmark(min_vec(x2, y2))
microbenchmark::microbenchmark(pmin(x2, y2))
The microbenchmark function output for 100,000 elements is:
> microbenchmark::microbenchmark(min_vec(x1, y1))
Unit: microseconds
expr min lq mean median uq
min_vec(x1, y1) 215.731 222.3705 230.7018 224.484 228.1115
max neval
284.631 100
> microbenchmark::microbenchmark(pmin(x1, y1))
Unit: microseconds
expr min lq mean median uq max
pmin(x1, y1) 891.486 904.7365 943.5884 922.899 954.873 1098.259
neval
100
And for 500,000 elements:
> microbenchmark::microbenchmark(min_vec(x2, y2))
Unit: milliseconds
expr min lq mean median uq
min_vec(x2, y2) 1.493136 2.008122 2.109541 2.140318 2.300022
max neval
2.97674 100
> microbenchmark::microbenchmark(pmin(x2, y2))
Unit: milliseconds
expr min lq mean median uq
pmin(x2, y2) 4.652925 5.146819 5.286951 5.264451 5.445638
max neval
6.639985 100
So you can see the Rcpp version is faster.
You could make it better by adding some error checking in the function, for instance: check that both vectors are the same length, or that they are comparable (not character vs. numeric, or boolean vs. numeric).
If your data.frame name is dat.
dat$pmin <- do.call(pmin,dat[c("a","b")])
dat$pmax <- do.call(pmax,dat[c("a","b")])
Another possible solution:
set.seed(21)
Data <- data.frame(a=runif(10),b=runif(10))
Data$low <- apply(Data[,c("a","b")], 1, min)
Data$high <- apply(Data[,c("a","b")], 1, max)
I have a matrix, A, filled with random values with shape 10x10. How can I perform a function on each row (finding the 75th quantile), and divide each element in that row of A by that result?
In the below attempt, I am getting a single value for q, but q should be at least 10 values (one for every row). At that point I should be able to do element-wise division with A/q. What am I doing wrong?
A <- matrix(rnorm(10 * 10), 10, 10)
q <- c(quantile(A[1,], 0.75))
A/q
There's rowQuantiles from the matrixStats package:
library(matrixStats)
res <- A / rowQuantiles(A, probs=0.75)
Same result?
identical(apply(A, 1, quantile, probs=0.75), rowQuantiles(A, probs=0.75))
[1] TRUE
Is it faster?
library(microbenchmark)
microbenchmark(apply=apply(A, 1, quantile, probs=0.75),
matStat=rowQuantiles(A, probs=0.75))
Unit: microseconds
expr min lq mean median uq max neval cld
apply 788.298 808.9675 959.816 829.3515 855.154 13259.652 100 b
matStat 246.611 267.2800 278.302 276.1180 284.386 362.075 100 a
On this matrix, definitely.
What about on a bigger matrix (1000 X 1000)?
A <- matrix(rnorm(1e6), 1000, 1000)
microbenchmark(apply=apply(A, 1, quantile, probs=0.75),
matStat=rowQuantiles(A, probs=0.75))
Unit: milliseconds
expr min lq mean median uq max neval cld
apply 115.57328 123.4831 183.1455 139.82021 308.3715 353.1725 100 b
matStat 74.22657 89.2162 136.1508 95.41482 113.0969 745.1526 100 a
Not as dramatic, but still yes (ignoring the max value).
Solved the issue by using apply, as below:
A <- matrix(rnorm(10 * 10), 10, 10)
q <- apply(A, 1, quantile, probs = c(0.75), na.rm = TRUE)
A <- A/q
It technically answers the question, but a vectorized approach would be nice.
Lets say I have vector x that:
is very large ( > 200 000 )
is integer
is sorted
all of it's values are unique
I would like to check if an integer value y is in this vector, and if it is, I would like to get the index of it. I would like to take advantage of the fact, that vector is sorted, so it can be done fast.
How would I accomplish such thing?
Here's some data
set.seed(123)
x = sort(unique(floor(runif(1e6, 1, 1e7))))
y = sample(1e7, 10000)
And a couple of approaches
f0 = function(y, vec) y %in% vec
f1 = function(y, vec) vec[findInterval(y, vec)] == y
The %in% does a full scan; findInterval() does a binary search (I think). They generate the same result
> identical(f0(y, x), f1(y, x))
[1] TRUE
And have approximately similar amortized performance
> library(microbenchmark)
> microbenchmark(f0(y, x), f1(y, x), times=10)
Unit: milliseconds
expr min lq mean median uq max neval
f0(y, x) 99.35425 100.87319 102.32160 102.20107 103.67718 105.70854 10
f1(y, x) 94.83219 95.05068 95.93625 95.77822 96.72601 97.50961 10
But findInterval() is I think faster for small queries
> microbenchmark(f0(y[1:10], x), f1(y[1:10], x), times=10)
Unit: milliseconds
expr min lq mean median uq max neval
f0(y[1:10], x) 83.441578 85.116818 86.264751 86.07515 87.13516 89.430801 10
f1(y[1:10], x) 7.731606 7.734207 7.757201 7.75199 7.77210 7.810957 10
David suggests (I think)
f2 = function(x, vec) vec[which.max(x == vec)] == x
which.max() is only good for scalar y, which is seldom (saying this for the benefit of OP) a good use of R. It appears less performant than findInterval()
> microbenchmark(f1(x[1000], x), f2(x[1000], x), times=10)
Unit: milliseconds
expr min lq mean median uq max neval
f1(x[1000], x) 7.707420 7.709047 7.714576 7.711979 7.718953 7.729688 10
f2(x[1000], x) 9.353225 9.358874 9.381781 9.378680 9.400808 9.426102 10
Contrary to #Laterow I don't see any particular performance difference between which() and which.max() (in current R-devel or R-3-2-branch; also, the results aren't the same, so it's an apples-to-oranges comparison). I have a vague recollection of an R-devel conversation about this in the last 6 months...
> set.seed(123) ; x <- sample(2e5, replace = TRUE)
> microbenchmark(which.max(x == 1e7), which(x == 1e7)[1])
Unit: milliseconds
expr min lq mean median uq max
which.max(x == 1e+07) 4.240606 4.266470 5.975966 5.015947 5.217903 43.78467
which(x == 1e+07)[1] 4.060040 4.132667 5.550078 4.986287 5.059128 43.88074
neval
100
100
Performance of which versus which.max might have changed with this commit, where previously which.max() would coerce logical to numeric vectors before the scan, triggering a copy.
I was looking at the benchmarks in this answer, and wanted to compare them with diag (used in a different answer). Unfortunately, it seems that diag takes ages:
nc <- 1e4
set.seed(1)
m <- matrix(sample(letters,nc^2,replace=TRUE), ncol = nc)
microbenchmark(
diag = diag(m),
cond = m[row(m)==col(m)],
vec = m[(1:nc-1L)*nc+1:nc],
mat = m[cbind(1:nc,1:nc)],
times=10)
Comments: I tested these with identical. I took "cond" from one of the answers to this homework question. Results are similar with a matrix of integers, 1:26 instead of letters.
Results:
Unit: microseconds
expr min lq mean median uq max neval
diag 604343.469 629819.260 710371.3320 706842.3890 793144.019 837115.504 10
cond 3862039.512 3985784.025 4175724.0390 4186317.5260 4312493.742 4617117.706 10
vec 317.088 329.017 432.9099 350.1005 629.460 651.376 10
mat 272.147 292.953 441.7045 345.9400 637.506 706.860 10
It is just a matrix-subsetting operation, so I don't know why there's so much overhead. Looking inside the function, I see a few checks and then c(m)[v], where v is the same vector used in the "vec" benchmark. Timing these two...
v <- (1:nc-1L)*nc+1:nc
microbenchmark(diaglike=c(m)[v],vec=m[v])
# Unit: microseconds
# expr min lq mean median uq max neval
# diaglike 579224.436 664853.7450 720372.8105 712649.706 767281.5070 931976.707 100
# vec 334.843 339.8365 568.7808 646.799 663.5825 1445.067 100
...it seems I have found my culprit. So, the new variation on my question is: Why is there a seemingly unnecessary and very time-consuming c in diag?
Summary
As of R version 3.2.1 (World-Famous Astronaut) diag() has received an update. The discussion moved to r-devel where it was noted that c() strips non-name attributes and may have been why it was placed there. While some people worried that removing c() would cause unknown issues on matrix-like objects, Peter Dalgaard found that, "The only case where the c() inside diag() has an effect is where M[i,j] != M[(i-1)*m+j] AND c(M) will stringize M in column-major order, so that M[i,j] == c(M)[(i-1)*m+j]."
Luke Tierney tested #Frank 's removal of c(), finding it did not effect anything on CRAN or BIOC and so was implemented to replace c(x)[...] with x[...] on line 27. This leads to relatively large speedups in diag(). Below is a speed test showing the improvement with R 3.2.1's version of diag().
library(microbenchmark)
nc <- 1e4
set.seed(1)
m <- matrix(sample(letters,nc^2,replace=TRUE), ncol = nc)
microbenchmark(diagOld(m),diag(m))
Unit: microseconds
expr min lq mean median uq max neval
diagOld(m) 451189.242 526622.2775 545116.5668 531905.5635 540008.704 682223.733 100
diag(m) 222.563 646.8675 644.7444 714.4575 740.701 1015.459 100
I have a data.frame with columns "a" and "b". I want to add columns called "high" and "low" that contain the highest and the lowest among columns a and b.
Is there a way of doing this without looping over the lines in the dataframe?
edit: this is for OHLC data, and so the high and low column should contain the highest and lowest element between a and b on the same line, and not among the whole columns. sorry if this is badly worded.
Sounds like you're looking for pmax and pmin ("parallel" max/min):
Extremes package:base R Documentation
Maxima and Minima
Description:
Returns the (parallel) maxima and minima of the input values.
Usage:
max(..., na.rm = FALSE)
min(..., na.rm = FALSE)
pmax(..., na.rm = FALSE)
pmin(..., na.rm = FALSE)
pmax.int(..., na.rm = FALSE)
pmin.int(..., na.rm = FALSE)
Arguments:
...: numeric or character arguments (see Note).
na.rm: a logical indicating whether missing values should be
removed.
Details:
‘pmax’ and ‘pmin’ take one or more vectors (or matrices) as
arguments and return a single vector giving the ‘parallel’ maxima
(or minima) of the vectors. The first element of the result is
the maximum (minimum) of the first elements of all the arguments,
the second element of the result is the maximum (minimum) of the
second elements of all the arguments and so on. Shorter inputs
are recycled if necessary. ‘attributes’ (such as ‘names’ or
‘dim’) are transferred from the first argument (if applicable).
Here's a version I implemented using Rcpp. I compared pmin with my version, and my version is roughly 3 times faster.
library(Rcpp)
cppFunction("
NumericVector min_vec(NumericVector vec1, NumericVector vec2) {
int n = vec1.size();
if(n != vec2.size()) return 0;
else {
NumericVector out(n);
for(int i = 0; i < n; i++) {
out[i] = std::min(vec1[i], vec2[i]);
}
return out;
}
}
")
x1 <- rnorm(100000)
y1 <- rnorm(100000)
microbenchmark::microbenchmark(min_vec(x1, y1))
microbenchmark::microbenchmark(pmin(x1, y1))
x2 <- rnorm(500000)
y2 <- rnorm(500000)
microbenchmark::microbenchmark(min_vec(x2, y2))
microbenchmark::microbenchmark(pmin(x2, y2))
The microbenchmark function output for 100,000 elements is:
> microbenchmark::microbenchmark(min_vec(x1, y1))
Unit: microseconds
expr min lq mean median uq
min_vec(x1, y1) 215.731 222.3705 230.7018 224.484 228.1115
max neval
284.631 100
> microbenchmark::microbenchmark(pmin(x1, y1))
Unit: microseconds
expr min lq mean median uq max
pmin(x1, y1) 891.486 904.7365 943.5884 922.899 954.873 1098.259
neval
100
And for 500,000 elements:
> microbenchmark::microbenchmark(min_vec(x2, y2))
Unit: milliseconds
expr min lq mean median uq
min_vec(x2, y2) 1.493136 2.008122 2.109541 2.140318 2.300022
max neval
2.97674 100
> microbenchmark::microbenchmark(pmin(x2, y2))
Unit: milliseconds
expr min lq mean median uq
pmin(x2, y2) 4.652925 5.146819 5.286951 5.264451 5.445638
max neval
6.639985 100
So you can see the Rcpp version is faster.
You could make it better by adding some error checking in the function, for instance: check that both vectors are the same length, or that they are comparable (not character vs. numeric, or boolean vs. numeric).
If your data.frame name is dat.
dat$pmin <- do.call(pmin,dat[c("a","b")])
dat$pmax <- do.call(pmax,dat[c("a","b")])
Another possible solution:
set.seed(21)
Data <- data.frame(a=runif(10),b=runif(10))
Data$low <- apply(Data[,c("a","b")], 1, min)
Data$high <- apply(Data[,c("a","b")], 1, max)