I have a matrix, A, filled with random values with shape 10x10. How can I perform a function on each row (finding the 75th quantile), and divide each element in that row of A by that result?
In the below attempt, I am getting a single value for q, but q should be at least 10 values (one for every row). At that point I should be able to do element-wise division with A/q. What am I doing wrong?
A <- matrix(rnorm(10 * 10), 10, 10)
q <- c(quantile(A[1,], 0.75))
A/q
There's rowQuantiles from the matrixStats package:
library(matrixStats)
res <- A / rowQuantiles(A, probs=0.75)
Same result?
identical(apply(A, 1, quantile, probs=0.75), rowQuantiles(A, probs=0.75))
[1] TRUE
Is it faster?
library(microbenchmark)
microbenchmark(apply=apply(A, 1, quantile, probs=0.75),
matStat=rowQuantiles(A, probs=0.75))
Unit: microseconds
expr min lq mean median uq max neval cld
apply 788.298 808.9675 959.816 829.3515 855.154 13259.652 100 b
matStat 246.611 267.2800 278.302 276.1180 284.386 362.075 100 a
On this matrix, definitely.
What about on a bigger matrix (1000 X 1000)?
A <- matrix(rnorm(1e6), 1000, 1000)
microbenchmark(apply=apply(A, 1, quantile, probs=0.75),
matStat=rowQuantiles(A, probs=0.75))
Unit: milliseconds
expr min lq mean median uq max neval cld
apply 115.57328 123.4831 183.1455 139.82021 308.3715 353.1725 100 b
matStat 74.22657 89.2162 136.1508 95.41482 113.0969 745.1526 100 a
Not as dramatic, but still yes (ignoring the max value).
Solved the issue by using apply, as below:
A <- matrix(rnorm(10 * 10), 10, 10)
q <- apply(A, 1, quantile, probs = c(0.75), na.rm = TRUE)
A <- A/q
It technically answers the question, but a vectorized approach would be nice.
Related
#BEGIN CODE
my.kernel <- function(Yt){
for (i in 1:length(Yt)) {
Yt[i] <- ifelse(abs(Yt[i]) <= 1, (35/32)*(1 - Yt[i]^2)^3, 0)}
Yt}
# Print results
my.kernel.density.estimator <- function(y,Yt,h){
result <- 0
for(i in 1:length(Yt)){
result <- result + (1/(length(Yt)*h))*my.kernel((Yt[i]-y)/h)}
result}
# Print results
my.loglik.cv <- function(Yt,h){
result <- 0
for(i in 1:length(Yt)){
result <- result + log(my.kernel.density.estimator(Yt[i],Yt[-i],h))}
result}
# Print the results
# END CODE
Yt, h and y can be any vector/number. Here is one example.
Yt<- seq(0, 10, 0.01)
h <- 1
y<- 1
The main point is to understand how to make it run faster.
In R mathematical operations are vectorized. In other words you do not need to apply the same mathematical operation on each vector element separately, you can perform it on all vector elements simultaneously.
The function
my.kernel <- function(Yt){
for (i in 1:length(Yt)) {
Yt[i] <- ifelse(abs(Yt[i]) <= 1, (35/32)*(1 - Yt[i]^2)^3, 0)}
Yt}
can be rewritten as
my.kernel.vec <- function(x) ifelse(abs(x) <= 1, (35/32)*(1 - x^2)^3, 0)
Yt <- seq(0, 10, 0.01)
h <- 1
y <- 1
all.equal(my.kernel(Yt),
my.kernel.vec(Yt))
#output
TRUE
the difference in speed is not minor:
library(microbenchmark)
microbenchmark(my.kernel(Yt),
my.kernel.vec(Yt))
Unit: microseconds
expr min lq mean median uq max neval cld
my.kernel(Yt) 1110.8 1179.2 1438.136 1311.35 1708.9 6756.4 100 b
my.kernel.vec(Yt) 54.3 66.3 104.204 70.20 74.3 3495.4 100 a
That is quite of a speed up.
Similarly
my.kernel.density.estimator <- function(y,Yt,h){
result <- 0
for(i in 1:length(Yt)){
result <- result + (1/(length(Yt)*h))*my.kernel((Yt[i]-y)/h)}
result}
can be changed to utilize R vectorized operations
my.kernel.density.estimator.vec <- function(y,Yt,h) sum((1/(length(Yt)*h))*my.kernel.vec((Yt-y)/h))
all.equal(my.kernel.density.estimator.vec(1, Yt, 1),
my.kernel.density.estimator(1, Yt, 1))
#output
TRUE
microbenchmark(my.kernel.density.estimator.vec(1, Yt, 1),
my.kernel.density.estimator(1, Yt, 1))
Unit: microseconds
expr min lq mean median uq max neval cld
my.kernel.density.estimator.vec(1, Yt, 1) 57.8 59.6 101.918 63.10 70.25 3716.4 100 a
my.kernel.density.estimator(1, Yt, 1) 2110.8 2163.6 2285.316 2231.35 2283.20 7826.7 100 b
Finally in
my.loglik.cv <- function(Yt,h){
result <- 0
for(i in 1:length(Yt)){
result <- result + log(my.kernel.density.estimator(Yt[i],Yt[-i],h))}
result}
you need to loop in order to create vectors Yt[i] and Yt[-i] so I left it as is.
microbenchmark(my.loglik.cv.vec(Yt, 1),
my.loglik.cv(Yt, 1), times = 10)
Unit: milliseconds
expr min lq mean median uq max neval cld
my.loglik.cv.vec(Yt, 1) 59.1957 59.6794 79.13856 90.46365 92.7877 93.4487 10 a
my.loglik.cv(Yt, 1) 2240.7176 2280.7737 2309.83982 2299.39885 2343.6714 2412.8111 10 b
Not to mention the speedup on larger vectors:
Yt <- seq(0, 10, 0.001)
microbenchmark(my.loglik.cv.vec(Yt, 1),
my.loglik.cv(Yt, 1), times = 1)
Unit: seconds
expr min lq mean median uq max neval
my.loglik.cv.vec(Yt, 1) 5.460431 5.460431 5.460431 5.460431 5.460431 5.460431 1
my.loglik.cv(Yt, 1) 230.221194 230.221194 230.221194 230.221194 230.221194 230.221194 1
I have a matrix A with hourly data (on a monthly period) and dim [116 152 744]
I am trying to create matrix B with daily data and dim [116 152 31]
Obviously where every dim TSTEP in B is the average of the first 24 TSTEPS in matrix A.
I was successful in creating a matrix C with monthly data with a simple apply
C <- apply(A, c(1,2), function (x) mean(x))
But can't quite figure it out to average over every N values. Thanks.
Take one vector only, to get the mean every 24 values, you can do:
mean24 <- function(x) {
dim(x) <- c(24, length(x) / 24)
colMeans(x)
}
x <- 1:48
mean24(x)
[1] 12.5 36.5
So, in your case, you just have to do:
apply(A, c(1, 2), mean24)
You could also do it with sapply and some indexing:
# create data
arr <- array(dim=c(116,152,744))
arr[] <- runif(length(arr[]))
daily <-sapply(seq(1,744,24),function(ix){
rowMeans(arr[,,ix:(ix+23)],dims = 2)
},simplify = 'array')
> str(daily)
num [1:116, 1:152, 1:31] 0.451 0.522 0.407 0.536 0.432 ...
Edit:
It's also fairly quick compared with the other solution (microbenchmark):
Unit: milliseconds
expr min lq mean median uq max neval
apply(arr, c(1, 2), mean24) 464.4121 509.9772 653.9486 667.2114 699.498 1221.733 100
Unit: milliseconds
expr
sapply(seq(1, 744, 24), function(ix) { rowMeans(arr[, , ix:(ix + 23)], dims = 2) }, simplify = "array")
min lq mean median uq max neval
164.8211 168.3295 189.8147 171.4008 196.2403 409.9638 100
If a function return 2 or more values, and using fill = NA, rollapply become much slower.
Is there any ways to avoid it?
f1= function(v)c(mean(v)+ median(v)) #return vector of length 1
f2= function(v)c(mean(v), median(v)) #return vector of length 2
v = rnorm(1000)
microbenchmark(rollapplyr(v, 20, f1), rollapplyr(v,20, f1, fill=NA) )
# expr min lq mean median uq max neval
# rollapplyr(v, 20, f1) 50.84485 53.68726 57.21892 54.63793 57.78519 75.88305 100
# rollapplyr(v, 20, f1, fill = NA) 52.11355 54.69866 59.73473 56.20600 63.10546 99.96493 100
microbenchmark(rollapplyr(v, 20, f2), rollapplyr(v,20, f2, fill=NA) )
# expr min lq mean median uq max neval
# rollapplyr(v, 20, f2) 51.77687 52.29403 56.80307 53.44605 56.65524 105.6713 100
# rollapplyr(v, 20, f2, fill = NA) 69.93853 71.08953 76.48056 72.21896 80.58282 151.4455 100
The reason is to be found in the speed of using fill.na on different types of data, as happens internally in the rollapply()function. Your f1 returns a single vector, whereas f2 returns a matrix of two columns (well, both are zoo objects actually, but you catch my drift).
The speed decrease for inserting the NA is not proportionate to the mere doubling of the number of elements, as this shows:
library(zoo)
library(microbenchmark)
v <- zoo(rnorm(1000))
m <- zoo(matrix(rnorm(2000), ncol=2))
ix <- seq(1000)>50
microbenchmark(na.fill(v, NA, ix), na.fill(m, NA, ix))
# Unit: microseconds
# expr min lq mean median uq max neval
# na.fill(v, NA, ix) 402.861 511.912 679.1114 659.597 754.8385 4716.46 100
# na.fill(m, NA, ix) 9746.643 10091.038 14281.5598 14057.304 17589.9670 22249.96 100
Is there an implementation of colMeans in R that includes an alpha trimmed mean parameter?
If not, how could I make one?
The original question has been answered in the comments below.
apply(x, 2, mean, trim=.05) is not as optimized as colMeans;
what is an implementation of equal efficiency?
Here are some examples of different ways to calculate trimmed colMeans, with a comparison of their performance.
m <- matrix(runif(1000000), nc=1000)
trim <- 0.1
Using apply:
out1 <- apply(m, 2, mean, trim=trim)
Using sapply:
out2 <- sapply(seq_len(ncol(m)), function(i) mean(m[, i], trim=trim))
Using Rcpp:
library(inline)
library(RcppArmadillo)
f <- 'using namespace arma;
mat x = sort(as<mat>(x_));
double trim = as<double>(trim_);
int low;
if(x.n_rows % 2 == 1) {
low = ceil(trim * x.n_rows) - 1;
} else {
low = ceil(trim * x.n_rows);
}
int high = ceil((1 - trim) * x.n_rows) - 1;
return(wrap(mean(x.rows(low, high))));'
trim.colMeans <- cxxfunction(signature(x_='matrix', trim_='numeric'),
f, plugin="RcppArmadillo")
out3 <- trim.colMeans(m, trim)
Comparison
identical(out1, out2)
[1] TRUE
identical(out1, c(out3))
[1] TRUE
library(microbenchmark)
microbenchmark(apply=apply(m, 2, mean, trim=trim),
sapply=sapply(seq_len(ncol(m)), function(i) mean(m[, i], trim=trim)),
Rcpp=trim.colMeans(m, trim),
colMeans=colMeans(m))
Unit: microseconds
expr min lq median uq max neval
apply 68907.162 100439.4775 102555.396 109044.4025 136034.067 100
sapply 64675.928 66383.6010 66937.615 68152.1115 98680.906 100
Rcpp 43614.629 44297.6980 44761.360 45164.4850 46883.602 100
colMeans 782.458 805.7995 828.538 988.4625 1452.877 100
I'm sure my Rcpp implementation is sub-optimal, so feel free to chime in with improvements. As you can see, none of these methods is as efficient as an untrimmed calculation of colMeans, yet I suspect equivalent efficiency is impossible, since additional calculations must be made, including sorting and subsetting of the matrix. This penalty for trimming data is evident when benchmarking the mean of a vector vs. the trimmed counterpart:
v <- runif(1000)
microbenchmark(mean(v), mean(v, trim=0.1))
Unit: microseconds
expr min lq median uq max neval
mean(v) 5.722 6.325 6.927 7.229 124.989 100
mean(v, trim = 0.1) 42.165 43.671 44.574 44.876 84.630 100
How I can obtained by dividing each original value by the square root of the sum of squared original values for that column in the original matrix.
data(longley)
X <- as.matrix(longley[,-7])
X/sqrt(colSums(X^2))
Getting wrong results.
Try this:
t(t(X)/sqrt(colSums(X^2)))
Benchmarks:
library(microbenchmark)
microbenchmark(t(t(X)/sqrt(colSums(X^2))),
apply(X, 2 , function(x) x/sqrt(sum(x^2))))
# Unit: microseconds
# expr min lq median uq max neval
# t(t(X)/sqrt(colSums(X^2))) 28.783 33.1305 34.9455 40.5640 68.147 100
# apply(X, 2, function(x) x/sqrt(sum(x^2))) 100.307 105.1940 106.9975 108.1075 193.015 100
does this work?
data(longley)
X <- as.matrix(longley[,-7])
X <- apply(X, 2 , function(x) x/sqrt(sum(x^2)))