I have a matrix A with hourly data (on a monthly period) and dim [116 152 744]
I am trying to create matrix B with daily data and dim [116 152 31]
Obviously where every dim TSTEP in B is the average of the first 24 TSTEPS in matrix A.
I was successful in creating a matrix C with monthly data with a simple apply
C <- apply(A, c(1,2), function (x) mean(x))
But can't quite figure it out to average over every N values. Thanks.
Take one vector only, to get the mean every 24 values, you can do:
mean24 <- function(x) {
dim(x) <- c(24, length(x) / 24)
colMeans(x)
}
x <- 1:48
mean24(x)
[1] 12.5 36.5
So, in your case, you just have to do:
apply(A, c(1, 2), mean24)
You could also do it with sapply and some indexing:
# create data
arr <- array(dim=c(116,152,744))
arr[] <- runif(length(arr[]))
daily <-sapply(seq(1,744,24),function(ix){
rowMeans(arr[,,ix:(ix+23)],dims = 2)
},simplify = 'array')
> str(daily)
num [1:116, 1:152, 1:31] 0.451 0.522 0.407 0.536 0.432 ...
Edit:
It's also fairly quick compared with the other solution (microbenchmark):
Unit: milliseconds
expr min lq mean median uq max neval
apply(arr, c(1, 2), mean24) 464.4121 509.9772 653.9486 667.2114 699.498 1221.733 100
Unit: milliseconds
expr
sapply(seq(1, 744, 24), function(ix) { rowMeans(arr[, , ix:(ix + 23)], dims = 2) }, simplify = "array")
min lq mean median uq max neval
164.8211 168.3295 189.8147 171.4008 196.2403 409.9638 100
Related
#BEGIN CODE
my.kernel <- function(Yt){
for (i in 1:length(Yt)) {
Yt[i] <- ifelse(abs(Yt[i]) <= 1, (35/32)*(1 - Yt[i]^2)^3, 0)}
Yt}
# Print results
my.kernel.density.estimator <- function(y,Yt,h){
result <- 0
for(i in 1:length(Yt)){
result <- result + (1/(length(Yt)*h))*my.kernel((Yt[i]-y)/h)}
result}
# Print results
my.loglik.cv <- function(Yt,h){
result <- 0
for(i in 1:length(Yt)){
result <- result + log(my.kernel.density.estimator(Yt[i],Yt[-i],h))}
result}
# Print the results
# END CODE
Yt, h and y can be any vector/number. Here is one example.
Yt<- seq(0, 10, 0.01)
h <- 1
y<- 1
The main point is to understand how to make it run faster.
In R mathematical operations are vectorized. In other words you do not need to apply the same mathematical operation on each vector element separately, you can perform it on all vector elements simultaneously.
The function
my.kernel <- function(Yt){
for (i in 1:length(Yt)) {
Yt[i] <- ifelse(abs(Yt[i]) <= 1, (35/32)*(1 - Yt[i]^2)^3, 0)}
Yt}
can be rewritten as
my.kernel.vec <- function(x) ifelse(abs(x) <= 1, (35/32)*(1 - x^2)^3, 0)
Yt <- seq(0, 10, 0.01)
h <- 1
y <- 1
all.equal(my.kernel(Yt),
my.kernel.vec(Yt))
#output
TRUE
the difference in speed is not minor:
library(microbenchmark)
microbenchmark(my.kernel(Yt),
my.kernel.vec(Yt))
Unit: microseconds
expr min lq mean median uq max neval cld
my.kernel(Yt) 1110.8 1179.2 1438.136 1311.35 1708.9 6756.4 100 b
my.kernel.vec(Yt) 54.3 66.3 104.204 70.20 74.3 3495.4 100 a
That is quite of a speed up.
Similarly
my.kernel.density.estimator <- function(y,Yt,h){
result <- 0
for(i in 1:length(Yt)){
result <- result + (1/(length(Yt)*h))*my.kernel((Yt[i]-y)/h)}
result}
can be changed to utilize R vectorized operations
my.kernel.density.estimator.vec <- function(y,Yt,h) sum((1/(length(Yt)*h))*my.kernel.vec((Yt-y)/h))
all.equal(my.kernel.density.estimator.vec(1, Yt, 1),
my.kernel.density.estimator(1, Yt, 1))
#output
TRUE
microbenchmark(my.kernel.density.estimator.vec(1, Yt, 1),
my.kernel.density.estimator(1, Yt, 1))
Unit: microseconds
expr min lq mean median uq max neval cld
my.kernel.density.estimator.vec(1, Yt, 1) 57.8 59.6 101.918 63.10 70.25 3716.4 100 a
my.kernel.density.estimator(1, Yt, 1) 2110.8 2163.6 2285.316 2231.35 2283.20 7826.7 100 b
Finally in
my.loglik.cv <- function(Yt,h){
result <- 0
for(i in 1:length(Yt)){
result <- result + log(my.kernel.density.estimator(Yt[i],Yt[-i],h))}
result}
you need to loop in order to create vectors Yt[i] and Yt[-i] so I left it as is.
microbenchmark(my.loglik.cv.vec(Yt, 1),
my.loglik.cv(Yt, 1), times = 10)
Unit: milliseconds
expr min lq mean median uq max neval cld
my.loglik.cv.vec(Yt, 1) 59.1957 59.6794 79.13856 90.46365 92.7877 93.4487 10 a
my.loglik.cv(Yt, 1) 2240.7176 2280.7737 2309.83982 2299.39885 2343.6714 2412.8111 10 b
Not to mention the speedup on larger vectors:
Yt <- seq(0, 10, 0.001)
microbenchmark(my.loglik.cv.vec(Yt, 1),
my.loglik.cv(Yt, 1), times = 1)
Unit: seconds
expr min lq mean median uq max neval
my.loglik.cv.vec(Yt, 1) 5.460431 5.460431 5.460431 5.460431 5.460431 5.460431 1
my.loglik.cv(Yt, 1) 230.221194 230.221194 230.221194 230.221194 230.221194 230.221194 1
I have a matrix, A, filled with random values with shape 10x10. How can I perform a function on each row (finding the 75th quantile), and divide each element in that row of A by that result?
In the below attempt, I am getting a single value for q, but q should be at least 10 values (one for every row). At that point I should be able to do element-wise division with A/q. What am I doing wrong?
A <- matrix(rnorm(10 * 10), 10, 10)
q <- c(quantile(A[1,], 0.75))
A/q
There's rowQuantiles from the matrixStats package:
library(matrixStats)
res <- A / rowQuantiles(A, probs=0.75)
Same result?
identical(apply(A, 1, quantile, probs=0.75), rowQuantiles(A, probs=0.75))
[1] TRUE
Is it faster?
library(microbenchmark)
microbenchmark(apply=apply(A, 1, quantile, probs=0.75),
matStat=rowQuantiles(A, probs=0.75))
Unit: microseconds
expr min lq mean median uq max neval cld
apply 788.298 808.9675 959.816 829.3515 855.154 13259.652 100 b
matStat 246.611 267.2800 278.302 276.1180 284.386 362.075 100 a
On this matrix, definitely.
What about on a bigger matrix (1000 X 1000)?
A <- matrix(rnorm(1e6), 1000, 1000)
microbenchmark(apply=apply(A, 1, quantile, probs=0.75),
matStat=rowQuantiles(A, probs=0.75))
Unit: milliseconds
expr min lq mean median uq max neval cld
apply 115.57328 123.4831 183.1455 139.82021 308.3715 353.1725 100 b
matStat 74.22657 89.2162 136.1508 95.41482 113.0969 745.1526 100 a
Not as dramatic, but still yes (ignoring the max value).
Solved the issue by using apply, as below:
A <- matrix(rnorm(10 * 10), 10, 10)
q <- apply(A, 1, quantile, probs = c(0.75), na.rm = TRUE)
A <- A/q
It technically answers the question, but a vectorized approach would be nice.
I have a string matrix (my_data) of dimensions 9000000x10 with each value being a single character string. I want to transform it to a numeric matrix using the function utf8ToInt, but it takes a long time and crashes my session.
new_matrix <- apply(my_data, 1:2, "utf8ToInt")
The result is what I expect, but I need a more efficient way of doing that.
Any help is deeply appreciated.
Imagine my data is:
my_data <- matrix(c("a","b","c","d"), ncol = 2)
but it is actually 9000000x10 instead of 2x2.
stringi::stri_enc_toutf32 may be an alternative.
From ?stri_enc_toutf32:
This function is roughly equivalent to a vectorized call to utf8ToInt(enc2utf8(str))
On a 1e3 * 2 matrix, stri_enc_toutf32 is about 10 and 20 times faster than vapply / apply + utf8ToInt respectively:
library(stringi)
library(microbenchmark)
nr = 1e3
nc = 2
m = matrix(sample(letters, nr*nc, replace = TRUE), nrow = nr, ncol = nc)
microbenchmark(
f_apply = apply(m, 1:2, utf8ToInt),
f_vapply = structure(vapply(m, utf8ToInt, numeric(1)), dim=dim(m)),
f = matrix(unlist(stri_enc_toutf32(m), use.names = FALSE), nrow = nrow(m)),
times = 10L, check = "equal")
# Unit: microseconds
# expr min lq mean median uq max neval
# f_apply 2283.4 2297.2 2351.17 2325.40 2354.5 2583.6 10
# f_vapply 1276.1 1298.0 1348.88 1322.00 1353.4 1611.3 10
# f 87.6 92.3 108.53 105.15 111.0 163.8 10
Using vapply would be almost twice as fast. Since vapply returns a vector, it is necessary to re-establish the matrix format (here with structure).
library(microbenchmark)
my_data <- matrix(sample(letters, 2*100, replace = TRUE), ncol = 2)
microbenchmark(
apply = apply(my_data, 1:2, utf8ToInt),
vapply = structure(vapply(my_data, utf8ToInt, numeric(1)), dim=dim(my_data)),
times = 500L, check = 'equal'
)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> apply 199.201 208.001 224.811 213.801 220.1515 1560.400 500
#> vapply 111.000 115.501 136.343 120.401 124.9505 1525.901 500
Created on 2021-03-06 by the reprex package (v1.0.0)
I have a data frame (called coors) that contains a vector of x coordinates and a vector of y coordinates.
I have another data frame (called pickedPoint) that contains specified (x,y) pairs that are of interest.
The goal is to associate each coors point to its nearest pickedPoint. I want to use the Euclidean norm (l-2). If at all possible please you tidyverse methods.
Coor = data.frame(row = rep(1:96, each = 72),
col = rep(1:72, times = 96))
PickedPoint = data.frame(ppRow = sample(96,10),
ppCol = sample(72,10))
There is another thread that is similar but is posted in python:
How to find the closest (x, y) position to (x,y) position in another list?
I have included a benchmark for the answers thus far:
microbenchmark(CPak(), latemail(),Jul(), times=10L)
Unit: milliseconds
expr min lq mean median uq max neval
CPak() 37.83691 38.60585 43.66030 39.86094 44.9592 62.784 10
latemail() 4275.10 4536.783 4674.966 4712.938 4855.860 5045.069 10
Jul() 37.38809 39.87625 46.17202 44.90693 53.08938 57.33 10
I often work on this kind of problem.
You're better off avoiding a tidyverse answer and using a vectorized approach. I like to use outer in this case, which is fast. I calculate distance as Dist = sqrt((x1-x2)^2 + (y1-y2)^2).
myfun <- function() {
Dx <- outer(Coor$row, PickedPoint$ppRow, "-")**2 # ** is same as ^
Dy <- outer(Coor$col, PickedPoint$ppCol, "-")**2
Dist <- sqrt(Dx+Dy)
minDistind <- apply(Dist, 1, which.min)
ans <- PickedPoint[minDistind,]
}
Output (head)
ppRow ppCol
8 10 32
8.1 10 32
8.2 10 32
8.3 10 32
8.4 10 32
8.5 10 32
I compare to the other answers just for completeness
latemail <- function() {
closest <- sapply( 1:nrow(Coor), function(x) which.min(sqrt(rowSums(sweep(PickedPoint, MARGIN=1, STATS=unlist(Coor[x,]))^2))) )
}
Note I added sol <- PickedPoint[Coor$closest,] to Jul's function because the original function only returned the indexes
Jul <- function() {
require(sp)
require(dplyr)
Coor$closest <- spDists(as.matrix(Coor),as.matrix(PickedPoint)) %>% apply(1,which.min)
sol <- PickedPoint[Coor$closest,]
}
Benchmarking
library(microbenchmark)
microbenchmark(myfun(), latemail(), times=10L)
expr min lq mean median uq max neval
myfun() 50.34484 50.93591 53.75279 51.46284 55.46526 66.09656 10
latemail() 9683.82227 9733.03489 9863.94716 9856.65472 9974.46137 10065.89549 10
microbenchmark(myfun(), Jul(), times=10L)
Unit: milliseconds
expr min lq mean median uq max neval
myfun() 47.85368 50.13398 63.84994 50.82162 58.62493 167.69221 10
Jul() 54.27473 54.38482 59.22976 58.56265 61.97588 69.11861 10
This illustrates why you should avoid tidyverse approach which is even slower than sapply
Note this answer compares all-to-all, which could be important if you're not using a simple toy example; with your toy example, you could use clever tricks to avoid all-to-all comparisons
I'd suggest using the sp package for this
library(sp)
library(dplyr)
Coor$closest <- spDists(as.matrix(Coor),as.matrix(PickedPoint)) %>% apply(1,which.min)
I'd like to sort out what it the fastest way to get the norm of a set of vectors contained in a matrix. I was using apply (this is an example, my matrices are much bigger):
a = matrix(1:9, 3,3)
norm_a = apply(a, 1, function(x) sqrt(sum(x^2)))
but then I wanted to speed up my code and moved to:
norm_a = sqrt(a^2%*%rep(1,dim(a)[2]))
which is actually much faster (seen with system.time, I'm not an expert in benchmarking). But I haven't found any final answer to this question so far. Does anyone have an insight about this ?
thanks
This depends on the size of your matrix:
library(microbenchmark)
microbenchmark(f1 = apply(a, 1, function(x) sqrt(sum(x^2))),
f2 = sqrt(a^2%*%rep(1,dim(a)[2])),
f3 = sqrt(rowSums(a^2)))
#Unit: microseconds
# expr min lq mean median uq max neval cld
# f1 44.656 46.812 52.21050 47.5815 49.4295 191.248 100 c
# f2 1.849 2.772 4.07532 4.3120 4.6210 16.323 100 a
# f3 6.160 7.392 9.25537 9.5480 10.1630 20.943 100 b
set.seed(42)
b <- matrix(rnorm(1e6), 1000)
microbenchmark(f1 = apply(b, 1, function(x) sqrt(sum(x^2))),
f2 = sqrt(b^2%*%rep(1,dim(b)[2])),
f3 = sqrt(rowSums(b^2)))
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# f1 30.851752 55.513228 86.84168 109.439043 112.54796 152.27730 100 b
# f2 5.503050 7.434152 14.36080 8.861268 10.42327 66.41539 100 a
# f3 4.430403 5.895553 12.92235 7.359163 8.62321 74.65256 100 a