R: What can I do about a slow double sapply?

R: What can I do about a slow double sapply? - r

I have a computation that does a version of this:
n <- 5
l <- 3
m <- seq(0,1,length.out = n)
r <- seq(3,4,length.out = n)
y <- 1:n
pp <- sapply(0:l, function(h) cumsum(y[(h+1):n]*y[1:(n-h)]))
rec.acf <- sapply(0:l, function(h) pp[[h+1]] + sapply((h+1):n, function(j) m[j] + r[j-h]) )
to obtain
> rec.acf
[[1]]
[1] 4.0 8.5 18.0 34.5 60.0
[[2]]
[1] 5.25 11.75 24.25 44.75
[[3]]
[1] 6.5 15.0 30.5
[[4]]
[1] 7.75 18.25
In practice, of course, n and l are much larger (and the actual functions, computing autocovariances over samples of increasing size, are more complicated).
When l is relatively small, as I had hoped for, the computations work much faster than other implementations I had worked out that do not take into account that I can recycle many identical computations through pp.
However, the picture reverses when l is large relative to n, likely because the outer sapply then sends off many inner loops. Is there anything obviously inefficient about my approach?
I tinkered with mapply, without much success.

The inner loop is unnecessary:
mm = lapply(0:l, function(h) tail(m, length(m) - h) + head(r, length(r) - h))
mapply("+", pp, mm)
#[[1]]
#[1] 4.0 8.5 18.0 34.5 60.0
#
#[[2]]
#[1] 5.25 11.75 24.25 44.75
#
#[[3]]
#[1] 6.5 15.0 30.5
#
#[[4]]
#[1] 7.75 18.25

Related

Fill NAs by smoothing the peak values

df <- data.frame(date = seq(from=as.POSIXct(as.Date("2020-10-01")),
to= as.POSIXct(as.Date("2020-10-02")) , by = 'hour'),
val = c(15,20,18,22,17,NA,NA,NA,80,14,23,16,19,21,NA,NA,60,18,15,20,22,19,NA,35,18))
There are uneven sequences of 'NA's followed by peak values e.g.: val = 80, 60 and 35 .
I would like to fill the 'NA's by smoothing the peak values. For example: in the first NA sequence, three NAs are followed by 80, which equals four data points therefore, 80 is divided by 4 = 20 .
Note: the peak values are NOT outliers, so the total sum of the data points should not change.
If possible, I would like to fill the NAs with the above conditions while reserving the signal behavior (trend and seasonality).
Many thanks.

The following function fills sequences of NA values with the next non-NA value divided by the sequence length.
fill_na <- function(x){
na <- is.na(x)
r <- rle(na)
div <- r$lengths[r$values] + 1L
cs <- cumsum(r$lengths)[r$values]
for(i in seq_along(div)){
if(cs[i] < length(x)){
x[ (cs[i] - div[i] + 1L):(cs[i] + 1L) ] <- x[ cs[i] + 1L ]/div[i]
}
}
x
}
fill_na(df$val)
# [1] 15.0 20.0 18.0 22.0 20.0 20.0 20.0 20.0 20.0 14.0 23.0
#[12] 16.0 19.0 20.0 20.0 20.0 20.0 18.0 15.0 20.0 22.0 17.5
#[23] 17.5 17.5 18.0

How to generate all possible vectors based on sampling without replacement in R?

I have a pool of seven numbers. I would like to generate all vectors of length 7 with:
two first elements are drawn from the pool of the 7 numbers.
two next elements are drawn from the 5 numbers left.
three final elements are drawn from the 3 numbers left.
That manner could be described by the vector c(2,2,3).
For example:
sample <- c(8.93,9.11,9.12,9.05,8.87,8.95,9.02)
structure <- c(2,2,3)
I know that there are 7C2*5C2*3C3 = 210 vectors like that. To be clearer I do not need permutation within each group of elements, for example, two vectors c(8.93,9.11,9.12,9.05,8.87,8.95,9.02) and c(9.11,8.93,9.12,9.05,8.87,8.95,9.02) are the same for me, I only need one of them appear on the list of 210 vectors.
Here is what I did using for loop, combn and setdiff. However, I would like to use the vector structure in the code and also make it more flexible, for example c(2,5) instead of c(2,2,3). Is there any tidier solution to generalize my problem, with the apply function family for example?
df<-data.frame()
sample <- c(8.93,9.11,9.12,9.05,8.87,8.95,9.02)
combn(sample,2) -> com1
for (i in 1:ncol(com1)){
com1[,i]
setdiff(sample,com1[,i]) -> com2
combn(com2,2) -> com3
for (j in 1:ncol(com3)){
setdiff(com2,com3[,j]) -> com4
c(com1[,i],com3[,j],com4) -> de
df <- rbind(df,de)
}
}
df

A recursive version in base R:
x <- c(8.93,9.11,9.12,9.05,8.87,8.95,9.02)
k <- c(2, 2, 3)
f <- function(el, l) {
if (length(l)==1L) {
return(data.frame(t(el)))
}
do.call(rbind, combn(el, l[1L],
#using code directly from setdiff for slight speedup and
#comparing integers for robustness
function(s) cbind(data.frame(t(s)), f(el[match(el, s, 0L) == 0L], l[-1L])),
simplify=FALSE))
}
apply(f(seq_along(x), k), 1L:2L, function(i) x[i])

Is this what you need? Felt like a contradiction in the question with "I would like to generate all vectors of length 7", but then saying you only need one of the 2 eg's. Using combn wouldn't you just end with one random sample?
library(combinat)
x1 <- permn(sample[1:2])
x2 <- permn(sample[3:4])
x3 <- permn(sample[5:7])
all <- expand.grid(x1, x2, x3)
apply(all, 1, unlist)

Since you mention combn and setdiff here is a possibility:
We first create a convenience function draw that draws ndraw samples from x and stores results in lst.
draw <- function(x, ndraw, lst) {
unlist(lapply(lst, function(y) {
lapply(
combn(setdiff(x, y), ndraw, simplify = F),
function(z) c(y, z))
}), recursive = F)
}
We can then define a function generate_samples to draw as many samples from x as there are entries in draws. I've added a check to ensure that the sum of draws equals the number of elements in x.
generate_samples <- function(x, draws) {
stopifnot(sum(draws) == length(x))
res <- list(NULL)
for (i in seq_along(draws)) res <- draw(x, draws[i], res)
res
}
In your particular case, we would do
lst <- generate_samples(sample, draws = structure)
#[[1]]
#[1] 8.93 9.11 9.12 9.05 8.87 8.95 9.02
#
#[[2]]
#[1] 8.93 9.11 9.12 8.87 9.05 8.95 9.02
#
#[[3]]
#[1] 8.93 9.11 9.12 8.95 9.05 8.87 9.02
#
#[[4]]
#[1] 8.93 9.11 9.12 9.02 9.05 8.87 8.95
#
#[[5]]
#[1] 8.93 9.11 9.05 8.87 9.12 8.95 9.02
#
#[[6]]
#[1] 8.93 9.11 9.05 8.95 9.12 8.87 9.02
# ....
We confirm that this indeed produces 210 elements in the output list
length(lst)
#[1] 210

find_combns_in_remainders <- function(list_combns_and_remainders, m) {
unlist(lapply(
list_combns_and_remainders,
function(.) combn(x = .$remainder,
m = m,
FUN = function(combination)
list(combination = c(.$combination, combination),
remainder = setdiff(.$remainder, combination)),
simplify = FALSE)
), recursive = FALSE)
}
Reduce(
x = structure,
f = find_combns_in_remainders,
init = list(list(combination = numeric(0), remainder = sample))
)
# [[1]]
# [[1]]$combination
# [1] 8.93 9.11 9.12 9.05 8.87 8.95 9.02
#
# [[1]]$remainder
# numeric(0)
#
#
# [[2]]
# [[2]]$combination
# [1] 8.93 9.11 9.12 8.87 9.05 8.95 9.02
#
# [[2]]$remainder
# numeric(0)
#
#
# [[3]]
# [[3]]$combination
# [1] 8.93 9.11 9.12 8.95 9.05 8.87 9.02
#
# [[3]]$remainder
# numeric(0)
#
#
# ....
#
#
# [[208]]
# [[208]]$combination
# [1] 8.95 9.02 9.12 9.05 8.93 9.11 8.87
#
# [[208]]$remainder
# numeric(0)
#
#
# [[209]]
# [[209]]$combination
# [1] 8.95 9.02 9.12 8.87 8.93 9.11 9.05
#
# [[209]]$remainder
# numeric(0)
#
#
# [[210]]
# [[210]]$combination
# [1] 8.95 9.02 9.05 8.87 8.93 9.11 9.12
#
# [[210]]$remainder
# numeric(0)

Estimating missing values in time-series data frame based on a "rate of change"

I am trying to use a loop in R to estimate values that will replace the NAs in my data frame based on a rate of change ("rate") that multiplies my last value (ok, this is confusing, but please refer to the example below). This is something similar to my data:
l1 <- c(NA,NA,NA,27,31,0.5)
l2 <- c(NA,8,12,28,39,0.5)
l3 <- c(NA,NA,NA,NA,39,0.3)
l4 <- c(NA,NA,11,15,31,0.2)
l5 <- c(NA,NA,NA,NA,51,0.9)
data <- as.data.frame(rbind(l1,l2,l3,l4,l5))
colnames(data) <- c("dbh1","dbh2","dbh3","dbh4","dbh5","rate")
So I created a loop to identify my first no-NA value in each line, then use that value to estimate its previous values based on the "rate". So for instance, in row 1, the first NA value would be replace by "27-(0.5*3)", then the second one would be "27-(0.5*2)" and the third one by "27-(0.5*1)". This is the loop I came up with. I know the first part (the outside loop) works but the the inside one doesn't:
for (i in 1: nrow(data)) {
dbh.cols <- data3[i,c("dbh1","dbh2","dbh3","dbh4","dbh5")]
sample.year <- which(dbh.cols != "NA")
data$first.dbh[i] <- min(dbh.cols, na.rm = T)
data$first.index[i] <- min(sample.year)
for (j on 1: (min(sample.year)-1)) {
ifelse(is.na(data[i,j]), min(dbh.cols, na.rm = T) - (min(sample.year)-j)*rate[i,j], data[i,j])
}
}
I am not good at programming so probably my internal loop strategy with "ifelse" is too weird (and wrong) but I just couldn't think of anything else that would work here... Any suggestions?

1) This uses no explicit loops, just an apply. It assumes that the NAs are all leading as in the example given.
fillIn <- function(x) {
rate <- tail(x, 1)
n <- sum(is.na(x)) # no of NAs
c(x[n+1] - rate * seq(n, 1), na.omit(x))
}
replace(data, TRUE, t(apply(data, 1, fillIn)))
giving:
dbh1 dbh2 dbh3 dbh4 dbh5 rate
l1 25.5 26.0 26.5 27.0 31 0.5
l2 7.5 8.0 12.0 28.0 39 0.5
l3 37.8 38.1 38.4 38.7 39 0.3
l4 10.6 10.8 11.0 15.0 31 0.2
l5 47.4 48.3 49.2 50.1 51 0.9
2) Here is a second approach that uses na.approx from the zoo package. It does not require apply. Here data1 has the same content as data except that the first column is filled in. The other NAs remain. The last line uses na.approx to fill in the remaining NAs linearly.
library(zoo)
NAs <- rowSums(is.na(data))
data1 <- cbind( data[cbind(1:nrow(data), NAs + 1)] - data$rate * NAs, data[-1] )
replace(data, TRUE, t(na.approx(t(data1))))
giving:
dbh1 dbh2 dbh3 dbh4 dbh5 rate
l1 25.5 26.0 26.5 27.0 31 0.5
l2 7.5 8.0 12.0 28.0 39 0.5
l3 37.8 38.1 38.4 38.7 39 0.3
l4 10.6 10.8 11.0 15.0 31 0.2
l5 47.4 48.3 49.2 50.1 51 0.9
2a) A variation on (2) uses na.locf in the middle line to bring forward the first non-NA in each row. The first and last lines are the same.
library(zoo)
NAs <- rowSums(is.na(data))
data1 <- cbind(na.locf(t(data), fromLast = TRUE)[1, ] - data$rate * NAs, data[-1])
replace(data, TRUE, t(na.approx(t(data1))))

You do not need to use multiple for loops for this. Here is some simplified code to do what you want just for the for loop. Working explicitly with your data we need to get the first non-NA value from each row.
for_estimate <- apply(data, 1, function(x) x[min(which(is.na(x) == FALSE))])
Secondly, we need to determine what integer to multiply the rate by for each row depending on how many NA values there are.
# total number of NA values per row
n_na <- apply(data,1, function(x) sum(is.na(x)) )
# make it a matrix with a 0's appended on
n_na <- matrix(c(n_na, rep(0, nrow(data) * (ncol(data)-1))),
nrow = nrow(data), ncol = ncol(data)-1)
# fill in the rest of the matrix
for(i in 2:ncol(n_na)){
n_na[,i] <- n_na[,i-1] -1
}
Once we have that we can use this code to back fill the NA values in that way you are interested in.
for(i in (ncol(data)-1):1){
if(sum(is.na(data[,i]))>0){
to_fill <- which(is.na(data[,i])==TRUE)
data[to_fill,i] <- for_estimate[to_fill] - (data$rate[to_fill]*(n_na[to_fill,i])
}
}
output
dbh1 dbh2 dbh3 dbh4 dbh5 rate
l1 25.5 26.0 26.5 27.0 31 0.5
l2 7.5 8.0 12.0 28.0 39 0.5
l3 37.8 38.1 38.4 38.7 39 0.3
l4 10.6 10.8 11.0 15.0 31 0.2
l5 47.4 48.3 49.2 50.1 51 0.9

Compute mean across intervals of rows

I have a long matrix, of which I need to compute the mean for a certain interval of rows. At the moment I am doing this manually like this:
values <- cbind(1:50,1)
meanqual10 <- mean(values[1:10,1])
meanqual10
[1] 5.5
meanqual15 <- mean(values[11:20,1])
meanqual15
[1] 15.5
meanqual20 <- mean(values[21:30,1])
meanqual20
[1] 25.5
meanqual25 <- mean(values[31:40,1])
meanqual25
[1] 35.5
meanqual30 <- mean(values[41:50,1])
meanqual30
[1] 45.5
There must be a nicer way of doing this. Can anybody help please?

Señor O's answer is nice if you have regular intervals. Another approach, if you want to select arbitrary rows could be something like:
l <- list(1:10,11:20,21:30,31:40,41:50) # vectors of any length or ordering
sapply(l, function(x) mean(values[x,1]))
Which gives:
[1] 5.5 15.5 25.5 35.5 45.5
And of course by can do this for arbitrary rows, too. This is just a slightly different approach.

by(values[,1], ceiling(1:50 / 10), mean)
ceiling(1:50/10) creates a vector of length 50 with a new integer every 10 numbers.
by will then take the mean for each group of the same integer.
Result:
ceiling(1:50/10): 1
[1] 5.5
------------------------------------------------------------------------------------------------------
ceiling(1:50/10): 2
[1] 15.5
------------------------------------------------------------------------------------------------------
ceiling(1:50/10): 3
[1] 25.5
------------------------------------------------------------------------------------------------------
ceiling(1:50/10): 4
[1] 35.5
------------------------------------------------------------------------------------------------------
ceiling(1:50/10): 5
[1] 45.5

Automatic multiplication between vector and matrix

I have this R code:
> coef
[1] 1.5 2.4 3.9 4.4
> y
[,1] [,2] [,3] [,4]
[1,] 1 2 12 45
[2,] 5 6 7 8
[3,] 9 10 2 12
[4,] 13 14 15 45
[5,] 17 18 39 7
I have to multiply each value of the column with the respective coef. The result should be:
First column:
1*1.5
5*1.5
9*1.5
13*1.5
17*1.5
Second column:
2*2.4
6*2.4
10*2.4
14*2.4
18*2.4
Third column:
12*3.9
7*3.9
2*3.9
15*3.9
39*3.9
Fourth column:
45*4.4
8*4.4
12*4.4
45*4.4
7*4.4
All the column's values moltiplied by the same coefficient at the same index in the vector.
How can I do this calculation?
The solution could be:
> y[,1] <- y[,1] * coef[1]
> y[,2] <- y[,2] * coef[2]
> y[,3] <- y[,3] * coef[3]
> y[,4] <- y[,4] * coef[4]
But doesn't seem too optimized! Something better?
Thank you!

This will give you what you want:
t( t(y) * coef )

Two more possibilities: sweep and scale (the latter only operates columnwise, and seems to me to be a bit of hack).
coef <- c(1.5,2.4,3.9,4.4)
y <- matrix(c(seq(1,17,by=4),
seq(2,18,by=4),
c(12,7,2,15,39,
45,8,12,45,7)),
ncol=4)
t(t(y)*coef)
t(apply(y,1,"*",coef))
sweep(y,2,coef,"*")
scale(y,center=FALSE,scale=1/coef)
library(rbenchmark)
benchmark(t(t(y)*coef),
y %*% diag(coef),
t(apply(y,1,"*",coef)),
sweep(y,2,coef,"*"),
scale(y,center=FALSE,scale=1/coef),
replications=1e4)
test replications elapsed relative
5 scale(y, center = FALSE, scale = 1/coef) 10000 0.990 4.342105
4 sweep(y, 2, coef, "*") 10000 0.846 3.710526
3 t(apply(y, 1, "*", coef)) 10000 1.537 6.741228
1 t(t(y) * coef) 10000 0.228 1.000000
2 y %*% diag(coef) 10000 0.365 1.600877
edit: added y %*% diag(coef) from #baptiste [not fastest, although it might be so for a big problem with a sufficiently optimized BLAS package ...] [and it was fastest in another trial, so I may just not have had a stable estimate]
edit: fixed typo in t(t(y)*coef) [thanks to Timur Shtatland] (but did not update timings, so they might be slightly off ...)
I also tried library(Matrix); y %*% Diagonal(x=coef), which is very slow for this example but might be fast for a large matrix (??). (I also tried constructing the diagonal matrix just once, but even multiplication by a predefined matrix was slow in this example (25x slower than the best, vs. 47x slower when defining the matrix on the fly.)
I have a mild preference for sweep as I think it expresses most clearly the operation being done ("multiply the columns by the elements of coef")

apply(y, 1, "*", coef)
# -- result --
[,1] [,2] [,3] [,4] [,5]
[1,] 1.5 7.5 13.5 19.5 25.5
[2,] 4.8 14.4 24.0 33.6 43.2
[3,] 46.8 27.3 7.8 58.5 152.1
[4,] 198.0 35.2 52.8 198.0 30.8

A late entry:
coef[col(y)]*y
On my system, this is the fastest.
test replications elapsed relative
6 coef[col(y)] * y 10000 0.068 1.000
5 scale(y, center = FALSE, scale = 1/coef) 10000 0.640 9.412
4 sweep(y, 2, coef, "*") 10000 0.535 7.868
3 t(apply(y, 1, "*", coef)) 10000 0.837 12.309
1 t(t(y) * coef) 10000 0.176 2.588
2 y %*% diag(coef) 10000 0.187 2.750

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: What can I do about a slow double sapply? - r

The inner loop is unnecessary: mm = lapply(0:l, function(h) tail(m, length(m) - h) + head(r, length(r) - h)) mapply("+", pp, mm) #[[1]] #[1] 4.0 8.5 18.0 34.5 60.0 # #[[2]] #[1] 5.25 11.75 24.25 44.75 # #[[3]] #[1] 6.5 15.0 30.5 # #[[4]] #[1] 7.75 18.25

Related

Fill NAs by smoothing the peak values

How to generate all possible vectors based on sampling without replacement in R?

Estimating missing values in time-series data frame based on a "rate of change"

Compute mean across intervals of rows

Automatic multiplication between vector and matrix

Categories

Resources