In R, split a vector randomly into k chunks? - r

I have seen many variations on the "split vector X into Y chunks in R" question on here. See for example: here and here for just two. So, when I realized I needed to split a vector into Y chunks of random size, I was surprised to find that the randomness requirement might be "new"--I couldn't find a way to do this on here.
So, here's what I've drawn up:
k.chunks = function(seq.size, n.chunks) {
break.pts = sample(1:seq.size, n.chunks, replace=F) %>% sort() #Get a set of break points chosen from along the length of the vector without replacement so no duplicate selections.
groups = rep(NA, seq.size) #Set up the empty output vector.
groups[1:break.pts[1]] = 1 #Set the first set of group affiliations because it has a unique start point of 1.
for (i in 2:(n.chunks)) { #For all other chunks...
groups[break.pts[i-1]:break.pts[i]] = i #Set the respective group affiliations
}
groups[break.pts[n.chunks]:seq.size] = n.chunks #Set the last group affiliation because it has a unique endpoint of seq.size.
return(groups)
}
My question is: Is this inelegant or inefficient somehow? It will get called 1000s of times in the code I plan to do, so efficiency is important to me. It'd be especially nice to avoid the for loop or having to set both the first and last groups "manually." My other question: Are there logical inputs that could break this? I recognize that n.chunks cannot > seq.size, so I mean other than that.

That should be pretty quick for smaller numbers. But here a more concise way.
k.chunks2 = function(seq.size, n.chunks) {
break.pts <- sort(sample(1:seq.size, n.chunks - 1, replace = FALSE))
break.len <- diff(c(0, break.pts, seq.size))
groups <- rep(1:n.chunks, times = break.len)
return(groups)
}
If you really get a huge number of groups, I think the sort will start to cost you execution time. So you can do something like this (probably can be tweaked to be even faster) to split based on proportions. I am not sure how I feel about this, because as n.chunks gets very large, the proportions will get very small. But it is faster.
k.chunks3 = function(seq.size, n.chunks) {
props <- runif(n.chunks)
grp.props <- props / sum(props)
chunk.size <- floor(grp.props[-n.chunks] * seq.size)
break.len <- c(chunk.size, seq.size - sum(chunk.size))
groups <- rep(1:n.chunks, times = break.len)
return(groups)
}
Running a benchmark, I think any of these will be fast enough (unit is microseconds).
n <- 1000
y <- 10
microbenchmark::microbenchmark(k.chunks(n, y),
k.chunks2(n, y),
k.chunks3(n, y))
Unit: microseconds
expr min lq mean median uq max neval
k.chunks(n, y) 49.9 52.05 59.613 53.45 58.35 251.7 100
k.chunks2(n, y) 46.1 47.75 51.617 49.25 52.55 107.1 100
k.chunks3(n, y) 8.1 9.35 11.412 10.80 11.75 44.2 100
But as the numbers get larger, you will notice a meaningful speedup (note the unit is now milliseconds).
n <- 1000000
y <- 100000
microbenchmark::microbenchmark(k.chunks(n, y),
k.chunks2(n, y),
k.chunks3(n, y))
Unit: milliseconds
expr min lq mean median uq max neval
k.chunks(n, y) 46.9910 51.38385 57.83917 54.54310 56.59285 113.5038 100
k.chunks2(n, y) 17.2184 19.45505 22.72060 20.74595 22.73510 69.5639 100
k.chunks3(n, y) 7.7354 8.62715 10.32754 9.07045 10.44675 58.2093 100
All said and done, I would probably use my k.chunks2() function.

Random is probably inefficient, but it would seem to be expected that it should be so. Random suggests all input elements should also be random. So, considering a desired random selection from a vector Y; it would seem the effort should be applied to an index of Y, and successive Y(s), that would be or seem random. With sufficient sets of Y(s) it can be discerned how far from completely random the indexing is, but maybe that is not material, or perhaps merely thousands of repetitions is insufficient to demonstrate it.
None the less, my sense is that both inputs to sample need to be 'random' in some way as a certainty in one reduces the randomness of the other.
my_vector <- c(1:100000)
sample_1 <- sample(my_vector, 50, replace = FALSE)
sample_2 <- sample(my_vector, 80, replace = FALSE)
full_range <- c(1, sort(unique(sample1,sample2)), 100000)
starts <- full_range[c(TRUE,FALSE)]#[generally](https://stackoverflow.com/questions/33257610/how-to-return-the-elements-in-the-odd-position)
ends <- full_range[c(FALSE, TRUE)]
!unique(diff(full_range))
And absent setting seed, I think non-reproducible is as close as you get to a random selection upon Y(s). This answer is just to suggest an approach to indexing Y. The use of indices thereafter might follow #Adam 's approach. And, of course, I could be completely wrong about all of this. Clearer random thinkers than me might well weigh in...

Related

Code performance: apply family or optimized alternatives

I've always taken it as fact that colMeans() or colSums() are the fastest way to perform their respective operations. As a ground rule, I am talking about within base and not dplyr or data.table implementations.
While teaching some new users, I ran the benchmark myself to prove the point. I am now consistently seeing contradicting conclusions.
n = 10000
p = 100
test_matrix <- matrix(runif(n*p), n, p)
test_df <- as.data.frame(test_matrix)
benchmark <- microbenchmark(
colMeans(test_df),
colMeans(as.matrix(test_df)),
sapply(test_df, mean),
vapply(test_df, mean, 0),
colMeans(test_matrix),
apply(test_matrix, 2, mean)
)
Unit: microseconds
expr min lq mean median uq max neval
colMeans(test_df) 3099.941 3165.8290 3733.024 3241.345 3617.039 11387.090 100
colMeans(as.matrix(test_df)) 3091.634 3158.0880 3553.537 3241.345 3548.507 8531.067 100
sapply(test_df, mean) 2209.227 2267.3750 2723.176 2338.172 2602.289 10384.612 100
vapply(test_df, mean, 0) 2180.153 2228.2945 2611.982 2270.584 2514.123 7421.356 100
colMeans(test_matrix) 904.307 915.0685 1020.085 939.422 1002.667 2985.911 100
apply(test_matrix, 2, mean) 9748.388 9957.0020 12098.328 10330.429 12582.889 34873.009 100
For a matrix, colMeans() torches apply() That is expected. But for a data frame, sapply() and vapply() routinely beat colMeans(), even as I increase n and p. Is there a reason why I would want to use colMeans() on a data frame? It appears that the difference comes from the overhead associated with converting the data frame back into a matrix.
Main Question
In other words, is there a reason why (a more formal version of) the following would be inadvisable? Benchmarks show basically no drop off. Obviously this makes an assumption about the input the user pushes in, but that is not the point here.
colMeans2 <- function(myobject) {
if (typeof(myobject) == "double") {
colMeans(myobject)
} else if (typeof(myobject) == "list") {
vapply(myobject, mean, 0)
} else {
stop("what is this")
}
}
For Reference
Here are two other posts I could find, both somewhat related and mentioning how colMeans() should be faster.
Grouping functions (tapply, by, aggregate) and the *apply family
Why are `colMeans()` and `rowMeans()` functions faster than using the mean function with `lapply()`?

Build a Grid based on two input vectors

I'm trying (by using R) to build a "grid" in a matrix based on two input vectors. So, the idea is to avoid nested loop like this:
inputVector1=1:4
inputVector2=1:4
grid=NULL
for(i in inputVector1){
line=NULL
for(j in inputVector2){
cellValue=i+j # Instead of i+j it can be anything like taking a value in a dataframe
line=cbind(line,cellValue)
}
grid=rbind(grid,line)
}
Is there a dedicated function in R to do this kind of job faster and simpler ? I know there is apply family functions but I didn't found a proper way to do it (without combining multiple apply family functions). Thank you for the help.
Loops are kind of simple and they are not necessarily slow. However, it depends on how to use those loops. In your code (I call your approach L.GUEGAN(), for further reference), for instance, you don't exploit the fact that you know the size of your ultimate grid and you keep expanding vectors, matrices. That slows things down. A very simple alternative would be
niceFor <- function() {
grid <- matrix(0, nrow = length(inputVector1), ncol = length(inputVector2))
for(i in seq_along(inputVector1))
for(j in seq_along(inputVector2))
grid[i, j] <- i + j
grid
}
where the essential difference is predefining the grid object and updating its values, rather than creating new objects.
Yes, you may say that there is a dedicated function for what:
outer(inputVector1, inputVector2, `+`)
However, one needs to keep in mind that the function in the third argument needs to be vectorized, which is the case in this situation. That is, vectors are allowed when using addition
1:2 + 3:4
# [1] 4 6
`+`(1:2, 3:4)
# [1] 4 6
However, some other functions are not vectorized. E.g.,
seq(3:4, 6:7)
# Error in seq.default(3:4, 6:7) : 'from' must be of length 1
In that case, if you use outer, take a look at ?Vectorize.
Certain operations have even "more direct" dedicated functions. E.g., if we had
grid[i, j] <- i * j
Then you should use
inputVector1 %*% t(inputVector2)
as it would be faster and cleaner than both loops and outer.
A comparison of the three approaches mentioned before
microbenchmark(L.GUEGAN(), niceFor(), funOuter(), times = 2000)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# L.GUEGAN() 24.354 33.8645 38.933968 35.6315 40.878 295.661 2000 c
# niceFor() 4.011 4.7820 6.576742 5.4050 7.697 29.547 2000 a
# funOuter() 4.928 6.1935 8.701545 7.3085 10.619 74.449 2000 b
So, the nice for loop seems even to be superior if speed matters. Notice that you could further improve it by exploiting symmetry of your grid: you could compute only half of the matrix manually and then use your results to fill the other triangle.
Thanks to #hrbrmstr this is what I was looking for:
outer( 1:4, 1:4, function(a,b){mapply(FUN = function(x,y){return(x+y)},a,b)} )

Difference between mean and manual calculation in R?

I am writing a simple function in R to calculate percentage differences between two input numbers.
pdiff <-function(a,b)
{
if(length(a>=1)) a <- median(a)
if(length(b>=1)) b <- median(b)
(abs(a-b)/((a+b)/2))*100
}
pdiffa <-function(a,b)
{
if(length(a>=1)) a <- median(a)
if(length(b>=1)) b <- median(b)
(abs(a-b)/mean(a,b))*100
}
When you run it with a random value of a and b, the functions give different results
x <- 5
y <- 10
pdiff(x,y) #gives 66%
pdiffa(x,y) #gives 100%
When I go into the code, apparently the values of (x+y)/2 = 7.5 and mean(x,y) = 5 differ......Am I missing something really obvious and stupid here?
This is due to a nasty "gotcha" in the mean() function (not listed in the list of R traps, but probably should be): you want mean(c(a,b)), not mean(a,b). From ?mean:
mean(x, ...)
[snip snip snip]
... further arguments passed to or from other methods.
So what happens if you call mean(5,10)? mean calls the mean.default method, which has trim as its second argument:
trim the fraction (0 to 0.5) of observations to be trimmed from each end of x before the mean is computed. Values of trim outside that range are taken as the nearest endpoint.
The last phrase "values of trim outside that range are taken as the nearest endpoint" means that values of trim larger than 0.5 are set to 0.5, which means that we're asking mean to throw away 50% of the data on either end of the data set, which means that all that's left is the median. Debugging our way through mean.default, we see that we indeed end up at this code ...
if (trim >= 0.5)
return(stats::median(x, na.rm = FALSE))
So mean(c(x,<value_greater_than_0.5>)) returns the median of c(5), which is just 5 ...
Try mean(5, 10) by itself.
mean(5, 10)
[1] 5
Now try mean(c(5, 10)).
mean(c(5, 10))
[1] 7.5
mean takes a vector as its first argument.

Speed up the calculation of squared error for large arrays in R

Basically I am helping someone to write some code for their research, but my usual time saving tactics have not reduced the run time of her algorithm enough for it to be reasonable. I was hoping someone else might know a better way to make a function run quickly based on an example I have written to avoid including information about the research.
The object in the example is smaller than the one she is using (but can easily be made larger). For the actual algorithm, this piece takes about 3 minutes in a small case, but might take 8-10 in the full case, and needs to run probably 1000-10000 times. This is the reason I need to seriously reduce the run time.
How I am currently doing this (hopefully with enough comments to make my thought process obvious):
example<-array(rnorm(100000), dim=c(5, 25, 40, 20))
observation <- array(rnorm(600), dim=c(5, 5, 12))
calc.err<-function(value, observation){
#'This creates the squared error for each observation, and each point in the
#'example array, across the five values in the first dimension of each
sqError<-(value-observation)^2
#'the apply function here sums up the squared error for each observation and
#'point. This is the value returned
return(apply(sqError, c(2,3), function(x) sum(x)))
}
run<-apply(example, c(2,3,4), function(x) calc.err(x, observation))
#'It isn't returned in the right format (small problem) but reformatting is fast
format<-array(run, dim=c(5, 12, 25, 40, 20))
Will clarify if necessary.
edit:
The data.table package appears to be very helpful. I will have to learn that package, but preliminaries seem to be much faster. I guess I was working with arrays because the code she gave me to make faster had the objects formatted that way. Didn't even think about changing it
Here's a couple simple refactors along with the timings:
calc.err2 <- function(value, observation){
#'This creates the squared error for each observation, and each point in the
#'example array, across the five values in the first dimension of each
sqError<-(value-observation)^2
#' getting rid of the anonymous function
apply(sqError, c(2,3), sum)
}
calc.err3 <- function(value, observation){
#'This creates the squared error for each observation, and each point in the
#'example array, across the five values in the first dimension of each
sqError<-(value-observation)^2
#' replacing with colSums
colSums(sqError)
}
R>microbenchmark(times=8, apply(example, 2:4, calc.err, observation),
+ apply(example, 2:4, calc.err2, observation),
+ apply(example, 2:4, calc.err3, observation)
+ )
Unit: milliseconds
expr min lq
apply(example, 2:4, calc.err, observation) 2284.350162 2321.875878
apply(example, 2:4, calc.err2, observation) 2194.316755 2257.007572
apply(example, 2:4, calc.err3, observation) 645.004808 652.567611
mean median uq max neval
2349.7524509 2336.6661645 2393.3452420 2409.894876 8
2301.7896566 2298.9346090 2362.5479790 2383.020177 8
681.3176878 667.9070175 720.7049605 723.177516 8
colSums is way faster than the corresponding apply.

Writing a window function with state using only R's basics

I am trying to write R code which acts as a "moving window", just with memory (state). I have figured out (thanks to this question) how to apply a function to subsequent tuples of elements. For example, if I wish to write a (simple) moving average with a typical period 4, I would do the following:
mapply(myfunc, x[1:(length(x)-4)], x[2:(length(x)-3)], x[3:(length(x)-2)], x[4:(length(x)-1)])
Where myfunc is a function with 4 arguments, which calculates their mean (I cannot use mean, as it expects only 1 argument, and I don't know how to make the 4 arguments a single vector).
That's quite cumbersome, though, and if the typical period is 100, say, I am not sure how to do it.
So here's my first question: how do I generalize this?
But here's another issue: suppose I wish the applied function to be able to save state. A simple example would be to keep record of how many values it was applied on so far. Another example is the exponential moving average (EMA), which is not really a window function, but instead a function which works on single values but which keeps state (the last resulted mean).
How can I write a function which when applied to a vector, works on its values one by one, returning a vector of the same length, which is able to retain its last output every time, or save any other "state" during its calculations? In Python, for example, I'd use classes for that, but that's quite difficult in R.
Important note: I am not interested in auxiliary R packages like zoo or TTR to do the work for me. I am trying to learn R, and in any case the functions I wish to write, while having similarities with MA or EMA, are custom, and do not exist in any of these packages.
Regarding your first question,
n <- length(x)
k <- 4
r <- embed(x, n-k)[1:k, seq(n-k, 1)]
do.call("mapply", c("myfunc", split(r, 1:k)))
Regarding the second question, Reduce can be used to iterate over a vector saving state.
For things like this you should consider using a plain for loop:
x <- runif(10000)
k <- 100
n <- length(x)
res <- numeric(n - k)
library(microbenchmark)
microbenchmark(times=5,
for(i in k:n) res[i - k + 1] <- sum(vec[i:(i + k)]),
{
r <- embed(x, n-k)[1:k, seq(n-k, 1)]
gg <- do.call("mapply", c("sum", split(r, 1:k)))
},
flt <- filter(x, rep(1, k))
)
Produces:
Unit: milliseconds
min lq median uq max neval
for 163.5403 164.4929 165.2543 166.6315 167.0608 5
embed/mapply 1255.2833 1307.3708 1338.2748 1341.5719 1405.1210 5
filter 6.7101 6.7971 6.8073 6.8161 6.8991 5
Now, the results are not identical and I don't pretend to understand exactly what GGrothendieck is doing with embed, but generally speaking for loops are just as fast as *pply functions so long as you initialize your result vectors first. Windowed calculations don't lend themselves well to vectorization, so might as well use a for loop.
EDIT: as several have pointed out in comments, there appears to be an internally implemented function to do (filter) this that is quite a bit faster, so that seems to be the best option (though you should confirm it actually does what you want as again, the results are not exactly identical and I am not personally familiar with the function; in it's default configuration it appears to do a rolling weighted sum, or sum if weights are 1, with a centered window).

Resources