Sampling using common random numbers in r (efficiently!) - r

Is there any way to perform sampling using common random numbers with R?
There are many cases where you do the following many times (for instance, if you wanted to plot Monte Carlo estimates at many different parameter values). First, you sample, say, ten thousand variates from a normal distribution, and second, you take the average of some function of these samples, returning a single floating point numbers. Now, if I wanted to change a few parameters, changing either of these two functions, I would have to re-do those steps over and over again.
The naive way would be to sample fresh draws over and over again using some function like rnorm(). A less naive way would be to use a different function that takes a large collection of common random numbers. However, if I used this approach, there might still be a lot of copying going on here, due to R mostly using pass-by-value semantics. What are some tools that would allow me to get around this and avoid all this copying in the second situation?

I think you're asking two types of questions here:
Programmatically, can we preserve a large pull of random data in such a way that side-steps R's default pass-by-value?
Mathematically, if we make a large pull of random data and pick from it piece-meal, can we arbitrarily change the parameters used in the pull?
The answer to 1 is "yes": pass-by-reference semantics are possible in R, but they take a little more work. All of the implementations I've seen and played with are done with environments or non-R-native objects (C/C++ pointers to structs or such). Here is one example that caches a large pull of random "normal" data and checks the pool of available data on each call:
my_rnorm_builder <- function(deflen = 10000) {
.cache <- numeric(0)
.index <- 0L
.deflen <- deflen
check <- function(n) {
if ((.index + n) > length(.cache)) {
message("reloading") # this should not be here "in-production"
l <- length(.cache)
.cache <<- c(.cache[ .index + seq_len(l - .index) ],
rnorm(.deflen + n + l))
.index <<- 0L
}
}
function(n, mean = 0, sd = 1) {
check(n)
if (n > 0) {
out <- mean + sd * .cache[ .index + seq_len(n) ]
.index <<- .index + as.integer(n)
return(out)
} else return(numeric(0))
}
}
It is by-far not resilient to hostile users or other likely mistakes. It does not guarantee the length of available remaining random numbers. (To put in checks like that would slow it down below a threshold of reasonable-ness, with the benchmark in mind.)
Demo of it in operation:
my_rnorm <- my_rnorm_builder(1e6)
# starts empty
get(".index", env=environment(my_rnorm))
# [1] 0
length(get(".cache", env=environment(my_rnorm)))
# [1] 0
set.seed(2)
my_rnorm(3) # should see "reloading"
# reloading
# [1] -0.8969145 0.1848492 1.5878453
my_rnorm(3) # should not see "reloading"
# [1] -1.13037567 -0.08025176 0.13242028
# prove that we've changed things internally
get(".index", env=environment(my_rnorm))
# [1] 6
length(get(".cache", env=environment(my_rnorm)))
# [1] 1000003
head(my_rnorm(1e6)) # should see "reloading"
# reloading
# [1] 0.7079547 -0.2396980 1.9844739 -0.1387870 0.4176508 0.9817528
Let's make sure that the random-number scaling of sigma*x+mu makes sense by starting over and re-setting our seed:
# reload the definition of my_rnorm
my_rnorm <- my_rnorm_builder(1e6)
length(get(".cache", env=environment(my_rnorm)))
# [1] 0
set.seed(2)
my_rnorm(3) # should see "reloading"
# reloading
# [1] -0.8969145 0.1848492 1.5878453
my_rnorm(3, mean = 100) # should not see "reloading"
# [1] 98.86962 99.91975 100.13242
So to answer question 2: "yes". Quick inspection reveals that those last three numbers are indeed "100 plus" the numbers in the second my_rnorm(3) in the previous block. So just shifting "normal" random numbers by mu/sigma holds. And we did this while still using the large pre-pulled cache of random data.
But is it worth it? This is a naïve test/comparison in and of itself, constructive suggestions are welcome.
t(sapply(c(1,5,10,100,1000,10000), function(n) {
s <- summary(microbenchmark::microbenchmark(
base = rnorm(n),
my = my_rnorm(n),
times = 10000, unit = "ns"
))
c(n = n, setNames(s$median, s$expr))
}))
# reloading
# reloading
# reloading
# reloading
# reloading
# reloading
# reloading
# reloading
# reloading
# reloading
# reloading
# reloading
# reloading
# reloading
# n base my
# [1,] 1 1100 1100
# [2,] 5 1400 1300
# [3,] 10 1600 1400
# [4,] 100 6400 2000
# [5,] 1000 53100 6600
# [6,] 10000 517000 49900
(All medians are in nanoseconds.) So while it would have seemed intuitive that "smaller pulls done more frequently" (with rnorm) would have benefited from this caching, I cannot explain why it is not very helpful until pulls 100 and greater.
Can this be extended to other distributions? Almost certainly. "Uniform" would be straight forward (similarly scale and shift), but some others might take a little more calculus to do correctly. (For instance, it is not obvious without more research how the "t" distribution could alter the degrees-of-freedom on pre-pulled data ... if that's even possible. Though I do count myself a statistician in some ways, I am not prepared to claim yes/no/maybe on that one yet :-)

Addition to r2evans' answer concerning is it worth it?: I don't think so, since instead of caching random draws one could also use a faster RNG. Here I am adding dqrnorm from my dqrng package to the comparison:
dqrnorm is the fastest method for n <= 100
for n > 100, caching and dqrnorm are comparable and much faster than rnorm

Related

Using parSapply to generate random numbers

I am trying to run a function which there is a random number generator within the function. The results at not as what I expected so I have done the following test:
# Case 1
set.seed(100)
A1 = matrix(NA,20,10)
for (i in 1:10) {
A1[,i] = sample(1:100,20)
}
# Case 2
set.seed(100)
A2 = sapply(seq_len(10),function(x) sample(1:100,20))
# Case 3
require(parallel)
set.seed(100)
cl <- makeCluster(detectCores() - 1)
A3 = parSapply(cl,seq_len(10), function(x) sample(1:100,20))
stopCluster(cl)
# Check: Case 1 result equals Case 2 result
identical(A1,A2)
# [1] TRUE
# Check: Case 1 result does NOT equal to Case 3 result
identical(A1,A3)
# [1] FALSE
# Check2: Would like to check if it's a matter of ordering
range(rowSums(A1))
# [1] 319 704
range(rowSums(A3))
# [1] 288 612
In the above code, the parSapply generates a different set of random numbers than A1 and A2. My purpose of having Check2 is that, I was suspecting that parSapply might alter the order however it doesn't seem to be case as the max and min sums of these random numbers are different.
Appreciate if someone could shed some colour on why parSapply would give a different result from sapply. What am I missing here?
Thanks in advance!
Have a look at ?vignette(parallel) and in particular at "Section 6 Random-number generation". Among other things it states the following
Some care is needed with parallel computation using (pseudo-)random numbers: the processes/threads which run separate parts of the computation need to run independent (and preferably reproducible) random-number streams.
When an R process is started up it takes the random-number seed from the object .Random.seed in a saved workspace or constructs one from the clock time and process ID when random-number generation is first used (see the help on RNG). Thus worker processes might get the same seed
because a workspace containing .Random.seed was restored or the random number generator has been used before forking: otherwise these get a non-reproducible seed (but with very high probability a different seed for each worker).
You should also have a look at ?clusterSetRNGStream.

R probability simulation that won't terminate?

I'm teaching a statistics class where I'm having students explore questions in probability and statistics through simulation using R. Recently there was some confusion about the probability of getting exactly two 6's when rolling 5 dice. The answer is choose(5,2)*5^3/6^5, but some students were convinced that "order shouldn't matter"; i.e. that the answer should be choose(5,2)*choose(25,3)/choose(30,5). I thought it would be fun to have them simulate rolling 5 dice thousands of times, keeping track of the empirical probability for each experiment, and then repeat the experiment many times. The problem is the two numbers above are sufficiently close that it's quite hard to get a simulation to tease out the difference in a statistically significant fashion (of course I could just be doing it wrong). I tried rolling 5 dice 100000 times, then repeating the experiment 10000 times. This took an hour or so to run on my i7 linux machine and still allowed for a 25% chance that the correct answer is choose(5,2)*choose(25,3)/choose(30,5). So I increased the number of dice rolls per experiment to 10^6. Now the code has been running for over 2 days and shows no sign of finishing. I'm confused by this, as I only increased the number of operations by an order of magnitude, implying that the run time should be closer to 10 hours.
Second question: Is there a better way to do this? See code posted below:
probdist = rep(0,10000)
for (j in 1:length(probdist))
{
outcome = rep(0,1000000)
for (k in 1:1000000)
{
rolls = sample(1:6, 5, replace=T)
if (length(rolls[rolls == 6]) == 2) outcome[k] = 1
}
probdist[j] = sum(outcome)/length(outcome)
}
A good rule of thumb is to never, ever write a for loop in R. Here's an alternative solution:
doSample <- function()
{
sum(sample(1:6,size=5,replace=TRUE)==6)==2
}
> system.time(samples <- replicate(n=10000,expr=doSample()))
user system elapsed
0.06 0.00 0.06
> mean(samples)
[1] 0.1588
> choose(5,2)*5^3/6^5
[1] 0.160751
Doesn't seem to be too accurate with $10,000$ samples. Better with $100,000$:
> system.time(samples <- replicate(n=100000,expr=doSample()))
user system elapsed
0.61 0.02 0.61
> mean(samples)
[1] 0.16135
I had originally awarded a correct answer check to M. Berk for his/her suggestion to use the R replicate() function. Further investigation has forced to to rescind my previous endorsement. It turns out that replicate() is just a wrapper for sapply(), which doesn't actually afford any performance benefits over a for loop (this seems to be a common misconception). In any case, I prepared 3 versions of the simulation, 2 using a for loop, and one using replicate, as suggested, and ran them one after the other, starting from a fresh R session each time, in order to compare the execution times:
# dice26dist1.r: For () loop version with unnecessary array allocation
probdist = rep(0,100)
for (j in 1:length(probdist))
{
outcome = rep(0,1000000)
for (k in 1:1000000)
{
rolls = sample(1:6, 5, replace=T)
if (length(rolls[rolls == 6]) == 2) outcome[k] = 1
}
probdist[j] = sum(outcome)/length(outcome)
}
system.time(source('dice26dist1.r'))
user system elapsed
596.365 0.240 598.614
# dice26dist2.r: For () loop version
probdist = rep(0,100)
for (j in 1:length(probdist))
{
outcomes = 0
for (k in 1:1000000)
{
rolls = sample(1:6, 5, replace=T)
if (length(rolls[rolls == 6]) == 2) outcomes = outcomes + 1
}
probdist[j] = outcomes/1000000
}
system.time(source('dice26dist2.r'))
user system elapsed
506.331 0.076 508.104
# dice26dist3.r: replicate() version
doSample <- function()
{
sum(sample(1:6,size=5,replace=TRUE)==6)==2
}
probdist = rep(0,100)
for (j in 1:length(probdist))
{
samples = replicate(n=1000000,expr=doSample())
probdist[j] = mean(samples)
}
system.time(source('dice26dist3.r'))
user system elapsed
804.042 0.472 807.250
From this you can see that the replicate() version is considerably slower than either of the for loop versions by any system.time metric. I had originally thought that my problem was mostly due to cache misses by allocating the million character outcome[] array, but comparing the times of dice26dist1.r and dice26dist2.r indicates that this only has nominal impact on performance (although the impact on system time is considerable: >300% difference.
One might argue that I'm still using for loops in all three simulations, but as far as I can tell this is completely unavoidable when simulating a random process; I have to simulate actually going through the random process (in this case, rolling 5 die) every time. I would love to know about any technique that would allow me to avoid using a for loop (in a way that improves performance, of course). I understand that this problem would lend itself very effectively to parallelization, but I'm talking about using a single R session -- is there a way to make this faster?
Vectorization is almost always preferred to any for loop. In this case, you should see substantial speedup by generating all your dice throws first, then checking how many in each group of five equal 6.
set.seed(5)
N <- 1e6
foo <- matrix(sample(1:6, 5*N, replace=TRUE), ncol=5)
p <- mean(rowSums(foo==6)==2)
se <- sqrt(p*(1-p)/N)
p
## [1] 0.160382
Here's a 95% confidence interval:
p + se*qnorm(0.975)*c(-1,1)
## [1] 0.1596628 0.1611012
We can see that the true answer (ans1) is in the interval, but the false answer (ans2) is not, or we could perform significance tests; the p-value when testing the true answer is 0.31 but for the false answer is 0.0057.
(ans1 <- choose(5,2)*5^3/6^5)
## [1] 0.160751
pnorm(abs((ans1-p)/se), lower=FALSE)*2
## [1] 0.3145898
ans2 <- choose(5,2)*choose(25,3)/choose(30,5)
## [1] 0.1613967
pnorm(abs((ans2-p)/se), lower=FALSE)*2
## [1] 0.005689008
Note that I'm generating all the dice throws at once; if memory is an issue, you could split this up into pieces and combine, as you did in your original post. This is possibly what caused your unexpected speedup in time; if it was necessary to use swap memory, this would slow it substantially. If so, better to increase the number of time you run the loop, not the number of rolls within the loop.

subset slow in large matrix

I have a numeric vector of length 5,000,000
>head(coordvec)
[1] 47286545 47286546 47286547 47286548 47286549 472865
and a 3 x 1,400,000 numeric matrix
>head(subscores)
V1 V2 V3
1 47286730 47286725 0.830
2 47286740 47286791 0.065
3 47286750 47286806 -0.165
4 47288371 47288427 0.760
5 47288841 47288890 0.285
6 47288896 47288945 0.225
What I am trying to accomplish is that for each number in coordvec, find the average of V3 for rows in subscores in which V1 and V2 encompass the number in coordvec. To do that, I am taking the following approach:
results<-numeric(length(coordvec))
for(i in 1:length(coordvec)){
select_rows <- subscores[, 1] < coordvec[i] & subscores[, 2] > coordvec[i]
scores_subset <- subscores[select_rows, 3]
results[m]<-mean(scores_subset)
}
This is very slow, and would take a few days to finish. Is there a faster way?
Thanks,
Dan
I think there are two challenging parts to this question. The first is finding the overlaps. I'd use the IRanges package from Bioconductor (?findInterval in the base package might also be useful)
library(IRanges)
creating width 1 ranges representing the coordinate vector, and set of ranges representing the scores; I sort the coordinate vectors for convenience, assuming that duplicate coordinates can be treated the same
coord <- sort(sample(.Machine$integer.max, 5000000))
starts <- sample(.Machine$integer.max, 1200000)
scores <- runif(length(starts))
q <- IRanges(coord, width=1)
s <- IRanges(starts, starts + 100L)
Here we find which query overlaps which subject
system.time({
olaps <- findOverlaps(q, s)
})
This takes about 7s on my laptop. There are different types of overlaps (see ?findOverlaps) so maybe this step requires a bit of refinement.
The result is a pair of vectors indexing the query and overlapping subject.
> olaps
Hits of length 281909
queryLength: 5000000
subjectLength: 1200000
queryHits subjectHits
<integer> <integer>
1 19 685913
2 35 929424
3 46 1130191
4 52 37417
I think this is the end of the first complicated part, finding the 281909 overlaps. (I don't think the data.table answer offered elsewhere addresses this, though I could be mistaken...)
The next challenging part is calculating a large number of means. The built-in way would be something like
olaps0 <- head(olaps, 10000)
system.time({
res0 <- tapply(scores[subjectHits(olaps0)], queryHits(olaps0), mean)
})
which takes about 3.25s on my computer and appears to scale linearly, so maybe 90s for the 280k overlaps. But I think we can accomplish this tabulation efficiently with data.table. The original coordinates are start(v)[queryHits(olaps)], so as
require(data.table)
dt <- data.table(coord=start(q)[queryHits(olaps)],
score=scores[subjectHits(olaps)])
res1 <- dt[,mean(score), by=coord]$V1
which takes about 2.5s for all 280k overlaps.
Some more speed can be had by recognizing that the query hits are ordered. We want to calculate a mean for each run of query hits. We start by creating a variable to indicate the ends of each query hit run
idx <- c(queryHits(olaps)[-1] != queryHits(olaps)[-length(olaps)], TRUE)
and then calculate the cumulative scores at the ends of each run, the length of each run, and the difference between the cumulative score at the end and at the start of the run
scoreHits <- cumsum(scores[subjectHits(olaps)])[idx]
n <- diff(c(0L, seq_along(idx)[idx]))
xt <- diff(c(0L, scoreHits))
And finally, the mean is
res2 <- xt / n
This takes about 0.6s for all the data, and is identical to (though more cryptic than?) the data.table result
> identical(res1, res2)
[1] TRUE
The original coordinates corresponding to the means are
start(q)[ queryHits(olaps)[idx] ]
Something like this might be faster :
require(data.table)
subscores <- as.data.table(subscores)
subscores[, cond := V1 < coordvec & V2 > coordvec]
subscores[list(cond)[[1]], mean(V3)]
list(cond)[[1]] because: "When i is a single variable name, it is not considered an expression of column names and is instead evaluated in calling scope." source: ?data.table
Since your answer isn't easily reproducible and even if it were, none of your subscores meet your boolean condition, I'm not sure if this does exactly what you're looking for but you can use one of the apply family and a function.
myfun <- function(x) {
y <- subscores[, 1] < x & subscores[, 2] > x
mean(subscores[y, 3])
}
sapply(coordvec, myfun)
You can also take a look at mclapply. If you have enough memory this will probably speed things up significantly. However, you could also look at the foreach package with similar results. You've got your for loop "correct" by assigning into results rather than growing it, but really, you're doing a lot of comparisons. It will be hard to speed this up much.

R: Sample into bins of predefined sizes (partition sample vector)

I'm working on a dataset that consists of ~10^6 values which clustered into a variable number of bins. In the course of my analysis, I am trying to randomize my clustering, but keeping bin size constant. As a toy example (in pseudocode), this would look something like this:
data <- list(c(1,5,6,3), c(2,4,7,8), c(9), c(10,11,15), c(12,13,14));
sizes <- lapply(data, length);
for (rand in 1:no.of.randomizations) {
rand.data <- partition.sample(seq(1,15), partitions=sizes, replace=F)
}
So, I am looking for a function like "partition.sample" that will take a vector (like seq(1,15)) and randomly sample from it, returning a list with the data partitioned into the right bin sizes given already by "sizes".
I've been trying to write one such function myself, since the task seems to be not so hard. However, the partitioning of a vector into given bin sizes looks like it would be a lot faster and more efficient if done "under the hood", meaning probably not in native R. So I wonder whether I have just missed the name of the appropriate function, or whether someone could please point me to a smart solution that is around :-)
Your help & time are very much appreciated! :-)
Best,
Lymond
UPDATE:
By "no.of.randomizations" I mean the actual number of times I run through the whole "randomization loop". This will, later on, obviously include more steps than just the actual sampling.
Moreover, I would in addition be interested in a trick to do the above feat for sampling without replacement.
Thanks in advance, your help is very much appreciated!
Revised: This should be fairly efficient. It's complexity should be primarily in the permutation step:
# A single step:
x <- sample( unlist(data))
list( one=x[1:4], two=x[5:8], three=x[9], four=x[10:12], five=x[13:16])
As mentioned above the "no.of.randomizations" may have been the number of repeated applications of this proces, in which case you may want to wrap replicate around that:
replic <- replicate(n=4, { x <- sample(unlist(data))
list( x[1:4], x[5:8], x[9], x[10:12], x[13:15]) } )
After some more thinking and googling, I have come up with a feasible solution. However, I am still not convinced that this is the fastest and most efficient way to go.
In principle, I can generate one long vector of a uniqe permutation of "data" and then split it into a list of vectors of lengths "sizes" by going via a factor argument supplied to split. For this, I need an additional ID scheme for my different groups of "data", which I happen to have in my case.
It becomes clearer when viewed as code:
data <- list(c(1,5,6,3), c(2,4,7,8), c(9), c(10,11,15), c(12,13,14));
sizes <- lapply(data, length);
So far, everything as above
names <- c("set1", "set2", "set3", "set4", "set5");
In my case, I am lucky enough to have "names" already provided from the data. Otherwise, I would have to obtain them as (e.g.)
names <- seq(1, length(data));
This "names" vector can then be expanded by "sizes" using rep:
cut.by <- rep(names, times = sizes);
[1] 1 1 1 1 2 2 2 2 3 4 4 4 5
[14] 5 5
This new vector "cut.by" can then by provided as argument to split()
rand.data <- split(sample(1:15, 15), cut.by)
$`1`
[1] 8 9 14 4
$`2`
[1] 10 2 15 13
$`3`
[1] 12
$`4`
[1] 11 3 5
$`5`
[1] 7 6 1
This does the job I was looking for alright. It samples from the background "1:15" and splits the result into vectors of lengths "sizes" through the vector "cut.by".
However, I am still not happy to have to go via an additional (possibly) long vector to indicate the split positions, such as "cut.by" in the code above. This definitely works, but for very long data vectors, it could become quite slow, I guess.
Thank you anyway for the answers and pointers provided! Your help is very much appreciated :-)

Testing if rows of a matrix or data frame are sorted in R

What is an efficient way to test if rows in a matrix are sorted? [Update: see Aaron's Rcpp answer - straightforward & very fast.]
I am porting some code that uses issorted(,'rows') from Matlab. As it seems that is.unsorted does not extend beyond vectors, I'm writing or looking for something else. The naive method is to check that the sorted version of the matrix (or data frame) is the same as the original, but that's obviously inefficient.
NB: For sorting, a la sortrows() in Matlab, my code essentially uses SortedDF <- DF[do.call(order, DF),] (it's wrapped in a larger function that converts matrices to data frames, passes parameters to order, etc.). I wouldn't be surprised if there are faster implementations (data table comes to mind).
Update 1: To clarify: I'm not testing for sorting intra-row or intra-columns. (Such sorting generally results in an algebraically different matrix.)
As an example for creating an unsorted matrix:
set.seed(0)
x <- as.data.frame(matrix(sample(3, 60, replace = TRUE), ncol = 6, byrow = TRUE))
Its sorted version is:
y <- x[do.call(order, x),]
A proper test, say testSorted, would return FALSE for testSorted(x) and TRUE for testSorted(y).
Update 2:
The answers below are all good - they are concise and do the test. Regarding efficiency, it looks like these are sorting the data after all.
I've tried these with rather large matrices, such as 1M x 10, (just changing the creation of x above) and all have about the same time and memory cost. What's peculiar is that they all consume more time for unsorted objects (about 5.5 seconds for 1Mx10) than for sorted ones (about 0.5 seconds for y). This suggests they're sorting before testing.
I tested by creating a z matrix:
z <- y
z[,2] <- y[,1]
z[,1] <- y[,2]
In this case, all of the methods take about 0.85 seconds to complete. Anyway, finishing in 5.5 seconds isn't terrible (in fact, that seems to be right about the time necessary to sort the object), but knowing that a sorted matrix is 11X faster suggests that a test that doesn't sort could be even faster. In the case of the 1M row matrix, the first three rows of x are:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 3 1 2 2 3 1 3 3 2 2
2 1 1 1 3 2 3 2 3 3 2
3 3 3 1 2 1 1 2 1 2 3
There's no need to look beyond row 2, though vectorization isn't a bad idea.
(I've also added the byrow argument for the creation of x, so that row values don't depend on the size of x.)
Update 3:
Another comparison for this testing can be found with the sort -c command in Linux. If the file is already written (using write.table()), with 1M rows, then time sort -c myfile.txt takes 0.003 seconds for the unsorted data and 0.101 seconds for the sorted data. I don't intend to write out to a file, but it's a useful comparison.
Update 4:
Aaron's Rcpp method bested all other methods offered here and that I've tried (including the sort -c comparison above: in-memory is expected to beat on-disk). As for the ratio relative to other methods, it's hard to tell: the denominator is too small to give an accurate measurement, and I've not extensively explored microbenchmark. The speedups can be very large (4-5 orders of magnitude) for some matrices (e.g. one made with rnorm), but this is misleading - checking can terminate after only a couple of rows. I've had speedups with the example matrices of about 25-60 for the unsorted and about 1.1X for the sorted, as the competing methods were already very fast if the data is sorted.
Since this does the right thing (i.e. no sorting, just testing), and does it very quickly, it's the accepted answer.
If y is sorted then do.call(order,y) returns 1:nrow(y).
testSorted = function(y){all(do.call(order,y)==1:nrow(y))}
note this doesn't compare the matrices, but it doesn't dip out as soon as it finds a non-match.
Well, why don't you use:
all(do.call(order, y)==seq(nrow(y)))
That avoids creating the ordered matrix, and ensures it checks your style of ordering.
Newer: I decided I could use the Rcpp practice...
library(Rcpp)
library(inline)
isRowSorted <- cxxfunction(signature(A="numeric"), body='
Rcpp::NumericMatrix Am(A);
for(int i = 1; i < Am.nrow(); i++) {
for(int j = 0; j < Am.ncol(); j++) {
if( Am(i-1,j) < Am(i,j) ) { break; }
if( Am(i-1,j) > Am(i,j) ) { return(wrap(false)); }
}
}
return(wrap(true));
', plugin="Rcpp")
rownames(y) <- NULL # because as.matrix is faster without rownames
isRowSorted(as.matrix(y))
New: This R-only hack is the same speed for all matrices; it's definitely faster for sorted matrices; for unsorted ones it depends on the nature of the unsortedness.
iss3 <- function(x) {
x2 <- sign(do.call(cbind, lapply(x, diff)))
x3 <- t(x2)*(2^((ncol(x)-1):0))
all(colSums(x3)>=0)
}
Original: This is faster for some unsorted matrices. How much faster will depends on where the unsorted elements are; this looks at the matrix column by column so unsortedness on the left side will be noticed much faster than unsorted on the right, while top/bottomness doesn't matter nearly as much.
iss2 <- function(y) {
b <- c(0,nrow(y))
for(i in 1:ncol(y)) {
z <- rle(y[,i])
b2 <- cumsum(z$lengths)
sp <- split(z$values, cut(b2, breaks=b))
for(spi in sp) {
if(is.unsorted(spi)) return(FALSE)
}
b <- c(0, b2)
}
return(TRUE)
}
Well, the brute-force approach is to loop and compare, aborting as soon as a violation is found.
That approach can be implemented and tested easily in R, and then be carried over to a simple C++ function we can connect to R via inline and Rcpp (or plain C if you must) as looping is something that really benefits from an implementation in a compiled language.
Otherwise, can you not use something like diff() and check if all increments are non-negative?
You can use your do.call statement with is.unsorted:
issorted.matrix <- function(A) {!is.unsorted(do.call("order",data.frame(A)))}

Resources