Testing if rows of a matrix or data frame are sorted in R - r

What is an efficient way to test if rows in a matrix are sorted? [Update: see Aaron's Rcpp answer - straightforward & very fast.]
I am porting some code that uses issorted(,'rows') from Matlab. As it seems that is.unsorted does not extend beyond vectors, I'm writing or looking for something else. The naive method is to check that the sorted version of the matrix (or data frame) is the same as the original, but that's obviously inefficient.
NB: For sorting, a la sortrows() in Matlab, my code essentially uses SortedDF <- DF[do.call(order, DF),] (it's wrapped in a larger function that converts matrices to data frames, passes parameters to order, etc.). I wouldn't be surprised if there are faster implementations (data table comes to mind).
Update 1: To clarify: I'm not testing for sorting intra-row or intra-columns. (Such sorting generally results in an algebraically different matrix.)
As an example for creating an unsorted matrix:
set.seed(0)
x <- as.data.frame(matrix(sample(3, 60, replace = TRUE), ncol = 6, byrow = TRUE))
Its sorted version is:
y <- x[do.call(order, x),]
A proper test, say testSorted, would return FALSE for testSorted(x) and TRUE for testSorted(y).
Update 2:
The answers below are all good - they are concise and do the test. Regarding efficiency, it looks like these are sorting the data after all.
I've tried these with rather large matrices, such as 1M x 10, (just changing the creation of x above) and all have about the same time and memory cost. What's peculiar is that they all consume more time for unsorted objects (about 5.5 seconds for 1Mx10) than for sorted ones (about 0.5 seconds for y). This suggests they're sorting before testing.
I tested by creating a z matrix:
z <- y
z[,2] <- y[,1]
z[,1] <- y[,2]
In this case, all of the methods take about 0.85 seconds to complete. Anyway, finishing in 5.5 seconds isn't terrible (in fact, that seems to be right about the time necessary to sort the object), but knowing that a sorted matrix is 11X faster suggests that a test that doesn't sort could be even faster. In the case of the 1M row matrix, the first three rows of x are:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 3 1 2 2 3 1 3 3 2 2
2 1 1 1 3 2 3 2 3 3 2
3 3 3 1 2 1 1 2 1 2 3
There's no need to look beyond row 2, though vectorization isn't a bad idea.
(I've also added the byrow argument for the creation of x, so that row values don't depend on the size of x.)
Update 3:
Another comparison for this testing can be found with the sort -c command in Linux. If the file is already written (using write.table()), with 1M rows, then time sort -c myfile.txt takes 0.003 seconds for the unsorted data and 0.101 seconds for the sorted data. I don't intend to write out to a file, but it's a useful comparison.
Update 4:
Aaron's Rcpp method bested all other methods offered here and that I've tried (including the sort -c comparison above: in-memory is expected to beat on-disk). As for the ratio relative to other methods, it's hard to tell: the denominator is too small to give an accurate measurement, and I've not extensively explored microbenchmark. The speedups can be very large (4-5 orders of magnitude) for some matrices (e.g. one made with rnorm), but this is misleading - checking can terminate after only a couple of rows. I've had speedups with the example matrices of about 25-60 for the unsorted and about 1.1X for the sorted, as the competing methods were already very fast if the data is sorted.
Since this does the right thing (i.e. no sorting, just testing), and does it very quickly, it's the accepted answer.

If y is sorted then do.call(order,y) returns 1:nrow(y).
testSorted = function(y){all(do.call(order,y)==1:nrow(y))}
note this doesn't compare the matrices, but it doesn't dip out as soon as it finds a non-match.

Well, why don't you use:
all(do.call(order, y)==seq(nrow(y)))
That avoids creating the ordered matrix, and ensures it checks your style of ordering.

Newer: I decided I could use the Rcpp practice...
library(Rcpp)
library(inline)
isRowSorted <- cxxfunction(signature(A="numeric"), body='
Rcpp::NumericMatrix Am(A);
for(int i = 1; i < Am.nrow(); i++) {
for(int j = 0; j < Am.ncol(); j++) {
if( Am(i-1,j) < Am(i,j) ) { break; }
if( Am(i-1,j) > Am(i,j) ) { return(wrap(false)); }
}
}
return(wrap(true));
', plugin="Rcpp")
rownames(y) <- NULL # because as.matrix is faster without rownames
isRowSorted(as.matrix(y))
New: This R-only hack is the same speed for all matrices; it's definitely faster for sorted matrices; for unsorted ones it depends on the nature of the unsortedness.
iss3 <- function(x) {
x2 <- sign(do.call(cbind, lapply(x, diff)))
x3 <- t(x2)*(2^((ncol(x)-1):0))
all(colSums(x3)>=0)
}
Original: This is faster for some unsorted matrices. How much faster will depends on where the unsorted elements are; this looks at the matrix column by column so unsortedness on the left side will be noticed much faster than unsorted on the right, while top/bottomness doesn't matter nearly as much.
iss2 <- function(y) {
b <- c(0,nrow(y))
for(i in 1:ncol(y)) {
z <- rle(y[,i])
b2 <- cumsum(z$lengths)
sp <- split(z$values, cut(b2, breaks=b))
for(spi in sp) {
if(is.unsorted(spi)) return(FALSE)
}
b <- c(0, b2)
}
return(TRUE)
}

Well, the brute-force approach is to loop and compare, aborting as soon as a violation is found.
That approach can be implemented and tested easily in R, and then be carried over to a simple C++ function we can connect to R via inline and Rcpp (or plain C if you must) as looping is something that really benefits from an implementation in a compiled language.
Otherwise, can you not use something like diff() and check if all increments are non-negative?

You can use your do.call statement with is.unsorted:
issorted.matrix <- function(A) {!is.unsorted(do.call("order",data.frame(A)))}

Related

modelling an infinite series in R

I'm trying to write a code to approximate the following infinite Taylor series from the Theis hydrogeological equation in R.
I'm pretty new to functional programming, so this was a challenge! This is my attempt:
Wu <- function(u, repeats = 100) {
result <- numeric(repeats)
for (i in seq_along(result)){
result[i] <- -((-u)^i)/(i * factorial(i))
}
return(sum(result) - log(u)-0.5772)
}
I've compared the results with values from a data table available here: https://pubs.usgs.gov/wsp/wsp1536-E/pdf/wsp_1536-E_b.pdf - see below (excuse verbose code - should have made a csv, with hindsight):
Wu_QC <- data.frame(u = c(1.0*10^-15, 4.1*10^-14,9.9*10^-13, 7.0*10^-12, 3.7*10^-11,
2.3*10^-10, 6.8*10^-9, 5.7*10^-8, 8.4*10^-7, 6.3*10^-6,
3.1*10^-5, 7.4*10^-4, 5.1*10^-3, 2.9*10^-2,8.7*10^-1,
4.6,9.90),
Wu_table = c(33.9616, 30.2480, 27.0639, 25.1079, 23.4429,
21.6157, 18.2291, 16.1030, 13.4126, 11.3978,
9.8043,6.6324, 4.7064,2.9920,0.2742,
0.001841,0.000004637))
Wu_QC$rep_100 <- Wu(Wu_QC$u,100)
The good news is the formula gives identical results for repeats = 50, 100, 150 and 170 (so I've just given you the 100 version above). The bad news is that, while the function performs well for u < ~10^-3, it goes off the rails and gives negative outputs for numbers within an order of magnitude or so of 1. This doesn't happen when I just call the function on an individual number. i.e:
> Wu(4.6)
[1] 0.001856671
Which is the correct answer to 2sf.
Can anyone spot what I've done wrong and/or suggest a better way to code this equation? I think the problem is something to do with my for loop and/or an issue with the factorials generating infinite numbers as u gets larger, but I'm not at all certain.
Thanks!
As it says on page 93 of your reference, W is also known as the exponential integral. See also here.
Then, e.g., the package expint provides a function to compute W(u):
library(expint)
expint(10^(-8))
# [1] 17.84347
expint(4.6)
# [1] 0.001841006
where the results are exactly as in your referred table.
You can write a function that takes in a value together with the repetition times and outputs the required value:
w=function(u,l){
a=2:l
-0.5772-log(u)+u+sum(u^(a)*rep(c(-1,1),length=l-1)/(a)/factorial(a))
}
transform(Wu_QC,new=Vectorize(w)(u,170))
u Wu_table new
1 1.0e-15 3.39616e+01 3.396158e+01
2 4.1e-14 3.02480e+01 3.024800e+01
3 9.9e-13 2.70639e+01 2.706387e+01
4 7.0e-12 2.51079e+01 2.510791e+01
5 3.7e-11 2.34429e+01 2.344290e+01
6 2.3e-10 2.16157e+01 2.161574e+01
7 6.8e-09 1.82291e+01 1.822914e+01
8 5.7e-08 1.61030e+01 1.610301e+01
9 8.4e-07 1.34126e+01 1.341266e+01
10 6.3e-06 1.13978e+01 1.139777e+01
11 3.1e-05 9.80430e+00 9.804354e+00
12 7.4e-04 6.63240e+00 6.632400e+00
13 5.1e-03 4.70640e+00 4.706408e+00
14 2.9e-02 2.99200e+00 2.992051e+00
15 8.7e-01 2.74200e-01 2.741930e-01
16 4.6e+00 1.84100e-03 1.856671e-03
17 9.9e+00 4.63700e-06 2.030179e-05
As the numbers become large the estimation is not quite good, so we should have to go further than 170! but R cannot do that. Maybe you can try other platforms. ie Python
I think I may have solved this myself (though borrowing heavily from Onyambo's answer!) Here's my code:
well_func2 <- function (u, l = 100) {
result <- numeric(length(u))
a <- 2:l
for(i in seq_along(u)){
result[i] <- -0.5772-log(u[i])+u[i]+sum(u[i]^(a)*rep(c(-1,1),length=l-1)/(a)/factorial(a))
}
return(result)
}
As far as I can tell so far, this matches the tabulated results well for u <5 (as did Onyambo's code), and it also gives the same result for vector vs single-value inputs.
Still needs a bit more testing, and there's probably a tidier way to code it using map() or similar instead of the for loop, but I'm happy enough for now. Thought I'd share in case anyone else has the same problem.

R lookup time for very long vector

In the R programming language...
Bottleneck in my code:
a <- a[b]
where:
a,b are vectors of length 90 Million.
a is a logical vector.
b is a permutation of the indeces of a.
This operation is slow: it takes ~ 1.5 - 2.0 seconds.
I thought straightforward indexing would be much faster, even for large vectors.
Am I simply stuck? Or is there a way to speed this up?
Context:
P is a large matrix (10k row, 5k columns).
rows = names, columns = features. values = real numbers.
Problem: Given a subset of names, I need to obtain matrix Q, where:
Each column of Q is sorted (independently of the other columns of Q).
The values in a column of Q come from the corresponding column of P and are only those from the rows of P which are in the given subset of names.
Here is a naive implementation:
Psub <- P[names,]
Q <- sapply( Psub , sort )
But I am given 10,000 distinct subsets of names (each subset is several 20% to 90% of the total). Taking the subset and sorting each time is incredibly slow.
Instead, I can pre-compute the order vector:
b <- sapply( P , order )
b <- convert_to_linear_index( as.data.frame(b) , dim(P) )
# my own function.
# Now b is a vector of length nrow(P) * ncol(P)
a <- rownames(P) %in% myNames
a <- rep(a , ncol(P) )
a <- a[b]
a <- as.matrix(a , nrow = length(myNames) )
I don't see this getting much faster than that. You can try to write an optimized C function to do exactly this, which might cut the time in half or so (and that's optimistic -- vectorized R operations like this don't have much overhead), but not much more than that.
You've got approx 10^8 values to go through. Each time through the internal loop, it needs to increment the iterator, get the index b[i] out of memory, look up a[b[i]] and then save that value into newa[i]. I'm not a compiler/assembly expert by a long shot, but this sounds like on the order of 5-10 instructions, which means you're looking at "big O" of 1 billion instructions total, so there's a clock rate limit to how fast this can go.
Also, R stores logical values as 32 bit ints, so the array a will take up about 400 megs, which doesn't fit into cache, so if b is a more or less random permutation, then you're going to be missing the cache regularly (on most lookups to a, in fact). Again, I'm not an expert, but I would think it's likely that the cache misses here are the bottleneck, and if that's the case, optimized C won't help much.
Aside from writing it in C, the other thing to do is determine whether there are any assumptions you can make that would let you not go through the whole array. For example, if you know most of the indices will not change, and you can figure out which ones do change, you might be able to make it go faster.
On edit, here are some numbers. I have an AMD with clock speed of 2.8GHz. It takes me 3.4 seconds with a random permutation (i.e. lots of cache misses) and 0.7 seconds with either 1:n or n:1 (i.e. very few cache misses), which breaks into 0.6 seconds of execution time and 0.1 of system time, presumably to allocate the new array. So it does appear that cache misses are the thing. Maybe optimized C code could shave something like 0.2 or 0.3 seconds off of that base time, but if the permutation is random, that won't make much difference.
> x<-sample(c(T,F),90*10**6,T)
> prm<-sample(90*10**6)
> prm1<-1:length(prm)
> prm2<-rev(prm1)
> system.time(x<-x[prm])
user system elapsed
3.317 0.116 3.436
> system.time(x<-x[prm1])
user system elapsed
0.593 0.140 0.734
> system.time(x<-x[prm2])
user system elapsed
0.631 0.112 0.743
>

counting matching elements of two vectors but not including repeated elements in the count

I've search a lot in this forum. However, I didn't found a similar problem as the one I'm facing.
My question is:
I have two vectors
x <- c(1,1,2,2,3,3,3,4,4,4,6,7,8) and z <- c(1,1,2,4,5,5,5)
I need to count the number of times x or z appears in each other including if they are repeated or not.
The answer for this problem should be 4 because :
There are two number 1, one number 2, and one number 4 in each vector.
Functions like match() don't help since they will return the answer of repeated for non repeated numbers. Using unique() will also alter the final answer from 4 to 3
What I came up with was a loop that every time it found one number in the other, it would remove from the list so it won't be counted again.
The loop works fine for this size of this example; however, searching for larger vectors numerous times makes my loop inefficient and too slow for my purposes.
system.time({
for(n in 1:1000){
x <- c(1,1,2,2,3,3,3,4,4,4,6,7,8)
z <- c(1,1,2,4,5,5,5)
score <- 0
for(s in spectrum){
if(s %in% sequence){
sequence <- sequence[-which(sequence==s)[1]]
score <- score + 1
}
}
}
})
Can someone suggest a better method?
I've tried using lapply, for short vectors it is faster, but it became slower for longer ones..
Use R's vectorization to your advantage here. There's no looping necessary.
You could use a table to look at the frequencies,
table(z[z %in% x])
#
# 1 2 4
# 2 1 1
And then take the sum of the table for the total
sum(table(z[z %in% x]))
# [1] 4

R efficient index matching elements, cluster evaluation

This may be a fairly esoteric question.
I'm trying to implement some of the ideas from Albatineh et al (2006) (DOI: 10.1007/s00357-006-0017-z) for a spatial clustering algorithm. The basic idea is one way to assess the stability of a clustering result is to examine how often pairs of observations end up in the same class. In a well defined solution pairs of observations should frequently end up in the same group.
The challenge is that in a large data set there are n^2 possible pairs (and most don't occur). We have structured our output as follows:
A B C C A
B A A A B
A B C C A
Where the column index is the observation ID and each row represents a run from the clustering algorithm. In this example there are 5 observations and the algorithm was run 3 times. The cluster labels A:C are essentially arbitrary between runs. I'd like an efficient way to calculate something like this:
ID1 ID2
1 5
2
3 4
4 3
5 1
1 2
2 3
2 4
...
This accomplishes my goal but is super slow, especially for a large data frame:
testData <- matrix(data=sample(x=c("A", "B", "C"), 15, replace=TRUE), nrow=3)
cluPr <- function(pr.obs){
pairs <- data.frame()
for (row in 1:dim(pr.obs)[1]){
for (ob in 1:dim(pr.obs)[2]){
ob.pairs <- which(pr.obs[row,] %in% pr.obs[row,ob], arr.ind=TRUE)
pairs <- rbind(pairs, cbind(ob, ob.pairs))
}
}
return(pairs)
}
cluPr(testData)
Here's a relatively quick approach using the combn() function. I assumed that the name of your matrix was m.
results <- t(combn(dim(m)[2], 2, function(x) c(x[1], x[2], sum(m[, x[1]] == m[, x[2]]))))
results2 <- results[results[, 3]>0, ]
Try this:
clu.pairs <- function(k, row)
{
w <- which(row==k)
expand.grid(w, w)
}
row.pairs <- function(row)
{
do.call(rbind, lapply(unique(row), function(k) clu.pairs(k, row)))
}
full.pairs <- function(data)
{
do.call(rbind, lapply(seq_len(nrow(data)), function(i) row.pairs(data[i,])))
}
And use full.pairs(testData). The result is not in the same order as yours, but it's equivalent.
My first implementation (not in R; my code is much faster in Java) of the pair counting metrics was with ordered generators, and then doing a merge-sort way of computing the intersection. It was still on the order of O(n^2) run-time, but much lower in memory use.
However, you need to realize that you don't need to know the exact pairs. You only need the quantity in the intersections, and that can be computed straightforward from the intersection matrix, just like most other similarity measures. It's substantially faster if you only need to compute the set intersection sizes; with hash tables, set intersection should be in O(n).
I don't have time to look it up; but we may have touched this in the discussion of
Evaluation of Clusterings – Metrics and Visual Support
Data Engineering (ICDE), 2012 IEEE 28th International Conference on
Elke Achtert, Sascha Goldhofer, Hans-Peter Kriegel, Erich Schubert, Arthur Zimek
where we demonstrated a visual tool to explore the pair-counting based measures, also for more than two clustering solutions (unfortunately, a visual inspection mostly works for toy data sets, not for real data which is usually too messy and high-dimensional).
Roughly said: try computing the values using the formulas on page 303 in the publication you cited, instead of computing and then counting the pairs as explained in the intuition/motivation!

subset slow in large matrix

I have a numeric vector of length 5,000,000
>head(coordvec)
[1] 47286545 47286546 47286547 47286548 47286549 472865
and a 3 x 1,400,000 numeric matrix
>head(subscores)
V1 V2 V3
1 47286730 47286725 0.830
2 47286740 47286791 0.065
3 47286750 47286806 -0.165
4 47288371 47288427 0.760
5 47288841 47288890 0.285
6 47288896 47288945 0.225
What I am trying to accomplish is that for each number in coordvec, find the average of V3 for rows in subscores in which V1 and V2 encompass the number in coordvec. To do that, I am taking the following approach:
results<-numeric(length(coordvec))
for(i in 1:length(coordvec)){
select_rows <- subscores[, 1] < coordvec[i] & subscores[, 2] > coordvec[i]
scores_subset <- subscores[select_rows, 3]
results[m]<-mean(scores_subset)
}
This is very slow, and would take a few days to finish. Is there a faster way?
Thanks,
Dan
I think there are two challenging parts to this question. The first is finding the overlaps. I'd use the IRanges package from Bioconductor (?findInterval in the base package might also be useful)
library(IRanges)
creating width 1 ranges representing the coordinate vector, and set of ranges representing the scores; I sort the coordinate vectors for convenience, assuming that duplicate coordinates can be treated the same
coord <- sort(sample(.Machine$integer.max, 5000000))
starts <- sample(.Machine$integer.max, 1200000)
scores <- runif(length(starts))
q <- IRanges(coord, width=1)
s <- IRanges(starts, starts + 100L)
Here we find which query overlaps which subject
system.time({
olaps <- findOverlaps(q, s)
})
This takes about 7s on my laptop. There are different types of overlaps (see ?findOverlaps) so maybe this step requires a bit of refinement.
The result is a pair of vectors indexing the query and overlapping subject.
> olaps
Hits of length 281909
queryLength: 5000000
subjectLength: 1200000
queryHits subjectHits
<integer> <integer>
1 19 685913
2 35 929424
3 46 1130191
4 52 37417
I think this is the end of the first complicated part, finding the 281909 overlaps. (I don't think the data.table answer offered elsewhere addresses this, though I could be mistaken...)
The next challenging part is calculating a large number of means. The built-in way would be something like
olaps0 <- head(olaps, 10000)
system.time({
res0 <- tapply(scores[subjectHits(olaps0)], queryHits(olaps0), mean)
})
which takes about 3.25s on my computer and appears to scale linearly, so maybe 90s for the 280k overlaps. But I think we can accomplish this tabulation efficiently with data.table. The original coordinates are start(v)[queryHits(olaps)], so as
require(data.table)
dt <- data.table(coord=start(q)[queryHits(olaps)],
score=scores[subjectHits(olaps)])
res1 <- dt[,mean(score), by=coord]$V1
which takes about 2.5s for all 280k overlaps.
Some more speed can be had by recognizing that the query hits are ordered. We want to calculate a mean for each run of query hits. We start by creating a variable to indicate the ends of each query hit run
idx <- c(queryHits(olaps)[-1] != queryHits(olaps)[-length(olaps)], TRUE)
and then calculate the cumulative scores at the ends of each run, the length of each run, and the difference between the cumulative score at the end and at the start of the run
scoreHits <- cumsum(scores[subjectHits(olaps)])[idx]
n <- diff(c(0L, seq_along(idx)[idx]))
xt <- diff(c(0L, scoreHits))
And finally, the mean is
res2 <- xt / n
This takes about 0.6s for all the data, and is identical to (though more cryptic than?) the data.table result
> identical(res1, res2)
[1] TRUE
The original coordinates corresponding to the means are
start(q)[ queryHits(olaps)[idx] ]
Something like this might be faster :
require(data.table)
subscores <- as.data.table(subscores)
subscores[, cond := V1 < coordvec & V2 > coordvec]
subscores[list(cond)[[1]], mean(V3)]
list(cond)[[1]] because: "When i is a single variable name, it is not considered an expression of column names and is instead evaluated in calling scope." source: ?data.table
Since your answer isn't easily reproducible and even if it were, none of your subscores meet your boolean condition, I'm not sure if this does exactly what you're looking for but you can use one of the apply family and a function.
myfun <- function(x) {
y <- subscores[, 1] < x & subscores[, 2] > x
mean(subscores[y, 3])
}
sapply(coordvec, myfun)
You can also take a look at mclapply. If you have enough memory this will probably speed things up significantly. However, you could also look at the foreach package with similar results. You've got your for loop "correct" by assigning into results rather than growing it, but really, you're doing a lot of comparisons. It will be hard to speed this up much.

Resources