subset slow in large matrix - r

I have a numeric vector of length 5,000,000
>head(coordvec)
[1] 47286545 47286546 47286547 47286548 47286549 472865
and a 3 x 1,400,000 numeric matrix
>head(subscores)
V1 V2 V3
1 47286730 47286725 0.830
2 47286740 47286791 0.065
3 47286750 47286806 -0.165
4 47288371 47288427 0.760
5 47288841 47288890 0.285
6 47288896 47288945 0.225
What I am trying to accomplish is that for each number in coordvec, find the average of V3 for rows in subscores in which V1 and V2 encompass the number in coordvec. To do that, I am taking the following approach:
results<-numeric(length(coordvec))
for(i in 1:length(coordvec)){
select_rows <- subscores[, 1] < coordvec[i] & subscores[, 2] > coordvec[i]
scores_subset <- subscores[select_rows, 3]
results[m]<-mean(scores_subset)
}
This is very slow, and would take a few days to finish. Is there a faster way?
Thanks,
Dan

I think there are two challenging parts to this question. The first is finding the overlaps. I'd use the IRanges package from Bioconductor (?findInterval in the base package might also be useful)
library(IRanges)
creating width 1 ranges representing the coordinate vector, and set of ranges representing the scores; I sort the coordinate vectors for convenience, assuming that duplicate coordinates can be treated the same
coord <- sort(sample(.Machine$integer.max, 5000000))
starts <- sample(.Machine$integer.max, 1200000)
scores <- runif(length(starts))
q <- IRanges(coord, width=1)
s <- IRanges(starts, starts + 100L)
Here we find which query overlaps which subject
system.time({
olaps <- findOverlaps(q, s)
})
This takes about 7s on my laptop. There are different types of overlaps (see ?findOverlaps) so maybe this step requires a bit of refinement.
The result is a pair of vectors indexing the query and overlapping subject.
> olaps
Hits of length 281909
queryLength: 5000000
subjectLength: 1200000
queryHits subjectHits
<integer> <integer>
1 19 685913
2 35 929424
3 46 1130191
4 52 37417
I think this is the end of the first complicated part, finding the 281909 overlaps. (I don't think the data.table answer offered elsewhere addresses this, though I could be mistaken...)
The next challenging part is calculating a large number of means. The built-in way would be something like
olaps0 <- head(olaps, 10000)
system.time({
res0 <- tapply(scores[subjectHits(olaps0)], queryHits(olaps0), mean)
})
which takes about 3.25s on my computer and appears to scale linearly, so maybe 90s for the 280k overlaps. But I think we can accomplish this tabulation efficiently with data.table. The original coordinates are start(v)[queryHits(olaps)], so as
require(data.table)
dt <- data.table(coord=start(q)[queryHits(olaps)],
score=scores[subjectHits(olaps)])
res1 <- dt[,mean(score), by=coord]$V1
which takes about 2.5s for all 280k overlaps.
Some more speed can be had by recognizing that the query hits are ordered. We want to calculate a mean for each run of query hits. We start by creating a variable to indicate the ends of each query hit run
idx <- c(queryHits(olaps)[-1] != queryHits(olaps)[-length(olaps)], TRUE)
and then calculate the cumulative scores at the ends of each run, the length of each run, and the difference between the cumulative score at the end and at the start of the run
scoreHits <- cumsum(scores[subjectHits(olaps)])[idx]
n <- diff(c(0L, seq_along(idx)[idx]))
xt <- diff(c(0L, scoreHits))
And finally, the mean is
res2 <- xt / n
This takes about 0.6s for all the data, and is identical to (though more cryptic than?) the data.table result
> identical(res1, res2)
[1] TRUE
The original coordinates corresponding to the means are
start(q)[ queryHits(olaps)[idx] ]

Something like this might be faster :
require(data.table)
subscores <- as.data.table(subscores)
subscores[, cond := V1 < coordvec & V2 > coordvec]
subscores[list(cond)[[1]], mean(V3)]
list(cond)[[1]] because: "When i is a single variable name, it is not considered an expression of column names and is instead evaluated in calling scope." source: ?data.table

Since your answer isn't easily reproducible and even if it were, none of your subscores meet your boolean condition, I'm not sure if this does exactly what you're looking for but you can use one of the apply family and a function.
myfun <- function(x) {
y <- subscores[, 1] < x & subscores[, 2] > x
mean(subscores[y, 3])
}
sapply(coordvec, myfun)
You can also take a look at mclapply. If you have enough memory this will probably speed things up significantly. However, you could also look at the foreach package with similar results. You've got your for loop "correct" by assigning into results rather than growing it, but really, you're doing a lot of comparisons. It will be hard to speed this up much.

Related

find all unique combinations of n numbers between 1 and k

I want a list of all possible sets of five (or n) numbers between 1 and 63 (or more generalizably 1 and k)
If computing time wasn't an issue, I could do something like
#Get all combenations of numbers between 1 and 63
indexCombinations <- expand.grid(1:63, 1:63, 1:63, 1:63, 1:63)
#Throw out the rows that have more than one of the same number in them
allDifferent <- apply(indexCombinations, 1, function(x){
length(x) == length(unique(x))
} # function
) # apply
indexCombinationsValid <- indexCombinations[allDifferent,]
# And then just take the unique values
indexCombinationsValidUnique <- unique(indexCombinationsValid)
The finding of unique values, I am concerned, is going to be prohibitively slow. Furthermore, I end up having to make a bunch of rows in the first place I never use. I was wondering if anyone has a more elegant and efficient way of getting a data frame or matrix of unique combinations of each of five numbers (or n numbers) between one and some some range of values.
Credit to #SymbolixAU for a very elegant solution, which I re-post here as an answer:
n <- 1:63; x <- combn(n, m = 5)

counting matching elements of two vectors but not including repeated elements in the count

I've search a lot in this forum. However, I didn't found a similar problem as the one I'm facing.
My question is:
I have two vectors
x <- c(1,1,2,2,3,3,3,4,4,4,6,7,8) and z <- c(1,1,2,4,5,5,5)
I need to count the number of times x or z appears in each other including if they are repeated or not.
The answer for this problem should be 4 because :
There are two number 1, one number 2, and one number 4 in each vector.
Functions like match() don't help since they will return the answer of repeated for non repeated numbers. Using unique() will also alter the final answer from 4 to 3
What I came up with was a loop that every time it found one number in the other, it would remove from the list so it won't be counted again.
The loop works fine for this size of this example; however, searching for larger vectors numerous times makes my loop inefficient and too slow for my purposes.
system.time({
for(n in 1:1000){
x <- c(1,1,2,2,3,3,3,4,4,4,6,7,8)
z <- c(1,1,2,4,5,5,5)
score <- 0
for(s in spectrum){
if(s %in% sequence){
sequence <- sequence[-which(sequence==s)[1]]
score <- score + 1
}
}
}
})
Can someone suggest a better method?
I've tried using lapply, for short vectors it is faster, but it became slower for longer ones..
Use R's vectorization to your advantage here. There's no looping necessary.
You could use a table to look at the frequencies,
table(z[z %in% x])
#
# 1 2 4
# 2 1 1
And then take the sum of the table for the total
sum(table(z[z %in% x]))
# [1] 4

Aligning sequences with missing values

The language I'm using is R, but you don't necessarily need to know about R to answer the question.
Question:
I have a sequence that can be considered the ground truth, and another sequence that is a shifted version of the first, with some missing values. I'd like to know how to align the two.
setup
I have a sequence ground.truth that is basically a set of times:
ground.truth <- rep( seq(1,by=4,length.out=10), 5 ) +
rep( seq(0,length.out=5,by=4*10+30), each=10 )
Think of ground.truth as times where I'm doing the following:
{take a sample every 4 seconds for 10 times, then wait 30 seconds} x 5
I have a second sequence observations, which is ground.truth shifted with 20% of the values missing:
nSamples <- length(ground.truth)
idx_to_keep <- sort(sample( 1:nSamples, .8*nSamples ))
theLag <- runif(1)*100
observations <- ground.truth[idx_to_keep] + theLag
nObs <- length(observations)
If I plot these vectors this is what it looks like (remember, think of these as times):
What I've tried. I want to:
calculate the shift (theLag in my example above)
calculate a vector idx such that ground.truth[idx] == observations - theLag
First, assume we know theLag. Note that ground.truth[1] is not necessarily observations[1]-theLag. In fact, we have ground.truth[1] == observations[1+lagI]-theLag for some lagI.
To calculate this, I thought I'd use cross-correlation (ccf function).
However, whenever I do this I get a lag with a max. cross-correlation of 0, meaning ground.truth[1] == observations[1] - theLag. But I've tried this in examples where I've explicitly made sure that observations[1] - theLag is not ground.truth[1] (i.e. modify idx_to_keep to make sure it doesn't have 1 in it).
The shift theLag shouldn't affect the cross-correlation (isn't ccf(x,y) == ccf(x,y-constant)?) so I was going to work it out later.
Perhaps I'm misunderstanding though, because observations doesn't have as many values in it as ground.truth? Even in the simpler case where I set theLag==0, the cross correlation function still fails to identify the correct lag, which leads me to believe I'm thinking about this wrong.
Does anyone have a general methodology for me to go about this, or know of some R functions/packages that could help?
Thanks a lot.
For the lag, you can compute all the differences (distances) between your two sets of points:
diffs <- outer(observations, ground.truth, '-')
Your lag should be the value that appears length(observations) times:
which(table(diffs) == length(observations))
# 55.715382960625
# 86
Double check:
theLag
# [1] 55.71538
The second part of your question is easy once you have found theLag:
idx <- which(ground.truth %in% (observations - theLag))
The following should work if your time series are not too long.
You have two vectors of time-stamps,
the second one being a shifted and incomplete copy of the first,
and you want to find by how much it was shifted.
# Sample data
n <- 10
x <- cumsum(rexp(n,.1))
theLag <- rnorm(1)
y <- theLag + x[sort(sample(1:n, floor(.8*n)))]
We can try all possible lags and, for each one,
compute how bad the alignment is,
by matching each observed timestamp with the closest
"truth" timestamp.
# Loss function
library(sqldf)
f <- function(u) {
# Put all the values in a data.frame
d1 <- data.frame(g="truth", value=x)
d2 <- data.frame(g="observed", value=y+u)
d <- rbind(d1,d2)
# For each observed value, find the next truth value
# (we could take the nearest, on either side,
# but it would be more complicated)
d <- sqldf("
SELECT A.g, A.value,
( SELECT MIN(B.value)
FROM d AS B
WHERE B.g='truth'
AND B.value >= A.value
) AS next
FROM d AS A
WHERE A.g = 'observed'
")
# If u is greater than the lag, there are missing values.
# If u is smaller, the differences decrease
# as we approach the lag.
if(any(is.na(d))) {
return(Inf)
} else {
return( sum(d$`next` - d$value, na.rm=TRUE) )
}
}
We can now search for the best lag.
# Look at the loss function
sapply( seq(-2,2,by=.1), f )
# Minimize the loss function.
# Change the interval if it does not converge,
# i.e., if it seems in contradiction with the values above
# or if the minimum is Inf
(r <- optimize(f, c(-3,3)))
-r$minimum
theLag # Same value, most of the time

Matching elements in a list

Just starting to program in R... Got stumped on this one, perhaps because I don't know where to begin.
Define a random variable to be equal to the number of trials before there is a match. So if you have a list of numbers, (4,5,7,11,3,11,12,8,8,1....), the first value of the random variable is 6 because by then there are two 11's.(4,5,7,11,3,11) The second value is 3 because then you have 2 8's..12,8,8.
The code below creates the list of numbers, u, by simulating from a uniform distribution.
Thank-you for any help or pointers. I've included a full description of the problem I am solving below if anyone is interested (trying to learn by coding a statistics text).
set.seed(1); u = matrix(runif(1000), nrow=1000)
u[u > 0 & u <= 1/12] <- 1
u[u > 1/12 & u <= 2/12] <- 2
u[u > 2/12 & u <= 3/12] <- 3
u[u > 3/12 & u <= 4/12] <- 4
u[u > 4/12 & u <= 5/12] <- 5
u[u > 5/12 & u <= 6/12] <- 6
u[u > 6/12 & u <= 7/12] <- 7
u[u > 7/12 & u <= 8/12] <- 8
u[u > 8/12 & u <= 9/12] <- 9
u[u > 9/12 & u <= 10/12] <- 10
u[u > 10/12 & u <= 11/12] <- 11
u[u > 11/12 & u < 12/12] <- 12
table(u); u[1:10,]
Example 2.6-3 Concepts in Probability and Stochastic Modeling, Higgins
Suppose we were to ask people at random in which month they were born. Let the random variable X denote the number of people we would need to ask before we found two people born in the same month. The possible values for X are 2,3,...13. That is, at least two people must be asked in order to have a match and no more than 13 need to be asked. With the simplifying assumption that every month is an equally likely candidate for a response, a computer simulation was used to estimate the probabilitiy mass function of X. The simulation generated birth months until a match was found. Based on 1000 repetitions of this experiment, the following empirical distribution and sample statistics were obtained...
R has a steep initial learning curve. I don't think it's fair to assume this is your homework, and yes, it's possible to find solutions if you know what you're looking for. However, I remember it being difficult at times to research problems online simply because I didn't know what to search for (I wasn't familiar enough with the terminology).
Below is an explanation of one approach to solving the problem in R. Read the commented code and try and figure out exactly what it's doing. Still, I would recommend working through a good beginner resource. From memory, a good one to get up and running is icebreakeR, but there are many out there...
# set the number of simulations
nsim <- 10000
# Create a matrix, with nsim columns, and fill it with something.
# The something with which you'll populate it is a random sample,
# with replacement, of month names (held in a built-in vector called
# 'month.abb'). We're telling the sample function that it should take
# 13*nsim samples, and these will be used to fill the matrix, which
# has nsim columns (and hence 13 rows). We've chosen to take samples
# of length 13, because as your textbook states, 13 is the maximum
# number of month names necessary for a month name to be duplicated.
mat <- matrix(sample(month.abb, 13*nsim, replace=TRUE), ncol=nsim)
# If you like, take a look at the first 10 columns
mat[, 1:10]
# We want to find the position of the first duplicated value for each column.
# Here's one way to do this, but it might be a bit confusing if you're just
# starting out. The 'apply' family of functions is very useful for
# repeatedly applying a function to columns/rows/elements of an object.
# Here, 'apply(mat, 2, foo)' means that for each column (2 represents columns,
# 1 would apply to rows, and 1:2 would apply to every cell), do 'foo' to that
# column. Our function below extends this a little with a custom function. It
# says: for each column of mat in turn, call that column 'x' and perform
# 'match(1, duplicated(x))'. This match function will return the position
# of the first '1' in the vector 'duplicated(x)'. The vector 'duplicated(x)'
# is a logical (boolean) vector that indicates, for each element of x,
# whether that element has already occurred earlier in the vector (i.e. if
# the month name has already occurred earlier in x, the corresponding element
# of duplicated(x) will be TRUE (which equals 1), else it will be false (0).
# So the match function returns the position of the first duplicated month
# name (well, actually the second instance of that month name). e.g. if
# x consists of 'Jan', 'Feb', 'Jan', 'Mar', then duplicated(x) will be
# FALSE, FALSE, TRUE, FALSE, and match(1, duplicated(x)) will return 3.
# Referring back to your textbook problem, this is x, a realisation of the
# random variable X.
# Because we've used the apply function, the object 'res' will end up with
# nsim realisations of X, and these can be plotted as a histogram.
res <- apply(mat, 2, function(x) match(1, duplicated(x)))
hist(res, breaks=seq(0.5, 13.5, 1))

Testing if rows of a matrix or data frame are sorted in R

What is an efficient way to test if rows in a matrix are sorted? [Update: see Aaron's Rcpp answer - straightforward & very fast.]
I am porting some code that uses issorted(,'rows') from Matlab. As it seems that is.unsorted does not extend beyond vectors, I'm writing or looking for something else. The naive method is to check that the sorted version of the matrix (or data frame) is the same as the original, but that's obviously inefficient.
NB: For sorting, a la sortrows() in Matlab, my code essentially uses SortedDF <- DF[do.call(order, DF),] (it's wrapped in a larger function that converts matrices to data frames, passes parameters to order, etc.). I wouldn't be surprised if there are faster implementations (data table comes to mind).
Update 1: To clarify: I'm not testing for sorting intra-row or intra-columns. (Such sorting generally results in an algebraically different matrix.)
As an example for creating an unsorted matrix:
set.seed(0)
x <- as.data.frame(matrix(sample(3, 60, replace = TRUE), ncol = 6, byrow = TRUE))
Its sorted version is:
y <- x[do.call(order, x),]
A proper test, say testSorted, would return FALSE for testSorted(x) and TRUE for testSorted(y).
Update 2:
The answers below are all good - they are concise and do the test. Regarding efficiency, it looks like these are sorting the data after all.
I've tried these with rather large matrices, such as 1M x 10, (just changing the creation of x above) and all have about the same time and memory cost. What's peculiar is that they all consume more time for unsorted objects (about 5.5 seconds for 1Mx10) than for sorted ones (about 0.5 seconds for y). This suggests they're sorting before testing.
I tested by creating a z matrix:
z <- y
z[,2] <- y[,1]
z[,1] <- y[,2]
In this case, all of the methods take about 0.85 seconds to complete. Anyway, finishing in 5.5 seconds isn't terrible (in fact, that seems to be right about the time necessary to sort the object), but knowing that a sorted matrix is 11X faster suggests that a test that doesn't sort could be even faster. In the case of the 1M row matrix, the first three rows of x are:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 3 1 2 2 3 1 3 3 2 2
2 1 1 1 3 2 3 2 3 3 2
3 3 3 1 2 1 1 2 1 2 3
There's no need to look beyond row 2, though vectorization isn't a bad idea.
(I've also added the byrow argument for the creation of x, so that row values don't depend on the size of x.)
Update 3:
Another comparison for this testing can be found with the sort -c command in Linux. If the file is already written (using write.table()), with 1M rows, then time sort -c myfile.txt takes 0.003 seconds for the unsorted data and 0.101 seconds for the sorted data. I don't intend to write out to a file, but it's a useful comparison.
Update 4:
Aaron's Rcpp method bested all other methods offered here and that I've tried (including the sort -c comparison above: in-memory is expected to beat on-disk). As for the ratio relative to other methods, it's hard to tell: the denominator is too small to give an accurate measurement, and I've not extensively explored microbenchmark. The speedups can be very large (4-5 orders of magnitude) for some matrices (e.g. one made with rnorm), but this is misleading - checking can terminate after only a couple of rows. I've had speedups with the example matrices of about 25-60 for the unsorted and about 1.1X for the sorted, as the competing methods were already very fast if the data is sorted.
Since this does the right thing (i.e. no sorting, just testing), and does it very quickly, it's the accepted answer.
If y is sorted then do.call(order,y) returns 1:nrow(y).
testSorted = function(y){all(do.call(order,y)==1:nrow(y))}
note this doesn't compare the matrices, but it doesn't dip out as soon as it finds a non-match.
Well, why don't you use:
all(do.call(order, y)==seq(nrow(y)))
That avoids creating the ordered matrix, and ensures it checks your style of ordering.
Newer: I decided I could use the Rcpp practice...
library(Rcpp)
library(inline)
isRowSorted <- cxxfunction(signature(A="numeric"), body='
Rcpp::NumericMatrix Am(A);
for(int i = 1; i < Am.nrow(); i++) {
for(int j = 0; j < Am.ncol(); j++) {
if( Am(i-1,j) < Am(i,j) ) { break; }
if( Am(i-1,j) > Am(i,j) ) { return(wrap(false)); }
}
}
return(wrap(true));
', plugin="Rcpp")
rownames(y) <- NULL # because as.matrix is faster without rownames
isRowSorted(as.matrix(y))
New: This R-only hack is the same speed for all matrices; it's definitely faster for sorted matrices; for unsorted ones it depends on the nature of the unsortedness.
iss3 <- function(x) {
x2 <- sign(do.call(cbind, lapply(x, diff)))
x3 <- t(x2)*(2^((ncol(x)-1):0))
all(colSums(x3)>=0)
}
Original: This is faster for some unsorted matrices. How much faster will depends on where the unsorted elements are; this looks at the matrix column by column so unsortedness on the left side will be noticed much faster than unsorted on the right, while top/bottomness doesn't matter nearly as much.
iss2 <- function(y) {
b <- c(0,nrow(y))
for(i in 1:ncol(y)) {
z <- rle(y[,i])
b2 <- cumsum(z$lengths)
sp <- split(z$values, cut(b2, breaks=b))
for(spi in sp) {
if(is.unsorted(spi)) return(FALSE)
}
b <- c(0, b2)
}
return(TRUE)
}
Well, the brute-force approach is to loop and compare, aborting as soon as a violation is found.
That approach can be implemented and tested easily in R, and then be carried over to a simple C++ function we can connect to R via inline and Rcpp (or plain C if you must) as looping is something that really benefits from an implementation in a compiled language.
Otherwise, can you not use something like diff() and check if all increments are non-negative?
You can use your do.call statement with is.unsorted:
issorted.matrix <- function(A) {!is.unsorted(do.call("order",data.frame(A)))}

Resources