Related
I would like to find out the three closest numbers in a vector.
Something like
v = c(10,23,25,26,38,50)
c = findClosest(v,3)
c
23 25 26
I tried with sort(colSums(as.matrix(dist(x))))[1:3], and it kind of works, but it selects the three numbers with minimum overall distance not the three closest numbers.
There is already an answer for matlab, but I do not know how to translate it to R:
%finds the index with the minimal difference in A
minDiffInd = find(abs(diff(A))==min(abs(diff(A))));
%extract this index, and it's neighbor index from A
val1 = A(minDiffInd);
val2 = A(minDiffInd+1);
How to find two closest (nearest) values within a vector in MATLAB?
My assumption is that the for the n nearest values, the only thing that matters is the difference between the v[i] - v[i - (n-1)]. That is, finding the minimum of diff(x, lag = n - 1L).
findClosest <- function(x, n) {
x <- sort(x)
x[seq.int(which.min(diff(x, lag = n - 1L)), length.out = n)]
}
findClosest(v, 3L)
[1] 23 25 26
Let's define "nearest numbers" by "numbers with minimal sum of L1 distances". You can achieve what you want by a combination of diff and windowed sum.
You could write a much shorter function but I wrote it step by step to make it easier to follow.
v <- c(10,23,25,26,38,50)
#' Find the n nearest numbers in a vector
#'
#' #param v Numeric vector
#' #param n Number of nearest numbers to extract
#'
#' #details "Nearest numbers" defined as the numbers which minimise the
#' within-group sum of L1 distances.
#'
findClosest <- function(v, n) {
# Sort and remove NA
v <- sort(v, na.last = NA)
# Compute L1 distances between closest points. We know each point is next to
# its closest neighbour since we sorted.
delta <- diff(v)
# Compute sum of L1 distances on a rolling window with n - 1 elements
# Why n-1 ? Because we are looking at deltas and 2 deltas ~ 3 elements.
withingroup_distances <- zoo::rollsum(delta, k = n - 1)
# Now it's simply finding the group with minimum within-group sum
# And working out the elements
group_index <- which.min(withingroup_distances)
element_indices <- group_index + 0:(n-1)
v[element_indices]
}
findClosest(v, 2)
# 25 26
findClosest(v, 3)
# 23 25 26
A base R option, idea being we first sort the vector and subtract every ith element with i + n - 1 element in the sorted vector and select the group which has minimum difference.
closest_n_vectors <- function(v, n) {
v1 <- sort(v)
inds <- which.min(sapply(head(seq_along(v1), -(n - 1)), function(x)
v1[x + n -1] - v1[x]))
v1[inds: (inds + n - 1)]
}
closest_n_vectors(v, 3)
#[1] 23 25 26
closest_n_vectors(c(2, 10, 1, 20, 4, 5, 23), 2)
#[1] 1 2
closest_n_vectors(c(19, 23, 45, 67, 89, 65, 1), 2)
#[1] 65 67
closest_n_vectors(c(19, 23, 45, 67, 89, 65, 1), 3)
#[1] 1 19 23
In case of tie this will return the numbers with smallest value since we are using which.min.
BENCHMARKS
Since we have got quite a few answers, it is worth doing a benchmark of all the solutions till now
set.seed(1234)
x <- sample(100000000, 100000)
identical(findClosest_antoine(x, 3), findClosest_Sotos(x, 3),
closest_n_vectors_Ronak(x, 3), findClosest_Cole(x, 3))
#[1] TRUE
microbenchmark::microbenchmark(
antoine = findClosest_antoine(x, 3),
Sotos = findClosest_Sotos(x, 3),
Ronak = closest_n_vectors_Ronak(x, 3),
Cole = findClosest_Cole(x, 3),
times = 10
)
#Unit: milliseconds
# expr min lq mean median uq max neval cld
#antoine 148.751 159.071 163.298 162.581 167.365 181.314 10 b
# Sotos 1086.098 1349.762 1372.232 1398.211 1453.217 1553.945 10 c
# Ronak 54.248 56.870 78.886 83.129 94.748 100.299 10 a
# Cole 4.958 5.042 6.202 6.047 7.386 7.915 10 a
An idea is to use zoo library to do a rolling operation, i.e.
library(zoo)
m1 <- rollapply(v, 3, by = 1, function(i)c(sum(diff(i)), c(i)))
m1[which.min(m1[, 1]),][-1]
#[1] 23 25 26
Or make it into a function,
findClosest <- function(vec, n) {
require(zoo)
vec1 <- sort(vec)
m1 <- rollapply(vec1, n, by = 1, function(i) c(sum(diff(i)), c(i)))
return(m1[which.min(m1[, 1]),][-1])
}
findClosest(v, 3)
#[1] 23 25 26
For use in a dataframe,
data%>%
group_by(var1,var2)%>%
do(data.frame(findClosest(.$val,3)))
Actually a really nice problem to which I came up with a solution (see below), which is, however, not beautiful:
Assume you have a vector x and a matrix A which contains the start of an interval in the first column and the end of the interval in the second.
How can I get the elements of A, which fall into the intervals given by A?
x <- c(4, 7, 15)
A <- cbind(c(3, 9, 14), c(5, 11, 16))
Expected output:
[1] 4 15
You could you the following information, if this would be helpful for increasing the performance:
Both, the vector and the rows of the matrix are ordered and the intervals don't overlap. All intervals have the same length. All numbers are integers, but can be huge.
Now I did not want to be lazy and came up with the following solution, which is too slow for long vectors and matrices:
x <- c(4, 7, 15) # Define input vector
A <- cbind(c(3, 9, 14), c(5, 11, 16)) # Define matrix with intervals
b <- vector()
for (i in 1:nrow(A)) {
b <- c(b, A[i, 1]:A[i, 2])
}
x[x %in% b]
I know that loops in R can be slow, but I did not know how to write the operation without one (maybe there is a way with apply).
We can use sapply to loop over each element of x and find if it lies in the range of any of those matrix values.
x[sapply(x, function(i) any(i > A[, 1] & i < A[,2]))]
#[1] 4 15
In case, if length(x) and nrow(A) are same then we don't even need the sapply loop and we can use this comparison directly.
x[x > A[, 1] & x < A[,2]]
#[1] 4 15
Here is a method that does not use an explicit loop or an apply function. outer is sometimes much faster.
x[rowSums(outer(x, A[,1], `>=`) & outer(x, A[,2], `<=`)) > 0]
[1] 4 15
This answer is late, but today I had the same problem to solve and my answer is maybe helpful for future readers. My solution was the following:
f3 <- function(x,A) {
Reduce(f = "|",
x = lapply(1:NROW(A),function(k) x>A[k,1] & x<A[k,2]),
init = logical(length(x)))
}
This function return a logical vector of length(x) indicating whether the corresponding value in x can be found in the intervals or not. If I want to get the elements I simply have to write
x[f3(x,A)]
I did some benchmarks and my function seems to work very well, also while testing with larger data.
Lets define the other solutions suggested here in this post:
f1 <- function(x,A) {
sapply(x, function(i) any(i > A[, 1] & i < A[,2]))
}
f2 <- function(x,A) {
rowSums(outer(x, A[,1], `>`) & outer(x, A[,2], `<`)) > 0
}
Now they are also returning a logical vector.
The benchmarks on my machine are following:
x <- c(4, 7, 15)
A <- cbind(c(3, 9, 14), c(5, 11, 16))
microbenchmark::microbenchmark(f1(x,A), f2(x,A), f3(x,A))
#Unit: microseconds
# expr min lq mean median uq max neval
#f1(x, A) 21.5 23.20 25.023 24.30 25.40 61.8 100
#f2(x, A) 18.8 21.20 23.606 22.75 23.70 75.4 100
#f3(x, A) 13.9 15.85 18.682 18.30 19.15 52.2 100
It seems like there is no big difference, but the follwoing example will make it more obvious:
x <- seq(1,100,length.out = 1e6)
A <- cbind(20:70,(20:70)+0.5)
microbenchmark::microbenchmark(f1(x,A), f2(x,A), f3(x,A), times=10)
#Unit: milliseconds
# expr min lq mean median uq max neval
#f1(x, A) 4176.172 4227.6709 4419.6010 4484.2946 4539.9668 4569.7412 10
#f2(x, A) 1418.498 1511.5647 1633.4659 1571.0249 1703.6651 1987.8895 10
#f3(x, A) 614.556 643.4138 704.3383 672.5385 770.7751 873.1291 10
That the functions all return the same result can be checked e.g. via:
all(f1(x,A)==f3(x,A))
R offers max and min, but I do not see a really fast way to find another value in the order, apart from sorting the whole vector and then picking a value x from this vector.
Is there a faster way to get the second highest value, for example?
Use the partial argument of sort(). For the second highest value:
n <- length(x)
sort(x,partial=n-1)[n-1]
Slightly slower alternative, just for the records:
x <- c(12.45,34,4,0,-234,45.6,4)
max( x[x!=max(x)] )
min( x[x!=min(x)] )
Rfast has a function called nth_element that does exactly what you ask.
Further the methods discussed above that are based on partial sort, don't support finding the k smallest values
Update (28/FEB/21) package kit offers a faster implementation (topn) see https://stackoverflow.com/a/66367996/4729755, https://stackoverflow.com/a/53146559/4729755
Disclaimer: An issue appears to occur when dealing with integers which can by bypassed by using as.numeric (e.g. Rfast::nth(as.numeric(1:10), 2)), and will be addressed in the next update of Rfast.
Rfast::nth(x, 5, descending = T)
Will return the 5th largest element of x, while
Rfast::nth(x, 5, descending = F)
Will return the 5th smallest element of x
Benchmarks below against most popular answers.
For 10 thousand numbers:
N = 10000
x = rnorm(N)
maxN <- function(x, N=2){
len <- length(x)
if(N>len){
warning('N greater than length(x). Setting N=length(x)')
N <- length(x)
}
sort(x,partial=len-N+1)[len-N+1]
}
microbenchmark::microbenchmark(
Rfast = Rfast::nth(x,5,descending = T),
maxn = maxN(x,5),
order = x[order(x, decreasing = T)[5]])
Unit: microseconds
expr min lq mean median uq max neval
Rfast 160.364 179.607 202.8024 194.575 210.1830 351.517 100
maxN 396.419 423.360 559.2707 446.452 487.0775 4949.452 100
order 1288.466 1343.417 1746.7627 1433.221 1500.7865 13768.148 100
For 1 million numbers:
N = 1e6
x = rnorm(N)
microbenchmark::microbenchmark(
Rfast = Rfast::nth(x,5,descending = T),
maxN = maxN(x,5),
order = x[order(x, decreasing = T)[5]])
Unit: milliseconds
expr min lq mean median uq max neval
Rfast 89.7722 93.63674 114.9893 104.6325 120.5767 204.8839 100
maxN 150.2822 207.03922 235.3037 241.7604 259.7476 336.7051 100
order 930.8924 968.54785 1005.5487 991.7995 1031.0290 1164.9129 100
I wrapped Rob's answer up into a slightly more general function, which can be used to find the 2nd, 3rd, 4th (etc.) max:
maxN <- function(x, N=2){
len <- length(x)
if(N>len){
warning('N greater than length(x). Setting N=length(x)')
N <- length(x)
}
sort(x,partial=len-N+1)[len-N+1]
}
maxN(1:10)
Here is an easy way to find the indices of N smallest/largest values in a vector(Example for N = 3):
N <- 3
N Smallest:
ndx <- order(x)[1:N]
N Largest:
ndx <- order(x, decreasing = T)[1:N]
So you can extract the values as:
x[ndx]
For nth highest value,
sort(x, TRUE)[n]
Here you go... kit is the obvious winner!
N = 1e6
x = rnorm(N)
maxN <- function(x, N=2){
len <- length(x)
if(N>len){
warning('N greater than length(x). Setting N=length(x)')
N <- length(x)
}
sort(x,partial=len-N+1)[len-N+1]
}
microbenchmark::microbenchmark(
Rfast = Rfast::nth(x,5,descending = T),
maxN = maxN(x,5),
order = x[order(x, decreasing = T)[5]],
kit = x[kit::topn(x, 5L,decreasing = T)[5L]]
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# Rfast 12.311168 12.473771 16.36982 12.702134 16.110779 102.749873 100
# maxN 12.922118 13.124358 17.49628 18.977537 20.053139 28.928694 100
# order 50.443100 50.926975 52.54067 51.270163 52.323116 66.561606 100
# kit 1.177202 1.216371 1.29542 1.240228 1.297286 2.771715 100
Edit: I forgot that kit::topn has hasna option...let's do another run.
microbenchmark::microbenchmark(
Rfast = Rfast::nth(x,5,descending = T),
maxN = maxN(x,5),
order = x[order(x, decreasing = T)[5]],
kit = x[kit::topn(x, 5L,decreasing = T)[5L]],
kit2 = x[kit::topn(x, 5L,decreasing = T,hasna = F)[5L]],
unit = "ms"
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# Rfast 13.194314 13.358787 14.7227116 13.4560340 14.551194 24.524105 100
# maxN 7.378960 7.527661 10.0747803 7.7119715 12.217756 67.409526 100
# order 50.088927 50.488832 52.4714347 50.7415680 52.267003 70.062662 100
# kit 1.180698 1.217237 1.2975441 1.2429790 1.278243 3.263202 100
# kit2 0.842354 0.876329 0.9398055 0.9109095 0.944407 2.135903 100
Here is the simplest way I found,
num <- c(5665,1615,5154,65564,69895646)
num <- sort(num, decreasing = F)
tail(num, 1) # Highest number
head(tail(num, 2),1) # Second Highest number
head(tail(num, 3),1) # Third Highest number
head(tail(num, n),1) # Generl equation for finding nth Highest number
I found that removing the max element first and then do another max runs in comparable speed:
system.time({a=runif(1000000);m=max(a);i=which.max(a);b=a[-i];max(b)})
user system elapsed
0.092 0.000 0.659
system.time({a=runif(1000000);n=length(a);sort(a,partial=n-1)[n-1]})
user system elapsed
0.096 0.000 0.653
dplyr has the function nth, where the first argument is the vector and the second is which place you want. This goes for repeating elements as well.
For example:
x = c(1,2, 8, 16, 17, 20, 1, 20)
Finding the second largest value:
nth(unique(x),length(unique(x))-1)
[1] 17
When I was recently looking for an R function returning indexes of top N max/min numbers in a given vector, I was surprised there is no such a function.
And this is something very similar.
The brute force solution using base::order function seems to be the easiest one.
topMaxUsingFullSort <- function(x, N) {
sort(x, decreasing = TRUE)[1:min(N, length(x))]
}
But it is not the fastest one in case your N value is relatively small compared to length of the vector x.
On the other side if the N is really small, you can use base::whichMax function iteratively and in each iteration you can replace found value by -Inf
# the input vector 'x' must not contain -Inf value
topMaxUsingWhichMax <- function(x, N) {
vals <- c()
for(i in 1:min(N, length(x))) {
idx <- which.max(x)
vals <- c(vals, x[idx]) # copy-on-modify (this is not an issue because idxs is relative small vector)
x[idx] <- -Inf # copy-on-modify (this is the issue because data vector could be huge)
}
vals
}
I believe you see the problem - the copy-on-modify nature of R. So this will perform better for very very very small N (1,2,3) but it will rapidly slow down for larger N values. And you are iterating over all elements in vector x N times.
I think the best solution in clean R is to use partial base::sort.
topMaxUsingPartialSort <- function(x, N) {
N <- min(N, length(x))
x[x >= -sort(-x, partial=N)[N]][1:N]
}
Then you can select the last (Nth) item from the result of functions defiend above.
Note: functions defined above are just examples - if you want to use them, you have to check/sanity inputs (eg. N > length(x)).
I wrote a small article about something very similar (get indexes of top N max/min values of a vector) at http://palusga.cz/?p=18 - you can find here some benchmarks of similar functions I defined above.
head(sort(x),..) or tail(sort(x),...) should work
This will find the index of the N'th smallest or largest value in the input numeric vector x. Set bottom=TRUE in the arguments if you want the N'th from the bottom, or bottom=FALSE if you want the N'th from the top. N=1 and bottom=TRUE is equivalent to which.min, N=1 and bottom=FALSE is equivalent to which.max.
FindIndicesBottomTopN <- function(x=c(4,-2,5,-77,99),N=1,bottom=FALSE)
{
k1 <- rank(x)
if(bottom==TRUE){
Nindex <- which(k1==N)
Nindex <- Nindex[1]
}
if(bottom==FALSE){
Nindex <- which(k1==(length(x)+1-N))
Nindex <- Nindex[1]
}
return(Nindex)
}
topn = function(vector, n){
maxs=c()
ind=c()
for (i in 1:n){
biggest=match(max(vector), vector)
ind[i]=biggest
maxs[i]=max(vector)
vector=vector[-biggest]
}
mat=cbind(maxs, ind)
return(mat)
}
this function will return a matrix with the top n values and their indices.
hope it helps
VDevi-Chou
You can identify the next higher value with cummax(). If you want the location of the each new higher value for example you can pass your vector of cummax() values to the diff() function to identify locations at which the cummax() value changed. say we have the vector
v <- c(4,6,3,2,-5,6,8,12,16)
cummax(v) will give us the vector
4 6 6 6 6 6 8 12 16
Now, if you want to find the location of a change in cummax() you have many options I tend to use sign(diff(cummax(v))). You have to adjust for the lost first element because of diff(). The complete code for vector v would be:
which(sign(diff(cummax(v)))==1)+1
You can use the sort keyword like this:
sort(unique(c))[1:N]
Example:
c <- c(4,2,44,2,1,45,34,2,4,22,244)
sort(unique(c), decreasing = TRUE)[1:5]
will give the first 5 max numbers.
This question is related to my previous one. Here is a small sample data. I have used both data.table and data.frame to find a faster solution.
test.dt <- data.table(strt=c(1,1,2,3,5,2), end=c(2,1,5,5,5,4), a1.2=c(1,2,3,4,5,6),
a2.3=c(2,4,6,8,10,12), a3.4=c(3,1,2,4,5,1), a4.5=c(5,1,15,10,12,10),
a5.6=c(4,8,2,1,3,9))
test.dt[,rown:=as.numeric(row.names(test.dt))]
test.df <- data.frame(strt=c(1,1,2,3,5,2), end=c(2,1,5,5,5,4), a1.2=c(1,2,3,4,5,6),
a2.3=c(2,4,6,8,10,12), a3.4=c(3,1,2,4,5,1), a4.5=c(5,1,15,10,12,10),
a5.6=c(4,8,2,1,3,9))
test.df$rown <- as.numeric(row.names(test.df))
> test.df
strt end a1.2 a2.3 a3.4 a4.5 a5.6 rown
1 1 2 1 2 3 5 4 1
2 1 1 2 4 1 1 8 2
3 2 5 3 6 2 15 2 3
4 3 5 4 8 4 10 1 4
5 5 5 5 10 5 12 3 5
6 2 4 6 12 1 10 9 6
I want to use the start and end column values to determine the range of columns to subset (columns from a1.2 to a5.6) and obtain the mean. For example, in the first row, since strt=1 and end=2, I need to get the mean of a1.2 and a2.3; in the third row, I need to get the mean of a2.3, a3.4, a4.5, and a5.6
The output should be a vector like this
> k
1 2 3 4 5 6
1.500000 2.000000 6.250000 5.000000 3.000000 7.666667
Here, is what I tried:
Solution 1: This uses the data.table and applies a function over it.
func.dt <- function(rown, x, y) {
tmp <- paste0("a", x, "." , x+1)
tmp1 <- paste0("a", y, "." , y+1)
rowMeans(test.dt[rown,get(tmp):get(tmp1), with=FALSE])
}
k <- test.dt[, func.dt(rown, strt, end), by=.(rown)]
Solution 2: This uses the data.frame and applies a function over it.
func.df <- function(rown, x, y) {
rowMeans(test.df[rown,(x+2):(y+2), drop=FALSE])
}
k1 <- mapply(func.df, test.df$rown, test.df$strt, test.df$end)
Solution 3: This uses the data.frame and loops through it.
test.ave <- rep(NA, length(test1$strt))
for (i in 1 : length(test.df$strt)) {
test.ave[i] <- rowMeans(test.df[i, as.numeric(test.df[i,1]+2):as.numeric(test.df[i,2]+2), drop=FALSE])
}
Benchmarking shows that Solution 2 is the fastest.
test replications elapsed relative user.self sys.self user.child sys.child
1 sol1 100 0.67 4.786 0.67 0 NA NA
2 sol2 100 0.14 1.000 0.14 0 NA NA
3 sol3 100 0.15 1.071 0.16 0 NA NA
But, this is not good enough for me. Given the size of my data, these functions would need to run for a few days before I get the output. I am sure that I am not fully utilizing the power of data.table and I also know that my functions are crappy (they refer to the dataset in the global environment without passing it). Unfortunately, I am out of my depth and do not know how to fix these issues and make my functions fast. I would greatly appreciate any suggestions that help in improving my function(s) or point to alternate solutions.
I was curious how fast I could make this without resorting to writing custom C or C++ code. The best I could come up with is below. Note that using mean.default will provide greater precision, since it does a second pass over the data for error correction.
f_jmu <- compiler::cmpfun({function(m) {
# remove start/end columns from 'm' matrix
ma <- m[,-(1:2)]
# column index for each row in 'ma' matrix
cm <- col(ma)
# logical index of whether we need the column for each row
i <- cm >= m[,1L] & cm <= m[,2L]
# multiply the input matrix by the index matrix and sum it
# divide by the sum of the index matrix to get the mean
rowSums(i*ma) / rowSums(i)
}})
The Rcpp function is still faster (not surprisingly), but the function above gets respectably close. Here's an example on 50 million observations on my laptop with an i7-4600U and 12GB of RAM.
set.seed(21)
N <- 5e7
test.df <- data.frame(strt = 1L,
end = sample(5, N, replace = TRUE),
a1.2 = sample(3, N, replace = TRUE),
a2.3 = sample(7, N, replace = TRUE),
a3.4 = sample(14, N, replace = TRUE),
a4.5 = sample(8, N, replace = TRUE),
a5.6 = sample(30, N, replace = TRUE))
test.df$strt <- pmax(1L, test.df$end - sample(3, N, replace = TRUE) + 1L)
test.m <- as.matrix(test.df)
Also note that I take care to ensure that test.m is an integer matrix. That helps reduce the memory footprint, which can help make things faster.
R> system.time(st1 <- MYrcpp(test.m))
user system elapsed
0.900 0.216 1.112
R> system.time(st2 <- f_jmu(test.m))
user system elapsed
6.804 0.756 7.560
R> identical(st1, st2)
[1] TRUE
Unless you can think of a way to do this with a clever subsetting approach, I think you've reached R's speed barrier. You'll want to use a low-level language like C++ for this problem. Fortunately, the Rcpp package makes interfacing with C++ in R simple. Disclaimer: I've never written a single line of C++ code in my life. This code may be very inefficient.
library(Rcpp)
cppFunction('NumericVector MYrcpp(NumericMatrix x) {
int nrow = x.nrow(), ncol = x.ncol();
NumericVector out(nrow);
for (int i = 0; i < nrow; i++) {
double avg = 0;
int start = x(i,0);
int end = x(i,1);
int N = end - start + 1;
while(start<=end){
avg += x(i, start + 1);
start = start + 1;
}
out[i] = avg/N;
}
return out;
}')
For this code I'm going to pass the data.frame as a matrix (i.e. testM <- as.matrix(test.df))
Let's see if it works...
MYrcpp(testM)
[1] 1.500000 2.000000 6.250000 5.000000 3.000000 7.666667
How fast is it?
Unit: microseconds
expr min lq mean median uq max neval
f2() 1543.099 1632.3025 2039.7350 1843.458 2246.951 4735.851 100
f3() 1859.832 1993.0265 2642.8874 2168.012 2493.788 19619.882 100
f4() 281.541 315.2680 364.2197 345.328 375.877 1089.994 100
MYrcpp(testM) 3.422 10.0205 16.7708 19.552 21.507 56.700 100
Where f2(), f3() and f4() are defined as
f2 <- function(){
func.df <- function(rown, x, y) {
rowMeans(test.df[rown,(x+2):(y+2), drop=FALSE])
}
k1 <- mapply(func.df, test.df$rown, test.df$strt, test.df$end)
}
f3 <- function(){
test.ave <- rep(NA, length(test.df$strt))
for (i in 1 : length(test.df$strt)) {
test.ave[i] <- rowMeans(test.df[i,as.numeric(test.df[i,1]+2):as.numeric(test.df[i,2]+2), drop=FALSE])
}
}
f4 <- function(){
lapply(
apply(test.df,1, function(x){
x[(x[1]+2):(x[2]+2)]}),
mean)
}
That's roughly a 20x increase over the fastest.
Note, to implement the above code you'll need a C complier which R can access. For windows look into Rtools. For more on Rcpp read this
Now let's see how it scales.
N = 5e3
test.df <- data.frame(strt = 1,
end = sample(5, N, replace = TRUE),
a1.2 = sample(3, N, replace = TRUE),
a2.3 = sample(7, N, replace = TRUE),
a3.4 = sample(14, N, replace = TRUE),
a4.5 = sample(8, N, replace = TRUE),
a5.6 = sample(30, N, replace = TRUE))
test.df$rown <- as.numeric(row.names(test.df))
test.dt <- as.data.table(test.df)
microbenchmark(f4(), MYrcpp(testM))
Unit: microseconds
expr min lq mean median uq max neval
f4() 88647.256 108314.549 125451.4045 120736.073 133487.5295 259502.49 100
MYrcpp(testM) 196.003 216.533 242.6732 235.107 261.0125 499.54 100
With 5e3 rows MYrcpp is now 550x faster. This partially due to the fact that f4() is not going to scale well as Richard discusses in the comment. The f4() is essentially invoking a nested for loop by calling an apply within a lapply. Interestingly, the C++ code is also invoking a nested loop by utilizing a while loop inside a for loop. The speed disparity is due in large part to the fact that the C++ code is already complied and does not need to be interrupted into something the machine can understand at run time.
I'm not sure how big your data set is, but when I run MYrcpp on a data.frame with 1e7 rows, which is the largest data.frame I could allocate on my crummy laptop, it ran in 500 milliseconds.
Update: R equivalent of C++ code
MYr <- function(x){
nrow <- nrow(x)
ncol <- ncol(x)
out <- matrix(NA, nrow = 1, ncol = nrow)
for(i in 1:nrow){
avg <- 0
start <- x[i,1]
end <- x[i,2]
N <- end - start + 1
while(start<=end){
avg <- avg + x[i, start + 2]
start = start + 1
}
out[i] <- avg/N
}
out
}
Both MYrcpp and MYr are similar in many ways. Let me discuss a couple of the differences
The first line of MYrcpp is different from the MYr. In words the first line of MYrcpp, NumericVector MYrcpp(NumericMatrix x), means that we are defining a function whose name is MYrcpp which returns an output of class NumericVector and takes an input x of class NumericMatrix.
In C++ you have to define the class of a variable when you introduce it, i.e. int nrow = x.row() is a variable whose name is nrow whose class is int (i.e. integer) and is assigned to be x.nrow() i.e. the number of rows of x. (IGNORE if you're overwhelmed, nrow() is a method for instances of class `NumericVector. Like in Python you call a method by attaching it to the instance. The R equivalent is S3 and S4 methods)
When you subset in C++ you use () instead of [] like in R. Also, indexing begins at zero (like in Python). For example, x(0,1) in C++ is equivalent to x[1,2] in R
++ is an operator that means increment by 1, i.e. j++ is the same as j + 1. += is an operator that means add to together and assign, i.e. a += b is the same as a = a + b
My solution is the first one in the benchmark
library(microbenchmark)
microbenchmark(
lapply(
apply(test.df,1, function(x){
x[(x[1]+2):(x[2]+2)]}),
mean),
test.dt[, func.dt(rown, strt, end), by=.(rown)]
)
min lq mean median uq max neval
138.654 175.7355 254.6245 201.074 244.810 3702.443 100
4243.641 4747.5195 5576.3399 5252.567 6247.201 8520.286 100
It seems to be 25 times faster, but this is a small dataset. I am sure there is a better way to do this than what I have done.
Sorry in advance if "inversion score" isn't the proper terminology. Here's a wiki entry.
Consider a list of values, for instance
1 2 3 4 7 6 9 10 8
would have three penalties (a score of 3)
The 6 comes after 7
The 8 comes after 9
The 8 comes after 10
How can I calculate this inversion for a given vector of numbers in R? Note that some values will be NA, and I just want to skip these.
Your "inversion score" is a central component of Kendall's tau statistic. According to Wikipedia (see link), the tau statistic is (# concordant pairs-#discordant pairs)/(n*(n-1)/2). I believe that what R reports as T is the number of concordant pairs. Therefore, we should be able to reconstruct the number of discordant pairs (which I think is what you want) via n*(n-1)/2-T, as follows
x <- c(1,2,3,4,7,6,9,10,8)
(cc <- cor.test(sort(x),x,method="kendall"))
## Kendall's rank correlation tau
## data: sort(x) and x
## T = 33, p-value = 0.0008543
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## 0.8333333
So this function should work:
ff <- function(x) {
cc <- cor.test(sort(x),x,method="kendall")
n <- length(x)
n*(n-1)/2-unname(cc$statistic["T"])
}
ff(x) is 3 as requested (it would be good if you gave more examples of desired output ...) Haven't checked speed, but this has the advantage of being implemented in underlying C code.
I quickly came up with two strategies. A naive and a more clever using the outer function.
We look at two vectors of numbers A and B, where A is your example.
A <- scan(text = "1 2 3 4 7 6 9 10 8")
B <- sample(1:2321)
Define and try the naive inversion counting:
simpleInversion <- function(A) {
sum <- 0
n <- length(A)
for (i in 1:(n-1)) {
for (j in (i+1):n) {
sum <- sum + (A[i] > A[j])
}
}
return(sum)
}
simpleInversion(A)
simpleInversion(B)
Define and try the slightly more clever inversion counting:
cleverInversion <- function(A) {
tab <- outer(A, A, FUN = ">")
return(sum(tab[upper.tri(tab)]))
}
cleverInversion(A)
cleverInversion(B)
For the version which ignores NAs we can simply add an na.omit:
cleverInversion2 <- function(A) {
AA <- na.omit(A)
Tab <- outer(AA, AA, FUN = ">")
return(sum(Tab[upper.tri(Tab)]))
}
A[2] <- NA
cleverInversion2(A)
Hope this helps.
Edit: A faster version
Both functions become quite slow quickly when the size of the vector grows. So I came up with at faster version:
fastInversion <- function(A) {
return(sum(cbind(1, -1) %*% combn(na.omit(AA), 2) > 0))
}
C <- sample(c(1:500, NA))
library("microbenchmark")
microbenchmark(
simpleInversion(C),
cleverInversion(C),
fastInversion(C))
#Unit: microseconds
# expr min lq median uq max neval
# simpleInversion(C) 128538.770 130483.626 133999.272 144660.116 185767.208 100
# cleverInversion(C) 9546.897 9893.358 10513.799 12564.298 17041.789 100
# fastInversion(C) 104.632 114.229 193.144 198.209 324.614 100
So we gain quite a speed-up of nearly two orders of magnitude. The speed-up is even greater for larger vectors.
You could test each pair of values in your vector, counting the number that are inverted:
inversion.score <- function(vec) {
sum(apply(combn(length(vec), 2), 2, function(x) vec[x[2]] < vec[x[1]]), na.rm=T)
}
inversion.score(c(1, 2, 3, 7, 6, 9, 10, 8, NA))
# [1] 3