I want to write a function to reverse the order of any numbers. Here is what I have but it doesn't work. Please help me!
n=123
rev_number=function(n){
m=strsplit(as.character(n),"")
if (m==rev(m)) print("reversed number")
}
The desired output is n=321
I feel like reverse an integer should stay in the integer world instead of getting into the string manipulation world. It seems there isn't a built in function for such task in R, so we can create one, using the Rcpp package for instance. Here's an example
library(Rcpp)
cppFunction('int Reverse_CPP(int x) {
int reverse = 0;
while(x != 0) {
int remainder = x%10;
reverse = reverse*10 + remainder;
x/= 10;
}
return reverse ;
}')
Reverse_CPP(1234)
# [1] 4321
And here's a vectorized version
cppFunction('IntegerVector Reverse_CPP2(IntegerVector x) {
int n = x.size();
IntegerVector out(n);
IntegerVector xx = clone(x); // Will need this if you don"t want to modify x in place
for (int i = 0; i < n; ++i){
int reverse = 0;
while(xx[i] != 0) {
int remainder = xx[i]%10;
reverse = reverse*10 + remainder;
xx[i]/= 10;
}
out[i] = reverse;
}
return out;
}')
Reverse_CPP2(c(12345, 21331, 4324234, 4243))
# [1] 54321 13312 4324234 3424
Note that I had to add IntegerVector xx = clone(x); and hence slow the function drastically (see #alexis_laz comment) as Rcpp will modify the original x by reference otherwise. You don't need that if you are passing a bare vector or if you don't care if the original vector is being modifyied
Some benchmarks against other vectorized string manipulation functions
Stringi <- function(x) as.integer(stringi::stri_reverse(x))
Base <- function(x) {
as.integer(vapply(lapply(strsplit(as.character(x), "", fixed = TRUE), rev),
paste, collapse = "", FUN.VALUE = character(1L)))
}
library(microbenchmark)
set.seed(123)
x <- sample(1e3L:1e7L, 1e5, replace = TRUE)
microbenchmark(
Base(x),
Stringi(x),
Reverse_CPP2(x)
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# Base(x) 855.985729 913.602215 994.60640 976.836206 1025.482170 1867.448511 100 c
# Stringi(x) 86.244426 94.882566 105.58049 102.962924 110.334702 179.918461 100 b
# Reverse_CPP2(x) 1.842699 1.865594 2.06674 1.947703 2.076983 6.546552 100 a
Here's my attempt by extending your code
n=-123
rev_number=function(n){
if(n>0)
return(as.integer(paste0(rev(unlist(strsplit(as.character(n), ""))), collapse = "")))
else{
return(-as.integer(paste0(rev(unlist(strsplit(as.character(abs(n)), ""))), collapse = "")))
}
}
rev_number(n)
If else is to handle negative numbers. First it converts the integer to characters and then uses strsplit to split the number. Then unlist is used to convert the list to vector. Then we can use rev to reverse the vector and collapse the vector using paste. Finally as.integer is used to convert the string in to int.
If you do not wish to convert the number to characters, below one does not need any additional packages but only works with positive integers
reverse_number <- function(n){
rev <- 0
while (n > 0) {
r <- n %% 10
rev <- rev * 10 + r
n <- n %/% 10
}
return(rev)
}
reverse_number(134576)
For integers n > 9 this function can be used:
reverse_int <- function(n) {
t1 <- floor(log10(n))
t2 <- 0
for (i in t1:1) t2 <- t2 + floor(n/10^i) * 10^(t1-i)
return(n*10^t1 - 99*t2)
}
reverse_int(678754)
#[1] 457876
Note that the function is not vectorized; it only takes one parameter n as input.
An R function to reverse numbers based on integer division with successive powers of 10. This came up in a school project related to palindrome numbers.
Reverse_number <- function(x){
n <- trunc(log10(x)) # now many powers of 10 are we dealing with
x.rem <- x # the remaining numbers to be reversed
x.out <- 0 # stores the output
for(i in n:0){
x.out <- x.out + (x.rem %/% 10^i)*10^(n-i) # multiply and add
x.rem <- x.rem - (x.rem %/% 10^i)*10^i # multiply and subtract
}
return(x.out)
}
Here are some other base R options using utf8ToInt or substring
> n <- 123478634
> as.integer(intToUtf8(rev(utf8ToInt(as.character(n)))))
[1] 436874321
> as.integer(paste0(substring(n, nchar(n):1, nchar(n):1), collapse = ""))
[1] 436874321
Related
I am trying to create a function that computes the sum of digits of large numbers, of the order of 100^100. The approach described in this question does not work, as shown below. I tried to come up with a function that does the job, but have not been able to get very far.
The inputs would be of the form a^b, where 1 < a, b < 100 and a and b are integers. So, in that sense, I am open to making digitSumLarge a function that accepts two arguments.
digitSumLarge <- function(x) {
pow <- floor(log10(x)) + 1L
rem <- x
i <- 1L
num <- integer(length = pow)
# Individually isolate each digit starting from the largest and store it in num
while(rem > 0) {
num[i] <- rem%/%(10^(pow - i))
rem <- rem%%(10^(pow - i))
i <- i + 1L
}
return(num)
}
# Function in the highest voted answer of the linked question.
digitsum <- function(x) sum(floor(x / 10^(0:(nchar(x) - 1))) %% 10)
Consider the following tests:
x <- c(1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9)
as.numeric(paste(x, collapse = ''))
# [1] 1.234568e+17
sum(x)
# 90
digitSumLarge(as.numeric(paste(x, collapse = '')))
# 85
digitsum(as.numeric(paste(x, collapse = '')))
# 81, with warning message about loss of accuracy
Is there any way I can write such a function in R?
You need arbitrary precision numbers. a^b with R's numerics (double precision floats) can be only represented with limited precision and not exactly for sufficiently large input.
library(gmp)
a <- as.bigz(13)
b <- as.bigz(67)
sum(as.numeric(strsplit(as.character(a^b), split = "")[[1]]))
#[1] 328
I want to produce all permutations of 20 minus one(-1) and 21 one(1) this matrix has 269128937220 rows and 41 columns. and I want to do the following calculation on each row of this matrix:
(SLS')/4
where:
S is each row of this matrix(a 1 by 41 array).
S' is the transpose of S(a 41 by 1 array).
L is a 41 by 41 matrix
the final result of each calculation is a single number.
is there any way to produce this matrix and do the calculation without getting out of memory error and in a reasonable time?
thanks in advance.
First off, you are probably better off rethinking your approach. With that said, let's get started attacking your problem.
This is a very difficult problem mainly due to the limitations of resources. Below, I have a solution that will complete in a reasonable amount of time on a home computer given that you have access to a decent amount of storage (at least 7 TB). The algorithm below does not require that much memory and can be tuned to reduce memory usage.
Before we begin, we note that merely generating that many permutations seems impossible at first. However with the help of highly optimized C++ code and parallel computing, the task is brought back into the realm of possibility. This was demonstrated in my answer to the OP's previous question. We utilized RcppAlgos (I am the author) and the parallel package to generate about 36 million permutations per second in chunks of one million using 8 cores.
Now, we are charged with carrying out specific computations on each permutation as fast as possible. The computation is as follows:
(SLS') / 4, where S is a permutation, L is a 41 x 41 matrix
Here are a couple of base R approaches (N.B. m1[x, ] %*% m2 %*% m1[x, ] is the same as m1[x, ] %*% m2 %*% as.matrix(m1[x, ], ncol = 1)):
baseTest1 <- function(m1, m2) {
vapply(1:nrow(m1), function(x) {
m1[x, ] %*% m2 %*% m1[x, ]
}, FUN.VALUE = 1.1111, USE.NAMES = FALSE) / 4
}
baseTest2 <- function(m1, m2) {
temp <- m1 %*% m2
vapply(1:nrow(m1), function(x) {
crossprod(temp[x, ], m1[x, ])
}, FUN.VALUE = 1.1111, USE.NAMES = FALSE) / 4
}
Let's think about this a little bit. We have a bunch of permutations of the numbers one and negative one. When we multiply these permutations by a matrix of real numbers, say M, we end up simply adding and subtracting values from the M. I bet we can speed this up quite a bit using Rcpp and avoid wasteful (and useless) identity multiplications (i.e. multiplications by 1).
#include <Rcpp.h>
//[[Rcpp::export]]
Rcpp::NumericVector makeVecCpp(Rcpp::NumericMatrix A,
Rcpp::NumericMatrix B,
unsigned long int mySize) {
Rcpp::NumericVector result = Rcpp::no_init_vector(mySize);
double temp = 0;
for (std::size_t i = 0; i < mySize; ++i) {
for (std::size_t j = 0; j < 41u; ++j) {
for (std::size_t k = 0; k < 41u; ++k) {
if (A(i, j) + A(i, k)) {
temp += B(j, k);
} else {
temp -= B(j, k);
}
}
}
result[i] = temp / 4;
temp = 0;
}
return result;
}
Now let's see if they give the same results and also benchmark them:
options(scipen = 999)
library(RcppAlgos)
library(microbenchmark)
set.seed(42)
M <- matrix(rnorm(41*41), nrow = 41, ncol = 41)
negOne <- permuteGeneral(c(1L, -1L), freqs = c(21, 20), upper = 100000)
all.equal(baseTest1(negOne, M), baseTest2(negOne, M))
# [1] TRUE
all.equal(baseTest1(negOne, M), makeVecCpp(negOne, M, 100000))
# [1] TRUE
microbenchmark(base1 = baseTest1(negOne, M), base2 = baseTest2(negOne, M),
myRcpp = makeVecCpp(negOne, M, 100000), times = 25)
Unit: milliseconds
expr min lq mean median uq max neval
base1 555.0256 582.2273 597.6447 593.7708 599.1380 690.3882 25
base2 471.0251 494.2367 541.2632 531.1858 586.6774 632.7279 25
myRcpp 202.7637 207.2463 210.0255 209.0399 209.9648 240.6664 25
Our Rcpp implementation is the clear winner!! Moving on, we incorportate this into our final answer:
## WARNING Don't run this unless you have a few DAYS on your hand
library(parallel)
## break up into even intervals of one hundred thousand
firstPart <- mclapply(seq(1, 269128900000, 100000), function(x) {
negOne <- permuteGeneral(c(1L, -1L), freqs = c(21, 20),
lower = x, upper = x + 99999)
vals <- makeVecCpp(negOne, M, 100000)
write.csv(vals, paste0("myFile", x, ".csv", collapse = ""))
x
}, mc.cores = 8)
## get the last few results and complete analysis
lastPart <- permuteGeneral(c(1L, -1L), freqs = c(21, 20),
lower = 269128900001, upper = 269128937220)
vals <- makeVecCpp(lastPart, M, 37220)
write.csv(vals, paste0("myFile", 269128900001, ".csv", collapse = ""))
You will note that we avoid storing everything in memory by writing every one hundred thousand results to main storage hence the need for a huge hard drive. When I tested this, each file was about 2.5 Mb which would total to about 6.5 TB:
a <- 2.5 * (2^20) ### convert to bytes
a * (269128937220 / 1e5) / 2^40 ## get terabytes
[1] 6.416534
To give you an idea of how long this compuation will take, here is a timing for the first one hundred million results:
system.time(firstPart <- mclapply(seq(1, 100000000, 100000), function(x) {
negOne <- permuteGeneral(c(1L, -1L), freqs = c(21, 20),
lower = x, upper = x + 99999)
vals <- makeVecCpp(negOne, M, 100000)
write.csv(vals, paste0("myFile", x, ".csv", collapse = ""))
x
}, mc.cores = 8))
user system elapsed
529.931 9.557 80.690
80 seconds ain't that bad! That means we will only have to wait around for about 2.5 days!!!!!:
(269128937220 / 100000000 / 60 / 60 / 24) * 80
[1] 2.491935
If you really want to reduce this time, you will have to utilize a high performance computing service.
All results were obtained on a MacBook Pro 2.8GHz quad core (with 4 virtual cores.. 8 total).
First note that the result you expect is a numerical vector with more than 269 billion elements. You will need 8 bytes per element, i.e. more than 2TB RAM just to store the result. If you don't have that much, it's hopeless to do what you ask for. Note also that you will need a long vector to store the result.
If you do have this amount of RAM, here is a solution based on combn with its FUN argument. This should be fairly optimal for the memory use. If you want to make it faster, try to implement compute_one directly with Rcpp.
k = 15 # should be 20
n = 2*k+1
L = matrix(runif(n*n), ncol=n)
compute_one = function(indices) {
s = rep.int(1,n)
s[indices] = -1
drop(t(s) %*% L %*% s / 4)
}
res = combn(n, k, compute_one)
I want to optimize the implementation of this formula.
Here is the formula:
x is an array of values. i goes from 1 to N where N > 2400000.
For i=0, i-1 is the last element and for i=lastElement, i+1 is the first element. Here is the code which I have written:
x <- 1:2400000
re <- array(data=NA, dim = NROW(x))
lastIndex = NROW(x)
for(i in 1:lastIndex){
if (i==1) {
re[i] = x[i]*x[i] - x[lastIndex]*x[i+1]
} else if(i==lastIndex) {
re[i] = x[i]*x[i] - x[i-1]*x[1]
} else {
re[i] = x[i]*x[i] - x[i-1]*x[i+1]
}
}
Can it be done by apply in R?
We can use direct vectorization for this
# Make fake data
x <- 1:10
n <- length(x)
# create vectors for the plus/minus indices
xminus1 <- c(x[n], x[-n])
xplus1 <- c(x[-1], x[1])
# Use direct vectorization to get re
re <- x^2 - xminus1*xplus1
If really each x[i] is equal to i then you can do a little math:
xi^2 - (xi-1)*(xi+1) = 1
so all elements of the result are 1 (only the first and the last are not 1).
The result is:
c(1-2*N, rep(1, N-2), N*N-(N-1))
In the general case (arbitrary values in x) you can do (as in the answer from Dason):
x*x - c(x[N], x[-N])*c(x[-1], x[1])
Here is a solution with rollapply() from zoo:
library("zoo")
rollapply(c(x[length(x)],x, x[1]), width=3, function(x) x[2]^2 - x[1]*x[3]) # or:
rollapply(c(tail(x,1), x, x[1]), width=3, function(x) x[2]^2 - x[1]*x[3])
Here is the benchmark:
library("microbenchmark")
library("zoo")
N <- 10000
x <- 1:N
microbenchmark(
math=c(1-2*N, rep(1, N-2), N*N-(N-1)), # for the data from the question
vect.i=x*x - c(x[N], x[-N])*c(x[-1], x[1]), # general data
roll.i=rollapply(c(x[length(x)],x, x[1]), width=3, function(x) x[2]^2 - x[1]*x[3]), # or:
roll.tail=rollapply(c(tail(x,1), x, x[1]), width=3, function(x) x[2]^2 - x[1]*x[3])
)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# math 33.613 34.4950 76.18809 36.9130 38.0355 2002.152 100 a
# vect.i 188.928 192.5315 732.50725 197.1955 198.5245 51649.652 100 a
# roll.i 56748.920 62217.2550 67666.66315 68195.5085 71214.9785 109195.049 100 b
# roll.tail 57661.835 63855.7060 68815.91001 67315.5425 71339.6045 119428.718 100 b
An lapply implementation of your formula would look like this:
x <- c(1:2400000)
last <- length(x)
re <- lapply(x, function(i) {
if(i == 1) {
x[i]*x[i] - x[last]*x[i+1]
} else if (i == last) {
x[i]*x[i] - x[i-1]*x[1]
} else {
x[i]*x[i] - x[i-1]*x[i+1]
}
})
re <- unlist(re)
lapply will return a list, so conversion to a vector is done using unlist()
1) You can avoid all the special-casing in the computation by padding the start and end of array x with copies of the last and first rows; something like this:
N <- NROW(x)
x <- rbind(x[N], x, x[1]) # pad start and end to give wraparound
re <- lapply(2:N, function(i) { x[i]*x[i] - x[i-1]*x[i+1] } )
#re <- unlist(re) as andbov wrote
# and remember not to use all of x, just x[2:N], elsewhere
2) Directly vectorize, as #Dason's answer:
# Do the padding trick on x , then
x[2:N]^2 - x[1:N-1]*x[3:N+1]
3) If performance matters, I suspect using data.table or else for-loop on i will be faster, since it references three consecutive rows.
4) For more performance, use byte-compiling
5) If you need even more speed, use Rcpp extension (C++ under the hood) How to use Rcpp to speed up a for loop?
See those questions I cited for good examples of using lineprof and microbenchmarking to figure out where your bottleneck is.
I want to have a very quick search and it seems, that using hashes (via environments) is the best way. Now, I got an example to run with environments, but it does not return what I need.
Here is an example:
a <- data.table::data.table(a=c(1, 3, 5), b=c(2, 4, 6), time=c(10, 20, 30))
my_env <- list2env(a)
x <- a[2, .(a, b)] # x=c(3,4)
found <- get("x", envir = my_env)
I would expect found = c(3, 4, 20) but receive found = c(3, 4)
(I want the whole row to be returned instead of the unknown row subset)
Backround: I have a huge list containing source and destination of routes calculated with osrm, e.g.
lattitude1, longitude1, lattitude2, longitude2, travel-time
46.12, 8.32, 47.87, 9.92, 1036
...
The list contains in a first example about 100000 rows. Using binary search in a data.table speeded up my code by a factor 100, but one search still takes 1 ms. As I have to search for many routes during a simulation (About 2e5 searches) I would like to get even faster.
#Gregor: I am a beginner in R, but I don't think my question is a duplicate:
I knew the second link , which is an abstract overview for experts listing possibilities. Furthermore, it is 4 years old.
I didn't know the first link, but from those answers I can't see whether I should switch to environments and how an implementation could work at all. There is also no discussion about searching a part of a huge list.
Summary (Thanks to DigEmAll for his running example below):
Using Rcpp on integers, the search is less memory consuming without any loss of quality. Futhermore, it is about a factor of 3 faster.
Do not use hashed environments when you want to look up doubles (which have to be converted to strings).
Implementation in existing code should be easy.
Here's an example using enviroment and data.table, the code is pretty self-explanatory :
library(data.table)
# create a big random example (160k rows)
set.seed(123)
fromTo <- expand.grid(1:400,1:400)
colnames(fromTo) <- c('a','b')
DF <- as.data.frame(cbind(fromTo,time=as.integer(runif(nrow(fromTo), min = 1, max=500))))
# setup the environment to use it as hashtable:
# we simply put the times inside an enviroment using
# a|b (concatenation of a with b) as key
timesList <- as.list(DF$time)
names(timesList) <- paste(DF$a,DF$b,sep='|')
timesEnv <- list2env(timesList)
# setup the data.table to use it as hashtable
DT <- setDT(DF,key=c('a','b'))
# create search functions
searchUsingEnv <- function(a,b){
time <- get(paste(a,b,sep='|'),envir=timesEnv,inherits=FALSE)
return(time)
}
searchUsingDataTable <- function(from,to){
time <- DT[.(from,to),time]
return(time)
}
Benchmark :
# benchmark functions
# i.e. we try to search ~16K rows in ourtwo kind of hashtables
benchEnv <- function(){
n <- nrow(fromTo)
s <- as.integer(n * 0.9)
for(i in s:n){
searchUsingEnv(fromTo[i,'a'],fromTo[i,'b'])
}
}
benchDT <- function(){
n <- nrow(fromTo)
s <- as.integer(n * 0.9)
for(i in s:n){
searchUsingDataTable(fromTo[i,'a'],fromTo[i,'b'])
}
}
# let's measure the performances
> system.time(benchEnv(), gcFirst = TRUE)
user system elapsed
2.26 0.00 2.30
> system.time(benchDT(), gcFirst = TRUE)
user system elapsed
42.34 0.00 42.56
Conclusions:
environment seems much faster then data.table for repeated single key access, so you can try to use it.
EDIT :
Enviroments have fast access but they can only have string keys which occupy more memory than doubles. So, I've added an example using Rcpp and std::map<> with a multiple values map :
(note: if you are on Windows you need to install RTools in order to make Rcpp work)
library(data.table)
library(Rcpp)
library(inline)
nRows <- 1e7
############# create data.table "DT" containing coordinates and times
generate_routes_dt <- function(nmax) {
set.seed(123)
routes <- data.table(lat1 = numeric(nmax),
lng1 = numeric(nmax),
lat2 = numeric(nmax),
lng2 = numeric(nmax),
time = numeric(nmax))
tmp <- sample(seq(46, 49, length.out = nmax), nmax)
routes$lat1 <- tmp
tmp <- sample(seq(8, 10, length.out = nmax), nmax)
routes$lng1 <- tmp
tmp <- sample(seq(46, 49, length.out = nmax), nmax)
routes$lat2 <- tmp
tmp <- sample(seq(8, 10, length.out = nmax), nmax)
routes$lng2 <- tmp
tmp <- sample(seq(0, 1e7, length.out = nmax), nmax)
routes$time <- as.integer(tmp)
data.table::setkey(routes, lat1, lng1, lat2, lng2)
return(routes)
}
DT <- generate_routes_dt(nRows)
############# create data.table search function
searchUsingDataTable <- function(lat_1,lng_1,lat_2,lng_2){
time <- DT[.(lat_1,lng_1,lat_2,lng_2),time]
return(time)
}
#############
############# create Rcpp search function
# the following code create 2 functions: createMap and getTime
# usage:
# map <- createMap(lat1Vec,lng1Vec,lat2Vec,lng2Vec,timesVec)
# t <- getTime(map,lat1,lng1,lat2,lng2)
sourceCpp(code=
'
#include <Rcpp.h>
class MultiKey {
public:
double lat1;
double lng1;
double lat2;
double lng2;
MultiKey(double la1, double ln1, double la2, double ln2)
: lat1(la1), lng1(ln1), lat2(la2), lng2(ln2) {}
bool operator<(const MultiKey &right) const
{
if ( lat1 == right.lat1 ) {
if ( lng1 == right.lng1 ) {
if ( lat2 == right.lat2 ) {
return lng2 < right.lng2;
}
else {
return lat2 < right.lat2;
}
}
else {
return lng1 < right.lng1;
}
}
else {
return lat1 < right.lat1;
}
}
};
// [[Rcpp::export]]
SEXP createMap(Rcpp::NumericVector lat1,
Rcpp::NumericVector lng1,
Rcpp::NumericVector lat2,
Rcpp::NumericVector lng2,
Rcpp::NumericVector times){
std::map<MultiKey, double>* map = new std::map<MultiKey, double>;
int n1 = lat1.size();
int n2 = lng1.size();
int n3 = lat2.size();
int n4 = lng2.size();
int n5 = times.size();
if(!(n1 == n2 && n2 == n3 && n3 == n4 && n4 == n5)){
throw std::range_error("input vectors lengths are different");
}
for(int i = 0; i < n1; i++){
MultiKey key(lat1[i],lng1[i],lat2[i],lng2[i]);
map->insert(std::pair<MultiKey, double>(key, times[i]));
}
Rcpp::XPtr< std::map<MultiKey, double> > p(map, true);
return( p );
}
// [[Rcpp::export]]
Rcpp::NumericVector getTime(SEXP mapPtr,
double lat1,
double lng1,
double lat2,
double lng2){
Rcpp::XPtr< std::map<MultiKey, double> > ptr(mapPtr);
MultiKey key(lat1,lng1,lat2,lng2);
std::map<MultiKey,double>::iterator it = ptr->find(key);
if(it == ptr->end())
return R_NilValue;
return Rcpp::wrap(it->second);
}
')
map <- createMap(DT$lat1,DT$lng1,DT$lat2,DT$lng2,DT$time)
searchUsingRcpp <- function(lat_1,lng_1,lat_2,lng_2){
time <- getTime(map,lat_1,lng_1,lat_2,lng_2)
return(time)
}
#############
############# benchmark
set.seed(1234)
rowsToSearchOneByOne <- DT[sample.int(nrow(DT),size=nrow(DT),replace=FALSE),]
bench <- function(searchFun2Use){
for(i in nrow(rowsToSearchOneByOne)){
key <- rowsToSearchOneByOne[i,]
searchFun2Use(key$lat1,key$lng1,key$lat2,key$lng2)
}
}
microbenchmark::microbenchmark(
bench(searchUsingRcpp),
bench(searchUsingDataTable),
times=100)
#############
Benchmark result :
Unit: microseconds
expr min lq mean median uq max neval
bench(searchUsingRcpp) 360.959 381.7585 400.4466 391.999 403.9985 665.597 100
bench(searchUsingDataTable) 1103.034 1138.0740 1214.3008 1163.514 1224.9530 2035.828 100
Note:
I really don't think that using double as keys is a good idea... floating point values should be used to search using a certain tolerance or inside a range, not to look up for perfect match inside a map.
I am stuck in a difficult problem in R and am not able to resolve it. The problem goes like this.
x and y are two vectors, as given below:
x<- c(1,2,3,4,5)
y<- c(12,4,2,5,7,18,9,10)
I want to create a new vector p, where length(p) = length(x), in the following manner:
For each id in x, find the id in y which has minimum absolute distance in terms of values. For instance, for id=1 in x, value_x(id=1)=1, min_value_y =2, and id_y(value==2) = 3. Thus, the answer to id 1 in x is 3. Thus, we create a new vector p, which will have following values: p = (3,3,3,2,4);
Now we have to update p, in the following manner:
As 3 has been the id corresponding to id_x=1, it can't be the id for id_x=2. Hence, we have to discard id_y =3 with value 2, to calculate the next minimum distance for id_x=2. Next best minimum distance for id_x=2 is id_y=2 with value 4. Hence, updated p is (3,2,3,2,4).
As 3 has been the id corresponding to id_x=1, it can't be the id for id_x=3. Hence, we have to discard id_y =3 with value 2, to calculate the next minimum distance for id_x=3. Next best minimum distance for id_x=3 is 2. Hence, updated p is (3,2,4,2,4).
As next values in p is 2, and 4 we have to repeat what we did in the last two steps. In summary, while calculating the minimum distance between x and y, for each id of x we have to get that id of y which hasn't been previously appeared. Thus all the elements of p has to be unique.
Any answers would be appreciated.
I tried something like this, though not a complete solution:
minID <- function(x,y) {return(which(abs(x-y)==min(abs(x-y))))};
p1 <- sapply(x,minID,y=y);
#Calculates the list of all minimum elements -no where close to actual solution :(
I have a x and y over 1 million, hence for loop would be extremely slow. I am looking for a faster solution.
This can be implemented efficiently with a binary search tree on the elements of y, deleting elements as they're matched and added to p. I've implemented this using set from the stl in C++, using Rcpp to get the code into R:
library(Rcpp)
getVals = cppFunction(
'NumericVector getVals(NumericVector x, NumericVector y) {
NumericVector p(x.size());
std::vector<std::pair<double, int> > init;
for (int j=0; j < y.size(); ++j) {
init.push_back(std::pair<double, int>(y[j], j));
}
std::set<std::pair<double, int> > s(init.begin(), init.end());
for (int i=0; i < x.size(); ++i) {
std::set<std::pair<double, int> >::iterator p1, p2, selected;
p1 = s.lower_bound(std::pair<double, int>(x[i], 0));
p2 = p1;
--p2;
if (p1 == s.end()) {
selected = p2;
} else if (p2 == s.begin()) {
selected = p1;
} else if (fabs(x[i] - p1->first) < fabs(x[i] - p2->first)) {
selected = p1;
} else {
selected = p2;
}
p[i] = selected->second+1; // 1-indexed
s.erase(selected);
}
return p;
}')
Here's a runtime comparison against the pure-R solution that was posted -- the binary search tree solution is much faster and enables solutions with vectors of length 1 million in just a few seconds:
# Pure-R posted solution
getVals2 = function(x, y) {
n <- length(x)
p <- rep(NA, n)
for(i in 1:n) {
id <- which.min(abs(y - x[i]))
y[id] <- Inf
p[i] <- id
}
return(p)
}
# Test with medium-sized vectors
set.seed(144)
x = rnorm(10000)
y = rnorm(20000)
system.time(res1 <- getVals(x, y))
# user system elapsed
# 0.008 0.000 0.008
system.time(res2 <- getVals2(x, y))
# user system elapsed
# 1.284 2.919 4.211
all.equal(res1, res2)
# [1] TRUE
# Test with large vectors
set.seed(144)
x = rnorm(1000000)
y = rnorm(2000000)
system.time(res3 <- getVals(x, y))
# user system elapsed
# 4.402 0.097 4.467
The reason for the speedup is because this approach is asymptotically faster -- if x is of size n and y is of size m, then the binary search tree approach runs in O((n+m)log(m)) time -- O(m log(m)) to construct the BST and O(n log(m)) to compute p -- while the which.min approach runs in O(nm) time.
n <- length(x)
p <- rep(NA, n)
for(i in 1:n) {
id <- which.min(abs(y - x[i]))
y[id] <- Inf
p[i] <- id
}
I have tried to develop a code in R and have gotten around 20x improvement over for loop. The piece of code goes as follows:
Generalized.getMinId <- function(a,b)
{
sapply(a, FUN = function(x) which.min(abs(x-b)))
}
Generalized.getAbsDiff <- function(a,b)
{
lapply(a, FUN = function(x) abs(x-b))
}
min_id = Generalized.getMinId(tlist,clist);
dup = which(duplicated(min_id));
while(length(dup) > 0)
{
absdiff = Generalized.getAbsDiff(tlist[dup],clist);
infind = lapply(dup, function(x,y)
{l <- head(y,x-1); l[l>0]}, y = min_id);
absdiff = Map(`[<-`, absdiff, infind, Inf);
dupind = sapply(absdiff, which.min);
min_id[dup] = dupind;
dup = which(duplicated(min_id));
}
In case someone can make an improvement over this piece of code, it would be awesome.