I have two vectors:
vec1 <- c(0, 1, 2, 3, 4, 5, 6, 7, 9)
vec2 <- c(1, 2, 7, 5, 3, 6, 80, 4, 8)
I would like to set the same order in vec1 as it is in vec2. For example, in vec2 the highest number (position 9) is in position 7, so I would like to put the highest number in vec1 (position 9, number 9) to position 7.
Expected output:
vec1 <- c(0, 1, 6, 4, 2, 5, 9, 3, 7)
I don't have any duplicated values in any vector.
I'm primarily interested in efficient Rcpp solutions but also anything in R is welcome.
Another baseR option is match
vec1[match(vec2, sort(vec2))]
# [1] 0 1 6 4 2 5 9 3 7
edit
Including a benchmark with larger sample size
set.seed(42)
n <- 1e6
vec1 <- seq_len(n)
vec2 <- sample(1:1e7, size = n)
benchmarks <- bench::mark(match = vec1[match(vec2, sort(vec2))],
rank = vec1[rank(vec2)],
frank = vec1[data.table::frank(vec2)],
order_order = vec1[order(order(vec2))],
rcpp_order_order = foo(vec1, vec2),
iterations = 25)
benchmarks[ , 1:3]
Result
# A tibble: 5 x 3
# expression min median
# <bch:expr> <bch:tm> <bch:tm>
#1 match 259.8ms 322ms
#2 rank 825.9ms 876ms
#3 frank 88.6ms 134ms
#4 order_order 110.6ms 139ms
#5 rcpp_order_order 793.5ms 893ms
We can adapt the Rcpp version of order() from this answer (to account for the fact that you do not want to check for duplicates and adding a function to order by an order of an ordering) to make the following Rcpp solution:
#include <Rcpp.h>
Rcpp::IntegerVector order(const Rcpp::NumericVector& x) {
return Rcpp::match(Rcpp::clone(x).sort(), x);
}
Rcpp::IntegerVector order(const Rcpp::IntegerVector& x) {
return Rcpp::match(Rcpp::clone(x).sort(), x);
}
// [[Rcpp::export]]
Rcpp::NumericVector foo(const Rcpp::NumericVector x,
const Rcpp::NumericVector y) {
return x[order(order(y))-1];
}
Then we get the expected results:
library(Rcpp)
sourceCpp("foo.cpp")
vec1 <- c(0, 1, 2, 3, 4, 5, 6, 7, 9)
vec2 <- c(1, 2, 7, 5, 3, 6, 80, 4, 8)
foo(vec1, vec2)
# [1] 0 1 6 4 2 5 9 3 7
with decent performance (comparisons are to the R solutions presented by other answers):
benchmarks <- bench::mark(match = vec1[match(vec2, sort(vec2))],
rank = vec1[rank(vec2)],
order_order = vec1[order(order(vec2))],
rcpp_order_order = foo(vec1, vec2),
iterations = 10000)
benchmarks[ , 1:3]
# # A tibble: 4 x 3
# expression min median
# <bch:expr> <bch:tm> <bch:tm>
# 1 match 28.4µs 31.72µs
# 2 rank 7.99µs 9.84µs
# 3 order_order 26.27µs 30.61µs
# 4 rcpp_order_order 2.51µs 3.23µs
Note that this solution only works if there are no duplicates. (If you might run into duplicates, adding a check is demonstrated in the linked-to answer). Also note that these benchmarks were just done on this data; I don't know for sure how they change at scale.
We could use rank
vec1[rank(vec2)]
#[1] 0 1 6 4 2 5 9 3 7
Or with order
vec1[order(order(vec2))]
#[1] 0 1 6 4 2 5 9 3 7
Or as #markus suggested an option with frank from data.table
library(data.table)
vec1[frank(vec2)]
If I understand you correctly, you want vec1 to follow the same order of vec1. That is, is vec2 is increasing, so should the values of vec1; if vec2 is decreasing, so should vec1 and so on.
sort(vec1)[order(vec2)]
Related
I have a question I have the following data
c(1, 2, 4, 5, 1, 8, 9)
I set a l = 2 and an u = 6
I want to find all the values in the range (3,7)
How can I do this?
In base R we can use comparison operators to create a logical vector and use that for subsetting the original vector
x[x > 2 & x <= 6]
#[1] 3 5 6
Or using a for loop, initialize an empty vector, loop through the elements of 'x', if the value is between 2 and 6, then concatenate that value to the empty vector
v1 <- c()
for(i in x) {
if(i > 2 & i <= 6) v1 <- c(v1, i)
}
v1
#[1] 3 5 6
data
x <- c(3, 5, 6, 8, 1, 2, 1)
How do I add a vector to another while keeping for the first vector constant? For example if I had c(1, 2, 3) + 1. I would get 2, 3, 4. If I wanted to scale this up to say + 1, and + 2, what could I do to get
2, 3, 4, 3, 4, 5
Intuitively I wanted to c(1, 2, 3) + c(1, 2) but this does not work.
Turning the comments into an answer we can use outer as #jogo showed
c(outer(1:3, 1:2, FUN='+'))
# [1] 2 3 4 3 4 5
Another option is rep
f <- function(x, y) {
x + rep(y, each = length(x))
}
f(1:3, 1:2)
# [1] 2 3 4 3 4 5
I am looking for a function which takes a vector and keeps dropping the first value until the sum of the vector is less than 20. Return the remaining values.
I've tried both a for-loop and while-loop and can't find a solution.
vec <- c(3,5,3,4,3,9,1,8,2,5)
short <- function(vec){
for (i in 1:length(vec)){
while (!is.na((sum(vec)) < 20)){
vec <- vec[i+1:length(vec)]
#vec.remove(i)
}
}
The expected output should be:
1,8,2,5
which is less than 20.
Looking at the expected output it looks like you want to drop values until sum of remaining values is less than 20.
We can create a function
drop_20 <- function(vec) {
tail(vec, sum(cumsum(rev(vec)) < 20))
}
drop_20(vec)
#[1] 1 8 2 5
Trying it on another input
drop_20(1:10)
#[1] 9 10
Breaking down the function, first the vec
vec = c(3,5,3,4,3,9,1,8,2,5)
We then reverse it
rev(vec)
#[1] 5 2 8 1 9 3 4 3 5 3
take cumulative sum over it (cumsum)
cumsum(vec)
#[1] 3 8 11 15 18 27 28 36 38 43
Find out number of enteries that are less than 20
cumsum(rev(vec)) < 20
#[1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
sum(cumsum(rev(vec)) < 20)
#[1] 4
and finally subset these last enteries using tail.
A slight modification in the code and it should be able to handle NAs as well
drop_20 <- function(vec) {
tail(vec, sum(cumsum(replace(rev(vec), is.na(rev(vec)), 0)) < 20))
}
vec = c(3, 2, NA, 4, 5, 1, 2, 3, 4, 9, NA, 1, 2)
drop_20(vec)
#[1] 3 4 9 NA 1 2
The logic being we replace NA with zeroes and then take the cumsum
You need to remove the first value each time, so your while loop should be,
while (sum(x, na.rm = TRUE) >= 20) {
x <- x[-1]
}
#[1] 1 8 2 5
base solution without loops
not my most readable code ever, but it's pretty fast (see benchmarking below)
rev( rev(vec)[cumsum( replace( rev(vec), is.na( rev(vec) ), 0 ) ) < 20] )
#[1] 1 8 2 5
note: 'borrowed' the NA-handling from #Ronak's answer
sample data
vec = c(3, 2, NA, 4, 5, 1, 2, 3, 4, 9, NA, 1, 2)
benchmarks
microbenchmark::microbenchmark(
Sotos = {
while (sum(vec, na.rm = TRUE) >= 20) {
vec <- vec[-1]
}
},
Ronak = tail(vec, sum(cumsum(replace(rev(vec), is.na(rev(vec)), 0)) < 20)),
Wimpel = rev( rev(vec)[cumsum( replace( rev(vec), is.na( rev(vec) ), 0 ) ) < 20]),
WimpelMarkus = vec[rev(cumsum(rev(replace(vec, is.na(vec), 0))) < 20)]
)
# Unit: microseconds
# expr min lq mean median uq max neval
# Sotos 2096.795 2127.373 2288.15768 2152.6795 2425.4740 3071.684 100
# Ronak 30.127 33.440 42.54770 37.2055 49.4080 101.827 100
# Wimpel 13.557 15.063 17.65734 16.1175 18.5285 38.261 100
# WimpelMarkus 7.532 8.737 12.60520 10.0925 15.9680 45.491 100
I would go with Reduce
vec[Reduce(f = "+", x = vec, accumulate = T, right = T) < 20]
##[1] 1 8 2 5
Alternatively, define Reduce with function sum with the conditional argument na.rm = T in order to hanlde NAs if desired:
vec2 <- c(3, 2, NA, 4, 5, 1, 2, 3, 4, 9, NA, 1, 2)
vec2[Reduce(f = function(a,b) sum(a, b, na.rm = T), x = vec2, accumulate = TRUE, right = T) < 20]
##[1] 3 4 9 NA 1 2
I find the Reduce option to start from right (end of the integer vector), and hence not having to reverse it first, convenient.
I have a vector:
as <- c(1,2,3,4,5,9)
I need to extract the first continunous sequence in the vector, starting at index 1, such that the output is the following:
1 2 3 4 5
Is there a smart function for doing this, or do I have to do something not so elegant like this:
a <- c(1,2,3,4,5,9)
is_continunous <- c()
for (i in 1:length(a)) {
if(a[i+1] - a[i] == 1) {
is_continunous <- c(is_continunous, i)
} else {
break
}
}
continunous_numbers <- c()
if(is_continunous[1] == 1) {
is_continunous <- c(is_continunous, length(is_continunous)+1)
continunous_numbers <- a[is_continunous]
}
It does the trick, but I would expect that there is a function that can already do this.
It isn't clear what you need if the index of the continuous sequence only if it starts at index one or the first sequence, whatever the beginning index is.
In both case, you need to start by checking the difference between adjacent elements:
d_as <- diff(as)
If you need the first sequence only if it starts at index 1:
if(d_as[1]==1) 1:(rle(d_as)$lengths[1]+1) else NULL
# [1] 1 2 3 4 5
rle permits to know lengths and values for each consecutive sequence of same value.
If you need the first continuous sequence, whatever the starting index is:
rle_d_as <- rle(d_as)
which(d_as==1)[1]+(0:(rle_d_as$lengths[rle_d_as$values==1][1]))
Examples (for the second option):
as <- c(1,2,3,4,5,9)
d_as <- diff(as)
rle_d_as <- rle(d_as)
which(d_as==1)[1]+(0:(rle_d_as$lengths[rle_d_as$values==1][1]))
#[1] 1 2 3 4 5
as <- c(4,3,1,2,3,4,5,9)
d_as <- diff(as)
rle_d_as <- rle(d_as)
which(d_as==1)[1]+(0:(rle_d_as$lengths[rle_d_as$values==1][1]))
# [1] 3 4 5 6 7
as <- c(1, 2, 3, 6, 7, 8)
d_as <- diff(as)
rle_d_as <- rle(d_as)
which(d_as==1)[1]+(0:(rle_d_as$lengths[rle_d_as$values==1][1]))
# [1] 1 2 3
A simple way to catch the sequence would be to find the diff of your vector and grab all elements with diff == 1 plus the very next element, i.e.
d1<- which(diff(as) == 1)
as[c(d1, d1[length(d1)]+1)]
NOTE
This will only work If you only have one sequence in your vector. However If we want to make it more general, then I 'd suggest creating a function as so,
get_seq <- function(vec){
d1 <- which(diff(as) == 1)
if(all(diff(d1) == 1)){
return(c(d1, d1[length(d1)]+1))
}else{
d2 <- split(d1, cumsum(c(1, diff(d1) != 1)))[[1]]
return(c(d2, d2[length(d2)]+1))
}
}
#testing it
as <- c(3, 5, 1, 2, 3, 4, 9, 7, 5, 4, 5, 6, 7, 8)
get_seq(as)
#[1] 3 4 5 6
as <- c(8, 9, 10, 11, 1, 2, 3, 4, 7, 8, 9, 10)
get_seq(as)
#[1] 1 2 3 4
as <- c(1, 2, 3, 4, 5, 6, 11)
get_seq(as)
#[1] 1 2 3 4 5 6
I need to find all positions in my vector corresponding to any of values of another vector:
needles <- c(4, 3, 9)
hay <- c(2, 3, 4, 5, 3, 7)
mymatches(needles, hay) # should give vector: 2 3 5
Is there any predefined function allowing to do this?
This should work:
which(hay %in% needles) # 2 3 5
R already has the the match() function / %in% operator, which are the same thing, and they're vectorized. Your solution:
which(!is.na(match(hay, needles)))
[1] 2 3 5
or the shorter syntax which(hay %in% needles) as #jalapic showed.
With match(), if you wanted to, you could see which specific value was matched at each position...
match(hay, needles)
[1] NA 2 1 NA 2 NA
or just a logical vector of where the matches occurred:
!is.na(match(hay, needles))
[1] FALSE TRUE TRUE FALSE TRUE FALSE
If you want to match between an integer and a integer vector then the following code is about twice as fast for longer integer vectors:
library(microbenchmark)
library(parallel)
library(Rcpp)
library(RcppArmadillo)
cppFunction(depends = "RcppArmadillo",
'std::vector<double> findMatches(const int &x, const arma::ivec &y) {
arma::uvec temp = arma::find(y == x) + 1;
return as<std::vector<double>>(wrap(temp));
}')
x <- 1L
y <- as.integer(1:1e8)
microbenchmark(findMatches(x, y))
microbenchmark(which(x %in% y))
To match all elements of a vector we can do:
needles <- c(4, 3, 9)
hay <- c(2, 3, 4, 5, 3, 7)
unlist(lapply(FUN = findMatches, X = needles, y=hay))
# The same thing in parallel
unlist(mclapply(FUN = findMatches, X = needles, y=hay))
Benchmarking:
# on a 8 core server
hay <- as.integer(1:1e7)
needles <- sample(hay, 10)
microbenchmark(which(hay %in% needles)) # 74 milliseconds
microbenchmark(unlist(lapply(FUN = findMatches, X = needles, y=hay))) # 44 milliseconds
microbenchmark(unlist(mclapply(FUN = findMatches, X = needles, y=hay))) # 46 milliseconds
Doing in parallel will only be faster if each embarrassingly parallel task is long enough to make it worth the overhead. In this example that does not seem to be the case.