R: Counting rows in a dataframe in which all values fall within individual ranges - r

I have a dataframe (A rows x K columns). For each column, I get the 5th and 95th percentile value. I want to know how many rows in the df have all K of their values within these K sets of 5th and 95th percentile values.
Example code below works, removing rows that do not fall within the bounds (1 and 9 here, but in practice will be percentiles), and then counting what remains. But A will be 10K and K will be 40, and I am simulating this dataframe 10K times, so I am wondering if there is code that will run faster.
data <- rbind(c(1,2,3,4,5), c(3,5,7,8,5), c(2,8,9,5,9), c(9,1,1,8,9),
c(3,5,6,7,5))
Lower_Bound <- rbind(1,1,1,1,1)
Upper_Bound <- rbind(9,9,9,9,9)
for (i in c(1:5)) {
data <- data[data[,i] > Lower_Bound[i,],]
data <- data[data[,i] < Upper_Bound[i,],]
}
N <- nrow(data)

If I understand correctly, the OP is only interested in the number of rows which fulfill the condition. So, there is no need to actually remove rows fromdata that do not fall within the bounds. It is sufficient to count the number of rows which do fall within the bounds.
This answer contains solutions for
matrices
data.frames
and a benchmark which compares
OP's approach,
apply() with matrices and data.frames,
an approach using purrr's map() and reduce() functions.
apply() with matrices
Let's start with the provided sample data and fixed Lower_Bound and Upper_Bound. Please, note that all three objects are matrices created by rbind(). This is in contrast to the text of the question which refers to a dataframe (A rows x K columns). Anyhow, we will provide solutions for both cases.
apply(data, 1, function(x) all(x > Lower_Bound & x < Upper_Bound))
returns a vector of type logical
[1] FALSE TRUE FALSE FALSE TRUE
The number of rows which fulfill the condition can be derived by
N <- sum(apply(data, 1, function(x) all(x > Lower_Bound & x < Upper_Bound)))
N
[1] 2
because TRUE is coerced to 1L and FALSE to 0L.
The next step is to also compute the bounds for each column as 5th and 95th percentile. For this, we have to create a new sample dataset mat, again as matrix
# create sample data
n_col <- 5
n_row <- 10
set.seed(42) # required for reproducible results
mat <- sapply(1:n_col, function(x) rnorm(n_row, mean = x))
mat
[,1] [,2] [,3] [,4] [,5]
[1,] 2.3709584 3.3048697 2.693361 4.455450 5.205999
[2,] 0.4353018 4.2866454 1.218692 4.704837 4.638943
[3,] 1.3631284 0.6111393 2.828083 5.035104 5.758163
[4,] 1.6328626 1.7212112 4.214675 3.391074 4.273295
[5,] 1.4042683 1.8666787 4.895193 4.504955 3.631719
[6,] 0.8938755 2.6359504 2.569531 2.282991 5.432818
[7,] 2.5115220 1.7157471 2.742731 3.215541 4.188607
[8,] 0.9053410 -0.6564554 1.236837 3.149092 6.444101
[9,] 3.0184237 -0.4404669 3.460097 1.585792 4.568554
[10,] 0.9372859 3.3201133 2.360005 4.036123 5.655648
For demonstration, each column has a different mean.
# count number of rows
probs <- c(0.05, 0.95)
bounds <- apply(mat, 2, quantile, probs)
idx <- apply(mat, 1, function(x) all(x > bounds[1, ] & x < bounds[2, ]))
N <- sum(idx)
N
1 5
If required, the subset of mat which fulfills the condition can be derived by
mat[idx, ]
[,1] [,2] [,3] [,4] [,5]
[1,] 2.3709584 3.304870 2.693361 4.455450 5.205999
[2,] 1.6328626 1.721211 4.214675 3.391074 4.273295
[3,] 0.8938755 2.635950 2.569531 2.282991 5.432818
[4,] 2.5115220 1.715747 2.742731 3.215541 4.188607
[5,] 0.9372859 3.320113 2.360005 4.036123 5.655648
The bounds are
bounds
[,1] [,2] [,3] [,4] [,5]
5% 0.641660 -0.5592606 1.226857 1.899532 3.882318
95% 2.790318 3.8517060 4.588960 4.886484 6.135429
apply() with data.frames
In case the dataset is a data.frame we can use the same code, i.e.,
df <- as.data.frame(mat)
probs <- c(0.05, 0.95)
bounds <- apply(df, 2, quantile, probs)
idx <- apply(df, 1, function(x) all(x > bounds[1, ] & x < bounds[2, ]))
N <- sum(idx)
Benchmark
The OP is looking for code which is faster than OP's own approach because the OP wants to replicate the simulation 10000 times.
So, here is a benchmark which compares
OP1: OP's own approach using matrices
OP2: a slightly modified version of OP1
apply_mat: the apply() function with matrices
apply_df: the apply() function with data.frames
purrr: using map(), pmap(), and reduce() from the purrr package
(Note that the list of methods is not exhaustive)
The benchmark is repeated for varying problem sizes, i.e., 5, 10, and 40 columns as well as 100, 1000, and 10000 rows. The largest problem size corresponds to the size of OP's simulations. As some codes modify the input dataset, all runs start with a fresh copy of the input data.
library(bench)
library(purrr)
library(ggplot2)
bm <- press(
n_col = c(5L, 10L, 40L)
, n_row = 10L^(2:4)
, {
set.seed(42)
mat0 <- sapply(1:n_col, function(x) rnorm(n_row, mean = x))
df0 <- as.data.frame(mat0)
mark(
OP1 = {
data <- data.table::copy(mat0)
Lower_Bound <- as.matrix(apply(data, 2, quantile, probs = 0.05), ncol = 1L)
Upper_Bound <- as.matrix(apply(data, 2, quantile, probs = 0.95), ncol = 1L)
for (i in seq_len(ncol(data))) {
data <- data[data[, i] > Lower_Bound[i, ], ]
data <- data[data[, i] < Upper_Bound[i, ], ]
}
nrow(data)
},
OP2 = {
data <- data.table::copy(mat0)
Lower_Bound <- as.matrix(apply(data, 2, quantile, probs = 0.05), ncol = 1L)
Upper_Bound <- as.matrix(apply(data, 2, quantile, probs = 0.95), ncol = 1L)
for (i in seq_len(ncol(data))) {
data <- data[data[, i] > Lower_Bound[i, ] & data[, i] < Upper_Bound[i, ], ]
}
nrow(data)
},
apply_mat = {
mat <- data.table::copy(mat0)
probs <- c(0.05, 0.95)
bounds <- apply(mat, 2, quantile, probs)
idx <- apply(mat, 1, function(x) all(x > bounds[1, ] & x < bounds[2, ]))
sum(idx)
},
apply_df = {
df <- data.table::copy(df0)
probs <- c(0.05, 0.95)
bounds <- apply(df, 2, quantile, probs)
idx <- apply(df, 1, function(x) all(x > bounds[1, ] & x < bounds[2, ]))
sum(idx)
},
purrr = {
data.table::copy(df0) %>%
map2(map_dfc(., quantile, probs), ~ (.x > .y[1L] & .x < .y[2L])) %>%
pmap(all) %>%
reduce(`+`)
}
)
}
)
autoplot(bm)
Note the logarithmic time scale
print(bm[, 1:11], n = Inf)
# A tibble: 45 x 11
expression n_col n_row min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <int> <dbl> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
1 OP1 5 100 1.46ms 1.93ms 493. 88.44KB 0 248 0 503ms
2 OP2 5 100 1.34ms 1.78ms 534. 71.56KB 0 267 0 500ms
3 apply_mat 5 100 1.16ms 1.42ms 621. 26.66KB 2.17 286 1 461ms
4 apply_df 5 100 1.41ms 1.8ms 526. 34.75KB 0 263 0 500ms
5 purrr 5 100 2.34ms 2.6ms 374. 17.86KB 0 187 0 500ms
6 OP1 10 100 2.42ms 2.78ms 344. 205.03KB 0 172 0 500ms
7 OP2 10 100 2.37ms 2.71ms 354. 153.38KB 2.07 171 1 484ms
8 apply_mat 10 100 1.76ms 2.12ms 457. 51.64KB 0 229 0 501ms
9 apply_df 10 100 2.31ms 2.63ms 367. 67.78KB 0 184 0 501ms
10 purrr 10 100 3.44ms 4.1ms 222. 34.89KB 2.09 106 1 477ms
11 OP1 40 100 9.4ms 10.57ms 92.9 955.41KB 0 47 0 506ms
12 OP2 40 100 9.18ms 10.08ms 96.8 638.92KB 0 49 0 506ms
13 apply_mat 40 100 5.44ms 6.46ms 146. 429.95KB 2.12 69 1 472ms
14 apply_df 40 100 6.12ms 6.75ms 141. 608.66KB 0 71 0 503ms
15 purrr 40 100 10.43ms 11.8ms 84.9 149.53KB 0 43 0 507ms
16 OP1 5 1000 1.75ms 1.94ms 478. 837.55KB 2.10 228 1 477ms
17 OP2 5 1000 1.69ms 1.94ms 487. 674.36KB 0 244 0 501ms
18 apply_mat 5 1000 4.84ms 5.62ms 176. 255.17KB 0 89 0 506ms
19 apply_df 5 1000 6.37ms 7.66ms 122. 333.58KB 0 62 0 506ms
20 purrr 5 1000 9.86ms 11.22ms 87.7 165.52KB 2.14 41 1 467ms
21 OP1 10 1000 3.35ms 3.91ms 253. 1.89MB 0 127 0 503ms
22 OP2 10 1000 3.33ms 3.72ms 256. 1.41MB 2.06 124 1 484ms
23 apply_mat 10 1000 5.86ms 6.93ms 142. 491.09KB 0 72 0 508ms
24 apply_df 10 1000 7.74ms 10.08ms 99.2 647.86KB 0 50 0 504ms
25 purrr 10 1000 14.55ms 15.44ms 62.5 323.17KB 2.23 28 1 448ms
26 OP1 40 1000 13.8ms 16.28ms 58.8 8.68MB 2.18 27 1 459ms
27 OP2 40 1000 13.29ms 14.72ms 67.9 5.84MB 0 34 0 501ms
28 apply_mat 40 1000 12.17ms 13.85ms 68.5 4.1MB 2.14 32 1 467ms
29 apply_df 40 1000 14.61ms 15.86ms 62.9 5.78MB 0 32 0 509ms
30 purrr 40 1000 41.85ms 43.66ms 22.7 1.25MB 0 12 0 529ms
31 OP1 5 10000 5.57ms 6.55ms 147. 8.15MB 2.07 71 1 482ms
32 OP2 5 10000 5.38ms 6.27ms 157. 6.55MB 2.06 76 1 485ms
33 apply_mat 5 10000 43.98ms 46.9ms 20.7 2.48MB 0 11 0 532ms
34 apply_df 5 10000 53.59ms 56.53ms 17.8 3.24MB 3.57 5 1 280ms
35 purrr 5 10000 86.32ms 88.83ms 11.1 1.6MB 0 6 0 540ms
36 OP1 10 10000 12.03ms 13.63ms 72.3 18.97MB 2.07 35 1 484ms
37 OP2 10 10000 11.66ms 12.97ms 76.5 14.07MB 4.25 36 2 471ms
38 apply_mat 10 10000 50.31ms 51.77ms 18.5 4.77MB 0 10 0 541ms
39 apply_df 10 10000 62.09ms 65.17ms 15.1 6.3MB 0 8 0 528ms
40 purrr 10 10000 125.82ms 128.3ms 7.35 3.13MB 2.45 3 1 408ms
41 OP1 40 10000 53.38ms 56.34ms 16.2 87.79MB 5.41 6 2 369ms
42 OP2 40 10000 46.24ms 47.43ms 20.3 58.82MB 2.25 9 1 444ms
43 apply_mat 40 10000 78.25ms 83.79ms 11.4 40.94MB 2.85 4 1 351ms
44 apply_df 40 10000 95.66ms 97.02ms 10.3 57.58MB 2.06 5 1 486ms
45 purrr 40 10000 361.26ms 373.23ms 2.68 12.31MB 0 2 0 746ms
Conclusions
To my surprise, OPs approach does perform quite well despite the repeated copy operations. In fact, for OP's problem size of 10000 rows and 40 columns the modified version OP2 is nearly tow times faster than apply_mat.
A possible explanation (which needs to be verified, though) is that OPs approach is kind of recursive where the number of rows to be checked are reduced when iterating over the columns.
Interestingly, the purrr variant has the lowest memory requirements.
Taking the median run time of about 50 ms for the OP2 method from this benchmark, 10000 repetitions of the simulation may take less than 10 minutes.

Related

Comparing two lists of values of different lengths

I have a long list of random numbers between 1 and 100, and i would like to count how many of them are larger than 10,20,30 etc
x <- c(sample(1:100, 500, replace = T))
y <- seq(0,100, by = 10)
I am looking for this to return an output such as;
Total
10
20
30
40
50
Count
7
13
17
28
42
Where Count is the number of x Values that are larger than Total (each y value )
So far, I have tried
Count = ifelse(x > y, 1, 0)
However this returns a list of Binary 1,0 returns for each of the 500 values of X
I'd appreciate any help
This answer asummes your looking for intervals not for cummulative sum of numbers greater than a threshold given your count.
cut + table are useful here:
table(cut(x, breaks = y))
(0,10] (10,20] (20,30] (30,40] (40,50] (50,60] (60,70] (70,80] (80,90] (90,100]
51 66 36 44 54 49 55 46 58 41
findInterval + table will give you the same result
table(findInterval(x, y, left.open = TRUE))
Data
set.seed(505)
x <- c(sample(1:100, 500, replace = T))
y <- seq(0,100, by = 10)
With base R this is one approach
rbind(Total = y, Count = rowSums(sapply(x, ">", y)))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
Total 0 10 20 30 40 50 60 70 80 90 100
Count 500 444 381 329 279 241 198 150 104 52 0
If I understood correctly, this might work:
x <- c(sample(1:100, 500, replace = T))
y <- seq(0,100, by = 10)
is_bigger_than <- function(y){
data.frame(y, n = sum(x > y,na.rm = TRUE))
}
purrr::map_df(y,is_bigger_than)
y n
1 0 500
2 10 450
3 20 403
4 30 359
5 40 305
6 50 264
7 60 201
8 70 155
9 80 100
10 90 52
11 100 0

Is it possible to do vectorized sampling by base::sample function in r?

I tried to sample 25 samples by using lapply,
a = list(c(1:5),c(100:105),c(110:115),c(57:62),c(27:32))
lapply(a,function(x)sample(x,5))
is it possible to use base::sample to do the vectorized sampling?
i.e.
sample(c(5,5),a)
It is not possible using base::sample; however, this kind of vectorized sampling is possible by using runif.
I don't have a good way to vectorize sampling without replacement for an arbitrary number of samples from each vector in x. But we can sample each element of each vector.
Here's a function that vectorizes sampling over a list of vectors. It will return a single vector of samples:
multisample <- function(x, n = lengths(x), replace = FALSE) {
if (replace) {
unlist(x)[rep.int(lengths(x), n)*runif(sum(n)) + 1 + rep.int(c(0, cumsum(lengths(x[-length(x)]))), n)]
} else {
unlist(x)[rank(runif(sum(n)) + rep.int(seq_along(x), n))]
}
}
The equivalent function using lapply:
multisample2 <- function(x, n = lengths(x), replace = FALSE) {
if (replace) {
unlist(lapply(seq_along(n), function(i) sample(x[[i]], n[i], 1)))
} else {
unlist(lapply(x, sample))
}
}
Example usage:
x <- list(c(1:9), c(11:18), c(21:27), c(31:36), c(41:45))
# sampling without replacement
multisample(x)
#> [1] 9 3 5 8 7 2 1 4 6 18 11 17 12 16 14 13 15 22 26 25 21 27 24 23 36
#> [26] 31 35 34 33 32 45 43 42 44 41
multisample2(x)
#> [1] 3 6 7 9 2 1 8 4 5 17 16 11 15 14 13 12 18 23 22 26 21 27 24 25 33
#> [26] 32 35 34 31 36 42 43 41 44 45
# sampling with replacement
n <- 7:3 # the number of samples from each vector
multisample(x, n, 1)
#> [1] 9 8 5 9 3 5 3 12 18 12 17 12 16 26 26 24 26 27 33 33 35 32 44 44 43
multisample2(x, n, 1)
#> [1] 9 8 3 7 8 7 8 15 14 15 16 18 14 27 27 21 27 27 33 36 33 34 45 44 41
The vectorized version is considerably faster:
x <- lapply(sample(10:15, 1e4, 1), seq)
n <- sample(10, 1e4, 1)
microbenchmark::microbenchmark(multisample = multisample(x),
multisample2 = multisample2(x))
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> multisample 7.4963 7.993501 8.629845 8.273701 8.732952 13.2050 100
#> multisample2 36.4702 40.518801 41.929437 41.701352 43.040650 63.4695 100
microbenchmark::microbenchmark(multisample = multisample(x, n, 1),
multisample2 = multisample2(x, n, 1))
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> multisample 2.326502 2.39170 2.842023 2.7672 3.183101 4.161801 100
#> multisample2 33.700001 37.61035 39.468619 39.1137 40.055901 72.030602 100
If a list of vectors is desired instead, the functions can be modified:
multisample <- function(x, n = lengths(x), replace = FALSE) {
i <- rep.int(seq_along(x), n)
if (replace) {
split(unlist(x)[rep.int(lengths(x), n)*runif(sum(n)) + 1 + rep.int(c(0, cumsum(lengths(x[-length(x)]))), n)], i)
} else {
split(unlist(x)[rank(runif(sum(lengths(x))) + i)], i)
}
}
multisample2 <- function(x, n = lengths(x), replace = FALSE) {
if (replace) {
lapply(seq_along(n), function(i) sample(x[[i]], n[i], 1))
} else {
lapply(x, sample)
}
}
The vectorized version is still much faster.
No. There's no option to stratify the sampling vector with sample(). lapply() is the way to go.

R lpSolve setting constraints (fantasy football related)

I am trying to solve linear optimization problem.
I have 100 players with score and price for each player. My goal is to select 11 players and to maximize possible score while staying within budget (83 in this example).
Code below solve this task.
library(data.table)
library(lpSolve)
set.seed(1)
# simulating data
data <- data.table(id = 1:100,
price = round(rnorm(100, 9, 2), 1),
score = round(rnorm(100, 150, 25)))
# objective which should be maximized
obj <- data$score
constraints <- matrix(
c(data$price,
rep(1, 100)),
byrow = TRUE, nrow = 2
)
constr_dir <- c("<=", "==")
rhs <- c(83, 11)
result <- lp("max", obj, constraints, constr_dir, rhs, all.bin = TRUE)
# joining result for easier observing
data$result <- result$solution
data[result != 0,]
Here is the question. In my team should be a captain whose score will count twice. How do I modify my code to add this condition? (please pay attention to all.bin argument in lp function, maybe this should be changed in final solution).
So here is current result:
id price score coeff
1: 6 7.4 194 1
2: 10 8.4 192 1
3: 13 7.8 186 1
4: 14 4.6 134 1
5: 24 5.0 146 1
6: 35 6.2 158 1
7: 60 8.7 197 1
8: 66 9.4 205 1
9: 71 10.0 208 1
10: 78 9.0 202 1
11: 97 6.4 186 1
What I want to achieve is 10 coefficients should be equal to 1 and one equal to 2(result may differ from one below, this is an example):
id price score coeff
1: 6 7.4 194 1
2: 10 8.4 192 1
3: 13 7.8 186 1
4: 14 4.6 134 1
5: 24 5.0 146 1
6: 35 6.2 158 1
7: 60 8.7 197 1
8: 66 9.4 205 1
9: 71 10.0 208 2
10: 78 9.0 202 1
11: 97 6.4 186 1
You can run 100 augmented linear problems. In ith problem you multiple score of ith player by 2. At then end you pick solution with highest score:
constraints <- matrix(c(data$price, rep(1, 100)), byrow = TRUE, nrow = 2)
constr_dir <- c("<=", "==")
rhs <- c(83, 11)
res <- vector("list", nrow(data))
for(i in seq_len(nrow(data))){
cat("iter:", i, "\n")
obj <- data$score
obj[[i]] <- obj[[i]] * 2
res[[i]] <- lp("max", obj, constraints, constr_dir, rhs, all.bin = TRUE)
}
captain_index <- which.max(unlist(lapply(res, function(x) x$objval)))
data[res[[captain_index]]$solution == 1,]
chosen captain has index captian_index

Outputting percentiles by filtering a data frame

Note that, as requested in the comments, that this question has been revised.
Consider the following example:
df <- data.frame(FILTER = rep(1:10, each = 10), VALUE = 1:100)
I would like to, for each value of FILTER, create a data frame which contains the 1st, 2nd, ..., 99th percentiles of VALUE. The final product should be
PERCENTILE df_1 df_2 ... df_10
1 [first percentiles]
2 [second percentiles]
etc., where df_i is based on FILTER == i.
Note that FILTER, although it contains numbers, is actually categorical.
The way I have been doing this is by using dplyr:
nums <- 1:10
library(dplyr)
for (i in nums){
df_temp <- filter(df, FILTER == i)$VALUE
assign(paste0("df_", i), quantile(df_temp, probs = (1:99)/100))
}
and then I would have to cbind these (with 1:99 in the first column), but I would rather not type in every single df name. I have considered using a loop on the names of these data frames, but this would involve using eval(parse()).
Here's a basic outline of a possibly smoother approach. I have not included every single aspect of your desired output, but the modification should be fairly straightforward.
df <- data.frame(FILTER = rep(1:10, each = 10), VALUE = 1:100)
df_s <- lapply(split(df,df$FILTER),
FUN = function(x) quantile(x$VALUE,probs = c(0.25,0.5,0.75)))
out <- do.call(cbind,df_s)
colnames(out) <- paste0("df_",colnames(out))
> out
df_1 df_2 df_3 df_4 df_5 df_6 df_7 df_8 df_9 df_10
25% 3.25 13.25 23.25 33.25 43.25 53.25 63.25 73.25 83.25 93.25
50% 5.50 15.50 25.50 35.50 45.50 55.50 65.50 75.50 85.50 95.50
75% 7.75 17.75 27.75 37.75 47.75 57.75 67.75 77.75 87.75 97.75
I did this for just 3 quantiles to keep things simple, but it obviously extends. And you can add the 1:99 column afterwards as well.
I suggest that you use a list.
list_of_dfs <- list()
nums <- 1:10
for (i in nums){
list_of_dfs[[i]] <- nums*i
}
df <- data.frame(list_of_dfs[[1]])
df <- do.call("cbind",args=list(df,list_of_dfs))
colnames(df) <- paste0("df_",1:10)
You'll get the result you want:
df_1 df_2 df_3 df_4 df_5 df_6 df_7 df_8 df_9 df_10
1 1 2 3 4 5 6 7 8 9 10
2 2 4 6 8 10 12 14 16 18 20
3 3 6 9 12 15 18 21 24 27 30
4 4 8 12 16 20 24 28 32 36 40
5 5 10 15 20 25 30 35 40 45 50
6 6 12 18 24 30 36 42 48 54 60
7 7 14 21 28 35 42 49 56 63 70
8 8 16 24 32 40 48 56 64 72 80
9 9 18 27 36 45 54 63 72 81 90
10 10 20 30 40 50 60 70 80 90 100
How about using get?
df <- data.frame(1:10)
for (i in nums) {
df <- cbind(df, get(paste0("df_", i)))
}
# get rid of first useless column
df <- df[, -1]
# get names
names(df) <- paste0("df_", nums)
df

Selecting the pairs of numbers in a vector which difference is equal to a predetermined value

I have a vector of numbers from which I would like to select the pairs that are 2 units apart. So if I have the vector p defined as follows:
p<-c(2,3,5,7,11,13,17,19,23,29,31,37,41,43,47)
I would like to select the following pairs:
3,5; 5,7; 11,13; 17,19; 29,31; 41,43
I tried unsuccessfully to select at least these numbers in a vector
j<-NULL
for(i in seq(p)) if (p[i+1]-p[i]==2) j<-c(j,i,i+1)
But it does not give the desired output. Thanks for your help.
Here is my solution using base R functions:
dif=which(abs(diff(p))==2)
sapply(dif, function(x) c(p[x],p[x+1]))
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 3 5 11 17 29 41
# [2,] 5 7 13 19 31 43
By changing 2 to any other value, you can manage to get the result of any desired unit from which the vector's elements are apart.
abs is used to take care of the cases in which vector's elements are not ordered.
BENCHMARK (small scale)
p<-c(2,3,5,7,11,13,17,19,23,29,31,37,41,43,47) # length(p)=15
library(dplyr)
library(data.table)
library(microbenchmark)
func_Sotos <- function(p){df <- expand.grid(p, p);df[df[,1]-df[,2] == 2,];}
func_m0h3n <- function(p){dif=which(abs(diff(p))==2);sapply(dif, function(x) c(p[x],p[x+1]));}
func_David_B <- function(p){data.frame(p) %>% mutate(lagp = lag(p)) %>% filter(p - lagp == 2)}
func_akrun1 <- function(p){setDT(list(p=p))[, p1 := shift(p)][p-p1 ==2];}
func_akrun2 <- function(p){unique(CJ(p=p, p1=p)[abs(p-p1)==2][.(p=pmin(p,p1), p1=pmax(p, p1))]);}
func_RHertel1 <- function(p){d2_mat <- which(as.matrix(dist(p))==2, arr.ind=TRUE);unique(t(apply(cbind(p[d2_mat[,1]],p[d2_mat[,2]]),1,sort)));}
func_RHertel2 <- function(p){m2 <- t(combn(sort(p),2));m2[abs(m2[,1] - m2[,2]) == 2,];}
func_RHertel3 <- function(p){d2 <- as.matrix(dist(p));d2[lower.tri(d2)] <- 0;idx <- which(d2 == 2, arr.ind=TRUE);cbind(p[idx[,1]], p[idx[,2]]);}
func_Tomas.H <- function(p) {a<-which(p-lag(p)==2);b<-a-1;df<-data.frame(pair1=p[b],pair2=p[a]);df;}
func_Arun.kumar.mahesh <- function(p) {
j<-c()
for(i in 1:length(p)){
if(sum(p[i]-p[i+1],na.rm=T)==-2){
j[i] <- paste(p[i],p[i+1],sep=",")
}
}
j <- j[!is.na(j)]
}
microbenchmark(func_Sotos(p), func_m0h3n(p), func_David_B(p), func_akrun1(p), func_akrun2(p), func_RHertel1(p), func_RHertel2(p), func_RHertel3(p), func_Tomas.H(p), func_Arun.kumar.mahesh(p))
Unit: microseconds
expr min lq mean median uq max neval
func_Sotos(p) 403.770 455.9520 470.6952 469.6390 485.4640 594.961 100
func_m0h3n(p) 72.713 92.8155 125.7504 98.8040 104.7920 2622.790 100
func_David_B(p) 1986.340 2148.2335 2260.4203 2207.0450 2292.1615 5547.553 100
func_akrun1(p) 1321.233 1404.2110 1472.6807 1464.3060 1504.7255 1872.566 100
func_akrun2(p) 2524.414 2623.2185 2777.9167 2700.2080 2816.5485 5595.885 100
func_RHertel1(p) 1160.838 1230.5560 1349.9502 1267.7680 1328.7185 4133.504 100
func_RHertel2(p) 249.362 281.2270 298.3233 296.1975 308.3880 562.027 100
func_RHertel3(p) 159.968 182.8515 204.4812 203.1675 223.6985 304.538 100
func_Tomas.H(p) 275.453 316.0865 337.7593 334.6925 350.7320 646.716 100
func_Arun.kumar.mahesh(p) 123.184 142.2175 174.5620 148.4200 158.0435 2579.163 100
BENCHMARK (medium scale)
set.seed(100)
p=sample(1000) # length(p)=1000
microbenchmark(func_Sotos(p), func_m0h3n(p), func_David_B(p), func_akrun1(p), func_akrun2(p), func_RHertel1(p), func_RHertel2(p), func_RHertel3(p), func_Tomas.H(p), func_Arun.kumar.mahesh(p))
Unit: microseconds
expr min lq mean median uq max neval
func_Sotos(p) 30711.250 35060.8410 53640.60456 64290.0265 69224.6310 98474.248 100
func_m0h3n(p) 41.465 68.9580 88.75608 83.5305 102.1600 196.808 100
func_David_B(p) 854.835 1067.1160 1220.68932 1150.1960 1261.5205 3934.944 100
func_akrun1(p) 524.319 748.9200 830.18763 811.5670 896.2995 1549.519 100
func_akrun2(p) 12986.877 17372.2235 34010.07038 21836.1435 52173.1590 58796.699 100
func_RHertel1(p) 76813.429 107942.6315 112380.30785 115049.1765 119579.6505 163399.316 100
func_RHertel2(p) 280275.495 297188.4505 307531.70976 304330.0005 314177.5760 360689.445 100
func_RHertel3(p) 45957.354 85348.1045 103999.44879 113351.6765 118847.8575 170738.875 100
func_Tomas.H(p) 154.742 212.4325 263.66812 260.8075 295.0610 536.037 100
func_Arun.kumar.mahesh(p) 972.619 1072.5250 1192.35206 1152.4500 1238.9850 2483.979 100
There is a better way than this, but here is an idea with expand.grid,
df <- expand.grid(p, p)
unname(apply(df[df[,1]-df[,2] == -2,], 1,paste, collapse = ','))
#[1] "3,5" "5,7" "11,13" "17,19" "29,31" "41,43"
If you want a data frame then simply,
df[df[,1]-df[,2] == 2,]
# Var1 Var2
#18 5 3
#34 7 5
#66 13 11
#98 19 17
#146 31 29
#194 43 41
Hi if desired outcome is data frame then try this
p<-c(2,3,5,7,11,13,17,19,23,29,31,37,41,43,47)
a<-which(p-lag(p)==2)
b<-a-1
df<-data.frame(pair1=p[b],
pair2=p[a])
If you want back a vector then this should work
res<-NULL
for (i in a){
res<-c(res,p[i-1],p[i])
}
You could do this using dplyr, which will return the pairs in a data frame:
> library(dplyr)
> data.frame(p) %>% mutate(lagp = lag(p)) %>% filter(p - lagp == 2)
p lagp
1 5 3
2 7 5
3 13 11
4 19 17
5 31 29
6 43 41
Here is another using data.table
library(data.table)
setDT(list(p=p))[, p1 := shift(p)][p-p1 ==2]
# p p1
#1: 5 3
#2: 7 5
#3: 13 11
#4: 19 17
#5: 31 29
#6: 43 41
If the vector p is not ordered, order it before doing the operation.
setDT(list(p=p))[order(p)][, p1 := shift(p)][p-p1==2]
Update
Using the new vector provided by #RHertel
p <- c(2, 3, 4, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47)
unique(CJ(p=p, p1=p)[abs(p-p1)==2][.(p=pmin(p,p1), p1=pmax(p, p1))])
# p p1
#1: 2 4
#2: 3 5
#3: 5 7
#4: 11 13
#5: 17 19
#6: 29 31
#7: 41 43
Kind of hacky, but here's another way.
d2_mat <- which(as.matrix(dist(p))==2, arr.ind=TRUE)
unique(t(apply(cbind(p[d2_mat[,1]],p[d2_mat[,2]]),1,sort)))
# [,1] [,2]
#[1,] 3 5
#[2,] 5 7
#[3,] 11 13
#[4,] 17 19
#[5,] 29 31
#[6,] 41 43
In contrast to some of the other answers, this does not require any specific order of the numbers in the vector p.
A vectorized version of the same could be:
d2 <- as.matrix(dist(p))
d2[lower.tri(d2)] <- 0
idx <- which(d2 == 2, arr.ind=TRUE)
cbind(p[idx[,1]], p[idx[,2]])
In the last line, instead of cbind(), one could also use paste(), depending on the desired output:
paste(p[idx[,1]], p[idx[,2]], sep=",")
#[1] "3,5" "5,7" "11,13" "17,19" "29,31" "41,43"
The following variant is simpler and probably (much) faster than my previous suggestions.
m2 <- t(combn(sort(p),2))
m2[abs(m2[,1] - m2[,2]) == 2,]
This version, too, finds all pairs of values that are 2 units apart within any integer vector.
Here's an example:
p <- c(13, 19, 43, 29, 47, 17, 7, 37, 2, 41, 3, 4, 31, 11, 5, 23)
# [,1] [,2]
#[1,] 2 4
#[2,] 3 5
#[3,] 5 7
#[4,] 11 13
#[5,] 17 19
#[6,] 29 31
#[7,] 41 43
The output can be modified, if desired, by using:
m2 <- t(combn(sort(p), 2))
m2 <- m2[abs(m2[,1] - m2[,2]) == 2,]
paste(m2[,1], m2[,2], sep=",")
#[1] "2,4" "3,5" "5,7" "11,13" "17,19" "29,31" "41,43"
Use length function instead of seq to get desired output
j<-c()
for(i in 1:length(p)){
if(sum(p[i]-p[i+1],na.rm=T)==-2){
j[i] <- paste(p[i],p[i+1],sep=",")
}
}
j <- j[!is.na(j)]
print(j)
[1] "3,5" "5,7" "11,13" "17,19" "29,31" "41,43"

Resources