Subset by multiple ranges [duplicate] - r

This question already has answers here:
Efficient way to filter one data frame by ranges in another
(3 answers)
Closed 5 years ago.
I want to get a list of values that fall in between multiple ranges.
library(data.table)
values <- data.table(value = c(1:100))
range <- data.table(start = c(6, 29, 87), end = c(10, 35, 92))
I need the results to include only the values that fall in between those ranges:
results <- c(6, 7, 8, 9, 10, 29, 30, 31, 32, 33, 34, 35, 87, 88, 89, 90, 91, 92)
I am currently doing this with a for loop,
results <- data.table(NULL)
for (i in 1:NROW(range){
results <- rbind(results,
data.table(result = values[value >= range[i, start] &
value <= range[i, end], value]))}
however the actual dataset is quite large and I am looking for a more efficient way.
Any suggestions are appreciated! Thank you!

Using the non-equi join possibility of data.table:
values[range, on = .(value >= start, value <= end), .(results = x.value)]
which gives:
results
1: 6
2: 7
3: 8
4: 9
5: 10
6: 29
7: 30
8: 31
9: 32
10: 33
11: 34
12: 35
13: 87
14: 88
15: 89
16: 90
17: 91
18: 92
Or as per the suggestion of #Henrik: values[value %inrange% range]. This works also very well on data.table's with multiple columns:
# create new data
set.seed(26042017)
values2 <- data.table(value = c(1:100), let = sample(letters, 100, TRUE), num = sample(100))
> values2[value %inrange% range]
value let num
1: 6 v 70
2: 7 f 77
3: 8 u 21
4: 9 x 66
5: 10 g 58
6: 29 f 7
7: 30 w 48
8: 31 c 50
9: 32 e 5
10: 33 c 8
11: 34 y 19
12: 35 s 97
13: 87 j 80
14: 88 o 4
15: 89 h 65
16: 90 c 94
17: 91 k 22
18: 92 g 46

If you have the latest CRAN version of data.table you can use non-equi joins. For example, you can create an index which you can then use to subset your original data:
idx <- values[range, on = .(value >= start, value <= end), which = TRUE]
# [1] 6 7 8 9 10 29 30 31 32 33 34 35 87 88 89 90 91 92
values[idx]

Here is one method using lapply and %between%
rbindlist(lapply(seq_len(nrow(range)), function(i) values[value %between% range[i]]))
This method loops through the ranges data.table and subsets values in each iteration according to the variable in ranges. lapply returns a list, which rbindlist constructs into a data.table. If you want a vector, replace rbindlist with unlist.
benchmarks
Just to check the speeds of each suggestion on the given data, I ran a quick comparison
microbenchmark(
lmo=rbindlist(lapply(seq_len(nrow(range)), function(i) values[value %between% range[i]])),
dd={idx <- values[range, on = .(value >= start, value <= end), which = TRUE]; values[idx]},
jaap=values[range, on = .(value >= start, value <= end), .(results = x.value)],
inrange=values[value %inrange% range])
This returned
Unit: microseconds
expr min lq mean median uq max neval cld
lmo 1238.472 1460.5645 1593.6632 1520.8630 1613.520 3101.311 100 c
dd 688.230 766.7750 885.1826 792.8615 825.220 3609.644 100 b
jaap 798.279 897.6355 935.9474 921.7265 970.906 1347.380 100 b
inrange 463.002 518.3110 563.9724 545.5375 575.758 1944.948 100 a
As might be expected, my looping solution is quite a bit slower than the others. However, the clear winner is %inrange%, which is essentially a vectorized extension of %between%.

Related

Calculate second highest cumulative value by group

I have data with a grouping variable 'grps' and a value 'x'. I have calculated the cummax within each group 'cmx'. Now I need to find the second highest cumulative value of 'x' within each group, scmx.
Some data, including the desired column scmx:
library(data.table)
d = structure(list(date = structure(rep(c(18690, 18691, 18692, 18693, 18694, 18695, 18696, 18697), 2), class = "Date"),
x = c(18, 70, 57, 94, 94, 13, 98, 23, 20, 72, 59, 96, 96, 15, 100, 25),
grps = c(rep("g1", 8), rep("g2", 8))),
row.names = c(NA, -16L), class = c("data.table", "data.frame"))
d[, cmx := cummax(x), by = .(grps)]
d[, scmx := c(18, 18, 57, 70, 70, 70, 94, 94, 20, 20, 59, 72, 72, 72, 96, 96)]
Context
If x corresponds to a performance rating, what I am trying to do is locate the date when they achieved their best performance and their second best. A similar question of mine where I needed to locate the row which corresponded to the highest cumulative value in a column:
Fill down first row within each cumulative max, with a twist
A data.table alternative:
d[ , scmx2 := {
c(x[1], sapply(seq(.N)[-1], function(i){
v = x[1:i]
v[frank(-v, ties.method = "dense") == 2][1]
}))
}, by = grps]
# date x grps cmx scmx scmx2
# 1: 2021-03-04 18 g1 18 18 18
# 2: 2021-03-05 70 g1 70 18 18
# 3: 2021-03-06 57 g1 70 57 57
# 4: 2021-03-07 94 g1 94 70 70
# 5: 2021-03-08 94 g1 94 70 70
# 6: 2021-03-09 13 g1 94 70 70
# 7: 2021-03-10 98 g1 98 94 94
# 8: 2021-03-11 23 g1 98 94 94
# 9: 2021-03-04 20 g2 20 20 20
# 10: 2021-03-05 72 g2 72 20 20
# 11: 2021-03-06 59 g2 72 59 59
# 12: 2021-03-07 96 g2 96 72 72
# 13: 2021-03-08 96 g2 96 72 72
# 14: 2021-03-09 15 g2 96 72 72
# 15: 2021-03-10 100 g2 100 96 96
# 16: 2021-03-11 25 g2 100 96 96
Within each group (by = grps), loop (sapply) over a sequence from 2 to number of rows in the current group (seq(.N)[-1]). In each step, subset 'x' from start of the vector to the index 'i' (v = x[1:i]).
Calculate dense rank and check if the rank is 2 (frank(-v, ties.method = "dense") == 2), i.e. the rank of the second largest number. Use the logical indices to subset 'v' (v[...). Select the first match ([1]; in case of several values with rank 2). Concatenate the result from this 'expanding window' with the first element of 'x' (c(x[1], ...).
In the first window, with only one value, there is clearly no second highest value. Here OP have chosen to return the first value. The same choice needs to be made also for longer windows where all values are equal, which will occur when there are leading runs of equal values. If we rather want to return NA than the first value, then replace the x[1] in the line
c(x[1], sapply(seq(.N)[-1], function(i){
...with NA_real_.
Small demo:
d = data.table(grps = c(1, 1, 2, 2, 2), x = c(3, 3, 4, 4, 5))
d[ , scmx2 := {
c(NA_real_, sapply(seq(.N)[-1], function(i){
v = x[1:i]
v[frank(-v, ties.method = "dense") == 2][1]
}))
}, by = grps]
# grps x scmx
# 1: 1 3 NA # grp 1: all values equal in all windows -> all NA
# 2: 1 3 NA
# 3: 2 4 NA
# 4: 2 4 NA
# 5: 2 5 4 # grp 2: only the last window has a second highest value
This question is indeed similar to the post I linked to above (Finding cumulative second max per group in R). However, here OP asked for a data.table solution.
Here is another option using non-equi join:
d[, s2 := .SD[.SD, on=.(grps, date<=date, x<cmx), by=.EACHI, max(x.x)]$V1]
d[is.na(s2), s2 := x][]
output:
date x grps cmx scmx s2
1: 2021-03-04 18 g1 18 18 18
2: 2021-03-05 70 g1 70 18 18
3: 2021-03-06 57 g1 70 57 57
4: 2021-03-07 94 g1 94 70 70
5: 2021-03-08 94 g1 94 70 70
6: 2021-03-09 13 g1 94 70 70
7: 2021-03-10 98 g1 98 94 94
8: 2021-03-11 23 g1 98 94 94
9: 2021-03-04 20 g2 20 20 20
10: 2021-03-05 72 g2 72 20 20
11: 2021-03-06 59 g2 72 59 59
12: 2021-03-07 96 g2 96 72 72
13: 2021-03-08 96 g2 96 72 72
14: 2021-03-09 15 g2 96 72 72
15: 2021-03-10 100 g2 100 96 96
16: 2021-03-11 25 g2 100 96 96
Create a sequence that is the length of the column x. Apply the function to each sequence in x that is from index 1 to the current number in the sequence, only caring about the unique values. Rfast::nth can be used to take the 2nd highest number in a vector.
library(Rfast)
sapply(seq(length(d$x)), function(x) {
return(nth(unique(d$x[1:x]), 2, descending=TRUE))
})
[1] 2.652495e-315 1.800000e+01 5.700000e+01 7.000000e+01
[5] 7.000000e+01 7.000000e+01 9.400000e+01 9.400000e+01
To do it for the new data frame. We can still use the function created above. Arrange the data frame so that the group names and values are in their own column, then use lapply with rollapplyr to capture the 2nd largest unique value.
d1=d %>% select(-cmx) %>%
pivot_wider(names_from=grps, values_from=x)
lapply(d1[-1], function(x) {
my_list=rollapplyr(x, seq(length(x)), function(x) {return(nth(sort(unique(x), decreasing=TRUE), 2))})
return(my_list)
})

reducing repetitive tasks in data.table in R

I notice that i am doing the same thing multiple time, just with slightly different values:
HCCtreshold <- 40000
claimsMonthly[, HCC12mnth := +(HCCtreshold < claim12month)][ HCC12mnth == 1, `:=` (aboveHCCth12mnth = (claim12month - HCCtreshold))][is.na(aboveHCCth12mnth),aboveHCCth12mnth := 0]
claimsMonthly[, HCC11mnth := +(HCCtreshold < claim11month)][ HCC11mnth == 1, `:=` (aboveHCCth11mnth = (claim11month - HCCtreshold))][is.na(aboveHCCth11mnth),aboveHCCth11mnth := 0]
claimsMonthly[, HCC10mnth := +(HCCtreshold < claim10month)][ HCC10mnth == 1, `:=` (aboveHCCth10mnth = (claim10month - HCCtreshold))][is.na(aboveHCCth10mnth),aboveHCCth10mnth := 0]
So started with something like this:
k <- seq.default(from = 8, to = 12, by = 1)
claimsMonthly[paste0("HCC", k, "mnth") := lapply(k, function(x) (+(HCCtreshold < paste0("HCC", k, "mnth"))))]
i get an error:
Error: Check that is.data.table(DT) == TRUE. Otherwise, := and `:=`(...) are defined for use in j, once only and in particular ways. See help(":=").
I also tried:
for(k in 8:12){
claimsMonthly[, paste0("HCC", k, "mnth") := +(HCCtreshold < paste0("HCC", k, "mnth"))]
}
the columns are created correctly, but i get incorrect values inside them. I get an 1 everywhere
I am not sure what i am doing wrong?
I can offer some suggestions and, with some fake data, try them out.
You can programmatically define names on the left-hand side of := if you wrap a vector in c(...), so for instance DT[ c(vec_of_names) := list(some, values)].
You can programmatically retrieve values of variables with a vector of variable names and mget. While I generally think mget can indicate problematic code, I believe that in here it works with low risk. (While mget and get normally retrieve variables from the operating environment, often .GlobalEnv, from within a data.table operation then retrieve columns just as easily.)
Instead of a double-tap of assignment with == 1 and then is.na(...), we can use some logical trickery and the data.table::fcoalesce function. (If you aren't familiar, fcoalesce operates like SQL's coalesce function which is a vector-friendly way of finding the first non-NA value in arguments of vectors.
fcoalesce(c(1, 2, NA, NA), c(11, 12, 13, NA), c(21, 22, 23, 24))
# [1] 1 2 13 24
We can use fcoalesce(some + math * calc, 0) to do the math and, if NA, replace it with 0. (We use it on the above* variables below, and not necessarily on the HCC* logical variables. It can apply there too, if desired. If those HCC* variables are throw-away, though, it just doesn't matter.)
Fake data:
library(data.table)
set.seed(42)
hccthreshold <- 50
dat <- data.table( claim10month = sample(99, 10), claim11month = sample(99, 10), claim12month = sample(99, 10) )
dat$claim11month[5] <- NA
dat
# claim10month claim11month claim12month
# 1: 91 46 90
# 2: 92 71 14
# 3: 28 91 96
# 4: 80 25 91
# 5: 61 NA 8
# 6: 49 89 49
# 7: 69 97 37
# 8: 13 11 84
# 9: 60 95 41
# 10: 64 51 76
First, let's programmatically determine the column names we want to act on, and from then create the same vectors for the new variables. (I'm a big fan of determining and adapting these variable names programmatically, so that if you get a partial data set your code still works. You might consider setting checks and alarms to catch something wrong. For instance, stopifnot(length(claimnames) == 12L), in case you are expecting to always have precisely 12 months.)
claimnames <- grep("^claim[0-9]+month", colnames(dat), value = TRUE)
hccnames <- gsub("^claim", "HCC", claimnames)
abovenames <- gsub("^claim", "aboveHCC", claimnames)
claimnames
# [1] "claim10month" "claim11month" "claim12month"
hccnames
# [1] "HCC10month" "HCC11month" "HCC12month"
abovenames
# [1] "aboveHCC10month" "aboveHCC11month" "aboveHCC12month"
And now, we can process the data.
dat[, c(hccnames) := lapply(mget(claimnames), `>`, hccthreshold) ]
dat[, c(abovenames) := Map(function(hcc, clm) fcoalesce(clm - hcc * hccthreshold, 0),
mget(hccnames), mget(claimnames)) ]
dat
# claim10month claim11month claim12month HCC10month HCC11month HCC12month aboveHCC10month aboveHCC11month aboveHCC12month
# 1: 91 46 90 TRUE FALSE TRUE 41 46 40
# 2: 92 71 14 TRUE TRUE FALSE 42 21 14
# 3: 28 91 96 FALSE TRUE TRUE 28 41 46
# 4: 80 25 91 TRUE FALSE TRUE 30 25 41
# 5: 61 NA 8 TRUE NA FALSE 11 0 8
# 6: 49 89 49 FALSE TRUE FALSE 49 39 49
# 7: 69 97 37 TRUE TRUE FALSE 19 47 37
# 8: 13 11 84 FALSE FALSE TRUE 13 11 34
# 9: 60 95 41 TRUE TRUE FALSE 10 45 41
# 10: 64 51 76 TRUE TRUE TRUE 14 1 26
I chose to keep the HCC* variables as logical instead of your +(...) integers, but it's directly translatable and up to you.

How to effectively determine the maximum difference between the variable value in each row and same variable subsequent row values in data.table in R

What is the most efficient way to determine the maximum positive difference between the value (X) for each row and the subsequent values of the same variable (X) within group (Y) in data.table in R.
Example:
set.seed(1)
dt <- data.table(X = sample(100:200, 500455, replace = TRUE),
Y = unlist(sapply(10:1000, function(x) rep(x, x))))
Here's my solution which I consider ineffective and slow:
dt[, max_diff := vapply(1:.N, function(x) max(X[x:.N] - X[x]), numeric(1)), by = Y]
head(dt, 21)
X Y max_diff
1: 126 10 69
2: 137 10 58
3: 157 10 38
4: 191 10 4
5: 120 10 75
6: 190 10 5
7: 195 10 0
8: 166 10 0
9: 163 10 0
10: 106 10 0
11: 120 11 80
12: 117 11 83
13: 169 11 31
14: 138 11 62
15: 177 11 23
16: 150 11 50
17: 172 11 28
18: 200 11 0
19: 138 11 56
20: 178 11 16
21: 194 11 0
If you can advise the efficient (faster) solution?
Here's a dplyr solution that is about 20x faster and gets the same results. I presume the data.table equivalent would be yet faster. (EDIT: see bottom - it is!)
The speedup comes from reducing how many comparisons need to be performed. The largest difference will always be found against the largest remaining number in the group, so it's faster to identify that number first and do only the one subtraction per row.
First, the original solution takes about 4 sec on my machine:
tictoc::tic("OP data.table")
dt[, max_diff := vapply(1:.N, function(x) max(X[x:.N] - X[x]), numeric(1)), by = Y]
tictoc::toc()
# OP data.table: 4.594 sec elapsed
But in only 0.2 sec we can take that data.table, convert to a data frame, add the orig_row row number, group by Y, reverse sort by orig_row, take the difference between X and the cumulative max of X, ungroup, and rearrange in original order:
library(dplyr)
tictoc::tic("dplyr")
dt2 <- dt %>%
as_data_frame() %>%
mutate(orig_row = row_number()) %>%
group_by(Y) %>%
arrange(-orig_row) %>%
mutate(max_diff2 = cummax(X) - X) %>%
ungroup() %>%
arrange(orig_row)
tictoc::toc()
# dplyr: 0.166 sec elapsed
all.equal(dt2$max_diff, dt2$max_diff2)
#[1] TRUE
EDIT: as #david-arenburg suggests in the comments, this can be done lightning-fast in data.table with an elegant line:
dt[.N:1, max_diff2 := cummax(X) - X, by = Y]
On my computer, that's about 2-4x faster than the dplyr solution above.

Multiple different conditions and if statments within a loop

I want to assign different letters from A:U to a new column vector according to some conditions that depend on a different column that takes the numbers 1:99.
I came up with the following solution, but I want to write it more efficiently.
for (i in 1:99){
if (i %in% 1:3 == T ){
id<-which(H07_NACE$NACE2.Code==i)
H07_NACE$NACE2.Sectors[id]<-"A"
}
.............
if (i %in% 45:60 == T ){
id<-which(H07_NACE$NACE2.Code==i)
H07_NACE$NACE2.Sectors[id]<-"D"
}
.....................
if (i == 99 ){
id<-which(H07_NACE$NACE2.Code==i)
H07_NACE$NACE2.Sectors[id]<-"U"
}
}
In the previous code I skipped multiple other line which essentially do the same thing. Notice that conditions changing all the time within this loop that I created and are of two types. One is for example of the type i %in% 45:60 == T and the other of the type 'i == 99 '
My original code has multiple such ifs within this loop so any help on how I can write it more efficiently or compactly will be appreciated.
The user has requested to map the numbers given in H07_NACE$NACE2.Code to the letters "A" to "U" according to given rules he has hardcoded in a number of if clauses.
A more flexible approach (and less tedious to code) is to use a lookup table (or constraint vector as Joseph Wood called it in his answer).
With data.table, we can use either a rolling join or a non-equi update join to do the mapping.
Sample data to be mapped
set.seed(1)
H07_NACE <- data.frame(NACE2.Code = sample(99, 10, replace = TRUE))
Rolling join
For the rolling join, we specify the mapping rules by tiling the number range 1:99 contiguously and giving the start number of each tile.
library(data.table)
# set up lookup table
lookup <- data.table(Code = c(1, 4, 21, 45, 61:75, 98, 99),
Sector = LETTERS[1:21])
lookup
Code Sector
1: 1 A
2: 4 B
3: 21 C
4: 45 D
5: 61 E
6: 62 F
7: 63 G
8: 64 H
9: 65 I
10: 66 J
11: 67 K
12: 68 L
13: 69 M
14: 70 N
15: 71 O
16: 72 P
17: 73 Q
18: 74 R
19: 75 S
20: 98 T
21: 99 U
Code Sector
# map Code to Sector
lookup[setDT(H07_NACE), on = .(Code = NACE2.Code), roll = TRUE]
Code Sector
1: 27 C
2: 37 C
3: 57 D
4: 90 S
5: 20 B
6: 89 S
7: 94 S
8: 66 J
9: 63 G
10: 7 B
If the H07_NACE is to be updated we can append a new column by
setDT(H07_NACE)[, NACE2.Sector := lookup[H07_NACE, on = .(Code = NACE2.Code),
roll = TRUE, Sector]][]
NACE2.Code NACE2.Sector
1: 27 C
2: 37 C
3: 57 D
4: 90 S
5: 20 B
6: 89 S
7: 94 S
8: 66 J
9: 63 G
10: 7 B
Non-equi update join
For the non-equi update join, we specify the mapping rules by giving the lower and upper bounds. This can be derived from lookup by
lookup2 <- lookup[, .(Sector, lower = Code,
upper = shift(Code - 1L, type = "lead", fill = max(Code)))]
lookup2
Sector lower upper
1: A 1 3
2: B 4 20
3: C 21 44
4: D 45 60
5: E 61 61
6: F 62 62
7: G 63 63
8: H 64 64
9: I 65 65
10: J 66 66
11: K 67 67
12: L 68 68
13: M 69 69
14: N 70 70
15: O 71 71
16: P 72 72
17: Q 73 73
18: R 74 74
19: S 75 97
20: T 98 98
21: U 99 99
Sector lower upper
The new column is created by
setDT(H07_NACE)[lookup2, on = .(NACE2.Code >= lower, NACE2.Code <= upper),
NACE2.Sector := Sector][]
NACE2.Code NACE2.Sector
1: 27 C
2: 37 C
3: 57 D
4: 90 S
5: 20 B
6: 89 S
7: 94 S
8: 66 J
9: 63 G
10: 7 B
Here is a quick and dirty solution that should do the job (I'm sure there is more efficient/elegant way to do this). We can setup a constraint vector and use indexing from there to produce the desired results.
## Here is some random data that resembles the OP's
set.seed(3)
H07_NACE <- data.frame(NACE2.Code = sample(99, replace = TRUE))
## "T" is the 20th element... we need to gurantee
## that the number corresponding to "U"
## corresponds to max(NACE2.Code)
maxCode <- max(H07_NACE$NACE2.Code)
constraintVec <- sort(sample(maxCode - 1, 20))
constraintVec <- c(constraintVec, maxCode)
H07_NACE$NACE2.Sector <- LETTERS[vapply(H07_NACE$NACE2.Code, function(x) {
which(constraintVec >= x)[1]
}, 1L)]
## Add optional check column to ensure we are mapping the
## Code to the correct Sector
H07_NACE$NACE2.Check <- constraintVec[vapply(H07_NACE$NACE2.Code, function(x) {
which(constraintVec >= x)[1]
}, 1L)]
head(H07_NACE)
NACE2.Code NACE2.Sector NACE2.Check
1 17 E 18
2 80 R 85
3 39 K 54
4 33 J 37
5 60 N 66
6 60 N 66
Update courtesy of #Frank
As suspected, there is a much simpler solution assuming the above logic is correct. We use findInterval and set the arguments rightmost.closed and left.open to TRUE (we also have to add 1L to the resulting vector):
H07_NACE$NACE2.Sector2 <- LETTERS[findInterval(H07_NACE$NACE2.Code, constraintVec,
rightmost.closed = TRUE, , left.open = TRUE) + 1L]
head(H07_NACE)
NACE2.Code NACE2.Sector NACE2.Check NACE2.Sector2
1 17 E 18 E
2 80 R 85 R
3 39 K 54 K
4 33 J 37 J
5 60 N 66 N
6 60 N 66 N
identical(H07_NACE$NACE2.Sector, H07_NACE$NACE2.Sector2)
[1] TRUE
Here's two tidyverse examples, though I'm not completely certain what the original poster is really asking for.
library(tidyverse)
data.frame(NACE2.Code = sample(99, replace = TRUE)) %>%
mutate(Sectors = ifelse(NACE2.Code %in% 1:3, "A",
ifelse(NACE2.Code %in% 45:60, "D",
ifelse(NACE2.Code ==99, "U", NA))))
data.frame(NACE2.Code = sample(99, replace = TRUE)) %>%
mutate(Sectors = case_when(NACE2.Code %in% 1:3 ~ "A",
NACE2.Code %in% 45:60 ~ "D",
NACE2.Code ==99 ~ "U")) %>%
drop_na

Selecting the pairs of numbers in a vector which difference is equal to a predetermined value

I have a vector of numbers from which I would like to select the pairs that are 2 units apart. So if I have the vector p defined as follows:
p<-c(2,3,5,7,11,13,17,19,23,29,31,37,41,43,47)
I would like to select the following pairs:
3,5; 5,7; 11,13; 17,19; 29,31; 41,43
I tried unsuccessfully to select at least these numbers in a vector
j<-NULL
for(i in seq(p)) if (p[i+1]-p[i]==2) j<-c(j,i,i+1)
But it does not give the desired output. Thanks for your help.
Here is my solution using base R functions:
dif=which(abs(diff(p))==2)
sapply(dif, function(x) c(p[x],p[x+1]))
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 3 5 11 17 29 41
# [2,] 5 7 13 19 31 43
By changing 2 to any other value, you can manage to get the result of any desired unit from which the vector's elements are apart.
abs is used to take care of the cases in which vector's elements are not ordered.
BENCHMARK (small scale)
p<-c(2,3,5,7,11,13,17,19,23,29,31,37,41,43,47) # length(p)=15
library(dplyr)
library(data.table)
library(microbenchmark)
func_Sotos <- function(p){df <- expand.grid(p, p);df[df[,1]-df[,2] == 2,];}
func_m0h3n <- function(p){dif=which(abs(diff(p))==2);sapply(dif, function(x) c(p[x],p[x+1]));}
func_David_B <- function(p){data.frame(p) %>% mutate(lagp = lag(p)) %>% filter(p - lagp == 2)}
func_akrun1 <- function(p){setDT(list(p=p))[, p1 := shift(p)][p-p1 ==2];}
func_akrun2 <- function(p){unique(CJ(p=p, p1=p)[abs(p-p1)==2][.(p=pmin(p,p1), p1=pmax(p, p1))]);}
func_RHertel1 <- function(p){d2_mat <- which(as.matrix(dist(p))==2, arr.ind=TRUE);unique(t(apply(cbind(p[d2_mat[,1]],p[d2_mat[,2]]),1,sort)));}
func_RHertel2 <- function(p){m2 <- t(combn(sort(p),2));m2[abs(m2[,1] - m2[,2]) == 2,];}
func_RHertel3 <- function(p){d2 <- as.matrix(dist(p));d2[lower.tri(d2)] <- 0;idx <- which(d2 == 2, arr.ind=TRUE);cbind(p[idx[,1]], p[idx[,2]]);}
func_Tomas.H <- function(p) {a<-which(p-lag(p)==2);b<-a-1;df<-data.frame(pair1=p[b],pair2=p[a]);df;}
func_Arun.kumar.mahesh <- function(p) {
j<-c()
for(i in 1:length(p)){
if(sum(p[i]-p[i+1],na.rm=T)==-2){
j[i] <- paste(p[i],p[i+1],sep=",")
}
}
j <- j[!is.na(j)]
}
microbenchmark(func_Sotos(p), func_m0h3n(p), func_David_B(p), func_akrun1(p), func_akrun2(p), func_RHertel1(p), func_RHertel2(p), func_RHertel3(p), func_Tomas.H(p), func_Arun.kumar.mahesh(p))
Unit: microseconds
expr min lq mean median uq max neval
func_Sotos(p) 403.770 455.9520 470.6952 469.6390 485.4640 594.961 100
func_m0h3n(p) 72.713 92.8155 125.7504 98.8040 104.7920 2622.790 100
func_David_B(p) 1986.340 2148.2335 2260.4203 2207.0450 2292.1615 5547.553 100
func_akrun1(p) 1321.233 1404.2110 1472.6807 1464.3060 1504.7255 1872.566 100
func_akrun2(p) 2524.414 2623.2185 2777.9167 2700.2080 2816.5485 5595.885 100
func_RHertel1(p) 1160.838 1230.5560 1349.9502 1267.7680 1328.7185 4133.504 100
func_RHertel2(p) 249.362 281.2270 298.3233 296.1975 308.3880 562.027 100
func_RHertel3(p) 159.968 182.8515 204.4812 203.1675 223.6985 304.538 100
func_Tomas.H(p) 275.453 316.0865 337.7593 334.6925 350.7320 646.716 100
func_Arun.kumar.mahesh(p) 123.184 142.2175 174.5620 148.4200 158.0435 2579.163 100
BENCHMARK (medium scale)
set.seed(100)
p=sample(1000) # length(p)=1000
microbenchmark(func_Sotos(p), func_m0h3n(p), func_David_B(p), func_akrun1(p), func_akrun2(p), func_RHertel1(p), func_RHertel2(p), func_RHertel3(p), func_Tomas.H(p), func_Arun.kumar.mahesh(p))
Unit: microseconds
expr min lq mean median uq max neval
func_Sotos(p) 30711.250 35060.8410 53640.60456 64290.0265 69224.6310 98474.248 100
func_m0h3n(p) 41.465 68.9580 88.75608 83.5305 102.1600 196.808 100
func_David_B(p) 854.835 1067.1160 1220.68932 1150.1960 1261.5205 3934.944 100
func_akrun1(p) 524.319 748.9200 830.18763 811.5670 896.2995 1549.519 100
func_akrun2(p) 12986.877 17372.2235 34010.07038 21836.1435 52173.1590 58796.699 100
func_RHertel1(p) 76813.429 107942.6315 112380.30785 115049.1765 119579.6505 163399.316 100
func_RHertel2(p) 280275.495 297188.4505 307531.70976 304330.0005 314177.5760 360689.445 100
func_RHertel3(p) 45957.354 85348.1045 103999.44879 113351.6765 118847.8575 170738.875 100
func_Tomas.H(p) 154.742 212.4325 263.66812 260.8075 295.0610 536.037 100
func_Arun.kumar.mahesh(p) 972.619 1072.5250 1192.35206 1152.4500 1238.9850 2483.979 100
There is a better way than this, but here is an idea with expand.grid,
df <- expand.grid(p, p)
unname(apply(df[df[,1]-df[,2] == -2,], 1,paste, collapse = ','))
#[1] "3,5" "5,7" "11,13" "17,19" "29,31" "41,43"
If you want a data frame then simply,
df[df[,1]-df[,2] == 2,]
# Var1 Var2
#18 5 3
#34 7 5
#66 13 11
#98 19 17
#146 31 29
#194 43 41
Hi if desired outcome is data frame then try this
p<-c(2,3,5,7,11,13,17,19,23,29,31,37,41,43,47)
a<-which(p-lag(p)==2)
b<-a-1
df<-data.frame(pair1=p[b],
pair2=p[a])
If you want back a vector then this should work
res<-NULL
for (i in a){
res<-c(res,p[i-1],p[i])
}
You could do this using dplyr, which will return the pairs in a data frame:
> library(dplyr)
> data.frame(p) %>% mutate(lagp = lag(p)) %>% filter(p - lagp == 2)
p lagp
1 5 3
2 7 5
3 13 11
4 19 17
5 31 29
6 43 41
Here is another using data.table
library(data.table)
setDT(list(p=p))[, p1 := shift(p)][p-p1 ==2]
# p p1
#1: 5 3
#2: 7 5
#3: 13 11
#4: 19 17
#5: 31 29
#6: 43 41
If the vector p is not ordered, order it before doing the operation.
setDT(list(p=p))[order(p)][, p1 := shift(p)][p-p1==2]
Update
Using the new vector provided by #RHertel
p <- c(2, 3, 4, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47)
unique(CJ(p=p, p1=p)[abs(p-p1)==2][.(p=pmin(p,p1), p1=pmax(p, p1))])
# p p1
#1: 2 4
#2: 3 5
#3: 5 7
#4: 11 13
#5: 17 19
#6: 29 31
#7: 41 43
Kind of hacky, but here's another way.
d2_mat <- which(as.matrix(dist(p))==2, arr.ind=TRUE)
unique(t(apply(cbind(p[d2_mat[,1]],p[d2_mat[,2]]),1,sort)))
# [,1] [,2]
#[1,] 3 5
#[2,] 5 7
#[3,] 11 13
#[4,] 17 19
#[5,] 29 31
#[6,] 41 43
In contrast to some of the other answers, this does not require any specific order of the numbers in the vector p.
A vectorized version of the same could be:
d2 <- as.matrix(dist(p))
d2[lower.tri(d2)] <- 0
idx <- which(d2 == 2, arr.ind=TRUE)
cbind(p[idx[,1]], p[idx[,2]])
In the last line, instead of cbind(), one could also use paste(), depending on the desired output:
paste(p[idx[,1]], p[idx[,2]], sep=",")
#[1] "3,5" "5,7" "11,13" "17,19" "29,31" "41,43"
The following variant is simpler and probably (much) faster than my previous suggestions.
m2 <- t(combn(sort(p),2))
m2[abs(m2[,1] - m2[,2]) == 2,]
This version, too, finds all pairs of values that are 2 units apart within any integer vector.
Here's an example:
p <- c(13, 19, 43, 29, 47, 17, 7, 37, 2, 41, 3, 4, 31, 11, 5, 23)
# [,1] [,2]
#[1,] 2 4
#[2,] 3 5
#[3,] 5 7
#[4,] 11 13
#[5,] 17 19
#[6,] 29 31
#[7,] 41 43
The output can be modified, if desired, by using:
m2 <- t(combn(sort(p), 2))
m2 <- m2[abs(m2[,1] - m2[,2]) == 2,]
paste(m2[,1], m2[,2], sep=",")
#[1] "2,4" "3,5" "5,7" "11,13" "17,19" "29,31" "41,43"
Use length function instead of seq to get desired output
j<-c()
for(i in 1:length(p)){
if(sum(p[i]-p[i+1],na.rm=T)==-2){
j[i] <- paste(p[i],p[i+1],sep=",")
}
}
j <- j[!is.na(j)]
print(j)
[1] "3,5" "5,7" "11,13" "17,19" "29,31" "41,43"

Resources