Closest other Value in the same Vector - r

I have a vector
set.seed(2)
x <- sample.int(20, 5)
[1] 4 14 11 3 16
Now, for every element I want to find
the element with the minimum distance (min(abs(x[i]-x[-i])) for element i), which here would be
[1] 3 16 14 4 14
the (first) index of the element with the minimum distance, which here would be
[1] 4 5 2 1 2
The point is that the element itself is not considered, but only all the other elements, which is why this R - Fastest way to find nearest value in vector
is not the answer.
If the actual answer is out there, sorry - I didn't find it.

1) Rfast Using dista in Rfast we get the indexes of the closest two. Take the second closest as the closest will be the same value.
library(Rfast)
x <- c(4, 14, 11, 3, 16) # input
x[ dista(x, x, k = 2, index = TRUE)[, 2] ]
## [1] 3 16 14 4 14
2) sqldf Using SQL we can left join DF to itself excluding the same value value and take the row with the minimum distance.
DF <- data.frame(x) # x is from (1)
sqldf("select a.x, b.x nearest, min(abs(a.x - b.x))
from DF a
left join DF b on a.x != b.x
group by a.rowid")[1:2]
giving:
x nearest
1 4 3
2 14 16
3 11 14
4 3 4
5 16 14
3) zoo Sort the input, take the value corresponding to the least difference on either of side of each element and order it back.
library(zoo)
ix <- order(x)
least <- function(x) if (x[2] - x[1] < x[3] - x[2]) x[1] else x[3]
rollapply(c(-Inf, x[ix], Inf), 3, least)[order(ix)]
## [1] 3 16 14 4 14
4) Base R Using ix and least from (3) we can mimic (3) using only base functions as follows.
apply(embed(c(-Inf, x[ix], Inf), 3)[, 3:1], 1, least)[order(ix)]
## [1] 3 16 14 4 14
4a) This slightly shorter variation would also work:
-apply(embed(-c(-Inf, x[ix], Inf), 3), 1, least)[order(ix)]
## [1] 3 16 14 4 14
4b) Simplifying further we have the following base solution where, again, ix is from (3):
xx <- x[ix]
x1 <- c(-Inf, xx[-length(xx)])
x2 <- c(xx[-1], Inf)
ifelse(xx - x1 < x2 - xx, x1, x2)[order(ix)]
## [1] 3 16 14 4 14
Duplicates
The example in the question had no duplicates but if there were duplicates there is some question regarding the problem definition. For example if we had c(1, 3, 4, 1) then if we look at the first value, 1, there is another value exactly equal to it so the closest value is 1. Another interpretation is that the closest value not equal to 1 should be returned which in this case is 3. In the codes above the sqldf solution gives the closest value not equal to the current value whereas the others give the closest value among the remaining values.
If we wanted the interpretation of the closest not equal for those other than sqldf then we could use rle after ordering to compress it down to unique values and then use inverse.rle afterwards as shown on the modified 4b:
x <- c(1, 3, 4, 1)
ix <- order(x)
r <- rle(x[ix])
xx <- r$values
x1 <- c(-Inf, xx[-length(xx)])
x2 <- c(xx[-1], Inf)
r$values <- ifelse(xx - x1 < x2 - xx, x1, x2)
inverse.rle(r)[order(ix)]
## [1] 3 4 3 3

I was very interested in this question and in the approaches suggested in the other responses, so I compared them with regard to their running time (and I added another approach using the package RANN). The code is appended below. TL;DR: The base R version 4b by user G. Grothendieck was most efficient, and by a significant margin.
library(RANN)
library(zoo)
library(data.table)
library(Rfast)
library(sqldf)
# All functions take a vector as argument,
# and return the values of nearest neighbours (not their index)
# Using base R, by ThomasIsCoding
base_nn <- function(x) {
d <- data.frame(`diag<-`(as.matrix(dist(x)),Inf))
id <- unlist(Map(which.min,d))
x[id]
}
# Using Rfast, by G. Grothendieck
rfast_nn <- function(x) {
x[ dista(x, x, k = 2, index = TRUE)[, 2] ]
}
# Using sqldf, by G. Grothendieck
sqldf_nn <- function(x) {
DF <- data.frame(x) # x is from (1)
unname(
unlist(sqldf("select a.x, b.x nearest, min(abs(a.x - b.x))
from DF a
left join DF b on a.x != b.x
group by a.rowid")[2])
)
}
# Using `zoo`, by G. Grothendieck
zoo_nn <- function(x) {
ix <- order(x)
least <- function(x) if (x[2] - x[1] < x[3] - x[2]) x[1] else x[3]
rollapply(c(-Inf, x[ix], Inf), 3, least)[order(ix)]
}
# Using base R (v 4), by G. Grothendieck
base2_nn <- function(x) {
ix <- order(x)
least <- function(x) if (x[2] - x[1] < x[3] - x[2]) x[1] else x[3]
apply(embed(c(-Inf, x[ix], Inf), 3)[, 3:1], 1, least)[order(ix)]
}
# Using base R (v 4a), by G. Grothendieck
base3_nn <- function(x) {
ix <- order(x)
least <- function(x) if (x[2] - x[1] < x[3] - x[2]) x[1] else x[3]
-apply(embed(-c(-Inf, x[ix], Inf), 3), 1, least)[order(ix)]
}
# Using base R (v 4b), by G. Grothendieck
base4_nn <- function(x) {
ix <- order(x)
xx <- x[ix]
x1 <- c(-Inf, xx[-length(xx)])
x2 <- c(xx[-1], Inf)
ifelse(xx - x1 < x2 - xx, x1, x2)[order(ix)]
}
# Using data.table, by IceCreamToucan
dt_nn <- function(x) {
dt <- setkey(data.table(x), x)
dt[dt, on = .(x > x), mult = 'first', lowx := i.x][, lowx := fcoalesce(lowx + .0, -Inf)]
dt[dt, on = .(x < x), mult = 'last', highx := i.x][, highx := fcoalesce(highx + .0, Inf)]
dt[, closex := fifelse(x - lowx < highx - x, lowx, highx)]
unname(unlist(dt[, .(closex)]))
}
# Using, RANN, by me
rann_nn <- function(x) {
id <- RANN::nn2(as.matrix(x), k = 2)$nn.idx[, 2]
x[id]
}
### Apply all methods
# Test that all have the same output:
x <- c(4, 14,11,3,16)
rann_nn(x)
# [1] 3 16 14 4 14
base_nn(x)
# [1] 3 16 14 4 14
rfast_nn(x)
# [1] 3 16 14 4 14
sqldf_nn(x)
# [1] 3 16 14 4 14
zoo_nn(x)
# [1] 3 16 14 4 14
base2_nn(x)
# [1] 3 16 14 4 14
base3_nn(x)
# [1] 3 16 14 4 14
base4_nn(x)
# [1] 3 16 14 4 14
dt_nn(x) # differently ordered for some reason
# [1] 4 3 14 16 14
# Compare running times
library(microbenchmark)
# Compare for N = 1000 elements
benchmark_data <- rnorm(1000)
microbenchmark(
rann_nn(benchmark_data),
base_nn(benchmark_data),
rfast_nn(benchmark_data),
sqldf_nn(benchmark_data),
zoo_nn(benchmark_data),
base2_nn(benchmark_data),
base3_nn(benchmark_data),
base4_nn(benchmark_data),
dt_nn(benchmark_data)
)
# Unit: microseconds
# expr min lq mean median uq max neval
# rann_nn(benchmark_data) 641.180 684.1975 776.5467 711.6680 775.3635 3822.023 100
# base_nn(benchmark_data) 166523.177 179240.8130 209471.1333 187633.0515 249740.8425 330864.712 100
# rfast_nn(benchmark_data) 45160.603 47032.5225 47681.0557 47594.0075 48308.8440 50579.839 100
# sqldf_nn(benchmark_data) 133916.594 138769.8175 143505.9315 140543.3250 143830.2765 211873.960 100
# zoo_nn(benchmark_data) 4359.359 4604.0275 5008.4291 4785.1515 5037.9705 14999.802 100
# base2_nn(benchmark_data) 1292.322 1407.4875 1747.8404 1462.7295 1588.1580 11297.321 100
# base3_nn(benchmark_data) 1263.644 1396.9210 1615.7495 1472.9940 1571.8575 11828.015 100
# base4_nn(benchmark_data) 119.543 146.1080 254.5075 178.1065 197.4265 7726.156 100
# dt_nn(benchmark_data) 5290.337 6580.6965 7111.1816 6892.3800 7351.3795 29469.815 100
# For N = 100000, leaving out the slowest versions (e.g., `base_nn()`
# no longer works because a distance matrix cannot be computed for
# N = 100000)
benchmark_data <- rnorm(100000)
microbenchmark(
rann_nn(benchmark_data),
zoo_nn(benchmark_data),
base2_nn(benchmark_data),
base3_nn(benchmark_data),
base4_nn(benchmark_data),
dt_nn(benchmark_data)
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# rann_nn(benchmark_data) 130.957025 141.02904 149.94052 148.60184 156.14506 271.1882 100
# zoo_nn(benchmark_data) 606.690004 673.88980 720.12545 717.51658 766.98190 886.4397 100
# base2_nn(benchmark_data) 142.554407 176.30358 198.58375 193.34812 212.33885 329.5470 100
# base3_nn(benchmark_data) 142.074126 168.78195 189.65122 184.45025 205.89414 287.0740 100
# base4_nn(benchmark_data) 9.354764 10.46687 17.22086 12.36354 14.22882 166.4758 100
# dt_nn(benchmark_data) 96.503882 104.06914 117.95408 108.20284 121.11428 247.2092 100

Here is a base R solution
d <- data.frame(`diag<-`(as.matrix(dist(x)),Inf))
ids <- unlist(Map(which.min,d))
val <- x[ids]
such that
> ids
X1 X2 X3 X4 X5
4 5 2 1 2
> val
[1] 3 16 14 4 14
DATA
x <- c(4, 14,11,3,16)

Option with a data.table non-equi-join
dt <- setkey(data.table(x), x)
dt[dt, on = .(x > x), mult = 'first', lowx := i.x][, lowx := fcoalesce(lowx + .0, -Inf)]
dt[dt, on = .(x < x), mult = 'last', highx := i.x][, highx := fcoalesce(highx + .0, Inf)]
dt[, closex := fifelse(x - lowx < highx - x, lowx, highx)]
dt[, .(x, closex)]
# x closex
# 1: 3 4
# 2: 4 3
# 3: 11 14
# 4: 14 16
# 5: 16 14

Related

Treat NAs as zeros when adding vectors

Adding two vectors is easy:
> c(1:5) + c(6:10)
[1] 7 9 11 13 15
But since adding any number to NA gives NA, this happens:
> c(1,NA,3:5)+c(6:10)
[1] 7 NA 11 13 15
How can I add two vectors where there may be some NAs, treating them as zeros? I need to get this result:
> c(1,NA,3:5)+c(6:10)
[1] 7 7 11 13 15
Any ideas on how to do this using {base} and not changing the NAs to zeros on the original vectors?
You can also use colSums or rowSums, e.g.:
rowSums(cbind(x, y), na.rm = T)
# [1] 7 7 11 13 15
colSums(rbind(x, y), na.rm = T)
# [1] 7 7 11 13 15
Benchmarks; surprisingly colSums works the fastest:
microbenchmark::microbenchmark(fn_replace(x, y),
fn_rowSums(x, y),
fn_colSums(x, y),
fn_coalesce(x, y))
# Unit: milliseconds
# expr min lq mean median uq max neval
# fn_replace(x, y) 121.4322 130.99067 174.1531 162.2454 183.1781 385.7348 100
# fn_rowSums(x, y) 143.0654 146.20815 172.5396 149.3953 179.0337 370.1625 100
# fn_colSums(x, y) 96.8848 99.46521 121.5916 106.8800 140.9279 298.1607 100
# fn_coalesce(x, y) 259.2923 310.16915 357.0241 326.1245 360.9110 595.9711 100
## Code to generate x, y and functions for benchmark:
fn_replace <- function(x, y) {
replace(x, is.na(x), 0) + replace(y, is.na(y), 0)
}
fn_rowSums <- function(x, y) {
rowSums(cbind(x, y), na.rm = T)
}
fn_colSums <- function(x, y) {
colSums(rbind(x, y), na.rm = T)
}
fn_coalesce <- function(x, y) {
dplyr::coalesce(x, rep(0, length(x))) +
dplyr::coalesce(y, rep(0, length(y)))
}
n_rep <- 1e6
x <- as.numeric(rep(c(1, NA, 3:5, NA, NA, 5), n_rep))
y <- as.numeric(rep(c(NA, 6:9, NA, 3, 4), n_rep))
Maybe replace NA's with 0 and then add the vectors
x <- c(1,NA,3:5)
y <- c(6:10)
replace(x, is.na(x), 0) + replace(y, is.na(y), 0)
#[1] 7 7 11 13 15
We could try using coalesce() from the dplyr package:
require(dplyr)
x <- c(1,NA,3:5)
y <- c(6:10)
coalesce(x, rep(0, 5)) + coalesce(y, rep(0, 5))
coalesce(x, y) works by taking the first non NA value from x, should that position have a non NA value, or from y, e.g.
x rep(0, 5) => result
1 0 1
NA 0 0
3 0 3
4 0 4
5 0 5
Instead of base::replace() and dplyr::coalesce() as above, we can also use tidyr::replace_na():
library(tidyr)
replace_na(x, 0) + replace_na(y, 0)
#[1] 7 7 11 13 15

Define the value of a column in a dataframe based on 2 keys from a different dataframe

I have the following dataframe:
a <- seq(0, 5, by = 0.25)
b <-seq(0, 20, by = 1)
df <- data.frame(a, b)
and I'd like to create a new column "value", based on columns a and b, and the conversion table below:
a_min <- c(0,2, 0,2)
a_max <- c(2,5,2,5)
b_min <- c(0,0,10,10)
b_max <- c(10,10,30,30)
output <-c(1,2,3,4)
conv <- data.frame(a_min, a_max, b_min, b_max, output)
I've tried to do it using dplyr::mutate without much success...
require(dplyr)
mutate(df, value = calcula(conv, a, b))
longer object length is not a multiple of shorter object length
My expectation would be to obtain a dataframe like the 'df' above with the additional column value as per below:
df$value <- c(rep(1,8), rep(2,2), rep(4,11))
A possible relatively simple and very efficient data.table solution using binary non-equi joins
library(data.table) # v1.10.0
setDT(conv)[setDT(df), output, on = .(a_min <= a, a_max >= a, b_min <= b, b_max >= b)]
## [1] 1 1 1 1 1 1 1 1 1 2 2 2 4 4 4 4 4 4 4 4 4 4 4
As a side note, if output column is just the row index within conv, you could make this join even more efficient by just asking for the row indices by specifying which = TRUE
setDT(conv)[setDT(df), on = .(a_min <= a, a_max >= a, b_min <= b, b_max >= b), which = TRUE]
## [1] 1 1 1 1 1 1 1 1 1 2 2 2 4 4 4 4 4 4 4 4 4 4 4
One more option, this time with matrices.
with(df, with(conv, output[max.col(
outer(a, a_min, `>=`) + outer(a, a_max, `<=`) +
outer(b, b_min, `>=`) + outer(b, b_max, `<=`))]))
## [1] 1 1 1 1 1 1 1 1 1 2 2 4 4 4 4 4 4 4 4 4 4
outer compares each element of the vector from df from the one from conv, producing a matrix of Booleans for each call. Since TRUE is 1, if you add all four matrices, the index you want will be the column with the most TRUEs, which you can get with max.col. Subset output, and you've got your result.
The benefit of working with matrices is that they're fast. Using #Phann's benchmarks on 1,000 rows:
Unit: microseconds
expr min lq mean median uq max neval cld
alistaire 276.099 320.4565 349.1045 339.8375 357.2705 941.551 100 a
akr1 830.934 966.6705 1064.8433 1057.6610 1152.3565 1507.180 100 ab
akr2 11431.246 11731.3125 12835.5229 11947.5775 12408.4715 36767.488 100 d
Pha 11985.129 12403.1095 13330.1465 12660.4050 13044.9330 29653.842 100 d
Ron 71132.626 74300.3540 81136.9408 78034.2275 88952.8765 98950.061 100 e
Dav1 2506.205 2765.4095 2971.6738 2948.6025 3082.4025 4065.368 100 c
Dav2 2104.481 2272.9180 2480.9570 2478.8775 2575.8740 3683.896 100 bc
and on 100,000 rows:
Unit: milliseconds
expr min lq mean median uq max neval cld
alistaire 30.00677 36.49348 44.28828 39.43293 54.28207 64.36581 100 a
akr1 36.24467 40.04644 48.46986 41.59644 60.15175 77.34415 100 a
Dav1 51.74218 57.23488 67.70289 64.11002 68.86208 382.25182 100 c
Dav2 48.48227 54.82818 60.25256 59.81041 64.92611 91.20212 100 b
We can try with Map with na.locf
library(zoo)
f1 <- function(u, v, x, y, z) z * NA^!((with(df, a >= u & a <v) & (b >=x & b <y)))
na.locf(do.call(pmax, c(do.call(Map, c(f=f1, unname(conv))), na.rm = TRUE)))
#[1] 1 1 1 1 1 1 1 1 2 2 4 4 4 4 4 4 4 4 4 4 4
Or another way to write the Map solution is to pass the 'a' and 'b' columns as arguments, and then do the logical evaluation with columns of 'conv' to extract the 'output' value and unlist the list output
unlist(Map(function(x, y)
with(conv, output[x >= a_min & a_max > x & y >= b_min & b_max > y]),
df$a, df$b))
#[1] 1 1 1 1 1 1 1 1 2 2 4 4 4 4 4 4 4 4 4 4
NOTE: The second solution should be slower as we are looping through the rows of the dataset while the first solution loops through the 'conv' rows (which we assume should not be many rows)
Another approach using apply:
df$value <- unlist(apply(df, 1, function(x){
ifelse(length(OUT <- output[which(x[1] >= a_min & x[1] <= a_max & x[2] >= b_min & x[2] <= b_max)]) > 0, OUT, 0)
}))
EDIT:
Because there are several answers so far, I checked the time needed to process the data. I created a little bit bigger example (similar to the given one with random numbers):
set.seed(23563)
a <- runif(1000, 0, 5)
b <- runif(1000, 0, 20)
df <- data.frame(a, b)
require(microbenchmark)
library(zoo)
require(data.table)
microbenchmark(
akr1 = { #akrun 1
f1 <- function(u, v, x, y, z) z * NA^!((with(df, a >= u & a <v) & (b >=x & b <y)))
na.locf(do.call(pmax, c(do.call(Map, c(f=f1, unname(conv))), na.rm = TRUE)))
},
akr2 = { #akrun 2
unlist(Map(function(x, y)
with(conv, output[x >= a_min & a_max > x & y >= b_min & b_max > y]),
df$a, df$b))
},
Pha = { #Phann
df$value <- unlist(apply(df, 1, function(x){
ifelse(length(OUT <- output[which(x[1] >= a_min & x[1] <= a_max & x[2] >= b_min & x[2] <= b_max)]) > 0, OUT, 0)
}))
},
Ron = { #Ronak Shah
unlist(mapply(function(x, y)
conv$output[x >= conv$a_min & conv$a_max > x & y >= conv$b_min & conv$b_max > y],
df$a, df$b))
},
Dav1 ={ #David Arenburg 1
setDT(conv)[setDT(df), on = .(a_min <= a, a_max >= a, b_min <= b, b_max >= b)]$output
},
Dav2 = { #David Arenburg 2
setDT(conv)[setDT(df), on = .(a_min <= a, a_max >= a, b_min <= b, b_max >= b), which = TRUE]
},
times = 100L
)
With 1000 random numbers:
# Unit: milliseconds
# expr min lq mean median uq max neval
# akr1 4.267206 4.749576 6.259695 5.351494 6.843077 54.39187 100
# akr2 33.437853 39.912785 49.932875 47.416888 57.070369 91.55602 100
# Pha 30.433779 36.939692 48.205592 46.393800 55.800204 83.91640 100
# Ron 174.765021 199.648315 227.493117 223.314661 240.579057 370.26929 100
# Dav1 6.944759 7.814469 10.685460 8.536694 11.974102 44.47915 100
# Dav2 6.106978 6.706424 8.961821 8.161707 10.376085 28.91255 100
With 10000 random numbers (same seed), I get:
# Unit: milliseconds
# expr min lq mean median uq max neval
# akr1 23.48180 24.03962 26.16747 24.46897 26.19565 41.83238 100
# akr2 357.38290 398.69965 434.92052 409.15385 440.98210 829.85113 100
# Pha 320.39285 347.66632 376.98118 361.76852 383.08231 681.28500 100
# Ron 1661.50669 1788.06228 1873.70929 1837.28187 1912.04123 2499.23235 100
# Dav1 20.91486 21.60953 23.12278 21.94707 22.42773 44.71900 100
# Dav2 19.69506 20.22077 21.63715 20.55793 21.27578 38.96819 100
Here is another attempt to utilize findIntervals efficiency on both memory and speed. A more convenient format of the conv "data.frame" could be
(i) a "list" of the intervals for each variable which are not overlapping:
vecs = list(a = unique(c(conv$a_min, conv$a_max)),
b = unique(c(conv$b_min, conv$b_max)))
vecs
#$a
#[1] 0 2 5
#
#$b
#[1] 0 10 30
and, (ii) a lookup structure that contains the group of each paired interval between the two variables:
maps = xtabs(output ~ a_min + b_min)
maps
# b_min
#a_min 0 10
# 0 1 3
# 2 2 4
where, for example, we note that the first interval of "a" && second of "b" are assigned a "3" etc.
Then we can use:
maps[mapply(findInterval, df, vecs, all.inside = TRUE)]
# [1] 1 1 1 1 1 1 1 1 2 2 4 4 4 4 4 4 4 4 4 4 4
And extending the benchmarks of Phann and alistaire (re-written, partly, for convenience):
n = 1e6
set.seed(23563); a = runif(n, 0, 5); b = runif(n, 0, 20); df = data.frame(a, b)
library(microbenchmark); library(zoo); library(data.table)
alistaire = function() {
with(df, with(conv, output[max.col(
outer(a, a_min, `>=`) + outer(a, a_max, `<=`) +
outer(b, b_min, `>=`) + outer(b, b_max, `<=`))]))
}
david = function() {
as.data.table(conv)[setDT(df), output, on = .(a_min <= a, a_max >= a, b_min <= b, b_max >= b)]
}
akrun = function() {
f1 = function(u, v, x, y, z) z * NA^!((with(df, a >= u & a <v) & (b >=x & b <y)))
na.locf(do.call(pmax, c(do.call(Map, c(f=f1, unname(conv))), na.rm = TRUE)))
}
alex = function() {
vecs = list(a = unique(c(conv$a_min, conv$a_max)), b = unique(c(conv$b_min, conv$b_max)))
maps = xtabs(output ~ a_min + b_min)
maps[mapply(findInterval, df, vecs, all.inside = TRUE)]
}
identical(alistaire(), david())
#[1] TRUE
identical(david(), akrun())
#[1] TRUE
identical(akrun(), alex())
#[1] TRUE
microbenchmark(alistaire(), david(), akrun(), alex(), times = 20)
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# alistaire() 592.46700 718.07148 799.28933 792.98107 860.16414 1136.4489 20 b
# david() 1363.76196 1375.43935 1398.53515 1385.11747 1425.69837 1457.1693 20 d
# akrun() 824.11962 850.88831 903.58723 906.21007 958.04310 995.2129 20 c
# alex() 70.82439 72.65993 82.87961 76.77627 81.20356 179.7669 20 a
We can use mapply on two variables a and b and find the correct output variable based on the range
unlist(mapply(function(x, y)
conv$output[x >= conv$a_min & conv$a_max > x & y >= conv$b_min & conv$b_max > y],
df$a, df$b))
#[1] 1 1 1 1 1 1 1 1 2 2 4 4 4 4 4 4 4 4 4 4

Most efficient way to multiply a data frame by a vector

What would be the most efficient way to multiply each column of a data frame by a vector?
e.g. data frame (df) has the following columns (col1, col2, col3, col4) and vector (v) has the following elements (v1,v2,v3).
I want the output to be: col2*v1, col3*v2, col4*v3
I've been trying df[c(2:4)] * c(v1,v2,v3) but it seems like the elements of the vector are not multiplying every single row of each column.
You could try (using df and v from Richard Scriven's answer):
df[-1] <- t(t(df[-1]) * v)
df
# a x y z
# 1 a 5 40 105
# 2 b 10 50 120
# 3 c 15 60 135
When you multiply a matrix by a vector, it multiplies columnwise. Since you want to multiply your rows by the vector, we transpose df[-1] using t, multiply by v, and transpose back using t.
It seems like this approach has a slight edge in benchmarking over the Map approach, and a significant advantage over sweep:
library(microbenchmark)
rscriven <- function(df, v) cbind(df[1], Map(`*`, df[-1], v))
josilber <- function(df, v) cbind(df[1], t(t(df[-1]) * v))
dardisco <- function(df, v) cbind(df[1], sweep(df[-1], MARGIN=2, STATS=v, FUN="*"))
df2 <- cbind(data.frame(rep("a", 1000)), matrix(rnorm(100000), nrow=1000))
v2 <- rnorm(100)
all.equal(rscriven(df2, v2), josilber(df2, v2))
# [1] TRUE
all.equal(rscriven(df2, v2), dardisco(df2, v2))
# [1] TRUE
microbenchmark(rscriven(df2, v2), josilber(df2, v2), dardisco(df2, v2))
# Unit: milliseconds
# expr min lq median uq max neval
# rscriven(df2, v2) 5.276458 5.378436 5.451041 5.587644 9.470207 100
# josilber(df2, v2) 2.545144 2.753363 3.099589 3.704077 8.955193 100
# dardisco(df2, v2) 11.647147 12.761184 14.196678 16.581004 132.428972 100
Thanks to #thelatemail for pointing out that the Map approach is a good deal faster for 100x larger data frames:
df2 <- cbind(data.frame(rep("a", 10000)), matrix(rnorm(10000000), nrow=10000))
v2 <- rnorm(1000)
microbenchmark(rscriven(df2, v2), josilber(df2, v2), dardisco(df2, v2))
# Unit: milliseconds
# expr min lq median uq max neval
# rscriven(df2, v2) 75.74051 90.20161 97.08931 115.7789 259.0855 100
# josilber(df2, v2) 340.72774 388.17046 498.26836 514.5923 623.4020 100
# dardisco(df2, v2) 928.81128 1041.34497 1156.39293 1271.4758 1506.0348 100
It seems like you'll need to benchmark to determine which approach is fastest for your application.
You can use Map for this. Here's an example
> ( df <- data.frame(a = letters[1:3], x = 1:3, y = 4:6, z = 7:9) )
# a x y z
# 1 a 1 4 7
# 2 b 2 5 8
# 3 c 3 6 9
> v <- c(5, 10, 15)
> cbind(df[1], Map(`*`, df[-1], v))
# a x y z
# 1 a 5 40 105
# 2 b 10 50 120
# 3 c 15 60 135
In this example,
column x is multiplied by v[1] (5)
column y is multiplied by v[2] (10)
column z is multiplied by v[3] (15)
cbind is used to attach the unused column a to the columns we operated on
Not as fast, but more flexible:
sweep(df[-1], MARGIN=2, STATS=v, FUN="*")
Simple 'apply' function can also be used here, reading by rows:
df[-1]= (t(apply(df[-1],1, FUN=function(x)x*v)))
df
a x y z
1 a 5 40 105
2 b 10 50 120
3 c 15 60 135

Numeric comparison during merge in R

Dataframe d1:
x y
4 10
6 20
7 30
Dataframe d2:
x z
3 100
6 200
9 300
How do I merge d1 and d2 by "x" where d1$x should be matched against exact match or the next higher number in d2$x. Output should look like:
x y z
4 10 200 # (4 is matched against next higher value that is 6)
6 20 200 # (6 is matched against 6)
7 30 300 # (7 is matched against next higher value that is 9)
If merge() cannot do this, then is there any other way to do this? For loops are painfully slow.
This is pretty straightforward using rolling joins with data.table:
require(data.table) ## >= 1.9.2
setkey(setDT(d1), x) ## convert to data.table, set key for the column to join on
setkey(setDT(d2), x) ## same as above
d2[d1, roll=-Inf]
# x z y
# 1: 4 200 10
# 2: 6 200 20
# 3: 7 300 30
Input data:
d1 <- data.frame(x=c(4,6,7), y=c(10,20,30))
d2 <- data.frame(x=c(3,6,9), z=c(100,200,300))
You basically wish to extend d1 by a new column. So let's copy it.
d3 <- d1
Next I assume that d2$x is sorted nondecreasingly and thatmax(d1$x) <= max(d2$x).
d3$z <- sapply(d1$x, function(x) d2$z[which(x <= d2$x)[1]])
Which reads: for each x in d1$x, get the smallest value from d2$x which is not smaller than x.
Under these assumptions, the above may also be written as (& should be a bit faster):
d3$z <- sapply(d1$x, function(x) d2$z[which.max(x <= d2$x)])
In result we get:
d3
## x y z
## 1 4 10 200
## 2 6 20 200
## 3 7 30 300
EDIT1: Inspired by #MatthewLundberg's cut-based solution, here's another one using findInterval:
d3$z <- d2$z[findInterval(d1$x, d2$x+1)+1]
EDIT2: (Benchmark)
Exemplary data:
set.seed(123)
d1 <- data.frame(x=sort(sample(1:10000, 1000)), y=sort(sample(1:10000, 1000)))
d2 <- data.frame(x=sort(c(sample(1:10000, 999), 10000)), z=sort(sample(1:10000, 1000)))
Results:
microbenchmark::microbenchmark(
{d3 <- d1; d3$z <- d2$z[findInterval(d1$x, d2$x+1)+1] },
{d3 <- d1; d3$z <- sapply(d1$x, function(x) d2$z[which(x <= d2$x)[1]]) },
{d3 <- d1; d3$z <- sapply(d1$x, function(x) d2$z[which.max(x <= d2$x)]) },
{d1$x2 <- d2$x[as.numeric(cut(d1$x, c(-Inf, d2$x, Inf)))]; merge(d1, d2, by.x='x2', by.y='x')},
{d1a <- d1; setkey(setDT(d1a), x); d2a <- d2; setkey(setDT(d2a), x); d2a[d1a, roll=-Inf] }
)
## Unit: microseconds
## expr min lq median uq max neval
## findInterval 221.102 1357.558 1394.246 1429.767 17810.55 100
## which 66311.738 70619.518 85170.175 87674.762 220613.09 100
## which.max 69832.069 73225.755 83347.842 89549.326 118266.20 100
## cut 8095.411 8347.841 8498.486 8798.226 25531.58 100
## data.table 1668.998 1774.442 1878.028 1954.583 17974.10 100
cut can be used to find the appropriate matches in d2$x for the values in d1$x.
The computation to find the matches with cut is as follows:
as.numeric(cut(d1$x, c(-Inf, d2$x, Inf)))
## [1] 2 2 3
These are the values:
d2$x[as.numeric(cut(d1$x, c(-Inf, d2$x, Inf)))]
[1] 6 6 9
These can be added to d1 and the merge performed:
d1$x2 <- d2$x[as.numeric(cut(d1$x, c(-Inf, d2$x, Inf)))]
merge(d1, d2, by.x='x2', by.y='x')
## x2 x y z
## 1 6 4 10 200
## 2 6 6 20 200
## 3 9 7 30 300
The added column may then be removed, if desired.
Try: sapply(d1$x,function(y) d2$z[d2$x > y][which.min(abs(y - d2$x[d2$x > y]))])

Piecewise linear transformation without for loop or nested ifelse

I'm trying to perform a piecewise linear transformation of my data. Here's an example table describing a transformation:
dat <- data.frame(x.low = 0:2, x.high = 1:3, y.low=c(0, 2, 3), y.high=c(2, 3, 10))
dat
# x.low x.high y.low y.high
# 1 0 1 0 2
# 2 1 2 2 3
# 3 2 3 3 10
If I defined x <- c(1.75, 2.5), I would expect transformed values 2.75 and 6.5 (my elements would be matched by rows 2 and 3 of dat, respectively).
I know how to solve this problem with a for loop, iterating through the rows of dat and transforming the corresponding values:
pw.lin.trans <- function(x, m) {
out <- rep(NA, length(x))
for (i in seq(nrow(m))) {
matching <- x >= m$x.low[i] & x <= m$x.high[i]
out[matching] <- m$y.low[i] + (x[matching] - m$x.low[i]) /
(m$x.high[i] - m$x.low[i]) * (m$y.high[i] - m$y.low[i])
}
out
}
pw.lin.trans(x, dat)
# [1] 2.75 6.50
While this works, it strikes me there should be a better approach that matches x values to rows of dat and then performs all the interpolations in a single computation. Could somebody point me to a non-for-loop solution for this problem?
Try approx:
(xp <- unique(c(dat$x.low, dat$x.high)))
## [1] 0 1 2 3
(yp <- unique(c(dat$y.low, dat$y.high)))
## [1] 0 2 3 10
x <- c(1.75, 2.5)
approx(xp, yp, x)
## $x
## [1] 1.75 2.50
##
## $y
## [1] 2.75 6.50
or approxfun (which returns a new function):
f <- approxfun(xp, yp)
f(x)
## [1] 2.75 6.50
Some benchmarks:
set.seed(123L)
x <- runif(10000, min(xp), max(yp))
library(microbenchmark)
microbenchmark(
pw.lin.trans(x, dat),
approx(xp, yp, x)$y,
f(x)
)
## Unit: microseconds
## expr min lq median uq max neval
## pw.lin.trans(x, dat) 3364.241 3395.244 3614.0375 3641.7365 6170.268 100
## approx(xp, yp, x)$y 359.080 379.669 424.0895 453.6800 522.756 100
## f(x) 202.899 209.168 217.8715 232.3555 293.499 100

Resources