Most efficient way to multiply a data frame by a vector - r

What would be the most efficient way to multiply each column of a data frame by a vector?
e.g. data frame (df) has the following columns (col1, col2, col3, col4) and vector (v) has the following elements (v1,v2,v3).
I want the output to be: col2*v1, col3*v2, col4*v3
I've been trying df[c(2:4)] * c(v1,v2,v3) but it seems like the elements of the vector are not multiplying every single row of each column.

You could try (using df and v from Richard Scriven's answer):
df[-1] <- t(t(df[-1]) * v)
df
# a x y z
# 1 a 5 40 105
# 2 b 10 50 120
# 3 c 15 60 135
When you multiply a matrix by a vector, it multiplies columnwise. Since you want to multiply your rows by the vector, we transpose df[-1] using t, multiply by v, and transpose back using t.
It seems like this approach has a slight edge in benchmarking over the Map approach, and a significant advantage over sweep:
library(microbenchmark)
rscriven <- function(df, v) cbind(df[1], Map(`*`, df[-1], v))
josilber <- function(df, v) cbind(df[1], t(t(df[-1]) * v))
dardisco <- function(df, v) cbind(df[1], sweep(df[-1], MARGIN=2, STATS=v, FUN="*"))
df2 <- cbind(data.frame(rep("a", 1000)), matrix(rnorm(100000), nrow=1000))
v2 <- rnorm(100)
all.equal(rscriven(df2, v2), josilber(df2, v2))
# [1] TRUE
all.equal(rscriven(df2, v2), dardisco(df2, v2))
# [1] TRUE
microbenchmark(rscriven(df2, v2), josilber(df2, v2), dardisco(df2, v2))
# Unit: milliseconds
# expr min lq median uq max neval
# rscriven(df2, v2) 5.276458 5.378436 5.451041 5.587644 9.470207 100
# josilber(df2, v2) 2.545144 2.753363 3.099589 3.704077 8.955193 100
# dardisco(df2, v2) 11.647147 12.761184 14.196678 16.581004 132.428972 100
Thanks to #thelatemail for pointing out that the Map approach is a good deal faster for 100x larger data frames:
df2 <- cbind(data.frame(rep("a", 10000)), matrix(rnorm(10000000), nrow=10000))
v2 <- rnorm(1000)
microbenchmark(rscriven(df2, v2), josilber(df2, v2), dardisco(df2, v2))
# Unit: milliseconds
# expr min lq median uq max neval
# rscriven(df2, v2) 75.74051 90.20161 97.08931 115.7789 259.0855 100
# josilber(df2, v2) 340.72774 388.17046 498.26836 514.5923 623.4020 100
# dardisco(df2, v2) 928.81128 1041.34497 1156.39293 1271.4758 1506.0348 100
It seems like you'll need to benchmark to determine which approach is fastest for your application.

You can use Map for this. Here's an example
> ( df <- data.frame(a = letters[1:3], x = 1:3, y = 4:6, z = 7:9) )
# a x y z
# 1 a 1 4 7
# 2 b 2 5 8
# 3 c 3 6 9
> v <- c(5, 10, 15)
> cbind(df[1], Map(`*`, df[-1], v))
# a x y z
# 1 a 5 40 105
# 2 b 10 50 120
# 3 c 15 60 135
In this example,
column x is multiplied by v[1] (5)
column y is multiplied by v[2] (10)
column z is multiplied by v[3] (15)
cbind is used to attach the unused column a to the columns we operated on

Not as fast, but more flexible:
sweep(df[-1], MARGIN=2, STATS=v, FUN="*")

Simple 'apply' function can also be used here, reading by rows:
df[-1]= (t(apply(df[-1],1, FUN=function(x)x*v)))
df
a x y z
1 a 5 40 105
2 b 10 50 120
3 c 15 60 135

Related

Closest other Value in the same Vector

I have a vector
set.seed(2)
x <- sample.int(20, 5)
[1] 4 14 11 3 16
Now, for every element I want to find
the element with the minimum distance (min(abs(x[i]-x[-i])) for element i), which here would be
[1] 3 16 14 4 14
the (first) index of the element with the minimum distance, which here would be
[1] 4 5 2 1 2
The point is that the element itself is not considered, but only all the other elements, which is why this R - Fastest way to find nearest value in vector
is not the answer.
If the actual answer is out there, sorry - I didn't find it.
1) Rfast Using dista in Rfast we get the indexes of the closest two. Take the second closest as the closest will be the same value.
library(Rfast)
x <- c(4, 14, 11, 3, 16) # input
x[ dista(x, x, k = 2, index = TRUE)[, 2] ]
## [1] 3 16 14 4 14
2) sqldf Using SQL we can left join DF to itself excluding the same value value and take the row with the minimum distance.
DF <- data.frame(x) # x is from (1)
sqldf("select a.x, b.x nearest, min(abs(a.x - b.x))
from DF a
left join DF b on a.x != b.x
group by a.rowid")[1:2]
giving:
x nearest
1 4 3
2 14 16
3 11 14
4 3 4
5 16 14
3) zoo Sort the input, take the value corresponding to the least difference on either of side of each element and order it back.
library(zoo)
ix <- order(x)
least <- function(x) if (x[2] - x[1] < x[3] - x[2]) x[1] else x[3]
rollapply(c(-Inf, x[ix], Inf), 3, least)[order(ix)]
## [1] 3 16 14 4 14
4) Base R Using ix and least from (3) we can mimic (3) using only base functions as follows.
apply(embed(c(-Inf, x[ix], Inf), 3)[, 3:1], 1, least)[order(ix)]
## [1] 3 16 14 4 14
4a) This slightly shorter variation would also work:
-apply(embed(-c(-Inf, x[ix], Inf), 3), 1, least)[order(ix)]
## [1] 3 16 14 4 14
4b) Simplifying further we have the following base solution where, again, ix is from (3):
xx <- x[ix]
x1 <- c(-Inf, xx[-length(xx)])
x2 <- c(xx[-1], Inf)
ifelse(xx - x1 < x2 - xx, x1, x2)[order(ix)]
## [1] 3 16 14 4 14
Duplicates
The example in the question had no duplicates but if there were duplicates there is some question regarding the problem definition. For example if we had c(1, 3, 4, 1) then if we look at the first value, 1, there is another value exactly equal to it so the closest value is 1. Another interpretation is that the closest value not equal to 1 should be returned which in this case is 3. In the codes above the sqldf solution gives the closest value not equal to the current value whereas the others give the closest value among the remaining values.
If we wanted the interpretation of the closest not equal for those other than sqldf then we could use rle after ordering to compress it down to unique values and then use inverse.rle afterwards as shown on the modified 4b:
x <- c(1, 3, 4, 1)
ix <- order(x)
r <- rle(x[ix])
xx <- r$values
x1 <- c(-Inf, xx[-length(xx)])
x2 <- c(xx[-1], Inf)
r$values <- ifelse(xx - x1 < x2 - xx, x1, x2)
inverse.rle(r)[order(ix)]
## [1] 3 4 3 3
I was very interested in this question and in the approaches suggested in the other responses, so I compared them with regard to their running time (and I added another approach using the package RANN). The code is appended below. TL;DR: The base R version 4b by user G. Grothendieck was most efficient, and by a significant margin.
library(RANN)
library(zoo)
library(data.table)
library(Rfast)
library(sqldf)
# All functions take a vector as argument,
# and return the values of nearest neighbours (not their index)
# Using base R, by ThomasIsCoding
base_nn <- function(x) {
d <- data.frame(`diag<-`(as.matrix(dist(x)),Inf))
id <- unlist(Map(which.min,d))
x[id]
}
# Using Rfast, by G. Grothendieck
rfast_nn <- function(x) {
x[ dista(x, x, k = 2, index = TRUE)[, 2] ]
}
# Using sqldf, by G. Grothendieck
sqldf_nn <- function(x) {
DF <- data.frame(x) # x is from (1)
unname(
unlist(sqldf("select a.x, b.x nearest, min(abs(a.x - b.x))
from DF a
left join DF b on a.x != b.x
group by a.rowid")[2])
)
}
# Using `zoo`, by G. Grothendieck
zoo_nn <- function(x) {
ix <- order(x)
least <- function(x) if (x[2] - x[1] < x[3] - x[2]) x[1] else x[3]
rollapply(c(-Inf, x[ix], Inf), 3, least)[order(ix)]
}
# Using base R (v 4), by G. Grothendieck
base2_nn <- function(x) {
ix <- order(x)
least <- function(x) if (x[2] - x[1] < x[3] - x[2]) x[1] else x[3]
apply(embed(c(-Inf, x[ix], Inf), 3)[, 3:1], 1, least)[order(ix)]
}
# Using base R (v 4a), by G. Grothendieck
base3_nn <- function(x) {
ix <- order(x)
least <- function(x) if (x[2] - x[1] < x[3] - x[2]) x[1] else x[3]
-apply(embed(-c(-Inf, x[ix], Inf), 3), 1, least)[order(ix)]
}
# Using base R (v 4b), by G. Grothendieck
base4_nn <- function(x) {
ix <- order(x)
xx <- x[ix]
x1 <- c(-Inf, xx[-length(xx)])
x2 <- c(xx[-1], Inf)
ifelse(xx - x1 < x2 - xx, x1, x2)[order(ix)]
}
# Using data.table, by IceCreamToucan
dt_nn <- function(x) {
dt <- setkey(data.table(x), x)
dt[dt, on = .(x > x), mult = 'first', lowx := i.x][, lowx := fcoalesce(lowx + .0, -Inf)]
dt[dt, on = .(x < x), mult = 'last', highx := i.x][, highx := fcoalesce(highx + .0, Inf)]
dt[, closex := fifelse(x - lowx < highx - x, lowx, highx)]
unname(unlist(dt[, .(closex)]))
}
# Using, RANN, by me
rann_nn <- function(x) {
id <- RANN::nn2(as.matrix(x), k = 2)$nn.idx[, 2]
x[id]
}
### Apply all methods
# Test that all have the same output:
x <- c(4, 14,11,3,16)
rann_nn(x)
# [1] 3 16 14 4 14
base_nn(x)
# [1] 3 16 14 4 14
rfast_nn(x)
# [1] 3 16 14 4 14
sqldf_nn(x)
# [1] 3 16 14 4 14
zoo_nn(x)
# [1] 3 16 14 4 14
base2_nn(x)
# [1] 3 16 14 4 14
base3_nn(x)
# [1] 3 16 14 4 14
base4_nn(x)
# [1] 3 16 14 4 14
dt_nn(x) # differently ordered for some reason
# [1] 4 3 14 16 14
# Compare running times
library(microbenchmark)
# Compare for N = 1000 elements
benchmark_data <- rnorm(1000)
microbenchmark(
rann_nn(benchmark_data),
base_nn(benchmark_data),
rfast_nn(benchmark_data),
sqldf_nn(benchmark_data),
zoo_nn(benchmark_data),
base2_nn(benchmark_data),
base3_nn(benchmark_data),
base4_nn(benchmark_data),
dt_nn(benchmark_data)
)
# Unit: microseconds
# expr min lq mean median uq max neval
# rann_nn(benchmark_data) 641.180 684.1975 776.5467 711.6680 775.3635 3822.023 100
# base_nn(benchmark_data) 166523.177 179240.8130 209471.1333 187633.0515 249740.8425 330864.712 100
# rfast_nn(benchmark_data) 45160.603 47032.5225 47681.0557 47594.0075 48308.8440 50579.839 100
# sqldf_nn(benchmark_data) 133916.594 138769.8175 143505.9315 140543.3250 143830.2765 211873.960 100
# zoo_nn(benchmark_data) 4359.359 4604.0275 5008.4291 4785.1515 5037.9705 14999.802 100
# base2_nn(benchmark_data) 1292.322 1407.4875 1747.8404 1462.7295 1588.1580 11297.321 100
# base3_nn(benchmark_data) 1263.644 1396.9210 1615.7495 1472.9940 1571.8575 11828.015 100
# base4_nn(benchmark_data) 119.543 146.1080 254.5075 178.1065 197.4265 7726.156 100
# dt_nn(benchmark_data) 5290.337 6580.6965 7111.1816 6892.3800 7351.3795 29469.815 100
# For N = 100000, leaving out the slowest versions (e.g., `base_nn()`
# no longer works because a distance matrix cannot be computed for
# N = 100000)
benchmark_data <- rnorm(100000)
microbenchmark(
rann_nn(benchmark_data),
zoo_nn(benchmark_data),
base2_nn(benchmark_data),
base3_nn(benchmark_data),
base4_nn(benchmark_data),
dt_nn(benchmark_data)
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# rann_nn(benchmark_data) 130.957025 141.02904 149.94052 148.60184 156.14506 271.1882 100
# zoo_nn(benchmark_data) 606.690004 673.88980 720.12545 717.51658 766.98190 886.4397 100
# base2_nn(benchmark_data) 142.554407 176.30358 198.58375 193.34812 212.33885 329.5470 100
# base3_nn(benchmark_data) 142.074126 168.78195 189.65122 184.45025 205.89414 287.0740 100
# base4_nn(benchmark_data) 9.354764 10.46687 17.22086 12.36354 14.22882 166.4758 100
# dt_nn(benchmark_data) 96.503882 104.06914 117.95408 108.20284 121.11428 247.2092 100
Here is a base R solution
d <- data.frame(`diag<-`(as.matrix(dist(x)),Inf))
ids <- unlist(Map(which.min,d))
val <- x[ids]
such that
> ids
X1 X2 X3 X4 X5
4 5 2 1 2
> val
[1] 3 16 14 4 14
DATA
x <- c(4, 14,11,3,16)
Option with a data.table non-equi-join
dt <- setkey(data.table(x), x)
dt[dt, on = .(x > x), mult = 'first', lowx := i.x][, lowx := fcoalesce(lowx + .0, -Inf)]
dt[dt, on = .(x < x), mult = 'last', highx := i.x][, highx := fcoalesce(highx + .0, Inf)]
dt[, closex := fifelse(x - lowx < highx - x, lowx, highx)]
dt[, .(x, closex)]
# x closex
# 1: 3 4
# 2: 4 3
# 3: 11 14
# 4: 14 16
# 5: 16 14

Distance between two data frames of unequal size

I have two data frames of unequal size:
>df1
b c d
a 2 3 4
>df2
g h i
e 1 1 5
f 0 4 3
I need to calculate distances between elements of these data frames, by subtracting values contained in df1 from every row in df2, thus I want to get:
c d e
a 1 2 1
b 2 1 1
Trying >myfunc1 <- function(x1,x2){abs(x1 - x2)} myfunc1(df1, df2) as well as df3 <- abs(df2 - df1) doesn't help because of unequal sizes.
require(purrr)
map2_df(df1, df2, ~abs(.x - .y))
Or Gregor's method: abs(df2 - df1[rep(1, nrow(df2)), ])
From my limited test, map2_df appears to be faster
df1 <- fread( "
b c d
2 3 4
")
df2 <- fread("
g h i
1 1 5
0 4 3
")
df1 <- rbindlist(replicate(10000, df1, simplify = F))
df2 <- rbindlist(replicate(10000, df2, simplify = F))
require(purrr)
f1 <- function(){
map2_df(df1, df2, ~abs(.x - .y))
}
f2 <- function(){
abs(df2 - df1[rep(1, nrow(df2)), ])
}
library(microbenchmark)
microbenchmark(f1(), f2())
#Unit: microseconds
# expr min lq mean median uq max neval
# f1() 727.385 891.4875 1268.775 956.923 1471.179 4651.075 100
# f2() 1737.025 2011.2815 2666.744 2218.666 2889.846 8572.715 100
If the case is always that there is one row in the first matrix then a base r apply method can be used:
t(apply(df2, 1, function(x) abs(x - df1[1,])))

In R, find elements of a vector in a list using vectorization

I have a vector v1
v1 = c(1, 200, 4000)
I would like to find the indices of the elements of v1 in a list L1 vectorially, i.e. without a loop, where
> L1
[[1]]
[1] 1 2 3 4
[[2]]
[1] 100 200 300 400
[[3]]
[1] 1000 2000 3000 4000
The output should be c(1, 2, 4).
Is there a way to do this without using a loop or apply (which is computationally the same as using a loop?) I have to do this for very long vectors.
We can do
sapply(L1, function(x) which(x %in% v1))
#[1] 1 2 4
Or with Vectorize
Vectorize(function(x) which(x %in% v1))(L1)
#[1] 1 2 4
If each element is checked against corresponding element of another
mapply(function(x, y) which(x %in% y), L1, v1)
#[1] 1 2 4
As #nicola mentioned match could also be used to get the first index. If there are duplicate elements, then which would be useful
mapply(match, v1, L1)
#[1] 1 2 4
Or using the purrr::map2
purrr::map2_int(L1, v1, ~ .x %in% .y %>%
which)
#[1] 1 2 4
we can do this, seems to be the fastest by far.
v1 <- c(1, 200, 4000)
L1 <- list(1:4, 1:4*100, 1:4*1000)
sequence(lengths(L1))[match(v1, unlist(L1))]
# [1] 1 2 4
sequence(lengths(L1))[which(unlist(L1) %in% v1)]
# [1] 1 2 4
library(microbenchmark)
library(tidyverse)
microbenchmark(
akrun_sapply = {sapply(L1, function(x) which(x %in% v1))},
akrun_Vectorize = {Vectorize(function(x) which(x %in% v1))(L1)},
akrun_mapply = {mapply(function(x, y) which(x %in% y), L1, v1)},
akrun_mapply_match = {mapply(match, v1, L1)},
akrun_map2 = {purrr::map2_int(L1, v1, ~ .x %in% .y %>% which)},
CPak = {setNames(rep(1:length(L1), times=lengths(L1)), unlist(L1))[as.character(v1)]},
zacdav = {sequence(lengths(L1))[match(v1, unlist(L1))]},
zacdav_which = {sequence(lengths(L1))[which(unlist(L1) %in% v1)]},
times = 10000
)
Unit: microseconds
expr min lq mean median uq max neval
akrun_sapply 18.187 22.7555 27.17026 24.6140 27.8845 2428.194 10000
akrun_Vectorize 60.119 76.1510 88.82623 83.4445 89.9680 2717.420 10000
akrun_mapply 19.006 24.2100 29.78381 26.2120 29.9255 2911.252 10000
akrun_mapply_match 14.136 18.4380 35.45528 20.0275 23.6560 127960.324 10000
akrun_map2 217.209 264.7350 303.64609 277.5545 298.0455 9204.243 10000
CPak 15.741 19.7525 27.31918 24.7150 29.0340 235.245 10000
zacdav 6.649 9.3210 11.30229 10.4240 11.5540 2399.686 10000
zacdav_which 7.364 10.2395 12.22632 11.2985 12.4515 2492.789 10000
You can try something like this
v1 = c(1, 200, 4000)
L1 <- list(1:4, 1:4*100, 1:4*1000)
setNames(rep(1:length(L1), times=lengths(L1)), unlist(L1))[as.character(v1)]
# 1 200 4000
# 1 2 3
We can also use
unlist(lapply(L1, function(x) which(x %in% v1)))
#[1] 1 2 4
Or use
unlist(Map(function(x, y) which(x %in% y), L1, v1 ))
#[1] 1 2 4

How to omit rows with NA in only two columns in R?

I want to omit rows where NA appears in both of two columns.
I'm familiar with na.omit, is.na, and complete.cases, but can't figure out how to use these to get what I want. For example, I have the following dataframe:
(df <- structure(list(x = c(1L, 2L, NA, 3L, NA),
y = c(4L, 5L, NA, 6L, 7L),
z = c(8L, 9L, 10L, 11L, NA)),
.Names = c("x", "y", "z"),
class = "data.frame",
row.names = c(NA, -5L)))
x y z
1 4 8
2 5 9
NA NA 10
3 6 11
NA 7 NA
and I want to remove only those rows where NAappears in both the x and y columns (excluding anything in z), to give
x y z
1 4 8
2 5 9
3 6 11
NA 7 NA
Does anyone know an easy way to do this? Using na.omit, is.na, or complete.cases is not working.
df[!with(df,is.na(x)& is.na(y)),]
# x y z
#1 1 4 8
#2 2 5 9
#4 3 6 11
#5 NA 7 NA
I did benchmarked on a slightly bigger dataset. Here are the results:
set.seed(237)
df <- data.frame(x=sample(c(NA,1:20), 1e6, replace=T), y= sample(c(NA, 1:10), 1e6, replace=T), z= sample(c(NA, 5:15), 1e6,replace=T))
f1 <- function() df[!with(df,is.na(x)& is.na(y)),]
f2 <- function() df[rowSums(is.na(df[c("x", "y")])) != 2, ]
f3 <- function() df[ apply( df, 1, function(x) sum(is.na(x))>1 ), ]
library(microbenchmark)
microbenchmark(f1(), f2(), f3(), unit="relative")
Unit: relative
#expr min lq median uq max neval
# f1() 1.000000 1.000000 1.000000 1.000000 1.000000 100
# f2() 1.044812 1.068189 1.138323 1.129611 0.856396 100
# f3() 26.205272 25.848441 24.357665 21.799930 22.881378 100
dplyr solution
require("dplyr")
df %>% filter_at(.vars = vars(x, y), .vars_predicate = any_vars(!is.na(.)))
can be modified to take any number columns using the .vars argument
Update: dplyr 1.0.4
df %>%
filter(!if_all(c(x, y), is.na))
See similar answer: https://stackoverflow.com/a/66136167/6105259
You can apply to slice up the rows:
sel <- apply( df, 1, function(x) sum(is.na(x))>1 )
Then you can select with that:
df[ sel, ]
To ignore the z column, just omit it from the apply:
sel <- apply( df[,c("x","y")], 1, function(x) sum(is.na(x))>1 )
If they all have to be TRUE, just change the function up a little:
sel <- apply( df[,c("x","y")], 1, function(x) all(is.na(x)) )
The other solutions here are more specific to this particular problem, but apply is worth learning as it solves many other problems. The cost is speed (usual caveats about small datasets and speed testing apply):
> microbenchmark( df[!with(df,is.na(x)& is.na(y)),], df[rowSums(is.na(df[c("x", "y")])) != 2, ], df[ apply( df, 1, function(x) sum(is.na(x))>1 ), ] )
Unit: microseconds
expr min lq median uq max neval
df[!with(df, is.na(x) & is.na(y)), ] 67.148 71.5150 76.0340 86.0155 1049.576 100
df[rowSums(is.na(df[c("x", "y")])) != 2, ] 132.064 139.8760 145.5605 166.6945 498.934 100
df[apply(df, 1, function(x) sum(is.na(x)) > 1), ] 175.372 184.4305 201.6360 218.7150 321.583 100
Use rowSums with is.na, like this:
> df[rowSums(is.na(df[c("x", "y")])) != 2, ]
x y z
1 1 4 8
2 2 5 9
4 3 6 11
5 NA 7 NA
Jumping on the benchmarking wagon, and demonstrating what I was referring to about this being a fairly easy-to-generalize solution, consider the following:
## Sample data with 10 columns and 1 million rows
set.seed(123)
df <- data.frame(replicate(10, sample(c(NA, 1:20),
1e6, replace = TRUE)))
First, here's what things look like if you're just interested in two columns. Both solutions are pretty legible and short. Speed is quite close.
f1 <- function() {
df[!with(df, is.na(X1) & is.na(X2)), ]
}
f2 <- function() {
df[rowSums(is.na(df[1:2])) != 2, ]
}
library(microbenchmark)
microbenchmark(f1(), f2(), times = 20)
# Unit: milliseconds
# expr min lq median uq max neval
# f1() 745.8378 1100.764 1128.047 1199.607 1310.236 20
# f2() 784.2132 1101.695 1125.380 1163.675 1303.161 20
Next, let's look at the same problem, but this time, we are considering NA values across the first 5 columns. At this point, the rowSums approach is slightly faster and the syntax does not change much.
f1_5 <- function() {
df[!with(df, is.na(X1) & is.na(X2) & is.na(X3) &
is.na(X4) & is.na(X5)), ]
}
f2_5 <- function() {
df[rowSums(is.na(df[1:5])) != 5, ]
}
microbenchmark(f1_5(), f2_5(), times = 20)
# Unit: seconds
# expr min lq median uq max neval
# f1_5() 1.275032 1.294777 1.325957 1.368315 1.572772 20
# f2_5() 1.088564 1.169976 1.193282 1.225772 1.275915 20
This is also the very basic dplyr solution:
library(dplyr)
df %>%
filter(!(is.na(x) & is.na(y)))
x y z
1 1 4 8
2 2 5 9
3 3 6 11
4 NA 7 NA

Numeric comparison during merge in R

Dataframe d1:
x y
4 10
6 20
7 30
Dataframe d2:
x z
3 100
6 200
9 300
How do I merge d1 and d2 by "x" where d1$x should be matched against exact match or the next higher number in d2$x. Output should look like:
x y z
4 10 200 # (4 is matched against next higher value that is 6)
6 20 200 # (6 is matched against 6)
7 30 300 # (7 is matched against next higher value that is 9)
If merge() cannot do this, then is there any other way to do this? For loops are painfully slow.
This is pretty straightforward using rolling joins with data.table:
require(data.table) ## >= 1.9.2
setkey(setDT(d1), x) ## convert to data.table, set key for the column to join on
setkey(setDT(d2), x) ## same as above
d2[d1, roll=-Inf]
# x z y
# 1: 4 200 10
# 2: 6 200 20
# 3: 7 300 30
Input data:
d1 <- data.frame(x=c(4,6,7), y=c(10,20,30))
d2 <- data.frame(x=c(3,6,9), z=c(100,200,300))
You basically wish to extend d1 by a new column. So let's copy it.
d3 <- d1
Next I assume that d2$x is sorted nondecreasingly and thatmax(d1$x) <= max(d2$x).
d3$z <- sapply(d1$x, function(x) d2$z[which(x <= d2$x)[1]])
Which reads: for each x in d1$x, get the smallest value from d2$x which is not smaller than x.
Under these assumptions, the above may also be written as (& should be a bit faster):
d3$z <- sapply(d1$x, function(x) d2$z[which.max(x <= d2$x)])
In result we get:
d3
## x y z
## 1 4 10 200
## 2 6 20 200
## 3 7 30 300
EDIT1: Inspired by #MatthewLundberg's cut-based solution, here's another one using findInterval:
d3$z <- d2$z[findInterval(d1$x, d2$x+1)+1]
EDIT2: (Benchmark)
Exemplary data:
set.seed(123)
d1 <- data.frame(x=sort(sample(1:10000, 1000)), y=sort(sample(1:10000, 1000)))
d2 <- data.frame(x=sort(c(sample(1:10000, 999), 10000)), z=sort(sample(1:10000, 1000)))
Results:
microbenchmark::microbenchmark(
{d3 <- d1; d3$z <- d2$z[findInterval(d1$x, d2$x+1)+1] },
{d3 <- d1; d3$z <- sapply(d1$x, function(x) d2$z[which(x <= d2$x)[1]]) },
{d3 <- d1; d3$z <- sapply(d1$x, function(x) d2$z[which.max(x <= d2$x)]) },
{d1$x2 <- d2$x[as.numeric(cut(d1$x, c(-Inf, d2$x, Inf)))]; merge(d1, d2, by.x='x2', by.y='x')},
{d1a <- d1; setkey(setDT(d1a), x); d2a <- d2; setkey(setDT(d2a), x); d2a[d1a, roll=-Inf] }
)
## Unit: microseconds
## expr min lq median uq max neval
## findInterval 221.102 1357.558 1394.246 1429.767 17810.55 100
## which 66311.738 70619.518 85170.175 87674.762 220613.09 100
## which.max 69832.069 73225.755 83347.842 89549.326 118266.20 100
## cut 8095.411 8347.841 8498.486 8798.226 25531.58 100
## data.table 1668.998 1774.442 1878.028 1954.583 17974.10 100
cut can be used to find the appropriate matches in d2$x for the values in d1$x.
The computation to find the matches with cut is as follows:
as.numeric(cut(d1$x, c(-Inf, d2$x, Inf)))
## [1] 2 2 3
These are the values:
d2$x[as.numeric(cut(d1$x, c(-Inf, d2$x, Inf)))]
[1] 6 6 9
These can be added to d1 and the merge performed:
d1$x2 <- d2$x[as.numeric(cut(d1$x, c(-Inf, d2$x, Inf)))]
merge(d1, d2, by.x='x2', by.y='x')
## x2 x y z
## 1 6 4 10 200
## 2 6 6 20 200
## 3 9 7 30 300
The added column may then be removed, if desired.
Try: sapply(d1$x,function(y) d2$z[d2$x > y][which.min(abs(y - d2$x[d2$x > y]))])

Resources