Numeric comparison during merge in R

Numeric comparison during merge in R - r

Dataframe d1:
x y
4 10
6 20
7 30
Dataframe d2:
x z
3 100
6 200
9 300
How do I merge d1 and d2 by "x" where d1$x should be matched against exact match or the next higher number in d2$x. Output should look like:
x y z
4 10 200 # (4 is matched against next higher value that is 6)
6 20 200 # (6 is matched against 6)
7 30 300 # (7 is matched against next higher value that is 9)
If merge() cannot do this, then is there any other way to do this? For loops are painfully slow.

This is pretty straightforward using rolling joins with data.table:
require(data.table) ## >= 1.9.2
setkey(setDT(d1), x) ## convert to data.table, set key for the column to join on
setkey(setDT(d2), x) ## same as above
d2[d1, roll=-Inf]
# x z y
# 1: 4 200 10
# 2: 6 200 20
# 3: 7 300 30

Input data:
d1 <- data.frame(x=c(4,6,7), y=c(10,20,30))
d2 <- data.frame(x=c(3,6,9), z=c(100,200,300))
You basically wish to extend d1 by a new column. So let's copy it.
d3 <- d1
Next I assume that d2$x is sorted nondecreasingly and thatmax(d1$x) <= max(d2$x).
d3$z <- sapply(d1$x, function(x) d2$z[which(x <= d2$x)[1]])
Which reads: for each x in d1$x, get the smallest value from d2$x which is not smaller than x.
Under these assumptions, the above may also be written as (& should be a bit faster):
d3$z <- sapply(d1$x, function(x) d2$z[which.max(x <= d2$x)])
In result we get:
d3
## x y z
## 1 4 10 200
## 2 6 20 200
## 3 7 30 300
EDIT1: Inspired by #MatthewLundberg's cut-based solution, here's another one using findInterval:
d3$z <- d2$z[findInterval(d1$x, d2$x+1)+1]
EDIT2: (Benchmark)
Exemplary data:
set.seed(123)
d1 <- data.frame(x=sort(sample(1:10000, 1000)), y=sort(sample(1:10000, 1000)))
d2 <- data.frame(x=sort(c(sample(1:10000, 999), 10000)), z=sort(sample(1:10000, 1000)))
Results:
microbenchmark::microbenchmark(
{d3 <- d1; d3$z <- d2$z[findInterval(d1$x, d2$x+1)+1] },
{d3 <- d1; d3$z <- sapply(d1$x, function(x) d2$z[which(x <= d2$x)[1]]) },
{d3 <- d1; d3$z <- sapply(d1$x, function(x) d2$z[which.max(x <= d2$x)]) },
{d1$x2 <- d2$x[as.numeric(cut(d1$x, c(-Inf, d2$x, Inf)))]; merge(d1, d2, by.x='x2', by.y='x')},
{d1a <- d1; setkey(setDT(d1a), x); d2a <- d2; setkey(setDT(d2a), x); d2a[d1a, roll=-Inf] }
)
## Unit: microseconds
## expr min lq median uq max neval
## findInterval 221.102 1357.558 1394.246 1429.767 17810.55 100
## which 66311.738 70619.518 85170.175 87674.762 220613.09 100
## which.max 69832.069 73225.755 83347.842 89549.326 118266.20 100
## cut 8095.411 8347.841 8498.486 8798.226 25531.58 100
## data.table 1668.998 1774.442 1878.028 1954.583 17974.10 100

cut can be used to find the appropriate matches in d2$x for the values in d1$x.
The computation to find the matches with cut is as follows:
as.numeric(cut(d1$x, c(-Inf, d2$x, Inf)))
## [1] 2 2 3
These are the values:
d2$x[as.numeric(cut(d1$x, c(-Inf, d2$x, Inf)))]
[1] 6 6 9
These can be added to d1 and the merge performed:
d1$x2 <- d2$x[as.numeric(cut(d1$x, c(-Inf, d2$x, Inf)))]
merge(d1, d2, by.x='x2', by.y='x')
## x2 x y z
## 1 6 4 10 200
## 2 6 6 20 200
## 3 9 7 30 300
The added column may then be removed, if desired.

Try: sapply(d1$x,function(y) d2$z[d2$x > y][which.min(abs(y - d2$x[d2$x > y]))])

Related

Closest other Value in the same Vector

I have a vector
set.seed(2)
x <- sample.int(20, 5)
[1] 4 14 11 3 16
Now, for every element I want to find
the element with the minimum distance (min(abs(x[i]-x[-i])) for element i), which here would be
[1] 3 16 14 4 14
the (first) index of the element with the minimum distance, which here would be
[1] 4 5 2 1 2
The point is that the element itself is not considered, but only all the other elements, which is why this R - Fastest way to find nearest value in vector
is not the answer.
If the actual answer is out there, sorry - I didn't find it.

1) Rfast Using dista in Rfast we get the indexes of the closest two. Take the second closest as the closest will be the same value.
library(Rfast)
x <- c(4, 14, 11, 3, 16) # input
x[ dista(x, x, k = 2, index = TRUE)[, 2] ]
## [1] 3 16 14 4 14
2) sqldf Using SQL we can left join DF to itself excluding the same value value and take the row with the minimum distance.
DF <- data.frame(x) # x is from (1)
sqldf("select a.x, b.x nearest, min(abs(a.x - b.x))
from DF a
left join DF b on a.x != b.x
group by a.rowid")[1:2]
giving:
x nearest
1 4 3
2 14 16
3 11 14
4 3 4
5 16 14
3) zoo Sort the input, take the value corresponding to the least difference on either of side of each element and order it back.
library(zoo)
ix <- order(x)
least <- function(x) if (x[2] - x[1] < x[3] - x[2]) x[1] else x[3]
rollapply(c(-Inf, x[ix], Inf), 3, least)[order(ix)]
## [1] 3 16 14 4 14
4) Base R Using ix and least from (3) we can mimic (3) using only base functions as follows.
apply(embed(c(-Inf, x[ix], Inf), 3)[, 3:1], 1, least)[order(ix)]
## [1] 3 16 14 4 14
4a) This slightly shorter variation would also work:
-apply(embed(-c(-Inf, x[ix], Inf), 3), 1, least)[order(ix)]
## [1] 3 16 14 4 14
4b) Simplifying further we have the following base solution where, again, ix is from (3):
xx <- x[ix]
x1 <- c(-Inf, xx[-length(xx)])
x2 <- c(xx[-1], Inf)
ifelse(xx - x1 < x2 - xx, x1, x2)[order(ix)]
## [1] 3 16 14 4 14
Duplicates
The example in the question had no duplicates but if there were duplicates there is some question regarding the problem definition. For example if we had c(1, 3, 4, 1) then if we look at the first value, 1, there is another value exactly equal to it so the closest value is 1. Another interpretation is that the closest value not equal to 1 should be returned which in this case is 3. In the codes above the sqldf solution gives the closest value not equal to the current value whereas the others give the closest value among the remaining values.
If we wanted the interpretation of the closest not equal for those other than sqldf then we could use rle after ordering to compress it down to unique values and then use inverse.rle afterwards as shown on the modified 4b:
x <- c(1, 3, 4, 1)
ix <- order(x)
r <- rle(x[ix])
xx <- r$values
x1 <- c(-Inf, xx[-length(xx)])
x2 <- c(xx[-1], Inf)
r$values <- ifelse(xx - x1 < x2 - xx, x1, x2)
inverse.rle(r)[order(ix)]
## [1] 3 4 3 3

I was very interested in this question and in the approaches suggested in the other responses, so I compared them with regard to their running time (and I added another approach using the package RANN). The code is appended below. TL;DR: The base R version 4b by user G. Grothendieck was most efficient, and by a significant margin.
library(RANN)
library(zoo)
library(data.table)
library(Rfast)
library(sqldf)
# All functions take a vector as argument,
# and return the values of nearest neighbours (not their index)
# Using base R, by ThomasIsCoding
base_nn <- function(x) {
d <- data.frame(`diag<-`(as.matrix(dist(x)),Inf))
id <- unlist(Map(which.min,d))
x[id]
}
# Using Rfast, by G. Grothendieck
rfast_nn <- function(x) {
x[ dista(x, x, k = 2, index = TRUE)[, 2] ]
}
# Using sqldf, by G. Grothendieck
sqldf_nn <- function(x) {
DF <- data.frame(x) # x is from (1)
unname(
unlist(sqldf("select a.x, b.x nearest, min(abs(a.x - b.x))
from DF a
left join DF b on a.x != b.x
group by a.rowid")[2])
)
}
# Using `zoo`, by G. Grothendieck
zoo_nn <- function(x) {
ix <- order(x)
least <- function(x) if (x[2] - x[1] < x[3] - x[2]) x[1] else x[3]
rollapply(c(-Inf, x[ix], Inf), 3, least)[order(ix)]
}
# Using base R (v 4), by G. Grothendieck
base2_nn <- function(x) {
ix <- order(x)
least <- function(x) if (x[2] - x[1] < x[3] - x[2]) x[1] else x[3]
apply(embed(c(-Inf, x[ix], Inf), 3)[, 3:1], 1, least)[order(ix)]
}
# Using base R (v 4a), by G. Grothendieck
base3_nn <- function(x) {
ix <- order(x)
least <- function(x) if (x[2] - x[1] < x[3] - x[2]) x[1] else x[3]
-apply(embed(-c(-Inf, x[ix], Inf), 3), 1, least)[order(ix)]
}
# Using base R (v 4b), by G. Grothendieck
base4_nn <- function(x) {
ix <- order(x)
xx <- x[ix]
x1 <- c(-Inf, xx[-length(xx)])
x2 <- c(xx[-1], Inf)
ifelse(xx - x1 < x2 - xx, x1, x2)[order(ix)]
}
# Using data.table, by IceCreamToucan
dt_nn <- function(x) {
dt <- setkey(data.table(x), x)
dt[dt, on = .(x > x), mult = 'first', lowx := i.x][, lowx := fcoalesce(lowx + .0, -Inf)]
dt[dt, on = .(x < x), mult = 'last', highx := i.x][, highx := fcoalesce(highx + .0, Inf)]
dt[, closex := fifelse(x - lowx < highx - x, lowx, highx)]
unname(unlist(dt[, .(closex)]))
}
# Using, RANN, by me
rann_nn <- function(x) {
id <- RANN::nn2(as.matrix(x), k = 2)$nn.idx[, 2]
x[id]
}
### Apply all methods
# Test that all have the same output:
x <- c(4, 14,11,3,16)
rann_nn(x)
# [1] 3 16 14 4 14
base_nn(x)
# [1] 3 16 14 4 14
rfast_nn(x)
# [1] 3 16 14 4 14
sqldf_nn(x)
# [1] 3 16 14 4 14
zoo_nn(x)
# [1] 3 16 14 4 14
base2_nn(x)
# [1] 3 16 14 4 14
base3_nn(x)
# [1] 3 16 14 4 14
base4_nn(x)
# [1] 3 16 14 4 14
dt_nn(x) # differently ordered for some reason
# [1] 4 3 14 16 14
# Compare running times
library(microbenchmark)
# Compare for N = 1000 elements
benchmark_data <- rnorm(1000)
microbenchmark(
rann_nn(benchmark_data),
base_nn(benchmark_data),
rfast_nn(benchmark_data),
sqldf_nn(benchmark_data),
zoo_nn(benchmark_data),
base2_nn(benchmark_data),
base3_nn(benchmark_data),
base4_nn(benchmark_data),
dt_nn(benchmark_data)
)
# Unit: microseconds
# expr min lq mean median uq max neval
# rann_nn(benchmark_data) 641.180 684.1975 776.5467 711.6680 775.3635 3822.023 100
# base_nn(benchmark_data) 166523.177 179240.8130 209471.1333 187633.0515 249740.8425 330864.712 100
# rfast_nn(benchmark_data) 45160.603 47032.5225 47681.0557 47594.0075 48308.8440 50579.839 100
# sqldf_nn(benchmark_data) 133916.594 138769.8175 143505.9315 140543.3250 143830.2765 211873.960 100
# zoo_nn(benchmark_data) 4359.359 4604.0275 5008.4291 4785.1515 5037.9705 14999.802 100
# base2_nn(benchmark_data) 1292.322 1407.4875 1747.8404 1462.7295 1588.1580 11297.321 100
# base3_nn(benchmark_data) 1263.644 1396.9210 1615.7495 1472.9940 1571.8575 11828.015 100
# base4_nn(benchmark_data) 119.543 146.1080 254.5075 178.1065 197.4265 7726.156 100
# dt_nn(benchmark_data) 5290.337 6580.6965 7111.1816 6892.3800 7351.3795 29469.815 100
# For N = 100000, leaving out the slowest versions (e.g., `base_nn()`
# no longer works because a distance matrix cannot be computed for
# N = 100000)
benchmark_data <- rnorm(100000)
microbenchmark(
rann_nn(benchmark_data),
zoo_nn(benchmark_data),
base2_nn(benchmark_data),
base3_nn(benchmark_data),
base4_nn(benchmark_data),
dt_nn(benchmark_data)
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# rann_nn(benchmark_data) 130.957025 141.02904 149.94052 148.60184 156.14506 271.1882 100
# zoo_nn(benchmark_data) 606.690004 673.88980 720.12545 717.51658 766.98190 886.4397 100
# base2_nn(benchmark_data) 142.554407 176.30358 198.58375 193.34812 212.33885 329.5470 100
# base3_nn(benchmark_data) 142.074126 168.78195 189.65122 184.45025 205.89414 287.0740 100
# base4_nn(benchmark_data) 9.354764 10.46687 17.22086 12.36354 14.22882 166.4758 100
# dt_nn(benchmark_data) 96.503882 104.06914 117.95408 108.20284 121.11428 247.2092 100

Here is a base R solution
d <- data.frame(`diag<-`(as.matrix(dist(x)),Inf))
ids <- unlist(Map(which.min,d))
val <- x[ids]
such that
> ids
X1 X2 X3 X4 X5
4 5 2 1 2
> val
[1] 3 16 14 4 14
DATA
x <- c(4, 14,11,3,16)

Option with a data.table non-equi-join
dt <- setkey(data.table(x), x)
dt[dt, on = .(x > x), mult = 'first', lowx := i.x][, lowx := fcoalesce(lowx + .0, -Inf)]
dt[dt, on = .(x < x), mult = 'last', highx := i.x][, highx := fcoalesce(highx + .0, Inf)]
dt[, closex := fifelse(x - lowx < highx - x, lowx, highx)]
dt[, .(x, closex)]
# x closex
# 1: 3 4
# 2: 4 3
# 3: 11 14
# 4: 14 16
# 5: 16 14

add values that are the same within a vector

I have a vector
x <- c(1,2,5,4,3,1,1,4,2,6,7,2,4,1,5)
and I want to add the values that are the same, giving me the new vector
x <- c(4, 6, 3, 12, 10, 6, 7)
It sounds quite simple but I am stuck.

You can use sapply to iterate over the vector of the unique values, and then sum each one of the corresponding entries, like so:
> sapply(unique(x), function(i) sum(x[x == i]))
[1] 4 6 10 12 3 6 7
If the order is relevant, please indicate which order do you want.
In this solution, the order is the same as the output of unique, which you can use to know what is the sum of what value.
> unique(x)
[1] 1 2 5 4 3 6 7
Edit
It looks like you want the ascending order of unique values. In that case, you can do like this:
> sapply(sort(unique(x)), function(i) sum(x[x == i]))
[1] 4 6 3 12 10 6 7

aggregate(x, list(number = x), FUN = sum )
# number x
#1 1 4
#2 2 6
#3 3 3
#4 4 12
#5 5 10
#6 6 6
#7 7 7
The result is a data.frame and you can extract the second column as usual.

Here's another option, for fun:
with(rle(sort(x)), lengths * values)
# [1] 4 6 3 12 10 6 7
Benchmarks
library(microbenchmark)
x <- c(1,2,5,4,3,1,1,4,2,6,7,2,4,1,5)
x <- rep(x, length.out=1000)
matthew <- function() with(rle(sort(x)), lengths * values)
iled <- function() sapply(sort(unique(x)), function(i) sum(x[x == i]))
kota <- function() as.numeric(table(x) * as.integer(names(table(x))))
deena <- function() {
freqTable = as.data.frame(table(x))
as.numeric(as.character(freqTable$x)) * freqTable$Freq
}
roland <- function() aggregate(x, list(number = x), FUN = sum )$x
microbenchmark(matthew(), iled(), kota(), deena(), roland())
# Unit: microseconds
# expr min lq mean median uq max neval
# matthew() 105.5 116.9 167.5 122.5 131.3 1466 100
# iled() 111.2 125.6 160.3 131.4 138.8 1449 100
# kota() 1821.5 1899.3 1960.4 1915.9 1940.7 3031 100
# deena() 1124.7 1175.6 1221.1 1187.9 1207.7 2700 100
# roland() 1912.2 1967.9 2116.6 1995.5 2078.5 3610 100

One way to do that would be multiplying each element with its frequency. The table function does a good with that :
freqTable = as.data.frame(table(x))
requiredResult = as.numeric(as.character(freqTable$x)) * freqTable$Freq

You want to compute #n * n for each n, where #n is the number of occurrence of n.
Just an alternative approach to the sapply above.
table(x) * as.integer(names(table(x)))
# x
# 1 2 3 4 5 6 7
# 4 6 3 12 10 6 7

R code to 'tidy' the values of a discrete variable

Consider this data:
set.seed(200914)
y <- round(runif(20, 5, 15))
y
table(y)
In the real application y is a categorical variable such as "outcome code". I want to recode R so that its values are 1:n, while preserving order (Sometimes the variable may be ordinal.)
One answer is:
(ya <- y - min(y) + 1)
table(ya)
But this solution does not have minimal range which may make subsequent code inefficient. Trying again...
(suy <- sort(unique(y)))
(n <- length(suy))
yb <- y
for (i in 1:n) yb[which(y == suy[i])] <- i
table(yb)
yb is exactly what I want, but I wonder if I am computing it in the most efficient way?

Try
yc <- as.numeric(factor(y))
table(yc)
#yc
#1 2 3 4 5 6 7 8
#1 4 1 1 6 3 3 1
since essentially what you're looking for are the factor codes (I think).

Try:
yc <- match(y, sort(unique(y)))
table(yc)
#1 2 3 4 5 6 7 8
#1 4 1 1 6 3 3 1
all.equal(yb,yc)
#[1] TRUE
Another option might be findInterval
table(findInterval(y, sort( unique(y))))
#1 2 3 4 5 6 7 8
#1 4 1 1 6 3 3 1
Benchmarks
set.seed(25)
y <- sample(1:20, 1e6,replace=TRUE)
f1 <- function() {suy <- sort(unique(y))
n <- length(suy)
yb <- y
for (i in 1:n) yb[which(y == suy[i])] <- i
table(yb)}
f2 <- function() {yc <- as.numeric(factor(y))
table(yc)}
f3 <- function() {yd <- match(y, sort(unique(y)))
table(yd)}
f4 <- function() {ye <- findInterval(y, sort(unique(y)))
table(ye)}
library(microbenchmark)
microbenchmark(f1(), f2(), f3(), f4(), unit="relative", times=25L)
# Unit: relative
# expr min lq median uq max neval
# f1() 1.198901 1.208551 1.235237 1.242697 1.600400 25
# f2() 3.745317 3.593736 3.593330 3.596990 3.488292 25
# f3() 1.000000 1.000000 1.000000 1.000000 1.000000 25
# f4() 1.017857 1.038056 1.047112 1.038731 1.014825 25

Most efficient way to multiply a data frame by a vector

What would be the most efficient way to multiply each column of a data frame by a vector?
e.g. data frame (df) has the following columns (col1, col2, col3, col4) and vector (v) has the following elements (v1,v2,v3).
I want the output to be: col2*v1, col3*v2, col4*v3
I've been trying df[c(2:4)] * c(v1,v2,v3) but it seems like the elements of the vector are not multiplying every single row of each column.

You could try (using df and v from Richard Scriven's answer):
df[-1] <- t(t(df[-1]) * v)
df
# a x y z
# 1 a 5 40 105
# 2 b 10 50 120
# 3 c 15 60 135
When you multiply a matrix by a vector, it multiplies columnwise. Since you want to multiply your rows by the vector, we transpose df[-1] using t, multiply by v, and transpose back using t.
It seems like this approach has a slight edge in benchmarking over the Map approach, and a significant advantage over sweep:
library(microbenchmark)
rscriven <- function(df, v) cbind(df[1], Map(`*`, df[-1], v))
josilber <- function(df, v) cbind(df[1], t(t(df[-1]) * v))
dardisco <- function(df, v) cbind(df[1], sweep(df[-1], MARGIN=2, STATS=v, FUN="*"))
df2 <- cbind(data.frame(rep("a", 1000)), matrix(rnorm(100000), nrow=1000))
v2 <- rnorm(100)
all.equal(rscriven(df2, v2), josilber(df2, v2))
# [1] TRUE
all.equal(rscriven(df2, v2), dardisco(df2, v2))
# [1] TRUE
microbenchmark(rscriven(df2, v2), josilber(df2, v2), dardisco(df2, v2))
# Unit: milliseconds
# expr min lq median uq max neval
# rscriven(df2, v2) 5.276458 5.378436 5.451041 5.587644 9.470207 100
# josilber(df2, v2) 2.545144 2.753363 3.099589 3.704077 8.955193 100
# dardisco(df2, v2) 11.647147 12.761184 14.196678 16.581004 132.428972 100
Thanks to #thelatemail for pointing out that the Map approach is a good deal faster for 100x larger data frames:
df2 <- cbind(data.frame(rep("a", 10000)), matrix(rnorm(10000000), nrow=10000))
v2 <- rnorm(1000)
microbenchmark(rscriven(df2, v2), josilber(df2, v2), dardisco(df2, v2))
# Unit: milliseconds
# expr min lq median uq max neval
# rscriven(df2, v2) 75.74051 90.20161 97.08931 115.7789 259.0855 100
# josilber(df2, v2) 340.72774 388.17046 498.26836 514.5923 623.4020 100
# dardisco(df2, v2) 928.81128 1041.34497 1156.39293 1271.4758 1506.0348 100
It seems like you'll need to benchmark to determine which approach is fastest for your application.

You can use Map for this. Here's an example
> ( df <- data.frame(a = letters[1:3], x = 1:3, y = 4:6, z = 7:9) )
# a x y z
# 1 a 1 4 7
# 2 b 2 5 8
# 3 c 3 6 9
> v <- c(5, 10, 15)
> cbind(df[1], Map(`*`, df[-1], v))
# a x y z
# 1 a 5 40 105
# 2 b 10 50 120
# 3 c 15 60 135
In this example,
column x is multiplied by v[1] (5)
column y is multiplied by v[2] (10)
column z is multiplied by v[3] (15)
cbind is used to attach the unused column a to the columns we operated on

Not as fast, but more flexible:
sweep(df[-1], MARGIN=2, STATS=v, FUN="*")

Simple 'apply' function can also be used here, reading by rows:
df[-1]= (t(apply(df[-1],1, FUN=function(x)x*v)))
df
a x y z
1 a 5 40 105
2 b 10 50 120
3 c 15 60 135

Apply a correction factor to one column based on the value of a second column

Example Data
A<-c(1,4,5,6,2,3,4,5,6,7,8,7)
B<-c(4,6,7,8,2,2,2,3,8,8,7,8)
DF<-data.frame(A,B)
What I would like to do is apply a correction factor to column A, based on the values of column B. The rules would be something like this
If B less than 4 <- Multiply A by 1
If B equal to 4 and less than 6 <- Multiply A by 2
If B equal or greater than 6 <- Multiply by 4
I suppose I could write an "if" statement (and I'd be glad to see a good example), but I'd also be interested in using square bracket indexing to speed things up.
The end result would look like this
A B
2 4
16 6
20 7
24 8
ect

Use this:
within(DF, A <- ifelse(B>=6, 4, ifelse(B<4, 1, 2)) * A)
Or this (corrected by #agstudy):
within(DF, {A[B>=6] <- A[B>=6]*4; A[B>=4 & B<6] <- A[B>=4 & B<6]*2})
Benchmarking:
DF <- data.frame(A=rpois(1e4, 5), B=rpois(1e4, 5))
a <- function(DF) within(DF, A <- ifelse(B>=6, 4, ifelse(B<4, 1, 2)) * A)
b <- function(DF) within(DF, {A[B>=6] <- A[B>=6]*4; A[B>=4 & B<6] <- A[B>=4 & B<6]*2})
identical(a(DF), b(DF))
#[1] TRUE
microbenchmark(a(DF), b(DF), times=1000)
#Unit: milliseconds
# expr min lq median uq max neval
# a(DF) 8.603778 10.253799 11.07999 11.923116 53.91140 1000
# b(DF) 3.763470 3.889065 5.34851 5.480294 39.72503 1000

Similar to #Ferdinand solution but using transform
transform(DF, newcol = ifelse(B<4, A,
ifelse(B>=6,4*A,2*A)))
A B newcol
1 1 4 2
2 4 6 16
3 5 7 20
4 6 8 24
5 2 2 2
6 3 2 3
7 4 2 4
8 5 3 5
9 6 8 24
10 7 8 28
11 8 7 32
12 7 8 28

I prefer to use findInterval as an index into a set of factors for such operations. The proliferation of nested test-conditional and consequent vectors with multiple ifelse calls offends my efficiency sensibilities:
DF$A <- DF$A * c(1,2,4)[findInterval(DF$B, c(-Inf,4,6,Inf) ) ]
DF
A B
1 2 4
2 16 6
3 20 7
4 24 8
snipped ....
Benchmark:
DF <- data.frame(A=rpois(1e4, 5), B=rpois(1e4, 5))
a <- function(DF) within(DF, A <- ifelse(B>=6, 4, ifelse(B<4, 1, 2)) * A)
b <- function(DF) within(DF, {A[B>=6] <- A[B>=6]*4; A[B>=4 & B<6] <- A[B>=4 & B<6]*2})
ccc <- function(DF) within(DF, {A * c(1,2,4)[findInterval(B, c(-Inf,4,6,Inf) ) ]})
microbenchmark(a(DF), b(DF), ccc(DF), times=1000)
#-----------
Unit: microseconds
expr min lq median uq max neval
a(DF) 7616.107 7843.6320 8105.0340 8322.5620 93549.85 1000
b(DF) 2638.507 2789.7330 2813.8540 3072.0785 92389.57 1000
ccc(DF) 604.555 662.5335 676.0645 698.8665 85375.14 1000
Note: I would not have done this using within if I were coding my own function, but thought for fairness to the earlier effort, I would make it apples <-> apples.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Numeric comparison during merge in R - r

This is pretty straightforward using rolling joins with data.table: require(data.table) ## >= 1.9.2 setkey(setDT(d1), x) ## convert to data.table, set key for the column to join on setkey(setDT(d2), x) ## same as above d2[d1, roll=-Inf] # x z y # 1: 4 200 10 # 2: 6 200 20 # 3: 7 300 30

Try: sapply(d1$x,function(y) d2$z[d2$x > y][which.min(abs(y - d2$x[d2$x > y]))])

Related

Closest other Value in the same Vector

add values that are the same within a vector

R code to 'tidy' the values of a discrete variable

Most efficient way to multiply a data frame by a vector

Apply a correction factor to one column based on the value of a second column

Categories

Resources