Quick manipulation of data frame in R - r

I have the following example data frame:
> a = data.frame(a=c(1, 2, 3), b=c(10, 11, 12), c=c(1, 1, 0))
> a
a b c
1 1 10 1
2 2 11 1
3 3 12 0
I want to do an operation to every row where if a$c == 1, a$a = a$b, otherwise, a$a keeps its value. The final data frame a should look like this:
> a
a b c
1 10 10 1
2 11 11 1
3 3 12 0
What is the fastest way to do this? Of course in my problem I have hundreds of thousands of rows, so looping over the entire data frame and doing one by one is extremely slow.
Thanks!

Easy as 1-2-3:
df = data.frame(a=c(1, 2, 3), b=c(10, 11, 12), c=c(1, 1, 0))
df$a[df$c == 1] <- df$b[df$c == 1]
df
## a b c
## 1 10 10 1
## 2 11 11 1
## 3 3 12 0
It reads: substitute all the elements in a corresponding to c==1 with all the elements in b corresponding to c==1.
A benchmark:
df <- data.frame(a=runif(100000), b=runif(100000), c=sample(c(1,0), 100000, replace=TRUE))
library(microbenchmark)
microbenchmark(df$a[df$c == 1] <- df$b[df$c == 1], df$a <- with(df, ifelse(c == 1, b, a)))
## Unit: milliseconds
## expr min lq median uq max neval
## df$a[df$c == 1] <- df$b[df$c == 1] 13.85375 15.13073 16.61701 74.5387 88.47949 100
## df$a <- with(df, ifelse(c == 1, b, a)) 44.23750 78.85029 103.01894 105.1750 118.09492 100

a$a <- with(a, ifelse(c == 1, b, a))

Related

Compare three (or more) variables in R with ifelse at once with loop

I want to compare three variables. If all have the same result (eg 0, 0, 0, and 2, 2, 2) returns a value (eg 'match').
I try this:
df_1 <- data.frame(
x = c(0, 1, 0, 2, 0),
y = c(0, 2, 1, 2, 1),
z = c(0, 2, 1, 2, 1)
)
ifelse(df_1$x == df_1$y == df_1$z, 'match', 'not')
Error: unexpected '==' in "ifelse(df_1$x == df_1$y =="
But it doesn't work. Thanks.
You need an & in there, so df_1$x == df_1$y & df_1$y == df_1$z, i.e. x equals y AND y equals x. You also don't need ifelse for this kind of comparison. Just do the comparison and add the output to your data frame:
df_1$match <- df_1$x == df_1$y & df_1$y == df_1$z
#### OUTPUT ####
x y z match
1 0 0 0 TRUE
2 1 2 2 FALSE
3 0 1 1 FALSE
4 2 2 2 TRUE
5 0 1 1 FALSE
However, if you really want "matched" an "not" you can do that too:
df_1$match <- ifelse(df_1$x == df_1$y & df_1$y == df_1$z, "matched", "not")
#### OUTPUT ####
x y z match
1 0 0 0 match
2 1 2 2 not
3 0 1 1 not
4 2 2 2 match
5 0 1 1 not
Edit based on comment:
For an arbitrary number of variables you could try something like this, which checks that unique only returns one value, i.e. all are equal:
df_1$match <- apply(df_1, 1, function(r) length(unique(r)) == 1)
If you have a large number of variables you can do:
df_1$match <- c("match", "no match")[apply(df_1, 1, function(x) length(unique(x)) != 1) + 1]
df_1
x y z match
1 0 0 0 match
2 1 2 2 no match
3 0 3 1 no match
4 2 2 2 match
5 0 1 1 no match
This post gives various ways to test whether all elements of a vector are the same. Since a data frame is a list of vectors, you can choose one of these methods and apply it to your data frame with one of the *apply(), purrr, or a loop.
Here is one option with purrr:
library(purrr)
df_1$comparison <- map_chr(as.data.frame(t(df_1)), ~ ifelse(
length(unique(.x)) == 1, 'match', 'not'))
Output:
x y z comparison
1 0 0 0 match
2 1 2 2 not
3 0 1 1 not
4 2 2 2 match
5 0 1 1 not
You could potentially also use rowSums():
rowSums(df_1[, -1] == df_1[, 1]) == length(df_1[, -1])
[1] TRUE FALSE FALSE TRUE FALSE
It checks whether the columns from the second on are the same as the first column. If all of them are them same, it returns a TRUE value.
And if you need a match/not result:
ifelse(rowSums(df_1[, -1] == df_1[, 1]) == length(df_1[, -1]), "match", "not")
You may try ifelse with apply, and use unique to see if matched:
df$match <- apply(df, 1, function(x) ifelse(length(unique(x))==1, 'match','not'))
Here's an approach with Reduce()
n_cols <- length(df_1)
Reduce(`&`,
lapply(seq_len(n_cols - 1),
function(j) df_1[[j]] == df_1[[j+1]])
)
Here is the performance of some of the answers evaluating to TRUE or FALSE:
# A tibble: 4 x 13
expression min median
<bch:expr> <bch:t> <bch:t>
1 Reduce_way 47.7us 50.5us
2 rowSums(df_1[, -1] == df_1[, 1]) == length(df_1[, -1]) 159.6us 168.6us
3 apply(df_1, 1, function(x) length(unique(x)) == 1) 150.6us 158.1us
4 df_1[[1]] == df_1[[2]] & df_1[[2]] == df_1[[3]] 27.5us 29.6us
The performance depends on the amount of columns and rows being evaluated. For instance 100,000 x 3:
df_1 <- as.data.frame(replicate(3, sample(3, 100000, replace = T)))
expression min median
<bch:expr> <bch:tm> <bch:t>
1 Reduce_way 931.5us 1.13ms
2 rowSums(df_1[, -1] == df_1[, 1]) == length(df_1[, -1]) 10.96ms 12.69ms
3 apply(df_1, 1, function(x) length(unique(x)) == 1) 1.01s 1.01s
4 df_1[[1]] == df_1[[2]] & df_1[[2]] == df_1[[3]] 894.8us 1.06ms
# following is used from here on out instead of writing out df_1[[1]] == ...
n_cols <- length(df_1)
eval_parse <- paste(
apply(matrix(rep(seq_len(n_cols), c(1, rep(2, n_cols - 2), 1)), 2),
2,
function(cols) paste0("df_1[[", cols, "]]", collapse = ' == ')
),
collapse = ' & '
)
## for 100 x 1000 data.frame
df_1 <- as.data.frame(replicate(1000, sample(3, 100, replace = T)))
# A tibble: 4 x 13
expression min median `itr/sec`
<bch:expr> <bch:> <bch:> <dbl>
1 Reduce_way 15.9ms 16.3ms 60.9
2 rowSums(df_1[, -1] == df_1[, 1]) == length(df_1[, -1]) 16.5ms 17.1ms 58.1
3 apply(df_1, 1, function(x) length(unique(x)) == 1) 10.4ms 10.7ms 92.4
4 eval(parse(text = eval_parse)) 20.1ms 20.6ms 47.4
Similar to #tmfmnk answer (updated according to #Cole's comment):
ifelse(rowMeans(df_1 == df_1[, 1]) == 1, 'match', 'not')
#[1] "match" "not" "not" "match" "not"

Efficient implementation in computing pairwise differences

Suppose I have a data frame as follows:
> foo = data.frame(x = 1:9, id = c(1, 1, 2, 2, 2, 3, 3, 3, 3))
> foo
x id
1 1 1
2 2 1
3 3 2
4 4 2
5 5 2
6 6 3
7 7 3
8 8 3
9 9 3
I want a very efficient implementation of h(a, b) that computes sums all (a - xi)*(b - xj) for xi, xj belonging to the same id class. For example, my current implementation is
h(a, b, foo){
a.diff = a - foo$x
b.diff = b - foo$x
prod = a.diff%*%t(b.diff)
id.indicator = as.matrix(ifelse(dist(foo$id, diag = T, upper = T),0,1)) + diag(nrow(foo))
return(sum(prod*id.indicator))
}
For example, with (a, b) = (0, 1), here is the output from each step in the function
> a.diff
[1] -1 -2 -3 -4 -5 -6 -7 -8 -9
> b.diff
[1] 0 -1 -2 -3 -4 -5 -6 -7 -8
> prod
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 0 1 2 3 4 5 6 7 8
[2,] 0 2 4 6 8 10 12 14 16
[3,] 0 3 6 9 12 15 18 21 24
[4,] 0 4 8 12 16 20 24 28 32
[5,] 0 5 10 15 20 25 30 35 40
[6,] 0 6 12 18 24 30 36 42 48
[7,] 0 7 14 21 28 35 42 49 56
[8,] 0 8 16 24 32 40 48 56 64
[9,] 0 9 18 27 36 45 54 63 72
> id.indicator
1 2 3 4 5 6 7 8 9
1 1 1 0 0 0 0 0 0 0
2 1 1 0 0 0 0 0 0 0
3 0 0 1 1 1 0 0 0 0
4 0 0 1 1 1 0 0 0 0
5 0 0 1 1 1 0 0 0 0
6 0 0 0 0 0 1 1 1 1
7 0 0 0 0 0 1 1 1 1
8 0 0 0 0 0 1 1 1 1
9 0 0 0 0 0 1 1 1 1
In reality, there can be up to 1000 id clusters, and each cluster will be at least 40, making this method too inefficient because of the sparse entries in id.indicator and extra computations in prod on the off-block-diagonals which won't be used.
I played a round a bit. First, your implementation:
foo = data.frame(x = 1:9, id = c(1, 1, 2, 2, 2, 3, 3, 3, 3))
h <- function(a, b, foo){
a.diff = a - foo$x
b.diff = b - foo$x
prod = a.diff%*%t(b.diff)
id.indicator = as.matrix(ifelse(dist(foo$id, diag = T, upper = T),0,1)) +
diag(nrow(foo))
return(sum(prod*id.indicator))
}
h(a = 1, b = 0, foo = foo)
#[1] 891
Next, I tried a variant using a proper sparse matrix implementation (via the Matrix package) and functions for the index matrix. I also use tcrossprod which I often find to be a bit faster than a %*% t(b).
library("Matrix")
h2 <- function(a, b, foo) {
a.diff <- a - foo$x
b.diff <- b - foo$x
prod <- tcrossprod(a.diff, b.diff) # the same as a.diff%*%t(b.diff)
id.indicator <- do.call(bdiag, lapply(table(foo$id), function(n) matrix(1,n,n)))
return(sum(prod*id.indicator))
}
h2(a = 1, b = 0, foo = foo)
#[1] 891
Note that this function relies on foo$id being sorted.
Lastly, I tried avoid creating the full n by n matrix.
h3 <- function(a, b, foo) {
a.diff <- a - foo$x
b.diff <- b - foo$x
ids <- unique(foo$id)
res <- 0
for (i in seq_along(ids)) {
indx <- which(foo$id == ids[i])
res <- res + sum(tcrossprod(a.diff[indx], b.diff[indx]))
}
return(res)
}
h3(a = 1, b = 0, foo = foo)
#[1] 891
Benchmarking on your example:
library("microbenchmark")
microbenchmark(h(a = 1, b = 0, foo = foo),
h2(a = 1, b = 0, foo = foo),
h3(a = 1, b = 0, foo = foo))
# Unit: microseconds
# expr min lq mean median uq max neval
# h(a = 1, b = 0, foo = foo) 248.569 261.9530 493.2326 279.3530 298.2825 21267.890 100
# h2(a = 1, b = 0, foo = foo) 4793.546 4893.3550 5244.7925 5051.2915 5386.2855 8375.607 100
# h3(a = 1, b = 0, foo = foo) 213.386 227.1535 243.1576 234.6105 248.3775 334.612 100
Now, in this example, the h3 is the fastest and h2 is really slow. But I guess that both will be faster for larger examples. Probably, h3 will still win for larger examples though. While there is plenty of room of more optimization, h3 should be faster and more memory efficient. So, I think you should go for a variant of h3 which does not create unnecessarily large matrices.
tapply lets you apply a function across groups of a vector, and will simplify the results to a matrix or vector if it can. Using tcrossprod to multiply all the combinations for each group, and on some suitably large data it performs well:
# setup
set.seed(47)
foo = data.frame(x = 1:9, id = c(1, 1, 2, 2, 2, 3, 3, 3, 3))
foo2 <- data.frame(id = sample(1000, 40000, TRUE), x = rnorm(40000))
h_OP <- function(a, b, foo){
a.diff = a - foo$x
b.diff = b - foo$x
prod = a.diff %*% t(b.diff)
id.indicator = as.matrix(ifelse(dist(foo$id, diag = T, upper = T),0,1)) + diag(nrow(foo))
return(sum(prod * id.indicator))
}
h3_AEBilgrau <- function(a, b, foo) {
a.diff <- a - foo$x
b.diff <- b - foo$x
ids <- unique(foo$id)
res <- 0
for (i in seq_along(ids)) {
indx <- which(foo$id == ids[i])
res <- res + sum(tcrossprod(a.diff[indx], b.diff[indx]))
}
return(res)
}
h_d.b <- function(a, b, foo){
sum(sapply(split(foo, foo$id), function(d) sum(outer(a-d$x, b-d$x))))
}
h_alistaire <- function(a, b, foo){
sum(tapply(foo$x, foo$id, function(x){sum(tcrossprod(a - x, b - x))}))
}
All return the same thing, and are not that different on small data:
h_OP(0, 1, foo)
#> [1] 891
h3_AEBilgrau(0, 1, foo)
#> [1] 891
h_d.b(0, 1, foo)
#> [1] 891
h_alistaire(0, 1, foo)
#> [1] 891
# small data test
microbenchmark::microbenchmark(
h_OP(0, 1, foo),
h3_AEBilgrau(0, 1, foo),
h_d.b(0, 1, foo),
h_alistaire(0, 1, foo)
)
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> h_OP(0, 1, foo) 143.749 157.8895 189.5092 189.7235 214.3115 262.258 100 b
#> h3_AEBilgrau(0, 1, foo) 80.970 93.8195 112.0045 106.9285 125.9835 225.855 100 a
#> h_d.b(0, 1, foo) 355.084 381.0385 467.3812 437.5135 516.8630 2056.972 100 c
#> h_alistaire(0, 1, foo) 148.735 165.1360 194.7361 189.9140 216.7810 287.990 100 b
On bigger data, difference become more stark, though. The original threatened to crash my laptop, but here are benchmarks for the fastest two:
# on 1k groups, 40k rows
microbenchmark::microbenchmark(
h3_AEBilgrau(0, 1, foo2),
h_alistaire(0, 1, foo2)
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> h3_AEBilgrau(0, 1, foo2) 336.98199 403.04104 412.06778 410.52391 423.33008 443.8286 100 b
#> h_alistaire(0, 1, foo2) 14.00472 16.25852 18.07865 17.22296 18.09425 96.9157 100 a
Another possibility is to use a data.frame to summarize by group, then sum the appropriate column. In base R you'd do this with aggregate, but dplyr and and data.table are popular for making such an approach simpler with more complicated aggregations.
aggregate is slower than tapply. dplyr is faster than aggregate, but still slower. data.table, which is designed for speed, is almost exactly as fast as tapply.
library(dplyr)
library(data.table)
h_aggregate <- function(a, b, foo){sum(aggregate(x ~ id, foo, function(x){sum(tcrossprod(a - x, b - x))})$x)}
tidy_h <- function(a, b, foo){foo %>% group_by(id) %>% summarise(x = sum(tcrossprod(a - x, b - x))) %>% select(x) %>% sum()}
h_dt <- function(a, b, foo){setDT(foo)[, .(x = sum(tcrossprod(a - x, b - x))), by = id][, sum(x)]}
microbenchmark::microbenchmark(
h_alistaire(1, 0, foo2),
h_aggregate(1, 0, foo2),
tidy_h(1, 0, foo2),
h_dt(1, 0, foo2)
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> h_alistaire(1, 0, foo2) 13.30518 15.52003 18.64940 16.48818 18.13686 62.35675 100 a
#> h_aggregate(1, 0, foo2) 93.08401 96.61465 107.14391 99.16724 107.51852 143.16473 100 c
#> tidy_h(1, 0, foo2) 39.47244 42.22901 45.05550 43.94508 45.90303 90.91765 100 b
#> h_dt(1, 0, foo2) 13.31817 15.09805 17.27085 16.46967 17.51346 56.34200 100 a
sum(sapply(split(foo, foo$id), function(d) sum(outer(a-d$x, b-d$x))))
#[1] 891
#TESTING
foo = data.frame(x = sample(1:9,10000,replace = TRUE),
id = sample(1:3, 10000, replace = TRUE))
system.time(sum(sapply(split(foo, foo$id), function(d) sum(outer(a-d$x, b-d$x)))))
# user system elapsed
# 0.15 0.01 0.17

Define the value of a column in a dataframe based on 2 keys from a different dataframe

I have the following dataframe:
a <- seq(0, 5, by = 0.25)
b <-seq(0, 20, by = 1)
df <- data.frame(a, b)
and I'd like to create a new column "value", based on columns a and b, and the conversion table below:
a_min <- c(0,2, 0,2)
a_max <- c(2,5,2,5)
b_min <- c(0,0,10,10)
b_max <- c(10,10,30,30)
output <-c(1,2,3,4)
conv <- data.frame(a_min, a_max, b_min, b_max, output)
I've tried to do it using dplyr::mutate without much success...
require(dplyr)
mutate(df, value = calcula(conv, a, b))
longer object length is not a multiple of shorter object length
My expectation would be to obtain a dataframe like the 'df' above with the additional column value as per below:
df$value <- c(rep(1,8), rep(2,2), rep(4,11))
A possible relatively simple and very efficient data.table solution using binary non-equi joins
library(data.table) # v1.10.0
setDT(conv)[setDT(df), output, on = .(a_min <= a, a_max >= a, b_min <= b, b_max >= b)]
## [1] 1 1 1 1 1 1 1 1 1 2 2 2 4 4 4 4 4 4 4 4 4 4 4
As a side note, if output column is just the row index within conv, you could make this join even more efficient by just asking for the row indices by specifying which = TRUE
setDT(conv)[setDT(df), on = .(a_min <= a, a_max >= a, b_min <= b, b_max >= b), which = TRUE]
## [1] 1 1 1 1 1 1 1 1 1 2 2 2 4 4 4 4 4 4 4 4 4 4 4
One more option, this time with matrices.
with(df, with(conv, output[max.col(
outer(a, a_min, `>=`) + outer(a, a_max, `<=`) +
outer(b, b_min, `>=`) + outer(b, b_max, `<=`))]))
## [1] 1 1 1 1 1 1 1 1 1 2 2 4 4 4 4 4 4 4 4 4 4
outer compares each element of the vector from df from the one from conv, producing a matrix of Booleans for each call. Since TRUE is 1, if you add all four matrices, the index you want will be the column with the most TRUEs, which you can get with max.col. Subset output, and you've got your result.
The benefit of working with matrices is that they're fast. Using #Phann's benchmarks on 1,000 rows:
Unit: microseconds
expr min lq mean median uq max neval cld
alistaire 276.099 320.4565 349.1045 339.8375 357.2705 941.551 100 a
akr1 830.934 966.6705 1064.8433 1057.6610 1152.3565 1507.180 100 ab
akr2 11431.246 11731.3125 12835.5229 11947.5775 12408.4715 36767.488 100 d
Pha 11985.129 12403.1095 13330.1465 12660.4050 13044.9330 29653.842 100 d
Ron 71132.626 74300.3540 81136.9408 78034.2275 88952.8765 98950.061 100 e
Dav1 2506.205 2765.4095 2971.6738 2948.6025 3082.4025 4065.368 100 c
Dav2 2104.481 2272.9180 2480.9570 2478.8775 2575.8740 3683.896 100 bc
and on 100,000 rows:
Unit: milliseconds
expr min lq mean median uq max neval cld
alistaire 30.00677 36.49348 44.28828 39.43293 54.28207 64.36581 100 a
akr1 36.24467 40.04644 48.46986 41.59644 60.15175 77.34415 100 a
Dav1 51.74218 57.23488 67.70289 64.11002 68.86208 382.25182 100 c
Dav2 48.48227 54.82818 60.25256 59.81041 64.92611 91.20212 100 b
We can try with Map with na.locf
library(zoo)
f1 <- function(u, v, x, y, z) z * NA^!((with(df, a >= u & a <v) & (b >=x & b <y)))
na.locf(do.call(pmax, c(do.call(Map, c(f=f1, unname(conv))), na.rm = TRUE)))
#[1] 1 1 1 1 1 1 1 1 2 2 4 4 4 4 4 4 4 4 4 4 4
Or another way to write the Map solution is to pass the 'a' and 'b' columns as arguments, and then do the logical evaluation with columns of 'conv' to extract the 'output' value and unlist the list output
unlist(Map(function(x, y)
with(conv, output[x >= a_min & a_max > x & y >= b_min & b_max > y]),
df$a, df$b))
#[1] 1 1 1 1 1 1 1 1 2 2 4 4 4 4 4 4 4 4 4 4
NOTE: The second solution should be slower as we are looping through the rows of the dataset while the first solution loops through the 'conv' rows (which we assume should not be many rows)
Another approach using apply:
df$value <- unlist(apply(df, 1, function(x){
ifelse(length(OUT <- output[which(x[1] >= a_min & x[1] <= a_max & x[2] >= b_min & x[2] <= b_max)]) > 0, OUT, 0)
}))
EDIT:
Because there are several answers so far, I checked the time needed to process the data. I created a little bit bigger example (similar to the given one with random numbers):
set.seed(23563)
a <- runif(1000, 0, 5)
b <- runif(1000, 0, 20)
df <- data.frame(a, b)
require(microbenchmark)
library(zoo)
require(data.table)
microbenchmark(
akr1 = { #akrun 1
f1 <- function(u, v, x, y, z) z * NA^!((with(df, a >= u & a <v) & (b >=x & b <y)))
na.locf(do.call(pmax, c(do.call(Map, c(f=f1, unname(conv))), na.rm = TRUE)))
},
akr2 = { #akrun 2
unlist(Map(function(x, y)
with(conv, output[x >= a_min & a_max > x & y >= b_min & b_max > y]),
df$a, df$b))
},
Pha = { #Phann
df$value <- unlist(apply(df, 1, function(x){
ifelse(length(OUT <- output[which(x[1] >= a_min & x[1] <= a_max & x[2] >= b_min & x[2] <= b_max)]) > 0, OUT, 0)
}))
},
Ron = { #Ronak Shah
unlist(mapply(function(x, y)
conv$output[x >= conv$a_min & conv$a_max > x & y >= conv$b_min & conv$b_max > y],
df$a, df$b))
},
Dav1 ={ #David Arenburg 1
setDT(conv)[setDT(df), on = .(a_min <= a, a_max >= a, b_min <= b, b_max >= b)]$output
},
Dav2 = { #David Arenburg 2
setDT(conv)[setDT(df), on = .(a_min <= a, a_max >= a, b_min <= b, b_max >= b), which = TRUE]
},
times = 100L
)
With 1000 random numbers:
# Unit: milliseconds
# expr min lq mean median uq max neval
# akr1 4.267206 4.749576 6.259695 5.351494 6.843077 54.39187 100
# akr2 33.437853 39.912785 49.932875 47.416888 57.070369 91.55602 100
# Pha 30.433779 36.939692 48.205592 46.393800 55.800204 83.91640 100
# Ron 174.765021 199.648315 227.493117 223.314661 240.579057 370.26929 100
# Dav1 6.944759 7.814469 10.685460 8.536694 11.974102 44.47915 100
# Dav2 6.106978 6.706424 8.961821 8.161707 10.376085 28.91255 100
With 10000 random numbers (same seed), I get:
# Unit: milliseconds
# expr min lq mean median uq max neval
# akr1 23.48180 24.03962 26.16747 24.46897 26.19565 41.83238 100
# akr2 357.38290 398.69965 434.92052 409.15385 440.98210 829.85113 100
# Pha 320.39285 347.66632 376.98118 361.76852 383.08231 681.28500 100
# Ron 1661.50669 1788.06228 1873.70929 1837.28187 1912.04123 2499.23235 100
# Dav1 20.91486 21.60953 23.12278 21.94707 22.42773 44.71900 100
# Dav2 19.69506 20.22077 21.63715 20.55793 21.27578 38.96819 100
Here is another attempt to utilize findIntervals efficiency on both memory and speed. A more convenient format of the conv "data.frame" could be
(i) a "list" of the intervals for each variable which are not overlapping:
vecs = list(a = unique(c(conv$a_min, conv$a_max)),
b = unique(c(conv$b_min, conv$b_max)))
vecs
#$a
#[1] 0 2 5
#
#$b
#[1] 0 10 30
and, (ii) a lookup structure that contains the group of each paired interval between the two variables:
maps = xtabs(output ~ a_min + b_min)
maps
# b_min
#a_min 0 10
# 0 1 3
# 2 2 4
where, for example, we note that the first interval of "a" && second of "b" are assigned a "3" etc.
Then we can use:
maps[mapply(findInterval, df, vecs, all.inside = TRUE)]
# [1] 1 1 1 1 1 1 1 1 2 2 4 4 4 4 4 4 4 4 4 4 4
And extending the benchmarks of Phann and alistaire (re-written, partly, for convenience):
n = 1e6
set.seed(23563); a = runif(n, 0, 5); b = runif(n, 0, 20); df = data.frame(a, b)
library(microbenchmark); library(zoo); library(data.table)
alistaire = function() {
with(df, with(conv, output[max.col(
outer(a, a_min, `>=`) + outer(a, a_max, `<=`) +
outer(b, b_min, `>=`) + outer(b, b_max, `<=`))]))
}
david = function() {
as.data.table(conv)[setDT(df), output, on = .(a_min <= a, a_max >= a, b_min <= b, b_max >= b)]
}
akrun = function() {
f1 = function(u, v, x, y, z) z * NA^!((with(df, a >= u & a <v) & (b >=x & b <y)))
na.locf(do.call(pmax, c(do.call(Map, c(f=f1, unname(conv))), na.rm = TRUE)))
}
alex = function() {
vecs = list(a = unique(c(conv$a_min, conv$a_max)), b = unique(c(conv$b_min, conv$b_max)))
maps = xtabs(output ~ a_min + b_min)
maps[mapply(findInterval, df, vecs, all.inside = TRUE)]
}
identical(alistaire(), david())
#[1] TRUE
identical(david(), akrun())
#[1] TRUE
identical(akrun(), alex())
#[1] TRUE
microbenchmark(alistaire(), david(), akrun(), alex(), times = 20)
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# alistaire() 592.46700 718.07148 799.28933 792.98107 860.16414 1136.4489 20 b
# david() 1363.76196 1375.43935 1398.53515 1385.11747 1425.69837 1457.1693 20 d
# akrun() 824.11962 850.88831 903.58723 906.21007 958.04310 995.2129 20 c
# alex() 70.82439 72.65993 82.87961 76.77627 81.20356 179.7669 20 a
We can use mapply on two variables a and b and find the correct output variable based on the range
unlist(mapply(function(x, y)
conv$output[x >= conv$a_min & conv$a_max > x & y >= conv$b_min & conv$b_max > y],
df$a, df$b))
#[1] 1 1 1 1 1 1 1 1 2 2 4 4 4 4 4 4 4 4 4 4

R: how to check if all columns in a data.frame are the same

> df = data.frame(A = c(1, 2, 3), B = c(3, 2, 2), C = c(3, 2, 1)); df
A B C
1 1 3 3
2 2 2 2
3 3 2 1
> df2 = data.frame(A = c(1, 2, 3), B = c(1, 2, 3), C = c(1, 2, 3)); df2
A B C
1 1 1 1
2 2 2 2
3 3 3 3
I want to know if all the columns in my data.frame are the same. For df, it should be FALSE, whereas for df2 it should be TRUE.
You could check if the number of unique variable vectors is equal to one:
length(unique(as.list(df))) == 1
# [1] FALSE
length(unique(as.list(df2))) == 1
# [1] TRUE
Another way could be to check if each variable is identical to the first variable:
all(sapply(df, identical, df[,1]))
# [1] FALSE
all(sapply(df2, identical, df2[,1]))
# [1] TRUE
You can also check it using ‘all.equal’.
sapply(2:ncol(df),function(x) isTRUE(all.equal(df[,x-1],df[,x])))
[1] FALSE FALSE
sapply(2:ncol(df2),function(x) isTRUE(all.equal(df2[,x-1],df2[,x])))
[1] TRUE TRUE
Perhaps worth mentioning the speed difference between the two solutions by josliber. The length(unique(..)) solution is the winner with small data, while all(sapply(...)) wins with large data.
df = data.frame(A = c(1, 2, 3), B = c(3, 2, 2), C = c(3, 2, 1))
df2 = data.frame(A = c(1, 2, 3), B = c(1, 2, 3), C = c(1, 2, 3))
# enlarge:
# df = do.call("rbind", replicate(10000, df, simplify = FALSE))
# df2 = do.call("rbind", replicate(10000, df2, simplify = FALSE))
microbenchmark::microbenchmark(
uniq1 =
{
length(unique(as.list(df))) == 1
},
uniq2 =
{
length(unique(as.list(df2))) == 1
},
ident1 =
{
all(sapply(df, identical, df[,1]))
},
ident2 =
{
all(sapply(df2, identical, df2[,1]))
}
)
# small:
Unit: microseconds
expr min lq mean median uq max neval cld
uniq1 4.243 4.5975 5.41435 5.0620 5.3685 19.852 100 a
uniq2 4.337 4.6425 5.80585 5.1340 5.3920 31.652 100 a
ident1 24.476 25.0100 28.22507 25.4255 26.4865 157.661 100 b
ident2 24.558 25.0380 28.08906 25.5215 26.6605 76.284 100 b
# large:
Unit: microseconds
expr min lq mean median uq max neval cld
uniq1 529.882 531.1020 537.98098 532.9360 538.0695 628.057 100 c
uniq2 872.855 874.7085 893.56305 884.1715 903.2400 987.257 100 d
ident1 25.004 26.2735 29.68082 27.7770 29.1075 55.286 100 a
ident2 369.629 371.1610 379.34730 372.6670 379.2495 455.276 100 b
Here is a new handy update to this relatively old question:
You can use the function all_equal from the package dplyr. The function returns TRUE if the two data frames are identical, otherwise a character vector describing the reasons why they are not equal.
Here are some more information: https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/all_equal

Apply a correction factor to one column based on the value of a second column

Example Data
A<-c(1,4,5,6,2,3,4,5,6,7,8,7)
B<-c(4,6,7,8,2,2,2,3,8,8,7,8)
DF<-data.frame(A,B)
What I would like to do is apply a correction factor to column A, based on the values of column B. The rules would be something like this
If B less than 4 <- Multiply A by 1
If B equal to 4 and less than 6 <- Multiply A by 2
If B equal or greater than 6 <- Multiply by 4
I suppose I could write an "if" statement (and I'd be glad to see a good example), but I'd also be interested in using square bracket indexing to speed things up.
The end result would look like this
A B
2 4
16 6
20 7
24 8
ect
Use this:
within(DF, A <- ifelse(B>=6, 4, ifelse(B<4, 1, 2)) * A)
Or this (corrected by #agstudy):
within(DF, {A[B>=6] <- A[B>=6]*4; A[B>=4 & B<6] <- A[B>=4 & B<6]*2})
Benchmarking:
DF <- data.frame(A=rpois(1e4, 5), B=rpois(1e4, 5))
a <- function(DF) within(DF, A <- ifelse(B>=6, 4, ifelse(B<4, 1, 2)) * A)
b <- function(DF) within(DF, {A[B>=6] <- A[B>=6]*4; A[B>=4 & B<6] <- A[B>=4 & B<6]*2})
identical(a(DF), b(DF))
#[1] TRUE
microbenchmark(a(DF), b(DF), times=1000)
#Unit: milliseconds
# expr min lq median uq max neval
# a(DF) 8.603778 10.253799 11.07999 11.923116 53.91140 1000
# b(DF) 3.763470 3.889065 5.34851 5.480294 39.72503 1000
Similar to #Ferdinand solution but using transform
transform(DF, newcol = ifelse(B<4, A,
ifelse(B>=6,4*A,2*A)))
A B newcol
1 1 4 2
2 4 6 16
3 5 7 20
4 6 8 24
5 2 2 2
6 3 2 3
7 4 2 4
8 5 3 5
9 6 8 24
10 7 8 28
11 8 7 32
12 7 8 28
I prefer to use findInterval as an index into a set of factors for such operations. The proliferation of nested test-conditional and consequent vectors with multiple ifelse calls offends my efficiency sensibilities:
DF$A <- DF$A * c(1,2,4)[findInterval(DF$B, c(-Inf,4,6,Inf) ) ]
DF
A B
1 2 4
2 16 6
3 20 7
4 24 8
snipped ....
Benchmark:
DF <- data.frame(A=rpois(1e4, 5), B=rpois(1e4, 5))
a <- function(DF) within(DF, A <- ifelse(B>=6, 4, ifelse(B<4, 1, 2)) * A)
b <- function(DF) within(DF, {A[B>=6] <- A[B>=6]*4; A[B>=4 & B<6] <- A[B>=4 & B<6]*2})
ccc <- function(DF) within(DF, {A * c(1,2,4)[findInterval(B, c(-Inf,4,6,Inf) ) ]})
microbenchmark(a(DF), b(DF), ccc(DF), times=1000)
#-----------
Unit: microseconds
expr min lq median uq max neval
a(DF) 7616.107 7843.6320 8105.0340 8322.5620 93549.85 1000
b(DF) 2638.507 2789.7330 2813.8540 3072.0785 92389.57 1000
ccc(DF) 604.555 662.5335 676.0645 698.8665 85375.14 1000
Note: I would not have done this using within if I were coding my own function, but thought for fairness to the earlier effort, I would make it apples <-> apples.

Resources