R: match elements between vectors - how to optimaze code - r

let's image that we have data frame of description of several individuals:
des <- c('mad', 'crazy','stupid', 'crazy','wise','dumb','mad','furious')
id <- c(1,2,3,4,5,6,7,8)
d <-data.frame(id,des)
d$dangerous <- NA
dan <-c('mad','crazy','furious')
We want to match d$des with description in vector dan
I prepared the following function:
for (i in 1:nrow(d)){
for(j in 1:length(dan)){
if (d$des[i]==dan[j])
{d$dangerous[i] <- 1 }
} }
d
id des dangerous
1 1 mad 1
2 2 crazy 1
3 3 stupid NA
4 4 crazy 1
5 5 wise NA
6 6 dumb NA
7 7 mad 1
8 8 furious 1
The code works well however I wonder how to optimize the code if it could deal with longer vectors and data frame. Any ideas?

Here are timings of the several solutions and of a solution of mine.
I have timed the functions with the original data.frame d and with a bigger data.frame, since the OP says it's an optimization problem.
OP <- function(DF, dan){
DF$dangerous <- NA
for (i in 1:nrow(DF)){
for(j in 1:length(dan)){
if (DF$des[i]==dan[j]) DF$dangerous[i] <- 1
}
}
DF
}
Carles <- function(DF, dan){
DF$dangerous<-ifelse(DF$des %in% dan, 1, NA)
DF
}
arg0naut91_1 <- function(DF, dan){
DF$dangerous <- NA
transform(DF, dangerous = replace(dangerous, des %in% dan, 1))
}
arg0naut91_2 <- function(DF, dan){
DF$dangerous <- NA
DF$dangerous[DF$des %in% dan] <- 1
DF
}
Rui <- function(DF, dan){
DF$dangerous <- c(1, NA)[(DF$des %in% dan) + 1]
DF
}
library(microbenchmark)
mb <- microbenchmark(
OP = OP(d, dan),
Carles = Carles(d, dan),
Rui = Rui(d, dan),
arg0naut91_1 = arg0naut91_1(d, dan),
arg0naut91_2 = arg0naut91_2(d, dan)
)
print(mb, order = "median")
#Unit: microseconds
# expr min lq mean median uq max neval cld
# Rui 22.623 25.1865 82.73746 27.2510 31.6630 5441.491 100 a
# Carles 31.740 34.4120 76.82339 36.9385 42.1760 3753.407 100 a
# arg0naut91_2 34.131 36.7140 89.10827 39.5925 46.6930 4577.938 100 a
# arg0naut91_1 226.237 230.1020 296.23198 234.6225 243.3040 4847.553 100 a
# OP 757.831 770.1875 926.88995 781.5630 818.2745 10992.040 100 b
e <- d
for(i in 1:10) e <- rbind(e, e)
mb2 <- microbenchmark(
OP = OP(e, dan),
Carles = Carles(e, dan),
Rui = Rui(e, dan),
arg0naut91_1 = arg0naut91_1(e, dan),
arg0naut91_2 = arg0naut91_2(e, dan),
times = 10
)
print(mb2, order = "median")
#Unit: microseconds
# expr min lq mean median uq max neval cld
# Rui 291.090 294.690 346.3638 298.9580 301.238 776.769 10 a
# arg0naut91_2 288.123 292.236 312.6684 311.2435 314.495 388.212 10 a
# Carles 427.500 430.120 447.7170 450.2570 453.884 480.424 10 a
# arg0naut91_1 513.059 517.822 611.0255 666.7095 670.059 688.023 10 a
# OP 898781.320 909717.469 911988.3906 914269.7245 916975.858 919223.886 10 b

Using ifelse() with %in% will do the trick:
d$dangerous<-ifelse(des %in% dan, 1,NA)
> d
id des dangerous
1 1 mad 1
2 2 crazy 1
3 3 stupid NA
4 4 crazy 1
5 5 wise NA
6 6 dumb NA
7 7 mad 1
8 8 furious 1

Another option:
transform(d, dangerous = replace(dangerous, des %in% dan, 1))
id des dangerous
1 1 mad 1
2 2 crazy 1
3 3 stupid NA
4 4 crazy 1
5 5 wise NA
6 6 dumb NA
7 7 mad 1
8 8 furious 1
Or:
d$dangerous[d$des %in% dan] <- 1

Related

How to create a vector sequencing along nonNA elements

In R, given a vector with some elements as NA, how can I count along it skipping NAs?
For example:
let <- letters[1:10]
let[c(2,3,7,9)] <- NA
How would I get the vector?
1,NA,NA,2,3,4,NA,5,NA,6
One way is to match all the indices of let to the non-na indices:
> match(seq_along(let), which(!is.na(let)))
[1] 1 NA NA 2 3 4 NA 5 NA 6
Another option using seq_along
let[!is.na(let)] <- seq_along(let[!is.na(let)])
as.numeric(let)
# [1] 1 NA NA 2 3 4 NA 5 NA 6
benchmark
library(microbenchmark)
n <- 1e7
let_long <- seq_len(n)
set.seed(1)
let_long[sample(seq_len(n), size = 1e6)] <- NA
benchmark <- microbenchmark(
Karolis = Karolis(let_long),
Markus = Markus(let_long),
Snoram = Snoram(let_long),
Alexandra = Alexandra(let_long),
Frank = Frank(let_long) # see comment under Snoram's answer
)
To get the chart below, type autoplot(benchmark).
#Unit: milliseconds
# expr min lq mean median uq max neval
# Karolis 1042.0708 1216.6241 1314.9765 1290.3428 1374.7090 1807.4604 100
# Markus 210.3860 259.9957 310.0776 293.8244 363.4317 488.2171 100
# Snoram 714.4514 938.5760 1033.6168 1029.8205 1104.5614 1546.3733 100
# Alexandra 4317.5206 4470.2634 4665.9004 4603.6446 4771.5768 6495.3595 100
# Frank 103.3624 126.2842 166.7555 159.3568 190.5186 290.0422 100
Functions compared so far.
Karolis <- function(x) {
match(seq_along(x), which(!is.na(x)))
}
Markus <- function(x) {
x[!is.na(x)] <- seq_along(x[!is.na(x)])
as.numeric(x)
}
Snoram <- function(x) {
ifelse(is.na(x), NA, cumsum(!is.na(x)))
}
Alexandra <- function(x) {
j = 0
for (i in 1:length(x)) {
if(is.na(x[i]) == FALSE){
j = j + 1
x[i] <- j
}
}
as.numeric(x)
}
Frank <- function(x) {
replace(cumsum(!is.na(x)), is.na(x), NA)
}
Other options include:
Use ifelse() and cumsum()
ifelse(is.na(let), NA, cumsum(!is.na(let)))
#[1] 1 NA NA 2 3 4 NA 5 NA 6
This would get the result you are looking for but isn't efficient
let <- letters[1:10]
let[c(2,3,7,9)] <- NA
j = 0
for (i in 1:length(let)) {
if(is.na(let[i]) == FALSE){
j = j + 1
let[i] <- j
}
}

faster method for ordered column names dataframe from numeric dataframe in R

I have a dataframe with three columns:
set.seed(123)
df <- data.frame(x = abs(rnorm(10)), y = abs(rnorm(10)), z = abs(rnorm(10)))
df
x y z
1 0.56047565 1.2240818 1.0678237
2 0.23017749 0.3598138 0.2179749
3 1.55870831 0.4007715 1.0260044
4 0.07050839 0.1106827 0.7288912
5 0.12928774 0.5558411 0.6250393
6 1.71506499 1.7869131 1.6866933
7 0.46091621 0.4978505 0.8377870
8 1.26506123 1.9666172 0.1533731
9 0.68685285 0.7013559 1.1381369
10 0.44566197 0.4727914 1.2538149
I want to construct a dataframe with the same number of rows, having in each row, the column names of df, ordered by the corresponding row value in df. I have a for-loop based approach that works, but is too slow for a large dataframe, but am looking for a faster, vectorized approach. Here is the for loop based approach:
df_names <- df
df_names[,] <- NA
df_names
x y z
1 NA NA NA
2 NA NA NA
3 NA NA NA
4 NA NA NA
5 NA NA NA
6 NA NA NA
7 NA NA NA
8 NA NA NA
9 NA NA NA
10 NA NA NA
for(r in 1:nrow(df)) {
sorted_row <- sort(df[r,], decreasing = TRUE)
df_names[r,] <- colnames(sorted_row)
}
df_names
x y z
1 y z x
2 y x z
3 x z y
4 z y x
5 z y x
6 y x z
7 z y x
8 y x z
9 z y x
10 z y x
How do I do this faster using the apply family or vectorization?
Revised: I merged all attempts, corrections by #rawr, and #rawr's approach is the best so far - with a 30x savings. #989 added a much faster approach. See accepted answer by #989.
library(microbenchmark)
set.seed(123)
df <- data.frame(x = abs(rnorm(1000)), y = abs(rnorm(1000)), z = abs(rnorm(1000)))
get_name_df_with_for = function(df) {
df_names <- df
df_names[,] <- NA
for(r in 1:nrow(df)) {
df_names[r,] <- colnames(sort(df[r,], decreasing = TRUE))
}
return(df_names)
}
get_name_df_with_apply = function(df) {
df_names <- data.frame(t(apply(df, 1, function(row) names(sort(row, decreasing = TRUE)))))
return(df_names)
}
get_name_df_with_apply_names = function(df) {
df_names <- data.frame(t(apply(df, 1, function(row) names(row)[(order(row, decreasing = TRUE))])))
return(df_names)
}
get_name_df_double_t = function(df) {
df_names <- data.frame(t(apply(t(df), 2, function(col) names(sort(col, decreasing = TRUE)))))
return(df_names)
}
microbenchmark(
"for" = get_name_df_with_for(df),
"double_transpose" = get_name_df_double_t(df),
"apply" = get_name_df_with_apply(df),
"apply_with_names" = get_name_df_with_apply_names(df),
times = 10
)
Unit: milliseconds
expr min lq mean median uq max neval
for 417.08341 424.37019 446.00655 451.67451 459.64900 480.33351 10
double_transpose 28.46577 29.96637 32.44685 33.02763 33.51309 36.77468 10
apply 27.54800 28.27331 38.02239 30.36667 37.29727 71.46596 10
apply_with_names 12.35264 12.59502 14.16868 13.92946 15.80656 17.22005 10
If the number of columns in your df is just three, here is a faster solution using max.col. It is provably about 8x faster than the fastest solution proposed in the other answer when nrow(df)=100.
The case in which nrow(df)=100
library(microbenchmark)
set.seed(123)
size <- 100
df <- data.frame(x = abs(rnorm(size)), y = abs(rnorm(size)), z = abs(rnorm(size)))
f1 <- function(df){
vec <- unlist(t(df))
sq <- seq(0,(nrow(df)-1)*3,3)
m1 <- max.col(df)
# -----------------------
vec[sq+m1] <- -Inf
m2 <- max.col(matrix(vec, ncol=3, byrow=T))
vec[sq+m2] <- -Inf
# -----------------------
m3 <- max.col(matrix(vec, ncol=3, byrow=T))
nm <- names(df)
cbind(nm[m1], nm[m2], nm[m3])
}
all(f1(df)==get_name_df_with_for(df))
# [1] TRUE
all(f1(df)==get_name_df_with_apply(df))
# [1] TRUE
all(f1(df)==get_name_df_with_apply_names(df))
# [1] TRUE
all(f1(df)==get_name_df_double_t(df))
# [1] TRUE
microbenchmark(f1(df), "f2"=get_name_df_with_for(df), "f3"=get_name_df_with_apply(df),
"f4"=get_name_df_with_apply_names(df), "f5"=get_name_df_double_t(df))
# Unit: microseconds
# expr min lq mean median uq max neval
# f1(df) 395.643 458.0905 470.8278 472.633 492.7355 701.464 100
# f2 59262.146 61773.0865 63098.5840 62963.223 64309.4780 74246.953 100
# f3 5491.521 5637.1605 6754.3912 5801.619 5956.4545 90457.611 100
# f4 3392.689 3463.9055 3603.1546 3569.125 3707.2795 4237.012 100
# f5 5513.335 5636.3045 5954.9277 5781.089 5971.2115 8622.017 100
Significantly faster when nrow(df)=1000
# Unit: microseconds
# expr min lq mean median uq max neval
# f1(df) 693.765 769.8995 878.3698 815.6655 846.4615 3559.929 100
# f2 627876.429 646057.8155 671925.4799 657768.6270 694047.9940 797900.142 100
# f3 49570.397 52038.3515 54334.0501 53838.8465 56181.0515 62517.965 100
# f4 28892.611 30046.8180 31961.4085 31262.4040 33057.5525 48694.850 100
# f5 49866.379 51491.7235 54413.8287 53705.3970 55962.0575 75287.600 100

Resampling from subject id's in R

Assume we have the following data
set.seed(123)
dat <- data.frame(var1=c(10,35,13,19,15,20,19), id=c(1,1,2,2,2,3,4))
(sampledIDs <- sample(min(dat$id):max(dat$id), size=3, replace=TRUE))
> [1] 2 4 2
The sampledIDs is a vector of id's that is sampled (with replacement) from dat$id.
I need the code that results in (and works also for a large dataset with more variables):
var1 id
13 2
19 2
15 2
19 4
13 2
19 2
15 2
The code dat[which(dat$id%in%sampledIDs),] does not give me what I want, since the the result of this code is
var1 id
13 2
19 2
15 2
19 4
where the subject with dat$id==2 appears only once in this data (I understand why this is the result, but don't know how to get what I want). Can someone please help?
EDIT: Thank you for the answers, here the runtime of all answers (for those who are interested):
test replications elapsed relative user.self
3 dat[unlist(lapply(sampledIDs, function(x) which(x == dat$id))), ] 1000 0.67 1.000 0.64
1 dat[which(sapply(sampledIDs, "==", dat$id), arr.ind = TRUE)[, 1], ] 1000 0.67 1.000 0.67
2 do.call(rbind, split(dat, dat$id)[as.character(sampledIDs)]) 1000 1.83 2.731 1.83
4 setkey(setDT(dat), id)[J(sampledIDs)] 1000 1.33 1.985 1.33
This would be probably the fastest approach for a big data set using data.table binary search
library(data.table)
setkey(setDT(dat), id)[J(sampledIDs)]
# var1 id
# 1: 13 2
# 2: 19 2
# 3: 15 2
# 4: 19 4
# 5: 13 2
# 6: 19 2
# 7: 15 2
Edit:
Here's a benchmark for a not so big data set (1e+05 rows) which illustrates which is the clear winner
library(data.table)
library(microbenchmark)
set.seed(123)
n <- 1e5
dat <- data.frame(var1 = sample(seq_len(100), n, replace = TRUE), id = sample(seq_len(10), n, replace = TRUE))
(sampledIDs <- sample(min(dat$id) : max(dat$id), size = 3, replace = TRUE))
dat2 <- copy(dat)
Sven1 <- function(dat) dat[unlist(lapply(sampledIDs, function(x) which(x == dat$id))), ]
Sven2 <- function(dat) dat[which(sapply(sampledIDs, "==", dat$id), arr.ind = TRUE)[ , 1], ]
flodel <- function(dat) do.call(rbind, split(dat, dat$id)[as.character(sampledIDs)])
David <- function(dat2) setkey(setDT(dat2), id)[J(sampledIDs)]
Res <- microbenchmark(Sven1(dat),
Sven2(dat),
flodel(dat),
David(dat2))
Res
# Unit: milliseconds
# expr min lq median uq max neval
# Sven1(dat) 4.356151 4.817557 6.715533 7.313877 45.407768 100
# Sven2(dat) 9.750984 12.385677 14.324671 16.655005 54.797096 100
# flodel(dat) 36.097602 39.680006 42.236017 44.314981 82.261879 100
# David(dat2) 1.813387 2.068749 2.154774 2.335442 8.665379 100
boxplot(Res)
If, for example, we would like to sample more then just 3 Ids, but lets say, 10, the benchmark becomes ridiculous
(sampledIDs <- sample(min(dat$id) : max(dat$id), size = 10, replace = TRUE))
[1] 7 6 10 9 5 9 5 3 7 3
# Unit: milliseconds
# expr min lq median uq max neval
# Sven1(dat) 80.124502 89.141162 97.908365 104.111738 175.40919 100
# Sven2(dat) 99.010410 127.797966 159.404395 170.751069 209.96887 100
# flodel(dat) 129.722435 144.847505 157.737362 178.242103 232.41293 100
# David(dat2) 2.431682 2.721038 2.855103 3.057796 19.60826 100
You can do:
do.call(rbind, split(dat, dat$id)[as.character(sampledIDs)])
One approach:
dat[unlist(lapply(sampledIDs, function(x) which(x == dat$id))), ]
# var1 id
# 3 13 2
# 4 19 2
# 5 15 2
# 7 19 4
# 3.1 13 2
# 4.1 19 2
# 5.1 15 2
An alternative approach:
dat[which(sapply(sampledIDs, "==", dat$id), arr.ind = TRUE)[ , 1], ]

R code to 'tidy' the values of a discrete variable

Consider this data:
set.seed(200914)
y <- round(runif(20, 5, 15))
y
table(y)
In the real application y is a categorical variable such as "outcome code". I want to recode R so that its values are 1:n, while preserving order (Sometimes the variable may be ordinal.)
One answer is:
(ya <- y - min(y) + 1)
table(ya)
But this solution does not have minimal range which may make subsequent code inefficient. Trying again...
(suy <- sort(unique(y)))
(n <- length(suy))
yb <- y
for (i in 1:n) yb[which(y == suy[i])] <- i
table(yb)
yb is exactly what I want, but I wonder if I am computing it in the most efficient way?
Try
yc <- as.numeric(factor(y))
table(yc)
#yc
#1 2 3 4 5 6 7 8
#1 4 1 1 6 3 3 1
since essentially what you're looking for are the factor codes (I think).
Try:
yc <- match(y, sort(unique(y)))
table(yc)
#1 2 3 4 5 6 7 8
#1 4 1 1 6 3 3 1
all.equal(yb,yc)
#[1] TRUE
Another option might be findInterval
table(findInterval(y, sort( unique(y))))
#1 2 3 4 5 6 7 8
#1 4 1 1 6 3 3 1
Benchmarks
set.seed(25)
y <- sample(1:20, 1e6,replace=TRUE)
f1 <- function() {suy <- sort(unique(y))
n <- length(suy)
yb <- y
for (i in 1:n) yb[which(y == suy[i])] <- i
table(yb)}
f2 <- function() {yc <- as.numeric(factor(y))
table(yc)}
f3 <- function() {yd <- match(y, sort(unique(y)))
table(yd)}
f4 <- function() {ye <- findInterval(y, sort(unique(y)))
table(ye)}
library(microbenchmark)
microbenchmark(f1(), f2(), f3(), f4(), unit="relative", times=25L)
# Unit: relative
# expr min lq median uq max neval
# f1() 1.198901 1.208551 1.235237 1.242697 1.600400 25
# f2() 3.745317 3.593736 3.593330 3.596990 3.488292 25
# f3() 1.000000 1.000000 1.000000 1.000000 1.000000 25
# f4() 1.017857 1.038056 1.047112 1.038731 1.014825 25

Apply a correction factor to one column based on the value of a second column

Example Data
A<-c(1,4,5,6,2,3,4,5,6,7,8,7)
B<-c(4,6,7,8,2,2,2,3,8,8,7,8)
DF<-data.frame(A,B)
What I would like to do is apply a correction factor to column A, based on the values of column B. The rules would be something like this
If B less than 4 <- Multiply A by 1
If B equal to 4 and less than 6 <- Multiply A by 2
If B equal or greater than 6 <- Multiply by 4
I suppose I could write an "if" statement (and I'd be glad to see a good example), but I'd also be interested in using square bracket indexing to speed things up.
The end result would look like this
A B
2 4
16 6
20 7
24 8
ect
Use this:
within(DF, A <- ifelse(B>=6, 4, ifelse(B<4, 1, 2)) * A)
Or this (corrected by #agstudy):
within(DF, {A[B>=6] <- A[B>=6]*4; A[B>=4 & B<6] <- A[B>=4 & B<6]*2})
Benchmarking:
DF <- data.frame(A=rpois(1e4, 5), B=rpois(1e4, 5))
a <- function(DF) within(DF, A <- ifelse(B>=6, 4, ifelse(B<4, 1, 2)) * A)
b <- function(DF) within(DF, {A[B>=6] <- A[B>=6]*4; A[B>=4 & B<6] <- A[B>=4 & B<6]*2})
identical(a(DF), b(DF))
#[1] TRUE
microbenchmark(a(DF), b(DF), times=1000)
#Unit: milliseconds
# expr min lq median uq max neval
# a(DF) 8.603778 10.253799 11.07999 11.923116 53.91140 1000
# b(DF) 3.763470 3.889065 5.34851 5.480294 39.72503 1000
Similar to #Ferdinand solution but using transform
transform(DF, newcol = ifelse(B<4, A,
ifelse(B>=6,4*A,2*A)))
A B newcol
1 1 4 2
2 4 6 16
3 5 7 20
4 6 8 24
5 2 2 2
6 3 2 3
7 4 2 4
8 5 3 5
9 6 8 24
10 7 8 28
11 8 7 32
12 7 8 28
I prefer to use findInterval as an index into a set of factors for such operations. The proliferation of nested test-conditional and consequent vectors with multiple ifelse calls offends my efficiency sensibilities:
DF$A <- DF$A * c(1,2,4)[findInterval(DF$B, c(-Inf,4,6,Inf) ) ]
DF
A B
1 2 4
2 16 6
3 20 7
4 24 8
snipped ....
Benchmark:
DF <- data.frame(A=rpois(1e4, 5), B=rpois(1e4, 5))
a <- function(DF) within(DF, A <- ifelse(B>=6, 4, ifelse(B<4, 1, 2)) * A)
b <- function(DF) within(DF, {A[B>=6] <- A[B>=6]*4; A[B>=4 & B<6] <- A[B>=4 & B<6]*2})
ccc <- function(DF) within(DF, {A * c(1,2,4)[findInterval(B, c(-Inf,4,6,Inf) ) ]})
microbenchmark(a(DF), b(DF), ccc(DF), times=1000)
#-----------
Unit: microseconds
expr min lq median uq max neval
a(DF) 7616.107 7843.6320 8105.0340 8322.5620 93549.85 1000
b(DF) 2638.507 2789.7330 2813.8540 3072.0785 92389.57 1000
ccc(DF) 604.555 662.5335 676.0645 698.8665 85375.14 1000
Note: I would not have done this using within if I were coding my own function, but thought for fairness to the earlier effort, I would make it apples <-> apples.

Resources