Related
I have a data frame where one column is a list of time-stamps. I need to annotate which time-stamps are valid or not, depending on whether or not they are close enough (i.e., within 1 second) to an element of another list of valid time-stamps. For this I have a helper function.
valid_times <- c(219.934, 229.996, 239.975, 249.935, 259.974, 344)
actual_times <- c(200, 210, 215, 220.5, 260)
strain <- c("green", "green", "green", "green", "green", "green")
valid_or_not <- c(rep("NULL", 6))
df <- data.frame(strain, actual_times, valid_or_not)
My data-frame looks like this:
strain actual_times valid_or_not
1 green 200.0 NULL
2 green 210.0 NULL
3 green 215.0 NULL
4 green 220.5 NULL
5 green 260.0 NULL
My helper (that checks to see if an actual_time is within 1 second of a valid time) is as follows:
valid_or_not_fxn<- function(actual_time){
c = "not valid"
for (i in 1:length(valid_times))
if (abs(valid_times[i] - actual_time) <= 1) {
c <- "valid"
} else {
}
return(c)
}
What I've tried to do is loop through the entire data-frame using a for loop with this helper function.
However....it's really slow (on my real data-set) because it's a nested loop cross-comparing two lists that are 100s of elements long. I can't figure out to optimize this.
df$valid_or_not <- as.character(df$valid_or_not)
for (i in 1:nrow(df))
print(df[i, "valid_or_not"])
df[i, "valid_or_not"] <- valid_or_not_fxn(df[i, "actual_times"])
Thank you for any help!
No matter what you do, you essentially have to do at least length(valid_times) comparisons. Probably better off looping over valid_times and comparing each item of that vector to your actual_times column as a vectorised operation. That way you'd only have 5 loop iterations.
One way of doing this is then:
df$test <- Reduce(`|`, lapply(valid_times, function(x) abs(df$actual_times - x) <= 1))
# strain actual_times valid_or_not
#1 green 200.0 FALSE
#2 green 210.0 FALSE
#3 green 215.0 FALSE
#4 green 220.5 TRUE
#5 green 260.0 TRUE
100K rows in df and 1000 valid_times test finishes in <4 seconds:
df2 <- df[sample(1:5,1e5,replace=TRUE),]
valid_times2 <- valid_times[sample(1:5,1000,replace=TRUE)]
system.time(Reduce(`|`, lapply(valid_times2, function(x) abs(df2$actual_times - x) <= 1)))
# user system elapsed
# 3.13 0.40 3.54
The easist way to do it is avoiding data frame operations. So you can do this check and populate the valid_or_not vector before combining them into the dataframe as:
valid_or_not[sapply(actual_times, function(x) any(abs(x - valid_times) <= 1))] <- "valid"
Note that, by this line, the valid_or_not vector is indexed with an equal length vector of boolean values (whether the condition is satisfied, T or F). So only TRUE valued indices from the vector are updated. valid_or_not and actual_times vectors must be of same length where as valid_times vector can be of different length.
By the way "plying" a for loop does not enhance the performance significantly since it is just a "wrapper" for "for" loops. Only performance increase comes from avoiding intermediary objects due to neater and more concise style of code and avoiding redundant copying in some cases. The same case is true for the Vectorize function: It just wraps the for loop that goes through the function and in for example "outer" function, the FUN must be "vectorized" in that manner. In fact it does not give the performance of a truely vectorized operation. In my example the performance enhancement comes from the substitution of the for loop with the "any" function.
And because of some kind of a "bug", subsetting data frames has an important penalty. As Hadley Wickham explains in Performance topic of Advanced-R:
Extracting a single value from a data frame
The following microbenchmark shows five ways to access a single value
(the number in the bottom-right corner) from the built-in mtcars
dataset. The variation in performance is startling: the slowest method
takes 30x longer than the fastest. There’s no reason that there has to
be such a huge difference in performance. It’s simply that no one has
had the time to fix it.
microbenchmark(
"[32, 11]" = mtcars[32, 11],
"$carb[32]" = mtcars$carb[32],
"[[c(11, 32)]]" = mtcars[[c(11, 32)]],
"[[11]][32]" = mtcars[[11]][32],
".subset2" = .subset2(mtcars, 11)[32] )
## Unit: nanoseconds
## expr min lq mean median uq max neval
## [32, 11] 15,300 16,300 18354 17,000 17,800 76,400 100
## $carb[32] 8,860 9,930 12836 10,600 11,600 85,400 100
## [[c(11, 32)]] 7,200 8,110 9293 8,780 9,350 21,300 100
## [[11]][32] 6,330 7,580 8377 8,100 8,690 20,900 100
## .subset2 334 566 4461 669 800 368,000 100
The most efficient way to subset a data frame is to use the .subset2 method. Your poor performance can mostly be attributed to this fact.
And as last notes:
If the "else" in your conditional statment does not do anything (just like in your example: else {}) you do not have to include it. R has some lazy operations (does not evaluate a statement as long as it is not executed inside the code), but that does not mean it always skips non-executed code portions.
The "character" values in your example are in fact categoric: Only
one of few values can be chosen for each entry. So there is no need
to store them as "characters" and they can be converted into factors
(which are just integer values). This can also enhance
performance.
An addition for #thelatemail 's working solution:
In R, "or" (|) operator isn't lazy while "any" function is. A ply combining or's work till the end while "any" function stops at the first encounter of a TRUE value - which enhances the performance (I will write a blog post on this topic ASAP). And vectorized "any" is almost as fast as native C code while *ply can be slightly faster than for loops in R (That I will benchmark and show in another blog post soon).
Some benchmarks showing this:
Pure "any" and | comparison:
> microbenchmark(any(T,F,F,F,F,F), T|F|F|F|F|F)
Unit: nanoseconds
expr min lq mean median uq max neval cld
any(T, F, F, F, F, F) 274 307.0 545.86 366.5 429.5 16380 100 a
T | F | F | F | F | F 597 626.5 903.47 668.5 730.0 18966 100 a
Pure "Reduce" and vectorization comparison:
> vec0 <- rep(1, 1e6)
> microbenchmark(Reduce("+", vec0), sum(vec0), times = 10)
Unit: microseconds
expr min lq mean median uq
Reduce("+", vec0) 308415.064 310071.953 318503.6048 312940.6355 317648.354
sum(vec0) 930.625 936.775 944.2416 943.5425 949.257
max neval cld
369864.993 10 b
962.349 10 a
And a reduced "|" vs. vectorized "any" comparison (for an extreme case). "any" beats by more than 1e5 times:
> vec1 <- c(T, rep(F, 1e6))
> microbenchmark(Reduce("|", vec1), any(vec1), times = 10)
Unit: nanoseconds
expr min lq mean median uq
Reduce("|", vec1) 394040518 395792399 402703632.6 399191803 400990304
any(vec1) 154 267 1932.5 2588 2952
max neval cld
441805451 10 b
3420 10 a
When the single TRUE is at the very end (so "any" is not lazy anymore and has to check the whole vector), "any" still beats by more than 400 times:
> vec2 <- c(rep(F, 1e6), T)
> microbenchmark(Reduce("|", vec2), any(vec2), times = 10)
Unit: microseconds
expr min lq mean median uq
Reduce("|", vec2) 396625.318 401744.849 416732.5087 407447.375 424538.222
any(vec2) 736.975 787.047 857.5575 832.137 926.076
max neval cld
482116.632 10 b
1013.732 10 a
When I was re-reading Hadley's Advanced R recently, I noticed that he said in Chapter 6 that `if` can be used as a function like
`if`(i == 1, print("yes"), print("no"))
(If you have the physical book in hand, it's on Page 80)
We know that ifelse is slow (Does ifelse really calculate both of its vectors every time? Is it slow?) as it evaluates all arguments. Will `if` be a good alternative to that as if seems to only evaluate TRUE arguments (this is just my assumption)?
Update: Based on the answers from #Benjamin and #Roman and the comments from #Gregor and many others, ifelse seems to be a better solution for vectorized calculations. I'm taking #Benjamin's answer here as it provides a more comprehensive comparison and for the community wellness. However, both answers(and the comments) are worth reading.
This is more of an extended comment building on Roman's answer, but I need the code utilities to expound:
Roman is correct that if is faster than ifelse, but I am under the impression that the speed boost of if isn't particularly interesting since it isn't something that can easily be harnessed through vectorization. That is to say, if is only advantageous over ifelse when the cond/test argument is of length 1.
Consider the following function which is an admittedly weak attempt at vectorizing if without having the side effect of evaluating both the yes and no conditions as ifelse does.
ifelse2 <- function(test, yes, no){
result <- rep(NA, length(test))
for (i in seq_along(test)){
result[i] <- `if`(test[i], yes[i], no[i])
}
result
}
ifelse2a <- function(test, yes, no){
sapply(seq_along(test),
function(i) `if`(test[i], yes[i], no[i]))
}
ifelse3 <- function(test, yes, no){
result <- rep(NA, length(test))
logic <- test
result[logic] <- yes[logic]
result[!logic] <- no[!logic]
result
}
set.seed(pi)
x <- rnorm(1000)
library(microbenchmark)
microbenchmark(
standard = ifelse(x < 0, x^2, x),
modified = ifelse2(x < 0, x^2, x),
modified_apply = ifelse2a(x < 0, x^2, x),
third = ifelse3(x < 0, x^2, x),
fourth = c(x, x^2)[1L + ( x < 0 )],
fourth_modified = c(x, x^2)[seq_along(x) + length(x) * (x < 0)]
)
Unit: microseconds
expr min lq mean median uq max neval cld
standard 52.198 56.011 97.54633 58.357 68.7675 1707.291 100 ab
modified 91.787 93.254 131.34023 94.133 98.3850 3601.967 100 b
modified_apply 645.146 653.797 718.20309 661.568 676.0840 3703.138 100 c
third 20.528 22.873 76.29753 25.513 27.4190 3294.350 100 ab
fourth 15.249 16.129 19.10237 16.715 20.9675 43.695 100 a
fourth_modified 19.061 19.941 22.66834 20.528 22.4335 40.468 100 a
SOME EDITS: Thanks to Frank and Richard Scriven for noticing my shortcomings.
As you can see, the process of breaking up the vector to be suitable to pass to if is a time consuming process and ends up being slower than just running ifelse (which is probably why no one has bothered to implement my solution).
If you're really desperate for an increase in speed, you can use the ifelse3 approach above. Or better yet, Frank's less obvious* but brilliant solution.
by 'less obvious' I mean, it took me two seconds to realize what he did. And per nicola's comment below, please note that this works only when yes and no have length 1, otherwise you'll want to stick with ifelse3
if is a primitive (complied) function called through the .Primitive interface, while ifelse is R bytecode, so it seems that if will be faster. Running some quick benchmarks
> microbenchmark(`if`(TRUE, "a", "b"), ifelse(TRUE, "a", "b"))
Unit: nanoseconds
expr min lq mean median uq max neval cld
if (TRUE) "a" else "b" 46 54 372.59 60.0 68.0 30007 100 a
ifelse(TRUE, "a", "b") 1212 1327 1581.62 1442.5 1617.5 11743 100 b
> microbenchmark(`if`(FALSE, "a", "b"), ifelse(FALSE, "a", "b"))
Unit: nanoseconds
expr min lq mean median uq max neval cld
if (FALSE) "a" else "b" 47 55 91.64 61.5 73 2550 100 a
ifelse(FALSE, "a", "b") 1256 1346 1688.78 1460.0 1677 17260 100 b
It seems that if not taking into account the code that is in actual branches, if is at least 20x faster than ifelse. However, note that this doesn't account the complexity of expression being tested and possible optimizations on that.
Update: Please note that this quick benchmark represent a very simplified and somewhat biased use case of if vs ifelse (as pointed out in the comments). While it is correct, it underrepresents the ifelse use cases, for that Benjamin's answer seems to provided more fair comparison.
Yes. I develop a for 152589 records using ifelse() took 90 min and using if() improve to 25min
for(i in ...){
# "Case 1"
# asesorMinimo<-( dummyAsesor%>%filter(FechaAsignacion==min(FechaAsignacion)) )[1,]
# asesorRegla<-tail(dummyAsesor%>%filter( FechaAsignacion<=dumFinClase)%>%arrange(FechaAsignacion),1)
# #Asigna Asesor
# dummyRow<-dummyRow%>%mutate(asesorRetencion=ifelse(dim(asesorRegla)[1]==0,asesorMinimo$OperadorNombreApellido,asesorRegla$OperadorNombreApellido))
# "Case 2"
asesorRegla<-tail(dummyAsesor%>%filter( FechaAsignacion<=dumFinClase)%>%arrange(FechaAsignacion),1)
asesorMinimo<-( dummyAsesor%>%filter(FechaAsignacion==min(FechaAsignacion)) )[1,]
if(dim(asesorRegla)[1]==0){
dummyRow<-dummyRow%>%mutate(asesorRetencion=asesorMinimo[1,7])
}else{
dummyRow<-dummyRow%>%mutate(asesorRetencion=asesorRegla[1,7])
}
}
Suppose I have a a 5 million row data frame, with two columns, as such (this data frame only has ten rows for simplicity):
df <- data.frame(start=c(11,21,31,41,42,54,61,63), end=c(20,30,40,50,51,63,70,72))
I want to be able to produce the following numbers in a numeric vector:
11 to 20, 21 to 30, 31 to 40, 41 to 50, 51, 54-63, 64-70, 71-72
And then take the length of the new vector (in this case, 10+10+10+10+1+10+7+2) = 60
*NOTE, I do not need the vector itself, just it's length will suffice. So if someone has a more intelligent logical approach to obtain the length, that is welcomed.
Essentially, what was done, was the for each row in the dataframe, the sequence from the start to end was taken, and all these sequences were combined, and then filtered for UNIQUE values.
So I used an approach as such:
length(unique(c(apply(df, 1, function(x) {
return(as.numeric(x[1]):as.numeric(x[2]))
}))))
which proves incredibly slow on my five million row data frame.
Any quicker more efficient solutions? Bonus, please try to add system time.
user system elapsed
19.946 0.620 20.477
This should work, assuming your data is sorted.
library(dplyr) # for the lag function
with(df, sum(end - pmax(start, lag(end, 1, default = 0)+1) + 1))
#[1] 60
library(microbenchmark)
microbenchmark(
beginneR={with(df, sum(end - pmax(start, lag(end, 1, default = 0)+1) + 1))},
r2evans={vec <- pmax(mm[,1], c(0,1+head(mm[,2],n=-1))); sum(mm[,2]-vec+1);},
times = 1000
)
Unit: microseconds
expr min lq median uq max neval
beginneR 37.398 41.4455 42.731 44.0795 74.349 1000
r2evans 31.788 35.2470 36.827 38.3925 9298.669 1000
So matrix is still faster, but not much (and the conversion step is still not included here). And I wonder why the max duration in #r2evans's answer is so high compared to all other values (which are really fast)
Another method:
mm <- as.matrix(df) ## critical for performance/scalability
(vec <- pmax(mm[,1], c(0,1+head(mm[,2],n=-1))))
## [1] 11 21 31 41 51 54 64 71
sum(mm[,2] - vec + 1)
## [1] 60
(This should scale reasonable well, certainly better than data.frames.)
Edit: after I updated my code to use matrices and no apply calls, I did a quick benchmark of my implementation compared with the other answer (which is also correct):
library(microbenchmark)
library(dplyr)
microbenchmark(
beginneR={
df <- data.frame(start=c(11,21,31,41,42,54,61,63),
end=c(20,30,40,50,51,63,70,72))
with(df, sum(end - pmax(start, lag(end, 1, default = 0)+1) + 1))
},
r2evans={
mm <- matrix(c(11,21,31,41,42,54,61,63,
20,30,40,50,51,63,70,72), nc=2)
vec <- pmax(mm[,1], c(0,1+head(mm[,2],n=-1)))
sum(mm[,2]-vec+1)
}
)
## Unit: microseconds
## expr min lq median uq max neval
## beginneR 230.410 238.297 244.9015 261.228 443.574 100
## r2evans 37.791 40.725 44.7620 47.880 147.124 100
This benefits greatly from the use of matrices instead of data.frames.
Oh, and system time is not that helpful here :-)
system.time({
mm <- matrix(c(11,21,31,41,42,54,61,63,
20,30,40,50,51,63,70,72), nc=2)
vec <- pmax(mm[,1], c(0,1+head(mm[,2],n=-1)))
sum(mm[,2]-vec+1)
})
## user system elapsed
## 0 0 0
Here's an example of what I mean, this code outputs the right thing:
list1 = list(c(1,2,3,4), c(5,6,7), c(8,9), c(10, 11))
matrix1 = rbind(c(1,2), c(1,5), c(8, 10))
compare <- function(list.t, matrix.t) {
pairs <- 0
for (i in 1:nrow(matrix.t)) {
for (j in 1:length(list.t)) {
if (length(intersect(matrix.t[i,], list.t[[j]])) == 2) {
pairs <- pairs + 1
}
}
}
return(pairs / nrow(matrix.t))
}
compare(list1, matrix1)
# = 0.33333
I hope that makes sense. I'm trying to take an nx2 matrix, and see if the two elements of each row of the matrix are also found in each section of the list. So, in the example above, the first row of the matrix is (1,2), and this pair is found in the first section of the list. The (1,5) or the (8,10) pairs are not found in any section of the list. So that's why I'm outputting 0.3333 (1/3).
I'm wondering if anyone knows a way that doesn't use two for-loops to compare each row to each section? I have larger matrices and lists, and so this is too slow.
Thank you for any help!
Wouldn't this work just the same? You could call sapply over the list and compare with all rows of the matrix simultaneously.
> list1 = list(c(1,2,3,4), c(5,6,7), c(8,9), c(10, 11))
> matrix1 = rbind(c(1,2), c(1,5), c(8, 10))
> s <- sapply(seq_along(list1), function(i){
length(intersect(list1[[i]], matrix1)) == 2
})
> sum(s)/nrow(matrix1)
# [1] 0.3333333
If we call your function f1(), and this sapply version of the same function f2(), we get the following difference in speed.
> library(microbenchmark)
> microbenchmark(f1(), f2())
# Unit: microseconds
# expr min lq median uq max neval
# f1() 245.017 261.2240 268.843 281.7350 1265.706 100
# f2() 113.727 117.7045 125.478 135.6945 268.310 100
Hopefully that's the increase in efficiency you're looking for.
This is offered in the spirit of your R golf challenge for your problem, a compact bu potentially inscrutable solution:
mean( apply(matrix1, 1,
function(x) any( {lapply(list1, function(z) {all(x %in% z) } )}) )
)
[1] 0.3333333
The inner lapply tests whether a particular element of list1 has both of the items in the two-element vector pass as a row from matrix1. Then the any function tests whether any of the 4 elements met the challenge for a particular row. The intermediate logical vector c(TRUE,FALSE,FALSE) is converted into a fraction by the mean. (It still really two nested loops.)
I have a vector of scalar values of which I'm trying to get: "How many different values there are".
For instance in group <- c(1,2,3,1,2,3,4,6) unique values are 1,2,3,4,6 so I want to get 5.
I came up with:
length(unique(group))
But I'm not sure it's the most efficient way to do it. Isn't there a better way to do this?
Note: My case is more complex than the example, consisting of around 1000 numbers with at most 25 different values.
Here are a few ideas, all points towards your solution already being very fast. length(unique(x)) is what I would have used as well:
x <- sample.int(25, 1000, TRUE)
library(microbenchmark)
microbenchmark(length(unique(x)),
nlevels(factor(x)),
length(table(x)),
sum(!duplicated(x)))
# Unit: microseconds
# expr min lq median uq max neval
# length(unique(x)) 24.810 25.9005 27.1350 28.8605 48.854 100
# nlevels(factor(x)) 367.646 371.6185 380.2025 411.8625 1347.343 100
# length(table(x)) 505.035 511.3080 530.9490 575.0880 1685.454 100
# sum(!duplicated(x)) 24.030 25.7955 27.4275 30.0295 70.446 100
You can use rle from base package
x<-c(1,2,3,1,2,3,4,6)
length(rle(sort(x))$values)
rle produces two vectors (lengths and values ). The length of values vector gives you the number of unique values.
I have used this function
length(unique(array))
and it works fine, and doesn't require external libraries.
uniqueN function from data.table is equivalent to length(unique(group)). It is also several times faster on larger datasets, but not so much on your example.
library(data.table)
library(microbenchmark)
xSmall <- sample.int(25, 1000, TRUE)
xBig <- sample.int(2500, 100000, TRUE)
microbenchmark(length(unique(xSmall)), uniqueN(xSmall),
length(unique(xBig)), uniqueN(xBig))
#Unit: microseconds
# expr min lq mean median uq max neval cld
#1 length(unique(xSmall)) 17.742 24.1200 34.15156 29.3520 41.1435 104.789 100 a
#2 uniqueN(xSmall) 12.359 16.1985 27.09922 19.5870 29.1455 97.103 100 a
#3 length(unique(xBig)) 1611.127 1790.3065 2024.14570 1873.7450 2096.5360 3702.082 100 c
#4 uniqueN(xBig) 790.576 854.2180 941.90352 896.1205 974.6425 1714.020 100 b
We can use n_distinct from dplyr
dplyr::n_distinct(group)
#[1] 5
If one wants to get number of unique elements in a matrix or data frame or list, the following code would do:
if( typeof(Y)=="list"){ # Y is a list or data frame
# data frame to matrix
numUniqueElems <- length( na.exclude( unique(unlist(Y)) ) )
} else if ( is.null(dim(Y)) ){ # Y is a vector
numUniqueElems <- length( na.exclude( unique(Y) ) )
} else { # length(dim(Y))==2, Yis a matrix
numUniqueElems <- length( na.exclude( unique(c(Y)) ) )
}