I'm running a simulation where I need to repeatedly extract 1 column from a matrix and check each of its values against some condition (e.g. < 10). However, doing so with a matrix is 3 times slower than doing the same thing with a data.frame. Why is this the case?
I'd like to to use matrixes to store the simulation data because they are faster for some other operations (e.g. updating columns by adding/subtracting values). How can I extract columns / subset a matrix in a faster way?
Extract column from data.frame vs matrix:
df <- data.frame(a = 1:1e4)
m <- as.matrix(df)
library(microbenchmark)
microbenchmark(
df$a,
m[ , "a"])
# Results; Unit: microseconds
# expr min lq mean median uq max neval cld
# df$a 5.463 5.8315 8.03997 6.612 8.0275 57.637 100 a
# m[ , "a"] 64.699 66.6265 72.43631 73.759 75.5595 117.922 100 b
Extract single value from data.frame vs matrix:
microbenchmark(
df[1, 1],
df$a[1],
m[1, 1],
m[ , "a"][1])
# Results; Unit: nanoseconds
# expr min lq mean median uq max neval cld
# df[1, 1] 8248 8753.0 10198.56 9818.5 10689.5 48159 100 c
# df$a[1] 4072 4416.0 5247.67 5057.5 5754.5 17993 100 b
# m[1, 1] 517 708.5 828.04 810.0 920.5 2732 100 a
# m[ , "a"][1] 45745 47884.0 51861.90 49100.5 54831.5 105323 100 d
I expected the matrix column extraction to be faster, but it was slower. However, extracting a single value from a matrix (i.e. m[1, 1]) was faster than both of the ways of doing so with a data.frame. I'm lost as to why this is.
Extract row vs column, data.frame vs matrix:
The above is only true for selecting columns. When selecting rows, matrices are much faster than data.frames. Still don't know why.
microbenchmark(
df[1, ],
m[1, ],
df[ , 1],
m[ , 1])
# Result: Unit: nanoseconds
# expr min lq mean median uq max neval cld
# df[1, ] 16359 17243.5 18766.93 17860.5 19849.5 42973 100 c
# m[1, ] 718 999.5 1175.95 1181.0 1327.0 3595 100 a
# df[ , 1] 7664 8687.5 9888.57 9301.0 10535.5 42312 100 b
# m[ , 1] 64874 66218.5 72074.93 73717.5 74084.5 97827 100 d
data.frame
Consider the builtin data frame BOD. data frames are stored as a list of columns and the inspect output shown below shows the address of each of the two columns of BOD. We then assign its second column to BOD2. Note that the address of BOD2 is the same memory location as the second column shown in the inspect output for BOD. That is, all R did was have BOD2 point to memory within BOD in order to create BOD2. There was no data movement at all. Another way to see this is to compare the size of BOD, BOD2 and both together and we see that both together take up the same amount of memory as BOD so there must have been no copying. (Continued after code.)
library(pryr)
BOD2 <- BOD[[2]]
inspect(BOD)
## <VECSXP 0x507c278>
## <REALSXP 0x4f81f48>
## <REALSXP 0x4f81ed8> <--- compare this address to address shown below
## ...snip...
BOD2 <- BOD[,2]
address(BOD2)
## [1] "0x4f81ed8"
object_size(BOD)
## 1.18 kB
object_size(BOD2)
## 96 B
object_size(BOD, BOD2) # same as object_size(BOD) above
## 1.18 kB
matrix
Matrices are stored as one long vector with dimensions rather than as a list of columns so the strategy for extraction of a column is different. If we look at the memory used by a matrix m, an extracted column m2 and both together we see below that both together use the sum of the memories of the individual objects showing that there was data copying.
set.seed(123)
n <- 10000L
m <- matrix(rnorm(2*n), n, 2)
m2 <- m[, 2]
object_size(m)
## 160 kB
object_size(m2)
## 80 kB
object_size(m, m2)
## 240 kB <-- unlike for data.frames this equals sum of above
what to do
If your program is such that it uses column extraction up to a point only you could use a data frame for that portion and then do a one time conversion to matrix and process it like that for the rest.
I suppose it is about the data structure of R in the memory.
A matrix in R is a 2-d array, which is the same of 1-d array. A variable is a point directly to the memory, so it would be very faster to extract a single value. To extract a column in the matrix, it would take some computation and ask for new memory address and save it. As for dataframe, it is actually a list of columns, so it would be faster to return a column.
That's what i guess, hope to be proved.
Related
I have a data frame where one column is a list of time-stamps. I need to annotate which time-stamps are valid or not, depending on whether or not they are close enough (i.e., within 1 second) to an element of another list of valid time-stamps. For this I have a helper function.
valid_times <- c(219.934, 229.996, 239.975, 249.935, 259.974, 344)
actual_times <- c(200, 210, 215, 220.5, 260)
strain <- c("green", "green", "green", "green", "green", "green")
valid_or_not <- c(rep("NULL", 6))
df <- data.frame(strain, actual_times, valid_or_not)
My data-frame looks like this:
strain actual_times valid_or_not
1 green 200.0 NULL
2 green 210.0 NULL
3 green 215.0 NULL
4 green 220.5 NULL
5 green 260.0 NULL
My helper (that checks to see if an actual_time is within 1 second of a valid time) is as follows:
valid_or_not_fxn<- function(actual_time){
c = "not valid"
for (i in 1:length(valid_times))
if (abs(valid_times[i] - actual_time) <= 1) {
c <- "valid"
} else {
}
return(c)
}
What I've tried to do is loop through the entire data-frame using a for loop with this helper function.
However....it's really slow (on my real data-set) because it's a nested loop cross-comparing two lists that are 100s of elements long. I can't figure out to optimize this.
df$valid_or_not <- as.character(df$valid_or_not)
for (i in 1:nrow(df))
print(df[i, "valid_or_not"])
df[i, "valid_or_not"] <- valid_or_not_fxn(df[i, "actual_times"])
Thank you for any help!
No matter what you do, you essentially have to do at least length(valid_times) comparisons. Probably better off looping over valid_times and comparing each item of that vector to your actual_times column as a vectorised operation. That way you'd only have 5 loop iterations.
One way of doing this is then:
df$test <- Reduce(`|`, lapply(valid_times, function(x) abs(df$actual_times - x) <= 1))
# strain actual_times valid_or_not
#1 green 200.0 FALSE
#2 green 210.0 FALSE
#3 green 215.0 FALSE
#4 green 220.5 TRUE
#5 green 260.0 TRUE
100K rows in df and 1000 valid_times test finishes in <4 seconds:
df2 <- df[sample(1:5,1e5,replace=TRUE),]
valid_times2 <- valid_times[sample(1:5,1000,replace=TRUE)]
system.time(Reduce(`|`, lapply(valid_times2, function(x) abs(df2$actual_times - x) <= 1)))
# user system elapsed
# 3.13 0.40 3.54
The easist way to do it is avoiding data frame operations. So you can do this check and populate the valid_or_not vector before combining them into the dataframe as:
valid_or_not[sapply(actual_times, function(x) any(abs(x - valid_times) <= 1))] <- "valid"
Note that, by this line, the valid_or_not vector is indexed with an equal length vector of boolean values (whether the condition is satisfied, T or F). So only TRUE valued indices from the vector are updated. valid_or_not and actual_times vectors must be of same length where as valid_times vector can be of different length.
By the way "plying" a for loop does not enhance the performance significantly since it is just a "wrapper" for "for" loops. Only performance increase comes from avoiding intermediary objects due to neater and more concise style of code and avoiding redundant copying in some cases. The same case is true for the Vectorize function: It just wraps the for loop that goes through the function and in for example "outer" function, the FUN must be "vectorized" in that manner. In fact it does not give the performance of a truely vectorized operation. In my example the performance enhancement comes from the substitution of the for loop with the "any" function.
And because of some kind of a "bug", subsetting data frames has an important penalty. As Hadley Wickham explains in Performance topic of Advanced-R:
Extracting a single value from a data frame
The following microbenchmark shows five ways to access a single value
(the number in the bottom-right corner) from the built-in mtcars
dataset. The variation in performance is startling: the slowest method
takes 30x longer than the fastest. There’s no reason that there has to
be such a huge difference in performance. It’s simply that no one has
had the time to fix it.
microbenchmark(
"[32, 11]" = mtcars[32, 11],
"$carb[32]" = mtcars$carb[32],
"[[c(11, 32)]]" = mtcars[[c(11, 32)]],
"[[11]][32]" = mtcars[[11]][32],
".subset2" = .subset2(mtcars, 11)[32] )
## Unit: nanoseconds
## expr min lq mean median uq max neval
## [32, 11] 15,300 16,300 18354 17,000 17,800 76,400 100
## $carb[32] 8,860 9,930 12836 10,600 11,600 85,400 100
## [[c(11, 32)]] 7,200 8,110 9293 8,780 9,350 21,300 100
## [[11]][32] 6,330 7,580 8377 8,100 8,690 20,900 100
## .subset2 334 566 4461 669 800 368,000 100
The most efficient way to subset a data frame is to use the .subset2 method. Your poor performance can mostly be attributed to this fact.
And as last notes:
If the "else" in your conditional statment does not do anything (just like in your example: else {}) you do not have to include it. R has some lazy operations (does not evaluate a statement as long as it is not executed inside the code), but that does not mean it always skips non-executed code portions.
The "character" values in your example are in fact categoric: Only
one of few values can be chosen for each entry. So there is no need
to store them as "characters" and they can be converted into factors
(which are just integer values). This can also enhance
performance.
An addition for #thelatemail 's working solution:
In R, "or" (|) operator isn't lazy while "any" function is. A ply combining or's work till the end while "any" function stops at the first encounter of a TRUE value - which enhances the performance (I will write a blog post on this topic ASAP). And vectorized "any" is almost as fast as native C code while *ply can be slightly faster than for loops in R (That I will benchmark and show in another blog post soon).
Some benchmarks showing this:
Pure "any" and | comparison:
> microbenchmark(any(T,F,F,F,F,F), T|F|F|F|F|F)
Unit: nanoseconds
expr min lq mean median uq max neval cld
any(T, F, F, F, F, F) 274 307.0 545.86 366.5 429.5 16380 100 a
T | F | F | F | F | F 597 626.5 903.47 668.5 730.0 18966 100 a
Pure "Reduce" and vectorization comparison:
> vec0 <- rep(1, 1e6)
> microbenchmark(Reduce("+", vec0), sum(vec0), times = 10)
Unit: microseconds
expr min lq mean median uq
Reduce("+", vec0) 308415.064 310071.953 318503.6048 312940.6355 317648.354
sum(vec0) 930.625 936.775 944.2416 943.5425 949.257
max neval cld
369864.993 10 b
962.349 10 a
And a reduced "|" vs. vectorized "any" comparison (for an extreme case). "any" beats by more than 1e5 times:
> vec1 <- c(T, rep(F, 1e6))
> microbenchmark(Reduce("|", vec1), any(vec1), times = 10)
Unit: nanoseconds
expr min lq mean median uq
Reduce("|", vec1) 394040518 395792399 402703632.6 399191803 400990304
any(vec1) 154 267 1932.5 2588 2952
max neval cld
441805451 10 b
3420 10 a
When the single TRUE is at the very end (so "any" is not lazy anymore and has to check the whole vector), "any" still beats by more than 400 times:
> vec2 <- c(rep(F, 1e6), T)
> microbenchmark(Reduce("|", vec2), any(vec2), times = 10)
Unit: microseconds
expr min lq mean median uq
Reduce("|", vec2) 396625.318 401744.849 416732.5087 407447.375 424538.222
any(vec2) 736.975 787.047 857.5575 832.137 926.076
max neval cld
482116.632 10 b
1013.732 10 a
I have to subset a sequence of data.frames frequently (millions of times each run). The data.frames are of approximate size 200 rows x 30 columns. Depending on the state, the values in the data.frame change from one iteration to the next. Thus, doing one subset in the beginning is not working.
In contrast to the question, when a data.table starts to be faster than a data.frame, I am looking for a speed-up of subsetting for a given size of the data.frame/data.table
The following minimum reproducible example shows, that data.frame seems to be the fastest:
library(data.table)
nmax <- 1e2 # for 1e7 the results look as expected: data.table is really fast!
set.seed(1)
x<-runif(nmax,min=0,max=10)
y<-runif(nmax,min=0,max=10)
DF<-data.frame(x,y)
DT<-data.table(x,y)
summary(microbenchmark::microbenchmark(
setkey(DT,x,y),
times = 10L, unit = "us"))
# expr min lq mean median uq max neval
# 1 setkey(DT, x, y) 70.326 72.606 105.032 80.3985 126.586 212.877 10
summary(microbenchmark::microbenchmark(
DF[DF$x>5, ],
`[.data.frame`(DT,DT$x < 5,),
DT[x>5],
times = 100L, unit = "us"))
# expr min lq mean median uq max neval
# 1 DF[DF$x > 5, ] 41.815 45.426 52.40197 49.9885 57.4010 82.110 100
# 2 `[.data.frame`(DT, DT$x < 5, ) 43.716 47.707 58.06979 53.5995 61.2020 147.873 100
# 3 DT[x > 5] 205.273 214.777 233.09221 222.0000 231.6935 900.164 100
Is there anything I can do to improve performance?
Edit after input:
I am running a discrete event simulation and for each event I have to search in a list (I don't mind whether it is a data.frame or data.table). Most likely, I could implement a different approach, but then I have to re-write the code which was developed over more than 3 years. At the moment, this is not an option. But if there is no way to get it faster this might become an option in the future.
Technically, it is not a sequence of data.frames but just one data.frame, which changes with each iteration. However, this has no impact on "how to get the subset faster" and I hope that the question is now more comprehensive.
You will see a performance boost by converting to matrices. This is a viable alternative if the whole content of your data.frame is numerical (or can be converted without too much trouble).
Here we go. First I modified the data to have it with size 200x30:
library(data.table)
nmax = 200
cmax = 30
set.seed(1)
x<-runif(nmax,min=0,max=10)
DF = data.frame(x)
for (i in 2:cmax) {
DF = cbind(DF, runif(nmax,min=0,max=10))
colnames(DF)[ncol(DF)] = paste0('x',i)
}
DT = data.table(DF)
DM = as.matrix(DF) # # # or data.matrix(DF) if you have factors
And the comparison, ranked from quickest to slowest:
summary(microbenchmark::microbenchmark(
DM[DM[, 'x']>5, ], # # # # Quickest
as.matrix(DF)[DF$x>5, ], # # # # Still quicker with conversion
DF[DF$x>5, ],
`[.data.frame`(DT,DT$x < 5,),
DT[x>5],
times = 100L, unit = "us"))
# expr min lq mean median uq max neval
# 1 DM[DM[, "x"] > 5, ] 13.883 19.8700 22.65164 22.4600 24.9100 41.107 100
# 2 as.matrix(DF)[DF$x > 5, ] 141.100 181.9140 196.02329 195.7040 210.2795 304.989 100
# 3 DF[DF$x > 5, ] 198.846 238.8085 260.07793 255.6265 278.4080 377.982 100
# 4 `[.data.frame`(DT, DT$x < 5, ) 212.342 268.2945 346.87836 289.5885 304.2525 5894.712 100
# 5 DT[x > 5] 322.695 396.3675 465.19192 428.6370 457.9100 4186.487 100
If your use-case involves querying multiple times the data, then you can do the conversion only once and increase the speed by one order of magnitude.
Suppose I have a a 5 million row data frame, with two columns, as such (this data frame only has ten rows for simplicity):
df <- data.frame(start=c(11,21,31,41,42,54,61,63), end=c(20,30,40,50,51,63,70,72))
I want to be able to produce the following numbers in a numeric vector:
11 to 20, 21 to 30, 31 to 40, 41 to 50, 51, 54-63, 64-70, 71-72
And then take the length of the new vector (in this case, 10+10+10+10+1+10+7+2) = 60
*NOTE, I do not need the vector itself, just it's length will suffice. So if someone has a more intelligent logical approach to obtain the length, that is welcomed.
Essentially, what was done, was the for each row in the dataframe, the sequence from the start to end was taken, and all these sequences were combined, and then filtered for UNIQUE values.
So I used an approach as such:
length(unique(c(apply(df, 1, function(x) {
return(as.numeric(x[1]):as.numeric(x[2]))
}))))
which proves incredibly slow on my five million row data frame.
Any quicker more efficient solutions? Bonus, please try to add system time.
user system elapsed
19.946 0.620 20.477
This should work, assuming your data is sorted.
library(dplyr) # for the lag function
with(df, sum(end - pmax(start, lag(end, 1, default = 0)+1) + 1))
#[1] 60
library(microbenchmark)
microbenchmark(
beginneR={with(df, sum(end - pmax(start, lag(end, 1, default = 0)+1) + 1))},
r2evans={vec <- pmax(mm[,1], c(0,1+head(mm[,2],n=-1))); sum(mm[,2]-vec+1);},
times = 1000
)
Unit: microseconds
expr min lq median uq max neval
beginneR 37.398 41.4455 42.731 44.0795 74.349 1000
r2evans 31.788 35.2470 36.827 38.3925 9298.669 1000
So matrix is still faster, but not much (and the conversion step is still not included here). And I wonder why the max duration in #r2evans's answer is so high compared to all other values (which are really fast)
Another method:
mm <- as.matrix(df) ## critical for performance/scalability
(vec <- pmax(mm[,1], c(0,1+head(mm[,2],n=-1))))
## [1] 11 21 31 41 51 54 64 71
sum(mm[,2] - vec + 1)
## [1] 60
(This should scale reasonable well, certainly better than data.frames.)
Edit: after I updated my code to use matrices and no apply calls, I did a quick benchmark of my implementation compared with the other answer (which is also correct):
library(microbenchmark)
library(dplyr)
microbenchmark(
beginneR={
df <- data.frame(start=c(11,21,31,41,42,54,61,63),
end=c(20,30,40,50,51,63,70,72))
with(df, sum(end - pmax(start, lag(end, 1, default = 0)+1) + 1))
},
r2evans={
mm <- matrix(c(11,21,31,41,42,54,61,63,
20,30,40,50,51,63,70,72), nc=2)
vec <- pmax(mm[,1], c(0,1+head(mm[,2],n=-1)))
sum(mm[,2]-vec+1)
}
)
## Unit: microseconds
## expr min lq median uq max neval
## beginneR 230.410 238.297 244.9015 261.228 443.574 100
## r2evans 37.791 40.725 44.7620 47.880 147.124 100
This benefits greatly from the use of matrices instead of data.frames.
Oh, and system time is not that helpful here :-)
system.time({
mm <- matrix(c(11,21,31,41,42,54,61,63,
20,30,40,50,51,63,70,72), nc=2)
vec <- pmax(mm[,1], c(0,1+head(mm[,2],n=-1)))
sum(mm[,2]-vec+1)
})
## user system elapsed
## 0 0 0
I have a vector of scalar values of which I'm trying to get: "How many different values there are".
For instance in group <- c(1,2,3,1,2,3,4,6) unique values are 1,2,3,4,6 so I want to get 5.
I came up with:
length(unique(group))
But I'm not sure it's the most efficient way to do it. Isn't there a better way to do this?
Note: My case is more complex than the example, consisting of around 1000 numbers with at most 25 different values.
Here are a few ideas, all points towards your solution already being very fast. length(unique(x)) is what I would have used as well:
x <- sample.int(25, 1000, TRUE)
library(microbenchmark)
microbenchmark(length(unique(x)),
nlevels(factor(x)),
length(table(x)),
sum(!duplicated(x)))
# Unit: microseconds
# expr min lq median uq max neval
# length(unique(x)) 24.810 25.9005 27.1350 28.8605 48.854 100
# nlevels(factor(x)) 367.646 371.6185 380.2025 411.8625 1347.343 100
# length(table(x)) 505.035 511.3080 530.9490 575.0880 1685.454 100
# sum(!duplicated(x)) 24.030 25.7955 27.4275 30.0295 70.446 100
You can use rle from base package
x<-c(1,2,3,1,2,3,4,6)
length(rle(sort(x))$values)
rle produces two vectors (lengths and values ). The length of values vector gives you the number of unique values.
I have used this function
length(unique(array))
and it works fine, and doesn't require external libraries.
uniqueN function from data.table is equivalent to length(unique(group)). It is also several times faster on larger datasets, but not so much on your example.
library(data.table)
library(microbenchmark)
xSmall <- sample.int(25, 1000, TRUE)
xBig <- sample.int(2500, 100000, TRUE)
microbenchmark(length(unique(xSmall)), uniqueN(xSmall),
length(unique(xBig)), uniqueN(xBig))
#Unit: microseconds
# expr min lq mean median uq max neval cld
#1 length(unique(xSmall)) 17.742 24.1200 34.15156 29.3520 41.1435 104.789 100 a
#2 uniqueN(xSmall) 12.359 16.1985 27.09922 19.5870 29.1455 97.103 100 a
#3 length(unique(xBig)) 1611.127 1790.3065 2024.14570 1873.7450 2096.5360 3702.082 100 c
#4 uniqueN(xBig) 790.576 854.2180 941.90352 896.1205 974.6425 1714.020 100 b
We can use n_distinct from dplyr
dplyr::n_distinct(group)
#[1] 5
If one wants to get number of unique elements in a matrix or data frame or list, the following code would do:
if( typeof(Y)=="list"){ # Y is a list or data frame
# data frame to matrix
numUniqueElems <- length( na.exclude( unique(unlist(Y)) ) )
} else if ( is.null(dim(Y)) ){ # Y is a vector
numUniqueElems <- length( na.exclude( unique(Y) ) )
} else { # length(dim(Y))==2, Yis a matrix
numUniqueElems <- length( na.exclude( unique(c(Y)) ) )
}
I would like to have unique numeric factors as part of an xts, so that over time...each number refers to a specific factor, independent of time.
To give an example, imagine a stock index that changes its constituents every day. We can simulate this if I have the following universe of two letter stock tickers
universe <- apply(as.data.frame(expand.grid(letters,letters)),1,paste0,collapse="")
and each day an index is created that is a random subsample of 20 of the stock tickers from the universe.
subsample.list <- lapply(1:50, function(y){
sort(sample(universe,20,replace=FALSE))
})
the key of unique stocks over the 50 days is:
uni.subsample <- sort(unique(unlist(subsample.list)))
I would like to basically be able to see which stocks were in the index each day if i had the xts object and unique factors.
Although it is not meant to be used this way....I was thinking something like:
tmp <- xts(do.call(rbind,subsample.list),Sys.Date()-c(50:1))
to create the xts.
however I would like to covert the coredata into a numeric matrix, where each number is the ticker from uni.subsample
so if tmp.adjusted['20130716'][1,] would be the numeric vector of numbers of length 20 that represents the numerical values of uni.subsample for the 16th July 2013, so I would expect that I would be able to get all of 2013-07-16's index members by using the xts objecting the following way uni.subsample[tmp.adjusted['20130716'][1,]]...i.e. the adjustment from tmp to tmp.adjusted converts the strings into factors, with unique levels associated with uni.subsample
I hope this makes sense...its kinda hard to explain....
Here a vectorized solution:
tmp.int <- xts(matrix(as.integer(factor(tmp,levels=uni.subsample,ordered=TRUE)),
ncol=ncol(tmp)),index(tmp))
You are basically trying to code a matrix of ordered factor by their levels order.
EDIT adding some benchmarking :
set.seed(1233)
N <- 5000
subsample.list <- lapply(seq(N), function(y){
sort(sample(universe,20,replace=FALSE))
})
uni.subsample <- sort(unique(unlist(subsample.list)))
tmp <- xts(do.call(rbind,subsample.list),Sys.Date()-seq(N))
ag <- function() xts(matrix(as.integer(factor(tmp,levels=uni.subsample,ordered=TRUE)),
ncol=ncol(tmp)),index(tmp))
no <- function()xts(apply(X=tmp,
MARGIN=c(1,2), function(x) which(uni.subsample == x)),
index(tmp))
library(microbenchmark)
microbenchmark(ag(),no(),times=1)
## N = 50 ag 24 faster
microbenchmark(ag(),no(),times=1)
Unit: milliseconds
expr min lq median uq max neval
ag() 1.126405 1.126405 1.126405 1.126405 1.126405 1
## N = 500 ag 135 fatser
microbenchmark(ag(),no(),times=10)
Unit: milliseconds
expr min lq median uq max neval
ag() 23.38484 26.19744 31.13428 35.51057 44.96251 10
no() 3115.24902 3220.04940 3250.63773 3288.66867 3470.35053 10
no() 24.000003 24.000003 24.000003 24.000003 24.000003 1
How about:
tmp.int <- xts(apply(X=tmp, MARGIN=c(1,2), function(x) which(uni.subsample == x)),
index(tmp))
# to perform the lookup (e.g., 'find the name of the first value on May 27, 2013'):
uni.subsample[tmp.int['2013-05-27'][,1]]