Preserve xts index when using rowSums on xts - r

Is there a way to preserve the index of an xts object when passing rowSums a xts object?
Currently I recast the result as an xts object, but this doesn't seem to be as fast as it could be if rowSums was able to simply return what it was passed.
xts(rowSums(abs(data)),index(data))

Interesting question. Let's ignore the abs calculation, as it's not relevant a lot of the time with just prices. If your concern is performance, here is a set of timings to consider for the current suggestions:
library(microbenchmark)
sample.xts <- xts(order.by = as.POSIXct("2004-01-01 00:00:00") + 1:1e6, matrix(rnorm(1e6 *4), ncol = 4), dimnames = list(NULL, c("A", "B", "C", "D")))
# See how quickly rowSum works on just the underlying matrix of data in the timings below:
xx <- coredata(sample.xts)
microbenchmark(
coredata(sample.xts),
rowSums(xx),
rowSums(sample.xts),
rowSums(coredata(sample.xts)),
.xts(x = rowSums(sample.xts), .index(sample.xts)),
xts(rowSums(coredata(sample.xts)), index(sample.xts)),
xts(rowSums(sample.xts),index(sample.xts)),
Reduce("+", as.list(sample.xts)), times = 100)
# Unit: milliseconds
# expr min lq mean median uq max neval
# coredata(sample.xts) 2.558479 2.661242 6.884048 2.817607 6.356423 104.57993 100
# rowSums(xx) 10.314719 10.824184 11.872108 11.289788 12.382614 18.39334 100
# rowSums(sample.xts) 10.358009 10.887609 11.814267 11.335977 12.387085 17.16193 100
# rowSums(coredata(sample.xts)) 12.916714 13.839761 18.968731 15.950048 17.836838 113.78552 100
# .xts(x = rowSums(sample.xts), .index(sample.xts)) 14.402382 15.764736 20.307027 17.808984 19.072600 114.24039 100
# xts(rowSums(coredata(sample.xts)), index(sample.xts)) 20.490542 24.183286 34.251031 25.566188 27.900599 125.93967 100
# xts(rowSums(sample.xts), index(sample.xts)) 17.436137 19.087269 25.259143 21.923877 22.805013 119.60638 100
# Reduce("+", as.list(sample.xts)) 21.745574 26.075326 41.696152 27.669601 30.442397 136.38650 100
y = .xts(x = rowSums(sample.xts), .index(sample.xts))
y2 = xts(rowSums(sample.xts),index(sample.xts))
all.equal(y, y2)
#[1] TRUE
coredata(sample.xts) returns the underlying numeric matrix. I think the fastest performance you can expect is given by rowSums(xx) for doing the computation, which can be considered a "benchmark". The question is then, what's the quickest way to do it in an xts object. It seems
.xts(x = rowSums(sample.xts), .index(sample.xts)) gives decent performance.

If your objection is having to pick apart and put together the components of the input then if x is your xts object then try this. It returns an xts object directly:
Reduce("+", as.list(x))

Related

Efficiently find set differences and generate random sample

I have a very large data set with categorical labels a and a vector b that contains all possible labels in the data set:
a <- c(1,1,3,2) # artificial data
b <- c(1,2,3,4) # fixed categories
Now I want to find for each observation in a the set of all remaining categories (that is, the elements of b excluding the given observation in a). From these remaining categories, I want to sample one at random.
My approach using a loop is
goal <- numeric() # container for results
for(i in 1:4){
d <- setdiff(b, a[i]) # find the categories except the one observed in the data
goal[i] <- sample(d,1) # sample one of the remaining categories randomly
}
goal
[1] 4 4 1 1
However, this has to be done a large number of times and applied to very large data sets. Does anyone have a more efficient version that leads to the desired result?
EDIT:
The function by akrun is unfortunately slower than the original loop. If anyone has a creative idea with a competitive result, I'm happy to hear it!
We can use vapply
vapply(a, function(x) sample(setdiff(b, x), 1), numeric(1))
set.seed(24)
a <- sample(c(1:4), 10000, replace=TRUE)
b <- 1:4
system.time(vapply(a, function(x) sample(setdiff(b, x), 1), numeric(1)))
# user system elapsed
# 0.208 0.007 0.215
It turns out that resampling the labels that are equal to the labels in the data is an even faster approach, using
test = sample(b, length(a), replace=T)
resample = (a == test)
while(sum(resample>0)){
test[resample] = sample(b, sum(resample), replace=T)
resample = (a == test)
}
Updated Benchmarks for N=10,000:
Unit: microseconds
expr min lq mean median uq max neval
loop 14337.492 14954.595 16172.2165 15227.010 15585.5960 24071.727 100
akrun 14899.000 15507.978 16271.2095 15736.985 16050.6690 24085.839 100
resample 87.242 102.423 113.4057 112.473 122.0955 174.056 100
shree(data = a, labels = b) 5195.128 5369.610 5472.4480 5454.499 5574.0285 5796.836 100
shree_mapply(data = a, labels = b) 1500.207 1622.516 1913.1614 1682.814 1754.0190 10449.271 100
Update: Here's a fast version with mapply. This method avoids calling sample() for every iteration so is a bit faster. -
mapply(function(x, y) b[!b == x][y], a, sample(length(b) - 1, length(a), replace = T))
Here's a version without setdiff (setdiff can be a bit slow) although I think even more optimization is possible. -
vapply(a, function(x) sample(b[!b == x], 1), numeric(1))
Benchmarks -
set.seed(24)
a <- sample(c(1:4), 1000, replace=TRUE)
b <- 1:4
microbenchmark::microbenchmark(
akrun = vapply(a, function(x) sample(setdiff(b, x), 1), numeric(1)),
shree = vapply(a, function(x) sample(b[!b == x], 1), numeric(1)),
shree_mapply = mapply(function(x, y) b[!b == x][y], a, sample(length(b) - 1, length(a), replace = T))
)
Unit: milliseconds
expr min lq mean median uq max neval
akrun 28.7347 30.66955 38.319655 32.57875 37.45455 237.1690 100
shree 5.6271 6.05740 7.531964 6.47270 6.87375 45.9081 100
shree_mapply 1.8286 2.01215 2.628989 2.14900 2.54525 7.7700 100

Efficient sparse linear interpolation of row by row data

What is the most efficient way to do linear interpolation when the desired interpolation points are sparse compared to the available data? I have a very long data frame containing many columns, one of which represents a timestamp and the rest are variables, for which I am interested in interpolating at a very small number of timestamps. For example, consider the two variable case:
microbenchmark::microbenchmark(approx(1:2, 1:2, 1.5)$y)
# Unit: microseconds
# expr min lq mean median uq max neval
# ... 39.629 41.3395 46.80514 42.195 52.8865 138.558 100
microbenchmark::microbenchmark(approx(seq_len(1e6), seq_len(1e6), 1.5)$y)
# Unit: milliseconds
# expr min lq mean median uq max neval
# ... 129.5733 231.0047 229.3459 236.3845 247.3096 369.4621 100
we see that although only one interpolated value (at t = 1.5) is desired, increasing the number of pairs (x, y) can cause a few orders of magnitude difference in running time.
Another example, this time with a data table.
library(data.table)
tmp_dt <- data.table(time = seq_len(1e7), a = seq_len(1e7), b = seq_len(1e7), c = seq_len(1e7))
Running tmp_dt[, lapply(.SD, function(col) {approx(time, col, 1.5)$y}), .SDcols = c("a", "b", "c")] produces a one row data table but it takes a while.
I am thinking there must be some efficiency to be gained by removing all rows in the data table that are not necessary for interpolation.
If your linear interpolation is weighted.mean(c(x0, x1), c(t1-t, t-t0)), where (t0, x0) is the nearest point below and (t1, x1) the nearest above...
# fix bad format
tmp_dt[, names(tmp_dt) := lapply(.SD, as.numeric)]
# enumerate target times
tDT = data.table(t = seq(1.5, 100.5, by=.5))
# handle perfect matches
tDT[, a := tmp_dt[.SD, on=.(time = t), x.a]]
# handle interpolation
tDT[is.na(a), a := {
w = findInterval(t, tmp_dt$time)
cbind(tmp_dt[w, .(t0 = time, a0 = a)], tmp_dt[w+1L, .(t1 = time, a1 = a)])[,
(a0*(t1-t) + a1*(t-t0))/(t1-t0)]
}]
The extension to more columns is a little messy, but can be shoehorned in here.
Some sort of rolling, like w = tmp_dt[t, on=.(time), roll=TRUE, which=TRUE], might be faster than findInterval, but I haven't looked into it.

Copying the data from a small data.frame to a bigger data.frame

Let d be a pre-allocated big matrix
d = as.data.frame(matrix(NA,ncol=3,nrow=5e7))
names(d) = c("x","y","z")
dsub is a small matrix with the same number of columns and same column names as d
dsub = data.frame(x = 1:4,y=1:4,z=1:4)
I wish to copy data from dsub into d at lines 5 to 8
d[5:8,] = dsub
This operation is very slow. It seems that R is copying the entire data.frame d!
Why is it so?
How can one make this process faster?
In this comment the data.table package was mentioned to overcome the problem with copying the whole object when modifying only a few rows.
The best way to demonstrate the effect is a benchmark. Thereby, the different approaches the data.table package offers can be compared.
Setting up data
df <- as.data.frame(matrix(NA_integer_, ncol = 3, nrow = 5e7))
names(df) = c("x", "y", "z")
dt <- setDT(copy(df))
dsub <- data.frame(x = 1:4, y = 1:4, z = 1:4)
Note that the target object is initialized with NA_integer_ instead of NA which is of type logical. This avoids the overhead which is caused by coercing the left hand side to integer (and the repective warnings issued by data.table).
Benchmarking
mb <- microbenchmark::microbenchmark(
df = d[5:8,] <- dsub,
dt1 = dt[5:8] <- dsub,
dt2 = dt[5:8, (c("x","y","z")) := .SD],
dt3 = set(dt, 5:8, 1:3, dsub),
times = 10,
unit = "ms"
)
print(mb, unit = "relative")
#Unit: relative
# expr min lq mean median uq max neval cld
# df 56458.1921 27397.98069 27932.40685 29796.52860 34413.21160 29487.64751 10 b
# dt1 49142.9608 24959.42180 22909.58526 20687.62826 30129.96416 21349.51295 10 b
# dt2 111.9582 86.57717 54.36988 70.89935 69.36287 31.89704 10 a
# dt3 1.0000 1.00000 1.00000 1.00000 1.00000 1.00000 10 a
Note that the benchmark results are printed relative to the fastest method which is data.table's set() function. However, updating by reference using regular data.table systax (case dt2) is magnitudes faster than the data.frame way.

Is there a way to speed up subsetting of smaller data.frames

I have to subset a sequence of data.frames frequently (millions of times each run). The data.frames are of approximate size 200 rows x 30 columns. Depending on the state, the values in the data.frame change from one iteration to the next. Thus, doing one subset in the beginning is not working.
In contrast to the question, when a data.table starts to be faster than a data.frame, I am looking for a speed-up of subsetting for a given size of the data.frame/data.table
The following minimum reproducible example shows, that data.frame seems to be the fastest:
library(data.table)
nmax <- 1e2 # for 1e7 the results look as expected: data.table is really fast!
set.seed(1)
x<-runif(nmax,min=0,max=10)
y<-runif(nmax,min=0,max=10)
DF<-data.frame(x,y)
DT<-data.table(x,y)
summary(microbenchmark::microbenchmark(
setkey(DT,x,y),
times = 10L, unit = "us"))
# expr min lq mean median uq max neval
# 1 setkey(DT, x, y) 70.326 72.606 105.032 80.3985 126.586 212.877 10
summary(microbenchmark::microbenchmark(
DF[DF$x>5, ],
`[.data.frame`(DT,DT$x < 5,),
DT[x>5],
times = 100L, unit = "us"))
# expr min lq mean median uq max neval
# 1 DF[DF$x > 5, ] 41.815 45.426 52.40197 49.9885 57.4010 82.110 100
# 2 `[.data.frame`(DT, DT$x < 5, ) 43.716 47.707 58.06979 53.5995 61.2020 147.873 100
# 3 DT[x > 5] 205.273 214.777 233.09221 222.0000 231.6935 900.164 100
Is there anything I can do to improve performance?
Edit after input:
I am running a discrete event simulation and for each event I have to search in a list (I don't mind whether it is a data.frame or data.table). Most likely, I could implement a different approach, but then I have to re-write the code which was developed over more than 3 years. At the moment, this is not an option. But if there is no way to get it faster this might become an option in the future.
Technically, it is not a sequence of data.frames but just one data.frame, which changes with each iteration. However, this has no impact on "how to get the subset faster" and I hope that the question is now more comprehensive.
You will see a performance boost by converting to matrices. This is a viable alternative if the whole content of your data.frame is numerical (or can be converted without too much trouble).
Here we go. First I modified the data to have it with size 200x30:
library(data.table)
nmax = 200
cmax = 30
set.seed(1)
x<-runif(nmax,min=0,max=10)
DF = data.frame(x)
for (i in 2:cmax) {
DF = cbind(DF, runif(nmax,min=0,max=10))
colnames(DF)[ncol(DF)] = paste0('x',i)
}
DT = data.table(DF)
DM = as.matrix(DF) # # # or data.matrix(DF) if you have factors
And the comparison, ranked from quickest to slowest:
summary(microbenchmark::microbenchmark(
DM[DM[, 'x']>5, ], # # # # Quickest
as.matrix(DF)[DF$x>5, ], # # # # Still quicker with conversion
DF[DF$x>5, ],
`[.data.frame`(DT,DT$x < 5,),
DT[x>5],
times = 100L, unit = "us"))
# expr min lq mean median uq max neval
# 1 DM[DM[, "x"] > 5, ] 13.883 19.8700 22.65164 22.4600 24.9100 41.107 100
# 2 as.matrix(DF)[DF$x > 5, ] 141.100 181.9140 196.02329 195.7040 210.2795 304.989 100
# 3 DF[DF$x > 5, ] 198.846 238.8085 260.07793 255.6265 278.4080 377.982 100
# 4 `[.data.frame`(DT, DT$x < 5, ) 212.342 268.2945 346.87836 289.5885 304.2525 5894.712 100
# 5 DT[x > 5] 322.695 396.3675 465.19192 428.6370 457.9100 4186.487 100
If your use-case involves querying multiple times the data, then you can do the conversion only once and increase the speed by one order of magnitude.

Compute digit-sums in specific columns of a data frame

I'm trying to sum the digits of integers in the last 2 columns of my data frame. I have found a function that does the summing, but I think I may have an issue with applying the function - not sure?
Dataframe
a = c("a", "b", "c")
b = c(1, 11, 2)
c = c(2, 4, 23)
data <- data.frame(a,b,c)
#Digitsum function
digitsum <- function(x) sum(floor(x / 10^(0:(nchar(as.character(x)) - 1))) %% 10)
#Applying function
data[2:3] <- lapply(data[2:3], digitsum)
This is the error that I get:
*Warning messages:
1: In 0:(nchar(as.character(x)) - 1) :
numerical expression has 3 elements: only the first used
2: In 0:(nchar(as.character(x)) - 1) :
numerical expression has 3 elements: only the first used*
Your function digitsum at the moment works fine for a single scalar input, for example,
digitsum(32)
# [1] 5
But, it can not take a vector input, otherwise ":" will complain. You need to vectorize this function, using Vectorize:
vec_digitsum <- Vectorize(digitsum)
Then it works for a vector input:
b = c(1, 11, 2)
vec_digitsum(b)
# [1] 1 2 2
Now you can use lapply without trouble.
#Zheyuan Li 's answer solved your problem of using lapply. Though I'd like to add several points:
Vectorize is just a wrapper with mapply, which doesn't give you the performance of vectorization.
The function itself can be improved for much better readability:
see
digitsum <- function(x) sum(floor(x / 10^(0:(nchar(as.character(x)) - 1))) %% 10)
vec_digitsum <- Vectorize(digitsum)
sumdigits <- function(x){
digits <- strsplit(as.character(x), "")[[1]]
sum(as.numeric(digits))
}
vec_sumdigits <- Vectorize(sumdigits)
microbenchmark::microbenchmark(digitsum(12324255231323),
sumdigits(12324255231323), times = 100)
Unit: microseconds
expr min lq mean median uq max neval cld
digitsum(12324255231323) 12.223 12.712 14.50613 13.201 13.690 96.801 100 a
sumdigits(12324255231323) 13.689 14.667 15.32743 14.668 15.157 38.134 100 a
The performance of two versions are similar, but the 2nd one is much easier to understand.
Interestingly, the Vectorize wrapper add considerable overhead for single input:
microbenchmark::microbenchmark(vec_digitsum(12324255231323),
vec_sumdigits(12324255231323), times = 100)
Unit: microseconds
expr min lq mean median uq max neval cld
vec_digitsum(12324255231323) 92.890 96.801 267.2665 100.223 108.045 16387.07 100 a
vec_sumdigits(12324255231323) 94.357 98.757 106.2705 101.445 107.556 286.00 100 a
Another advantage of this function is that if you have really big numbers in string format, it will still work (with small modification of removing the as.character). While the first version function will have problem with big numbers or may introduce errors.
Note: At first my benchmark was comparing the vectorized version of OP function and non-vectorized version of my function, that gave me the wrong impression of my function is much faster. Turned out that was caused by Vectorize overhead.

Resources