I'm a beginning R programmer. I have trouble in a loop calculation with a previous value like recursion.
An example of my data:
dt <- data.table(a = c(0:4), b = c( 0, 1, 2, 1, 3))
And calculated value 'c' is y[n] = (y[n-1] + b[n])*a[n]. Initial value of c is 0. (c[1] = 0)
I used the for loop and the code and result is as below.
dt$y <- 0
for (i in 2:nrow(dt)) {
dt$y[i] <- (dt$y[i - 1] + dt$b[i]) * dt$a[i]
a b y
1: 0 0 0
2: 1 1 1
3: 2 2 6
4: 3 1 21
5: 4 3 96
This result is what I want. However, my data has over 1,000,000 rows and several columns, therefore I'm trying to find other ways without using a for loop. I tried to use "Reduce()", but it only works with a single vector (ex. y[n] = y_[n-1]+b[n]). As shown above, my function uses two vectors, a and b, so I can't find a solution.
Is there a more efficient way to be faster without using a for loop, such as using a recursive function or any good package functions?
This kind of computation cannot make use of R's advantage of vectorization because of the iterative dependencies. But the slow-down appears to really be coming from indexing performance on a data.frame or data.table.
Interestingly, I was able to speed up the loop considerably by accessing a, b, and y directly as numeric vectors (1000+ fold advantage for 2*10^5 rows) or as matrix "columns" (100+ fold advantage for 2*10^5 rows) versus as columns in a data.table or data.frame.
This old discussion may still shed some light on this rather surprising result: https://stat.ethz.ch/pipermail/r-help/2011-July/282666.html
Please note that I also made a different toy data.frame, so I could test a larger example without returning Inf as y grew with i:
Option data.frame (numeric vectors embedded in a data.frame or data.table per your example):
vec_length <- 200000
dt <- data.frame(a=seq(from=0, to=1, length.out = vec_length), b=seq(from=0, to=-1, length.out = vec_length), y=0)
system.time(for (i in 2:nrow(dt)) {
dt$y[i] <- (dt$y[i - 1] + dt$b[i]) * dt$a[i]
#user system elapsed
#79.39 146.30 225.78
#NOTE: Sorry, I didn't have the patience to let the data.table version finish for vec_length=2*10^5.
#[1] -554.1953 -555.1842 -556.1758 -557.1702 -558.1674 -559.1674
Option vector (numeric vectors extracted in advance of loop):
vec_length <- 200000
dt <- data.frame(a=seq(from=0, to=1, length.out = vec_length), b=seq(from=0, to=-1, length.out = vec_length), y=0)
y <- as.numeric(dt$y)
a <- as.numeric(dt$a)
b <- as.numeric(dt$b)
system.time(for (i in 2:length(y)) {
y[i] <- (y[i - 1] + b[i]) * a[i]
#user system elapsed
#0.03 0.00 0.03
#[1] -554.1953 -555.1842 -556.1758 -557.1702 -558.1674 -559.1674
Option matrix (data.frame converted to matrix before loop):
vec_length <- 200000
dt <- as.matrix(data.frame(a=seq(from=0, to=1, length.out = vec_length), b=seq(from=0, to=-1, length.out = vec_length), y=0))
system.time(for (i in 2:nrow(dt)) {
dt[i, 1] <- (dt[i - 1, 3] + dt[i, 2]) * dt[i, 1]
#user system elapsed
#0.67 0.01 0.69
#[1] -554.1953 -555.1842 -556.1758 -557.1702 -558.1674 -559.1674
#NOTE: a matrix is actually a vector but with an additional attribute (it's "dim") that says how the "matrix" should be organized into rows and columns
Option data.frame with matrix style indexing:
vec_length <- 200000
dt <- data.frame(a=seq(from=0, to=1, length.out = vec_length), b=seq(from=0, to=-1, length.out = vec_length), y=0)
system.time(for (i in 2:nrow(dt)) {
dt[i, 3] <- (dt[(i - 1), 3] + dt[i, 2]) * dt[i, 1]
#user system elapsed
#110.69 0.03 112.01
#[1] -554.1953 -555.1842 -556.1758 -557.1702 -558.1674 -559.1674
An option is to use Rcpp since for this recursive equation is easy to code in C++:
NumericVector func(NumericVector b, NumericVector a) {
int len = b.size();
NumericVector y(len);
for (int i = 1; i < len; i++) {
y[i] = (y[i-1] + b[i]) * a[i];
func(c( 0, 1, 2, 1, 3), c(0:4))
#[1] 0 1 6 21 96
timing code:
vec_length <- 1e7
dt <- data.frame(a=1:vec_length, b=1:vec_length, y=0)
y <- as.numeric(dt$y)
a <- as.numeric(dt$a)
b <- as.numeric(dt$b)
system.time(for (i in 2:length(y)) {
y[i] <- (y[i - 1] + b[i]) * a[i]
# user system elapsed
# 19.22 0.06 19.44
system.time(func(b, a))
# user system elapsed
# 0.09 0.02 0.09
Here is a base R solution.
According to the information from #ThetaFC, an indication for speedup is to use matrix or vector (rather than data.frame for data.table). Thus, it is better to have the following preprocessing before calculating df$y, i.e.,
a <- as.numeric(df$a)
b <- as.numeric(df$b)
Then, you have two approaches to get df$y:
writing your customized recursion function
f <- function(k) {
if (k == 1) return(0)
c(f(k-1),(tail(f(k-1),1) + b[k])*a[k])
df$y <- f(nrow(df))
Or a non-recursion function (I guess this will be much faster then the recursive approach)
g <- Vectorize(function(k) sum(rev(cumprod(rev(a[2:k])))*b[2:k]))
df$y <- g(seq(nrow(df)))
such that
> df
a b y
1 0 0 0
2 1 1 1
3 2 2 6
4 3 1 21
5 4 3 96
I don't think this will be any faster, but here's one way to do it without an explicit loop
dt[, y := purrr::accumulate2(a, b, function(last, a, b) (last + b)*a
, .init = 0)[-1]]
# a b y
# 1: 0 0 0
# 2: 1 1 1
# 3: 2 2 6
# 4: 3 1 21
# 5: 4 3 96
I just saw a YouTube video from Numberphile on the Yellowstone sequence (A098550). It's base on a sequence starting with 1 and 2, with subsequent terms generated by the rules:
no repeated terms
always pick the lowest integer
gcd(a_n, a_(n-1)) = 1
gcd(a_n, a_(n-2)) > 1
The first 15 terms would be: 1 2 3 4 9 8 15 14 5 6 25 12 35 16 7
A Q&D approach in R could be something like this, but understandably, this becomes very slow at attempts to make longer sequences. It also make some assumptions about the highest number that is possible within the sequence (as info: the sequence of 10,000 items never goes higher than 5000).
What can we do to make this faster?
a <- c(1, 2, 3)
p <- length(a)
# all natural numbers
all_ints <- 1:5000
for (n in p:1000) {
# rule 1 - remove all number that are in sequence already
next_a_set <- all_ints[which(!all_ints %in% a)]
# rule 3 - search the remaining set for numbers that have gcd == 1
next_a_option <- next_a_set[which(
function(x) GCD(a[n], x)
) == 1
# rule 4 - search the remaining number for gcd > 1
next_a <- next_a_option[which(
function(x) GCD(a[n - 1], x)
) > 1
# select the lowest
a <- c(a, min(next_a))
n <- n + 1
Here's a version that's about 20 times faster than yours, with comments about the changes:
# Set a to the final length from the start.
a <- c(1, 2, 3, rep(NA, 997))
p <- 3
# Define a vectorized gcd() function. We'll be testing
# lots of gcds at once. This uses the Euclidean algorithm.
gcd <- function(x, y) { # vectorized gcd
while (any(y != 0)) {
x1 <- ifelse(y == 0, x, y)
y <- ifelse(y == 0, 0, x %% y)
x <- x1
# Guess at a reasonably large vector to work from,
# but we'll grow it later if not big enough.
allnum <- 1:1000
# Keep a logical record of what has been used
used <- c(rep(TRUE, 3), rep(FALSE, length(allnum) - 3))
for (n in p:1000) {
# rule 1 - remove all number that are in sequence already
# nothing to do -- used already records that.
repeat {
# rule 3 - search the remaining set for numbers that have gcd == 1
keep <- !used & gcd(a[n], allnum) == 1
# rule 4 - search the remaining number for gcd > 1
keep <- keep & gcd(a[n-1], allnum) > 1
# If we found anything, break out of this loop
if (any(keep))
# Otherwise, make the set of possible values twice as big,
# and try again
allnum <- seq_len(2*length(allnum))
used <- c(used, rep(FALSE, length(used)))
# select the lowest
newval <- which.max(keep)
# Assign into the appropriate place
a[n+1] <- newval
# Record that it has been used
used[newval] <- TRUE
If you profile it, you'll see it spends most of its time in the gcd() function. You could probably make that a lot faster by redoing it in C or C++.
The biggest change here is pre-allocation and restricting the search to numbers that have not yet been used.
N <- 5e3
a <- integer(N)
a[1:3] <- 1:3
b <- logical(N) # which numbers have been used already?
b[1:3] <- TRUE
NN <- 1:N
for (n in 4:N) {
a1 <- a[n - 1L]
a2 <- a[n - 2L]
for (k in NN[!b]) {
if (GCD(k, a1) == 1L & GCD(k, a2) > 1L) {
a[n] <- k
b[k] <- TRUE
if (!a[n]) {
a <- a[1:(n - 1L)]
#> user system elapsed
#> 1.28 0.00 1.28
#> [1] 1137
For a fast C++ algorithm, see here.
Lets assume the integer x. I want to split this quantity in n mostly equal chunks and save the values in a vector. E.g. if x = 10 and n = 4 then the resulting vector would be:
and if n = 3:
Note: The order of the resulting vector does not matter
While this will create a (probably unnecessary) large object when x is large, it is still pretty quick:
x <- 10
n <- 4
tabulate(cut(1:x, n))
#[1] 3 2 2 3
On a decent modern machine dividing 10M records into 100K groups, it takes only 5 seconds:
x <- 1e7
n <- 1e5
system.time(tabulate(cut(1:x, n)))
# user system elapsed
# 5.07 0.06 5.13
Here are some solutions.
1) lpSolve Solve this integer linear program. It should be fast even for large x (but not if n is also large). I also tried it for x = 10,000 and n = 3 and it returned the solution immediately.
For example, for n = 4 and x = 10 it corresponds to
min x4 - x1 such that 0 <= x1 <= x2 <= x3 <= x4 and
x1 + x2 + x3 + x4 = 10 and
x1, x2, x3, x4 are all integer
The R code is:
x <- 10
n <- 4
D <- diag(n)
mat <- (col(D) - row(D) == 1) - D
mat[n, ] <- 1
obj <- replace(numeric(n), c(1, n), c(-1, 1))
dir <- replace(rep(">=", n), n, "=")
rhs <- replace(numeric(n), n, x)
result <- lp("min", obj, mat, dir, rhs, all.int = TRUE)
## [1] 2 2 3 3
and if we repeat the above with n = 3 we get:
## [1] 3 3 4
2) lpSolveAPI The lpSolveAPI package's interface to lpSolve supports a sparse matrix specification which may reduce storage if n is large although it may still be slow if n is sufficiently large. Rewriting (1) using this package we have:
x <- 10
n <- 4
mod <- make.lp(n, n)
set.type(mod, 1:n, "integer")
set.objfn(mod, c(-1, 1), c(1, n))
for(i in 2:n) add.constraint(mod, c(-1, 1), ">=", 0, c(i-1, i))
add.constraint(mod, rep(1, n), "=", x)
## [1] 2 2 3 3
3) Greedy Heuristic This alternative uses no packages. It starts with a candidate solution having n-1 values of x/n rounded down and one remaining value. On each iteration it tries to improve the current solution by subtracting one from the largest values and adding 1 to the same number of smallest values. It stops when it can make no further improvement in the objective, diff(range(soln)).
Note that for x <- 1e7 and n <- 1e5 it is quite an easy to solve since n divides evenly into x. In particular system.time(tabulate(cut(...))) reports 18 sec on my machine and for the same problem the code below takes 0.06 seconds as it gets the answer after 1 iteration.
For x <- 1e7 and n <- 1e5-1 system.time(tabulate(cut(...))) reports 16 seconds on my machine and for the same problem the code below takes 4 seconds finishing after 100 iterations.
In the example below, taken from the question, 10/4 rounded down is 2 so it starts out with c(2, 2, 2, 4). On the first iteration it gets c(2, 2, 3, 3). On the second iteration it cannot get any improvement and so returns the answer.
x <- 10
n <- 4
a <- x %/% n
soln <- replace(rep(a, n), n, x - (n-1)*a)
obj <- diff(range(soln))
iter <- 0
while(TRUE) {
iter <- iter + 1
soln_new <- soln
mx <- which(soln == max(soln))
ix <- seq_along(mx)
soln_new[ix] <- soln_new[ix] + 1
soln_new[mx] <- soln_new[mx] - 1
soln_new <- sort(soln_new)
obj_new <- diff(range(soln_new))
if (obj_new >= obj) break
soln <- soln_new
obj <- obj_new
## [1] 2
## [1] 2 2 3 3
I've implemented a simple dynamic programming example described here, using data.table, in the hope that it would be as fast as vectorized code.
B=100; M=50; alpha=0.5; beta=0.9;
n = B + M + 1
m = M + 1
u <- function(c)c^alpha
dt <- data.table(s = 0:(B+M))[, .(a = 0:min(s, M)), s] # State Space and corresponging Action Space
dt[, u := (s-a)^alpha,] # rewards r(s, a)
dt <- dt[, .(s_next = a:(a+B), u = u), .(s, a)] # all possible (s') for each (s, a)
dt[, p := 1/(B+1), s] # transition probs
# s a s_next u p
# 1: 0 0 0 0 0.009901
# 2: 0 0 1 0 0.009901
# 3: 0 0 2 0 0.009901
# 4: 0 0 3 0 0.009901
# 5: 0 0 4 0 0.009901
# ---
#649022: 150 50 146 10 0.009901
#649023: 150 50 147 10 0.009901
#649024: 150 50 148 10 0.009901
#649025: 150 50 149 10 0.009901
#649026: 150 50 150 10 0.009901
To give a little content to my question: conditional on s and a, future values of s (s_next) is realized as one of a:(a+10), each with probability p=1/(B + 1). u column gives the u(s, a) for each combination (s, a).
Given the initial values V(always n by 1 vector) for each unique state s, V updates according to V[s]=max(u(s, a)) + beta* sum(p*v(s_next)) (Bellman Equation).
Maximization is wrt a, hence, [, `:=`(v = max(v), i = s_next[which.max(v)]), by = .(s)] in the iteration below.
Actually there is very efficient vectorized solution. I thought data.table solution would be comparable in performance as vectorized approach.
I know that the main culprit is dt[, v := V[s_next + 1]]. Alas, I have no idea how to fix it.
# Iteration starts here
V <- rep(0, n) # initial guess for Value function
i <- 1
tol <- 1
while(tol > 0.0001){
dt[, v := V[s_next + 1]]
dt[, v := u + beta * sum(p*v), by = .(s, a)
][, `:=`(v = max(v), i = s_next[which.max(v)]), by = .(s)] # Iteration
dt1 <- dt[, .(v[1L], i[1L]), by = s]
Vnew <- dt1$V1
sig <- dt1$V2
tol <- max(abs(V - Vnew))
V <- Vnew
i <- i + 1
# user system elapsed
# 5.81 0.40 6.25
To my dismay, the data.table solution is even slower than the following highly non-vectorized solution. As a sloppy data.table-user, I must be missing some data.table functionality. Is there a way to improve things, or, data.table is not suitable for these kinds of computations?
S <- 0:(n-1) # StateSpace
VFI <- function(V){
out <- rep(0, length(V))
for(s in S){
x <- -Inf
for(a in 0:min(s, M)){
s_next <- a:(a+B) # (s')
x <- max(x, u(s-a) + beta * sum(V[s_next + 1]/(B+1)))
out[s+1] <- x
V <- rep(0, n) # initial guess for Value function
i <- 1
tol <- 1
while(tol > 0.0001){
Vnew <- VFI(V)
tol <- max(abs(V - Vnew))
V <- Vnew
i <- i + 1
# user system elapsed
# 3.81 0.00 3.81
Here's how I would do this...
DT = CJ(s = seq_len(n)-1L, a = seq_len(m)-1L, s_next = seq_len(n)-1L)
DT[ , p := 0]
#p is 0 unless this is true
DT[between(s_next, a, a + B), p := 1/(B+1)]
#may as well subset to eliminate irrelevant states
DT = DT[p>0 & s>=a]
DT[ , util := u(s - a)]
#don't technically need by, but just to be careful
DT[ , V0 := rep(0, n), by = .(a, s_next)]
while(TRUE) {
#for each s, maximize given past value;
# within each s, have to sum over s_nexts,
# to do so, sum by a
DT[ , V1 := max(.SD[ , util[1L] + beta*sum(V0*p), by = a],
na.rm = TRUE), by = s]
if (DT[ , max(abs(V0 - V1))] < 1e-4) break
DT[ , V0 := V1]
On my machine this takes about 15 seconds (so not good)... but maybe this will give you some ideas. For example, this data.table is far too large since there really only are n unique values of V ultimately.
I have data that looks like this :
char_column date_column1 date_column2 integer_column
415 18JT9R6EKV 2014-08-28 2014-09-06 1
26 18JT9R6EKV 2014-12-08 2014-12-11 2
374 18JT9R6EKV 2015-03-03 2015-03-09 1
139 1PEGXAVCN5 2014-05-06 2014-05-10 3
969 1PEGXAVCN5 2014-06-11 2014-06-15 2
649 1PEGXAVCN5 2014-08-12 2014-08-16 3
I want to perform a loop that would check every row against the preceding row, and given certain conditions assign them the same number (so I can group them later) , the point is that if the date segments are close enough I would collapse them into one segment.
my attempt is the following :
i <- 1
z <- 1
v <- 1
for (i in 2:nrow(df)){
z[i] <- ifelse(df[i,'char_column'] == df[i-1,'char_column'],
ifelse((df[i,'date_column1'] - df[i-1,'date_column2']) <= 5,
ifelse(df[i,'integer_column'] == df[i-1,'integer_column'],
v, v<- v+1),
v <- v+1),
v <- v+1)}
df$grouping <- z
then I would just group using min(date_column1) and max(date_column2).
this method works perfectly for say 100,000 rows (22.86 seconds)
but for a million rows : 33.18 minutes!! I have over 60m rows to process,
is there a way I can make the process more efficient ?
PS: to generate a similar table you can use the following code :
x <- NULL
for (i in 1:200) { x[i] <- paste(sample(c(LETTERS, 1:9), 10), collapse = '')}
y <- sample((as.Date('2014-01-01')):as.Date('2015-05-01'), 1000, replace = T)
y2 <- y + sample(1:10)
df <- data.frame(char_column = sample(x, 1000, rep = T),
date_column1 = as.Date(y, origin = '1970-01-01'),
date_column2 = as.Date(y2,origin = '1970-01-01'),
integer_column = sample(1:3,1000, replace = T),
row.names = NULL)
df <- df[order(df$char_column, df$date_column1),]
Since data.table::rleid does not work, I post another (hopefully) fast solution
1. Get rid of nested ifelse
ifelse is often slow, especially for scalar evaluation, use if.
Nested ifelse should be avoided whenever possible: observe that ifelse(A, ifelse(B, x, y), y) can be suitably replaced by if (A&B) x else y
f1 <- function(df){
z <- rep(NA, nrow(df))
z[1] <- 1
char_col <- df[, 'char_column']
date_col1 <- df[, 'date_column1']
date_col2 <- df[, 'date_column2']
int_col <- df[, 'integer_column']
for (i in 2:nrow(df)){
if((char_col[i] == char_col[i-1])&((date_col1[i] - date_col2[i-1]) <= 5)&(int_col[i] == int_col[i-1]))
z[i] <- z[i-1]
z[i] <- z[i-1]+1
f1 is about 40% faster than the original solution for 10.000 rows.
user system elapsed
2.72 0.00 2.79
2. Vectorize
Upon closer inspection the conditions inside if can be vectorized
f2 <- function(df){
z <- rep(NA, nrow(df))
z[1] <- 1
char_col <- df[, 'char_column']
date_col1 <- df[, 'date_column1']
date_col2 <- df[, 'date_column2']
int_col <- df[, 'integer_column']
cond <- (char_col==shift(char_col))&(date_col1 - shift(date_col2) <= 5)&(int_col==shift(int_col))
for (i in 2:nrow(df)){
z[i] <- z[i-1]
z[i] <- z[i-1]+1
# for 10000 rows
# user system elapsed
# 0.01 0.00 0.02
3. Vectorize, Vectorize
While f2 is already quite fast, a further vectorization is possible. Observe how z is calculated: cond is a logical vector, and z[i] = z[i-1] + 1 when cond is FALSE. This is none other than cumsum(!cond).
f3 <- function(df){
df[, cond := (char_column==shift(char_column))&(date_column1 - shift(date_column2) <= 5)&(integer_column==shift(integer_column)),]
df[, group := cumsum(!c(FALSE, cond[-1L])),]
For 1M rows
# user system elapsed
# 0.05 0.05 0.09
# user system elapsed
# 1.83 0.05 1.87
Given 3 vectors A, B and x, I would like to get a vector Sol stating TRUE every time the nearest lower value of A and B united comes from A and FALSE if it comes form B.
Information about the data
A is a relatively short (~80 elements) vector of integers.
B is a relatively short (~200 elements) vector of integers.
x is a very long (1e7-1e8 maybe) vector of integers.
Example Data
These data are much shorter than the one I am to
A = floor(runif(30, 1,1e5))
B = c(1,floor(runif(80, 1,1e5)))
if (!any(c(A %in% B, B %in% A))){break}
x = floor(runif(300, 1,1e5))
Non-performent solution
The following should work however it will be very slow as x become very long.
Sol = rep(NA,length(x))
for (i in 1:length(x))
xi = x[i]
mA = max(A[A<=xi])
mB = max(B[B<=xi])
if (mA>mB) {Sol[i]=TRUE} else {Sol[i]=FALSE}
Note on performance
The process will be repeated maybe 1000 times. I currently don't have accurate estimation of the number of repeats and the length of x though.
Benchmark #MaratTalipov answer in comparison with mine
A = floor(runif(100, 1,1e6))
B = c(1,floor(runif(300, 1,1e6)))
if (!any(c(A %in% B, B %in% A))){break}
x = floor(runif(1e5, 1,1e6))
Marat = function (A,B,x)
d <- rbind(data.frame(x=TRUE,y=A),data.frame(x=FALSE,y=B))
d <- d[order(d$y),]
return (d$x[findInterval(x,d$y)])
Remi = function (A,B,x)
Sol = rep(NA,length(x))
for (i in 1:length(x))
xi = x[i]
mA = max(A[A<=xi])
mB = max(B[B<=xi])
if (mA>mB) {Sol[i]=TRUE} else {Sol[i]=FALSE}
benchmark(s1 <- Marat(A,B,x), s2 <- Remi(A,B,x), order="elapsed")
test replications elapsed relative user.self sys.self user.child sys.child
1 s1 <- Marat(A, B, x) 100 1.003 NA 0.964 0.065 0 0
2 s2 <- Remi(A, B, x) 100 144.118 NA 130.320 14.867 0 0
How about this approach:
d <- rbind(data.frame(x=TRUE,y=A),data.frame(x=FALSE,y=B))
d <- d[order(d$y),]
out <- d$x[findInterval(x,d$y)]
Try using this approach (apply family of functions). It still has to loop through all values of x, so there is no avoiding linear scan, but much more efficient than straight loops. I added an additional condition where value of x is below all values of A or B and return NA in this case. You can remove it if you want NULLs instead.
sapply(x, function(x) if (x < min(A) || x < min(B)) NA else if (max(A[A <= x]) > max(B[B <= x])) TRUE else FALSE)