Say I have
a <- c(0, 22, 0, 2, 0, 0, 20, 20, 20, 0, 0)
I want to do a cumulative sum whereby I minus 5 to each value in a, and then add on the previous value.
However, I also have the condition that if a becomes less than 0, I want the cumsum to 0 and if a becomes greater than 40, for cumsum to 40.
So, I want to get
(0, 17, 12, 9, 4, 0, 15, 30, 40, 35, 30)
Can anyone help? I've been trying out a lot of things for a few hours now!
#Holger, that method doesn't always work.
So if I add in a couple of extra zeros it does not come with the right solution
a <- c(0, 22, 0, 2, 0, 0, 0, 0, 20, 20, 20, 0, 0)
gives
0 17 12 9 4 0 1 7 22 37 52 47 42
Here are some alternatives:
1) Loop Create a one line loop like this:
b <- a; for(i in seq_along(b)[-1]) b[i] <- min(40, max(0, a[i] - 5 + b[i-1]))
b
## [1] 0 17 12 9 4 0 15 30 40 35 30
2) Reduce
f <- function(b, a) min(40, max(0, a - 5 + b))
Reduce(f, a, acc = TRUE)
## [1] 0 17 12 9 4 0 15 30 40 35 30
3) recursion This recursive solution will be limited to inputs which are not too long.
rec <- function(a) {
n <- length(a)
if (n <= 1) a
else {
rec.hd <- Recall(a[-n])
c(rec.hd, min(40, max(0, rec.hd[n-1] + a[n] - 5)))
}
}
rec(a)
## [1] 0 17 12 9 4 0 15 30 40 35 30
Try
cumsum_up_low <- function(a, d=5, up=40, low=0 ){
out = rep(0, length(a))
out[1] = a[1]*(a[1]>=0 && a[1]<=40) + 0*(a[1]<0) + 40*(a[1] > 40)
for(i in 2:length(a)){
if(out[i-1] + a[i] - d > low && out[i-1] + a[i] - d < up){
out[i] = out[i-1] + a[i] - d
} else if(out[i-1] + a[i] - d <= low){
out[i] = 0
} else out[i] = 40
}
out
}
cumsum_up_low(a, d=5, up=40, low=0)
# [1] 0 17 12 9 4 0 15 30 40 35 30
For long vectors
a <- sample(a, 1e6, TRUE)
system.time(cumsum_up_low(a))
# user system elapsed
# 3.59 0.00 3.59
library(compiler)
cumsum_up_low_compiled <- cmpfun(cumsum_up_low)
system.time(cumsum_up_low_compiled(a))
# user system elapsed
# 0.28 0.00 0.28
For a really long vectors
library(Rcpp)
cppFunction('
NumericVector cumsum_up_low_cpp(NumericVector a, double d, double up, double low) {
NumericVector out(a.size());
out[0] = a[0];
for(int i=1; i<a.size(); i++){
if(out[i-1] + a[i] - d > low & out[i-1] + a[i] - d < up){
out[i] = out[i-1] + a[i] - d;
} else if(out[i-1] + a[i] - d <= low){
out[i] = 0;
} else out[i] = 40;
}
return out;
}')
a <- sample(a, 5e6, replace = TRUE)
system.time(cumsum_up_low_compiled(a, d=5, up=40, low=0))
# user system elapsed
# 1.45 0.00 1.46
system.time(cumsum_up_low_cpp(a, d=5, up=40, low=0))
# user system elapsed
# 0.04 0.02 0.05
You can use Reduce to get the cumulative sum and combine this with max and min or pmin and pmax to get the bounds.
It is unclear whether you want to use the 0 and 40 in your cumulative summation or if you want bound the variable afterwards. Below, I've provided both possibilities.
Bound within the summation:
Reduce(function(x, y) min(max(x + y - 5, 0), 40), a, 0, accumulate=TRUE)
[1] 0 0 17 12 9 4 0 15 30 40 35 30
Bound after the summation
pmin(pmax(Reduce(function(x, y) x + y - 5, a, 0, accumulate=TRUE), 0), 40)
[1] 0 0 12 7 4 0 0 9 24 39 34 29
This is definitely not the efficient way to do this, but it might be easiest to understand:
a <- c(0, 22, 0, 2, 0, 0, 20, 20, 20, 0, 0)
## Initialize another vector just like a
c <- a
## Do it easy-to-understand'ly in a for loop:
for (i in seq_along(a)){
b <- a[i]
if (i>1) {
b <- b+c[i-1]
b <- b-5
}
if (b<0) b <- 0
if (b>40) b <- 40
c[i] <- b
print(c[i])
}
Try to figure out each part, and if you need help, lemme know!
Related
I have a numeric vector like this x <- c(1, 23, 7, 10, 9, 2, 4) and I want to group the elements from left to right with the constrain that each group sum must not exceed 25. Thus, here the first group is c(1, 23), the second is c(7, 10) and the last c(9, 2, 4). the expected output is a dataframe with a second column containing the groups:
data.frame(x= c(1, 23, 7, 10, 9, 2, 4), group= c(1, 1, 2, 2, 3, 3, 3))
I have tried different things with cumsum but am not able to kind of dynamically restart cumsum for the new group once the limit sum of 25 for the last group is reached.
I think cpp function is the fastest way:
library(Rcpp)
cppFunction(
"IntegerVector GroupBySum(const NumericVector& x, const double& max_sum = 25)
{
double sum = 0;
int cnt = 0;
int period = 1;
IntegerVector res(x.size());
for (int i = 0; i < x.size(); ++i)
{
++cnt;
sum += x[i];
if (sum > max_sum)
{
sum = x[i];
if (cnt > 1)
++period;
cnt = 1;
}
res[i] = period;
}
return res;
}"
)
GroupBySum(c(1, 23, 7, 10, 9, 2, 4), 25)
We can try this as a programming practice if you like :)
f1 <- function(x) {
group <- c()
while (length(x)) {
idx <- cumsum(x) <= 25
x <- x[!idx]
group <- c(group, rep(max(group, 0) + 1, sum(idx)))
}
group
}
or
f2 <- function(x) {
group <- c()
g <- 0
while (length(x)) {
cnt <- s <- 0
for (i in seq_along(x)) {
s <- s + x[i]
if (s <= 25) {
cnt <- cnt + 1
} else {
break
}
}
g <- g + 1
group <- c(group, rep(g, cnt))
x <- x[-(1:cnt)]
}
group
}
or
f3 <- function(x) {
s <- cumsum(x)
r <- c()
grp <- 1
while (length(s)) {
idx <- (s <= 25)
r <- c(r, rep(grp, sum(idx)))
grp <- grp + 1
s <- s[!idx] - tail(s[idx], 1)
}
r
}
which gives
[1] 1 1 2 2 3 3 3
and benchmarking among them looks like
set.seed(1)
set.seed(1)
x <- runif(1e3, 0, 25)
bm <- microbenchmark(
f1(x),
f2(x),
f3(x),
check = "equivalent"
)
autoplot(bm)
Recursion version
Another option is using recursion (based on f1())
f <- function(x, res = c()) {
if (!length(x)) {
return(res)
}
idx <- cumsum(x) <= 25
Recall(x[!idx], res = c(res, list(x[idx])))
}
and you will see
> f(x)
[[1]]
[1] 1 23
[[2]]
[1] 7 10
[[3]]
[1] 9 2 4
You can use the cumsumbinning built-in function from the MESS package:
# install.packages("MESS")
MESS::cumsumbinning(x, 25, cutwhenpassed = F)
# [1] 1 1 2 2 3 3 3
Or it can be done with purrr::accumulate:
cumsum(x == accumulate(x, ~ifelse(.x + .y <= 25, .x + .y, .y)))
# [1] 1 1 2 2 3 3 3
output
group <- MESS::cumsumbinning(x, 25, cutwhenpassed = F)
data.frame(x= c(1, 23, 7, 10, 9, 2, 4),
group = group)
x group
1 1 1
2 23 1
3 7 2
4 10 2
5 9 3
6 2 3
7 4 3
Quick benchmark:
x<- c(1, 23, 7, 10, 9, 2, 4)
bm <- microbenchmark(
fThomas(x),
fThomasRec(x),
fJKupzig(x),
fCumsumbinning(x),
fAccumulate(x),
fReduce(x),
fRcpp(x),
times = 100L,
setup = gc(FALSE)
)
autoplot(bm)
Егор Шишунов's Rcpp is the fastest, closely followed by MESS::cumsumbinning and ThomasIsCoding's both functions.
With n = 100, the gap gets bigger but Rcpp and cumsumbinning are still the top choices and the while loop option is no longer efficient (I had to remove ThomasIsCoding's functions because the execution time was too long):
x = runif(100, 1, 50)
In base R you could also use Reduce:
do.call(rbind, Reduce(\(x,y) if((z<-x[1] + y) > 25) c(y, x[2]+1)
else c(z, x[2]), x[-1], init = c(x[1], 1), accumulate = TRUE))
[,1] [,2]
[1,] 1 1
[2,] 24 1
[3,] 7 2
[4,] 17 2
[5,] 9 3
[6,] 11 3
[7,] 15 3
Breaking it down:
f <- function(x, y){
z <- x[1] + y
if(z > 25) c(y, x[2] + 1)
else c(z, x[2])
}
do.call(rbind, Reduce(f, x[-1], init = c(x[1], 1), accumulate = TRUE))
if using accumulate
library(tidyverse)
accumulate(x[-1], f, .init = c(x[1], 1)) %>%
invoke(rbind, .)
[,1] [,2]
[1,] 1 1
[2,] 24 1
[3,] 7 2
[4,] 17 2
[5,] 9 3
[6,] 11 3
[7,] 15 3
Here is a solution using base R and cumsum (and lapply for iteration):
id <- c(seq(1, length(x),1)[!duplicated(cumsum(x) %/% 25)], length(x)+1)
id2 <- 1:length(id)
group <- unlist(lapply(1:(length(id)-1), function(x) rep(id2[x], diff(id)[x])))
data.frame(x=x, group=group)
x group
1 1 1
2 23 1
3 7 2
4 10 2
5 9 3
6 2 3
7 4 3
Edit: New Approach using recursive function
Here is a new more efficient approach that should also cover the special case which #ЕгорШишунов considered and should work efficiently because it's written as a recursive function.
recursiveFunction<- function(x, maxN=25, sumX=0, period=1, period2return=c()){
sumX <- sumX + x[1]
if (sumX >= maxN) { sumX=x[1]; period = period + 1}
period2return <- c(period2return, period)
if (length(x) == 1) { return(period2return)}
return(recursiveFunction(x[-1], 25, sumX, period, period2return))
}
recursiveFunction(x, maxN=25)
Note that you should not change the entries for the last three function parameters (sumX=0, period=1, period2return=c()) because they are only important during the recursive call of the function.
I need to create a function of five variables,
a (multiplier)
n (sample size)
c (increment with default 0)
m (modulus)
x0 (Initial seed value)
I need to generate a sequence of random numbers with the equation
xi = (a*xi-1 + c) (mod m), i = 1, 2, ..., n
As in the vector x = (x1, ..., xn).
My attempt:
my.unif1 <- function(n, a,c = 0, m, x = x[0]) {
while(n > 0) {
x[n] <- (a*x[n-1]+c)%%m
}
}
It sounds like you want to learn more about Linear Congruential Generators. Here's a resource that will probably help you solve your code problem: https://qualityandinnovation.com/2015/03/03/a-linear-congruential-generator-lcg-in-r/
lcg <- function(a,c,m,run.length,seed) {
x <- rep(0,run.length)
x[1] <- seed
for (i in 1:(run.length-1)) {
x[i+1] <- (a * x[i] + c) %% m
}
U <- x/m # scale all of the x's to
# produce uniformly distributed
# random numbers between [0,1)
return(list(x=x,U=U))
}
> z <- lcg(6,7,23,20,5)
> z
$x
[1] 5 14 22 1 13 16 11 4 8 9 15 5 14 22 1 13 16 11
[19] 4 8
$U
[1] 0.21739130 0.60869565 0.95652174 0.04347826 0.56521739
[6] 0.69565217 0.47826087 0.17391304 0.34782609 0.39130435
[11] 0.65217391 0.21739130 0.60869565 0.95652174 0.04347826
[16] 0.56521739 0.69565217 0.47826087 0.17391304 0.34782609
That could help:
my.fct.1 <- function(x, multiplier, increment, modulus){
increment <- ifelse(missing(increment), 0, increment) # setting the default increment to 0
newval <- (multiplier*x + increment) %% modulus
return(newval)
}
my.fct.2 <- function(x0, n, multiplier, increment, modulus){
if(n == 1){
val <- my.fct.1(x = x0, multiplier = multiplier, increment = increment, modulus = modulus)
vec <- c(x0, val)
return(vec)
}
if(n > 1){
vec <- my.fct.2(x = x0, n = n-1, multiplier = multiplier, increment = increment, modulus = modulus)
val <- vec[length(vec)]
newval <- my.fct.1(x = val, multiplier = multiplier, increment = increment, modulus = modulus)
newvec <- c(vec, newval)
return(newvec)
}
}
my.fct.2 does the required, the arguments are pretty much self explanatory. Watch out though, because it is a recursive function (which can affect speed among other things).
And here are some examples of such generated sequences:
> my.fct.2(3, 9, 7, -1, 4)
[1] 3 0 3 0 3 0 3 0 3 0
> my.fct.2(1, 9, 2, 1, 13)
[1] 1 3 7 2 5 11 10 8 4 9
> my.fct.2(0, 17, 5, 3, 7)
[1] 0 3 4 2 6 5 0 3 4 2 6 5 0 3 4 2 6 5
# and here the arguments set to cross check it against #mysteRious's answer
> my.fct.2(5, 20, 6, 7, 23)
[1] 5 14 22 1 13 16 11 4 8 9 15 5 14 22 1 13 16 11 4 8 9
U <- my.fct.2(5, 20, 6, 7, 23)/23
> U
[1] 0.21739130 0.60869565 0.95652174 0.04347826 0.56521739 0.69565217 0.47826087 0.17391304
[9] 0.34782609 0.39130435 0.65217391 0.21739130 0.60869565 0.95652174 0.04347826 0.56521739
[17] 0.69565217 0.47826087 0.17391304 0.34782609 0.39130435
I have a vector of numbers:
x <- c(0, 0, 0, 30, 60, 0, 0, 0, 0, 0, 10, 0, 0, 15, 45, 0, 0)
For each element i in x, I would like to do the following
If x[i] > 0, return 0
If all 4 elements before x[i] are 0, return NA
If the 4 elements before x[i] are not 0, count how many elements are between the last not-0 element and x[i]
I expect this output:
#> x
#[1] 0 0 0 30 60 0 0 0 0 0 10 0 0 15 45 0 0
#> x_out
#[1] NA NA NA 0 0 1 2 3 4 NA 0 1 2 0 0 1 2
Notice that the solution should also work when there are less than 4 elements available at the beginning of the vector (i.e. condition 2 and 3 should use as many elements as are available). Does anybody have a solution for this? A vectorised approach is preferred because the vectors are long and the dataset is fairly big.
Here is a simple Rcpp solution. Create a new C++ file in RStudio and paste the code into it and source the file. Obviously, you'll need to have installed Rtools if you use Windows.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector funRcpp(const IntegerVector x) {
const double n = x.length();
int counter = 4;
IntegerVector y(n);
for (double i = 0; i < n; ++i) {
if (x(i) > 0) {
y(i) = 0;
counter = 0;
}
else {
if (counter > 3) {
y(i) = NA_INTEGER;
} else {
counter++;
y(i) = counter;
}
}
}
return y;
}
/*** R
x <- c(0, 0, 0, 30, 60, 0, 0, 0, 0, 0, 10, 0, 0, 15, 45, 0, 0)
funRcpp(x)
*/
This returns the desired result:
> funRcpp(x)
[1] NA NA NA 0 0 1 2 3 4 NA 0 1 2 0 0 1 2
This is my current approach:
library(dplyr)
last_x_months <- 4
my_list <- vector("list", 1 + last_x_months)
my_list[[1]] <- x
# create lagged variants of vector
for (j in seq_along(1:last_x_months)) {
my_list[[1 + j]] <- lag(my_list[[1]], n = j, default = NA)
}
# row bind it to a data.frame
i_dat <- do.call(rbind, my_list) %>%
as.data.frame()
# apply function to each column in dataframe
sapply(i_dat, function(x) {
if (sum(x, na.rm = TRUE) == 0) {
NA
} else if (x[1] > 0) {
0
} else {
rle(x)$lengths[1]
}
})
This is the output I get:
#> output
#[1] NA NA NA 0 0 1 2 3 4 NA 0 1 2 0 0 1 2
Is this good practice or could I improve performance with a shortcut? I am pretty inexperienced when it comes to performance optimisation, that's why I posed the question.
Here is a simple example of one type of iterative calc:
vals <- data.frame( "x"=c( 14, 15, 12, 10, 17 ), "ema"=0 )
vals$ema[1] <- vals$x[1]
K <- 0.90
for( jj in 2:nrow( vals ) )
vals$ema[jj] <- K * vals$ema[jj-1] + (1-K) * vals$x[jj]
vals
x ema
1 14 14.0000
2 15 14.1000
3 12 13.8900
4 10 13.5010
5 17 13.8509
The more involved examples use if...else to determine the next value:
for( jj in 2:nrow( vals ) )
if( K * vals$ema[jj-1] + (1-K) * vals$x[jj] < 5.0 )
vals$ema[jj] <- 5.0
else if( K * vals$ema[jj-1] + (1-K) * vals$x[jj] > 15.0 )
vals$ema[jj] <- 15.0
else
vals$ema[jj] <- K * vals$ema[jj-1] + (1-K) * vals$x[jj]
I am not sure if it would be more involved or not, but the decision can be based on the previous value as well:
K1 <- 0.999
K2 <- 0.95
K3 <- 0.90
for( jj in 2:now( vals ) )
if( vals$ema[jj-1] < 0.0 )
vals$ema[jj] <- K1 * vals$ema[jj-1] + (1-K1) * vals$x[jj]
else if( vals$ema[jj-1] > 100.0 )
vals$ema[jj] <- K3 * vals$ema[jj-1] + (1-K3) * vals$x[jj]
else
vals$ema[jj] <- K2 * vals$ema[jj-1] + (1-K2) * vals$x[jj]
This answer by WaltS to a similar question I had about recursive calculations provides two potential solutions. Adapting one of them to your question:
vals$ema.Reduce <- Reduce(function(myema, x) K * myema + (1-K) * x,
x = tail(vals$x, -1), init = 14, accumulate = TRUE)
vals
# x ema ema.Reduce
#1 14 14.0000 14.0000
#2 15 14.1000 14.1000
#3 12 13.8900 13.8900
#4 10 13.5010 13.5010
#5 17 13.8509 13.8509
Explanation of the function:
Reduce() is calculating ema for the current jj row, and myema is the previous value (jj-1) starting with init. The x vector required by Reduce consists of vals$x for the rows you want to calculate: row 2 to the last row = x = tail(vals$x, -1). The accumulate = TRUE option returns the vector instead of the final value. (Note the x term in Reduce is a generic term and not the same as vals$x in the example data. For calculations that do not require the additional term vals$x, a vector of 0's would work (as in the linked answer)).
Adding if/else conditions to Reduce (note: init is changed in these examples to illustrate the conditional statements):
Reduce(function(myema, x) {
if(myema < 5) {
5
} else if(myema > 15) {
15
} else {
K * myema + (1-K) * x
}
}, x = tail(vals$x, -1), init = 16, accumulate = TRUE)
#[1] 16.000 15.000 14.700 14.230 14.507
Reduce(function(myema, x) {
if(myema < 0) {
K1 * myema + (1-K1) * x
} else if(myema > 100) {
K3 * myema + (1-K3) * x
} else {
K2 * myema + (1-K2) * x
}
}, x = tail(vals$x, -1), init = 110, accumulate = TRUE)
#[1] 110.00000 100.50000 91.65000 87.56750 84.03912
K3*110 + (1-K3)*vals$x[2] #100.5
K3*100.5 + (1-K3)*vals$x[3] #91.65
K2*91.65 + (1-K2)*vals$x[4] #87.5675
K2*87.5675 + (1-K2)*vals$x[5] #84.03912
Seems this succeeds:
vals$ema2 <- c(vals$ema[1], K*vals$ema[1:4] +(1-K)*vals$x[2:5] )
> vals
x ema ema2
1 14 14.0000 14.0000
2 15 14.1000 14.1000
3 12 13.8900 13.8900
4 10 13.5010 13.5010
5 17 13.8509 13.8509
Sometimes it is best to work with the time series and data munging libraries. In this case, lag.zoo from the zoo library handles lagged values for you.
library(dplyr)
library(zoo)
vals <- data.frame( "x"=c( 14, 15, 12, 10, 17 ) )
K <- 0.90
vals %>% mutate(ema = (1-K)*vals$x + K*(lag(vals$x,1)))
For this particular problem, the weights for each value is some function of k and i (as in the ith value). We can write a function for the weights, and vectorize it:
weights <- function(i, k) {
q <- 1-k
qs <- '^'(q, 1:i)
rev(qs) * c(1, rep(k, (i-1)))
}
v_weights <- Vectorize(weights)
An example:
> v_weights(1:3, .1)
[[1]]
[1] 0.9
[[2]]
[1] 0.81 0.09
[[3]]
[1] 0.729 0.081 0.090
where these are the weights of the "preceding" x values. We proceed with some matrix algebra. I write a function to make the weights (above) into a matrix:
weight_matrix <- function(j, k) {
w <- v_weights(1:j, k=k)
Ws <- matrix(0, j+1, j+1)
Ws[row(Ws)+col(Ws)<(j+2)] <- unlist(rev(w))
Ws <- t(Ws)
Ws[row(Ws)+col(Ws)==(j+2)] <- k
Ws[(j+1),1] <- 1
Ws
}
Example:
> weight_matrix(3, .1)
[,1] [,2] [,3] [,4]
[1,] 0.729 0.081 0.09 0.1
[2,] 0.810 0.090 0.10 0.0
[3,] 0.900 0.100 0.00 0.0
[4,] 1.000 0.000 0.00 0.0
Then multiply this with the vector of xs. Function: ema <- function(x, k) rev(weight_matrix(length(x)-1, k) %*% x[1:(length(x))]).
To get the dataframe above (I "flipped" the k so it's 0.1 instead of 0.9):
> x <- c(14, 15, 12, 10, 17)
> k <- .1
> vals <- data.frame("x"=x, "ema"=ema(x, k))
> vals
x ema
1 14 14.0000
2 15 14.1000
3 12 13.8900
4 10 13.5010
5 17 13.8509
#shayaa's answer is 99% correct. dplyr implements lag just fine, and apart from a typo in that answer (one value of x should be ema), extraneous calls to column names, and a missing default value (otherwise it puts NA in the first row) it works perfectly well.
library(dplyr)
vals %>% mutate(ema = K*lag(ema, 1, default=ema[1]) + (1-K)*x)
#> x ema
#> 1 14 14.0000
#> 2 15 14.1000
#> 3 12 13.8900
#> 4 10 13.5010
#> 5 17 13.8509
I have this data frame:
A <- c(10, 20, 30, 40, 5)
B <- c(5, 0, 0, 0, 0)
df = data.frame(A, B)
And I want to replace the 0's in B with the sum of A and B[i-1]. I have searched everywhere, but I feel like I am missing something really basic. This is my desired result:
A B
1 10 5
2 20 25
3 30 55
4 40 95
5 5 100
I have tried this, but it didn't work:
for(i in 2:length(df)){
df$B <- A[i] + B[i-1]
}
In Excel, this would be something like B$2 = A$2 + B$1. I cannot figure out how to do this in R. Any help would be greatly appreciated since I feel like I am missing something basic. Thanks!
You were very close. Try this:
for(i in 2:nrow(df)){
df$B[i] <- df$A[i] + df$B[i-1]
}
And to expand to those comments, could something like this work?
for(i in 2:nrow(df)){
if((df$A[i] + df$B[i-1]) > 60) df$B[i] <- df$B[i-1] else{
df$B[i] <- df$A[i] + df$B[i-1]}
}
# Data
# I changed one of the later values of B to non-zero to confirm that only
# the zero values of B were getting changed
A <- c(10, 20, 30, 40, 5)
B <- c(5, 0, 0, 10, 0)
(df = data.frame(A, B))
# A B
# 1 10 5
# 2 20 0
# 3 30 0
# 4 40 0
# 5 5 10
for(i in 2:nrow(df)) {
if(df$B[i]==0) df$B[i] <- df$A[i] + df$B[i-1]
if(df$B[i] >= 60) df$B[i] <- df$B[i-1]
}
df
# A B
# 1 10 5
# 2 20 25
# 3 30 55
# 4 40 55
# 5 5 10