(dplyr) Sum of N values most recent to a date - r

I'm trying to create a function that sums the closest n values to a given date. So if I had 5 weeks of data, and n=2, the value on week 1 would be the sum of weeks 2&3, the value on week 2 would be the sum of weeks 1&3, etc. Example:
library(dplyr)
library(data.table)
Week <- 1:5
Sales <- c(1, 3, 5, 7, 9)
frame <- data.table(Week, Sales)
frame
Week Sales Recent
1: 1 1 8
2: 2 3 6
3: 3 5 10
4: 4 7 14
5: 5 9 12
I want to make a function that does this for me with an input for most recent n (not just 2), but for now I want to get 2 right. Here's my function using lag/lead:
RecentSum = function(Variable, Lags){
Sum = 0
for(i in 1:(Lags/2)){ #Lags/2 because I want half values before and half after
#Check to see if you can go backwards. If not, go foward (i.e. use lead).
if(is.na(lag(Variable, i))){
LoopSum = lead(Variable, i)
}
else{
LoopSum = lag(Variable, i)
}
Sum = Sum + LoopSum
}
for(i in 1:(Lags/2)){
if(is.na(lead(Variable, i))){ #Check to see if you can go forward. If not, go backwards (i.e. use lag).
LoopSum = lag(Variable, i)
}
else{
LoopSum = lead(Variable, i)
}
Sum = Sum + LoopSum
}
Sum
}
When I do RecentSum(frame$Sale,2) I get [1] 6 10 14 18 NA which is wrong for a number of reasons:
My if statements are only hitting on week one, so it will always be NA for lag and always be non-NA for lead.
I need to have a way to see if it uses lag/lead the first time. The first value is 6 instead of 8 because the first for-loop sends it to lead(_,1), but then the second for-loop does the same. I can't think of how I'd make my second for-loop recognize this.
Is there a function or library (Zoo?) that makes this task easy? I'd like to get my own function to work for the sake of practice/understanding, but at this point I'd rather just get it done.
Thanks!

To elaborate on my comment, lead and lag are functions that are meant to be used within vectorized functions such as dplyr. Here is a way to do it within dplyr without using a function:
df <- tibble(week = Week, sales = Sales)
df %>%
mutate(recent = case_when(is.na(lag(sales)) ~ lead(sales, n = 1) + lead(sales, n = 2),
is.na(lead(sales)) ~ lag(sales, n = 1) + lag(sales, n = 2),
TRUE ~ lag(sales) + lead(sales)))
That gives you this:
# A tibble: 5 x 3
week sales recent
<int> <dbl> <dbl>
1 1 1 8
2 2 3 6
3 3 5 10
4 4 7 14
5 5 9 12

1) Assuming that k is even define to as a vector of indices such that for each element of to we sum the k+1 elements of Sales that end in that index and from that subtract Sales:
k <- 2 # number of elements to sum
n <- nrow(frame)
to <- pmax(k+1, pmin(1:n + k/2, n))
Sum <- function(to, Sales) sum(Sales[seq(to = to, length = k+1)])
frame %>% mutate(recent = sapply(to, Sum, Sales) - Sales)
giving:
Week Sales recent
1 1 1 8
2 2 3 6
3 3 5 10
4 4 7 14
5 5 9 12
Note that by replacing the last line of code above with the following line the solution can be done entirely in base R:
transform(frame, recent = sapply(to, Sum, Sales) - Sales)
2) This concatenates the appropriate elements before and after the Sales series so that an ordinary rolling sum gives the result.
library(zoo)
ix <- c(seq(to = k+1, length = k/2), 1:n, seq(to = n-k, length = k/2))
frame %>% mutate(recent = rollsum(Sales[ix], k+1) - Sales)
Note that if k=2 then it reduces this to this one-liner:
frame %>% mutate(recent = rollsum(Sales[c(3, 1:n(), n()-2)], 3) - Sales)
giving:
Week Sales recent
1 1 1 8
2 2 3 6
3 3 5 10
4 4 7 14
5 5 9 12
Update: fixed for k > 2

Related

Data Filtering. Remove zero rows but only until the first non zero value for each ID

I have a data table that captures the monthly returns for different stocks over a period of time. The data was collected in a way that if the sock was delisted, the rest of the return values are filled with zeroes (which does not align with reality). The filtering I need is the following:
for each stock: remove all lines from the end of the data collecting period which have a zero return until the first non-zero return. All other returns after the first non-zero return are valid and should not be deleted.
My idea was to order the data by date in descending order, group It by Id and then apply a function that removes all the zeroes until the first non-zero. However, I am not sure how to apply this.
I wanted to use the data.table package so I created this function.
i <- 1
nzero <- i + 1
while (i <= count(stocks)+1) {
if (stocks[i,6] == 0 | is.na(stocks[i,6])) {
while (stocks[nzero,6] == 0 | is.na(stocks[nzero,6])) {
nzero <- nzero + 1
}
test89 <- rbind(test89, tocks[nzero, count(stocks)])
i <- nzero
}
else{
if (stocks[i,1] != stocks[(i+1),1]) {
test89 <- rbind(test89,stocks[(i+1):count(stocks)])
}
}
i <- i+1
}
the first column of stocks refers to the Id of the stock and the 6. refers to the return.
Does this make sense? how can I change it for it to work and how can I apply a personalized function after grouping the data?
Thanks for the support and sorry for the not so qualified question
Best
Perhaps something like this is what you are looking for? I made some dummy data for 12 months where returns are either 1 or 0. First one has a run of 6 non-zero values, second has 4.
library(data.table)
stocks <- data.table(
stock = do.call('c', lapply(letters[1:2], rep, 12)),
month = rep(1:12, 2),
return = c(rep(1, 6), rep(0, 6),
rep(1, 4), rep(0, 8))
)
# reverse date order
stocks <- stocks[order(-month)]
# calculate (first) run of zero values
zero_run_length <- function(x) {
r <- rle(x)
if (r$values[1] == 0) {
return(r$lengths[1])
} else return(0)
}
# calculate "zero run length" (on reversed order table)
zero_idx <- stocks[, list(zrl = zero_run_length(return)), by = "stock"]
zero_idx
#> stock zrl
#> 1: b 8
#> 2: a 6
# select target rows
stocks2 <- stocks[zero_idx, on = "stock"][, .SD[((zrl[1] + 1):.N),], by = "stock"]
# put back in order
stocks2[order(stock, month),]
#> stock month return zrl
#> 1: a 1 1 6
#> 2: a 2 1 6
#> 3: a 3 1 6
#> 4: a 4 1 6
#> 5: a 5 1 6
#> 6: a 6 1 6
#> 7: b 1 1 8
#> 8: b 2 1 8
#> 9: b 3 1 8
#> 10: b 4 1 8

How to Efficiently Populate R Dataframe with Lookup Values from Another Dataframe [duplicate]

This question already has answers here:
Expand ranges defined by "from" and "to" columns
(10 answers)
Closed 2 years ago.
I have a question regarding efficiently populating an R dataframe based on data retrieved from another dataframe.
So my input typically looks like:
dfInput <- data.frame(start = c(1,6,17,29), end = c(5,16,28,42), value = c(1,2,3,4))
start end value
1 5 1
6 16 2
17 28 3
29 42 4
I want to find the min and max values in cols 1 and 2 and create a new dataframe with a row for each value in that range:
rangeMin <- min(dfInput$start)
rangeMax <- max(dfInput$end)
dfOutput <- data.frame(index = c(rangeMin:rangeMax), value = 0)
And then populate it with the appropriate "values" from the input dataframe:
for (i in seq(nrow(dfOutput))) {
lookup <- dfOutput[i,"index"]
dfOutput[i, "value"] <- dfInput[which(dfInput$start <= lookup &
dfInput$end >= lookup),"value"]
}
This for-loop achieves what I want to do, but it feels like this is a very convoluted way to do it.
Is there a way that I can do something like:
dfOutput$value <- dfInput[which(dfInput$start <= dfOutput$index &
dfInput$end >= dfOutput$index),"value"]
Or something else to populate the values when instantiating dfOutput.
I feel like this is pretty basic but I'm new to R, so many thanks for any help!
You can create a sequence between start and end :
library(dplyr)
dfInput %>%
mutate(index = purrr::map2(start, end, seq)) %>%
tidyr::unnest(index) %>%
select(-start, -end)
# A tibble: 42 x 2
# value index
# <dbl> <int>
# 1 1 1
# 2 1 2
# 3 1 3
# 4 1 4
# 5 1 5
# 6 2 6
# 7 2 7
# 8 2 8
# 9 2 9
#10 2 10
# … with 32 more rows
In base R :
do.call(rbind, Map(function(x, y, z)
data.frame(index = x:y, value = z), dfInput$start, dfInput$end, dfInput$value))

Append values of a column if values in another column are sequential in R

I am looking for a convenient way to concatenate values of a column if the values of another column increment by 1.
My dataframe
st row_index
1 alpha 2
2 beta 7
3 gamma 11
4 delta 12
5 zero 15
6 one 16
7 two 17
Target data frame
st row_index
1 alpha 2
2 beta 7
3 gammadelta 11
4 zero one two 15
You can use lag and cumsum to create a helper group variable g, and then summarize by this variable; row_index - lag(row_index, default=0) != 1 checks the difference between the current row_index and the previous one, which returns TRUE if it's different from 1 (Used default=0 to removes NA introduced by lag), combined with cumsum it gives a unique id for each consecutive chunk of rows where the difference of row_index is one:
df %>%
group_by(g = cumsum(row_index - lag(row_index, default=0) != 1)) %>%
summarise(st = paste(st, collapse = " "), row_index = first(row_index)) %>%
select(-g)
# A tibble: 4 x 2
# st row_index
# <chr> <int>
#1 alpha 2
#2 beta 7
#3 gamma delta 11
#4 zero one two 15
Here is an option with data.table. Grouped by the cumulative sum of difference of 'row_index' that are not 1, paste the elements of 'st' together and also take the first values of 'row_index'
library(data.table)
setDT(df1)[, .(st = paste(st, collapse= ' '),
row_index = row_index[1]), .(grp = cumsum(c(TRUE, diff(row_index) != 1)))
][, .(st, row_index)]
# st row_index
#1: alpha 2
#2: beta 7
#3: gamma delta 11
#4: zero one two 15

Binning data by row values with minimum sample size

I’m trying to figure out how to create bins with a minimum sample size that also accounts for values in a specific column.
So, in the dummy data below, I want to create bins that have a minimum number of 6 samples in them, but if a bin includes a row with a specific value from column a, I want that bin to also include all other rows with that same value. I also do not want any bins to only contain 1 unique value from row a. I then want the output to have a row with a mean of the unique values in column a, a mean of all values in column b and a column with sample size.
df<-data.frame(a=c(1,1,2,2,2,3,3,3,3,4,4,5,6,6,6,7,7,7,7,7,7,8,8,8,9,9,9,9,10,10,10),
b=c(12,13,11,12,12,11,15,13,12,11,14,15,11,14,12,11,14,12,13,15,11,11,12,13,14,16,14,13,15,13,15))
I want the output to look something like this:
mean.a mean.b n
1 2.0 12.33333 9
2 5.0 12.83333 6
3 7.0 12.66667 6
4 8.5 13.28571 7
This is what I have so far:
x<-df
final<-NULL
for(i in 1:16){
x1<-x[1:6,]
x2<-x[-c(1:6),]
x3<-rbind(x1, x2[x2$a==x1$a[6],])
n<-nrow(x3)
y<-mean(x3$b)
z<-mean(unique(x3$a))
f<-data.frame(mean.a=z, mean.b=y, n=n)
final<-rbind(final,f)
x<-x[-c(1:n),]
}
final<-final[complete.cases(final),]
The problem I'm having is I can't figure out how to not have a single bin with one unique value in column a. For example, in the third bin, all 6 rows have mean.a$a=7, but I would like to add the next sequential row and all rows with that row value in column a to that bin (which would be all rows that have mean.a$a=8 in this case).
Also, I can't figure out how to get the loop to continue looping through without having 1:number at the top, and then just deleting the rows with NAs afterwards, this isn't a huge deal, but that's the reason it's kind of messy.
I'm not attached to this loop by any means, and if there's a simpler way to answer this question, I'm all for it!
Here is a recursive solution for the problem, where get_6 will return a group variable based on the column a. The conditions are met in get_i function inside, starting from index 6 and move forward until we find the next index that is not equal to the current value and the length of unique values is not equal to 1, every time we found a sequence that satisfies the condition we increase the id by one and the result will be similar to what you get from the rleid function from data.table, from there, summary statistics can be calculated based on this group variable:
get_6 <- function(vec, id = 1) {
if(length(vec) < 6) NULL
else {
get_i <- function(x, i = 6) {
if(length(x) == i) i
else if(x[i + 1] != x[i] && length(unique(x[1:i])) != 1) i
else get_i(x, i + 1)
}
ind <- get_i(vec)
c(rep(id, ind), get_6(vec[-(1:ind)], id + 1))
}
}
s <- get_6(df$a)
s
# [1] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4
library(dplyr)
df[1:length(s), ] %>%
mutate(g = s) %>% group_by(g) %>%
summarize(n = n(), mean.a = mean(unique(a)), mean.b = mean(b))
#Source: local data frame [4 x 4]
# g n mean.a mean.b
# <dbl> <int> <dbl> <dbl>
#1 1 9 2.0 12.33333
#2 2 6 5.0 12.83333
#3 3 9 7.5 12.44444
#4 4 7 9.5 14.28571

aggregate data frame by equal buckets

I would like to aggregate an R data.frame by equal amounts of the cumulative sum of one of the variables in the data.frame. I googled quite a lot, but probably I don't know the correct terminology to find anything useful.
Suppose I have this data.frame:
> x <- data.frame(cbind(p=rnorm(100, 10, 0.1), v=round(runif(100, 1, 10))))
> head(x)
p v
1 10.002904 4
2 10.132200 2
3 10.026105 6
4 10.001146 2
5 9.990267 2
6 10.115907 6
7 10.199895 9
8 9.949996 8
9 10.165848 8
10 9.953283 6
11 10.072947 10
12 10.020379 2
13 10.084002 3
14 9.949108 8
15 10.065247 6
16 9.801699 3
17 10.014612 8
18 9.954638 5
19 9.958256 9
20 10.031041 7
I would like to reduce the x to a smaller data.frame where each line contains the weighted average of p, weighted by v, corresponding to an amount of n units of v. Something of this sort:
> n <- 100
> cum.v <- cumsum(x$v)
> f <- cum.v %/% n
> x.agg <- aggregate(cbind(v*p, v) ~ f, data=x, FUN=sum)
> x.agg$'v * p' <- x.agg$'v * p' / x.agg$v
> x.agg
f v * p v
1 0 10.039369 98
2 1 9.952049 94
3 2 10.015058 104
4 3 9.938271 103
5 4 9.967244 100
6 5 9.995071 69
First question, I was wondering if there is a better (more efficient approach) to the code above. The second, more important, question is how to correct the code above in order to obtain more precise bucketing. Namely, each row in x.agg should contain exacly 100 units of v, not just approximately as it is the case above. For example, the first row contains the aggregate of the first 17 rows of x which correspond to 98 units of v. The next row (18th) contains 5 units of v and is fully included in the next bucket. What I would like to achieve instead would be attribute 2 units of row 18th to the first bucket and the remaining 3 units to the following one.
Thanks in advance for any help provided.
Here's another method that does this with out repeating each p v times. And the way I understand it is, the place where it crosses 100 (see below)
18 9.954638 5 98
19 9.958256 9 107
should be changed to:
18 9.954638 5 98
19.1 9.958256 2 100 # ---> 2 units will be considered with previous group
19.2 9.958256 7 107 # ----> remaining 7 units will be split for next group
The code:
n <- 100
# get cumulative sum, an id column (for retrace) and current group id
x <- transform(x, cv = cumsum(x$v), id = seq_len(nrow(x)), grp = cumsum(x$v) %/% n)
# Paste these two lines in R to install IRanges
source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")
require(IRanges)
ir1 <- successiveIRanges(x$v)
ir2 <- IRanges(seq(n, max(x$cv), by=n), width=1)
o <- findOverlaps(ir1, ir2)
# gets position where multiple of n(=100) occurs
# (where we'll have to do something about it)
pos <- queryHits(o)
# how much do the values differ from multiple of 100?
val <- start(ir2)[subjectHits(o)] - start(ir1)[queryHits(o)] + 1
# we need "pos" new rows of "pos" indices
x1 <- x[pos, ]
x1$v <- val # corresponding values
# reduce the group by 1, so that multiples of 100 will
# belong to the previous row
x1$grp <- x1$grp - 1
# subtract val in the original data x
x$v[pos] <- x$v[pos] - val
# bind and order them
x <- rbind(x1,x)
x <- x[with(x, order(id)), ]
# remove unnecessary entries
x <- x[!(duplicated(x$id) & x$v == 0), ]
x$cv <- cumsum(x$v) # updated cumsum
x$id <- NULL
require(data.table)
x.dt <- data.table(x, key="grp")
x.dt[, list(res = sum(p*v)/sum(v), cv = tail(cv, 1)), by=grp]
Running on your data:
# grp res cv
# 1: 0 10.037747 100
# 2: 1 9.994648 114
Running on #geektrader's data:
# grp res cv
# 1: 0 9.999680 100
# 2: 1 10.040139 200
# 3: 2 9.976425 300
# 4: 3 10.026622 400
# 5: 4 10.068623 500
# 6: 5 9.982733 562
Here's a benchmark on a relatively big data:
set.seed(12345)
x <- data.frame(cbind(p=rnorm(1e5, 10, 0.1), v=round(runif(1e5, 1, 10))))
require(rbenchmark)
benchmark(out <- FN1(x), replications=10)
# test replications elapsed relative user.self
# 1 out <- FN1(x) 10 13.817 1 12.586
It takes about 1.4 seconds on 1e5 rows.
If you are looking for precise bucketing, I am assuming value of p is same for 2 "split" v
i.e. in your example, value of p for 2 units of row 18th that go in first bucket is 9.954638
With above assumption, you can do following for not super large datasets..
> set.seed(12345)
> x <- data.frame(cbind(p=rnorm(100, 10, 0.1), v=round(runif(100, 1, 10))))
> z <- unlist(mapply(function(x,y) rep(x,y), x$p, x$v, SIMPLIFY=T))
this creates a vector with each value of p repeated v times for each row and result is combined into single vector using unlist.
After this aggregation is trivial using aggregate function
> aggregate(z, by=list((1:length(z)-0.5)%/%100), FUN=mean)
Group.1 x
1 0 9.999680
2 1 10.040139
3 2 9.976425
4 3 10.026622
5 4 10.068623
6 5 9.982733

Resources