Consecutive wins/losses R - r

I am still new to R and learning methods for conducting analysis. I have a df which I want to count the consecutive wins/losses based on column "x9". This shows the gain/loss (positive value or negative value) for the trade entered. I did find some help on code that helped with assigning a sign, sign lag and change, however, I am looking for counter to count the consecutive wins until a loss is achieved then reset, and then count the consecutive losses until a win is achieved. Overall am looking for assistance to adjust the counter to reset when consecutive wins/losses are interrupted. I have some sample code below and a attached .png to explain my thoughts
#Read in df
df=vroom::vroom(file = "analysis.csv")
#Filter df for specfic order types
df1 = filter(df, (x3=="s/l") |(x3=="t/p"))
#Create additional column to tag wins/losses in df1
index <- c("s/l","t/p")
values <- c("Loss", "Win")
df1$col2 <- values[match(df1$x3, index)]
df1
#Mutate df to review changes, attempt to review consecutive wins and losses & reset when a
#positive / negative value is encountered
df2=df1 %>%
mutate(sign = ifelse(x9 > 0, "pos", ifelse(x9 < 0, "neg", "zero")), # get the sign of the value
sign_lag = lag(sign, default = sign[9]), # get previous value (exception in the first place)
change = ifelse(sign == sign_lag, 1 , 0), # check if there's a change
series_id = cumsum(change)+1) %>% # create the series id
print() -> dt2

I think you can use rle for this. By itself, it doesn't immediately provide a grouping-like functionality, but we can either use data.table::rleid or construct our own function:
# borrowed from https://stackoverflow.com/a/62007567/3358272
myrleid <- function(x) {
rl <- rle(x)$lengths
rep(seq_along(rl), times = rl)
}
x9 <- c(-40.57,-40.57,-40.08,-40.08,-40.09,-40.08,-40.09,-40.09,-39.6,-39.6,-49.6,-39.6,-39.61,-39.12,-39.12-39.13,782.58,-41.04)
tibble(x9) %>%
mutate(grp = myrleid(x9 > 0)) %>%
group_by(grp) %>%
mutate(row = row_number()) %>%
ungroup()
# # A tibble: 17 x 3
# x9 grp row
# <dbl> <int> <int>
# 1 -40.6 1 1
# 2 -40.6 1 2
# 3 -40.1 1 3
# 4 -40.1 1 4
# 5 -40.1 1 5
# 6 -40.1 1 6
# 7 -40.1 1 7
# 8 -40.1 1 8
# 9 -39.6 1 9
# 10 -39.6 1 10
# 11 -49.6 1 11
# 12 -39.6 1 12
# 13 -39.6 1 13
# 14 -39.1 1 14
# 15 -78.2 1 15
# 16 783. 2 1
# 17 -41.0 3 1

Related

Sum of elements in a forward looking rolling window by month

I have the following data.frame with columns: Id, Month, have
library(dplyr)
dt <- read.table(header = TRUE, text = '
Id Month have want
1 01-Jan-2018 1.000000000000000 1.234567901220000
1 01-Feb-2018 0.200000000000000 0.234567901233000
1 01-Mar-2018 0.030000000000000 0.034567901234400
1 01-Apr-2018 0.004000000000000 0.004567901234550
1 01-May-2018 0.000500000000000 0.000567901234566
1 01-Jun-2018 0.000060000000000 0.000067901234566
1 01-Jul-2018 0.000007000000000 0.000007901234566
1 01-Aug-2018 0.000000800000000 0.000000901234566
1 01-Sep-2018 0.000000090000000 0.000000101234566
1 01-Oct-2018 0.000000010000000 0.000000011234566
1 01-Nov-2018 0.000000001100000 0.000000001234566
1 01-Dec-2018 0.000000000120000 0.000000000134566
1 01-Jan-2019 0.000000000013000 0.000000000014566
1 01-Feb-2019 0.000000000001400 0.000000000001566
1 01-Mar-2019 0.000000000000150 0.000000000000166
1 01-Apr-2019 0.000000000000016 0.000000000000016
2 01-Jan-2018 1337.00 1338.00
2 01-Feb-2018 1.00 1.00
3 01-Jan-2018 5.000000000000000000 5.000000000000000
') %>% mutate(Month=as.Date(Month, format='%d-%b-%Y')
I would like to programmatically calculate sum of elements in a 12 month forward looking rolling window by Month and grouped by Id as demonstrated in column want. If the rolling observation window is less than 12 months, the missing elements should be ignored.
For bonus points would the solution would also allow for missing months, such as in:
dt <- read.table(header = TRUE, text = '
Id Month have want
1 01-Jan-18 1.000000000000000 1.200000000000000
1 01-Dec-18 0.200000000000000 0.230000000000000
1 01-Jan-19 0.030000000000000 0.030000000000000
') %>% mutate(Month=as.Date(Month, format='%d-%b-%Y')
I have tried different solutions, e.g. rollapplyr() of the zoo package and some functions in the runner package, but it doesn't seem to give me what I need.
You can use zoo's rollaply with partial = TRUE
library(dplyr)
dt %>%
group_by(Id) %>%
tidyr::complete(Month = seq(min(Month), max(Month), "month")) %>%
mutate(result = zoo::rollapply(have, 12, sum, na.rm = TRUE,
align = 'left', partial = TRUE)) -> result
result
If you have data for every month for each Id like in the example shared you can remove the complete step.
I suggest to use runner package in this case. runner function let you to calculate rolling window having a full control in time. k is a window length, lag is a lag of the window and in idx you specify index column which window depends on.
library(runner)
dt %>%
group_by(Id) %>%
mutate(want2 = runner(
.,
f = function(x) sum(x$have),
k = 12, # or "12 months"
lag = -11, # or "-11 months"
idx = Month)
)
# # A tibble: 19 x 5
# # Groups: Id [3]
# Id Month have want want2
# <int> <date> <dbl> <dbl> <dbl>
# 1 1 2018-01-01 1.00e+ 0 1.23e+ 0 1.00e+ 0
# 2 1 2018-02-01 2.00e- 1 2.35e- 1 2.00e- 1
# 3 1 2018-03-01 3.00e- 2 3.46e- 2 3.00e- 2
# 4 1 2018-04-01 4.00e- 3 4.57e- 3 4.00e- 3
# 5 1 2018-05-01 5.00e- 4 5.68e- 4 5.00e- 4
# 6 1 2018-06-01 6.00e- 5 6.79e- 5 6.00e- 5

Rolling sum of one variable in data.frame in number of steps defined by another variable

I'm trying to sum up the values in a data.frame in a cumulative way.
I have this:
df <- data.frame(
a = rep(1:2, each = 5),
b = 1:10,
step_window = c(2,3,1,2,4, 1,2,3,2,1)
)
I'm trying to sum up the values of b, within the groups a. The trick is, I want the sum of b values that corresponds to the number of rows following the current row given by step_window.
This is the output I'm looking for:
data.frame(
a = rep(1:2, each = 5),
step_window = c(2,3,1,2,4,
1,2,3,2,1),
b = 1:10,
sum_b_step_window = c(3, 9, 3, 9, 5,
6, 15, 27, 19, 10)
)
I tried to do this using the RcppRoll but I get an error Expecting a single value:
df %>%
group_by(a) %>%
mutate(sum_b_step_window = RcppRoll::roll_sum(x = b, n = step_window))
I'm not sure if having variable window size is possible in any of the rolling function. Here is one way to do this using map2_dbl :
library(dplyr)
df %>%
group_by(a) %>%
mutate(sum_b_step_window = purrr::map2_dbl(row_number(), step_window,
~sum(b[.x:(.x + .y - 1)], na.rm = TRUE)))
# a b step_window sum_b_step_window
# <int> <int> <dbl> <dbl>
# 1 1 1 2 3
# 2 1 2 3 9
# 3 1 3 1 3
# 4 1 4 2 9
# 5 1 5 4 5
# 6 2 6 1 6
# 7 2 7 2 15
# 8 2 8 3 27
# 9 2 9 2 19
#10 2 10 1 10
1) rollapply
rollapply in zoo supports vector widths. partial=TRUE says that if the width goes past the end then use just the values within the data. (Another possibility would be to use fill=NA instead in which case it would fill with NA's if there were not enough data left) . align="left" specifies that the current value at each step is the left end of the range to sum.
library(dplyr)
library(zoo)
df %>%
group_by(a) %>%
mutate(sum = rollapply(b, step_window, sum, partial = TRUE, align = "left")) %>%
ungroup
2) SQL
This can also be done in SQL by left joining df to itself on the indicated condition and then for each row summing over all rows for which the condition matches.
library(sqldf)
sqldf("select A.*, sum(B.b) as sum
from df A
left join df B on B.rowid between A.rowid and A.rowid + A.step_window - 1
and A.a = B.a
group by A.rowid")
Here is a solution with the package slider.
library(dplyr)
library(slider)
df %>%
group_by(a) %>%
mutate(sum_b_step_window = hop_vec(b, row_number(), step_window+row_number()-1, sum)) %>%
ungroup()
It is flexible on different window sizes.
Output:
# A tibble: 10 x 4
a b step_window sum_b_step_window
<int> <int> <dbl> <int>
1 1 1 2 3
2 1 2 3 9
3 1 3 1 3
4 1 4 2 9
5 1 5 4 5
6 2 6 1 6
7 2 7 2 15
8 2 8 3 27
9 2 9 2 19
10 2 10 1 10
slider is a couple-of-months-old tidyverse package specific for sliding window functions. Have a look here for more info: page, vignette
hop is the engine of slider. With this solution we are triggering different .start and .stop to sum the values of b according to the a groups.
With _vec you're asking hop to return a vector: a double in this case.
row_number() is a dplyr function that allows you to return the row number of each group, thus allowing you to slide along the rows.
data.table solution using cumulative sums
setDT(df)
df[, sum_b_step_window := {
cs <- c(0,cumsum(b))
cs[pmin(.N+1, 1:.N+step_window)]-cs[pmax(1, (1:.N))]
},by = a]

replace a value in one column with a value from a second column on condition of a value from a third column from different rows

I have a data frame:
df1 <- data.frame(Object = c("Klaus","Klaus","Peter","Peter","Daniel","Daniel"),
PointA = as.numeric(c("7",NA,"17",NA,NA,NA)),
PointB = as.numeric(c("18","22",NA,NA,"17",NA)),
measure = c("1","2","1","2","1","2")
)
And I want this:
df2 <- data.frame(Object = c("Klaus","Klaus","Peter","Peter","Daniel","Daniel"),
PointA = as.numeric(c("7","18","17",NA,NA,"17")),
PointB = as.numeric(c("18","22",NA,NA,"17",NA)),
measure = c("1","2","1","2","1","2")
)
Which is, if there is a no value for an Object for PointA for measure == 2, I want it replaced with PointB from measure == 1 of the same Object.
First thing that comes to mind is:
library(dplyr)
df$PointA <- coalesce(df$PointA, df$PointB)
But afaik there is no way to make this condional.
Then I thought maybe something like:
df$PointA[is.na(df$PointA)] <- df$PointB
But this does not differentiate for the measure.
So I thought about:
df$PointA <- ifelse(df$measure == 2 & is.na(df$PointA), df$PointB, df$PointA)
But that does not take into account that I need the corresponding value from measure == 1.
Now, I am at a loss here. I am out of ideas how to approch this. Help?
Edit: I got two very good solutions already, but both rely on the order in the data frame. I tried, but obviously my example was to simple. I am looking for something that works under the following condition, too:
df1 <- df1[sample(nrow(df1)), ]
One possible option is using row_number() from dplyr. In case you need to sort your dataframe first, you can insert an arrange statement.
library(dplyr)
df1 %>%
arrange(Object, measure) %>%
group_by(Object) %>%
mutate(PointA = if_else(measure == 2 & is.na(PointA), PointB[row_number()-1], PointA))
# A tibble: 6 x 4
# Groups: Object [3]
# Object PointA PointB measure
# <chr> <dbl> <dbl> <chr>
# 1 Daniel NA 17 1
# 2 Daniel 17 NA 2
# 3 Klaus 7 18 1
# 4 Klaus 18 22 2
# 5 Peter 17 NA 1
# 6 Peter NA NA 2
You could use coalesce +lag as shown below:
library(tidyverse)
df1 %>%
arrange(Object, measure) %>%
group_by(Object) %>%
mutate(PointA = coalesce(PointA, lag(PointB)))
# A tibble: 6 x 4
# Groups: Object [3]
Object PointA PointB measure
<chr> <dbl> <dbl> <chr>
1 Klaus 7 18 1
2 Klaus 18 18 2
3 Peter 17 NA 1
4 Peter NA NA 2
5 Daniel NA 17 1
6 Daniel 17 NA 2
This could be condensed, but it should be relatively clear and doesn't rely on the row order at all. Beware if you have multiple rows for the same Object/Measure pair - the self-join will have multiple matches and you'll end up with a lot more rows than you started with.
library(dplyr)
df_fill = df1 %>%
filter(measure == 1) %>%
select(Object, fill_in = PointB) %>%
mutate(needs_fill = 1L)
result = df1 %>%
mutate(needs_fill = if_else(measure == 2 & is.na(PointA), 1L, NA_integer_)) %>%
left_join(df_fill) %>%
mutate(PointA = coalesce(PointA, fill_in)) %>%
select(-fill_in, -needs_fill)
result
# Object PointA PointB measure
# 1 Klaus 7 18 1
# 2 Klaus 18 22 2
# 3 Peter 17 NA 1
# 4 Peter NA NA 2
# 5 Daniel NA 17 1
# 6 Daniel 17 NA 2
Same as above but without saving the intermediate object:
result = df1 %>%
mutate(needs_fill = if_else(measure == 2 & is.na(PointA), 1L, NA_integer_)) %>%
left_join(
df1 %>%
filter(measure == 1) %>%
select(Object, fill_in = PointB) %>%
mutate(needs_fill = 1L)
) %>%
mutate(PointA = coalesce(PointA, fill_in)) %>%
select(-fill_in, -needs_fill)

dplyr: calculate the days for product replenishment

I am working on a dataset in which I need to calculate how long does it take for a retail store to replenish some products from shortage, and here is a quick view of the dataset in the simplest form:
Date <- c("2019-1-1","2019-1-2","2019-1-3","2019-1-4","2019-1-5","2019-1-6","2019-1-7","2019-1-8")
Product <- rep("Product A",8)
Net_Available_Qty <- c(-2,-2,10,8,-5,-6,-7,0)
sample_df <- data.frame(Date,Product,Net_Available_Qty)
When the Net_Available_Qty becomes negative, it means there is a shortage. When it turns back to 0 or positive qty, it means the supply has been recovered. What I need to calculate is the days between when we first see shortage and when it is recovered. In this case, for the 1st shortage, it took 2 days to recover and for the second shortage, it took 3 days to recover.
A tidyverse solution would be most welcome.
I hope someone else finds a cleaner solution. But this produces diffDate which assigns the date difference from when a negative turns positive/zero.
sample_df %>%
mutate(sign = ifelse(Net_Available_Qty > 0, "pos", ifelse(Net_Available_Qty < 0, "neg", "zero")),
sign_lag = lag(sign, default = sign[1]), # get previous value (exception in the first place)
change = ifelse(sign != sign_lag, 1 , 0), # check if there's a change
sequence=sequence(rle(as.character(sign))$lengths)) %>%
group_by(sequence) %>%
mutate(diffDate = as.numeric(difftime(Date, lag(Date,1))),
diffDate=ifelse(Net_Available_Qty <0, NA, ifelse((sign=='pos'| sign=='zero') & sequence==1, diffDate, NA))) %>%
ungroup() %>%
select(Date, Product, Net_Available_Qty, diffDate)
#Schilker had a great idea using rle. I am building on his answer and offering a slightly shorter version including the use of cumsum
Date <- c("2019-1-1","2019-1-2","2019-1-3","2019-1-4","2019-1-5","2019-1-6","2019-1-7","2019-1-8")
Product <- rep("Product A",8)
Net_Available_Qty <- c(-2,-2,10,8,-5,-6,-7,0)
sample_df <- data.frame(Date,Product,Net_Available_Qty)
library(tidyverse)
sample_df %>%
mutate(
diffDate = c(1, diff(as.Date(Date))),
sequence = sequence(rle(Net_Available_Qty >= 0)$lengths),
group = cumsum(c(TRUE, diff(sequence)) != 1L)
) %>%
group_by(group) %>%
mutate(n_days = max(cumsum(diffDate)))
#> # A tibble: 8 x 7
#> # Groups: group [4]
#> Date Product Net_Available_Qty diffDate sequence group n_days
#> <fct> <fct> <dbl> <dbl> <int> <int> <dbl>
#> 1 2019-1-1 Product A -2 1 1 0 2
#> 2 2019-1-2 Product A -2 1 2 0 2
#> 3 2019-1-3 Product A 10 1 1 1 2
#> 4 2019-1-4 Product A 8 1 2 1 2
#> 5 2019-1-5 Product A -5 1 1 2 3
#> 6 2019-1-6 Product A -6 1 2 2 3
#> 7 2019-1-7 Product A -7 1 3 2 3
#> 8 2019-1-8 Product A 0 1 1 3 1
Created on 2020-02-23 by the reprex package (v0.3.0)

Adding a column that counts sequential numbers

I would like to add a column that counts the number of consecutive values. Most of what I am seeing on here is how to count duplicate values (1,1,1,1,1) and I would like to count a when the number goes up by 1 ( 5,6,7,8,9). The ID column is what I have and the counter column is what I would like to create. Thanks!
ID Counter
5 1
6 2
7 3
8 4
10 1
11 2
13 1
14 2
15 3
16 4
A solution using the dplyr package. The idea is to calculate the difference between each number to create a grouping column, and then assign counter to each group.
library(dplyr)
dat2 <- dat %>%
mutate(Diff = ID - lag(ID, default = 0),
Group = cumsum(Diff != 1)) %>%
group_by(Group) %>%
mutate(Counter = row_number()) %>%
ungroup() %>%
select(-Diff, -Group)
dat2
# # A tibble: 10 x 2
# ID Counter
# <int> <int>
# 1 5 1
# 2 6 2
# 3 7 3
# 4 8 4
# 5 10 1
# 6 11 2
# 7 13 1
# 8 14 2
# 9 15 3
# 10 16 4
DATA
dat <- read.table(text = "ID
5
6
7
8
10
11
13
14
15
16",
header = TRUE, stringsAsFactors = FALSE)
A loop version is simple:
for (i in 2:length(ID))
if (diff(ID)[i-1] == 1)
counter[i] <- counter[i-1] +1
else
counter[i] <- 1
But this loop will perform very bad for n > 10^4! I'll try to think of a vector-solution!
You can using
s=df$ID-shift(df$ID)
s[is.na(s)]=1
ave(s,cumsum(s!=1),FUN=seq_along)
[1] 1 2 3 4 1 2 1 2 3 4
This one makes use solely of highly efficient vector-arithmetic. Idea goes as follows:
1.take the cumulative sum of the differences of ID
2.subtract the value if jump is bigger than one
cum <- c(0, cumsum(diff(ID))) # take the cumulative difference of ID
ccm <- cum * c(1, (diff(ID) > 1)) # those with jump > 1 will remain its value
# subtract value with jump > 1 for all following numbers (see Link for reference)
# note: rep(0, n) is because ccm[...] starts at first non null value
counter <- cum - c(rep(0, which(diff(dat) != 1)[1]),
ccm[which(ccm != 0)][cumsum(ccm != 0)]) + 1
enter code here
Notes:
Reference for highliy efficient fill-function by nacnudus: Fill in data frame with values from rows above
Restriction: Id must be monotonically increasing
That should deal with your millions of data efficiently!
Another solution:
breaks <- c(which(diff(ID)!=1), length(ID))
x <- c(breaks[1], diff(breaks))
unlist(sapply(x, seq_len))

Resources