I have the following data.frame with columns: Id, Month, have
library(dplyr)
dt <- read.table(header = TRUE, text = '
Id Month have want
1 01-Jan-2018 1.000000000000000 1.234567901220000
1 01-Feb-2018 0.200000000000000 0.234567901233000
1 01-Mar-2018 0.030000000000000 0.034567901234400
1 01-Apr-2018 0.004000000000000 0.004567901234550
1 01-May-2018 0.000500000000000 0.000567901234566
1 01-Jun-2018 0.000060000000000 0.000067901234566
1 01-Jul-2018 0.000007000000000 0.000007901234566
1 01-Aug-2018 0.000000800000000 0.000000901234566
1 01-Sep-2018 0.000000090000000 0.000000101234566
1 01-Oct-2018 0.000000010000000 0.000000011234566
1 01-Nov-2018 0.000000001100000 0.000000001234566
1 01-Dec-2018 0.000000000120000 0.000000000134566
1 01-Jan-2019 0.000000000013000 0.000000000014566
1 01-Feb-2019 0.000000000001400 0.000000000001566
1 01-Mar-2019 0.000000000000150 0.000000000000166
1 01-Apr-2019 0.000000000000016 0.000000000000016
2 01-Jan-2018 1337.00 1338.00
2 01-Feb-2018 1.00 1.00
3 01-Jan-2018 5.000000000000000000 5.000000000000000
') %>% mutate(Month=as.Date(Month, format='%d-%b-%Y')
I would like to programmatically calculate sum of elements in a 12 month forward looking rolling window by Month and grouped by Id as demonstrated in column want. If the rolling observation window is less than 12 months, the missing elements should be ignored.
For bonus points would the solution would also allow for missing months, such as in:
dt <- read.table(header = TRUE, text = '
Id Month have want
1 01-Jan-18 1.000000000000000 1.200000000000000
1 01-Dec-18 0.200000000000000 0.230000000000000
1 01-Jan-19 0.030000000000000 0.030000000000000
') %>% mutate(Month=as.Date(Month, format='%d-%b-%Y')
I have tried different solutions, e.g. rollapplyr() of the zoo package and some functions in the runner package, but it doesn't seem to give me what I need.
You can use zoo's rollaply with partial = TRUE
library(dplyr)
dt %>%
group_by(Id) %>%
tidyr::complete(Month = seq(min(Month), max(Month), "month")) %>%
mutate(result = zoo::rollapply(have, 12, sum, na.rm = TRUE,
align = 'left', partial = TRUE)) -> result
result
If you have data for every month for each Id like in the example shared you can remove the complete step.
I suggest to use runner package in this case. runner function let you to calculate rolling window having a full control in time. k is a window length, lag is a lag of the window and in idx you specify index column which window depends on.
library(runner)
dt %>%
group_by(Id) %>%
mutate(want2 = runner(
.,
f = function(x) sum(x$have),
k = 12, # or "12 months"
lag = -11, # or "-11 months"
idx = Month)
)
# # A tibble: 19 x 5
# # Groups: Id [3]
# Id Month have want want2
# <int> <date> <dbl> <dbl> <dbl>
# 1 1 2018-01-01 1.00e+ 0 1.23e+ 0 1.00e+ 0
# 2 1 2018-02-01 2.00e- 1 2.35e- 1 2.00e- 1
# 3 1 2018-03-01 3.00e- 2 3.46e- 2 3.00e- 2
# 4 1 2018-04-01 4.00e- 3 4.57e- 3 4.00e- 3
# 5 1 2018-05-01 5.00e- 4 5.68e- 4 5.00e- 4
# 6 1 2018-06-01 6.00e- 5 6.79e- 5 6.00e- 5
I am still new to R and learning methods for conducting analysis. I have a df which I want to count the consecutive wins/losses based on column "x9". This shows the gain/loss (positive value or negative value) for the trade entered. I did find some help on code that helped with assigning a sign, sign lag and change, however, I am looking for counter to count the consecutive wins until a loss is achieved then reset, and then count the consecutive losses until a win is achieved. Overall am looking for assistance to adjust the counter to reset when consecutive wins/losses are interrupted. I have some sample code below and a attached .png to explain my thoughts
#Read in df
df=vroom::vroom(file = "analysis.csv")
#Filter df for specfic order types
df1 = filter(df, (x3=="s/l") |(x3=="t/p"))
#Create additional column to tag wins/losses in df1
index <- c("s/l","t/p")
values <- c("Loss", "Win")
df1$col2 <- values[match(df1$x3, index)]
df1
#Mutate df to review changes, attempt to review consecutive wins and losses & reset when a
#positive / negative value is encountered
df2=df1 %>%
mutate(sign = ifelse(x9 > 0, "pos", ifelse(x9 < 0, "neg", "zero")), # get the sign of the value
sign_lag = lag(sign, default = sign[9]), # get previous value (exception in the first place)
change = ifelse(sign == sign_lag, 1 , 0), # check if there's a change
series_id = cumsum(change)+1) %>% # create the series id
print() -> dt2
I think you can use rle for this. By itself, it doesn't immediately provide a grouping-like functionality, but we can either use data.table::rleid or construct our own function:
# borrowed from https://stackoverflow.com/a/62007567/3358272
myrleid <- function(x) {
rl <- rle(x)$lengths
rep(seq_along(rl), times = rl)
}
x9 <- c(-40.57,-40.57,-40.08,-40.08,-40.09,-40.08,-40.09,-40.09,-39.6,-39.6,-49.6,-39.6,-39.61,-39.12,-39.12-39.13,782.58,-41.04)
tibble(x9) %>%
mutate(grp = myrleid(x9 > 0)) %>%
group_by(grp) %>%
mutate(row = row_number()) %>%
ungroup()
# # A tibble: 17 x 3
# x9 grp row
# <dbl> <int> <int>
# 1 -40.6 1 1
# 2 -40.6 1 2
# 3 -40.1 1 3
# 4 -40.1 1 4
# 5 -40.1 1 5
# 6 -40.1 1 6
# 7 -40.1 1 7
# 8 -40.1 1 8
# 9 -39.6 1 9
# 10 -39.6 1 10
# 11 -49.6 1 11
# 12 -39.6 1 12
# 13 -39.6 1 13
# 14 -39.1 1 14
# 15 -78.2 1 15
# 16 783. 2 1
# 17 -41.0 3 1
I have a data table that looks like
|userId|36|37|38|39|40|
|1|1|0|3|0|0|
|2|3|0|0|0|1|
Where each numbered column (36-40) represent week numbers. I want to calculate the number of weeks before the 1st occurrence of a non-zero value, and the last.
For instance, for userId 1 in my dataset, the first value appears at week 36, and the last one appears at week 38, so the value I want is 2. For userId 2 it's 40-36 which is 4.
I would like to store the data like:
|userId|lifespan|
|1|2|
|2|4|
I'm struggling to do this, can someone please help?
General method I would take is to melt it, convert the character column names to numeric, and take the delta by each userID. Here is an example using data.table.
library(data.table)
dt <- fread("userId|36|37|38|39|40
1|1|0|3|0|0
2|3|0|0|0|1",
header = TRUE)
dt <- melt(dt, id.vars = "userId")
dt[, variable := as.numeric(as.character(variable))]
dt
# userId variable value
# 1: 1 36 1
# 2: 2 36 3
# 3: 1 37 0
# 4: 2 37 0
# 5: 1 38 3
# 6: 2 38 0
# 7: 1 39 0
# 8: 2 39 0
# 9: 1 40 0
# 10: 2 40 1
dt[!value == 0, .(lifespan = max(variable) - min(variable)), by = .(userId)]
# userId lifespan
# 1: 1 2
# 2: 2 4
Here's a dplyr method:
df %>%
gather(var, value, -userId) %>%
mutate(var = as.numeric(sub("X", "", var))) %>%
group_by(userId) %>%
slice(c(which.max(value!=0), max(which(value!=0)))) %>%
summarize(lifespan = var[2]-var[1])
Result:
# A tibble: 2 x 2
userId lifespan
<int> <dbl>
1 1 2
2 2 4
Data:
df = read.table(text = "userId|36|37|38|39|40
1|1|0|3|0|0
2|3|0|0|0|1", header = TRUE, sep = "|")
I tried asking this question before but was it was poorly stated. This is a new attempt cause I haven't solved it yet.
I have a dataset with winners, losers, date, winner_points and loser_points.
For each row, I want two new columns, one for the winner and one for the loser that shows how many points they have scored so far (as both winners and losers).
Example data:
winner <- c(1,2,3,1,2,3,1,2,3)
loser <- c(3,1,1,2,1,1,3,1,2)
date <- c("2017-10-01","2017-10-02","2017-10-03","2017-10-04","2017-10-05","2017-10-06","2017-10-07","2017-10-08","2017-10-09")
winner_points <- c(2,1,2,1,2,1,2,1,2)
loser_points <- c(1,0,1,0,1,0,1,0,1)
test_data <- data.frame(winner, loser, date = as.Date(date), winner_points, loser_points)
I want the output to be:
winner_points_sum <- c(0, 0, 1, 3, 1, 3, 5, 3, 5)
loser_points_sum <- c(0, 2, 2, 1, 4, 5, 4, 7, 4)
test_data <- data.frame(winner, loser, date = as.Date(date), winner_points, loser_points, winner_points_sum, loser_points_sum)
How I've solved it thus far is to do a for loop such as:
library(dplyr)
test_data$winner_points_sum_loop <- 0
test_data$loser_points_sum_loop <- 0
for(i in row.names(test_data)) {
test_data[i,]$winner_points_sum_loop <-
(
test_data %>%
dplyr::filter(winner == test_data[i,]$winner & date < test_data[i,]$date) %>%
dplyr::summarise(points = sum(winner_points, na.rm = TRUE))
+
test_data %>%
dplyr::filter(loser == test_data[i,]$winner & date < test_data[i,]$date) %>%
dplyr::summarise(points = sum(loser_points, na.rm = TRUE))
)
}
test_data$winner_points_sum_loop <- unlist(test_data$winner_points_sum_loop)
Any suggestions how to tackle this problem? The queries take quite some time when the row numbers add up. I've tried elaborating with the AVE function, I can do it for one column to sum a players point as winner but can't figure out how to add their points as loser.
winner <- c(1,2,3,1,2,3,1,2,3)
loser <- c(3,1,1,2,1,1,3,1,2)
date <- c("2017-10-01","2017-10-02","2017-10-03","2017-10-04","2017-10-05","2017-10-06","2017-10-07","2017-10-08","2017-10-09")
winner_points <- c(2,1,2,1,2,1,2,1,2)
loser_points <- c(1,0,1,0,1,0,1,0,1)
test_data <- data.frame(winner, loser, date = as.Date(date), winner_points, loser_points)
library(dplyr)
library(tidyr)
test_data %>%
unite(winner, winner, winner_points) %>% # unite winner columns
unite(loser, loser, loser_points) %>% # unite loser columns
gather(type, pl_pts, winner, loser, -date) %>% # reshape
separate(pl_pts, c("player","points"), convert = T) %>% # separate columns
arrange(date) %>% # order dates (in case it's not)
group_by(player) %>% # for each player
mutate(sum_points = cumsum(points) - points) %>% # get points up to that date
ungroup() %>% # forget the grouping
unite(pl_pts_sumpts, player, points, sum_points) %>% # unite columns
spread(type, pl_pts_sumpts) %>% # reshape
separate(loser, c("loser", "loser_points", "loser_points_sum"), convert = T) %>% # separate columns and give appropriate names
separate(winner, c("winner", "winner_points", "winner_points_sum"), convert = T) %>%
select(winner, loser, date, winner_points, loser_points, winner_points_sum, loser_points_sum) # select the order you prefer
# # A tibble: 9 x 7
# winner loser date winner_points loser_points winner_points_sum loser_points_sum
# * <int> <int> <date> <int> <int> <int> <int>
# 1 1 3 2017-10-01 2 1 0 0
# 2 2 1 2017-10-02 1 0 0 2
# 3 3 1 2017-10-03 2 1 1 2
# 4 1 2 2017-10-04 1 0 3 1
# 5 2 1 2017-10-05 2 1 1 4
# 6 3 1 2017-10-06 1 0 3 5
# 7 1 3 2017-10-07 2 1 5 4
# 8 2 1 2017-10-08 1 0 3 7
# 9 3 2 2017-10-09 2 1 5 4
I finally understood what you want. And I took an approach of getting cumulative points of each player at each point in time and then joining it to the original test_data data frame.
winner <- c(1,2,3,1,2,3,1,2,3)
loser <- c(3,1,1,2,1,1,3,1,2)
date <- c("2017-10-01","2017-10-02","2017-10-03","2017-10-04","2017-10-05","2017-10-06","2017-10-07","2017-10-08","2017-10-09")
winner_points <- c(2,1,2,1,2,1,2,1,2)
loser_points <- c(1,0,1,0,1,0,1,0,1)
test_data <- data.frame(winner, loser, date = as.Date(date), winner_points, loser_points)
library(dplyr)
library(tidyr)
cum_points <- test_data %>%
gather(end_game_status, player_id, winner, loser) %>%
gather(which_point, how_many_points, winner_points, loser_points) %>%
filter(
(end_game_status == "winner" & which_point == "winner_points") |
(end_game_status == "loser" & which_point == "loser_points")) %>%
arrange(date = as.Date(date)) %>%
group_by(player_id) %>%
mutate(cumulative_points = cumsum(how_many_points)) %>%
mutate(cumulative_points_sofar = lag(cumulative_points, default = 0))
select(player_id, date, cumulative_points)
output <- test_data %>%
left_join(cum_points, by = c('date', 'winner' = 'player_id')) %>%
rename(winner_points_sum = cumulative_points_sofar) %>%
left_join(cum_points, by = c('date', 'loser' = 'player_id')) %>%
rename(loser_points_sum = cumulative_points_sofar)
output
The difference to the previous question of the OP is that the OP is now asking for the cumulative sum of points each player has scored so far, i.e., before the actual date. Furthermore, the sample data set now contains a date column which uniquely identifies each row.
So, my previous approach can be used here as well, with some modifications. The solution below reshapes the data from wide to long format whereby two value variables are reshaped simultaneously, computes the cumulative sums for each player id , and finally reshapes from long back to wide format, again. In order to sum only points scored before the actual date, the rows are lagged by one.
It is important to note that the winner and loser columns contain the respective player ids.
library(data.table)
cols <- c("winner", "loser")
setDT(test_data)[
# reshape multiple value variables simultaneously from wide to long format
, melt(.SD, id.vars = "date",
measure.vars = list(cols, paste0(cols, "_points")),
value.name = c("id", "points"))][
# rename variable column
, variable := forcats::lvls_revalue(variable, cols)][
# order by date and cumulate the lagged points by id
order(date), points_sum := cumsum(shift(points, fill = 0)), by = id][
# reshape multiple value variables simultaneously from long to wide format
, dcast(.SD, date ~ variable, value.var = c("id", "points", "points_sum"))]
date id_winner id_loser points_winner points_loser points_sum_winner points_sum_loser
1: 2017-10-01 1 3 2 1 0 0
2: 2017-10-02 2 1 1 0 0 2
3: 2017-10-03 3 1 2 1 1 2
4: 2017-10-04 1 2 1 0 3 1
5: 2017-10-05 2 1 2 1 1 4
6: 2017-10-06 3 1 1 0 3 5
7: 2017-10-07 1 3 2 1 5 4
8: 2017-10-08 2 1 1 0 3 7
9: 2017-10-09 3 2 2 1 5 4