Below is my attempt at a minimal reproducible example. Briefly explained, I am using rollApply from the rowr package to calculate a function over a rolling window, and using data from two columns simultaneously. If possible, I would like to skip n steps between each time the function is calculated on a new window. I will try to make it clear what I mean in the example below.
Here is the example data:
df1 <- tibble(
x = c(1:9),
y = c(1:9),
Date = as.Date(c("2015-08-08", "2015-08-15", "2015-08-22",
"2015-08-29","2015-09-05", "2015-09-12", "2015-09-19",
"2015-09-26", "2015-10-03"))
)
Here are the example functions:
calc_ex <- function(y){
sum(y[,1] + y[,2])
}
roll_calc_ex <- function(y){
vec <- c(rep(NA, 2), rowr::rollApply(y, calc_ex, window = 3, minimum = 3))
y <- y %>%
mutate(estimate = vec)
return(y)
}
Applying the function roll_calc_ex() to df1, I get the following output:
> roll_calc_ex(df1)
# A tibble: 9 x 4
x y Date estimate
<int> <int> <date> <int>
1 1 1 2015-08-08 NA
2 2 2 2015-08-15 NA
3 3 3 2015-08-22 12
4 4 4 2015-08-29 18
5 5 5 2015-09-05 24
6 6 6 2015-09-12 30
7 7 7 2015-09-19 36
8 8 8 2015-09-26 42
9 9 9 2015-10-03 48
Ideally, I would to have a rolling window that skips n steps, say n=2, to produce the following output:
# A tibble: 9 x 4
x y Date estimate
<int> <int> <date> <int>
1 1 1 2015-08-08 NA
2 2 2 2015-08-15 NA
3 3 3 2015-08-22 12
4 4 4 2015-08-29 NA
5 5 5 2015-09-05 NA
6 6 6 2015-09-12 30
7 7 7 2015-09-19 NA
8 8 8 2015-09-26 NA
9 9 9 2015-10-03 48
Alternatively, instead of returning NA for every row skipped, the number from the previous calculation could be filled in (something I am planning to do later aynway using fill() from tidyverse).
If this is possible to solve using for example rollapply() from the zoo package, that would also be interesting to hear. I am only using rowr::rollApply() because I need to apply the function to two columns simultaneously. I know it is possible to use runner() from the package "runner", but in my more complicated problem I need to run parallel computations. I am using the furrr package for parallelization, and my code works well with rollApply, but not with runner(). The problem I have with runner is explained here: Problem with parallelization using furrr [and runner::runner() ] in R .
Thanks to anyone that took the time to read this post. Any help will be much appreciated.
1) The rowr package was removed from CRAN but we can use rollapplyr (like rollapply but the r on the end means to default to right alignment) from zoo which has a by.column= argument to specify whether processing is performed column by column (TRUE) or all columns are passed at once (FALSE) and a by= argument which causes skipping.
library(dplyr)
library(zoo)
mutate(df1, roll =
rollapplyr(cbind(x, y), 3, calc_ex, fill = NA, by.column = FALSE, by = 2)
)
giving:
x y Date roll
1 1 1 2015-08-08 NA
2 2 2 2015-08-15 NA
3 3 3 2015-08-22 12
4 4 4 2015-08-29 NA
5 5 5 2015-09-05 24
6 6 6 2015-09-12 NA
7 7 7 2015-09-19 36
8 8 8 2015-09-26 NA
9 9 9 2015-10-03 48
2) Using complex arithmetic would also work:
f <- function(v) calc_ex(cbind(Re(v), Im(v)))
mutate(df1, roll = rollapplyr(x + y * 1i, 3, f, fill = NA, by = 2))
3) and if we look into call_ex then it could be written (although this does not generalize):
mutate(df1, roll = rollapplyr(x + y, 3, sum, fill = NA, by = 2))
4) We could also consider using zoo objects rather than data frames:
z <- read.zoo(df1, index = "Date")
merge(z, roll = rollapplyr(z, 3, calc_ex, by.column = FALSE, by = 2))
If we were to use the slider package
library(tidyverse)
library(slider)
df1 <- tibble(
x = c(1:9),
y = c(1:9),
Date = as.Date(c("2015-08-08", "2015-08-15", "2015-08-22",
"2015-08-29","2015-09-05", "2015-09-12", "2015-09-19",
"2015-09-26", "2015-10-03")))
df1 |>
mutate(rolling_sum = slide2_dbl(.x = x,.y = y,.f = sum,
.step = 3,.before = 2,.complete = T
))
#> # A tibble: 9 x 4
#> x y Date rolling_sum
#> <int> <int> <date> <dbl>
#> 1 1 1 2015-08-08 NA
#> 2 2 2 2015-08-15 NA
#> 3 3 3 2015-08-22 12
#> 4 4 4 2015-08-29 NA
#> 5 5 5 2015-09-05 NA
#> 6 6 6 2015-09-12 30
#> 7 7 7 2015-09-19 NA
#> 8 8 8 2015-09-26 NA
#> 9 9 9 2015-10-03 48
Created on 2021-10-21 by the reprex package (v2.0.1)
Related
I have a dataframe where I would like to keep a row as soon as the cumulative value of a column reaches a certain level. The dataset could look like this:
set.seed(0)
n <- 10
dat <- data.frame(id=1:n,
group=rep(LETTERS[1:2], n/2),
age=sample(18:30, n, replace=TRUE),
type=factor(paste("type", 1:n)),
x=abs(rnorm(n)))
dat
id group age type x
1 1 A 26 type 1 0.928567035
2 2 B 21 type 2 0.294720447
3 3 A 24 type 3 0.005767173
4 4 B 18 type 4 2.404653389
5 5 A 19 type 5 0.763593461
6 6 B 30 type 6 0.799009249
7 7 A 24 type 7 1.147657009
8 8 B 28 type 8 0.289461574
9 9 A 19 type 9 0.299215118
10 10 B 28 type 10 0.411510833
Where I want to keep a row as soon as the cumulative value of x reaches a threshold (e.g. 1), starting to count again as soon as a row was retained. Which would result in the following output:
id group age type x
2 2 B 21 type 2 0.294720447
4 4 B 18 type 4 2.404653389
6 6 B 30 type 6 0.799009249
7 7 A 24 type 7 1.147657009
10 10 B 28 type 10 0.411510833
I am trying to get a dplyr based solution but can't seem to figure it out. Any tips?
You can use purrr::accumulate to compute the cumsum with threshold, then use dplyr::slice_tail to get the last value before the cumsum cuts the threshold:
library(dplyr)
library(purrr)
dat %>%
group_by(a = cumsum(x == accumulate(x, ~ ifelse(.x <= 1, .x + .y, .y)))) %>%
slice_tail(n = 1)
# id group age type x gp
# 1 2 B 21 type 2 0.295 1
# 2 4 B 18 type 4 2.40 2
# 3 6 B 30 type 6 0.799 3
# 4 7 A 24 type 7 1.15 4
# 5 10 B 28 type 10 0.412 5
Another option is to use MESS::cumsumbinning, which might be more friendly to use:
library(MESS)
library(dplyr)
dat %>%
group_by(a = cumsumbinning(x, 1, cutwhenpassed = T)) %>%
slice_tail(n = 1)
Mael beat me with the cumsumbinning() from the MESS-package...
Here is a data.table option using that function:
library(MESS)
library(data.table)
setDT(dat)[, .SD[.N], by = MESS::cumsumbinning(x, 1, cutwhenpassed = TRUE)]
# MESS id group age type
# 1: 1 2 B 21 type 2
# 2: 2 4 B 18 type 4
# 3: 3 6 B 30 type 6
# 4: 4 7 A 24 type 7
# 5: 5 10 B 28 type 10
I am trying to find a way to replicate the following code 4 times:
df3_1<- df3_1 %>% add_row(.before
= 2)
I tried the 'rep' function but it didn't work out. Is there any way to repeat this code 4 times so I can add multiple blank rows exactly in the manner described above (i.e. with respect to a specific row number).
Thanks!
In base R, you can do:
n <- 4
nr <- nrow(df3_1)
df3_1[append(seq(nr), values = rep(nr + 1, n), after = 1), ]
Or a tidyverse approach:
library(tibble)
n <- 4
df3_1 %>%
add_row(!!names(.)[1] := rep(NA, n), .before = 2)
I would think base::Reduce() or purrr::reduce() are best designed for this kind of job
library(dplyr)
library(purrr)
df3_1 <- data.frame(a = 1:5,
b = letters[1:5])
# purrr reduce
reduce(1:4,
function(df, x){
add_row(df, .before = 2)
},
.init = df3_1)
#> a b
#> 1 1 a
#> 2 NA <NA>
#> 3 NA <NA>
#> 4 NA <NA>
#> 5 NA <NA>
#> 6 2 b
#> 7 3 c
#> 8 4 d
#> 9 5 e
# base R Reduce
Reduce(function(df, x){
add_row(df, .before = 2)
},
1:4,
init = df3_1)
#> a b
#> 1 1 a
#> 2 NA <NA>
#> 3 NA <NA>
#> 4 NA <NA>
#> 5 NA <NA>
#> 6 2 b
#> 7 3 c
#> 8 4 d
#> 9 5 e
You dont actually need the x parameter here inside of your function but to determine how often the function should be called, that is, how many blank rows should be inserted.
Try this (updated simplified version based on #27ϕ9 generalised approach):
df3_1 <- data.frame(a = 1:5,
b = letters[1:5])
library(dplyr)
df3_1 %>%
add_row(a = rep(NA, 4), .before = 2)
#> a b
#> 1 1 a
#> 2 NA <NA>
#> 3 NA <NA>
#> 4 NA <NA>
#> 5 NA <NA>
#> 6 2 b
#> 7 3 c
#> 8 4 d
#> 9 5 e
Created on 2020-07-01 by the reprex package (v0.3.0)
I would like know how many animals will show up on a specific day. This chart describes people register their animals in advance.
For instance, at 7 days ahead, someone registered for their 4 cats to show up on 5/3/2019; at 6 days ahead, another 9 cats are registered for 5/3/2019. So there will be 7+6=13 cats showing up on 5/3/2019.
When days_ahead = 0, it simply means someone registered on the event day. For instance, 4 wolves registered for 5/1/2019 on 5/1/2019 (0 days ahead), and there will be 4 wolves that day.
library(dplyr)
set.seed(0)
animal = c(rep('cat', 5), rep('dog', 6), rep('wolf', 3))
date = sample(seq(as.Date("2019/5/1"), as.Date('2019/5/10'), by='day'), 14, replace=TRUE)
days_ahead = sample(seq(0,14), 14, replace=FALSE)
number = sample.int(10, 14, replace=TRUE)
dt = data.frame(animal, date, days_ahead, number) %>% arrange(animal, date)
The expected outcome should have the same 1-3 columns as the example, but the fourth column should be an accumulated number by each date, accumulating on days_ahead.
I added an expected outcome here. The comments are used to explain the accumulated_number column.
I've considered loop function but not entirely sure how to loop over three variables (cat, date, and days_ahead). Any advice is appreciated!!
The accumulated_number is somewhat easy with cumsum(). See this link for your comments field:
Cumulatively paste (concatenate) values grouped by another variable
dt%>%
group_by(animal,date)%>%
mutate(accumulated_number = cumsum(number)
,comments = Reduce(function(x1, x2) paste(x1, x2, sep = '+'), as.character(number), accumulate = T)
)%>%
ungroup()
Also, my dataset is slightly different than yours with the same seed. Still, it seems to work.
# A tibble: 14 x 6
animal date days_ahead number accumulated_number comments
<fct> <date> <int> <int> <int> <chr>
1 cat 2019-05-03 10 9 9 9
2 cat 2019-05-04 6 4 4 4
3 cat 2019-05-06 8 5 5 5
4 cat 2019-05-09 5 4 4 4
5 cat 2019-05-10 13 6 6 6
6 dog 2019-05-01 0 2 2 2
7 dog 2019-05-03 3 5 5 5
8 dog 2019-05-07 1 7 7 7
9 dog 2019-05-07 9 8 15 7+8
10 dog 2019-05-09 12 2 2 2
11 dog 2019-05-10 7 9 9 9
12 wolf 2019-05-02 14 5 5 5
13 wolf 2019-05-03 11 8 8 8
14 wolf 2019-05-07 4 9 9 9
I'm not sure I understand your question, is this what you want?
I'm adding an "animals_arriving" column and kepping the rest of dt
library(dplyr)
library(lubridate)
dt %>%
mutate(date_arrival = date + days(days_ahead)) %>%
group_by(date = date_arrival) %>%
summarise(animals_arriving = n()) %>%
full_join(dt,by="date")
I have a table that has dates as a number and a value with each date. Now I'd like to add another column, weekSum, which contains the sum of value over the last week. However some dates are missing (so I can't always use the current and last 6 rows). My table looks like this:
df <- data.frame('date' = c(20160309, 20160310, 20160311, 20160312, 20160313, 20160314, 20160315, 20160317, 20160318, 20160319, 20160321), 'value' = c(1, 2, 3, 4, 5, 6, 7 ,8, 9, 10, 11))
date value
20160309 1
20160310 2
20160311 3
20160312 4
20160313 5
20160314 6
20160315 7
20160316 8
20160318 9 #17th skipped
20160319 10
20160321 11 #20th skipped
I'd like to get the following as output:
date value weekSum
20160309 1 NA
20160310 2 NA
20160311 3 NA
20160312 4 NA
20160313 5 NA
20160314 6 NA
20160315 7 28 # 1+2+3+4+5+6+7
20160316 8 35 # 2+3+4+5+6+7+8
20160318 9 39 # 4+5+6+7+8+9
20160319 10 45 # 5+6+7+8+9+10
20160321 11 45 # 7+8+9+10+11
How can this be done?
1) Convert the data frame to zoo and define a weekSum function which subsets its input to the last week and sums that. Then use rollapplyr with coredata = FALSE so that it passes a zoo object with times, not just the core data, to the weekSum function.
library(zoo)
z <- read.zoo(df, format = "%Y%m%d")
weekSum <- function(z) sum(z[time(z) > tail(time(z), 1) - 7])
transform(df, weekSum = rollapplyr(z, 7, weekSum, fill = NA, coredata = FALSE))
giving:
date value weekSum
2016-03-09 20160309 1 NA
2016-03-10 20160310 2 NA
2016-03-11 20160311 3 NA
2016-03-12 20160312 4 NA
2016-03-13 20160313 5 NA
2016-03-14 20160314 6 NA
2016-03-15 20160315 7 28
2016-03-16 20160316 8 35
2016-03-18 20160318 9 39
2016-03-19 20160319 10 45
2016-03-21 20160321 11 45
2) An alternative is to fill in the value at the missing dates with zero and then just use rollsumr with width of 7. z is from (1).
z0 <- merge(z, zoo(, seq(start(z), end(z), "day")), fill = 0)
transform(df, weekSum = rollsumr(z0, 7, fill = NA)[z0 != 0])
With base R it can be done like this:
res <- merge(df, data.frame(date = seq(df$date[1], to = df$date[length(d)], by = "days")), all.y = TRUE)
res$weekSum <- NA
for(i in seq_along(res$sum)[-seq_len(6)]){
res$weekSum[i] <- sum(res$value[(i - 6):i], na.rm = TRUE)
}
res <- res[!is.na(res$value), ]
res
# date value sum weekSum
#1 2016-03-09 1 NA NA
#2 2016-03-10 2 NA NA
#3 2016-03-11 3 NA NA
#4 2016-03-12 4 NA NA
#5 2016-03-13 5 NA NA
#6 2016-03-14 6 NA NA
#7 2016-03-15 7 28 28
#9 2016-03-17 8 33 35
#10 2016-03-18 9 39 42
#11 2016-03-19 10 45 49
#13 2016-03-21 11 45 56
Here is an approach using tidyverse tools. This method uses tidyr::complete to construct the full date sequence, making it easy to take the current row and the previous 6 as suggested. Be careful here if there are
NA values in value to begin with, as currently those rows will be filtered out at the end. Tweaks possible to avoid this case if necessary.
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#>
#> date
df <- data.frame('date' = c(20160309, 20160310, 20160311, 20160312, 20160313, 20160314, 20160315, 20160317, 20160318, 20160319, 20160321), 'value' = c(1, 2, 3, 4, 5, 6, 7 ,8, 9, 10, 11))
df %>%
mutate(date = ymd(date)) %>%
complete(date = seq.Date(min(date), max(date), by = 1)) %>%
arrange(date) %>%
mutate(
newval = replace_na(value, 0),
weekSum = newval + lag(newval) + lag(newval, 2) + lag(newval, 3) +
lag(newval, 4) + lag(newval, 5) + lag(newval, 6)
) %>%
select(-newval) %>%
filter(!is.na(value))
#> # A tibble: 11 x 3
#> date value weekSum
#> <date> <dbl> <dbl>
#> 1 2016-03-09 1. NA
#> 2 2016-03-10 2. NA
#> 3 2016-03-11 3. NA
#> 4 2016-03-12 4. NA
#> 5 2016-03-13 5. NA
#> 6 2016-03-14 6. NA
#> 7 2016-03-15 7. 28.
#> 8 2016-03-17 8. 33.
#> 9 2016-03-18 9. 39.
#> 10 2016-03-19 10. 45.
#> 11 2016-03-21 11. 45.
Created on 2018-05-07 by the reprex package (v0.2.0).
I have time series data that I'm predicting on, so I am creating lag variables to use in my statistical analysis. I'd like a quick way to create multiple variables given specific inputs so that I can easily cross-validate and compare models.
The following is example code that adds 2 lags for 2 different variables (4 total) given a certain category (A, B, C):
# Load dplyr
library(dplyr)
# create day, category, and 2 value vectors
days = 1:9
cats = rep(c('A','B','C'),3)
set.seed = 19
values1 = round(rnorm(9, 16, 4))
values2 = round(rnorm(9, 16, 16))
# create data frame
data = data.frame(days, cats, values1, values2)
# mutate new lag variables
LagVal = data %>% arrange(days) %>% group_by(cats) %>%
mutate(LagVal1.1 = lag(values1, 1)) %>%
mutate(LagVal1.2 = lag(values1, 2)) %>%
mutate(LagVal2.1 = lag(values2, 1)) %>%
mutate(LagVal2.2 = lag(values2, 2))
LagVal
days cats values1 values2 LagVal1.1 LagVal1.2 LagVal2.1 LagVal2.2
<int> <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 A 16 -10 NA NA NA NA
2 2 B 14 24 NA NA NA NA
3 3 C 16 -6 NA NA NA NA
4 4 A 12 25 16 NA -10 NA
5 5 B 20 14 14 NA 24 NA
6 6 C 18 -5 16 NA -6 NA
7 7 A 21 2 12 16 25 -10
8 8 B 19 5 20 14 14 24
9 9 C 18 -3 18 16 -5 -6
My problem comes in at the # mutate new lag variables step, since I have about a dozen predictor variables that I would potentially want to lag up to 10 times (~13k row dataset), and I don't have the heart to create 120 new variables.
Here is my attempt at writing a function which mutates new variables given the inputs for data (dataset to mutate), variables (the variables you wish to lag), and lags (the number of lags per variable):
MultiMutate = function(data, variables, lags){
# select the data to be working with
FuncData = data
# Loop through desired variables to mutate
for (i in variables){
# Loop through number of desired lags
for (u in 1:lags){
FuncData = FuncData %>% arrange(days) %>% group_by(cats) %>%
# Mutate new variable for desired number of lags. Give new variable a name with the lag number appended
mutate(paste(i, u) = lag(i, u))
}
}
FuncData
}
To be honest I'm just sort of lost on how to get this to work. The ordering of my for-loops and overall logic makes sense, but the way the function takes characters into variables and the overall syntax seems way off. Is there a simple way to fix up this function to get my desired result?
In particular, I'm looking for:
A function like MultiMutate(data = data, variables = c(values1, values2), lags = 2) that would create the exact result of LagVal from above.
Dynamically naming the variables based on the variable and their lag. I.e. value1.1, value1.2, value2.1, value2.2, etc.
Thank you in advance and let me know if you need additional information. If there's a simpler way to get what I'm looking for, then I am all ears.
You'll have to reach deeper into the tidyverse toolbox to add them all at once. If you nest data for each value of cats, you can iterate over the nested data frames, iterating the lags over the values* columns in each.
library(tidyverse)
set.seed(47)
df <- data_frame(days = 1:9,
cats = rep(c('A','B','C'),3),
values1 = round(rnorm(9, 16, 4)),
values2 = round(rnorm(9, 16, 16)))
df %>% nest(-cats) %>%
mutate(lags = map(data, function(dat) {
imap_dfc(dat[-1], ~set_names(map(1:2, lag, x = .x),
paste0(.y, '_lag', 1:2)))
})) %>%
unnest() %>%
arrange(days)
#> # A tibble: 9 x 8
#> cats days values1 values2 values1_lag1 values1_lag2 values2_lag1
#> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A 1 24. -7. NA NA NA
#> 2 B 2 19. 1. NA NA NA
#> 3 C 3 17. 17. NA NA NA
#> 4 A 4 15. 24. 24. NA -7.
#> 5 B 5 16. -13. 19. NA 1.
#> 6 C 6 12. 17. 17. NA 17.
#> 7 A 7 12. 27. 15. 24. 24.
#> 8 B 8 16. 15. 16. 19. -13.
#> 9 C 9 15. 36. 12. 17. 17.
#> # ... with 1 more variable: values2_lag2 <dbl>
data.table::shift makes this simpler, as it's vectorized. Naming takes more work than the actual lagging:
library(data.table)
setDT(df)
df[, sapply(1:2, function(x){paste0('values', x, '_lag', 1:2)}) := shift(.SD, 1:2),
by = cats, .SDcols = values1:values2][]
#> days cats values1 values2 values1_lag1 values1_lag2 values2_lag1
#> 1: 1 A 24 -7 NA NA NA
#> 2: 2 B 19 1 NA NA NA
#> 3: 3 C 17 17 NA NA NA
#> 4: 4 A 15 24 24 NA -7
#> 5: 5 B 16 -13 19 NA 1
#> 6: 6 C 12 17 17 NA 17
#> 7: 7 A 12 27 15 24 24
#> 8: 8 B 16 15 16 19 -13
#> 9: 9 C 15 36 12 17 17
#> values2_lag2
#> 1: NA
#> 2: NA
#> 3: NA
#> 4: NA
#> 5: NA
#> 6: NA
#> 7: -7
#> 8: 1
#> 9: 17
In these cases, I rely on the magic of dplyr and tidyr:
library(dplyr)
library(tidyr)
set.seed(47)
# create data
s_data = data_frame(
days = 1:9,
cats = rep(c('A', 'B', 'C'), 3),
values1 = round(rnorm(9, 16, 4)),
values2 = round(rnorm(9, 16, 16))
)
max_lag = 2 # define max number of lags
# create lags
s_data %>%
gather(select = -c("days", "cats")) %>% # gather all variables that will be lagged
mutate(n_lag = list(0:max_lag)) %>% # add list-column with lag numbers
unnest() %>% # unnest the list column
arrange(cats, key, n_lag, days) %>% # order the data.frame
group_by(cats, key, n_lag) %>% # group by relevant variables
# create lag. when grouped by vars above, n_lag is a constant vector, take 1st value
mutate(lag_val = lag(value, n_lag[1])) %>%
ungroup() %>%
# create some fancy labels
mutate(var_name = ifelse(n_lag == 0, key, paste0("Lag", key, ".", n_lag))) %>%
select(-c(key, value, n_lag)) %>% # drop unnecesary data
spread(var_name, lag_val) %>% # spread your newly created variables
select(days, cats, starts_with("val"), starts_with("Lag")) # reorder
## # A tibble: 9 x 8
## days cats values1 values2 Lagvalues1.1 Lagvalues1.2 Lagvalues2.1 Lagvalues2.2
## <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 A 24. -7. NA NA NA NA
## 2 2 B 19. 1. NA NA NA NA
## 3 3 C 17. 17. NA NA NA NA
## 4 4 A 15. 24. 24. NA -7. NA
## 5 5 B 16. -13. 19. NA 1. NA
## 6 6 C 12. 17. 17. NA 17. NA
## 7 7 A 12. 27. 15. 24. 24. -7.
## 8 8 B 16. 15. 16. 19. -13. 1.
## 9 9 C 15. 36. 12. 17. 17. 17.