Set up ouput-list names in slider::slide_index() - r

slider::slider_index() uses a purrr::map()-like syntax to apply functions on rolling (here time) window. The output of this function is a list. My question is, how to set up the names of this list?
The slider_index() function does not have a .names_to argument like slider_index_dfr().
See reproducible example below:
library(slider)
library(lubridate)
library(dplyr)
storms_summary <- storms %>%
# filter dataset so it does not eat all memory and save computing time ;)
filter(year <= 2000 & year >= 1999) %>%
# make a date column to mimic my real data
mutate(storm_date = as.Date(paste(year, month, day, sep = "-"))) %>%
arrange(storm_date) %>%
slider::slide_index(
.x = .,
.i = .$storm_date,
.f = ~ summary(.x),
.after = lubridate::weeks(2),
.complete = TRUE
)
Expected output is the list storms_summary with names that are character strings indicating the first and the last date of the period summarized. With the code below we get:
> names(storms_summary)
NULL

If there is no way to assign the names within the function we can create names on our own and use setNames to assign it.
library(slider)
library(lubridate)
library(dplyr)
storms_summary <- storms %>%
# filter dataset so it does not eat all memory and save computing time ;)
filter(year <= 2000 & year >= 1999) %>%
# make a date column to mimic my real data
mutate(storm_date = as.Date(paste(year, month, day, sep = "-"))) %>%
arrange(storm_date) %>%
mutate(sliding_output = slider::slide_index(.x = cur_data(),
.i = storm_date,
.f = ~summary(.x),
.after = lubridate::weeks(2),
.complete = TRUE
),
names = slider::slide_index_chr(.x = storm_date,
.i = storm_date,
.f = ~paste0(range(.x), collapse = '-'),
.after = lubridate::weeks(2)),
sliding_output = setNames(sliding_output, names)) %>%
select(-names)
names(storms_summary$sliding_output)
# [1] "1999-07-02-1999-07-03" "1999-07-03-1999-07-03" "1999-07-03-1999-07-03"
# [4] "1999-07-03-1999-07-03" "1999-08-24-1999-09-07" "1999-08-24-1999-09-07"
# [7] "1999-08-24-1999-09-07" "1999-08-25-1999-09-08" "1999-08-25-1999-09-08"
#[10] "1999-08-25-1999-09-08" "1999-08-25-1999-09-08" "1999-08-26-1999-09-09"
#...
#...

Related

R roll mean on a non continuous time serie

I want to make a rolling mean on the last X number of days. rollmean() does that using rows. Since I am using loggers that sometimes fail, and also the data were cleaned, the time series is not continuous (rows do not necessarily represent a constant time difference).
A colleague suggested the solution below, which works great. Except my data need to be grouped (in the example by treatment). For each day, I want the rolling mean of the last X days for each treatment.
Thanks
# making some example data
# vector with days since the beginning of experiment
days <- 0:30
# random values df1 <- tibble::tibble(
days_since_beginning = days,
value_to_used = rnorm(length(days)),
treatment = sample(letters[1],31,replace = TRUE) )
df2 <- tibble::tibble(
days_since_beginning = days,
value_to_used = rnorm(length(days)),
treatment = sample(letters[2],31,replace = TRUE) )
df <- full_join(df1, df2)
# how long should be the period for mean
time_period <- 10 # calculate for last 10 days
df_mean <- df %>% dplyr::mutate(
# calculate rolling mean
roll_mean = purrr::map_dbl(
.x = days_since_beginning,
.f = ~ df %>%
# select only data for the last `time_period`
dplyr::filter(days_since_beginning >= .x - time_period &
days_since_beginning <= .x) %>%
purrr::pluck("value_to_used") %>%
mean() %>%
return()
) )
This takes the mean over the last 10 days by treatment. The width argument includes a computation of how many rows back to use so that it corresponds to 10 days rather than 10 rows. This uses the fact that width can be a vector.
library(dplyr)
library(zoo)
df %>%
group_by(treatment) %>%
mutate(roll = rollapplyr(value_to_used,
seq_along(days_since_beginning) - findInterval(days_since_beginning - 10, days_since_beginning),
mean)) %>%
ungroup
Same colleague came up with his own solution:
df_mean <-
df %>%
dplyr::group_by(treatment) %>%
tidyr::nest() %>%
dplyr::mutate(
data_with_mean = purrr::map(
.x = data,
.f = ~ {
dataset <- .x
dataset %>%
dplyr::mutate(
# calculate rolling mean
roll_mean = purrr::map_dbl(
.x = days_since_beginning,
.f = ~ dataset %>%
# select only data for the last `time_period`
dplyr::filter(days_since_beginning >= .x - time_period &
days_since_beginning <= .x) %>%
purrr::pluck("value_to_used") %>%
mean() %>%
return()
)) %>%
return()
}
)) %>%
dplyr::select(-data) %>%
tidyr::unnest(data_with_mean) %>%
dplyr::ungroup()
I compared the results with G. Grothendieck's idea, and it only matches if I use time_period in my colleague's code and time_period + 1 in G. Grothendieck's code. So there is a difference in how the time_period is used, and I am confused about why it happens.

Manipulating data.frame while using cycles and storing values in a list

I have 2 codes that manipulate and filter (by date) my data.frame and that work perfectly. Now I want to run the code for not only one day, but for every day in vector:
seq(from=as.Date('2020-03-02'), to=Sys.Date(),by='days')` #.... 538 days
The code I want to run for all the days between 2020-03-02 and today is:
KOKOKO <- data.frame %>%
filter(DATE < '2020-03-02')%>%
summarize(DATE = '2020-03-02', CZK = sum(Objem.v.CZK,na.rm = T)
STAVPTF <- data.frame %>%
filter (DATE < '2020-03-02')%>%
group_by(CP) %>%
summarize(mnozstvi = last(AKTUALNI_MNOZSTVI_AKCIE), DATE = '2020-03-02') %>%
select(DATE,CP,mnozstvi) %>%
rbind(KOKOKO)%>%
drop_na() %>%
So instead of '2020-03-02' I want to fill in all days since '2020-03-02' one after another. And each of the KOKOKO and STAVPTF created for the unique day like this I want to save as a separate data.frame and all of them store in a list.
We could use map to loop over the sequence and apply the code
library(dplyr)
library(purrr)
out <- map(s1, ~ data.frame %>%
filter(DATE < .x)%>%
summarize(DATE = .x, CZK = sum(Objem.v.CZK,na.rm = TRUE))
As this is repeated cycle, a function would make it cleaner
f1 <- function(dat, date_col, group_col, Objem_col, aktualni_col, date_val) {
filtered <- dat %>%
filter({{date_col}} < date_val)
KOKOKO <- filtered %>%
summarize({{date_col}} := date_val,
CZK = sum({{Objem_col}}, na.rm = TRUE)
STAVPTF <- filtered %>%
group_by({{group_col}}) %>%
summarize(mnozstvi = last({{aktualni_col}}),
{{date_col}} := date_val) %>%
select({{date_col}}, {{group_col}}, mnozstvi) %>%
bind_rows(KOKOKO)%>%
drop_na()
return(STAVPTF)
}
and call as
map(s1, ~ f1(data.frame, DATE, CP, Objem.v.CZK, AKTUALNI_MNOZSTVI_AKCIE, !!.x))
where
s1 <- seq(from=as.Date('2020-03-02'), to=Sys.Date(), by='days')
It would be easier to answer your question, if you would provide a minimal reproducible example. It's easy done with tidyverses reprex packages
However, your KOKOKO code can be rewritten as simple cumulative sum:
KOKOKO =
data.frame %>%
arrange(DATE) %>% # if necessary
group_by(DATE) %>%
summarise(CZK = sum(Objem.v.CZK), .groups = 'drop') %>% # summarise per DATE (if necessary)
mutate(CZK = cumsum(CZK) - CZK) # cumulative sum excluding current row (current DATE)
Even STAVPTF code can probably be rewritten without iterations. First find the last value of AKTUALNI_MNOZSTVI_AKCIE per CP and DATE. Then this value is assigned to the next DATE:
STAVPTF <-
data.frame %>%
group_by(CP, DATE) %>%
summarise(mnozstvi = last(AKTUALNI_MNOZSTVI_AKCIE), .groups='drop_last') %>%
arrange(DATE) %>% # if necessary
mutate(DATE = lead(DATE))

Is there a faster way than applying 'ddply' to aggregate columns by groups with a large dataset?

Purpose
I am trying to check whether a pair of values in two columns appear in the previous event, and aggregate the dummy variables by groups.
Specifically, I have each event id (i.e., oid) and dyad-level observations associated with each event: agent (i.e., aid), partner (i.e., pid). The events are sorted by time when the event occurs (i.e., o4.in).
(1)I made a dummy variable indicating if a pair of agent and partner appear together in the previous event.
(2) Also, I used ddply to aggregate the dummy variable by groups, as specified in the below example.
I find that ddply and lag functions take so much time with a large dataset, and I am wondering if there is a faster way to achieved these tasks.
Dataset
library(tidyverse)
library(tibble)
rename <- dplyr::rename
select <- dplyr::select
set.seed(10001)
cases <- sample(1:5, 1000, replace=T)
set.seed(10002)
agent <- sample(1:20, 1000, replace=T)
set.seed(10003)
partner <- sample(1:20, 1000, replace=T)
set.seed(123)
n <- 1000 # no of random datetimes needed
minDate <- as.POSIXct("1999/01/01")
maxDate <- as.POSIXct("2000-01-01")
epoch <- "1970-01-01"
timestamps <-
as.POSIXct(pmax(runif(n, minDate, maxDate), runif(n, minDate, maxDate)), origin = epoch)
df <-
data.frame(cases, agent, partner, timestamps) %>%
rename(
aid = agent,
pid = partner,
oid = cases,
o4.in = timestamps
) %>%
filter(aid != pid)
Current Methods
# creating dummy variable
d <-
df %>%
arrange(o4.in) %>%
group_by(aid) %>%
mutate(
oid.lag.a = lag(oid)
) %>%
ungroup %>%
group_by(pid) %>%
mutate(
oid.lag.p = lag(oid)
) %>%
ungroup %>%
mutate(
j2.consecutive = ifelse(oid.lag.a == oid.lag.p, 1, 0),
j2.consecutive = ifelse(is.na(j2.consecutive), 0, j2.consecutive)
) %>%
select(-oid.lag.a, -oid.lag.p)
# aggregating the dummy variable by groups
t <-
d %>%
ungroup %>%
ddply(c('oid', 'aid'), function(i){
i %>%
mutate(aj1.consecutive = (sum(j2.consecutive) - j2.consecutive)/(n()-1))
} , .progress = 'text') %>%
arrange(oid, pid) %>%
ddply(c('oid', 'pid'), function(i){
i %>%
mutate(apj1.consecutive = (sum(j2.consecutive) - j2.consecutive)/(n()-1))
} , .progress = 'text')
Update for Future Readers
Task (1) is achieved by the answer by #akrun below.
Task (2) solution is answered by #akrun in a separate post: A faster way than applying 'ddply' to aggregate a variable by a function by groups
Special thanks to #akrun!!
We can use data.table methods to make it faster
library(data.table)
df2 <- copy(df)
df3 <- setDT(df2)[order(o4.in)]
df3[, oid.lag.a := shift(oid), by = aid
][, oid.lag.p := shift(oid), by = pid]
df3[, j2.consecutive := fcoalesce(+(oid.lag.a == oid.lag.p), 0L)]
Also, note that some things in the OP's code are unnecessary i.e. using ifelse to convert a logical to binary. It can just be as.integer or coercion with +. The second line again with ifelse can be removed as well with coalesce
library(dplyr)
out <- df %>%
arrange(o4.in) %>%
group_by(aid) %>%
mutate(
oid.lag.a = lag(oid)
) %>%
group_by(pid) %>%
mutate(
oid.lag.p = lag(oid)
) %>%
ungroup %>%
mutate(j2.consecutive = coalesce(+(oid.lag.a == oid.lag.p), 0))
-checking the output from dplyr/data.table
all(out$j2.consecutive == df3$j2.consecutive )
[1] TRUE

rolling mutate function applied to lists data frames

I have some time series data where I use the rolling_origin function to apply different time series splits to the data which generates a number of lists. The time series starts from 2020-03-01 until 2020-10-30.
I want to start from 2020-04-15 such that I have 1 month before (2020-03-15) and 1 month after (2020-05-15). I can use an ifelse statement to add a 1 for observations after and a 0 for observations before.
rolledData %>%
map(., ~mutate(.x,
treatment_control = ifelse(date >= as.Date("2020-04-15"), 1, 0)
))
But what I want to do is to increment the ifelse date when mapped over the list. So the first one might start on the 2020-04-15 but in the next list in the sequence it would be changed to 2020-04-16, and the next list 2020-04-17, .... , until the end.
I could manually write out the results:
lst1 <- rolledData[[12]] %>%
mutate(
treatment_control = ifelse(date >= as.Date("2020-04-15"), 1, 0)
)
lst2 <- rolledData[[13]] %>%
mutate(
treatment_control = ifelse(date >= as.Date("2020-04-16"), 1, 0)
)
lst3 <- rolledData[[14]] %>%
mutate(
treatment_control = ifelse(date >= as.Date("2020-04-17"), 1, 0)
)
How can I map over the list and increment the treatment_control mutate?
Note: Because I am using financial data (which was just the easiest to obtain for a reproducible example) the weekends are removed (in my data I have a full week)
Data:
library(tidyquant)
library(rsample)
data <- tq_get(c("AAPL"),
get = "stock.prices",
from = "2020-03-01",
to = "2020-10-30")
rolledData <- data %>%
rolling_origin(
data = .,
initial = 60, # 2 months of data
assess = 0,
cumulative = FALSE,
skip = 0
)
rolledData <- rolledData$splits %>%
map(., ~analysis(.x))
If the dates are different, we can pass a vector of custom dates that have the same length as the rolledData in map2
library(dplyr)
library(purrr)
rolleData2 <- rolledData %>%
map2(., newdates,
~ .x %>%
mutate(treatment_control = +(date >= .y)
))
where
newdates <- seq(as.Date("2020-03-15"), length.out = length(rolledData), by = "1 day")
If it is based on the next month from the first 'date' value
library(lubridate)
rolledData2 <- rolledData %>%
map(~ .x %>%
mutate(treatment_control =
+(date >= (first(date) %m+% months(1)))))

R transform dummies into factor variable

I have a panel dataset where the time and group variables were already converted to dummies. I want to reverse the transformation though back to a simple id and time variable.
Let's create a comparable data:
library(plm)
library(tidyverse)
library(fastDummies)
data(EmplUK)
EmplUK %>%
select(-sector) %>%
dummy_cols(.data = .,select_columns = c("firm","year"),remove_selected_columns = TRUE,remove_first_dummy = TRUE) -> paneldata
head(paneldata)
So basically now all my dummy variables are firm_X and year_X and I would like to have a Year and Firm variable again.
This is slightly complicated by the fact that Firm 1 and Year 1 does not exist as dummy (as they would not be needed in a regression model).
I'm fine with this precise data missing (I can simply infer that the first Firm would be Firm 1 and the year would be Year 1976, which is one less than the smallest one).
Any ideas how to do this nicely? Ideally using tidyverse?
After some thinking, I figured it out and created a small function:
getfactorback <- function(data,
groupdummyprefix,
timedummyprefix,
grouplabel,
timelabel,
firstgroup,
firsttime) {
data %>%
mutate(newgroup = ifelse(rowSums(cur_data() %>% select(starts_with("id")))==1,0,1),
newtime = ifelse(rowSums(cur_data() %>% select(starts_with("time")))==1,0,1)) %>%
rename(!!paste0(groupdummyprefix,firstgroup):=newgroup,
!!paste0(timedummyprefix,firsttime):=newtime) %>%
pivot_longer(cols = starts_with(groupdummyprefix),names_to = grouplabel,names_prefix = groupdummyprefix) %>%
filter(value == 1) %>%
select(-value) %>%
pivot_longer(cols = starts_with(timedummyprefix),names_to = timelabel,names_prefix = timedummyprefix) %>%
filter(value == 1) %>%
select(-value) %>%
mutate(across(.cols = c(all_of(grouplabel),all_of(timelabel)),factor)) %>%
relocate(all_of(c(grouplabel,timelabel))) -> output
return(output)
}
getfactorback(data = paneldata,
groupdummyprefix = "firm_",
grouplabel = "firm",
timedummyprefix = "year_",
timelabel = "year",
firstgroup = "1",
firsttime = 1976)

Resources