Apply mlr3 pipes on group by basis - r

I would like to know is it possible to apply mlr3 Pipe processing on groupBy basis.
For example, from the mlr3pipelines documentation, we can scale predictors with following code:
library(mlr3)
library(mlr3pipelines)
task = tsk("iris")
pop = po("scalemaxabs")
pop$train(list(task))[[1]]$data()
But, is it possible to do scaling by group. For example, lets add month columns to iris data:
library(mlr3)
library(mlr3pipelines)
task = tsk("iris")
dt = task$data()
dt[, month := c(rep(1, 50), rep(2, 50), rep(3, 50))]
task = as_task_classif(dt, target = "Species", id = "iris")
Is it possible to scale predictors by month column? That is, we want to scale every month separately.
Using data.table, this is easy:
task$data()[, lapply(.SD, function(x) as.vector(scale(x))), .SDcols = names(DT)[2:5], by = month]
but is it possible to do this inside the mlr3pipe graph?

If there is no PipeOp that has exactly this functionality, you can write your own. You already solved the problem with data.tables. mlr3pipelines also uses data.tables internally, so it should be no problem to put your code into a PipeOp. The mlr3book explains how to write your own PipeOp.

Related

How can I re-write code that applies a function on subset of rows based on another vector in different R ecosystems?

in my problem I have to apply a function on a subset of individual time-series based on a set of dates extracted from the original data.
So, I have a data.frame with a time-series for each individual between 2005-01-01 and 2010-12-31 (test_final_ind_series) and a sample of pairs individual-date (sample_events) ideally extracted from the same data.
With these, in my example I attempt to calculate an average on a subset of the time-series values exp conditional on individual and date in the sample_events.
I did this in 2 different ways:
1: a simple but effective code that gets the job done very quickly
I simply ask the user to input the data for a specific individual and define a lag of time and a window width (like a rolling average). The function exp_summary then outputs the requested average.
To repeat the operation for each row in sample_events I decided to nest the individual series by ID of the individuals and then attach the sample of dates. Eventually, I just run a loop that applies the function to each individual nested dataframe.
#Sample data
set.seed(111)
exp_series <- data.frame(
id = as.character(rep(1:10000, each=2191)),
date = rep(seq(as.Date('2005-01-01'),
as.Date('2010-12-31'), by = 'day'),times=10000),
exp = rep(rnorm(n=10000, mean=10, sd=5),times=2191)
)
sample_dates <- data.frame(
Event_id = as.character(replicate(10000,sample(1:10000,size = 1,replace = TRUE))),
Event_date = sample(
seq(as.Date('2005-01-01'),
as.Date('2010-12-31'), by = 'day'),
size =10000,replace = TRUE)
)
#This function, given a dataframe with dates and exposure series (df)
#an event_date
#a lag value
#a width of the window
#Outputs the average for a user-defined time window
exp_summary<- function(df, event_date, lag=0,width=0){
df<-as.data.table(df)
end<-as.character(as.Date(event_date)-lag)
start<-as.character(max(as.Date(end)-width, min(df$date)))# I need this in case the time window goes beyond the time limits (earliest date)
return(mean(df[date %between% c(start,end)]$exp))
}
#Nest dataframes
exp_series_nest <- exp_series %>%
group_by(id) %>%
nest()
#Merge with sample events, including only the necessary dates
full_data<-merge(exp_series_nest,sample_dates, by.x="id", by.y="Event_id",all.x = FALSE, all.y=TRUE)
#Initialize dataframe in advance
summaries1<-setNames(data.frame(matrix(ncol = 2, nrow = nrow(full_data))), c("id", "mean"))
summaries1$id<-full_data$id
#Loop over each id, which is nasted data.frame
system.time(for (i in 1:nrow(full_data)){
summaries1$mean[i]<-exp_summary(full_data$data[[i]], full_data$Event_date[i], lag=1, width=365)
})
2: using the highly-flexible package runner
With the same data I need to properly specify the arguments properly. I have also opened an issue on the Github repository to speed-up this code with parallelization.
system.time(summaries2 <- sample_dates %>%
group_by(Event_id) %>%
mutate(
mean = runner(
x = exp_series[exp_series$id == Event_id[1],],
k = "365 days",
lag = "1 days",
idx =exp_series$date[exp_series$id == Event_id[1]],
at = Event_date,
f = function(x) {mean(x$exp)},
na_pad=FALSE
)
)
)
They give very same results up to the second decimal, but method 1 is much faster than 2, and you can see the difference when you use very datasets.
My question is, for method 1, how can I write the last loop in a more concise way within the data.table and/or tidyverse ecosystems? I really struggle in making work together nested lists and "normal" columns embedded in the same dataframe.
Also, if you have any other recommendation I am open to hear it! I am here more for curiosity than need, as my problem is solved by method 1 already acceptably.
With data.table, you could join exp_series with the range you wish in sample_dates and calculate mean by=.EACHI:
library(data.table)
setDT(exp_series)
setDT(sample_dates)
lag <- 1
width <- 365
# Define range
sample_dates[,':='(begin=Event_date-width-lag,end=Event_date-lag)]
# Calculate mean by .EACHI
summariesDT <- exp_series[sample_dates,.(id,mean=mean(exp))
,on=.(id=Event_id,date>=begin,date<=end),by=.EACHI][
,.(id,mean)]
Note that this returns the same results as summaries1 only for Event_id without duplicates in sample_dates.
The results are different in case of duplicates, for instance Event_id==1002:
sample_dates[Event_id==1002]
Event_id Event_date begin end
<char> <Date> <Date> <Date>
1: 1002 2010-08-17 2009-08-16 2010-08-16
2: 1002 2010-06-23 2009-06-22 2010-06-22
If you don't have duplicates in your real data, this shouldn't be a problem.

dplyr dynamically create lag and ma features

I am trying to create a process that takes in a dataframe and creates additional lagged and rolling window features (e.g. moving average). This is what I have so far.
# dummy dataframe
n <- 20
set.seed(123)
foo <- data.frame(
date = seq(as.Date('2020-01-01'),length.out = n, by = 'day'),
var1 = sample.int(n),
var2 = sample.int(n))
# creates lags and based on (some of) them creates rolling average features
foo %>%
mutate_at(vars(starts_with('var')),
funs(lag_1 = lag(.), lag_2 = lag(.,2))) %>%
mutate_at(vars(contains('lag_1')),
funs(ra_3 = rollmean(., k = 3, align = 'right', fill = NA)))
The above chunk :
creates lag01,lag02 features considering the selected variables
based on a subset of the newly created columns, creates rolling average features
What I am now looking for, is to create an arbitrary number of lagged features (e.g. lag3,lag6,lag9 so on) as well as create an arbitrary number of rolling average features (of different window length - i.e. var1_lag_1_ra_3, var1_lag_1_ra_6, var2_lag_1_ra_3, var2_lag_1_ra_6. At the moment the settings to generate such features are hardcoded. Ideally I would have couple of vectors to adjust the outcome; like so:
lag_features <- c(3,6,9)
ma_features <- c(12,15)
Lastly, it would be quite nice if there was a way to configure the names of the generated features in a dynamic manner. I 've seen {{}},!!,:= operators, but I am not really in a position to tell the difference or how to use them.
I have also implemented the above using some readily available functions from the timetk package, but since I am looking for some additional flexibility, I was wondering how I could replicate such behavior myself.
library(timetk)
foo %>%
select(date,starts_with('var')) %>%
tk_augment_lags(.value = starts_with("var"),
.lags = 1) %>%
tk_augment_slidify(.value = ends_with("lag1"),
.period = seq(0,24,3)[-1],
.f = mean,
.align = 'right',
.partial = TRUE
)
Any support would be really appreciated.
You can use the map function to get the lagged value for variable numbers. We can use the .names argument in across to provide names to new columns.
library(dplyr)
library(purrr)
library(zoo)
lag_features <- c(3,6,9)
ma_features <- c(12,15)
foo <- bind_cols(foo, map_dfc(lag_features, ~foo %>%
transmute(across(starts_with('var'),
lag, .x, .names = '{col}_lag{.x}'))),
map_dfc(ma_features, ~foo %>%
transmute(across(contains('lag3'), rollmeanr, k = .x,
fill = NA, .names = '{col}_{.x}'))))

Nested Individual-time conditions on panel data in R

So I have a large data set of students at a school that looks like this:
library(data.table)
set.seed(1)
school <- data.table("id" = rep(1:10, each = 10), "year" = rep(2000:2009, each = 10),
"grade" = sample(c(9:11, rep(NA, 5)), 100, replace = T))
What I want to do is create a column that indicates if a student has previously been in the same grade as he is now.
The desired output for this example can be found here (I crated a link to save space).
This may sound simple but it is not since students can go back in grades, or be absent in years prior.
I would like a way to do this using data.table as the dataset is very large. so far I've tried the following:
library(dplyr)
library(scales)
school[, repetition := any(school[censor((.I - 10):(.I + 10),
range = c(0, NROW(school))) %>% na.omit
][school[.I, id] == id] == grade)]
However, this doesn't work as I don't know how to distinguish "upper level" (from the first school[...] call) operators like .I and id from inside the second school[...] call.
P.D.: I'll accept suggestions for a better title. Thanks!
We can use duplicated to get logical value for grades that repeat for each id and year.
library(data.table)
school[, repetition := duplicated(grade, incomparables = NA), .(id, year)]

Rolling correlation with 'grouped by' - Error: incorrect number of dimensions

I'm trying to calculate rolling correlations with a five year window based on daily stock data. My dataframe test consists of 20 columns, with "logRet3" being located in column #17 and "logMarRet3" in #18. I want to calculate the correlation of these two return measures.
What makes it difficult is the fact that I want the rolling correlation to be grouped by my share indicator "PERMNO" in column #1. By that I mean that the rolling correlation "restarts" whenever the time-series data of a particular stock ends.
Through research I came up with the following code, using the dplyr, zoo and magrittr packages:
test <- test %>%
group_by(PERMNO) %>%
mutate(CorSecMar = zoo::rollapply(test, width = 1255, function(x) cor(x[,logRet3], x[,logMarRet3]), fill = NA, align = "right"))
However, when I run this code, I get the following error:
Error in x[,logMarRet3]: Incorrect number of dimensions
Me being a newbie, I tried adjusting the code by deleting the ,:
test <- test %>%
group_by(PERMNO) %>%
mutate(CorSecMar = zoo::rollapply(test, width = 1255, function(x) cor(x[logRet3], x[logMarRet3]), fill = NA, align = "right"))
resulting in the following error (translated to English):
Error in x[logMarRet3]: Only zeros are allowed to be mixed with negative indices
Any help on how to fix these errors or alternative ways of calculating the rolling correlation by group would be greatly appreciated.
EDIT: Thanks to G. Grothendieck for pointing out some flaws in my question. I'm referring to his answer for reproducible input and will keep that in mind for further posts.
There are several problems:
rollapply applies to each column separately unless by.column = FALSE is used.
using test within group_by will not cause test to be subsetted. It will refer to the entire dataset. Use individual column names instead.
the column names in the code in the question must have quotes around them; otherwise, it is saying there are variables of those names containing the column names.
when posting to SO you need to reduce your problem to a complete reproducible example and post that. I have done it this time for you in the Note at the end.
With reference to the Note, use this code:
library(dplyr)
library(zoo)
mycor <- function(x) cor(x[, 1], x[, 2])
DF %>%
group_by(stock) %>%
mutate(Cor = rollapplyr(cbind(a, b), 4, mycor, by.column = FALSE, fill = NA)) %>%
ungroup
or this code which only uses zoo. mycor is from above.
library(zoo)
n <- nrow(DF)
roll <- function(i) rollapplyr(DF[i, c("a", "b")], 4, mycor, by.column = FALSE, fill = NA)
transform(DF, Cor = ave(1:n, stock, FUN = roll))
Note
The input in reproducible form is:
DF <- data.frame(stock = rep(LETTERS[1:2], each = 6), a = 1:6, b = (1:6)^3)

Using data.table for multiple aggregation steps

I am trying to do multiple aggregation steps using data.table. First I want to find the median value at each concentration for a specific type of sample by plate, then I want to find the maximum of the medians for each plate.
library(data.table)
set.seed(1)
DT <- data.table(plate = rep(paste0("plate",1:3),each=11),
type = rep(c(rep(1,9),2,2),3),
value = sample(1:25,33,replace=TRUE),
conc = rep(c(rep(1:3,each=3),4,4),3)
)
I got the following to work:
DT[,med := median(value[type==1]),by=list(plate,conc)]
DT[,max := max(med,na.rm=TRUE),by=plate]
Is it possible to do a multiple step aggregation without adding the intermediate med column?
You could e.g. do the following:
DT[, max := max(.SD[, median(value[type == 1]), by = conc]$V1, na.rm = T),
by = plate]
but I'm pretty sure your two line way is much faster.

Resources