Using data.table for multiple aggregation steps - r

I am trying to do multiple aggregation steps using data.table. First I want to find the median value at each concentration for a specific type of sample by plate, then I want to find the maximum of the medians for each plate.
library(data.table)
set.seed(1)
DT <- data.table(plate = rep(paste0("plate",1:3),each=11),
type = rep(c(rep(1,9),2,2),3),
value = sample(1:25,33,replace=TRUE),
conc = rep(c(rep(1:3,each=3),4,4),3)
)
I got the following to work:
DT[,med := median(value[type==1]),by=list(plate,conc)]
DT[,max := max(med,na.rm=TRUE),by=plate]
Is it possible to do a multiple step aggregation without adding the intermediate med column?

You could e.g. do the following:
DT[, max := max(.SD[, median(value[type == 1]), by = conc]$V1, na.rm = T),
by = plate]
but I'm pretty sure your two line way is much faster.

Related

Apply mlr3 pipes on group by basis

I would like to know is it possible to apply mlr3 Pipe processing on groupBy basis.
For example, from the mlr3pipelines documentation, we can scale predictors with following code:
library(mlr3)
library(mlr3pipelines)
task = tsk("iris")
pop = po("scalemaxabs")
pop$train(list(task))[[1]]$data()
But, is it possible to do scaling by group. For example, lets add month columns to iris data:
library(mlr3)
library(mlr3pipelines)
task = tsk("iris")
dt = task$data()
dt[, month := c(rep(1, 50), rep(2, 50), rep(3, 50))]
task = as_task_classif(dt, target = "Species", id = "iris")
Is it possible to scale predictors by month column? That is, we want to scale every month separately.
Using data.table, this is easy:
task$data()[, lapply(.SD, function(x) as.vector(scale(x))), .SDcols = names(DT)[2:5], by = month]
but is it possible to do this inside the mlr3pipe graph?
If there is no PipeOp that has exactly this functionality, you can write your own. You already solved the problem with data.tables. mlr3pipelines also uses data.tables internally, so it should be no problem to put your code into a PipeOp. The mlr3book explains how to write your own PipeOp.

Trouble constructing a function properly in R

In the code below, I'm trying to find the mean correct score for each item in the "category" column of the "regular season" dataset I'm working with.
rs_category <- list2env(split(regular_season, regular_season$category),
.GlobalEnv)
unique_categories <- unique(regular_season$category)
for (i in unique_categories)
Mean_[i] <- mean(regular_season$correct[regular_season$category == i], na.rm = TRUE, .groups = 'drop')
eapply(rs_category, Mean_[i])
print(i)
I'm having trouble getting this to work though. I have created a list of the items in the category as sub-datasets and separately, (I think) I have created a vector of the unique items in the category in order to run the for loop with. I have a feeling the problem may be with how I defined the mean function because an error occurs at the "eapply()" line and tells me "Mean_[i]" is not a function, but I can't think of how else to define the function. If someone could help, I would greatly appreciate it.
The issue would be that Mean_ wouldn't have an i name. In the below code, we initiaize the object 'Mean_' as type numeric with length as the same as length of 'unique_categories', then loop over the sequence of 'unique_categories', get the subset of 'correct', apply the mean function and store that as ith value of 'Mean_'
Mean_ <- numeric(length(unique_categories))
for(i in seq_along(unique_categories)) {
Mean_[i] <- mean(regular_season$correct[regular_season$category
== unique_categories[i]], na.rm = TRUE)
}
If we need to use a faster execution, use data.table
library(data.table)
setDT(regular_season[, .(Mean_ = mean(correct, na.rm = TRUE)), category]
Or using collapse
library(collapse)
fmean(slt(regular_season, category, correct), g = category)
Instead of splitting the dataset and using for loop R has functions for such grouping operations which I think can be used here. You can apply a function for each unique group (value).
library(dplyr)
regular_season %>%
group_by(category) %>%
summarise(Mean_ = mean(correct, na.rm = TRUE)) -> result
This gives you average value of correct for each category, where result$Mean_ is the vector that you are looking for.
In base R, this can be solved with aggregate.
result <- aggregate(correct~category, regular_season, mean, na.rm = TRUE)

Nested Individual-time conditions on panel data in R

So I have a large data set of students at a school that looks like this:
library(data.table)
set.seed(1)
school <- data.table("id" = rep(1:10, each = 10), "year" = rep(2000:2009, each = 10),
"grade" = sample(c(9:11, rep(NA, 5)), 100, replace = T))
What I want to do is create a column that indicates if a student has previously been in the same grade as he is now.
The desired output for this example can be found here (I crated a link to save space).
This may sound simple but it is not since students can go back in grades, or be absent in years prior.
I would like a way to do this using data.table as the dataset is very large. so far I've tried the following:
library(dplyr)
library(scales)
school[, repetition := any(school[censor((.I - 10):(.I + 10),
range = c(0, NROW(school))) %>% na.omit
][school[.I, id] == id] == grade)]
However, this doesn't work as I don't know how to distinguish "upper level" (from the first school[...] call) operators like .I and id from inside the second school[...] call.
P.D.: I'll accept suggestions for a better title. Thanks!
We can use duplicated to get logical value for grades that repeat for each id and year.
library(data.table)
school[, repetition := duplicated(grade, incomparables = NA), .(id, year)]

Rolling correlation with 'grouped by' - Error: incorrect number of dimensions

I'm trying to calculate rolling correlations with a five year window based on daily stock data. My dataframe test consists of 20 columns, with "logRet3" being located in column #17 and "logMarRet3" in #18. I want to calculate the correlation of these two return measures.
What makes it difficult is the fact that I want the rolling correlation to be grouped by my share indicator "PERMNO" in column #1. By that I mean that the rolling correlation "restarts" whenever the time-series data of a particular stock ends.
Through research I came up with the following code, using the dplyr, zoo and magrittr packages:
test <- test %>%
group_by(PERMNO) %>%
mutate(CorSecMar = zoo::rollapply(test, width = 1255, function(x) cor(x[,logRet3], x[,logMarRet3]), fill = NA, align = "right"))
However, when I run this code, I get the following error:
Error in x[,logMarRet3]: Incorrect number of dimensions
Me being a newbie, I tried adjusting the code by deleting the ,:
test <- test %>%
group_by(PERMNO) %>%
mutate(CorSecMar = zoo::rollapply(test, width = 1255, function(x) cor(x[logRet3], x[logMarRet3]), fill = NA, align = "right"))
resulting in the following error (translated to English):
Error in x[logMarRet3]: Only zeros are allowed to be mixed with negative indices
Any help on how to fix these errors or alternative ways of calculating the rolling correlation by group would be greatly appreciated.
EDIT: Thanks to G. Grothendieck for pointing out some flaws in my question. I'm referring to his answer for reproducible input and will keep that in mind for further posts.
There are several problems:
rollapply applies to each column separately unless by.column = FALSE is used.
using test within group_by will not cause test to be subsetted. It will refer to the entire dataset. Use individual column names instead.
the column names in the code in the question must have quotes around them; otherwise, it is saying there are variables of those names containing the column names.
when posting to SO you need to reduce your problem to a complete reproducible example and post that. I have done it this time for you in the Note at the end.
With reference to the Note, use this code:
library(dplyr)
library(zoo)
mycor <- function(x) cor(x[, 1], x[, 2])
DF %>%
group_by(stock) %>%
mutate(Cor = rollapplyr(cbind(a, b), 4, mycor, by.column = FALSE, fill = NA)) %>%
ungroup
or this code which only uses zoo. mycor is from above.
library(zoo)
n <- nrow(DF)
roll <- function(i) rollapplyr(DF[i, c("a", "b")], 4, mycor, by.column = FALSE, fill = NA)
transform(DF, Cor = ave(1:n, stock, FUN = roll))
Note
The input in reproducible form is:
DF <- data.frame(stock = rep(LETTERS[1:2], each = 6), a = 1:6, b = (1:6)^3)

Rolling Regression by Group

Hi I have a panel data set. I'd like to do a rolling window regression for each firm and extract the coefficient of the independent var. y is the dependent var and x is the independent var. Rolling window is 12. That is,
the first regression uses row 1 to row 12 data,
the second regression uses row 2 to row 13 data, etc.
Rollapply is used.
Here is a question that has the exact same error that I encountered:
Rolling by group in data.table R
The lucky thing about that question is that it only takes one column but mine takes two columns for regression so I can't make the change accordingly to the recommended answer in that post.
Here is another post that uses a for loop. My real data has more than 2 million observations so it is too slow:
rolling regression with dplyr
Can any one help?
My fake data set is as follows:
dt<-rep(c("AAA","BBB","CCC"),each=24)
dt<-as.data.frame(dt)
names(dt)[names(dt)=="dt"] <- "firm"
a<-c(20100131,20100228,20100331,20100430,20100531,20100630,20100731,20100831,20100930,20101031,20101130,20101231,20110131,20110228,20110331,20110430,20110531,20110630,20110731,20110831,20110930,20111031,20111130,20111231)
dt$time<-rep(a,3)
dt<-dt%>% group_by(firm)%>%
mutate(y=rnorm(24,10,5))
dt<-dt%>% group_by(firm)%>%
mutate(x=rnorm(24,5,2))
dt<-as.data.table(dt)
I tried this code:
# create rolling regression function
roll <- function(Z)
{
t = lm(formula=y~x, data = as.data.frame(Z), na.rm=T);
return(t$coef[2])
}
dt[,beta := rollapply(dt, width=12, roll, fill=NA, by.column=FALSE, align="right") , by=firm]
I am trying to create a column called "beta" that shows the coefficient of var x. So for each firm, the first data should kick in from the 12th observation.
It looks like the regression takes x and y from the 1st row for different groups and the coefficients seems a bit off compared to the result I got from EXCEL.
The second method I tried is the dplyr version:
dt %>%
group_by(firm) %>%
mutate(dt,beta = rollapply(dt,12,function(x) coef(lm(y~x,data=as.data.frame(x)))[2],by.column= FALSE, fill = NA, align = "right"))
It gives me the same issue. each group has the same number. Looks like for each firm, the regression takes y and x from the 1st row.
Any thoughts? Thank you so much.
Here is a solution that uses the rollRegres package and data.table package. I have also added a modified version of the OP's solution which works (see eddi's comment) and used an example with 2 million observations as the OP mentions
#####
# setup data
library(rollRegres)
library(data.table)
library(dplyr)
set.seed(33700919)
n_firms <- 83334 # yields ~ the 2M firm as the OP mentions
dt <- rep(1:n_firms, each = 24)
dt <- data.frame(firm = dt)
a <-c(20100131,20100228,20100331,20100430,20100531,20100630,20100731,20100831,20100930,20101031,20101130,20101231,20110131,20110228,20110331,20110430,20110531,20110630,20110731,20110831,20110930,20111031,20111130,20111231)
dt$time <- rep(a, n_firms)
dt <- dt %>% group_by(firm) %>% mutate(y=rnorm(24,10,5))
dt <- dt %>% group_by(firm) %>% mutate(x=rnorm(24,5,2))
dt <- as.data.table(dt)
nrow(dt) # roughly the 2M rows that the OP mentions
#R [1] 2000016
#####
# fit models
setkey(dt, firm, time) # make sure data is sorted correctly
start_time <- Sys.time() # to show computation time
dt[
, beta :=
roll_regres.fit(x = cbind(1, .SD[["x"]]), y = .SD[["y"]],
width = 12L)$coefs[, 2],
by = firm]
Sys.time() - start_time
#R Time difference of 6.526595 secs
# gives the same as OP's solution with minor corrections
library(zoo)
start_time <- Sys.time()
roll <- function(Z)
lm.fit(x = cbind(1, Z[, "x"]), y = Z[, "y"])$coef[2]
dt[
, beta_zoo :=
rollapply(.SD, width=12, roll, fill=NA, by.column=FALSE, align="right"),
by=firm]
Sys.time() - start_time # much slower
#R Time difference of 1.87341 mins
# gives the same
all.equal(dt$beta, dt$beta_zoo)
#R [1] TRUE
Maybe you can try to change the first argument in rollapply, replace dt to column, dt[, c("y","x")]. See if it works

Resources