rolling average to multiple variables in R using data.table package - r

I would like to get rolling average for each of the numeric variables that I have. Using data.table package, I know how to compute for a single variable. But how should I revise the code so it can process multiple variables at a time rather than revising the variable name and repeat this procedure for several times? Thanks.
Suppose I have other numeric variables named as "V2", "V3", and "V4".
require(data.table)
setDT(data)
setkey(data,Receptor,date)
data[ , `:=` ('RollConc' = rollmean(AvgConc, 48, align="left", na.pad=TRUE)) , by=Receptor]
A copy of my sample data can be found at:
https://drive.google.com/file/d/0B86_a8ltyoL3OE9KTUstYmRRbFk/view?usp=sharing
I would like to get 5-hour rolling means for "AvgConc","TotDep","DryDep", and "WetDep" by each receptor.

From your description you want something like this, which is similar to one example that can be found in one of the data.table vignettes:
library(data.table)
set.seed(42)
DT <- data.table(x = rnorm(10), y = rlnorm(10), z = runif(10), g = c("a", "b"), key = "g")
library(zoo)
DT[, paste0("ravg_", c("x", "y")) := lapply(.SD, rollmean, k = 3, na.pad = TRUE),
by = g, .SDcols = c("x", "y")]

Now, one can use the frollmean function in the data.table package for this.
library(data.table)
xy <- c("x", "y")
DT[, (xy):= lapply(.SD, frollmean, n = 3, fill = NA, align="center"),
by = g, .SDcols = xy]
Here, I am replacing the x and y columns by the rolling average.
# Data
set.seed(42)
DT <- data.table(x = rnorm(10), y = rlnorm(10), z = runif(10),
g = c("a", "b"), key = "g")

Related

R fast cosine distance between consecutive rows of a data.table

How can I efficiently calculate distances between (almost) consecutive rows of a large-ish (~4m rows) of a data.table? I've outlined my current approach, but it is very slow. My actual data has up to a few hundred columns. I need to calculate lags and leads for future use, so I create these and use them to calculate distances.
library(data.table)
library(proxy)
set_shift_col <- function(df, shift_dir, shift_num, data_cols, byvars = NULL){
df[, (paste0(data_cols, "_", shift_dir, shift_num)) := shift(.SD, shift_num, fill = NA, type = shift_dir), byvars, .SDcols = data_cols]
}
set_shift_dist <- function(dt, shift_dir, shift_num, data_cols){
stopifnot(shift_dir %in% c("lag", "lead"))
shift_str <- paste0(shift_dir, shift_num)
dt[, (paste0("dist", "_", shift_str)) := as.numeric(
proxy::dist(
rbindlist(list(
.SD[,data_cols, with=FALSE],
.SD[, paste0(data_cols, "_" , shift_str), with=FALSE]
), use.names = FALSE),
method = "cosine")
), 1:nrow(dt)]
}
n <- 10000
test_data <- data.table(a = rnorm(n), b = rnorm(n), c = rnorm(n), d = rnorm(n))
cols <- c("a", "b", "c", "d")
set_shift_col(test_data, "lag", 1, cols)
set_shift_col(test_data, "lag", 2, cols)
set_shift_col(test_data, "lead", 1, cols)
set_shift_col(test_data, "lead", 2, cols)
set_shift_dist(test_data, "lag", 1, cols)
I'm sure this is a very inefficient approach, any suggestions would be appreciated!
You aren't using the vectorisation efficiencies in the proxy::dist function - rather than call it once for each row you can get all the distances you need from a single call.
Try this replacement function and compare the speed:
set_shift_dist2 <- function(dt, shift_dir, shift_num, data_cols){
stopifnot(shift_dir %in% c("lag", "lead"))
shift_str <- paste0(shift_dir, shift_num)
dt[, (paste0("dist2", "_", shift_str)) := proxy::dist(
.SD[,data_cols, with=FALSE],
.SD[, paste0(data_cols, "_" , shift_str), with=FALSE],
method = "cosine",
pairwise = TRUE
)]
}
You could also do it in one go without storing copies of the data in the table
test_data[, dist_lag1 := proxy::dist(
.SD,
as.data.table(shift(.SD, 1)),
pairwise = TRUE,
method = 'cosine'
), .SDcols = c('a', 'b', 'c', 'd')]

Grouping to form more than one comma-separated columns in data.table

Problem: I basically want to group data based on the data.table syntax and in parallel create two or more columns which contain comma-separated values (as in the example below).
Approach: I thought about an lapply where I can provide a list of columns which I want to comma-separate; however this did not turn out as expected.
Any suggestions?
EDIT
I am somehow looking for an approach where I only have to provide a list/vector of columns and then apply the function on this list (similar to the not-working lapply approach)
library(data.table)
dt <- data.table(
x = c(1, 1, 1, 3, 3, 2),
y = c("AA", "BB", "CC", "BB", "EE", "AA"),
z = c("H", "A", "C", "Z", "F", "G")
)
## Attempts
dt[, paste0(y, collapse = ","), by = .(x)]
dt[, lapply(c("y", "z"), paste0, collapse = ","), by = x]
## Desired Ouput
x y z
1: 1 AA,BB,CC H, A, C
2: 3 BB,EE Z, F
3: 2 AA G
library(data.table)
dt[, lapply(.SD, toString), by = x, .SDcols = names(dt)[sapply(dt, is.character)]]
dt_sum <- dt[,.(yy=toString(unique(y)),zz=toString(unique(z))),by=c("x")]
dt_sum

Utilizing roll functions with data.table

I'm having problems specifically applying functions from the roll package using data.table. I'm attempting to calculate rolling metrics on column DT$obs for each group DT$group. I'm able to calculate rolling metrics using the zoo package, but I'd like to use some of the additional arguments in roll package functions.
Demo of the error is below.
require(data.table)
require(zoo)
require(roll)
# Fabricated Data:
DT <- data.table(group = rep(c("A", "B"), each = 20), obs = runif(40, min = 0, max = 100))
# Calculate a rolling sum (this is working properly)
DT[, RollingSum := lapply(.SD, function(x) zoo::rollsumr(x, k = 5, fill = NA)), by = "group", .SDcols = "obs"]
# Attempt to calculate a rolling z-score (this throws me an error)
DT[, RollingZScore := lapply(.SD, function(x) roll::roll_scale(as.matrix(x), width = 10, min_obs = 5)), by = "group", .SDcols = "obs"]
I can't figure out what's different about the zoo function and the roll function. They each return numeric vectors. Any guidance appreciated.
As #Frank describes, the problem is that the result of roll_scale (and thus each element of lapply output) is a matrix. You can either use sapply instead of lapply, or put as.vector in your function definition.
DT[, RollingZScore := sapply(.SD,
function(x) roll::roll_scale(as.matrix(x), width = 10, min_obs = 5)),
by = "group", .SDcols = "obs"]
or
DT[, RollingZScore := lapply(.SD,
function(x) as.vector(roll::roll_scale(as.matrix(x), width = 10, min_obs = 5))),
by = "group", .SDcols = "obs"]
This can be done with rollapplyr by simply defining a function that returns NA if the input has fewer than 5 elements:
Scale <- function(x) if (length(x) < 5) NA else tail(scale(x), 1)
DT[, rollingScore := rollapplyr(obs, 10, Scale, partial = TRUE), by = "group"]

Subset data.table columns independently

I'm starting with the below table dt and try to subset its column by the list keys:
library(data.table)
set.seed(123)
randomchar <- function(n, w){
chararray <- replicate(w, sample(c(letters, LETTERS), n, replace = TRUE))
apply(chararray, 1, paste0, collapse = "")
}
dt <- data.table(x = randomchar(1000, 3),
y = randomchar(1000, 3),
z = randomchar(1000, 3),
key = c("x", "y", "z"))
keys <- with(dt, list(x = sample(x, 501),
y = sample(y, 500),
z = sample(z, 721)))
I can get the result I want by using a loop:
desired <- copy(dt)
for(i in seq_along(keys)){
keyname <- names(keys)[i]
desired <- desired[get(keyname) %in% keys[[i]]]
}
desired
The question is - Is there a more data.table idiomatic way to do this subset?
I tried using CJ: dt[CJ(keys)], but it takes a very long time.
What about building a mask and filter dt on this mask:
dt[Reduce(`&`, Map(function(key, col) col %in% key, keys, dt)),]

R ddply vector version

I am looking for a vector version of ddply.
I would like to do the following:
vector_ddply(frame1, frame2, ..., frameN, c("column1", "column2"), processingFunction);
Here all frames have both "column1" and "column2" and processingFunction takes N parameters.
Note that in my specific case it doesn't make sense to merge the N data frames into one.
The resulting frame would made of the unions of all the keys of the N frames.
Is there a way to achieve this ?
Thanks
Let's start with some sample data:
ll <- list(
f1 = data.frame( x = c("a", "b", "a", "b"), y = c(1,1,2,2), z = rnorm(4), p = 1:4 ),
f2 = data.frame( x = c("a", "b", "a", "b"), y = c(1,1,2,2), z = rnorm(4), q = 1:4 ),
f3 = data.frame( x = c("a", "b", "a", "b"), y = c(1,1,2,2), z = rnorm(4), r = 1:4 )
)
1. Solution: apply data.frame-wise
You want to ddply processingFunction on each data.frame individually, and combine the results to one resulting data.frame:
ldply( ll, ddply, .(x, y), summarise, z = processingFunction(z) )
2. Solution: apply on one rbinded data.frame
You want to apply processingFunction over all rows of the data.frames at once. So then you should just rbind all data.frames together to a large one. Just in case this is not directly possible because the individual frames have not all columns in common, you have to rbind on the common column subset:
commonCols <- Reduce( "intersect", lapply(ll, colnames) )
oneDf <- do.call( "rbind", lapply( ll, "[", commonCols ) )
ddply( oneDf, .(x,y), summarise, z = processingFunction(z) )

Resources