I often have a need to mutate a data frame through the additional of several columns at once using a custom function, preferably using parallelization. Below are the ways I already know how to do this.
Setup
library(dplyr)
library(plyr)
library(purrr)
library(doMC)
registerDoMC(2)
df <- data.frame(x = rnorm(10), y = rnorm(10), z = rnorm(10))
Suppose that I want two new columns, foocol = x + y and barcol = (x + y) * 100, but that these are actually complex calculations done in a custom function.
Method 1: Add columns separately using rowwise and mutate
foo <- function(x, y) return(x + y)
bar <- function(x, y) return((x + y) * 100)
df_out1 <- df %>% rowwise() %>% mutate(foocol = foo(x, y), barcol = bar(x, y))
This is not a good solution since it requires two function calls for each row and two "expensive" calculations of x + y. It's also not parallelized.
Method 2: Trick ddply into rowwise operation
df2 <- df
df2$id <- 1:nrow(df2)
df_out2 <- ddply(df2, .(id), function(r) {
foocol <- r$x + r$y
barcol <- foocol * 100
return(cbind(r, foocol, barcol))
}, .parallel = T)
Here I trick ddply into calling a function on each row by splitting on a unique id column I just created. It's clunky, though, and requires maintaining a useless column.
Method 3: splat
foobar <- function(x, y, ...) {
foocol <- x + y
barcol <- foocol * 100
return(data.frame(x, y, ..., foocol, barcol))
}
df_out3 <- splat(foobar)(df)
I like this solution since you can reference the columns of df in the custom function (which can be anonymous if desired) without array comprehension. However, this method isn't parallelized.
Method 4: by_row
df_out4 <- df %>% by_row(function(r) {
foocol <- r$x + r$y
barcol <- foocol * 100
return(data.frame(foocol = foocol, barcol = barcol))
}, .collate = "cols")
The by_row function from purrr eliminates the need for the unique id column, but this operation isn't parallelized.
Method 5: pmap_df
df_out5 <- pmap_df(df, foobar)
# or equivalently...
df_out5 <- df %>% pmap_df(foobar)
This is the best option I've found. The pmap family of functions also accept anonymous functions to apply to the arguments. I believe pmap_df converts df to a list and back, though, so maybe there is a performance hit.
It's also a bit annoying that I need to reference all the columns I plan on using for calculation in the function definition function(x, y, ...) instead of just function(r) for the row object.
Am I missing any good or better options? Are there any concerns with the methods I described?
How about using data.table?
library(data.table)
foo <- function(x, y) return(x + y)
bar <- function(x, y) return((x + y) * 100)
dt <- as.data.table(df)
dt[, foocol:=foo(x,y)]
dt[, barcol:=bar(x,y)]
The data.table library is quite fast and has at least some some potential for parallelization.
Related
I have this data frame in R:
x <- seq(1, 10,0.1)
y <- seq(1, 10,0.1)
data_frame <- expand.grid(x,y)
I also have this function:
some_function <- function(x,y) { return(x+y) }
Basically, I want to create a new column in the data frame based on "some_function". I thought I could do this with the "lapply" function in R:
data_frame$new_column <-lapply(c(data_frame$x, data_frame$y),some_function)
This does not work:
Error in `$<-.data.frame`(`*tmp*`, f, value = list()) :
replacement has 0 rows, data has 8281
I know how to do this in a more "clunky and traditional" way:
data_frame$new_column = x + y
But I would like to know how to do this using "lapply" - in the future, I will have much more complicated and longer functions that will be a pain to write out like I did above. Can someone show me how to do this using "lapply"?
Thank you!
When working within a data.frame you could use apply instead of lapply:
x <- seq(1, 10,0.1)
y <- seq(1, 10,0.1)
data_frame <- expand.grid(x,y)
head(data_frame)
some_function <- function(x,y) { return(x+y) }
data_frame$new_column <- apply(data_frame, 1, \(x) some_function(x["Var1"], x["Var2"]))
head(data_frame)
To apply a function to rows set MAR = 1, to apply a function to columns set MAR = 2.
lapply, as the name suggests, is a list-apply. As a data.frame is a list of columns you can use it to compute over columns but within rectangular data, apply is often the easiest.
If some_function is written for that specific purpose, it can be written to accept a single row of the data.frame as in
x <- seq(1, 10,0.1)
y <- seq(1, 10,0.1)
data_frame <- expand.grid(x,y)
head(data_frame)
some_function <- function(row) { return(row[1]+row[2]) }
data_frame$yet_another <- apply(data_frame, 1, some_function)
head(data_frame)
Final comment: Often functions written for only a pair of values come out as perfectly vectorized. Probably the best way to call some_function is without any function of the apply-familiy as in
some_function <- function(x,y) { return(x + y) }
data_frame$last_one <- some_function(data_frame$Var1, data_frame$Var2)
I want to apply rolling on the function that requires 2 vector arguments. Here is the exmample (that doesn't work) using data.table:
library(data.table)
df <- as.data.table(cbind.data.frame(x=1:100, y=101:200))
my_sum <- function(x, y) {
x <- log(x)
y <- x * y
return(x + y)
}
roll_df <- frollapply(df, 10, function(x, y) {
my_sum(x, y)})
It doesn't recognize y column. Ofc, the solution can be using xts or some other package.
EDIT:
This is the real function I want to apply:
library(dpseg)
dpseg_roll <- function(time, price) {
p <- estimateP(x=time, y=price, plot=FALSE)
segs <- dpseg(time, price, jumps=jumps, P=p, type=type, store.matrix=TRUE)
slope_last <- segs$segments$slope[length(segs$segments$slope)]
return(slope_last)
}
With runner you can apply any function in rolling window. Running window can be created also on a rows of data.frame inserted to x argument. Let's focus on simpler function my_sum. Argument f in runner can accept only one object (data in this case). I encourage to put browser() to the function to debug row-by-row before you apply some fancy model on the subset (some algorithms requires some minimal number of observations).
my_sum <- function(data) {
# browser()
x <- log(data$x)
y <- x * data$y
tail(x + y, 1) # return only one value
}
my_sum should return only one value, because runner computes for each row - if my_sum returns vector, you would get a list.
Because runner is an independent function you need to pass data.table object to x. Best way to do this is to use x = .SD (see here why)
df[,
new_col := runner(
x = .SD,
f = my_sum,
k = 10
)]
I have no idea what you are going to do with frollapply (mean or sum or something else?).
Assuming you are about to use rolling sum, here might be one example. I rewrote your function my_sum such that it applies to df directly.
my_sum <- function(...) {
v <- c(...)
x <- log(v[[1]])
y <- Reduce(`*`,v)
return(x + y)
}
roll_df <- frollapply(
my_sum(df),
10,
FUN = sum)
rollapply in zoo passes a zoo object to the function to be applied if coredata=FALSE is used. The zoo object is made up of a time and a value part so we can use the following if the x value represents ascending values (which I gather it does). Note that my_sum in the question returns a 10 element result if the two arguments are length 10 so out shown below is a 100 x 10 zoo object with the first 9 rows filled with NAs.
If you don't want the NAs omit fill=NA or if you want to apply the function to partial inputs at the beginning instead of fill=NA use partial=TRUE. If you only want one of the 10 elements, such as the last one, then use function(x) my_sum(time(x), coredata(x))[10] in place of the function shown or just use out[, 10].
fortify.zoo(out) can be used to turn a zoo object out to a data frame if you need the result in that form or use as.data.frame(out) if you want to drop the times. as.data.table(out) also works in a similar manner.
library(zoo)
z <- read.zoo(df) # df$x becomes the time part and df$y the value part
out <- rollapplyr(z, 10, function(u) my_sum(time(u), coredata(u)),
coredata = FALSE, fill = NA)
dim(out)
## [1] 100 10
Note that in dpseg_roll that jumps and type are not defined.
reprod:
df1 <- data.frame(X = c(0:9), Y = c(10:19))
df2 <- data.frame(X = c(0:9), Y = c(10:19))
df3 <- data.frame(X = c(0:9), Y = c(10:19))
list_of_df <- list(A = df1, B = df2, C = df3)
list_of_df
I'm trying to apply the rollmean function from zoo to every 'Y' column in this list of dataframes.
I've tried lapply with no success, It seems no matter which way i spin it, there is no way to get around specifying the dataframe you want to apply to at some point.
This does one of the dataframes
roll_mean <- rollmean(list_of_df$A, 2)
roll_mean
obviously this doesn't work:
roll_mean1 <- rollmean(list_of_df, 2)
roll_mean1
I also tried this:
subset(may not be necessary)
Sub1 <- lapply(list_of_df, "[", 2)
roll_mean1 <- rollmean(Sub1, 2)
roll_mean1
there doesn't seem to be a way to do it without having to
specify the particular dataframe in the rollmean function
lapply(list_of_df), function(x) rollmean(list_of_df, 2))
for loop? also no success
For (i in list_of_df) {roll_mean1 <- rollmean(Sub1, 2)
Exp
}
Stating the obvious but I'm very new to coding in general and would appreciate some pointers.
It has occurred to me that even if it did work, the column that has been averaged would be one value longer than the rest of the dataframe; how would I get around that?
The question at one point says that it wants to perform the rollmean only on Y and at another point says that this works roll_mean <- rollmean(list_of_df$A, 2) but that does all columns.
1) Assuming that you want to apply rollmean to all columns:
Use lapply like this:
lapply(list_of_df, rollmean, 2)
This also works:
for(i in seq_along(list_of_df)) list_of_df[[i]] <- rollmean(list_of_df[[i]], 2)
2) If you only want to apply it to the Y column:
lapply(list_of_df, transform, Y = rollmean(Y, 2, fill = NA))
or
for(i in seq_along(list_of_df)) {
list_of_df[[i]]$Y <- rollmean(list_of_df[[i]]$Y, 2, fill = NA)
}
I would like to compute in R something of the following kind:
It is important that the summand could be any function f(y,x).
My approach so far is with nested for loops:
n <- 5
fun <- function(y,x){y^2 + sqrt(y*x)} # might be any function of y and x
sum_x <- c()
for(x in 1:n){
sum_y <- c()
for(y in 0:x){
sum_y[y+1] <- fun(y,x)
}
sum_x[x] <- sum(sum_y)
}
sum(sum_x) # 147.6317
However, I do not like this approach. It's pretty ugly and becomes very inconvenient if lower and upper bound need to be more flexible. I thought about using expand.grid and then applying fun to it using mapply, but couldn't figure out how to express the nested structure of the sums. Any suggestions how to do this?
You can perform the outer product based on a function. This outer product will look at all combinations of two input variables and place the result in a matrix; it takes the following form:
outer(<rows>, <cols>, FUN)
In your case specifically, the following suffices:
n <- 5
fun <- function(x, y) {ifelse (y > x, 0, y^2 + sqrt(x * y))}
outer(1:n, 1:n, FUN = fun) %>% sum() # 147.6317
Since y ranges from 0 and y occurs in both terms, it defaults to 0 (by chance). Regardless, it's necessary in this case to include some form of indexing in the function definition since the nested summing of y is dependent on x.
You could use nested sapply which will apply fun for only required terms and then take sum of it.
sum(unlist(sapply(seq_len(n), function(x) sapply(0:x, fun, x))))
#[1] 147.6317
We can also use outer with use of rowCumsums from matrixStats
library(matrixStats)
sum(outer(seq_len(n), seq_len(n), FUN = fun) * rowCumsums(diag(n)))
#[1] 147.6317
Or with crossing from tidyr
library(tidyr)
library(dplyr)
crossing(x = seq_len(n), y = seq_len(n)) %>%
filter(y <= x) %>%
transmute(out = fun(y, x)) %>%
summarise(out = sum(out)) %>%
pull(out)
#[1] 147.6317
I'm doing some programming using dplyr, and am curious about how to pass an expression as (specifically a MoreArgs) argument to mapply?
Consider a simple function F that subsets a data.frame based on some ids and a time_range, then outputs a summary statistic based on some other column x.
require(dplyr)
F <- function(ids, time_range, df, date_column, x) {
date_column <- enquo(date_column)
x <- enquo(x)
df %>%
filter(person_id %chin% ids) %>%
filter(time_range[1] <= (!!date_column) & (!!date_column) <= time_range[2]) %>%
summarise(newvar = sum(!!x))
}
We can make up some example data to which we can apply our function F.
person_ids <- lapply(1:2, function(i) sample(letters, size = 10))
time_ranges <- lapply(list(c("2014-01-01", "2014-12-31"),
c("2015-01-01", "2015-12-31")), as.Date)
require(data.table)
dt <- CJ(person_id = letters,
date_col = seq.Date(from = as.Date('2014-01-01'), to = as.Date('2015-12-31'), by = '1 day'))
dt[, z := rnorm(nrow(dt))] # The variable we will later sum over, i.e. apply F to.
We can successfully apply our function to each of our inputs.
F(person_ids[[1]], time_ranges[[1]], dt, date_col, z)
F(person_ids[[2]], time_ranges[[2]], dt, date_col, z)
And so if I wanted, I could write a simple for-loop to solve my problem. But if we try to apply syntactic sugar and wrap everything within mapply, we get an error.
mapply(F, ids = person_ids, time_range = time_ranges, MoreArgs = list(df = dt, date_column = date_col, x = z))
# Error in mapply... object 'date_col' not found
In mapply, MoreArgs is provided as a list, but R tries to evaluate the list elements, causing the error. As suggested by #Gregor, you can quote those MoreArgs that we don't want to evaluate immediately, preventing the error and allowing the function to proceed. This can be done with base quote or dplyr quo:
mapply(F, person_ids, time_ranges, MoreArgs = list(dt, quote(date_col), quote(z)))
mapply(F, person_ids, time_ranges, MoreArgs = list(dt, quo(date_col), quo(z)))
Another option is to use map2 from the purrr package, which is the tidyverse equivalent of mapply with two input vectors. tidyverse functions are set up to work with non-standard evaluation, which avoids the error you're getting with mapply without the need for quoting the arguments:
library(purrr)
map2(person_ids, time_ranges, F, dt, date_col, z)
[[1]]
newvar
1 40.23419
[[2]]
newvar
1 71.42327
More generally, you could use pmap, which iterates in parallel over any number of input vectors:
pmap(list(person_ids, time_ranges), F, dt, date_col, z)