I would like to compute in R something of the following kind:
It is important that the summand could be any function f(y,x).
My approach so far is with nested for loops:
n <- 5
fun <- function(y,x){y^2 + sqrt(y*x)} # might be any function of y and x
sum_x <- c()
for(x in 1:n){
sum_y <- c()
for(y in 0:x){
sum_y[y+1] <- fun(y,x)
}
sum_x[x] <- sum(sum_y)
}
sum(sum_x) # 147.6317
However, I do not like this approach. It's pretty ugly and becomes very inconvenient if lower and upper bound need to be more flexible. I thought about using expand.grid and then applying fun to it using mapply, but couldn't figure out how to express the nested structure of the sums. Any suggestions how to do this?
You can perform the outer product based on a function. This outer product will look at all combinations of two input variables and place the result in a matrix; it takes the following form:
outer(<rows>, <cols>, FUN)
In your case specifically, the following suffices:
n <- 5
fun <- function(x, y) {ifelse (y > x, 0, y^2 + sqrt(x * y))}
outer(1:n, 1:n, FUN = fun) %>% sum() # 147.6317
Since y ranges from 0 and y occurs in both terms, it defaults to 0 (by chance). Regardless, it's necessary in this case to include some form of indexing in the function definition since the nested summing of y is dependent on x.
You could use nested sapply which will apply fun for only required terms and then take sum of it.
sum(unlist(sapply(seq_len(n), function(x) sapply(0:x, fun, x))))
#[1] 147.6317
We can also use outer with use of rowCumsums from matrixStats
library(matrixStats)
sum(outer(seq_len(n), seq_len(n), FUN = fun) * rowCumsums(diag(n)))
#[1] 147.6317
Or with crossing from tidyr
library(tidyr)
library(dplyr)
crossing(x = seq_len(n), y = seq_len(n)) %>%
filter(y <= x) %>%
transmute(out = fun(y, x)) %>%
summarise(out = sum(out)) %>%
pull(out)
#[1] 147.6317
Related
I want to apply rolling on the function that requires 2 vector arguments. Here is the exmample (that doesn't work) using data.table:
library(data.table)
df <- as.data.table(cbind.data.frame(x=1:100, y=101:200))
my_sum <- function(x, y) {
x <- log(x)
y <- x * y
return(x + y)
}
roll_df <- frollapply(df, 10, function(x, y) {
my_sum(x, y)})
It doesn't recognize y column. Ofc, the solution can be using xts or some other package.
EDIT:
This is the real function I want to apply:
library(dpseg)
dpseg_roll <- function(time, price) {
p <- estimateP(x=time, y=price, plot=FALSE)
segs <- dpseg(time, price, jumps=jumps, P=p, type=type, store.matrix=TRUE)
slope_last <- segs$segments$slope[length(segs$segments$slope)]
return(slope_last)
}
With runner you can apply any function in rolling window. Running window can be created also on a rows of data.frame inserted to x argument. Let's focus on simpler function my_sum. Argument f in runner can accept only one object (data in this case). I encourage to put browser() to the function to debug row-by-row before you apply some fancy model on the subset (some algorithms requires some minimal number of observations).
my_sum <- function(data) {
# browser()
x <- log(data$x)
y <- x * data$y
tail(x + y, 1) # return only one value
}
my_sum should return only one value, because runner computes for each row - if my_sum returns vector, you would get a list.
Because runner is an independent function you need to pass data.table object to x. Best way to do this is to use x = .SD (see here why)
df[,
new_col := runner(
x = .SD,
f = my_sum,
k = 10
)]
I have no idea what you are going to do with frollapply (mean or sum or something else?).
Assuming you are about to use rolling sum, here might be one example. I rewrote your function my_sum such that it applies to df directly.
my_sum <- function(...) {
v <- c(...)
x <- log(v[[1]])
y <- Reduce(`*`,v)
return(x + y)
}
roll_df <- frollapply(
my_sum(df),
10,
FUN = sum)
rollapply in zoo passes a zoo object to the function to be applied if coredata=FALSE is used. The zoo object is made up of a time and a value part so we can use the following if the x value represents ascending values (which I gather it does). Note that my_sum in the question returns a 10 element result if the two arguments are length 10 so out shown below is a 100 x 10 zoo object with the first 9 rows filled with NAs.
If you don't want the NAs omit fill=NA or if you want to apply the function to partial inputs at the beginning instead of fill=NA use partial=TRUE. If you only want one of the 10 elements, such as the last one, then use function(x) my_sum(time(x), coredata(x))[10] in place of the function shown or just use out[, 10].
fortify.zoo(out) can be used to turn a zoo object out to a data frame if you need the result in that form or use as.data.frame(out) if you want to drop the times. as.data.table(out) also works in a similar manner.
library(zoo)
z <- read.zoo(df) # df$x becomes the time part and df$y the value part
out <- rollapplyr(z, 10, function(u) my_sum(time(u), coredata(u)),
coredata = FALSE, fill = NA)
dim(out)
## [1] 100 10
Note that in dpseg_roll that jumps and type are not defined.
I'm doing some programming using dplyr, and am curious about how to pass an expression as (specifically a MoreArgs) argument to mapply?
Consider a simple function F that subsets a data.frame based on some ids and a time_range, then outputs a summary statistic based on some other column x.
require(dplyr)
F <- function(ids, time_range, df, date_column, x) {
date_column <- enquo(date_column)
x <- enquo(x)
df %>%
filter(person_id %chin% ids) %>%
filter(time_range[1] <= (!!date_column) & (!!date_column) <= time_range[2]) %>%
summarise(newvar = sum(!!x))
}
We can make up some example data to which we can apply our function F.
person_ids <- lapply(1:2, function(i) sample(letters, size = 10))
time_ranges <- lapply(list(c("2014-01-01", "2014-12-31"),
c("2015-01-01", "2015-12-31")), as.Date)
require(data.table)
dt <- CJ(person_id = letters,
date_col = seq.Date(from = as.Date('2014-01-01'), to = as.Date('2015-12-31'), by = '1 day'))
dt[, z := rnorm(nrow(dt))] # The variable we will later sum over, i.e. apply F to.
We can successfully apply our function to each of our inputs.
F(person_ids[[1]], time_ranges[[1]], dt, date_col, z)
F(person_ids[[2]], time_ranges[[2]], dt, date_col, z)
And so if I wanted, I could write a simple for-loop to solve my problem. But if we try to apply syntactic sugar and wrap everything within mapply, we get an error.
mapply(F, ids = person_ids, time_range = time_ranges, MoreArgs = list(df = dt, date_column = date_col, x = z))
# Error in mapply... object 'date_col' not found
In mapply, MoreArgs is provided as a list, but R tries to evaluate the list elements, causing the error. As suggested by #Gregor, you can quote those MoreArgs that we don't want to evaluate immediately, preventing the error and allowing the function to proceed. This can be done with base quote or dplyr quo:
mapply(F, person_ids, time_ranges, MoreArgs = list(dt, quote(date_col), quote(z)))
mapply(F, person_ids, time_ranges, MoreArgs = list(dt, quo(date_col), quo(z)))
Another option is to use map2 from the purrr package, which is the tidyverse equivalent of mapply with two input vectors. tidyverse functions are set up to work with non-standard evaluation, which avoids the error you're getting with mapply without the need for quoting the arguments:
library(purrr)
map2(person_ids, time_ranges, F, dt, date_col, z)
[[1]]
newvar
1 40.23419
[[2]]
newvar
1 71.42327
More generally, you could use pmap, which iterates in parallel over any number of input vectors:
pmap(list(person_ids, time_ranges), F, dt, date_col, z)
I often have a need to mutate a data frame through the additional of several columns at once using a custom function, preferably using parallelization. Below are the ways I already know how to do this.
Setup
library(dplyr)
library(plyr)
library(purrr)
library(doMC)
registerDoMC(2)
df <- data.frame(x = rnorm(10), y = rnorm(10), z = rnorm(10))
Suppose that I want two new columns, foocol = x + y and barcol = (x + y) * 100, but that these are actually complex calculations done in a custom function.
Method 1: Add columns separately using rowwise and mutate
foo <- function(x, y) return(x + y)
bar <- function(x, y) return((x + y) * 100)
df_out1 <- df %>% rowwise() %>% mutate(foocol = foo(x, y), barcol = bar(x, y))
This is not a good solution since it requires two function calls for each row and two "expensive" calculations of x + y. It's also not parallelized.
Method 2: Trick ddply into rowwise operation
df2 <- df
df2$id <- 1:nrow(df2)
df_out2 <- ddply(df2, .(id), function(r) {
foocol <- r$x + r$y
barcol <- foocol * 100
return(cbind(r, foocol, barcol))
}, .parallel = T)
Here I trick ddply into calling a function on each row by splitting on a unique id column I just created. It's clunky, though, and requires maintaining a useless column.
Method 3: splat
foobar <- function(x, y, ...) {
foocol <- x + y
barcol <- foocol * 100
return(data.frame(x, y, ..., foocol, barcol))
}
df_out3 <- splat(foobar)(df)
I like this solution since you can reference the columns of df in the custom function (which can be anonymous if desired) without array comprehension. However, this method isn't parallelized.
Method 4: by_row
df_out4 <- df %>% by_row(function(r) {
foocol <- r$x + r$y
barcol <- foocol * 100
return(data.frame(foocol = foocol, barcol = barcol))
}, .collate = "cols")
The by_row function from purrr eliminates the need for the unique id column, but this operation isn't parallelized.
Method 5: pmap_df
df_out5 <- pmap_df(df, foobar)
# or equivalently...
df_out5 <- df %>% pmap_df(foobar)
This is the best option I've found. The pmap family of functions also accept anonymous functions to apply to the arguments. I believe pmap_df converts df to a list and back, though, so maybe there is a performance hit.
It's also a bit annoying that I need to reference all the columns I plan on using for calculation in the function definition function(x, y, ...) instead of just function(r) for the row object.
Am I missing any good or better options? Are there any concerns with the methods I described?
How about using data.table?
library(data.table)
foo <- function(x, y) return(x + y)
bar <- function(x, y) return((x + y) * 100)
dt <- as.data.table(df)
dt[, foocol:=foo(x,y)]
dt[, barcol:=bar(x,y)]
The data.table library is quite fast and has at least some some potential for parallelization.
Neither this post nor this post apply to my case.
Assume:
set.seed(42)
x<-rep(c("A","B","C"), c(3,4,1))
y<-rep(c("V","W"),c(5,3))
z<-rnorm(8,-2,1)
df<-data.frame(x,y,z)
boxplot(z~x+y,df)
I want my plot to include groups with more than, say, one element. This means that I want my plot show only A.V, B.V and B.W.
Furthermore, since my graph has about 70 groups, I don't want to do it by writing a list by hand.
Thanks
You can create a new column ('xy') using paste, create a logical index using ave for 'xy' groups having more than one elements, and then do the boxplot.
df1$xy <- factor(paste(df1$x, df1$y, sep='.'))
index <- with(df1, ave(1:nrow(df1), xy, FUN=length))>1
boxplot(z~xy, droplevels(df1[index,]))
Or using ggplot
library(dplyr)
library(tidyr)
library(ggplot2)
df %>%
group_by(x,y) %>%
filter(n()>1) %>%
unite(xy, x,y) %>%
ggplot(., aes(xy, z))+
geom_boxplot()
You can see if any bp$n are 0 and subset by that
set.seed(42)
df <- data.frame(x = rep(c("A","B","C"), c(3,4,1)),
y = rep(c("V","W"),c(5,3)),
z = rnorm(8,-2,1))
bp <- boxplot(z ~ x + y, df, plot = FALSE)
wh <- which(bp$n == 0)
bp[] <- lapply(bp, function(x) if (length(x)) {
## `bp` contains a list of boxplot statistics, some vectors and
## some matrices which need to be indexed accordingly
if (!is.null(nrow(x))) x[, -wh] else x[-wh]
## some of `bp` will not be present depending on how you called
## `boxplot`, so if that is the case, you need to leave them alone
## to keep the original structure so `bxp` can deal with it
} else x)
## call `bxp` on the subset of `bp`
bxp(bp)
Or you can use any value you like:
wh <- which(bp$n <= 1)
bp[] <- lapply(bp, function(x) if (length(x)) {
if (!is.null(nrow(x))) x[, -wh] else x[-wh]
} else x)
bxp(bp)
I usually like to use lapply() instead of a for loop:
lx <- split( x, x$hr) #with the next step being lapply( lx, function( x) ...)).
But now each element of lx includes the column hr, which is inefficient because that information is already in names( lx).
So now I must do:
lx <- lapply( lx, function( X) select( X, -hr))
(An alternative is:
HR <- unique( x$hr)
lx <- select( lx, -hr)
lx <- split( x, HR)
)
The whole point of lapply() over a for loop is to be efficient so these extra lines bother me. It seems like such a common use case, and my experience has shown that usually R has something more efficient, or I'm missing something.
Can this be achieved in a single function call or one-liner?
EDIT: Specific Example
DF <- data.frame( A = 1:2, B = 2:3, C = 3:4)
DF <- split( DF, factor( DF$A)) # but each list element still contains the column A which is
# redundant (because the names() of the list element equals A
# as well), so I have to write the following line if I want
# to be efficient especially with large datasets
DF <- lapply( DF, function( x) select( x, -A)) # I hate always writing this line!
Remove the split column first:
split(DF[-1], DF[[1]])
or
split(subset(DF, select = -A), DF$A)
Update: Added last line.