Clean, simple function factories in R - r

Short example. I am exploring the behavior of a function by testing it with different "specs", f(spec). I wrote down one spec by hand, spec1, and am creating new specs as variations on it. To do this, I decided to write a function:
spec1 = list(fy = list(a = 1), fx = list(f1 = function(x) 10-x, f2 = function(x) 2-x))
make_spec = function(f = function(x) 10-x, xtheta = 2)
list(fy = list(a = 1), fx = list(f1 = f, f2 = function(x) xtheta-x))
res1 = make_spec()
# first problem: they don't match
all.equal(res1,spec1)
# [1] "Component “fx”: Component “f2”: target, current do not match when deparsed"
# ^ this happens, even though...
res1$fx$f2(4) == spec1$fx$f2(4)
# TRUE
# second problem: res1 is fugly
res1
# $fy
# $fy$a
# [1] 1
#
#
# $fx
# $fx$f1
# function (x)
# 10 - x
# <environment: 0x000000000f8f2e20>
#
# $fx$f2
# function (x)
# xtheta - x
# <environment: 0x000000000f8f2e20>
str(res1)
# even worse
My goals for make_spec are...
all.equal(spec1, res1) and/or identical(spec1, res1)
for str(res1) to be human-readable (no <environment: ptr> tags or srcfilecopy)
to avoid substitute and eval altogether if possible (not a high priority)
to avoid writing out the second arg of substitute (see "full" example below)
Is there an idiomatic way to achieve some or all of these goals?
Full example. I'm not sure if the example above fully covers my use case, so here's the latter:
spec0 = list(
v_dist = list(
pdf = function(x) 1,
cdf = function(x) x,
q = function(x) x,
supp = c(0,1)
)
,
ucondv_dist = {
ucondv_dist = list()
ucondv_dist$condmean = function(v) 10-v
ucondv_dist$pdf = function(u,v) dnorm(u, ucondv_dist$condmean(v), 50)
ucondv_dist$cdf = function(u,v) pnorm(u, ucondv_dist$condmean(v), 50)
ucondv_dist
}
)
make_spec = function(ycondx_condmean = function(x) 10-x, ycondx_sd = 50){
s = substitute(list(
x_dist = list(
pdf = function(x) 1,
cdf = function(x) x,
q = function(x) x,
supp = c(0,1)
)
,
ycondx_dist = {
ycondx_dist = list()
ycondx_dist$condmean = ycondx_condmean
ycondx_dist$pdf = function(u,v) dnorm(u, ycondx_dist$condmean(v), ycondx_sd)
ycondx_dist$cdf = function(u,v) pnorm(u, ycondx_dist$condmean(v), ycondx_sd)
ycondx_dist
}
)
, list(ycondx_condmean=ycondx_condmean, ycondx_sd = ycondx_sd))
eval(s, .GlobalEnv)
}
res0 = make_spec()
Side note. I don't know if "function factory" is the right term here, since I am not a computer scientist, but it seems related. I found only a paragraph on the concept related to R.

The enclosing environments of the functions are different leading to the difference in output/difference in deparsing. So, there are two things to do to get the desired output:
make the environments the same
substitute the variables from the enclosing environments into the function bodies.
However, doing it this way you get a double dose of the eval/substitute you didn't want, so maybe there would be an alternative.
make_spec <- function(f = function(x) 10-x, xtheta = 2) {
e <- parent.frame()
fixClosure <- function(func)
eval(eval(substitute(substitute(func)), parent.frame()), e)
list(fy = list(a = 1), fx = list(
f1 = fixClosure(f),
f2 = fixClosure(function(x) xtheta-x)
))
}
spec1 <- list(fy = list(a = 1), fx = list(f1 = function(x) 10-x, f2 = function(x) 2-x))
res1 <- make_spec()
all.equal(res1, spec1)
[1] TRUE

Related

Avoiding duplication in R

I am trying to fit a variety of (truncated) probability distributions to the same very thin set of quantiles. I can do it but it seems to require lots of duplication of the same code. Is there a neater way?
I am using this code by Nadarajah and Kotz to generate the pdf of the truncated distributions:
qtrunc <- function(p, spec, a = -Inf, b = Inf, ...)
{
tt <- p
G <- get(paste("p", spec, sep = ""), mode = "function")
Gin <- get(paste("q", spec, sep = ""), mode = "function")
tt <- Gin(G(a, ...) + p*(G(b, ...) - G(a, ...)), ...)
return(tt)
}
where spec can be the name of any untruncated distribution for which code in R exists, and the ... argument is used to provide the names of the parameters of that untruncated distribution.
To achieve the best fit I need to measure the distance between the given quantiles and those calculated using arbitrary values of the parameters of the distribution. In the case of the gamma distribution, for example, the code is as follows:
spec <- "gamma"
fit_gamma <- function(x, l = 0, h = 20, t1 = 5, t2 = 13){
ct1 <- qtrunc(p = 1/3, spec, a = l, b = h, shape = x[1],rate = x[2])
ct2 <- qtrunc(p = 2/3, spec, a = l, b = h, shape = x[1],rate = x[2])
dist <- vector(mode = "numeric", length = 2)
dist[1] <- (t1 - ct1)^2
dist[2] <- (t2- ct2)^2
return(sqrt(sum(dist)))
}
where l is the lower truncation, h is the higher and I am given the two tertiles t1 and t2.
Finally, I seek the best fit using optim, thus:
gamma_fit <- optim(par = c(2, 4),
fn = fit_gamma,
l = l,
h = h,
t1 = t1,
t2 = t2,
method = "L-BFGS-B",
lower = c(1.01, 1.4)
Now suppose I want to do the same thing but fitting a normal distribution instead. The names of the parameters of the normal distribution that I am using in R are mean and sd.
I can achieve what I want but only by writing a whole new function fit_normal that is extremely similar to my fit_gamma function but with the new parameter names used in the definition of ct1 and ct2.
The problem of duplication of code becomes very severe because I wish to try fitting a large number of different distributions to my data.
What I want to know is whether there is a way of writing a generic fit_spec as it were so that the parameter names do not have to be written out by me.
Use x as a named list to create a list of arguments to pass into qtrunc() using do.call().
fit_distro <- function(x, spec, l = 0, h = 20, t1 = 5, t2 = 13){
args <- c(x, list(spec = spec, a = l, b = h))
ct1 <- do.call(qtrunc, args = c(list(p = 1/3), args))
ct2 <- do.call(qtrunc, args = c(list(p = 2/3), args))
dist <- vector(mode = "numeric", length = 2)
dist[1] <- (t1 - ct1)^2
dist[2] <- (t2 - ct2)^2
return(sqrt(sum(dist)))
}
This is called as follows, which is the same as your original function.
fit_distro(list(shape = 2, rate = 3), "gamma")
# [1] 13.07425
fit_gamma(c(2, 3))
# [1] 13.07425
This will work with other distributions, for however many parameters they have.
fit_distro(list(mean = 10, sd = 3), "norm")
# [1] 4.08379
fit_distro(list(shape1 = 2, shape2 = 3, ncp = 10), "beta")
# [1] 12.98371

Changing behavior for closure stored in data.table between R 3.4.3 and R 3.6.0

I noticed the following peculiar behavior when I upgraded from R 3.4.3 to R 3.6.0 (both were using data.table 1.12.6). In 3.4.3 the code below leads to the all.equal statement being TRUE, but in 3.6.0 there is a mean relative difference that comes from the fact that even though we are trying to access the approxfun calculated from group "a", the values from group "b" are used (probably somehow due to lazy evaluation). In 3.6.0, this issue can be solved by adding a copy statement in the calls to approxfun based on this question:
Handling of closures in data.table
The fascinating thing to me is that I do not get an error in 3.4.3. Any idea what changed?
library(data.table)
data <- data.table(
group = c(rep("a", 4), rep("b", 4)),
x = rep(c(.02, .04, .12, .21), 2),
y = c(
0.0122, 0.01231, 0.01325, 0.01374, 0.01218, 0.01229, 0.0133, 0.01379)
)
dtFuncs <- data[ , list(
func = list(stats::approxfun(x, y, rule = 2))
), by = group]
f <- function(group, x) {
dtResults <- CJ(group = group, x = x)
dtResults <- dtResults[ , {
.g <- group
f2 <- dtFuncs[group == .g, func][[1]]
list(x = x, y = f2(x))
}, by = group]
dtResults
}
x0 <- .07
g <- "a"
all.equal(
with(data[group == g], approx(x, y, x0, rule = 2)$y),
f(group = g, x = x0)$y
)
After running git bisect on the r-source, I was able to deduce that it was this commit that caused the behavior: https://github.com/wch/r-source/commit/adcf18b773149fa20f289f2c8f2e45e6f7b0dbfe
What fundamentally happened was that in the case where x's were ordered in approxfun, an internal copy was no longer made. If the data had been randomly sorted, the code would have continued to work! (see snippet below)
Lesson for me is that its probably best not to mix complicated objects with data.table as the same environment is used over and over for each "by" group (or being very deliberate with data.table::copy)
## should be run under R > 3.6.0 to see disparity
library(data.table)
## original sorted x (does not work)
data <- data.table(
group = c(rep("a", 4), rep("b", 4)),
x = rep(c(.02, .04, .12, .21), 2),
y = c(
0.0122, 0.01231, 0.01325, 0.01374, 0.01218, 0.01229, 0.0133, 0.01379)
)
dtFuncs <- data[ , {
print(environment())
list(
func = list(stats::approxfun(x, y, rule = 2))
)
}, by = group]
f <- function(group, x) {
dtResults <- CJ(group = group, x = x)
dtResults <- dtResults[ , {
.g <- group
f2 <- dtFuncs[group == .g, func][[1]]
list(x = x, y = f2(x))
}, by = group]
dtResults
}
get("y", environment(dtFuncs$func[[1]]))
get("y", environment(dtFuncs$func[[2]]))
x0 <- .07
g <- "a"
all.equal(
with(data[group == g], approx(x, y, x0, rule = 2)$y),
f(group = g, x = x0)$y
)
## unsorted x (works)
data <- data.table(
group = c(rep("a", 4), rep("b", 4)),
x = rep(c(.02, .04, .12, .21), 2),
y = c(
0.0122, 0.01231, 0.01325, 0.01374, 0.01218, 0.01229, 0.0133, 0.01379)
)
set.seed(10)
data <- data[sample(1:.N, .N)]
dtFuncs <- data[ , {
print(environment())
list(
func = list(stats::approxfun(x, y, rule = 2))
)
}, by = group]
f <- function(group, x) {
dtResults <- CJ(group = group, x = x)
dtResults <- dtResults[ , {
.g <- group
f2 <- dtFuncs[group == .g, func][[1]]
list(x = x, y = f2(x))
}, by = group]
dtResults
}
get("y", environment(dtFuncs$func[[1]]))
get("y", environment(dtFuncs$func[[2]]))
x0 <- .07
g <- "a"
all.equal(
with(data[group == g], approx(x, y, x0, rule = 2)$y),
f(group = g, x = x0)$y
)
## better approach: maybe safer to avoid mixing objects treated by reference
## (data.table & closures) all together...
fList <- lapply(split(data, by = "group"), function(x){
with(x, stats::approxfun(x, y, rule = 2))
})
fList
fList[[1]](.07) != fList[[2]](.07)

R redefine a string to argument

following on from some help earlier I think all I need for this to work is a way to define the variable dimxST below as not a string as I need that to point to the data frame....
cpkstudy <- function(x,y){
dxST <- paste(x,"$",y, sep = "")
dLSL <- paste(y, "LSL", sep = "")
dUSL <- paste(y, "USL", sep = "")
dTar <- paste(y, "Target", sep = "")
dimxST <-
dimLSL <- PivSpecs[[dLSL]]
dimUSL <- PivSpecs[[dUSL]]
dimTar <- PivSpecs[[dTar]]
ss.study.ca(dimxST, LSL = dimLSL, USL = dimUSL, Target = dimTar,
alpha = 0.05, f.na.rm = TRUE, f.main = "Six Sigma Study")
}
cpkstudy("cam1","D1")
link to the previous post
This is a different direction, and you may find the learning curve a bit steeper, but it's a lot more powerful. Instead of passing everything in as strings, we pass them without quotes, and use the rlang package to figure out where to evaluate D1.
# the same dummy data frame from Katia's answer
cam1 <- data.frame(D1 = rnorm(10),
D2 = rnorm(10))
PivSpecs <- list(D1LSL = 740, D1USL = 760, D1Target = 750)
library(rlang)
cpkstudy <- function(df, y){
quo_y <- enquo(y)
dLSL <- paste0(quo_name(quo_y), "LSL")
dUSL <- paste0(quo_name(quo_y), "USL")
dTar <- paste0(quo_name(quo_y), "Target")
dimxST <- eval_tidy(quo_y, data = df)
dimLSL <- PivSpecs[[dLSL]]
dimUSL <- PivSpecs[[dUSL]]
dimTar <- PivSpecs[[dTar]]
print(dimxST)
print (paste("dimLSL=", dimLSL) )
print (paste("dimUSL=", dimUSL) )
print (paste("dimTar=", dimTar) )
# ss.study.ca(dimxST, LSL = dimLSL, USL = dimUSL, Target = dimTar,
# alpha = 0.05, f.na.rm = TRUE, f.main = "Six Sigma Study")
}
# notice that I am not quoting cam1 and D1
cpkstudy(cam1, D1)
If you want to learn more about this, I would suggest looking at https://dplyr.tidyverse.org/articles/programming.html as an overview (the dplyr package imports some of the functions used in rlang), and http://rlang.r-lib.org/index.html for a more complete list of all the functions and examples.
You can use function get() to get object value from its string representation. In this solution I did not evaluate ss.study.ca() function itself, since I do not have your real-case input data, instead I just print the values that would go there:
cpkstudy <- function(x,y){
#dxST <- paste0(x,"$",y)
dLSL <- paste0(y, "LSL")
dUSL <- paste0(y, "USL")
dTar <- paste0(y, "Target")
dimxST <- get(x)[,y]
print(dimxST)
dimLSL <- PivSpecs[[dLSL]]
dimUSL <- PivSpecs[[dUSL]]
dimTar <- PivSpecs[[dTar]]
print (paste("dimLSL=", dimLSL) )
print (paste("dimUSL=", dimUSL) )
print (paste("dimTar=", dimTar) )
#ss.study.ca(dimxST, LSL = dimLSL, USL = dimUSL, Target = dimTar,
# alpha = 0.05, f.na.rm = TRUE, f.main = "Six Sigma Study")
}
# create some dummy dataframe to test with this example
cam1 <- data.frame(D1 = rnorm(10),
D2 = rnorm(10))
# define a list that will be used within a function
PivSpecs <- list(D1LSL = 740, D1USL = 760, D1Target = 750)
#test function
cpkstudy("cam1","D1")
#[1] 1.82120625 -0.08857998 -0.08452232 -0.43263828 0.17974556 -0.91141414 #-2.30595203 -1.24014396 -1.83814577 -0.24812598
#[1] "dimLSL= 740"
#[1] "dimUSL= 760"
#[1] "dimTar= 750"
I also changed your paste() commands on paste0() which has sep="" as a default.

Running aggregate function within dmapply (ddR package)

I would like to run the aggregate function within the dmapply function as offered through the ddR package.
Desired results
The desired results reflect a simple output generated via aggregate in base:
aggregate(
x = mtcars$mpg,
FUN = function(x) {
mean(x, na.rm = TRUE)
},
by = list(trans = mtcars$am)
)
which produces:
trans x
1 0 17.14737
2 1 24.39231
Attempt - ddmapply
I would like to arrive at the same results while utilising ddmapply, as attempted below:
# ddR
require(ddR)
# ddR object creation
distMtcars <- as.dframe(mtcars)
# Aggregate / ddmapply
dmapply(
FUN = function(x, y) {
aggregate(FUN = mean(x, na.rm = TRUE),
x = x,
by = list(trans = y))
},
distMtcars$mpg,
y = distMtcars$am,
output.type = "dframe",
combine = "rbind"
)
The code fails:
Error in match.fun(FUN) : 'mean(x, na.rm = TRUE)' is not a
function, character or symbol Called from: match.fun(FUN)
Updates
Fixing error pointed out by #Mike removes the error, however, does not produce the desired result. The code:
# Avoid namespace conflict with other packages
ddR::collect(
dmapply(
FUN = function(x, y) {
aggregate(
FUN = function(x) {
mean(x, na.rm = TRUE)
},
x = x,
by = list(trans = y)
)
},
distMtcars$mpg,
y = distMtcars$am,
output.type = "dframe",
combine = "rbind"
)
)
yields:
[1] trans x
<0 rows> (or 0-length row.names)
It works fine for me if you change your aggregate function to be consistent with the one you call earlier: FUN = function(x) mean(x, na.rm = T). The reason it can't find mean(x, na.rm = T) is because it isn't a function (it's a function call), rather mean is a function.
Also it will give you NA results unless you change your x = distMtcars$mpg to x = collect(distMtcars)$mpg. Same for y. With all that said, I think this should work for you:
res <-dmapply(
FUN = function(x, y) {
aggregate(FUN = function(x) mean(x, na.rm = TRUE),
x = x,
by = list(trans = y))
},
x = list(collect(distMtcars)$mpg),
y = list(collect(distMtcars)$am),
output.type = "dframe",
combine = "rbind"
)
Then you can do collect(res) to see the result.
collect(res)
# trans x
#1 0 17.14737
#2 1 24.39231

Is it a bug In Rglpk

I used Rglpk to solve a linear programming problem, but its results seems weird. I changed it to lpSolve, and the two results are different.
Please comment the Rglpk and uncomment lpSolve statements to change the solver to lpSolve.
# Lo, S.-F., & Lu, W.-M. (2009). An integrated performance evaluation of financial holding companies in Taiwan.
# European Journal of Operational Research, 198(1), 341–350. doi:10.1016/j.ejor.2008.09.006
sbm = function(X,Y)
{
# Here X is N * m matrix, Y is N*s matrix.
library(Rglpk)
# require(lpSolve)
N = nrow(X)
m = ncol(X)
s = ncol(Y)
# variables are
# t
# gamma_j,j=1..N
# s_i^(-),i=1..m
# s_r^(+),r=1..s
efficiency = numeric(N)
max_positive_y = apply(Y[,1:s], MARGIN = 2, function(x) max(x[x>0]))
min_positive_y = apply(Y[,1:s], MARGIN = 2, function(x) min(x[x>0]))
dir = rep("==",1+m+s+1)
rhs = c(1,rep(0,m),rep(0,s),0)
for(i in 1:N)
{
x = X[i,]
y = Y[i,]
#variables
coef_t = 1
coef_gamma = rep(0,N)
coef_s_i = -1/(m * x)
coef_s_r = rep(0,s)
obj = c(coef_t,coef_gamma,coef_s_i,coef_s_r)
coef_constraint1_s=y
for(r in 1:s)
{
if(y[r]<0){
coef_constraint1_s[r] =
min_positive_y[r] * (max_positive_y[r] - min_positive_y[r])/
(max_positive_y[r] - y[r])
}
}
constraint1 = c(1, rep(0,N), rep(0,m) , 1/(s*coef_constraint1_s))
constraint2 = cbind(-x, t(X), diag(m), matrix(0,m,s))
constraint3 = cbind(-y, t(Y), matrix(0,s,m), -diag(s))
constraint4 = c(-1, rep(1,N), rep(0,m), rep(0,s))
mat = rbind(constraint1,constraint2,constraint3,constraint4)
results = Rglpk_solve_LP(obj = obj,mat = mat,dir = dir,rhs = rhs,max = FALSE)
efficiency[i] = results$optimum
# results <- lp("min", obj, mat, dir, rhs)
# efficiency[i] = results$objval
}
efficiency
}

Resources