In R, making packages with functions that are by() and with() compliant - r

Working on improving my package summarytools, I'm looking for a way to use the information on each of by()'s groups to integrate this info in some function's output. To give a little bit more of a context, the functions in this package print out the dataframe name and variable name(s) being summarized. Functions like by() make it difficult because they use generic names such as dd[x, ] when slicing the data and feeding it to functions. substitute() is thus not an option to get at the x parameter in that case, and the values of the IND variable(s) are also hidden (to a certain level).
To illustrate, in the following example, the group information (c.g. "gender: F" and "smoker: No") is simply printed out with cat() when print.by() is invoked, using attributes of the object of class "by":
dat <- data.frame(gender=rep(c("F","M"),each=15),
smoker=rep(c("Yes", "No")),
someQty=runif(n = 30,min = 0, max = 10))
by(dat$someQty, INDICES = list(gender=dat$gender, smoker=dat$smoker), FUN = mean)
## gender: F
## smoker: No
## [1] 5.560505
## -------------------------------------------------------------------------------
## gender: M
## smoker: No
## [1] 2.568055
## -------------------------------------------------------------------------------
## gender: F
## smoker: Yes
## [1] 4.057938
## -------------------------------------------------------------------------------
## gender: M
## smoker: Yes
## [1] 3.416027
Now what I need is to get the info for each group during the by-group processing (as opposed to recuperating them after the "by" object has been created).
I worked on a solution, but before I repeat a similar work for making functions comply with with(), %>%, and possibly others... and their combinations, I'm wondering if there might be a simpler approach to this.
Here's what I have so far do deal with by():
# Initialise variable in package-specific environment that
# will help keeping track of the by-processing
myenv <- new.env()
myenv$byInfo <- list()
# Declare some function that will return the `by` variables values
# at each iteration (it's a sort of dummy function that does just that)
myfunc <- function(x) {
sc <- sys.calls()
sf <- sys.frames()
# Find position of by.default() and tapply() in the sys.calls list
by_pos <- which(as.character(lapply(sc, head, 1))=="by.default()")
tapply_pos <- which(as.character(lapply(sc, head, 1))=="tapply()")
if (length(by_pos) == 1) {
# check if this is the first "by" iteration
if(length(myenv$byInfo) == 0) {
# Standardise the call (adds argument names)
by_call <- as.list(pryr::standardise_call(sc[[by_pos]]))
# Extract the data argument
by_data <- deparse(by_call$data)
# Extract the IND variable names
by_IND <- as.character(by_call$IND)
by_IND <- by_IND[-which(by_IND=="list")]
# Get the levels of these IND variables
by_levels <- sf[[tapply_pos]]$namelist
levels_df <- expand.grid(by_levels, stringsAsFactors = FALSE)
# Store the info in the package-specific environment
myenv$byInfo$iter <- 1
myenv$byInfo$levels_df <- levels_df
myenv$byInfo$nb_iter <- nrow(levels_df)
}
levels_df <- myenv$byInfo$levels_df
info <- paste(colnames(myenv$byInfo$levels_df),
as.character(myenv$byInfo$levels_df[myenv$byInfo$iter,]),
sep=" = ", collapse = ", ")
if (myenv$byInfo$iter == myenv$byInfo$nb_iter)
myenv$byInfo <- list()
else
myenv$byInfo$iter = myenv$byInfo$iter + 1
return(info)
}
return()
}
b <- by(data = dat$someQty,
INDICES = list(gender = dat$gender, smoker = dat$smoker),
FUN = myfunc)
b[1:4]
## [1] "gender = F, smoker = No" "gender = M, smoker = No"
## [3] "gender = F, smoker = Yes" "gender = M, smoker = Yes"
So yes, it does give me what I want, but I'd like to know if I'm missing something more straightforward here.
Note: I thought adding a by= parameter to some functions and just ignore base R's by() altogether but I'd rather use the preexisting base functions people are accustomed to.

Related

How to use the `purrr` package in R instead of for-loop to iterate over indices

I have a list of S4 objects, and I'm trying to iterate a function over these lists where I select an index position, and then from that position extract keywords I'm interested in. I am able to do a for loop and apply the function successfully, but is there a way this could be done using the purrr package? I'm not sure how to replicate the S4 object exactly, so I've included a very high level example just to get an idea of my process.
list_1 <- list("Sample", "test", "test Date")
list_2 <- list("test", "sample", "test Date")
listoflists <- list(list_1, list_2)
I created a list of indices of "Sample":
groupList <- map(listoflists,~which(toupper(.) == "SAMPLE"))
As well as a list of keywords that I'd like to extract:
keywordsList <- list(c("One test", "two test"), c("one test", "two test"))
I have a function that takes the S4 objects, selects the index where "sample" is found, and from that extracts the keywords.
for(i in seq_along(listoflists){
output[[i]] <- some_function(listoflists[[i]], index = groupList[[i]], keywords = keywordsList[[i]]) }
I tried using imap, but it seems like when I do this, the output's sublist only has 1 keyword (say "One test" in first list and "two test" in second list) instead of 3:
output <- listoflists %>% imap(~some_function(.x,index = groupList[[.y]], keywords = keywordsList[[.y]])
You are missing an closing bracket in your for loop but other than that your code should work. I am going to define a trivial some_function() to demonstrate:
some_function <- function(x, index, keywords) {
c(x[[index]], keywords)
}
loop_output <- vector(mode = "list", length = length(listoflists))
for (i in seq_along(listoflists)) {
loop_output[[i]] <- some_function(listoflists[[i]], index = groupList[[i]], keywords = keywordsList[[i]])
}
purr_output <- imap(
listoflists,
~ some_function(
.x,
index = groupList[[.y]],
keywords = keywordsList[[.y]]
)
)
identical(loop_output, purr_output)
# TRUE
If even with the correct brackets, your example works in a loop but not using imap I doubt that the use of S4 objects is relevant.
You can be tripped up if you have a named list. From the imap docs:
imap_xxx(x, ...), an indexed map, is short hand for map2(x, names(x), ...) if x has names, or map2(x, seq_along(x), ...) if it does not.
See for example:
listoflists <- list(list_1, list_2)
imap(listoflists, ~.y)
# [[1]]
# [1] 1
# [[2]]
# [1] 2
listoflists <- list(l1 = list_1, l2 = list_2)
imap(listoflists, ~.y)
# $l1
# [1] "l1"
# $l2
# [1] "l2"
Make sure you are looping over the indices rather than the names and the output should be identical.
You could also do this with purrr::pmap(), which maps in parallel over an arbitrary number of lists (passed within a super-list):
output <-
pmap(.l = list(listoflists, index = groupList, keywords = keywordsList),
.f = some_function)

print list names while iterating in lapply not working in 3.2

I'm trying o output the list names every time I run the function thru lapply. I posted this question earlier that I posted earlier, and the answer provided by #Ananda Mahto worked fine until I upgraded my R to version 3.2.0. It is no longer working and I get the following error message: Error in eval.parent(quote(names(X)))[substitute(x)[[3]]] : invalid subscript type 'symbol'
x <- ts(rnorm(40,5), start = c(1961, 1), frequency = 12)
y <- ts(rnorm(50,20), start = c(1971, 1), frequency = 12)
z <- ts(rnorm(50,39), start = c(1981, 1), frequency = 12)
a <- ts(rnorm(50,59), start = c(1991, 1), frequency = 12)
dat.list <- list(x=x,y=y,z=z,a=a)
abc <- function(x) {
r <- mean(x)
print(eval.parent(quote(names(X)))[substitute(x)[[3]]])
return(r)
}
forl <- lapply(dat.list, abc)
I'm not sure what the issues is, but I checked all the syntax in the new version of R nothing has changed. Any help is greatly appreciated. I'm open to any new ideas as well.
If you restructure it a little, you can get the same effect with a simpler appearance:
set.seed(42)
## using your dat.list construction code from above
abc <- function(x) { r <- mean(x); return(r); }
forl <- mapply(function(n, x) {
message(n)
abc(x)
}, names(dat.list), dat.list, SIMPLIFY=FALSE)
## x
## y
## z
## a
forl
## $x
## [1] 4.960464
## $y
## [1] 20.1141
## $z
## [1] 38.87175
## $a
## [1] 58.89825
I'm presuming you wanted the output in a list vice a vector, ergo SIMPLIFY=FALSE to mimic lapply. Otherwise, it's a simpler vector.
It's not as generic in that you have to explicitly pass the names as well as the dat, though you can create a sandwich function at the cost of having to reimplement lapply functionality.
A personal coding preference I'm using here is the use of message over print. The rationale is that the user has the option to use suppressMessages (outside the mapply, assuming it could/would be buried in a function), whereas suppressing the output of a print call is a bit more work.

Get name of function from loop variable

I need to loop through a number of functions, and plot/print the result next to the function name. I learned (this question/answer) that I have to use substitute / eval. This works nicely if each function name per se is enclosed in substitute() (see (A) and (B) below). Is there a way to automatize this, e.g. by using a construction similar to sapply? (C) obviously fails because substitute encloses also the c() clause, but maybe there is a something that I missed to make (D) work? Or any other ideas? Or is there no way?
This is what I tried (small examples, real code has many more functions and plotting stuff).
# A
x <- c(1:10)
my.fun <- substitute(mean) # works
print(paste(deparse(my.fun), ": ", eval(my.fun)(x), sep=""))
# B
for (my.fun in c(substitute(mean), substitute(median))) { # works, but lots of typing for longer function lists
print(paste(deparse(my.fun), ": ", eval(my.fun)(x), sep=""))
}
# C
for (my.fun in substitute(c(mean, median))) { # error: invalid for() loop sequence
print(paste(deparse(my.fun), ": ", eval(my.fun)(x), sep=""))
}
# D
for (my.fun in sapply(c(mean, median), substitute)) { # error: '...' used in an incorrect context
print(paste(deparse(my.fun), ": ", eval(my.fun)(x), sep=""))
}
# E # also not helpful
my.functions <- c(mean, median)
my.fun.2 <- NULL
for (i in 1:2) {
my.fun.2 <- c(my.fun.2, substitute(my.functions[i]))
}
# my.fun.2
# [[1]]
# my.functions[i]
# [[2]]
# my.functions[i]
What about this "one-liner"? :)
> x <- c(1:10)
> f <- function(...)
+ sapply(
+ sapply(
+ as.list(substitute(list(...)))[-1L],
+ deparse),
+ function(fn)
+ get(fn)(x))
> f(mean, median)
mean median
5.5 5.5
In short, you can pass the functions as multiple arguments, then quickly deparse those before actually evaluating the functions one by one. So the above function with a few extra comments:
#' Evaluate multiple functions on some values
#' #param ... any number of function that will take \code{x} as the argument
#' #param x values to be passed to the functions
#' #examples
#' f(mean, median)
#' f(mean, median, sum, x = mtcars$hp)
f <- function(..., x = c(1:10)) {
## get provided function names
fns <- sapply(as.list(substitute(list(...)))[-1L], deparse)
## run each function as an anonymous function on "x"
sapply(fns, function(fn) get(fn)(x))
}
Or with do.call instead of this latter anonymous function:
f <- function(..., x = c(1:10)) {
## get provided function names
fns <- sapply(as.list(substitute(list(...)))[-1L], deparse)
## run each function on "x"
sapply(fns, do.call, args = list(x))
}

how to provide output of user-defined-function as data.frame in R in .GlobalEnv

I have created a function 'mywsobj' which takes one input & 2 output
user input: environment
output1: a data.frame name "wsobj"
(data of list of objects,their classes,their memory usage)
output to console
output2: a barplot of list of objects and their memory usage based.
So far all ok,
My question1: but How to save that "wsobj" data.frame from inside the function in that user-specified input environment or at least in the .GlobalEnv ?
I tried reading things like: use of <<, use of pos/parent.environment etc.
but things are so far beyond me.
My question2: Is it possible to specify project-specific/user-specfic environment in R/specially in Rstudio server ?
Though my code here may not matter still it is below:
# creating SOME data
dfAtoZ<- data.frame(LETTERS)
df1to1Cr <- data.frame(1:10000000)
vec1to1Cr <- as.vector(1:10000000)
mat1to1Cr <- as.matrix(1:10000000)
set.seed<-10
randvec<-runif(1000,min=-100,max=100)
# creating MY function
mywsobj<-function(myenvironmentName)
{#step1 creating vector of object names
wslist <- vector(length=length(ls(myenvironmentName)))
for(i in 1:length(ls(myenvironmentName)))
{wslist[i]<-ls(myenvironmentName)[i]}
# wslist<-cbind(unlist(ls()))
#step2 creating vector of object classes
wsclass <- vector(length=length(wslist))
wsmemKb <- vector(mode="numeric",length=length(wslist))
for(i in 1:length(wslist))
{wsclass[i]<-class(get(wslist[i]))
wsmemKb[i]<- object.size(get(wslist[i]))/1000*1024}
#step4 combining them in a data.frame
wobj<-data.frame(cbind(wslist,wsclass,wsmemKb))
# library(sqldf)
# sqldf("pragma table_info(wobj)") shows col 3(wsmem) still non-numeric
wobj[,3] <- as.numeric( as.character(wobj[,3]) )
# create data to return matrix of memory consumption
objmemsize <- rev(sort(sapply(ls(envir=myenvironmentName),
function (object.name)object.size(get(object.name))/1000)))
# draw bar plot
barplot(objmemsize,main="Memory usage by object in myenvironment",
ylab="KiloByte", xlab="Variable name",
col=heat.colors(length(objmemsize)))
# result <- sqldf("select * from wobj order by 1/wsmemKb")
return(wobj)
# return(data.frame(myenvironmentName,wobj))
# return(assign("myname",wobj,envir = .GlobalEnv))
# attach(wobj,pos=2,"wobj")
return(barplot)
}
# use of mywsobj function
mywsobj(.GlobalEnv)
# saving output of mywsobj function
output<-as.data.frame(mywsobj(.GlobalEnv))
Not sure if this is what you're after. But, you can assign values to that environment with $.
my_fun <- function(in.env) {
# you may want to check if input argument is an environment
# do your computations
set.seed(45)
x <- sample(10)
y <- runif(10)
in.env$val <- sum(x*y)
}
my_fun(my.env <- new.env())
ls(my.env)
[1] "val"
my.env$val
# [1] 22.30493
Alternatively, you can also use assign as follows:
my_fun <- function(in.env) {
# you may want to check if input argument is an environment
# do your computations
set.seed(45)
x <- sample(10)
y <- runif(10)
assign("val", sum(x*y), envir=in.env)
}
# assign to global environment
my_fun(globalenv())
> val
# [1] 22.30493
# assign to local environment, say, v
v <- new.env()
my_fun(v)
> v$val
# [1] 22.30493

Writing a function with arguments that indicate what function to apply

Is there a way to write a function in which one of the arguments indicates what function to apply?
For example, if I have a function:
mf = function(data, option, level)
where I want option to tell whether to calculate the mean, median or sd of a data set?
Yes, one option is to just pass a function to option. E.g.
mf <- function(data, option) {
option <- match.fun(option)
option(data)
}
set.seed(42)
dat <- rnorm(10)
mf(dat, option = mean)
Which gives:
> set.seed(42)
> dat <- rnorm(10)
> mean(dat)
[1] 0.5472968
> mf(dat, option = mean)
[1] 0.5472968
> sd(dat)
[1] 0.8354488
> mf(dat, option = sd)
[1] 0.8354488
match.fun() is the standard R way of matching to an available function. In the example I pass the function itself, but match.fun() allows other ways of referring to a function, for example as a character string:
> mf(dat, option = "mean")
[1] 0.5472968
match.fun() returns a function that can be used as any other function, hence option() is a function that is essentially the same as the function passed to the option argument or is the function named in the option argument.
It isn't clear how the level argument was supposed to be used to I have ignored that above.
I should probably add that if you want to pass in any arguments to the applied function then you'll want to use ... in the function definition, e.g.:
mf <- function(data, option, ...) {
option <- match.fun(option)
option(data, ...)
}
Hence we can do things like this
set.seed(42)
dat2 <- rnorm(10)
dat2[4] <- NA
mean(dat2)
mean(dat2, na.rm = TRUE)
mf(dat2, mean, na.rm = TRUE)
the last three lines giving
> mean(dat2)
[1] NA
> mean(dat2, na.rm = TRUE)
[1] 0.5377895
> mf(dat2, mean, na.rm = TRUE)
[1] 0.5377895
There is a bit of a problem in that "data set" in R usually means a dataframe and there is no median.data.frame so you need to use both lapply and do.call:
df <- data.frame(x=rnorm(10), y=rnorm(10))
mf = function(data, option="mean") {lapply( data,
function(col) do.call(option, list(col))) }
mf(df)
#-------------
$x
[1] 0.01646814
$y
[1] 0.5388518
You did not indicate what "level" was supposed to do, so I left it out of the equation,
> mf(df, sd)
$x
[1] 1.169847
$y
[1] 0.8907117

Resources