I am writing a package where a large number of different methods will take the same pattern of arguments and build a dataframe. I am trying to make a helper function that would call substitute on various passed parameters to identify columns in a dataframe, and I can't figure out how to get it to work two levels up. Here is a small example (the real one would have Yobs, B, Z, siteID all as different variables to be fetched from the passed data):
worker <- function( YY,
data = NULL,
env = NULL ) {
vv = substitute(YY, env=env)
res <- tibble( V = eval(vv, data) )
return( res )
}
compare_methods <- function(Yobs, data = NULL ) {
env = rlang::env()
dat = worker( Yobs, data=data, env = env )
return( dat )
}
dat = tibble( kitty = 1:10, pig = LETTERS[1:10], orc = 1:10 * 10 )
compare_methods( kitty, data=dat )
This is so I can not have all the variable names quoted in the function call, aka tidyverse methods. I think I am fundamentally not understanding some of the magic with tidyverse's passing variable names not as strings, however, and perhaps I should be using an entire different toolset here?
Related
I need to calculate aggregate using a native R function IQR.
df1 <- SparkR::createDataFrame(iris)
df2 <- SparkR::agg(SparkR::groupBy(df1, "Species"),
IQR_Sepal_Length=IQR(df1$Sepal_Length, na.rm = TRUE)
)
returns
Error in as.numeric(x): cannot coerce type 'S4' to vector of type 'double'
How can I do it?
This is exactly what gapply, dapply, gapplyCollect is created for! Essentially you can use a user defined function in Spark, which will not run as optimally as native Spark functions, but at least you will get what you want.
I would suggest you to start using gapplyCollect initially, then move on to gapply.
df1 <- SparkR::createDataFrame(iris)
# gapplyCollect does not require you to specify output schema
# but, it will collect all the distributed workload back to driver node
# hence, it is not efficient if you expect huge sized output
df2 <- SparkR::gapplyCollect(
df1,
c("Species"),
function(key, x){
df_agg <- data.frame(
Species = key[[1]],
IQR_Sepal_Length = IQR(x$Sepal_Length, na.rm = TRUE)
)
}
)
# This is how you do the same thing using gapply - specify schema
df3 <- SparkR::gapply(
df1,
c("Species"),
function(key, x){
df_agg <- data.frame(
Species = key[[1]],
IQR_Sepal_Length = IQR(x$Sepal_Length, na.rm = TRUE)
)
},
schema = "Species STRING, IQR_Sepal_Length DOUBLE"
)
SparkR::head(df3)
I have the following situation: I have different dataframes, I would like to be able, for each dataframe, to create 2 dataframes according to the value of one of the columns (log2FoldChange>1 and logFoldChange<-1).
For this I use the following code:
DJ29_T0_Overexpr = DJ29_T0[which(DJ29_T0$log2FoldChange > 1),]
DJ29_T0_Underexpr = DJ29_T0[which(DJ21_T0$log2FoldChange < -1),]
DJ229_T0 being one of my dataframe.
First problem: the sign for the dataframe where log2FoldChange < -1 is not taken into account.
But the main problem is at the time of making the function, I wrote the following:
spliteOverUnder <- function(res){
nm <-deparse(substitute(res))
assign(paste(nm,"_Overexpr", sep=""), res[which(as.numeric(as.character(res$log2FoldChange)) > 1),])
assign(paste(nm,"_Underexpr", sep=""), res[which(as.numeric(as.character(res$log2FoldChange)) < -1),])
}
Which I then ran with :
spliteOverUnder(DJ29_T0)
No error message, but my objects are not exported in my global environment. I tried with return(paste(nm,"_Overexpr", sep="") but it only returns the object name but not the associated dataframe.
Using paste() forces the use of assign(), so I can't do :
spliteOverUnder <- function(res){
nm <-deparse(substitute(res))
paste(nm,"_Overexpr", sep="") <<- res[which(as.numeric(as.character(res$log2FoldChange)) > 1),]
paste(nm,"_Underexpr", sep="") <<- res[which(as.numeric(as.character(res$log2FoldChange)) < -1),]
}
spliteOverUnder(DJ24_T0)
I encounter the following error:
Error in paste(nm, "_Overexpr", sep = "") <<- res[which(as.numeric(as.character(res$log2FoldChange)) > :
could not find function "paste<-"
If you've encountered this difficulty before, I'd appreciate a little help.
And if you knew, once the function works, how to use a For loop going through a list containing all my dataframes to apply this function to each of them, I'm also a taker.
Thanks
When assigning, use the pos argument to hoist the new objects out of the function.
function(){
assign(x = ..., value = ...,
pos = 1 ## see below
)
}
... where 0 = the function's local environment, 1 = the environment next up (in which the function is defined) etc.
edit
A general function to create the split dataframes in your global environment follows. However, you might rather want to save the new dataframes (from within the function) or just forward them to downstream functions than cram your workspace with intermediary objects.
splitOverUnder <- function(the_name_of_the_frame){
df <- get(the_name_of_the_frame)
df$cat <- cut(df$log2FoldChange,
breaks = c(-Inf, -1, 1, Inf),
labels = c('underexpr', 'normal', 'overexpr')
)
split_data <- split(df, df$cat)
sapply(c('underexpr', 'overexpr'),
function(n){
new_df_name <- paste(the_name_of_the_frame, n, sep = '_')
assign(x = new_df_name,
value = split_data$n,
envir = .GlobalEnv
)
}
)
}
## say, df1 and df2 are your initial dataframes to split:
sapply(c('df1', 'df2'), function(n) splitOverUnder(n))
I am using seurat to analyze some scRNAseq data, I have managed to put all the SCT integration one line codes from satijalab into a function with basically
SCT_normalization <- function (f1, f2) {
f_merge <- merge (f1, y=f2)
f.list <- SplitObject(f_merge, split.by = "stim")
f.list <- lapply(X = f.list, FUN = SCTransform)
features <- SelectIntegrationFeatures(object.list = f.list, nfeatures = 3000)
f.list <<- PrepSCTIntegration(object.list = f.list, anchor.features = features)
return (f.list)
}
so that I will have f.list in the global environment for downstream analysis and making plots. The problem I am running into is that, every time I run the function, the output would be f.list, I want it to be specific to the input value name (i.e., f1 and/or f2). Basically something that I can set so that I would know which input value was used to generate the final output. I saw something using the assign function but someone wrote a warning about "the evil and wrong..." so I am not sure as to how to approach this.
From what it sounds like you don't need to use the super assign function <<-. In my opinion, I don't think <<- should be used as it can cause unexpected changes in objects. This is what I assume the other person was saying. For example, if you have the following function:
AverageVector <- function(v) x <<- mean(v, rm.na = TRUE)
Now you're trying to find the average of a vector you have, along with more analysis
library(tidyverse)
x <- unique(iris$Species)
avg_sl <- AverageVector(iris$Sepal.Length)
Now where x used to be a character vector, it's not a numeric vector with a length of 1.
So I would remove the <<- and call your function like this
object_list_1_2 <- SCT_normalize(object1, object2)
If you wanted a slightly more programatic way you could do something like this to keep track of objects you could do something like this:
SCT_normalization <- function(f1, f2) {
f_merge <- merge (f1, y = f2)
f.list <- SplitObject(f_merge, split.by = "stim")
f.list <- lapply(X = f.list, FUN = SCTransform)
features <- SelectIntegrationFeatures(object.list = f.list, nfeatures = 3000)
f.list <- PrepSCTIntegration(object.list = f.list, anchor.features = features)
to_return <- list(inputs = list(f1, f2), normalized = f.list)
return(to_return)
}
I am writing a R function using dplyr 0.7.2 syntax to pass input and output data frame names and a column name to sort on. The following is the code I have.
#test data frame creation
lb<- data.frame(study = replicate(25,"ABC"),
subjid = c("x1","x2","x3","x4","x5"),
visit = c("SCREENING","VISIT1","VISIT2","VISIT3","EOT"),
visitn = c(-1,1,2,3,4),
param = c("ALB","AST","HGB","HCT","LDL"),
aval = replicate(5, sample(c(20:100), 1, rep = TRUE)))
#sort function- user to provide input/output df names and column name to sort on
sortdf <- function(ind,outd,col){
col <- enquo(col)
outd <- ind %>% arrange(!!col)
outd <<- outd # return dataframe to workspace
}
sortdf(lb,lb_sort, visitn)
the above code works but the output df name is not getting resolved to lb_sort. output df is named as the name of the associated parameter (outd). Need some help!
Thanks,
Prasanna
You do not need to make use of the << in this context. In effect, your function is a wrapper for arrange:
my_sort <- function(df, col) {
col <- enquo(col)
df %>%
arrange(!!col)
}
my_sort(df = lb, col = visitn)
Then you could create your objects as usual:
my_sort(df = lb, col = visitn) -> sorted_stuff
Edit
As per request, forcing creation of names object in parent environment.
my_sort <- function(df, col, some_name) {
col <- enquo(col)
df %>%
arrange(!!col) -> dta_a
# Gather env. inf
e <- environment() # current environment
p <- parent.env(e)
# Create object in parent env.
assign(x = some_name,
value = dta_a,
envir = p)
# If desired return another object
# return(some_other_data)
}
my_sort(df = lb, col = visitn, some_name ="created_data")
Explanation
e/p objects are used to gather information about functions current and parent environment
assign uses string and creates names object in function's parent environment. Global environment, if called as provided in the example.
Remarks
This is odd behaviour, when called as shown:
>> ls()
[1] "lb" "my_sort"
>> my_sort(df = lb, col = visitn, some_name ="created_data")
>> ls()
[1] "created_data" "lb" "my_sort"
The function leaves "created_data" object in global environment. This is inconsistent with expected behaviour where the user would usually create objects:
my_sort(df = lb, col = visitn) -> created_data
and I wouldn't encourage using it. If the actual problem is concerned with returning multiple objects a potentially better approach may involve packing all the results into a list and returning one list:
list(result_1 = mtcars,
result_2 = airquality)
I'm trying to use ezANOVA from the ez package within a function where I want to allow the dv to be specified using a parameter. Normally, ezANOVA will accept the column variable as a symbol or character string (see "This Works" below). However, trying to give ezANOVA a parameter that holds a symbol or character doesn't work (see "This Does Not Work" below). ezANOVA complains that '"the_dv" is not a variable in the data frame provided'. I've tried wrapping the variable name in various methods like as.symbol(), as.formula(), and even tried various ways to incorporate eval() and substitute(), but all with no luck. How is this achieved?
If the why of it helps, i have an project where I need to do many compound analyses (means, anovas, post-hocs, graphs) that are identical expect for the dataset or the variable being analyzed. I want a function so I can write it once and run it many times. The code below is just a simple example.
library(ez)
df<-data.frame(ID=as.factor(101:120),
Training=rep(c("Jedi", "Sith"), 10),
Wins=sample(1:50, 20),
Losses=sample(1:50, 20))
# ----------
# This Works
# ----------
myfunc1 <- function(the_data) {
ezANOVA(
data = the_data,
wid = ID,
dv = Wins,
between = Training
)
}
myfunc1(the_data = df)
# ------------------
# This Does Not Work
# -------------------
myfunc2 <- function(the_data, the_dv) {
ezANOVA(
data = the_data,
wid = ID,
dv = the_dv,
between = Training
)
}
myfunc2(the_data = df, the_dv = Wins) # 'Wins' also fails
Had to solve this one myself. Turns out that a combination of eval() and substitute() solves this puzzle:
# ----------------------------------
# Aha, it works!
# ----------------------------------
library(ez)
df<-data.frame(ID=as.factor(101:120),
Training=rep(c("Jedi", "Sith"), 10),
Wins=sample(1:50, 20),
Losses=sample(1:50, 20))
myfunc2 <- function(the_data, the_dv) {
eval(
substitute(
ezANOVA(data = the_data,
wid = ID,
dv = the_dv,
between = Training),
list(the_dv = the_dv)))
}
myfunc2(the_data = df, the_dv = 'Wins')
myfunc2(the_data = df, the_dv = 'Losses')
Enjoy!!