SparkR: custom aggregete function - r

I need to calculate aggregate using a native R function IQR.
df1 <- SparkR::createDataFrame(iris)
df2 <- SparkR::agg(SparkR::groupBy(df1, "Species"),
IQR_Sepal_Length=IQR(df1$Sepal_Length, na.rm = TRUE)
)
returns
Error in as.numeric(x): cannot coerce type 'S4' to vector of type 'double'
How can I do it?

This is exactly what gapply, dapply, gapplyCollect is created for! Essentially you can use a user defined function in Spark, which will not run as optimally as native Spark functions, but at least you will get what you want.
I would suggest you to start using gapplyCollect initially, then move on to gapply.
df1 <- SparkR::createDataFrame(iris)
# gapplyCollect does not require you to specify output schema
# but, it will collect all the distributed workload back to driver node
# hence, it is not efficient if you expect huge sized output
df2 <- SparkR::gapplyCollect(
df1,
c("Species"),
function(key, x){
df_agg <- data.frame(
Species = key[[1]],
IQR_Sepal_Length = IQR(x$Sepal_Length, na.rm = TRUE)
)
}
)
# This is how you do the same thing using gapply - specify schema
df3 <- SparkR::gapply(
df1,
c("Species"),
function(key, x){
df_agg <- data.frame(
Species = key[[1]],
IQR_Sepal_Length = IQR(x$Sepal_Length, na.rm = TRUE)
)
},
schema = "Species STRING, IQR_Sepal_Length DOUBLE"
)
SparkR::head(df3)

Related

Getting substitute to substitute two levels up in R

I am writing a package where a large number of different methods will take the same pattern of arguments and build a dataframe. I am trying to make a helper function that would call substitute on various passed parameters to identify columns in a dataframe, and I can't figure out how to get it to work two levels up. Here is a small example (the real one would have Yobs, B, Z, siteID all as different variables to be fetched from the passed data):
worker <- function( YY,
data = NULL,
env = NULL ) {
vv = substitute(YY, env=env)
res <- tibble( V = eval(vv, data) )
return( res )
}
compare_methods <- function(Yobs, data = NULL ) {
env = rlang::env()
dat = worker( Yobs, data=data, env = env )
return( dat )
}
dat = tibble( kitty = 1:10, pig = LETTERS[1:10], orc = 1:10 * 10 )
compare_methods( kitty, data=dat )
This is so I can not have all the variable names quoted in the function call, aka tidyverse methods. I think I am fundamentally not understanding some of the magic with tidyverse's passing variable names not as strings, however, and perhaps I should be using an entire different toolset here?

Name seurat function in r with name of each experiment/variable

I am using seurat to analyze some scRNAseq data, I have managed to put all the SCT integration one line codes from satijalab into a function with basically
SCT_normalization <- function (f1, f2) {
f_merge <- merge (f1, y=f2)
f.list <- SplitObject(f_merge, split.by = "stim")
f.list <- lapply(X = f.list, FUN = SCTransform)
features <- SelectIntegrationFeatures(object.list = f.list, nfeatures = 3000)
f.list <<- PrepSCTIntegration(object.list = f.list, anchor.features = features)
return (f.list)
}
so that I will have f.list in the global environment for downstream analysis and making plots. The problem I am running into is that, every time I run the function, the output would be f.list, I want it to be specific to the input value name (i.e., f1 and/or f2). Basically something that I can set so that I would know which input value was used to generate the final output. I saw something using the assign function but someone wrote a warning about "the evil and wrong..." so I am not sure as to how to approach this.
From what it sounds like you don't need to use the super assign function <<-. In my opinion, I don't think <<- should be used as it can cause unexpected changes in objects. This is what I assume the other person was saying. For example, if you have the following function:
AverageVector <- function(v) x <<- mean(v, rm.na = TRUE)
Now you're trying to find the average of a vector you have, along with more analysis
library(tidyverse)
x <- unique(iris$Species)
avg_sl <- AverageVector(iris$Sepal.Length)
Now where x used to be a character vector, it's not a numeric vector with a length of 1.
So I would remove the <<- and call your function like this
object_list_1_2 <- SCT_normalize(object1, object2)
If you wanted a slightly more programatic way you could do something like this to keep track of objects you could do something like this:
SCT_normalization <- function(f1, f2) {
f_merge <- merge (f1, y = f2)
f.list <- SplitObject(f_merge, split.by = "stim")
f.list <- lapply(X = f.list, FUN = SCTransform)
features <- SelectIntegrationFeatures(object.list = f.list, nfeatures = 3000)
f.list <- PrepSCTIntegration(object.list = f.list, anchor.features = features)
to_return <- list(inputs = list(f1, f2), normalized = f.list)
return(to_return)
}

R dplyr 0.7.2 - functional programming. Resolving dataframe name

I am writing a R function using dplyr 0.7.2 syntax to pass input and output data frame names and a column name to sort on. The following is the code I have.
#test data frame creation
lb<- data.frame(study = replicate(25,"ABC"),
subjid = c("x1","x2","x3","x4","x5"),
visit = c("SCREENING","VISIT1","VISIT2","VISIT3","EOT"),
visitn = c(-1,1,2,3,4),
param = c("ALB","AST","HGB","HCT","LDL"),
aval = replicate(5, sample(c(20:100), 1, rep = TRUE)))
#sort function- user to provide input/output df names and column name to sort on
sortdf <- function(ind,outd,col){
col <- enquo(col)
outd <- ind %>% arrange(!!col)
outd <<- outd # return dataframe to workspace
}
sortdf(lb,lb_sort, visitn)
the above code works but the output df name is not getting resolved to lb_sort. output df is named as the name of the associated parameter (outd). Need some help!
Thanks,
Prasanna
You do not need to make use of the << in this context. In effect, your function is a wrapper for arrange:
my_sort <- function(df, col) {
col <- enquo(col)
df %>%
arrange(!!col)
}
my_sort(df = lb, col = visitn)
Then you could create your objects as usual:
my_sort(df = lb, col = visitn) -> sorted_stuff
Edit
As per request, forcing creation of names object in parent environment.
my_sort <- function(df, col, some_name) {
col <- enquo(col)
df %>%
arrange(!!col) -> dta_a
# Gather env. inf
e <- environment() # current environment
p <- parent.env(e)
# Create object in parent env.
assign(x = some_name,
value = dta_a,
envir = p)
# If desired return another object
# return(some_other_data)
}
my_sort(df = lb, col = visitn, some_name ="created_data")
Explanation
e/p objects are used to gather information about functions current and parent environment
assign uses string and creates names object in function's parent environment. Global environment, if called as provided in the example.
Remarks
This is odd behaviour, when called as shown:
>> ls()
[1] "lb" "my_sort"
>> my_sort(df = lb, col = visitn, some_name ="created_data")
>> ls()
[1] "created_data" "lb" "my_sort"
The function leaves "created_data" object in global environment. This is inconsistent with expected behaviour where the user would usually create objects:
my_sort(df = lb, col = visitn) -> created_data
and I wouldn't encourage using it. If the actual problem is concerned with returning multiple objects a potentially better approach may involve packing all the results into a list and returning one list:
list(result_1 = mtcars,
result_2 = airquality)

How to create an object out of the 'preProcess' function in R?

I'd like to know how to create an object out of the preProcess function (from the 'caret' package).
The following code isn't creating any object and I don't understand why :
function(dt1, dt2, norm = "spatialSign"){
X<-dt1[ ,-ncol(dt1)]
Y<-dt1[ ,ncol(dt1)]
t<-holdout(Y, ratio = 8/10, mode = "random")
prepr<-preProcess(X[t$tr, ], method = norm)}
I would like to obtain a prepr object at the end of the function call.
Can you help me ?
More information on functions you can find here, because this is a very basic question.
But for now, there are multiple solutions depending on how you want to call your function.
define your function
my_func <- function(dt1, dt2, norm = "spatialSign"){
X <- dt1[ , -ncol(dt1)]
Y <- dt1[ , ncol(dt1)]
t <- holdOut(Y, ratio = 8/10, mode = "random")
prepr <- preProcess(X[t$tr, ], method = norm)
}
Then call your function and store it in an object
my_outcome <- my_func(dt1, dt1)
assign the local prepr variable to a global variable with the <<- operator
my_func <- function(dt1, dt2, norm = "spatialSign"){
X <- dt1[ , -ncol(dt1)]
Y <- dt1[ , ncol(dt1)]
t <- holdOut(Y, ratio = 8/10, mode = "random")
prepr <<- preProcess(X[t$tr, ], method = norm)
}
and then use the function as is in your code.
my_func(dt1, dt2)
But I recommend the first option as it shows more clearly what is going on and if you want to make changes to the function it needs to be done only in one place.

Referencing a column name within function and ddply

I am trying to create a function that uses ddply to summarize data about a particular column that I pass in. I am able to reference the column I want outside of ddply, but I'm not sure how to do it within ddply:
exp_group = c('test','test','control','control')
value = c(1,3,2,3)
df <- data.frame(exp_group, value)
compare_means <- function(df,cols_detail, col_to_eval){
df_int <- df[, c(cols_detail, col_to_eval)] # this part works fine
summary <- ddply(df_int
, .(exp_group)
, summarize
, mean = t.test(col_to_eval)$estimate #these ones don't
, lo_bound = t.test(col_to_eval)$conf.int[1]
, hi_bound = t.test(col_to_eval)$conf.int[2]
)
return(summary)
}
test <- compare_means(df, 'exp_group','value')
When I do this, it returns col_to_eval not found. I've also tried it with df_int[,col_to_eval], as well as df_int[,2] (col reference value) and it says df_int not found.
Where I want to find the means of the test and control groups.
How do I reference the column I want in the t.test functions?
Ok, went through a few iterations and finally got it to work by doing this:
exp_group = c('test','test','control','control')
value = c(1,3,2,3)
df <- data.frame(exp_group, value)
compare_means <- function(df,cols_detail, col_to_eval){
df_int <- df[, c(cols_detail, col_to_eval)]
summary <- ddply(df_int
, .(exp_group)
, function(x){
mean = t.test(x[,col_to_eval])$estimate
lo_bound = t.test(x[,col_to_eval])$conf.int[1]
hi_bound = t.test(x[,col_to_eval])$conf.int[2]
data.frame(mean, lo_bound, hi_bound)
}
)
return(summary)
}
test <- compare_means(df, 'exp_group','value')

Resources