I'd like to know how to create an object out of the preProcess function (from the 'caret' package).
The following code isn't creating any object and I don't understand why :
function(dt1, dt2, norm = "spatialSign"){
X<-dt1[ ,-ncol(dt1)]
Y<-dt1[ ,ncol(dt1)]
t<-holdout(Y, ratio = 8/10, mode = "random")
prepr<-preProcess(X[t$tr, ], method = norm)}
I would like to obtain a prepr object at the end of the function call.
Can you help me ?
More information on functions you can find here, because this is a very basic question.
But for now, there are multiple solutions depending on how you want to call your function.
define your function
my_func <- function(dt1, dt2, norm = "spatialSign"){
X <- dt1[ , -ncol(dt1)]
Y <- dt1[ , ncol(dt1)]
t <- holdOut(Y, ratio = 8/10, mode = "random")
prepr <- preProcess(X[t$tr, ], method = norm)
}
Then call your function and store it in an object
my_outcome <- my_func(dt1, dt1)
assign the local prepr variable to a global variable with the <<- operator
my_func <- function(dt1, dt2, norm = "spatialSign"){
X <- dt1[ , -ncol(dt1)]
Y <- dt1[ , ncol(dt1)]
t <- holdOut(Y, ratio = 8/10, mode = "random")
prepr <<- preProcess(X[t$tr, ], method = norm)
}
and then use the function as is in your code.
my_func(dt1, dt2)
But I recommend the first option as it shows more clearly what is going on and if you want to make changes to the function it needs to be done only in one place.
Related
I am writing a package where a large number of different methods will take the same pattern of arguments and build a dataframe. I am trying to make a helper function that would call substitute on various passed parameters to identify columns in a dataframe, and I can't figure out how to get it to work two levels up. Here is a small example (the real one would have Yobs, B, Z, siteID all as different variables to be fetched from the passed data):
worker <- function( YY,
data = NULL,
env = NULL ) {
vv = substitute(YY, env=env)
res <- tibble( V = eval(vv, data) )
return( res )
}
compare_methods <- function(Yobs, data = NULL ) {
env = rlang::env()
dat = worker( Yobs, data=data, env = env )
return( dat )
}
dat = tibble( kitty = 1:10, pig = LETTERS[1:10], orc = 1:10 * 10 )
compare_methods( kitty, data=dat )
This is so I can not have all the variable names quoted in the function call, aka tidyverse methods. I think I am fundamentally not understanding some of the magic with tidyverse's passing variable names not as strings, however, and perhaps I should be using an entire different toolset here?
I need to calculate aggregate using a native R function IQR.
df1 <- SparkR::createDataFrame(iris)
df2 <- SparkR::agg(SparkR::groupBy(df1, "Species"),
IQR_Sepal_Length=IQR(df1$Sepal_Length, na.rm = TRUE)
)
returns
Error in as.numeric(x): cannot coerce type 'S4' to vector of type 'double'
How can I do it?
This is exactly what gapply, dapply, gapplyCollect is created for! Essentially you can use a user defined function in Spark, which will not run as optimally as native Spark functions, but at least you will get what you want.
I would suggest you to start using gapplyCollect initially, then move on to gapply.
df1 <- SparkR::createDataFrame(iris)
# gapplyCollect does not require you to specify output schema
# but, it will collect all the distributed workload back to driver node
# hence, it is not efficient if you expect huge sized output
df2 <- SparkR::gapplyCollect(
df1,
c("Species"),
function(key, x){
df_agg <- data.frame(
Species = key[[1]],
IQR_Sepal_Length = IQR(x$Sepal_Length, na.rm = TRUE)
)
}
)
# This is how you do the same thing using gapply - specify schema
df3 <- SparkR::gapply(
df1,
c("Species"),
function(key, x){
df_agg <- data.frame(
Species = key[[1]],
IQR_Sepal_Length = IQR(x$Sepal_Length, na.rm = TRUE)
)
},
schema = "Species STRING, IQR_Sepal_Length DOUBLE"
)
SparkR::head(df3)
I am using seurat to analyze some scRNAseq data, I have managed to put all the SCT integration one line codes from satijalab into a function with basically
SCT_normalization <- function (f1, f2) {
f_merge <- merge (f1, y=f2)
f.list <- SplitObject(f_merge, split.by = "stim")
f.list <- lapply(X = f.list, FUN = SCTransform)
features <- SelectIntegrationFeatures(object.list = f.list, nfeatures = 3000)
f.list <<- PrepSCTIntegration(object.list = f.list, anchor.features = features)
return (f.list)
}
so that I will have f.list in the global environment for downstream analysis and making plots. The problem I am running into is that, every time I run the function, the output would be f.list, I want it to be specific to the input value name (i.e., f1 and/or f2). Basically something that I can set so that I would know which input value was used to generate the final output. I saw something using the assign function but someone wrote a warning about "the evil and wrong..." so I am not sure as to how to approach this.
From what it sounds like you don't need to use the super assign function <<-. In my opinion, I don't think <<- should be used as it can cause unexpected changes in objects. This is what I assume the other person was saying. For example, if you have the following function:
AverageVector <- function(v) x <<- mean(v, rm.na = TRUE)
Now you're trying to find the average of a vector you have, along with more analysis
library(tidyverse)
x <- unique(iris$Species)
avg_sl <- AverageVector(iris$Sepal.Length)
Now where x used to be a character vector, it's not a numeric vector with a length of 1.
So I would remove the <<- and call your function like this
object_list_1_2 <- SCT_normalize(object1, object2)
If you wanted a slightly more programatic way you could do something like this to keep track of objects you could do something like this:
SCT_normalization <- function(f1, f2) {
f_merge <- merge (f1, y = f2)
f.list <- SplitObject(f_merge, split.by = "stim")
f.list <- lapply(X = f.list, FUN = SCTransform)
features <- SelectIntegrationFeatures(object.list = f.list, nfeatures = 3000)
f.list <- PrepSCTIntegration(object.list = f.list, anchor.features = features)
to_return <- list(inputs = list(f1, f2), normalized = f.list)
return(to_return)
}
I'm able to successfully modify the behaviour of [.data.frame, but fail to do so for [.data.table.
For data.frame:
# Exact same signature as "[.data.frame" :
"[.my.data.frame" <- function (x, i, j,
drop = if (missing(i)) TRUE
else length(cols) == 1) {
if(!missing(j) && j==8 ) {
cat("Oy vey\n")
}
NextMethod()
}
df <- data.frame(a=1,b=2)
class(df) <- c("my.data.frame", class(df))
# Works as expected:
df[1,2] # 2
df[1,8] # Oy Vey NULL
df[1,] # 1 2
However, for (the considerably more complicated) data.table:
# Exact same signature as "[.data.table" :
"[.my.data.table" <- function (x, i, j, by, keyby, with = TRUE, nomatch = getOption("datatable.nomatch"),
mult = "all", roll = FALSE,
rollends = if (roll == "nearest") c(TRUE, TRUE)
else if (roll >= 0) c(FALSE, TRUE) else c(TRUE, FALSE),
which = FALSE, .SDcols, verbose = getOption("datatable.verbose"),
allow.cartesian = getOption("datatable.allow.cartesian"),
drop = NULL, on = NULL) {
if(!missing(j) && j==8 ) {
cat("Oy vey\n")
}
NextMethod()
}
dt <- data.table(a=1,b=2)
class(dt) <- c("my.data.table", class(dt))
dt[1,2] # ERROR: i is not found in calling scope and it is not a column of type logical. When the first argument inside DT[...] is a single symbol, data.table looks for it in calling scope.
I know better than to pass arguments to NextMethod. It looks like I must call [.data.table explicitly, capture and pass the arguments as unevaluated promises - but all my attempts with quote, substitute or match.call have so far failed. Any insight would be appreciated.
I've found a partial solution, posting here in hope someone might improve on it.
"[.my.data.table" <- function (x, ...) {
# Modifications and tests galore - which can be tricky with this signature
class(x) <- class(x)[-1]
ret <- x[...]
class(x) <- c("my.data.table", class(x))
ret
}
I still consider this partial, because actually doing something in the function probably involves at least something like arglist <- list(...), and this fails when [ is called like this -
dt[1,]
Other directions are still very welcome.
I am trying to create a function that uses ddply to summarize data about a particular column that I pass in. I am able to reference the column I want outside of ddply, but I'm not sure how to do it within ddply:
exp_group = c('test','test','control','control')
value = c(1,3,2,3)
df <- data.frame(exp_group, value)
compare_means <- function(df,cols_detail, col_to_eval){
df_int <- df[, c(cols_detail, col_to_eval)] # this part works fine
summary <- ddply(df_int
, .(exp_group)
, summarize
, mean = t.test(col_to_eval)$estimate #these ones don't
, lo_bound = t.test(col_to_eval)$conf.int[1]
, hi_bound = t.test(col_to_eval)$conf.int[2]
)
return(summary)
}
test <- compare_means(df, 'exp_group','value')
When I do this, it returns col_to_eval not found. I've also tried it with df_int[,col_to_eval], as well as df_int[,2] (col reference value) and it says df_int not found.
Where I want to find the means of the test and control groups.
How do I reference the column I want in the t.test functions?
Ok, went through a few iterations and finally got it to work by doing this:
exp_group = c('test','test','control','control')
value = c(1,3,2,3)
df <- data.frame(exp_group, value)
compare_means <- function(df,cols_detail, col_to_eval){
df_int <- df[, c(cols_detail, col_to_eval)]
summary <- ddply(df_int
, .(exp_group)
, function(x){
mean = t.test(x[,col_to_eval])$estimate
lo_bound = t.test(x[,col_to_eval])$conf.int[1]
hi_bound = t.test(x[,col_to_eval])$conf.int[2]
data.frame(mean, lo_bound, hi_bound)
}
)
return(summary)
}
test <- compare_means(df, 'exp_group','value')