I am using seurat to analyze some scRNAseq data, I have managed to put all the SCT integration one line codes from satijalab into a function with basically
SCT_normalization <- function (f1, f2) {
f_merge <- merge (f1, y=f2)
f.list <- SplitObject(f_merge, split.by = "stim")
f.list <- lapply(X = f.list, FUN = SCTransform)
features <- SelectIntegrationFeatures(object.list = f.list, nfeatures = 3000)
f.list <<- PrepSCTIntegration(object.list = f.list, anchor.features = features)
return (f.list)
}
so that I will have f.list in the global environment for downstream analysis and making plots. The problem I am running into is that, every time I run the function, the output would be f.list, I want it to be specific to the input value name (i.e., f1 and/or f2). Basically something that I can set so that I would know which input value was used to generate the final output. I saw something using the assign function but someone wrote a warning about "the evil and wrong..." so I am not sure as to how to approach this.
From what it sounds like you don't need to use the super assign function <<-. In my opinion, I don't think <<- should be used as it can cause unexpected changes in objects. This is what I assume the other person was saying. For example, if you have the following function:
AverageVector <- function(v) x <<- mean(v, rm.na = TRUE)
Now you're trying to find the average of a vector you have, along with more analysis
library(tidyverse)
x <- unique(iris$Species)
avg_sl <- AverageVector(iris$Sepal.Length)
Now where x used to be a character vector, it's not a numeric vector with a length of 1.
So I would remove the <<- and call your function like this
object_list_1_2 <- SCT_normalize(object1, object2)
If you wanted a slightly more programatic way you could do something like this to keep track of objects you could do something like this:
SCT_normalization <- function(f1, f2) {
f_merge <- merge (f1, y = f2)
f.list <- SplitObject(f_merge, split.by = "stim")
f.list <- lapply(X = f.list, FUN = SCTransform)
features <- SelectIntegrationFeatures(object.list = f.list, nfeatures = 3000)
f.list <- PrepSCTIntegration(object.list = f.list, anchor.features = features)
to_return <- list(inputs = list(f1, f2), normalized = f.list)
return(to_return)
}
Related
I have the following situation: I have different dataframes, I would like to be able, for each dataframe, to create 2 dataframes according to the value of one of the columns (log2FoldChange>1 and logFoldChange<-1).
For this I use the following code:
DJ29_T0_Overexpr = DJ29_T0[which(DJ29_T0$log2FoldChange > 1),]
DJ29_T0_Underexpr = DJ29_T0[which(DJ21_T0$log2FoldChange < -1),]
DJ229_T0 being one of my dataframe.
First problem: the sign for the dataframe where log2FoldChange < -1 is not taken into account.
But the main problem is at the time of making the function, I wrote the following:
spliteOverUnder <- function(res){
nm <-deparse(substitute(res))
assign(paste(nm,"_Overexpr", sep=""), res[which(as.numeric(as.character(res$log2FoldChange)) > 1),])
assign(paste(nm,"_Underexpr", sep=""), res[which(as.numeric(as.character(res$log2FoldChange)) < -1),])
}
Which I then ran with :
spliteOverUnder(DJ29_T0)
No error message, but my objects are not exported in my global environment. I tried with return(paste(nm,"_Overexpr", sep="") but it only returns the object name but not the associated dataframe.
Using paste() forces the use of assign(), so I can't do :
spliteOverUnder <- function(res){
nm <-deparse(substitute(res))
paste(nm,"_Overexpr", sep="") <<- res[which(as.numeric(as.character(res$log2FoldChange)) > 1),]
paste(nm,"_Underexpr", sep="") <<- res[which(as.numeric(as.character(res$log2FoldChange)) < -1),]
}
spliteOverUnder(DJ24_T0)
I encounter the following error:
Error in paste(nm, "_Overexpr", sep = "") <<- res[which(as.numeric(as.character(res$log2FoldChange)) > :
could not find function "paste<-"
If you've encountered this difficulty before, I'd appreciate a little help.
And if you knew, once the function works, how to use a For loop going through a list containing all my dataframes to apply this function to each of them, I'm also a taker.
Thanks
When assigning, use the pos argument to hoist the new objects out of the function.
function(){
assign(x = ..., value = ...,
pos = 1 ## see below
)
}
... where 0 = the function's local environment, 1 = the environment next up (in which the function is defined) etc.
edit
A general function to create the split dataframes in your global environment follows. However, you might rather want to save the new dataframes (from within the function) or just forward them to downstream functions than cram your workspace with intermediary objects.
splitOverUnder <- function(the_name_of_the_frame){
df <- get(the_name_of_the_frame)
df$cat <- cut(df$log2FoldChange,
breaks = c(-Inf, -1, 1, Inf),
labels = c('underexpr', 'normal', 'overexpr')
)
split_data <- split(df, df$cat)
sapply(c('underexpr', 'overexpr'),
function(n){
new_df_name <- paste(the_name_of_the_frame, n, sep = '_')
assign(x = new_df_name,
value = split_data$n,
envir = .GlobalEnv
)
}
)
}
## say, df1 and df2 are your initial dataframes to split:
sapply(c('df1', 'df2'), function(n) splitOverUnder(n))
Sorry if this is a duplicate. I am very new to data.table. Basically, I am able to get my code to work outside of functions, but when I pack the operations inside of a function, they breakdown. Ultimately, I had hoped to make the functions age.inds and m.inds internal functions in a package.
# required functions ------------------------------------------------------
# create object
create.obj <- function(n = 100){
obj = list()
obj$inds <- data.table(age = rep(0.1, n), m = NA)
obj$m$model <- function(age, a){return(age^a)}
obj$m$params <- list(a = 2)
return(obj)
}
# calculate new 'age' of inds
age.inds <- function(obj){
obj$inds[, age := age + 1]
return(obj)
}
# calculate new 'm' of inds
m.inds <- function(obj){
ARGS <- list()
args.incl <- which(names(obj$m$params) %in% names(formals(obj$m$model)))
ARGS <- c(ARGS, obj$m$params[args.incl])
args.incl <- names(obj$inds)[names(obj$inds) %in% names(formals(obj$m$model))]
ARGS <- c(ARGS, obj$inds[, ..args.incl]) # double dot '..' version
# ARGS <- c(ARGS, inds[, args.incl, with = FALSE]) # 'with' version
obj$inds[, m := do.call(obj$m$model, ARGS)]
return(obj)
}
# advance object
adv.obj <- function(obj, times = 1){
for(i in seq(times)){
obj <- age.inds(obj)
obj <- m.inds(obj)
}
return(obj)
}
# Example ----------------------------------------------------------------
# this doesn't work
obj <- create.obj(n = 10)
obj # so far so good
obj <- age.inds(obj)
obj # 'inds' gone
# would ultimately like to call adv.obj
obj <- adv.obj(obj, times = 5)
Also (as a side note), most of what I would like to do in my code would be vectorized calculations (i.e. updating variables in obj$inds), so I don't even know if going to data.tables makes too much sense for me (i.e. no by grouping operations as of yet). I am dealing with large objects and wondered if switching from data.frame objects would speed things up (I can get my code to work using data.frames).
Thanks
Update
OK, the issue with the printing has been solved thanks to #eddi. I am however unable to use these "inds" functions when they are located internally within a package (i.e not exported). I made a small package (DTtester), that has this example in the help file for adv.obj:
obj <- create.obj(n=10)
obj <- adv.obj(obj, times = 5)
# Error in `:=`(age, new.age) :
# Check that is.data.table(DT) == TRUE. Otherwise, := and `:=`(...) are
# defined for use in j, once only and in particular ways. See help(":=").
Any idea why the functions would fail in this way?
I am writing a function that calls another function (e.g. lm), and I would like to pass other
arguments to it using ellipsis (...). However, the data to be used is not
in the global environment, but inside a list. A minimal example:
L <- list(data = chickwts, other = 1:5)
wrapper <- function(list, formula = NULL, ...){
if (missing(formula)) formula <- formula(weight~feed)
lm(formula, data = list$data, ...)
}
wrapper(L, subset = feed != "casein") #fails
I can make it work using attach but I'm sure there is more efficient ways of doing it by specifying the evaluation frame...?
wrapper2 <- function(list, formula = NULL, ...){
if (missing(formula)) formula <- formula(weight~feed)
attach(list$data)
m <- lm(formula, ...)
detach(list$data)
return(m)
}
wrapper2(L, subset = feed != "casein") #works
Another solution I have used before is to use list(...), and dealing with the arguments manually, but that would not be practical in the real situation.
I can see that this is fairly basic, but I couldn't find a solution. Any suggestion to the specific problem and also a link to a good conceptual explanation of environments in general would be appreciated.
We would need to construct a call and eval it.
wrapper <- function(list, formula = NULL, ...){
if (missing(formula)) formula <- weight ~ feed
cl <- match.call()
cl$list <- NULL
cl$formula <- formula
cl$data <- quote(list$data)
cl[[1]] <- quote(stats::lm)
eval(cl)
}
Reproducible example:
L <- list(data = trees, other = 1:5)
wrapper(L, Height ~ Girth, subset = Volume > 20)
uniq <- unique(file[,12])
pdf("SKAT.pdf")
for(i in 1:length(uniq)) {
dat <- subset(file, file[,12] == uniq[i])
names <- paste("Sample_filtered_on_", uniq[i], sep="")
qq.chisq(-2*log(as.numeric(dat[,10])), df = 2, main = names, pvals = T,
sub=subtitle)
}
dev.off()
file[,12] is an integer so I convert it to a factor when I'm trying to run it with by instead of a for loop as follows:
pdf("SKAT.pdf")
by(file, as.factor(file[,12]), function(x) { qq.chisq(-2*log(as.numeric(x[,10])), df = 2, main = paste("Sample_filtered_on_", file[1,12], sep=""), pvals = T, sub=subtitle) } )
dev.off()
It works fine to sort the data frame by this (now a factor) column. My problem is that for the plot title, I want to label it with the correct index from that column. This is easy to do in the for loop by uniq[i]. How do I do this in a by function?
Hope this makes sense.
A more vectorized (== cooler?) version would pull the common operations out of the loop and let R do the book-keeping about unique factor levels.
dat <- split(-2 * log(as.numeric(file[,10])), file[,12])
names(dat) <- paste0("IoOPanos_filtered_on_pc_", names(dat))
(paste0 is a convenience function for the common use case where normally one would use paste with the argument sep=""). The for loop is entirely appropriate when you're running it for its side effects (plotting pretty pictures) rather than trying to capture values for further computation; it's definitely un-cool to use T instead of TRUE, while seq_along(dat) means that your code won't produce unexpected results when length(dat) == 0.
pdf("SKAT.pdf")
for(i in seq_along(dat)) {
vals <- dat[[i]]
nm <- names(dat)[[i]]
qq.chisq(val, main = nm, df = 2, pvals = TRUE, sub=subtitle)
}
dev.off()
If you did want to capture values, the basic observation is that your function takes 2 arguments that vary. So by or tapply or sapply or ... are not appropriate; each of these assume that just a single argument is varying. Instead, use mapply or the comparable Map
Map(qq.chisq, dat, main=names(dat),
MoreArgs=list(df=2, pvals=TRUE, sub=subtitle))
How can I write this all in one line?
mydata is a "zoo" series, limit is a numeric vector of the same size
tmp <- ave(coredata(mydata), as.Date(index(mydata)),
FUN = function(x) cummax(x)-x)
tmp <- (tmp < limit)
final <- ave(tmp, as.Date(index(mydata)),
FUN = function(x) cumprod(x))
I've tried to use two vectors as argument to ave(...) but it seems to accept just one even if I join them into a matrix.
This is just an example, but any other function could be use.
Here I need to compare the value of cummax(mydata)-mydata with a numeric vector and
once it surpasses it I'll keep zeros till the end of the day. The cummax is calculated from the beginning of each day.
If limit were a single number instead of a vector (with different possible numbers) I could write it:
ave(coredata(mydata), as.Date(index(mydata)),
FUN = function(x) cumprod((cummax(x) - x) < limit))
But I can't introduce there a vector longer than x (it should have the same length than each day) and I don't know how to introduce it as another argument in ave().
Seems like this routine imposes intraday stoploss based on maxdrawdown. So I assume you want to be able to pass in variable limit as a second argument to your aggregation function which only currently only takes 1 function due to the way ave works.
If putting all this in one line is not an absolute must, I can share a function I've written that generalizes aggregation via "cut variables". Here's the code:
mtapplylist2 <- function(t, IDX, DEF, MoreArgs=NULL, ...)
{
if(mode(DEF) != "list")
{
cat("Definition must be list type\n");
return(NULL);
}
a <- c();
colnames <- names(DEF);
for ( i in 1:length(DEF) )
{
def <- DEF[[i]];
func <- def[1];
if(mode(func) == "character") { func <- get(func); }
cols <- def[-1];
# build the argument to be called
arglist <- list();
arglist[[1]] <- func;
for( j in 1:length(cols) )
{
col <- cols[j];
grp <- split(t[,col], IDX);
arglist[[1+j]] <- grp;
}
arglist[["MoreArgs"]] <- MoreArgs;
v <- do.call("mapply", arglist);
# print(class(v)); print(v);
if(class(v) == "matrix")
{
a <- cbind(a, as.vector(v));
} else {
a <- cbind(a, v);
}
}
colnames(a) <- colnames;
return(a);
}
And you can use it like this:
# assuming you have the data in the data.frame
df <- data.frame(date=rep(1:10,10), ret=rnorm(100), limit=rep(c(0.25,0.50),50))
dfunc <- function(x, ...) { return(cummax(x)-x ) }
pfunc <- function(x,y, ...) { return((cummax(x)-x) < y) }
# assumes you have the function declared in the same namespace
def <- list(
"drawdown" = c("dfunc", "ret"),
"hasdrawdown" = c("pfunc", "ret", "limit")
);
# from R console
> def <- list("drawdown" = c("dfunc", "ret"),"happened" = c("pfunc","ret","limit"))
> dim( mtapplylist2(df, df$date, def) )
[1] 100 2
Notice that the "def" variable is a list containing the following items:
computed column name
vector arg function name as a string
name of the variable in the input data.frame that are inputs into the function
If you look at the guts of "mtapplylist2" function, the key components would be "split" and "mapply". These functions are sufficiently fast (I think split is implemented in C).
This works with functions requiring multiple arguments, and also for functions returning vector of the same size or aggregated value.
Try it out and let me know if this solves your problem.