I'm trying to pass a custom R function inside spark_apply but keep running into issues and cant figure out what some of the errors mean.
library(sparklyr)
sc <- spark_connect(master = "local")
perf_df <- data.frame(predicted = c(5, 7, 20),
actual = c(4, 6, 40))
perf_tbl <- sdf_copy_to(sc = sc,
x = perf_df,
name = "perf_table")
#custom function
ndcg <- function(predicted_rank, actual_rank) {
# x is a vector of relevance scores
DCG <- function(y) y[1] + sum(y[-1]/log(2:length(y), base = 2))
DCG(predicted_rank)/DCG(actual_rank)
}
#works in R using R data frame
ndcg(perf_df$predicted, perf_df$actual)
#does not work
perf_tbl %>%
spark_apply(function(e) ndcg(e$predicted, e$actual),
names = "ndcg")
Ok, i'm seeing two possible problems.
(1)-spark_apply prefers functions that have one parameter, a dataframe
(2)-you may need to make a package depending on how complex the function in.
let's say you modify ndcg to receive a dataframe as the parameter.
ndcg <- function(dataset) {
predicted_rank <- dataset$predicted
actual_rank <- dataset$actual
# x is a vector of relevance scores
DCG <- function(y) y[1] + sum(y[-1]/log(2:length(y), base = 2))
DCG(predicted_rank)/DCG(actual_rank)
}
And you put that in a package called ndcg_package
now your code will be similar to:
spark_apply(perf_tbl, ndcg, packages = TRUE, names = "ndcg")
Doing this from memory, so there may be a few typos, but it'll get you close.
Related
I am using seurat to analyze some scRNAseq data, I have managed to put all the SCT integration one line codes from satijalab into a function with basically
SCT_normalization <- function (f1, f2) {
f_merge <- merge (f1, y=f2)
f.list <- SplitObject(f_merge, split.by = "stim")
f.list <- lapply(X = f.list, FUN = SCTransform)
features <- SelectIntegrationFeatures(object.list = f.list, nfeatures = 3000)
f.list <<- PrepSCTIntegration(object.list = f.list, anchor.features = features)
return (f.list)
}
so that I will have f.list in the global environment for downstream analysis and making plots. The problem I am running into is that, every time I run the function, the output would be f.list, I want it to be specific to the input value name (i.e., f1 and/or f2). Basically something that I can set so that I would know which input value was used to generate the final output. I saw something using the assign function but someone wrote a warning about "the evil and wrong..." so I am not sure as to how to approach this.
From what it sounds like you don't need to use the super assign function <<-. In my opinion, I don't think <<- should be used as it can cause unexpected changes in objects. This is what I assume the other person was saying. For example, if you have the following function:
AverageVector <- function(v) x <<- mean(v, rm.na = TRUE)
Now you're trying to find the average of a vector you have, along with more analysis
library(tidyverse)
x <- unique(iris$Species)
avg_sl <- AverageVector(iris$Sepal.Length)
Now where x used to be a character vector, it's not a numeric vector with a length of 1.
So I would remove the <<- and call your function like this
object_list_1_2 <- SCT_normalize(object1, object2)
If you wanted a slightly more programatic way you could do something like this to keep track of objects you could do something like this:
SCT_normalization <- function(f1, f2) {
f_merge <- merge (f1, y = f2)
f.list <- SplitObject(f_merge, split.by = "stim")
f.list <- lapply(X = f.list, FUN = SCTransform)
features <- SelectIntegrationFeatures(object.list = f.list, nfeatures = 3000)
f.list <- PrepSCTIntegration(object.list = f.list, anchor.features = features)
to_return <- list(inputs = list(f1, f2), normalized = f.list)
return(to_return)
}
I'm dealing with several outputs I obtain from QIIME, texts which I want to manipulate for obtaining boxplots. Every input is formatted in the same way, so the manipulation is always the same, but it changes the source name. For each input, I want to extract the last 5 rows, have a mean for each column/sample, associate the values to sample experimental labels (Group) taken from the mapfile and put them in the order I use for making a boxplot of all the 6 data obtained.
In bash, I do something like "for i in GG97 GG100 SILVA97 SILVA100 NCBI RDP; do cp ${i}/alpha/collated_alpha/chao1.txt alpha_tot/${i}_chao1.txt; done" to do a command various times changing the names in the code in an automatic way through ${i}.
I'm struggling to find a way to do the same with R. I thought creating a vector containing the names and then using a for cycle by moving the i with [1], [2] etc., but it doesn't work, it stops at the read.delim line not finding the file in the wd.
Here's the manipulation code I wrote. After the comment, it will repeat itself 6 times with the 6 databases I'm using (GG97 GG100 SILVA97 SILVA100 NCBI RDP).
PLUS, I repeat this process 4 times because I have 4 metrics to use (here I'm showing shannon, but I also have a copy of the code for chao1, observed_species and PD_whole_tree).
library(tidyverse)
library(labelled)
mapfile <- read.delim(file="mapfile_HC+BV.txt", check.names=FALSE);
mapfile <- mapfile[,c(1,4)]
colnames(mapfile) <- c("SampleID","Pathology_group")
#GG97
collated <- read.delim(file="alpha_diversity/GG97_shannon.txt", check.names=FALSE);
collated <- tail(collated,5); collated <- collated[,-c(1:3)]
collated_reorder <- collated[,match(mapfile[,1], colnames(collated))]
labels <- t(mapfile)
colnames(collated_reorder) <- labels[2,]
mean <- colMeans(collated_reorder, na.rm = FALSE, dims = 1)
mean = as.matrix(mean); mean <- t(mean)
GG97_shannon <- as.data.frame(rbind(labels[2,],mean))
GG97_shannon <- t(GG97_shannon);
DB_type <- list(DB = "GG97"); DB_type <- rep(DB_type, 41)
GG97_shannon <- as.data.frame(cbind(DB_type,GG97_shannon))
colnames(GG97_shannon) <- c("DB","Group","value")
rm(collated,collated_reorder,DB_type,labels,mean)
Here I paste all the outputs together, freeze the order and make the boxplot.
alpha_shannon <- as.data.frame(rbind(GG97_shannon,GG100_shannon,SILVA97_shannon,SILVA100_shannon,NCBI_shannon,RDP_shannon))
rownames(alpha_shannon) <- NULL
rm(GG97_shannon,GG100_shannon,SILVA97_shannon,SILVA100_shannon,NCBI_shannon,RDP_shannon)
alpha_shannon$Group = factor(alpha_shannon$Group, unique(alpha_shannon$Group))
alpha_shannon$DB = factor(alpha_shannon$DB, unique(alpha_shannon$DB))
library(ggplot2)
ggplot(data = alpha_shannon) +
aes(x = DB, y = value, colour = Group) +
geom_boxplot()+
labs(title = 'Shannon',
x = 'Database',
y = 'Diversity') +
theme(legend.position = 'bottom')+
theme_grey(base_size = 16)
How do I keep this code "DRY" and don't need 146 rows of code to repeat the same things over and over? Thank you!!
You didn't provide a Minimal reproducible example, so this answer cannot guarantee correctness.
An important point to note is that you use rm(...), so this means some variables are only relevant within a certain scope. Therefore, encapsulate this scope into a function. This makes your code reusable and spares you the manual variable removal:
process <- function(file, DB){
# -> Use the function parameter `file` instead of a hardcoded filename
collated <- read.delim(file=file, check.names=FALSE);
collated <- tail(collated,5); collated <- collated[,-c(1:3)]
collated_reorder <- collated[,match(mapfile[,1], colnames(collated))]
labels <- t(mapfile)
colnames(collated_reorder) <- labels[2,]
mean <- colMeans(collated_reorder, na.rm = FALSE, dims = 1)
mean = as.matrix(mean); mean <- t(mean)
# -> rename this variable to a more general name, e.g. `result`
result <- as.data.frame(rbind(labels[2,],mean))
result <- t(result);
# -> Use the function parameter `DB` instead of a hardcoded string
DB_type <- list(DB = DB); DB_type <- rep(DB_type, 41)
result <- as.data.frame(cbind(DB_type,result))
colnames(result) <- c("DB","Group","value")
# -> After the end of this function, the variables defined in this function
# vanish automatically, you just need to specify the result
return(result)
}
Now you can reuse that block:
GG97_shannon <- process(file = "alpha_diversity/GG97_shannon.txt", DB = "GG97")
GG100_shannon <- process(file =...., DB = ....)
SILVA97_shannon <- ...
SILVA100_shannon <- ...
NCBI_shannon <- ...
RDP_shannon <- ...
Alternatively, you can use looping structures:
General-purpose for:
datasets <- c("GG97_shannon", "GG100_shannon", "SILVA97_shannon",
"SILVA100_shannon", "NCBI_shannon", "RDP_shannon")
files <- c("alpha_diversity/GG97_shannon.txt", .....)
DBs <- c("GG97", ....)
result <- list()
for(i in seq_along(datasets)){
result[[datasets[i]]] <- process(files[i], DBs[i])
}
mapply, a "specialized for" for looping over several vectors in parallel:
# the first argument is the function from above, the other ones are given as arguments
# to our process(.) function
results <- mapply(process, files, DBs)
I am pretty new to Spark, I have tried to look for something on the web but I haven't found anything satisfactory.
I have always run parallel computations using the command mclapply and I like its structure (i.e., first parameter used as scrolling index, second argument the function to be parallelized, and then other optional parameters passed to the function).
Now I am trying to do kind of the same thing via Spark, i.e., I would like to distribute my computations among all the node of the Spark cluster. This is shortly what I have learned and how I think the code should be structured (I'm using the package sparklyr):
I create a connection to Spark using the command spark_connect;
I copy my data.frame in the Spark environment with copy_to and access it through its tibble;
I would like to implement a "Spark-friendly" version of mclapply, but I have seen there is no similar function in the package (I have seen there exists the function spark.lapply in the SparkR package, but unfortunately it is not in the CRAN anymore).
Here below, a simple test script I have implemented that works using the function mclapply.
#### Standard code that works with mclapply #########
dfTest = data.frame(X = rep(1, 10000), Y = rep(2, 10000))
.testFunc = function(X = 1, df, str) {
rowSelected = df[X, ]
y = as.numeric(rowSelected[1] + rowSelected[2])
return(list(y = y, str = str))
}
lOutput = mclapply(X = 1 : nrow(dfTest), FUN = .testFunc, df = dfTest,
str = "useless string", mc.cores = 2)
######################################################
###### Similar code that should work with Spark ######
library(sparklyr)
sc = spark_connect(master = "local")
dfTest = data.frame(X = rep(1, 10000), Y = rep(2, 10000))
.testFunc = function(X = 1, df, str) {
rowSelected = df[X, ]
nSum = as.numeric(rowSelected[1] + rowSelected[2])
return(list(nSum = nSum, str = str))
}
dfTest_tbl = copy_to(sc, dfTest, "test_tbl", overwrite = TRUE)
# Apply similar function mclapply to dfTest_tbl, that works with
# Spark
# ???
######################################################
If someone has already found a solution for this, then it will be great. Also other references/guides/links are more than welcome. Thanks!
sparklyr
spark_apply is existing function you're looking for:
spark_apply(sdf, function(data) {
...
})
Please refer to Distributed R in sparklyr documentation for details.
SparkR
With SparkR use gapply / gapplyCollect
gapply(df, groupingCols, function(data) {...} schema)
dapply / dapplyCollect
dapply(df, function(data) {...}, schema)
UDFs. Refer to
dapply docs
gapply docs
for details.
Be warned that all solutions are inferior compared to native Spark code and should be avoided when high performance is required.
sparklyr::spark_apply now can support pass some external variables like models as context.
Here is my example to run xgboost model on sparklyr:
bst <- xgboost::xgb.load("project/models/xgboost.model")
res3 <- spark_apply(x = ft_union_price %>% sdf_repartition(partitions = 1500, partition_by = "uid"),
f = inference_fn,
packages = F,
memory = F,
names = c("uid",
"action_1",
"pred"),
context = {model <- bst})
I need to open files (matrices) from a directory and apply a function pca on each one. It uses another function count_pc which is thought to null diagonals in the matrix step by step and add recalculated PC1 to a the table pcs from the previous function. At the start, I didn't think of environments so count_pca was crashing with the error "unknown variable". Then I tried to do it this way:
files <- list.files()
count_pc <- function(x, env = parent.frame()) {
diag(file[x:nrow(file),]) <- 0
diag(file[,x:nrow(file)]) <- 0
pcn <- prcomp(file, scale = FALSE)
pcn <- data.frame(pcn$rotation)
pcs <- cbind(pcs, pcn$PC1)
}
pca <- function(filename) {
file <- as.matrix(read.table(filename))
pc <- prcomp(file, scale = FALSE)
pc <- data.frame(pc$rotation)
pc1 <- pc$PC1
pcs <- data.frame(pc1)
for (k in 1:40) {
count_pc(k)
}
new_filename <- strsplit(filename, "_")[[1]][3]
print(pcs)
colnames(pcs) <- paste0(0:40, rep("_bins_deleted", 40))
write.table(pcs, file=paste(new_filename, "eigenvectors", sep="_"))
return(apply(pcs, 2, cor, y = pc1))
}
ldply(files, pca)
And indeed, count_pc does not crash with above error but, unfortunately, it crashes with the new one:
"colnames<-`(`*tmp*`, value = c("0_bins_deleted", "1_bins_deleted", :
'names' [41] attribute must be the same length as the vector [1]"
which means that count_pc does not change needed variables. First, I thought the problem might be connected with using sapply(1:40, count_pc) so I replaced it with a cycle. But it didn't help. I've also tried to use environment(count_pc) <- environment() in the pca but it didn't help either (as well as changing variable names in count_pc to env$'name'). I don't know what to do and googling doesn't seem to help.
I came across this function a while back that was created for fixing PCA values. The problem with the function was that it wasn't compatible xts time series objects.
amend <- function(result) {
result.m <- as.matrix(result)
n <- dim(result.m)[1]
delta <- apply(abs(result.m[-1,] - result.m[-n,]), 1, sum)
delta.1 <- apply(abs(result.m[-1,] + result.m[-n,]), 1, sum)
signs <- c(1, cumprod(rep(-1, n-1) ^ (delta.1 <= delta)))
zoo(result * signs)
}
Full sample can be found https://stats.stackexchange.com/questions/34396/im-getting-jumpy-loadings-in-rollapply-pca-in-r-can-i-fix-it
The problem is that applying the function on a xts object with multiple columns and rows wont solve the problem. Is there a elegant way of applying the algorithm for a matrix of xts objects?
My current solution given a single column with multiple row is to loop through row by row...which is slow and tedious. Imagine having to do it column by column also.
Thanks,
Here is some code to get one started:
rm(list=ls())
require(RCurl)
sit = getURLContent('https://github.com/systematicinvestor/SIT/raw/master/sit.gz', binary=TRUE, followlocation = TRUE, ssl.verifypeer = FALSE)
con = gzcon(rawConnection(sit, 'rb'))
source(con)
close(con)
load.packages('quantmod')
data <- new.env()
tickers<-spl("VTI,IEF,VNQ,TLT")
getSymbols(tickers, src = 'yahoo', from = '1980-01-01', env = data, auto.assign = T)
for(i in ls(data)) data[[i]] = adjustOHLC(data[[i]], use.Adjusted=T)
bt.prep(data, align='remove.na', dates='1990::2013')
prices<-data$prices[,-10] #don't include cash
retmat<-na.omit(prices/mlag(prices) - 1)
rollapply(retmat, 500, function(x) summary(princomp(x))$loadings[, 1], by.column = FALSE, align = "right") -> princomproll
require(lattice)
xyplot(amend(pruncomproll))
plotting "princomproll" will get you jumpy loadings...
It isn't very obvious how the amend function relates to the script below it (since it isn't called there), or what you are trying to achieve. There are a couple of small changes that can be made. I haven't profiled the difference, but it's a little more readable if nothing else.
You remove the first and last rows of the result twice.
rowSums might be slightly more efficient for getting the row sums than apply.
rep.int is a little bit fster than rep.
amend <- function(result) {
result <- as.matrix(result)
n <- nrow(result)
without_first_row <- result[-1,]
without_last_row <- result[-n,]
delta_minus <- rowSums(abs(without_first_row - without_last_row))
delta_plus <- rowSums(abs(without_first_row + without_last_row))
signs <- c(1, cumprod(rep.int(-1, n-1) ^ (delta_plus <= delta_minus)))
zoo(result * signs)
}