Recently I've been playing with doing some parallel processing in R using future (and future.apply and furrr) which has been great mostly, but I've stumbled onto something that I can't explain. It's possible that this is a bug somewhere, but it may also be sloppy coding on my part. If anyone can explain this behavior it would be much appreciated.
The setup
I'm running simulations on different subgroups of my data. For each group, I want to run the simulation n times and then calculate some summary stats on the results. Here is some example code to reproduce my basic setup and demonstrate the issue I'm seeing:
library(tidyverse)
library(future)
library(future.apply)
# Helper functions
#' Calls out to `free` to get total system memory used
sys_used <- function() {
.f <- system2("free", "-b", stdout = TRUE)
as.numeric(unlist(strsplit(.f[2], " +"))[3])
}
#' Write time, and memory usage to log file in CSV format
#' #param .f the file to write to
#' #param .id identifier for the row to be written
mem_string <- function(.f, .id) {
.s <- paste(.id, Sys.time(), sys_used(), Sys.getpid(), sep = ",")
write_lines(.s, .f, append = TRUE)
}
# Inputs
fake_inputs <- 1:16
nsim <- 100
nrows <- 1e6
log_file <- "future_mem_leak_log.csv"
if (fs::file_exists(log_file)) fs::file_delete(log_file)
test_cases <- list(
list(
name = "multisession-sequential",
plan = list(multisession, sequential)
),
list(
name = "sequential-multisession",
plan = list(sequential, multisession)
)
)
# Test code
for (.t in test_cases) {
plan(.t$plan)
# loop over subsets of the data
final_out <- future_lapply(fake_inputs, function(.i) {
# loop over simulations
out <- future_lapply(1:nsim, function(.j) {
# in real life this would be doing simulations,
# but here we just create "results" using rnorm()
res <- data.frame(
id = rep(.j, nrows),
col1 = rnorm(nrows) * .i,
col2 = rnorm(nrows) * .i,
col3 = rnorm(nrows) * .i,
col4 = rnorm(nrows) * .i,
col5 = rnorm(nrows) * .i,
col6 = rnorm(nrows) * .i
)
# write memory usage to file
mem_string(log_file, .t$name)
# in real life I would write res to file to read in later, but here we
# only return head of df so we know the returned value isn't filling up memory
res %>% slice_head(n = 10)
})
})
# clean up any leftover objects before testing the next plan
try(rm(final_out))
try(rm(out))
try(rm(res))
}
The outer loop is for testing two parallelization strategies: whether to parallelize over the subsets of data or over the 100 simulations.
Some caveats
I realize that parallelizing over the simulations is not the ideal design, and also that chunking that data to send 10-20 simulations to each core would be more efficient, but that's not really the point here. I'm just trying to understand what is happening in memory.
I also considered that maybe plan(multicore) would be better here (though I'm sure if it would) but I'm more interested in figuring out what's happening with plan(multisession)
The results
I ran this on an 8-vCPU Linux EC2 (I can give more specs if people need them) and created the following plot from the results (plotting code at the bottom for reproducibility):
First off, plan(list(multisession, sequential)) is faster (as expected, see caveat above), but what I'm confused about is the memory profile. The total system memory usage remains pretty constant for plan(list(multisession, sequential)) which I would expect, because I assumed the res object is overwritten each time through the loop.
However, the memory usage for plan(list(sequential, multisession)) steadily grows as the program runs. It appears that each time through the loop the res object is created and then hangs around in limbo somewhere, taking up memory. In my real example this got large enough that it filled my entire (32GB) system memory and killed the process about halfway through.
Plot twist: it only happens when nested
And here's the part that really has me confused! When I changed the outer future_lapply to just regular lapply and set plan(multisession) I don't see it! From my reading of this "Future: Topologies" vignette this should be the same as plan(list(sequential, multisession)) but the plot doesn't show the memory growing at all (in fact, it's almost identical to plan(list(multisession, sequential)) in the above plot)
Note on other options
I actually originally found this with furrr::future_map_dfr() but to be sure it wasn't a bug in furrr, I tried it with future.apply::future_lapply() and got the results shown. I tried to code this up with just future::future() and got very different results, but quite possibly because what I coded up wasn't actually equivalent. I don't have much experience with using futures directly without the abstraction layer provided by either furrr or future.apply.
Again, any insight on this is much appreciated.
Plotting code
library(tidyverse)
logDat <- read_csv("future_mem_leak_log.csv",
col_names = c("plan", "time", "sys_used", "pid")) %>%
group_by(plan) %>%
mutate(
start = min(time),
time_elapsed = as.numeric(difftime(time, start, units = "secs"))
)
ggplot(logDat, aes(x = time_elapsed/60, y = sys_used/1e9, group = plan, colour = plan)) +
geom_line() +
xlab("Time elapsed (in mins)") + ylab("Memory used (in GB)") +
ggtitle("Memory Usage\n list(multisession, sequential) vs list(sequential, multisession)")
Related
I am quite struggling with a huge data set at the moment.
What I would like to do is not very complicated, but the matter is that it is just too slow. In the first step, I need to check whether a website is active or not. For this intention, I used the following code (here with a sample of three API-pathes)
library(httr)
Updated <- function(x){http_error(GET(x))}
websites <- data.frame(c("https://api.crunchbase.com/v3.1/organizations/designpitara","www.twitter.com","www.sportschau.de"))
abc <- apply(websites,1,Updated)
I already noticed that a for loop is pretty much faster than the apply function. However, the full code (which has around 1MIllion APIs to check) still would take around 55 hours to be executed. Any help is appreciated :)
Alternatively, something like this would work for passing multiple libraries to the PSOCK cluster:
clusterEvalQ(cl, {
library(data.table)
library(survival)
})
The primary limiting factor will probably be the time taken to query the website. Currently, you're waiting for each query to return a result before executing the next one. The best way to speed up the workflow would be to execute batches of queries in parallel.
If you're using a Unix system you could try the following:
### Packages ###
library(parallel)
### On your example ###
abc <- unlist(mclapply(websites[[1]], Updated, mc.cores = 3))
### On a larger number of sites ###
abc <- unlist(mclapply(websites[[1]], Updated, mc.cores = detectCores())
### You can even go beyond your machine's core count ###
abc <- unlist(mclapply(websites[[1]], Updated, mc.cores = 40))
However, the precise number of threads at which you saturate your processor/internet connection is kind of dependent upon your machine and your connection.
Alternatively, if you're stuck on Windows:
### For a larger number of sites ###
cl <- makeCluster(detectCores(), type = "PSOCK")
clusterExport(cl, varlist = "websites")
clusterEvalQ(cl = cl, library(httr))
abc <- parSapply(cl = cl, X = websites[[1]], FUN = Updated, USE.NAMES = FALSE)
stopCluster(cl)
In the case of PSOCK clusters, I'm not sure whether there are any benefits to be had from exceeding your machine's core count, although I'm not a Windows person, and I welcome any correction.
I am trying to optimize a simple R code I wrote on two aspects:
1) For loops
2) Writing data into my PostgreSQL database
For 1) I know for loops should be avoided at all cost and it's recommended to use lapply but I am not clear on how to translate my code below using lapply.
For 2) what I do below is working but I am not sure this is the most efficient way (for example doing this way versus rbinding all data into an R dataframe and then load the whole dataframe into my PostgreSQL database.)
EDIT: I updated my code with a reproducible example below.
for (i in 1:100){
search <- paste0("https://github.com/search?o=desc&p=", i, &q=R&type=Repositories)
download.file(search, destfile ='scrape.html',quiet = TRUE)
url <- read_html('scrape.html')
github_title <- url%>%html_nodes(xpath="//div[#class=mt-n1]")%>%html_text()
github_link <- url%>%html_nodes(xpath="//div[#class=mt-n1]//#href")%>%html_text()
df <- data.frame(github_title, github_link )
colnames(df) <- c("title", "link")
dbWriteTable(con, "my_database", df, append = TRUE, row.names = FALSE)
cat(i)
}
Thanks a lot for all your inputs!
First of all, it is a myth that should be completely thrashed that lapply is in any way faster than equivalent code using a for loop. For years this has been fixed, and for loops should in every case be faster than the equivalent lapply.
I will visualize using a for loop as you seem to find this more intuitive. Do however note that i work mostly in T-sql and there might be some conversion necessary.
n <- 1e5
outputDat <- vector('list', n)
for (i in 1:10000){
id <- element_a[i]
location <- element_b[i]
language <- element_c[i]
date_creation <- element_d[i]
df <- data.frame(id, location, language, date_creation)
colnames(df) <- c("id", "location", "language", "date_creation")
outputDat[[i]] <- df
}
## Combine data.frames
outputDat <- do.call('rbind', outputDat)
#Write the combined data.frame into the database.
##dbBegin(con) #<= might speed up might not.
dbWriteTable(con, "my_database", df, append = TRUE, row.names = FALSE)
##dbCommit(con) #<= might speed up might not.
Using Transact-SQL you could alternatively combine the entire string into a single insert into statement. Here I'll deviate and use apply to iterate over the rows, as it is much more readable in this case. A for loop is once again just as fast if done properly.
#Create the statements. here
statement <- paste0("('", apply(outputDat, 1, paste0, collapse = "','"), "')", collapse = ",\n") #\n can be removed, but makes printing nicer.
##Optional: Print a bit of the statement
# cat(substr(statement, 1, 2000))
##dbBegin(con) #<= might speed up might not.
dbExecute(con, statement <- paste0(
'
/*
SET NOCOCUNT ON seems to be necessary in the DBI API.
It seems to react to 'n rows affected' messages.
Note only affects this method, not the one using dbWriteTable
*/
--SET NOCOUNT ON
INSERT INTO [my table] values ', statement))
##dbCommit(con) #<= might speed up might not.
Note as i comment, this might simply fail to properly upload the table, as the DBI package seems to sometimes fail this kind of transaction, if it results in one or more messages about n rows affected.
Last but not least once the statements are made, this could be copied and pasted from R into any GUI that directly access the database, using for example writeLines(statement, 'clipboard') or writing into a text file (a file is more stable if your data contains a lot of rows). In rare outlier cases this last resort can be faster, if for whatever reason DBI or alternative R packages seem to run overly slow without reason. As this seems to be somewhat of a personal project, this might be sufficient for your use.
I am trying to use the quasi-quotation syntax (quo, exprs, !!, etc.) as well as the foreach function to create several new variables by means of a named list of expressions to be evaluated inside the rxDataStep function, specifically, the transforms argument. I am getting the following error:
Error in rxLinkTransformComponents(transforms = transforms, transformFunc = transformFunc, : 'transforms' must be of the form list(...)
I have a dataset which includes a number of variables with I need to log-transform in order to perform further analyses. I have been using the rx functions from the "RevoScaleR" package for roughly three years and totally missed the "tidyverse"/pipeline method of data transformation techniques. I do occasionally dabble with these tools but prefer to stick with the aforementioned rx functions giving my relative familiarity and the fact that they have served me very well thus far.
As a MWE:
Required libraries:
library(foreach)
library(rlang)
Creating variables which need to be log-transformed.
vars <- foreach(i = 10:20, .combine = "cbind") %do% rnorm(10, i)
Dataframe with identifier and above variables.
data_in <- data.frame(id = 1:10, vars)
Object which creates the expressions of the log-transformed variables; this creates a named list.
log_vars <- foreach(i = names(data_in[-1]), .final = function(x) set_names(x, paste0(names(data_in[-1]), "_log"))) %do%
expr(log10(!!sym(i)))
Now attempting to add the variables to the existing dataframe.
data_out <- rxDataStep(inData = data_in, transforms = log_vars, transformObjects = list(log_vars = log_vars))
The resulting error is the following:
Error in rxLinkTransformComponents(transforms = transforms, transformFunc = transformFunc, : 'transforms' must be of the form list(...)
I simply cannot understand the error given that log_vars is defined as a named list. One can check this with str and typeof.
I have tried a slightly different way of defining the new variables:
log_vars <- unlist(foreach(i = names(data_in[-1]), j = paste0(names(data_in[-1]), "_log")) %do%
exprs(!!j := log10(!!sym(i))))
I have to use unlist given that exprs delivers a list as output already. Either way, I get the same error as before.
Naturally, I expect to have 10 new variables named result.1_log, result.2_log, etc. inserted into the dataframe. Instead, I receive the above error and the new dataframe is not created.
I suspected that the rx functions do not like working with the quasi-quotation syntax, however, I have used it before when having to identify subjects with NA values of certain variables. This was done using the rowSelection argument of rxDataStep. I do realise that rowSelection requires a single, logical expression while transforms requires a named list of expressions.
Any help would be much appreciated since this type of data transformation will keep up again in my analyses. I do suspect that I simply do not understand the inner workings of the quasi-quotation syntax or perhaps how lists work in general but, hopefully there is a simple fix.
I am using Microsoft R Open 3.4.3.
My session info is the following:
R Services Information:
Local R: C:\Program Files\Microsoft\ML Server\R_SERVER\
Version: 1.3.40517.1016
Operating System: Microsoft Windows 10.0.17134
CPU Count: 4
Physical Memory: 12169 MB, 6810 MB free
Virtual Memory: 14025 MB, 7984 MB free
Video controller[1]: Intel(R) HD Graphics 620
GPU[1]: Intel(R) HD Graphics Family
Video memory[1]: 1024 MB
Connected users: 1
I'm not quite sure what you're trying to do as I think you've made things too complicated.
If all you want to do is take the log of each # in each data point, then I show two approaches below.
Approach #1 is static, you know the fixed # of columns and hard code it. It's a bit faster for rxDataStep to run in this approach.
Approach #2 is a bit more dynamic, taking advantage of a transformFunc. transformFunc works in chunks, so it can be used safely in a clustered fashion. rxDataStep knows how to integrate the chunks together. But there will be a bit of a performance hit for it.
You might have been trying to find a hybrid approach - dynamically build the list for the transforms parameter in the rxDataStep. I haven't found a way to get that to work. Here's a similar question for doing it in rxSetVarInfo (Change a dynamic variable name with rxSetVarInfo) but using that approach hasn't yielded success for me yet.
Let me know if I've completely missed the mark!
library(foreach)
library(rlang)
startSize <- 10
endSize <- 20
vars <- foreach(i = startSize:endSize, .combine = "cbind") %do% rnorm(10, i)
data_in <- data.frame(vars)
tempInput <- tempfile(fileext = ".xdf")
tempOutput <- tempfile(fileext = ".xdf")
rxImport(inData = data_in, outFile = tempInput, overwrite = T)
rxGetInfo(tempInput, getVarInfo = T)
### Approach #1
print("Approach #1")
rxDataStep(inData = tempInput, outFile = tempOutput, overwrite = T,
transforms = list(
log_R1 = log10(result.1),
log_R2 = log10(result.2),
log_R3 = log10(result.3),
log_R4 = log10(result.4),
log_R5 = log10(result.5),
log_R6 = log10(result.6),
log_R7 = log10(result.7),
log_R8 = log10(result.8),
log_R9 = log10(result.9),
log_R10 = log10(result.10),
log_R11 = log10(result.11)))
rxGetInfo(tempOutput, getVarInfo = T)
### Approach #2
print("Approach #2")
logxform <- function(dataList) {
numRowsInChunk <- length(dataList$result.1)
for (j in 1:columnDepth) {
dataList[[paste0("log_R",j)]] <- rep(0, times=numRowsInChunk)
for (i in 1:numRowsInChunk) {
dataList[[paste0("log_R",j)]][i] <- log10(dataList[[paste0("result.",j)]][i])
}
}
return(dataList)
}
rxDataStep(inData = tempInput, outFile = tempOutput, overwrite = T,
transformObjects = list(columnDepth = endSize - startSize + 1),
transformFunc = logxform)
rxGetInfo(tempOutput, getVarInfo = T)
I am working with a list where each element is also a list, comprised of R data.tables. My task is to grab the nth element of each sublist, and then stack those data.tables into a larger data.table. So, from a list of twenty lists, each having twelve elements, I end up with a total list of twelve elements, where each element is a data.table.
I'm not having difficulty with the code to do this, but I am having some confusion about what is happening with R's memory management in this case. It is relatively simple to do the extraction, like this (just to show context, not a MWE on its own):
lst_new <- lapply(X = list_indices,
FUN = function(idx) {return(rbindlist(l = lapply(X = lst_old,FUN = `[[`,idx)))})
My question is, why is R not releasing the memory that was originally allocated to lst_old when I delete it? More generally, why is that my rbind operations seem to hold onto memory after the object is removed? Below is a minimal working example.
library(data.table)
# Create list elements of large enough size
uFunc_MakeElement <- function() {
clicode <- paste(sample(x = c(letters,LETTERS),size = 4,replace = T),collapse = "")
column_data <- replicate(n = 100,expr = {sample(x = c(0:20),size = 600000,replace = T)},simplify = FALSE)
names(column_data) <- paste("var",1:100,sep = "")
return(as.data.table(cbind(clicode = clicode,as.data.frame(column_data))))
}
lst_big <- replicate(n = 15,expr = uFunc_MakeElement(),simplify = FALSE)
# At this point, the rsession is consuming 4.01GB according to top (RES)
# According to RStudio, lst_big was 3.4Gb
# Transform to a data.table
dt_big <- rbindlist(l = lst_big)
# According to top, RES was 7.293Gb
rm(lst_big)
# RES does not change
dt_big <- rbind(dt_big,NULL)
# RES goes to 0.010t
gc()
# RES goes back down to 6.833Gb
I'm not sure why, when I remove lst_big after creating the new data.table using rbindlist, I am not having the memory returned to me. Even after manually calling gc (which you should not have to do), I still don't get back the memory that seems to be allocated to lst_big. Am I doing something wrong? Is there a better way to concatenate data.tables so that I do not leak memory?
(Tagging this with RStudio in case there's a chance it's somehow related to the IDE. This example is coming from RStudio Server running on an Ubuntu 14.04 box).
EDITED TO ADD: I just noticed that this memory usage issue remains even if I overwrite the list itself (rather than creating a new list, I just assign the output of my operations to the old list).
I have the following, somewhat large dataset:
> dim(dset)
[1] 422105 25
> class(dset)
[1] "data.frame"
>
Without doing anything, the R process seems to take about 1GB of RAM.
I am trying to run the following code:
dset <- ddply(dset, .(tic), transform,
date.min <- min(date),
date.max <- max(date),
daterange <- max(date) - min(date),
.parallel = TRUE)
Running that code, RAM usage skyrockets. It completely saturated 60GB's of RAM, running on a 32 core machine. What am I doing wrong?
If performance is an issue, it might be a good idea to switch to using data.tables from the package of the same name. They are fast. You'd do something roughly equivalent something like this:
library(data.table)
dat <- data.frame(x = runif(100),
dt = seq.Date(as.Date('2010-01-01'),as.Date('2011-01-01'),length.out = 100),
grp = rep(letters[1:4],each = 25))
dt <- as.data.table(dat)
key(dt) <- "grp"
dt[,mutate(.SD,date.min = min(dt),
date.max = max(dt),
daterange = max(dt) - min(dt)), by = grp]
Here's an alternative application of data.table to the problem, illustrating how blazing-fast it can be. (Note: this uses dset, the data.frame constructed by Brian Diggs in his answer, except with 30000 rather than 10 levels of tic).
(The reason this is much faster than #joran's solution, is that it avoids the use of .SD, instead using the columns directly. The style's a bit different than plyr, but it typically buys huge speed-ups. For another example, see the data.table Wiki which: (a) includes this as recommendation #1 ; and (b) shows a 50X speedup for code that drops the .SD).
library(data.table)
system.time({
dt <- data.table(dset, key="tic")
# Summarize by groups and store results in a summary data.table
sumdt <- dt[ ,list(min.date=min(date), max.date=max(date)), by="tic"]
sumdt[, daterange:= max.date-min.date]
# Merge the summary data.table back into dt, based on key
dt <- dt[sumdt]
})
# ELAPSED TIME IN SECONDS
# user system elapsed
# 1.45 0.25 1.77
A couple of things come to mind.
First, I would write it as:
dset <- ddply(dset, .(tic), summarise,
date.min = min(date),
date.max = max(date),
daterange = max(date) - min(date),
.parallel = TRUE)
Well, actually, I would probably avoid double calculating min/max date and write
dset <- ddply(dset, .(tic), function(DF) {
mutate(summarise(DF, date.min = min(date),
date.max = max(date)),
daterange = date.max - date.min)},
.parallel = TRUE)
but that's not the main point you are asking about.
With a dummy data set of your dimensions
n <- 422105
dset <- data.frame(date=as.Date("2000-01-01")+sample(3650, n, replace=TRUE),
tic = factor(sample(10, n, replace=TRUE)))
for (i in 3:25) {
dset[i] <- rnorm(n)
}
this ran comfortably (sub 1 minute) on my laptop. In fact the plyr step took less time than creating the dummy data set. So it couldn't have been swapping to the size you saw.
A second possibility is if there are a large number of unique values of tic. That could increase the size needed. However when I tried it increasing the possible number of unique tic values to 1000, it didn't really slow down.
Finally, it could be something in the parallelization. I don't have a parallel backend registered for foreach, so it was just doing a serial approach. Perhaps that is causing your memory explosion.
Are there many numbers of factor levels in the data frame? I've found that this type of excessive memory usage is common in adply and possibly other plyr functions, but can be remedied by removing unnecessary factors and levels. If the large data frame was read into R, make sure stringsAsFactors is set to FALSE in the import:
dat = read.csv(header=TRUE, sep="\t", file="dat.tsv", stringsAsFactors=FALSE)
Then assign the factors you actually need.
I haven't look into Hadley's source yet to discover why.