I am quite struggling with a huge data set at the moment.
What I would like to do is not very complicated, but the matter is that it is just too slow. In the first step, I need to check whether a website is active or not. For this intention, I used the following code (here with a sample of three API-pathes)
library(httr)
Updated <- function(x){http_error(GET(x))}
websites <- data.frame(c("https://api.crunchbase.com/v3.1/organizations/designpitara","www.twitter.com","www.sportschau.de"))
abc <- apply(websites,1,Updated)
I already noticed that a for loop is pretty much faster than the apply function. However, the full code (which has around 1MIllion APIs to check) still would take around 55 hours to be executed. Any help is appreciated :)
Alternatively, something like this would work for passing multiple libraries to the PSOCK cluster:
clusterEvalQ(cl, {
library(data.table)
library(survival)
})
The primary limiting factor will probably be the time taken to query the website. Currently, you're waiting for each query to return a result before executing the next one. The best way to speed up the workflow would be to execute batches of queries in parallel.
If you're using a Unix system you could try the following:
### Packages ###
library(parallel)
### On your example ###
abc <- unlist(mclapply(websites[[1]], Updated, mc.cores = 3))
### On a larger number of sites ###
abc <- unlist(mclapply(websites[[1]], Updated, mc.cores = detectCores())
### You can even go beyond your machine's core count ###
abc <- unlist(mclapply(websites[[1]], Updated, mc.cores = 40))
However, the precise number of threads at which you saturate your processor/internet connection is kind of dependent upon your machine and your connection.
Alternatively, if you're stuck on Windows:
### For a larger number of sites ###
cl <- makeCluster(detectCores(), type = "PSOCK")
clusterExport(cl, varlist = "websites")
clusterEvalQ(cl = cl, library(httr))
abc <- parSapply(cl = cl, X = websites[[1]], FUN = Updated, USE.NAMES = FALSE)
stopCluster(cl)
In the case of PSOCK clusters, I'm not sure whether there are any benefits to be had from exceeding your machine's core count, although I'm not a Windows person, and I welcome any correction.
Related
Recently I've been playing with doing some parallel processing in R using future (and future.apply and furrr) which has been great mostly, but I've stumbled onto something that I can't explain. It's possible that this is a bug somewhere, but it may also be sloppy coding on my part. If anyone can explain this behavior it would be much appreciated.
The setup
I'm running simulations on different subgroups of my data. For each group, I want to run the simulation n times and then calculate some summary stats on the results. Here is some example code to reproduce my basic setup and demonstrate the issue I'm seeing:
library(tidyverse)
library(future)
library(future.apply)
# Helper functions
#' Calls out to `free` to get total system memory used
sys_used <- function() {
.f <- system2("free", "-b", stdout = TRUE)
as.numeric(unlist(strsplit(.f[2], " +"))[3])
}
#' Write time, and memory usage to log file in CSV format
#' #param .f the file to write to
#' #param .id identifier for the row to be written
mem_string <- function(.f, .id) {
.s <- paste(.id, Sys.time(), sys_used(), Sys.getpid(), sep = ",")
write_lines(.s, .f, append = TRUE)
}
# Inputs
fake_inputs <- 1:16
nsim <- 100
nrows <- 1e6
log_file <- "future_mem_leak_log.csv"
if (fs::file_exists(log_file)) fs::file_delete(log_file)
test_cases <- list(
list(
name = "multisession-sequential",
plan = list(multisession, sequential)
),
list(
name = "sequential-multisession",
plan = list(sequential, multisession)
)
)
# Test code
for (.t in test_cases) {
plan(.t$plan)
# loop over subsets of the data
final_out <- future_lapply(fake_inputs, function(.i) {
# loop over simulations
out <- future_lapply(1:nsim, function(.j) {
# in real life this would be doing simulations,
# but here we just create "results" using rnorm()
res <- data.frame(
id = rep(.j, nrows),
col1 = rnorm(nrows) * .i,
col2 = rnorm(nrows) * .i,
col3 = rnorm(nrows) * .i,
col4 = rnorm(nrows) * .i,
col5 = rnorm(nrows) * .i,
col6 = rnorm(nrows) * .i
)
# write memory usage to file
mem_string(log_file, .t$name)
# in real life I would write res to file to read in later, but here we
# only return head of df so we know the returned value isn't filling up memory
res %>% slice_head(n = 10)
})
})
# clean up any leftover objects before testing the next plan
try(rm(final_out))
try(rm(out))
try(rm(res))
}
The outer loop is for testing two parallelization strategies: whether to parallelize over the subsets of data or over the 100 simulations.
Some caveats
I realize that parallelizing over the simulations is not the ideal design, and also that chunking that data to send 10-20 simulations to each core would be more efficient, but that's not really the point here. I'm just trying to understand what is happening in memory.
I also considered that maybe plan(multicore) would be better here (though I'm sure if it would) but I'm more interested in figuring out what's happening with plan(multisession)
The results
I ran this on an 8-vCPU Linux EC2 (I can give more specs if people need them) and created the following plot from the results (plotting code at the bottom for reproducibility):
First off, plan(list(multisession, sequential)) is faster (as expected, see caveat above), but what I'm confused about is the memory profile. The total system memory usage remains pretty constant for plan(list(multisession, sequential)) which I would expect, because I assumed the res object is overwritten each time through the loop.
However, the memory usage for plan(list(sequential, multisession)) steadily grows as the program runs. It appears that each time through the loop the res object is created and then hangs around in limbo somewhere, taking up memory. In my real example this got large enough that it filled my entire (32GB) system memory and killed the process about halfway through.
Plot twist: it only happens when nested
And here's the part that really has me confused! When I changed the outer future_lapply to just regular lapply and set plan(multisession) I don't see it! From my reading of this "Future: Topologies" vignette this should be the same as plan(list(sequential, multisession)) but the plot doesn't show the memory growing at all (in fact, it's almost identical to plan(list(multisession, sequential)) in the above plot)
Note on other options
I actually originally found this with furrr::future_map_dfr() but to be sure it wasn't a bug in furrr, I tried it with future.apply::future_lapply() and got the results shown. I tried to code this up with just future::future() and got very different results, but quite possibly because what I coded up wasn't actually equivalent. I don't have much experience with using futures directly without the abstraction layer provided by either furrr or future.apply.
Again, any insight on this is much appreciated.
Plotting code
library(tidyverse)
logDat <- read_csv("future_mem_leak_log.csv",
col_names = c("plan", "time", "sys_used", "pid")) %>%
group_by(plan) %>%
mutate(
start = min(time),
time_elapsed = as.numeric(difftime(time, start, units = "secs"))
)
ggplot(logDat, aes(x = time_elapsed/60, y = sys_used/1e9, group = plan, colour = plan)) +
geom_line() +
xlab("Time elapsed (in mins)") + ylab("Memory used (in GB)") +
ggtitle("Memory Usage\n list(multisession, sequential) vs list(sequential, multisession)")
I have a function that creates a cartogram of fish catch per country per year and puts that cartogram into a list of cartograms, depending on which year I feed it:
fishtogram <- function(year) {
dfname <- paste0("carto", year) # name of the cartogram being made
map_year <- get(paste0("map", year), map_years) # 'map_year' contains one SpatialPolygonsDataFrame of a year of fishing/country data pulled from the list of spdf's 'map_years'
carto_maps[[dfname]] <<- cartogram(map_year, "CATCH", itermax=1) # This is the part that takes forever. Create cartogram named 'dfname', chuck it into the carto_maps list
plot(carto_maps[[dfname]], main=dfname) # plot it
print(paste("Finished", dfname, "at", Sys.time())) # print time finished cartogram
writeOGR(obj = carto_maps[[dfname]], dsn = "Shapefiles", layer = dfname, driver = "ESRI Shapefile", overwrite_layer=TRUE) # Save cartogram as shapefile
}
Originally this was all in a for loop (for the years 1950-2014) and it does the job, just extremely slow. The part that is slowing me down is the cartogram function. Currently, producing one cartogram takes about an hour and uses about ~13% of my CPU. I would like to try and use parallel processing to make 3-4 cartograms at once and hopefully speed things up.
How do I wrap this in an apply function correctly to both loop through the years I want and use parallel processing? I've been using this R bloggers post for guidance. My attempt:
lapply(seq(1975, 2014, 10), fishtogram, .parallel=TRUE)
>Error in FUN(X[[i]], ...) : unused argument (.parallel = TRUE)
Thank you to #patL in telling me to use lapply vs apply.
My code & data can be found here: https://github.com/popovs/400m-cartograms/blob/master/400m_cartograms.R
To go parallel you can try some parapply family functions from parallel library.
Following steps from this page you will need to firs detect the number of cores:
library(parallel)
no_cores <- detectCores() - 1 #it is recomendable that you use the number of cores less one
cl <- makeCluster(no_cores) #initiate cluster
It is important to export all functions and objects you will use during your parallelization:
clusterExport(cl, "fishtogram")
clusterExport(cl, "dfname")
clusterExport(cl, "map_years")
...
Then you can run your parallelized version of lapply:
parLapply(cl, seq(1975, 2014, 10), fishtogram)
and finally stop the cluster
stopCluster(cl)
There are other functions that you can run your code in parallel (foreach, from foreach library; mclapply, also from parallel library, etc).
Your specific error is coming from your parenthesis on the fishtogram function. You dont need them when using apply:
apply(seq(1975, 2014, 10), 1, fishtogram)
..would fix that error.
I have a large data.frame of 20M lines. This data frame is not only numeric, there is characters as well. Using a split and conquer concept, I want to split this data frame to be executed in a parallel way using snow package (parLapply function, specifically). The problem is that the nodes run out of memory because the data frame parts are worked in RAM. I looked for a package to help me with this problem and I found just one (considering the multi type data.frame): ff package. Another problem comes from the use of this package. The split result of a ffdf is not equal to a split of a commom data.frame. Thus, it is not possible to run the parLapply function.
Do you know other packages for this goal? Bigmemory only supports matrix.
I've benchmarked some ways of splitting the data frame and parallelizing to see how effective they are with large data frames. This may help you deal with the 20M line data frame and not require another package.
The results are here. The description is below.
This suggests that for large data frames the best option is (not quite the fastest, but has a progress bar):
library(doSNOW)
library(itertools)
# if size on cores exceeds available memory, increase the chunk factor
chunk.factor <- 1
chunk.num <- kNoCores * cut.factor
tic()
# init the cluster
cl <- makePSOCKcluster(kNoCores)
registerDoSNOW(cl)
# init the progress bar
pb <- txtProgressBar(max = 100, style = 3)
progress <- function(n) setTxtProgressBar(pb, n)
opts <- list(progress = progress)
# conduct the parallelisation
travel.queries <- foreach(m=isplitRows(coord.table, chunks=chunk.num),
.combine='cbind',
.packages=c('httr','data.table'),
.export=c("QueryOSRM_dopar", "GetSingleTravelInfo"),
.options.snow = opts) %dopar% {
QueryOSRM_dopar(m,osrm.url,int.results.file)
}
# close progress bar
close(pb)
# stop cluster
stopCluster(cl)
toc()
Note that
coord.table is the data frame/table
kNoCores ( = 25 in this case) is the number of cores
Distributed memory. Sends coord.table to all nodes
Shared memory. Shares coord.table with nodes
Shared memory with cuts. Shares subset of coord.table with nodes.
Do par with cuts. Sends subset of coord.table to nodes.
SNOW with cuts and progress bar. Sends subset of coord.table to nodes
Option 5 without progress bar
More information about the other options I compared can be found here.
Some of these answers might suit you, although they doesn't relate to distributed parlapply and I've included some of them in my benchmarking options.
I have set of data (around 50000 data. and each one of them 1.5 mb). So, to load the data and process the data first I have used this code;
data <- list() # creates a list
listcsv <- dir(pattern = "*.txt") # creates the list of all the csv files in the directory
then I use for loop to load each data;
for (k in 1:length(listcsv)){
data[[k]]<- read.csv(listcsv[k],sep = "",as.is = TRUE, comment.char = "", skip=37);
my<- as.matrix(as.double(data[[k]][1:57600,2]));
print(ort_my);
a[k]<-ort_my;
write(a,file="D:/ddd/ads.txt",sep='\t',ncolumns=1)}
So, I set the program run but even if after 6 hours it didn't finished. Although I have a decent pc with a 32 GB ram and 6 core CPU.
I have searched the forum and maybe fread function would be helpful people say. However all the examples which I found so far deal with the single file reading with the fread function.
Can any one suggest me the solution of this problem for faster loop to read data and process it with these many rows and columns?
I am guessing there has to be a way to make the extraction of what you want more efficient. But I think running in parallel could save you a bunch of time. And save you memory by not storing each file.
library("data.table")
#Create function you want to eventually loop through in parallel
readFiles <- function(x) {
data <- fread(x,skip=37)
my <- as.matrix(data[1:57600,2,with=F]);
mesh <- array(my, dim = c(120,60,8));
Ms<-1350*10^3 # A/m
asd2=(mesh[70:75,24:36 ,2])/Ms; # in A/m
ort_my<- mean(asd2);
return(ort_my)
}
#R Code to run functions in parallel
library(“foreach”);library(“parallel”);library(“doMC”)
detectCores() #This will tell you how many cores are available
registerDoMC(8) #Register the parallel backend
#Can change .combine from rbind to list
OutputList <- foreach(listcsv,.combine=rbind,.packages=c(”data.table”)) %dopar% (readFiles(x))
registerDoSEQ() #Very important to close out parallel backend.
Good evening,
I am trying to analyse the forementioned data(edgelist or pajek format). First thought was R-project with igraph package. But memory limitations(6GB) wont do the trick. Will a 128GB PC be able to handle the data? Are there any alternatives that don't require whole graph in RAM?
Thanks in advance.
P.S: I have found several programs but I would like to hear some pro(yeah, that's you) opinions on the matter.
If you only want degree distributions, you likely don't need a graph package at all. I recommend the bigtablulate package so that
your R objects are file backed so that you aren't limited by RAM
you can parallelize the degree computation using foreach
Check out their website for more details. To give a quick example of this approach, let's first create an example with an edgelist involving 1 million edges among 1 million nodes.
set.seed(1)
N <- 1e6
M <- 1e6
edgelist <- cbind(sample(1:N,M,replace=TRUE),
sample(1:N,M,replace=TRUE))
colnames(edgelist) <- c("sender","receiver")
write.table(edgelist,file="edgelist-small.csv",sep=",",
row.names=FALSE,col.names=FALSE)
I next concatenate this file 10 times to make the example a bit bigger.
system("
for i in $(seq 1 10)
do
cat edgelist-small.csv >> edgelist.csv
done")
Next we load the bigtabulate package and read in the text file with our edgelist. The command read.big.matrix() creates a file-backed object in R.
library(bigtabulate)
x <- read.big.matrix("edgelist.csv", header = FALSE,
type = "integer",sep = ",",
backingfile = "edgelist.bin",
descriptor = "edgelist.desc")
nrow(x) # 1e7 as expected
We can compute the outdegrees by using bigtable() on the first column.
outdegree <- bigtable(x,1)
head(outdegree)
Quick sanity check to make sure table is working as expected:
# Check table worked as expected for first "node"
j <- as.numeric(names(outdegree[1])) # get name of first node
all.equal(as.numeric(outdegree[1]), # outdegree's answer
sum(x[,1]==j)) # manual outdegree count
To get indegree, just do bigtable(x,2).