spark: java.io.IOException: No space left on device [again!] - r

I am getting the java.io.IOException: No space left on device that occurs after running a simple query in sparklyr. I use both last versions of Spark (2.1.1) and Sparklyr
df_new <-spark_read_parquet(sc, "/mypath/parquet_*", name = "df_new", memory = FALSE)
myquery <- df_new %>% group_by(text) %>% summarize(mycount = n()) %>%
arrange(desc(mycount)) %>% head(10)
#this FAILS
get_result <- collect(myquery)
I do have set both
spark.local.dir <- "/mypath/"
spark.worker.dir <- "/mypath/"
using the usual
config <- spark_config()
config$`spark.executor.memory` <- "100GB"
config$`spark.executor.cores` <- "3"
config$`spark.local.dir` <- "/mypath/"
config$`spark.worker.dir` <- "mypath/"
config$`spark.cores.max`<- "2000"
config$`spark.default.parallelism`<- "4"
config$`spark.total-executor-cores`<- "80"
config$`sparklyr.shell.driver-memory` <- "100G"
config$`sparklyr.shell.executor-memory` <- "100G"
config$`spark.yarn.executor.memoryOverhead` <- "100G"
config$`sparklyr.shell.num-executors` <- "90"
config$`spark.memory.fraction` <- "0.2"
Sys.setenv(SPARK_HOME="mysparkpath")
sc <- spark_connect(master = "spark://mynode", config = config)
where mypath has more than 5TB of disk space (I can see these options in the Environment tab). I tried a similar command in Pyspark and it failed the same way (same error).
By looking at the Stages tab in Spark, I see that the error occurs when shuffle write is about 60 GB. (input is about 200GB). This is puzzling given that I have plenty of space available. I have have looked at the other SO solutions already...
The cluster job is started with magpie https://github.com/LLNL/magpie/blob/master/submission-scripts/script-sbatch-srun/magpie.sbatch-srun-spark
Every time I start a Spark job, I see a directory called spark-abcd-random_numbers in my /mypath folder. but the size of the files in there is very small (nowhere near the 60GB shuffle write)
there are about 40 parquet files. each is 700K (original csv files were 100GB) They contain strings essentially.
cluster is 10 nodes, each has 120GB RAM and 20 cores.
What is the problem here?
Thanks!!

I ve had this problem multiple times before. The reason behind is the temporary files. most of servers have a very small size partition for /tmp/ which is the default temporary directory for spark.
Usually, I used to change that by setting that in spark-submit command as the following:
$spark-submit --master local[*] --conf "spark.driver.extraJavaOptions=-Djava.io.tmpdir=/mypath/" ....
In your case, I think that you can provide that to the configuration in R as following (I have not tested that but that should work):
config$`spark.driver.extraJavaOptions` <- "-Djava.io.tmpdir=/mypath/"
config$`spark.executor.extraJavaOptions ` <- "-Djava.io.tmpdir=/mypath/"
Notice that you have to change that for the driver and executors since you're using Spark standalone master (as I can see in your question)
I hope that will help

change following settings in your magpie script
export MAGPIE_LOCAL_DIR="/tmp/${USER}/magpie"
export SPARK_LOCAL_DIR="/tmp/${USER}/spark"
to have mypath prefix and not /tmp

Once you set the parameter, you can see the new value of spark.local.dir in Spark environment UI. But it doesn't reflect.
Even I faced the similar problem. After setting this parameter, I restarted the machines and then started working.

Since you need to set this when the JVM is launched via spark-submit, you need to use the sparklyr java-options, e.g.
config$`sparklyr.shell.driver-java-options` <- "-Djava.io.tmpdir=/mypath"

I had this very problem this week on a Standalone mode cluster and after trying different things, like some of the recommendations in this thread, it ended up being a sub folder called "work" inside the Spark home folder grew unchecked for a while thus filling up the worker's hhd

Related

Is there a way to transfer R objects to separate R sessions on linux?

I've got a program that repeatedly loads largish datasets that are stored in R's Rds format. Here's a silly example that has all of the salient features:
# make and save the data
big_data <- matrix(rnorm(1e6^2), 1e6)
saveRDS(big_data, file = "big_data.Rds")
# write a program that uses the data
big_data <- readRDS("big_data.Rds")
BIGGER_data <- big_data+rnorm(1)
print("hooray!")
# save this in a text file called `my_program.R`
# run this program a bunch
for (i = 1:1000){
system("Rscript my_program.R")
}
The bottleneck is loading the data. But what if I had a separate process somewhere that held the data in memory?
Maybe something like this:
# write a program to hold the data in memory
big_data <- readRDS("big_data.Rds")
# save this as `holder.R` open a terminal and do
Rscript holder.R
Now there is a process running somewhere with my data in memory. How can I get it from a different R session? (I'm assuming that this would be faster than loading it -- but is this correct?)
Maybe something like this:
# write another program:
big_data <- get_big_data_from_holder()
BIGGER_data <- big_data+1
print("yahoo!")
# save this as `my_improved_program.R`
# now do the following:
for (i = 1:1000){
system("Rscript my_improved_program.R")
}
So I guess my question is what would the function get_big_data_from_holder() look like? Is it possible to do this? Practical?
Backstory: I'm trying to work around what appears to be a memory leak in R's interface to keras/tensorflow, that I've described here. The workaround is to let the OS clean up all of the cruft left over from a TF session, so that I can run TF sessions one after another without my computer slowing to a crawl.
Edit: maybe I could do this with a clone() system call? Conceptually I can imagine that I'd clone the process running holder and then run all of the commands in the program that depend on the data that's loaded. But I don't know how this would be done.
You may also improve the performance of saving and loading the data by turning off compression:
saveRDS(..., compress = FALSE)
You may find my filematrix package useful for storing and quickly accessing the big matrix.
To create it, run:
big_data = matrix(rnorm(1e4^2), 1e4)
library(filematrix)
fm = fm.create.from.matrix('matrix_file', big_data)
close(fm)
To access it from another R session:
library(filematrix)
fm = fm.open('matrix_file')
show(fm[1:3,1:3])
close(fm)

Get number of active spark executors with sparklyr and R

When launching a spark cluster via sparklyr, I notice that it can take between 10-60 seconds for all the executors to come online.
Right now I'm using Sys.sleep(60) to allow time for them to come online, but sometimes it takes longer than that, and sometimes it is shorter than that. I want a programmatic way to adjust for this time variance, similar to this question regarding Python. So I think I want to pass getExecutorMemoryStatus via sparklyr, but I'm not sure how to do this.
To see what I'm seeing, run the following code to launch a yarn-client spark connection, and check the Yarn UI. In the Event Timeline we can see at which time each executor comes online.
spark_config <- spark_config()
spark_config$spark.executor.memory <- "11G"
spark_config$`sparklyr.shell.driver-memory` <- "11G"
spark_config$spark.dynamicAllocation.enabled <- FALSE
spark_config$`spark.yarn.executor.memoryOverhead` <- "1G"
spark_config$spark.executor.instances <- 32
sc <- spark_connect(master = "yarn-client", config = spark_config)
So I think I want to pass getExecutorMemoryStatus via sparklyr, but I'm not sure how to do this.
You have to retrieve SparkContext object:
sc <- spark_connect(...)
spark_context(sc) %>%
...
and then invoke the method:
... %>% invoke("getExecutorMemoryStatus")
Together:
spark_context(sc) %>%
invoke("getExecutorMemoryStatus") %>%
names()
should give you a list of active executors.

Retain valid workspace reference after project transfer.

I've been working on a R project (projectA) that I want to hand over to a colleague, what would be the best way to handle workspace references in the scripts? To illustrate, let's say projectA consists of several R scripts that each read input and write output to certain directories (dirs). All dirs are contained within my local dropbox. The I/O part of the scripts look as follows:
# Script 1.
# Give input and output names and dirs:
dat1Dir <- "D:/Dropbox/ProjectA/source1/"
dat1In <- "foo1.asc"
dat2Dir <- "D:/Dropbox/ProjectA/source2/"
dat2In <- "foo2.asc"
outDir <- "D:/Dropbox/ProjectA/output1/"
outName <- "fooOut1.asc"
# Read data
setwd(dat1Dir)
dat1 <- read.table(dat1In)
setwd(dat2Dir)
dat2 <- read.table(dat2In)
# do stuff with dat1 and dat2 that result in new data foo
# Write new data foo to file
setwd(outDir)
write.table(foo, outName)
# Script 2.
# Give input and output names and dirs
dat1Dir <- "D:/Dropbox/ProjectA/output1/"
dat1In <- "fooOut1.asc"
outDir <- "D:/Dropbox/ProjectA/output2/"
outName <- "fooOut2.asc"
Etc. Each script reads and write data from/to file and subsequent scripts read the output of previous scripts. The question is: how can I ensure that the directory-strings remain valid after transfer to another user?
Let's say we copy the ProjectA folder, including subfolders, to another PC, where it is stored at, e.g., C:/Users/foo/my documents/. Ideally, I would have a function FindDir() that finds the location of the lowest common folder in the project, here "ProjectA", so that I can replace every directory string with:
dat1Dir <- paste(FindDir(), "ProjectA/source1", sep= "")
So that:
# At my own PC
dat1Dir <- paste(FindDir(), "ProjectA/source1", sep= "")
> "D:/Dropbox/ProjectA/source1/"
# At my colleagues PC
dat1Dir <- paste(FindDir(), "ProjectA/source1", sep= "")
> "C:Users/foo/my documents/ProjectA/source1/"
Or perhaps there is a different way? Our work IT infrastructure currently does not allow using a shared disc. I'll put helper-functions in an 'official' R project (ie, hosted on R forge), but I'd like to use scripts when many I/O parameters are required and because the code can easily be viewed and commented.
Many thanks in advance!
You should be able to do this by using relative directory paths. This is what I do for my R projects that I have in Dropbox and that I edit/run on both my Windows and OS X machines where the Dropbox folder is D:/Dropbox and /Users/robin/Dropbox respectively.
To do this, you'll need to
Set the current working directory in R (either in the first line of your script, or interactively at the console before running), using setwd('/Users/robin/Dropbox;) (see the full docs for that command).
Change your paths to relative paths, which mean they just have the bit of the path from the current directory, in this case the 'ProjectA/source1' bit if you've set your current directory to your Dropbox folder, or just 'source1' if you've set your current directory to the ProjectA folder (which is a better idea).
Then everything should just work!
You may also be interested in an R library that I love called ProjectTemplate - it gives you really nice functionality for making self-contained projects for this sort of work in R, and they're entirely reproducible, moveable between computers and so on. I've written an introductory blog post which may be useful.

How to export data from a shell and then how to plot it in other computer? R

Eyy guys, I am processing some datum using R in a Mac, first of all let me show you my code.
require(spatstat)
require(maptools)
setwd("/Whereever") # in that folder I have the files sis1993.txt and colombia.shp...
mydata <- read.table("sis1993.txt", header = TRUE)
attach(mydata)
summary(mydata)
datos=read.table("sis1993.txt", header=T, dec=",", sep="\t")
summary(datos)
attach(datos)
S=readShapePoly("colombia.shp")
SP=as(S,"SpatialPolygons")
W=as(S,"owin")
sis1993=ppp(datos$x, datos$y, window=W)
unitname(sis1993)="meter"
sis1993=ppp(datos$x, datos$y, window=W, marks=m )
unitname(sis1993)="meter"
summary(sis1993)
Kenv <- envelope(sis1993,fun="Kest",nsim=199,nrank=5)
plot(Kenv,xlab="r",ylab="Khat(r)",cex.lab=1.6,cex.axis=1.5,main="K-Hat", cex.main=1.5)
Then, as a result of the function plot, I see this:
Now, let me explain you my issue... It is just a test, it only has 2 points, but the task that I am facing has around 1000-1200 points and I also must do it around ten times, thanks to that I cannot do this simulation using my computer because it will spend so much time, instead of that I plan to rent one of these services: Amazon EC2 (Amazon Elastic Compute Cloud).
In there, I must do it through a Linux/UNIX shell (SO: Ubuntu, I guess.), now my question is related to the last two lines in my code, how can I export Kenv into a file or something in order to import it and then plot it in my computer.
I hope you have understood me, let me know if you have any question.
See help(save) in R. That saves things to a file.
Then copy the file from the Amazon instance to your PC (beyond the scope of me).
then see help(load) for how to load it back in.

Downloading multiple file as parallel in R

I am trying to download 460,000 files from ftp server ( which I got from the TRMM archive data). I made a list of all files and separated them into different jobs, but can any one help me how to run those jobs at the same time in R. Just an example what I have tried to do
my.list <-readLines("1998-2010.txt") # lists the ftp address of each file
job1 <- for (i in 1: 1000) {
download.file(my.list[i], name[i], mode = "wb")
}
job2 <- for (i in 1001: 2000){
download.file(my.list[i], name[i], mode = "wb")
}
job3 <- for (i in 2001: 3000){
download.file(my.list[i], name[i], mode = "wb")
}
Now I m stuck on how to run all of the Jobs at the same time.
Appreciate your help
Dont do that. Really. Dont. It won't be any faster because the limiting factor is going to be the network speed. You'll just end up with a large number of even slower downloads, and then the server will just give up and throw you off, and you'll end up with a large number of half-downloaded files.
Downloading multiple files will also increase the disk load since now your PC is trying to save a large number of files.
Here's another solution.
Use R (or some other tool, its one line of awk script starting from your list) to write an HTML file which just looks like this:
file-1.dat
file-2.dat
and so on. Now open this file in your web browser and use a download manager (eg DownThemAll for Firefox) and tell it to download all the links. You can specify how many simultaneous downloads, how many times to retry fails and so on with DownThemAll.
A good option is to use the mclapply or parLapply from the builtin parallel package. You then make a function that accepts a list of files that need to be downloaded:
library(parallel)
dowload_list = function(file_list) {
return(lapply(download.file(file_list)))
}
list_of_file_lists = c(my_list[1:1000], my_list[1001:2000], etc)
mclapply(list_of_file_lists, download_list)
I think it is wise to first split up the big list of files into a set a sublists, as for each entry in the list fed to mclapply a process is spawned. If this list is big, and the processing time per item in the list small, the overhead of parallelisation is probably going to make the downloading slower in stead of faster.
Do note that mclapply only works on Linux, parLapply should also work fine under Windows.
First make a while loop which looks for all the destination file. If the current predefined destination file is in the existing destination file, the script creates a new destination file. This will create many destination files which correspond to each one downloading. Next, I parallel the script. If I have 5 cores on my machine, I will get 5 destination files on my disk. I can also use lapply function to do that.
For example:
id <- 0
newDestinationFile <- "File.xlsx"
while(newDestinationFile %in% list.files(path =getwd(),pattern ="[.]xlsx"))
{
newDestinationFile <- paste0("File",id,".xlsx")
id <- id+1
download.file(url = URLS,method ="libcurl",mode ="wb",quiet = TRUE,destfile =newDestinationFile)
}

Resources