How can I speed up Pkg.add() in Julia? - julia

Take Cairo as an example, when I run Pkg.add("Cairo"), there's nothing displayed in the console.
Is there a way to let Pkg.add() display more information when it is working?
What steps does Pkg.add() carry out? Download, compile?
Is it possible to speed it up? I kept waiting for 15 minutes, nothing out! Maybe it's Julia's problem, or maybe it's system's problem, how can one tell?
Edit
Julia version: 0.3.9 (Installed using binary from julia-lang.org)
OS: Winsows 7 64bit.
CPU: Core Duo 2.4GHz
RAM: 4G
Hard Disk: SSD
ping github.com passed, 0% loss.
Internet download speedtest: ~30 Mbps.
I don't know whether this is normal: it took me 11 seconds to get the version.
PS C:\Users\Nick> Measure-Command {julia --version}
Days : 0
Hours : 0
Minutes : 0
Seconds : 11
Milliseconds : 257
Ticks : 112574737
TotalDays : 0.000130294834490741
TotalHours : 0.00312707602777778
TotalMinutes : 0.187624561666667
TotalSeconds : 11.2574737
TotalMilliseconds : 11257.4737
And it took nearly 2 minutes to load the Gadfly package:
julia> #time require("Gadfly")
elapsed time: 112.131236102 seconds (442839856 bytes allocated, 0.39% gc time)
Does it runs faster on Linux/Mac than on Windows? It is usually not easy to build software on Windows; however, will it improve the performance if I build from source?
Julia is awesome, I really hope it works!

As mentioned by Colin T. Bowers, your particular case is abnormally slow and indicates something wrong with your installation. But Pkg.add (along with other Pkg operations) is known to be slow on Julia 0.4. Luckily, this issue has been fixed.
Pkg operations have seen a substantial increase in performance in Julia v0.5, which is being released today (September 19, 2016). You can go to the downloads page to get v0.5. (It might be a few hours before they're up as of the time of this writing.)

Related

Memory allocation for entities in R only 2 GB?

Windows 10 64 bit, 32 GB RAM, Rstudio 1.1.383 and R 3.4.2 (up-to-date)
I have several csv files which have at least 1 or 2 lines full of many nul values. So I wrote a script that uses read_lines_raw() from stringr package in R which reads the file in raw format. It produces a list where each element is a row. Then I check for 00 (the nul value) and when it is found that line gets deleted.
One of the files is 2.5 GB in size and also has nul value somewhere in it. The problem is, read_lines_raw is not able to read this file and throws an error:
r in read_lines_raw_(ds, n_max = n_max, progress = progress) :
negative length vectors are not allowed
I don't even understand the problem. Some of my research hints something regarding the size, but not even half of the RAM is used. Some other files that it was able to read were 1.5 GB in size. Is this file too big, or is it something else that causes this?
Update 1:
I tried to read in the whole file using scan but that also gave me an error:
could not allocate memory (2048 Mb) in C function 'R_AllocStringBuffer'
So although my pc is 32 GB, the maximum allowed space for an entity is 2 GB? And I checked to make sure it is running 64 bit R, and yes it is.
> version
_
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 4.2
year 2017
month 09
day 28
svn rev 73368
language R
version.string R version 3.4.2 (2017-09-28)
nickname Short Summer
It seems like many people are facing similar issues, but there is no solution I could find. How can we increase the memory allocation for individual entities? The memory.limit() gives back 32 GB, which is the RAM size, but that isn't helpful. memory.size() does give something close 2 GB, and since the file is 2.7 GB on the disk, I assume this is the reason for getting the error.
Thank you.

"GC overhead limit exceeded" on cache of large dataset into spark memory (via sparklyr & RStudio)

I am very new to the Big Data technologies I am attempting to work with, but have so far managed to set up sparklyr in RStudio to connect to a standalone Spark cluster. Data is stored in Cassandra, and I can successfully bring large datsets into Spark memory (cache) to run further analysis on it.
However, recently I have been having a lot of trouble bringing in one particularly large dataset into Spark memory, even though the cluster should have more than enough resources (60 cores, 200GB RAM) to handle a dataset of its size.
I thought that by limiting the data being cached to just a few select columns of interest I could overcome the issue (using the answer code from my previous query here), but it does not. What happens is the jar process on my local machine ramps up to take over up all the local RAM and CPU resources and the whole process freezes, and on the cluster executers keep getting dropped and re-added. Weirdly, this happens even when I select only 1 row for cacheing (which should make this dataset much smaller than other datasets which I have had no problem cacheing into Spark memory).
I've had a look through the logs, and these seem to be the only informative errors/warnings early on in the process:
17/03/06 11:40:27 ERROR TaskSchedulerImpl: Ignoring update with state FINISHED for TID 33813 because its task set is gone (this is likely the result of receiving duplicate task finished status updates) or its executor has been marked as failed.
17/03/06 11:40:27 INFO DAGScheduler: Resubmitted ShuffleMapTask(0, 8167), so marking it as still running
...
17/03/06 11:46:59 WARN TaskSetManager: Lost task 3927.3 in stage 0.0 (TID 54882, 213.248.241.186, executor 100): ExecutorLostFailure (executor 100 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 167626 ms
17/03/06 11:46:59 INFO DAGScheduler: Resubmitted ShuffleMapTask(0, 3863), so marking it as still running
17/03/06 11:46:59 WARN TaskSetManager: Lost task 4300.3 in stage 0.0 (TID 54667, 213.248.241.186, executor 100): ExecutorLostFailure (executor 100 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 167626 ms
17/03/06 11:46:59 INFO DAGScheduler: Resubmitted ShuffleMapTask(0, 14069), so marking it as still running
And then after 20min or so the whole job crashes with:
java.lang.OutOfMemoryError: GC overhead limit exceeded
I've changed my connect config to increase the heartbeat interval ( spark.executor.heartbeatInterval: '180s' ), and have seen how to increase memoryOverhead by changing settings on a yarn cluster ( using spark.yarn.executor.memoryOverhead ), but not on a standalone cluster.
In my config file, I have experimented by adding each of the following settings one at a time (none of which have worked):
spark.memory.fraction: 0.3
spark.executor.extraJavaOptions: '-Xmx24g'
spark.driver.memory: "64G"
spark.driver.extraJavaOptions: '-XX:MaxHeapSize=1024m'
spark.driver.extraJavaOptions: '-XX:+UseG1GC'
UPDATE: and my full current yml config file is as follows:
default:
# local settings
sparklyr.sanitize.column.names: TRUE
sparklyr.cores.local: 3
sparklyr.shell.driver-memory: "8G"
# remote core/memory settings
spark.executor.memory: "32G"
spark.executor.cores: 5
spark.executor.heartbeatInterval: '180s'
spark.ext.h2o.nthreads: 10
spark.cores.max: 30
spark.memory.storageFraction: 0.6
spark.memory.fraction: 0.3
spark.network.timeout: 300
spark.driver.extraJavaOptions: '-XX:+UseG1GC'
# other configs for spark
spark.serializer: org.apache.spark.serializer.KryoSerializer
spark.executor.extraClassPath: /var/lib/cassandra/jar/guava-18.0.jar
# cassandra settings
spark.cassandra.connection.host: <cassandra_ip>
spark.cassandra.auth.username: <cassandra_login>
spark.cassandra.auth.password: <cassandra_pass>
spark.cassandra.connection.keep_alive_ms: 60000
# spark packages to load
sparklyr.defaultPackages:
- "com.datastax.spark:spark-cassandra-connector_2.11:2.0.0-M1"
- "com.databricks:spark-csv_2.11:1.3.0"
- "com.datastax.cassandra:cassandra-driver-core:3.0.2"
- "com.amazonaws:aws-java-sdk-pom:1.10.34"
So my question are:
Does anyone have any ideas about what to do in this instance?
Are
Are there config settings I can change to help with this issue?
Alternatively, is there a way to import the cassandra data in
batches with RStudio/sparklyr as the driver?
Or alternatively again, is there a way to munge/filter/edit data as it is brought into cache so that the resulting table is smaller (similar to using SQL querying, but with more complex dplyr syntax)?
OK, I've finally managed to make this work!
I'd initially tried the suggestion of #user6910411 to decrease the cassandra input split size, but this failed in the same way. After playing around with LOTS of other things, today I tried changing that setting in the opposite direction:
spark.cassandra.input.split.size_in_mb: 254
By INCREASING the split size, there were fewer spark tasks, and thus less overhead and fewer calls to the GC. It worked!

Error: Maximal number of DLLs reached

I'm writing an R package which depends upon many other packages. When I load too many packages into the session I frequently got this error:
Error in dyn.load(file, DLLpath = DLLpath, ...) :
unable to load shared object '/Library/Frameworks/R.framework/Versions/3.2/Resources/library/proxy/libs/proxy.so':
`maximal number of DLLs reached...
This post Exceeded maximum number of DLLs in R pointed out that the issue is with the Rdynload.c of the base R code:
#define MAX_NUM_DLLS 100
Is there any way to bypass this issue except modifying and building from source?
As of R 3.4, you can set a different max number of DLLs using and environmental variable R_MAX_NUM_DLLS. From the release notes:
The maximum number of DLLs that can be loaded into R e.g. via
dyn.load() can now be increased by setting the environment
variable R_MAX_NUM_DLLS before starting R.
Increasing that number is of course "possible"... but it also costs a bit
(adding to the fixed memory footprint of R).
I did not set that limit, but I'm pretty sure it was also meant as reminder for the useR to "clean up" a bit in her / his R session, i.e., not load package namespaces unnecessarily. I cannot yet imagine that you need > 100 packages | namespaces loaded in your R session.
OTOH, some packages nowadays have a host of dependencies, so I agree that this at least may happen accidentally more frequently than in the past.
The real solution of course would be a code improvement that starts with a relatively small number of "DLLinfo" structures (say 32), and then allocates more batches (of size say 32) if needed.
Patches to the R sources (development trunk in subversion at https://svn.r-project.org/R/trunk/ ) are very welcome!
---- added Jan.26, 2017: In the mean time, we've had a public bug report about this, a proposed patch (which was not good enough: There is always an OS dependent limit on the number of open files), and today that bug report has been closed by R core member #TomasKalibera who implemented new code where the maximal number of loaded DLLs is set at
pmax(100, pmin(1000, 0.6* OS_dependent_getrlimit_or_equivalent()))
and so on Windows and Linux (and not yet tested, but "almost surely" macOS), the limit should be considerably higher than previously.
----- Update #2 (written Jan.5, 2018):
In Oct'17, the above change was made more automatic with the following commit to the sources (of the development version of R - only!)
r73545 | kalibera | 2017-10-12 14:41:20
Increase the number of DLLs that can be loaded by default. If needed,
increase the soft limit on open files.
and on the help page ?dyn.load (https://stat.ethz.ch/R-manual/R-devel/library/base/html/dynload.html) the ulimit -n <num_open_files> is now mentioned (section Note close to bottom).
So you might consider using R's development version till that becomes "main stream" in April.
Alternatively, you do (in a terminal / shell)
ulimit -n 2048
and then start R from that terminal. Tomas Kalibera mentioned this to work on macOS.
I had this issue with the simpleSingleCell library in bioconductor
On the macOS you can't exceed 256. So I set my .Renviron in my home dir
R_MAX_NUM_DLLS=150
It's easy
Go to the environment variable and edit
variable_name = R_MAX_NUM_DLL
value = 1000
Restart R
worked well for me

Parallel detectCores() giving incorrect output on a Virtual machine

I'm just doing some performance testing on a new laptop. My problem starts when I tried to test it on parallel computing.
So, when I run the function detectCores() from parallel the result is 1. The problem is that the laptop has an i7- 4800MQ processor which as 4 cores.
As a result when I run my code it thinks that it has only one core and the time to execute the code is exactly the same as without the parallelization.
I’ve tested the code in a different machine with an i5 processor also with 4 cores using the same R version (R 3.0.2 64 bits) and the code runs perfectly. The only difference is that the new computer as installed windows 8.1 while the old one has windows 7
Also, when I run Sys.getenv(“NUMBER_OF_PROCESSORS”) I also get 1 as an answer.
I've search the internet looking for an answer with no joy. As anyone came across this problem before?
Manny thanks
Make sure you are loading the parallel package before running detectCores(). I also have an i-7 processor (Windows 8.1, 64 Bit) and I am able to see as 8 cores when I run detectCores(logical = TRUE) and I get 4 when I run detectCores(logical = FALSE). For more, kindly refer this link. HTH

How to optimize R performance

We have a recent performance bench mark that I am trying to understand. We have a large script that performance appears 50% slower on a Redhat Linux machine than a Windows 7 laptop where the specs are comparable. The linux machine is virtualized using kvm and has 4 cores assigned to it along with 16GB of memory. The script is not io intensive but has quite a few for loops. Mainly I am wondering if there are any R compile options that I can use to optimize or any kernel compiler options that might help to make this more comparable. Any pointers would be appreciated. I will try to get another machine and test it in using raw metal also for a better comparison.
These are the configure flags that I am using to compile R on the linux machine. I have experimented quite a bit and this seems to cut 12 seconds off the execution time in the green for larger data sets. Essentially I went from 2.087 to 1.48 seconds with these options.
./configure CFLAGS="-O3 -g -std=gnu99" CXXFLAGS="-O3 -g" FFLAGS="-O2 -g" LDFLAGS="-Bdirect,--hash-stype=both,-Wl,-O1" --enable-R-shlib --without-x --with-cairo --with-libpng --with-jpeglib
Update 1
The script has not been optimized yet. Another group is actually working on the script and we have put in requests to use the apply function but not sure how this explains the disparity in the times.
The top of the profile looks like this. Most of these functions will later be optimized using the apply functions but right now it is bench marked apples to apples on both machines.
"eval.with.vis" 8.66 100.00 0.12 1.39
"source" 8.66 100.00 0.00 0.00
"[" 5.38 62.12 0.16 1.85
"GenerateQCC" 5.30 61.20 0.00 0.00
"[.data.frame" 5.20 60.05 2.40 27.71
"ZoneCalculation" 5.12 59.12 0.02 0.23
"cbind" 3.22 37.18 0.34 3.93
"[[" 2.26 26.10 0.38 4.39
"[[.data.frame" 1.88 21.71 0.82 9.47
My first suspicion and I will be testing shortly and updating with my findings is that KVM linux virtualization is to blame. This script is very memory intensive and due to the large number of array operations and R being pass by copy ( which of course has to malloc ) this may be causing the problem. Since the VM does not have direct access to the memory controller and must share it with it's other VM's this could very likely cause the problem. I will be getting a raw machine later on today and will update with my findings.
Thank you all for the quick updates.
Update 2
We originally thought the cause of the performance problem was caused by hyper threading with a VM, but this turned out to be incorrect and performance was the same on a bare metal machine comparatively.
We later realized that the windows laptop is using a 32 bit version of R for computations. This led us to try the 64 bit version of R and the result was ~140% slower than 32 bit on the same exact same script. This leads me to the question of how is it possible that the 64 bit could be ~140% slower than the 32 bit version of R?
What we are seeing is that the 32
Windows 32 bit execution time 48 seconds
Windows 64 bit execution time 2.33 seconds.
Linux 64 bit execution time 2.15 seconds.
Linux 32 bit execution time < in progress > ( Built a 32 bit version on RHEL 6.3 x86_64 but did not see performance improvement am going to reload with 32 bit version of RHEL 6.3 )
I found this link but it only explains a 15-20% hit on some 64 bit machines.
http://www.hep.by/gnu/r-patched/r-admin/R-admin_51.html
Sorry I cannot legally post the script.
Have a look at the sections on "profiling" in the Writing R Extensions manual.
From 30,000 feet, you can't say much else -- you will need profiling information. "General consensus" (and I am putting this is in quotes as you can't really generalize these things) is that Linux tends to do better on memory management and file access, so I am a little astonished by your results.
Building R with --enable-R-shlib can cause a performance penalty. This is discussed in R Installation and Administration, in Appendix B, Section 1. That alone could explain 10-20% of the variation. Other sources could be from the differences of the "comparable specs".
The issue was resolved and it was caused by a non optimized BLAS library.
This article was a great help. Using ATLAS was a great help.
http://www.cybaea.net/Blogs/Data/Faster-R-through-better-BLAS.html

Resources