How does Hydras sweeper, specifically Ax-sweeper free/allocate memory? - fb-hydra

So I'm using Hydra 1.1 and hydra-ax-sweeper==1.1.5 to manage my configuration, and run some hyper-parameter optimization on minerl environment. For this purpose, I load a lot of data in to memory (peak around 50Gb while loading with multiprocessing, drops to 30Gb after fully loaded) with multiprocessing (by pytorch).
On a normal run this is not a problem (My machine have 90+Gb RAM), one training finish without any issue.
However, when I run the same code with -m option (and hydra/sweeper: ax in config), the code stops after about 2-3 sweeper runs, getting stuck at the data loading phase, because all memories of the system (+swap memory) is occupied.
First I thought this was some issue with minerl environment code, which starts java-code in sub-process. So I tried to run my code without the environment (only the 30Gb data), and I still have the same issue. So I suspect I have some memory-leak inbetween the Hydra sweeper.
So my question is, How does Hydra sweeper(or ax-sweeper) work in-between sweeps? I always had the impression that it runs the main(cfg: DictConfig) decorated with #hydra.main(...), takes a scalar return(score) and run the Bayesian optimizer with this score, with main() called similar to a function (everything inside being properly deallocated/garbage collected between each sweep-run).
Is this not the case? Should I then load the data somewhere outside the main() and keep it between sweeps?
Thank you very much in advance!

The hydra-ax-sweeper may run trials in parallel, depending on the result of calling the get_max_parallelism function defined in ax.service.ax_client.
I suspect that your machine is running out of memory because of this parallelism.
Hydra's Ax plugin does not currently have a config group for configuring this max_parallelism setting, so it is automatically set by ax.
Loading the data outside of main (as you suggested) may be a good workaround for this issue.

Hydra sweepers in general does not have a facility to control concurrency. This is the responsibility of the launcher you are using.
The built-in basic launcher runs the jobs serially so it should not trigger memory issues.
If you are using other launchers, you may need to control their parallelism via Launcher specific parameters.

Related

Azure Machine Learning throws error "Invalid graph: You have invalid compute target(s) in node(s)" while running the pipeline

I am facing a strange issue while dealing with Azure Machine Learning (Preview) interface.
I have designed a training pipeline, which was getting initiated on certain compute node (2 node cluster, with minimal configurations). However, it used to take lot of time for execution. So, I tried to create a new training cluster (8 node cluster, with higher config). During this process, I had created and deleted some of the training clusters.
But, strangely, since then, while submitting the pipeline I am getting error as "Invalid graph: You have invalid compute target(s) in node(s)".
Could you please advise on this situation.
Thanks,
Mitul
I bet this was pretty frustrating. A common debugging strategy I have is to delete compute targets and create new ones. Perhaps this was another "transient" error?
The issue should have been fixed and will be rolled out soon. Meanwhile, as a temporary solution, you can refresh the page to make it work.

Why Symfony3 so slow?

I installed Symfony3 framework-standard-edition. I'm trying to open the home page(app.php prod) and it is loaded 300-400ms.
This is my profiler information:
also I use php7.
Why it is so long?
You can try to optimize Zend OPCache.
Here are some recommended settings
opcache.revalidate_freq
Basically put, how often (in seconds) should the code cache expire and check if your code has changed. 0 means it checks your PHP code every single request (which adds lots of stat syscalls). Set it to 0 in your development environment. Production doesn't matter because of the next setting.
opcache.validate_timestamps
When this is enabled, PHP will check the file timestamp per your opcache.revalidate_freq value.
When it's disabled, opcache.revaliate_freq is ignored and PHP files are NEVER checked for updated code. So, if you modify your code, the changes won't actually run until you restart or reload PHP (you force a reload with kill -SIGUSR2).
Yes, this is a pain in the ass, but you should use it. Why? While you're updating or deploying code, new code files can get mixed with old ones— the results are unknown. It's unsafe as hell
opcache.max_accelerated_files
Controls how many PHP files, at most, can be held in memory at once. It's important that your project has LESS FILES than whatever you set this at. For a codebase at ~6000 files, I use the prime number 8000 for maxacceleratedfiles.
You can run find . -type f -print | grep php | wc -l to quickly calculate the number of files in your codebase.
opcache.memory_consumption
The default is 64MB. You can use the function opcachegetstatus() to tell how much memory opcache is consuming and if you need to increase the amount.
opcache.interned_strings_buffer
A pretty neat setting with like 0 documentation. PHP uses a technique called string interning to improve performance— so, for example, if you have the string "foobar" 1000 times in your code, internally PHP will store 1 immutable variable for this string and just use a pointer to it for the other 999 times you use it. Cool.
This setting takes it to the next level— instead of having a pool of these immutable string for each SINGLE php-fpm process, this setting shares it across ALL of your php-fpm processes. It saves memory and improves performance, especially in big applications.
The value is set in megabytes, so set it to "16" for 16MB. The default is low, 4MB.
opcache.fast_shutdown
Another interesting setting with no useful documentation. "Allows for faster shutdown".
Oh okay. Like that helps me. What this actually does is provide a faster mechanism for calling the destructors in your code at the end of a single request to speed up the response and recycle php workers so they're ready for the next incoming request faster.
Set it to 1 and turn it on.
opcache=1
opcache.memory_consumption=256
opcache.interned_strings_buffer=16
opcache.max_accelerated_files=8000
opcache.validate_timestamps=0
opcache.revalidate_freq=0
opcache.fast_shutdown=1
I hope it will help improve your performances
[EDIT]
You might also want to look at this answer:
Are Doctrine relations affecting application performance?
TheMrbikus, try some optimization with the following elements:
Use APC
Use Bootstrap files
Reference: http://symfony.com/doc/current/performance.html
Use the OPCache PHP7
Use Apache PHP-FPM.
E-mail sending process, and may slow down during the form rendering operations. Create a blank test Controller.

JavaFX 8 QuantumRenderer high CPU usage

I have a JavaFX APP containing two listviews displaying incoming customer orders (using a custom cellfactory) received from my server. I also have a few tableview displaying information from a Postgres database (this are spread across a few tabs inside a tabpane).
The user has to take an order (by clicking on it), and enter a few information inside textboxes.
The application was initially written an deployed using Java7. I had no problem whatsoever.
But recently I decided to switch to Java8. I modified my code to use lambdas and added a few extras stuff to the app:
a timeline to check and display orders status every minute, inside a textflow;
modified the customcellfactory class to use an external CSS, with setId instead of setStyle;
...
Now, the application is running fine but, after 2-3 hours of uptime it becomes sluggish. Since is hard for me simulate the behavior inside a profiler I used jstack, top -H, and matching pid with nid to find out what is happening.
This way I found out that the culprit was QuantumRenderer with 95+% CPU usage:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
30300 utilizat+ 20 0 5801608 527412 39696 S 95,1 6,5 60:57.34 java
"QuantumRenderer-0" #9 daemon prio=5 os_prio=0 tid=0x00007f4f182bb800 nid=0x765c runnable [0x00007f4eeb2a1000]
java.lang.Thread.State: RUNNABLE
at com.sun.prism.es2.X11GLDrawable.nSwapBuffers(Native Method)
at com.sun.prism.es2.X11GLDrawable.swapBuffers(X11GLDrawable.java:50)
at com.sun.prism.es2.ES2SwapChain.present(ES2SwapChain.java:186)
at com.sun.javafx.tk.quantum.PresentingPainter.run(PresentingPainter.java:107)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at com.sun.javafx.tk.RenderJob.run(RenderJob.java:58)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at com.sun.javafx.tk.quantum.QuantumRenderer$PipelineRunnable.run(QuantumRenderer.java:125)
at java.lang.Thread.run(Thread.java:745)
The machine running the application is using a 64Bit version of Lubuntu.
I can't figure out where should I look to find out what is the problem...
It seems your renderer is using the X11 pipeline (Java2D?) which could be the cause of high CPU usage (software acceleration). Does your graphics card supports hardware acceleration?
Try getting more information with -Dprism.verbose=true if your graphics card does support hardware acceleration you might want to try to force it with -Dprism.forceGPU=true, also try enabling the OpenGL pipeline to increase Java2D performance with
-Dprism.order=es2,es1,sw,j2d (you could also try with the old Java2D flag
-Dsun.java2d.opengl=true but I think that won't affect prism).
I would also recommend taking a look at the OpenJFX performance tips and tricks checklist I've seen high CPU usage in nodes that was somewhat fixed with the usage of Node.setCache(true) and its CacheHints when using any kind of animation (with the downside that this uses more memory).
Also, take a look at how you are updating your UI from your worker threads. It's important to do minimum work in the FX UI Thread and update it from your workers correctly and only when necessary, take a look at this other question to learn more about the javafx.concurrent.Task class and its correct usage to update the UI from worker threads.
This seems much more like a software acceleration issue and Dprism.verbose should let you know more but following the other suggestions never hurts! Hope this helps!

Programs have to access to global variables in package

I've a package with global variables related to a file open
(*os.File), and its logger associated.
By another side, I'll build several commands that are going to use
that package and I want not open the file to set it as logger every
time I run a command.
So, the first program to run will set the global variables, and here
is my question:
Do the next programs to use the package can access to those global
variables without problem? It could be created a command with a flag
to initialize those values before of be used by another programs, and
another flag to finish it (unset the global variables in the package).
If that is not possible, which is the best option to avoid such IO-bound? To use a server in Unix sockets?
Assuming by 'program' you actually mean 'process', the answer is no.
If you want to share a (customized perhaps) logging functionality between processes then I would consider a daemon-like (Go doesn't yet AFAIK support writing true daemons) process/server and any kind of IPC you see handy.

have R halt the EC2 machine it's running on

I have a few work flows where I would like R to halt the Linux machine it's running on after completion of a script. I can think of two similar ways to do this:
run R as root and then call system("halt")
run R from a root shell script (could run the R script as any user) then have the shell script run halt after the R bit completes.
Are there other easy ways of doing this?
The use case here is for scripts running on AWS where I would like the instance to stop after script completion so that I don't get charged for machine time post job run. My instance I use for data analysis is an EBS backed instance so I don't want to terminate it, simply suspend. Issuing a halt command from inside the instance is the same effect as a stop/suspend from AWS console.
I'm impressed that works. (For anyone else surprised that an instance can stop itself, see notes 1 & 2.)
You can also try "sudo halt", as you wouldn't need to run as a root user, as long as the user account running R is capable of running sudo. This is pretty common on a lot of AMIs on EC2.
Be careful about what constitutes an assumption of R quitting - believe it or not, one can crash R. It may be better to have a separate script that watches the R pid and, once that PID is no longer active, terminates the instance. Doing this command inside of R means that if R crashes, it never reaches the call to halt. If you call it from within another script, that can be dangerous, too. If you know Linux well, what you're looking for is the PID from starting R, which you can pass to another script that checks ps, say every 1 second, and then terminates the instance once the PID is no longer running.
I think a better solution is to use the EC2 API tools (see: http://docs.amazonwebservices.com/AWSEC2/latest/APIReference/ for documentation) to terminate OR stop instances. There's a difference between the two of these, and it matters if your instance is EBS backed or S3 backed. You needn't run as root in order to terminate the instance - the fact that you have the private key and certificate shows Amazon that you're the BOSS, way above the hoi polloi who merely have root access on your instance.
Because these credentials can be used for mischief, be careful about running API tools from a given server, you'll need your certificate and private key on the server. That's a bad idea in the event that you have a security problem. It would be better to message to a master server and have it shut down the instance. If you have messaging set up in any way between instances, this can do all the work for you.
Note 1: Eric Hammond reports that the halt will only suspend an EBS instance, so you still have storage fees. If you happen to start a lot of such instances, this can clutter things up. Your original question seems unclear about whether you mean to terminate or stop an instance. He has other good advice on this page
Note 2: A short thread on the EC2 developers forum gives advice for Linux & Windows users.
Note 3: EBS instances are billed for partial hours, even when restarted. (See this thread from the developer forum.) Having an auto-suspend close to the hour mark can be useful, assuming the R process isn't working, in case one might re-task that instance (i.e. to save on not restarting). Other useful tools to consider: setTimeLimit and setSessionTimeLimit, and various checkpointing tools (I have a Q that mentions a couple). Using an auto-kill is useful if one has potentially badly behaved code.
Note 4: I recently learned of the shutdown command in package fun. This is multi-platform. See this blog post for commentary, and code is here. Dangerous stuff, but it could be useful if you want to adapt to Windows. I haven't tried it, though.
Update 1. Three more ideas:
You could use .Last() and runLast = TRUE for q() and quit(), which could shut down the instance.
If using littler or a script that invokes the script via Rscript, the same command line functions could be used.
My favorite package of today, tcltk2 has a neat timer mechanism, called tclTaskSchedule() that can be used to schedule the execution of an expression. You could then go crazy with the execution of stuff just before a hourly interval has elapsed.
system("echo 'rootpassword' | sudo halt")
However, the downside is having your root password in plain text in the script.
AFAIK those ways you mentioned are the only ones. In any case the script will have to run as root to be able to shut down the machine (if you find a way to do it without root that's possibly an exploit). You ask for an easier way but system("halt") is just an additional line at the end of your script.
sudo is an option -- it allows you to run certain commands without prompting for any password. Just put something like this in /etc/sudoers
<username> ALL=(ALL) PASSWD: ALL, NOPASSWD: /sbin/halt
(of course replacing with the name of user running R) and system('sudo halt') should just work.

Resources