How to properly parallelize similar tasks in Luigi - pipeline

I spent a good five or six hours the other day trying to parallelize some work in Luigi, based on the method used here: http://rjbaxley.com/posts/2016/03/13/parallel_jobs_in_luigi.html
The problem I was having was that I kept getting a luigi.task_register.TaskClassAmbigiousException which drove me crazy. Ultimately I threw luigi.auto_namespace(scope=name) at the top of my package and everything started working, but I don't know why. Roughly described, I had 3 tasks:
TaskA - required nothing
provided a txt file with paths
TaskB - requires only input parameters p1 and p2
provides a .csv file
TaskC - requires output from task A
yields one TaskB for each path pair from output A
is completed when all yielded TaskBs are completed.
If anyone can sketch how i should have done this correctly, instead of the hacked together nonsense I have now, I'd be so very grateful

Did you yield TaskB in your run() method, or in requires()? I ran into this issue when yielding in the run() method. It turned out that the task it was yielding was broken for other reasons, but Luigi wasn't very clear about propagating the full exception stack for tasks yielded in run().
To yelp debug, try calling luigi.build([TaskB]) and seeing if that throws an exception. If you can provide the full exception stack that would be helpful.

Related

First token could not be read or is not the keyword 'FoamFile' in OpenFOAM

I am a beginner to programming. I am trying to run a simulation of a combustion chamber using reactingFoam.
I have modified the counterflow2D tutorial.
For those who maybe don't know OpenFOAM, it is a programme built in C++ but it does not require C++ programming, just well-defining the variables in the files needed.
In one of my first tries I have made a very simple model but since I wanted to check it very well I set it to 60 seconds with a 1e-6 timestep.
My computer is not very powerful so it took me for a day aprox. (by this I mean I'd like to find a solution rather than repeating the simulation).
I executed the solver reactingFOAM using 4 processors in parallel using
mpirun -np 4 reactingFOAM -parallel > log
The log does not show any evidence of error.
The problem is that when I use reconstructPar it works perfectly but then I try to watch the results with paraFoam and this error is shown:
From function bool Foam::IOobject::readHeader(Foam::Istream&)
in file db/IOobject/IOobjectReadHeader.C at line 88
Reading "mypath/constant/reactions" at line 1
First token could not be read or is not the keyword 'FoamFile'
I have read that maybe some files are empty when they are not supposed to be so, but I have not found that problem.
My 'reactions' file have not been modified from the tutorial and has always worked.
edit:
Sorry for the vague question. I have modified it a bit.
A typical OpenFOAM dictionary file always contains a Foam::Istream named FoamFile. An example from a typical system/controlDict file can be seen below:
FoamFile
{
version 2.0;
format ascii;
class dictionary;
location "system";
object controlDict;
}
During the construction of the dictionary header, if this Istream is absent, OpenFOAM ceases its operation by raising an error message that you have experienced:
First token could not be read or is not the keyword 'FoamFile'
The benefit of the header is possibly to contribute OpenFOAM's abstraction mechanisms, which would be difficult otherwise.
As mentioned in the comments, adding the header entity almost always solves this problem.

SparkR collect method crashes with OutOfMemory on Java heap space

With SparkR, I'm trying for a PoC to collect an RDD that I created from text files which contains around 4M lines.
My Spark cluster is running in Google Cloud, is bdutil deployed and is composed with 1 master and 2 workers with 15gb of RAM and 4 cores each. My HDFS repository is based on Google Storage with gcs-connector 1.4.0.
SparkR is intalled on each machine, and basic tests are working on small files.
Here is the script I use :
Sys.setenv("SPARK_MEM" = "1g")
sc <- sparkR.init("spark://xxxx:7077", sparkEnvir=list(spark.executor.memory="1g"))
lines <- textFile(sc, "gs://xxxx/dir/")
test <- collect(lines)
First time I run this, it seems to be working fine, all the tasks are run successfully, spark's ui says that the job completed, but I never get the R prompt back :
15/06/04 13:36:59 WARN SparkConf: Setting 'spark.executor.extraClassPath' to ':/home/hadoop/hadoop-install/lib/gcs-connector-1.4.0-hadoop1.jar' as a work-around.
15/06/04 13:36:59 WARN SparkConf: Setting 'spark.driver.extraClassPath' to ':/home/hadoop/hadoop-install/lib/gcs-connector-1.4.0-hadoop1.jar' as a work-around.
15/06/04 13:36:59 INFO Slf4jLogger: Slf4jLogger started
15/06/04 13:37:00 INFO Server: jetty-8.y.z-SNAPSHOT
15/06/04 13:37:00 INFO AbstractConnector: Started SocketConnector#0.0.0.0:52439
15/06/04 13:37:00 INFO Server: jetty-8.y.z-SNAPSHOT
15/06/04 13:37:00 INFO AbstractConnector: Started SelectChannelConnector#0.0.0.0:4040
15/06/04 13:37:54 INFO GoogleHadoopFileSystemBase: GHFS version: 1.4.0-hadoop1
15/06/04 13:37:55 WARN LoadSnappy: Snappy native library is available
15/06/04 13:37:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/06/04 13:37:55 WARN LoadSnappy: Snappy native library not loaded
15/06/04 13:37:55 INFO FileInputFormat: Total input paths to process : 68
[Stage 0:=======================================================> (27 + 10) / 68]
Then after a CTRL-C to get the R prompt back, I try to run the collect method again, here is the result :
[Stage 1:==========================================================> (28 + 9) / 68]15/06/04 13:42:08 ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.remote.default-remote-dispatcher-5] shutting down ActorSystem [sparkDriver]
java.lang.OutOfMemoryError: Java heap space
at org.spark_project.protobuf.ByteString.toByteArray(ByteString.java:515)
at akka.remote.serialization.MessageContainerSerializer.fromBinary(MessageContainerSerializer.scala:64)
at akka.serialization.Serialization$$anonfun$deserialize$1.apply(Serialization.scala:104)
at scala.util.Try$.apply(Try.scala:161)
at akka.serialization.Serialization.deserialize(Serialization.scala:98)
at akka.remote.MessageSerializer$.deserialize(MessageSerializer.scala:23)
at akka.remote.DefaultMessageDispatcher.payload$lzycompute$1(Endpoint.scala:58)
at akka.remote.DefaultMessageDispatcher.payload$1(Endpoint.scala:58)
at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:76)
at akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:937)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:415)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
I understand the exception message, but I don't understand why I am getting this the second time.
Also, why the collect never returns after completing in Spark?
I Googled every piece of information I have, but I had no luck finding a solution. Any help or hint would be greatly appreciated!
Thanks
This does appear to be a simple combination of Java in-memory object representations being inefficient combined with some apparent long-lived object references which cause some collections to fail to be garbage-collected in time for the new collect() call to overwrite the old one in-place.
I experimented with some options, and for my sample 256MB file that contains ~4M lines, I indeed reproduce your behavior where collect is fine the first time, but OOMs the second time, when using SPARK_MEM=1g. I then set SPARK_MEM=4g instead, and then I'm able to ctrl+c and re-run test <- collect(lines) as many times as I want.
For one thing, even if references didn't leak, note that after the first time you ran test <- collect(lines), the variable test is holding that gigantic array of lines, and the second time you call it, the collect(lines) executes before finally being assigned to the test variable and thus in any straightforward instruction-ordering, there's no way to garbage-collect the old contents of test. This means the second run will make the SparkRBackend process hold two copies of the entire collection at the same time, leading to the OOM you saw.
To diagnose, on the master I started SparkR and first ran
dhuo#dhuo-sparkr-m:~$ jps | grep SparkRBackend
8709 SparkRBackend
I also checked top and it was using around 22MB of memory. I fetched a heap profile with jmap:
jmap -heap:format=b 8709
mv heap.bin heap0.bin
Then I ran the first round of test <- collect(lines) at which point running top showed it using ~1.7g of RES memory. I grabbed another heap dump. Finally, I also tried test <- {} to get rid of references to allow garbage-collection. After doing this, and printing out test and showing it to be empty, I grabbed another heap dump and noticed RES still showed 1.7g. I used jhat heap0.bin to analyze the original heap dump, and got:
Heap Histogram
All Classes (excluding platform)
Class Instance Count Total Size
class [B 25126 14174163
class [C 19183 1576884
class [<other> 11841 1067424
class [Lscala.concurrent.forkjoin.ForkJoinTask; 16 1048832
class [I 1524 769384
...
After running collect, I had:
Heap Histogram
All Classes (excluding platform)
Class Instance Count Total Size
class [C 2784858 579458804
class [B 27768 70519801
class java.lang.String 2782732 44523712
class [Ljava.lang.Object; 2567 22380840
class [I 1538 8460152
class [Lscala.concurrent.forkjoin.ForkJoinTask; 27 1769904
Even after I nulled out test, it remained about the same. This shows us 2784858 instances of char[], for a total size of 579MB, and also 2782732 instances of String, presumably holding those char[]'s above it. I followed the reference graph all the way up, and got something like
char[] -> String -> String[] -> ... -> class scala.collection.mutable.DefaultEntry -> class [Lscala.collection.mutable.HashEntry; -> class scala.collection.mutable.HashMap -> class edu.berkeley.cs.amplab.sparkr.JVMObjectTracker$ -> java.util.Vector#0x785b48cd8 (36 bytes) -> sun.misc.Launcher$AppClassLoader#0x7855c31a8 (138 bytes)
And then AppClassLoader had something like thousands of inbound references. So somewhere along that chain something should've been removing their reference but failing to do so, causing the entire collected array to sit in memory while we try to fetch a second copy of it.
Finally, to answer your question about hanging after the collect, it appears it has to do with the data not fitting in the R process's memory; here's a thread related to that issue: https://www.mail-archive.com/user#spark.apache.org/msg29155.html
I confirmed that using a smaller file with only a handful of lines, and then running collect indeed does not hang.

R Package IBrokers placeOrder() function fails

I'm using the package: IBrokers. It works well for me when I request historical data. Also the call to reqAccountUpdates() works well.
I am having problems with this script:
# myscript.r
.libPaths("rpackages")
library(IBrokers)
tws2 = twsConnect(2)
print('Attempting BUY')
mytkr = twsFuture("ES","GLOBEX","201412")
myorderid = sample(1001:3001, 1)
IBrokers:::.placeOrder(tws2, mytkr, twsOrder(myorderid, "BUY", "1", "MKT"))
twsDisconnect(tws2)
Sometimes the above script works okay. Usually though it fails. When it fails, it seems to connect okay.
Then I see this in my TWS console:
03:47:45:581 JTS-EServerSocket-290: [2:47:71:1:0:0:0:ERR] Message type -1. Socket I/O error -
03:47:45:581 JTS-EServerSocket-290: Anticipated error
jextend.d: Socket I/O error -
at jextend.sc.b(sc.java:364)
at jextend.ch.sb(ch.java:1534)
at jextend.ch.run(ch.java:1390)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:196)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at java.net.SocketInputStream.read(SocketInputStream.java:210)
at jextend.xh.d(xh.java:45)
at jextend.sc.c(sc.java:579)
at jextend.sc.r(sc.java:227)
at jextend.af.a(af.java:232)
at jextend.sc.f(sc.java:650)
at jextend.pd.a(pd.java:822)
at jextend.sc.b(sc.java:358)
... 3 more
03:47:45:583 JTS-EServerSocket-290: [2:47:71:1:0:0:0:ERR] Socket connection for client{2} has closed.
03:47:45:583 JTS-EWriter14-291: [2:47:71:1:0:0:0:ERR] Unable write to socket client{2} -
03:47:45:584 JTS-EServerSocketNotifier-288: Terminating
Can you offer any ideas on how you would wrestle with this issue?
One other piece of info:
I think the call to reqIds() may be necessary. Sometimes reqIds() would return an id not high enough. Then, I'd use it and placeOrder() would fail. So, I call reqIds() but then use Sys.time() to give me an id which is larger than the last ID I used.
Another problem may have been some code-text I copied out of a PowerPoint. Some of the code-characters may have been corrupt.
The main problem was orderid.
I need to be careful how I generate orderid.
Also I may have moused in bad characters from a powerpoint preso which had an example.
I posted some code which works in a comment in this thread.
Dan

Out-of-memory error in parfor: kill the slave, not the master

When an out-of-memory error is raised in a parfor, is there any way to kill only one Matlab slave to free some memory instead of having the entire script terminate?
Here is what happens by default when an out-of-memory error occurs in a parfor: the script terminated, as shown in the screenshot below.
I wish there was a way to just kill one slave (i.e. removing a worker from parpool) or stop using it to release as much memory as possible from it:
If you get a out of memory in the master process there is no chance to fix this. For out of memory on the slave, this should do it:
The simple idea of the code: Restart the parfor again and again with the missing data until you get all results. If one iteration fails, a flag (file) is written which let's all iterations throw an error as soon as the first error occurred. This way we get "out of the loop" without wasting time producing other out of memory.
%Your intended iterator
iterator=1:10;
%flags which indicate what succeeded
succeeded=false(size(iterator));
%result array
result=nan(size(iterator));
FLAG='ANY_WORKER_CRASHED';
while ~all(succeeded)
fprintf('Another try\n')
%determine which iterations should be done
todo=iterator(~succeeded);
%initialize array for the remaining results
partresult=nan(size(todo));
%initialize flags which indicate which iterations succeeded (we can not
%throw erros, it throws aray results)
partsucceeded=false(size(todo));
%flag indicates that any worker crashed. Have to use file based
%solution, don't know a better one. #'
delete(FLAG);
try
parfor falseindex=1:sum(~succeeded)
realindex=todo(falseindex);
try
% The flag is used to let all other workers jump out of the
% loop as soon as one calculation has crashed.
if exist(FLAG,'file')
error('some other worker crashed');
end
% insert your code here
%dummy code which randomly trowsexpection
if rand<.5
error('hit out of memory')
end
partresult(falseindex)=realindex*2
% End of user code
partsucceeded(falseindex)=true;
fprintf('trying to run %d and succeeded\n',realindex)
catch ME
% catch errors within workers to preserve work
partresult(falseindex)=nan
partsucceeded(falseindex)=false;
fprintf('trying to run %d but it failed\n',realindex)
fclose(fopen(FLAG,'w'));
end
end
catch
%reduce poolsize by 1
newsize = matlabpool('size')-1;
matlabpool close
matlabpool(newsize)
end
%put the result of the current iteration into the full result
result(~succeeded)=partresult;
succeeded(~succeeded)=partsucceeded;
end
After quite bit of research, and a lot of trial and error, I think I may have a decent, compact answer. What you're going to do is:
Declare some max memory value. You can set it dynamically using the MATLAB function memory, but I like to set it directly.
Call memory inside your parfor loop, which returns the memory information for that particular worker.
If the memory used by the worker exceeds the threshold, cancel the task that worker was working on. Now, here it get's a bit tricky. Depending on the way you're using parfor, you'll either need to delete or cancel either the task or worker. I've verified that it works with the code below when there is one task per worker, on a remote cluster.
Insert the following code at the beginning of your parfor contents. Tweak as necessary.
memLimit = 280000000; %// This doesn't have to be in parfor. Everything else does.
memData = memory;
if memData.MemUsedMATLAB > memLimit
task = getCurrentTask();
cancel(task);
end
Enjoy! (Fun question, by the way.)
One other option to consider is that since R2013b, you can open a parallel pool with 'SpmdEnabled' set to false - this allows MATLAB worker processes to die without the whole pool being shut down - see the doc here http://www.mathworks.co.uk/help/distcomp/parpool.html . Of course, you still need to arrange somehow to shutdown the workers.

Clear console for each run of Testacular/Karma + Jasmine

It is difficult for me to catch with the eye a boundary between test runs.
Is it possible to clear console for each run of Testacular/Karma + Jasmine or at least put there something easily catched by the eye, for example a series of newlines?
Note
Currently it is an abandoned question because I am no longer trying to perform tasks described in it. Please do not ask for additional info. Write only if you know for sure what to do. It will help other people.
Write your own reporter, and do whatever you want with it.
Also, if you're on a Mac and use Growl, take a look at karma-growl-reporter
I am not sure to fully understand your need but karma-spec-reporter can give you a detailed review of your test execution. Output example from karma-spec-reporter-example:
array:
push:
PASSED - should add an element
PASSED - should remove an element
FAILED - should do magic (this test will fail) expected [] to include 'magic'
at /home/michael/development/codecentric/karma-spec-reporter-example/node_modules/chai/chai.js:401
...
PhantomJS 1.8.1 (Linux): Executed 3 of 3 (1 FAILED) (0.086 secs / NaN secs)
There's now a reporter available for this: https://github.com/arthurc/karma-clear-screen-reporter
It's working for me on OSX.

Resources