Dataflow from Colab issue - jupyter-notebook

I'm trying to run a Dataflow job from Colab and getting the following worker error:
sdk_worker_main.py: error: argument --flexrs_goal: invalid choice: '/root/.local/share/jupyter/runtime/kernel-1dbd101c-a79e-432e-89b3-5ba68df104d7.json' (choose from 'COST_OPTIMIZED', 'SPEED_OPTIMIZED')
I haven't provided the flexrs_goal argument, and if I do it doesn't fix this issue. Here are my pipeline options:
beam_options = PipelineOptions(
runner='DataflowRunner',
project=...,
job_name=...,
temp_location=...,
subnetwork='regions/us-west1/subnetworks/default',
region='us-west1'
)
My pipeline is very simple, it's just:
with beam.Pipeline(options=beam_options) as pipeline:
(pipeline
| beam.io.ReadFromBigQuery(
query=f'SELECT column FROM {BQ_TABLE} LIMIT 100')
| beam.Map(print))
It looks like the command line args for the sdk worker are getting polluted by jupyter somehow. I've rolled back to the past two apache-beam library versions and it hasn't helped. I could move over to Vertex Workbench but I've invested a lot in this Colab notebook (plus I like the easy sharing) and I'd rather not migrate.

Figured it out. The PipelineOptions constructor will pull in sys.argv if no parameter is given for the first argument (called flags). In my case it was pulling in the command line args that my jupyter notebook was started with and passing them as Beam options to the workers.
I fixed my issue by doing this:
beam_options = PipelineOptions(
flags=[],
...
)

Related

Trimmomatic-0.39 error message "Unknown option -baseout ..."

I have used Trimmomatic-0.39 few times already for trimming some sequencing data. This time I tried the option -baseout as stated in the manual but it does not recognise it as a valid option and the command does not run. If I run the command, as I usually I do with all the output files listed, it works though.
I type in the command line:
java -jar trimmomatic-0.39.jar PE -phred33 -trimlog trimmed_file18_log.log -baseout file18.fastq.gz file18-R1.fastq.gz file18-R2.fastq.gz ILLUMINACLIP:NexteraPE-PE.fa:2:30:10 MAXINFO:25:0.2 MINLEN:20
What I get back is:
Unknown option -baseout file18.fastq.gz
Usage:
PE [-version] [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-summary <statsSummaryFile>] [-quiet] [-validatePairs] [-basein <inputBase> | <inputFile1> <inputFile2>] [-baseout <outputBase> | <outputFile1P> <outputFile1U> <outputFile2P> <outputFile2U>] <trimmer1>...
or: .....
I get back the same error message even if I move the '-baseout file18.fastq.gz' option after '...jar PE' and before the list of all the other options.

OpenMDAO adding command line args for ExternalCodeComp that won't results in runtime error

In OpenMDAO V3.1 I am using an ExternalCodeComp to execute a CFD code. Typically, I would call it as such:
mpirun nodet_mpi --design_run
If the above call is made in the appropriate directory, then it will find the appropriate run file and execute the CFD run. I have tried command args for the ExternalCodeComp;
execute = ['mpirun', 'nodet_mpi', '--design_run']
execute = ['mpirun', 'nodet_mpi --design_run']
execute = ['mpirun nodet_mpi --design_run']
I either get an error such as:
RunTimeError: 255, execvp error on file "nodet_mpi --design_run" (No such file or directory)
Or that the command cannot be found.
Is there any way to setup the execute statement to include commandline args for the flow solver when an input file is not defined?
Thanks in advance!
One detail in your question seems incorrect, you state that you have tried execute = "...". The ExternalCodeComp uses an option called command. I will assume that you are using the correct option in your code.
The most correct form to use is the list with all arguments as single entries in the list:
self.options['command'] = ['mpirun', 'nodet_mpi', '--design_run']
Your error msg seems to indicate that the directory that OpenMDAO is running in is not the same as the directory you would like to execute the CFD code from. The absolute simplest solution would be to make sure that you are in the correct directory via cd in the terminal window before executing your python script.
However, there is likely a reason that your python script is in a different place so there are other options I can suggest:
You can use a combination of os.getcwd() and os.chdir() inside the compute method that you have implemented to make sure you switch into and out of the working directory for the CFD code.
If you would like to, you can modify the entries of the list you've assigned to the self.options['command'] option on the fly within your compute method. You would again be relying on some of the methods in the os module for help. os.path.exists can be used to test if the specific input files you need exist or not, and you can modify the command option accordingly.
For option 2, code would look something like this:
def compute(self, inputs, outputs):
if os.path.exists('some_input.file'):
self.options['command'] = ['mpirun', 'nodet_mpi', '--design_run']
else:
self.options['command'] = ['mpirun', 'nodet_mpi', '--design_run', '--other_options']
# the parent compute function actually runs the external code
super().compute(inputs, outputs)

How to get the 'execution count' of the most recent execution in IPython or a Jupyter notebook?

IPython and Jupyter notebooks keep track of an 'execution count'. This is e.g. shown in the prompts for input and output: In[...] and Out[...].
How can you get the latest execution count programmatically?
The variable Out cannot be used for this:
Out is a dictionary, with execution counts as keys, but only for those cells that yielded a result; note that not all (cell) executions need to yield a result (e.g. print(...)).
The variable In seems to be usable:
In is a list of all inputs; it seems that In[0] is always primed with an empty string, so that you can use the 1-based execution count as index.
However, when using len(In) - 1, in your code, you need to take into account that the execution count seems to be updated before execution of that new code. So, actually, it seems you need to use len(In) - 2 for the execution count of the most recent already completed execution.
Questions:
Is there a better way to get the current execution count?
Can you rely on the above observations (the 'seems')?
After more digging, I found the (surprisingly simple) answer:
from IPython import get_ipython
ipython = get_ipython()
... ipython.execution_count ...
This does not show up in the IPython documentation, though could (should?) have been mentioned here: https://ipython.readthedocs.io/en/stable/api/generated/IPython.core.interactiveshell.html (I guess attributes are not documented). Here you do find run_line_magic (which was mentioned in a comment to my question).
The way I found this attribute is by defining ipython as above and then doing code completion (TAB) on ipython. (but it does not have documentation).
help(ipython)
gives you documentation about InteractiveShell, and it does mention execution_count (though it does not confirm its purpose):
| execution_count
| An int trait.

SparkR collect method crashes with OutOfMemory on Java heap space

With SparkR, I'm trying for a PoC to collect an RDD that I created from text files which contains around 4M lines.
My Spark cluster is running in Google Cloud, is bdutil deployed and is composed with 1 master and 2 workers with 15gb of RAM and 4 cores each. My HDFS repository is based on Google Storage with gcs-connector 1.4.0.
SparkR is intalled on each machine, and basic tests are working on small files.
Here is the script I use :
Sys.setenv("SPARK_MEM" = "1g")
sc <- sparkR.init("spark://xxxx:7077", sparkEnvir=list(spark.executor.memory="1g"))
lines <- textFile(sc, "gs://xxxx/dir/")
test <- collect(lines)
First time I run this, it seems to be working fine, all the tasks are run successfully, spark's ui says that the job completed, but I never get the R prompt back :
15/06/04 13:36:59 WARN SparkConf: Setting 'spark.executor.extraClassPath' to ':/home/hadoop/hadoop-install/lib/gcs-connector-1.4.0-hadoop1.jar' as a work-around.
15/06/04 13:36:59 WARN SparkConf: Setting 'spark.driver.extraClassPath' to ':/home/hadoop/hadoop-install/lib/gcs-connector-1.4.0-hadoop1.jar' as a work-around.
15/06/04 13:36:59 INFO Slf4jLogger: Slf4jLogger started
15/06/04 13:37:00 INFO Server: jetty-8.y.z-SNAPSHOT
15/06/04 13:37:00 INFO AbstractConnector: Started SocketConnector#0.0.0.0:52439
15/06/04 13:37:00 INFO Server: jetty-8.y.z-SNAPSHOT
15/06/04 13:37:00 INFO AbstractConnector: Started SelectChannelConnector#0.0.0.0:4040
15/06/04 13:37:54 INFO GoogleHadoopFileSystemBase: GHFS version: 1.4.0-hadoop1
15/06/04 13:37:55 WARN LoadSnappy: Snappy native library is available
15/06/04 13:37:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/06/04 13:37:55 WARN LoadSnappy: Snappy native library not loaded
15/06/04 13:37:55 INFO FileInputFormat: Total input paths to process : 68
[Stage 0:=======================================================> (27 + 10) / 68]
Then after a CTRL-C to get the R prompt back, I try to run the collect method again, here is the result :
[Stage 1:==========================================================> (28 + 9) / 68]15/06/04 13:42:08 ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.remote.default-remote-dispatcher-5] shutting down ActorSystem [sparkDriver]
java.lang.OutOfMemoryError: Java heap space
at org.spark_project.protobuf.ByteString.toByteArray(ByteString.java:515)
at akka.remote.serialization.MessageContainerSerializer.fromBinary(MessageContainerSerializer.scala:64)
at akka.serialization.Serialization$$anonfun$deserialize$1.apply(Serialization.scala:104)
at scala.util.Try$.apply(Try.scala:161)
at akka.serialization.Serialization.deserialize(Serialization.scala:98)
at akka.remote.MessageSerializer$.deserialize(MessageSerializer.scala:23)
at akka.remote.DefaultMessageDispatcher.payload$lzycompute$1(Endpoint.scala:58)
at akka.remote.DefaultMessageDispatcher.payload$1(Endpoint.scala:58)
at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:76)
at akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:937)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:415)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
I understand the exception message, but I don't understand why I am getting this the second time.
Also, why the collect never returns after completing in Spark?
I Googled every piece of information I have, but I had no luck finding a solution. Any help or hint would be greatly appreciated!
Thanks
This does appear to be a simple combination of Java in-memory object representations being inefficient combined with some apparent long-lived object references which cause some collections to fail to be garbage-collected in time for the new collect() call to overwrite the old one in-place.
I experimented with some options, and for my sample 256MB file that contains ~4M lines, I indeed reproduce your behavior where collect is fine the first time, but OOMs the second time, when using SPARK_MEM=1g. I then set SPARK_MEM=4g instead, and then I'm able to ctrl+c and re-run test <- collect(lines) as many times as I want.
For one thing, even if references didn't leak, note that after the first time you ran test <- collect(lines), the variable test is holding that gigantic array of lines, and the second time you call it, the collect(lines) executes before finally being assigned to the test variable and thus in any straightforward instruction-ordering, there's no way to garbage-collect the old contents of test. This means the second run will make the SparkRBackend process hold two copies of the entire collection at the same time, leading to the OOM you saw.
To diagnose, on the master I started SparkR and first ran
dhuo#dhuo-sparkr-m:~$ jps | grep SparkRBackend
8709 SparkRBackend
I also checked top and it was using around 22MB of memory. I fetched a heap profile with jmap:
jmap -heap:format=b 8709
mv heap.bin heap0.bin
Then I ran the first round of test <- collect(lines) at which point running top showed it using ~1.7g of RES memory. I grabbed another heap dump. Finally, I also tried test <- {} to get rid of references to allow garbage-collection. After doing this, and printing out test and showing it to be empty, I grabbed another heap dump and noticed RES still showed 1.7g. I used jhat heap0.bin to analyze the original heap dump, and got:
Heap Histogram
All Classes (excluding platform)
Class Instance Count Total Size
class [B 25126 14174163
class [C 19183 1576884
class [<other> 11841 1067424
class [Lscala.concurrent.forkjoin.ForkJoinTask; 16 1048832
class [I 1524 769384
...
After running collect, I had:
Heap Histogram
All Classes (excluding platform)
Class Instance Count Total Size
class [C 2784858 579458804
class [B 27768 70519801
class java.lang.String 2782732 44523712
class [Ljava.lang.Object; 2567 22380840
class [I 1538 8460152
class [Lscala.concurrent.forkjoin.ForkJoinTask; 27 1769904
Even after I nulled out test, it remained about the same. This shows us 2784858 instances of char[], for a total size of 579MB, and also 2782732 instances of String, presumably holding those char[]'s above it. I followed the reference graph all the way up, and got something like
char[] -> String -> String[] -> ... -> class scala.collection.mutable.DefaultEntry -> class [Lscala.collection.mutable.HashEntry; -> class scala.collection.mutable.HashMap -> class edu.berkeley.cs.amplab.sparkr.JVMObjectTracker$ -> java.util.Vector#0x785b48cd8 (36 bytes) -> sun.misc.Launcher$AppClassLoader#0x7855c31a8 (138 bytes)
And then AppClassLoader had something like thousands of inbound references. So somewhere along that chain something should've been removing their reference but failing to do so, causing the entire collected array to sit in memory while we try to fetch a second copy of it.
Finally, to answer your question about hanging after the collect, it appears it has to do with the data not fitting in the R process's memory; here's a thread related to that issue: https://www.mail-archive.com/user#spark.apache.org/msg29155.html
I confirmed that using a smaller file with only a handful of lines, and then running collect indeed does not hang.

How to limit the number of output lines in a given cell of the Ipython notebook?

Sometimes my Ipython notebooks crash because I left a print statement in a big loop or in a recursive function. The kernel shows busy and the stop button is usually unresponsive. Eventually Chrome asks me if I want to kill the page or wait.
Is there a way to limit the number of output lines in a given cell? Or any other way to avoid this problem?
You can suppress output using this command:
‘;’ at the end of a line
Perhaps create a condition in your loop to suppress output past a certain threshold.
For anyone else stumbling across:
If you want to see some of the output rather than suppress the output entirely, there is an extension called limit-output.
You'll have to follow the installation instructions for the extensions at the first link. Then I ran the following code to update the maximum number of characters output by each cell:
from notebook.services.config import ConfigManager
cm = ConfigManager().update('notebook', {'limit_output': 10})
Note: you'll need to run the block of code, then restart your notebook server entirely (not just the kernel) for the change to take effect.
Results on jupyter v4.0.6 running a Python 2.7.12 kernel
for i in range(0,100):
print i
0
1
2
3
4
limit_output extension: Maximum message size exceeded

Resources